Medicinal Plant Identification Using Machine Learning".In
Medicinal Plant Identification Using Machine Learning".In
Medicinal Plant Identification Using Machine Learning".In
Identifying plants through their leaves is a thoroughly pursued endeavor that has widely
varying applications ranging from ecology, horticulture, disease identification, rare plant
preservation in plants to medicinal applications in Ayurveda and various plant bases medical
systems. Our purpose in this project is to identify plant species using the image of a single
leaf through neural networks. We will approach our project using Keras, TensorFlow, and
Convolutional Neural Networks,an ensemble supervised machine learning algorithm based
on color, texture and geometrical features. Fore mentioned approach gives satisfactory results
with high accuracy.
The world bears thousands of plant species, many of which have medicinal values, others
are close to extinction, and still others that are harmful to man. Not only are plants an
essential resource for human beings, but they form the base of all food chains. The medicinal
plants are used mostly in herbal, Ayurvedic and folk medicinal manufacturing. Herbal plants
are plants that can be used for alternatives to cure diseases naturally. About 80% of people in
the world still depend on traditional medicine. Meanwhile, according to herbal plants are
plants whose plant parts (leaves, stems, or roots) have properties that can be used as raw
materials in making modern medicines or traditional medicines. These medicinal plants are
often found in the forest. There are various types of herbal plants that we can know through
the identification of these herbs, one of which is using identification through the leaves. and
protect plant species, it is crucial to study and classify plants correctly.
It is self-evident that plants are crucial for our survival. So, plant identification is an
important field that finds many significant applications in the identification of plants,
protection of plants, maintenance, and assessment of many variables that are important for
their maintenance, weed control, and many others. It is very difficult for an untrained eye to
distinguish between plants. And there are so many species of plants that it is impossible to
identify plants by humans.
Technologies like deep learning, machine learning, and computer vision are very efficient
in successful detection. Neural Networks are algorithms that can teach themselves to do tasks
as human minds do. CNN is an advanced algorithm that uses deep learning to do the task
which includes Natural Language Processing, image recognition, pattern analysis, etc. CNN
teaches itself on basis of samples we provide to it. It uses different layers of filters which
recognize a particular feature of the sample to perform its task. Due to the wide popularity of
CNN and extensive research done on it, up to 90% accuracy is possible on current CNN
models.
1
1.2 MOTIVATION
Local peoples are not enough knowledgeable of their urban medicinal plants and their
usages. In the ancient past, the Ayurvedic physicians themselves picked the medicinal plants
and prepared the medicines for their patients. Today only a few practitioners follow this
practice. The manufacturing and marketing of Ayurvedic drugs has become a thriving
industry whose turnover exceeds Rs. 4000 crores. The number of licensed Ayurvedic
medicine manufacturers in India easily exceeds 8500. This commercialization of Ayurvedic
sector has brought in to focus several questions regarding the quality of raw materials used
for Ayurvedic medicines. Today the plants are collected by women and children from forest
areas; those are not professionally trained in identifying correct medicinal plants.
Manufacturing units often receive incorrect or substituted medicinal plants. Most of these
units lack adequate quality control mechanisms to screen these plants. In addition to this,
confusion due to variations in local name is also rampant. Some plants arrive in dried form
and this make the manual identification task much more difficult. Incorrect use of medicinal
plants makes the Ayurvedic medicine ineffective. It may produce unpredictable side effects
also. In this situation, strict measures for quality control must be enforced on Ayurvedic
medicines and raw materials used by the industry in order to sustain the present growth of
industry by maintaining the efficacy and credibility of medicines.
A trained Botanist looks for all the available features of the plants such as leaves, flowers,
seeds, root and stem to identify plants. Except for the leaf, all others are 3D objects and
increase the complexity of analysis by computer. However, plant leaves are 2D objects and
carry sufficient information to identify the plant. Leaves can be collected easily and image
acquisition may be carried out using inexpensive digital cameras, mobile phones or document
scanners. It is available at any time of the year in contrast to flowers and seeds. Leaves
acquire a specific colour, texture and shape when it grows and these changes are relatively
insignificant. Plant recognition based on leaves depends on finding exact descriptors and
extracting the feature vectors from it. Then the feature vectors of the training samples are
compared with the feature vectors of the test sample to find the degree of similarity using an
appropriate classifier.
2
1.3 PROBLEM STATEMENT
Machine learning is the study of design of algorithms, inspired from the model of human
brain.Machine learning is becoming more popular in data science fields like artificial
intelligence (AI),image identification.Deep learning is supported by various libraries such as
TensorFlow, Keras is one of the most powerful and easy to use python library,which is used
for creating machine learning models.Identification of correct medicinal leaves can help
botanists, taxonomists and drug manufacturers to make quality drug and can reduce the side
effects caused by the wrong drug delivery. To identify the leaves of the plants, a type of
artificial neural network called Convolutional Neural Network (CNN) is used. The
architecture we used here is Densenet, which is a convolutional neural network that is a
powerful model capable of achieving high accuracies on challenging datasets.
Automatic detection of medicinal plants opens new doors for the development of
medicines to cure diseases that have not yet been cured by Allopathy. It will allow the
layman to be aware of the plants growing in their surroundings and make utmost use of them
to cure common ailments with no possible side effects. Artificial Intelligence makes this
purpose even more achievable. A portable system may be developed for field use. In future
research in the area of plants identification, improved Machine Learning feature selection
models will be used to solve the accuracy related issues & enhance the performance.
3
Phase 3 : Feature Extraction is the most important step in recognition of an image from the
computer vision since it influences the accuracy of the overall process. Feature Extraction is
the process of transforming raw data into numerical features that can be processed while
preserving the information in the original dataset. It yields better results than applying
machine learning directly to the raw data. The main features of a leaf are shape, texture,
shape, color and venation, etc.
Phase 4 : The next stage is Dimensionality Reduction. It is the task of reducing the number of
features in a dataset. In machine learning tasks such classification, there are often too many
variables to work with. These variables are also called features. Some of these features can be
quite redundant, adding noise to dataset and it makes no sense to have them in the training
data. This is where feature space needs to reduced. The process of dimensionality reduction
essentially transforms data from high-dimensional feature space to a low-dimensional feature
space.
Phase 5 : The final stage is classification. Classification is the process of assigning an input
image to a particular pre-defined class. A class is defined as a collection of the feature values
which were obtained during the training phase. The algorithms employed in the classification
phase assume that image consists of various features and set of features belong to several
different classes, Input to this phase is the feature vector, which consists of extracted features.
4
CHAPTER 2
DATASET DESCRIPTION
2.1 INTRODUCTION
In this project, I used image dataset. The leaf samples of medicinal plants were collected
from google chrome and formed a dataset. The dataset contains 31 classes of different
medicinal plant species. Each class contains more than 50 images. Perform a basic manual
sampling and remove severely damaged leaves.70% of the images is used as train data and
30% of the images is used as test data.
DATASET
TRAINING DATASET
The training data is the biggest (in size) subset of the original dataset,which is used to train
or fit the machine learning model. The training data is fed to the ML algorithms, which lets
them learn how to make classification for the given task. For supervised learning, the training
data contains labels in order to train the model and make classification.
The type of training data that we provide to the model is highly responsible for the model's
accuracy and classification ability. It means that the better quality of the training data, the
better will be the performance of the model. Training data is approximately more than or
equal to 70% of the total data for this project.
5
TESTING DATASET
Once we train the model with the training dataset, it's time to test the model with the test
dataset. This dataset evaluates the performance of the model and ensures that the model can
generalize well with the new or unseen dataset. The test dataset is another subset of original
data, which is independent of the training dataset. However, it has some similar types of
features and class probability distribution and uses it as a benchmark for model evaluation
once the model training is completed. Test data is a well-organized dataset that contains data
for each type of scenario for a given problem that the model would be facing when used in
the real world. Usually, the test dataset is approximately 20-25% of the total original data.
At this stage, we can also check and compare the testing accuracy with the training
accuracy, which means how accurate our model is with the test dataset against the training
dataset. If the accuracy of the model on training data is greater than that on testing data, then
the model is said to have over-fitting.
6
CHAPTER 3
LITERATURE REVIEW
3.1 INTRODUCTION
Machine learning is a broad subset of artificial intelligence that enables computers to learn
from data and experience without being explicitly programmed. In recent years, machine
learning has helped to solve complex problems in areas such as finance, healthcare,
manufacturing, and logistics.There are different types of machine learning algorithms, but the
most common are regression and classification algorithms. Regression algorithms are used to
predict outcomes, while classification algorithms are used to identify patterns and group data.
Machine learning algorithms can be further divided into two categories: supervised and
unsupervised. Supervised algorithms require a training dataset that includes both the input
data and the desired output. Unsupervised algorithms do not require a training dataset, and
instead rely on data to "learn" on its own.Machine learning itself has several subsets of AI
within it, including neural networks, deep learning, and reinforcement learning.
Machine learning models make use of several algorithms. While no one network is
considered perfect, some algorithms are better suited to perform specific tasks. To choose the
right ones, it’s good to gain a solid understanding of all primary algorithms. One of these
algorithms used in our project for image classification called Convolutional Neural Network
(CNN).
In the paper by Sue Han, Seng, et al., it has been proposed that a plant identification by using
the deep learning CNN algorithm. In this method the CNN algorithm is used to learn
unsupervised feature representations. These experiments are carried out on 44 different plant
7
types, which are collected at the Royal Botanic Garden. The experimental results show the
consistency and superiority than the other hand-crafted feature extraction methods.
Wang-Su and Sang Yong proposed a plant leaf recognition using CNN model and created
two models by adjusting the network depth using Google Net. Also the performance of each
model is evaluated according to the discoloration/damage effects are applied on the leaves.
Accuracy was obtained greater than 80% even with the 30% damaged leaves.
A novel method for plant leaf classification was proposed by Jiachun Liu, S. Yang and et al.
This method uses the CNN algorithm for feature extraction and classification. 10 layers are
introduced in this CNN architecture. In this method, an augment for leaf was applied to
enlarge the database. So that it can improve the classification performance. The visualization
method was used for the analysis of the factors affecting the accuracy. The experimentation
of this CNN method was carried out on the Flavia dataset and obtained an accuracy of
81.92%.
In the paper, a CNN based plant classification system has been proposed to classify the
different types of plant species from the image database collected from smart agro-stations. In
this method CNN architecture is used for feature extraction of different plant images. This
approach is tested on the TARBIL database and obtained an accuracy of 80.47% on 16
different plant species. In this method the CNN based classification shows the more accuracy
than the SVM based classification.
In the paper, by shah, sougatta singh et al., it has been proposed that a leaf classification
system using the dual
path deep CNN. The dual-path CNN will do the following major
operations
i. Both the shape and texture characteristics are studied.
ii. Optimizes the obtained features for the classification.
Using this method, it has been obtained a good accuracy of about 82.28 %( flavia data-set)
which is better than other uni-path CNN methods.
8
CHAPTER 4
Liu, Albert and Yangming Huang developed a plant identification system using SVM for
classification purpose. But this method holds for clean images. These images are
characterized with leaves that are well aligned on a contrast background, with few or no
variations of color or luminance.
Kumar, P. M., Surya, C. M. and Gopi, V. P. used different plant features such as color,
texture, shape. But this method works well only for static background or plain background.
Putzu, L., Di Ruberto, C., Fenu, G. make use of saliency maps methods such as Graph-Based
Manifold Ranking (GMR), Visual Saliency Feature (VSF), Gaussian Pyramids, based leaf
extraction. But This method works well in the presence of an untextured background.
Anantrasirichai, Nantheera, Sion, L., Hannuna and Cedric Nishan Canagarajah. based on
marker-controlled watershed segmentation. This method still misses some areas because of
reflection, shadow and disease on the leaves.
One of the most authoritative works in the field of plant identification has been done by Wu
et al. From five basic geometric features, and then Principle Component Analysis(PCA) is
used for dimension reduction.They Achieved an average of 90.3% with the Flavia Dataset.
LIMITATIONS
Missing some areas because of reflection, shadow and disease on the leaves.
9
The images are characterized with leaves that are well aligned on a contrast background,
with few or no variations of color.
A novel method for identification of medicinal plants from images of different angles of
both front and backside of the leaves have been proposed. The work is based on a database of
leaf images of medicinal plants. Unique features of texture and shape combinations of
morphological features have been identified, that maximizes the identification rate of green
leaves. By using this method, when the image of any plant leaf is given to the system it gives
efficient plant with it’s image and properties of the leaf or the disease it cures along with the
image of the leaf are displayed. In this method Dense Net type of Convolutional Neural
Network(CNN) is used because of its several compelling advantages like it strengthens the
feature propagation and also encourages feature reuse, which in turn increases the efficiency
and decrease the loss of valuation. Here TensorFlow and Keras is used for training the data to
the model.
10
4.2.1 CONVOLUTIONAL NEURAL NETWORK
A convolution neural network has multiple hidden layers that help in extracting
information from an image. The four important layers in CNN are:
1. Convolution layer
2. ReLU layer
3. Pooling layer
1. CONVOLUTIONAL LAYER
This is the first step in the process of extracting valuable features from an image. A
convolution layer has several filters that perform the convolution operation. Every image is
considered as a matrix of pixel values. Consider the following 5x5 image whose pixel values
are either 0 or 1. There’s also a filter matrix with a dimension of 3x3. Slide the filter matrix
over the image and compute the dot product to get the convolved feature matrix.
11
2.ReLU LAYER
ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next
step is to move them to a ReLU layer. The original image is scanned with multiple
convolutions and ReLU layers for locating the features and generated the rectified feature
map.
Fig 4.3
Fig 4.4
The ReLU layer is used to identify different features of the image like
color,shape,texture,edges,corners,margin.
12
Fig 4.5 Features of Single Leaf
2. POOLING LAYER
Pooling is a down-sampling operation that reduces the dimensionality of the feature map.
The rectified feature map now goes through a pooling layer to generate a pooled feature map.
13
3. FULLY CONNECTED LAYER
The Fully Connected (FC) layer consists of the weights and biases along with the neurons
and is used to connect the neurons between two different layers. These layers are usually
placed before the output layer and form the last few layers of a CNN Architecture.
In this, the input image from the previous layers are flattened and fed to the FC layer. The
flattened vector then undergoes few more FC layers where the mathematical functions
operations usually take place. In this stage, the classification process begins to take place.
The reason two layers are connected is that two fully connected layers will perform better
than a single connected layer. These layers in CNN reduce the human supervision.
ADVANTAGES
Unique features of texture and shape combinations of morphological features have been
identified, that maximizes the identification rate of green leaves.
14
In this method Dense Net type of Convolutional Neural Network(CNN) is used because
of its several compelling advantages like it strengthens the feature propagation and also
encourages feature reuse, which in turn increases the efficiency and decrease the loss of
valuation.
Its built-in convolutional layer reduces the high dimensionality of images without losing
its information.
15
CHAPTER 5
The considered dataset of 2835 images are trained using epochs and the accuracy and loss
are calculated for each epoch. Epochs is the number of times a learning algorithm sees the
complete dataset. One Epoch is when an entire dataset is passed forward and backward
through the neural network only once. Loss is the summation of errors in the model and
accuracy is the ratio of correct predictions to the total predictions.This process takes place
through back propagation.
During the training process, some of the images are displayed that are used for training.
16
Fig 5.2 Displaying some of the training images
17
Fig 5.4 Displaying the output
OUTPUT 1:
OUTPUT 2:
18
OUTPUT 3:
OUTPUT 4:
OUTPUT 5:
19
5.2 RESULT ANALYSIS
Here the values of the training accuracy and validation accuracy with respect to the epochs
are mentioned in the below table.
20
EPOCHS MODEL ACCURACY VALIDATION
ACCURACY
1 0.2703 0.3898
2 0.3448 0.4215
3 0.4537 0.5538
4 0.6442 0.6649
5 0.7884 0.7443
6 0.8761 0.7901
7 0.9171 0.7937
8 0.9638 0.8289
9 0.9793 0.8236
10 0.9987 0.8201
11 1.0000 0.8325
12 1.0000 0.8289
13 1.0000 0.8254
14 1.0000 0.8272
15 1.0000 0.8289
16 1.0000 0.8183
17 1.0000 0.8183
18 1.0000 0.8254
19 1.0000 0.8289
20 1.0000 0.8325
21
Fig 5.11 epoch vs loss graph
� �
LOSS = �=�
( �� -ŷ)�
�
n – number of inputs
i – Iteration
Here the values of the training accuracy and validation accuracy with respect to the epochs
are mentioned in the below table.
22
Table 5.2 Epochs vs model loss and validation loss
23
CHAPTER 6
24
CHAPTER 7
Plants are necessary for human survival. Herbs, particularly, are employed by indigenous
populations as folk medicines from old period. Herbs are typically recognized by clinicians
based on decades of intimate sensory or olfactory experience. Recent improvements in
analytical technology have made it much easier to identify herbs depending on scientific
evidence. This helps a lot of individuals, particularly those are not used to recognizing herbs.
additionally for time-consuming methods, laboratory-based analysis necessitates expertise in
sample healing and data explanation. As a result, a simple and reliable method for identifying
herbs is required. Herbal identification anticipated to benefit from the combination of
computation and statistical examination. This non-destructive technique will be the preferred
approach for quickly identifying herbs, especially for individuals who cannot able to use
expensive analytical equipment. This work reviews about different methods for plants
recognition and also reviews their advantages and disadvantages.
25
REFERENCES
Liu, Albert and Yangming Huang developed a plant identification system using SVM for
classification purpose. But this method holds for clean images. These images are
characterized with leaves that are well aligned on a contrast background, with few or no
variations of color or luminance.
Kumar, P. M., Surya, C. M. and Gopi, V. P. used different plant features such as color,
texture, shape. But this method works well only for static background or plain background.
Putzu, L., Di Ruberto, C., Fenu, G. make use of saliency maps methods such as Graph-Based
Manifold Ranking (GMR), Visual Saliency Feature (VSF), Gaussian Pyramids, based leaf
extraction. But This method works well in the presence of an untextured background.
Anantrasirichai, Nantheera, Sion, L., Hannuna and Cedric Nishan Canagarajah. based on
marker-controlled watershed segmentation. This method still misses some areas because of
reflection, shadow and disease on the leaves.
One of the most authoritative works in the field of plant identification has been done by Wu
et al. From five basic geometric features, and then Principle Component Analysis(PCA) is
used for dimension reduction.They Achieved an average of 90.3% with the Flavia Dataset.
26