Medicinal Plant Identification Using Machine Learning".In

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

ABSTRACT

My project is entitled as “Medicinal Plant Identification using Machine Learning”.In


this project we identify the correct medicinal plant by using their leaves. From Vedic times
plants have been used as a source of medicine in Ayurveda. In the preparation of Ayurvedic
medicine, identification of correct plant is the most important step, which have been done
manually. Due to demand of mass production, Identification of these plants automatically is
important.

Identifying plants through their leaves is a thoroughly pursued endeavor that has widely
varying applications ranging from ecology, horticulture, disease identification, rare plant
preservation in plants to medicinal applications in Ayurveda and various plant bases medical
systems. Our purpose in this project is to identify plant species using the image of a single
leaf through neural networks. We will approach our project using Keras, TensorFlow, and
Convolutional Neural Networks,an ensemble supervised machine learning algorithm based
on color, texture and geometrical features. Fore mentioned approach gives satisfactory results
with high accuracy.

Keywords:Plant Identification,Supervised Machine Learning,Convolutional Neural


Network,geometrical features
CHAPTER 1
1.1 INTRODUCTION

The world bears thousands of plant species, many of which have medicinal values, others
are close to extinction, and still others that are harmful to man. Not only are plants an
essential resource for human beings, but they form the base of all food chains. The medicinal
plants are used mostly in herbal, Ayurvedic and folk medicinal manufacturing. Herbal plants
are plants that can be used for alternatives to cure diseases naturally. About 80% of people in
the world still depend on traditional medicine. Meanwhile, according to herbal plants are
plants whose plant parts (leaves, stems, or roots) have properties that can be used as raw
materials in making modern medicines or traditional medicines. These medicinal plants are
often found in the forest. There are various types of herbal plants that we can know through
the identification of these herbs, one of which is using identification through the leaves. and
protect plant species, it is crucial to study and classify plants correctly.

It is self-evident that plants are crucial for our survival. So, plant identification is an
important field that finds many significant applications in the identification of plants,
protection of plants, maintenance, and assessment of many variables that are important for
their maintenance, weed control, and many others. It is very difficult for an untrained eye to
distinguish between plants. And there are so many species of plants that it is impossible to
identify plants by humans.

Technologies like deep learning, machine learning, and computer vision are very efficient
in successful detection. Neural Networks are algorithms that can teach themselves to do tasks
as human minds do. CNN is an advanced algorithm that uses deep learning to do the task
which includes Natural Language Processing, image recognition, pattern analysis, etc. CNN
teaches itself on basis of samples we provide to it. It uses different layers of filters which
recognize a particular feature of the sample to perform its task. Due to the wide popularity of
CNN and extensive research done on it, up to 90% accuracy is possible on current CNN
models.

1
1.2 MOTIVATION

Local peoples are not enough knowledgeable of their urban medicinal plants and their
usages. In the ancient past, the Ayurvedic physicians themselves picked the medicinal plants
and prepared the medicines for their patients. Today only a few practitioners follow this
practice. The manufacturing and marketing of Ayurvedic drugs has become a thriving
industry whose turnover exceeds Rs. 4000 crores. The number of licensed Ayurvedic
medicine manufacturers in India easily exceeds 8500. This commercialization of Ayurvedic
sector has brought in to focus several questions regarding the quality of raw materials used
for Ayurvedic medicines. Today the plants are collected by women and children from forest
areas; those are not professionally trained in identifying correct medicinal plants.
Manufacturing units often receive incorrect or substituted medicinal plants. Most of these
units lack adequate quality control mechanisms to screen these plants. In addition to this,
confusion due to variations in local name is also rampant. Some plants arrive in dried form
and this make the manual identification task much more difficult. Incorrect use of medicinal
plants makes the Ayurvedic medicine ineffective. It may produce unpredictable side effects
also. In this situation, strict measures for quality control must be enforced on Ayurvedic
medicines and raw materials used by the industry in order to sustain the present growth of
industry by maintaining the efficacy and credibility of medicines.

A trained Botanist looks for all the available features of the plants such as leaves, flowers,
seeds, root and stem to identify plants. Except for the leaf, all others are 3D objects and
increase the complexity of analysis by computer. However, plant leaves are 2D objects and
carry sufficient information to identify the plant. Leaves can be collected easily and image
acquisition may be carried out using inexpensive digital cameras, mobile phones or document
scanners. It is available at any time of the year in contrast to flowers and seeds. Leaves
acquire a specific colour, texture and shape when it grows and these changes are relatively
insignificant. Plant recognition based on leaves depends on finding exact descriptors and
extracting the feature vectors from it. Then the feature vectors of the training samples are
compared with the feature vectors of the test sample to find the degree of similarity using an
appropriate classifier.

2
1.3 PROBLEM STATEMENT

Machine learning is the study of design of algorithms, inspired from the model of human
brain.Machine learning is becoming more popular in data science fields like artificial
intelligence (AI),image identification.Deep learning is supported by various libraries such as
TensorFlow, Keras is one of the most powerful and easy to use python library,which is used
for creating machine learning models.Identification of correct medicinal leaves can help
botanists, taxonomists and drug manufacturers to make quality drug and can reduce the side
effects caused by the wrong drug delivery. To identify the leaves of the plants, a type of
artificial neural network called Convolutional Neural Network (CNN) is used. The
architecture we used here is Densenet, which is a convolutional neural network that is a
powerful model capable of achieving high accuracies on challenging datasets.

1.4 RESEARCH SCOPE

Automatic detection of medicinal plants opens new doors for the development of
medicines to cure diseases that have not yet been cured by Allopathy. It will allow the
layman to be aware of the plants growing in their surroundings and make utmost use of them
to cure common ailments with no possible side effects. Artificial Intelligence makes this
purpose even more achievable. A portable system may be developed for field use. In future
research in the area of plants identification, improved Machine Learning feature selection
models will be used to solve the accuracy related issues & enhance the performance.

1.5 RESEARCH CONTRIBUTION


Phase 1 : The foremost step is image acquisition that determines the input image. The quality
of the input image determines the accuracy of the output. The image can be taken in
the google chrome and form a dataset.
Phase 2 : The image pre-processing is the second stage in the identification process. The raw
image can be obtained in its natural background. It might contain noise and could be of a
random size attained at a random angle. This process of removing noise and adjusting its
contrast is called Image pre-processing. The scale and orientation of the image need to be
standardized for the proper feature computation and to provide accurate results.

3
Phase 3 : Feature Extraction is the most important step in recognition of an image from the
computer vision since it influences the accuracy of the overall process. Feature Extraction is
the process of transforming raw data into numerical features that can be processed while
preserving the information in the original dataset. It yields better results than applying
machine learning directly to the raw data. The main features of a leaf are shape, texture,
shape, color and venation, etc.

Phase 4 : The next stage is Dimensionality Reduction. It is the task of reducing the number of
features in a dataset. In machine learning tasks such classification, there are often too many
variables to work with. These variables are also called features. Some of these features can be
quite redundant, adding noise to dataset and it makes no sense to have them in the training
data. This is where feature space needs to reduced. The process of dimensionality reduction
essentially transforms data from high-dimensional feature space to a low-dimensional feature
space.

Phase 5 : The final stage is classification. Classification is the process of assigning an input
image to a particular pre-defined class. A class is defined as a collection of the feature values
which were obtained during the training phase. The algorithms employed in the classification
phase assume that image consists of various features and set of features belong to several
different classes, Input to this phase is the feature vector, which consists of extracted features.

4
CHAPTER 2

DATASET DESCRIPTION
2.1 INTRODUCTION

In this project, I used image dataset. The leaf samples of medicinal plants were collected
from google chrome and formed a dataset. The dataset contains 31 classes of different
medicinal plant species. Each class contains more than 50 images. Perform a basic manual
sampling and remove severely damaged leaves.70% of the images is used as train data and
30% of the images is used as test data.

DATASET

TRAINING SET TESTING SET

Fig 2.1 Types of Datasets

TRAINING DATASET

The training data is the biggest (in size) subset of the original dataset,which is used to train
or fit the machine learning model. The training data is fed to the ML algorithms, which lets
them learn how to make classification for the given task. For supervised learning, the training
data contains labels in order to train the model and make classification.

The type of training data that we provide to the model is highly responsible for the model's
accuracy and classification ability. It means that the better quality of the training data, the
better will be the performance of the model. Training data is approximately more than or
equal to 70% of the total data for this project.

5
TESTING DATASET

Once we train the model with the training dataset, it's time to test the model with the test
dataset. This dataset evaluates the performance of the model and ensures that the model can
generalize well with the new or unseen dataset. The test dataset is another subset of original
data, which is independent of the training dataset. However, it has some similar types of
features and class probability distribution and uses it as a benchmark for model evaluation
once the model training is completed. Test data is a well-organized dataset that contains data
for each type of scenario for a given problem that the model would be facing when used in
the real world. Usually, the test dataset is approximately 20-25% of the total original data.

At this stage, we can also check and compare the testing accuracy with the training
accuracy, which means how accurate our model is with the test dataset against the training
dataset. If the accuracy of the model on training data is greater than that on testing data, then
the model is said to have over-fitting.

Fig 2.2 Dataset

6
CHAPTER 3

LITERATURE REVIEW
3.1 INTRODUCTION

Machine learning is a broad subset of artificial intelligence that enables computers to learn
from data and experience without being explicitly programmed. In recent years, machine
learning has helped to solve complex problems in areas such as finance, healthcare,
manufacturing, and logistics.There are different types of machine learning algorithms, but the
most common are regression and classification algorithms. Regression algorithms are used to
predict outcomes, while classification algorithms are used to identify patterns and group data.

Machine learning algorithms can be further divided into two categories: supervised and
unsupervised. Supervised algorithms require a training dataset that includes both the input
data and the desired output. Unsupervised algorithms do not require a training dataset, and
instead rely on data to "learn" on its own.Machine learning itself has several subsets of AI
within it, including neural networks, deep learning, and reinforcement learning.

Machine learning models make use of several algorithms. While no one network is
considered perfect, some algorithms are better suited to perform specific tasks. To choose the
right ones, it’s good to gain a solid understanding of all primary algorithms. One of these
algorithms used in our project for image classification called Convolutional Neural Network
(CNN).

A Convolutional Neural Network(CNN or ConvNet) is a network architecture for Machine


learning that learns directly from data. CNNs are particularly useful for finding patterns in
images to recognize objects, classes, and categories. They can also be quite effective for
classifying audio,time-series, and signal data. It consist of multiple layers and are mainly
used for image processing and object detection. Yann LeCun developed the first CNN in
1988.

3.2 RELATED WORKS AND DISCUSSION

In the paper by Sue Han, Seng, et al., it has been proposed that a plant identification by using
the deep learning CNN algorithm. In this method the CNN algorithm is used to learn
unsupervised feature representations. These experiments are carried out on 44 different plant

7
types, which are collected at the Royal Botanic Garden. The experimental results show the
consistency and superiority than the other hand-crafted feature extraction methods.

Wang-Su and Sang Yong proposed a plant leaf recognition using CNN model and created
two models by adjusting the network depth using Google Net. Also the performance of each
model is evaluated according to the discoloration/damage effects are applied on the leaves.
Accuracy was obtained greater than 80% even with the 30% damaged leaves.

A novel method for plant leaf classification was proposed by Jiachun Liu, S. Yang and et al.
This method uses the CNN algorithm for feature extraction and classification. 10 layers are
introduced in this CNN architecture. In this method, an augment for leaf was applied to
enlarge the database. So that it can improve the classification performance. The visualization
method was used for the analysis of the factors affecting the accuracy. The experimentation
of this CNN method was carried out on the Flavia dataset and obtained an accuracy of
81.92%.

In the paper, a CNN based plant classification system has been proposed to classify the
different types of plant species from the image database collected from smart agro-stations. In
this method CNN architecture is used for feature extraction of different plant images. This
approach is tested on the TARBIL database and obtained an accuracy of 80.47% on 16
different plant species. In this method the CNN based classification shows the more accuracy
than the SVM based classification.

In the paper, by shah, sougatta singh et al., it has been proposed that a leaf classification
system using the dual path deep CNN. The dual-path CNN will do the following major
operations
i. Both the shape and texture characteristics are studied.
ii. Optimizes the obtained features for the classification.
Using this method, it has been obtained a good accuracy of about 82.28 %( flavia data-set)
which is better than other uni-path CNN methods.

8
CHAPTER 4

EXISTING MODEL / PROPOSED MODEL

4.1 EXISTING MODEL

Liu, Albert and Yangming Huang developed a plant identification system using SVM for
classification purpose. But this method holds for clean images. These images are
characterized with leaves that are well aligned on a contrast background, with few or no
variations of color or luminance.

Kumar, P. M., Surya, C. M. and Gopi, V. P. used different plant features such as color,
texture, shape. But this method works well only for static background or plain background.

Putzu, L., Di Ruberto, C., Fenu, G. make use of saliency maps methods such as Graph-Based
Manifold Ranking (GMR), Visual Saliency Feature (VSF), Gaussian Pyramids, based leaf
extraction. But This method works well in the presence of an untextured background.

Anantrasirichai, Nantheera, Sion, L., Hannuna and Cedric Nishan Canagarajah. based on
marker-controlled watershed segmentation. This method still misses some areas because of
reflection, shadow and disease on the leaves.

One of the most authoritative works in the field of plant identification has been done by Wu
et al. From five basic geometric features, and then Principle Component Analysis(PCA) is
used for dimension reduction.They Achieved an average of 90.3% with the Flavia Dataset.

LIMITATIONS

 Only fewer inputs could be sent.

 Works well only for static background or plain background.

 Missing some areas because of reflection, shadow and disease on the leaves.

9
 The images are characterized with leaves that are well aligned on a contrast background,
with few or no variations of color.

4.2 PROPOSED MODEL

A novel method for identification of medicinal plants from images of different angles of
both front and backside of the leaves have been proposed. The work is based on a database of
leaf images of medicinal plants. Unique features of texture and shape combinations of
morphological features have been identified, that maximizes the identification rate of green
leaves. By using this method, when the image of any plant leaf is given to the system it gives
efficient plant with it’s image and properties of the leaf or the disease it cures along with the
image of the leaf are displayed. In this method Dense Net type of Convolutional Neural
Network(CNN) is used because of its several compelling advantages like it strengthens the
feature propagation and also encourages feature reuse, which in turn increases the efficiency
and decrease the loss of valuation. Here TensorFlow and Keras is used for training the data to
the model.

Fig 4.1 Proposed Model

10
4.2.1 CONVOLUTIONAL NEURAL NETWORK

A convolutional neural network (CNN or ConvNet) is a subset of machine learning. It is


one of the various types of artificial neural networks which are used for different applications
and data types. A CNN is a kind of network architecture for Machine learning algorithms and
is specifically used for image recognition and tasks that involve the processing of pixel data.
There are other types of neural networks in Machine learning, but for identifying and
recognizing objects, CNNs are the network architecture of choice. This makes them highly
suitable for computer vision (CV) tasks.

4.2.2 LAYERS IN A CONVOLUTIONAL NEURAL NETWORK

A convolution neural network has multiple hidden layers that help in extracting
information from an image. The four important layers in CNN are:

1. Convolution layer

2. ReLU layer

3. Pooling layer

4. Fully connected layer

1. CONVOLUTIONAL LAYER

This is the first step in the process of extracting valuable features from an image. A
convolution layer has several filters that perform the convolution operation. Every image is
considered as a matrix of pixel values. Consider the following 5x5 image whose pixel values
are either 0 or 1. There’s also a filter matrix with a dimension of 3x3. Slide the filter matrix
over the image and compute the dot product to get the convolved feature matrix.

Fig 4.2 Example of Convolutional Layer

11
2.ReLU LAYER

ReLU stands for the rectified linear unit. Once the feature maps are extracted, the next
step is to move them to a ReLU layer. The original image is scanned with multiple
convolutions and ReLU layers for locating the features and generated the rectified feature
map.

Fig 4.3

Fig 4.4

Fig 4.3 and Fig 4.4 Feature Mapping

The ReLU layer is used to identify different features of the image like
color,shape,texture,edges,corners,margin.

12
Fig 4.5 Features of Single Leaf

2. POOLING LAYER

Pooling is a down-sampling operation that reduces the dimensionality of the feature map.
The rectified feature map now goes through a pooling layer to generate a pooled feature map.

Fig 4.6 Pooling Layer

13
3. FULLY CONNECTED LAYER

The Fully Connected (FC) layer consists of the weights and biases along with the neurons
and is used to connect the neurons between two different layers. These layers are usually
placed before the output layer and form the last few layers of a CNN Architecture.

In this, the input image from the previous layers are flattened and fed to the FC layer. The
flattened vector then undergoes few more FC layers where the mathematical functions
operations usually take place. In this stage, the classification process begins to take place.
The reason two layers are connected is that two fully connected layers will perform better
than a single connected layer. These layers in CNN reduce the human supervision.

Fig 4.7 CNN Model

ADVANTAGES

 Unique features of texture and shape combinations of morphological features have been
identified, that maximizes the identification rate of green leaves.

14
 In this method Dense Net type of Convolutional Neural Network(CNN) is used because
of its several compelling advantages like it strengthens the feature propagation and also
encourages feature reuse, which in turn increases the efficiency and decrease the loss of
valuation.

 It automatically detects the important features without any human supervision.

 CNN is also computationally efficient.

 Its built-in convolutional layer reduces the high dimensionality of images without losing
its information.

15
CHAPTER 5

EXPERIMENTATION AND RESULT ANALYSIS


5.1 EXPERIMENTAL EVALUATION

1) Training the Data:

The considered dataset of 2835 images are trained using epochs and the accuracy and loss
are calculated for each epoch. Epochs is the number of times a learning algorithm sees the
complete dataset. One Epoch is when an entire dataset is passed forward and backward
through the neural network only once. Loss is the summation of errors in the model and
accuracy is the ratio of correct predictions to the total predictions.This process takes place
through back propagation.

Fig 5.1 Training process through back propagation

2) Displaying the Training Images:

During the training process, some of the images are displayed that are used for training.

16
Fig 5.2 Displaying some of the training images

3) Input and Output:


The path of the image from the test data is given in the path and the program is run. The output image is
displayed with it’s local name, scientific name and the properties of the leaf or the disease it cures is displayed
along with it’s Image.

Fig 5.3 Input Given

17
Fig 5.4 Displaying the output

OUTPUT 1:

Fig 5.5 Neem

OUTPUT 2:

Fig 5.6 Fenugreek

18
OUTPUT 3:

Fig 5.7 False_Daisy

OUTPUT 4:

Fig 5.8 Betel

OUTPUT 5:

Fig 5.9 Oleander

19
5.2 RESULT ANALYSIS

Training And Validation Accuracy Graph


After the process of training,the graph between training accuracy, validation accuracy and
20 epochs is plotted. Here the blue line represents validation accuracy and the red line
represents training accuracy. Accuracy is calculated using the below formula.

Fig 5.10 epoch vs accuracy graph

Formula to calculate accuracy:

Number of correct predictions – positive predictions


Total numbers of predictions – sum of positive and negative predictions
Here positive predictions are the correctly predicted values according to the desired input and
negative predictions are the error values or the incorrectly predicted values.

Here the values of the training accuracy and validation accuracy with respect to the epochs
are mentioned in the below table.

20
EPOCHS MODEL ACCURACY VALIDATION
ACCURACY
1 0.2703 0.3898
2 0.3448 0.4215
3 0.4537 0.5538
4 0.6442 0.6649
5 0.7884 0.7443
6 0.8761 0.7901
7 0.9171 0.7937
8 0.9638 0.8289
9 0.9793 0.8236
10 0.9987 0.8201
11 1.0000 0.8325
12 1.0000 0.8289
13 1.0000 0.8254
14 1.0000 0.8272
15 1.0000 0.8289
16 1.0000 0.8183
17 1.0000 0.8183
18 1.0000 0.8254
19 1.0000 0.8289
20 1.0000 0.8325

Table 5.1 Epoch vs model accuracy and validation accuracy

Training And Validation Loss Graph :


After the process of training,the graph between training accuracy, validation accuracy and
20 epochs is plotted. Here the blue line represents validation accuracy and the red line
represents training accuracy. Accuracy is calculated using the below formula.

21
Fig 5.11 epoch vs loss graph

Formula to calculate loss:

� �
LOSS = �=�
( �� -ŷ)�

Yi – The actual output

Yi– The predicted output

n – number of inputs

i – Iteration

Here the values of the training accuracy and validation accuracy with respect to the epochs
are mentioned in the below table.

22
Table 5.2 Epochs vs model loss and validation loss

23
CHAPTER 6

TOOLS AND TECHNOLOGIES


Python 3.9 version is used as software and the IDE used is Jupyter Notebook. Keras is
used to train the model. It is a high-level neural network library which trains the deep
learning model by using epochs and back propagation. Epochs means considering the data
into batches and training them through iterations, while training it checks for minimum loss
and maximum accuracy.
DenseNet is the type of CNN used and the library used for the numerical calculations in
DenseNet is TensorFlow. It is an open source software library which performs computations
using dataflow graphs and provides multiple application interfaces. The input activation
function used in first layer of CNN is ReLU and the output activation function used in last
layer of CNN is SoftMax.
ReLU is a piece wise linear function that will output the input directly if it is positive,
otherwise it will output zero. SoftMax scales the input values between 0 and 1 i.e., it is used
to normalize the output. The optimizer type used is Adam and the learning rate is 0.001.
Adam optimizer is a stochastic gradient descent method that is based on the estimation of
first order and second order moments.

24
CHAPTER 7

CONCLUSION AND FUTURE DIRECTIONS


7.1 CONCLUSION

Plants are necessary for human survival. Herbs, particularly, are employed by indigenous
populations as folk medicines from old period. Herbs are typically recognized by clinicians
based on decades of intimate sensory or olfactory experience. Recent improvements in
analytical technology have made it much easier to identify herbs depending on scientific
evidence. This helps a lot of individuals, particularly those are not used to recognizing herbs.
additionally for time-consuming methods, laboratory-based analysis necessitates expertise in
sample healing and data explanation. As a result, a simple and reliable method for identifying
herbs is required. Herbal identification anticipated to benefit from the combination of
computation and statistical examination. This non-destructive technique will be the preferred
approach for quickly identifying herbs, especially for individuals who cannot able to use
expensive analytical equipment. This work reviews about different methods for plants
recognition and also reviews their advantages and disadvantages.

7.2 FUTURE DIRECTIONS


The proposed methods are not suitable for tiny leaves or plants without a proper leaf.
Efforts may be made to develop methods to identify these types of plants. The algorithms
may be implemented on a standalone single board computer connected to a scanner. A
portable system may be developed for field use.In future research in the area of plants
identification, improved machine learning classifier with some pre-processing and feature
selection models will be used to solve the accuracy related issues and enhance the
performance.

25
REFERENCES

Liu, Albert and Yangming Huang developed a plant identification system using SVM for
classification purpose. But this method holds for clean images. These images are
characterized with leaves that are well aligned on a contrast background, with few or no
variations of color or luminance.

Kumar, P. M., Surya, C. M. and Gopi, V. P. used different plant features such as color,
texture, shape. But this method works well only for static background or plain background.

Putzu, L., Di Ruberto, C., Fenu, G. make use of saliency maps methods such as Graph-Based
Manifold Ranking (GMR), Visual Saliency Feature (VSF), Gaussian Pyramids, based leaf
extraction. But This method works well in the presence of an untextured background.

Anantrasirichai, Nantheera, Sion, L., Hannuna and Cedric Nishan Canagarajah. based on
marker-controlled watershed segmentation. This method still misses some areas because of
reflection, shadow and disease on the leaves.

One of the most authoritative works in the field of plant identification has been done by Wu
et al. From five basic geometric features, and then Principle Component Analysis(PCA) is
used for dimension reduction.They Achieved an average of 90.3% with the Flavia Dataset.

26

You might also like