100% found this document useful (1 vote)
128 views6 pages

Crop Yield Prediction

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

Crop Yield Prediction Model

Sarthak Aggarwal, Manushivam Maheshwari, Vartul Tripathi and Sangeeta Mittal

Department of Computer Science and Engineering


Jaypee Institute of Information Technology, Noida-62
sarthaakagarwal@gmail.com, harindermaheshwari@gmail.com, vartult2@gmail.com, sangeeta.mittal@jiit.ac.in

Abstract—Food crop production in India is largely about relationships among all this data can provide information.
cereal crops like rice and wheat. The sustainability and Information can be converted into knowledge about historical
productivity of rice are dependent on the area’s climatic patterns and future trends. For example, summary information
conditions, topographic factors and amount of fertilizer etc.
Having a pre-estimate of annual yield is critical to deciding upon about crop production can help the farmers identify the crop
policies to ensure food security. Improved techniques to predict losses and prevent it in future.
crop productivity in different conditions can assist the farmers This study aim to make Machine Learning model, to help
and other stakeholders in making correct and balanced resource farmers predict and gives the analysis of production based
usage.In this study, crop yield prediction has been addressed on available data and determine the yield of their crops in
using several machine learning algorithms. Data available on
different government sites has been collated to create a feature the future with best practices. A number of studies have
set. Experimental results show that the proposed work efficiently investigated how Information and Communication Technology
predicts the chosen crop’s yield. can be applied to improve crop yield prediction[2]. The model
looks at different variables that define the yield of any crop
Keywords—crop analysis; crop yield; data science; multiple like total area available, seasonal rainfall, fertilizer, state name,
linear regression; support vector regression; decision tree regres-
sion; prediction; production, water quality and minimum support price. These
variables are then passed to three different machines learning
algorithm( Multiple Linear Regression(MLR), Support Vector
I. I NTRODUCTION
Regression(SVR), Decision Tree Regression(DTR) to predict
Farming sector is one of the backbones of any economy.In the yield and best practices to get a better yield. Multiple
India,about 58.4% of the population are earning their Linear Regression determine the relative influence of one
livelihood by agriculture. India’s recent accomplishments or more predictor variables to the criterion value which in
in crop yields are just 30% to 60% of the best crop yields our case in yield. Support Vector Regression has excellent
attainable in developed and few developing countries [3].In generalization capability, with high prediction accuracy,
India a majority of the farmers are not getting expected crop which is great for more parameter like fertilizer used, water
yield. To fulfill the demands created by increasing population quality etc. The analysis of results were undertaken and
and changing climatic factors, deteriorating soil and water conclusion made as to its effectiveness for improving crop
conditions, farmers are forced to use synthetic crop yield yield prediction.
enhancing products. These products laced with dangerous
chemicals are degrading society’s health at large.This problem II. R ELATED W ORK
can be circumvented by providing technological solutions to Productivity of crop is one of the main issues in agricul-
farmers that can model cause and effect of natural conditions ture[7]. Support Vector Regression is a supervised machine
on crop yield.In this paper, machine learning methods have learning technique, there are a number of examples where it
been utilized to find out relation of crop yield and seasonal has been used in the agriculture domain[1]. Kaur et al., (2017)
climatic conditions. predicted rice yield using data from the year 2001-2015 of
state Ludhiana and Punjab using Apriori algorithm to find that
Apart from the assets like soil, water and air which temperature and rainfall have big effect on rice crop yield [1].
lead to insecurity of food[4]. The agricultural yield primarily Rice crop was also found to be high during low rainfall and
depends on environmental conditions. Yield prediction is an high during low rainfall[1].
important agricultural problem. Every farmer is interested Regression or correlation analyses are generally used to char-
in knowing, how much yield he about expects. In the past, acterize the statistical relationship between controlled variables
yield prediction was performed by considering the farmer’s and crop yield[6]. Regression is a statistical measure that can
previous experience for a particular city or district. The be used to determine the strength of the relationship between
volume of data is enormous in Indian agriculture, the data one dependent variable and a series of other changing variables
when become information is highly useful for many purposes. known as in dependent variables (regular attributes)[9]. the
The yield prediction is a major issue that remains to be same as linear regression, with the only difference being that
solved based on available data. The patterns, associations, or in multiple linear regression, there can be a multiple number
of independent variables[8]. Many studies have examined the calculated
role of mean climate change in agriculture, but an under-
standing of the influence of inter-annual climate variations on • Minimum Support Price() - Minimum Support Price
crop yields in different regions remains elusive[11]. fixed by government for each year.
Crop production will definitely have an impact due to vari-
ation in temperature[2]. Given relatively small increases in • Fertilizer(Tonnes) - The fertilizer (UREA) required by
temperature, however, uncertainty in yield predictions could each state from the year 2012-2015
be amplied by higher yield sensitivity to temperature[10].
Gandhi et al, (2015) used SVM to predict rice crop yield
with data from the year 1998-2002 and with two parameters Few more climatic parameters were collected from the web-
as temperature and rainfall of Maharashtra state only[2]. K- site[13]: http://www.cpcbenvis.nic.in/
Clustering was used by Thambare et al., (2017) to predict yield • Water Quality(pH) - The water quality of each state each
using rainfall[3]. Chong et al.,(2015) used nearest neighbor year was collected and used.
algorithm to predict yield in China with very high accuracy.
”ARMA” model was built based in the ”nearest year” to obtain
the years having the similar production[5].weather prediction
based on machine learning technique called Support Vector
Machines had been proposed. These algorithms have shown
better results over the conventional algorithms[4]. Manjula et
al.,(2017) researched to predict crop yield in Tamil Nadu with
temperature as main factor[4].
III. R ESEARCH M ETHOD
This section discuses the method used for this research and
includes details of the dataset and methodology. We created
two datasets for better prediction.
DATASET-1 was collected for the year 2000-2015 with pa-
rameters namely State Name, Crop Year, Crop Name, Season,
Area of cultivation, Production, Rainfall, Minimum Support Fig. 1. Parameters Used for Analysis
price and Yield.
DATASET-2 was collected for the year 2012-2015 with pa-
rameters namely State Name, Crop Year, Crop Name, Season, B. Methodology
Area, Production, Rainfall, Water Quality, Fertilizer, Minimum This section describes the methodology used in this paper
Support Price and Yield. to achieve yield prediction.
• DATA COLLECTION:
A. Dataset Used
The dataset was collected from various data sources
All the datasets used in the project were extracted from as described earlier. This raw data was appended to
the openly accessible records of Indian Government website: get a single raw data source in the context of the problem.
https://data.gov.in/ [12] . Data was extracted for the Kharif
season of the rice crop in the year 2000-2015. • DATA PRE-PROCESSING:
All the parameters are collected for almost all the states of
India for rice crop during this period. STEP 1: Acquired each parameter as the raw dataset
(Season, Area, Production, MSP, Yield ) and computed
• Rainfall(mm)-The total precipitation of Kharif Season( mean to obtain yearly records of each state from 2000
June-November) for each year of every state was to 2015.
calculated from monthly rainfall of that state for that
year. STEP 2: Calculated the Water Quality Index by
minimum, maximum and average values of the city wise
• Area(Hectare) - The rice cultivated area for each state data during the Kharif season (Best preferred for rice
and each year production from June to November) and acquiring data
for the Fertilizer(UREA) required from the government
• Production(Tonnes) - The rice production for above area resources[13] for all the states from the year 2012-2015.
in each state for each year
STEP 3: The raw dataset was then collated as a single
• Yield(Tonnes/Hectare) - Depending on the area and source to find total area and production under each state
production for each state for each year, the yield was for each year and finally had following fields: Year,
State, Crop, Season, Area, Production, Water Quality, xi4= independent variable, fertilizer
Rainfall, Fertilizer, Minimum Support price, Yield. E = random error in prediction, that is the variance that
cannot be accurately predicted by the model. It is also
STEP 4: For some years we didn’t get values for Water known as residuals.
Quality Index and Fertilizers so we made two datasets B0 = y-intercept at time zero.
to utilize complete data. Dataset 1 was collected for the B1 = regression coefficient that measures a unit change
year 2000-2015 with parameters as - State Name, Crop in the dependent variable when xi1 changes change in
Year, Crop Name, Season, Area, Production, Rainfall, yield when area or production changes.
Minimum Support price and Yield . Similarly, Dataset B2 = coefficient value that measures a unit change in the
2 was collected for the year 2012-2015 with all above dependent variable when xi2 changes change in yield
parameters as well as Water Quality and Fertilizer when rainfall or fertilizers requirement changes.
information. Finally, the data was arranged year wise.
• Support Vector Regression - Support Vector Ma-
STEP 5: All fields of both the datasets were normalized chine(SVM) can be used as a regression method, main-
to center the data (make it have zero mean and unit taining all the main features that characterize the algo-
standard error), to eliminate large variations in actual rithm (maximal margin). The Support Vector Regression
value ranges. (SVR) uses the same principles as the SVM for clas-
sification, with only a few minor differences. First of
all, because the output is a real number it becomes very
difficult to predict the information at hand, which has
infinite possibilities. In the case of regression, a margin
of tolerance (epsilon) is set in approximation to the SVM.
• Decision trees - Decision tree builds regression or clas-
sification models in the form of a tree structure. It breaks
down a dataset into smaller subsets while at the same time
an associated decision tree is incrementally developed.
The final result is a tree with decision nodes and leaf
nodes. A decision node (e.g., yield) has two or more
branches (e.g., area, production, Rainfall, etc..), each rep-
resenting values for the attribute tested. Leaf node (e.g.,
fertilizer) represents a decision on the numerical target.
The topmost decision node in a tree which corresponds
to the best predictor is called a root node. Decision trees
can handle both categorical and numerical data.

IV. M ETRICS
Since our model will produce an output given any input or
set of inputs, we then check these estimated outputs against
the actual values that we tried to predict. We call the difference
between the actual value and the model’s estimate a residual.

A. Accuracy
Fig. 2. Methodology[1] Accuracy is an important metric for evaluating performance
of a regression model. It is computed as the fraction of
On the preprocessed datasets, few machine learning correctly predicted instances out of total instances presented
algorithms were applied to evaluate classification to the classifier. Predicted value is considered accurate if the
accuracy of yield from chosen features. difference between predicted and actual value if less than equal
to 1%.

• Multivariate Linear Regression - The model for Number of Correct Predictions


Accuracy =
multivariate or multiple linear regression is: Total Number of Predictions
yi = B0 + B1xi1 + B2xi2 + ... + Bpxip + E (1)
Where yi = dependent variable, yield
xi1 = independent variable, area B. Root Mean Square Error
xi2 = independent variable, production Root Mean Square Error(RMSE) is a quadratic scoring
xi3 = independent variable, rainfall rule that also measures the average magnitude of the error. Its
the square root of the average of squared differences between Multiple Linear Regression achieved an accuracy of 72% and
prediction and actual observation. root mean square error value of 0.5741. Least accurate was
decision tree regression with an accuracy of 63% and root
mean square error of 1.6562.

TABLE I
R ESULT FROM DATASET-1
(2) Mean Median
DATASET - 1 Root Mean
Accuracy Absolute Absolute
2000-2015 Square Error
Error Error
Multiple Linear
where, Regression
72% 0.5741 0.4178 0.2757
• p = forecasted/ predicted value Support Vector
80% 0.4795 0.2100 0.000
Regression
• a = actual observed values (known results)
Decision Tree
• n = total number of values 63% 1.6562 1.0690 0.4278
Regression

C. Mean Absolute Error


Mean Absolute Error(MAE) measures the average magni-
tude of the errors in a set of predictions, without considering
their direction.

(3)

where,
• p = forecasted/ predicted value
• a = actual observed values (known results)
• n = total number of values

Fig. 3. Accuracy Comparison


D. Median Absolute Error
The Median Absolute Error(MdAE) is similar to the MAE,
but we start with the absolute values of the residuals, and
we use the median instead of the mean as the measure for
centrality.

(4)

where,
• e = difference between forecast and actual value

V. E XPERIMENTAL R ESULTS
This section discusses the result obtained after applying
the three machine learning algorithms on the two datasets. Fig. 4. Root Mean Square Error
On the Dataset 1, Support Vector Regression performed best
with an accuracy defined by equation (1) of 80% and root On Dataset 2 with water quality and fertilizer, multiple
mean square error defined by equation (2) of 0.4795, while linear regression achieved an accuracy of 92% with root
TABLE II VI. D ISCUSSION AND C ONCLUSION
R ESULT FROM DATASET-2
The analysis shows that support vector regression works
Mean Median
DATASET - 2
Accuracy
Root Mean
Absolute Absolute
best with a large amount of parameters and large dataset as
2012-2015 Square Error it achieved an accuracy of 80% with root mean square error
Error Error
Multiple Linear
92% 0.6858 0.4772 0.3008 of 0.4795 and mean absolute error of 0.21 on Dataset-1. As
Regression
compared to previous research which reached an accuracy of
Support Vector
88% 0.4898 0.1600 0.000 78.76% and root mean square error of 0.39 and mean absolute
Regression
Decision Tree
60% 1.3454 0.9908 0.6325
error of 0.23 with the same model for the smaller dataset
Regression and fewer parameters[2]. Our method also outperformed
another work which used the same amount of data as ours
but with fewer parameters and considered only two cities
on Punjab[1].Root mean square error of Multiple Linear
Regression and Support Vector Machine on Dataset-1 is
0.5741 and 0.4795, lower than Dey et al. which was 0.602
for multiple linear regression and 0.598 for support vector
machine[8]. So according to our study Support Vector
Regression for the linear data model will work even more
efficiently if more parameters are used to predict crop yield.
On Dataset-2, Multiple Linear Regression analyzed best with
an accuracy of 92% because of the linearity of the dataset
and lesser amount of data. These results were also better
than papers using similar amount of data as ours but fewer
parameters[2]. Manjula et al. reported 86% accuracy with
fewer parameter and with data of a single state[4].

The proposed work has demonstrated the prediction of


Fig. 5. Mean Absolute Error rice crop yield by applying different machine learning model,
multiple linear regression, support vector regression and
decision tree regression. The experimental results show that
mean square error of 0.6858, and mean absolute error defined Multiple linear regression works better for less amount of
by equation (3) as 0.4772 and support vector regression got data with an accuracy of 92% with median absolute error
an accuracy of 88% and mean absolute error as 0.1600 and of 0.3008 for Dataset 2 (2012-2015) and 72% with median
decision tree regression got 60% and its root mean square absolute error of 0.2757 for Dataset 1 (2000-2015) while
error is 1.35 and median absolute error defined by equation Support vector regression is best suited for more data as
(4) is 0.6325. it achieved an accuracy of 80% on Dataset 1 (2000-2015)
and 88% on Dataset 2 (2012-2015). We were also able
to improve the performance of prediction as compared to
previous approaches by adding more parameters to the dataset
and achieving more accuracy.

R EFERENCES
[1] K. Kaur and K.S. Attwal, ”Effect of Temperature and rainfall on
Paddy Yield using Data Mining”, International Conference on Cloud
Computing, Data Science Engineering’ Confluence, 2017.
[2] N. Gandhi, L.J. Armstrong, O. Petkar and A.K. Tripathy, ”Rice crop
yield prediction in India using Support Vector Machines”, 13th Interna-
tional Joint Conference on Computer Science and Software Engineering
(JCSSE), 2016.
[3] R. Thombare, A. Chaudhari, S. Bhosale and P. Dhemey, ”Crop Yield
Prediction Using Big Data Analytics”, International Journal of Computer
Mathematical Sciences, 2017.
[4] E. Manjula and S. Djodiltachoumy, ”A Model for Prediction of Crop
Yield”, International Journal of Computational Intelligence and Infor-
matics, Vol. 6: No. 4, March 2017.
[5] Chen Chong, Wu Fan, Guo Xiaoling, Yu Hua, Wang Juyun, ”Predic-
Fig. 6. Median Absolute Error tion of Crop Yield using Big Data”, 8th International Symposium on
Computational Intelligence Data, 2015.
[6] Ji, B. S., Wan, J., ”Artificial neutral networks for rice yield prediction
in mountainous regions”, Journal of Agricultural science, 2007
[7] Ms.P. Kanjana Devi, ”Enhanced Crop Yield Prediction and Soil Data
Analysis Using Data Mining”, International Journal of Modern Com-
puter Science, 2016
[8] Dey, Umid Kumar, Abdullah Hasan Masud, and Mohammed Nazim
Uddin, ”Rice yield prediction model using data mining.” In Electrical,
Computer and Communication Engineering (ECCE), 2017
[9] Ramesh, D., and B. Vishnu Vardhan, ”Analysis of crop yield prediction
using data mining techniques.”, International Journal of Research in
Engineering and Technology 4, 2015
[10] Li, Tao, Toshihiro Hasegawa, Xinyou Yin, Yan Zhu, Kenneth Boote,
Myriam Adam, Simone Bregaglio et al, ”Uncertainties in predicting rice
yield by current crop models under a wide range of climatic conditions.”
Global Change Biology 21, 2015
[11] Ray, Deepak K., James S. Gerber, Graham K. MacDonald, and Paul C.
West. ”Climate variation explains a third of global crop yield variability.”
Nature communications 6, 2015
[12] Open Government Data Platform India
https://data.gov.in/
[13] Water Quality Dataset
http://www.cpcbenvis.nic.in/

You might also like