Sugarcane Yield Grade Prediction Using Random Forest With Forward Feature Selection and Hyper-Parameter Tuning
Sugarcane Yield Grade Prediction Using Random Forest With Forward Feature Selection and Hyper-Parameter Tuning
Sugarcane Yield Grade Prediction Using Random Forest With Forward Feature Selection and Hyper-Parameter Tuning
net/publication/326018532
Sugarcane Yield Grade Prediction Using Random Forest with Forward Feature
Selection and Hyper-parameter Tuning
CITATIONS READS
23 749
2 authors, including:
Pradit Mittrapiyanuruk
SEE PROFILE
All content following this page was uploaded by Pradit Mittrapiyanuruk on 10 July 2018.
Abstract. This paper presents a Random Forest (RF) based method for predicting
the sugarcane yield grade of a farmer plot. The dataset used in this work is ob-
tained from a set of sugarcane plots around a sugar mill in Thailand. The number
of records in the train dataset and the test dataset are 8,765 records and 3,756
records, respectively.
We propose a forward feature selection in conjunction with hyper-parameter
tuning for training the random forest classifier. The accuracy of our method is
71.88%. We compare the accuracy of our method with two non-machine-learning
baselines. The first baseline is to use the actual yield of the last year as the pre-
diction. The second baseline is that the target yield of each plot is manually pre-
dicted by human expert. The accuracies of these baselines are 51.52% and
65.50%, respectively. The results on accuracy indicate that our proposed method
can be used for aiding the decision making of sugar mill operation planning.
1 Introduction
In each harvest season, sugar mills need to know the estimated yield of sugarcane in
each harvest season for the purpose of operation planning. For each production year, the
field surveying staffs of sugar mill will collect the data of each plot. Conventionally, the
field experts of sugar mills will use their experience to estimate the sugarcane yield of
each plot based on the surveying data, the historical yield profile of each plot and each
farmer. The main disadvantage of using this human based yield estimation method is
that there is a large discrepancy between estimated yields and actual yields.
In this work, we propose a machine learning based method for sugarcane yield grade
prediction. The yield grade of each plot is assigned to low-yield-volume, medium-yield-
volume or high-yield-volume. Then our goal is to predict the yield grade of a plot. The
features used in the prediction are the plot characteristics, the sugarcane characteristics,
the plot cultivation scheme and the rain volume.
Our proposed method is based on Random Forest (RF) technique [7]. In training
phase, we propose to a RF model training procedure by using a forward feature selection
[8] in conjunction with hyper-parameter tuning. We implement our method by using the
Python Anaconda. In particular, we use the RF implementation of Scikit-Learn [1]. The
contribution of this work is that we develop a machine learning based sugarcane yield
grade prediction method for the data obtained from sugarcane plots around a sugar mill
in Thailand. Our method outperforms the prediction provided by human-expert.
The remainder of the paper are listed as follows. The related works are reviewed in
Section 2. In Section 3, we explain about the data used in the prediction as well as the
details of our proposed methods. Then, we report the prediction accuracy and give a
discussion in Section 4. Finally, the conclusion is drawn in Section 5.
2 RELATED WORK
The literatures about sugarcane yield prediction are presented in [2], [3], [4], [5]. These
works are directly related to our work. In [2], the authors propose a sugarcane yield
prediction method using a random forest algorithm. The features used in this work in-
clude: (i) biomass index computed from Agricultural Production Systems sIMulator
(APSIM), (ii) yields from two previous years, (iii) cumulative rainfalls, radiations and
daily temperature ranges in two different time periods, (iv) Southern Oscillation Index
(SOI), and (v) 3-month running average sea surface temperature in the Niño 3.4 region.
In [3], the authors propose a method to predict the sugar content from sugarcane harvests
in the year of 2011-2012 where each observation is referred to one block in the farms.
The authors use 53 features in the prediction. These features belong to four groups: (i)
soil physics and soil chemistry, (ii) weather, (iii) agricultural practices, and (iv) crop-
related information. Three different machine learning techniques are used in the predic-
tion including Support Vector Regression, Random Forest, and Regression Trees. Also,
the RRelief algorithm [6] is used for feature selection. The authors report that the pre-
diction accuracy of their best method in term of Mean-Absolute-Error is 2.02 kg/Mg.
Also, they report that their method can predict 90% of the observation within a precision
of 5.40 kg/Mg.
In [4], the authors propose the study of the effects of hyper-parameter tuning, feature
engineering and feature selection in sugarcane yield prediction techniques. For feature
engineering, a set of features derived from the original features are calculated and used
in the prediction, e.g., the rate of fertilization for each nutrient (N, P and K), the weather
description for four different periods, etc. The feature selection is performed by using
RRelief algorithm [6]. The machine learning techniques used in the study include Sup-
port Vector Machine (SVM), Random Forest (RF), Regression Tree (RT), Neural Net-
work (NN), and Boosted Regression Trees (BRT). The hyper-parameter tuning is ac-
complished by grid search with 10-fold cross validation where the Mean-Absolute-Error
is used as the metric. The authors evaluate 66 combinations of six machine learning
techniques, tuning, feature selection, and feature engineering. The authors report that the
BRT, SVM, and RF give the better performance among the others where the RF is the
best. In [5], the authors propose the method to solve the harvest scheduling problem for
a group of sugarcane growers that supply sugarcane to a mill in Thailand. A neural net-
work is applied to predict the sugarcane yields to be used in the harvest scheduling. The
features used in the prediction include crop class, farming skill, cultivars, soil type, the
use of irrigation, the age of the cane in days from cultivation to harvest, the average value
of the daily minimum/maximum temperature, the average daily rainfall (mm.), and the
accumulated daily rainfall since germination (mm.).
With regard to the feature selection for crop yield prediction, the works in [8] and
[9] are presented. In [8], the authors propose a forward feature selection method for
Wheat Yield Prediction. Their problem is recast as regression problem. The authors use
two different predictive methods, i.e., Regression Tree and Support Vector Regression
in their work. Inspired by this work, we adopt their idea into our proposed method pre-
sented in Section 3. In [9], the authors evaluate several common predictive modeling
techniques applied to crop yield prediction using a method to define the best feature
subset for each model. The predictive modeling methods studied in [9] include Multiple
linear regression, stepwise linear regression, M5- regression trees, and artificial neural
networks (ANN).
3.1 Data
Class_cane Category Cane class which has 3 different types: 1st rattoon cane, 2nd rattoon
cane and 3rd rattoon cane
Type_Cane Category Cane type which has 4 different types:
LK92-11, K84-200, K99-72 and Khonkean3
WaterType Category Irrigation sources which has 3 different types: Rain, Groundwater
and Natural canal
ActionWater Category Irrigation action type which has 2 different types: Water pour and
Rain
Epidemic Category Epidemic control method which has 2 different types: Preimergent
and Herbicide
FertilizerType Category Fertizer types which has 2 different types:
Chemical and organic
Fertilizer Category Fertilizer formula which has 4 different types:
46-0-0, 15-15-15, 16-16-16 and 25-7-7
TypeSoil Category Soil type which has 4 different types:
Loam, Silty clay and Ferus soil
GrooveWide Category Groove wide of the plot which has 4 different types: 120, 130, 140,
and 150 cm.
YieldOldGrade Category Yield Grade of the plot from the previous season which has 3 differ-
ent types: Grade 1, 2, and 3.
TargetGrade Category Target yield grade that is provided by human expert (Grade 1, 2, and
3).
TargetOldGrade Category Target yield grade from the previous season (Grade 1, 2, and 3).
FarmerContract- Category The farmer grade provided by the financial department according to
Grade his/her profile of sugarcane yields in previous years and his/her pro-
file of financial debts with the mill in case that the financial aid is
provided to the farmer which has 3 different types: A:good, B:Fair,
and C: Poor
Area_Remain Continuous The actual plot area that is used in the sugarcane cultivation (rais)
Distance Continuous The distance from plot to sugarcane mill (km)
Rain_Vol Continuous The rain volume in area of plot (mm)
ContractsArea Continuous The amount of sugarcanes that the farmer commits to deliver to the
mill (Tons/Rais).
YieldGrade Category Yield grade (Grade 1, 2, and 3).
Data Preprocessing.
First, we perform an exploratory data analysis (EDA). There are some records of miss-
ing value (NULL). That is, there are 41 records, 45 records, and, 43 records that have
no value in the fields “Type_Cane”, “FertilizerType” and “Fertilizer” respectively.
Also, there are some records that correspond to the outliers. A record is specified as
outlier if there is the value in a categorical variable which the count of this variable for
the whole data is very small comparing to the most frequency (largest count). The
source of these missing data and outliers could happen during the data entry process.
We opt to fill in these missing values and replace the values of outliers with the mode
(most frequency) of the corresponding fields. After the data pre-processing, we apply
the one-hot encoding (dummy variables) to the categorical variables. Finally, we ran-
domly split the dataset into the train data and the test data with 70:30 (Train:Test) ratio.
The number of records in the train and the test datasets are 8,765 records and 3,756
records, respectively.
In the training set, the number of records that are Grade 1, Grade 2, and Grade 3
are 3,933, 3,759, and 1,073, respectively. And, the number of records that are Grade 1,
Grade 2, and Grade3, in the testing set are 1,712, 1,607, and 437, respectively. The
distributions of yield grade in the training and testing datasets are shown in Fig. 1.
Clearly, the distributions of target variables of the train and test datasets are similar.
Distribution of YieldGrade
0.6
0.449 0.456 0.429 0.428
0.4
0
1 2 3
Train Test
Fig. 1. The histograms show the distributions of yield grades in the training set (blue) and
testing dataset (red).
3.2 Yield grade prediction using Random Forest with Forward Feature
Selection
In this section, we present our Random Forest based method for sugarcane yield grade
prediction. As an overview, the input to the system is a record of data with a pre-selected
set of variables (listed in Table 1) and the output is the predicted yield grade of the cor-
responding record. At the model training step, we propose to use a forward feature se-
lection in conjunction with hyper-parameter tuning to improve the accuracy of predic-
tion.
As a review, Random Forest (RF) is a supervise machine learning algorithm that can
be used for both regression and classification tasks. The RF model is an ensemble of
decision trees. Each decision tree (DT) is trained with random subset of training data
where the sampling is performed with replacement. Furthermore, at each step in the DT
construction, the best feature of split is chosen from a random subset of features. To
make the prediction of a test data instance, first we make the prediction for each DT.
Then the predictions from all DTs in the model are aggregated either by hard voting or
soft voting. Moreover, as the random subset of training data are sampled during RF
model training, there will be about 37% of training instances that are not used in each
DT construction. These samples are called Out-of-bag (OOB) instances. Alternatively,
we can also use these OOB instances to evaluate the predictive accuracy of RF model
by averaging the evaluations of OOB instances of DTs of the RF model.
In this work, we use Random Forest model in Scikit-Learn [1]. The input features
are listed in Table 2 but not include YieldGrade, which will be the target of prediction.
First, we try to build the RF predictive model by using the default parameters set up by
Scikit-Learn. We found that the accuracy of yield grade classification on both the train-
ing set and the testing set are 97.96% and 68.29%, respectively. The OOB accuracy score
is 64.61%. As the accuracies on the training set and the testing set are largely different,
this indicates that the problem of model overfitting.
Algorithm 1: RF model training with Forward Feature Selection and Hyper-parameter tuning
Input: Training data: X=features, Y=class label, F=set of features
Output: Trained model: best_clf and best hyper-parameters: best_params
1: clf_1, params_1=GridSearchCV(RandomForestClassifier(),X[F],Y)
2: impVal = clf_1.feature_importance()
3: F=sort_descending(F, impVal)
4: S0={}
5: for i=1 to |F|
6: Si = Si-1 F[i]
7: clf = RandomForestClassifier(params_1)
8: ValScore[i]=cross_validation_score(clf,X[Si],Y)
9: end for
10: k = argmaxi ValScore[i]
11: best_clf,best_params=GridSearchCV(RandomForestClassifier(),X[Sk],Y)
Table 2. Feature importance values of the RF based model and the mean validation score at each
iteration of forward feature selection.
Mean Validation Score
Feature name Feature importance value
(after adding the feature)
TargetGrade 0.60273 0.66374
FarmerContractGrade 0.22070 0.69577
ContractsArea 0.05302 0.70264
Area_Remain 0.03336 0.70208
Distance 0.02211 0.69649
Rain_Vol 0.01935 0.69864
YieldOldGrade 0.01931 0.70232
Fertilizer 0.00702 0.70367
WaterType 0.00586 0.70511
Class_Cane 0.00353 0.70503
TargetOldGrade 0.00341 0.70511
GrooveWide 0.00285 0.70591
EpidemicType 0.00194 0.70487
Type_Cane 0.00144 0.70495
ActionWaterType 0.00139 0.70487
FertilizerType 0.00134 0.70511
TypeSoil 0.00065 0.70407
In this work, we propose methods for classifying the sugarcane yield grade at the plot
level from the information about plot characteristics. Our method is based on Random
Forest algorithm where our implementation is based on the random forest of Scikit-
Learn [1]. The data used in this work is acquired from a set of sugarcane plots around
a sugar mill in Thailand. The data consists of 12,521 records from two production years.
We split the data into the training set and the testing set. The training dataset is used to
build the models to predict the yield grade in the testing dataset. The classification ac-
curacies of our random forest based method on the testing dataset are 75.83% and
71.88%, respectively. Our proposed method outperforms the baseline that uses the ac-
tual yield of the last year as the prediction (51.52%) and the baseline that the target
yield of each plot is manually predicted by human expert (65.50%).
Some possible topics for future work can be listed as follows. First we can investi-
gate the improvement in the prediction accuracy in the case that daily the temperature
information in the local area and more information about the soil characteristics of each
plot are available as in some related works. Second, a comprehensive study about the
effects of hyper-parameter tuning, feature engineering and feature selection on the da-
taset could be a potential work. Finally, a model stacking technique can be applied to
improve the prediction accuracy.
References
1. Pedregosa, F., et. al.: Scikit-learn: Machine Learning in Python Journal of Machine Learning
Research 12, 2825-2830 (2011)
2. Everingham, Y., et al. Accurate prediction of sugarcane yield using a random forest algo-
rithm. Agronomy for Sustainable Development 36(27), 1-9 (2016)
3. de Oliveira, M. P. G., BOCCA, Bocca F. F., Rodrigues, L. H. A.. From spreadsheets to sugar
content modeling: A data mining approach. Computers and Electronics in Agriculture 132,
14-20 (2017
4. Bocca, F. F., Rodrigues, L. H. A.: The effect of tuning, feature engineering, and feature
selection in data mining applied to rained sugarcane yield modelling. Computers and Elec-
tronics in Agriculture128, 67-76 (2016)
5. Thuankaewsing, S., Khamjan, S., Piewthongngam, K., Pathumnakul, S.: Harvest scheduling
algorithm to equalize supplier benefits: a case study from the Thai sugar cane industry. Com-
put. Electron. Agric. 110, 42–55 (2015)
6. Robnik-Šikonja, M., Kononenko, I., Konokenko, I.: An adaptation of Relief for attribute
estimation in regression. In: Proceedings of the Fourteenth International Conference on Ma-
chine Learning, pp. 296-304. Morgan Kaufmann Publishers Inc., San Francisco (1997)
7. Breiman, L.: Random forests. Machine Learning 45, 5–32 (2001)
8. Ruß, G., Kruse, R.: Feature Selection for Wheat Yield Prediction. In: Bramer M., Ellis R.,
Petridis M. (eds) Research and Development in Intelligent Systems XXVI, pp. 465-478.
Springer, London (2010)
9. Gonzalez-Sanchez, A., Frausto-Solis, J., Ojeda-Bustamante, W.: Attribute Selection Impact
on Linear and Nonlinear Regression Models for Crop Yield Prediction. The Scientific World
Journal (2014)