1 M PC HSRo CK 0 F 4 Tey 3 P 3 JN 4 N OQwq JTD
1 M PC HSRo CK 0 F 4 Tey 3 P 3 JN 4 N OQwq JTD
1 M PC HSRo CK 0 F 4 Tey 3 P 3 JN 4 N OQwq JTD
000546
RESEARCH ARTICLE
Use of Statistical Models in Predicting Groundnut Yield in Relation to
Weather Parameters
Abhinaya D1*, Patil S G1, Dheebakaran Ga2, Djanaguiraman M3, Arockia Stephen Raj1
1*
Department of Physical Sciences & Information Technology, Agricultural Engineering College & Research Institute,
Tamil Nadu Agricultural University, Coimbatore-641 003.
2
Agro Climate Research Centre, Tamil Nadu Agricultural University, Coimbatore.
3
Department of Crop Physiology, Agricultural College & Research Institute, Tamil Nadu Agricultural University, Coimbatore.
ABSTRACT
Keywords: Stepwise Multiple Linear Regression; Ridge regression; LASSO; Elastic Net;
Groundnut yield prediction, Weather indices.
shrinkage and selection operator (LASSO), elastic Weighed weather indices: Zij = ∑w=1 riw Xiw
net (ENET) and ridge regression techniques can Where,
be used (Das et al.,2017). In this context, the main
Xiw = value of ith weather variable in w-th week
objective of our study is to develop and select a
j
statistical groundnut yield forecasting model for the riw = correlation coefficient of detrended yield with
Coimbatore district of Tamil Nadu with the predictive i-th weather variable
performance and efficiency of the developed models.
m = week of forecast
MATERIAL AND METHODS For j=0, we have unweighed indices and for j=1,
Data Collection weighed indices. Totally 11 weather variables were
generated as per the procedure mentioned above
Time series data of groundnut yield (Arachis
is presented in Table.1.
hypogea L) for Coimbatore district of Tamil Nadu
for 29 years (1991 to 2019) has been collected YIELD FORECAST MODELS
from the Season and Crop Report, Department of Stepwise Multiple Linear Regression
Economics and Statistics. Daily weather data were
collected from Agro Climate Research Centre, TNAU. Multiple Linear Regression (MLR) is the most
The data on five weather variables namely maximum straightforward approach for the development of
temperature (Tmax, oC), minimum temperature statistical models. However, its application for the
(Tmin,oC), morning and evening relative humidity dataset with more significant explanatory variables
(RH I & RH II (%)) and rainfall (mm) for a total of 18 and is not always successful (Balabin et al., 2011).
weeks of crop cultivation which includes 14 th to 31 A stepwise regression procedure was adopted
st
standard meteorological week (SMW) has been to select the best regression variables among
used in the study as the sowing of the groundnut many independent variables (Singh et al, 2014).
in Coimbatore district is usually carried out during A fundamental problem with stepwise regression
the month of April-May (Chithiraipattam). Daily data is that some real explanatory variables that have
of Tmax, Tmin, RH I and RH II had been converted causal effects on dependant variables may happen
into its weekly average, whereas the weekly sum to be statistically insignificant, while nuisance
of rainfall has been considered. Out of the 29-year variables may be coincidentally significant (Smith
data, 24 years were used for calibration, while the et al., 2018). Hence, we opt for alternative methods
remaining 5 years were used for validation. such as penalized regression methods.
Detrending of Yield Time Series Data Penalized Regression
The fluctuations in yield data over the years due Penalized regression is a better alternative
to the technology differences, climatic variability, for the linear regression model (or the ordinary
etc., leads to a nonlinear and non-stationary trend least squares method). The penalized regression
which has to be removed. The correlation between adds a constraint (penalty) in the equation. The
detrended yield and weather parameters is used to consequence of imposing this penalty is to reduce
calculate weight for model development(Wu et al., the coefficient values towards zero. This allows the
2007). In the present investigation, a simple linear less contributive variables to have a coefficient close
107 | 10-12 | 2
to zero or equal to zero. The logic behind penalized cross-validation (Piaskowski et al., 2016). The
regression is to reduce the impact of multicollinearity overall strength of the penalty is controlled by tuning
since all independent variables in the study are parameter λ (Hastie and Qian,2014). The other
related. tuning parameter alpha was set at 0 for Ridge, 1 for
LASSO and 0.5 for ELNET. The data were analyzed
Ridge Regression
using ‘glmnet’ R-package (Friedman et al., 2009).
Ridge regression shrinks the regression
Model Performance
coefficients so that variables with a minor
contribution to the outcome have their coefficients The performance of the developed statistical
close to zero. The shrinkage of the coefficients is models is tested using, adjusted R2, root mean
achieved by penalizing the regression model with square error (RMSE), normalised RMSE, mean
a penalty term called L2-norm, which is the sum of absolute error (MAE) and mean absolute percentage
the squared coefficients (Zou and Hastie, 2005). error (MAPE) were calculated using the following
formula:
L2 = ∑ (Ŷi– Yi)² + λ∑ β² (2)
RMSE = [1/n ∑(i=1) (Yi– Ŷi)²]
1/2
n
(5)
Where y is the independent variable, 𝛽 is the
corresponding coefficient and λ is the L2 norm nRMSE = [1/n ∑(i=1) (Yi– Ŷi)²]
1/2
n
X 100/(mean(yi)) (6)
penalty. A large value of λ means a more significant
MAE = 1/n ∑ (i=1) |Yi– Ŷi|
n
(7)
amount of shrinkage. Ridge regression keeps all the
predictors in the model without making any variable MAPE = 100/n ∑ (i=1)|(Yi– Ŷi) Yi |
n
(8)
selection.
yi = actual value
Lasso Regression (Least Absolute Shrinkage And
Ŷi = Model output
Selection Operator)
R2adj towards 1 and RMSE towards 0 indicates
It shrinks the regression coefficients toward
better performance of the developed models.
zero by penalizing the regression model with a
Also lesser the MAE and MAPE values, the better
penalty term called L1-norm, which is the sum of the
fit the model is. According to nRMSE, the model
absolute coefficients. In the case of lasso regression,
performance is judged as excellent, good, fair and
the penalty has the effect of forcing some of the
poor when the values are in the range of <10%, 10–
coefficient estimates, with a minor contribution to
20%, 20–30% and >30%, respectively (Jamieson
the model, to be exactly equal to zero (Tibshirani,
et al., 1991).
1996). One obvious advantage of lasso regression
over ridge regression is that it produces more RESULTS AND DISCUSSION
straightforward and more interpretable models that
Summary Statistics Of Yield Data
incorporate only a reduced set of predictors.
The summary statistics of groundnut yield data
L1= ∑(Ŷi– Yi)² + λ∑ |β| (3)
(1991-2019) of the Coimbatore district of Tamil
Where y is the independent variable, 𝛽 is the Nadu is presented in Table 2. The maximum yield
corresponding coefficient and λ is the L1 norm was 2877 kgha-1, whereas the minimum yield was
penalty. 1519 kg ha-1. The coefficient of variation of yield
was found to be 17.84%. A normal Q-Q plot was
Elastic Net Regression
constructed for testing the normality of yield data
Elastic Net combines characteristics of both and it was affirming the normality, thus satisfying
lasso and ridge, i.e., penalized with both the L1 and L2 the basic assumptions of parametric models
norm (Hoerl and Kennard, 1970). The consequence (Fig.1). Figure.2. Shows the Pearson’s coefficient of
of this is to effectively shrink coefficients (like in ridge correlation between all variables. Significant positive
regression) and to set some coefficients to zero (like correlations (correlation coefficient greater than
in LASSO). Hence it reduces the impact of different 0.5) were found between yield and Z11, Z30, Z31,
features while not eliminating all of the features
Table1. Weather indices used in the development of
(Cho et al., 2009).
multivariate yield forecasting regression
L = ∑(Ŷi– Yi)² + λ2∑ β² + λ1∑ |β| (4) models
Where y is the independent variable, 𝛽 is the Parameter Unweighed Indices Weighed Indices
corresponding coefficient and λ is the penalty. Tmax Z10 Z11
Tmin Z20 Z21
These methods have two parameters, namely
lambda and alpha, which need to be optimized. The Rain Z30 Z31
optimal lambda values were selected by minimizing RH I Z40 Z41
the average mean square error in leave-one-out RH II Z50 Z51
107 | 10-12 | 3
Z51 (P<0.05). The yield was found to be strongly opted for alternative approaches to fix this problem
correlated with those variables (p<0.01) and the of multicollinearity.
correlation coefficients ranged between 0.50 and Figure.1. Normal Q-Q plot for groundnut yield of
0.59 (Iqbal et al., 2019). Coimbatore district.
Table.2. OLS estimates of regression coefficients
in MLR and VIF
Variable Coefficients VIF
Intercept 5249.95 ± 5597.78 0
Z10 -14.37 ± 11.0972 12.01887
Z11 28.88 ± 38.93 7.859344
Z20 6.62 ± 7.73 3.753992
Z21 -21.24 ± 41.89 6.089400
Z30 -23.48 ± 16.18 474.9300
Z31 57.18 ± 36.39 507.5759
Z40 -0.39 ± 0.91 6.801643
Z41 4.58 ± 5.49 5.729535
Z50 -1.01 ± 0.86 4.468881
Z51 13.30 ± 3.41 1.651192
T 9.71 ± 8.84 3.438897
Adj R2 0.8215
RMSE 184.09
107 | 10-12 | 4
LASSO (Least Absolute Shrinkage and Selection of the variation in yield. The nRMSE value depicted
Operator) that the model performance was excellent.
In LASSO, feature selection is made along
with regularization of parameters, thus preventing
the model from overfitting. The developed model
explains about 87.46% of the variation in yield.
The most influential parameter was found to
be maximum temperature followed by rainfall.
The developed model is considered excellent in
agreement with nRMSE value.
Elastic Net Regression Figure.2. Pearson correlation coefficient between
In ENet method, which is a combination of both groundnut yield and weather indices of Coimbatore
ridge and LASSO, the maximum temperature was district.
found to be the most critical parameter followed by For comparing the performance of SMLR and
rainfall. The developed model explains about 87.48% penalized regression techniques, we used goodness-
Regularization path for ridge regression Cross validation of lambda values - ridge
Figure.3. Regularization paths and cross validation of lambda values for ridge, LASSO and ENet
107 | 10-12 | 5
of-fit measures i.e., R2adj, RMSE, MAE, MAPE. The When considering the R2adj, ridge regression was
adjusted coefficient of determination (R2adj) was found to have a better fit compared to other models.
significant for all the models included in the study. The RMSE value of 114.40 was found the least for
Table.4 . Goodness of fit measures obtained for Groundnut yield prediction of Coimbatore district using
different statistical models.
MODEL Adj R2 RMSEC RMSEV MAE MAPE nRMSEV F statistic
MLR 0.8215 117.78 184.09 167.17 0.08 9.64 10.62
SMLR 0.8376 (2) 114.34 178.01 (1) 158.99 (3) 0.08 (1) 9.32 14.18
Ridge 0.8785 (1) 136.51 210.08 (4) 166.93 (4) 1.84 (4) 11.00 5.67
LASSO 0.8746 (4) 136.01 188.09 (3) 158.54 (2) 1.76 (3) 9.85 10.39
ELNET 0.8748 (3) 138.63 181.99 (2) 149.61 (1) 1.65 (2) 9.53 7.77
**Values in parenthesis refers to the rank of the measures.
107 | 10-12 | 6
in Pakistan using factor and principal component Sangun, L., Cankaya, S., Kayaalp, G. T., and Akar, M.
scores in multiple linear regression. J. Anim. Plan. 2009. Use of factor analysis scores in multiple
Sci., 23(6): 1532-1540. regression model for estimation of body weight from
Friedman, J., Hastie, T., and Tibshirani, R. 2009. glmnet: some body measurements in Lizardfish. Journal of
Lasso and elastic-net regularized generalized linear Animal and Veterinary Advances., 8(1): 47-50.
models. R package version., 1(4): 1-24. Sheather, S. (2009). A modern approach to regression
Hastie, T., and Qian, J.2014. Glmnet vignette.,9: 1-30. with R. Springer Science and Business Media.
Hoerl, A. E., and Kennard, R. W. 1970. Ridge Shi, W., Tao, F., and Zhang, Z. 2013. A review
regression: Biased estimation for nonorthogonal on statistical models for identifying climate
problems. Technometrics., 12(1): 55-67. contributions to crop yields. Journal of Geographical
Sciences., 23(3): 567-576.
Iqbal, F., Ali, M., Huma, Z. E., and Raziq, A.2019.
Predicting live body weight of harnai sheep through Singh, R. S., Patel, C., Yadav, M. K., and Singh, K.
penalized regression models. J. Anim. Plant K. 2014. Yield forecasting of rice and wheat
Sci., 29(6): 1541-1548. crops for eastern Uttar Pradesh. Journal of
Agrometeorology., 16(2): 199.
Jamieson, P. D., Porter, J. R., and Wilson, D. R. 1991. A
test of the computer simulation model ARCWHEAT1 Smith, G. 2018. Step away from stepwise. Journal of
on wheat crops grown in New Zealand. Field crops Big Data., 5(1): 1-12.
research., 27(4): 337-350. Tibshirani, R. 1996. Regression shrinkage and selection
Kumar, H. V., Shivamur thy, M., and Lunagaria, M. via the lasso. Journal of the Royal Statistical
M. 2017. Impact of rainfall variability and trend Society: Series B (Methodological.), 58(1): 267-
on rice yield in coastal Karnataka. Journal of 288.
Agrometeorolog.y, 19(3): 286-287. Trnka, M., Hlavinka, P., Semerádová, D., Dubrovsky, M.,
Kumar, N., Pisal, R. R., Shukla, S. P., and Pandey, K. K. Zalud, Z., and Mozny, M. 2007. Agricultural drought
2014. Crop yield forecasting of paddy, sugarcane and spring barley yields in the Czech Republic. Plant
and wheat through linear regression technique for Soil and Environment., 53(7): 306.
south Gujarat. Mausam., 65(3): 361-364. Verma, U., Piepho, H. P., Goyal, A., Ogutu, J. O., and
Kutner, M., Nachtsheim, C., and Neter, J. 2004. Applied Kalubarme, M. H.2016. Role of climatic variables
Linear Regression Models. Richard D. Irwin. Inc.: and crop condition term for mustard yield prediction
Homewood, Illinois. in Haryana. Int. J .Agric. Stat. Sci., 12: 45-51.
Lobell, D. B., and Burke, M. B. 2010. On the Wu, Z., Huang, N. E., Long, S. R., and Peng, C. K. 2007.
use of statistical models to predict crop yield On the trend, detrending, and variability of nonlinear
responses to climate change. Agricultural and and nonstationary time series. Proceedings of the
forest meteorology., 150(11): 1443-1452. National Academy of Sciences., 104(38): 14889-
14894.
Piaskowski, J. L., Brown, D., and Campbell, K. G.
2016. Near infrared calibration of soluble stem Yakubu, A. 2010. Fixing multicollinearity instability in the
carbohydrates for predicting drought tolerance in prediction of body weight from morphometric traits
spring wheat. Agronomy Journal., 108(1): 285- of White Fulani cows. Journal of Central European
293. Agriculture.
Rai, Y. K., Ale, B. B., and Alam, J. 2011. Impact Zou, H., and Hastie, T.2005. Regularization and
assessment of climate change on paddy yield: a variable selection via the elastic net. Journal of
case study of Nepal Agriculture Research Council the royal statistical society: series B (statistical
(NARC), Tarahara, Nepal. Journal of the Institute of methodology)., 67(2): 301-320.
Engineering., 8(3): 147-167.
107 | 10-12 | 7