Can Machine Learning and Predictor Selection Algorithms Yield Reliable Stream Flow Prediction?
Can Machine Learning and Predictor Selection Algorithms Yield Reliable Stream Flow Prediction?
Can Machine Learning and Predictor Selection Algorithms Yield Reliable Stream Flow Prediction?
1. Introduction
Over the years, streamflow prediction has gone through several methodological developments targeting
sustainable decision-making in water infrastructure planning and management. The co-existence of
floods and droughts poses a menace that cannot be eradicated but must be mitigated by robust
streamflow prediction systems to meet the various aggressive demands on a sustainable basis. Typically
streamflow prediction systems are categorized into physical-based and data-driven models (He et al.,
2014; Zhang et al., 2015; Kratzert et al., 2019). Each category has its strengths and weaknesses.
Physical-based models has its strength in integrating the process-driven complexities of hydrological
cycle taking conservation of mass, momentum and energy into consideration, which gives confidence
to the researchers in decision making (Toth & Brath, 2007; Zhang et al., 2015; Yang et al., 2020). The
physical-based models produce larger prediction uncertainty with the un-availability of comprehensive
required datasets indicating its weakness in conceptualizing the physical process with limited parameters
even though the dataset being a long-term time series. The afforested shortcoming of the physical based
model limits its transferability (Yadav et al., 2007; Parajka et al., 2013; Samaniego et al., 2017).
Whereas, machine learning (ML) algorithms are data-driven approach to draw inferences from patterns
in the data. ML models are known for their predictions of streamflow using historical data and their
capability emulate the high nonlinearity in natural processes (Toth & Brath, 2007; Yang et al., 2020;
Lin et al., 2021). However, the efficacy of these approaches largely depends on the quality of data, as
the data with measurement uncertainty may lower the reliability of the model.
From the literature review, it was observed that the dominant aspect of building a robust machine
learning model for streamflow prediction lies with the interpretability and its generalizability on fresh
data that has never been seen before. However, in the development and its performance testing phase,
ML models are prone to selecting irrelevant predictors leading to capturing noise while training, and
subsequently degrading the performance of the model due to an increase in the dimensionality of the
training dataset (Raschka, 2020; Gharib & Davies, 2021).
This study proposes a two-stage framework in the model training and testing phases to avert the glitches
mentioned in the earlier paragraph. From the literature, it was observed that very limited prediction
models have been developed for agriculture-dominated tropical watersheds. The primitive objectives of
this study are: (1) To examine the level of uncertainties and hindrances about the input predictor batch
and algorithms for agriculture-dominated tropical watersheds and to avert the pitfalls mentioned above
by specific preprocessing workflow; (2) to select the optimal hierarchal predictor batch through a hybrid
selection algorithm; and (3) to develop a machine learning-based model for daily streamflow prediction
using novel data-mining algorithm. The regression algorithms used in this study are multi-layer
perceptron (ANN), support vector regression (SVR), model5 prime (M5P), reduced error pruning tree
(REPTree), random tree (RT), and random forest (RF), being standalone and seldomly applied in
prediction of streamflow.
2. Methodology
This section discusses the essential descriptions of the study area, descriptive statistics on data used,
estimation of potential evapotranspiration, the model framework involving the selection of a machine
learning algorithm, predictor selection algorithm, evaluation of model performance using good-to-fit
criteria and model development.
2.1 Study Area
The study area, the Rana watershed, is in the lower-middle reach of the Mahanadi River basin in the
state of Odisha, India. The Rana watershed extends from 20.11˚ N to 20.41˚ N and from 85.39˚ E to
85.63˚ E, bounded by hills to the east and west, and the Mahanadi River to the north side. The total
drainage area of the watershed is 496 km2 with the longest flow path of this watershed being 25 km,
approximately. Topographically, this watershed slopes moderately (up to 8%) from south to north, with
a greater slope of approximately 22% with hillocks and drainage channels in lateritic uplands (Jena et
al., 2021). Average annual rainfall of approximately 1451 mm, out of which 70% is received from July
to October (Jena, et al., 2020).
2.2 Data
Soil textural information was obtained from (Jena et al., 2021) and found that 5 USDA soil classes exist
in the study area with the majority of the soil identified as sandy loam (40.9 %) followed by loam (30.9
%), silty loam (19 %), silty clay loam (8.8 %) and sand (0.4 %). Land use/land cover (LULC) data for
the year 2002 of scale 1: 50000 was collected from Odisha Space Application Centre (ORSAC) and
considered in the model from the year 1997 to 2002. Subsequently, LULC in raster format with pixel
size 53 m × 53 m for the years 2005, 2010, and 2015 were collected from NRSC, ISRO, Government of
India, Hyderabad, India and were considered for durations of 2003 to 2005, 2006 to 2010 and 2013 to
2016, respectively. The LULC classes found in the Rana watershed are dense forest, shrub forest,
agricultural land, wasteland, water bodies, and built–up areas.
The digital elevation model (DEM) of the watershed was extracted from the Earthdata, Advanced
Spaceborne Thermal Emission and Reflection Radiometer (ASTER) data with a grid size of 30 m × 30
m. The slope map was generated and classified into 5 subgroups of range 0% – 5.04%, 5.04% – 9.62%,
9.62% – 16.81%, 16% – 33.74%, and 33.74% – 140.75%. Data processing and management exercises
of soil, LULC, and DEM were carried out using ArcGIS 10.2. The streamflow data at the outlet was
monitored daily between 2014 to 2016 and additionally, the measurements for the duration between
1997 to 2014 were collected from Central Water Commission, India. The daily meteorological
parameters namely maximum and minimum temperature, relative humidity, wind speed, solar radiation,
and rainfall were obtained from an automatic weather station (AWS) installed in the study watershed.
2.3 Estimation of Potential Evapotranspiration
There are several empirical methods for potential evapotranspiration (PET) determination vary
depending on the input parameter requirement, i.e., from simple temperature-driven to energy supply &
vapor transport approaches. The findings of (Ellenburg et al., 2018) presented that the significance of
PET over streamflow modeling in daily time through Penman-Monteith Equations (Penman monteith,
1963) produces a more accurate streamflow prediction compared to other empirical methods. So, for
this study the PET was computed using the FAO Penman-Monteith method using AWS data (Rao et al.,
2012)(Cai et al., 2007).
2.4 Machine Learning Algorithms
The streamflow prediction proficiencies of machine learning algorithms were enhanced using the two-
fold framework proposed in this study .The Weka 3.8.6 (Khosravi et al., 2021) and python environment
(Abbas et al., 2022) was used to implement six heuristic data mining algorithms namely, ANN, SVR,
M5Prime, REPTree, Random Tree (RT), and RF in this study. ANN uses backpropagation to learn a
multi-layer perceptron. The hyper-parameters determined using a grid-search algorithm with cross-
validation and pruning were carried out with respect to the number of neurons in the hidden layers,
learning rate, momentum, and the sigmoid activation function for ANN. Support vector machines uses
sequential minimal optimization for regression (SMO-reg) and its performance gets enhanced by
selecting the RBF kernel which can grasp the non-linearity aspect since after normalization it can be
mapped to a high dimensional space. Hyperparameters involved are “c” and “gamma” which checks
misrepresenting points and decision boundaries (He et al., 2014 ).
The M5Prime algorithm is a reconstruction of the M5 algorithm for inducing trees of regression models
(Quinlan, 1992). It uses a divergence metric called “Standard Deviation Reduction” for producing a
decision-based regression tree. Later the algorithm is followed by pruning and smoothening to cater to
decreased error estimates and to compensate for the acute discontinuity that appears obligatory between
adjacent linear sets at the leaves of pruned tree. Hyperparameters involved were maximum depth, the
minimum number of leaves, and a certain smoothing constant for this study (Pal & Deswal, 2009;
Ghasemi et al., 2018; Adnan et al., 2019). The Reduced Error Pruned Tree is a fast decision tree learner,
it works on information gain or reducing the variance as splitting criteria and it does prune based on the
mean square error on the predictions made by the tree. The process is recursively repeated to interpret
and prune the tree by contemplating it against already established trees. This process were repeated for
several specific depths of the trees to ensure stopping criteria (Joseph K & Ravichandran, 2012; Chen
et al., 2019).
A random tree constructs a tree that considers randomly chosen attributes at each node. It does not prune
and only allows the estimation of class probabilities based on a hold-out set (Xu et al., 2019). The RF
constructs a forest of random trees i.e., an ensembled method of the decision tree. It combines the two
operations. Firstly, make the array of all the recommended places (bootstrapping) which ensure different
size and combination of attributes. Secondly, do the voting process to select the best place (aggregation).
The whole process of getting the recommendation and selecting the best place is generalized as a
bagging process. Hyperparameters involved are maximum depth and number of estimators which ensure
a certain kind of stopping criteria to select the best set of attributes and also have the capability to
compute the importance of each attribute contributing to the target attribute (Oppel & Schumann, 2020;
Hagen et al., 2021).
2.5 Model Performance Evaluation
To evaluate a generalizable model at a new instance, a data partitioning strategy called k-fold cross-
validation technique is carried out, where k = 10 (tenfold cross-validation), but such an estimate is biased
since the model is exposed to a testing set, which eventually guides the algorithm, model
hyperparameters, etc. (Gharib & Davies, 2021). Hence this cross-validation technique only applied for
the training set (1997- 2012) and comparison of several algorithms. The evaluation criteria to compare
algorithms and measure the prediction skill of the model with an unseen dataset were mean absolute
error (MAE), root mean square error (RMSE), coefficient of determination (R2), and Nash-Sutcliffe
efficiency (NSE: Nash & Sutcliffe, 1970). The expressions for MAE, RMSE, R2, and NSE are
1 𝑁
𝑀𝐴𝐸 = ∑ |ℎ − ℎ𝑜𝑖 | (1)
𝑁 𝑖=1 𝑠𝑖
∑𝑁
𝑖=1(ℎ𝑠𝑖 −ℎ𝑜𝑖 )
2
𝑅𝑀𝑆𝐸 = √ (2)
𝑁
2
(∑𝑁 ̅̅̅̅ ̅̅̅
𝑖=1(ℎ𝑜𝑖 −ℎ𝑜 )(ℎ𝑠𝑖 −ℎ𝑠 ))
𝑅2 = 𝑁 ̅̅̅̅
∑𝑖=1(ℎ𝑜𝑖 −ℎ 2 𝑁 ̅̅̅𝑠 )2 (3)
𝑜 ) ∑𝑖=1(ℎ𝑠𝑖 −ℎ
∑𝑁 (ℎ −ℎ )2
𝑜𝑖 𝑠𝑖
𝑁𝑆𝐸 = 1 − ∑𝑖=1
𝑁 (ℎ −ℎ̅̅̅̅)2 (4)
𝑖=1 𝑜𝑖 𝑜
where hoi is the observed streamflow discharge at test locations; hsi is the predicted streamflow discharge
obtained from the model algorithm; ̅̅̅
ℎ0 is the mean of observed streamflow discharge; ̅̅̅ ℎ𝑠 is the mean
of predicted streamflow discharge; N is the number of observations. A best fit between the observed and
predicted streamflow discharge under ideal conditions would yield an MAE and RMSE value equal to
zero, whereas R2 and NSE value equal to one. Moreover, NSE values between 0.0 and 1.0 are generally
viewed as acceptable performance levels.
2.6 Hierarchical Predictor Selection Algorithm
In building models, the key to enhancing the skill of prediction is to select the most appropriate/ pertinent
predictors from the entire predictors. Selecting the entire predictor sometimes may incur overfitting with
flawed prediction due to the existence of noisy and irrelevant data. Hence a new two-stage technique
was framed for the predictor selection. To avoid such pitfalls standard techniques were employed by the
researchers namely filter, wrapper, and embedded methods (Lin et al., 2014; Rodriguez-Galiano et al.,
2018; Jena et al., 2021). In this study, we followed the workflow for the predictor selection which help
us build a more generalized model. Initially, in stage one, several HIDs were developed by the forward
selection or backward elimination of predictors under the wrapper method, which explores a whole lot
of combinations of static, spatial, and temporal predictors (Table 1). The performance of algorithms on
different HIDs was evaluated using RMSE, MAE, R2, and NSE using 10-fold cross-validation. These
HIDs were later partitioned into groups that can be interpreted as ‘good’, ‘moderate’, and ‘poor’ using
a k- means clustering algorithm based on each HID performance (Solomatine & Shrestha, 2009; Jena et
al., 2021). Subsequently, the embedded method was used for ranking the predictors for all ‘good’
clustered HIDs by assigning importance, calculated based on mean decrease impurity (Louppe et al.,
2013), which has shown practical utility and dominant experimental studies. For each HIDs ranking of
predictors is done numerically. Consequently, predictors were being accrued at different ranks in all the
“good” clustered HIDs using the following equation (Breiman, 2001; Louppe et al., 2013)
The predictors were ranked from 1st to nth where “n” was the number of predictors in a particular HID.
A rank frequency matrix is deployed for the investigation of each predictor attracting several times to
its corresponding rank. Here, we adopted the hybrid wrapper-embedded predictor selection algorithm
to select the optimal number of relevant predictors named ‘HID-P’ which enhances the prediction skill
of the model (Jena et al., 2021).
The predictor selection criteria for stage 1 were
(a) The predictors having higher frequencies to accrue the ranks between first and fifth to be selected
(b) Predictors possessing ≥50% of the total instances of appearance in ‘good’ HIDs within first to fifth
rank to be selected as the appropriate predictor.
Table 1 Predictor Set Hierarchies for stage-1
Note: The graduated green to yellow shades represents the magnitude of frequencies to accrue
different ranks.
The eight pertinent predictors selected through above-discussed method in Stage 1 were PET; LULC:
Built-up area, agricultural land, dense forest, shrub forest, wasteland, water bodies; and rainfall, which
had a frequency ranking within the first five (Table 3). The sum of the instances each selected predictor
appeared within rank fifth was more than or equal to 50% of the total number of times it appeared in 16
“good” HIDs. For instance, PET was selected as it possessed higher frequencies i.e., 8, 1, 1, 0, and 1 to
accrue ranks below five, and it has appeared 11 times out of a total of 16 appearances (greater than 50%)
in good clustered HIDs. But Loam having frequencies 1, 1, 1 acquired rank third, fourth and fifth
respectively which satisfied criteria 1. However, this predictor failed to satisfy criteria 2 because of only
three out of a total of 8 appearances which is 37.5% of the times it appeared within the fifth rank.
Similarly, all the 25 predictors are implemented decisively, and eight predictions were selected as HID-
P in stage 1. But, the performance metrics MAE, RMSE, R2, and NSE are in a range of (6.6-7.9), (18.5-
19.9), (0.24-0.34), and (0.23-0.33), respectively. Hence there is a scope for enhancing the model
performance efficacy through further analysis in Stage 2 as described in Section 2.6.
Table 4 Frequency of different predictors to accrue ranks from 1st to 6th in all HIDs.
Figure 1 Scatter plot on log scale depicting the correctness of model training and testing. Green lines
and blue lines representing confidence and prediction band. And, orange and red line represent linear
fit and 1:1 slope line
3.2 Conclusion
In this study, 20 years (1997-2016) of hydrometeorological data consisting meteorological, streamflow,
DEM, and Land use data is used development of streamflow prediction model for the Rana basin. The
supreme pertinent predictors were finalized specific to this study area are based on a proposed workflow.
This study also addressed the lag associated studies with the hydrological system. The major findings
from this study can be summarized that (i) Bagging based Ensemble model Random Forest outperforms
among other benchmark machine learning models with significant differences mapped using heatmaps
and differences between random forest and random tree clearly show the effect of inclusion or its
exclusion of pruning in regression trees respectively. (ii) Rank frequency matrix ascertains the
dominance of predictor with other predictors. For instance, rainfall of two and one-day lag over the
watershed is proved to be more effective for predicting daily streamflow than the rainfall on the day of
prediction alone. Similarly, PET of first lag over the watershed is of greater importance than the PET
on the day of prediction and second lag of PET which makes sense of clear framing of modelling the
hydrological aspect including delay/lag associated with the rainfall and PET in the contribution towards
streamflow. (iii) LULC changes over the study area were found to be a major lead next to rainfall, PET.
The model identified the accustoms according to its dynamics involved in temporal changes which
improved the required skill of prediction to grab the patterns.
This comprehensive two-stage framework approach could be big leap towards optimal selection of
predictors and prediction systems of daily time step streamflow with no complex data acquisition, with
simply acquiring satellite temporal images for land use dynamics from authorities and installations of
automatic weather station in few nearby basins which has capability to sense solar insolation for
attaining time-series PET based on net radiation. Such station also assists in monitoring, understanding
and strategize the adaptation of climate change and synthesis of these data in forecasting the streamflow
is the future course of the research. Conclusively, this framework could be bright step for enhancing
watershed management for poorly gauged basins or ungauged basins mostly in developing countries to
predict streamflow of daily time step with minimum efforts of data acquisition which ultimately
legitimates the ecological balance in the watershed with comprehensive planning and conservation of
land and water resources.
References
Abbas, A., Boithias, L., Pachepsky, Y., Kim, K., Chun, J. A., & Cho, K. H. (2022). AI4Water v1.0: an
open-source python package for modeling hydrological time series using data-driven methods.
Geoscientific Model Development, 15(7), 3021–3039.
Adnan, R. M., Liang, Z., Trajkovic, S., Zounemat-Kermani, M., Li, B., & Kisi, O. (2019). Daily
streamflow prediction using optimally pruned extreme learning machine. Journal of Hydrology,
577(May), 123981.
Breiman, L. (2001). No Title. 1–33.
Cai, J., Liu, Y., Lei, T., & Pereira, L. S. (2007). Estimating reference evapotranspiration with the FAO
Penman-Monteith equation using daily weather forecast messages. Agricultural and Forest
Meteorology, 145(1–2), 22–35.
Chen, W., Hong, H., Li, S., Shahabi, H., Wang, Y., Wang, X., & Ahmad, B. Bin. (2019). Flood
susceptibility modelling using novel hybrid approach of reduced-error pruning trees with bagging and
random subspace ensembles. Journal of Hydrology, 575(February), 864–873.
Ellenburg, W. L., Cruise, J. F., & Singh, V. P. (2018). The role of evapotranspiration in streamflow
modeling – An analysis using entropy. Journal of Hydrology, 567(January), 290–304.
Gharib, A., & Davies, E. G. R. (2021). A workflow to address pitfalls and challenges in applying
machine learning models to hydrology. Advances in Water Resources, 152(November 2020), 103920.
Ghasemi, E., Kalhori, H., Bagherpour, R., & Yagiz, S. (2018). Model tree approach for predicting
uniaxial compressive strength and Young’s modulus of carbonate rocks. Bulletin of Engineering
Geology and the Environment, 77(1), 331–343.
Hagen, J. S., Leblois, E., Lawrence, D., Solomatine, D., & Sorteberg, A. (2021). Identifying major
drivers of daily streamflow from large-scale atmospheric circulation with machine learning. Journal of
Hydrology, 596(January), 126086.
He, Z., Wen, X., Liu, H., & Du, J. (2014). A comparative study of artificial neural network, adaptive
neuro fuzzy inference system and support vector machine for forecasting river flow in the semiarid
mountain region. Journal of Hydrology, 509, 379–386.
Jena, S., Mohanty, B. P., Panda, R. K., & Ramadas, M. (2021). Toward Developing a Generalizable
Pedotransfer Function for Saturated Hydraulic Conductivity Using Transfer Learning and Predictor
Selector Algorithm. Water Resources Research, 57(7), e2020WR028862.
Jena, S., Panda, R. K., Ramadas, M., Mohanty, B. P., & Pattanaik, S. K. (2020). Delineation of
groundwater storage and recharge potential zones using RS-GIS-AHP: Application in arable land
expansion. Remote Sensing Applications: Society and Environment, 19(July), 100354.
Joseph K, S., & Ravichandran, T. (2012). A comparative evaluation of software effort estimation using
REPTree and K* in handling with missing values. Australian Journal of Basic and Applied Sciences, 6,
312–317.
Khosravi, K., Golkarian, A., Booij, M. J., Barzegar, R., Sun, W., Yaseen, Z. M., & Mosavi, A. (2021).
Improving daily stochastic stream flow prediction: comparison of novel hybrid data-mining algorithms.
Hydrological Sciences Journal, 66(9), 1457–1474.
Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K., Hochreiter, S., & Nearing, G. S. (2019). Toward
Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning. Water Resources
Research, 55(12), 11344–11354.
Lin, F., Liang, D., Yeh, C. C., & Huang, J. C. (2014). Novel feature selection methods to financial
distress prediction. Expert Systems with Applications, 41(5), 2472–2483.
Louppe, G., Wehenkel, L., Sutera, A., & Geurts, P. (2013). Understanding variable importances in
Forests of randomized trees. Advances in Neural Information Processing Systems, 1–9.
Nash, J. E., & Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I — A
discussion of principles. Journal of Hydrology, 10(3), 282–290.
Oppel, H., & Schumann, A. H. (2020). Machine learning based identification of dominant controls on
runoff dynamics. Hydrological Processes, 34(11), 2450–2465.
Pal, M., & Deswal, S. (2009). M5 model tree based modelling of reference evapotranspiration.
Hydrological Processes, March 2009, 1437–1443.
Parajka, J., A. Viglione, Rogger, M., J. L. Salinas, Sivapalan, M., & Bl¨oschl, and G. (2013).
Comparative assessment of predictions in ungauged basins – Part 1 : Runoff-hydrograph studies.
Hydrology and Earth System Sciences, 1783–1795.
Penman monteith. (1963). Vegetation and hydrology. H. L. Penman (Technical Communication No. 53,
Commonwealth Bureau of Soils, Harpenden) Commonwealth Agricultural Bureaux, Farham Royal,
1963. Pp. v, 124: 72 Tables. 20s. 89(382), 565–566.
Quinlan, J. R. (1992). LEARNING WITH CONTINUOUS CLASSES 2 . Constructing Model Trees.
92, 343–348.
Rao, B. B., Sandeep, V. M., & Venkateswarlu, B. (2012). Potential Evapotranspiration estimation for
Indian conditions : Improving accuracy through calibration coefficients. 1–60.
Raschka, S. (2020). Model Evaluation , Model Selection , and Algorithm Selection in Machine Learning
arXiv : 1811 . 12808v3 [ cs . LG ] 11 Nov 2020.
Rodriguez-Galiano, V. F., Luque-Espinar, J. A., Chica-Olmo, M., & Mendes, M. P. (2018). Feature
selection approaches for predictive modelling of groundwater nitrate pollution: An evaluation of filters,
embedded and wrapper methods. Science of the Total Environment, 624, 661–672.
Samaniego, L., Kumar, R., Thober, S., Rakovec, O., Zink, M., Wanders, N., Eisner, S., Schmied, H. M.,
Sutanudjaja, E. H., Warrach-sagi, K., & Attinger, S. (2017). Toward seamless hydrologic predictions
across spatial scales. 4323–4346.
Solomatine, D. P., & Shrestha, D. L. (2009). A novel method to estimate model uncertainty using
machine learning techniques. Water Resources Research, 45(1), 1–16.
Toth, E., & Brath, A. (2007). Multistep ahead streamflow forecasting: Role of calibration data in
conceptual and neural network modeling. Water Resources Research, 43(11), 1–11.
Xu, Y., Zhao, X., Chen, Y., & Yang, Z. (2019). Research on a mixed gas classification algorithm based
on extreme random tree. Applied Sciences (Switzerland), 9(9).
Yadav, M., Wagener, T., & Gupta, H. (2007). Regionalization of constraints on expected watershed
response behavior for improved predictions in ungauged basins. Advances in Water Resources, 30(8),
1756–1774.
Yang, S., Yang, D., Chen, J., Santisirisomboon, J., Lu, W., & Zhao, B. (2020). A physical process and
machine learning combined hydrological model for daily streamflow simulations of large watersheds
with limited observation data. Journal of Hydrology, 590(March), 125206.
Zhang, X., Peng, Y., Zhang, C., & Wang, B. (2015). Are hybrid models integrated with data
preprocessing techniques suitable for monthly streamflow forecasting? Some experiment evidences.
Journal of Hydrology, 530, 137–152.