Applied Geochemistry: Mouigni Baraka Nafouanti, Junxia Li, Nasiru Abba Mustapha, Placide Uwamungu, Dalal AL-Alimi
Applied Geochemistry
journal homepage:
Editorial handling by Prof. M. Kersten Groundwater fluoride is posing a health risk to humans, and analyzing groundwater quality is time-wasting and
expensive. Statistical methods provide a valuable approach to study the spatial distribution of groundwater
Keywords: fluoride. Random Forest (RF), Artificial Neural Network (ANN), and Logistic Regression (LR) were used in this
Groundwater study for groundwater fluoride prediction in Datong Basin. The groundwater chemistry of 482 groundwater
samples was collected and used to figure out the performance of three statistical technologies and extract the
Random forest
main factors controlling the enrichment of fluoride in groundwater. The data was separated into two parts for the
Artificial neural network
Logistic regression statistical analysis, 80% for training and 20% for testing. The Chi-squared was applied to select the most relevant
variables, and TDS, Cl− , NO3− , Na+, HCO3− , SO42− , K+, Zn, Ca2+, and Mg2+ were selected as best inputs for the
fluoride prediction. Models were evaluated using the confusion matrix and The receiver operating characteristic
area under the curve ROC (AUC). The results suggest that within ten input variables, the accuracies of RF, ANN,
and LR were 0.89, 0.85, and 0.76, respectively. The mean decrease in impurity (MDI) and permutation feature
demonstrates that eight of ten parameters, including TDS, Cl− , NO3− , Na+, HCO3− , SO42− , Ca2+ and Mg2+ are the
variables influencing the groundwater fluoride in the study area. RF exhibited the best model with high con
formity and confidence in predicting groundwater fluoride contamination in the study area.
* Corresponding author. State Key Laboratory of Biogeology and Environmental Geology, China University of Geosciences, Wuhan, 430074, China.
E-mail address: (J. Li).
Received 18 April 2021; Received in revised form 20 July 2021; Accepted 22 July 2021
Available online 24 July 2021
0883-2927/© 2021 Elsevier Ltd. All rights reserved.
M.B. Nafouanti et al. Applied Geochemistry 132 (2021) 105054
Rafique et al., 2009). slopes from the northwest to the southeast (Fig. 1). The annual precip
The fluoride concentration in the groundwater systems has been itation is between 225 mm and 400 mm, and the evapotranspiration is
reported to be influenced by conditions resulting from the natural over 2000 mm. The annual average air temperature is 6.5 ◦ C (Su et al.,
hydrogeochemical processes, such as dissolution of fluoride-containing 2015). The Sanggan and the Huangshui rivers are the main rivers
minerals, including fluorite and biotite, precipitation of carbonate running across the study area. They are used for land irrigation due to
minerals, Ca–Na exchange on the clay minerals, and intense evapo several agricultural activities developed in the area (Wang and
transpiration (Li et al., 2020). Additionally, recent studies showed that Shpeyzer, 2000).
anthropogenic activities using fertilizers and irrigation processes The outcrops for bedrock are detected in the western, eastern, and
directly affect the groundwater fluoride concentration (Ayenew, 2008; northern. The outcrops for the north are basalt and Archean gneiss. The
Cairncross and Feachem, 1993). Altogether, these factors leave us with west is constituted by Carboniferous–Permian–Jurassic sandstone,
an unknown regarding the fate of fluoride in the groundwater. There Cambrian–Ordovician limestone, and shale. In the northeast, the basin is
fore, the assessment of groundwater quality, monitoring, and modeling formed of granite sparsely and Archean gneiss. The sediment in the basin
are necessary for identifying groundwater trends and groundwater is alluvial–pluvial sand and gravel. The central part of the basin is
sustainability. formed by Alluvial–pluvial sands, lacustrine and alluvial– lacustrine
Numerical models have been previously applied for groundwater sandy loam soils. Also, silts and silty clay abundant in organic matter are
quality modeling purposes (Rapantova et al., 2007). However, these reported in the central part of the basin (Guo and Wang, 2005).
models have limitations, such as needing a large quantity of data, Furthermore, three aquifers are beneath the flat alluvial–lacustrine
considerable time, and have a complex structure that restricts their use plain in the basin center, the upper, middle, and lower aquifers. The
(Alagha et al., 2014; Coppola et al., 2005). Thus to solve this issue, it is upper aquifer is formed by sands, and gravel usually occurs between 5
necessary to adopt a potential approach for assessing groundwater and 60 m under the land surface with 2–10 m thick. The middle aquifer
contamination. is molded by sandy gravel and sand from 60 to 160 m beneath the land
The application of machine learning models can provide an efficient surface. Finally, the lower aquifer consists of silt and fine sand observed
alternative in predicting groundwater contamination which have been at depths bigger than 160 m beneath the land surface (Xie et al., 2009).
widely used by many studies (e.g., Mohammadi et al., 2016a; Nadiri The groundwater recharge is by infiltration of the basins meteoric
et al., 2013; Noshad et al., 2019). For example, the artificial neural water vertically, irrigation return flow, and bedrock fractures in the
network (ANN) can detect complex non-relationship between predictor mountain front, accompanied by an outflow from non-perennial rivers
variables and the dependent variable and has the ability to solve erro laterally (Guo and Wang, 2005). Evaporation and abstraction are the
neous and voluminous problems in a dataset (Mohammadi et al., 2016a; two major causes of the groundwater discharged in the study area.
Tarasov et al., 2018). The random forest (RF) model can handle
high-dimensional data, continuous, missing values, and binary data. 3. Methodology
Furthermore, logistic regression (LR) is an efficient algorithm that is fast
in dataset training and efficiently used to analyze binary classification 3.1. Sampling and analytical methods
(Stoltzfus, 2011). Many studies employed ANN, RF, and LR to predict
groundwater contamination. For instance, ANN was applied to predict The details on the groundwater sampling and chemistry analysis can
fluoride contamination in groundwater in Khaf (Mohammadi et al., be obtained from our previous work (Li et al., 2012). Briefly, 482 sam
2016a). Likewise, it was used to forecast groundwater contaminated by ples were collected from different wells in August 2011 (Fig. 1). Quality
nitrate in Iran and predict the concentration of high fluoride ground assurance and quality control were maintained in the sampling and all
water in the Maku area (Nadiri et al., 2019; Ostad-Ali-Askari et al., analytical procedures (Li et al, 2012, 2020). All the chemical measure
2017). Similarly, RF was employed to predict groundwater contamina ments were accomplished at the State Key Laboratory of Biogeology and
tion by uranium in California and predict groundwater pollution by Environmental Geology, China University of Geosciences, Wuhan.
nitrate in Southern Spain (Lopez et al., 2020; Rodriguez-Galiano et al.,
2014). In addition, some existing studies used LR for predicting 3.2. Data preprocessing
groundwater contamination. For example, it was employed in India to
predict groundwater contamination by Fluoride (Podgorski et al., 2018). The data comprised 16 input variables, including TDS, Cl− , NO3− ,
However, these algorithms were all individually applied to predict Na , K+, HCO3− , SO42− , Ca2+, Mg2+, pH, Ba, Li, Mn, Pb, Sr, Zn, and the
groundwater contamination, and there is a gap in identifying the best dependent variable. The data was converted into high and low classes by
machine learning to effectively predict groundwater contamination. In allocating zero (0) to all fluoride concentrations lower than 1 mg/L and
this regard, the current study compares three machine learnings, RF, assigning by one (1) for the fluoride concentrations higher than 1 mg/L.
ANN, and LR, to predict the fluoride in groundwater using binary clas The independent variables were then scaled between 0 and 1 for the
sification analysis. three algorithms to enhance the model speed and accuracy. The data
The objective of the present work is to identify the most suitable was then randomly divided into two sections 80% for training and 20%
predictive model that can be applied to predict fluoride contamination for testing.
in groundwater in the Datong Basin. Therefore, an evaluation and
comparison of three models, RF, ANN, and LR classifiers, were applied 3.3. Selection of the relevant input
using physicochemical water parameters from the study area. Also, the
determination of the variables influencing the fluoride in the study area Discarding significant variables or maintaining irrelevant variables
was considered in this study. This investigation will provide insights into affects machine learning model performance (Gheyas and Smith, 2010).
using classification models to predict groundwater and enhance For selecting the relevant inputs, filter methods were applied in this
groundwater prediction in the study area and elsewhere in the world. study. These methods are rapid compared to the wrapper methods as
they do not involve model training. Moreover, they can determine the
2. Hydrogeological setting relationship between the independent and the dependent variables
(Hendrawan and Murase, 2011; Sánchez-Marono et al., 2007).
Datong Basin belongs to the Shanxi rift system with around 6000 km2 In the filter methods, the Chi-squared was implemented in this study
formed by Cenozoic faulted basins (Xing et al., 2013). It is situated in as a feature selection method. The Chi-squared compares the observed
East Asia, characterized by a seasonal monsoon region with a semiarid distribution between various variables in the dataset and the dependent
climate. According to topography, the area is enclosed by mountains and variable. It summarizes squared differences among observed and ex
M.B. Nafouanti et al. Applied Geochemistry 132 (2021) 105054
Fig. 1. Location map of the study area showing the sampling location.
pected values divided by expected values to determine the most relevant E = Expected value(s) (The expected value is based on the row and
independent variables in the prediction (Lee et al., 2011). The variables column totals. It is the multiplication of row total by the column total
are independent when the observed count is close to the expected count, and then dividing by the total, and gives the expected value for each
and these variables will have a small Chi-squared value. Thus, a high cell).
Chi-Squared value indicates that the variable is more dependent on the
output, and it can be chosen for model training. The variables were 3.4. Random forest modeling
selected using the sklearn library in python using the “SelecktBest,”
which retained the first k (The degree of freedom which is the number of RF is an algorithm that can be used for regression and classification
samples being summed) input variables with the highest scores analysis. In this study, random forest classification was used in pre
(Table 1). Therefore, ten (10) variables such as TDS, Cl− , NO3− , Na+, K+, dicting groundwater fluoride contamination. The RF combined many
HCO3− , SO42− , Ca2+, Mg2+, Zn are reported with a high Chi-squared decision trees to limit overfitting, formulates a robust model, and gives
value and were selected as relevant inputs for groundwater fluoride high accuracy. In the random forest, the random is presented in two
prediction. ways in the trees growing. First, a random selection with the substitute
∑ (Oi − Ei)2 of all data rows results from one-third of the data and “out-of-bag”
The Chi − squared is defined as: Xc2 = (1) (OBB), which are not randomly selected for a decision tree. The second is
the restricted number of randomly selected variables available at each
node. In the RF, the number of trees and the number of predictor vari
C = degree of freedom (Degree of freedom refers to the maximum ables chosen at each node are the tuning parameters determining the RF
number of independent values, which have the freedom to vary in overall fit. In this work, one hundred (100) trees were grown to generate
the data sample). the RF model.
O = Observed value(s) (They are the values that are observed in the In addition, RF can identify significant predictor variables and effi
dataset). ciently describe how they affect contaminant existence in aquifers. In
this study, to assess the essential variables, the mean decrease in
Table 1
Selection of Relevant Inputs by using the Chi-Squared Analysis.
Variables TDS Na+ HCO3− NO3− SO42- Cl− Ca2+ Mg2+ K+ Zn
score 20668.6 8967.3 8515.7 5226.4 2131.2 1583.5 1001.9 459.9 59.2 7.3
M.B. Nafouanti et al. Applied Geochemistry 132 (2021) 105054
ANN is a model intended to simulate biological ‘neurons’ behavior 3.6. Logistic regression
(Ostad-Ali-Askari et al., 2017). In this study, the ANN applied is the
multilayer perceptron (MLP) feedforward. The MLP is a type of neural Logistic regression (LR) is mostly used for binary classification (Qian
network in which each neuron is associated with above-layer neurons et al., 2020). It is a conversion of linear regression using the sigmoid
(Nevtipilova, 2014). The MLP neural network used in this study was function. In this work, LR is applied to predict the fluoride in ground
composed of three different layers (Fig. 2). It was composed of input, water. LR equation describes as:
hidden, and output layers. The input layers were formed of 10 neurons,
which are the number of predictor variables. The hidden layer where F(x) = (5)
1 + e− (β0+β1x)
data is processed was composed of two layers, and an output layer
produces the results. Each layer comprises a fundamental element where β0 and β1 are the estimated parameters.
named neuron, which has a threshold and an activation function
essential to the training process (Dreyfus, 2008; Mohammadi et al., 3.7. Model evaluation criteria
2016b). In this study, the “adam” optimizer was used to update the
weight in the network. The mathematical expression of the MLP defines The models predictive capability in the testing stage was evaluated
as: using the confusion matrix for each model. The accuracy, sensitivity,
∑ specificity, and error were calculated to assess the model prediction. The
Xnm = WnmXn + Wm (2)
receiver operating characteristic area under the curve ROC (AUC) was
also considered to evaluate the LR.
Xn represents the output of nodes, i located for any of the previous The evaluation of predictive performance for binary classification is
layers, Wnm the weight associated with the link connecting nodes n and mainly based on the confusion matrix. It shows how the model classified
m, and Wm the bias of node m. the actual values compared to predicted values (Bowes et al., 2012). The
In this study, the activation function used to the hidden layer is relu prediction was compared to observed concentrations to identify the
and is defined by: percentages of observations that were correctly classified. The per
xi if xi > 0 centage of fluoride correctly classified is known as the sensitivity, and
f (x) = max(0, x = )f (x) = (3) the non-fluoride that was correctly classified is known as the specificity.
0, if xi < 0
The three models were carried out using the Python3.7 programming
In the output layer, the activation function depends on the prediction language.
of the model. For this analysis, the sigmoid activation is applied in the The metrics equation for the confusion matrix are described as:
output layer, and it defines as
Fig. 2. Structure of Artificial Neural Network with the Inputs variables of the study area.
M.B. Nafouanti et al. Applied Geochemistry 132 (2021) 105054
TP + TN Table 3
Accuracy = (6)
TP + FP + TN + FN Statistical metrics for the Models Evaluation using Physico-Chemical Water
Parameters for the three algorithms Random Forest, Neural Networks, and Lo
TP gistic Regression.
Sensitivity = (7)
TP + FN Metrics RF ANN LR
Table 2
Statistical Analysis of Physico-Chemical Parameters for Groundwater samples for the study area.
Variables TDS Cl− NO3− SO42- HCO3− K+ Na+ Ca2+ Mg2+ Zn F−
Minimum 289.3 5.3 0.01 0.01 159.2 0.01 5.8 3.2 4.3 0.01 0.01
Maximum 20588 8032 3855 6688 1786 326.7 2895 716.6 1913 18.2 66.7
Mean 1458 250.4 65.9 274.9 476.2 6.9 251.2 55.7 74.1 0.7 1.7
Standard deviation 1880 608.7 212.9 553.7 264.1 25 391.2 59.8 128.5 0.8 3.4
(unit: mg/L).
M.B. Nafouanti et al. Applied Geochemistry 132 (2021) 105054
instead combine many trees to produce the prediction that increases the performance depends on the algorithm structure, the data nature, and
model performance. the parameter selection (Asim et al., 2018). For statistical analysis,
The LR demonstrated the lowest performance amongst the three feature selection (e.g., Filter methods) should be considered to obtain an
models regarding accuracy, sensitivity, and specificity (Table 3). The excellent predictive model in such classification tasks.
lower performance of LR can occur in high dimensional data in the
training data set, and the model may overfit and might not be accurate
4.3. Identification of the variables influencing the fluoride mobilization
on the test data set. Despite the weak performance of ANN and LR in the
current study, they are advantages in using them in other studies to
The relationship between predictors with fluoride was determined
predict groundwater contamination.
using the mean decrease in impurity (MDI), a measure used for variable
The process of groundwater contamination is complicated to un
importance in RF (Calle and Urrea, 2011). It is a tree-specific feature
derstand due to the presence of several fluctuating variables. Conse
importance measure computed by the feature importance implemented
quently, the more flexible the algorithm, the greater the predictive and
in the “skirt-learn library” for RF in python. The sum of MDI for each
more reliable model (De’ath and Fabricius, 2000). An algorithm
feature across every forest tree is accumulated each time a variable is
Fig. 4. Important Features to the Fluoride using Mean Decrease in Impurity in Random forest.
M.B. Nafouanti et al. Applied Geochemistry 132 (2021) 105054
chosen to split a node. As demonstrated in (Fig. 4), the variables that Table 4
tend to split nodes closer to tree root will have a more significant value. Importance Features using Permutation Feature for ANN showing the
Thus, the essential variables of the model will be the highest in the plot change of the Accuracy after a variable is eliminated.
and have the most significant MDI values, which are the cases of TDS, Variables Accuracy variation for ANN
Cl− , NO3− , Na+, HCO3− , SO42− , Ca2+, Mg2+ in the plot. The variables K+ All variables 0.85
and Zn are lower in the plot, which means they have a small MDI and Eliminated TDS 0.77
suggesting that they do not influence the fluoride in the study area. Eliminated Cl− 0.82
The application of the MDI to determine the important variables in a Eliminated NO3− 0.79
Eliminated SO42- 0.78
dataset to the dependent variable was quoted in previous studies
Eliminated HCO3− 0.82
(Breiman, 2001; Meinshausen, 2007; Zhao, 2000). In addition, the MDI Eliminated K+ 0.85
was used to identify significant predictors to the dependent variables in Eliminated Na+ 0.78
microarray and facies prediction studies and has shown the ability to Eliminated Ca2+ 0.77
identify the important variables related to the dependent variable Eliminated Mg2+ 0.80
Eliminated Zn 0.85
(Archer and Kimes, 2008; Bhattacharya and Mishra, 2018). The result of
the MDI demonstrates that TDS, Cl− , NO3− , Na+, HCO3− , SO42− , Ca2+,
and Mg2+ are the variables influencing the fluoride in the study area, methods find similar variables related to fluoride, including TDS, Cl− ,
which is consistent with the findings of previous studies (Chae et al., NO3− , Na+, HCO3− , SO42− , Ca2+, and Mg2+.
2007; Dhiman and Keshari, 2006; Guo et al., 2007). According to many model evaluation criteria, the RF algorithm
The permutation feature was adopted to assess the variable impor outperformed the ANN and LR when predicting groundwater fluoride
tance of ANN to know the most influential variables on the output. The contamination. These results suggest that the RF model can be used as a
permutation decreases the definitive model score when eliminating a consistent algorithm to predict groundwater fluoride in the Datong
single variable (Maier and Dandy, 1996; Wen et al., 2013). Overall, Basin and can be applied to other study areas in predicting groundwater
eleven (11) networks were evaluated to determine the most significant contamination. However, for the consistent performance of RF to predict
variables to the output. Each one demonstrated the change observed in groundwater fluoride in the study area, future research should be
network accuracy variation after removing a variable (Table 4). focused on developing other models that should be more flexible in
In the observation, after eliminating the variables K+ and Zn, the predicting groundwater contamination.
accuracy is 0.85 same as the original model accuracy. Therefore K+ and
Zn could be excluded from the model as they do not affect the network Declaration of competing interest
accuracy and suggest that K+ and Zn do not enhance the fluoride in the
study area. Conversely, with the elimination of other variables such as The authors declare that they have no known competing financial
TDS, Cl− , NO3− , Na+, HCO3− , SO42− , Ca2+, and Mg2+, the model ac interests or personal relationships that could have appeared to influence
curacy decrease confirming their importance to the fluoride. Previous the work reported in this paper.
studies used the permutation feature to determine the most important
variables to dissolved oxygen and learning event data (Matayoshi et al.,
2019; Wen et al., 2013).
Therefore, in this study, the permutation feature and the mean
The research work was financially supported by the National Natural
decrease in impurity suggest the same results as TDS, Cl− , NO3− , Na+,
Science Foundation of China (4202010400, 41521001, and 41502230),
HCO3− , SO42− , Ca2+, and Mg2+, the variables influencing the fluoride in
the Ministry of Education of China (111 projects), and the Fundamental
the study area. However, the permutation feature is applicable
Research Funds for the Central Universities, China University of Geo
compared to the MDI to determine the relationship between the input
sciences (Wuhan) (No.CUGGC07).
and output variables. Thus, the permutation is appropriate to any al
gorithms to assess the essential variables to the output, but the MDI is an
Appendix A. Supplementary data
important measure feature limited to the RF algorithm.
Previous studies have demonstrated different chemicals and pro
Supplementary data to this article can be found online at https://doi.
cesses that influence fluoride in the study area (Su et al, 2013, 2015).
The high fluoride in groundwater was generally characterized by the
water type of HCO3–Na(Mg), HCO3.SO4–Na(Mg) and SO4.Cl–Na(Mg)
(Su et al., 2013). Moreover, it stated that the increase in groundwater
