Groundwater Prediction
Groundwater Prediction
Groundwater Prediction
ABSTRACT Water is an essential source of life for every living thing, and drilling is the only source to gain
water from underground. Different advanced technologies have been used to minimize the time factor and
labor force. Along with technology to be used, some other factors are equally essential to be considered,
like water level, the hardness level of the land, and the number of days spent on the whole process. The
study proposed a weighted voting classifier based on Differential Evaluation (DE) to classify the regions
with different soil colors and land layers. The weights are assigned to the candidate classifiers based on
their performance for each class. For the assignment of the optimal weight, the DE optimization algorithm
is used. Moreover, the study presents a chained multi-objective regression model to simultaneously predict
the water level and total depth on different locations. The proposed work facilitates the drilling industry to
increase the rate of penetration (ROP) by selecting the region with soft soil and land layer. The prediction
of depth and water level allows the industry to estimate water levels in different areas at different depths.
The dataset is provided by the research organization, which contains information of different drilling points.
The results of the proposed weighted voting classifier are compared with the traditional machine learning
models (kernel Naive Bayes, Gaussian SVM, Quadratic SVM, and Bilayered Neural network) and state of
the art voting classifier in terms of precision, recall, and accuracy. Moreover, the proposed regression model
is evaluated by well-known evaluation metrics, including Mean Absolute Error, Mean Square Error, and R2
score. Finally, the comparison verifies the effectiveness of the enhanced optimization-based classifier and
multi-objective regressor.
INDEX TERMS Weighted voting classifier, multi objective regression, groundwater level, planning and risk
assessment, water resource management.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 9, 2021 168329
A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor
problems in city construction, the proposed study predicts the compared with Decision Tree Regressor (DT-R), K-Nearest
soil color and layer on different areas and estimates the water Neighbor Regressor (KNN-R), and Linear Regressor (LR).
level and depth. The results are compared in terms of Mean Absolute Error
Drilling is one of the important and only source to gain (MAE), Mean Square Error (MSE), and R2 score.
water from underground. The water has different under- The core contributions of the proposed study are listed
ground levels, and the different layers of the land layers lead below.
us to the water [4]. In this study, the soil color and land • Different features are extracted from the dataset, and the
layer are predicted by using the proposed DE-based weighted hardness level of the underground layers is computed
voting classifier, and results are compared with some state- • The study proposed weighted voting classifier based on
of-the-art ML algorithms and ensemble voting classifier [5]. Differential Evaluation optimization algorithm
In ensemble techniques, multiple classifiers are trained, and • The weights of the candidate classifiers are learned from
the knowledge of all classifiers is used to make the final deci- DE based on the performance of the classifier
sion [6]. As compared to the individual classifier, the com- • The chained multi-objective regression model is pre-
bination of multiple classifiers could be more effective [7]. sented to predict the water level and drilling depth in
To get a good ensemble classifier, the candidate classifiers different regions
should be as strong as possible. The selection of the best • The study facilitates the drilling industry to select the
member classifier is a concern of many researchers [8]. There region with a soft layer and acceptable range of water
are multiple ensemble techniques for classification problem level and depth
including bagging, boosting and stacking [9]–[11]. For the
multi-class problem, the combination of multiple classifiers The rest of the paper is structured as follows: section II dis-
can be more effective. In the dataset of drilling, we have cuss the models and techniques used to facilitate the drilling
ten different soil colors and eight rock layers of the land. industry; Section III describe the proposed weighted voting
The dataset is provided by a research organization, which classifier for classification and chained multi-output model
contains each borehole’s information on a certain location. for the prediction of water level and drilling depth. Section IV
Most of the features in the dataset are numeric continuous, presents the results of proposed models and compares them
and other features are encoded into the numeric value to make with the existing state of the are ML techniques to show the
the calculations easy. effectiveness of the proposed model. Finally section V con-
The dataset contains different soil colors and land layers clude the contributions with some possible future directions.
that lead us to the water. The prediction of soil color is
important to reduce the risks involved in the drilling process. II. RELATED WORK
The DE-based optimization algorithm is used to select the Groundwater is complex at the same time, a fragile resource
best candidate and assign weights to each classifier based that is substantial to domestic, economic, agriculture, and
on their performance. The voting strategy merges the knowl- industrial activities [12]. Moreover, it is a vital source of
edge of all the member classifiers and makes one ensemble replenishing safe drinking water worldwide and plays a
model. The ensemble model assumes to be better than the critical role in natural ecosystems. There are various ways
individual classifiers. The soil color has ten different values. through which groundwater exploration can be done. One of
This is because we have multiple colors of soil at different the widely used groundwater exploration and development
depths. The hardness and softness of the soil color are also techniques is drilling. However, drilling is not confined to
essential for the cost estimation of the whole process before just boring a hole in the ground and acquiring water from
starting drilling. Like soil color, the land layer is another it [13]. Drilling groundwater is becoming a multi-billion
attribute representing the underground layers of the land at industry due to the growing demand for groundwater acqui-
different depths. We have eight different land layers in the sition through downhole drilling operations. The surge for
data, including landfill layer, Sedimentary layer, Weathered fulfilling massive water demand and providing safe drinking
soil layer, Weathered rock layer, Soft rock layer, Gyeongam water has aggravated the drilling operations for groundwater
Formation, Ordinary rock formation, and Burlap soil layer. resource exploitation. However, groundwater acquisition is
All these layers have a different level of hardness and average not as simple as it looks [14]. Before drilling, many pre-
digging depth. The hardness and softness of the soil color and liminary surveys and analyses need to be conducted so that
land layer are computed by using depth and the number of desired information related to the estimation of groundwater
days spent to achieve that depth. characteristics must be estimated beforehand. A particular
For the prediction of water level and total depth of the bore- geological setting, hydraulic continuity, groundwater depth,
hole, multi-regression models are used. The multi-regression quality and quantity along with geological layer analysis are
models predict two values at the same time based on inde- substantial part of groundwater characteristics.
pendent input variables. In the chaining mechanism, two The drilling process for groundwater acquisition can
trained models are combined to predict two targets simultane- simply be defined in four major steps. Site visiting by a
ously. The results of the proposed chained multi-out regres- hydrogeological expert for assessment of geological char-
sion model based on Support Vector Regressor (SVR) are acteristics so that the risk of drilling into natural hazards
must be avoided [15]. The second phase involves drilling of the frequency model depicted that the area under the curve
and construction of boreholes, followed by aquifer testing to is about 84.78 percent.
determine water borehole yield. Lastly, the hydrogeologists Article [26] proposed a random forest-based mapping
ascertain the pumping and piping system types based on scheme for groundwater yield. Experimental findings sug-
the intended use of water resources. However, the drilling gested that the generalization performance achieved by
process is highly resources intensive and consumes a huge self-learning random forest achieved 23 percent improve-
amount of budget. The cost of drilling is influenced by ment. Furthermore, the comparative analysis of the proposed
the type of ground, groundwater, and borehole depth, the approach with random forest, support vector machine, artifi-
machinery involved, experts, human resources, and materials cial neural network, decision tree, and voted artificial neural
required [16]. Due to advanced drilling and pumping tech- network and random forest verified the effectiveness of the
niques, the reliance on groundwater has increased manifolds. proposed solution. Article [27] developed a prediction frame-
Increased drilling operations all over the world are producing work based on machine learning algorithms for predicting
a massive amount of data. Hence, data-driven modeling and the water depth. The input variables considered for the study
the application of advanced data analytics tools to predict are meteorological data, historical and upstream water level
downhole environmental aspects are deemed necessary. Geo- gauge data from 2009 to 2015. Artificial Neural Network,
logical and groundwater bore complexities are incurring addi- Random Forest, Decision Tree, and Support Vector Machine
tional costs and wastages of scarce resources. Consequently, are trained to predict the water level. Experimental findings
optimizing the drilling operations through the application of suggest that Random forest achieved root mean square error
machine learning techniques is essential to reduce resource RMSE of 0.09 percent. The study [28] proposed a deep learn-
wastage and increase drilling productivity [17]. Prediction ing framework for predicting the water quality and depth. The
and classification of various hydrogeological factors can deep learning framework comprises CNN deployed for water
assist the drilling companies to avoid issues like stuck pipe, level simulation and LSTM for water level prediction. The
over cost, and low water levels during the drilling process in results conclude the effectiveness of the proposed solution for
different areas. accurately simulating water quality and quantity.
Recently, many techniques have been applied to optimize Article [29] proposed a modeling approach for identifying
the drilling process for the achievement of optimal drilling changes in levels of groundwater using machine learning
parameters such as rate of penetration [18]–[20]. Predictive models. The study applied an ensemble modeling scheme
modeling and big data analytics for time series drilling data using spectral analysis. Experimental results proved that
have driven huge interest by the scientific community. The ensemble learning can be used as an alternative approach for
Researchers have successfully implemented machine learn- the simulation of groundwater changes and can better ascer-
ing and artificial intelligence methods with a major focus on tain water availability in regions with subsurface properties.
reducing various parameters that are substantial for ground- The ROP is treated as a prediction problem and solved by
water drilling such as the non-productive and invisible lost using different optimization techniques [30]. ANN is used to
time during drilling [21]. For efficient groundwater acquisi- predict the ROP by optimizing the parameters. The method
tion, there is a need for an optimized drilling process such was applied on vertical and horizontal drilling for oil wells.
as rate of penetration. Article [22], [23] proposed a machine Furthermore, the authors established the relation between
learning model for predicting the rate of penetration based on irrigation demand and its impact on change in groundwater
data analytics using a real-time data set. The authors consider level. Another study [31] presented a comparative analysis of
seven input parameters for the proposed work including sur- machine learning models for mapping groundwater potential
face parameters and others, including torque, stand pipe pres- through prediction. The models involved in the research are
sure, differential pressure etc. Experimental results reveal that mixture discriminant analysis (MDA), random forest (RF),
the rate of penetration increased by 14%. In the article, [24] multivariate adaptive regression spline (MARS), and boosted
rate of penetration is predicted using statistical regression and regression tree. The study analyzed the spatial distribution
artificial neural networks. To find the correlation between of various hydrogeological and physiographical factors along
various variables, the authors performed exploratory data with their respective spatial distribution. Results verify the
analysis. Furthermore, the importance of predictors is also effectiveness of the MDA model for mapping groundwa-
computed. For prediction, several models are employed, such ter potential. Groundwater analysis has been conducted in
as neural network, step-wise regression, classification and article [32] related to prediction approaches using machine
regression techniques (CART), and K-nearest neighbors. The learning methods. The findings of the proposed study suggest
findings of the proposed work suggest that ensemble meth- that data-driven modeling approaches are highly effective in
ods can improve the performance of the system. Another predictive modeling and decision making; however, there is
study [25] proposed a probabilistic frequency-based ratio a need to improve the accuracy of these systems. Article [33]
model for mapping of groundwater potential using eight Proposed SVM-based model to detect anomalies in ground-
factors related to topography, geology, and satellite imagery. water using real-time data. Another study [34] presented a
Moreover, the authors analyzed several relationships between machine learning-based solution for predicting lithology for-
the yield of groundwater and hydrogeological factors. Results mation based on drilling parameters. Results show excellent
The soil color and land layer have different patterns and The random value ensures that the trail vector Ti,G get at least
are somehow related, so, the neighbor information is useful. one value from mutant vector Vi,G .
By considering this information, KNN is selected as one of Selection: The selection process selects and compares the
the candidate for voting. trails with target individuals and decides based on the survival
In the first step of the experiments, raw data is passed to of the target. If the target is able to survive in the next
the preprocessing phase to process the data before passing generation, the trail individual will be selected. The operation
for classification. Then, the first step in feature importance of selection can be presented as
is computed, and features are selected to impute the missing
Ti,G (j), if Trail f (Ti,G ) 6 individual f (Ti,G )
value. Instead of removing missing values from the data, (7)
Xi,G (j), otherwise
we imputed them by the KNN imputation technique.
where the function f () is the function to be optimized and
2) PROPOSED ENSEMBLE WEIGHTED VOTING MODEL
ensure that the selected member is best for the individual.
The described technique is used in the selection of opti-
In the ensemble model, the voting classifier is used to classify
mal weights for classifiers; however, multiple crossover and
the data into different classes. Voting itself is not a classifier
mutation techniques for DE have been proposed [47]–[50].
but a technique to combine the results of multiple classifiers.
The Algorithm 1 shows the flow of the selection of best
The wrapper of a voting classifier consists of multiple trained
weights using DE from the given search space. The DE
classifiers and assigns the target label to each sample based
algorithm returns the best weight vector (w1 , . . . wn ) for each
on the number of voters. The same data samples are passed
member classifier.
to all candidate classifiers to train the model and ensemble
them to predict the final output. Ensemble voting classifier
Algorithm 1 DE Based Proposed Weighted Voting
is best for multi-class problems [46]. We have selected four
Classifier
different classifiers for the voting wrapper: Gaussian Sup- Input: DE Control Parameters: Population size N ,
port Vector classifier, K-Nearest Neighbor classifier, Kernel Mutation factor F and Crossover Rate CR
Gaussian NB, and Decision Tree. The voting strategy merges Output: Optimal weights (w1 , . . . , wn );
the knowledge of all involved candidates and decides the initialization;
label for each testing sample. The conceptual model of the Population of N individuals;
voting classifier is shown in Figure 5. But in the traditional XG = X1,G , X2,G , . . . , XN ,G
voting classifier, each candidate classifier contributes equally // Uniformly Distributed
in the selection of the target variable. While it is possible in where Xi,G = [xi,G (1), xi,G (2), . . . , xi,G (N )] represents
some situations that one or two candidate classifiers are less the weights (w1 , . . . , wn ) of classifiers.
confident to assign a target to a sample compared to others, G ← 0 // generation iteration
in this case, the wrong class can be assigned. while stopping criteria not satisfied do
By considering the given limitation of the traditional clas- for i = 0; i < N ; i++ do
sifier, we proposed a weighted voting classifier in which Select distinct indexers r1, r2, r3 /* should
we use DE to select the best optimal weights for each be different from i */
classifier to select the best target variable for each sam- Vi,G = Xr1,G + F × (Xr2,G − Xr3,G )
ple. The problem is to assign optimal weights to each clas- // Compute mutant vector
sifier based on its confidence for each class. The DE is jrand ← random() // Random value
a population-based optimization technique used widely for for j = 0; j < D; j + + do
multiple search problems in the literature. DE first create the Ti,G ← using eq. (6)
list of population of size N and D-Dimensional vector, where Xi,G+1 ← Fitness of Ti,G and TX ,G using eq. (7)
Xi,G = [xi,G (1), xi,G (2), . . . xi,G (j) . . . xi,G (D)]. DE per-
G ← G+1
forms three main steps, including mutation, crossover, and
// Increase termination parameter
selection.
Mutation: mutation is the first step of the DE optimization
algorithm which generates the donor vector represented by
The fitness evaluation of the model is based on the
Vi,G for each target vector Xi,G for the current generation.
accuracy of the member classifiers. For c number of cate-
Crossover: Crossover simply generate the trail vector
gories in the target attribute with D number of classifiers
Ti,G = [ti,G (1), ti,G (2), . . . ti,G (j) . . . ti,G (D)] by performing
to vote, the predicted value Vp of the proposed weighted
the crossover operation between target and corresponding
voting for k sample is
vector. For each variable j from the D-dimensional vector:
D
X
vi,G (j), if random value (i,j)[0,1] < CR Vp = arg max (ϑij × wi ) (8)
(6)
xi,G (j), otherwise i=1
where CR represents the crossover rate in the range of [0,1], where ϑij is the binary decision variable; ϑij = 1 if jth
random number is a uniform distribution for each jth value. category is assigned to the k sample by ith classifier and
FIGURE 3. Data preprocessing, feature extraction, state of the art ML techniques and operational flow of enhanced optimization-based voting classifier.
ϑij = 0 otherwise. wi represents the weight for ith classifier In the proposed chained multi-out regression model Sup-
and optimized by the Algorithm 1. port vector regressor is selected to make a chain of the
models for each target attribute. The single model is able
to predict both target attributes at one time, as shown in
C. REGRESSION Figure 6. As the SVR shows better results as compared
1) REGRESSIONS MODELS to other ML model, so the best model is selected as can-
Regression analysis is a type of prediction in which the didate for proposed technique. In the multi-out regression
relationship between dependent and independent variables model, the SVR model is first trained based on one tar-
is investigated. Moreover, these techniques are used in time get attribute, and then the prediction of the first model
series prediction, forecasting, and determining the variable’s along with training data is passed to the second model to
causal effect relationship. Multiple regression techniques are predict another attribute. The chained model automatically
available to determine or predict the next value of a dependent splits the target attribute and trains both models by chain-
variable. These models fit the curve based on given data. Lin- ing both trained models and simultaneously predict multiple
ear regression, K-nearest Neighbours Regression Decision variables.
tree regressor, and Support vector Regressor are famous and The prediction of the water level and depth is starts
most used Regression models. We use the traditional regres- from the separation of the target variables from independent
sion models to predict water level and drilling depth and features and shown in Figure 7. After the separation, the
compare the results of the conventional models with the pro- stand-alone regression models are applied to predict the water
posed SVR-based chained multi-objective regression model level and total depth separately. After that SVR based multi
in terms of MAE, MSE, and R2 score. objective chained model is applied to predict both variables
TP
Recall = (10)
TP + FN
where TP refers to True positive and FN is false-negative rate.
Precision is a little bit change to recall; it reflects the infor-
mation related to the true positive class. The prediction of
true positive samples from all true positive and false-positive
samples eq. (11).
TP
Precision = (11)
TP + FP
As we have multiple classes of soil color and land layer,
so the prediction value is based on the mean of all classes.
FIGURE 6. Multiout Regression model based on SVR. To evaluate the performance of the chained multi-out
regression model, MAE, MSE, and R2 are used. MAE shows
the absolute difference between actual and predicted water
simultaneously. K-Fold cross-validation is used to train the level and drilling depth. The eq. (12) shows how the MAE is
model. Finally, existing and proposed models are evaluated computed from the actual and predicted value.
using MAE, MSE, and R2 score.
Pn
|yi − xi |
IV. RESULTS MAE = i=1 (12)
n
A. EVALUATION METRICS
To evaluate the performance of the proposed classification The mean squared deviation (MSD) or mean square
model and for a fair comparison, accuracy, precision, and error (MSE) squares the error between actual and predicted
recall are used as evaluation metrics. These metrics are widely values.
used for the evaluation of classification models [51]. The n
1X
accuracy metric refers to how much the measurements are MSE = (13)
close to the true or accepted value eq. (9). n
i=1
TABLE 2. Average accuracy, precision, and recall of validation and test phase for soil color.
B. CLASSIFICATION RESULTS are used as training, and the process continues for ten-time
To predict the underground soil layer and land layer in dif- so that each fold is used in a testing phase. The standard
ferent regions, multiple classification algorithms are applied, deviation of upper and lower bound of precision, recall, and
and the results are listed in Table 2 and 3. The drilling task accuracy achieved in the folds for each classifier is reported.
becomes easy when the land layer and soil color are known in For the effectiveness of the proposed methods, the model is
advance because soil color and land layer have the different compared with existing ML techniques for both soil color
softness and hardness levels. The average digging capacity and land layer as shown in Table 2 and 3 respectively. The
per day of each layer and soil color shows the hardness level. validation and test accuracy, precision, and recall are reported
Figure 8 shows the average digging capacity of each soil color and compared. As the proposed model considers the candi-
and land layer. date classifier’s performance so that the optimal weights are
There are eight different land layers in the given dataset. assigned to each classifier. As result, the role of the candidate
Four different state-of-the-art ML algorithms are selected with higher accuracy in the selection of target variable is more
to predict land layers on different locations and depths to as compared to other algorithms.
solve this multi-class problem. The performance of Bilay- Average per day digging capacity of each soil color and
ered NN is better as compared to all other ML algorithms land layer is computed to analyze the hardness level. Per day
in terms of precision, recall, and accuracy. The ensemble digging capacity of both soil color and land layer is computed
voting classifier is also applied to compare the results with by using eq. (15).
the proposed optimization-based weighted voting classifier. Pn
(EDi − SDi )
The same member classifiers are used to for both voting clas- DC(color, layer) = i=1 (15)
sifiers. The results show that the proposed weighted voting n
classifier shows better classification results as compared to where DC is Digging Capacity, ED is ending depth, SD
the existing voting classifier and traditional ML classification is starting depth, and n is the number of instances of that
algorithms. specific soil color or land layer. The Figure 8 illustrates the
The results of the prediction of soil color are illustrated in information computed by the eq. (15) for both soil color and
Table 2. The results show the effectiveness of the proposed land layer. The figure shows that the dark brown soil color is
weighted voting classifier. The accuracy achieved by the soft because the average depth achieved per day is 4.3 meters.
proposed model for validation is 82.76% and 86.69% for the In contrast, the depth achieved in the case of partridge soil
test phase. The proposed model use the knowledge of all color is 2.3, which shows that partridge soil is hard and con-
classifier based on their performance for each class so the sumes more days and resources. The drilling industry needs
results are better than all other state of the art techniques. attention when the partridge soil color is predicted because
The early prediction of the land with soft soil will allow the the fluid of the pipe can be broken because of its hardness.
drilling industry to figure out the resources before starting In the case of the land layer, the Gyeongam Formation layer
the actual process. Along with it, the rate of penetration can is soft compared to all others. While the landfill layer is too
also be increased by considering the predicted information. hard because we can achieve only 1.5-meter depth by working
There are multiple risks involved in drilling, including stuck a whole day. By analyzing the results of the land layer, we can
pipe, fluid broken, and over cost. All these risks are involved say that the rate of penetration in case of the landfill layer
because of the hard layer. Because of the hard layer, pipes are is low. Figure 2 shows the drilling point with a different
stuck, and pipes fluid can also be broken. The early prediction layered structure. Each slice of 3D pipe shows the digging
can minimize the risks involved in the drilling process. If the depth on a specific day. At the same time, the total number
predicted layer is hard, the drilling industry should use the of slices in the 3D pipe shows the number of days spent
related material to reduce the time and cost. in that particular borehole. The z-axis of the figure shows
The proposed weighted voting classifier merges the knowl- the total depth of drilling points. Different colors on one
edge of all the involved candidate models based on their borehole point show the number of days spent on each drilling
confidence level of prediction and decides the label. For better point and the thickness of each layer. The situation of the
measurement, 10-fold is used to train candidate models, and underground water table is also depicted in the illustration.
the average accuracy is used as a final parameter for the In some areas, the water level is low, while the water level is
optimization module. For each iteration, nine out of ten folds too high in some areas.
TABLE 3. Average accuracy, precision, and recall of validation and test phase for land layer.
FIGURE 8. Average digging capacity of both soil color and land layer.
To sum up the knowledge of all the candidate classifiers, are compared with conventional models of ML. The depth
the patterns illustrated in Figure 9a and 9b are extracted. The of the borehole varies at different locations as per the level
sequence of prediction or the flow of the trained model is of groundwater. Similarly, the pattern of soil color and land
visualized. The sequence shows a separate line for each class layer also plays an important role in the prediction of depth.
label. Numeric values are normalized, and the categorical The results show the effectiveness of the proposed chained
value remains the same. The figure shows some patterns model in terms of MAE, MSE, and R2. As the proposed
between independent features. When the value of Korea method consists of two trained regression models where the
startup has changed, the value of the layer also changes in sequence is sequential. Because of the sequential method, the
most cases. The color is assigned to each correct and incorrect previously trained model injects the knowledge to the next
prediction, as shown in the right bar of the figures. model to predict one dependent variable. The next model uses
that predicted variable as an independent variable to train
C. REGRESSION RESULTS itself to predict another variable. Because of that chained
The prediction of water level and depth is based on region, strategy, the proposed method’s performance is good com-
Altitude, Land layer, days spent, and soil color. The borehole pared to other conventional ML techniques. Figure 10 shows
depth is highly dependent on days spent on the process and the prediction results for the training and testing phase for
the hardness level of both soil color and land layer. Similarly, drilling depth, and Figure 11 shows the prediction results of
different land layers and soil color patterns lead us to different water level.
groundwater levels. The results of prediction of water level The proposed model is based on two chained mod-
and drilling depth shown in Table 4 and 5 respectively. The els to predict two different target variables simultaneously.
results of the proposed chained multi-out regression model By predicting the level of water and depth in different regions,
FIGURE 10. Results of multiout regression model for the prediction of next depth.
FIGURE 11. Results of multiout regression model for the prediction of water level.
TABLE 4. Comparison of proposed chained multiout regressor and the preference of the drilling industry and their method of
traditional ML regressor in terms of MAE, MSE, and R2 for groundwater
level prediction. drilling.
V. CONCLUSION
Huge data is generated by scientific experiments and other
mediums. Researchers have applied multiple clustering and
classification techniques to extract the patterns from the data
TABLE 5. Comparison of proposed chained multiout regressor and to make it useful. Different ML techniques are also enhanced
traditional ML regressor in terms of MAE, MSE, and R2 for borehole depth
prediction. to get more accurate results from the raw data. The pro-
posed work presents an optimization-based weighted voting
approach to find the hidden patterns from the data. The data
is accrued from JNU, Republic of Korea. In the first phase,
data is preprocessed, and some features are extracted, like
total depth and number of days spent. Moreover, the data
is encoded using an ordinal encoder to apply mathematical
the early estimation can be done by the drilling industry. classification algorithms. For fair voting, a population-based
Moreover, the prediction of soil color and land layer and optimization algorithm is used to find the optimal weights
their hardness level facilitates the drilling industry to manage based on the performance of the member classifiers. The
the required resources. The selection of the area depends on weight is assigned to each classifier, and the target variable
the preferences of the drilling industry. If the industry has is assigned to each test data sample. The proposed method
heavy resources, then the hard layer can be compromised, is evaluated based on accuracy, precision, and recall. It is
and areas with low depth can be considered. While, on the noticed from the experiments that the proposed model out-
other hand, if the labor cost is not a problem and the drilling performs as compared to the existing voting classifier and
equipment are not as much strong to drill the hand layer, state-of-the-art ML algorithms. The prediction of soil color
then the area with soft layer should be selected. Therefore, and land layer on different areas and predicted depth allows
the classification and regression results help the industry. the drilling industry to estimate the time and labor cost early.
While the selection of the region is highly dependent on However, some factors should be considered before starting
the actual drilling process, together with soil color, land layer, [16] P. Madhnure, ‘‘Groundwater exploration and drilling problems encoun-
water level, and depth of the drilling point. The study uses a tered in basaltic and granitic terrain of Nanded district, Maharashtra,’’
J. Geol. Soc. India, vol. 84, no. 3, pp. 341–351, Sep. 2014.
weighted voting classifier to predict the soil color and land [17] L. F. F. M. Barbosa, A. Nascimento, M. H. Mathias, and J. A. de Carvalho,
layer on different locations. At the same time, the hardness ‘‘Machine learning methods applied to drilling rate of penetration pre-
and softness of the underground layer are computed to select diction and optimization—A review,’’ J. Petroleum Sci. Eng., vol. 183,
Dec. 2019, Art. no. 106332.
the related equipment for drilling. The chained multi-out [18] R. Ashena, M. Rabiei, V. Rasouli, A. H. Mohammadi, and S. Mishani,
regression model based on SVR is also proposed to predict ‘‘Drilling parameters optimization using an innovative artificial intelli-
water level and drilling depth on various locations simultane- gence model,’’ J. Energy Resour. Technol., vol. 143, no. 5, May 2021,
Art. no. 052110.
ously. Two trained models are chained so that the first model [19] N. Iqbal, A. Rizwan, A. N. Khan, R. Ahmad, B. Kim, K. Kim, and
injects his knowledge into the second model to make the D. Kim, ‘‘Boreholes data analysis architecture based on clustering and
predictions more accurate. The proposed model is compared prediction models for enhancing underground safety verification,’’ IEEE
Access, vol. 9, pp. 78428–78451, 2021.
with conventional ML regressors, including DT-R, KNN-R,
[20] A. Rizwan, N. Iqbal, A. N. Khan, R. Ahmad, and D. H. Kim, ‘‘Toward
and LR. The results are compared in terms of MAE, MSE, and effective pattern recognition based on enhanced weighted K-mean clus-
R2 score. The results show the significance and effectiveness tering algorithm for groundwater resource planning in point cloud,’’ IEEE
of the chained regression model. Access, vol. 9, pp. 130154–130169, 2021.
[21] C. I. Noshi and J. J. Schubert, ‘‘The role of machine learning in drilling
operations: A review,’’ in Proc. Day Wed, Oct. 2018.
ACKNOWLEDGMENT [22] S. Chandrasekaran and G. S. Kumar, ‘‘Drilling efficiency improvement and
rate of penetration optimization by machine learning and data analytics,’’
Any correspondence related to this paper should be addressed Int. J. Math., Eng. Manage. Sci., vol. 5, no. 3, pp. 381–394, Jun. 2020.
to Do Hyeun Kim. [23] N. Iqbal, A. Khan, A. Rizwan, R. Ahmad, B. Kim, K. Kim, and D. Kim,
‘‘Groundwater level prediction model using correlation and difference
mechanisms based on boreholes data for sustainable hydraulic resource
REFERENCES management,’’ IEEE Access, vol. 9, pp. 96092–96113, 2021.
[1] M. Z. Lukawski, B. J. Anderson, C. Augustine, L. E. Capuano, [24] B. Mantha and R. Samuel, ‘‘ROP optimization using artificial intelligence
K. F. Beckers, B. Livesay, and J. W. Tester, ‘‘Cost analysis of oil, gas, techniques with statistical regression coupling,’’ in Proc. Day 3 Wed,
and geothermal well drilling,’’ J. Petroleum Sci. Eng., vol. 118, pp. 1–14, Sep. 2016.
Jun. 2014. [25] M. A. Manap, H. Nampak, B. Pradhan, S. Lee, W. N. A. Sulaiman, and
[2] W. F. Prassl, J. M. Peden, and K. W. Wong, ‘‘A process-knowledge M. F. Ramli, ‘‘Application of probabilistic-based frequency ratio model
management approach for assessment and mitigation of drilling risks,’’ in groundwater potential mapping using remote sensing data and GIS,’’
J. Petroleum Sci. Eng., vol. 49, nos. 3–4, pp. 142–161, Arabian J. Geosci., vol. 7, no. 2, pp. 711–724, Feb. 2014.
Dec. 2005. [26] M. I. Sameen, B. Pradhan, and S. Lee, ‘‘Self-learning random forests model
[3] S. Paul Singh and P. Xavier, ‘‘Causes, impact and control of overbreak for mapping groundwater yield in data-scarce areas,’’ Natural Resour. Res.,
in underground excavations,’’ Tunnelling Underground Space Technol., vol. 28, no. 3, pp. 757–775, Jul. 2019.
vol. 20, no. 1, pp. 63–71, Jan. 2005. [27] C. Choi, J. Kim, H. Han, D. Han, and H. S. Kim, ‘‘Development of water
[4] L.-H. Luu, P. Philippe, G. Noury, J. Perrin, and O. Brivois, ‘‘Erosion level prediction models using machine learning in wetlands: A case study
of cohesive soil layers above underground conduits,’’ EPJ Web Conf., of Upo wetland in South Korea,’’ Water, vol. 12, no. 1, p. 93, 2020.
vol. 140, Oct. 2017, Art. no. 09038. [28] S.-S. Baek, J. Pyo, and J. A. Chun, ‘‘Prediction of water level and water
[5] D. Ruta and G. Gabrys, ‘‘Classifier selection for majority voting,’’ Inf. quality using a CNN-LSTM combined deep learning approach,’’ Water,
Fusion, vol. 6, pp. 63–81, Mar. 2005. vol. 12, no. 12, p. 3399, Dec. 2020.
[6] L. Rokach, ‘‘Ensemble-based classifiers,’’ Artif. Intell. Rev., vol. 33, [29] S. Sahoo, T. A. Russo, J. Elliott, and I. Foster, ‘‘Machine learning algo-
nos. 1–2, pp. 1–39, 2010. rithms for modeling groundwater level changes in agricultural regions of
the U.S.’’ Water Resour. Res., vol. 53, no. 5, pp. 3878–3895, May 2017.
[7] T. Dietterich, ‘‘Ensemble learning,’’ in The Handbook of Brain Theory and
[30] A. Al-AbdulJabbar, A. A. Mahmoud, and S. Elkatatny, ‘‘Artificial neural
Neural Networks, vol. 2, no. 1. Cambridge, MA, USA: MIT Press, 2002,
network model for real-time prediction of the rate of penetration while
pp. 110–125.
horizontally drilling natural gas-bearing sandstone formations,’’ Arabian
[8] Z.-H. Zhou, J. Wu, and W. Tang, ‘‘Ensembling neural networks: Many
J. Geosci., vol. 14, no. 2, pp. 1–14, 2021.
could be better than all,’’ Artif. Intell., vol. 137, no. 1, pp. 239–263, 2002.
[31] A. Al-Fugara, H. R. Pourghasemi, A. R. Al-Shabeeb, M. Habib,
[9] Y. Freund and R. E. Schapire, ‘‘A decision-theoretic generalization of on- R. Al-Adamat, H. Al-Amoush, and L. Adrian Collins, ‘‘A comparison of
line learning and an application to boosting,’’ J. Comput. Syst. Sci., vol. 55, machine learning models for the mapping of groundwater spring poten-
no. 1, pp. 119–139, Aug. 1997. tial,’’ Environ. Earth Sci., vol. 79, no. 10, p. 206, May 2020.
[10] L. Breiman, ‘‘Bagging predictors,’’ Mach. Learn., vol. 24, no. 2, [32] K. Kenda, M. Cerin, M. Bogataj, M. Senozetnik, K. Klemen, P. Pergar,
pp. 123–140, 1996. C. Laspidou, and D. Mladenic, ‘‘Groundwater modeling with machine
[11] D. H. Wolpert, ‘‘Stacked generalization,’’ Neural Netw., vol. 5, no. 2, learning techniques: Ljubljana polje aquifer,’’ Proceedings, vol. 2, no. 11,
pp. 241–259, 1992. p. 697, 2018.
[12] D. Perrone and S. Jasechko, ‘‘Deeper well drilling an unsustainable [33] J. Liu, J. Gu, H. Li, and K. H. Carlson, ‘‘Machine learning and transport
stopgap to groundwater depletion,’’ Nature Sustainability, vol. 2, no. 8, simulations for groundwater anomaly detection,’’ J. Comput. Appl. Math.,
pp. 773–782, Aug. 2019. vol. 380, Dec. 2020, Art. no. 112982.
[13] B. Shirmohammadi, M. Vafakhah, V. Moosavi, and A. Moghaddamnia, [34] A. A. Mahmoud, S. Elkatatny, and A. Al-AbdulJabbar, ‘‘Application
‘‘Application of several data-driven techniques for predicting groundwater of machine learning models for real-time prediction of the formation
level,’’ Water Resour. Manage., vol. 27, no. 2, pp. 419–432, Jan. 2013. lithology and Tops from the drilling parameters,’’ J. Petroleum Sci. Eng.,
[14] R. Ratolojanahary, R. Houé Ngouna, K. Medjaher, F. Dauriac, and vol. 203, Aug. 2021, Art. no. 108574.
M. Sebilo, ‘‘Groundwater quality assessment combining supervised and [35] L. Knoll, L. Breuer, and M. Bach, ‘‘Large scale prediction of groundwater
unsupervised methods,’’ IFAC-Papers Line, vol. 52, no. 10, pp. 340–345, nitrate concentrations from spatial data using machine learning,’’ Sci. Total
Jan. 2019. Environ., vol. 668, pp. 1317–1327, Jun. 2019.
[15] A. R. Bisson and H. J. Lehr, Modern Groundwater Exploration: Dis- [36] A.-N. Khan, N. Iqbal, A. Rizwan, R. Ahmad, and D.-H. Kim, ‘‘An ensem-
covering New Water Resources in Consolidated Rocks Using Innovative ble energy consumption forecasting model based on spatial-temporal clus-
Hydrogeologic Concepts, Exploration, Drilling, Aquifer Testing and Man- tering analysis in residential buildings,’’ Energies, vol. 14, no. 11, p. 3020,
agement Method. Hoboken, NJ, USA: Wiley, 2004. Jan. 2021.
[37] Z. Ghaffar, A. Alshahrani, M. Fayaz, A. M. Alghamdi, and J. Gwak, ANAM NAWAZ KHAN received the B.S.
‘‘A topical review on machine learning, software defined networking, and M.S. degrees in computer science from
Internet of Things applications: Research limitations and challenges,’’ COMSATS University Islamabad, Attock Cam-
Electronics, vol. 10, no. 8, p. 880, Jan. 2021. pus, Pakistan, in 2016 and 2019, respectively.
[38] A. Rizwan, N. Iqbal, R. Ahmad, and D.-H. Kim, ‘‘WR-SVM model based She is currently pursuing the Ph.D. degree with
on the margin radius approach for solving the minimum enclosing ball the Department of Computer Engineering, Jeju
problem in support vector machine classification,’’ Appl. Sci., vol. 11, National University, Republic of Korea. Her
no. 10, p. 4657, Jan. 2021.
research interests include machine learning appli-
[39] L. I. Kuncheva and J. J. Rodriguez, ‘‘A weighted voting framework for
cations in smart environments, analysis of predic-
classifiers ensembles,’’ Knowl. Inf. Syst., vol. 38, no. 2, pp. 259–275,
Feb. 2014. tion and optimization algorithms, big data, and the
[40] L. Lam and C. Y. Suen, ‘‘Application of majority voting to pattern recog- IoT-based applications.
nition: An analysis of its behavior and performance,’’ IEEE Trans. Syst.,
Man, Cybern. A, Syst. Humans, vol. 27, no. 5, pp. 553–568, Sep. 1997.
[41] A. Ekbal and S. Saha, ‘‘Weighted vote-based classifier ensemble for named
entity recognition: A genetic algorithm-based approach,’’ ACM Trans.
Asian Lang. Inf. Process., vol. 10, no. 2, pp. 1–37, Jun. 2011. NAEEM IQBAL (Member, IEEE) received the
[42] C. Friedman, ‘‘System and method for language extraction and encoding M.S. degree in computer science from COMSATS
utilizing the parsing of text data in accordance with domain parameters,’’ University Islamabad, Attock Campus, Punjab,
U.S. Patent 6 182 029, Jan. 30, 2001. Pakistan, in 2019. He is currently pursuing the
[43] Y. Murakami and K. Mizuguchi, ‘‘Applying the Naïve Bayes classifier Ph.D. degree with the Department of Computer
with kernel density estimation to the prediction of protei-protein interaction Engineering, Jeju National University, Republic
sites,’’ Bioinformatics, vol. 26, no. 15, pp. 1841–1848, 2010. of Korea. He has professional experience in the
[44] A. Pérez and I. N. Inza, ‘‘Bayesian classifiers based on kernel density
software development industry and in academic
estimation: Flexible classifiers,’’ Int. J. Approx. Reasoning, vol. 50, no. 2,
as well. He has published more than 20 papers
pp. 341–362, 2009.
[45] B. Scholkopf, K. K. Sung, and C. Burges, ‘‘Comparing support vector in peer-reviewed international journals and con-
machines with Gaussian kernels to radial basis function classifiers,’’ IEEE ferences. He is serving as a professional reviewer for various well-reputed
Trans. Signal Process., vol. 45, no. 11, pp. 2758–2765, Nov. 1997. journals and conferences. His research interests include AI-based intelligent
[46] O. Sagi and L. Rokach, ‘‘Ensemble learning: A survey,’’ WIREs Data systems, data science, big data analytics, machine learning, deep learning,
Mining Knowl. Discovery, vol. 8, no. 4, p. 15, Jul. 2018. analysis of optimization algorithms, the IoT, and blockchain-based secured
[47] J. Brest, S. Greiner, B. Boskovic, M. Mernik, and V. Zumer, ‘‘Self- applications.
adapting control parameters in differential evolution: A comparative study
on numerical benchmark problems,’’ IEEE Trans. Evol. Comput., vol. 10,
no. 6, pp. 646–657, Dec. 2006.
[48] J. Zhang and A. C. Sanderson, ‘‘JADE: Adaptive differential evolution
with optional external archive,’’ IEEE Trans. Evol. Comput., vol. 13, no. 5, RASHID AHMAD received the B.S. degree from
pp. 945–958, Oct. 2009. the University of Malakand, Pakistan, in 2007, the
[49] J. Brest and M. S. Maucec, ‘‘Self-adaptive differential evolution algorithm M.S. degree in computer science from the National
using population size reduction and three strategies,’’ Soft Comput., vol. 15,
University of Computer and Emerging Sciences
pp. 2157–2174, Dec. 2011.
(NUCES), Islamabad, Pakistan, in 2009, and the
[50] W. Gong, ‘‘Repairing the crossover rate in adaptive differential evolution,’’
Appl. Soft Comput., vol. 15, pp. 149–168, Feb. 2014. Ph.D. degree in computer engineering from Jeju
[51] N. Iqbal, R. Ahmad, F. Jamil, and D.-H. Kim, ‘‘Hybrid features predic- National University, South Korea, in 2015. Cur-
tion model of movie quality using multi-machine learning techniques for rently, he is working as an Assistant Professor at
effective business resource planning,’’ J. Intell. Fuzzy Syst., vol. 40, no. 5, COMSATS University Islamabad, Attock Cam-
pp. 9361–9382, Jan. 2021. pus. His research interests include the application
of prediction and optimization algorithms to build IoT-based solutions,
machine learning, data mining, and related applications.