Groundwater Prediction

Received November 25, 2021, accepted December 6, 2021, date of publication December 8, 2021,

date of current version December 29, 2021.

Enhanced Optimization-Based Voting Classifier

and Chained Multi-Objective Regressor for
Effective Groundwater Resource Management
1 Department of Computer Engineering, Jeju National University, Jeju-si, Jeju-do 63243, Republic of Korea
2 Department of Computer Science, COMSATS University Islamabad, Attock Campus, Attock 43600, Pakistan
3 BigdataResearch Center, Jeju National University, Jeju-si, Jeju-do 63243, Republic of Korea
4 Department of Computer Engineering and Advanced Technology Research Institute, Jeju National University, Jeju-si, Jeju-do 63243, Republic of Korea

Corresponding author: Do Hyeun Kim (

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the
Korea government (MSIT) (2021-0-00188, Open source development and standardization for AI enabled IoT platforms and interworking)
and this research was supported by Energy Cloud R&D Program through the National Research Foundation of Korea (NRF) funded by the
Ministry of Science, ICT (2019M3F2A1073387), Any correspondence related to this paper should be addressed to DoHyeun Kim.

ABSTRACT Water is an essential source of life for every living thing, and drilling is the only source to gain
water from underground. Different advanced technologies have been used to minimize the time factor and
labor force. Along with technology to be used, some other factors are equally essential to be considered,
like water level, the hardness level of the land, and the number of days spent on the whole process. The
study proposed a weighted voting classifier based on Differential Evaluation (DE) to classify the regions
with different soil colors and land layers. The weights are assigned to the candidate classifiers based on
their performance for each class. For the assignment of the optimal weight, the DE optimization algorithm
is used. Moreover, the study presents a chained multi-objective regression model to simultaneously predict
the water level and total depth on different locations. The proposed work facilitates the drilling industry to
increase the rate of penetration (ROP) by selecting the region with soft soil and land layer. The prediction
of depth and water level allows the industry to estimate water levels in different areas at different depths.
The dataset is provided by the research organization, which contains information of different drilling points.
The results of the proposed weighted voting classifier are compared with the traditional machine learning
models (kernel Naive Bayes, Gaussian SVM, Quadratic SVM, and Bilayered Neural network) and state of
the art voting classifier in terms of precision, recall, and accuracy. Moreover, the proposed regression model
is evaluated by well-known evaluation metrics, including Mean Absolute Error, Mean Square Error, and R2
score. Finally, the comparison verifies the effectiveness of the enhanced optimization-based classifier and
multi-objective regressor.

INDEX TERMS Weighted voting classifier, multi objective regression, groundwater level, planning and risk
assessment, water resource management.

I. INTRODUCTION when drilling process is started. Furthermore, the type of

Efficient and Robust models are required in the field of city land in different areas and layers of land should be known
construction and drilling related problems. As we required to reduce the risks such as stuck pipe, formation fracturing,
labor force and time for the process of drilling. The process is and lost circulation [2]. There are multiple soil colors on
also expensive in terms of cost and equipment to be used [1]. the different layers of the land. These soil colors play an
So, there should be a track or a way that should be followed important role in the process, when we know the pattern of
soil colors that lead us to the water [3]. Moreover, the number
The associate editor coordinating the review of this manuscript and of days and the maximum depth should also be known to plan
approving it for publication was Yiqi Liu . the project and gain the water. By taking into account all these

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

problems in city construction, the proposed study predicts the compared with Decision Tree Regressor (DT-R), K-Nearest
soil color and layer on different areas and estimates the water Neighbor Regressor (KNN-R), and Linear Regressor (LR).
level and depth. The results are compared in terms of Mean Absolute Error
Drilling is one of the important and only source to gain (MAE), Mean Square Error (MSE), and R2 score.
water from underground. The water has different under- The core contributions of the proposed study are listed
ground levels, and the different layers of the land layers lead below.
us to the water [4]. In this study, the soil color and land • Different features are extracted from the dataset, and the
layer are predicted by using the proposed DE-based weighted hardness level of the underground layers is computed
voting classifier, and results are compared with some state- • The study proposed weighted voting classifier based on
of-the-art ML algorithms and ensemble voting classifier [5]. Differential Evaluation optimization algorithm
In ensemble techniques, multiple classifiers are trained, and • The weights of the candidate classifiers are learned from
the knowledge of all classifiers is used to make the final deci- DE based on the performance of the classifier
sion [6]. As compared to the individual classifier, the com- • The chained multi-objective regression model is pre-
bination of multiple classifiers could be more effective [7]. sented to predict the water level and drilling depth in
To get a good ensemble classifier, the candidate classifiers different regions
should be as strong as possible. The selection of the best • The study facilitates the drilling industry to select the
member classifier is a concern of many researchers [8]. There region with a soft layer and acceptable range of water
are multiple ensemble techniques for classification problem level and depth
including bagging, boosting and stacking [9]–[11]. For the
multi-class problem, the combination of multiple classifiers The rest of the paper is structured as follows: section II dis-
can be more effective. In the dataset of drilling, we have cuss the models and techniques used to facilitate the drilling
ten different soil colors and eight rock layers of the land. industry; Section III describe the proposed weighted voting
The dataset is provided by a research organization, which classifier for classification and chained multi-output model
contains each borehole’s information on a certain location. for the prediction of water level and drilling depth. Section IV
Most of the features in the dataset are numeric continuous, presents the results of proposed models and compares them
and other features are encoded into the numeric value to make with the existing state of the are ML techniques to show the
the calculations easy. effectiveness of the proposed model. Finally section V con-
The dataset contains different soil colors and land layers clude the contributions with some possible future directions.
that lead us to the water. The prediction of soil color is
important to reduce the risks involved in the drilling process. II. RELATED WORK
The DE-based optimization algorithm is used to select the Groundwater is complex at the same time, a fragile resource
best candidate and assign weights to each classifier based that is substantial to domestic, economic, agriculture, and
on their performance. The voting strategy merges the knowl- industrial activities [12]. Moreover, it is a vital source of
edge of all the member classifiers and makes one ensemble replenishing safe drinking water worldwide and plays a
model. The ensemble model assumes to be better than the critical role in natural ecosystems. There are various ways
individual classifiers. The soil color has ten different values. through which groundwater exploration can be done. One of
This is because we have multiple colors of soil at different the widely used groundwater exploration and development
depths. The hardness and softness of the soil color are also techniques is drilling. However, drilling is not confined to
essential for the cost estimation of the whole process before just boring a hole in the ground and acquiring water from
starting drilling. Like soil color, the land layer is another it [13]. Drilling groundwater is becoming a multi-billion
attribute representing the underground layers of the land at industry due to the growing demand for groundwater acqui-
different depths. We have eight different land layers in the sition through downhole drilling operations. The surge for
data, including landfill layer, Sedimentary layer, Weathered fulfilling massive water demand and providing safe drinking
soil layer, Weathered rock layer, Soft rock layer, Gyeongam water has aggravated the drilling operations for groundwater
Formation, Ordinary rock formation, and Burlap soil layer. resource exploitation. However, groundwater acquisition is
All these layers have a different level of hardness and average not as simple as it looks [14]. Before drilling, many pre-
digging depth. The hardness and softness of the soil color and liminary surveys and analyses need to be conducted so that
land layer are computed by using depth and the number of desired information related to the estimation of groundwater
days spent to achieve that depth. characteristics must be estimated beforehand. A particular
For the prediction of water level and total depth of the bore- geological setting, hydraulic continuity, groundwater depth,
hole, multi-regression models are used. The multi-regression quality and quantity along with geological layer analysis are
models predict two values at the same time based on inde- substantial part of groundwater characteristics.
pendent input variables. In the chaining mechanism, two The drilling process for groundwater acquisition can
trained models are combined to predict two targets simultane- simply be defined in four major steps. Site visiting by a
ously. The results of the proposed chained multi-out regres- hydrogeological expert for assessment of geological char-
sion model based on Support Vector Regressor (SVR) are acteristics so that the risk of drilling into natural hazards

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

must be avoided [15]. The second phase involves drilling of the frequency model depicted that the area under the curve
and construction of boreholes, followed by aquifer testing to is about 84.78 percent.
determine water borehole yield. Lastly, the hydrogeologists Article [26] proposed a random forest-based mapping
ascertain the pumping and piping system types based on scheme for groundwater yield. Experimental findings sug-
the intended use of water resources. However, the drilling gested that the generalization performance achieved by
process is highly resources intensive and consumes a huge self-learning random forest achieved 23 percent improve-
amount of budget. The cost of drilling is influenced by ment. Furthermore, the comparative analysis of the proposed
the type of ground, groundwater, and borehole depth, the approach with random forest, support vector machine, artifi-
machinery involved, experts, human resources, and materials cial neural network, decision tree, and voted artificial neural
required [16]. Due to advanced drilling and pumping tech- network and random forest verified the effectiveness of the
niques, the reliance on groundwater has increased manifolds. proposed solution. Article [27] developed a prediction frame-
Increased drilling operations all over the world are producing work based on machine learning algorithms for predicting
a massive amount of data. Hence, data-driven modeling and the water depth. The input variables considered for the study
the application of advanced data analytics tools to predict are meteorological data, historical and upstream water level
downhole environmental aspects are deemed necessary. Geo- gauge data from 2009 to 2015. Artificial Neural Network,
logical and groundwater bore complexities are incurring addi- Random Forest, Decision Tree, and Support Vector Machine
tional costs and wastages of scarce resources. Consequently, are trained to predict the water level. Experimental findings
optimizing the drilling operations through the application of suggest that Random forest achieved root mean square error
machine learning techniques is essential to reduce resource RMSE of 0.09 percent. The study [28] proposed a deep learn-
wastage and increase drilling productivity [17]. Prediction ing framework for predicting the water quality and depth. The
and classification of various hydrogeological factors can deep learning framework comprises CNN deployed for water
assist the drilling companies to avoid issues like stuck pipe, level simulation and LSTM for water level prediction. The
over cost, and low water levels during the drilling process in results conclude the effectiveness of the proposed solution for
different areas. accurately simulating water quality and quantity.
Recently, many techniques have been applied to optimize Article [29] proposed a modeling approach for identifying
the drilling process for the achievement of optimal drilling changes in levels of groundwater using machine learning
parameters such as rate of penetration [18]–[20]. Predictive models. The study applied an ensemble modeling scheme
modeling and big data analytics for time series drilling data using spectral analysis. Experimental results proved that
have driven huge interest by the scientific community. The ensemble learning can be used as an alternative approach for
Researchers have successfully implemented machine learn- the simulation of groundwater changes and can better ascer-
ing and artificial intelligence methods with a major focus on tain water availability in regions with subsurface properties.
reducing various parameters that are substantial for ground- The ROP is treated as a prediction problem and solved by
water drilling such as the non-productive and invisible lost using different optimization techniques [30]. ANN is used to
time during drilling [21]. For efficient groundwater acquisi- predict the ROP by optimizing the parameters. The method
tion, there is a need for an optimized drilling process such was applied on vertical and horizontal drilling for oil wells.
as rate of penetration. Article [22], [23] proposed a machine Furthermore, the authors established the relation between
learning model for predicting the rate of penetration based on irrigation demand and its impact on change in groundwater
data analytics using a real-time data set. The authors consider level. Another study [31] presented a comparative analysis of
seven input parameters for the proposed work including sur- machine learning models for mapping groundwater potential
face parameters and others, including torque, stand pipe pres- through prediction. The models involved in the research are
sure, differential pressure etc. Experimental results reveal that mixture discriminant analysis (MDA), random forest (RF),
the rate of penetration increased by 14%. In the article, [24] multivariate adaptive regression spline (MARS), and boosted
rate of penetration is predicted using statistical regression and regression tree. The study analyzed the spatial distribution
artificial neural networks. To find the correlation between of various hydrogeological and physiographical factors along
various variables, the authors performed exploratory data with their respective spatial distribution. Results verify the
analysis. Furthermore, the importance of predictors is also effectiveness of the MDA model for mapping groundwa-
computed. For prediction, several models are employed, such ter potential. Groundwater analysis has been conducted in
as neural network, step-wise regression, classification and article [32] related to prediction approaches using machine
regression techniques (CART), and K-nearest neighbors. The learning methods. The findings of the proposed study suggest
findings of the proposed work suggest that ensemble meth- that data-driven modeling approaches are highly effective in
ods can improve the performance of the system. Another predictive modeling and decision making; however, there is
study [25] proposed a probabilistic frequency-based ratio a need to improve the accuracy of these systems. Article [33]
model for mapping of groundwater potential using eight Proposed SVM-based model to detect anomalies in ground-
factors related to topography, geology, and satellite imagery. water using real-time data. Another study [34] presented a
Moreover, the authors analyzed several relationships between machine learning-based solution for predicting lithology for-
the yield of groundwater and hydrogeological factors. Results mation based on drilling parameters. Results show excellent

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

abilities of the ANN model for the identification of formation

lithology. Article [35] employed RF to predict concentrations
of nitrate groundwater. The findings of the proposed study
suggest the use of spatial predictors for predicting nitrate
Although, multiple ML techniques are applied on different
application areas [36], [37] and the enhanced versions are
proposed [38] and many researchers concerned with hybrid
or ensemble techniques [8]. Majority voting is one of the
popular and widely used ensemble techniques. It is just a
decision rule to decide about any sample by getting the
knowledge of all candidate classifiers. In case of simple
voting classifier, the model does not required a parameter
tuning [39], [40]. It is a crucial issue to select the best
classifier and assign weights based on its performance for a
particular class [41]. The weight assignment can be seen as
an optimization problem and can be optimized using differ-
ent optimization algorithms like Genetic Algorithm, Particle FIGURE 1. Sequence of one drilling point.
Swarm Optimization (PSO), and DE.
Next, some features from the data are extracted to improve
III. METHOGOLOGY the performance of the model. Existing features like staring
The dataset is provided by Jeju National University, Republic depth, ending depth are used to extract new features from the
of Korea. The data contains multiple attributes related to data.
several drilling points. The dataset contains the information The first feature extracted from the data is the total depth of
which reflects the condition of the borehole. The total depth each drilling point. The total depth is the depth in which the
of the drilling point, including location coordinates and the water is found. That total depth of drilling point is achieved
groundwater level, is given in the dataset. Different number in multiple days. The eq. (1) is used to extract the total depth
of days required to complete drilling, so the data contains n using starting and ending depth.
number of rows for one drilling point where n is the number
of days spent. Each sample represent the starting and ending X
depth for each day and the soil color and land layer found TD = EDepth − SDepth (1)
on that day is also given in the dataset. Further attributes are
listed in the Table 1. where EDepth is ending depth, SDepth is starting depth, and n is
First, The data is preprocessed and some features are the number of instances of a single borehole. On each drilling
extracted from the existing one. In the raw dataset, each point, different days are spent, to calculate the number of
drilling point contains multiple records, where each record days, instances of each point are calculated, and the following
presents the information related to a single drilling point. query is used to extract the number of days from the database.
Figure 1 shows the information of one drilling point with six
records. There are multiple land layers and soil colors found Select count(∗) as ‘Num of days‘ from
during the process of drilling. Similarly, Multiple days spent ‘drilling data‘ group by location;
on each drilling point and different level of depth are achieved
as illustrated in Figure 1. After the extraction of features, the feature set is completed
and passed to the classification model.
A. FEATURE PREPROCESSING AND FEATURE EXTRACTION The total depth is computed features from the existing one.
To perform the mathematical models on the extracted sam- The total depth is the depth of the borehole on which the
ples, data must be in the form of numbers. To achieve this water is found. The total depth of borehole along with x and
goal, an Ordinal encoder is used to convert the categorical y position is shown in 3D space in Figure 2.
values into numbers. The ordinal encoder assigns a unique
number to each category of the feature [42]. For instance, B. CLASSIFICATION
we have ten different soil colors, so the ordinal encoder 1) TRADITIONAL MACHINE LEARNING MODELS
assigns numbers 1-10 for each soil color. Some features Multiple machine learning techniques are used to classify
from the dataset are dropped because they show no pat- real-world data into various classes. In the first phase of
tern as per the target attribute. For example, borehole code the experiments, data is passed to the classification model
and drilling resonance have unique values for each sample, to predict the land layer and soil color. Figure 3 shows the
so these attributes are removed from the dataset. Because, the complete classification model to predict the soil color and the
unique attributes have no pattern in terms of target feature. land layer. Initially, the raw data is passed to the preprocessing

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

TABLE 1. Detail feature description.

Support vector classifier is one of the famous classifier

due to its powerful kernel tricks. SVM not only focuses on
separating line but also set the best hyperplane between the
classes. The kernel tricks of the SVM transform the data into
kernelized feature space where the data is more likely to be
linearly separable. To transform the data from the original to
kernelized feature space, quadratic and Gaussian kernels are
used. RBF Gaussian kernel is famous and most commonly
used to solve multi-class nonlinear problems [45]. Initially,
data is transformed from original feature space to kernelized
feature space using quadratic kernel eq.(4) and Gaussian
kernel eq. (5).

K (x, y) = (x T y + c)d (4)

where x is the input sample, y in target variable, d is the

degree polynomial and c is the tradeoff parameter where
c ≥ 0.
Similarly, the Gaussian kernel use gamma to transform the
data samples
FIGURE 2. 3D visualization of each drilling point.
K (x, y) = exp(−γ xi − xj ) (5)
step and then the feature extraction step, as discussed in the
previous step. where, the parameter γ is used to scale the mapping,
The processed data is passed to the classification module the Gaussian transformed feature space is more linearly
along with target classes. Soil color and land layer are used as separable.
target labels one by one. From the traditional machine learn- Along with machine learning models, some deep learning
ing models, naïve Bayes with kernel-based estimator [43] techniques are also used to solve complex and multiclass
is used to estimate the probability of the target class. The problems. In this study, a Bilayered Neural Network with
general form of the kernel-based estimator is given in eq. (2). the Relu activation function is used to predict the soil color
X and land layers using an input set of independent variables.
f (x, M ) = n −1
KM (x − x (i) ) (2) The architecture of NN used in the experiment is shown in
i=1 Figure 4. There are two fully connected layers involved in
where M is the smoothing matrix, x is the sample from the the network, followed by hidden layers. In a fully connected
dataset, and n is the total number of cases from which the layer, all the input from one layer is connected to the activa-
estimator is learned. For the supervised learning approach, tion unit of the next layer. The output of the Bilayered neural
the data is in the form of (x 1 , y1 ) . . . .(x n , yn ) with n number network is optimal weights used to predict the target class in
of samples. The kernel estimator used here is given in eq. (3). the validation and testing step. Finally, all the selected drilling
characteristics are passed to the NN to predict the soil color
1X fˆ (x (i) , yi ) and land layer.
Î (X , Y ) = log (3)
n fˆ (x (i) )fˆ (y(i) ) The purpose of the selection of SVC is that the kernel
tricks of SVM transform the data into kernelized feature space
where the function fˆ represents the density-based kernel esti- where kernelized NB is suitable for the multiclass problem.
mator given in eq. (2). The purpose of selecting kernel-based DT is one of the famous entropy-based classification algo-
naïve Bayes is to determine better results than the state-of- rithm and is commonly used in multiclass problems. The
the-art Naïve Bayes classifier [44]. working strategy of KNN is based on the neighbor’s value.

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

The soil color and land layer have different patterns and The random value ensures that the trail vector Ti,G get at least
are somehow related, so, the neighbor information is useful. one value from mutant vector Vi,G .
By considering this information, KNN is selected as one of Selection: The selection process selects and compares the
the candidate for voting. trails with target individuals and decides based on the survival
In the first step of the experiments, raw data is passed to of the target. If the target is able to survive in the next
the preprocessing phase to process the data before passing generation, the trail individual will be selected. The operation
for classification. Then, the first step in feature importance of selection can be presented as
is computed, and features are selected to impute the missing 
Ti,G (j), if Trail f (Ti,G ) 6 individual f (Ti,G )

value. Instead of removing missing values from the data, (7)
Xi,G (j), otherwise
we imputed them by the KNN imputation technique.
where the function f () is the function to be optimized and
ensure that the selected member is best for the individual.
The described technique is used in the selection of opti-
In the ensemble model, the voting classifier is used to classify
mal weights for classifiers; however, multiple crossover and
the data into different classes. Voting itself is not a classifier
mutation techniques for DE have been proposed [47]–[50].
but a technique to combine the results of multiple classifiers.
The Algorithm 1 shows the flow of the selection of best
The wrapper of a voting classifier consists of multiple trained
weights using DE from the given search space. The DE
classifiers and assigns the target label to each sample based
algorithm returns the best weight vector (w1 , . . . wn ) for each
on the number of voters. The same data samples are passed
member classifier.
to all candidate classifiers to train the model and ensemble
them to predict the final output. Ensemble voting classifier
Algorithm 1 DE Based Proposed Weighted Voting
is best for multi-class problems [46]. We have selected four
different classifiers for the voting wrapper: Gaussian Sup- Input: DE Control Parameters: Population size N ,
port Vector classifier, K-Nearest Neighbor classifier, Kernel Mutation factor F and Crossover Rate CR
Gaussian NB, and Decision Tree. The voting strategy merges Output: Optimal weights (w1 , . . . , wn );
the knowledge of all involved candidates and decides the initialization;
label for each testing sample. The conceptual model of the Population of N individuals;
voting classifier is shown in Figure 5. But in the traditional XG = X1,G , X2,G , . . . , XN ,G
voting classifier, each candidate classifier contributes equally // Uniformly Distributed
in the selection of the target variable. While it is possible in where Xi,G = [xi,G (1), xi,G (2), . . . , xi,G (N )] represents
some situations that one or two candidate classifiers are less the weights (w1 , . . . , wn ) of classifiers.
confident to assign a target to a sample compared to others, G ← 0 // generation iteration
in this case, the wrong class can be assigned. while stopping criteria not satisfied do
By considering the given limitation of the traditional clas- for i = 0; i < N ; i++ do
sifier, we proposed a weighted voting classifier in which Select distinct indexers r1, r2, r3 /* should
we use DE to select the best optimal weights for each be different from i */
classifier to select the best target variable for each sam- Vi,G = Xr1,G + F × (Xr2,G − Xr3,G )
ple. The problem is to assign optimal weights to each clas- // Compute mutant vector
sifier based on its confidence for each class. The DE is jrand ← random() // Random value
a population-based optimization technique used widely for for j = 0; j < D; j + + do
multiple search problems in the literature. DE first create the Ti,G ← using eq. (6)
list of population of size N and D-Dimensional vector, where Xi,G+1 ← Fitness of Ti,G and TX ,G using eq. (7)
Xi,G = [xi,G (1), xi,G (2), . . . xi,G (j) . . . xi,G (D)]. DE per-
G ← G+1
forms three main steps, including mutation, crossover, and
// Increase termination parameter
Mutation: mutation is the first step of the DE optimization
algorithm which generates the donor vector represented by
The fitness evaluation of the model is based on the
Vi,G for each target vector Xi,G for the current generation.
accuracy of the member classifiers. For c number of cate-
Crossover: Crossover simply generate the trail vector
gories in the target attribute with D number of classifiers
Ti,G = [ti,G (1), ti,G (2), . . . ti,G (j) . . . ti,G (D)] by performing
to vote, the predicted value Vp of the proposed weighted
the crossover operation between target and corresponding
voting for k sample is
vector. For each variable j from the D-dimensional vector:
vi,G (j), if random value (i,j)[0,1] < CR Vp = arg max (ϑij × wi ) (8)
xi,G (j), otherwise i=1
where CR represents the crossover rate in the range of [0,1], where ϑij is the binary decision variable; ϑij = 1 if jth
random number is a uniform distribution for each jth value. category is assigned to the k sample by ith classifier and

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

FIGURE 3. Data preprocessing, feature extraction, state of the art ML techniques and operational flow of enhanced optimization-based voting classifier.

FIGURE 4. Architecture of Bilayered NN.

ϑij = 0 otherwise. wi represents the weight for ith classifier In the proposed chained multi-out regression model Sup-
and optimized by the Algorithm 1. port vector regressor is selected to make a chain of the
models for each target attribute. The single model is able
to predict both target attributes at one time, as shown in
C. REGRESSION Figure 6. As the SVR shows better results as compared
1) REGRESSIONS MODELS to other ML model, so the best model is selected as can-
Regression analysis is a type of prediction in which the didate for proposed technique. In the multi-out regression
relationship between dependent and independent variables model, the SVR model is first trained based on one tar-
is investigated. Moreover, these techniques are used in time get attribute, and then the prediction of the first model
series prediction, forecasting, and determining the variable’s along with training data is passed to the second model to
causal effect relationship. Multiple regression techniques are predict another attribute. The chained model automatically
available to determine or predict the next value of a dependent splits the target attribute and trains both models by chain-
variable. These models fit the curve based on given data. Lin- ing both trained models and simultaneously predict multiple
ear regression, K-nearest Neighbours Regression Decision variables.
tree regressor, and Support vector Regressor are famous and The prediction of the water level and depth is starts
most used Regression models. We use the traditional regres- from the separation of the target variables from independent
sion models to predict water level and drilling depth and features and shown in Figure 7. After the separation, the
compare the results of the conventional models with the pro- stand-alone regression models are applied to predict the water
posed SVR-based chained multi-objective regression model level and total depth separately. After that SVR based multi
in terms of MAE, MSE, and R2 score. objective chained model is applied to predict both variables

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

FIGURE 7. Multiout regression model to predict depth and water level

FIGURE 5. Structure of proposed DE-based weighted voting classifier.

used eq. (10).

Recall = (10)
where TP refers to True positive and FN is false-negative rate.
Precision is a little bit change to recall; it reflects the infor-
mation related to the true positive class. The prediction of
true positive samples from all true positive and false-positive
samples eq. (11).

Precision = (11)
As we have multiple classes of soil color and land layer,
so the prediction value is based on the mean of all classes.
FIGURE 6. Multiout Regression model based on SVR. To evaluate the performance of the chained multi-out
regression model, MAE, MSE, and R2 are used. MAE shows
the absolute difference between actual and predicted water
simultaneously. K-Fold cross-validation is used to train the level and drilling depth. The eq. (12) shows how the MAE is
model. Finally, existing and proposed models are evaluated computed from the actual and predicted value.
using MAE, MSE, and R2 score.
|yi − xi |
IV. RESULTS MAE = i=1 (12)
To evaluate the performance of the proposed classification The mean squared deviation (MSD) or mean square
model and for a fair comparison, accuracy, precision, and error (MSE) squares the error between actual and predicted
recall are used as evaluation metrics. These metrics are widely values.
used for the evaluation of classification models [51]. The n
accuracy metric refers to how much the measurements are MSE = (13)
close to the true or accepted value eq. (9). n

TP + TN The R2 coefficient i different from all other error metrics,

Ac = (9) instead of error it shows the accuracy of the model. the
higher value of R2 shows better performance of the model.
where TP is true positive, TN is true negative and N is total The eq. (14)
number of samples. Accuracy is easier to understand that how
positively the model is performing on the data. To check the Unexplained Variation
R2 = 1 − (14)
positive class prediction from all positive samples, recall is Total Variation

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

TABLE 2. Average accuracy, precision, and recall of validation and test phase for soil color.

B. CLASSIFICATION RESULTS are used as training, and the process continues for ten-time
To predict the underground soil layer and land layer in dif- so that each fold is used in a testing phase. The standard
ferent regions, multiple classification algorithms are applied, deviation of upper and lower bound of precision, recall, and
and the results are listed in Table 2 and 3. The drilling task accuracy achieved in the folds for each classifier is reported.
becomes easy when the land layer and soil color are known in For the effectiveness of the proposed methods, the model is
advance because soil color and land layer have the different compared with existing ML techniques for both soil color
softness and hardness levels. The average digging capacity and land layer as shown in Table 2 and 3 respectively. The
per day of each layer and soil color shows the hardness level. validation and test accuracy, precision, and recall are reported
Figure 8 shows the average digging capacity of each soil color and compared. As the proposed model considers the candi-
and land layer. date classifier’s performance so that the optimal weights are
There are eight different land layers in the given dataset. assigned to each classifier. As result, the role of the candidate
Four different state-of-the-art ML algorithms are selected with higher accuracy in the selection of target variable is more
to predict land layers on different locations and depths to as compared to other algorithms.
solve this multi-class problem. The performance of Bilay- Average per day digging capacity of each soil color and
ered NN is better as compared to all other ML algorithms land layer is computed to analyze the hardness level. Per day
in terms of precision, recall, and accuracy. The ensemble digging capacity of both soil color and land layer is computed
voting classifier is also applied to compare the results with by using eq. (15).
the proposed optimization-based weighted voting classifier. Pn
(EDi − SDi )
The same member classifiers are used to for both voting clas- DC(color, layer) = i=1 (15)
sifiers. The results show that the proposed weighted voting n
classifier shows better classification results as compared to where DC is Digging Capacity, ED is ending depth, SD
the existing voting classifier and traditional ML classification is starting depth, and n is the number of instances of that
algorithms. specific soil color or land layer. The Figure 8 illustrates the
The results of the prediction of soil color are illustrated in information computed by the eq. (15) for both soil color and
Table 2. The results show the effectiveness of the proposed land layer. The figure shows that the dark brown soil color is
weighted voting classifier. The accuracy achieved by the soft because the average depth achieved per day is 4.3 meters.
proposed model for validation is 82.76% and 86.69% for the In contrast, the depth achieved in the case of partridge soil
test phase. The proposed model use the knowledge of all color is 2.3, which shows that partridge soil is hard and con-
classifier based on their performance for each class so the sumes more days and resources. The drilling industry needs
results are better than all other state of the art techniques. attention when the partridge soil color is predicted because
The early prediction of the land with soft soil will allow the the fluid of the pipe can be broken because of its hardness.
drilling industry to figure out the resources before starting In the case of the land layer, the Gyeongam Formation layer
the actual process. Along with it, the rate of penetration can is soft compared to all others. While the landfill layer is too
also be increased by considering the predicted information. hard because we can achieve only 1.5-meter depth by working
There are multiple risks involved in drilling, including stuck a whole day. By analyzing the results of the land layer, we can
pipe, fluid broken, and over cost. All these risks are involved say that the rate of penetration in case of the landfill layer
because of the hard layer. Because of the hard layer, pipes are is low. Figure 2 shows the drilling point with a different
stuck, and pipes fluid can also be broken. The early prediction layered structure. Each slice of 3D pipe shows the digging
can minimize the risks involved in the drilling process. If the depth on a specific day. At the same time, the total number
predicted layer is hard, the drilling industry should use the of slices in the 3D pipe shows the number of days spent
related material to reduce the time and cost. in that particular borehole. The z-axis of the figure shows
The proposed weighted voting classifier merges the knowl- the total depth of drilling points. Different colors on one
edge of all the involved candidate models based on their borehole point show the number of days spent on each drilling
confidence level of prediction and decides the label. For better point and the thickness of each layer. The situation of the
measurement, 10-fold is used to train candidate models, and underground water table is also depicted in the illustration.
the average accuracy is used as a final parameter for the In some areas, the water level is low, while the water level is
optimization module. For each iteration, nine out of ten folds too high in some areas.

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

TABLE 3. Average accuracy, precision, and recall of validation and test phase for land layer.

FIGURE 8. Average digging capacity of both soil color and land layer.

FIGURE 9. The sequence of trained model.

To sum up the knowledge of all the candidate classifiers, are compared with conventional models of ML. The depth
the patterns illustrated in Figure 9a and 9b are extracted. The of the borehole varies at different locations as per the level
sequence of prediction or the flow of the trained model is of groundwater. Similarly, the pattern of soil color and land
visualized. The sequence shows a separate line for each class layer also plays an important role in the prediction of depth.
label. Numeric values are normalized, and the categorical The results show the effectiveness of the proposed chained
value remains the same. The figure shows some patterns model in terms of MAE, MSE, and R2. As the proposed
between independent features. When the value of Korea method consists of two trained regression models where the
startup has changed, the value of the layer also changes in sequence is sequential. Because of the sequential method, the
most cases. The color is assigned to each correct and incorrect previously trained model injects the knowledge to the next
prediction, as shown in the right bar of the figures. model to predict one dependent variable. The next model uses
that predicted variable as an independent variable to train
C. REGRESSION RESULTS itself to predict another variable. Because of that chained
The prediction of water level and depth is based on region, strategy, the proposed method’s performance is good com-
Altitude, Land layer, days spent, and soil color. The borehole pared to other conventional ML techniques. Figure 10 shows
depth is highly dependent on days spent on the process and the prediction results for the training and testing phase for
the hardness level of both soil color and land layer. Similarly, drilling depth, and Figure 11 shows the prediction results of
different land layers and soil color patterns lead us to different water level.
groundwater levels. The results of prediction of water level The proposed model is based on two chained mod-
and drilling depth shown in Table 4 and 5 respectively. The els to predict two different target variables simultaneously.
results of the proposed chained multi-out regression model By predicting the level of water and depth in different regions,

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

FIGURE 10. Results of multiout regression model for the prediction of next depth.

FIGURE 11. Results of multiout regression model for the prediction of water level.

TABLE 4. Comparison of proposed chained multiout regressor and the preference of the drilling industry and their method of
traditional ML regressor in terms of MAE, MSE, and R2 for groundwater
level prediction. drilling.

Huge data is generated by scientific experiments and other
mediums. Researchers have applied multiple clustering and
classification techniques to extract the patterns from the data
TABLE 5. Comparison of proposed chained multiout regressor and to make it useful. Different ML techniques are also enhanced
traditional ML regressor in terms of MAE, MSE, and R2 for borehole depth
prediction. to get more accurate results from the raw data. The pro-
posed work presents an optimization-based weighted voting
approach to find the hidden patterns from the data. The data
is accrued from JNU, Republic of Korea. In the first phase,
data is preprocessed, and some features are extracted, like
total depth and number of days spent. Moreover, the data
is encoded using an ordinal encoder to apply mathematical
the early estimation can be done by the drilling industry. classification algorithms. For fair voting, a population-based
Moreover, the prediction of soil color and land layer and optimization algorithm is used to find the optimal weights
their hardness level facilitates the drilling industry to manage based on the performance of the member classifiers. The
the required resources. The selection of the area depends on weight is assigned to each classifier, and the target variable
the preferences of the drilling industry. If the industry has is assigned to each test data sample. The proposed method
heavy resources, then the hard layer can be compromised, is evaluated based on accuracy, precision, and recall. It is
and areas with low depth can be considered. While, on the noticed from the experiments that the proposed model out-
other hand, if the labor cost is not a problem and the drilling performs as compared to the existing voting classifier and
equipment are not as much strong to drill the hand layer, state-of-the-art ML algorithms. The prediction of soil color
then the area with soft layer should be selected. Therefore, and land layer on different areas and predicted depth allows
the classification and regression results help the industry. the drilling industry to estimate the time and labor cost early.
While the selection of the region is highly dependent on However, some factors should be considered before starting

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

A. Rizwan et al.: Enhanced Optimization-Based Voting Classifier and Chained Multi-Objective Regressor

[37] Z. Ghaffar, A. Alshahrani, M. Fayaz, A. M. Alghamdi, and J. Gwak, ANAM NAWAZ KHAN received the B.S.
‘‘A topical review on machine learning, software defined networking, and M.S. degrees in computer science from
Internet of Things applications: Research limitations and challenges,’’ COMSATS University Islamabad, Attock Cam-
Electronics, vol. 10, no. 8, p. 880, Jan. 2021. pus, Pakistan, in 2016 and 2019, respectively.
[38] A. Rizwan, N. Iqbal, R. Ahmad, and D.-H. Kim, ‘‘WR-SVM model based She is currently pursuing the Ph.D. degree with
on the margin radius approach for solving the minimum enclosing ball the Department of Computer Engineering, Jeju
problem in support vector machine classification,’’ Appl. Sci., vol. 11, National University, Republic of Korea. Her
no. 10, p. 4657, Jan. 2021.
research interests include machine learning appli-
[39] L. I. Kuncheva and J. J. Rodriguez, ‘‘A weighted voting framework for
cations in smart environments, analysis of predic-
classifiers ensembles,’’ Knowl. Inf. Syst., vol. 38, no. 2, pp. 259–275,
Feb. 2014. tion and optimization algorithms, big data, and the
[40] L. Lam and C. Y. Suen, ‘‘Application of majority voting to pattern recog- IoT-based applications.
nition: An analysis of its behavior and performance,’’ IEEE Trans. Syst.,
Man, Cybern. A, Syst. Humans, vol. 27, no. 5, pp. 553–568, Sep. 1997.
[41] A. Ekbal and S. Saha, ‘‘Weighted vote-based classifier ensemble for named
entity recognition: A genetic algorithm-based approach,’’ ACM Trans.
Asian Lang. Inf. Process., vol. 10, no. 2, pp. 1–37, Jun. 2011. NAEEM IQBAL (Member, IEEE) received the
[42] C. Friedman, ‘‘System and method for language extraction and encoding M.S. degree in computer science from COMSATS
utilizing the parsing of text data in accordance with domain parameters,’’ University Islamabad, Attock Campus, Punjab,
U.S. Patent 6 182 029, Jan. 30, 2001. Pakistan, in 2019. He is currently pursuing the
[43] Y. Murakami and K. Mizuguchi, ‘‘Applying the Naïve Bayes classifier Ph.D. degree with the Department of Computer
with kernel density estimation to the prediction of protei-protein interaction Engineering, Jeju National University, Republic
sites,’’ Bioinformatics, vol. 26, no. 15, pp. 1841–1848, 2010. of Korea. He has professional experience in the
[44] A. Pérez and I. N. Inza, ‘‘Bayesian classifiers based on kernel density
software development industry and in academic
estimation: Flexible classifiers,’’ Int. J. Approx. Reasoning, vol. 50, no. 2,
as well. He has published more than 20 papers
pp. 341–362, 2009.
[45] B. Scholkopf, K. K. Sung, and C. Burges, ‘‘Comparing support vector in peer-reviewed international journals and con-
machines with Gaussian kernels to radial basis function classifiers,’’ IEEE ferences. He is serving as a professional reviewer for various well-reputed
Trans. Signal Process., vol. 45, no. 11, pp. 2758–2765, Nov. 1997. journals and conferences. His research interests include AI-based intelligent
[46] O. Sagi and L. Rokach, ‘‘Ensemble learning: A survey,’’ WIREs Data systems, data science, big data analytics, machine learning, deep learning,
Mining Knowl. Discovery, vol. 8, no. 4, p. 15, Jul. 2018. analysis of optimization algorithms, the IoT, and blockchain-based secured
[47] J. Brest, S. Greiner, B. Boskovic, M. Mernik, and V. Zumer, ‘‘Self- applications.
adapting control parameters in differential evolution: A comparative study
on numerical benchmark problems,’’ IEEE Trans. Evol. Comput., vol. 10,
no. 6, pp. 646–657, Dec. 2006.
[48] J. Zhang and A. C. Sanderson, ‘‘JADE: Adaptive differential evolution
with optional external archive,’’ IEEE Trans. Evol. Comput., vol. 13, no. 5, RASHID AHMAD received the B.S. degree from
pp. 945–958, Oct. 2009. the University of Malakand, Pakistan, in 2007, the
[49] J. Brest and M. S. Maucec, ‘‘Self-adaptive differential evolution algorithm M.S. degree in computer science from the National
using population size reduction and three strategies,’’ Soft Comput., vol. 15,
University of Computer and Emerging Sciences
pp. 2157–2174, Dec. 2011.
(NUCES), Islamabad, Pakistan, in 2009, and the
[50] W. Gong, ‘‘Repairing the crossover rate in adaptive differential evolution,’’
Appl. Soft Comput., vol. 15, pp. 149–168, Feb. 2014. Ph.D. degree in computer engineering from Jeju
[51] N. Iqbal, R. Ahmad, F. Jamil, and D.-H. Kim, ‘‘Hybrid features predic- National University, South Korea, in 2015. Cur-
tion model of movie quality using multi-machine learning techniques for rently, he is working as an Assistant Professor at
effective business resource planning,’’ J. Intell. Fuzzy Syst., vol. 40, no. 5, COMSATS University Islamabad, Attock Cam-
pp. 9361–9382, Jan. 2021. pus. His research interests include the application
of prediction and optimization algorithms to build IoT-based solutions,
machine learning, data mining, and related applications.

DO HYEUN KIM received the B.S. degree in

electronics engineering and the M.S. and Ph.D.
ATIF RIZWAN received the M.S. degree in degrees in information telecommunication from
computer science from COMSATS University Kyungpook National University, South Korea,
Islamabad, Attock Campus, Punjab, Pakistan, in in 1988, 1990, and 2000, respectively. He was
2020. He is currently pursuing the Ph.D. degree with the Agency of Defense Development (ADD),
with the Department of Computer Engineering, from 1990 to 1995. From 2008 to 2009, he was a
Jeju National University, Republic of Korea. He Visiting Researcher with the Queensland Univer-
has two-year academic experience as a Research sity of Technology, Australia. Since 2004, he has
Associate with COMSATS University Islamabad. been with Jeju National University, South Korea,
He has good industry experience in software devel- where he is currently a Professor with the Department of Computer Engi-
opment and testing. His research interests include neering. His research interests include sensor networks, M2M/IOT, energy
applied machine learning, data and web mining, optimization of core optimization and prediction, intelligent service, and mobile computing.
algorithms, and the IoT-based applications.

VOLUME 9, 2021 168341

