Machine Learning Based Missing Data Imputation
Machine Learning Based Missing Data Imputation
ABSTRACT In order to predict and fill in the gaps in categorical datasets, this research looked into the use of
machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction
Output Codes (ECOC) framework, including models based on SVM and KNN as well as a hybrid classifier
that combines models based on SVM, KNN, and MLP. Three diverse datasets—the CPU, Hypothyroid, and
Breast Cancer datasets—were employed to validate these algorithms. Results indicated that these machine
learning techniques provided substantial performance in predicting and completing missing data, with the
effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models,
ensemble models that made use of the ECOC framework significantly improved prediction accuracy and
robustness. Deep learning for missing data imputation has obstacles despite these encouraging results,
including the requirement for large amounts of labeled data and the possibility of over-fitting. Subsequent
research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context
of the imputation of missing data.
INDEX TERMS Data cleansing, missing data imputation, classification, regression and categorical datasets.
• Duplicate records: Instances where different or multiple missing data by utilizing a function of auxiliary variables
records represent the same entry in the dataset. or predictors. Given its crucial role across various statistical
• Redundant features: These are irrelevant attributes domains, particularly in government statistics, imputation has
that contribute minimally to the model’s construction been extensively discussed in the literature. This process is
and potentially extend training duration and increase illustrated in Figure 2.
overfitting risk [4].
• Missing data: These occur when no feature data values B. THE ONSET OF MISSING DATA
have been recorded. These are common and can Missing data might result from human or machine error
substantially influence data interpretation [5]. during sample processing, malfunctioning equipment, tran-
• Outliers: In statistical analysis, outliers are observations scription issues, dropouts during follow-up and clinical
that deviate significantly from others, potentially caus- studies, or respondents’ unwillingness to answer a specific
ing severe issues [6]. topic, as well as the combination of two fairly identical
matches in a collection of data. This difference is also
Data cleaning involves rectifying these issues, including
known as a ‘‘non-response. A programmed non-response
filling in missing data, smoothing noisy data, identifying
occurs when some responses are available but not all are
or removing outliers, and addressing inconsistencies. The
due to programmed refusal, inability to attend, absence
ultimate goal is to develop a tool capable of resolving
from home, or untracked situations. A respondent may
all the aforementioned problems. Previous research has
choose not to answer a question. Imputation based on these
primarily concentrated on commonly encountered challenges
representations can thus be used at two levels: unit and item
such as incorrect data types, lost data, and outliers [7].
non-response. Any variable that does not have a measurable
Missing data often impedes useful investigations across
value for the entire population should be estimated. Given the
various scientific domains. Although such research relies
preceding levels, this article will focus on the article’s level
on subject cooperation, complete participation cannot be
of unresponsiveness. To clarify how to handle missing data,
assumed due to data gaps. This paper defines ‘‘missing
the aforementioned reasons have been turned into multiple
data’’ as instances where no data exists for the relevant
‘‘missing data mechanisms [12].
variable.
Even the most carefully planned and conducted studies C. MOTIVATION AND BACKGROUND
can yield incomplete results, a problem recognized in both
It is necessary to interpolate the missing data in order to
scientific and corporate realms. Missing data complicates
complete the process, as data analysis cannot be performed
the interpretation and understanding of the phenomena under
on insufficient data sets. This step, if neglected, could lead to
study. The absence of data compromises the validity of
incorrect conclusions. missing data can result in undesirable
scientific research, as reliable conclusions are only drawn
outcomes, especially when they cause estimates to be skewed
through a thorough analysis of complete datasets. Most sci-
in the wrong direction. Although the method of interpolating
entific, commercial, and economic decisions are influenced
missing data has been the subject of debate for decades,
or informed by published research findings. Hence, proper
relatively few studies have examined the accuracy of the
handling of missing data should be a priority [8], [9].
machine learning algorithms that are most commonly used to
perform this task. There are numerous methods for handling
A. IMPUTATION OF MISSING DATA and resolving missing data and practices and procedures for
Imputation is a technique applied to handle missing data. filling in the missing data [13] The technique of interpolation
In this article, we extend the definition of imputation is one of the practices that will be discussed in this paper
beyond that given by [10], which states, ‘‘Imputation is [14]. It is achieved via the application of machine learning
a comprehensive and flexible method for dealing with algorithms. Appropriate estimation methods can be used
missing data.’’ This technique involves predicting missing to enhance the quality of the analyzed dataset and help
data based on the observed data distribution, commonly make more informed healthcare decisions [1]. The state-of-
referred to as ‘‘drawing missing data from the estimated the-art AI-enabled imputation was selected after extensive
distribution through imputation.’’ Imputation methods predict experimental work on all ensembles.
II. LITERATURE REVIEW (MNAR). Baraldi and Enders [17] Consult MCAR if the
A. DATA TYPE DISCOVERY TECHNIQUES likelihood of loss is constant across all scenarios. The cause of
Identifying data types in the original dataset can be data loss, according to MCAR, is unrelated to the data itself.
accomplished through various methods. Some approaches are When a student in educational research shifts to a different
straightforward and rely on basic statistics or heuristics. For area in the middle of their undergraduate career, this is an
example, to determine if a function is distinct or constant, example of MCAR. The missing data is MCAR if the source
we can calculate the number of distinct values used by the of the motion is unrelated to any other variables in the dataset.
function and compare it to the total number of instances of MCAR is frequently not practical due to the data at hand.
the function. However, more advanced or complex methods Data become absent at random only when there is an
may require the use of machine learning models for accurate equal risk of absence inside each cluster defined by the
detection. observational data [17]. As a result, if the reason for a
variable’s missing inputs is unrelated to the variable itself,
B. MANAGING MISSING DATA the problem may be linked to other observable variables.
Missing data, according to [8], is a prevalent problem that The MAR process is not random because it represents
either goes unnoticed by scientists or is actively suppressed. systematic missing data, where the bias in the missing data
To put it another way, researchers are aware of the missing is tied to other observable aspects of the analysis, despite the
data and are focused on proving why it is irrelevant to misleading name ‘‘random.’’ When sampling a population,
the specific study. Data is notable when it influences for example, the variance to be included is determined
judgments and, ultimately, one’s knowledge, both known by some known property. MAR is a larger category than
and unknown. Missing data can have serious consequences MCAR. The MAR assumption is the foundation for the
for quantitative research, such as information loss, increased majority of recent strategies for dealing with missing data.
standard errors, and a decrease in statistical power, biased Finally, if neither MCAR nor MAR holds true, the absence is
parameter estimations, and a decrease in the generalizability considered non-random. According to MNAR, the likelihood
of study conclusions [15]. Unfortunately, one of the standard of extinction changes for unknown causes. As a result,
ways for scientists to deal with missing data is to delete those it is reliant on intangible measurements. The worth of an
using ad hoc methods like listwise or pairwise elimination. unseen reply is determined by facts that cannot be assessed.
This usually leads to skewed estimates and/or criticism for When asked about their spending patterns, students who
being inefficient. Frangakis and Rubin [16] found that the frequently gambled at casinos, for example, tended to avoid
most common cause of missing data in the NSI dataset the questions out of fear of getting into trouble. As a result,
is that respondents opt not to participate in the survey or the model is unable to anticipate future data appropriately.
answer some questions they do not want to answer (item not As a result, MNAR is the more difficult case.
answered; unit not answered). Rubin’s [12] distinction is critical in understanding why
some solutions may not function as expected. The theory
C. THEORY OF MCAR, MAR AND MNAR explains why data-missing approaches produce statistically
As reported by Rubin [12], Rubin has devised techniques to significant findings. These increase forecast accuracy and
deal with the loss of any data point. He split the missing data effectiveness. This research is built on MCAR data. Although
problem into three distinct missing data mechanisms. To put the method reduces statistical power, it offers the advantage
it another way, there are three kinds of missing data: ‘‘totally of maintaining the study’s goal because the estimated
random missing,’’ ‘‘randomly missing,’’ and ‘‘not random’’ parameters are not influenced by missing data.
D. PROPORTIONS OF MISSING DATA 2) As returns the imputed missing data for the occupation
Academics generally accept missing data strategies. Par- variable to the missing data.
ticularly given that it has been demonstrated that this 3) The study uses linear regression to predict missing data
differentiation has an effect on the strategy’s efficacy. The for occupation by age and income based on all of the
rate of missing data, on the other hand, is not. There are observed cases.
numerous points of view on the acceptable percentage of 4) The study uses the values obtained in step 3 to impute
missing data in a dataset. According to Schafer [18], 5% or missing data for occupation. The occupation variable is
less is insignificant; hence, values should be imputed when not missing at this time.
5% or more percentages are missing. When the amount of 5) Steps 2-4 are repeated for the various ages.
missing data exceeds 10%, Bennett argues that values should 6) Repeat steps 2 through 4 for actions.
be imputed. As a result, even if a small fraction of data is 7) Repeat the entire iterative process to converge the three
missing, a researcher may desire to impute missing data. variables.
Multiple imputations are specific to MAR but also produce
valid estimates in MNAR.
E. MULTIPLE IMPUTATION
The authors of this paper propose an RL-based approach
To reduce imputation-induced bias, we proposed a method
for estimating missing data. This method involves learning
for averaging the results of multiple imputation data sets.
a strategy for empirically estimating data based on action
Multivariate imputation basically consists of three steps.
rewards. The abbreviation RL stands for reinforcement
First, incomplete data sets’ missing data is imputed m
learning. The proposed method maintains the variance of the
times. It should be noted that the estimates are based
interpolated values by interpolating missing data in columns
on circulation. This step produces a full set of data. The
with different values, as opposed to interpolating missing data
following (second) step is to examine each of the ten
in columns by only working on the same column (this is
complete data sets. The mean, variance, and confidence
analogous to single-unit variate interpolation). The authors
interval of the variable of interest are calculated. Finally,
report that our method outperforms other interpolation
we add the results of the m-analysis to the final result.
strategies when applied to various datasets [23].
Multiple imputation is by far the most complex and popular
The proposed method employs multiple interpolation
method. The Multiple Imputation Chain Equation (MICE),
techniques using an iterative Markov chain Monte Carlo
which is based on the MCMC algorithm, is the most widely
(MCMC) simulation method based on the Gibbs sampler
used method of multiple imputations. MICE takes the idea
algorithm. In earlier attempts, MCMC simulations were used,
of regression one step further and exploits correlations
but only on relatively small data sets with a restricted number
among responses by Lynn [19]. To explain the concept of
of variables. Consequently, an additional contribution of
MICE, let’s take i.e. Despite these promising results, there
this paper is its application and comparison within a large
are still challenges with deep learning for missing data
longitudinal English education study with three iterative
imputation. These include the need for a significant amount of
specifications. This was accomplished by utilizing the study’s
labeled data and the risk of overfitting. Future studies should
findings. The simulation’s results reveal how the algorithm
assess the practicability and performance of deep learning
will eventually converge [24].
algorithms when it comes to data imputation [20], [21].
Using local feature spaces, the authors of this paper
In one effort, The specific comparison of two conventional
propose two closed-item- set-based methods, CIimpute and
methods, multiple imputation by chained equations (MICE)
ICIimpute, to interpolate missing data for multiclass matrix
and missForest, with the deep learning methods, generative
data. CIimpute and ICIimpute are referred to, respectively,
adversarial imputation networks (GAIN) with onehot encod-
as CIimpute and ICIimpute. CIimpute estimates the missing
ing, GAIN with embedding, variational auto-encoder (VAE)
data using a closed term set that has been extracted from
with onehot encoding, and VAE with embedding. Three
each class. The CIimpute method has been modified to
simulated datasets and seven genuine benchmark datasets
include an attribute reduction procedure, resulting in the
are taken into consideration, covering a range of scenarios
ICIimpute method. The results of the experiments indicate
with varying feature types at varying sample size levels.
that reducing the number of attributes significantly reduces
Three types of missing mechanisms–missing completely at
the computation time and improves the interpolation pre-
random (MCAR), missing at random (MAR), and missing
cision. In addition, the results demonstrate that ICIimpute
not at random (MNAR)–as well as various missing ratios
provides superior interpolation precision despite requiring
are used to produce the missing data [22]. Use MICE to
a longer amount of computation time compared to other
impute missing data from a simple dataset. Imagine that we
methods [25].
have three characteristics in our dataset: occupation, age, and
This research proposes an autoencoder model that con-
income, each with missing data. MICE can be carried out in
siders spatiotemporal factors to estimate missing data in
the following ways:
air quality datasets. The model consists of one-dimensional
1) First, a simple imputation method will be used, such as convolutional layers that provide flexible coverage of air
imputation by the mean, to fill in the missing data. pollutants’ spatial and temporal behavior. It incorporates data
from nearby stations to enhance predictions for data-deficient temporal information inherently present in longitudinal data
target stations, eliminating the need for additional com- is beneficial for machine learning applications, which can
ponents such as weather and climate data. The findings be effectively achieved using the proposed data-driven
demonstrate that the method effectively fills in missing methods [28].
data from discontinuous or long-interval interrupted datasets.
Compared to univariate interpolation techniques (most com- F. SUMMARY OF TECHNIQUES
mon, median, and mean interpolation), our model achieves up Missing data is unavoidable when handling any amount of
to a 65% improvement in RMSE and a 20-40% improvement medical data. Being able to build prognosis and prediction
compared to multivariate interpolation techniques (deci- models based on data sets with substantial amounts of
sion trees, extra trees, k-nearest neighbors, and Bayesian missing data would be an advantage to researchers. A data set
ridge regression). However, when adjacent sites have a has been simulated to be used in predicting patient lifetimes
negative or weak correlation, interpolation performance is via an artificial neural network. Various levels of missing
diminished [26]. were then simulated, and the missing data were imputed
A new mechanism for predicting and estimating the by a variety of methods.FAMD stands for ‘‘Factor Analysis
amount of data lost in IoT gateways has been developed for Mixed Data’’.The technique known as FAMD is used to
to achieve greater autonomy at the network’s edge. In most analyze data that contains both continuous and categorical
cases, the computational resources on these gateways are variables. It is a development of the technique known as
limited. Therefore, the interpolation method for missing factor analysis, which finds underlying patterns in data.
data must be simple while still producing precise estimates. By turning category variables into dummy variables, FAMD
In light of this, the authors of this study propose two neural can handle them. The performance of MICE is better than
network-based regression models to estimate the missing data FAMD. The lifetime prediction ANNs were then applied to
in IoT gateways. The authors consider not only the precision the imputed data, and these results were compared across
of the prediction but also the time required to execute the the different amounts of missing data. It is the conclusion of
algorithm and the total amount of memory consumed. The this article that MICE without pooling, MICE with imputed
authors validated our models by utilizing six years’ worth pooling, and MICE with non-imputed pooling all have
of Rio de Janeiro weather data, varying the percentage of similar performance. Missing forests had significantly lower
missing data, and running the models. Based on the mean misclassification and loss rates. MICE with non-imputed
and the repetition of previous values, the results indicate that pooling has the highest theoretical accuracy of the MICE
the neural network regression model outperforms the other algorithms, and the associated R package has a large degree
investigated interpolation techniques. This is the case for all of tenability. Table 1 Description of the dataset showing
missing data percentages. In addition, the neural network missing data percentages for each attribute It is therefore the
models can run on IoT gateways due to their relatively short recommendation set forth here that imputation of data sets
execution times and low memory requirements [27]. for ANN lifetime predictions be implemented using one of
The authors of this paper propose a data-driven interpo- these two methods, with the weight of the suggestion being
lation method for missing data that identifies the optimal the missing forest algorithm, particularly for data sets with a
interpolation technique. This method uses the information high degree of missing data.
already known about the dataset to rank five chosen methods
based on their respective estimated error rates. In evaluating III. PROPOSED METHODOLOGY
the proposed methods, the authors utilized both a classifier- In this research, we will discuss the development of a
independent scenario, where they compared the applicability Python-based missing data imputation system that will
and error rate of each interpolation method, and a classifier- provide automated, data-driven support to help users clean
dependent scenario, where they compared the prediction their data efficiently. Any Integrated Development and
accuracy of a random forest classifier using datasets prepared Learning Environment(IDLE) can be used. The proposed
with each interpolation method and a baseline method model aims to improve data quality to train better machine
without interpolation. In the classifier-independent scenario, learning models. There are ways to solve a wide range of data
they assessed each interpolation technique’s applicability and problems. But to be clear, the main concern is the automatic
error rate, allowing the classification algorithm to handle handling of missing data. In this section, the suggested
missing data internally. method will be discussed.
Based on the results of these two experimental sets, the In this section, we will explain all the steps followed
authors conclude that the proposed data-driven interpolation to develop an automatic method for handling missing data
method typically results in more accurate estimates of efficiently and accurately. A benchmark dataset will be used
missing data and improved classifier performance in a lon- to validate the effectiveness of the proposed models for
gitudinal dataset of human aging. Additionally, the authors missing data imputation. The data will be preprocessed in
note that estimates derived from interpolation techniques order to select the best attributes for handling missing data.
specifically designed for longitudinal data are extremely This is a simple kind of task that doesn’t require any complex
precise. This finding supports the idea that utilizing the operations, like features using an optimization technique or
TABLE 1. Description of the dataset with missing data percentages for each attribute.
some kind of feature extraction technique to extract some forest model, we have many classifiers, so a voting scheme
hidden information from the data. is used to select the final output class for a data set. The
In a machine learning-based project, the dataset is voting is performed using the mode function, which assigns
resampled by using a cross-validation technique. The a class label to the test data that is predicted by most of the
cross-validation method is used to create a training and a classifiers [28].
test set from the original dataset. Three state-of-the-art cross-
validation techniques are widely used for model performance B. PROPOSED FRAMEWORK
evaluation and parameter tuning of the proposed classifica- The dataset can be loaded into any modern Python-based
tion model. The section below explains our methodology, Integrated Development and Learning Environment(IDLE).
which we will follow in this research. The dataset attributes will be checked to see if there are any
missing numerical or nominal values. The methodology for
A. CLASSIFIERS AND REGRESSION MODEL USED FOR both models is different; for the prediction of nominal values,
MISSING DATA PREDICATION classification models will be used in a supervised learning
1) SUPPORT VECTOR MACHINE approach, while for estimating numerical values, a regression
is a linear model that classifies data into only two categories. model will be used. We must keep the attribute that will have
The SVM model uses a hyperplane to divide the two classes missing data as a class attribute, which will be made up of
using a straight line. Due to the linear nature of SVM, predictors. Cross-validation methods are used for splitting a
it was not possible to classify more than two classes of dataset into two subsets: the training set and the test set. For
data. In recent years, a framework-based SVM capable this purpose, three state-of-the-art cross-validation methods
of classifying multi-class data has been developed. The will be used that are Hold-out with a percentage of 70% for
ensemble learning method is applied to a linear model, so for training and 30% for testing, K-fold with a k value of 10, and
training a model for a multi-class classification problem, the leave one out method The model will have been validated
more than one SVM model is used [29]. Sequential minimal using the out-of-sample data for evaluating the performance
optimization is a State-of-the-art SVM frame for multiclass of both models, i.e., classifiers. The performance of the
problem classification [30]. classifiers will be evaluated using accuracy, precision, recall,
and f-measure, while the performance of the regression model
2) K-NEAREST NEIGHBOR will be evaluated using root mean square error. Overview of
The K-nearest neighbor is a lazy classifier because it is an the proposed framework for predicting missing nominal and
instance-based learner, which means that the K-NN model numerical values in a categorical dataset using random forest,
does not have a training phase. It uses a similarity measure SVM, and KNN classifiers:
technique, which is considered an unsupervised method • Preprocessing: The first step in the framework is to
because there are no labels required and it doesn’t have a preprocess the dataset to prepare it for imputation. This
training mode. Euclidean distance is the most popular and includes handling any missing data that are present in the
widely used method for finding the similarity between data target variable (the variable with missing data that you
points [26]. wish to impute), as well as any other missing data in the
dataset. It may also involve one-hot encoding categorical
3) RANDOM FOREST variables or standardizing numerical variables.
Random forest is an ensemble method that uses many • Splitting the data: Next, the dataset is split into training
decision trees. In supervised learning, the decision tree model and testing sets. The training set is used to train the
is considered the simplest and most efficient classification machine learning models, while the testing set is used
model. When the dataset size is small, a decision tree model to evaluate their performance.
achieves higher accuracy; a small dataset size refers to fewer • Training the models: The machine learning models
records and fewer attributes in a dataset. In the random (random forest, SVM, and KNN) are then trained on the
to improve the accuracy and robustness of the model by the predictions of multiple classifiers. In the ECOC method,
leveraging the strengths of different classifiers. several linear SVM models are used to make predictions,
The confusion matrix that was produced by contrasting the and the results are pooled by taking the average of all the
actual value with the projected missing data may be found in predictions made by the individual SVMs.
Table 7. There are 55 cases that may be correctly predicted The ECOC method is a cutting-edge approach that has
for the right class, whereas the accuracy for the left class been shown to be effective in improving the accuracy and
is 58%. robustness of ensemble classifiers. In our study, we used
The above Tabular Table 8, shows the accuracy and root the ECOC method to generate the SVM-based ensemble
mean square error of all instances in one dataset with model, which was then used to predict the missing data in the
286 instances. Table 8 which presents a thorough perfor- dataset.
mance analysis utilizing a range of performance assessment The RMSE of 32.27 was attained with the help of the
measures, includes a list of performance evaluation metrics, Bagging SVM regression model that was presented for
including accuracy, true positive rate, false positive rate, numerical values. Other performance analysis measures, such
precision, recall, and F1- score. The average accuracy of as the correlation coefficient, mean absolute error, and root
the random forest classifier for predicting missing data was relative squared error, are utilized in the validation process of
57.48%, and it obtained a score of 57.5% on the f1 scale. The the model. The results of these evaluations yield the values
value of k will stay at 10 throughout the experiment for the 0.64, 14.37, and 77.61, respectively.
purpose of cross-validation.
b: KNN-BAGGING ENSEMBLE
3) SIMULATION OF CPU DATASET The KNN-based ensemble model is an ensemble method that
This section describes the results achieved from the simula- combines the predictions of several lazy K-nearest neighbor
tion of the CPU dataset using different classifiers. (KNN) classifier models using a mean voting system. This
creates an ensemble regression model that can be used to
a: SVM-BAGGING ENSEMBLE predict continuous values, such as the missing data in our
The SVM-based ensemble model is an ensemble method study. The KNN-based ensemble model is based on the
that combines the predictions of several linear support vector error correction output codes (ECOC) method, which is a
machine (SVM) models using a mean voting system. This technique for constructing ensemble classifiers by combining
creates an ensemble regression model that can be used to the predictions of multiple classifiers. In the ECOC method,
predict continuous values, such as the missing data in our several KNN models are used to make predictions, and the
study. The SVM-based ensemble model is based on the results are pooled by taking the average of all the predictions
error correction output codes (ECOC) method, which is a made by the individual KNNs. The ECOC method is a
technique for constructing ensemble classifiers by combining cutting-edge approach that has been shown to be effective
TABLE 12. Result achieved by bagging mix classifier. ensemble random forest is 28.13, which is attained with the
help of the bagging SVM regression model that was presented
for numerical values. Other performance analysis measures,
such as the correlation coefficient, mean absolute error, and
root relative squared error, are utilized in the validation
process of the model. The results of these evaluations yield
the values 0.73, 13.33, and 67.67, respectively.
TABLE 14. Bagging mix classification model detailed performance using the hypothyroid dataset.
TABLE 15. The results of the performance evaluation for the random [5] I. Mehmood, M. Sajjad, K. Muhammad, S. I. A. Shah, A. K. Sangaiah,
forest classifier and the bagging mix models. M. Shoaib, and S. W. Baik, ‘‘An efficient computerized decision support
system for the analysis and 3D visualization of brain tumor,’’ Multimedia
Tools Appl., vol. 78, no. 10, pp. 12723–12748, May 2019.
[6] F. E. Grubbs, ‘‘Procedures for detecting outlying observations in samples,’’
Technometrics, vol. 11, no. 1, p. 1, Feb. 1969.
[7] S. Agarwal, ‘‘Data mining: Data mining concepts and techniques,’’ in Proc.
indicate that both algorithms were effective in predicting and Int. Conf. Mach. Intell. Res. Advancement, Dec. 2013, pp. 203–207.
imputing the missing data in the dataset. [8] P. E. McKnight, K. M. McKnight, S. Sidani, and A. J. Figueredo, Missing
Data: A Gentle Introduction. New York, NY, USA: Guilford Press, 2007.
[9] M. Liu, S. Li, H. Yuan, M. E. H. Ong, Y. Ning, F. Xie, S. E. Saffari,
V. CONCLUSION Y. Shang, V. Volovici, B. Chakraborty, and N. Liu, ‘‘Handling missing
In summary, This research explored the use of machine values in healthcare data: A systematic review of deep learning-based
imputation techniques,’’ Artif. Intell. Med., vol. 142, Aug. 2023,
learning algorithms to predict and impute missing data Art. no. 102587. [Online]. Available: https://www.sciencedirect.com/
in categorical datasets, employing three distinct datasets science/article/pii/S093336572300101X
including CPU, Hypothyroid, and Breast Cancer, and various [10] R. Little and D. B. Rubin, Incomplete Data. Hoboken, NJ, USA: Wiley,
2014.
ensemble models built on the Error Correction Output Codes
[11] R. J. Little and D. B. Rubin, Statistical Analysis With Missing Data,
(ECOC) framework. In all kinds of datasets, the missing, null, vol. 793. Hoboken, NJ, USA: Wiley, 2019.
or infinite values recurrence and non-recurrence is a major [12] D. B. Rubin, ‘‘Inference and missing data,’’ Biometrika, vol. 63, no. 3,
issue. The study demonstrated satisfactory performance of p. 581, Dec. 1976.
these algorithms in predicting and imputing missing data, [13] T. Köse, S. Özgür, E. Coşgun, A. Keskinog̀lu, and P. Keskinog̀lu, ‘‘Effect
of missing data imputation on deep learning prediction performance for
with the ensemble models within the ECOC framework vesicoureteral reflux and recurrent urinary tract infection clinical study,’’
notably enhancing prediction accuracy and robustness. How- BioMed Res. Int., vol. 2020, pp. 1–15, Jul. 2020.
ever, the study’s limitations included a narrow focus on select [14] M. Kazijevs and M. D. Samad, ‘‘Deep imputation of missing values
in time series health data: A review with benchmarking,’’ J. Biomed.
algorithms and datasets, and the fact that algorithm perfor- Informat., vol. 144, Aug. 2023, Art. no. 104440. [Online]. Available:
mance could be influenced by specific data characteristics https://www.sciencedirect.com/science/article/pii/S1532046423001612
and missing data patterns. [15] T. Sun, S. Zhu, R. Hao, B. Sun, and J. Xie, ‘‘Traffic missing data
imputation: A selective overview of temporal theories and algorithms,’’
Despite these limitations, our research provides insightful
Mathematics, vol. 10, no. 14, p. 2544, Jul. 2022. [Online]. Available:
perspectives on the use of machine learning to handle https://www.mdpi.com/2227-7390/10/14/2544
missing data in specific datasets. It emphasizes the poten- [16] C. E. Frangakis and D. B. Rubin, ‘‘Principal stratification in causal
tial of ensemble models and the ECOC framework as inference,’’ Biometrics, vol. 58, no. 1, pp. 21–29, Mar. 2002.
[17] A. N. Baraldi and C. K. Enders, ‘‘An introduction to modern missing data
a viable strategy for improving prediction accuracy and analyses,’’ J. School Psychol., vol. 48, no. 1, pp. 5–37, Feb. 2010.
robustness in missing data imputation. Moreover, it suggests [18] J. L. Schafer, ‘‘Multiple imputation: A primer,’’ Stat. Methods Med. Res.,
future research directions to enhance the performance of vol. 8, no. 1, pp. 3–15, Jan. 1999.
machine learning-based imputation methods, acknowledging [19] P. Lynn, ‘‘Multiple imputation for nonresponse in surveys,’’ 1988.
that missing data imputation is a complex challenge with [20] P. Cihan, ‘‘Deep learning-based approach for missing data imputation,’’
Eskişehir Teknik Üniversitesi Bilim ve Teknoloji Dergisi B Teorik Bilimler,
significant scope for advancement. vol. 8, pp. 336–343, Aug. 2020.
[21] C.-Y. Cheng, W.-L. Tseng, C.-F. Chang, C.-H. Chang, and S. S.-F. Gau,
REFERENCES ‘‘A deep learning approach for missing data imputation of rating scales
assessing attention-deficit hyperactivity disorder,’’ Frontiers Psychiatry,
[1] L. Bargelloni, O. Tassiello, M. Babbucci, S. Ferraresso, R. Franch,
vol. 11, Jul. 2020, doi: 10.3389/fpsyt.2020.00673.
L. Montanucci, and P. Carnier, ‘‘Data imputation and machine learning
improve association analysis and genomic prediction for resistance to fish [22] Y. Sun, J. Li, Y. Xu, T. Zhang, and X. Wang, ‘‘Deep learning
photobacteriosis in the gilthead sea bream,’’ Aquaculture Rep., vol. 20, versus conventional methods for missing data imputation: A review
Jul. 2021, Art. no. 100661. and comparative study,’’ Expert Syst. Appl., vol. 227, Oct. 2023,
[2] M. A. Munson, ‘‘A study on the importance of and time spent on Art. no. 120201. [Online]. Available: https://www.sciencedirect.com/
different modeling steps,’’ ACM SIGKDD Explor. Newslett., vol. 13, no. 2, science/article/pii/S0957417423007030
pp. 65–71, May 2012. [23] S. E. Awan, M. Bennamoun, F. Sohel, F. Sanfilippo, and G. Dwivedi, ‘‘A
[3] J.-U. Kietz, F. Serban, S. Fischer, and A. Bernstein, ‘‘‘Semantics inside!’ reinforcement learning-based approach for imputing missing data,’’ Neural
but let’s not tell the data miners: Intelligent support for data mining,’’ Comput. Appl., vol. 34, no. 12, pp. 9701–9716, Jun. 2022.
in Proc. Semantic Web, Trends Challenges, 11th Int. Conf. (ESWC), [24] A. Elasra, ‘‘Multiple imputation of missing data in educational production
Anissaras, Greece. Cham, Switzerland: Springer, May 2014, pp. 706–720. functions,’’ Computation, vol. 10, no. 4, p. 49, Mar. 2022.
[4] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, ‘‘Wrangler: Interactive [25] M. Tada, N. Suzuki, and Y. Okada, ‘‘Missing value imputation method for
visual specification of data transformation scripts,’’ in Proc. SIGCHI Conf. multiclass matrix data based on closed itemset,’’ Entropy, vol. 24, no. 2,
Hum. Factors Comput. Syst., May 2011, pp. 3363–3372. p. 286, Feb. 2022.
[26] I. N. K. Wardana, J. W. Gardner, and S. A. Fahmy, ‘‘Estimation of missing LAILA IFTIKHAR received the Master of Science
air pollutant data using a spatiotemporal convolutional autoencoder,’’ (M.S.) degree in computer sciences (database)
Neural Comput. Appl., vol. 34, no. 18, pp. 16129–16154, Sep. 2022. from The University of Agriculture at Peshawar,
[27] C. M. França, R. S. Couto, and P. B. Velloso, ‘‘Missing data imputation in 2023. She is currently an IT Support Staff. Her
in Internet of Things gateways,’’ Information, vol. 12, no. 10, p. 425, research passion is AI-enabled dataset anomalies
Oct. 2021. and missing data detection and rectification.
[28] C. Ribeiro and A. A. Freitas, ‘‘A data-driven missing value imputation
approach for longitudinal datasets,’’ Artif. Intell. Rev., vol. 54, no. 8,
pp. 6277–6307, Dec. 2021.
[29] P. P. Singh, S. Prasad, B. Das, U. Poddar, and D. R. Choudhury,
‘‘Classification of diabetic patient data using machine learning tech-
niques,’’ in Ambient Communications and Computer Systems, G. M. Perez,
S. Tiwari, M. C. Trivedi, and K. K. Mishra, Eds. Singapore: Springer, 2018,
pp. 427–436.
[30] K. Maheswari and P. P. A. Priya, ‘‘Predicting customer behavior in online
shopping using SVM classifier,’’ in Proc. IEEE Int. Conf. Intell. Techn. MOHAMMAD FARHAD BULBUL received the
Control, Optim. Signal Process. (INCOS), Mar. 2017, pp. 1–5. Ph.D. degree from the Department of Informa-
[31] S. F. Crone, S. Lessmann, and R. Stahlbock, ‘‘The impact of preprocessing tion and Computing Science, Peking University,
on data mining: An evaluation of classifier sensitivity in direct marketing,’’ China. He was a Postdoctoral Researcher with the
Eur. J. Oper. Res., vol. 173, no. 3, pp. 781–800, Sep. 2006.
Department of Computer Science and Engineer-
[32] H. Pan, Z. Ye, Q. He, C. Yan, J. Yuan, X. Lai, J. Su, and R. Li, ‘‘Discrete
ing, Pohang University of Science and Technol-
missing data imputation using multilayer perceptron and momentum
gradient descent,’’ Sensors, vol. 22, no. 15, p. 5645, Jul. 2022. [Online]. ogy, South Korea. He is currently an Assistant
Available: https://www.mdpi.com/1424-8220/22/15/5645 Professor with the Department of Mathematics,
Jashore University of Science and Technology,
Bangladesh. His research interests include com-
puter vision, deep learning, pattern recognition, and image processing.