0% found this document useful (0 votes)
28 views

Machine Learning Based Missing Data Imputation

This research investigates machine learning algorithms for imputing missing data in categorical datasets, focusing on ensemble models using the Error Correction Output Codes (ECOC) framework. The study demonstrates that these models, particularly those combining SVM, KNN, and MLP, significantly enhance prediction accuracy compared to single models across various datasets. Despite promising results, challenges remain for deep learning approaches, necessitating further exploration of their effectiveness in data imputation.

Uploaded by

Sai Rohith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Machine Learning Based Missing Data Imputation

This research investigates machine learning algorithms for imputing missing data in categorical datasets, focusing on ensemble models using the Error Correction Output Codes (ECOC) framework. The study demonstrates that these models, particularly those combining SVM, KNN, and MLP, significantly enhance prediction accuracy compared to single models across various datasets. Despite promising results, challenges remain for deep learning approaches, necessitating further exploration of their effectiveness in data imputation.

Uploaded by

Sai Rohith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Received 21 May 2024, accepted 30 May 2024, date of publication 10 June 2024, date of current version 1 July 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3411817

Machine Learning Based Missing Data Imputation


in Categorical Datasets
∗ ∗
MUHAMMAD ISHAQ1, , SANA ZAHIR1, , LAILA IFTIKHAR1 , MOHAMMAD FARHAD BULBUL 2,

SEUNGMIN RHO 3 , AND MI YOUNG LEE 4 , (Member, IEEE)


1 Institute
of Computer Sciences and Information Technology, The University of Agriculture at Peshawar, Peshawar, Khyber Pakhtunkhwa 25000, Pakistan
2 Department of Mathematics, Jashore University of Science and Technology, Jashore 7408, Bangladesh
3 Department of Industrial Security, Chung-Ang University, Seoul 06974, South Korea
4 Department of Research, Chung-Ang University, Seoul 06974, South Korea

Corresponding author: Mi Young Lee (miylee@cau.ac.kr)


This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the
Ministry of Education under Grant 2021R1I1A1A01055652.
∗ Muhammad Ishaq and Sana Zahir contributed equally to this work.

ABSTRACT In order to predict and fill in the gaps in categorical datasets, this research looked into the use of
machine learning algorithms. The emphasis was on ensemble models constructed using the Error Correction
Output Codes (ECOC) framework, including models based on SVM and KNN as well as a hybrid classifier
that combines models based on SVM, KNN, and MLP. Three diverse datasets—the CPU, Hypothyroid, and
Breast Cancer datasets—were employed to validate these algorithms. Results indicated that these machine
learning techniques provided substantial performance in predicting and completing missing data, with the
effectiveness varying based on the specific dataset and missing data pattern. Compared to solo models,
ensemble models that made use of the ECOC framework significantly improved prediction accuracy and
robustness. Deep learning for missing data imputation has obstacles despite these encouraging results,
including the requirement for large amounts of labeled data and the possibility of over-fitting. Subsequent
research endeavors ought to evaluate the feasibility and efficacy of deep learning algorithms in the context
of the imputation of missing data.

INDEX TERMS Data cleansing, missing data imputation, classification, regression and categorical datasets.

I. INTRODUCTION As a result, preprocessing is required, as illustrated in


‘‘Dirty data’’ describes unprocessed or inconsistent, erro- Figure 1, before machine learning models can be trained or
neous, or incomplete raw data that has been tampered run on raw data. Even though it is necessary and inevitable,
with. High-quality data is always the foundation for quality data preprocessing is a time-consuming and frustrating
decisions. The conclusions drawn from analytical results procedure. According to industry standards, data scientists
derived from dirty data are untrustworthy. Consequently, typically devote more than half of their analysis time to this
raw data must first be cleaned before being utilized in any task. On the other hand, those who used the software in work
analytical process. It is not possible to use raw data directly were not experts in it [2]. Because of this, data scientists are
in analytical methods. Data cleaning is an important part in high demand for a tool that will assist them in automating
of information quality management. It aims to enhance the the process [3].
overall quality of data by locating and removing errors, Data preprocessing encompasses various tasks such as
omissions, and inconsistencies. This section provides an data cleaning, data integration, and data transformation [3].
overview of the proposed technique and an introduction to It confronts common data challenges like outliers, lost or
its theoretical foundations [1]. missing information, and inconsistent naming conventions.
The main objective of data cleaning is to address these data
problems. The key issues can be categorized as follows:
The associate editor coordinating the review of this manuscript and • Inconsistent column names: This includes inconsistency
approving it for publication was Chun-Wei Tsai . in the naming of columns on a case-by-case basis [2].
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
88332 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

FIGURE 1. An overview of the machine learning process [3].

• Duplicate records: Instances where different or multiple missing data by utilizing a function of auxiliary variables
records represent the same entry in the dataset. or predictors. Given its crucial role across various statistical
• Redundant features: These are irrelevant attributes domains, particularly in government statistics, imputation has
that contribute minimally to the model’s construction been extensively discussed in the literature. This process is
and potentially extend training duration and increase illustrated in Figure 2.
overfitting risk [4].
• Missing data: These occur when no feature data values B. THE ONSET OF MISSING DATA
have been recorded. These are common and can Missing data might result from human or machine error
substantially influence data interpretation [5]. during sample processing, malfunctioning equipment, tran-
• Outliers: In statistical analysis, outliers are observations scription issues, dropouts during follow-up and clinical
that deviate significantly from others, potentially caus- studies, or respondents’ unwillingness to answer a specific
ing severe issues [6]. topic, as well as the combination of two fairly identical
matches in a collection of data. This difference is also
Data cleaning involves rectifying these issues, including
known as a ‘‘non-response. A programmed non-response
filling in missing data, smoothing noisy data, identifying
occurs when some responses are available but not all are
or removing outliers, and addressing inconsistencies. The
due to programmed refusal, inability to attend, absence
ultimate goal is to develop a tool capable of resolving
from home, or untracked situations. A respondent may
all the aforementioned problems. Previous research has
choose not to answer a question. Imputation based on these
primarily concentrated on commonly encountered challenges
representations can thus be used at two levels: unit and item
such as incorrect data types, lost data, and outliers [7].
non-response. Any variable that does not have a measurable
Missing data often impedes useful investigations across
value for the entire population should be estimated. Given the
various scientific domains. Although such research relies
preceding levels, this article will focus on the article’s level
on subject cooperation, complete participation cannot be
of unresponsiveness. To clarify how to handle missing data,
assumed due to data gaps. This paper defines ‘‘missing
the aforementioned reasons have been turned into multiple
data’’ as instances where no data exists for the relevant
‘‘missing data mechanisms [12].
variable.
Even the most carefully planned and conducted studies C. MOTIVATION AND BACKGROUND
can yield incomplete results, a problem recognized in both
It is necessary to interpolate the missing data in order to
scientific and corporate realms. Missing data complicates
complete the process, as data analysis cannot be performed
the interpretation and understanding of the phenomena under
on insufficient data sets. This step, if neglected, could lead to
study. The absence of data compromises the validity of
incorrect conclusions. missing data can result in undesirable
scientific research, as reliable conclusions are only drawn
outcomes, especially when they cause estimates to be skewed
through a thorough analysis of complete datasets. Most sci-
in the wrong direction. Although the method of interpolating
entific, commercial, and economic decisions are influenced
missing data has been the subject of debate for decades,
or informed by published research findings. Hence, proper
relatively few studies have examined the accuracy of the
handling of missing data should be a priority [8], [9].
machine learning algorithms that are most commonly used to
perform this task. There are numerous methods for handling
A. IMPUTATION OF MISSING DATA and resolving missing data and practices and procedures for
Imputation is a technique applied to handle missing data. filling in the missing data [13] The technique of interpolation
In this article, we extend the definition of imputation is one of the practices that will be discussed in this paper
beyond that given by [10], which states, ‘‘Imputation is [14]. It is achieved via the application of machine learning
a comprehensive and flexible method for dealing with algorithms. Appropriate estimation methods can be used
missing data.’’ This technique involves predicting missing to enhance the quality of the analyzed dataset and help
data based on the observed data distribution, commonly make more informed healthcare decisions [1]. The state-of-
referred to as ‘‘drawing missing data from the estimated the-art AI-enabled imputation was selected after extensive
distribution through imputation.’’ Imputation methods predict experimental work on all ensembles.

VOLUME 12, 2024 88333


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

FIGURE 2. The process of missing data imputation [11].

II. LITERATURE REVIEW (MNAR). Baraldi and Enders [17] Consult MCAR if the
A. DATA TYPE DISCOVERY TECHNIQUES likelihood of loss is constant across all scenarios. The cause of
Identifying data types in the original dataset can be data loss, according to MCAR, is unrelated to the data itself.
accomplished through various methods. Some approaches are When a student in educational research shifts to a different
straightforward and rely on basic statistics or heuristics. For area in the middle of their undergraduate career, this is an
example, to determine if a function is distinct or constant, example of MCAR. The missing data is MCAR if the source
we can calculate the number of distinct values used by the of the motion is unrelated to any other variables in the dataset.
function and compare it to the total number of instances of MCAR is frequently not practical due to the data at hand.
the function. However, more advanced or complex methods Data become absent at random only when there is an
may require the use of machine learning models for accurate equal risk of absence inside each cluster defined by the
detection. observational data [17]. As a result, if the reason for a
variable’s missing inputs is unrelated to the variable itself,
B. MANAGING MISSING DATA the problem may be linked to other observable variables.
Missing data, according to [8], is a prevalent problem that The MAR process is not random because it represents
either goes unnoticed by scientists or is actively suppressed. systematic missing data, where the bias in the missing data
To put it another way, researchers are aware of the missing is tied to other observable aspects of the analysis, despite the
data and are focused on proving why it is irrelevant to misleading name ‘‘random.’’ When sampling a population,
the specific study. Data is notable when it influences for example, the variance to be included is determined
judgments and, ultimately, one’s knowledge, both known by some known property. MAR is a larger category than
and unknown. Missing data can have serious consequences MCAR. The MAR assumption is the foundation for the
for quantitative research, such as information loss, increased majority of recent strategies for dealing with missing data.
standard errors, and a decrease in statistical power, biased Finally, if neither MCAR nor MAR holds true, the absence is
parameter estimations, and a decrease in the generalizability considered non-random. According to MNAR, the likelihood
of study conclusions [15]. Unfortunately, one of the standard of extinction changes for unknown causes. As a result,
ways for scientists to deal with missing data is to delete those it is reliant on intangible measurements. The worth of an
using ad hoc methods like listwise or pairwise elimination. unseen reply is determined by facts that cannot be assessed.
This usually leads to skewed estimates and/or criticism for When asked about their spending patterns, students who
being inefficient. Frangakis and Rubin [16] found that the frequently gambled at casinos, for example, tended to avoid
most common cause of missing data in the NSI dataset the questions out of fear of getting into trouble. As a result,
is that respondents opt not to participate in the survey or the model is unable to anticipate future data appropriately.
answer some questions they do not want to answer (item not As a result, MNAR is the more difficult case.
answered; unit not answered). Rubin’s [12] distinction is critical in understanding why
some solutions may not function as expected. The theory
C. THEORY OF MCAR, MAR AND MNAR explains why data-missing approaches produce statistically
As reported by Rubin [12], Rubin has devised techniques to significant findings. These increase forecast accuracy and
deal with the loss of any data point. He split the missing data effectiveness. This research is built on MCAR data. Although
problem into three distinct missing data mechanisms. To put the method reduces statistical power, it offers the advantage
it another way, there are three kinds of missing data: ‘‘totally of maintaining the study’s goal because the estimated
random missing,’’ ‘‘randomly missing,’’ and ‘‘not random’’ parameters are not influenced by missing data.

88334 VOLUME 12, 2024


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

D. PROPORTIONS OF MISSING DATA 2) As returns the imputed missing data for the occupation
Academics generally accept missing data strategies. Par- variable to the missing data.
ticularly given that it has been demonstrated that this 3) The study uses linear regression to predict missing data
differentiation has an effect on the strategy’s efficacy. The for occupation by age and income based on all of the
rate of missing data, on the other hand, is not. There are observed cases.
numerous points of view on the acceptable percentage of 4) The study uses the values obtained in step 3 to impute
missing data in a dataset. According to Schafer [18], 5% or missing data for occupation. The occupation variable is
less is insignificant; hence, values should be imputed when not missing at this time.
5% or more percentages are missing. When the amount of 5) Steps 2-4 are repeated for the various ages.
missing data exceeds 10%, Bennett argues that values should 6) Repeat steps 2 through 4 for actions.
be imputed. As a result, even if a small fraction of data is 7) Repeat the entire iterative process to converge the three
missing, a researcher may desire to impute missing data. variables.
Multiple imputations are specific to MAR but also produce
valid estimates in MNAR.
E. MULTIPLE IMPUTATION
The authors of this paper propose an RL-based approach
To reduce imputation-induced bias, we proposed a method
for estimating missing data. This method involves learning
for averaging the results of multiple imputation data sets.
a strategy for empirically estimating data based on action
Multivariate imputation basically consists of three steps.
rewards. The abbreviation RL stands for reinforcement
First, incomplete data sets’ missing data is imputed m
learning. The proposed method maintains the variance of the
times. It should be noted that the estimates are based
interpolated values by interpolating missing data in columns
on circulation. This step produces a full set of data. The
with different values, as opposed to interpolating missing data
following (second) step is to examine each of the ten
in columns by only working on the same column (this is
complete data sets. The mean, variance, and confidence
analogous to single-unit variate interpolation). The authors
interval of the variable of interest are calculated. Finally,
report that our method outperforms other interpolation
we add the results of the m-analysis to the final result.
strategies when applied to various datasets [23].
Multiple imputation is by far the most complex and popular
The proposed method employs multiple interpolation
method. The Multiple Imputation Chain Equation (MICE),
techniques using an iterative Markov chain Monte Carlo
which is based on the MCMC algorithm, is the most widely
(MCMC) simulation method based on the Gibbs sampler
used method of multiple imputations. MICE takes the idea
algorithm. In earlier attempts, MCMC simulations were used,
of regression one step further and exploits correlations
but only on relatively small data sets with a restricted number
among responses by Lynn [19]. To explain the concept of
of variables. Consequently, an additional contribution of
MICE, let’s take i.e. Despite these promising results, there
this paper is its application and comparison within a large
are still challenges with deep learning for missing data
longitudinal English education study with three iterative
imputation. These include the need for a significant amount of
specifications. This was accomplished by utilizing the study’s
labeled data and the risk of overfitting. Future studies should
findings. The simulation’s results reveal how the algorithm
assess the practicability and performance of deep learning
will eventually converge [24].
algorithms when it comes to data imputation [20], [21].
Using local feature spaces, the authors of this paper
In one effort, The specific comparison of two conventional
propose two closed-item- set-based methods, CIimpute and
methods, multiple imputation by chained equations (MICE)
ICIimpute, to interpolate missing data for multiclass matrix
and missForest, with the deep learning methods, generative
data. CIimpute and ICIimpute are referred to, respectively,
adversarial imputation networks (GAIN) with onehot encod-
as CIimpute and ICIimpute. CIimpute estimates the missing
ing, GAIN with embedding, variational auto-encoder (VAE)
data using a closed term set that has been extracted from
with onehot encoding, and VAE with embedding. Three
each class. The CIimpute method has been modified to
simulated datasets and seven genuine benchmark datasets
include an attribute reduction procedure, resulting in the
are taken into consideration, covering a range of scenarios
ICIimpute method. The results of the experiments indicate
with varying feature types at varying sample size levels.
that reducing the number of attributes significantly reduces
Three types of missing mechanisms–missing completely at
the computation time and improves the interpolation pre-
random (MCAR), missing at random (MAR), and missing
cision. In addition, the results demonstrate that ICIimpute
not at random (MNAR)–as well as various missing ratios
provides superior interpolation precision despite requiring
are used to produce the missing data [22]. Use MICE to
a longer amount of computation time compared to other
impute missing data from a simple dataset. Imagine that we
methods [25].
have three characteristics in our dataset: occupation, age, and
This research proposes an autoencoder model that con-
income, each with missing data. MICE can be carried out in
siders spatiotemporal factors to estimate missing data in
the following ways:
air quality datasets. The model consists of one-dimensional
1) First, a simple imputation method will be used, such as convolutional layers that provide flexible coverage of air
imputation by the mean, to fill in the missing data. pollutants’ spatial and temporal behavior. It incorporates data

VOLUME 12, 2024 88335


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

from nearby stations to enhance predictions for data-deficient temporal information inherently present in longitudinal data
target stations, eliminating the need for additional com- is beneficial for machine learning applications, which can
ponents such as weather and climate data. The findings be effectively achieved using the proposed data-driven
demonstrate that the method effectively fills in missing methods [28].
data from discontinuous or long-interval interrupted datasets.
Compared to univariate interpolation techniques (most com- F. SUMMARY OF TECHNIQUES
mon, median, and mean interpolation), our model achieves up Missing data is unavoidable when handling any amount of
to a 65% improvement in RMSE and a 20-40% improvement medical data. Being able to build prognosis and prediction
compared to multivariate interpolation techniques (deci- models based on data sets with substantial amounts of
sion trees, extra trees, k-nearest neighbors, and Bayesian missing data would be an advantage to researchers. A data set
ridge regression). However, when adjacent sites have a has been simulated to be used in predicting patient lifetimes
negative or weak correlation, interpolation performance is via an artificial neural network. Various levels of missing
diminished [26]. were then simulated, and the missing data were imputed
A new mechanism for predicting and estimating the by a variety of methods.FAMD stands for ‘‘Factor Analysis
amount of data lost in IoT gateways has been developed for Mixed Data’’.The technique known as FAMD is used to
to achieve greater autonomy at the network’s edge. In most analyze data that contains both continuous and categorical
cases, the computational resources on these gateways are variables. It is a development of the technique known as
limited. Therefore, the interpolation method for missing factor analysis, which finds underlying patterns in data.
data must be simple while still producing precise estimates. By turning category variables into dummy variables, FAMD
In light of this, the authors of this study propose two neural can handle them. The performance of MICE is better than
network-based regression models to estimate the missing data FAMD. The lifetime prediction ANNs were then applied to
in IoT gateways. The authors consider not only the precision the imputed data, and these results were compared across
of the prediction but also the time required to execute the the different amounts of missing data. It is the conclusion of
algorithm and the total amount of memory consumed. The this article that MICE without pooling, MICE with imputed
authors validated our models by utilizing six years’ worth pooling, and MICE with non-imputed pooling all have
of Rio de Janeiro weather data, varying the percentage of similar performance. Missing forests had significantly lower
missing data, and running the models. Based on the mean misclassification and loss rates. MICE with non-imputed
and the repetition of previous values, the results indicate that pooling has the highest theoretical accuracy of the MICE
the neural network regression model outperforms the other algorithms, and the associated R package has a large degree
investigated interpolation techniques. This is the case for all of tenability. Table 1 Description of the dataset showing
missing data percentages. In addition, the neural network missing data percentages for each attribute It is therefore the
models can run on IoT gateways due to their relatively short recommendation set forth here that imputation of data sets
execution times and low memory requirements [27]. for ANN lifetime predictions be implemented using one of
The authors of this paper propose a data-driven interpo- these two methods, with the weight of the suggestion being
lation method for missing data that identifies the optimal the missing forest algorithm, particularly for data sets with a
interpolation technique. This method uses the information high degree of missing data.
already known about the dataset to rank five chosen methods
based on their respective estimated error rates. In evaluating III. PROPOSED METHODOLOGY
the proposed methods, the authors utilized both a classifier- In this research, we will discuss the development of a
independent scenario, where they compared the applicability Python-based missing data imputation system that will
and error rate of each interpolation method, and a classifier- provide automated, data-driven support to help users clean
dependent scenario, where they compared the prediction their data efficiently. Any Integrated Development and
accuracy of a random forest classifier using datasets prepared Learning Environment(IDLE) can be used. The proposed
with each interpolation method and a baseline method model aims to improve data quality to train better machine
without interpolation. In the classifier-independent scenario, learning models. There are ways to solve a wide range of data
they assessed each interpolation technique’s applicability and problems. But to be clear, the main concern is the automatic
error rate, allowing the classification algorithm to handle handling of missing data. In this section, the suggested
missing data internally. method will be discussed.
Based on the results of these two experimental sets, the In this section, we will explain all the steps followed
authors conclude that the proposed data-driven interpolation to develop an automatic method for handling missing data
method typically results in more accurate estimates of efficiently and accurately. A benchmark dataset will be used
missing data and improved classifier performance in a lon- to validate the effectiveness of the proposed models for
gitudinal dataset of human aging. Additionally, the authors missing data imputation. The data will be preprocessed in
note that estimates derived from interpolation techniques order to select the best attributes for handling missing data.
specifically designed for longitudinal data are extremely This is a simple kind of task that doesn’t require any complex
precise. This finding supports the idea that utilizing the operations, like features using an optimization technique or

88336 VOLUME 12, 2024


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 1. Description of the dataset with missing data percentages for each attribute.

some kind of feature extraction technique to extract some forest model, we have many classifiers, so a voting scheme
hidden information from the data. is used to select the final output class for a data set. The
In a machine learning-based project, the dataset is voting is performed using the mode function, which assigns
resampled by using a cross-validation technique. The a class label to the test data that is predicted by most of the
cross-validation method is used to create a training and a classifiers [28].
test set from the original dataset. Three state-of-the-art cross-
validation techniques are widely used for model performance B. PROPOSED FRAMEWORK
evaluation and parameter tuning of the proposed classifica- The dataset can be loaded into any modern Python-based
tion model. The section below explains our methodology, Integrated Development and Learning Environment(IDLE).
which we will follow in this research. The dataset attributes will be checked to see if there are any
missing numerical or nominal values. The methodology for
A. CLASSIFIERS AND REGRESSION MODEL USED FOR both models is different; for the prediction of nominal values,
MISSING DATA PREDICATION classification models will be used in a supervised learning
1) SUPPORT VECTOR MACHINE approach, while for estimating numerical values, a regression
is a linear model that classifies data into only two categories. model will be used. We must keep the attribute that will have
The SVM model uses a hyperplane to divide the two classes missing data as a class attribute, which will be made up of
using a straight line. Due to the linear nature of SVM, predictors. Cross-validation methods are used for splitting a
it was not possible to classify more than two classes of dataset into two subsets: the training set and the test set. For
data. In recent years, a framework-based SVM capable this purpose, three state-of-the-art cross-validation methods
of classifying multi-class data has been developed. The will be used that are Hold-out with a percentage of 70% for
ensemble learning method is applied to a linear model, so for training and 30% for testing, K-fold with a k value of 10, and
training a model for a multi-class classification problem, the leave one out method The model will have been validated
more than one SVM model is used [29]. Sequential minimal using the out-of-sample data for evaluating the performance
optimization is a State-of-the-art SVM frame for multiclass of both models, i.e., classifiers. The performance of the
problem classification [30]. classifiers will be evaluated using accuracy, precision, recall,
and f-measure, while the performance of the regression model
2) K-NEAREST NEIGHBOR will be evaluated using root mean square error. Overview of
The K-nearest neighbor is a lazy classifier because it is an the proposed framework for predicting missing nominal and
instance-based learner, which means that the K-NN model numerical values in a categorical dataset using random forest,
does not have a training phase. It uses a similarity measure SVM, and KNN classifiers:
technique, which is considered an unsupervised method • Preprocessing: The first step in the framework is to
because there are no labels required and it doesn’t have a preprocess the dataset to prepare it for imputation. This
training mode. Euclidean distance is the most popular and includes handling any missing data that are present in the
widely used method for finding the similarity between data target variable (the variable with missing data that you
points [26]. wish to impute), as well as any other missing data in the
dataset. It may also involve one-hot encoding categorical
3) RANDOM FOREST variables or standardizing numerical variables.
Random forest is an ensemble method that uses many • Splitting the data: Next, the dataset is split into training
decision trees. In supervised learning, the decision tree model and testing sets. The training set is used to train the
is considered the simplest and most efficient classification machine learning models, while the testing set is used
model. When the dataset size is small, a decision tree model to evaluate their performance.
achieves higher accuracy; a small dataset size refers to fewer • Training the models: The machine learning models
records and fewer attributes in a dataset. In the random (random forest, SVM, and KNN) are then trained on the

VOLUME 12, 2024 88337


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 2. Summary of the three medical domains: number of examples,


number of classes, number of attributes, and average number of values
per attribute.

a: PROGNOSIS OF BREAST CANCER RECURRENCE


The Prognosis of Breast Cancer Recurrence dataset is a
medical dataset that contains information on breast cancer
patients and whether or not their cancer has recurred. The
dataset may include information such as patient demograph-
ics, tumor characteristics, treatment details, and follow-
up information. The goal of the dataset is to predict the
likelihood of breast cancer recurrence in patients, which
can help inform treatment decisions and improve patient
outcomes.
It is not uncommon for datasets in the medical field to
have missing data, as it may be difficult to collect complete
information for all patients. Therefore, imputing missing data
may be necessary in order to accurately analyze the data
and make reliable predictions. The specific details of the
Prognosis of Breast Cancer Recurrence dataset, including
FIGURE 3. Proposed framework for missing data predictions.
the variables and the percentage of missing data, may vary
depending on the source of the dataset. The recurrence
training set. This involves fitting the models to the data class as demonstrated in the tables points out the repeat of
and adjusting the model parameters to optimize their cancer infection. In cases of non-recurrence class, the tumor
performance. or infection is wiped out. The domain is characterized by
• Testing the models: The trained models are then 2 decision classes and 9 attributes. The set of attributes is
evaluated on the testing set to assess their performance incomplete because it is not sufficient to fully distinguish
in predicting the missing data. This may involve cal- cases with different outcomes. At 5 years postoperatively,
culating evaluation metrics such as accuracy, precision, data were available for 286 patients with known diagnostic
or recall. status. The five specialists who evaluated the cases gave the
• Selecting the best model: The performance of the models correct prognosis in 64 percent of the cases. Table 2 shows the
is compared, and the best-performing model is selected number of examples with attributes and the average number
as the final model to be used for imputation. of values in three medical domains.
• Imputing missing data: The final model is then used
to impute the missing data in the target variable. This
b: HYPOTHYROID DATASET
may involve using the model to predict the missing data
The Hypothyroid dataset is a medical dataset that contains
for each sample in the dataset or using a more complex
information on patients with hypothyroidism, a condition
approach such as multiple imputation.
in which the thyroid gland does not produce enough
• Evaluation: The imputed dataset is then evaluated
hormones. The dataset may include information such as
to assess the quality of the imputed values and the
patient demographics, symptoms, laboratory test results, and
overall performance of the imputation process. This
treatment details. The goal of the dataset is to predict the like-
may involve comparing the imputed values to the true
lihood of a patient having hypothyroidism, which can help
values (if available) or using other evaluation metrics
diagnose and treat the condition. The hypothyroid dataset
such as imputation accuracy or fidelity to the original
consists of data collected from thyroid patients, consisting
distribution of the data. The proposed framework is
of four classes: negative, compensated hypothyroid, primary
shown in Figure 3.
hypothyroid, and secondary hypothyroid. The data consists
of 3771 instances consisting of features and class attributes.
1) DATASET The total number of attributes in the hypothyroid is 30, where
In this subsection, the description of the dataset used in this the first attributes are the input (features) to the model and the
research considers several public datasets collected for the last attribute is the class attribute in the predictive model’s
evaluation of categorical anomaly detection output.

88338 VOLUME 12, 2024


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 3. Some parameters for the experimental work.

c: CPU DATASET shown in equations 2 and 3.


The CPU dataset from Weka is a machine-learning dataset TP
that contains information on computer hardware components. Precision = (2)
TP + FP
The dataset includes information on the speed, memory size, TP
and other characteristics of CPUs, as well as their price. The Recall = (3)
TP + FN
goal of the dataset is to predict the price of a CPU based
on its characteristics. The CPU dataset from Weka does not c: F-MEASURE
typically have missing data, as it is a synthetic dataset that was The F-measure is calculated by adding the accuracy and recall
generated for the purpose of demonstrating machine learning scores and assigning equal weight to each. It enables the
techniques. use of a single score to evaluate the model while taking
However, in real-world datasets, it is not uncommon to into account both its accuracy and recall, which is useful
have missing data due to incomplete data collection or when describing the model’s performance and comparing
other factors. In these cases, imputing missing data may models [31]. A general formula for F-measure is as follows.
be necessary in order to accurately analyze the data and
make reliable predictions. The CPU dataset consists of data Precision × Recall
F − Measure = (4)
collected from attribute MYCT numeric, attribute MMIN Precision + Recall
numeric, attribute MMAX numeric, attribute CACH numeric,
attribute CHMIN numeric, attribute CHMAX numeric, and IV. EXPERIMENTAL RESULTS
attribute class numeric. The data consists of 209 instances In this section, we present the experimental results of
consisting of features and class attributes. The total number of our study on the prediction and imputation of nominal
attributes in the CPU is 17, where the first attributes from 1 to and numeric missing data. We evaluated the performance
16 are the input (features) to the model, while the last attribute of several machine learning algorithms, including random
is the class attribute in the predictive model’s output. forest, SVM, and KNN, on a variety of datasets with
different levels of missing records. Our goal was to assess
2) PARAMETERS FOR SIMULATION the effectiveness of these algorithms in accurately predicting
Table 3 shows the parameters for the proposed model and imputing the missing data, as well as to identify
simulation. Imputation of data will be performed using three any patterns or trends in their performance. To evaluate
different types of datasets. The types of classifiers and the performance of the algorithms, we used a range of
performance evaluation metrics are also mentioned. evaluation metrics, including accuracy, precision, and recall.
We also conducted a detailed analysis of the imputed
a: ACCURACY values, including comparisons to the true values (if available)
Accuracy describes how close an experimental measurement and analyses of the distribution and statistical properties
is to the present value. Precision is a term used to describe of the imputed data. Overall, our results show that the
anything that is close to its true value or accepted standard. machine learning algorithms were able to achieve good
For example, a computer can perform an accurate math performance in predicting and imputing the missing data,
calculation that is correct with the given information but with some variations depending on the specific dataset and
does not match the exact value [31]. Accuracy is calculated missing data pattern. In the following sections, we present
through equation 1. the results in more detail and discuss their implications
TP + TN and limitations. During the process of the experiment, two
ACC = (1) different bagging-based ensemble classifiers are created and
TP + TN + FP + FN
simulated. The first ensemble is a combination of linear
b: PRECISION AND RECALL regression, K-nearest neighbor, and multilayer perceptrons,
The performance of a categorization or information retrieval and the second ensemble is a random forest classifier that
system is measured using two metrics: precision and recall. groups together several decision tree models. The first
Precision is defined as the proportion of relevant samples ensemble is a combination of linear regression, K-nearest
to all samples. The number of samples chosen from all neighbor, and multilayer perceptrons [32]. MLP models
relevant samples is known as recall, which is also known have the ability to learn complex relationships among the
as ‘‘precision’’ [31]. Precision and recall can be calculated data points, so these models are more effective in the

VOLUME 12, 2024 88339


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 4. Performance of random forest classifier.

TABLE 5. Confusion matrix achieved using the random forest classifier.

TABLE 6. Classification model detailed performance.

FIGURE 4. Performance comparisons of random forest and bagging-mix


model.

TABLE 7. Confusion matrix of random forest classifier.

imputation of missing values. In some experiments with


medical datasets, the MLP achieves very high accuracy.
MLP models are comparatively easy to train and implement.
It can handle all kinds of datasets. Overfitting in the
training phase can be overcome through proper validation predicting and imputing missing data. The evaluation metrics
set and L1 or L2 regularization. Actually MLP Classifier used include accuracy, true positive rate, false positive
and repressor use parameter alpha for L2 Regularization. rate, precision, recall, and F1-score. These metrics provide
For large datasets, a dropout layer is also used. Hidden different insights into the performance of the algorithms and
layers size can be adjusted to (5, 2). MLP classifier can can be useful for comparing their effectiveness.
predict new samples (missing values) on the basis of past The results in Tabulars 4 and 5 show that the random forest
classification experience. MLP also has regression models. classifier had an average accuracy of 49% and an F1-score of
Even the MLP classifier can predict the probability of missing 49.8. The F1 score is a balance between precision and recall,
values. and it is a common metric for evaluating the performance
of classification algorithms. In order to ensure the reliability
A. SIMULATION OF BREAST DATASET of the results, we used cross-validation with a value of k=10
1) RANDOM FOREST CLASSIFIER throughout the experiment. This means that the data was split
The confusion matrix in Table 4 displays the results of into 10 folds and the algorithms were trained and evaluated
the missing data prediction for the breast type attribute on different combinations of the folds.
using the random forest classifier. The breast type attribute
has missing data, and the confusion matrix shows how 2) BAGGING-MIX
well the classifier was able to predict the missing data. The bagging-mix classifier is an ensemble method that
In the confusion matrix, the rows represent the true values combines the predictions of three advanced classification
(i.e. the actual values of the breast type attribute), and models: support vector machine (SVM), kernel neural
the columns represent the predicted values (i.e. the values network (KNN), and logistic regression (MLP). To train the
predicted by the classifier). The Diagonal elements of the SVM model, we applied the radial basis function (RBF)
matrix represent the number of samples that were correctly kernel to the 2D feature map, which is a transformation
classified, while the off-diagonal elements represent the of the data that allows the model to learn nonlinear
number of samples that were misclassified. relationships. The feature map was then converted into a
As shown in Table 4, there were 116 samples for the right 3D feature map, which is used to make predictions. For
breast and 131 samples for the left breast. The classifier was the KNN classifier, we set the number of neighbors used
able to correctly classify 53 of the right breast samples and for training to 1. This means that the classifier will make
70 of the left breast samples. This represents a classification predictions based on the closest single neighbor to each
accuracy of 46% for the right breast and 53% for the left sample.
breast. Finally, the bagging-mix classifier uses a voting system
Tabulars 4 and 5 present the results of a detailed to combine the output of the three separate classifiers and
performance analysis of the machine learning algorithms for make a final prediction for the missing data. This can help

88340 VOLUME 12, 2024


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 8. Bagging-mix model detailed performance analysis.

TABLE 9. Classification model detailed performance.

FIGURE 5. Performance comparison of ensemble regression model using


CPU dataset.

TABLE 10. Classification model detailed performance.

FIGURE 6. Performance comparison using random forest and


bagging-mix model using the hypothyroid dataset.

to improve the accuracy and robustness of the model by the predictions of multiple classifiers. In the ECOC method,
leveraging the strengths of different classifiers. several linear SVM models are used to make predictions,
The confusion matrix that was produced by contrasting the and the results are pooled by taking the average of all the
actual value with the projected missing data may be found in predictions made by the individual SVMs.
Table 7. There are 55 cases that may be correctly predicted The ECOC method is a cutting-edge approach that has
for the right class, whereas the accuracy for the left class been shown to be effective in improving the accuracy and
is 58%. robustness of ensemble classifiers. In our study, we used
The above Tabular Table 8, shows the accuracy and root the ECOC method to generate the SVM-based ensemble
mean square error of all instances in one dataset with model, which was then used to predict the missing data in the
286 instances. Table 8 which presents a thorough perfor- dataset.
mance analysis utilizing a range of performance assessment The RMSE of 32.27 was attained with the help of the
measures, includes a list of performance evaluation metrics, Bagging SVM regression model that was presented for
including accuracy, true positive rate, false positive rate, numerical values. Other performance analysis measures, such
precision, recall, and F1- score. The average accuracy of as the correlation coefficient, mean absolute error, and root
the random forest classifier for predicting missing data was relative squared error, are utilized in the validation process of
57.48%, and it obtained a score of 57.5% on the f1 scale. The the model. The results of these evaluations yield the values
value of k will stay at 10 throughout the experiment for the 0.64, 14.37, and 77.61, respectively.
purpose of cross-validation.
b: KNN-BAGGING ENSEMBLE
3) SIMULATION OF CPU DATASET The KNN-based ensemble model is an ensemble method that
This section describes the results achieved from the simula- combines the predictions of several lazy K-nearest neighbor
tion of the CPU dataset using different classifiers. (KNN) classifier models using a mean voting system. This
creates an ensemble regression model that can be used to
a: SVM-BAGGING ENSEMBLE predict continuous values, such as the missing data in our
The SVM-based ensemble model is an ensemble method study. The KNN-based ensemble model is based on the
that combines the predictions of several linear support vector error correction output codes (ECOC) method, which is a
machine (SVM) models using a mean voting system. This technique for constructing ensemble classifiers by combining
creates an ensemble regression model that can be used to the predictions of multiple classifiers. In the ECOC method,
predict continuous values, such as the missing data in our several KNN models are used to make predictions, and the
study. The SVM-based ensemble model is based on the results are pooled by taking the average of all the predictions
error correction output codes (ECOC) method, which is a made by the individual KNNs. The ECOC method is a
technique for constructing ensemble classifiers by combining cutting-edge approach that has been shown to be effective

VOLUME 12, 2024 88341


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 11. Confusion of random forest classifier.

TABLE 12. Result achieved by bagging mix classifier. ensemble random forest is 28.13, which is attained with the
help of the bagging SVM regression model that was presented
for numerical values. Other performance analysis measures,
such as the correlation coefficient, mean absolute error, and
root relative squared error, are utilized in the validation
process of the model. The results of these evaluations yield
the values 0.73, 13.33, and 67.67, respectively.

4) SIMULATION ON HYPOTHYROID DATASET


TABLE 13. Random forest regression model detailed performance.
This section includes the simulation results achieved by
different classifiers for the hypothyroid dataset

a: SIMULATION USING BAGGING-MIX CLASSIFIER


Accuracy, True Positive Rate, False Positive Rate, Precision,
Recall, and F1- score are some of the performance evaluation
metrics that are included in Table 13, which contains a
detailed performance analysis that was carried out using
a variety of performance assessment metrics. The random
in improving the accuracy and robustness of ensemble forest classifier for predicting missing data attained an
classifiers. In our study, we used the ECOC method to accuracy of 69% on average and received a score of
generate the KNN-based ensemble model, which was then 65.1 on the f1scale. For the sake of cross-validation, the
used to predict the missing data in the dataset. Table 6 value of k will remain constant throughout the experiment
contains the RMSE of 38.16 obtained with the bagging KNN at 10.
regression model that was presented for numerical values. Accuracy, True Positive Rate, False Positive Rate, Pre-
Other performance analysis measures, such as the correlation cision, Recall, and F1- score are some of the performance
coefficient, mean absolute error, and root relative squared evaluation metrics that are included in Table 4 and Table7,
error, are utilized in the validation process of the model. The which contains a detailed performance analysis that was
results of these evaluations yield the values 0.51, 18.11, and carried out using a variety of performance assessment
91.78, respectively. metrics. The random forest classifier for predicting missing
Data attained an accuracy of 70.28% on average and received
c: RANDOM FOREST-BAGGING ENSEMBLE a score of 68% on the f1 scale. For the sake of cross-
The random forest bootstrapping algorithm is a machine validation, the value of k will remain constant throughout the
learning technique that combines decision trees and ensemble experiment s shown in Table 14 and Figure 6. The Bagging
learning methods to improve the accuracy and robustness of mix classifier class-wise accuracy is mentioned in Table 15
predictions. It works by generating multiple decision trees Performance comparison between Random forest and
from a dataset using a process called bootstrapping, which Bagging-Mix using hypothyroid dataset is shown in Figure 6.
involves randomly selecting a portion of the data and using Cosmic is a big repository of cancer datasets.
it to train the trees. Figure 5 shows Ensemble performance The accuracy, precision, recall, and F1-score are all
comparison using the CPU dataset. The individual decision evaluation metrics that are used to measure the performance
trees are then averaged together to produce a final prediction of a machine learning model. The accuracy is the proportion
or classification. This process is known as ensemble learning, of correct predictions made by the model, while the precision
and it relies on the assumption that the errors made by each is the proportion of correct positive predictions among all
tree will be distinct from one another, resulting in more the positive predictions made by the model. The recall is
accurate overall predictions. One of the key benefits of the the proportion of correct positive predictions among all the
random forest bootstrapping algorithm is that it can handle actual positive samples in the dataset, and the F1-score is a
large and complex datasets, and it is often used for tasks such balance between precision and recall. As shown in Table 15,
as classification and regression. In our study, we used the both the random forest classifier and the bagging mix models
random forest bootstrapping algorithm to predict and impute had good performance, with the bagging mix models having
missing data in the dataset. The RMSE of the proposed slightly higher values for the evaluation metrics. These results

88342 VOLUME 12, 2024


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

TABLE 14. Bagging mix classification model detailed performance using the hypothyroid dataset.

TABLE 15. The results of the performance evaluation for the random [5] I. Mehmood, M. Sajjad, K. Muhammad, S. I. A. Shah, A. K. Sangaiah,
forest classifier and the bagging mix models. M. Shoaib, and S. W. Baik, ‘‘An efficient computerized decision support
system for the analysis and 3D visualization of brain tumor,’’ Multimedia
Tools Appl., vol. 78, no. 10, pp. 12723–12748, May 2019.
[6] F. E. Grubbs, ‘‘Procedures for detecting outlying observations in samples,’’
Technometrics, vol. 11, no. 1, p. 1, Feb. 1969.
[7] S. Agarwal, ‘‘Data mining: Data mining concepts and techniques,’’ in Proc.
indicate that both algorithms were effective in predicting and Int. Conf. Mach. Intell. Res. Advancement, Dec. 2013, pp. 203–207.
imputing the missing data in the dataset. [8] P. E. McKnight, K. M. McKnight, S. Sidani, and A. J. Figueredo, Missing
Data: A Gentle Introduction. New York, NY, USA: Guilford Press, 2007.
[9] M. Liu, S. Li, H. Yuan, M. E. H. Ong, Y. Ning, F. Xie, S. E. Saffari,
V. CONCLUSION Y. Shang, V. Volovici, B. Chakraborty, and N. Liu, ‘‘Handling missing
In summary, This research explored the use of machine values in healthcare data: A systematic review of deep learning-based
imputation techniques,’’ Artif. Intell. Med., vol. 142, Aug. 2023,
learning algorithms to predict and impute missing data Art. no. 102587. [Online]. Available: https://www.sciencedirect.com/
in categorical datasets, employing three distinct datasets science/article/pii/S093336572300101X
including CPU, Hypothyroid, and Breast Cancer, and various [10] R. Little and D. B. Rubin, Incomplete Data. Hoboken, NJ, USA: Wiley,
2014.
ensemble models built on the Error Correction Output Codes
[11] R. J. Little and D. B. Rubin, Statistical Analysis With Missing Data,
(ECOC) framework. In all kinds of datasets, the missing, null, vol. 793. Hoboken, NJ, USA: Wiley, 2019.
or infinite values recurrence and non-recurrence is a major [12] D. B. Rubin, ‘‘Inference and missing data,’’ Biometrika, vol. 63, no. 3,
issue. The study demonstrated satisfactory performance of p. 581, Dec. 1976.
these algorithms in predicting and imputing missing data, [13] T. Köse, S. Özgür, E. Coşgun, A. Keskinog̀lu, and P. Keskinog̀lu, ‘‘Effect
of missing data imputation on deep learning prediction performance for
with the ensemble models within the ECOC framework vesicoureteral reflux and recurrent urinary tract infection clinical study,’’
notably enhancing prediction accuracy and robustness. How- BioMed Res. Int., vol. 2020, pp. 1–15, Jul. 2020.
ever, the study’s limitations included a narrow focus on select [14] M. Kazijevs and M. D. Samad, ‘‘Deep imputation of missing values
in time series health data: A review with benchmarking,’’ J. Biomed.
algorithms and datasets, and the fact that algorithm perfor- Informat., vol. 144, Aug. 2023, Art. no. 104440. [Online]. Available:
mance could be influenced by specific data characteristics https://www.sciencedirect.com/science/article/pii/S1532046423001612
and missing data patterns. [15] T. Sun, S. Zhu, R. Hao, B. Sun, and J. Xie, ‘‘Traffic missing data
imputation: A selective overview of temporal theories and algorithms,’’
Despite these limitations, our research provides insightful
Mathematics, vol. 10, no. 14, p. 2544, Jul. 2022. [Online]. Available:
perspectives on the use of machine learning to handle https://www.mdpi.com/2227-7390/10/14/2544
missing data in specific datasets. It emphasizes the poten- [16] C. E. Frangakis and D. B. Rubin, ‘‘Principal stratification in causal
tial of ensemble models and the ECOC framework as inference,’’ Biometrics, vol. 58, no. 1, pp. 21–29, Mar. 2002.
[17] A. N. Baraldi and C. K. Enders, ‘‘An introduction to modern missing data
a viable strategy for improving prediction accuracy and analyses,’’ J. School Psychol., vol. 48, no. 1, pp. 5–37, Feb. 2010.
robustness in missing data imputation. Moreover, it suggests [18] J. L. Schafer, ‘‘Multiple imputation: A primer,’’ Stat. Methods Med. Res.,
future research directions to enhance the performance of vol. 8, no. 1, pp. 3–15, Jan. 1999.
machine learning-based imputation methods, acknowledging [19] P. Lynn, ‘‘Multiple imputation for nonresponse in surveys,’’ 1988.
that missing data imputation is a complex challenge with [20] P. Cihan, ‘‘Deep learning-based approach for missing data imputation,’’
Eskişehir Teknik Üniversitesi Bilim ve Teknoloji Dergisi B Teorik Bilimler,
significant scope for advancement. vol. 8, pp. 336–343, Aug. 2020.
[21] C.-Y. Cheng, W.-L. Tseng, C.-F. Chang, C.-H. Chang, and S. S.-F. Gau,
REFERENCES ‘‘A deep learning approach for missing data imputation of rating scales
assessing attention-deficit hyperactivity disorder,’’ Frontiers Psychiatry,
[1] L. Bargelloni, O. Tassiello, M. Babbucci, S. Ferraresso, R. Franch,
vol. 11, Jul. 2020, doi: 10.3389/fpsyt.2020.00673.
L. Montanucci, and P. Carnier, ‘‘Data imputation and machine learning
improve association analysis and genomic prediction for resistance to fish [22] Y. Sun, J. Li, Y. Xu, T. Zhang, and X. Wang, ‘‘Deep learning
photobacteriosis in the gilthead sea bream,’’ Aquaculture Rep., vol. 20, versus conventional methods for missing data imputation: A review
Jul. 2021, Art. no. 100661. and comparative study,’’ Expert Syst. Appl., vol. 227, Oct. 2023,
[2] M. A. Munson, ‘‘A study on the importance of and time spent on Art. no. 120201. [Online]. Available: https://www.sciencedirect.com/
different modeling steps,’’ ACM SIGKDD Explor. Newslett., vol. 13, no. 2, science/article/pii/S0957417423007030
pp. 65–71, May 2012. [23] S. E. Awan, M. Bennamoun, F. Sohel, F. Sanfilippo, and G. Dwivedi, ‘‘A
[3] J.-U. Kietz, F. Serban, S. Fischer, and A. Bernstein, ‘‘‘Semantics inside!’ reinforcement learning-based approach for imputing missing data,’’ Neural
but let’s not tell the data miners: Intelligent support for data mining,’’ Comput. Appl., vol. 34, no. 12, pp. 9701–9716, Jun. 2022.
in Proc. Semantic Web, Trends Challenges, 11th Int. Conf. (ESWC), [24] A. Elasra, ‘‘Multiple imputation of missing data in educational production
Anissaras, Greece. Cham, Switzerland: Springer, May 2014, pp. 706–720. functions,’’ Computation, vol. 10, no. 4, p. 49, Mar. 2022.
[4] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer, ‘‘Wrangler: Interactive [25] M. Tada, N. Suzuki, and Y. Okada, ‘‘Missing value imputation method for
visual specification of data transformation scripts,’’ in Proc. SIGCHI Conf. multiclass matrix data based on closed itemset,’’ Entropy, vol. 24, no. 2,
Hum. Factors Comput. Syst., May 2011, pp. 3363–3372. p. 286, Feb. 2022.

VOLUME 12, 2024 88343


M. Ishaq et al.: Machine Learning Based Missing Data Imputation in Categorical Datasets

[26] I. N. K. Wardana, J. W. Gardner, and S. A. Fahmy, ‘‘Estimation of missing LAILA IFTIKHAR received the Master of Science
air pollutant data using a spatiotemporal convolutional autoencoder,’’ (M.S.) degree in computer sciences (database)
Neural Comput. Appl., vol. 34, no. 18, pp. 16129–16154, Sep. 2022. from The University of Agriculture at Peshawar,
[27] C. M. França, R. S. Couto, and P. B. Velloso, ‘‘Missing data imputation in 2023. She is currently an IT Support Staff. Her
in Internet of Things gateways,’’ Information, vol. 12, no. 10, p. 425, research passion is AI-enabled dataset anomalies
Oct. 2021. and missing data detection and rectification.
[28] C. Ribeiro and A. A. Freitas, ‘‘A data-driven missing value imputation
approach for longitudinal datasets,’’ Artif. Intell. Rev., vol. 54, no. 8,
pp. 6277–6307, Dec. 2021.
[29] P. P. Singh, S. Prasad, B. Das, U. Poddar, and D. R. Choudhury,
‘‘Classification of diabetic patient data using machine learning tech-
niques,’’ in Ambient Communications and Computer Systems, G. M. Perez,
S. Tiwari, M. C. Trivedi, and K. K. Mishra, Eds. Singapore: Springer, 2018,
pp. 427–436.
[30] K. Maheswari and P. P. A. Priya, ‘‘Predicting customer behavior in online
shopping using SVM classifier,’’ in Proc. IEEE Int. Conf. Intell. Techn. MOHAMMAD FARHAD BULBUL received the
Control, Optim. Signal Process. (INCOS), Mar. 2017, pp. 1–5. Ph.D. degree from the Department of Informa-
[31] S. F. Crone, S. Lessmann, and R. Stahlbock, ‘‘The impact of preprocessing tion and Computing Science, Peking University,
on data mining: An evaluation of classifier sensitivity in direct marketing,’’ China. He was a Postdoctoral Researcher with the
Eur. J. Oper. Res., vol. 173, no. 3, pp. 781–800, Sep. 2006.
Department of Computer Science and Engineer-
[32] H. Pan, Z. Ye, Q. He, C. Yan, J. Yuan, X. Lai, J. Su, and R. Li, ‘‘Discrete
ing, Pohang University of Science and Technol-
missing data imputation using multilayer perceptron and momentum
gradient descent,’’ Sensors, vol. 22, no. 15, p. 5645, Jul. 2022. [Online]. ogy, South Korea. He is currently an Assistant
Available: https://www.mdpi.com/1424-8220/22/15/5645 Professor with the Department of Mathematics,
Jashore University of Science and Technology,
Bangladesh. His research interests include com-
puter vision, deep learning, pattern recognition, and image processing.

MUHAMMAD ISHAQ received the Ph.D. degree


(Hons.) in computer science from Harbin Engi-
neering University, in 2012, as a HEC Scholar. SEUNGMIN RHO is currently a Professor with
He seems to have several years of professional the Department of Industrial Security, Chung-Ang
experience, where he has served at various well- University. His current research interests include
reputed universities. He has a total of almost databases, big data analysis, music retrieval,
twelve (12) years of postdoctoral teaching expe- multimedia systems, machine learning, knowl-
rience. He has organized and attended several edge management, and computational intelli-
national and international conferences, work- gence. He has published 300 papers in refereed
shops, and seminars. Besides, he has actively journals and conference proceedings in these
contributed to launching new programs and enhancement of curricula areas. He has been involved in more than 20 con-
of existing programs. In a successful convener role, he approved BS ferences and workshops as various chairs and more
(artificial intelligence) and BS (bioinformatics) from university statutory than 30 conferences/workshops as a program committee member. He has
bodies. Recently, he is involved in a more challenging computerization edited a number of international journal special issues as a Guest Editor, such
task at the university. He successfully managed and promoted the as Multimedia Systems, Information Fusion, and Engineering Applications
Coursera-Based Higher Education Commission’s (HEC) Digital Learning of Artificial Intelligence.
and Skills Enrichment Initiative (DLSEI). He authored and published
several high-quality research manuscripts in reputed international journals
with significant scientific contributions. Besides, his research contributions,
he has successfully supervised numerous undergraduate and graduate
research theses, which is remarkable. As a Young Researcher, he was active
in submitting several research and institutional projects to various funding
MI YOUNG LEE (Member, IEEE) received the
agencies.
M.S. and Ph.D. degrees from the Department
of Image and Information Engineering, Pusan
National University. She was a Research Profes-
sor with Sejong University. Currently, she is a
Research Professor with Chung-Ang University
and conducting research as a Senior Researcher
SANA ZAHIR received the M.S. degree in with the Industrial Security Research Center.
computer science from Islamia College, Peshawar, She is broadly working in artificial intelligence,
Pakistan. She is currently pursuing the Ph.D. computer vision, image processing, and energy
degree with the Institute of Computer Sciences informatics. She has carried out several research projects successfully and
and Information Technology, The University of is a Principal Investigator of several ongoing research projects under the
Agriculture at Peshawar, Peshawar. She is a supervision of Korean Government and has filed more than 13 patents during
Lecturer with the Institute of Computer Sciences her career. Her particular research interests include video summarization,
and Information Technology, The University of movie data analysis, re-identification, electrical energy forecasting, and
Agriculture at Peshawar. Her primary research video retrieval. She has published several novel contributions in these areas in
interests include machine learning and deep learn- reputed journals and peer-reviewed conference proceedings, including IEEE
ing, especially in computer vision applications. This encompasses human ACCESS, MDPI Sensors, Multimedia Tools and Applications (Springer), and
behavior understanding through facial expression analysis and techniques in the 2020 International Joint Conference on Neural Networks.
crowd counting.

88344 VOLUME 12, 2024

You might also like