Nigercon Abuad IEEE 2024

Advanced Feature Engineering and Sampling
Technique for Effective Heart Disease Prediction

System using Machine Learning Technique
Abstract— Predicting heart disease is crucial in healthcare, the minority class, balancing the dataset and enhancing the
where accurate detection can significantly impact patient model's ability to detect heart disease cases.
outcomes. A major challenge in prediction is imbalanced data,
with underrepresented samples in the minority class leading to The rest of this study is organized as follows: Section II
biased models. This study addresses this by combining reviews related work on heart disease prediction, Section III
advanced feature engineering with SMOTE (Synthetic outlines the methodology used in this study, Section IV
Minority Over-sampling Technique) to manage class presents the results, discussion, and evaluation, and Section
imbalance. The objectives include refining data through V offers the conclusion.
feature engineering, building predictive models using machine
learning algorithms, and assessing performance metrics. II. RELATED WORKS
Algorithms such as Random Forest, SVM, Decision Tree, Previous research has extensively explored how
Logistic Regression, and K Nearest Neighbor are applied. classifiers can reliably detect heart disease cases [3].
Results show that robust feature engineering notably enhances
However, there has been a lack of comprehensive studies that
performance, with Random Forest achieving the highest
examine all of a patient's conditions and identify the key
accuracy (0.9604), followed by SVM (0.954) and Decision Tree
(0.947). The findings highlight the role of feature engineering variables essential for accurate heart disease prediction.
and sampling in improving heart disease prediction models, Research in related fields indicates that selecting critical
with meaningful implications for healthcare advancements. features significantly impacts the performance of classifier
frameworks. Instead of using every feature available, it is
Keywords— Data Mining, Coronary, SMOTE, Algorithms, crucial to identify the optimal combination of attributes [4].
Class imbalance Redundant and irrelevant features should be identified and
removed before applying classification methods. For data
I. INTRODUCTION mining professionals in healthcare, it is essential to
The healthcare sector faces several challenges, including understand the inter-dependencies among risk factors in a
poor service quality due to misdiagnosis and ineffective dataset and how each factor influences the accuracy of heart
treatments. High-quality care depends on accurate diagnosis disease predictions.
and effective treatment. Incorrect diagnoses can lead to Considerable effort has been dedicated to diagnosing
disastrous outcomes. According to a WHO survey, heart Cardiovascular Heart disease using Machine Learning
attacks and strokes account for 17 million deaths globally. In algorithms, which has spurred the motivation for this study.
many countries, deaths from heart disease are often linked to This paper provides a concise overview of existing literature.
factors like work overload and mental stress, making it a Various algorithms, including Logistic Regression, KNN,
leading cause of adult mortality. Accurate and efficient and Random Forest Classifier, have been utilized to
diagnosis is essential but complex, often relying on a efficiently predict cardiovascular disease. The Results section
doctor’s experience and knowledge, which can result in indicates that each algorithm exhibits distinct strengths in
errors and high treatment costs. To address these issues, achieving the defined objectives [5].
researchers have developed automatic medical diagnosis
systems. These systems utilize comprehensive databases and The model incorporating IHDPS (Integrated Heart
decision support tools to diagnose diseases with fewer tests Disease Prediction System) demonstrated the capability to
and more effective treatments [1]. delineate decision boundaries using both traditional and
advanced machine learning and deep learning models. It
The adoption of machine learning in healthcare has leveraged crucial factors such as family history associated
surged, enabling the identification of patterns within medical with heart disease. However, the accuracy achieved by the
data. Using machine learning to classify cardiovascular IHDPS model was notably lower than that of emerging
disease occurrences can help reduce diagnostic errors. models, such as those detecting coronary heart disease using
Additionally, data mining and machine learning artificial neural networks and other advanced machine and
methodologies are increasingly leveraged to predict the deep learning algorithms [6].
likelihood of developing specific ailments [2]. However,
accurately predicting heart disease risk is fraught with
challenges, particularly due to imbalanced issues commonly
found in medical datasets. Imbalance between healthy
individuals and those with heart disease skews predictions,
causing models to favour the majority class, thus limiting
their ability to correctly identify at-risk patients.
To address these challenges, an advanced technique offer
viable solution. Minority Over-sampling Technique
(SMOTE) is employed to generate synthetic data points for
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Fig. 1. Basic Heart Prediction System 10 Diabetic 0= no; 1 = yes
11 Sex 1 = Male; 0 = Female
Subramanian et al. introduced the diagnosis and 12 Race 1 = White; 0 = Black
prediction of Heart Disease and Blood Pressure, alongside 13 Genhealth 1=fair; 2=Very; fair; 3=Good
other attributes, utilizing neural networks. Their deep Neural 4=Very Good
14 KidneyDisease 1 = yes 0 = no
Network incorporated disease-related attributes to produce 15 SkinCancer 1=yes 0=no
outputs processed by output perceptron, comprising around 16 Physicalactivity 1=yes 0=no
120 hidden layers. This approach ensures accurate results 17 Asthma 1=yes 2=no
when applied to test datasets, constituting a fundamental and 18 AgeCategory Continous
pertinent technique. Supervised networks have been
recommended for heart disease diagnosis. Testing the model
with unfamiliar data by a physician yielded predictions based C. Data Preprocessing
on prior training data, thereby evaluating the model's The real-life information or data contains large numbers
accuracy. with missing and noisy data. These data are preprocessed to
III. METHODOLOGY overcome such issues and make predictions vigorously.
Cleaning the collected data may contain missing values and
The research adopts a mixed methods approach, may be noisy. To get an accurate and effective result, these
combining qualitative and quantitative methodologies to
data need to be cleaned in terms of noise and missing values
comprehensively explore and evaluate machine learning
are to be filled up. Transformation it changes the format of
techniques for Heart disease prediction system. A five-staged
architecture comprising the Heart disease dataset collection, the data from one form to another to make it more
dataset pre-processing and Classification is represented in comprehensible. The dataset does not contain any null
figure 2. After the classification stage was completed, the values. Various plotting techniques were used for to check
performances of the machine learning algorithms were the distribution of the data, and outlier detection. Before we
evaluated using F1 score, accuracy, specificity and sensitivity transfer the data for categorization or prediction purposes,
measures. all of these preprocessing approaches are crucial.
A. System Architecture D. The SMOTE (Synthetic Minority Oversampling
This graphical framework delineates the system Technique)
workflow and interconnections among its subsystems. It also SMOTE is an oversampling technique where the
scrutinizes the different processes and activities within each synthetic samples are generated for the minority class. This
subsystem component. Figure 4 illustrates the system algorithm helps to overcome the overfitting problem posed
architecture of the proposed heart disease detection system. by random oversampling. It focuses on the feature space to
generate new instances with the help of interpolation
between the positive instances that lie together.
The initial step in the process involves determining the
total number of oversampling observations, denoted as "N."
Typically, N is configured to achieve a balanced binary class
distribution of 1:1. However, there exists flexibility to adjust
this parameter to cater to specific requirements.
Commencing the iterative procedure, a positive class
instance is randomly chosen as the starting point.
Subsequently, the K Nearest Neighbors (KNN) algorithm,
employing a default value of 5 for K, is employed to identify
the neighboring instances associated with the selected
positive class instance. These K nearest neighbors is
Fig. 2. Proposed System Achitecture retrieved to further inform the generation of synthetic
instances. The generation of synthetic instances is
B. Data Gathering and Preprocessing accomplished by interpolating new data points based on the
The publicly available heart disease database is used. characteristics of the K nearest neighbors. To achieve this, a
Indicators of Heart Disease (2022 UPDATE) [7] consists of distance metric is employed to compute the disparity
319,794 rows and 18 columns records. The data set consists between the feature vector of the chosen instance and those
of 3 types of attributes: Input, Key & Predictable attribute of its neighbors. This calculated difference in distance
which are listed in Table I. serves as a pivotal factor in the generation process. To
introduce variability and enhance the diversity of the
TABLE I. ATTRIBUTES AND ENCODING VALUES synthetic instances, a random value within the range of [0,1]
S/N Attribute Value
is multiplied with the difference in distance. The resultant
value is then added to the original feature vector of the
1 HeartDisease 0= no 1 = yes
2 PhysicalHealth Continuous
chosen instance. This process is graphically depicted in the
3 alcoholdrinking 0= no 1 = yes illustration below, capturing the essence of the methodology
4 SleepTime Continuous [8]. Here, the graphical representation provided by Swastik
5 MentalHealth Continuous Satpathy in 2013 would be depicted, showing the step-by-
6 Smoking 0= no;1 = yes step process of generating synthetic instances through
7 Stroke 0= no;1 = yes
8 BMI Continuous value
distance based.
9 Diffwalking 0= no;1 = yes
d. Support vector machine
Support vector machine (SVM) represents a relatively
recent advancement in supervised machine learning.
Employing the kernel Adatron technique, SVM maps inputs
to a high-dimensional feature space, effectively segregating
data into distinct classes by isolating inputs near the data
boundaries. This method excels in delineating datasets with
complex boundary relationships but is limited to
classification tasks, as it cannot approximate functions [12].
e. Decision Tree
Decision tree learning is a predictive modeling technique
Fig. 3. SMOTE Majority Class utilized in statistics, data mining, and machine learning. In
classification trees, the target variable takes discrete values,
E. Heart Disease Classification with leaves representing class labels and branches
Figure 2 shows the detail description of the architecture representing attributes that combine to determine these
of the proposed methodology, in which the five different labels [13]. Conversely, regression trees are employed when
classification techniques proposed in this study are Logistic the target variable assumes continuous values.
Regression, KNN, SVM, Decision Trees, and Naive Bayes.
a. Logistic Regression f. Random forest
Logistic regression predicts discrete values, such as Random forest employs ensemble methods, combining
whether a student passed or failed, through the application multiple learning algorithms to enhance prediction accuracy.
of a transformation function known as the logistic function. Unlike statistical ensembles, which are often limitless in
This function, denoted as h ( x )=1/(1+ e x ) , generates an S- structure, machine learning ensembles consist of a finite
curve. Unlike Linear Regression, where the output is direct, collection of diverse models. The key challenge lies in
logistic regression outputs probabilities for the default class. identifying basic models that collectively make errors rather
These probabilities range between 0 and 1, reflecting their than individually achieving high accuracy [14]. Even if base
nature as probabilities. By employing the logistic function classifiers exhibit low accuracy, employing ensembles for
x classification can still yield high accuracy through the
h ( x )=1/ (1+ e ), the x-value undergoes a log- collective decision-making of multiple models.
transformation to produce the corresponding output (y-
value). Finally, these probabilities are dichotomized into
binary categories using a predefined threshold [9]. F. Performance Metrics
This study employed model construction to assess the
b. Naïve Bayes effectiveness and utility of various classification algorithms
Naïve Bayes is a classification technique rooted in the for predicting heart disease. The performance of each model
Bayes theorem, relying on the assumption of predictor was evaluated using a confusion matrix and a
independence. Simplistically, a Naïve Bayes classifier comprehensive set of metrics, including Accuracy,
operates under the belief that the presence of one feature precision, recall, F1-score, and ROC-AUC score.
within a class is unrelated to the presence of any other
feature []. For example, when categorizing fruit, such as
identifying an apple based on attributes like being red, IV. RESULT, DISCUSSION, AND EVALUATION
rounded, and having a diameter of approximately 3 inches, a This section presents the experimental results for all heart
Naïve Bayes classifier treats each attribute independently in patients using a dataset obtained from Kaggle. The
assessing the likelihood of the fruit being an apple, experiments were conducted in a simulated environment
disregarding interdependence or additional attributes. This using Python.
model's simplicity makes it well-suited for handling large A. Data Balancing (SMOTE Technique)
datasets.
The analysis of heart disease prediction using machine
learning algorithms has demonstrated varying performance
c. K-nearest Neighbor
across models. However, a significant challenge lies in the
K-nearest neighbor (KNN) finds application in both
imbalance of the dataset, as shown in the provided table. The
classification and regression scenarios, though it is dataset has a label feature with the title HadHeartAttack
predominantly used for classification tasks. Operating on the containing 416,959 instances of "No" for heart attacks
principle of determining a new instance's classification compared to only 25,108 instances of "Yes." This imbalance
based on the consensus of its k nearest neighbors, KNN can lead to biased predictions, where the models may
stores all existing examples and allocates a case to a class perform well on the majority class (no heart attack) but
based on the majority vote of its neighbors, as determined poorly on the minority class (heart attack).
by a distance function. Various distance functions, such as
Euclidean, Manhattan, Minkowski, and Hamming, are
employed, with Hamming specifically suited for categorical
variables and the others for continuous processes. The
choice of K value can sometimes pose a challenge in KNN
modeling [11].
B. Classsification Results
After feature Engineering and sampling of the data, Five
different model where created using five algorithms and are
evaluated using Performance Metric.The initial data was
separated into the train (80%) and test (20%) sets after being
segmented into features (X) and labels (Y). The analysis of
machine learning algorithms for predicting heart disease, as
detailed in Figure 4, demonstrates varying performance
metrics across different models, utilizing eighteen
features:HearthDisease, Smoking, alcoholdrinking, stroke,
Physicalhealth, mentalhealth, diffwalking, sex, AgeCtegory,
Race, Diabetic, Physicalactivity, Genhealth, SleepTime,
Asthma, KidneyDisease and SkinCancer.
Fig. 4. HadHeart Attack before SMOTE
Logistic Regression achieved an accuracy of 84.5%, with
The analysis of heart disease prediction using machine a precision of 69.23%, recall of 81.82%, F1-score of 75%,
learning algorithms has demonstrated varying performance and ROC-AUC score of 67%. This indicates balanced
across models. However, a significant challenge lies in the performance across metrics, with strong recall but relatively
imbalance of the dataset, as shown in the provided table. The lower precision.
dataset has a label feature with the title HadHeartAttack
containing 416,959 instances of "No" for heart attacks While K-Nearest Neighbors (K-NN) showed an accuracy
compared to only 25,108 instances of "Yes." This imbalance of 74.9%, precision of 75%, recall of 63.64%, F1-score of
can lead to biased predictions, where the models may 73.68%, and ROC-AUC score of 74.26%. K-NN's precision
perform well on the majority class (no heart attack) but and F1-score are quite similar, suggesting consistency,
poorly on the minority class (heart attack). though its recall is lower compared to Logistic Regression.
The SMOTE pre-processing technique is recognized as Support Vector Machine (SVM) performed with an
one of the most reliable and effective strategies in the accuracy of 94.8%, precision of 72.73%, recall of 78.9%, F1-
machine learning and data mining fields. SMOTE increases score of 81.82%, and ROC-AUC score of 69.79%. SVM
the number of data instances by generating synthetic demonstrates high accuracy and balanced recall and
minority class examples using Euclidean distance to precision. Decision Tree on the other hand achieved an
interpolate between existing minority class samples and their accuracy of 94.7%, precision of 65.23%, recall of 82.5%, F1-
nearest neighbors. These new instances are created based on score of 77%, and ROC-AUC score of 88.23%. This
the original features, making them very similar to the original indicates strong performance, especially in recall, though its
data. In this study, applying the SMOTE technique balanced precision is slightly lower.
the Yes attribute size from 25108 to 226013 samples, Random Forest outperformed the other models with an
resulting in a balanced dataset as illustrated in Fig 5. accuracy of 96.04%, precision of 91.62%, recall of 100%,
The SMOTE technique was used to balance the dataset. F1-score of 93.5%, and ROC-AUC score of 57.29%.
Random Forest demonstrates exceptional precision and
ROC- recall, leading to the highest F1-score, despite a lower ROC-
F1- AUC AUC score.
ML Acc Precision Recall score Score
Algorithms (%) (%) (%) (%) (%) These results highlight that each algorithm has unique
Logistic
84.5 strengths. Logistic Regression and K-NN provide balanced
Regression 69.23 81.82 75 67 results, SVM and Decision Tree offer high accuracy and
K-NN 74.9 75 63.64 73.68 74.26 recall, while Random Forest excels in precision and recall,
SVM 94.8 72.73 78.9 81.82 69.79 making it the most effective model for heart disease
Decision prediction among the ones evaluated.
Tree 94.7 65.23 82.5 77 88.23
Random
Forest 96.04 91.62 100 93.5 57.29 TABLE II. MACHINE LEARNING CLASSIFICATION RESULTS
Subsequently, the machine learning algorithms were re-

applied to the balanced dataset, and their accuracy was
evaluated. Table II shows the classification result of all the
classifiers used in this study. It can be analyzed that the
major contribution to the study is the use of a thorough
feature engineering approach for analyzing individual
features of the heart disease dataset. 90 percent of job done is
on the dataset pre-processing and these contributed to the
performance of each of the models build with different
algorithms. Random Forest performed best with an Accuracy
score of 0.9604 followed by the SVM Model with an
accuracy of 0.954 and the Decision tree-algorithm with an
accuracy score of 0.947, Logistic Regression with an
accuracy of 0.749 followed by the K-Nearest Neighbor with
an accuracy of 0.749
Fig. 5. HadHeart Attack After SMOTE
V. CONCLUSION [3] A. Rajdha, A. Agarwal, M. Sai, D. Ravi, P. Ghuli, “Heart disease
prediction using machine learning.” International Journal of
The successful application of various machine learning Engineering Research & Technology, Vol 9, No. 4, pp. 659-662, April
algorithms, particularly the exemplary performance of 2020.
Random Forest, offers promising insights into the potential [4] H. Jindal, S. Agrawal, R. Khera, R. Jain, and P. Nagrath. “Heart
of these models in practical healthcare applications. The disease prediction using machine learning algorithms.” In IOP
conference series: materials science and engineering, Vol. 1022, No.
achieved accuracy scores underscore the viability of 1, pp. 1-11, 2021.
leveraging data-driven approaches for early detection and [5] K. Uyar, and A. İlhan, “Diagnosis of heart disease using genetic
prognosis of heart diseases. Moreover, this study serves as a algorithm based trained recurrent fuzzy neural networks.” Procedia
testament to the pivotal role of meticulous preprocessing in computer science, 120, pp. 588-593.
shaping the outcomes of predictive modeling. Approximately [6] P. K. Sahoo, and P. Jeripothula, “Heart failure prediction using
90% of the project's progress is attributed to the dedicated machine learning techniques.” Available at SSRN 3759562, 2020.
efforts in data preparation, emphasizing the foundational [7] Kami, Indicators of Heart Disease (2022 UPDATE).
importance of data quality and refinement. It is imperative to Kaggle.https://www.kaggle.com/datasets/kamilpytlak/personal-key-
indicators-of-heart-disease, October, 2023.
acknowledge that while this research has made significant
strides, challenges and limitations persist. The [8] S. Swastik, ”Overcoming Class Imbalance using SMOTE
Techniques.”, AnalyticVidya. Data Science Blogathon. Available:
generalizability of the developed models to diverse
https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-
populations and evolving conditions warrants further imbalance-using-smote-techni
investigation. Additionally, the ongoing maintenance and [9] R. T. Vollmer, “Multivariate statistical analysis for pathologists: part
updating of the models to align with real-world changes and I, the logistic model.”, American journal of clinical pathology, Vol.
advancements remain crucial for sustained accuracy and 105, No. 1,pp. 115-126, 1996.
relevance. [10] K. Chandel, V. Kunwar, S. Sabitha, T. Choudhury, and S. Mukherjee.
“A comparative study on thyroid disease detection using K-nearest
In summation, this study not only sheds light on the neighbor and Naive Bayes classification techniques.”, CSI
potential of data-driven approaches in predicting heart transactions on ICT, 4, pp. 313-319, 2016
diseases but also underscores the interdisciplinary nature of [11] F. K. Nasser, and S. F. Behadili, “Breast cancer detection using
tackling complex health challenges. The integration of data decision tree and k-nearest neighbour classifiers.” Iraqi Journal of
science, medical knowledge, and advanced analytics offers a Science, pp. 4987-5003, 2022.
promising avenue for enhancing healthcare outcomes and [12] D.T. Do and N.Q.K. Le, “A sequence-based approach for identifying
recombination spots in Saccharomyces cerevisiae by using hyper-
contributing to the global effort to combat cardiovascular parameter optimization in FastText and support vector machine.
diseases. As the world grapples with the growing burden of Chemometrics and Intelligent Laboratory Systems”, 194, pp.103855,
heart diseases, the insights gleaned from this study pave the 2019.
way for more effective, data-informed strategies in the [13] S. Mishra, P.K. Mallick, H.K. Tripathy, A.K. Bhoi, and A. González-
ongoing battle for heart health. Briones, “Performance evaluation of a proposed machine learning
model for chronic disease datasets using an integrated attribute
REFERENCES evaluator and an improved decision tree classifier.” Applied Sciences,
Vol.10, No. 22, pp.8137, 2020.
[14] F.R. Aszhari, Z. Rustam, F. Subroto, and A.S. Semendawai,
[1] B. M. Ramageri, “Association Rule Discovery with Train and Test “Classification of thalassemia data using random forest algorithm.” In
approach for heart disease prediction”, IEEE Transactions on Journal of Physics: Conference Series, Vol. 1490, No. 1, pp. 012050,
Information Technology in Biomedicine, Vol 10, No. 2, pp. 334-343, March, 2020. IOP Publishing.
April 2006.
[2] C. M. Bhatt, P. Patel, T. Ghetia, and P. L. Mazzeo, "Effective Heart
Disease Prediction Using Machine Learning Techniques" Algorithms
Vol 16, No. 2, pp. 1-14, Feb 2023.

Nigercon Abuad IEEE 2024

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Nigercon Abuad IEEE 2024

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nigercon Abuad IEEE 2024

Uploaded by

Copyright:

Available Formats

Advanced Feature Engineering and Sampling

Technique for Effective Heart Disease Prediction

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Subsequently, the machine learning algorithms were re-

You might also like