Nigercon Abuad IEEE 2024
Nigercon Abuad IEEE 2024
Nigercon Abuad IEEE 2024
Abstract— Predicting heart disease is crucial in healthcare, the minority class, balancing the dataset and enhancing the
where accurate detection can significantly impact patient model's ability to detect heart disease cases.
outcomes. A major challenge in prediction is imbalanced data,
with underrepresented samples in the minority class leading to The rest of this study is organized as follows: Section II
biased models. This study addresses this by combining reviews related work on heart disease prediction, Section III
advanced feature engineering with SMOTE (Synthetic outlines the methodology used in this study, Section IV
Minority Over-sampling Technique) to manage class presents the results, discussion, and evaluation, and Section
imbalance. The objectives include refining data through V offers the conclusion.
feature engineering, building predictive models using machine
learning algorithms, and assessing performance metrics. II. RELATED WORKS
Algorithms such as Random Forest, SVM, Decision Tree, Previous research has extensively explored how
Logistic Regression, and K Nearest Neighbor are applied. classifiers can reliably detect heart disease cases [3].
Results show that robust feature engineering notably enhances
However, there has been a lack of comprehensive studies that
performance, with Random Forest achieving the highest
examine all of a patient's conditions and identify the key
accuracy (0.9604), followed by SVM (0.954) and Decision Tree
(0.947). The findings highlight the role of feature engineering variables essential for accurate heart disease prediction.
and sampling in improving heart disease prediction models, Research in related fields indicates that selecting critical
with meaningful implications for healthcare advancements. features significantly impacts the performance of classifier
frameworks. Instead of using every feature available, it is
Keywords— Data Mining, Coronary, SMOTE, Algorithms, crucial to identify the optimal combination of attributes [4].
Class imbalance Redundant and irrelevant features should be identified and
removed before applying classification methods. For data
I. INTRODUCTION mining professionals in healthcare, it is essential to
The healthcare sector faces several challenges, including understand the inter-dependencies among risk factors in a
poor service quality due to misdiagnosis and ineffective dataset and how each factor influences the accuracy of heart
treatments. High-quality care depends on accurate diagnosis disease predictions.
and effective treatment. Incorrect diagnoses can lead to Considerable effort has been dedicated to diagnosing
disastrous outcomes. According to a WHO survey, heart Cardiovascular Heart disease using Machine Learning
attacks and strokes account for 17 million deaths globally. In algorithms, which has spurred the motivation for this study.
many countries, deaths from heart disease are often linked to This paper provides a concise overview of existing literature.
factors like work overload and mental stress, making it a Various algorithms, including Logistic Regression, KNN,
leading cause of adult mortality. Accurate and efficient and Random Forest Classifier, have been utilized to
diagnosis is essential but complex, often relying on a efficiently predict cardiovascular disease. The Results section
doctor’s experience and knowledge, which can result in indicates that each algorithm exhibits distinct strengths in
errors and high treatment costs. To address these issues, achieving the defined objectives [5].
researchers have developed automatic medical diagnosis
systems. These systems utilize comprehensive databases and The model incorporating IHDPS (Integrated Heart
decision support tools to diagnose diseases with fewer tests Disease Prediction System) demonstrated the capability to
and more effective treatments [1]. delineate decision boundaries using both traditional and
advanced machine learning and deep learning models. It
The adoption of machine learning in healthcare has leveraged crucial factors such as family history associated
surged, enabling the identification of patterns within medical with heart disease. However, the accuracy achieved by the
data. Using machine learning to classify cardiovascular IHDPS model was notably lower than that of emerging
disease occurrences can help reduce diagnostic errors. models, such as those detecting coronary heart disease using
Additionally, data mining and machine learning artificial neural networks and other advanced machine and
methodologies are increasingly leveraged to predict the deep learning algorithms [6].
likelihood of developing specific ailments [2]. However,
accurately predicting heart disease risk is fraught with
challenges, particularly due to imbalanced issues commonly
found in medical datasets. Imbalance between healthy
individuals and those with heart disease skews predictions,
causing models to favour the majority class, thus limiting
their ability to correctly identify at-risk patients.
To address these challenges, an advanced technique offer
viable solution. Minority Over-sampling Technique
(SMOTE) is employed to generate synthetic data points for
e. Decision Tree
Decision tree learning is a predictive modeling technique
Fig. 3. SMOTE Majority Class utilized in statistics, data mining, and machine learning. In
classification trees, the target variable takes discrete values,
E. Heart Disease Classification with leaves representing class labels and branches
Figure 2 shows the detail description of the architecture representing attributes that combine to determine these
of the proposed methodology, in which the five different labels [13]. Conversely, regression trees are employed when
classification techniques proposed in this study are Logistic the target variable assumes continuous values.
Regression, KNN, SVM, Decision Trees, and Naive Bayes.
a. Logistic Regression f. Random forest
Logistic regression predicts discrete values, such as Random forest employs ensemble methods, combining
whether a student passed or failed, through the application multiple learning algorithms to enhance prediction accuracy.
of a transformation function known as the logistic function. Unlike statistical ensembles, which are often limitless in
This function, denoted as h ( x )=1/(1+ e x ) , generates an S- structure, machine learning ensembles consist of a finite
curve. Unlike Linear Regression, where the output is direct, collection of diverse models. The key challenge lies in
logistic regression outputs probabilities for the default class. identifying basic models that collectively make errors rather
These probabilities range between 0 and 1, reflecting their than individually achieving high accuracy [14]. Even if base
nature as probabilities. By employing the logistic function classifiers exhibit low accuracy, employing ensembles for
x classification can still yield high accuracy through the
h ( x )=1/ (1+ e ), the x-value undergoes a log- collective decision-making of multiple models.
transformation to produce the corresponding output (y-
value). Finally, these probabilities are dichotomized into
binary categories using a predefined threshold [9]. F. Performance Metrics
This study employed model construction to assess the
b. Naïve Bayes effectiveness and utility of various classification algorithms
Naïve Bayes is a classification technique rooted in the for predicting heart disease. The performance of each model
Bayes theorem, relying on the assumption of predictor was evaluated using a confusion matrix and a
independence. Simplistically, a Naïve Bayes classifier comprehensive set of metrics, including Accuracy,
operates under the belief that the presence of one feature precision, recall, F1-score, and ROC-AUC score.
within a class is unrelated to the presence of any other
feature []. For example, when categorizing fruit, such as
identifying an apple based on attributes like being red, IV. RESULT, DISCUSSION, AND EVALUATION
rounded, and having a diameter of approximately 3 inches, a This section presents the experimental results for all heart
Naïve Bayes classifier treats each attribute independently in patients using a dataset obtained from Kaggle. The
assessing the likelihood of the fruit being an apple, experiments were conducted in a simulated environment
disregarding interdependence or additional attributes. This using Python.
model's simplicity makes it well-suited for handling large A. Data Balancing (SMOTE Technique)
datasets.
The analysis of heart disease prediction using machine
learning algorithms has demonstrated varying performance
c. K-nearest Neighbor
across models. However, a significant challenge lies in the
K-nearest neighbor (KNN) finds application in both
imbalance of the dataset, as shown in the provided table. The
classification and regression scenarios, though it is dataset has a label feature with the title HadHeartAttack
predominantly used for classification tasks. Operating on the containing 416,959 instances of "No" for heart attacks
principle of determining a new instance's classification compared to only 25,108 instances of "Yes." This imbalance
based on the consensus of its k nearest neighbors, KNN can lead to biased predictions, where the models may
stores all existing examples and allocates a case to a class perform well on the majority class (no heart attack) but
based on the majority vote of its neighbors, as determined poorly on the minority class (heart attack).
by a distance function. Various distance functions, such as
Euclidean, Manhattan, Minkowski, and Hamming, are
employed, with Hamming specifically suited for categorical
variables and the others for continuous processes. The
choice of K value can sometimes pose a challenge in KNN
modeling [11].
B. Classsification Results
After feature Engineering and sampling of the data, Five
different model where created using five algorithms and are
evaluated using Performance Metric.The initial data was
separated into the train (80%) and test (20%) sets after being
segmented into features (X) and labels (Y). The analysis of
machine learning algorithms for predicting heart disease, as
detailed in Figure 4, demonstrates varying performance
metrics across different models, utilizing eighteen
features:HearthDisease, Smoking, alcoholdrinking, stroke,
Physicalhealth, mentalhealth, diffwalking, sex, AgeCtegory,
Race, Diabetic, Physicalactivity, Genhealth, SleepTime,
Asthma, KidneyDisease and SkinCancer.
Fig. 4. HadHeart Attack before SMOTE
Logistic Regression achieved an accuracy of 84.5%, with
The analysis of heart disease prediction using machine a precision of 69.23%, recall of 81.82%, F1-score of 75%,
learning algorithms has demonstrated varying performance and ROC-AUC score of 67%. This indicates balanced
across models. However, a significant challenge lies in the performance across metrics, with strong recall but relatively
imbalance of the dataset, as shown in the provided table. The lower precision.
dataset has a label feature with the title HadHeartAttack
containing 416,959 instances of "No" for heart attacks While K-Nearest Neighbors (K-NN) showed an accuracy
compared to only 25,108 instances of "Yes." This imbalance of 74.9%, precision of 75%, recall of 63.64%, F1-score of
can lead to biased predictions, where the models may 73.68%, and ROC-AUC score of 74.26%. K-NN's precision
perform well on the majority class (no heart attack) but and F1-score are quite similar, suggesting consistency,
poorly on the minority class (heart attack). though its recall is lower compared to Logistic Regression.
The SMOTE pre-processing technique is recognized as Support Vector Machine (SVM) performed with an
one of the most reliable and effective strategies in the accuracy of 94.8%, precision of 72.73%, recall of 78.9%, F1-
machine learning and data mining fields. SMOTE increases score of 81.82%, and ROC-AUC score of 69.79%. SVM
the number of data instances by generating synthetic demonstrates high accuracy and balanced recall and
minority class examples using Euclidean distance to precision. Decision Tree on the other hand achieved an
interpolate between existing minority class samples and their accuracy of 94.7%, precision of 65.23%, recall of 82.5%, F1-
nearest neighbors. These new instances are created based on score of 77%, and ROC-AUC score of 88.23%. This
the original features, making them very similar to the original indicates strong performance, especially in recall, though its
data. In this study, applying the SMOTE technique balanced precision is slightly lower.
the Yes attribute size from 25108 to 226013 samples, Random Forest outperformed the other models with an
resulting in a balanced dataset as illustrated in Fig 5. accuracy of 96.04%, precision of 91.62%, recall of 100%,
The SMOTE technique was used to balance the dataset. F1-score of 93.5%, and ROC-AUC score of 57.29%.
Random Forest demonstrates exceptional precision and
ROC- recall, leading to the highest F1-score, despite a lower ROC-
F1- AUC AUC score.
ML Acc Precision Recall score Score
Algorithms (%) (%) (%) (%) (%) These results highlight that each algorithm has unique
Logistic
84.5 strengths. Logistic Regression and K-NN provide balanced
Regression 69.23 81.82 75 67 results, SVM and Decision Tree offer high accuracy and
K-NN 74.9 75 63.64 73.68 74.26 recall, while Random Forest excels in precision and recall,
SVM 94.8 72.73 78.9 81.82 69.79 making it the most effective model for heart disease
Decision prediction among the ones evaluated.
Tree 94.7 65.23 82.5 77 88.23
Random
Forest 96.04 91.62 100 93.5 57.29 TABLE II. MACHINE LEARNING CLASSIFICATION RESULTS