Information: A Heart Disease Prediction Model Based On Feature Optimization and Smote-Xgboost Algorithm
Information: A Heart Disease Prediction Model Based On Feature Optimization and Smote-Xgboost Algorithm
Information: A Heart Disease Prediction Model Based On Feature Optimization and Smote-Xgboost Algorithm
Article
A Heart Disease Prediction Model Based on Feature
Optimization and Smote-Xgboost Algorithm
Jian Yang * and Jinhan Guan
School of Information, Shanxi University of Finance and Economics, Taiyuan 030006, China
* Correspondence: yangj@sxufe.edu.cn
Abstract: In today’s world, heart disease is the leading cause of death globally. Researchers have
proposed various methods aimed at improving the accuracy and efficiency of the clinical diagnosis
of heart disease. Auxiliary diagnostic systems based on machine learning are designed to learn
and predict the disease status of patients from a large amount of pathological data. Practice has
proved that such a system has the potential to save more lives. Therefore, this paper proposes a new
framework for predicting heart disease using the smote-xgboost algorithm. First, we propose a feature
selection method based on information gain, which aims to extract key features from the dataset
and prevent model overfitting. Second, we use the Smote-Enn algorithm to process unbalanced
data, and obtain sample data with roughly the same positive and negative categories. Finally, we
test the prediction effect of Xgboost algorithm and five other baseline algorithms on sample data.
The results show that our proposed method achieves the best performance in the five indicators
of accuracy, precision, recall, F1-score and AUC, and the framework proposed in this paper has
significant advantages in heart disease prediction.
to produce a model with superior overall performance. Bagging and boosting are the
two primary model combination techniques; bagging integrates multiple underfitting
weak classifiers, and boosting integrates multiple overfitting weak classifiers. Xgboost
is an efficient implementation of ensemble learning, whose main idea is boosting and
introducing regular terms in the objective function to prevent overfitting.
We propose a heart disease prediction model by employing the smote-xgboost al-
gorithm. The model was trained using real pathological data from cardiac patients.
Among them, Major Adverse Cardiovascular Events (MACCE) are the prediction tar-
get, and the occurrence of MACCE is a key indicator to evaluate the success of coronary
heart disease surgery. In summary, we make important contributions as follows.
• To remove the crucial features from the dataset, an information gain-based feature
selection method is used.
• Use a technique that combines undersampling and oversampling to handle uneven
data on the selected dataset.
• Using the preprocessed dataset, validate efficacy of xgboost. Additionally, assess the
ability of the xgboost algorithm with five baseline methods using a confusion matrix.
The remaining portions of the essay are structured as follows. Section 2 summarizes
the most recent research on heart disease prediction. Section 3 is a brief description of the
dataset and an introduction to the algorithms applied in the framework. Section 4 is a
statistical description of the dataset and a comparison and evaluation of the experimental
models, and Section 5 provides a conclusion and outlook.
2. Related Work
One of the major applications of machine learning in recent years has been the predic-
tion of heart disease, which has had some success. Some scholars have concentrated on the
innovation of data processing techniques such as feature selection, and some scholars have
focused on innovation from the perspective of prediction algorithms.
Modepalli et al. [6] utilized a new model (DT + RF) to predict the occurrence (or
non-occurrence) of heart disease. They chose the UCI dataset to validate the reliability
of the hybrid model, comparing the prediction outcomes of the hybrid model and any
single algorithm in the hybrid model, respectively. It is found that the hybrid model has a
significant advantage over the single algorithm in terms of performance in the evaluation
metric of accuracy, with a 7% to 9% improvement.
Joo et al. [7] used a dataset of cardiovascular disease with the same features but
different years of return visits to train the model. The authors selected 25 features from the
dataset by combining health examination results and questionnaire responses, and used
four machine learning models to predict the 2-year and 10-year cardiovascular disease
risk, respectively. In particular, they found that the accuracy of each model improved
somewhat if physician medication information was taken into account when performing
feature selection, and that medication information had a strong effect on the prediction of
short-term data in this study.
Li et al. [8] put out a feature selection approach fast conditional mutual information
(FCMIM) based on conditional mutual information. They employed four common feature
selection algorithms and FCMIM on the Cleveland dataset and used six machine learning
algorithms to train the model. The results suggested the use of this novelty feature selection
method, with the highest accuracy of 92.37% for the combination of FCMIM and SVM.
Ali et al. [9] used a feature fusion technique to process low-dimensional data extracted
from medical records and sensor data. Then, they employed a feature selection strategy
relating to information gain and feature ranking to obtain the dataset. They achieved
prediction accuracy of 98.5% by applying an ensemble deep learning algorithm.
Rahim et al. [10] applied an oversampling technique to balance the data, and also used
the mean value method to fill in the missing values and feature importance method for
feature selection. They selected three datasets (including the Framingham dataset and the
Cleveland dataset). After data preprocessing on each of the three datasets, the predictive
Information 2022, 13, 475 3 of 15
effectiveness of the new ensemble model (KNN and LR) with and without feature selection
was compared. The results fully validated the advantages of the new ensemble model,
in which the accuracy of the new model with feature selection was as high as 99.1%.
Ishaq et al. [11] used the feature importance of random forest to rank the features and
select the features with higher scores, and also employed the SMOTE technique to balance
the data. They compared the prediction performance of nine commonly used algorithms on
data treated with SMOTE and on unbalanced data without treatment, where it was found
that the prediction accuracy of each model was significantly improved on balanced data.
Khurana et al. [12] found that SVM outperformed all other machine learning algo-
rithms when testing their results on the Cleveland dataset by applying five feature selection
techniques. The prediction accuracy of each machine learning algorithm improved to
a different extent after applying the feature selection methods, where the feature selec-
tion methods with Chi-Square and information gain were applied. The accuracy of the
combination of Chi-Square and information gain and SVM both reached 83.41%.
Ashri et al. [13] applied a genetic-algorithm-based feature selection Simple Genetic
Algorithm (SGA) and trained model by using UCI dataset. Two algorithms with the highest
accuracy were selected to propose a hybrid ensemble learning model based on decision
trees and random forests, and found that the accuracy of the ensemble learning model
reached 98.18%.
Bashir et al. [14] proposed a new ensemble learning combinatorial voting approach,
in which four datasets were selected from the UCI database to validate six machine learning
algorithms and five ensemble models with a combination of these six algorithms. They
found that the accuracy of the ensemble models was generally greater compared to the
individual algorithms, in which the average accuracy of the five ensemble models reached
83%. The proposed combination can be extended to bagging and boosting to further
improve the accuracy.
In conclusion, data preprocessing, such as data standardization and feature selection,
can effectively raise the value of the dataset and greatly enhance the accuracy of a model.
Additionally, ensemble learning models perform well when dealing with heart disease.
The main point of this study is to employ the ensemble learning algorithm Xgboost on a
heart disease dataset after performing feature selection and imbalance processing. Finally,
by contrasting xgboost with other standard algorithms, the effectiveness and accuracy of
the suggested framework in predicting heart disease are confirmed.
3. Method
Figure 1 shows the heart disease prediction framework proposed in this paper.
3.1. Dataset
This paper uses the return visit data of real patients in a hospital as the research sample.
We named this the Heart Disease Dataset (HDD). The dataset has a total of 4232 samples
and 37 features, including numeric and categories. The predictive target is major adverse
cardiovascular and cerebrovascular events (MACCE), where zero indicates no occurrence
and one indicates occurrence.
H o − H min
H= × ( NHmax − NHmin ) + NHmin (1)
H max − H min
Information 2022, 13, 475 4 of 15
where H refers to the normalized value, H o refers to the original value, H min refers to the
minimum, H max refers to the maximum, NHmax and NHmin refers to the range of values
taken by the transformed dataset, usually NHmax = 1, NHmin = 0. In this paper, H is taken
as the experimental dataset whose range lies in the interval [0,1].
Resampling
Feature Data Balancing
Training set Preprocessing
Selection Using Smote-Enn
XGBoost
Training
Cross
Training model New dataset
validation
Baseline
Algorithms
Predicting
Feature Predictable Performance
Testing set
Selection model evaluation
E( X ) = − ∑ pi log2 pi (2)
i
Y Y
E( ) = P( X = v) E( ) (3)
X X=v
Information 2022, 13, 475 5 of 15
Y
IG ( X, Y ) = E(Y ) − E(
) (4)
X
where Equation (2) denotes the information entropy of feature X, Equation (3) denotes the
information entropy of prediction column Y when feature X is known, and Equation (4)
denotes the information gain, and the information gain of feature X is the difference
between the information entropy of prediction column Y and the conditional entropy
of both. Different information gain values are taken for various features in the dataset,
and these values are sorted. The features with gains larger than the threshold are regarded
as essential features that should be selected. The following is its pseudo code.
After the above preprocessing and feature selection, we get a total of 3527 sample data,
as well as 15 features and 1 predicted label. The following Table 1 provides a description of
the preprocessed HDD.
MACCE 0 1 Total
To obtain balanced data, there are three basic strategies: (1) expanding the sample
size from the minority class (oversampling); (2) decreasing the number of samples from
the majority class (undersampling); and (3) combining undersampling and oversampling.
The undersampling method removes samples from the majority class at random, which
may lead to a loss of crucial information that has a considerable impact on the learning
Information 2022, 13, 475 6 of 15
task. The oversampling method directly resamples samples from the minority class, which
may result in overfitting of the model. Furthermore, several researchers have shown that
mixed methods are superior to single methods when processing datasets [16,17].
In this research, a hybrid technique called SMOTE-ENN [18] is utilized to handle
imbalanced data. SMOTE is an oversampling algorithm that employs a method of in-
terpolating samples from the minority class. By removing samples that do not fall into
the categories that account for the majority of the k-nearest neighbor samples, the ENN
algorithm, which is an undersampling algorithm, decreases the amount of samples from
the majority class. In this paper, the SMOTE algorithm is used to undersample the category
of MACCE of one until the balance between the samples in the majority and minority
groups is reached. Then, the ENN algorithm is applied to remove the overlapping samples
in each of the two categories until the dataset is rebalanced. Using this hybrid technique,
the minority class of HDD has a proportion of 61.67 percent, increasing from 9.16 percent.
In Algorithm 2, the SMOTE-ENN pseudocode is included.
3.5. XGBoost
Xgboost is an implementation of the ensemble learning algorithm boosting [19].
The fundamental principle of the Xgboost is to train the model using residuals. The out-
come of the most recent tree training is utilized as the input for the subsequent iteration,
and the error is progressively decreased over numerous serial iterations. Finally, all weak
learners are linearly weighted to produce the ensemble learner.
Additionally, when training the Xgboost tree, the effective splitting point is chosen
using an information-gain-based greedy algorithm. To better optimize the objective func-
tion, Xgboost uses a second-order Taylor expansion to approximate the objective function,
and the optimal solution is the quadratic optimal solution. Furthermore, a regular term is
added to regulate the spanning tree’s complexity, lowering the possibility of overfitting the
model. The loss function is as follows
i
1
∑[ L(yi , ybl t−1 ) + f t (xi )) + 2 Ln ](yi , ybl t−1 ) + f t2 (xi ) + Ω( f t ))
(t)
f obj = (5)
n
1
λ∑ Tj ||Wj ||2 + γT
Ω( f t ) = (6)
2
Wj stands for the leaf node weights, T stands for the total number of nodes, and λ and
γ are hyperparameters that control the node complexity.
The Xgboost technique utilizes the shrinkage strategy [20] to ensemble weak learn-
ers and decrease the likelihood of overfitting the model. This ensemble takes the form
shown below.
where f m ( X ) denotes the mth iteration to generate the weak learner and Fm ( X ) denotes
the mth iteration to generate the integrated learner. Since the parameter η has a strong
negative correlation with the number of iterations, the model often has better generalization
properties when η has a smaller value.
Moreover, Xgboost adopts the Parzen estimation tree strategy to automatically op-
timize the hyperparameters in the model for optimal prediction, as well as the block
technique to enhance the capability of the model to handle large amounts of data and
improve its training efficiency.
1
p= (8)
1 + e−y
d
P(C )
P(C | x ) =
p( x ) ∏ P ( xi |C ) (9)
i =1
The test results are then categorized in accordance with the corresponding probability.
d
Cnb = argmaxP(C ) ∏ P( xi |C ) (10)
i =1
4. Performance Evaluation
4.1. Result of Exploratory Data Analysis
Exploratory data analytics were carried out on this dataset to better understand its charac-
teristics. The following subsection provides a description of the analyses’ observations.
The frequency distribution histogram provides a rapid overview of the data’s disper-
sion and central tendency. The distribution of various features is visually represented by
the height of each rectangle in Figure 2, which shows the frequency of occurrence of the
values. Additionally, the ability of the model to predict outcomes is impacted by the degree
of feature correlation.
Frequency
Frequency
Frequency
3000
2000
2000 2000 2000
1000 1000
1000 1000
0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Value Value Value Value
nitrate LM_lesion ASA smoke
4000
4000 4000 3000
3000
3000
Frequency
Frequency
Frequency
Frequency
3000 2000
2000
2000 2000
1000 1000
1000 1000
0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Value Value Value Value
REV_type age LVEF HBG
1250
1250
3000 1000 1500
1000
Frequency
Frequency
Frequency
Frequency
750 1000
2000 750
500 500
1000 500
250 250
0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Value Value Value Value
BUN TC SCV_number MACCE
2000 2000 3000
3000
1500 1500
2000
Frequency
Frequency
Frequency
Frequency
2000
1000 1000
1000 1000
500 500
0 0 0 0
0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0
Value Value Value Value
V H [
6 W D E O H B &