2
2
2
1 Introduction
According to the WHO, cardiovascular diseases (CVDs) are cause of death of 17.9
million people each year, which accounts for 32% of all fatalities worldwide [1].
Heart disease prediction and prevention is one of the major clinical research
areas, and even a tiny improvement in this area is significant for medical science.
Most of the patients suffering from CVDs are detected when the disease becomes
severe. Thus, detecting CVDs at an earlier stage is necessary for saving the
c Springer Nature Switzerland AG 2022
I. Woungang et al. (Eds.): ANTIC 2021, CCIS 1534, pp. 765–776, 2022.
https://doi.org/10.1007/978-3-030-96040-7_57
766 A. Aleem et al.
1. Roulette Wheel Selection: Its a method for choosing the genomes as the
next parent. Figure 1 shows an arbitrary Roulette wheel. All the genomes are
given an area on the circle in proportion to their fitness value from the last
state. The higher fitness value corresponds higher area for the genome that
it will get on the circle/wheel. The wheel is rotated with a fixed pointer to
select the genome. In programming, it is implemented with a random number
generator and mod function as the pointer to select. Higher the fitness value
better probability to be selected as a parent for the next state. Other methods
used for selection are Rank selection, Random selection, etc.
2. Crossover: It is a process to get the new genomes by crossing over the
selected parents on a random basis, cutting feature subset string with a cer-
tain probability. Some of the methods are single point crossover, two-point
crossover (shown in Fig. 2), uniform crossover, N-point crossover, etc.
is achieved by applying feature selection, and out of all the approaches, the GA-
based feature selection approach performs better in terms of prediction accuracy.
This article has been written in five sections. The first section introduced the
problem and discussed the probable solution. The rest of the article proceeds as
follows. The second section presents the background and related work on heart
disease prediction and feature selection. The third section presents the proposed
work along with the explanation of the approaches applied. The fourth section
presents the experimental work and results along with a brief explanation of the
utilized metrics. The fifth section provides the results and analyses it. Finally,
the sixth and last section concludes this article and provides direction for future
work.
2 Related Work
Many research works have been done for CVD detection using machine learning
approaches. Few of the recent and significant contributions have been discussed
here to present the background. Kanika and Shah [9] in 2016 proposed an app-
roach to predict CVD considering attributes such as Age, Sex, Weight, Chest
pain type, etc., for the prediction. The authors applied preprocessing techniques
like noise removal, discarding records with missing data, filling default values,
and level-wise classification of attributes to make decisions. SVM and NB had
been employed for the prediction, among which SVM is found better.
Mirmozaffari et al. [10] in 2016 used clustering algorithms for feature selec-
tion. The authors have applied clustering approaches like K-means, hierarchical
clustering, and density-based clustering. The best algorithm has been chosen
using multilayer filtering preprocessing and a quantitative evaluation method.
The accuracies, error functions, and building times of clusters are compared.
Density-Based Clustering and K-Means functioned perform quite well, based on
the results.
Jabbar et al. [11] in 2017 applied the genetic algorithm for selecting the
optimal feature subset for heart disease prediction. This method works well
for pruning redundant and irrelevant features by actually applying every new
generation to the test. In this case, KNN is used as a supervised algorithm to
check the accuracy for every generation. It is repeated until the performance
starts to stabilize. The authors have got 4–5% improved accuracy after applying
GA based feature selection approach on various UCI repository datasets.
Gokulnath et al. [12] in 2019 used SVM as the fitness function for the GA,
which performs better than the KNN and works on data that has a less lin-
ear dependency. The SVM model was 83.70% accurate when classifying CVD
with the full features. However, using the framework for feature reduction, an
improvement of 5% in the accuracy is seen.
Gárate-Escamila et al. [13] in 2020, the authors have used Principal Com-
ponent Analysis (PCA), an unsupervised method of feature reduction (filter
method), based on a non-parametric statistical technique. PCA had also been
utilized by Santhanam and Ephzibah [14] in 2013. The authors used PCA on the
Improving Heart Disease Prediction Using Feature 769
UCI dataset (total of 297 samples, 13 input, and one output attribute), along
with the regression technique used for feature reduction and ranking. The fea-
tures selected using the PCA method were further utilized for classifying and
predicting through regression and Feedforward neural network models.
Bashir et al. [15] in 2019 used a hybrid approach of various feature selection
methods and ensemble learning with Minimum Redundancy Maximum Rele-
vance Feature (MRMR) selection. Senthil et al. [16] in 2019 also used the hybrid
approach of random forest and Linear Model for optimal performance. ANN
with backpropagation is used for HRFLM.
All these research works are inspired by some natural phenomena of opti-
mizing the performance in general and have their own sets of advantages as well
as limitations. This article goes a step further and involves evolutionary algo-
rithms like GA and PSO for feature reduction in combination with traditional
machine learning algorithms. The improvement in prediction accuracy opens the
door for the employment of evolutionary algorithms for feature selection before
predicting a utility value.
3 Proposed Work
The objective is to improve the accuracy of classification models that predict
heart disease when applied to heart datasets. For an accurate prediction model,
a dataset is needed that has the best feature set, which has noise and redundan-
cies removed. Wrapper-Method for feature selection is one of the wise choices.
It is applied using the genetic algorithm. It could not perform well if proper
parameters for the algorithm are not set. A fitness function plays a significant
role in choosing the next generation for the algorithm. Choosing the right fit-
ness function could improve GA further. Parameters like crossover and mutation
probability values are optimized by trial and test only. The goal is to optimize
feature selection using various classification functions as a fitness function in the
genetic algorithm for finding a better next state to reach the optimized subset of
features. Since the motive is to remove redundant features, Naı̈ve Bayes is one
of the stronger candidates. Naı̈ve Bayes also resonates with the same principle
because if there is some redundancy left in the feature, the fitness value will be
meager as compared to others. Due to this, Naı̈ve Bayes will give more accu-
racy than other classification methods. The proposed algorithm has been shown
in Algorithm 1, and its flow has been explained in Fig. 3. The genomes for the
next state will be that features-set only which has a higher fitness value. This
hypothesis is further verified with experiments and other classification methods
as fitness functions in the analysis section.
4 Experimental Details
WEKA is used as a tool to build a predictive model and further increase the
accuracy of models. It is a software tool to analyze and work on different machine
learning models. It has all the package to build a classification model based on
770 A. Aleem et al.
z = (x − μ)/σ (3)
3: Generate random population of chromosomes for evaluation.
4: while desired accuracy achieved or threshold iterations done do
5: Train the model using Naı̈ve Bayes and evaluate the accuracy (f1 (I)) for sub-
optimal feature sets using equation 4.
TP + TN
f1 (I) = (4)
TP + TN + FP + FN
where TP, TN, FP, FN represent true positive, true negative, false positive and
false negative respectively.
6: Select number of genes (f2 (I)) using equation 5
no. of selected features
f2 (I) = 1 − (5)
size of feature set
7: Fitness function is evaluated for the genomes(feature-sets) using the fitness
function of equation 6.
the provided data set. In the attribute selection section, various wrapper and
filter methods are available to choose an optimal subset. Ranking of attributes
can be done with other methods like ReliefF algorithm, Pearson’s correlation,
etc. Initially, the data set is loaded into the WEKA tool, then various filter
methods and selection methods can be applied. The tool allows changing various
parameters for the input of algorithms like kernel function for SVM or number
of generations for Genetic algorithm. The description of the utilized dataset is
provided in Subsect. 4.1 and the flow of experimental activities is discussed in
Subsect. 4.2
Improving Heart Disease Prediction Using Feature 771
The dataset used is a standard UCI Hungarian dataset which has 14 attributes
that describe various factors like age, chest pain, exercise-induced angina, etc.
The description of attributes is given in Table 1. It has 294 instances. The motive
is to design a framework that gives a better performance in most of the predictive
models and produces higher accuracy than others. The data set is split into the
training set and testing set. (70% for training) and (30% for testing).
772 A. Aleem et al.
References
1. Organization, W.H.: Cardiovascular diseases (2021). www.who.int/
health-topics/cardiovascular-diseases/tab/tab/1 Accessed 02 Sept 2021
2. Richens, J.G., Lee, C.M., Johri, S.: Improving the accuracy of medical diagnosis
with causal machine learning. Nat. Commun. 11(1), 1–9 (2020)
3. UCI Machine Learning Repository: Heart disease data set.
archive.ics.uci.edu/ml/datasets/heart+disease Accessed 02 Sept 2021
4. Poli, R., Kennedy, J., Blackwell, T.: Particle swarm optimization. Swarm Intell.
1(1), 33–57 (2007)
5. Mirjalili, S.: Genetic algorithm. In: Evolutionary Algorithms and Neural Networks.
Studies in Computational Intelligence, vol. 780. Springer, Cham (2019). https://
doi.org/10.1007/978-3-319-93025-1 4
6. Murphy, K.P., et al.: Naive bayes classifiers. Univ. British Columbia 18(60), 1–8
(2006)
776 A. Aleem et al.
7. Noble, W.S.: What is a support vector machine? Nat. Biotechnol. 24(12), 1565–
1567 (2006)
8. Mathuria, M.: Decision tree analysis on j48 algorithm for data mining. Int. J. Adv.
Res. Comput. Sci. Softw. Eng. 3(6) (2013)
9. Kanikar, P., Shah, D.R.: Prediction of cardiovascular diseases using support vector
machine and bayesien classification. Int. J. Comput. Appl. 156(2) (2016)
10. Mirmozaffari, M., Alinezhad, A., Gilanpour, A.: Heart disease prediction with data
mining clustering algorithms. Int. J. Comput. Commun. Instrument. Eng. 4(1),
16–19 (2017)
11. Deekshatulu, B., Chandra, P., et al.: Classification of heart disease using k-nearest
neighbor and genetic algorithm. Proc. Technol. 10, 85–94 (2013)
12. Gokulnath, C.B., Shantharajah, S.: An optimized feature selection based on genetic
approach and support vector machine for heart disease. Cluster Comput. 22(6),
14777–14787 (2019)
13. Gárate-Escamila, A.K., El Hassani, A.H., Andrès, E.: Classification models for
heart disease prediction using feature selection and pca. Inf. Med. Unlocked 19,
100330 (2020)
14. Santhanam, T., Ephzibah, E.P.: Heart disease classification using PCA and feed
forward neural networks. In: Prasath, R., Kathirvalavakumar, T. (eds.) Mining
Intelligence and Knowledge Exploration. LNCS, vol. 8284. Springer, Cham (2013).
https://doi.org/10.1007/978-3-319-03844-5 10
15. Bashir, S., Khan, Z.S., Khan, F.H., Anjum, A., Bashir, K.: Improving heart disease
prediction using feature selection approaches. In: 2019 16th International Bhurban
Conference on Applied Sciences and Technology (IBCAST), pp. 619–623. IEEE
(2019)
16. Mohan, S., Thirumalai, C., Srivastava, G.: Effective heart disease prediction using
hybrid machine learning techniques. IEEE Access 7, 81542–81554 (2019)