Heart Disease Prediction Using Machine Learning Techniques: Raparthi Yaswanth, Y. Md. Riyazuddin
Heart Disease Prediction using Machine Learning Techniques
Shan Xu et al. [3] describes about Cardiovascular Disease them 164 were negatives, 139 were positives. The dataset is
prediction which is early detection and it is important to stored in CSV (Comma Separated Value) file format where
patient for treatment. This system focuses on providing more each row represents a single value.
accuracy with help of support vector machine and naive bayes
algorithms. This system consists of four parts like data
interface which has hospitals raw data and data preprocessing
for data integration and feature selection which is to retrieve
only useful attributes for getting more accuracy and
performing the classification of those attributes.
Mansoor et al. [4] said that this system is to develop and
validate prediction model by implementing multivariate
logistic regression, full and reduced random forest models. In
this method eleven variables are included in final model based
on backward elimination as well as the full random forest
model will have 32 variables and they were mitigated up to 17
variables by feature selection for getting more accuracy which
can be useful and accurate tool in clinical practice.
A. System Model
C. Preprocessing
In this phase first we need to gather the training dataset
which is in CSV file format. For that we need to read the file
data with the help of pandas library for converting it to list
array. The input message which is given by the user should get
append to list array, because the machine cannot understand
the file format data. On completion of converting process, it
will remove null values if they are in training and testing data.
D. Standardization
In this, we need to apply standardization on given training
dataset for scaling of each column data for getting best
accuracy model. To perform scaling we use StandardScaler
python library class which contains fit transform function and
it takes input as dataset columns data which have higher
E. Classification Techniques
Fig. 1.System Architecture
Naive Bayes
In Figure.1, the system needs to load heart disease training The Naive bayes machine learning algorithm is useful for
dataset which is accessed from UCI repository and select best categorizing documents and email spam filtering and this
attributes for standardization and then we will do algorithm is working based on Baye’s rule [6].
preprocessing if any row has null values or empty cells. Once
preprocessing and feature extraction is completed, then it will
split the dataset as training and testing and we choose machine
learning classifiers like SVM, KNN, NB, DT, RF, LR and NN (1)
for heart disease prediction, while the prediction process take
test data as input and it returns output as positive or negative Wherein:
with the help of training dataset. We also calculate the A is a class
accuracy measures by splitting of dataset as 70% training and B is a message
30% for testing. P (A) is a class probability
B. Data Collection
In this system the heart disease dataset is shown in Figure 2
which contains 0’s and 1’s where 0’s indicates to negative and
P (B) is the probability of a
1’s indicates to positive status. Dataset has imported from
UCI repository [5]. This dataset contains 303rows and among
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-5, March 2020
P (B|A) is conditional probability of the class for the given function predict () which takes input as test message.
message B
Decision Tree:
P (A|B) is conditional probability where message B
belongs to class A. A tree has a nodes and branches. Decision tree is a
In our system, for implementing Naive bayes algorithm we classifier in the form of tree structure. The tree has two types
use python library which is named as sklearn. naive_ bayes. of nodes which are decision nodes and leaf nodes. Here
Gaussian NB class. This algorithm have function like fit () decision nodes specify a choice or test, based on this it can
which will build the training model whose inputs are decide which direction it can go and leaf nodes indicates the
independent and dependent values of dataset and other classification of example or the value of example. The
function like predict() function which takes input as testing decision tree goes well with both classification and regression
values and then it can predict the heart disease status as problems. The classification means, having a group of data
positive or negative. and we are supposed to classify the data into predefined set of
classes. The representation for the classification and
K-Nearest Neighbor: regression tree (CART) is a binary tree. Each root node
The K-nearest neighbor classifier [7] is a simplest represents a single input variable (x) and a split on that
algorithm for prediction of any dataset with the help of variable. The leaf nodes of the tree contain an output variable
Euclidean distance. Here K means number of neighbor’s, so if (y) which is used to make a prediction. In decision tree the
we take k=1 then it gives very nearest neighbor as output. It greedy approach [9] is used to divide the input space called
works on voting based system that means while prediction it recursive binary splitting. In this procedure, different split
takes nearest neighbors votes. This output gets more votes points are trained and tested using a cost function. The split
that will be the predicted output value. with the best cost is selected and all input variables and all
D = (2) possible split points are evaluated and chosen in a greedy
manner [10] based on the cost function. For classification
Whereas: Gini cost function indicates how pure the nodes will become
D is a distance between x and y split points. The Gini index follows the below equation:
K is a nearest neighbor
x and y are independent attribute values G=1- (6)
In our system, for implementing K-nearest neighbor
If the decision tree applies the Information Gain then it can
algorithm we use python library from the sklearn API for
follow the below equation.
classification which is named as sklearn.neighbors.
KNeighborsClassifier class. This algorithm has function like E=- (7)
fit() which will build the training model whose inputs are Where pi denotes probability of class, E denotes entropy
independent and dependent values of dataset and another and G denotes Gini Index.In our system for implementing
function predict() which takes input as testing values then it Decision Tree algorithm we use python library which is
can predict the heart disease status as positive or negative. sklearn.tree.DecisionTreeClassifier class.
Support Vector Machine: Logistic Regression:
The support vector machine [8] is supervised machine The Logistic Regression [12] is a supervised machine
learning algorithm which is used for classification of learning algorithm where it can solve the classification
instances. It can separate the data linearly and for non-linear problem. It works same as linear regression. It can be used for
data it can use kernel functions. The SVM classifies two describing the data and explains about relationship between
classes with the help of hyper plane which has the largest one dependent variable and one variable. It is a regression
margin to separate the dataset in to classes. The margin model where it can build to predict the probability with given
between the two classes represents the longest distance input data which belongs to category “1”. It can classify when
between closets data points of those classes which are called a decision threshold came to picture which is very important
support vectors. to set the threshold value. The Threshold value can affect by
the majority of precision and recall values.
. +b 1 (3)
P= (8)
. +b -1 (4)
y ( . +b)-1 (5) In our system for implementing Logistic Regression
Wherein algorithm we use python library as sklearn. linear_ model.
b is a constant distance. Logistic Regression class.
w , u are vectors.
y is the output, where 1 is for positive samples and -1 for
negative samples.
In our system, for implementing support vector machine
algorithm we use python library which name was sklearn.
Svm. SVR Class. This algorithm has function like fit () which
is used to build the training model whose inputs are Random Forest:
independent and dependent values of dataset and other Random Forest [12] is a
Heart Disease Prediction using Machine Learning Techniques
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-5, March 2020
