Stroke Prediction Using Machine Learning
Stroke Prediction Using Machine Learning
---------------------------------------------------------------------------------------------------------------------------------------
Submitted: 25-05-2021 Revised: 01-06-2021 Accepted: 05-06-2021
---------------------------------------------------------------------------------------------------------------------------------------
ABSTRACT:A stroke is a brain attack, occurs and the wayquickly it'streated. Data generated from
when adequate blood flow to the brain is stopped, or hospitals as patient records are an important source
a blood vessel in the brain ruptures. Several works of information.
have been carried out to accurately predict the Using powerful data mining tools, it is
outcome of various diseases by analyzing the possible to extract useful information from these
various parameters associated with it and by making records. The aim of the work is to take a specific
comparisons amongst the performances of different area of medical research and help accurately predict
predictive data mining technologies. In our work, the outcome based on specific attributes. The model
various classification algorithms are compared on will take the input as the patient‘s data and as output
the Stroke-Prediction-Dataset (Kaggle [1]), with will provide the appropriate prediction. Extraction
features like hypertension, smoke status, BMI, etc. of hidden knowledge from the clinical database is
Here Scikit Learn‘s SimpleImputer has been used to done by the system, and it predicts whether the
fill in the missing values and ADASYN [6] is used patient has the disease. The use of medical profiles,
to balance the imbalanced dataset. Random Forest such as Hypertension, BMI, smoke status, etc. it is
[9,10,11] and XGBoost [13,16,18] are two able to predict the chances of patients contracting a
classification models compared, with Random disease. Classification models, when provided with
Forest's accuracy (97.67%) just edging out the appropriate attributes can be used for accurate
XGBoost Classifier (96.894%). prediction of diseases. After going through various
KEYWORDS: Random Forest, XGBoost, Simple related and non-related research papers, it was
Imputer, ADASYN sampling, Stroke. shown that Random Forest Classifier always gave
the best accuracy amongst all the other classification
I. INTRODUCTION algorithms. So, the authors have used Random
Stroke, also called a cerebrovascular Forest Classification in their work and compared it
accident, CVA, or ―Brain Attack‖, is the second with other classification algorithms with different
leading cause of death globally. Strokes are of two feature selection methods. In our work, The Stroke
types. The first being Ischemic stroke (part of the Prediction Dataset (Kaggle [1]) is used for
brain loses blood flow) and the second being predicting whether a patient has the likeliness of
Hemorrhagic stroke (bleeding occurs within the having a stroke. This dataset includes patient‘s
brain). Globally every 4th adult over the age of 25 is hypertension, BMI, and smoke status data.
susceptible to have a stroke in their lifetime. In the Further, the paper is divided into five
United States, every year around 795,000 people sections. Section 2 of the paper describes the review
have a stroke. About 610,000 people are the ones of the literature. Section 3 of the paper gives an
who are having a stroke for the first time. Around understanding of our methodology, which consists
87% of people have Ischemic Stroke in which the of preprocessing, feature selection, sampling, and
blood flow to the brain is blocked. Stroke is classification processes respectively. The analysis of
oneofthe leading causes ofgreatlong-term the results is shown in Section 4 which is then
disability.Stroke is the cause of reduced mobility in followed by the dataset which we have used for our
more than half of the stroke survivors aged 65 and model is presented in Section 5 and which is
over. The impact of stroke is oftenshort- and long- followed by the conclusion in Section 6.
term, counting onwhich part ofthe brain is affected
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 985
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 986
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252
Then this paper used a heatmap to it means that the feature is going to influence the
understand the correlation between different features output directly. If the value is towards -1, it means
in the dataset. Heatmap is a statistical tool that is that the feature is going to influence the final result
used to analyze the data. Heatmap is used to inversely, and if the value is closer to 0, it means
represent the correlativity amongst many features of that the feature is not likely to influence the final
the dataset. The correlation between the outcome output much. The last column of the heatmap shows
and features can be very useful in determining the correlation of a stroke with the other features of
which feature is very important. The values lie the dataset.
between -1 and 1. If the value is more towards one,
Fig 2: Heatmap
Now it is evident that the features are Index. The dataset had 5 Categorical Attributes: -
correlated to the Target Variable. Since all the gender, ever_married, work_type, Residence_type,
features are correlated too a good amount, hence any smoking_status which needs to be encoded before
feature cannot be discarded. So, after this Data being fed to the models. To resolve this issue
Preprocessing can be applied to the Dataset. OneHotEncoder[14] from the Scikit-Learn
preprocessing module was used to encode them into
3.2 Data Pre-processing numerical values. Standardization [17] is a scaling
Upon Visualizing and performing technique by which the values are centered around
Exploratory Data Analysis, this paper concluded the mean with a unit standard deviation. That means
that there were 201 records with the missing values that the mean of the attribute becomes zero and the
of the Body mass Index were discovered. So, Scikit resultant distribution has a unit standard deviation.
Learn‘sSimpleImputer was used to replace these The resulting records now have 21 attributes. Here‘s
missing values with the mean Body Mass the formula of standardization: -
Two attributes: - Avg_glucose_level done before splitting the data into namely two
ranging from 55.12 to 271.74 and Body Mass Index categories, i.e., test set and train set, by using
ranging from 10.3 to 97.6. To resolve this, feature Standard Scaler from Sklearn's preprocessing
scaling [15,17] was applied to the dataset and was module.
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 987
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 988
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252
Random Forest instead of searching for the increase the speed of the model. N_Estimatorsis the
most important feature, searches for the most most common hyperparameter which is used to
common amongst a subset of features. This is one of increase the accuracy of the model. Basically,
the reasons why this makes it a better model[9]. ―n_estimators,‖ tells the algorithm the number of
trees it has to build. The greater the number of trees,
Hyperparameters [11] are used to increase the better the accuracy. As the number of trees is
the predictive power of the model and also to
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 989
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252
increased, there are chances that the computation XGBoostClassifier-: After using Random
becomes a bit slower. Forest Classifier on the dataset, the XGBoost
classifier was also utilized to urge a far better
Max_features is another important accuracy[16]. XGBoost or eXtreme Gradient
hyperparameter[11]. It provides the maximum Boosting may be a sophisticated implementation of
number of features that random forest considers to a gradient boosting algorithm. It's a supervised
split an internal node. This can help increase the learning algorithm that attempts to impeccably
accuracy of the model. predict a variable by combining the estimates of a
gaggle of weaker and fewer complicated models.
Random_state is also a hyperparameter
that is used to increase a model's speed. Boosting is an ensemble method that seeks
Random_state is used to make the output of the to make a robust classification model that supports
model more predictable. When the value weak classifiers. By adding a model on top of every
of random_state is the same, it will train the data on other model iteratively, the errors of the older
the same training data and hence the time required models are corrected by their succeeding predictor
for computation decreases significantly thereby until the training data can be accurately predicted.
increasing the speed of the model. XGBoost fits the new model to the new residuals of
the previous prediction and then minimizes the loss
Random Forest Algorithm is one of the that is done while adding the newest prediction,
most widely used algorithms for Classification and rather than assigning different weights to the
Regression Problems. The accuracy achieved by classifiers after every iteration. So, in conclusion,
Random Forest is usually the best when compared the model is updated using gradient descent. Hence,
to the other models. It also runs more efficiently on it is called Gradient Boosting. XGBoost specifically,
large databases when compared to the other implements this algorithm for decision tree boosting
algorithms[10]. with a further custom regularization term within the
objective function.
XGBOOST 96.894%
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 990
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252
Random forest: -
n_estimators Accuracy
200 97.67%
The above table illustrates the accuracy After a certain value of the n_estimators, the
corresponding to the number of n_estimators. As the accuracy remained the same. Sometimes, using too
number of n_estimatorsincrease, the accuracy of the many n_estimators can decrease the accuracy of the
model also increases. The best accuracy was model.
achieved when the value of n_estimators was 200.
0 1
0 952 20
1 40 920
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 992