0% found this document useful (0 votes)
49 views

Stroke Prediction Using Machine Learning

Uploaded by

kailash.mtech3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Stroke Prediction Using Machine Learning

Uploaded by

kailash.mtech3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Advances in Engineering and Management (IJAEM)

Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

Stroke Prediction Using Machine Learning


Vatsal S Chheda1, Samit K Kapadia2, Bhavya K Lakhani3,Pankaj
Sonawane4*
1,2,3
Student, Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering, Mumbai,
India
4
Assistant Professor,Department of Computer Engineering, Dwarkadas J. Sanghvi College of Engineering,
Mumbai, India

---------------------------------------------------------------------------------------------------------------------------------------
Submitted: 25-05-2021 Revised: 01-06-2021 Accepted: 05-06-2021
---------------------------------------------------------------------------------------------------------------------------------------

ABSTRACT:A stroke is a brain attack, occurs and the wayquickly it'streated. Data generated from
when adequate blood flow to the brain is stopped, or hospitals as patient records are an important source
a blood vessel in the brain ruptures. Several works of information.
have been carried out to accurately predict the Using powerful data mining tools, it is
outcome of various diseases by analyzing the possible to extract useful information from these
various parameters associated with it and by making records. The aim of the work is to take a specific
comparisons amongst the performances of different area of medical research and help accurately predict
predictive data mining technologies. In our work, the outcome based on specific attributes. The model
various classification algorithms are compared on will take the input as the patient‘s data and as output
the Stroke-Prediction-Dataset (Kaggle [1]), with will provide the appropriate prediction. Extraction
features like hypertension, smoke status, BMI, etc. of hidden knowledge from the clinical database is
Here Scikit Learn‘s SimpleImputer has been used to done by the system, and it predicts whether the
fill in the missing values and ADASYN [6] is used patient has the disease. The use of medical profiles,
to balance the imbalanced dataset. Random Forest such as Hypertension, BMI, smoke status, etc. it is
[9,10,11] and XGBoost [13,16,18] are two able to predict the chances of patients contracting a
classification models compared, with Random disease. Classification models, when provided with
Forest's accuracy (97.67%) just edging out the appropriate attributes can be used for accurate
XGBoost Classifier (96.894%). prediction of diseases. After going through various
KEYWORDS: Random Forest, XGBoost, Simple related and non-related research papers, it was
Imputer, ADASYN sampling, Stroke. shown that Random Forest Classifier always gave
the best accuracy amongst all the other classification
I. INTRODUCTION algorithms. So, the authors have used Random
Stroke, also called a cerebrovascular Forest Classification in their work and compared it
accident, CVA, or ―Brain Attack‖, is the second with other classification algorithms with different
leading cause of death globally. Strokes are of two feature selection methods. In our work, The Stroke
types. The first being Ischemic stroke (part of the Prediction Dataset (Kaggle [1]) is used for
brain loses blood flow) and the second being predicting whether a patient has the likeliness of
Hemorrhagic stroke (bleeding occurs within the having a stroke. This dataset includes patient‘s
brain). Globally every 4th adult over the age of 25 is hypertension, BMI, and smoke status data.
susceptible to have a stroke in their lifetime. In the Further, the paper is divided into five
United States, every year around 795,000 people sections. Section 2 of the paper describes the review
have a stroke. About 610,000 people are the ones of the literature. Section 3 of the paper gives an
who are having a stroke for the first time. Around understanding of our methodology, which consists
87% of people have Ischemic Stroke in which the of preprocessing, feature selection, sampling, and
blood flow to the brain is blocked. Stroke is classification processes respectively. The analysis of
oneofthe leading causes ofgreatlong-term the results is shown in Section 4 which is then
disability.Stroke is the cause of reduced mobility in followed by the dataset which we have used for our
more than half of the stroke survivors aged 65 and model is presented in Section 5 and which is
over. The impact of stroke is oftenshort- and long- followed by the conclusion in Section 6.
term, counting onwhich part ofthe brain is affected

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 985
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

combines the concept of margin-based classifiers


II. REVIEW OF LITERATURE with censored regression to achieve a better
In [2], authors have proposed a model to concordance index than the Cox model. The authors
predict the chances of stroke by using various implement various algorithms to evaluate the
machine learning algorithms such as Decision Tree, prediction performance. This research demonstrates
Naïve Bayes, and Neural Networks for predicting that their model can be used for identifying potential
the likelihood of stroke occurrence. Here for risk factors for diseases without per- forming
dimensionality reduction Principal Component clinical trials.
Analysis algorithm has been used. The dataset used In [12], the authors have designed a model
here is the heart disease dataset provided by UCI. to predict a stroke's probability occurring using a
The dataset consists of Social Security Number, neural network-based classification model and
EKG (day/month/year), Age, Patient Number, principal component analysis for dimensionality
Blood pressure, type of chest pain, Gender, reduction. Three datasets are used here, Stroke &
Cigarettes, Family history, Hypertension, Nitrates, Claudication (STCL), Stroke & TIA(STTIA), and
Cholesterols, Years, Heart rate, and calcium channel Stroke & Angioplasty (STAN). Feature selection is
blocker. Firstly, pre-processing techniques are used done using Decision Trees. This research has
to remove duplicate records, missing data, noisy and demonstrated that the ANN-based prediction of
inconsistent data. Secondly, after preprocessing, the stroke occurrence yields an accuracy of
dimensions of the dataset are reduced using PCA 95%(STCL), 95.2%(STTIA), and 97.7%(STAN)
which determines the attributes that are more respectively.
involved in predicting the stroke disease. Lastly,
three Classification algorithms were used for the III. MODEL ARCHITECTURE
diagnosis of patients with stroke disease. 3.1 Data Visualization
In [3], the author‘s aim behind this work is Data Visualization is the process of
to review the existing models for cardiovascular risk describing the data in the form of graphical
assessment. Their research provides in-depth representation. Data Visualization can be used to
analysis on various conventional methods and find the trends and patterns in the dataset. The visual
models like the Framingham model, Reynolds Risk representation helps in analyzing the data quickly as
Score, MUCA, PROCAM, and Global Vascular compared to the raw data. It can also help in finding
Risk Score. This research aims to review the models the flaws of the dataset. Data Visualization provides
and examine the evidence available on new valuable insights into the data. To understand the
biomarkers and the non-clinical measures in data and select features for classification, this paper
improving the risk prediction. has used bar plots and a heatmap. The bar plot
In [7], the authors propose a novel below represents the counts of the target variable
automatic feature selection algorithm to predict (stroke.) This shows that the dataset is imbalanced
stroke. The author uses the CHS dataset. The and the classification will not be accurate. If the
proposed algorithm in combination with SVM model is trained on this particular dataset, it will get
achieves a greater AUC score than the Cox potential overfit on class 0. The classification will be one-
hazard model. Further, the authors have designed a sided; hence, the model will not be useful in any
margin-based censored regression algorithm that circumstance.

Fig 1: Bar plot for counts of two classes of stroke

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 986
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

Then this paper used a heatmap to it means that the feature is going to influence the
understand the correlation between different features output directly. If the value is towards -1, it means
in the dataset. Heatmap is a statistical tool that is that the feature is going to influence the final result
used to analyze the data. Heatmap is used to inversely, and if the value is closer to 0, it means
represent the correlativity amongst many features of that the feature is not likely to influence the final
the dataset. The correlation between the outcome output much. The last column of the heatmap shows
and features can be very useful in determining the correlation of a stroke with the other features of
which feature is very important. The values lie the dataset.
between -1 and 1. If the value is more towards one,

Fig 2: Heatmap

Now it is evident that the features are Index. The dataset had 5 Categorical Attributes: -
correlated to the Target Variable. Since all the gender, ever_married, work_type, Residence_type,
features are correlated too a good amount, hence any smoking_status which needs to be encoded before
feature cannot be discarded. So, after this Data being fed to the models. To resolve this issue
Preprocessing can be applied to the Dataset. OneHotEncoder[14] from the Scikit-Learn
preprocessing module was used to encode them into
3.2 Data Pre-processing numerical values. Standardization [17] is a scaling
Upon Visualizing and performing technique by which the values are centered around
Exploratory Data Analysis, this paper concluded the mean with a unit standard deviation. That means
that there were 201 records with the missing values that the mean of the attribute becomes zero and the
of the Body mass Index were discovered. So, Scikit resultant distribution has a unit standard deviation.
Learn‘sSimpleImputer was used to replace these The resulting records now have 21 attributes. Here‘s
missing values with the mean Body Mass the formula of standardization: -

Fig 3: Standardization Formula

Two attributes: - Avg_glucose_level done before splitting the data into namely two
ranging from 55.12 to 271.74 and Body Mass Index categories, i.e., test set and train set, by using
ranging from 10.3 to 97.6. To resolve this, feature Standard Scaler from Sklearn's preprocessing
scaling [15,17] was applied to the dataset and was module.

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 987
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

Fig 4: Box Plot of BMI before feature scaling

Fig 5: Box Plot of BMI after feature scaling

3.3 Sampling resulting dataset should be reasonable. It should be


Upon visualizing the data this paper had very much similar to the original dataset in many
concluded that the dataset provided was imbalanced. aspects i.e the distribution of the resulting dataset
It contained 4861 records with a stroke event value should be similar to that of the original dataset.
of 0 i.e. they never had a stroke, and 249 records
with a stroke event value of 1 i.e. they have had a
stroke in the past. In such cases, the concept of
sampling is used to predict the output with greater Sampling can be done in two ways-:
accuracy. Sampling is one of the most efficient 1)Oversampling- Oversampling is the process by
methods to overcome the problem of the imbalanced which the number of minority class samples is
dataset. The goals of sampling are to create a dataset increased in proportion to the majority class
that has relative class distribution as in a dataset that samples.
has a somewhat equal number of distributions on 2)Undersampling- Undersampling is the process by
either side, to create a clear decision boundary which the number of the majority class samples are
amongst the samples. Since sampling is used to decreased in proportion to the minority class
increase the count of minority observations, the samples

Fig 6: Types of Sampling

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 988
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

The concept of oversampling has been applied here.


The reason behind it was that as the dataset had only 3.4 Models
5000 samples, it made more sense to increase the Random Forest-: After sampling the
number of the samples to attain better accuracy for dataset we have equal counts of each category and
the model. now the dataset is ready for classification
algorithms. One such algorithm is Random Forest
Adaptive Synthetic Sampling Method for Classification [9].
Imbalanced Data [19] (ADASYN): -
It is the improved version of SMOTE The random Forest Classification algorithm
(Synthetic Minority Over-sampling Technique). belongs to the supervised learning field. It builds a
This technique inherits the main weakness of forest by ensembling decision trees, and the model
SMOTE i.e. its ability to create inner point-outer is usually trained with the bagging method. The
point bridges. ADASYN also finds points that are intuition behind the bagging method is to combine
outside of the homogeneous neighborhood. The learning models, which eventually increases the
algorithm uses Euclidean Distance for the KNN overall result. The random forest builds multiple
algorithm. The SMOTE algorithm is parameterized numbers of decision trees on the data sample and
with K Neighbors. then merges all of them by taking the average to
The steps of the SMOTE algorithm are-: provide a stable and accurate classification.
1. A random minority point is selected
2. Then randomly K_neighbours nearest neighbors The random Forest Classification and
are selected which tend to belong to the same Regression Algorithm works in the following two
minority class. phases. In the first phase, it makes multiple decision
3. The value of lambda(between 0 and 1) is trees and then combines all of them. In the second
randomly specified. phase, it predicts each tree created in the first
4. The new point is generated and placed on the phase.[10]
vector between the two points which are located Algorithm-:
at the lambda percent from the original point 1. Randomly select n data points from the training
set.
But the key difference in ADASYN is that it 2. Build the decision tree from the selected n
takes into account the distribution of density, which points from the dataset.
defines the number of synthetic instances produced 3. To build the decision trees, provide the number
for samples that are difficult to understand of estimators.
[20]. After sampling a considerable number of 4. Repeat step no.1 and step no.2.
samples has been created for which the distribution 5. For new data points, we'd like to seek out the
of the target variable is equal. Now the dataset can predictions of every decision tree and can need
be said to be balanced and this data can be provided to assign the new data points thereto a particular
to the classification models. category that wins the bulk votes.

Fig 7: Random Forest

Random Forest instead of searching for the increase the speed of the model. N_Estimatorsis the
most important feature, searches for the most most common hyperparameter which is used to
common amongst a subset of features. This is one of increase the accuracy of the model. Basically,
the reasons why this makes it a better model[9]. ―n_estimators,‖ tells the algorithm the number of
trees it has to build. The greater the number of trees,
Hyperparameters [11] are used to increase the better the accuracy. As the number of trees is
the predictive power of the model and also to

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 989
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

increased, there are chances that the computation XGBoostClassifier-: After using Random
becomes a bit slower. Forest Classifier on the dataset, the XGBoost
classifier was also utilized to urge a far better
Max_features is another important accuracy[16]. XGBoost or eXtreme Gradient
hyperparameter[11]. It provides the maximum Boosting may be a sophisticated implementation of
number of features that random forest considers to a gradient boosting algorithm. It's a supervised
split an internal node. This can help increase the learning algorithm that attempts to impeccably
accuracy of the model. predict a variable by combining the estimates of a
gaggle of weaker and fewer complicated models.
Random_state is also a hyperparameter
that is used to increase a model's speed. Boosting is an ensemble method that seeks
Random_state is used to make the output of the to make a robust classification model that supports
model more predictable. When the value weak classifiers. By adding a model on top of every
of random_state is the same, it will train the data on other model iteratively, the errors of the older
the same training data and hence the time required models are corrected by their succeeding predictor
for computation decreases significantly thereby until the training data can be accurately predicted.
increasing the speed of the model. XGBoost fits the new model to the new residuals of
the previous prediction and then minimizes the loss
Random Forest Algorithm is one of the that is done while adding the newest prediction,
most widely used algorithms for Classification and rather than assigning different weights to the
Regression Problems. The accuracy achieved by classifiers after every iteration. So, in conclusion,
Random Forest is usually the best when compared the model is updated using gradient descent. Hence,
to the other models. It also runs more efficiently on it is called Gradient Boosting. XGBoost specifically,
large databases when compared to the other implements this algorithm for decision tree boosting
algorithms[10]. with a further custom regularization term within the
objective function.

Fig 8: XGBoost Classifier


Hyperparameters-:
max_depth [default=6] max_leaf_nodes
1. The maximum depth of a tree.‘ 1. It is used to define the maximum number of
2. If the value is increased, it may result in terminal or leaf nodes in a tree.
overfitting. 2. This hyperparameter is often utilized in the
3. It can take values from 0 to infinity. place of max_depth. Since a binary tree is
4. 0 is only accepted in loss guided growing policy made, a tree with a depth of ‗n‘ levels would
when tree_method is set as hist or gpu_hist, and produce a maximum number of 2^n nodes.
it indicates no limit on the depth 3. It has a default value of 0.
5. It can be tuned using cross-validation.

IV. RESULT AND ANALYSIS


Compare both the models: -
Model Accuracy

XGBOOST 96.894%

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 990
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

Random Forest 97.67% (n_estimators = 200)

Random forest: -
n_estimators Accuracy

200 97.67%

The above table illustrates the accuracy After a certain value of the n_estimators, the
corresponding to the number of n_estimators. As the accuracy remained the same. Sometimes, using too
number of n_estimatorsincrease, the accuracy of the many n_estimators can decrease the accuracy of the
model also increases. The best accuracy was model.
achieved when the value of n_estimators was 200.

XGB Confusion Matrix: -


Confusion Matrix-:

0 1

0 952 20

1 40 920

Accuracy score -: 96.894%

V. DATASET predict by conducting only a minimal number of


This research has used the stroke prediction clinical trials. Features such as avg_glucose,
dataset (available on Kaggle [1]). Dataset consists of smoking status, etc are to be given to the model for
5110 records and 12 attributes per record. It contains making a prediction. The outcome is likely to warn
Patient Id, Age, Gender, Hypertension, Heart the patients hence, helping them to take precautions
Disease, Marriage Status, Work Type, Residence in advance to prevent them from having a stroke.
Type, Average Glucose Level, Body Mass Index,
Smoking Status, and Stroke Event. Using this ACKNOWLEDGMENT
dataset, ten attributes are selected, which are We would like to thank Prof. Pankaj Sonawane for
associated with stroke risk factors used for his guidance throughout the paper. We would also
prediction as described within the AHA guideline. like to thank Kaggle for providing the dataset for the
This dataset has a Stroke Event value of 1 if the paper.
patient had a stroke and 0 if not.
REFERENCES
VI. CONCLUSION [1]. "Stroke Prediction Dataset", Kaggle.com,
In this paper, two different approaches 2021. [Online]. Available:
have been discussed to predict whether a person is https://www.kaggle.com/fedesoriano/stroke-
likely to have a stroke or not. Random Forest prediction-dataset. [Accessed: 02- Apr-
Algorithm[9,10,11] and XGBoost 2021].
Classifier[13,16,18] have been used to predict the [2]. "(PDF) Prediction of Stroke Using Deep
results. The accuracy achieved from the Random Learning Model", ResearchGate, 2021.
Forest Algorithm came out to be 97.67%. The [Online]. Available:
hyperparameter provided was n_estimator. The https://www.researchgate.net/publication/320
value of the n_estimator varied from 5 to 100. The 687273_Prediction_of_Stroke_Using_Deep_
best accuracy was achieved when the value of Learning_Model. [Accessed: 02- Apr- 2021].
n_estimator was 200. The accuracy achieved from [3]. Ijser.org, 2021. [Online]. Available:
XGBoostClassifier was 96.894%. The model can https://www.ijser.org/researchpaper/Stroke-
DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 991
International Journal of Advances in Engineering and Management (IJAEM)
Volume 3, Issue 6 June 2021, pp: 985-992 www.ijaem.net ISSN: 2395-5252

Prediction-Models-A-Systematic-Review.pdf [15]. "ML | Feature Scaling – Part 2 -


. [Accessed: 02- Apr- 2021]. GeeksforGeeks", GeeksforGeeks, 2021.
[4]. Arxiv.org, 2021. [Online]. Available: [Online]. Available:
https://arxiv.org/pdf/1603.02754.pdf . https://www.geeksforgeeks.org/ml-feature-
[Accessed: 02- Apr- 2021]. scaling-part-2/ . [Accessed: 02- Apr- 2021].
[5]. Www3.nd.edu, 2021. [Online]. Available: [16]. C. Python, "XGBoost Parameters | XGBoost
https://www3.nd.edu/~dial/publications/hoen Parameter Tuning", Analytics Vidhya, 2021.
s2013imbalanced.pdf . [Accessed: 02- Apr- [Online]. Available:
2021]. https://www.analyticsvidhya.com/blog/2016/
[6]. Ele.uri.edu, 2021. [Online]. Available: 03/complete-guide-parameter-tuning-
https://www.ele.uri.edu/faculty/he/PDFfiles/a xgboost-with-codes-python/ . [Accessed: 02-
dasyn.pdf . [Accessed: 02- Apr- 2021]. Apr- 2021].
[7]. People.csail.mit.edu, 2021. [Online]. [17]. F. Standardization, "Feature Scaling |
Available: Standardization Vs Normalization", Analytics
https://people.csail.mit.edu/khosla/papers/kdd Vidhya, 2021. [Online]. Available:
2010.pdf . [Accessed: 02- Apr- 2021]. https://www.analyticsvidhya.com/blog/2020/
[8]. 2021. [Online]. Available: 04/feature-scaling-machine-learning-
https://www.tutorialspoint.com/machine_lear normalization-standardization/ . [Accessed:
ning_with_python/machine_learning_with_p 02- Apr- 2021].
ython_classification_algo . [Accessed: 02- [18]. "XGBoost Parameters — xgboost 1.4.0-
Apr- 2021]. SNAPSHOT documentation",
[9]. L. Forests, "Random Forests in Machine Xgboost.readthedocs.io, 2021. [Online].
Learning | Random Forests for Data Science", Available:
Analytics Vidhya, 2021. [Online]. Available: https://xgboost.readthedocs.io/en/latest/param
https://www.analyticsvidhya.com/blog/2020/ eter.html . [Accessed: 02- Apr- 2021].
12/lets-open-the-black-box-of-random- [19]. J. Brownlee, "Tour of Data Sampling
forests/ . [Accessed: 02- Apr- 2021]. Methods for Imbalanced Classification",
[10]. "Machine Learning Random Forest Machine Learning Mastery, 2021. [Online].
Algorithm - Javatpoint", Available:
www.javatpoint.com, 2021. [Online]. https://machinelearningmastery.com/data-
Available: sampling-methods-for-imbalanced-
https://www.javatpoint.com/machine- classification/ [Accessed: 20- Apr- 2021].
learning-random-forest-algorithm. [20]. Learning, "Imbalanced Classification |
[Accessed: 02- Apr- 2021]. Handling Imbalanced Data using Python",
[11]. "Understanding Random Forest", Medium, Analytics Vidhya, 2021. [Online]. Available:
2021. [Online]. Available: https://www.analyticsvidhya.com/blog/2020/
https://towardsdatascience.com/understandin 07/10-techniques-to-deal-with-class-
g-random-forest-58381e0602d2 . [Accessed: imbalance-in-machine-learning/. [Accessed:
02- Apr- 2021]. 20- Apr- 2021].
[12]. "API Reference — scikit-learn 0.24.1
documentation", Scikit-learn.org, 2021.
[Online]. Available: https://scikit-
learn.org/stable/modules/classes.html.
[Accessed: 02- Apr- 2021].
[13]. J. Brownlee, "Extreme Gradient Boosting
(XGBoost) Ensemble in Python", Machine
Learning Mastery, 2021. [Online]. Available:
https://machinelearningmastery.com/extreme-
gradient-boosting-ensemble-in-python/ .
[Accessed: 02- Apr- 2021].
[14]. J. Brownlee, "Why One-Hot Encode Data in
Machine Learning?", Machine Learning
Mastery, 2021. [Online]. Available:
https://machinelearningmastery.com/why-
one-hot-encode-data-in-machine-learning/ .
[Accessed: 02- Apr- 2021].

DOI: 10.35629/5252-0306985992 Impact Factor value 7.429 | ISO 9001: 2008 Certified Journal Page 992

You might also like