Diabetes Pridiction Using Machine Learning
Diabetes Pridiction Using Machine Learning
LEARNING
A PROJECT REPORT
AFFILIATED TO
1
CANDIDATE’S DECLARATION
I hereby certify that the work which is being presented in this Project entitled “DIABETES
PRIDICTION USING MACHINE LEARNING” in partial fulfilment of requirement for the award
of degree of B .Tech ,Computer Science and Engineering submitted in Department of
Computer Science & Engineering at Guru Nanak Institute Of Technology , Mullana , affiliated
to Kurukshetra University , Kurukshetra is an authentic record of my work carried out under
the supervision of Mr . Tejbir Rana, Head of Department of CSE , GNIT , Mullana , Ambala.
The matter presented here has not been submitted by me in any other
University/Institute for the award of any other degree.
DIMPLE KUMARI
6319008
This is to certify that the above statement made by the candidate is correct to the best of
my knowledge .
Department . Of CSE
2
ACKNOWLEDGEMENT
I acknowledge the contribution of each and every individual in the development of this
project entitled “DIABETES PREDICTION USING MACHINE LEARNING”, who directly
or indirectly helped me in this project. Without their support it would have been a tough job
for me to complete this project.
I express my sincere thanks to Mr Sidharth Arora (Hod of cs:) who guided me across the
process of learning and implementing the python, which is the language used to develop this
project.
I pay my deep sense of gratitude to my colleagues, friends and my parents for their valuable
moral.
3
Abstract
Diabetes (Diabetes Mellitus), is a group of metabolic disorders and millions of people are
affected. Detection of diabetes is of a great significance and serious complications should be
concerned. Many research studies have been done on the diagnosis of diabetes, most of the
research studies are based on one particular data set which is the Pima Indian diabetes data
set. This Pima Indian data set is a data set of studies of women in India's population that
began in 1965., and its onset rate is relatively high in diabetes. Most research studies were
carried out prior to focusing primarily on one or two specialized complex techniques for
testing data, while an inclusive research on several general techniques are missing. In this
system, we extensively explore the most popular techniques in Machine Learning (e.g. KNN
algorithm) used to identify the diabetes and pre-processing of data methods. We will
examine this technique by the accuracy of the cross validation on the UCI ML repository
data set. Keywords: Machine learning, Classification, KNN, Diabetes.
4
TABLE OF CONTENTS
Contents
ACKNOWLEGEMENT………………………………………………………………………3
ABSTRACT…………………………………………………………………………………..4
TABLE OF CONTENTS…………………………………………………………………….5
Chapter 1……………………………………………………………………………………..7
1.1 Introduction…………………………………………………………………………7-8
1.2 The contribution of this work………………………………………………………8
Chapter 2……………………………………………………………………………………..9
2.1 Literature review……………………………………………………………………9
Chapter 3……………………………………………………………………………………10
3.1 Methodology………………………………………………………………………10
3.2 Data Constraints…………………………………………………………………..10
3.3 Train Dataset or Test Dataset…………………………………………………….11
3.4 Pre-processing of data…………………………………………………………….11
3.5 Features Extraction………………………………………………………………..11
3.6 ML Algorithm :KNN………………………………………………………………...12
3.7 Result………………………………………………………………………………..12
Chapter 4……………………………………………………………………………………13
4.1 Proposed Work…………………………………………………………………….13
4.2 Dataset Description……………………………………………………………13-14
4.3 Correlation Matrix………………………………………………………………….15
4.4 Histogram…………………………………………………………………………..16
4.5 Bar Plot For Outcome Class……………………………………………………...17
4.6 K-Nearest Neighbours…………………………………………………………….18
4.7 Logistic Regression……………………………………………………………….19
4.8 Decision Tree………………………………………………………………………20
4.9 Feature importance in decision trees……………………………………………21
4.10 Random Forests………………………………………………………………….22
5
4.11 Features importance in random forests………………………………………..22
4.12 Support Vector Machine…………………………………………………………23
4.13 Accuracy Comparison……………………………………………………………23
Chapter 5……………………………………………………………………………………24
5.1 Source Code and Output……………………………………………………...24-27
Chapter 6……………………………………………………………………………………28
6.1 Conclusion………………………………………………………………………….28
6.2 Future Scope……………………………………………………………………….28
6.3 Bibliography…………………………………………………………………….29-30
LISTS OF FIGURES
6
CHAPTER-1
1.1 INTRODUCTION
Diabetes mellitus, or diabetes for short, is a chronic disease that occurs either when the
pancreas does not produce enough insulin or when the body cannot effectively use the
insulin it produces . Diabetes has two main types called type 1 and type 2. In type 1 diabetes
(also known as insulin-dependent or childhood-onset), there is insulin production deficiency
in the body, which requires daily administration of insulin, whereas in type 2 diabetes
(known formally as non-insulin-dependent or adult-onset), the body cannot use insulin
effectively. According to the World Health Organization (WHO), the number of people with
diabetes in 2014 was 422 million. Moreover, in 2016, diabetes was the direct cause of 1.6
million deaths.There are different causes for diabetes. For instance, type 1 diabetes mellitus
(T1DM) can develop due to an autoimmune reaction that destroys the cells in the pancreas
that make insulin, called beta cells , whereas type 2 diabetes is mainly caused by age, family
history of diabetes, high blood pressure, high levels of triglycerides, heart disease or stroke
[3]. Early detection of diabetes can be of great benefit, especially because the progression of
prediabetes to type 2 diabetes is quite high. According to CDC , diabetes can affect any part
of the body over time, leading to different types of complications. The most common types
are divided into micro- and macrovascular disorders. The former are those long-term
complications that affect small blood vessels, including retinopathy, nephropathy, and
Healthcare 2021, 9, 1712. https://doi.org/10.3390/healthcare9121712
https://www.mdpi.com/journal/healthcare Healthcare 2021, 9, 1712 2 of 19 neuropathy.
Macrovascular disorders, however, include ischemic heart disease, peripheral vascular
disease, and cerebrovascular disease . Due to high diabetes mortality and morbidity along
with its possible complications, it is very important to understand how to deal with diabetes
7
and how to prevent such possible complications. To reduce the possibility of developing
some serious complications related to diabetes, machine learning and data mining
techniques can be applied to diabetes-related datasets. Machine learning is a branch of
artificial intelligence and computer science which focuses on the use of data and algorithms
to imitate the way that humans learn. Machine learning itself can be divided into two main
categories, namely, supervised and unsupervised learning. The main goal in both cases is to
make use of a given dataset to enhance our understanding of the data and discover useful
knowledge. Supervised machine learning is characterized by the use of labeled data to train
its algorithms and can be utilized for classification or regression tasks. The goal of
classification is to assign each unknown instance to one of possible classes or categories for
prediction or diagnosis purposes. The proposed work implements several supervised
machine learning techniques and algorithms to predict different complications related to
diabetes. Unlike typical diabetes datasets, the complications’ set consists of various
collections of complications such as metabolic syndrome, dyslipidemia, neuropathy,
nephropathy, diabetic foot, hypertension, obesity, and retinopathy. Furthermore, logistic
regression (LR), support vector machine (SVM), decision tree (DT CART), random forest (RF),
AdaBoost, and XGBoost were utilized to build and evaluate different resulting classifiers.
8
CHAPTER-2
Defusal Faruque and Asaduzzaman, Iqbal H.Sarker has discussed that diabetes is one of the
most common disorder of the human body it is caused due the metabolic disorder .Hence
that they used various and important ML algorithms that are Support Vector machine,
NB,KNN and DT to predict the diabetes.
Sidong Wei,Xuejiao Zhao and Chunyan Miao presented that diabetes is commonly called as
disorder in which glucose level in body is high. In this paper they use popular methods such
as SVM and deep neural network for identify the disease and data processing.
Lakshmi K.S and G.Santhosh Kumar according to them Hospital databases serve as wealthy
information source for the fruitful medication diagnosis. IN this they used NLP tools along
with combined with data mining algorithms for the extraction of rules.
Jian-xunChen , Shih-LiSu and Che-Ha Chang discussed about Ontology that generate a
primary care planning to the medical professional’s for the accustoming. The result of the
research paper shows the model can be provided personalize diabetes mellitus care
planning efficiently.
MM Alotaib, RSH.Istepanian, and A.Sungoor they are present a clever based mobile
polygenic disease control system & tutoring model for the patients with diabetes. In this,
system is able to store the clinical information about the diabetes system, such an often
blood sugar level and BP measured and hypo glycaemia event.
IJARCCE ISSN (Online) 2278-1021 ISSN (Print) 2319-5940 International Journal of Advanced
Research in Computer and Communication Engineering Vol. 9, Issue 7, July 2020 Copyright
to IJARCCE DOI 10.17148/IJARCCE.2020.9712 81 Berina Alic and Lejila Gurbea ,Almir
Badnjevic they presented the overview of techniques in machine learning in the diabetes
classification and cardiovascular diseases using BNs and ANN.
9
CHAPTER-3
3.1 METHODOLOGY
The Proposed method use KNN algorithm for classification and prediction of diabetes using
trained data. And, the proposed system also predicts the time of getting diabetes.
Figure 1: Methodology
Data is a collection global dataset. IN this system use Pima Indian data set is used for
training a model. Data set contain 21 parameters and around 1000 dataset. The dataset
feature/parameters are:
• Age
• Gender
• Relation
• DOB
• Symptoms
10
• Family history etc.
This are data is trained to the model for the prediction of diabetes.
The training data is a initial set of data which is used to understand the program. This is the
one in which we have to train the model first because to set the feature and this data is
available on system. This data is used to teach the machine for do different actions. It is the
data in which model can learn with algorithm to teach the model and doing work automatic.
Testing data is the input given to a software. It shows the data affects when the execution
of the module that specifying and this is basically used for testing.
Data Preprocessing is a process in which that is actual use for converting the basic data into
the clean data set. It is the step in which the data transform or an encode to the state that
the machine can be easily parse. The major task of data Preprocessing in learning process is
to remove the unwanted data and filling the missed value. So that it help to machine can be
trained easily.
Feature Extraction is the method in which it used for alter the key data for features of
outcomes. This, trait square is used to compute the characteristics of designs given that
facilitate in different amid the class of key pattern details. This method involving to decrease
the counts of resource required to describe the huge set of data. Feature extraction is an
11
attribute reduction process. This is also used to increasing the speed and effectiveness of
supervised learning.
3.7 Result:
After taking that input data from the system will able to divine the statistics by appeal the
ML algorithm & also provided the foremost output in the devise of different in between to
detection the most accurate to treatment to diabetes millets.
12
CHAPTER-4
The forecast or detection of diabetes is the major and concerning it is severe the
complications. The diabetes complications showed in the below picture. Detection of
mellitus in the starting phase and played a significant role in the heal the diabetes.
The detection diabetes is plays very important role for the human life because it leads to
death. The offered system is used to initial detection of diabetes and time prediction
whereas time prediction means when the patients the diabetes it will be help to improve
the habit of the patients. The proposed system is mainly concentrated on development of
machine learning model and also it helpful in the medical sector to identify the diseases.
This offer system is an automation to predicts the diabetes using old patient’s data.
13
Figure-3 : Proposed Model Diagram.
14
4.3 Correlation Matrix:
Figure-4
It is easy to see that there is no single feature that has a very high correlation with our
outcome value. Some of the features have a negative correlation with the outcome value
and some have positive
15
4.4 Histogram:
Figure-5
Let’s take a look at the plots. It shows how each feature and label is distributed along
different ranges, which further confirms the need for scaling. Next, wherever you see
discrete bars, it basically means that each of these is actually a categorical variable. We will
need to handle these categorical variables before applying Machine Learning. Our outcome
labels have two classes, 0 for no disease and 1 for disease.
16
4.5 Bar Plot For Outcome Class:
Figure-6
The above graph shows that the data is biased towards datapoints having outcome value as 0 where
it means that diabetes was not present actually. The number of non-diabetics is almost twice the
number of diabetic patients.
17
4.6 k-Nearest Neighbours:
The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists
only of storing the training data set. To make a prediction for a new data point, the algorithm finds
the closest data points in the training data set, its “nearest neighbours.”
First, let’s investigate whether we can confirm the connection between model. complexity and
accuracy:
Fig
ure-7
The above plot shows the training and test set accuracy on the y-axis against the setting of
N_neighbours on the x axis. Considering if we choose one single nearest neighbour, the
prediction on the training set is perfect. But when more neighbours are considered, the
training accuracy drops, indicating that using the single nearest neighbour leads to a model
that is too complex. The best performance is somewhere around 9 neighbours.
18
TABLE-1
TABLE-2
➔ In first row, the default value of C=1 provides with 77% accuracy on the training and 78%
accuracy on the test set.
➔ In second row, using C=0.01 results are 78% accuracy on both the training and the test
sets.
➔ Using C=100 results in a little bit lower accuracy on the training set and little bit highest
accuracy on the test set, confirming that less regularization and a more complex model may
not generalize better than default setting.
19
Figure-8
This classifier creates a decision tree based on which, it assigns the class values to each data
point. Here, we can vary the maximum number of features to be considered while creating
the model.
TABLE-3
The accuracy on the training set is 100% and the test set accuracy is also good.
20
4.9 Feature Importance in Decision Trees
Feature importance rates how important each feature is for the decision a tree makes. It is
a number between 0 and 1 for each feature, where 0 means “not used at all” and 1 means
“perfectly predicts the target”.
Figure-9
21
4.10 Random Forest:
This classifier takes the concept of decision trees to the next level. It creates a forest of trees
where each tree is formed by a random selection of features from the total features.
T
ABLE-4
Figure-10
Similarly to the single decision tree, the random forest also gives a lot of importance to the
“Glucose” feature, but it also chooses “BMI” to be the 2nd most informative feature overall.
22
4.12 Support Vector Machine:
This classifier aims at forming a hyper plane that can separate the classes as much as
possible by adjusting the distance between the data points and the hyper plane. There are
several kernels based on which the hyper plane is decided. I tried four kernels namely linear
oly, rby and sigmoid.
Figure-11
As can be seen from the plot above, the linear kernel performed the best for this dataset
and achieved a score of 77%.
TABLE-5
Table-5 shows the accuracy values for all five machine learning algorithms. Table-5 shows
that Decision Tree algorithm gives the best accuracy with 98% training accuracy and 99%
testing accuracy.
23
CHAPTER-5
24
25
26
27
CHAPTER-6
6.1 Conclusion
The prediction of diabetes is one the of great importance in today scenario, and concerning
with its severe complications. Due to the biggest reason for the death in worldwide is
diabetes. The System model is mainly focus to identification of diabetes using some of the
parameters. System is useful to physicians to predict the diabetes in initial dais. So, that
conventional treatments and solutions may be given to the patients. System used some of
the techniques like ML for the prediction, so that to get the more precise results. There have
been fortune of investigation on the diabetes imprint. Building diabetes disease prediction
system is useful for hospitals and doctors. System predicts disease at early stages, so
doctors can treat patients in a better way. Proposed model is the real time application in
which is meant for multiple hospitals and predicts disease in less time. As we use machine
learning algorithms for disease prediction, we will get more accurate and efficient results .
Proposed system uses “KNN algorithm” to find the diabetes disease, in data science we
have many algorithms for classification such as Naive Bayes, SVM, Decision Tree, ID3 etc… in
future we can add more algorithms to find outputs and algorithms can be compared to find
the efficient algorithm. We can add visitor query module, where visitors can post queries to
administrator and admin can send reply to those queries. We can add treatment module,
where doctors upload treatment details for patients and patient can view those treatment
details.
28
6.3 BIBLIOGRAPHY
1. World Health Organization. Diabetes. 10 November 2020. Available online:
https://www.who.int/news-room/fact-sheets/ detail/diabetes (accessed on 30 November 2020).
2. Centers for Disease Control and Prevention. What Is Type 1 Diabetes. Available online:
https://www.cdc.gov/diabetes/basics/ what-is-type-1-diabetes.html (accessed on 22 November
2021).
4. Centers for Disease Control and Prevention. Diabetes. 2019. Available online:
https://www.cdc.gov/diabetes/managing/ problems.html (accessed on 30 November 2020).
5. Cade, W.T. Diabetes-Related Microvascular and Macrovascular Diseases in the Physical Therapy
Setting. Phys. Ther. 2008, 88, 1322–1335. [CrossRef] [PubMed]
6. Tekieh, M.H.; Raahemi, B. Importance of Data Mining in Healthcare. In Proceedings of the 2015
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015—
ASONAM, Paris, France, 25–28 August 2015; pp. 1057–1062. [CrossRef]
7. Sharma, R.; Singh, S.N.; Khatri, S. Medical Data Mining Using Different Classification and
Clustering Techniques: A Critical Survey. In Proceedings of the 2016 Second International Conference
on Computational Intelligence & Communication Technology (CICT), Ghaziabad, India, 12–13
February 2016; pp. 687–691. [CrossRef]
8. Dominic, V.; Gupta, D.; Khare, S.; Aggarwal, A. Investigation of chronic disease correlation using
data mining techniques. In Proceedings of the 2015 2nd International Conference on Recent
Advances in Engineering & Computational Sciences (RAECS), Chandigarh, India, 21–22 December
2015; pp. 1–6. [CrossRef]
9. Hasan, M.K.; Alam, M.A.; Das, D.; Hossain, E.; Hasan, M. Diabetes Prediction Using Ensembling of
Different Machine Learning Classifiers. IEEE Access 2020, 8, 76516–76531. [CrossRef]
10. Sisodia, D.; Sisodia, D.S. Prediction of Diabetes using Classification Algorithms. Procedia Comput.
Sci. 2018, 132, 1578–1585. [CrossRef]
29
11. Meng, X.-H.; Huang, Y.-X.; Rao, D.-P.; Zhang, Q.; Liu, Q. Comparison of three data mining models
for predicting diabetes or prediabetes by risk factors. Kaohsiung J. Med. Sci. 2013, 29, 93–99.
[CrossRef] [PubMed]
12. Abdulhadi, N.; Al-Mousa, A. Diabetes Detection Using Machine Learning Classification Methods.
In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman,
Jordan, 14–15 July 2021; pp. 350–354. [CrossRef]
13. Kantawong, K.; Tongphet, S.; Bhrommalee, P.; Rachata, N.; Pravesjit, S. The Methodology for
Diabetes Complications Prediction Model. In Proceedings of the 2020 Joint International Conference
on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical,
Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Pattaya,
Thailand, 11–14 March 2020; pp. 110–113. [CrossRef]
14. Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [CrossRef]
15. Islam, M.S.; Qaraqe, M.K.; Belhaouari, S.B. Early Prediction of Hemoglobin Alc: A novel
Framework for better Diabetes Management. In Proceedings of the 2020 IEEE Symposium Series on
Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 542–547. [CrossRef]
16. Dagliati, A.; Marini, S.; Sacchi, L.; Cogni, G.; Teliti, M.; Tibollo, V.; De Cata, P.; Chiovato, L.;
Bellazzi, R. Machine Learning Methods to Predict Diabetes Complications. J. Diabetes Sci. Technol.
2018, 12, 295–302. [CrossR]
30
31