Ieee Conference Paper Template

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 5

A Comparative Study of Machine Learning Techniques

for Diabetes Prediction

Syed Abdul Basit Andrabi Inderjeet Singh


Department of Computer Science and Engineering Department of Computer Science and Engine
Chandigarh University Chandigarh University
Ghuran Mohali 12345, India Ghuran Mohali 12345, India
sbasit.11@gmail.com er.indeerjeetsingh1989@gmail.com
machine learning computers are given ability to think by
Abstract - Diabetes is one of the crucial health concerns present
in the world. In this paper several papers have been reviewed
for the diabetes detection and comparative analysis of different
machine learning algorithms have been studied by applying
these techniques on Pima Indian Diabetes data set. It is found
that among all the techniques applied on the Pima Indian
dataset Random forest has achieved the highest accuracy.
Index Terms – Disease Prediction, Machine Learning,
computer aided diagnosis, Diabetes.

I. INTRODUCTION
Diabetes refers to chronic conditions characterized by
increased level of blood glucose commonly referred to as
blood sugar. This can sometimes cause life threating health
problems and can cause damage to the kidneys, heart, eyes
and nerves. Diabetes is one of the biggest public health
concerns in the world, and it has a big impact on both public
health and the economy. The main two types are Type 1
which is caused when pancreas either produce little insulin or
no insulin at all and Type 2 is caused when the cells does not
respond to the insulin .As per the reports of world health developing intelligence through experience
organization around 422 million people throughout the word
are diabetic and 1.5 million deaths are reported every year
due to diabetes. Figure 1: Diabetes cases around the world and future predictions
As per the agenda of sustainable development the member
states have to reduce the mortality from NCD,s including II. REVIEW OF LITERATURE
diabetes - by one-third achieve universal health coverage,
Artificial Neural Networks (ANN’s) and Bayesian
and provide access to affordable essential medicines
Networks were utilized in the categorization of diabetes as
As per the report of International Diabetics Federations in
well as cardiovascular diseases and their respective levels of
its10th Atlas edition 537 adults in the age group of 20-79 are
accuracy were evaluated.. This paper mentions the review of
living with diabetics and it is estimated that this number may
some papers from 2010 to 2021.Mainly the multilayer feed
increase to 643 million by 2030 and can go up to 783 million
forward neural network and Naive Bayesian network have
by 2045. In the year 2021 diabetes is responsible for 6.7
reported the good accuracy .
million that is equivalent to 1 death every 5 seconds.
[2] proposed the system for diabetes prediction using
Diabetes cases around the world are shown in Figure 1 [1].
AdaBoost algorithm. The proposed system uses the series of
The Figure mentions the cases in year 2021 and predicted
base classifiers comprised of (SVM) Support Vector
cases for 2030 and 2045.
Machine, Naive Bayes and decision tree. The global data for
Computer-Aided Diagnosis (CAD) is a rapidly expanding
the Prima Indian dataset were retrieved from the repository at
and dynamic field of research in the medical industry.
the “University of California, Irvine”. The retrieved data set
Machine learning researchers have used number of machine
is used for training and testing purpose. The data set consists
learning techniques for disease perception and diagnosis. By
of the 768 record and 9 attributes. For validation purpose the
local data set is used. Different performance metrics are used 10 fold cross validation was used to assess the performance
to evaluate the performance of the proposed model which of classification
includes accuracy, sensitivity, specificity and error rate.
[3] two data sets have been taken one is breast cancer and Figure: Workflow of proposed approach
other is diabetics . The classification of attributes have been algorithms, the results shows that Naïve Bayes achieved
done using classification algorithm on Weka tool. Several highest performance.
classification algorithms were applied on breast cancer and [5] Presents the idea of detecting diabetes using machine
diabetes data sets like J48, SMO ,Naive Bayes, SMO, MLP. learning techniques on PIMA Indian data set. SVM and DT
In case of diabetes data set SMO classification gives the best classification algorithms have been used. The framework
accuracy of 76.80 and for breast cancer data set J48 gives the uses the R programming. The SVM reported the
accuracy best of 74.28 %. The performance evaluation is classification accuracy of 82%. However the paper does not
done by using several performance metrics like Precision, the mention the validation approach used which is critical
Recall and other metrics. parameter in machine learning tasks.
In [4] machine learning framework for diabetes prediction In [6] a system is proposed for diabetes analysis and
has been proposed in which several classification algorithms prediction. The systems use two data sets of Prime Indian
has been used on Prima Indian Diabetes data set. The Diabetes Dataset and data set from 130 US hospitals. The
algorithms include Decision tree artificial neural network techniques used for analysis are “K nearest neighbor, Naive
(ANN), logistic regression, Random forest and naive Bayes. Bayes, Random forest and decision tree”. This paper also
uses ensemble method which shows the good results.
In [7] a technique for diagnosis of type 2 diabetes is
proposed. The data set of Asian diabetic patients Pima was
used in the research. The data set contains 768 records in
which 500 is of healthy woman and 268 are the woman who
suffered from type 2 diabetes. The studies use eight features
for the diagnosis of diabetes. The accuracy of model was
reported to be 84%
In [8] authors mentioned that ensemble voting classifiers for
the prediction of diabetes with an accuracy of 80% and 81%
for a data set of pima Indian diabetics. The method was
developed by using 10-fold cross validation and by splitting
data into training which is 70% and testing set which is 30%.
The paper [9] showed that the dataset analysed for the
purpose of diagnosis used to make a diagnosis needed to be
pre-processed and that missing values needed to be filled in..
The Modified training set improved accuracy while requiring
less time to train the set.
A new approach, proposed in [10], makes use of predictive
analysis to zero in on the factors that contribute to the early
diagnosis of diabetes mellitus. For diabetes data analysis, the
Random Forest method and Decision Tree algorithm have the
highest sensitivity and specificity, respectively, of 98.20 %
and 98.00%. The naive Bayesian result claims a best
accuracy of 82.30%. To increase classification accuracy, the
research additionally generalizes the selection of appropriate
characteristics from the dataset.
Machine learning algorithms such as decision trees, neural
networks, and random forests have all been utilized in the
process of diabetes prediction. The data used in the study is
obtained from hospital in china containing 14 features, the
principal component analysis and minimum redundancy
techniques are used for dimensionality reduction. Random
forest algorithm achieved highest accuracy of 80.84 % [11].
III. DATA COLLECTION AND MACHINE LEARNING
ALGORITHMS TO PREDICT THE DIABETES

The most crucial task is to collect quality and relevant


data. In the Research we have used Pima Indians dataset
which many researchers have also used as mentioned in the It is a type of ML algorithm that belongs to the class of
related work. The dataset consists of 768 records with 9 supervised algorithms and is mostly used for classification
attributes. The exploratory data analysis has been performed tasks but can also be used for regression. It is a flexible
on the data in order to get the insights about data and find algorithm that can be used to impute blank values and
missing values or ambiguous values in data. The work flow resample datasets. KNN classifies data based on the
of the proposed approach is shown in Figure 2. Replacing similarity [17].
with mean technique has been used to deal with the missing
values. The following approaches have been used for the
prediction of diabetes.
A. Decision Tree
Decision tree belong to the class of supervised machine
learning algorithms mainly used for the classification task. It
is a flowchart like structure in which there are mainly three
types of nodes root node, internal nodes and leaf nodes.
Where root node indicates the start of decision and internal
nodes represent the test on attribute of the dataset, branches
represent the decision and the leaf nodes indicate the
outcome. There are several algorithms to build a decision
tree ID3, C4.5 J48, CART, etc [12]. In most cases, decision
trees are trained in a loss-minimization setting with a greedy
algorithm that builds each binary condition iteratively. They
are also very cost-effective when it comes to running a test
on the system. The problem with the decision tree is over
fitting and faces problems in difficult classification tasks
[13]. In this research we have also used XGBoost which is
Figure 4: Overview of SVM
decision tree based ensemble learning algorithm
B. Random Forest
IV. EVALUATION TECHNIQUES
Random forest is another machine learning algorithm (ML)
that can overcome the problem of overfitting in decision The techniques used in this research are evaluated using
trees. It is an ensemble learning algorithm that uses bagging several evaluation techniques. In our case it is the
ensemble learning technique. In this several weak models classification problem for which the performance of
are separately trained depending upon the type of problem algorithm is quantified in terms of the confusion matrix
and the average of these models is taken to get high accuracy which gives the various measures with respect to accuracy,
[14]. precision, specificity and sensitivity [18] .These parameters
are explained as follows:
C. Support Vector Machine
Confusion matrix is a tabular structure that is used to
It is a machine learning approach that can be used for both evaluate the performance of model. The confusion matrix for
classification and regression based problems. SVM is linear binary classification is given in Figure 4.
classifier but by transforming the space of the input data into
a high dimensional space, SVM can be transformed from a
linear classifier to a nonlinear one. A SVM uses the concept
of Kernel trick which is maps data to the higher dimensional
space so as to make the data classification easier. SVM's
performance is determined by the kernel [15]. The ultimate
goal of SVM is to find the best hyperplane in n dimensional
space that can classify the data points distinctly, the
dimension depends upon the features in dataset if the data has
two features then the hyperplane is line, if it is three then the
hyperplane is 2D plane. The figure below shows the case of
two feature data. Generalization errors are minimised by
maximising margins between hyperplanes and data points in
SVM [16].
D. K Nearest Neighbor (KNN)
classifier performed best in comparison to other applied
techniques. The results of our experiment are mentioned in
Table 1.
Table 1: Results of some performance metrics

Performance Metrics
Classifier
Accuracy Precision Recall F1-Score
RF 0.79 0.8 0.89 0.84
DT 0.74 0.78 0.82 0.8
XGBoost 0.74 0.77 0.84 0.62
SVM 0.69 0.63 0.63 0.62
KNN 0.70 0.63 0.63 0.63

Figure 4 : Confusion matrix CONCLUSION

Accuracy (A): It is the total number of instances that are This paper presents the scenario of diabetes and its future
correctly classified by the algorithm. Mathematically it is predictions. In this paper various machine learning
written as: algorithms are applied in order to develop a model for
diabetes detection. The Random forest classifier achieved the
(1) highest accuracy but the most important thing is that several
papers have also used the different classifier and did not
Sensitivity: The sensitivity of a machine learning model mention about the problems about the data set. We want to
refers to its ability to detect positive instances correctly. It is mention here that it seems that there are no missing values in
also known as true positive rate: the data but the data set have several records where the value
is 0 in which we replaced using mean value and also the data
has the class imbalance problem for which we have used
(2) Synthetic Minority Oversampling Technique.

REFERENCES
[1] C. Practice, “Diabetes is ‘a pandemic of unprecedented magnitude’ now
Specificity: Specificity of a machine learning model refers to affecting one in 10 adults worldwide,” Diabetes Res. Clin. Pract., vol.
181, 2021.
its ability to detect negative instances correctly. It is also [2] V. V. Vijayan and C. Anjali, “Prediction and diagnosis of diabetes
called true negative rate. It represents the ratio of positive mellitus - A machine learning approach,” 2015 IEEE Recent Adv. Intell.
instances correctly classified as positive to the overall Comput. Syst. RAICS 2015, no. December, pp. 122–127, 2016.
[3] V. Deepika and M. Nidhi, “Analysis and Prediction of Breast Cancer and
positive samples. Diabetes using Data mining classification Techniques,” 2017 Int. Conf.
Intell. Sustain. Syst., no. Iciss, pp. 533–538, 2017.
(3) [4] S. M. Hasan Mahmud, M. A. Hossin, M. Razu Ahmed, S. R. H. Noori,
and M. N. I. Sarkar, “Machine learning based unified framework for
diabetes prediction,” ACM Int. Conf. Proceeding Ser., no. August, pp.
46–50, 2018.
Precision: It indicates the proportion of correctly classified [5] A. Rathore, S. Chauhan, and S. Gujral, “Detecting and Predicting
positive predictions done by the model to the total number of Diabetes Using Supervised Learning: An Approach towards Better
Healthcare for Women,” "Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5,
positive samples. pp. 1192–1195, 2017".
[6] M. Alehegn, R. R. Joshi, and P. Mulay, “Diabetes analysis and
prediction using random forest, KNN, Naïve Bayes, and J48: An
(4) ensemble approach,” "Int. J. Sci. Technol. Res., vol. 8, no. 9, pp. 1346–
1354, 2019".
[7] "M. Shanthi, R. Marimuthu, S. N. Shivapriya, and R.
Navaneethakrishnan", “Diagnosis of Diabetes using an Extreme
Learning Machine Algorithm based Model,” "2019 IEEE 10th Int. Conf.
V. RESULTS Aware. Sci. Technol. iCAST 2019 - Proc., pp. 1–5, 2019".
[8] "N. S. Prema, V. Varshith, and J. Yogeswar", “Prediction of diabetes
This section mentions the performance comparison of various using ensemble techniques,” Int. J. Recent Technol. Eng., vol. 7, no. 6,
pp. 203–205, 2019.
machine learning algorithms applied on Pima Indian Dataset.
Our experimental methods indicate that the Random Forest
[9] T. Jayalakshmi and A. Santhakumaran, “A novel classification method
for diagnosis of diabetes mellitus using artificial neural networks,”
DSDE 2010 - Int. Conf. Data Storage Data Eng., pp. 159–163, 2010.
[10] N. Sneha and T. Gangil, “Analysis of diabetes mellitus for early
prediction using optimal features selection,” J. Big Data, vol. 6, no. 1,
2019.
[11] Q. Zou, K. Qu, Y. Luo, D. Yin, Y. Ju, and H. Tang, “Predicting Diabetes
Mellitus With Machine Learning Techniques,” Front. Genet., vol. 9, no.
November, pp. 1–10, 2018.
[12] N. T. Shah, M. Z. Khan, M. Ali, B. Khan, and N. Idress, “CART, J-
48graft, J48, ID3, Decision Stump and Random Forest: A comparative
study,” Univ. Swabi J., vol. 2, no. April, pp. 1–6, 2018.
[13] S. Ray, “A Quick Review of Machine Learning Algorithms,” Proc. Int.
Conf. Mach. Learn. Big Data, Cloud Parallel Comput. Trends,
Prespectives Prospect. Com. 2019, pp. 35–39, 2019.
[14] Y. Qi, “Random Forest for Bioinformatics,” Ensemble Mach. Learn., pp.
307–323, 2012.
[15] D. A. Pisner and D. M. Schnyer, Support vector machine. Elsevier Inc.,
2019.
[16] S. Amari and S. Wu, “Improving support vector machine classifiers by
modifying kernel functions,” Neural Networks, vol. 12, no. 6, pp. 783–
789, 1999.
[17] L. Firte, C. Lemnaru, and R. Potolea, “Spam detection filter using KNN
algorithm and resampling,” Proc. - 2010 IEEE 6th Int. Conf. Intell.
Comput. Commun. Process. ICCP10, pp. 27–33, 2010.
[18] M. Heydarian and T. E. Doyle, “MLCM : Multi-Label Confusion
Matrix,” IEEE Access, vol. 10, pp. 19083–19095, 2022.

You might also like