Comparison of ML Techniques
Comparison of ML Techniques
1
Diyala University / College of Basic Education / Baqubeh / Diyala / Iraq
2
Diyala University / College of Science /Baqubeh/Diyala/ Iraq
Abstract: Diabetes is a disease that has no permanent cure; hence early detection is required with high accuracy. This study
aims to compare five machine learning (ML)algorithms and achieve the best accuracy for predicting early stage diabetes. The
dataset from the hospital Frankfurt, Germany includes information on 2000 patients as well as nine distinct character istics for
each of them is used in this work. Five ML Algorithms used for datasets to predict diabetes are Random Forest (RF), K-
Nearest Neighbor (KNN), Gaussian Naïve Bayes (NB), support vector machine (SVM), and Logistic Regression (LR).
However, according to the obtained results, it is observed that the proposed model with RF has achieved an excellent result of
accuracy value = 99% during the comparison with a rest classification algorithm that is used in the proposed model. In
addition, the proposed model's efficiency has been compared to previous work, and it has achieved the highest accuracy.
Keywords: Gaussian Naïve Bayes (NB); Machin Learning ML; Random Forest (RF); K-Nearest Neighbor (KNN); support
vector machine (SVM); Logistic Regression (LR).
__________________________________________________________
Introduction
Diabetes mellitus: More commonly referred to as “diabetes”—a chronic disease associated with abnormally
high levels of the sugar glucose in the blood. Diabetes is due to one of the two mechanisms: Inadequate
production of insulin (which is made by the pancreas and lowers blood glucose), or Inadequate sensitivity of cells
to the action of insulin. Diabetes mellitus also may develop as a secondary condition linked to another disease,
such as pancreatic disease, a genetic syndrome, such as myotonic dystrophy, or drugs, such as glucocorticoids.
Gestational diabetes is a temporary condition associated with pregnancy. In this situation, blood glucose levels
increase during pregnancy but usually returns to normal after delivery. Based on the data from the 2011 National
Diabetes Fact Sheet, diabetes affects an estimate of 25.8 million people in the US, which is about 8.3% of the
population. Additionally, approximately 79 million people have been diagnosed with pre-diabetes[1]. In this big
data era, a large volume of data is generated and machine learning has become an imperative tool to analyze the
complexity of the generated data. A plethora of techniques have been applied for data analytic in medical
diagnosis, including single classifier and classifier ensemble [2]. With ML models, it can also be possible to
improve quality of medical data, reduce fluctuations in patient rates, and save in medical costs. Therefore, these
models are frequently used to investigate diagnostic analysis when compared with other conventional methods. To
reduce the death rates caused by chronic diseases (CDs), early detection and effective treatments are the only
solution. Therefore, most medical scientists are attracted to the new technologies of predictive models in disease
forecasting [3]. Applying machine learning and data mining methods in Diabetes Mellitus (DM) research is a key
approach to utilizing large volumes of available diabetes-related data for extracting knowledge. The severe social
impact of the specific disease renders DM one of the main priorities in medical science research, which inevitably
generates huge amounts of data [4]. Data mining represents a significant advance in the type of analytical tools. It
has been proven that the benefits of introducing data mining into medical analysis are to increase diagnostic
accuracy, to reduce costs and to save human resources [5]. The motive of this study is to compare the performance
of the most effective machine learning techniques, used to predict diabetes diseases. In this works will be using
diabetes dataset that collected from The dataset from the hospital Frankfurt. The dataset contains information
about 2000 patients and their corresponding nine unique attributes. Algorithms used for datasets to predict
diabetes are Naïve Bayes (NB) , Random Forest (RF) , K-Near Neighbor Classification algorithms (KNN),support
vector machine (SVM),and Logistic Regression (LR). However, according to the obtained results it is observed
that the proposed model with RF is achieved an excellent result of accuracy during the comparison with a rest
classification algorithm that using in the proposed system.
The rest of the paper is laid out as follows: The second section delves into the related works. The discretion
Diabetes Dataset and five machine learning models that are used in the proposed model are presented in Section
1244
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
three along with the relevant methods and materials of this work. Section four illustrates the design proposed
model. Sections five explain the simulation proposed method, the evaluation metrics, and experimental results and
discussion respectively. Finally, Section six concludes the proposed model.
Related Works
In the previous work of Miao L., et. al.[6] develop a model for long-term risk cardiovascular disease (CVD) as a
result of type 2 diabetes (T2D). On data from the Framing-ham Heart Research longitudinal study, they used the
K-nearest neighbors and Support Vector Machine (SVM) algorithms to construct the prediction models. The
dataset was first aligned using the synthetic Minority Oversampling Technique algorithm. After adjusting the
parameters and training 1000 times, the average precision for correctly predicting the prevalence of CVD due to
T2D was 96.5 %, with an average recall rate of 89.8 %. They also used the KNN algorithm to train the dataset,
with a 92.9 % recall rate.
In the bygone work of Cherradi B., et. al. [7] to predict with or without type 2 diabetes mellitus patients,
researchers used and evaluated four Machine Learning algorithms (Artificial Neural Network, K-Nearest
Neighbors, Decision Tree, and Deep Neural Network). The first was retrieved from the Germany Frankfurt
Hospital while the second is a well-known dataset of Pima Indian, which contains the same feature composed of
risk factors, mixed data, and some clinical data. The proposed model achieved the best accuracy rate with KNN
97.53% and DeepNN 96.35%.
In the past work of Anwar N.K &SaianR.[8] utilized two different diabetes datasets the Frankfurt Germany
diabetes dataset and the Pima Indian diabetes dataset, various machine learning models involved in this study like
Naïve Bayes, AdaBoost M1, K-nearest neighbor, and RIPPER. The main algorithm that they used in this research
is Ant-Miner to make a comparison in terms of accuracy value. The highest accuracy obtained for the dataset is
when they implement the Ant-Miner algorithm which is 73.64% compared to other algorithms.
In the previous work of Maniruzzaman M., et. al. [9] built a diabetic patient prediction algorithm based on
machine learning (ML). Logistic Regression (LR) is used to classify risk factors for diabetes disease using p-
values and odds ratios (OR). To forecast diabetic patients, they used four classifiers: Naive Bayes (NB), decision
tree (DT), Ada boost (AB), and Random Forest (RF). These protocols were also followed and replicated in 20
trials by three groups of partition protocols (K2, K5, and K10). The accuracy (ACC) and region under the curve
(AUC) of these classifiers are used to assess their performance (AUC) The ACC of the ML-based method as a
whole is 90.62 %. The K10 protocol has a 94.25 % ACC and 0.95 AUC thanks to a mix of LR-based feature
collection and RF-based classifier.
These studies have mainly focused on obtaining acceptance accuracy rates to diagnose and detect diabetic patients
using different classification methods. Unlike these approaches, the main focus of the proposed model is to design
a diabetes classification model based on 5 advanced Machine Learning Algorithms to achieving higher accuracy.
Diabetes Dataset
A short overview of the datasets used will be given in this section. Downloading the diabetes dataset for
classification was done via theKaggle machine learning repository. Every dataset comprises a set of numerical
attributes instances. This data was gathered at the Frankfurt Hospital in Germany [10]. The dataset has 10
attributes and 2000 instances. The first column identifies each instance with an ID number while the last column
of the data table is that of the class label that defines the diagnose of the diabetic that class variable 1 means is
diabetes patient and class variable 0 meaning are non-diabetes patient [7]. Table 1 [7,11] lists the diabetes dataset
instances and characteristics, as well as some statistical evidence.
1245
Dr.M.Maria Saroja , E.MichaelJeya Priya
1246
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
|h|
where P (h) = is the estimated h prior probability (given |h| is class h number of patterns and N is the total
N
patterns number and assumes that all assumptions are equally probable), P (X|h) is the X conditional probability
which conditioned on h, and P(X) is the X prior probability. The maximum posterior (MAP) hypothesis is used to
assign the class h having maximum P (h|X). Equation (3) expresses that as shown:
hMAP ≡ arg maxh∈H = arg maxh∈H P(X|h)P(h)
(3)
Where H is the hypotheses set.
The Bayesian classifier can efficiently perform the minimum error rate if the distribution of data probabilities is
given. In this context, the expected loss of decision-making (i.e. conditional risk) will be minimum. This
statistically optimal classification rule is widely used as a benchmark to which other classification algorithms are
often measured [13].
For 1 <= j <i and j ≠ i. using the equation (2), getting the equation (5).
P(X|Ci)P(Ci )
P (Ci|X) = (5)
P(X)
The classifier, however, makes the naive or simplified statement that the features (whose cumulative number is
denoted by n) are conditionally independent of one another to save on computational costs. The class-conditional
independence can be written as:
P (X|Ci)= ∏nj=1 P(fj |Ci ) (6)
|C |
As P(X) is a constant for each class, and P (Ci) = i , NB classifier needs to maximize only the P (X|Ci). Since it
N
just counts the class distribution, this greatly decreases computing costs. [12].
1247
Dr.M.Maria Saroja , E.MichaelJeya Priya
When the forest is ready or designed as above, to organize another event, it is run with all the trees that are filled.
The new event is allocated to each tree, and it is registered as a vote [17].
All of the trees' votes are put together, and the class with the most votes e.g, the largest share of votes is presented
as a description of the new event [15]. As measuring the tree's formation, when the starter test kit is created by
examining and swapping each tree, about 1/3 of the instances cases are missing. This setting is called (OOB) data.
Each tree has its OOB information index that is used to estimate individual errors. It is called OOB error
evaluation. A forest random generalization error is reported as [18]:
𝑅𝐸 ∗ = 𝑓𝑥𝑦 (𝑚𝑔(𝑋. 𝑌)) < 0 (10)
The margin function is given as,
𝑚𝑔(𝑋. 𝑌) = 𝑎𝑣𝑘 𝐼(ℎ𝑘 (𝑥) = 𝑌) − 𝑚𝑎𝑥𝑗≠𝑌 𝑎𝑣𝑘(ℎ𝑘 (𝑋) = 𝑗) (11)
The margin function measures the extent to which the average number of votes at (X, Y) for the right class
exceeds the average vote for any other class [19]. Strength of Random Forest is given in terms of the expected
value of margin function as,
𝑆 = 𝐸𝑋.𝑌 (𝑚𝑔(𝑋. 𝑌)) (12)
The following equation gives an upper bound for generalization error if ρ is the mean value of the association
between base trees:
𝑃𝐸 ∗ ≤ 𝑃(1 − 𝑆 2 ) ∕ 𝑆 2 (13)
As a result, to improve Random Forest precision, the base decision trees must be diverse and precise [19].
The positive integer k value is calculated by examining the dataset, as seen in Figure 1. Cross-validation is a
technique for retrospectively determining a strong value of k by validating the k with an independent data set. In
this analysis, ten cross-validations will be used, and the values (k =1) will be used because they yield the best
outcomes [7].
1248
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
and reduces the probability of misclassifying test dataset instances. As data points of the form, given labeled
training data [20].
𝑀 = {(𝑥1,𝑦1),(𝑥2,𝑦2),…,(𝑥𝑛,𝑦𝑛)} (14)
Where 𝑦𝑛 is constant (1/-1) that denotes the class that point 𝑥𝑛belongs, where n is the data sample number. Each
𝑥𝑛 is a p-dimensional real vector. The SVM classifier converts the input vectors into a decision value before
performing classification with the aid of a threshold value. Divide the hyperplane, which can be represented as an
equation, to view the training data (15) [20].
Mapping :𝑊 𝑇 . 𝑥 + 𝑏 = 0 (15)
where w is a scalar and b is a p-dimensional weight vector The dividing hyperplane is perpendicular to the vector
w. The margin can be increased by using the offset parameter b. Select these hyperplanes such that there are no
points between them while the training data is linearly separable, and then attempt to maximize the distance
between them. SVM has found out the distance between the hyperplane as 2 ∕ |𝑥| as shown in Figure (2).To
minimize|𝑊|, then will need to ensure for all i either using equation (16)[21].
𝑤. 𝑥𝑖 − 𝑏 ≥ 1 𝑜𝑟 𝑤. 𝑥𝑖 − 𝑏 ≤ −1 (16)
Figure 2. SVM hyperplane with maximum margin trained for two-class samples [21].
The equation (19), called the sigmoid function, will maintain the value of ⊖𝑇 𝑥 within the [0, 1] range. Then LR
look up a value θ as if the probability 𝑃(𝑦 = 1⁄𝑥 ) = ℎ⊖ (𝑥) is high if x is of the class "1" and small, if x is of the
class "0 (i.e.P (y = 0|x) is large) [22].
1
𝜎(𝑡) = (19)
(1 + 𝑒−1 )
Criteria of Performance Evaluation
A confusion matrix is a type of tool used to monitor the accuracy of the classifier in classification [23]. This tool
explains the connection between the classes concerned and the forecast. The efficiency level of the classification
1249
Dr.M.Maria Saroja , E.MichaelJeya Priya
model is determined using a correct and incorrect number of classifications classified in the confusion matrix for
each possible variable. Table 2 reveals the two-class confusion matrix [12]:
Table 2: Confusion Matrix of Two Classes
predicted
Negative Positive
Actual Negative TN FP
Positive FN TP
Where
- TP and TN stand for True Positive and True Negative, respectively, indicating the proportion of positive and
negative states that were correctly identified.
- FP stands for False Positive, which refers to all negative cases that were falsely classified as positive, and FN
stands for False Negative, which refers to all positive cases that were incorrectly classified as negative.
Accuracy
Accuracy measures the classifier’s capability to produce the level of accurate diagnosis [23]. Equation (20)
shows the accuracy formula.
𝑇𝑃 + 𝑇𝑁 (20)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ( ) ∗ 100
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Precision
It simply shows “what number of selected data items are relevant”. In other words, out of the observations that an
algorithm has predicted to be positive, how many of them are actually positive. According to formula (21), the
precision equals the number of true positives divided by the sum of true positives and false positives.[25]
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (21)
𝑇𝑃 + 𝐹𝑃
Sensitivity or Recall
Recall: It presents “what number of relevant data items are selected”. In fact, out of the observations that are
actually positive, how many of them have been predicted by the algorithm. According to formula (22), the recall
equals the number of true positives divided by the sum of true positives and false negatives[25]:
𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (22)
𝑇𝑃 + 𝐹𝑁
Specificity
Specificity is the metric that evaluates a model’s ability to predict true negatives of each available category. To
compute the specificity value by applied equation (23)
[ 25]:
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (23)
𝑇𝑁 + 𝐹𝑃
1250
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
Where
N = the pairs of scores number
∑ 𝒙𝒚 = the products of paired scores sum
∑ 𝒙 = the x scores sum
∑ 𝒚 = the y scores sum
∑ 𝒙𝟐 = the squared x scores sum
∑ 𝒚𝟐 = the squared y scores sum
The cross-correlation coefficient is another name for the correlation coefficient. The correlation coefficient is
always in the range of -1 to +1, with -1 indicating that X and Y are negatively correlated and +1 indicating that X
and Y are positively correlated.
1251
Dr.M.Maria Saroja , E.MichaelJeya Priya
will assist in making procedure decisions, removing uncertainty, and gaining key business insights. As a result,
exploratory data analysis was implemented as a dataset preprocessor [9].
The aim of exploratory data analysis is to gain a broad understanding of the data. It is primarily carried out in
order to determine its properties, patterns, and visualizations. It assists us in ensuring the data is accurate and
ready to be used by machine learning algorithms. In this work, the EDA used descriptive statistics which is
represent attribute type, class distribution, mean, standard deviation, median, quartile, Skewness, correlation [24].
In this stage, divide the distribution into four classes and compute the value of quartile measures, which is above
and below the mean the distribution of values. The quartile breaks down the data into quarters such that 25% of
the measurements are less than the lower quartile, 50% are less than the mean, and 75% are less than the upper
quartile, while the mean divides the data in half so that 50% of the measurements are below the median and 50%
are above it. The mean is equal to the sum of all the values in the data. As a result, a data set can contain n values,
each of which has a value𝑠 𝑥1, 𝑥2, 𝑥3,…..𝑥𝑛, the sample mean, usually denoted by x (pronounced “x bar”), is as
shown in the formula (25) is usually written in a slightly different manner using the Greek capital letter, ∑,
pronounced “ sigma”, which means “sum of”:
𝑥𝑖
x − = ∑𝑛𝑖=1 (25)
𝑛
where :
∑ = Summation of… ;x − = sample mean; n = scores in the sample number
Where the standard deviation formula in (26)
n 2
∑ (x −)
s = √ i=1 𝑖−x 26))
n−1
Where:
s = Sample standard deviation; ∑ = Summation of…; x − = sample mean;n = scores in the sample number
1.1 Splitting Dataset to Test and Train
This work divides the dataset into 80% training set for the classifier and 20% a testing set used to evaluate
classification system performance accuracy. The total samples in Diabetes Dataset are = 2000, in this stage
splitting database into training and test as shown in table 3.
1252
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
Random Forest algorithm takes all possibilities in a random way, each probability has drawn tree using RF
algorithm, see Figure 5 that shown example of these tree.
Figure 5. Example of Random Forest Tree for All Probability of the Dataset.
As shown in figure 5, note that the tree contains the red color, which indicates class 1, while the blue color
indicates class 0. In addition, the color gets darker in the leaves of the tree, which depends on its value, which
represents the best value for this group. Now the algorithm divides each class into groups and takes all possible
possibilities by drawing independent trees for each group as shown in the figure 6 for class 0 .
1253
Dr.M.Maria Saroja , E.MichaelJeya Priya
Based on the probabilities of all the features in the data set computed using the Random Forest algorithm as
shown in Figure 5 ,6 and 7 willtaken the highest values obtained for each Feature and then rearrange the feature
from highest to lowest as shown in figure 8.
Then select 5 features which importance is greater than threshold 80% ,the value of threshold in this work will be
proposed based on values of values. The important features that selected by the Random Forest Algorithm are GL,
AGE, DPF, BP, and BMI, see distribution of data feature selection illustrated in figure 9.
1254
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
Figure 9 Distribution Data for GL, AGE, DPF, BP, and BMI Features After applied Random Forest Algorithm.
Standardize features by extracting the mean and scaling to unit variance after choosing a function using the RF
algorithm. As shown in equation (27) [11], the standard score of sample x is estimated:
s = z = (x − u) / s (27)
Where u is the training samples' mean, or zero if with mean=False, and S is the training samples' standard
deviation, or one if with std=True.
Results Discussion
The results of each step of the proposed system are illustrated in this section, where the proposed model
implementation under Windows 10 Professional operating system, Intel(R) Core(TM) i5-2450M CPU @
2.50GHz, 8 GB random access memory, and 64-bit system type and to run the proposed system within an
environment of python language.
Table 1 list the diabetes dataset instances and attributes ,as well as some statistical data .A correlation between the
data set features is also visualized in Figure 10 using correlation matrix .
Figure 10 Correlation Matrix Analysis for all features in Diabetes Dataset.
As shown in figure 4, all the diagonal elements of the correlation matrix ( c ) must be 1 because the correlation of
a variable with itself is always perfect, cij=1 and It should be symmetric c ij=c ji. The diagonal element divided a
correlation Coefficient matrix into two parts (Top triangle and bottom triangle) and bout they have the same
1255
Dr.M.Maria Saroja , E.MichaelJeya Priya
values which are in the range between [1 to -1].Table 4. shows the results of the AED technique to descriptive
statistics for all features in Diabetes Dataset by calculating the mean using equation (25), standard deviation using
equation (26), count, min, max, and Quantile (25%,50%,75%).
Table 4. Results of EDA Technique on Diabetes Dataset.
Feature ID count mean SD Min max 25% 50% 75%
Table 4 can find the similarity between all features, and when finding two features or more equal, then will take
one of them to increase the accuracy of the system in the classification stage. After selecting the important
features using the Random Forest Algorithm as shown in section (4.3). In this stage, the data of these Features will
be scaling using equation (27). The data distribution after applied the scaling algorithm for the five features (
Glucose (GL), BMI , Age, Diabetes Pedigree Function (PDF) , Blood Pressure (BP)) is illustrated in Figure 11 for
training and in Figure 12 for testing.
4 10
2
5
0
Value
Vlaue
476
856
191
286
381
571
666
761
951
1236
1046
1141
1331
1426
1521
1
96
-2
0
681
766
851
171
256
341
426
511
596
936
1
1021
1106
1191
1276
1361
1446
1531
86
-4
-6 -5
Frequency Value Frequency Value
Value
2
0 0
681
766
851
171
256
341
426
511
596
936
1
1021
1106
1191
1276
1361
1446
1531
86
540
155
232
309
386
463
617
694
771
848
925
1
1002
1541
1079
1156
1233
1310
1387
1464
78
-2 -5
Frequency Value Frequency Value
0
157
235
508
586
859
937
118
196
274
313
352
391
430
469
547
625
664
703
742
781
820
898
976
1
1210
1288
1561
1015
1054
1093
1132
1171
1249
1327
1366
1405
1444
1483
1522
40
79
-5
Frequency Value
e) Histogram of BP Feature
Figure 11 Training Feature After applied Scaling Data
1256
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
5 10
Value 5
Value
0
101
181
261
341
121
141
161
201
221
241
281
301
321
361
381
1
21
41
61
81
0
133
199
309
111
155
177
221
243
265
287
331
353
375
397
1
23
45
67
89
-5 -5
Frequency Value Frequency Value
4 4
Value
Value
2 2
0 0
109
217
136
163
190
244
271
298
325
352
379
1
28
55
82
151
276
301
101
126
176
201
226
251
326
351
376
1
26
51
76
-2 -2
Frequency Value Frequency Value
2
Value
0
109
157
241
289
373
121
133
145
169
181
193
205
217
229
253
265
277
301
313
325
337
349
361
385
397
1
25
13
37
49
61
73
85
97
-2
-4
Frequency Value
e) Histogram of BP Feature
Figure 12 Testing Features After applied Scaling Data
Figure 11 and 12 explain the effect of the scaling algorithm on the distribution of training and testing data, where
the effected means the values of a single feature are limited to a certain range by it gives equal weight to very
small values (which could only be noise) and large values. Scaling up the small variables (may be no relevant
also) could change the results profoundly.
The comparative performance of NB, RF, KNN, SVM, and LR Classification algorithms based on accuracy rate
using equation (20) are shown in Figure 13. which is illustrated that the RF obtains the best accuracy rate in
comparison with other classification algorithms used in the proposed system, where the accuracy of RF=99% with
the accuracy of the KNN=98.75% , the accuracy of SVM=81%, the accuracy of LR=77.5%, and the accuracy
NB=77.25.
Comparison Based on The Accuracy ratio
120,00%
100,00%
80,00%
Accuracy Value
60,00%
40,00%
20,00%
0,00%
NB RF KNN SVM LR
Classification method
Figure 8. Comparison Between Five Classification Algorithm that using in the Proposed System Based on
Accuracy Ratio.
1257
Dr.M.Maria Saroja , E.MichaelJeya Priya
Table 5 calculates the value of Precision metric using equation(21) ,Sensitivity or Recall metric using equation
(22) and Specificity metric using equation (23) of five classification algorithms( Gaussian Naïve Bayes (NB)
algorithm, Random Forest (RF) algorithm, k-Nearest Neighbor (KNN) algorithm, Support Vector Machin (SVM)
algorithm ,and finally Logistic Regression (LR) algorithm ).
Table (5) Classification performance based on Precision , Sensitivity, and Specificity ratio.
No Methods Precision Sensitivity Specificity
Table 5 compares the proposed approach to the previous methods in terms of precision. With the RF classification
algorithm, the proposed method performed well in terms of accuracy, achieving 99 % accuracy. This proves that
the proposed method can effectively diagnose diabetes.
Table 5. On the diabetes dataset, the proposed approach is compared to previous methods in terms of accuracy.
No. Reference Method Accuracy
Conclusion
One of the biggest problems in the healthcare sector is detecting diabetes early. In the proposed model, build a
model that can accurately predict diabetes. The proposed model uses a diabetes dataset that has 2000 instances
with 9 features and passes this data through several stages are: Load diabetes data set, Analysis dataset using
correlation matrix, preprocessing using data statistically, split dataset to 80%Train and 20%test, feature selection
using Random Forest Feature analysis to Scaling selected important Features, and classification stage (Naive
Bayes, Random Forest, Support Vector Machine, K-Near Neighbor and Logistic Regression). The results of the
performance of the proposed model based on accuracy rate are show KNN obtains the best accuracy rate in
comparison with other classification algorithms used in the proposed system, where the accuracy of
KNN=98.75% while the accuracy of NB=77.25%, the accuracy of RF=99%, the accuracy of SVM=81%, and the
accuracy of LR=77.5%. Besides, compared to previous methods, the proposed approach is more accurate 99 %
accuracy was achieved, which was a great result with the RF classification algorithm. This proves that the
proposed method can effectively diagnose diabetes.
Reference
[1] Srivastava, S., Sharma, L., Sharma, V., Kumar, A., & Darbari, H. (2019). Prediction of diabetes using artificial
neural network approach. In Engineering Vibration, Communication and Information Processing (pp. 679-687).
Springer, Singapore.
[2] Tama, B. A., & Rhee, K. H. (2019). Tree-based classifier ensembles for early detection method of diabetes: an
exploratory study. Artificial Intelligence Review, 51(3), 355-370.
[3] Battineni, G., Sagaro, G. G., Chinatalapudi, N., &Amenta, F. (2020). Applications of machine learning
predictive models in the chronic disease diagnosis. Journal of personalized medicine, 10(2), 21.
1258
Awareness On Healthy Dietary Habits Among Prospective Teachers In Tirunelveli District
[4] Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., &Chouvarda, I. (2017). Machine
learning and data mining methods in diabetes research. Computational and structural biotechnology journal, 15,
104-116.
[5] Kaware, S. R., &Wadne, V. S. (2020). Improve the performance of cancer and diabetes detection using novel
technique of machine learning (No. 2332). EasyChair.
[6] Miao, L., Guo, X., Abbas, H. T., Qaraqe, K. A., & Abbasi, Q. H. (2020, August). Using Machine Learning to
Predict the Future Development of Disease. In 2020 International Conference on UK-China Emerging
Technologies (UCET) (pp. 1-4). IEEE.
[7] Daanouni, O., Cherradi, B., &Tmiri, A. (2019, October). Predicting diabetes diseases using mixed data and
supervised machine learning algorithms. In Proceedings of the 4th International Conference on Smart City
Applications (pp. 1-6).
[8] Anwar, N. H. K., &Saian, R. (2020). Predictive accuracy for two diabetes datasets using ant-miner
algorithm. International Journal of Scientific and Technology Research, 9(4), 239-242.
[9] Maniruzzaman, M., Rahman, M. J., Ahammed, B., & Abedin, M. M. (2020). Classification and prediction of
diabetes disease using machine learning paradigm. Health information science and systems, 8(1), 1-14.
[10] https://www.kaggle.com/emrzcn/diabetes-prediction-with-lr-knn-nb-svm-rf-gbm.
[11] Haq, A. U., Li, J. P., Khan, J., Memon, M. H., Nazir, S., Ahmad, S., ... & Ali, A. (2020). Intelligent Machine
Learning Approach for Effective Recognition of Diabetes in E-Healthcare Using Clinical Data. Sensors, 20(9),
2649.
[12] Yin, H., & Chaoyang, Z. (2011, October). An improved bayesian algorithm for filtering spam e-mail. In 2011
2nd International Symposium on Intelligence Information Processing and Trusted Computing (pp. 87-90). IEEE.
[13] Kamel, H., Abdulah, D., & Al-Tuwaijari, J. M. (2019, June). Cancer Classification Using Gaussian Naive
Bayes Algorithm. In 2019 International Engineering Conference (IEC) (pp. 165-170). IEEE.
[14] S-B. Kim, K-S. Han, H-C. Rim and S. H. Myaeng, “Some Effective Techniques for Naive Bayes Text
Classification”, IEEE Transactions on Knowledge and Data Engineering, vol.18, no.11, pp.1457-1466, Nov. 2006.
[15] Arora, J., & Agrawal, U. (2020). Classification of Maize leaf diseases from healthy leaves using Deep
Forest. Journal of Artificial Intelligence and Systems, 2(1), 14-26.
[16] Gobalakrishnan, N., Pradeep, K., Raman, C. J., Ali, L. J., & Gopinath, M. P. (2020, July). A Systematic
Review on Image Processing and Machine Learning Techniques for Detecting Plant Diseases. In 2020
International Conference on Communication and Signal Processing (ICCSP) (pp. 0465-0468), IEEE.
[17] Mecheter, I., Alic, L., Abbod, M., Amira, A., & Ji, J. (2020). MR Image-Based Attenuation Correction of
Brain PET Imaging: Review of Literature on Machine Learning Approaches for Segmentation. Journal of Digital
Imaging, 1-18.
[18] Tigga, N. P., & Garg, S. (2020). Prediction of type 2 diabetes using machine learning classification methods.
Procedia Computer Science, 167, 706-716.
[19] Chow, L. S., &Paramesran, R. (2016). Review of medical image quality assessment. Biomedical signal
processing and control, 27, 145-154.
[20] Tambade, S., Somvanshi, M., Chavan, P., & Shinde, S. (2017). SVM-based diabetic classification and
hospital recommendation. International Journal of Computer Applications, 167(1), 40-43.
[21] Nguyen, L. (2017). Tutorial on support vector machine. Applied and Computational Mathematics, 6(4-1), 1-
15.
[22] Zhu, C., Idemudia, C. U., & Feng, W. (2019). Improved logistic regression model for diabetes prediction by
integrating PCA and K-means techniques. Informatics in Medicine Unlocked, 17, 100179.
[23] Doreswamy, H. K. (2012). Performance evaluation of predictive classifiers for knowledge discovery from
engineering materials data sets. arXiv preprint arXiv:1209.2501.
[24] Indrakumari, R., Poongodi, T., & Jena, S. R. (2020). Heart Disease Prediction using Exploratory Data
Analysis. Procedia Computer Science, 173, 130-139.
[25] Vakili, M., Ghamsari, M., & Rezaei, M. (2020). Performance Analysis and Comparison of Machine and
Deep Learning Algorithms for IoT Data Classification. arXiv preprint arXiv:2001.09636.
1259