Education 13 00293
Education 13 00293
Education 13 00293
Abstract: A problem that pervades throughout students’ careers is their poor performance in high
school. Predicting students’ academic performance helps educational institutions in many ways.
Knowing and identifying the factors that can affect the academic performance of students at the
beginning of the thread can help educational institutions achieve their educational goals by provid-
ing support to students earlier. The aim of this study was to predict the achievement of early sec-
ondary students. Two sets of data were used for high school students who graduated from the Al-
Baha region in the Kingdom of Saudi Arabia. In this study, three models were constructed using
different algorithms: Naïve Bayes (NB), Random Forest (RF), and J48. Moreover, the Synthetic Mi-
nority Oversampling Technique (SMOTE) technique was applied to balance the data and extract
features using the correlation coefficient. The performance of the prediction models has also been
validated using 10-fold cross-validation and direct partition in addition to various performance
evaluation metrics: accuracy curve, true positive (TP) rate, false positive (FP) rate, accuracy, recall,
F-Measurement, and receiver operating characteristic (ROC) curve. The NB model achieved a pre-
diction accuracy of 99.34%, followed by the RF model with 98.7%.
Keywords: machine learning; educational data mining; secondary school; prediction; academic
performance
The secondary stage can be considered the top of the pyramid of public education in
Saudi Arabia. The secondary stage is the gateway to entering the world of graduate stud-
ies and functional specializations. In addition, secondary education coincides with the
critical stage of adolescence, which is accompanied by many changes in the psychological
and physical structure. This stage requires a careful and insightful look, with the cooper-
ation of many parties to prepare the students to reduce the inability many of them to con-
tinue their higher education in institutes or in self-styled ways. Academic achievement
has an impact on the student’s self-confidence and his desire to reach higher ranks [4].
The good academic achievement of students often reflects the quality and success of an
educational institution. The low level of student achievement and the low level of their
ability to obtain a seat in higher education also led to a low reputation of the educational
institution [5]. There are several methods and ways in which students’ academic perfor-
mance is measured, including the use of data mining techniques.
As a bookish definition, data mining is the process of analyzing a quantity of data
(usually a large amount) to create a logical relationship that summarizes the data in a new
way that is understandable and useful to the data owner [6].
In other words, Data mining is an analysis of large-sized groups of observed data to
search for potentially summarized forms of data that are more understandable and useful
to the user. With the aim of extracting or discovering useful and exploitable knowledge
from a large collection of data, it helps explore hidden facts, knowledge, and unexpected
models, as well as explore new databases that exist in large databases [7]. Data mining has
various techniques for taking advantage of data such as description, prediction, estima-
tion, classification, aggregation, and correlation [8].
Data mining technology goes through several stages before reaching the results. The
first stage begins with the collection of raw data from different data sources, followed by
the data pre-processing stage such as denoising, excluding conflicting or redundant data,
reducing dimensions, extracting features, etc. In the next stage, patterns are identified
through several techniques, including grouping and classification. Finally, the results are
presented in the last stage. Figure 1 shows a summary of the data mining process in gen-
eral [9].
The application of data mining technology has benefited many fields such as health
care [10,11], business [12], politics [13], education, and others. Educational Data Mining
(EDM) is one of the most important and popular fields of data mining and knowledge
extraction from databases.
The objectives of educational data mining can be divided into three sections: educa-
tional objectives, administrative objectives, and business objectives. One of the educa-
tional objectives is to improve the academic performance of learners.
Decision makers have a huge student database and learning outcomes. However, this
massive amount of data, despite the high knowledge it contains, has not been investigated
effectively in evaluating students’ academic performance in a comprehensive manner to
overall improve the performance of the educational institution, especially in rural and
sub-urban areas. A comprehensive literature review has been conducted in this regard.
Educ. Sci. 2023, 13, 293 3 of 26
The proposed study investigates the role of some key demographic factors in addition to
academic factors as well as the academic performance of high school students to predict
success by means of utilizing data mining techniques. It is also concerned with investigat-
ing the relationship between academic performance and a set of demographic factors re-
lated to the student.
The rest of the paper is organized in the following order: the literature is reviewed in
Section 2, a summary of the techniques used is in Section 3, the application and imple-
mentation are in Section 4, the results are presented in Section 5, and the conclusion is in
Section 6.
2. Related Work
The previous studies are considered as a group of research and studies that deal with
the topic that we studied; these studies provide a lot of information to the researcher about
the topic of study that helps to fully understand the topic of his scientific research. The
following are the most prominent previous studies related to the use of data mining in the
educational sector at various educational levels: secondary school, undergraduate level,
and master level.
In a related study [14], the researchers used the Naïve Bayesian (NB) algorithm to
predict student academic success and behavior. The goal of this study is to use data ex-
traction techniques to help educational institutions gain insight into their educational
level, which can also be useful in enhancing the academic performance of students. The
application was based on a database containing information on 395 high school students
with 35 attributes. Attention has only been given to the set of mathematics degrees in
various courses. The classifier categorized the students into two categories, pass and fail,
with an accuracy of 87% [14].
Another study [15] was conducted with the purpose of building a classifier that com-
prehensively analyzes students’ data and forecasts their performance. The study database
was collected from 649 students from two secondary schools in Portugal. It includes 33
different characteristics including academic and demographic features. Nine different al-
gorithms were implemented which are (NB, decision tree (J48), Random Forests (RF), Ran-
dom Tree (RT), REPTree, JRip, OneR, SimpleLogistic (SL), and ZeroR). The results found
that academic scores had the largest influence on prediction, followed by study time and
school name. The highest score is obtained with OneR and REPTree, with an accuracy of
76.73% [15].
Similarly, the study in [16] aimed to predict the academic achievements of high
school students in Malaysia and Turkey. The study focused on the students ’academic
achievements in specific scientific subjects (physics, chemistry, and biology) to consider
the precautions needed to be taken against their failure. The study sample consisted of
922 students from Turkey and 1050 students from Malaysian schools, with 34 features.
The Artificial Neural Network (ANN) algorithm was chosen to build the model via
MATLAB. The proposed models scored 98.0% for the Turkish student sample and 95.5%
for the Malaysian student sample. The study concluded that family factors have a funda-
mental role in influencing the accuracy of predicting students’ success.
Study in [17] aimed to investigate the main factors that affect the overall academic
performance of secondary schools in Tunisia. The database contained 105 secondary
schools and several predictive factors that could positively or negatively affect the school’s
efficiency such as (school size, school location, students’ economic status, parental pres-
sure, percentage of female students, competition). The study constructed two models us-
ing a Regression Tree and RF algorithms to identify and visualize factors that could influ-
ence secondary school performance. The study showed that the school’s location and pa-
rental pressure are among the factors that improve students’ performance. Additionally,
smaller class sizes may provide a more effective education and a more positive environ-
ment. The study also encouraged the development of parenting participation policies to
enhance schools’ academic performance. A study [18] proposed a hybrid approach to
Educ. Sci. 2023, 13, 293 4 of 26
greatly affect students’ orientations and their future intentions [23]. Table 1 presents a
summary of the literature reviews made for master’s students.
High Accuracy
Dataset Size
Achieved
Country
Algorithm
Ref. Year Limitations
Used
• Single algorithm.
Portugal
• Predicting student achievement in
[14] 2017 NB 87% 395
only two categories (pass and fail).
• features selection is not used.
NB, J48,
RF, RT,
Portugal
REPTree, • The results of the algorithms were
[15] 2019 76.7% 649
JRip, OneR, only compared with accuracy.
SL and
ZeroR.
• It was applied to only one algo-
Malaysia and
rithm.
Turkey
~96.9 922
[16] 2019 ANN • Student achievement was pre-
% 1050
dicted for some courses, not for the fi-
nal average.
Regression • Academic achievement is pre-
Tunisia
[17] 2020 Tree and - 105 dicted at the school level, not stu-
RF dents.
RF, C5.0,
grades.
[18] 2020 NB and 99.7% 1204
• Most of the factors were signifi-
SVM
cant and did not have a clear and spe-
cific measure.
RF, Tree
Ensemble,
DT, NB,
Nigeria
[19] 2019 LR, and Re- 51.9% 1445 • Poor predictive accuracy
silient
backpropa-
gation
DT, RF, se-
quence of • There are no demographic fea-
Saudi
layer per-
ception,
and LR
Saudi
[21] 2020 69.3٪ 339
REPTree mographics.
• decision tree algorithms only
ANN
Portugal
RF 97.749
[22] 2019 395 • no academic features
Linear re- %
gression
CESCA-
FKNN • Predicting students’ trends af-
RF 82.47
China
ter graduation only.
[23] 2019 702
SVM % • It does not predict student
kernel ex- achievement
treme
ANN, Ada- • Focuses on academic factors.
18,6
Chile
[24] 2023 Boost, NB, 65.2% • Results need improvement.
10
RF, J48 • Distance learning environment
A study [24] presented a Cross-Industry Standard Process for Data Mining (CRISP-
DM) methodology to analyze data from the Distance Education Center of the Universidad
Católica del Norte (DEC-UCN) from 2000 to 2018. The data set size was more than 18,000
records. They have applied several algorithms such as ANN, AdaBoost, NB, RF, and J48.
The highest accuracy was gained for J48. The study highlights the importance of EDM and
aims to further improve it in the future by adding advanced methods.
Yagci (2022) [25] presented an EDM approach to predict the students at risk. The
dataset was taken from a single course at a Turkish state university during the fall semes-
ter of 2019–2020. Several machine learning algorithms have been investigated such as LR,
SVM, RF, and k nearest neighbors (kNN). The highest accuracy was achieved in the range
of 70–75%. The prediction was made based on only three parameters, the student grade,
department, and faculty data.
The following points potentially indicate the research gap, and the potential contri-
bution of this work is to fill this gap.
• The lack of studies in the KSA predicts the academic performance of high school stu-
dents.
• Most of the studies in the KSA target the undergraduate level. However, the issues
must be addressed earlier for better career counseling/adoption.
• Mainly studies focus on academic performance rather than demographic and aca-
demic factors.
• Most of the studies in the literature target urban areas students. However, in rural
areas and suburbs, students face more issues which are the target of the ongoing
study.
From the comprehensive review of the literature over a decade in EDM, it is evident
that:
Educ. Sci. 2023, 13, 293 7 of 26
• NB, DT, and RF are among the most widely used algorithms in education data min-
ing for success prediction.
• Thus, in the current study, their selection is based on their suitability to the EDM,
dataset nature, and size.
• Moreover, it is observed that accuracy is the most widely used metric to evaluate the
efficiency of the EDM algorithms in the literature.
• Most common demographic factors: gender, age, address, the relationship between
mother and father, in addition to the age of father and mother, their work as well,
place and type of residence.
• The most used academic factors were the semester grades and the subject grades and
the final grade for the degree in addition to the mock score, the duration of the study,
and the number of subjects in a year.
RF shows more effectiveness with big data in terms of its ability to handle many var-
iables without deleting any of them [29]. Additionally, it can guess the missing values. All
these features allowed the RF classifier to spread widely in different applications [30,31].
RF creates multiple decision trees without pruning and is characterized by high contrast
and low deviation [32]. Decision trees are combined to obtain a more accurate and stable
prediction. As the number of decision trees increases, the performance of RF increases
[28]. The final classification decision is based on the average probabilities estimated by all
produced trees. The final vote is calculated from Equation (1) [27].
∑𝑡 𝑛𝑜𝑟𝑓𝑝𝑗𝑝
𝑅𝐹𝑓𝑝𝑝 = ∑ 𝑛𝑜𝑟𝑓𝑡𝑗𝑡
j(𝑗 ∈ 𝑎𝑙𝑙 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝑡 ∈ 𝑎𝑙𝑙 𝑡𝑟𝑒𝑒𝑠 ) (1)
Educ. Sci. 2023, 13, 293 8 of 26
𝑅𝐹𝑓𝑝𝑝 = The final vote norfp sub(pj) = the normalized feature importance for p in the tree
j.
3.2. J48
This algorithm falls under the category of decision trees. The J48 algorithm is an ex-
tension and upgraded version of the Iterative Dichotomiser 3 (ID3) algorithm. This algo-
rithm was developed by Ross Quinlan [33]. The J48 algorithm analyzes categorical fea-
tures in addition to its ability to handle continuous features. It also has implication tech-
nology, with which it can process missing values based on the available data. Plus, it can
prune trees and avoid data over-fitting. These developments enable it to build a tree that
is more balanced in terms of flexibility and accuracy [34]. The DT includes several decision
nodes which represent attribute testing while classes are represented by leaf nodes. The
feature that is classified as a root node is the one that has the most information gain. The
rules can be inferred when tracing DT paths from the root to leaf nodes [35]. So, it can be
said that the j48 algorithm contributes to building easy-to-understand models.
4. Empirical Studies
4.1. Description of High School Student Dataset
The dataset for this study was collected using the electronic questionnaire tool. The
questionnaire targeted newly graduated students who had completed their secondary ed-
ucation in Al-Baha educational sector schools. The data set included 526 records with 26
features. Table 2 provides a brief description of all the features contained in the dataset.
female = 1
1 Gender Gender
Male = 2
<18 years =1
Age year of high school
2 Age 18–20 years = 2
graduation
above 20 years = 3
Single = 1
3 Social_status Social status
Married = 2
Scientific = 1
4 Specialization Specialization Literary = 2
Management = 3
Eldest = 1
6 Rank Ranking among sibling Middle child = 2
Youngest = 3
Works = 1
15 Mother _Job Mother’s job does not work = 2
retired = 3
Village = 1
19 Acc_place Accommodation place
Residential scheme = 2
From 90–100% = 1
From 89–80% = 2
20 GS_1 Grade in semester 1
From 79–70 = 3
Less than 70% = 4
From 90–100% = 1
From 89–80% = 2
21 GS_2 Grade in semester 2
From 79–70% = 3
Less than 70% = 4
From 90–100% = 1
From 89–80% = 2
22 GS_3 Grade in semester 3
From 79–70% = 3
Less than 70% = 4
From 90–100% = 1
From 89–80% = 2
23 GS_4 Grade in semester 4
From 79–70% = 3
Less than 70% = 4
From 90–100% = 1
From 89–80% = 2
24 GS_5 Grade in semester 5
From 79–%70 = 3
Less than 70% = 4
Educ. Sci. 2023, 13, 293 11 of 26
From 90–100% = 1
From 89–80% = 2
25 GS_6 Grade in semester 6
From 79–70% = 3
Less than 70% = 4
From 90–100% = 1
Final high school gradua- From 89–80% = 2
26 Class
tion rate From 79–70% = 3
Less than 70% = 4
Standard
No Attribute Mean Median Maximum Minimum
Deviation
1 Gender 1.474286 1 0.499815 2 1
2 Age 1.967619 2 0.596506 4 1
3 Ss 1.135238 1 0.342304 2 1
4 Sp 1.601905 2 0.641739 3 1
5 BS 2.062857 2 0.616141 3 1
6 Rank 2.030476 2 0.69005 3 1
7 Relative 1.632381 2 0.482617 2 1
8 Father_Age 2.731429 3 1.054915 5 1
9 Father_Edu 3.607619 4 1.247264 6 1
10 Father_live 1.32 1 0.659979 3 1
11 Father_Job 1.664762 1 0.868224 3 1
12 Mother_Age 2.308571 2 0.959098 5 1
13 Mother_Edu 3.085714 3 1.314295 6 1
14 Mother _Live 1.245714 1 0.584962 3 1
15 Mother _Job 1.737143 2 0.541636 3 1
16 F_income 3.325714 4 1.208512 5 1
17 Acc_type 1.750476 2 0.762069 3 1
18 Rented_A 1.786667 2 0.410052 2 1
19 Acc_place 1.588571 2 0.492562 2 1
20 GS_1 1.952381 2 0.948894 4 1
21 GS_2 1.889524 2 0.936528 4 1
22 GS_3 1.849524 2 0.911245 4 1
23 GS_4 1.761905 1 0.896599 4 1
24 GS_5 1.668571 1 0.904116 4 1
25 GS_6 1.55619 1 0.788264 4 1
26 Class 1.84381 2 0.974084 4 1
this study were built using RF, NB, and J48 algorithms. Moreover, Microsoft Excel was
used in the step of pre-processing the data and extracting a statistical analysis for it. Both
10-fold cross-validation (CV) and direct partitioning (75:25) were completed to calculate
the accuracy of each model. To evaluate the proposed model, multiple test measures re-
sorted to accuracy, precision, recall, F-Measure, specificity, and ROC curve. Complete
methodological steps are mentioned in Figure 4.
Pre-Processing
Collecting the Data (Digitization- Missing and conflicting valuesData
Transformation-)
Data Augmentation
Feature selection
(SMOTE)
4.4.1. Digitization
Initially, the data from the electronic questionnaire was collected and then stored and
arranged in Microsoft Excel workbooks. Excel workbooks consist of columns and rows.
Each row represents a record while the columns represent the attributes of the record.
Data Augmentation
Class
Before After
A 250 305
B 153 306
C 76 304
D 46 306
Total 526 1221
7 Acc_place 0.2983
8 Family_income 0.2076
9 BS 0.2044
10 M_live 0.1849
11 F_job 0.1721
12 Acc_type 0.1651
13 M_job 0.1453
14 M_edu 0.145
15 Social_status 0.1415
16 F_edu 0.1327
17 F_age 0.1297
18 FM_Relative 0.1255
19 M_age 0.1183
20 Gender 0.1041
21 Rented_Acc 0.0961
22 F_live 0.0934
23 Specialization 0.0931
24 Rank 0.087
25 Age 0.068
To evaluate the performance of the proposed models, the following metrics have
been used Equations (3)–(6) [48].
• Accuracy is the result of dividing the number of true classified outcomes by the
whole of classified instances. The accuracy is computed by the equation:
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (3)
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
• Recall is the percentage of positive tweets that are properly determined by the model
in the dataset. The recall calculated by [48]:
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑐𝑎𝑙𝑙 = (4)
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒
Educ. Sci. 2023, 13, 293 15 of 26
• Precision is the proportion of true positive tweets among all forecasted positive
tweets. The equation of precision measure calculated by [48]:
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (5)
𝑇𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
• F-score is the harmonic mean of precision and recall. The F-score measure equation
is [48]:
2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = (6)
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
Numitration
96%
95%
94%
93%
10 20 30 40 50 60 70 80 90 100
Seed
96.00%
95.00%
94.00%
93.00%
1 2 3 4 5 6 7 8 9 10
4.7.2. J48
With the J48 classifier, experiments were performed to adjust the parameters confi-
denceFactor and miniNumObj. The confidenceFactor parameter was set first, as the graph in
Figure 8 shows the effect of its values on the accuracy ratio. The highest accuracy was
obtained at 0.55 and 0.1 for 10-fold cross-validation and 75:25 split ratio, respectively. Ex-
periments were then applied to set the second parameter, miniNumObj. At a value of 7,
the highest accuracy was achieved by 10-fold cross-validation, while the value of 10 was
optimal for the 75:25 split ratio in Figure 9. Table 7 presents a summary of the optimal
parameter values for the classifier.
Confidence Factor
92%
91%
90%
89%
88%
87%
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7
miniNumObj
93%
92%
91%
90%
89%
88%
87%
1 2 3 4 5 6 7 8 9 10
NB Accuracy
99%
99% 98.47%
98%
98%
97%
97% 96.381%
96%
96%
95%
cross validation fold 10 75:25 split ratio
of the models through the clear positive difference in the accuracy ratios. Where the accu-
racy improved by up to 3.27%.
Table 8. Comparison of performance results for dataset before and after using the SMOTE.
RF J48 NB
Type of Dataset
10-Fold 75:25 Split 10-Fold 75:25 Split 10-Fold 75:25 Split
Imbalance Dataset 95.62% 95.42% 92.00% 91.60% 96.38% 98.47%
Balance Dataset 98.20% 98.69% 94.92% 97.70% 97.54% 98.69%
Table 9 shows the values obtained after applying the feature selection technique to
the whole data set after it was balanced. The result is based on the outcome after applying
the RF, J48, and NB on the selected features only. Th highest accuracy achieved using all
the features was 98.69% using RF and NB with a 75:25 split. Similarly, the highest accuracy
achieved by using selected features reached 99.34% using NB with a 75:25 split. That are
thirteen features including academic as well as demographic: such as class, GS_4, GS_3,
GS_1, GS_5, GS_2, GS_6, Acc_place, Family_income, BS, M_live, F_job, Acc_type, M_job.
The rest of the accuracies are enlisted in the table.
RF J48 NB
Feature Selected
10-Fold 75:25 Split 10-Fold 75:25 Split 10-Fold 75:25 Split
All 98.2% 98.69% 94.92% 97.70% 97.54% 98.69%
Class,
GS_4, GS_3, GS_1,
GS_5, GS_2, GS_6,
Acc_place,
97.13% 97.38% 94.92% 96.07% 97.30% 99.34%
Family_income,
BS,M_live,
F_job, Acc_type,
M_job
Class, GS_4,
GS_3, GS_1
96.48% 97.38% 95.33% 97.70% 96.31% 98.03%
GS_5, GS_2
GS_6
Class, GS_4,
93.20% 94.75% 92.22% 94.43% 94.27% 96.72%
GS_3, GS_1
Class, GS_4 79.20% 80.33% 79.20% 80.33% 79.20% 80.33%
Class 25.06% 21.97% 24.65% 21.97% 24.65% 21.97%
Educ. Sci. 2023, 13, 293 19 of 26
The results are contrasted with all features and selected features. The highest accu-
racy was obtained with all features selected in all but two models. The NB ensemble model
got better accuracy with half of the features, while the RS model with the NB ensemble
got the best accuracy with the seven highest correlation features .
Table 10. Comparison of results between 10-fold cross-validation and direct partition.
RF J48 NB
10-fold cross-validation 98.2% 95.33% 97.54%
75:25 Direct partition 98.69% 97.70% 99.34%
RF J48 NB
Metrics
10-Fold 75:25 Split 10-Fold 75:25 Split 10-Fold 75:25 Split
TP 0.982 0.987 0.953 0.977 0.975 0.993
FP 0.006 0.004 0.016 0.007 0.008 0.002
Precision 0.982 0.987 0.953 0.977 0.976 0.993
Recall 0.982 0.987 0.953 0.977 0.975 0.993
F-Measure 0.982 0.987 0.953 0.977 0.975 0.993
ROC Area 0.999 1.000 0.989 0.998 0.997 1
Accuracy 98.2% 98.69% 95.33 97.70% 97.54% 99.34%
In Table 11, the precision rate for all classes (A, B, C, and D) for all six models is
recorded. The results show excellent values for most predictive models. The J48 model
had the lowest value of 95% with the 10-fold and the precision improved with the 75:25
split by 97.7%. While the highest value obtained was 99.3% with the NB 75:25 split ratio
model, which is a significant improvement compared to 10-fold, which had a precision
value of 97.5%. In the RF model, the improvement was slight between the 75:25 partition
method and 10-fold, at 98.2% and 98.7%, respectively. Additionally, as shown in Table 11,
the NB model had the highest recall of 99.3% with a 75:25 split. It is followed by the RF
model with a ratio of 98.7%, with a split of 75:25 as well. The recall values scrolled down
for the rest of the models until they reached the lowest value of 95.33% with the J48 model.
Educ. Sci. 2023, 13, 293 20 of 26
In general, all models achieve good call values for predictive reliability. Since the values
of precision and recall are very close, the results of calculating the F-Measure values for
all classifiers are also close to them as shown in Table 11. The highest value obtained for
F-Measure was for the NB model by 99.3%. The lowest value was for classifier J48 by
95.33%. As with the precision and recall results, the models achieved excellent values with
F-Measure as well. Table 12 also shows confusion matrices for all models. It is clear from
the values shown in Table 12 that the results of confusion matrices are convergent for all
models with both 10-fold and 75:25 splits. The best confusion matrix is for the NB model
with a partition of 75:25. As shown in the NB model confusion matrix, 66 correct instances
were predicted in class A except for one case that was classified as class B. Moreover, with
class B, 76 instances were predicted to be valid except for one instance which was class A.
All instances of classes C and D were all correctly classified.
RF
10-Fold 75% Split
A B C D A B C D
A=1 296 9 0 0 66 1 0 0
B=2 9 296 1 0 3 74 0 0
C=3 0 2 301 1 0 0 80 0
D=4 0 0 0 306 0 0 0 81
J48
10-Fold 75% Split
A B C D A B C D
A=1 289 26 0 0 65 2 0 0
B=2 26 279 1 0 4 73 0 0
C=3 0 3 300 1 0 1 79 0
D=4 0 0 0 306 0 0 0 81
NB
10-Fold 75% Split
A B C D A B C D
A=1 298 7 0 0 66 1 0 0
B=2 14 292 0 0 1 76 0 0
C=3 0 7 295 2 0 0 80 0
D=4 0 0 0 306 0 0 0 81
The ROC curve figure displays the performance of the classification models at all
classification thresholds. The curve shows the model’s ability to categorize specific cases
into target groups [49]. All classifier curves are listed in Figures 12–17. All the curves
shown are for Class A classification. The values of the rest of the curves of the categories
(B, C, and D) are similar in each model. The ROC curve with classifiers RF and NB had a
value of 1 which is the highest with a 75:25 split. In addition, the J48 model received a
value of 0.99. These values are considered excellent to the extent that the models are con-
sidered reliable. The values of the curves are shown in Table 11.
Educ. Sci. 2023, 13, 293 21 of 26
From the previous results, it was noted that the model developed using the NB algo-
rithm is the best-performing model. The NB model obtained the highest performance with
the experiments on the full data set. The NB model with the 75:25 split method using the
14 most correlated features obtained an accuracy of 99.34%.
For the whole dataset, the best factors affecting students’ academic achievement were
GS_4 (Grade in semester 4), GS_3 (Grade in semester 3), GS_1 (Grade in semester 1), GS_5
(Grade in semester 5), GS_2 (Grade in semester 2), GS_6 (Grade in semester 6), (Accom-
modation place) Acc_place, (Family income) Family_income, BS (The number of brothers
and sisters), M_live (Does the mother live with the family?), F_job (Father’s job), M_job
(Mother’s job) and Acc_type (Accommodation type). It is depicted in Figure 18 in the form
Educ. Sci. 2023, 13, 293 23 of 26
of a word cloud. All models converged in performance accuracy, but the NB and RF mod-
els had the highest performance with a 75:25 split of 99.34% and 98.69%, respectively. The
school administration may take the opportunity to focus on the students’ fall in the men-
tioned factors set for counseling and additional care.
6. Conclusions
Data mining is a technique that served many fields and helped discover hidden in-
sights. Educational data mining is one of the most famous fields of data mining, and many
studies and applications have spread to it. Mining educational data can help in indicating
the improvement in students’ academic performance by taking precautionary measures,
especially for the students at risk. Education is a societal pillar, and its quality supports
the powers of nations. Therefore, education is one of the first areas that the Kingdom of
Saudi Arabia is focusing its efforts on. In the literature review, studies of educational data
of different levels were presented. The studies included applications on educational data
for secondary schools, data for universities, and data for the master’s level. The literature
review also revealed the study gap and indicated the need for further studies in this field.
The aim of this research was to use the data mining mechanism to reveal hidden
patterns in the database of high school students to improve their academic level through
several academic and demographic factors. The database was collected through a ques-
tionnaire targeting students who recently graduated from secondary schools in the Al-
Baha region. In this study, three classifiers were applied to the database: Random Forest,
Naïve Bayes, and J48. In addition, the experiments were applied using both cross-valida-
tion and partitioning methods. The performance of each model was discussed, and a brief
presentation of the performances was presented. The performance has also been im-
proved by using Data Augmentation and Feature Extraction technologies. The Naïve
Bayes models outperformed the rest of the models with an accuracy of 99.34%. The per-
formance of the models was checked by different metrics. The metrics included accuracy,
precision, recall, F-Measure, specificity, and ROC curve. Further, the research has sum-
marized the most dominant factors affecting students’ success at the secondary school
level. The factors are GS_4 (Grade in semester 4), GS_3 (Grade in semester 3), GS_1 (Grade
in semester 1), GS_5 (Grade in semester 5), GS_2 (Grade in semester 2), GS_6 (Grade in
semester 6), (Accommodation place) Acc_place, (Family income) Family_income, BS (The
number of brothers and sisters), M_live (Does the mother live with the family?), F_job
(Father’s job), M_job (Mother’s job), and Acc_type (Accommodation type). Based on these
factors, the school administration may arrange counseling and guidance to the related
students, and it will help improve their success rate, especially in the KSA suburbs such
Educ. Sci. 2023, 13, 293 24 of 26
as the Al-Baha region. This study recommends conducting more research on the same
topic with an increase in the number/amount of data, in the future. In addition to trying
to expand the study to include other levels of education including primary school students
and middle school students. Other machine learning and deep learning algorithms such
as transfer learning [50–52] and fused and hybrid models [53,54] may be investigated to
further fine-tune the prediction results.
References
1. Grossman, P. Teaching Core Practices in Teacher Education; Harvard Education Press: Cambridge, UK, 2018.
2. Quinn, M.A.; Rubb, S.D. The importance of education-occupation matching in migration decisions. Demography 2005, 42, 153–
167.
3. Available online: https://en.wikipedia.org/wiki/Education_in_Saudi_Arabia (accessed on 30 January 2022)
4. Smale-Jacobse, A.E.; Meijer, A.; Helms-Lorenz, M.; Maulana, R. Differentiated Instruction in Secondary Education: A Systematic
Review of Research Evidence. Front. Psychol. 2019, 10, 2366. https://doi.org/10.3389/fpsyg.2019.02366
5. Mosa, M.A. Analyze students’ academic performance using machine learning techniques. J. King Abdulaziz Univ. Comput. Inf.
Technol. Sci. 2021, 10, 97–121.
6. Aggarwal, V.B.; Bhatnagar, V.; Kumar, D.; Editors, M. Advances in Intelligent Systems and Computing, 654 Big Data Analytics;
Springer, Cham, Switzerland, 2015.
7. Han, J.; Kamber, M.; Pei, J. Data Mining, 3rd ed.; Elsevier Science & Technology: Amsterdam, The Netherlands, 2012.
8. Mathew, S.; Abraham, J.T.; Kalayathankal, S.J. Data mining techniques and methodologies. Int. J. Civ. Eng. Technol. 2018, 9, 246–
252, 2018.
9. Jackson, J. Data Mining; A Conceptual Overview. Commun. Assoc. Inf. Syst. 2002, 8, 19.
10. Yoon, S.; Taha, B.; Bakken, S. Using a data mining approach to discover behavior correlates of chronic disease: A case study of
depression. Stud. Health Technol. Inform. 2014, 201, 71–78.
11. Mamatha Bai, B.G.; Nalini, B.M.; Majumdar, J. Analysis and Detection of Diabetes Using Data Mining Techniques—A Big Data
Application in Health Care; Springer: Singapore.
12. Othman, M.S.; Kumaran, S.R.; Yusuf, L.M. Data Mining Approaches in Business Intelligence: Postgraduate Data Analytic. J.
Teknol. 2016, 78, 75–79. https://doi.org/10.11113/jt.v78.9544.
13. Kokotsaki, D.; Menzies, V.; Wiggins, A. Durham Research Online Woodlands. Crit. Stud. Secur. 2014, 2, 210–222.
14. Athani, S.S.; Kodli, S.A.; Banavasi, M.N.; Hiremath, P.G.S. Predictor using Data Mining Techniques. Int. Conf. Res. Innov. Inf.
Syst. ICRIIS 2017, 1, 170–174.
15. Salal, Y.K.; Abdullaev, S.M.; Kumar, M. Educational data mining: Student performance prediction in academic. Int. J. Eng. Adv.
Technol. 2019, 8, 54–59.
16. Yağci, A.; Çevik, M. Prediction of academic achievements of vocational and technical high school (VTS) students in science
courses through artificial neural networks (comparison of Turkey and Malaysia). Educ. Inf. Technol. 2019, 24, 2741–2761.
https://doi.org/10.1007/s10639-019-09885-4.
17. Rebai, S.; Ben Yahia, F.; Essid, H. A graphically based machine learning approach to predict secondary schools performance in
Tunisia. Socio-Economic Plan. Sci. 2020, 70, 100-724. https://doi.org/10.1016/j.seps.2019.06.009.
18. Sokkhey, P.; Okazaki, T. Hybrid Machine Learning Algorithms for Predicting Academic Performance. Int. J. Adv. Comput. Sci.
Appl. 2020, 11, 32–41.
19. Adekitan, A.I.; Noma-Osaghae, E. Data mining approach to predicting the performance of first year student in a university
using the admission requirements. Educ. Inf. Technol. 2019, 24, 1527–1543.
Educ. Sci. 2023, 13, 293 25 of 26
20. Alhassan, A.M. Using data Mining Techniques to Predict Students’ Academic Performance. Master Thesis in Faculty of Com-
puting and Information Technology at King Abdulaziz University, Jeddah, Saudi Arabia, 2020.
21. Alyahyan, E.; Dusteaor, D. Decision Trees for Very Early Prediction of Student's Achievement. In Proceedings of the 2020 2nd
International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 13–15 October 2020.
22. Pal, V.K.; Bhatt, V.K.K. Performance prediction for post graduate students using artificial neural network. Int. J. Innov. Technol.
Explor. Eng. 2019, 8, 446–454.
23. Lin, A.; Wu, Q.; Heidari, A.A.; Xu, Y.; Chen, H.; Geng, W.; Li, Y.; Li, C. Predicting Intentions of Students for Master Programs
Using a Chaos-Induced Sine Cosine-Based Fuzzy K-Nearest Neighbor Classifier. IEEE Access 2019, 7, 67235–67248.
24. Sánchez, A.; Vidal-Silva, C.; Mancilla, G.; Tupac-Yupanqui, M.; Rubio, J.M. Sustainable e-Learning by Data Mining—Successful
Results in a Chilean University. Sustainability 2023, 15, 895. https://doi.org/10.3390/su15020895.
25. Yağcı, M. Educational data mining: Prediction of students' academic performance using machine learning algorithms. Smart
Learn. Environ. 2022, 9, 11. https://doi.org/10.1186/s40561-022-00192-z.
26. Hu, C.; Chen, Y.; Hu, L.; Peng, X. A novel random forests based class incremental learning method for activity recognition.
Pattern Recognit. 2018, 78, 277–290.
27. Pavlov, Y.L. Random forests. De Gruyter: Zeist, The Netherlands, 2019, 1–122.
28. Paul, A.; Mukherjee, D.P.; Das, P.; Gangopadhyay, A.; Chintha, A.R.; Kundu, S. Improved Random Forest for Classification.
IEEE Trans. Image Process. 2018, 27, 4012–4024.
29. Dietterich, T.G. Ensemble Methods in Machine Learning. In Proceedings of the International Workshop on Multiple Classifier
Systems, Cagliari, Italy, 9–11 June 2000; pp. 1–15.
30. Luo, C.; Wang, Z.; Wang, S.; Zhang, J.; Yu, J. Locating Facial Landmarks Using Probabilistic Random Forest. IEEE Signal Process.
Lett. 2015, 22, 2324–2328.
31. Gall, J.; Lempitsky, V. Decision Forests for Computer Vision and Medical Image Analysis; Springer Science & Business Media: Berlin,
Germany, 2013.
32. Paul, A.; Mukherjee, D.P. Reinforced quasi-random forest. Pattern Recognit. 2019, 94, 13–24.
33. Gholap, J. Performance Tuning Of J48 Algorithm For Prediction Of Soil Fertility. arXiv 2012.
https://doi.org/10.48550/arXiv.1208.3943
34. Christopher, A.B.A.; Balamurugan, S.A.A. Prediction of warning level in aircraft accidents using data mining techniques. Aero-
naut. J. 2014, 118, 935–952.
35. Aljawarneh, S.; Yassein, M.B.; Aljundi, M. An enhanced J48 classification algorithm for the anomaly intrusion detection systems.
Clust. Comput. 2019, 22, 10549–10565.
36. Lewis, D.D. Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine Learning: ECML-98. ECML
1998. Lecture Notes in Computer Science; Nédellec, C., Rouveirol, C., Eds.; Springer: Berlin, Heidelberg, 1998, Volume 1398.
https://doi.org/10.1007/BFb0026666.
37. John, G.H.; Langley, P. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference
on Uncertainty in Artificial Intelligence (UAI’95), Montreal, QC, Canada, 18–20 August 1995; Morgan Kaufmann Publishers
Inc.: San Francisco, CA, USA, 1995; pp. 338–345.
38. Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software: An update. ACM
SIGKDD Explor. Newsl. 2009, 11, 10–18.
39. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random Erasing Data Augmentation. AAAI 2020, 34, 13001-13008.
40. Al-Azani, S.; El-Alfy, E.S.M. Using Word Embedding and Ensemble Learning for Highly Imbalanced Data Sentiment Analysis
in Short Arabic Text. Procedia Comput. Sci. 2017, 109, 359–366.
41. Kumar, V. Feature Selection: A literature Review. Smart Comput. Rev. 2014, 4 DOI: 10.6029/smartcr.2014.03.007.
42. Samuels, Peter & Gilchrist, Mollie., Pearson Correlation. Stats Tutor, a Community Project. 2014, Available online:
https://www.statstutor.ac.uk/resources/uploaded/pearsoncorrelation3.pdf (accessed on 21 July 2021).
43. Doshi, M.; Chaturvedi, S.K. Correlation Based Feature Selection (CFS) Technique to Predict Student Performance. Int. J. Comput.
Networks Commun. 2014, 6, 197–206.
44. Rahman, A.; Sultan, K.; Aldhafferi, N.; Alqahtani, A. Educational data mining for enhanced teaching and learning. J. Theor.
Appl. Inf. Technol. 2018, 96, 4417–4427.
45. Rahman, A.; Dash, S. Data Mining for Student’s Trends Analysis Using Apriori Algorithm. Int. J. Control Theory Appl. 2017, 10,
107–115.
46. Rahman, A.; Dash, S. Big Data Analysis for Teacher Recommendation using Data Mining Techniques. Int. J. Control Theory Appl.
2017, 10, 95–105.
47. Zaman, G.; Mahdin, H.; Hussain, K.; Rahman, A.U.; Abawajy, J.; Mostafa, S.A. An Ontological Framework for Information
Extraction From Diverse Scientific Sources. IEEE Access 2021, 9, 42111–42124.
48. Alqarni, A.; Rahman, A. Arabic Tweets-Based Sentiment Analysis to Investigate the Impact of COVID-19 in KSA: A Deep
Learning Approach. Big Data Cogn. Comput. 2023, 7, 16. https://doi.org/10.3390/bdcc7010016.
Educ. Sci. 2023, 13, 293 26 of 26
49. Basheer Ahmed, M.I.; Zaghdoud, R.; Ahmed, M.S.; Sendi, R.; Alsharif, S.; Alabdulkarim, J.; Albin Saad, B.A.; Alsabt, R.; Rahman,
A.; Krishnasamy, G. A Real-Time Computer Vision Based Approach to Detection and Classification of Traffic Incidents. Big
Data Cogn. Comput. 2023, 7, 22. https://doi.org/10.3390/bdcc7010022.
50. Nasir, M.U.; Khan, S.; Mehmood, S.; Khan, M.A.; Rahman, A.-U.; Hwang, S.O. IoMT-Based Osteosarcoma Cancer Detection in
Histopathology Images Using Transfer Learning Empowered with Blockchain, Fog Computing, and Edge Computing. Sensors
2022, 22, 5444. https://doi.org/10.3390/s22145444.
51. Nasir, M.U.; Zubair, M.; Ghazal, T.M.; Khan, M.F.; Ahmad, M.; Rahman, A.-U.; Al Hamadi, H.; Khan, M.A.; Mansoor, W. Kidney
Cancer Prediction Empowered with Blockchain Security Using Transfer Learning. Sensors 2022, 22, 7483.
https://doi.org/10.3390/s22197483.
52. Rahman, A.-U.; Alqahtani, A.; Aldhafferi, N.; Nasir, M.U.; Khan, M.F.; Khan, M.A.; Mosavi, A. Histopathologic Oral Cancer
Prediction Using Oral Squamous Cell Carcinoma Biopsy Empowered with Transfer Learning. Sensors 2022, 22, 3833.
https://doi.org/10.3390/s22103833.
53. Farooq, M.S.; Abbas, S.; Rahman, A.U.; Sultan, K.; Khan, M.A.; Mosavi, A. A Fused Machine Learning Approach for Intrusion
Detection System. Comput. Mater. Contin. 2023, 74, 2607–2623.
54. Rahman, A.U.; Dash, S.; Luhach, A.K.; Chilamkurti, N.; Baek, S.; Nam, Y. A Neuro-fuzzy approach for user behaviour classifi-
cation and prediction. J. Cloud Comput. 2019, 8, 17. https://doi.org/10.1186/s13677-019-0144-9.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual au-
thor(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.