Neural Computing and Applications (0123456789().,-volV)(0123456789().
,- volV)



A feature weighted support vector machine and artificial neural

network algorithm for academic course performance prediction
Chenxi Huang1 • Junsheng Zhou1 • Jinling Chen1 • Jane Yang2 • Kathy Clawson3 • Yonghong Peng4

Received: 8 September 2020 / Accepted: 25 March 2021

Ó The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021

Academic performance, a globally understood metric, is utilized worldwide across disparate teaching and learning envi-
ronments and is regarded as a quantifiable indicator of learning gain. The ability to reliably estimate student’s academic
performance is important and can assist academic staff to improve the provision of support. However, it is recognized that
academic performance estimation is non-trivial and affected by multiple factors, including a student’s engagement with
learning activities and their social, geographic, and demographic characteristics. This paper investigates the opportunity to
develop reliable models for predicting student performance using Artificial Intelligence. Specifically, we propose two-step
academic performance prediction using feature weighted support vector machine and artificial neural network (ANN)
learning. A feature weighted SVM, where the importance of different features to the outcome is calculated using infor-
mation gain ratios, is employed to perform coarse-grained binary classification (pass, P1, or fail, P0). Subsequently,
detailed score levels are divided from D to A?, and ANN learning is employed for fine-grained, multi-class training of the
P1 and P0 classes separately. The experiments and our subsequent ablation study, which are conducted on the student
datasets from two Portuguese secondary schools, have proved the effectiveness of this hybridized method.

Keywords Artificial intelligence  Academic performance analytics  Feature weighted SVM (FWSVM) 
Information gain ratio  ANN

1 Introduction students’ learning gain and is of vital importance, not only

as a measure of student success and progression but as an
The contribution of Artificial Intelligence (AI) in education approach for identifying at-risk students and enabling
is significant and attracting increasing attention. Academic timely remedial interventions [1]. The potential benefits of
performance may be defined as a quantifiable evaluation of automated performance prediction (for example enhanced
student monitoring, timely student intervention, enhanced
& Chenxi Huang
support, and diminished rates of attrition) span primary, secondary, further, and higher education contexts. In
& Yonghong Peng
China, the promotion rate of junior secondary school graduates achieved 94.9% in 2017 [2], while there are only
59.8% regular senior secondary school students in the
School of Informatics, Xiamen University, Xiamen 361005, composition of students in senior secondary schools [3]. In
order to better help enrolled students understand their
University of California, San Diego, La Jolla, CA 92093, learning situation and assist teachers to identify at-risk
individuals, academic course performance prediction is a
School of Computer Science, University of Sunderland, necessary and valuable tool.
Sunderland SR6 0DD, UK
Traditional machine learning approaches applied in
Department of Computing and Mathematics, Manchester educational contexts include support vector machine
Metropolitan University, Manchester M1 5GD, UK

Neural Computing and Applications

(SVM) [4, 5], decision tree [6], matrix factorization (MF) 2 Related work
[7, 8], and their extended models [9, 10]. Among these
methods, the SVM algorithm is frequently applied and Existing methods for student achievement may be sum-
achieves superior performance [4, 11]. However, it is marized and categorized as (1) those incorporating tradi-
originally designed to solve binary classification problems tional machine learning, and (2) deep learning methods.
and does not consider the relative importance of feature Traditional machine learning approaches include support
vector elements. For multiple classification problems, the vector machines (SVM) [4, 5], decision trees [6], methods
algorithm can be extended to constitute a superposition of utilizing matrix factorization (MF) [7, 8], and their exten-
multiple binary classification SVM models but does not ded models [9, 10]. Deep learning methods include graph
always perform well. As a rapidly developing method, convolutional networks (GNN) [13], fully connected feed-
deep learning [12–16] has also shown good performance in forward networks [14], and long short-term memory net-
the field of education. However, overtraining can become works (LSTM) [16].
an issue for fully trained deep learning models if the dataset
size is not sufficiently large. It has been demonstrated that a 2.1 Traditional machine learning
not too deep Artificial Neural Network (ANN) [17] is good
enough in terms of accuracy. Kloft et al. [4] employed support vector machines (SVM)
In this paper, we incorporate the SVM and ANN to [5] for weekly dropout prediction of massive online open
propose a novel two-step prediction for academic course courses (MOOCs) through analysing features such as stu-
performance. We take the weight of different features into dents’ history clickstream data. Similarly, Chen et al. [6]
consideration by extending the SVM model to a feature integrate decision tree modelling with extreme learning
weighted SVM (FWSVM). Specifically, the information machine as a novel hybrid approach to predict student
gain ratio (IGR) [18] is chosen for feature weight calcu- dropout. Sweeney et al. [7, 8] apply matrix factorization
lation and some low influential features are dismissed (MF)-based methods to predict next-term grades. As one of
before training. the extended MF-based models, the Cumulative Knowl-
Our SVM-ANN hybrid is implemented as follows. edge-based Regression Model (CK) proposed by Morsy
Firstly, the IGR-FWSVM is used to perform coarse-grained et al. [9] linked knowledge acquired from preparatory
binary classification and each feature vector is categorized courses with the requirements of a target course in the form
as a pass (P1) or fail (P0). Subsequently, detailed levels of of matrix vectors. Abouelmagd et al. [20] proposed a new
scores are divided from D to A ?, and ANN is employed method to improve the utilization rate of data. Ren et al.
for further fine-grained training of the P1 class and the P0 [10] then added the influence of co-taken courses into
class separately. The experiments and our subsequent consideration as well.
ablation study, which are conducted on the student datasets
[19] from two Portuguese secondary schools, demonstrate 2.2 Deep learning
the effectiveness of this hybridized method.
The contributions of our work are summarized as fol- Arsad et al. [12] used Artificial Neural Network and Linear
lows: (1) The IGR-FWSVM and the ANN algorithm are Regression to predict the academic performance of stu-
adopted separately in the proposed two prediction steps. dents. Hu et al. [13] estimated students’ academic perfor-
Therefore, we can conduct our fine-grained level prediction mance using attention-based graph convolutional networks
based on the former pass/fail result. (2) The information (GNN). According to the course graph of a given student, a
gain ratio (IGR) is utilized for feature weight calculation so personalized graph classification could be obtained by their
that we can obtain an optimized SVM model considering GNN model. Liu et al. [21] proposed to use Markovian
different features’ influence on the outcome. jumping parameter and mixed time-delays to solve the
The remainder of this paper is organized as follows: exponential stability problem of Cohen–Grossberg neural
Sect. 2 analyses the related methods applied in educational networks. Du et al. [22] studied a generalized discrete
analysis and prediction. Section 3 establishes the theory of neutral-type neural network with time-varying delays.
IGR-FWSVM and introduces the structure of ANN. Our Whitehill et al. [14] adopted a fully connected feed-for-
experimental procedure is explained in Sect. 4. Section 5 ward neural network to predict the dropout rate. Feng et al.
presents the process of feature selection and analyses the [15] constructed a context-aware feature interaction net-
result of our ablation study. Conclusions are offered in work where context information of both courses and stu-
Sect. 6. dents was taken into account. Fei et al. [16] proposed a
temporal prediction model for MOOC dropout by handling
the extracted features of students’ behaviour with LSTM.

Neural Computing and Applications

3 Methodology where C represents a positive constant. Equation (2) con-

tains two purposes: (a) find the smallest value of 12 kwk2 to
We propose a two-step classification architecture, from PN
coarse-grained to fine-grained, in which information gain ensure the minimum margin; (b) minimize i¼1 ni to
ratio based feature weight SVM (IGR-FWSVM) and arti- reduce the mistakes of classification.
ficial neural network (ANN) [17] are adopted in each stage, The corresponding Lagrangian function is then con-
respectively. Firstly, we use IGR-FWSVM to train a pre- structed through introducing non-negative Lagrangian
diction model between pass (P1) and fail (P0). Then, multipliers ai and li :
detailed levels are divided from D to A? according to the 1 X N

score 0 to 20, and ANN learning is employed for multi- Lp ðw; b; n; a; lÞ ¼ w2 þ C ni

2 i¼1
class modelling of the P1 class and the P0 class separately. X
Thus, the advantage of SVM as a dichotomous model is  ai ½yi ðw  /ðxi PÞ þ bÞ  1 þ ni 
fully utilized, and its limitation on the problem of multiple i¼1
classifications can be made up by ANN. X
 li ni :
3.1 Feature weighted SVM (FWSVM)
The classical SVM algorithm is designed for binary clas- In this way, we can transform the optimization problem
sification through searching for the maximum margin into its dual problem, and calculate the Karush–Kuhn–
hyperplane (MMH). However, it ignores the importance of Tucker (KKT) conditions [23] of the original problem.
individual features to the outcome. To more fully reflect XN
oLp ðw; b; n; a; lÞ
the nature of our data, specifically that different student ¼w ai yi /ðxi PÞ ¼ 0 ð5Þ
attributes will have different relative importance’s in the ow i¼1
real world, we take the weight of features into account and XN
oLp ðw; b; n; a; lÞ
extend the model to FWSVM. The derivation of FWSVM ¼ ai y i ¼ 0 ð6Þ
is as follows. i¼1

Given a training dataset T ¼ fðx1 ; y1 Þ; oLp ðw; b; n; a; lÞ

ðx2 ; y2 Þ; . . .; ðxN ; yN Þg, where xi ¼ ðxi1 ; xi2 ; . . .; xin Þ 2 X ¼ ¼ C  a i  li ¼ 0 ð7Þ
Rn ; ði ¼ 1; 2; . . .; N Þ is the i th input tuple with n features,
ai  0; li  0; ð8Þ
and the class label yi 2 Y ¼ f1; 1g; ði ¼ 1; 2; . . .; N Þ
represents the corresponding category. We need to find the ai ½yi ðw  /ðxi PÞ þ bÞ  1 þ ni  ¼ 0 ð9Þ
hyperplane w  /ð xPÞ þ b ¼ 0 which can separate the
li n i ¼ 0 ð10Þ
training space into two classes. In the above expression, the
input vector x is multiplied by the n  n feature weighted yi ðw  /ðxi PÞ þ bÞ  1  ni ; ðni  0; i ¼ 1; 2; . . .; N Þ ð11Þ
matrix P. Then, the feature mapping function /ðÞ trans- By using the equivalence relation
formed the original input from Rn to a higher dimensional
space Rm so that a linear separating hyperplane can be X
w¼ ai yi /ðxi PÞ; ð12Þ
available. i¼1
The classifier must satisfy the following inequality:
which is derived from Eq. (5), the expression for the dual
yi ðw  /ðxi PÞ þ bÞ  1  ni ; ðni  0; i ¼ 1; 2; . . .; N Þ; ð1Þ
problem can be simplified as follows:
where the slack variable ni deals with the noise from out- !
1X N X N
liers and is added each time an error occurs, and where the max ai  ai aj yi yj /ðxi PÞ/ðxj PÞ ;
upper bound of the classification error number is defined as
2 i¼1 j¼1
PN 8 N ð13Þ
i¼1 ni . <Pa y ¼ 0
i i
The objective optimization problem is formulated as s.t. i¼1 ; ði ¼ 1; 2; . . .; N Þ
follows: 0  ai  C
1 2
According to the KKT complementarity conditions, any
min kwk þC ni ; ð2Þ  
w;b;n 2 training data xj ; yj which satisfies 0  aj  C can be used
 to compute b:
yi ðw  /ðxi PÞ þ bÞ  1  ni
s.t. ; ði ¼ 1; 2; . . .; N Þ; ð3Þ
ni  0

Neural Computing and Applications

b ¼ yj  w  /ðxj PÞ: ð14Þ 3.2 Information gain ratio based feature

weighted SVM (IGR-FWSVM)
Given that multiple values of b may exist, we take the
average of all b’s results as a reasonable value: The FWSVM algorithm is an improved method based on
" #
1 X XN    SVM with feature weighting. Therefore, it is of vital
b¼ yj  ai yi /ðxi PÞ  / xj P ; ð15Þ importance to determine the principle of calculating char-
Nb 0  a  C i¼1
acteristic correlation. There have been some commonly
where N b means the number of support vectors. used baselines such as variance [24, 25], chi-square [26],
Given a new data sample x, the classification decision correlation coefficient [11], and information gain [27, 28].
function is formulated as follows: Compared with the first three simple operations, we pro-
pose that information gain will perform better because it is
f ðxÞ ¼ signðw  /ðxPÞ þ bÞ: ð16Þ
calculated using information entropy, and the internal
By substituting the expressions for w and b in Eqs. (12) relation between features and the target is more deeply
and (15), the classification decision function is subse- reflected. However, if the information gain is directly used
quently derived as: as an evaluation standard, features with more values (which
may or may not be influential to the outcomes) will be
f ð xÞ ¼ sign ai yi ð/ðxi PÞ  /ðxPÞÞ preferred. To circumvent this problem, we utilize a fea-
i¼1 ture’s information gain ratio (IGR) [18].
" #1 Given a training set
1 X XN   
þ yj  ai yi /ðxi PÞ  / xj P A: T ¼ fðx1 ; y1 Þ; ðx2 ; y2 Þ; . . .; ðxN ; yN Þg, and ith input tuple
Nb 0  a  C i¼1 xi ¼ ðxi1 ; xi2 ; . . .; xin Þ 2 X ¼ Rn ; ði ¼ 1; 2; . . .; N Þ, with

ð17Þ corresponding class label yi 2 Y ¼

fY1 ; Y1 g ¼ f1; 1g; ði ¼ 1; 2; . . .; N Þ, let YfT;kg represent
To simplify the expression, we then define a feature the subset of T which belongs to class Yk ; ðk 2 f1; 1gÞ. It
weighted kernel function Kp xi ; xj : is clear that YfT;1g [ YfT;1g ¼ T and YfT;1g \ YfT;1g ¼ ;.
Kp xi ; xj ¼ /ðxi PÞ  / xj P ¼ K xi P; xj P ; ð18Þ Furthermore, for a given dataset D, we define jDj as its

size, so jT j and YfT;kg represent the size of sample set T
so the non-linear classification decision function is and YfT;kg , respectively. The probability that a sample
0 " #1
1 X XN   jY j
f ð xÞ ¼ sign@ ai yi Kp ðxi ; xÞ þ yj  ai yi Kp xi ; xj A: belongs to class Yk can be denoted as pk ¼ fjTT;kj g and the
Nb 0  a  C i¼1
entropy of T can be expressed using Eq. (23):
ð19Þ X
EntðT Þ ¼  pk log2 pk
Here are three typical kernels with feature weights: k2f1;1g

(a) Linear kernel: X YfT;kg YfT;kg
    ¼ log2 : ð23Þ
Kp xi ; xj ¼ ðxi PÞ  xj P ¼ xi PPT xTj ; ð20Þ k2f1;1g
jT j jT j

(b) Polynomial kernel: For a certain feature F, suppose that the training set T is
     d correspondingly split into some subsets
Kp xi ; xj ¼ cðxi PÞ  xj P þ r T1 ; T2 ; . . .; Tp ; . . .; Ts ; ðp ¼ 1; 2; . . .; sÞ. We can derive the
¼ cxi PPT xTj þ r ; ðc [ 0Þ; ð21Þ entropy after partition through calculating the weighted
sum of all subsets’ entropy:
(c) Gaussian kernel: s

X Tp  

EntðTjF Þ ¼ Ent Tp
  xi P  xj P2 jT j
Kp xi ; xj ¼ exp c 2 p¼1
 2d  T  T !
s X Y

xi  xj PP xi  xj X Tp fTp ;kg YfTp ;kg
¼ exp c ; ðc [ 0Þ: ¼ log2 :
jT j Tp Tp
2d2 p¼1 k2f1;1g

ð22Þ ð24Þ
Then we can obtain the information gain IGðT; F Þ
simply by subtracting the new entropy EntðTjF Þ from the
original EntðT Þ:

Neural Computing and Applications

IGðT; F Þ ¼ EntðT Þ  EntðTjF Þ: ð25Þ for solving the multiple classification problem are unsu-
pervised in nature, for example, decision trees [29, 30] and
To determine the final information gain ratio, we cal- K-nearest neighbour (KNN) [31, 32] classification. In
culate the entropy of taking feature F as the random contrast to existing techniques such as [29–32], we for-
variable: mulate class modelling as a supervised learning problem


X Tp Tp and adopt a simple artificial neural network (ANN) [17] to
EntF ðT Þ ¼  log2 : ð26Þ deal with the second classification task. The ANN archi-
jT j jT j
tecture, composed of an input layer, one or more hidden
Then introduce a penalty parameter M which is the layers, and an output layer has been widely applied to
reciprocal of EntF ðT Þ: many decision modelling and forecasting problems. ANNs
1 1 have a number of benefits- they are non-linear in nature and
M¼ ¼ Ps : ð27Þ may be applied to parametric and non-parametric data. Our
EntF ðT Þ j Tp j j Tp j
 p¼1 jT j log2 jT j ANN architecture comprises two hidden layers and is
illustrated in Fig. 1.
The information gain ratio IGRðT; F Þ can then be given
To determine the parameters of the ANN model, we
by the following formula:
conduct a series of experiments and find that the model
IGðT; F Þ achieves the best training accuracy when there are two
IGRðT; F Þ ¼ M  IGðT; F Þ ¼ : ð28Þ
EntF ðT Þ hidden layers with 5 and 4 neurons, respectively, as shown
If the value of s is bigger, which means there are, in Fig. 1. If the model goes deeper with more connection
respectively, more subsets Tp ; ðp ¼ 1; 2; . . .; sÞ which are layers, it will lead to the problem of overfitting as the deep
divided by feature F, the value of EntF ðT Þ will be bigger, neural network (DNN) [33] does not apply well with our
and the penalty parameter M will be smaller (and vice dataset which is not large enough. On the other hand, the
versa). By multiplying the penalty parameter M on the model with only one hidden layer fails to make full use of
basis of information gain IGðT; F Þ, features with more the features and carry out effective training, which reduces
values will not be overly favoured, and those with fewer the prediction accuracy as well. As for the setting of hidden
values can be fairly treated. units in the two layers, we decide on the following
  empirical formula:
The information gain ratio IGR T; Fq , which indicates pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
the weight of the qth feature Fq , is used to measure its h ¼ Sinput þ Soutput þ a; ð30Þ
importance and contribution to the target prediction. The
where h is the number of hidden units, Sinput and Soutput
greater the information gain ratio is, the more influential
represent the dimension of input and output separately, and
the corresponding feature is. Consequently, the feature
a is a regulating constant ranging from 1 to 10. In our
weighted matrix P is constructed as an n  n diagonal
experiments, Sinput is set to 18 and Soutput is 1 as there are 18
2 3 input features while only one prediction result is the final
IGRðT; F1 Þ 0 ... 0 output. We go through all the selections of a, thus a total of
6 .. 7 10  10 ¼ 100 training times are conducted for the com-
6 0 IGRðT; F2 Þ 0 . 7
P¼6 6 7:
.. .. 7 bination of two layers’ neurons. After the loop training, the
4 . 0 . 0 5
network achieves the best performance when the first
0 ... 0 IGRðT; Fn Þ hidden layer contains five neural nodes and the second
ð29Þ layer includes four. To prove the efficiency of our finally
structured ANN model, we will compare its prediction
As we can see, each feature has only one weight value,
performance with some other machine learning approaches
and there is no interaction between features.
in the following ablation study.
For more details, we use the Adam optimizer for gra-
3.3 Artificial neural network (ANN)
dient update. In order to achieve multi-class classification,
we use softmax as the output layer and select the node with
After coarse-grained binary classification, we regard stu-
the highest final result as the classification result.
dent achievement prediction as a multiple classification
problem. The SVM algorithm performs well on two-class
classification problems, and it can be extended to model
multiple ( [ 2) classes but it is limited in this context
because it is essentially a superposition of multiple binary
classification models. Other commonly adopted approaches

Neural Computing and Applications

Fig. 1 Artificial neural network


4 Experimental overview passed the course according to the corresponding scores,

based on which we then divide the data into two classes. If
4.1 Dataset description the score is more than 12 points which means the student
passed the exam, it will be labelled as rP ¼ 1, otherwise
We train and evaluate our models using data curated from rP ¼ 0. In order to eliminate the dimensional influence
two Portuguese public schools across a single academic among indicators, data normalization is conducted to make
year [19]. Similar to other European countries such as different indicators comparable.
France, the achievement is measured across a 20-point After pre-processing and normalization, we partition our
scale in increasing order of achievement. A total of 32 dataset into a training set and a test set in a ratio of 5 : 1.
attributes are collected per individual student including Then, the K-fold cross-validation algorithm is applied to
demographic, social/emotional, and school-related infor- train the model. As shown in Fig. 3, the training dataset is
mation (e.g. absences and mid-term grades). Before train- divided into k subsets evenly with each set treated as the
ing our model, we first clean up and encode the data. As validation fold once. The cross-validation is repeated k
there are not many dirty and missing data, we simply times using the rest folds to train the model, and the
remove them to maintain the availability of data without average accuracy of the k models validated by the corre-
changing the distribution of data. Also, we apply One-Hot sponding validation folds is utilized to judge the effect of
Encoding to encode data so that the processed data can be model construction for training optimization. In our
used for machine learning. experiment, k is set to the empirical value 5 which is a
The experimental design mainly contains a two-step trade-off between computation and precision. Thus, the
prediction from coarse-grained to fine-grained. The first ratio of the training set, validation set, and test set is
step aims to train an IGR-FWSVM based prediction model 4 : 1 : 1. At last, the accuracy of the model on the test set
between pass (P1) and fail (P0). Then detailed levels are will be taken as the final performance.
divided from D to A? according to the score 0 to 20. As Before constructing the feature weighted matrix, we
for the second prediction, ANN is employed for the train- need to calculate the information gain ratio (IGR) for each
ing of the P1 class and the P0 class separately with the feature. Due to the wide gap of IGR values among some
label level. Therefore, we can conduct our level prediction features, we empirically define a threshold W ¼ 0:1 to
based on the former pass/fail result. In this section, the two remove the features whose IGR values to the final perfor-
steps will be detailly explained in turn and we will sum- mance are less than W. Among the original 32 inputs, only
marize an overall process in the end. 18 of them meet the threshold conditions and are retained
as valid characteristics. The IGR values of these features
4.2 Coarse-grained prediction are stored in a matrix available for further calculation. We
then train the above-mentioned IGR-FWSVM model using
Coarse-grained class separation is depicted in Fig. 2. Ini- the processed training set. The obtained model can be
tially, we perform data cleaning by removing dirty or null validated by predicting whether the students pass or fail
values and converting non-numerical elements (e.g. sex based on the test dataset. At last, the accuracy of our model
and school) into numeric encodings, such as binary values is measured by the average performance of the five training
0 and 1. After ensuring the accuracy and validity of the iterations.
data, a variable rP is used to mark whether a student

Neural Computing and Applications

Fig. 2 Process of the first


4.3 Fine-grained prediction 4.4 The overall process

In this stage, more detailed levels are divided from D to After the above two stages, we have completed the training
A? according to the score 0 to 20, which is listed in process. Next, we will illustrate the whole process of our
Table 1. As illustrated in Fig. 4, the second prediction is two-step student performance prediction. Given a new
separately executed on the two classes of data which is student observation, we can follow the steps below to
partitioned by rP. For the convenience of expression, we conduct a detailed performance prediction operation.
define the class whose rP ¼ 1 as class P1, and the rP ¼ Firstly, the information of the student is input to the trained
0 one as the P0 class. By training the P1 and P0 class IGR-FWSVM model for a preliminary judgement of
separately, the former pass/fail result can be utilized to whether pass or not. Next, the same features will be fed to
predict more detailed student performance which is repre- the corresponding ANN model on the basis of the previous
sented by levels. pass/fail result. If the first model judges that the student
For the two data sets of passed and failed, we train the fails the course, then the P0-ANN model is chosen to
ANN classification model, respectively. Take the training perform the next step; otherwise, it will be the model P1-
of the P0 class as an example, the first step is still to ANN. For example, one possible outcome is that the stu-
normalize the data. Subsequently, we use the level instead dent is classified as passed in the first step, and the second
of rP as the label to train the model named P0-ANN. In prediction outputs an A level performance from the four
addition, the input features here are the same as the 18 passed levels.
previously selected for the first prediction. In this way, we
obtain the model P0-ANN which can predict the level of a
failed student. Similarly, the model P1-ANN can be trained
for passed students. Finally, the accuracy is calculated as
well for comparison.

Neural Computing and Applications

Fig. 3 K-fold cross-validation algorithm used in dataset partitioning

Table 1 Level division

rP rP ¼ 0 (fail/class P0) rP ¼ 1 (pass/class P1)
corresponding to the scores
Scores 0–1 2–3 4–5 6–7 8–9 10–11 12–14 15–16 17–18 19–20
Level D C- C C? B- B B? A- A A?

5 Results and analysis inherent problem of IG and also the reason we choose IGR
instead of IG.
5.1 Feature selection and analysis Additionally, we find that there exists a wide gap among
part of the feature weights. For example, features such as
In the calculation of feature weights, we try the information ‘studytime’, ‘fjob’, ‘famrel’, and ‘failures’ have a great
gain (IG) and information gain ratio (IGR) separately for impact on the academic achievement of students, whose
comparison. As plotted in Fig. 5a, b, the outcome his- weights exceed 0.4 and even reach 0.8. On the contrary,
togram based on IGR is smoother. Especially, it can give a ‘sex_M’, ‘address_U’, ‘famsize_LE30 , ‘paid_yes’, and
good judgement on the binary input ‘higher_yes’ (i.e. some other characteristics weigh little and even less than
whether the student wants to take higher education) as 0.02. It is reasonable to assume that these low influential
well. However, the information gain for features with features do not contribute to the final prediction and may
fewer choices is almost at the same low level, which is the reduce the accuracy of the calculation if are taken into

Neural Computing and Applications

Fig. 4 Process of the second prediction

account. Thus, we set a threshold W ¼ 0:1 and only select 5.2 Ablation study
features whose IGR values are greater than W, while other
less important ones are removed as noise. In this way, the To quantitatively measure the experimental results and
precision and robustness of our model can be improved compare different methods, we adopt Root Mean Square
with the difficulty of the learning task at an appropriate Error (RMSE) [34] and Mean Absolute Percentage Error
level. The subgraphs (c) and (d) in Fig. 5 present the IG (MAPE) [35] as evaluation metrics. The related formulas
and IGR values of the 18 main features selected by IGR. As are defined as follows:
a complement, detailed descriptions of low influential and sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1X n
high influential features are given in Tables 2 and 3, RMSE ¼ ð ybi  yi Þ2 ð31Þ
respectively. n i¼1
According to the analysis of the feature weights, we find n
100% X ybi  yi
that students’ performance in class, such as grades in the MAPE ¼ ð32Þ
n i¼1 yi
two periods and the number of past class failures have a
more significant impact on the final grade. At the same where ybi and yi represent the predicted value and true value
time, family-related variables, such as parents’ education for the i th sample separately, and n denotes the size of the
level, work, and family relations, also play a crucial role in prediction dataset. The smaller the values of the two
students’ academic performance. Nevertheless, gender, indexes (i.e. closer to 0), the more accurate the experi-
address, school, and some other factors seem to be less mental results are.
influential or even unimportant. Based on the two metrics, Table 4 shows the comparison
of different approaches’ performance in the preliminary
pass/fail prediction. As we can see, ANN performs the

Neural Computing and Applications

Fig. 5 Comparison of different features’ weights based on two calculation criteria: a all features’ IG values, b all features’ IGR values, c main
features’ IG values, d main features’ IGR values. (IG, information gain; IGR, information gain ratio)

Table 2 Features of low

Feature Description
influence and their descriptions
Sex_M Student’s sex (binary: female or male)
School_MS Student’s school (binary: Gabriel Pereira or Mousinho da Silveira)
Address_U Student’s home address type (binary: urban or rural)
Famsize_LE3 Family size (binary: B 3 or [ 3)
Paid_yes Extra paid classes (binary: yes or no)
Pstatus_T Relationship between parents(binary: together or apart)
Schoolsup_yes Extra school education (binary: yes or no)
Famsup_yes Home education (binary: yes or no)
Activities_yes Regular activities (binary: yes or no)
Nursery_yes Go to nursery school (binary: yes or no)
Higher_yes Want higher education (binary: yes or no)
Internet_yes Love internet (binary: yes or no)
Romantic_yes Has a partner (binary: yes or no)
Subject Has a favourite subject (binary: yes or no)

Neural Computing and Applications

Table 3 Features of high

Feature Description
influence and their descriptions
Fjob Father’s job (number: from 0 to 4)
Studytime Weekly study time [number: (1) \ 2 h, (2) 2–5 h, (3) 5–10 h or (4) [ 10 h]
Failures Number of past class failures (number: n if 1 B n \ 3, else 4)
Famrel Quality of family relationships (number: from 1: very bad to 5: excellent)
Mjob Mother’s job (number: from 0 to 4)
Grade 1 First period grade (number: from 0 to 20)
Grade 2 Second period grade (number: from 0 to 20)
Age The age of student (number: from 15 to 22)
Medu The education of mother (number: from 0 to 4)
Fedu The education of father (number: from 0 to 4)
Reason Why choose this school
Traveltime Time from home to school (number: from 1: very low to 5: very high)
Freetime Time after school (number: from 1: very low to 5: very high)
Goout Go out with friends (number: from 1: very low to 5: very high)
Dalc Workday drink (number: from 1: very low to 5: very high)
Walc Weekend drink (number: from 1: very low to 5: very high)
Health Health status (number: from 1: very bad to 5: very well)
Absences Absences times (number: from 0 to 93)

Table 4 Comparison of different methods’ performance in the worst in this binary classification case, while SVM is better
pass/fail prediction than other comparative methods. Furthermore, we use the
Method All feature Main feature information gain ratio as the criterion for feature selection
so that the effectiveness of the new model IGR-FWSVM
reaches a higher level.
KNN 0.4255 1.226 0.4126 1.2074 As for the level prediction performance listed in Table 5,
LASSO 0.517 1.3172 0.5002 1.3144 the original SVM algorithm and its improved version by
RANDOMFOREST 0.2667 1.312 0.274 1.3077 IGR, though good for two-class classification, does not
RIDGE 0.3871 1.1902 0.3848 1.1915 work well on this essentially multiple classification prob-
ELASTICNET 0.4912 1.3027 0.5103 1.355 lem. On the contrary, ANN can make better use of the input
GRADIENTBOOST 0.2881 1.0958 0.2755 1.1007 characteristics by its connection layer and it does obtain a
ANN 0.7234 1.451 0.6697 1.4024 better result. What is even better is that we find detailed
SVM 0.2603 1.132 0.2534 1.0737 judgement for the level will be more accurate if based on
IGR-FWSVM 0.2491 1.0266 0.2364 0.9734 the coarse-grained pass/fail classification. The reason is
that the first prediction step has narrowed down the range
of the subsequent fine-grained judgement, so the errors
with a wide margin are reduced. Hence, we propose an
IGR-FWSVM and ANN-based algorithm for academic
Table 5 Comparison of performance between the two-step methods
and single-step methods course performance prediction, which achieves the best
result in our ablation study.
Method All feature Main feature

SVM 1.172 0.2745 1.1534 0.2661

6 Conclusion
IGR-FWSVM 1.0386 0.2502 0.988 0.2406
In this paper, we propose a two-step academic course
ANN 0.652 0.3625 0.596 0.3502
performance prediction based on the feature weighted
SVM ? ANN 0.6233 0.2837 0.5724 0.2604
support vector machine (FWSVM) and artificial neural
IGR-FWSVM ? ANN 0.5738 0.2656 0.5472 0.2423
network (ANN) algorithm. Firstly, we establish the theory
of feature weighted SVM where the importance of different
features to the outcome is calculated by the information
gain ratio (IGR). Then, the IGR-FWSVM is used to train a

Neural Computing and Applications

coarse-grained classification model between pass (P1) and conference on data mining, SDM 2017. Society for Industrial and
fail (P0). Finally, detailed score levels are divided from D Applied Mathematics, Philadelphia, pp 552–560
10. Ren Z, Ning X, Lan AS, Rangwala H (2019) Grade prediction
to A? , and ANN is employed for the further fine-grained based on cumulative knowledge and co-taken courses. In: EDM
training of the P1 class and the P0 class separately so that 2019—Proceedings of 12th international conference on educa-
we can take advantage of the obtained pass/fail prediction tional data mining, pp 158–167
results. The proposed approach is evaluated on the student 11. Saqlain SM, Sher M, Shah FA et al (2019) Fisher score and
Matthews correlation coefficient-based feature subset selection
datasets from two Portuguese secondary schools. Our for heart disease diagnosis using support vector machines. Knowl
ablation study demonstrates the effectiveness of this Inf Syst 58:139–167.
hybridized method. We propose our methodology may be 12. Arsad PM, Buniyamin N, Manan JLA (2014) Prediction of
utilized in a number of teaching and learning contexts to engineering students’ academic performance using artificial
neural network and linear regression: a comparison. In: 2013
assist practitioners and support positive student outcomes. IEEE 5th International conference on engineering education:
For example, automated performance prediction may allow aligning engineering education with industrial needs for nation
more effective student monitoring and provision of timely, development, ICEED 2013. IEEE, pp 43–48
individualized, student support. 13. Hu Q, Rangwala H (2019) Academic performance estimation
with attention-based graph convolutional networks. arXiv
14. Whitehill J, Mohan K, Seaton D et al (2017) Delving deeper into
MOOC student dropout prediction. arXiv
Funding This study was funded by the Fundamental Research Funds 15. Feng W, Tang J, Liu TX (2019) Understanding dropouts in
for the Central Universities (grant numbers 20720200094). MOOCs. In: 33rd AAAI conference on artificial intelligence
AAAI 2019, 31st Innovative applications of artificial intelligence
conference IAAI 2019, 9th AAAI symposium on educational
Declarations advances in artificial intelligence EAAI 2019, vol 33,
pp 517–524.
Conflict of interest The authors declare that they have no conflict of 3301517
interest or personal relationships that could have appeared to influ- 16. Fei M, Yeung DY (2016) Temporal models for predicting student
ence the work reported in this paper. dropout in massive open online courses. In: Proceedings of 15th
IEEE international conference on data mining workshop,
ICDMW 2015. IEEE, pp 256–263
17. Basheer IA, Hajmeer M (2000) Artificial neural networks: Fun-
References damentals, computing, design, and application. J Microbiol
Methods 43:3–31.
1. Kurzweil M, Wu DD (2015) Case study: building a pathway to 3
student success at Georgia State University 18. Dai J, Xu Q (2013) Attribute selection based on information gain
2. Ministry of Education of the People’s Republic of China (2016) ratio in fuzzy rough set theory with application to tumor classi-
Promotion rate of graduates of regular school by levels. http://en. fication. Appl Soft Comput J 13:211–221. 1016/j.asoc.2012.07.029
201610/t20161012_284485.html. Accessed 4 Feb 2021 19. Cortez P, Silva A (2008) Using data mining to predict secondary
3. Ministry of Education (2020) Composition of students in senior school student performance. In: 15th European concurrent engi-
secondary schools—Ministry of Education of the People’s neering conference 2008, ECEC 2008; 5th Future of business
Republic of China. technology conference, FUBUTEC 2008, pp 5–12
2018/national/201908/t20190812_394224.html. Accessed 4 Feb 20. Abouelmagd EI, Awad ME, Elzayat EMA, Abbas IA (2014)
2021 Reduction the secular solution to periodic solution in the gener-
4. Kloft M, Stiehler F, Zheng Z, Pinkwart N (2015) Predicting alized restricted three-body problem. Astrophys Space Sci
MOOC dropout over weeks using machine learning methods. In: 350:495–505.
Proceedings of the EMNLP 2014 workshop on analysis of large 21. Liu Y, Liu W, Obaid MA, Abbas IA (2016) Exponential stability
scale social interaction in MOOCs. Association for Computa- of Markovian jumping Cohen-Grossberg neural networks with
tional Linguistics, Stroudsburg, pp 60–65 mixed mode-dependent time-delays. Neurocomputing
5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 177:409–415.
20:273–297. 22. Du B, Liu Y, Atiatallah Abbas I (2016) Existence and asymptotic
6. Chen J, Feng J, Sun X et al (2019) MOOC dropout prediction behavior results of periodic solution for discrete-time neutral-type
using a hybrid algorithm based on decision tree and extreme neural networks. J Frankl Inst 353:448–461.
learning machine. Math Probl Eng 2019:1–11. 1016/j.jfranklin.2015.11.013
1155/2019/8404653 23. Chambers LG, Fletcher R (2001) Practical methods of opti-
7. Sweeney M, Lester J, Rangwala H (2015) Next-term student mization. Math Gaz 85:562.
grade prediction. In: Proceedings of 2015 IEEE international 24. He X, Ji M, Zhang C, Bao H (2011) A variance minimization
conference on Big Data, IEEE Big Data 2015. IEEE, pp 970–975 criterion to feature selection using Laplacian regularization. IEEE
8. Sweeney M, Rangwala H, Lester J, Johri A (2016) Next-term Trans Pattern Anal Mach Intell 33:2013–2025.
student performance prediction: a recommender systems 1109/TPAMI.2011.44
approach, pp 1–27. 25. Zhao Z, Zhang R, Cox J et al (2013) Massively parallel feature
9. Morsy S, Karypis G (2017) Cumulative knowledge-based selection: an approach based on variance preservation. Mach
regression models for next-term grade prediction. In: Chawla N, Learn 92:195–220.
Wang W (eds) Proceedings of the 17th SIAM international 26. Jin C, Ma T, Hou R et al (2015) Chi-square statistics feature
selection based on term frequency and distribution for text

Neural Computing and Applications

categorization. IETE J Res 61:351–362. 32. Denoeux T (1995) A k-nearest neighbor classification rule based
03772063.2015.1021385 on Dempster-Shafer theory. IEEE Trans Syst Man Cybern
27. Lee C, Lee GG (2006) Information gain and divergence-based 25:804–813.
feature selection for machine learning-based text categorization. 33. Hinton G, Deng L, Yu D et al (2012) Deep neural networks for
Inf Process Manag 42:155–165. acoustic modeling in speech recognition: the shared views of four
2004.08.006 research groups. IEEE Signal Process Mag 29:82–97. https://doi.
28. Uǧuz H (2011) A two-stage feature selection method for text org/10.1109/MSP.2012.2205597
categorization by using information gain, principal component 34. Chai T, Draxler RR (2014) Root mean square error (RMSE) or
analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. mean absolute error (MAE)? Arguments against avoiding RMSE in the literature. Geosci Model Dev 7:1247–1250.
29. Safavian SR, Landgrebe D (1991) A survey of decision tree 10.5194/gmd-7-1247-2014
classifier methodology. IEEE Trans Syst Man Cybern 35. Swamidass PM (2000) Mean absolute percentage error (MAPE).
21:660–674. In: Encyclopedia of production and manufacturing management.
30. Wang F, Wang Q, Nie F et al (2020) A linear multivariate binary Springer, Boston, pp 462–462
decision tree classifier based on K-means splitting. Pattern
Recognit 107:107521. Publisher’s Note Springer Nature remains neutral with regard to
107521 jurisdictional claims in published maps and institutional affiliations.
31. Keller JM, Gray MR (1985) A fuzzy K-nearest neighbor algo-
rithm. IEEE Trans Syst Man Cybern SMC 15:580–585. https://


