Huang2021 Article AFeatureWeightedSupportVectorM
Huang2021 Article AFeatureWeightedSupportVectorM
Huang2021 Article AFeatureWeightedSupportVectorM
https://doi.org/10.1007/s00521-021-05962-3 (0123456789().,-volV)(0123456789().
,- volV)
Abstract
Academic performance, a globally understood metric, is utilized worldwide across disparate teaching and learning envi-
ronments and is regarded as a quantifiable indicator of learning gain. The ability to reliably estimate student’s academic
performance is important and can assist academic staff to improve the provision of support. However, it is recognized that
academic performance estimation is non-trivial and affected by multiple factors, including a student’s engagement with
learning activities and their social, geographic, and demographic characteristics. This paper investigates the opportunity to
develop reliable models for predicting student performance using Artificial Intelligence. Specifically, we propose two-step
academic performance prediction using feature weighted support vector machine and artificial neural network (ANN)
learning. A feature weighted SVM, where the importance of different features to the outcome is calculated using infor-
mation gain ratios, is employed to perform coarse-grained binary classification (pass, P1, or fail, P0). Subsequently,
detailed score levels are divided from D to A?, and ANN learning is employed for fine-grained, multi-class training of the
P1 and P0 classes separately. The experiments and our subsequent ablation study, which are conducted on the student
datasets from two Portuguese secondary schools, have proved the effectiveness of this hybridized method.
Keywords Artificial intelligence Academic performance analytics Feature weighted SVM (FWSVM)
Information gain ratio ANN
123
Neural Computing and Applications
(SVM) [4, 5], decision tree [6], matrix factorization (MF) 2 Related work
[7, 8], and their extended models [9, 10]. Among these
methods, the SVM algorithm is frequently applied and Existing methods for student achievement may be sum-
achieves superior performance [4, 11]. However, it is marized and categorized as (1) those incorporating tradi-
originally designed to solve binary classification problems tional machine learning, and (2) deep learning methods.
and does not consider the relative importance of feature Traditional machine learning approaches include support
vector elements. For multiple classification problems, the vector machines (SVM) [4, 5], decision trees [6], methods
algorithm can be extended to constitute a superposition of utilizing matrix factorization (MF) [7, 8], and their exten-
multiple binary classification SVM models but does not ded models [9, 10]. Deep learning methods include graph
always perform well. As a rapidly developing method, convolutional networks (GNN) [13], fully connected feed-
deep learning [12–16] has also shown good performance in forward networks [14], and long short-term memory net-
the field of education. However, overtraining can become works (LSTM) [16].
an issue for fully trained deep learning models if the dataset
size is not sufficiently large. It has been demonstrated that a 2.1 Traditional machine learning
not too deep Artificial Neural Network (ANN) [17] is good
enough in terms of accuracy. Kloft et al. [4] employed support vector machines (SVM)
In this paper, we incorporate the SVM and ANN to [5] for weekly dropout prediction of massive online open
propose a novel two-step prediction for academic course courses (MOOCs) through analysing features such as stu-
performance. We take the weight of different features into dents’ history clickstream data. Similarly, Chen et al. [6]
consideration by extending the SVM model to a feature integrate decision tree modelling with extreme learning
weighted SVM (FWSVM). Specifically, the information machine as a novel hybrid approach to predict student
gain ratio (IGR) [18] is chosen for feature weight calcu- dropout. Sweeney et al. [7, 8] apply matrix factorization
lation and some low influential features are dismissed (MF)-based methods to predict next-term grades. As one of
before training. the extended MF-based models, the Cumulative Knowl-
Our SVM-ANN hybrid is implemented as follows. edge-based Regression Model (CK) proposed by Morsy
Firstly, the IGR-FWSVM is used to perform coarse-grained et al. [9] linked knowledge acquired from preparatory
binary classification and each feature vector is categorized courses with the requirements of a target course in the form
as a pass (P1) or fail (P0). Subsequently, detailed levels of of matrix vectors. Abouelmagd et al. [20] proposed a new
scores are divided from D to A ?, and ANN is employed method to improve the utilization rate of data. Ren et al.
for further fine-grained training of the P1 class and the P0 [10] then added the influence of co-taken courses into
class separately. The experiments and our subsequent consideration as well.
ablation study, which are conducted on the student datasets
[19] from two Portuguese secondary schools, demonstrate 2.2 Deep learning
the effectiveness of this hybridized method.
The contributions of our work are summarized as fol- Arsad et al. [12] used Artificial Neural Network and Linear
lows: (1) The IGR-FWSVM and the ANN algorithm are Regression to predict the academic performance of stu-
adopted separately in the proposed two prediction steps. dents. Hu et al. [13] estimated students’ academic perfor-
Therefore, we can conduct our fine-grained level prediction mance using attention-based graph convolutional networks
based on the former pass/fail result. (2) The information (GNN). According to the course graph of a given student, a
gain ratio (IGR) is utilized for feature weight calculation so personalized graph classification could be obtained by their
that we can obtain an optimized SVM model considering GNN model. Liu et al. [21] proposed to use Markovian
different features’ influence on the outcome. jumping parameter and mixed time-delays to solve the
The remainder of this paper is organized as follows: exponential stability problem of Cohen–Grossberg neural
Sect. 2 analyses the related methods applied in educational networks. Du et al. [22] studied a generalized discrete
analysis and prediction. Section 3 establishes the theory of neutral-type neural network with time-varying delays.
IGR-FWSVM and introduces the structure of ANN. Our Whitehill et al. [14] adopted a fully connected feed-for-
experimental procedure is explained in Sect. 4. Section 5 ward neural network to predict the dropout rate. Feng et al.
presents the process of feature selection and analyses the [15] constructed a context-aware feature interaction net-
result of our ablation study. Conclusions are offered in work where context information of both courses and stu-
Sect. 6. dents was taken into account. Fei et al. [16] proposed a
temporal prediction model for MOOC dropout by handling
the extracted features of students’ behaviour with LSTM.
123
Neural Computing and Applications
123
Neural Computing and Applications
(b) Polynomial kernel: For a certain feature F, suppose that the training set T is
d correspondingly split into some subsets
Kp xi ; xj ¼ cðxi PÞ xj P þ r T1 ; T2 ; . . .; Tp ; . . .; Ts ; ðp ¼ 1; 2; . . .; sÞ. We can derive the
d
¼ cxi PPT xTj þ r ; ðc [ 0Þ; ð21Þ entropy after partition through calculating the weighted
sum of all subsets’ entropy:
(c) Gaussian kernel: s
X Tp
EntðTjF Þ ¼ Ent Tp
xi P xj P2 jT j
Kp xi ; xj ¼ exp c 2 p¼1
2d T T !
s X Y
xi xj PP xi xj X Tp fTp ;kg YfTp ;kg
¼ exp c ; ðc [ 0Þ: ¼ log2 :
jT j Tp Tp
2d2 p¼1 k2f1;1g
ð22Þ ð24Þ
Then we can obtain the information gain IGðT; F Þ
simply by subtracting the new entropy EntðTjF Þ from the
original EntðT Þ:
123
Neural Computing and Applications
IGðT; F Þ ¼ EntðT Þ EntðTjF Þ: ð25Þ for solving the multiple classification problem are unsu-
pervised in nature, for example, decision trees [29, 30] and
To determine the final information gain ratio, we cal- K-nearest neighbour (KNN) [31, 32] classification. In
culate the entropy of taking feature F as the random contrast to existing techniques such as [29–32], we for-
variable: mulate class modelling as a supervised learning problem
s
X Tp Tp and adopt a simple artificial neural network (ANN) [17] to
EntF ðT Þ ¼ log2 : ð26Þ deal with the second classification task. The ANN archi-
p¼1
jT j jT j
tecture, composed of an input layer, one or more hidden
Then introduce a penalty parameter M which is the layers, and an output layer has been widely applied to
reciprocal of EntF ðT Þ: many decision modelling and forecasting problems. ANNs
1 1 have a number of benefits- they are non-linear in nature and
M¼ ¼ Ps : ð27Þ may be applied to parametric and non-parametric data. Our
EntF ðT Þ j Tp j j Tp j
p¼1 jT j log2 jT j ANN architecture comprises two hidden layers and is
illustrated in Fig. 1.
The information gain ratio IGRðT; F Þ can then be given
To determine the parameters of the ANN model, we
by the following formula:
conduct a series of experiments and find that the model
IGðT; F Þ achieves the best training accuracy when there are two
IGRðT; F Þ ¼ M IGðT; F Þ ¼ : ð28Þ
EntF ðT Þ hidden layers with 5 and 4 neurons, respectively, as shown
If the value of s is bigger, which means there are, in Fig. 1. If the model goes deeper with more connection
respectively, more subsets Tp ; ðp ¼ 1; 2; . . .; sÞ which are layers, it will lead to the problem of overfitting as the deep
divided by feature F, the value of EntF ðT Þ will be bigger, neural network (DNN) [33] does not apply well with our
and the penalty parameter M will be smaller (and vice dataset which is not large enough. On the other hand, the
versa). By multiplying the penalty parameter M on the model with only one hidden layer fails to make full use of
basis of information gain IGðT; F Þ, features with more the features and carry out effective training, which reduces
values will not be overly favoured, and those with fewer the prediction accuracy as well. As for the setting of hidden
values can be fairly treated. units in the two layers, we decide on the following
empirical formula:
The information gain ratio IGR T; Fq , which indicates pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
the weight of the qth feature Fq , is used to measure its h ¼ Sinput þ Soutput þ a; ð30Þ
importance and contribution to the target prediction. The
where h is the number of hidden units, Sinput and Soutput
greater the information gain ratio is, the more influential
represent the dimension of input and output separately, and
the corresponding feature is. Consequently, the feature
a is a regulating constant ranging from 1 to 10. In our
weighted matrix P is constructed as an n n diagonal
experiments, Sinput is set to 18 and Soutput is 1 as there are 18
matrix:
2 3 input features while only one prediction result is the final
IGRðT; F1 Þ 0 ... 0 output. We go through all the selections of a, thus a total of
6 .. 7 10 10 ¼ 100 training times are conducted for the com-
6 0 IGRðT; F2 Þ 0 . 7
P¼6 6 7:
.. .. 7 bination of two layers’ neurons. After the loop training, the
4 . 0 . 0 5
network achieves the best performance when the first
0 ... 0 IGRðT; Fn Þ hidden layer contains five neural nodes and the second
ð29Þ layer includes four. To prove the efficiency of our finally
structured ANN model, we will compare its prediction
As we can see, each feature has only one weight value,
performance with some other machine learning approaches
and there is no interaction between features.
in the following ablation study.
For more details, we use the Adam optimizer for gra-
3.3 Artificial neural network (ANN)
dient update. In order to achieve multi-class classification,
we use softmax as the output layer and select the node with
After coarse-grained binary classification, we regard stu-
the highest final result as the classification result.
dent achievement prediction as a multiple classification
problem. The SVM algorithm performs well on two-class
classification problems, and it can be extended to model
multiple ( [ 2) classes but it is limited in this context
because it is essentially a superposition of multiple binary
classification models. Other commonly adopted approaches
123
Neural Computing and Applications
123
Neural Computing and Applications
In this stage, more detailed levels are divided from D to After the above two stages, we have completed the training
A? according to the score 0 to 20, which is listed in process. Next, we will illustrate the whole process of our
Table 1. As illustrated in Fig. 4, the second prediction is two-step student performance prediction. Given a new
separately executed on the two classes of data which is student observation, we can follow the steps below to
partitioned by rP. For the convenience of expression, we conduct a detailed performance prediction operation.
define the class whose rP ¼ 1 as class P1, and the rP ¼ Firstly, the information of the student is input to the trained
0 one as the P0 class. By training the P1 and P0 class IGR-FWSVM model for a preliminary judgement of
separately, the former pass/fail result can be utilized to whether pass or not. Next, the same features will be fed to
predict more detailed student performance which is repre- the corresponding ANN model on the basis of the previous
sented by levels. pass/fail result. If the first model judges that the student
For the two data sets of passed and failed, we train the fails the course, then the P0-ANN model is chosen to
ANN classification model, respectively. Take the training perform the next step; otherwise, it will be the model P1-
of the P0 class as an example, the first step is still to ANN. For example, one possible outcome is that the stu-
normalize the data. Subsequently, we use the level instead dent is classified as passed in the first step, and the second
of rP as the label to train the model named P0-ANN. In prediction outputs an A level performance from the four
addition, the input features here are the same as the 18 passed levels.
previously selected for the first prediction. In this way, we
obtain the model P0-ANN which can predict the level of a
failed student. Similarly, the model P1-ANN can be trained
for passed students. Finally, the accuracy is calculated as
well for comparison.
123
Neural Computing and Applications
5 Results and analysis inherent problem of IG and also the reason we choose IGR
instead of IG.
5.1 Feature selection and analysis Additionally, we find that there exists a wide gap among
part of the feature weights. For example, features such as
In the calculation of feature weights, we try the information ‘studytime’, ‘fjob’, ‘famrel’, and ‘failures’ have a great
gain (IG) and information gain ratio (IGR) separately for impact on the academic achievement of students, whose
comparison. As plotted in Fig. 5a, b, the outcome his- weights exceed 0.4 and even reach 0.8. On the contrary,
togram based on IGR is smoother. Especially, it can give a ‘sex_M’, ‘address_U’, ‘famsize_LE30 , ‘paid_yes’, and
good judgement on the binary input ‘higher_yes’ (i.e. some other characteristics weigh little and even less than
whether the student wants to take higher education) as 0.02. It is reasonable to assume that these low influential
well. However, the information gain for features with features do not contribute to the final prediction and may
fewer choices is almost at the same low level, which is the reduce the accuracy of the calculation if are taken into
123
Neural Computing and Applications
account. Thus, we set a threshold W ¼ 0:1 and only select 5.2 Ablation study
features whose IGR values are greater than W, while other
less important ones are removed as noise. In this way, the To quantitatively measure the experimental results and
precision and robustness of our model can be improved compare different methods, we adopt Root Mean Square
with the difficulty of the learning task at an appropriate Error (RMSE) [34] and Mean Absolute Percentage Error
level. The subgraphs (c) and (d) in Fig. 5 present the IG (MAPE) [35] as evaluation metrics. The related formulas
and IGR values of the 18 main features selected by IGR. As are defined as follows:
a complement, detailed descriptions of low influential and sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1X n
high influential features are given in Tables 2 and 3, RMSE ¼ ð ybi yi Þ2 ð31Þ
respectively. n i¼1
According to the analysis of the feature weights, we find n
100% X ybi yi
that students’ performance in class, such as grades in the MAPE ¼ ð32Þ
n i¼1 yi
two periods and the number of past class failures have a
more significant impact on the final grade. At the same where ybi and yi represent the predicted value and true value
time, family-related variables, such as parents’ education for the i th sample separately, and n denotes the size of the
level, work, and family relations, also play a crucial role in prediction dataset. The smaller the values of the two
students’ academic performance. Nevertheless, gender, indexes (i.e. closer to 0), the more accurate the experi-
address, school, and some other factors seem to be less mental results are.
influential or even unimportant. Based on the two metrics, Table 4 shows the comparison
of different approaches’ performance in the preliminary
pass/fail prediction. As we can see, ANN performs the
123
Neural Computing and Applications
Fig. 5 Comparison of different features’ weights based on two calculation criteria: a all features’ IG values, b all features’ IGR values, c main
features’ IG values, d main features’ IGR values. (IG, information gain; IGR, information gain ratio)
123
Neural Computing and Applications
Table 4 Comparison of different methods’ performance in the worst in this binary classification case, while SVM is better
pass/fail prediction than other comparative methods. Furthermore, we use the
Method All feature Main feature information gain ratio as the criterion for feature selection
so that the effectiveness of the new model IGR-FWSVM
RMSE MAPE RMSE MAPE
reaches a higher level.
KNN 0.4255 1.226 0.4126 1.2074 As for the level prediction performance listed in Table 5,
LASSO 0.517 1.3172 0.5002 1.3144 the original SVM algorithm and its improved version by
RANDOMFOREST 0.2667 1.312 0.274 1.3077 IGR, though good for two-class classification, does not
RIDGE 0.3871 1.1902 0.3848 1.1915 work well on this essentially multiple classification prob-
ELASTICNET 0.4912 1.3027 0.5103 1.355 lem. On the contrary, ANN can make better use of the input
GRADIENTBOOST 0.2881 1.0958 0.2755 1.1007 characteristics by its connection layer and it does obtain a
ANN 0.7234 1.451 0.6697 1.4024 better result. What is even better is that we find detailed
SVM 0.2603 1.132 0.2534 1.0737 judgement for the level will be more accurate if based on
IGR-FWSVM 0.2491 1.0266 0.2364 0.9734 the coarse-grained pass/fail classification. The reason is
that the first prediction step has narrowed down the range
of the subsequent fine-grained judgement, so the errors
with a wide margin are reduced. Hence, we propose an
IGR-FWSVM and ANN-based algorithm for academic
Table 5 Comparison of performance between the two-step methods
and single-step methods course performance prediction, which achieves the best
result in our ablation study.
Method All feature Main feature
RMSE MAPE RMSE MAPE
123
Neural Computing and Applications
coarse-grained classification model between pass (P1) and conference on data mining, SDM 2017. Society for Industrial and
fail (P0). Finally, detailed score levels are divided from D Applied Mathematics, Philadelphia, pp 552–560
10. Ren Z, Ning X, Lan AS, Rangwala H (2019) Grade prediction
to A? , and ANN is employed for the further fine-grained based on cumulative knowledge and co-taken courses. In: EDM
training of the P1 class and the P0 class separately so that 2019—Proceedings of 12th international conference on educa-
we can take advantage of the obtained pass/fail prediction tional data mining, pp 158–167
results. The proposed approach is evaluated on the student 11. Saqlain SM, Sher M, Shah FA et al (2019) Fisher score and
Matthews correlation coefficient-based feature subset selection
datasets from two Portuguese secondary schools. Our for heart disease diagnosis using support vector machines. Knowl
ablation study demonstrates the effectiveness of this Inf Syst 58:139–167. https://doi.org/10.1007/s10115-018-1185-y
hybridized method. We propose our methodology may be 12. Arsad PM, Buniyamin N, Manan JLA (2014) Prediction of
utilized in a number of teaching and learning contexts to engineering students’ academic performance using artificial
neural network and linear regression: a comparison. In: 2013
assist practitioners and support positive student outcomes. IEEE 5th International conference on engineering education:
For example, automated performance prediction may allow aligning engineering education with industrial needs for nation
more effective student monitoring and provision of timely, development, ICEED 2013. IEEE, pp 43–48
individualized, student support. 13. Hu Q, Rangwala H (2019) Academic performance estimation
with attention-based graph convolutional networks. arXiv
14. Whitehill J, Mohan K, Seaton D et al (2017) Delving deeper into
MOOC student dropout prediction. arXiv
Funding This study was funded by the Fundamental Research Funds 15. Feng W, Tang J, Liu TX (2019) Understanding dropouts in
for the Central Universities (grant numbers 20720200094). MOOCs. In: 33rd AAAI conference on artificial intelligence
AAAI 2019, 31st Innovative applications of artificial intelligence
conference IAAI 2019, 9th AAAI symposium on educational
Declarations advances in artificial intelligence EAAI 2019, vol 33,
pp 517–524. https://doi.org/https://doi.org/10.1609/aaai.v33i01.
Conflict of interest The authors declare that they have no conflict of 3301517
interest or personal relationships that could have appeared to influ- 16. Fei M, Yeung DY (2016) Temporal models for predicting student
ence the work reported in this paper. dropout in massive open online courses. In: Proceedings of 15th
IEEE international conference on data mining workshop,
ICDMW 2015. IEEE, pp 256–263
17. Basheer IA, Hajmeer M (2000) Artificial neural networks: Fun-
References damentals, computing, design, and application. J Microbiol
Methods 43:3–31. https://doi.org/10.1016/S0167-7012(00)00201-
1. Kurzweil M, Wu DD (2015) Case study: building a pathway to 3
student success at Georgia State University 18. Dai J, Xu Q (2013) Attribute selection based on information gain
2. Ministry of Education of the People’s Republic of China (2016) ratio in fuzzy rough set theory with application to tumor classi-
Promotion rate of graduates of regular school by levels. http://en. fication. Appl Soft Comput J 13:211–221. https://doi.org/10.
moe.gov.cn/Resources/Statistics/edu_stat_2015/2015_en01/ 1016/j.asoc.2012.07.029
201610/t20161012_284485.html. Accessed 4 Feb 2021 19. Cortez P, Silva A (2008) Using data mining to predict secondary
3. Ministry of Education (2020) Composition of students in senior school student performance. In: 15th European concurrent engi-
secondary schools—Ministry of Education of the People’s neering conference 2008, ECEC 2008; 5th Future of business
Republic of China. http://en.moe.gov.cn/documents/statistics/ technology conference, FUBUTEC 2008, pp 5–12
2018/national/201908/t20190812_394224.html. Accessed 4 Feb 20. Abouelmagd EI, Awad ME, Elzayat EMA, Abbas IA (2014)
2021 Reduction the secular solution to periodic solution in the gener-
4. Kloft M, Stiehler F, Zheng Z, Pinkwart N (2015) Predicting alized restricted three-body problem. Astrophys Space Sci
MOOC dropout over weeks using machine learning methods. In: 350:495–505. https://doi.org/10.1007/s10509-013-1756-z
Proceedings of the EMNLP 2014 workshop on analysis of large 21. Liu Y, Liu W, Obaid MA, Abbas IA (2016) Exponential stability
scale social interaction in MOOCs. Association for Computa- of Markovian jumping Cohen-Grossberg neural networks with
tional Linguistics, Stroudsburg, pp 60–65 mixed mode-dependent time-delays. Neurocomputing
5. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 177:409–415. https://doi.org/10.1016/j.neucom.2015.11.046
20:273–297. https://doi.org/10.1007/bf00994018 22. Du B, Liu Y, Atiatallah Abbas I (2016) Existence and asymptotic
6. Chen J, Feng J, Sun X et al (2019) MOOC dropout prediction behavior results of periodic solution for discrete-time neutral-type
using a hybrid algorithm based on decision tree and extreme neural networks. J Frankl Inst 353:448–461. https://doi.org/10.
learning machine. Math Probl Eng 2019:1–11. https://doi.org/10. 1016/j.jfranklin.2015.11.013
1155/2019/8404653 23. Chambers LG, Fletcher R (2001) Practical methods of opti-
7. Sweeney M, Lester J, Rangwala H (2015) Next-term student mization. Math Gaz 85:562. https://doi.org/10.2307/3621816
grade prediction. In: Proceedings of 2015 IEEE international 24. He X, Ji M, Zhang C, Bao H (2011) A variance minimization
conference on Big Data, IEEE Big Data 2015. IEEE, pp 970–975 criterion to feature selection using Laplacian regularization. IEEE
8. Sweeney M, Rangwala H, Lester J, Johri A (2016) Next-term Trans Pattern Anal Mach Intell 33:2013–2025. https://doi.org/10.
student performance prediction: a recommender systems 1109/TPAMI.2011.44
approach, pp 1–27. https://doi.org/10.5281/zenodo.3554603 25. Zhao Z, Zhang R, Cox J et al (2013) Massively parallel feature
9. Morsy S, Karypis G (2017) Cumulative knowledge-based selection: an approach based on variance preservation. Mach
regression models for next-term grade prediction. In: Chawla N, Learn 92:195–220. https://doi.org/10.1007/s10994-013-5373-4
Wang W (eds) Proceedings of the 17th SIAM international 26. Jin C, Ma T, Hou R et al (2015) Chi-square statistics feature
selection based on term frequency and distribution for text
123
Neural Computing and Applications
categorization. IETE J Res 61:351–362. https://doi.org/10.1080/ 32. Denoeux T (1995) A k-nearest neighbor classification rule based
03772063.2015.1021385 on Dempster-Shafer theory. IEEE Trans Syst Man Cybern
27. Lee C, Lee GG (2006) Information gain and divergence-based 25:804–813. https://doi.org/10.1109/21.376493
feature selection for machine learning-based text categorization. 33. Hinton G, Deng L, Yu D et al (2012) Deep neural networks for
Inf Process Manag 42:155–165. https://doi.org/10.1016/j.ipm. acoustic modeling in speech recognition: the shared views of four
2004.08.006 research groups. IEEE Signal Process Mag 29:82–97. https://doi.
28. Uǧuz H (2011) A two-stage feature selection method for text org/10.1109/MSP.2012.2205597
categorization by using information gain, principal component 34. Chai T, Draxler RR (2014) Root mean square error (RMSE) or
analysis and genetic algorithm. Knowl Based Syst 24:1024–1032. mean absolute error (MAE)? Arguments against avoiding RMSE
https://doi.org/10.1016/j.knosys.2011.04.014 in the literature. Geosci Model Dev 7:1247–1250. https://doi.org/
29. Safavian SR, Landgrebe D (1991) A survey of decision tree 10.5194/gmd-7-1247-2014
classifier methodology. IEEE Trans Syst Man Cybern 35. Swamidass PM (2000) Mean absolute percentage error (MAPE).
21:660–674. https://doi.org/10.1109/21.97458 In: Encyclopedia of production and manufacturing management.
30. Wang F, Wang Q, Nie F et al (2020) A linear multivariate binary Springer, Boston, pp 462–462
decision tree classifier based on K-means splitting. Pattern
Recognit 107:107521. https://doi.org/10.1016/j.patcog.2020. Publisher’s Note Springer Nature remains neutral with regard to
107521 jurisdictional claims in published maps and institutional affiliations.
31. Keller JM, Gray MR (1985) A fuzzy K-nearest neighbor algo-
rithm. IEEE Trans Syst Man Cybern SMC 15:580–585. https://
doi.org/10.1109/TSMC.1985.6313426
123