Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach
Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach
Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach
Received: 14 February 2020 / Accepted: 22 June 2020 / Published online: 28 August 2020
© Springer Nature Switzerland AG 2020
Abstract
Road accident severity is a major concern of the world, particularly in underdeveloped countries. Understanding the
primary and contributing factors may combat road traffic accident severity. This study identified insights and the most
significant target specific contributing factors for road accident severity. To get the most determinant road accident
variables, a hybrid K-means and random forest (RF) approaches developed. K-means extract hidden information from
road accident data and creates a new feature in the training set. The distance between each cluster and the joining line
of k1 and k9 calculated and selected maximum value as k. k is an optimal value for the partition of the training set. RF
employed to classify severity prediction. After comparing with other classification techniques, the result revealed that
among classification techniques, the proposed approach disclosed an accuracy of 99.86%. The target-specific model
interpretation result showed that driver experience and day, light condition, driver age, and service year of the vehicle
were the strong contributing factors for serious injury, light injury, and fatal severity, respectively. The outcome demon-
strates the predictive supremacy of the approach in road accident prediction. Road transport and insurance agencies
will be benefited from the study to develop road safety strategies.
1 Introduction not just happen; illness is not random; they are caused
[33]. Traffic accidents occurred daily in the capital city of
Road traffic accident (RTA) is churning the world with Addis Ababa—Ethiopia. Human beings’ life and property
killing thousands and bringing demolition of property damage with a fraction of seconds. It is one of the leading
in a day without discrimination but did not give much terrifying causes of death in the country.
attention to mitigate the severity. However, it is one of RTA severity is one of the research areas in these two
the life-threatening incidents in the world cause of death decades in road safety. Researchers were using interest-
and property damage. Identifying the primary road traffic ing methods on the road accident severity classification
accident factors will help to provide an appropriate solu- based models. The authors were studying using a tra-
tion to minimize the adverse effect of severity on human ditional statistical-based approach for model building.
and property loss. Road Severity does not occur by chance: These techniques help to get insights and identify the
It has patterns and can be predicted and avoided. So, acci- underlying cause of vehicle accidents and related factors
dents are “events which can be examined, analyzed, and on road safety. These days, due to the presence of a mas-
prevented” [20]. According to workers’ health organization, sive volume of datasets, machine learning surpasses con-
accidents defined as “Fatalities are not fated; accident does ventional statistical-based in predicting the model [41].
* Pooja, pooja.1@sharda.ac.in; Salahadin Seid Yassin, salahadincs@yahoo.com | 1Department of Computer Science and Engineering,
Sharda University, Greater Noida 201310, India.
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
Many pieces of literature explained in different countries Probit model [54], logit model [11] are few of adopted con-
the causes of road traffic accident severity [7, 9, 36, 37, ventional statistical-based studies. some studies believed
40, 43, 45]. However, the road traffic accident severity the conventional statistical model better identify depend-
prediction research is still in development. In the previ- ent and independent accident factors [31]. But conven-
ous study, we have seen a room using a hybrid machine tional statistical-based approach lacks the capability
learning approach to improve classification accuracy. To to deal with multidimensional datasets [16]. In order to
fill the stated space, we work on a hybrid machine learn- combat traditional statistical models limitations; Nowa-
ing approach for road accident classification to improve days many studies used ML approach due to its predictive
the effectiveness of prediction accuracy. The previous supremacy, time consuming and informative dimension.
study mainly works on the performance of the Machine In these decade ML approach employed in construction
learning-based classification approach. However, there industry [48], occupational accident [41], agriculture [22],
is a dearth of comparing the state-of-the-art algorithms, educational classification [53], sentiment classification [50]
Hybrid Machine Learning, and deep learning algorithms. and in banking and insurance [46].
Sometimes, obtaining a suitable approach will make pre- On the other hand, in road accident prediction, many
diction accuracy more informative. Hence getting of best studies performed using Data mining, machine learning,
paradigm helps to identify the most determinant road and deep learning algorithms.Among clustering and clas-
accident factors. Furthermore, target specific contribut- sification algorithms: K-means, Support Vector Machines,
ing factors were not concerned and identified previously. K-Nearest Neighbors (KNN) Decision Tree (DT), Artificial
The study used hybrid clustering and classification algo- Neural Network (ANN), Convolution Neural Network (CNN)
rithms to predict road accident severity prediction. In this and Logistic Regression (LR) are in front to build accident
work, a new hybrid K-means and random Forest algorithm severity model. Kwon et al. [28] adopted Nave Bayes (NB)
proposed to predict target specific road accident severity. and Decision Tree (DT) on California dataset collected from
The proposed approach compared with individual classi- 2004 to 2010. Authors used binary regression to compare
fiers to measure the performance of the developed model. the performance of the developed model but Nave Bayes
Accuracy, precision, specificity, and recall used to compare were more sensitive to risk factors than the Decision Tree
the new approach and conventional techniques (SVM, model.
KNN, LR, and RF). The new approach composed of the fol- Sharma et al. [44] analyzed road accident data using
lowing phases: (I) removing disturbing noise and filling SVM and MLP on a limited number of datasets (300 data-
missing data using mean for numeric variables and mode sets). Besides authors used only two independent variables
for the categorical variable, (II) splitting the dataset into (alcohol and speed) as considering key factors. Eventually,
training and test dataset, (III) creating new feature using SVM with RBF kernel gave better accuracy (94%) than MLP
clustering, (IV) training classifiers, (V) finally evaluating the (64%). The study showed driving with high speed after
performance of individual classifiers. Moreover, the pro- drunk was the main reason for accident occurrence.
posed approach compared with a deep neural network Wahab and Jiang [51]carried out crash accidents on
to evaluate further with another state of the art classifier Ghana dataset using MLP, PART, and SimpleCART intend-
techniques. The evaluation outcome showed the proposed ing to evaluate classifiers and to identify the major factors
better performs than other classifiers based on classifica- for motorcycle crash. Auhors used Weka tools to compare
tion and performance metrics. and analyze datasets and InfoGainAttributeEval applied
The rest of the study prepared as follows: In Sect. 2, to see the most influential variable for motorcycle crash
existing research in road accident classification concerning in Ghana. As a result simpleCART model showed better
the Machine Learning approach discussed. In Sect. 3 new accuracy than other classification models.
Hybrid Based Machine Learning method using k-means Kumar et al. [27] implemented kmeans and Association
and random forest is presented. Experiment, Evaluation, Rule data mining approaches to identify the frequency of
and Discussion are summarized in Sect. 4. In Sect. 5, results accident severity locations and to extract hidden infor-
and analysis driven from the experiment are explained at mation. From the total 158 locations; 87 of them were
the last conclusion presented in Sect. 6. selected after removing accident location frequency
count less than 20. Then k-means were applied to cluster
into three groups, Number of clusters are determined by
2 Litrature review gap statistics. To get rules, they used minimum support
of 5 percent. As a result, curved and slop on the hilly sur-
In the area of road safety traditional statistical model- face were revealed as accident prone locations. Authors
based techniques were used to predict accident fatal and worked on the FARS data-set using data mining tech-
severity. Mixed logit modeling approach [23, 26], ordered niques to combat death and injury severity during 2007.
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
Road condition This variable shows the situation of the with absolute value—encoding and normalization are
road during the accident. Variable represents Namely: dry, carefully purify before using it.
muddy and wet.
Light condition This variable connotes the situation of 3.1.3 Splitting dataset
the road during the accident. Variable represents Namely:
dry, muddy and wet. Raw datasets and k-means created features split into train-
Weather condition This variable indicates the climate ing set and testing sets. The aforementioned training set
condition during the accident. Variable represents namely: helps to learn the newly proposed method. On the other
rainy, sunny, cold and windy. hand, the testing set used to measure the performances
Causality class This variable indicates the severity of the of the new proposed model. In the study, a 70:30 ratio is
class. The variable represents namely: driver, passenger, used to split the raw dataset. Then 70% is used to train
pedestrian, cycle driver, and resident. the prediction model, whereas, 30% of the dataset used
Causality age This variable indicates the severity of class to evaluate the performance of the prediction classifica-
age. tion accuracy.
Causality sex This variable indicates the severed class
sex male or female. 3.1.4 Prediction model
Severity This variable is the target variable represents
three classes, namely: fatal, serious injury and light injury. A prediction model is mainly used in machine learning
techniques to forecast future behavior by analyzing cur-
3.1.2 Preprocessing rent and historical data.
Raw datasets were sadly dirty, not in a proper format to 3.2 K‑means techniques
be understood by computing machines and give incom-
plete information to use as it is. Using Such datasets will K-means Technique [35] is unsupervised Machine learn-
reduce the efficiency of the accident severity prediction ing technique mainly used in statistical data analy-
model. Therefore, irrelevant datasets need to remove to sis, image processing, signal processing, information
obtain quality data. In the study before building a model retrieval. The presence of heterogeneity in a road acci-
intensive data preprocessing technique employed to get dent may lead to wrong model building and prediction.
meaningful and determinant risk factors Like Data clean- Unobserved heterogeneity defined as the presence of
ing, missing value handling, outlier treatment, dealing critical unseen features correlated with the observed
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
feature in a model building.To overcome this problem, 2. Calculate the distance between each instance to the
we are engaged in using clustering in our accident data- Cj-centroid.
set. The split of datasets based on its similarity makes 3. Compute mean of each data points in each cluster to
homogenous within clusters and heterogenous between find their centroid.
clusters. Besides, clustering in collaboration with clas- 4. Then repeat the aforementioned steps until each
sification makes the classifier to train a model with a points assigned to their nearest cluster.
short time, more accurate and needs less computational
memory when dealing with a massive amount of dataset 3.3 Random forest
[29]. K-means technique works on M data points as input
in the N dimension in initial k cluster centrists, k is user Random forest is ensembled classification technique pro-
defined to determine the total number of clusters; as a posed by Breiman and Adele Cutler mainly works building
result after claculating there distance from each cluster multiple tress to make uncorrelated decision trees [13]. It
data points assigned to each nearest cluster. Hartigan is one of the robust algorithms to predict a large number
and Wong [24]. All points within a cluster are closer in of datasets. Mainly decision tree prone to overfitting but
the distance to their centroid than they are to any other random forest uses multiple tresses to reduce overfitting
centroid. The primary goal of the K-means technique is [13]. The random forest creates many shallow, random
to reduce the Euclidean distance D(Xi, Cj) between each subset trees and then combine or aggregate subtrees to
point from the centroid. as a result intra-cluster variance avoid overfitting. Also, when it employed in large datasets
can be reduced and inter cluster similarity increases. gives more accurate predictions and cannot relinquish its
Squared error function represented in Eq. 1. accuracy when it faces several missing data. Random for-
est combines multiple Decision Tress during training then
k n
∑ ∑ takes the aggregate of it to build model. Therefore, weak
f (x) = |Xi − Cj |2 (1)
n=1 n=1
estimators improve when they are combined. Even if some
of the decision trees become weak, there overall desired
where k is number of clusters, n-number of cases and Cj- output results tend to be accurate. Figure 2 illustrates sam-
number of centroids and X is data points of which Euclid- ple random forest implementation.
ean distance from the centroid is calculated. K means
algorithm has initialization and iteration phases. In the first 3.4 Proposed approach
phase data points assign randomly in to k clusters, then
in iteration phase the algorithm calculate the distance These days road accident datasets are stored in a vast
between each data points to each cluster centers, finally database repository. A large number of datasets make the
the algorithm converges when each road accident data training and testing phase more complicated and reduces
points assigned to the nearest cluster [24]. Let us see how predicting efficiency. Therefore, it needs a powerful model
K-means algorithm works as follow: to overcome or minimize the complexity of a huge amount
of dataset. We developed a hybrid K-Means and Random
1. Randomly initialize and select the Cj-centroids. forest model to get a better efficient predictive model to
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
4.1 Dataset manipulation
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
Precision =
TP
(5) 5.2 Choosing k
TP + FP
There is no specific solution to find the exact value of k
(Precision ∗ Recall) to partition training dataset. For each k, we can initialize
F1Score = 2 ∗ (6)
(Precision + Recall) k-means and use the inertia attribute to identify the sum
of squared distances of the training set to the nearest
cluster center. When k increases, the sum of squared dis-
tance leans towards zero and the percentage of variances
TP: it shows predictive is positive and it is normally true increased as shown in Fig. 4a, b. If we use k to its maximum
TN: it implies predictive is Negative and it is normally value in the M training set, each training set will form its
True cluster. Figure 5a. below is a plot of the sum of squared
FP: denotes predictive is positive and it is normally false distances for k. If the plot looks like an arm, then the
FN: represents predictive is negative and it is false. Where elbow on the arm is optimal k. However, from the graph,
TP implies true positive, TN denotes true negative, FP the elbow is not clear to determine the optimal value of k.
indicates false positive, and FN denotes false nega- Then we created line joining the first and last points (i.e.
tive. in the actual study values are represented by K = 1 and k = 9) (Fig 5b illustrates line creates to connect
true and false whereas predictive values denoted by k = 1and k = 9). Then we calculated the distance between
positive and negative. each cluster and the line to find the maximum distance.
Figure 6a, b shows values of a distance of each k points
from the line. The maximum length is index 2 (i.e 3.63). so
we could say that the exact optimal value of k is three and
road accident dataset clustered into three groups based
on the experimentation.
Fig. 4 a The average distance within clusters (SSD) and b the percentage of variance between clusters
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
Fig. 5 a Elbow technique for optimal k, b lines created from k = 1to k = 9
Fig. 6 a, b Calculated distance values from each k or cluster to the line (value of k = 3)
5.3 Model performance evaluation on both data set showed all classification algorithms per-
form well on all evaluation metrics except k means clas-
In this section, evaluation of the performance and reliabil- sifier. An excellent performer on both datasets is random
ity of the model, and comparing the proposed approach forest. Especially on the cluster added dataset, the effi-
with the conventional models discussed briefly. ciency of the random forest algorithm performed very
The study employed the k-Means algorithm on a raw well. Its performance dramatically improved to 99.86%
accident dataset to cluster into three groups based on accuracy. Each supervised machine learning classifier
given k value. The newly created cluster used as a new achieved a promising result. Classification techniques
feature and added to the training set. An experiment per- performance showed a mouthwatering efficiency, espe-
formed on both the raw dataset and a new feature added cially random forest performance accuracy heightened
training set. when we emplloy unsupervised K-means from 87.77 to 99.86%. But unsupervised k means classifier
algorithm on raw dataset scores an accuracy of 42.25% performed somehow better on a raw dataset. In Table 1
whereas supervising machine learning algorithms like performance evaluation of each model is presented before
logistic regression, random forest, support vector machine, and after adding a new feature to the dataset. Result dis-
and k-Nearest Neighbors performance accuracy on a raw covered the proposed model has better performance than
dataset scored 86.83%, 87.77%, 68.45%, and 64.97% other models. Table 2 presents the execution time of each
respectively. While unsupervised and supervised machine model. Astonishingly KNN model had less execution time
learning techniques applied to a new feature added train- than other models.
ing set its performance of K-means, logistic regression, In the study, proposed approaches compared with
random forest, support vector machine, and k-Nearest other related studies. Table 3 shows the previous papers
Neighbors scored an accuracy of 35.83%, 99.13%, 99.86%, that worked on road accident severity prediction using
73.13% and 68.58% respectively. The experiment revealed different types of methodology. Our proposed Hybrid
performances of each classifier on various classifier metric approach used k-means from clustering and random forest
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
Table 1 Performance S. No Testing set without new feature Testing set with new feature
evaluation of classifiers and
proposed approach Classifier Precision Recall f1 score Accuracy Precision Recall f1 score Accuracy
Table 2 The execution time of models (ms) and Model 2 = 88.77%) achieved better result than the
Model Training time Testing time third model (model 3 = 88.03%). However as presented in
Fig. 7 model 2 has low test loss value (0.3622) than model
K-means 191 2.57 1 (0.3819) and model 3 (0.3686) relatively. But test loss is
LR 231 1.29 not as such attractive in multi-class classification. On the
RF 399 38 other hand Fig. 8 presents the AUC metric values of each
SVM 566 134 neural network model with different amounts of dense
KNN 9.7 87 layers. As it has seen all models with different dense layer
K-means-RF 295 5.71 gives similar results.
In this paper, we created a baseline ANN for road accident Table 4 Test accuracy, loss, and ROC curve value of ANN model
model prediction for different dense layers. Rectifier and with multiple dense layers
Sigmoid activation function used as input and output Model Dense layer Test accuracy (%) Test loss ROC curve (%)
layers respectively. Table 4 presents the performance of
Model1 2 88.77 0.3819 96.1
Artificial Neural network (ANN) with different dense layers.
Model2 3 88.77 0.3622 96.1
Experiment result showed the best test accuracy seen by
Model3 4 88.03 0.3686 96.1
two and three dense layers. Both model (Model 1 = 88.77%
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
and Accuracy. Table 5 presents Performance compari- Table 5 Comparison of ANN and proposed model performance
son of proposed and ANN models. The proposed model with different metrics (%)
achieved a better result than a deep neural network. Model type Precision Recall F1 score Accuracy
ANN 88 88 88 88
Proposed model 100 100 100 99.86
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
5.6 Random forest interpretation year, weather condition and causality class are a strong
contribution for fatal accident severity.
Random forest builds numerous decision trees for sev-
eral subsets of RTA variables. It is commonly called a
black box-difficult to know how it processed inside the
model. Indeed, it comprises of many decision trees. 6 Conclusion
Examining each deep tree decision and process is trou-
blesome and improbable. Whereas individual tree could In the study, a hybrid-based approach developed to
learn on bagged data on randomly selected features predict the severity of the RTA dataset. The approach is
[13]. So, we can get insights from the random forest on competitive and better than traditional machine learn-
computing feature importance. Before going to see how ing algorithms. In the case of creating a new cluster fea-
random forest works lets see how decision tree works. It ture and finally added to the training set, k means used
has a series of decision paths from the node to the last and showed a convincing result when combined with
leaf safeguarded by a sub-feature. Prediction is a sum classification algorithms. In the paper, K-Means used
of individual features and bias (mean value of top-most to group road accident dataset based on its similarity
region covered by training set). Decision tree prediction and random forest employed to classify road accident
function defined as: factors into the severity variable. The combination of
K-Means with random forest outperforms other Con-
M
∑ ventional models, namely Logistic Regression, k Nearest
f (x) = Cfull + contrib(x, k) (7)
m=1
Neighbor, and Support Vector Machine. The classification
technique used in the experiment improves classifica-
where M—number of leaves in the tree, k—the number of tion accuracy values for logistic regression, random for-
features, C full—root node value, Contrib(x, k) kth feature est, support vector machine, and k nearest neighbor are
contribution in feature vector x. Now let’s move to the pre- 12.3, 12.09, 4.8, and 3.61 respectively. On the contrary, k
diction of random forest, which is as discussed in Sect. 3.2 means decreased its accuracy value by 6.42. The experi-
an average value of its tree prediction. Therefore, random ment result revealed that adding a new cluster on the
forest prediction function defined as follows: training set has a strong impact to improve classification
J accuracy. Random Forest got better accuracy (99.86%).
1∑ Before clustering and classification data Preprocessing
f (x) = f (x) (8)
J j=1 j performed to purify raw datasets. Missing value treat-
ment and conversion of categorical values done to get a
It is pretty clear that the random forest prediction is the better result. In the paper optimal value of k discovered
average value of bias and the average value of each con- after calculating the maximum distance from each clus-
tribute feature set. Which can be defined as ter to the joining line from K1 to Kn. Also, to trust a pre-
J k J diction model interpretation made to understand how
1∑ ∑ 1∑ a model inside processes to predict. Prediction is a sum
f (x) = Cj full + ( +contribj (x, k) (9)
J j=1 k=1
j j=1 of bias and contribution features. In the experiment, we
showed results to get insights into the contributing vari-
The above expression explained how random forest black ables for the prediction model. Knowing the influence
box processed by following decision routes through the of individual variables on the prediction model is trust-
tree and compute the contributions of individual fea- worthy. Moreover, In the study target-specific, variable
tures. knowing the relatedness of predictive variable contribution explained. Overall, the paper tried to show
to the prediction model either negatively or positively the effects of combining Clustering and Classification to
helps to understand detail information about the model. improve model accuracy and identified major contribut-
which helps to know the influence of each variable on the ing factors class-specific wise from the collected data
outcome. for road traffic accident datasets. Another dataset will
In the experiment, the default parameter set up used straighten our model to get a better result.
to implement the random forest algorithm. we have
seen that day, driver experience, type of vehicle, loca-
tion, light condition, causality age, and casualty sex are Compliance with ethical standards
the strong contribution for serious injury, light condi-
tion, causality class, causality age, and causality sex are Conflict of interest The authors declare that they have no conflict of
a contributor for light injury whereas driver age, service interest.
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
References road traffic accidents: the case of Addis Ababa city. Addis
Ababa Addis Ababa University
20. Gissane W (1965) Accidentsa modern epidemic. J Inst Health
1. Abebe Y, Dida T, Yisma E, Silvestri DM (2018) Ambulance use is
Educ 3(1):16–18
not associated with patient acuity after road traffic collisions: a
21. Gu X, Li T, Wang Y, Zhang L, Wang Y, Yao J (2018) Traffic fatali-
cross-sectional study from Addis Ababa, Ethiopia. BMC Emerg
ties prediction using support vector machine with hybrid
Med 18(1):7
particle swarm optimization. J Algorithms Comput Technol
2. Abegaz T, Gebremedhin S (2019) Magnitude of road traffic
12(1):20–29
accident related injuries and fatalities in Ethiopia. PLoS one
22. Habib MT, Majumder A, Jakaria A, Akter M, Uddin MS, Ahmed
14(1):e0202240
F (2018) Machine vision based papaya disease recognition. J
3. Acurna E, Rodriguez C (2004) The treatment of missing values
King Saud Univ Comput Inf Sci 32(3):300–309
and its effect in the classifier accuracy, classification, clustering,
23. Haleem K, Alluri P, Gan A (2015) Analyzing pedestrian crash
and data mining applications. In: Proceedings of the meeting
injury severity at signalized and non-signalized locations.
of the International Federation of Classification Societies (IFCS),
Accid Anal Prev 81:14–23
pp 639–647
24. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means
4. Al-Radaideh QA, Daoud EJ (2018) Data mining methods for
clustering algorithm. J R Stat Soc Ser C (Appl Stat)
traffic accident severity prediction. Int J Neural Netw Adv Appl
28(1):100–108
5:1–12
25. Hordofa GG, Assegid S, Girma A, Weldemarium TD (2018)
5. Alikhani M, Nedaie A, Ahmadvand A (2013) Presentation of
Prevalence of fatality and associated factors of road traffic acci-
clustering-classification heuristic method for improvement
dents among victims reported to burayu town police stations,
accuracy in classification of severity of road accidents in Iran.
between 2010 and 2015, Ethiopia. J Transp Health 10:186–193
Saf Sci 60:142–150
26. Kim JK, Ulfarsson GF, Shankar VN, Mannering FL (2010) A note
6. AlMamlook RE, Kwayu KM, Alkasisbeh MR, Frefer AA (2019)
on modeling pedestrian-injury severity in motor-vehicle crashes
Comparison of machine learning algorithms for predicting
with the mixed logit model. Accid Anal Prev 42(6):1751–1758
traffic accident severity. In: 2019 IEEE Jordan international joint
27. Kumar S, Toshniwal D (2016) A data mining approach to charac-
conference on electrical engineering and information technol-
terize road accident locations. J Mod Transp 24(1):62–72
ogy (JEEIT). IEEE, pp 272–276
28. Kwon OH, Rhee W, Yoon Y (2015) Application of classification
7. Ansari S, Akhdar F, Mandoorah M, Moutaery K (2000) Causes and
algorithms for analysis of road safety risk factor dependencies.
effects of road traffic accidents in Saudi Arabia. Public Health
Accid Anal Prev 75:1–15
114(1):37–39
29. Kyriakopoulou A, Kalamboukis T (2008) Combining clustering
8. Asefa F, Assefa D, Tesfaye G (2014) Magnitude of, trends in, and
with classification for spam detection in social bookmarking
associated factors of road traffic collision in Cntral Ethiopia. BMC
systems. In: ECML PKDD
Public Health 14(1):1072
30. Laytin AD, Seyoum N, Kassa S, Juillard CJ, Dicker RA (2020) Pat-
9. Balogun J, Abereoje O (1992) Pattern of road traffic accident
terns of injury at an Ethiopian referral hospital: using an institu-
cases in a Nigerian University teaching hospital between 1987
tional trauma registry to inform injury prevention and systems
and 1990. J Trop Med Hyg 95(1):23–9
strengthening. Afr J Emerg Med 10(2):58–63
10. Baru A, Azazh A, Beza L (2019) Injury severity levels and asso-
31. Lee C, Saccomanno F, Hellinga B (2002) Analysis of crash precur-
ciated factors among road traffic collision victims referred to
sors on instrumented freeways. Transp Res Rec 1784(1):1–8
emergency departments of selected public hospitals in Addis
32. Lee N, Kim JM (2010) Conversion of categorical variables into
Ababa, Ethiopia: the study based on the Haddon matrix. BMC
numerical variables via bayesian network classifiers for binary
Emerg Med 19(1):2
classifications. Comput Stat Data Anal 54(5):1247–1265
11. Bedard M, Guyatt GH, Stones MJ, Hirdes JP (2002) The independ-
33. Leka S, Griffiths A, Cox T, World Health Organization et al (2003)
ent contribution of driver, crash, and vehicle characteristics to
Work organisation and stress: systematic problem approaches
driver fatalities. Accid Anal Prev 34(6):717–727
for employers, managers and trade union representatives. World
12. Beshah T, Hill S (2010) Mining road traffic accident data to
Health Organization, Geneva
improve safety: role of road-related factors on accident sever-
34. Li L, Shrestha S, Hu G (2017) Analysis of road traffic fatal acci-
ity in Ethiopia. In: 2010 AAAI Spring symposium series
dents using data mining techniques. In: 2017 IEEE 15th interna-
13. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
tional conference on software engineering research, manage-
14. Casado-Sanz N, Guirao B, Attard M (2020) Analysis of the risk fac-
ment and applications (SERA). IEEE, pp 363–370
tors affecting the severity of traffic accidents on spanish cross-
35. MacQueen J et al (1967) Some methods for classification and
town roads: the drivers perspective. Sustainability 12(6):2237
analysis of multivariate observations. In: Proceedings of the fifth
15. Castro Y, Kim YJ (2016) Data mining on road safety: factor assess-
Berkeley symposium on mathematical statistics and probability,
ment on vehicle accidents using classification models. Int J
Oakland, CA, USA, vol 1, pp 281–297
Crashworthiness 21(2):104–111
36. Odero W, Khayesi M, Heda P (2003) Road traffic injuries in Kenya:
16. Chen WH, Jovanis PP (2000) Method for identifying factors con-
magnitude, causes and status of intervention. Inj Control Saf
tributing to driver-injury severity in traffic crashes. Transp Res
Promot 10(1–2):53–61
Rec 1717(1):1–9
37. Persson A (2008) Road traffic accidents in Ethiopia: magnitude,
17. Deme D (2019) Road traffic accident in Ethiopia from 2007/08-
causes and possible interventions. Adv Transp Stud 15:5–16
2017/18. Am Int J Sci Eng Res 2(2):49–59
38. Powers DMW (2011) Evaluation: from precision, recall and
18. Fentahun A (2011) Mining road traffic accident data for predict-
f-measure to roc, informedness, markedness and correlation. J
ing accident severity to improve public health-role of driver and
Mach Learn Technol 2(1):37–63
road factors in the case of Addis Ababa. PhD thesis, Addis Ababa
39. Regassa Z (2009) Determining the degree of driver’s responsibil-
University
ity for car accident: the case of Addis Ababa traffic office. Addis
19. Getnet M (2009) Applying data mining with decision tree and
Ababa University, Addis Ababa
rule induction techniques to identify determinant factors of
40. Sameen MI, Pradhan B (2017) Severity prediction of traffic acci-
drivers and vehicles in support of reducing and controlling
dents with recurrent neural networks. Appl Sci 7(6):476
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
41. Sarkar S, Vinay S, Raj R, Maiti J, Mitra P (2019) Application of 49. Tulu GS (2015) Pedestrian crashes in Ethiopia: identification of
optimized machine learning techniques for prediction of occu- contributing factors through modelling of exposure and road
pational accidents. Comput Oper Res 106:210–224 environment variables. PhD thesis, Queensland University of
42. Seid M, Azazh A, Enquselassie F, Yisma E (2015) Injury charac- Technology
teristics and outcome of road traffic accident among victims 50. Vinodhini G, Chandrasekaran R (2016) A comparative perfor-
at Adult Emergency Department of Tikur Anbessa specialized mance evaluation of neural network based approach for senti-
hospital, Addis Ababa, Ethiopia: a prospective hospital based ment classification of online reviews. J King Saud Univ Comput
study. BMC Emerg Med 15(1):10 Inf Sci 28(1):2–12
43. Seid S et al (2019) Road accident data analysis: data preproc- 51. Wahab L, Jiang H (2019) Severity prediction of motorcycle
essing for better model building. J Comput Theor Nanosci crashes with machine learning methods. Int J Crashworthiness
16(9):4019–4027 24:1–8
44. Sharma B, Katiyar VK, Kumar K (2016) Traffic accident prediction 52. Xiao J (2019) SVM and KNN ensemble learning for traffic incident
model using support vector machines with Gaussian kernel. In: detection. Phys A 517:29–35
Proceedings of fifth international conference on soft computing 53. Yahya AA (2017) Swarm intelligence-based approach for edu-
for problem solving, Springer, Berlin, pp 1–10 cational data classification. J King Saud Univ Comput Inf Sci
45. Singh SK (2017) Road traffic accidents in India: issues and chal- 31(1):35–51
lenges. Transp Res Procedia 25:4708–4719 54. Zajac SS, Ivan JN (2003) Factors influencing injury severity of
46. Sundarkumar GG, Ravi V (2015) A novel hybrid undersampling motor vehicle-crossing pedestrian crashes in rural Connecticut.
method for mining unbalanced datasets in banking and insur- Accid Anal Prev 35(3):369–379
ance. Eng Appl Artif Intell 37:368–377
47. Tiwari P, Kumar S, Kalitin D (2017) Road-user specific analysis of Publisher’s Note Springer Nature remains neutral with regard to
traffic accident using data mining techniques. In: International jurisdictional claims in published maps and institutional affiliations.
conference on computational intelligence, communications,
and business analytics. Springer, Berlin, pp 398–410
48. Tixier AJP, Hallowell MR, Rajagopalan B, Bowman D (2016) Appli-
cation of machine learning to construction injury prediction.
Autom Constr 69:102–114
Vol.:(0123456789)