Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach

Research Article
Road accident prediction and model interpretation using a hybrid

K‑means and random forest algorithm approach
Salahadin Seid Yassin1 · Pooja1
Received: 14 February 2020 / Accepted: 22 June 2020 / Published online: 28 August 2020
© Springer Nature Switzerland AG 2020
Abstract
Road accident severity is a major concern of the world, particularly in underdeveloped countries. Understanding the
primary and contributing factors may combat road traffic accident severity. This study identified insights and the most
significant target specific contributing factors for road accident severity. To get the most determinant road accident
variables, a hybrid K-means and random forest (RF) approaches developed. K-means extract hidden information from
road accident data and creates a new feature in the training set. The distance between each cluster and the joining line
of k1 and k9 calculated and selected maximum value as k. k is an optimal value for the partition of the training set. RF
employed to classify severity prediction. After comparing with other classification techniques, the result revealed that
among classification techniques, the proposed approach disclosed an accuracy of 99.86%. The target-specific model
interpretation result showed that driver experience and day, light condition, driver age, and service year of the vehicle
were the strong contributing factors for serious injury, light injury, and fatal severity, respectively. The outcome demon-
strates the predictive supremacy of the approach in road accident prediction. Road transport and insurance agencies
will be benefited from the study to develop road safety strategies.
Keywords Clustering · Classification · Model interpretation · Hybrid model · Road safety
1 Introduction not just happen; illness is not random; they are caused
[33]. Traffic accidents occurred daily in the capital city of
Road traffic accident (RTA) is churning the world with Addis Ababa—Ethiopia. Human beings’ life and property
killing thousands and bringing demolition of property damage with a fraction of seconds. It is one of the leading
in a day without discrimination but did not give much terrifying causes of death in the country.
attention to mitigate the severity. However, it is one of RTA severity is one of the research areas in these two
the life-threatening incidents in the world cause of death decades in road safety. Researchers were using interest-
and property damage. Identifying the primary road traffic ing methods on the road accident severity classification
accident factors will help to provide an appropriate solu- based models. The authors were studying using a tra-
tion to minimize the adverse effect of severity on human ditional statistical-based approach for model building.
and property loss. Road Severity does not occur by chance: These techniques help to get insights and identify the
It has patterns and can be predicted and avoided. So, acci- underlying cause of vehicle accidents and related factors
dents are “events which can be examined, analyzed, and on road safety. These days, due to the presence of a mas-
prevented” [20]. According to workers’ health organization, sive volume of datasets, machine learning surpasses con-
accidents defined as “Fatalities are not fated; accident does ventional statistical-based in predicting the model [41].
* Pooja, pooja.1@sharda.ac.in; Salahadin Seid Yassin, salahadincs@yahoo.com | 1Department of Computer Science and Engineering,
Sharda University, Greater Noida 201310, India.
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1
Many pieces of literature explained in different countries Probit model [54], logit model [11] are few of adopted con-
the causes of road traffic accident severity [7, 9, 36, 37, ventional statistical-based studies. some studies believed
40, 43, 45]. However, the road traffic accident severity the conventional statistical model better identify depend-
prediction research is still in development. In the previ- ent and independent accident factors [31]. But conven-
ous study, we have seen a room using a hybrid machine tional statistical-based approach lacks the capability
learning approach to improve classification accuracy. To to deal with multidimensional datasets [16]. In order to
fill the stated space, we work on a hybrid machine learn- combat traditional statistical models limitations; Nowa-
ing approach for road accident classification to improve days many studies used ML approach due to its predictive
the effectiveness of prediction accuracy. The previous supremacy, time consuming and informative dimension.
study mainly works on the performance of the Machine In these decade ML approach employed in construction
learning-based classification approach. However, there industry [48], occupational accident [41], agriculture [22],
is a dearth of comparing the state-of-the-art algorithms, educational classification [53], sentiment classification [50]
Hybrid Machine Learning, and deep learning algorithms. and in banking and insurance [46].
Sometimes, obtaining a suitable approach will make pre- On the other hand, in road accident prediction, many
diction accuracy more informative. Hence getting of best studies performed using Data mining, machine learning,
paradigm helps to identify the most determinant road and deep learning algorithms.Among clustering and clas-
accident factors. Furthermore, target specific contribut- sification algorithms: K-means, Support Vector Machines,
ing factors were not concerned and identified previously. K-Nearest Neighbors (KNN) Decision Tree (DT), Artificial
The study used hybrid clustering and classification algo- Neural Network (ANN), Convolution Neural Network (CNN)
rithms to predict road accident severity prediction. In this and Logistic Regression (LR) are in front to build accident
work, a new hybrid K-means and random Forest algorithm severity model. Kwon et al. [28] adopted Nave Bayes (NB)
proposed to predict target specific road accident severity. and Decision Tree (DT) on California dataset collected from
The proposed approach compared with individual classi- 2004 to 2010. Authors used binary regression to compare
fiers to measure the performance of the developed model. the performance of the developed model but Nave Bayes
Accuracy, precision, specificity, and recall used to compare were more sensitive to risk factors than the Decision Tree
the new approach and conventional techniques (SVM, model.
KNN, LR, and RF). The new approach composed of the fol- Sharma et al. [44] analyzed road accident data using
lowing phases: (I) removing disturbing noise and filling SVM and MLP on a limited number of datasets (300 data-
missing data using mean for numeric variables and mode sets). Besides authors used only two independent variables
for the categorical variable, (II) splitting the dataset into (alcohol and speed) as considering key factors. Eventually,
training and test dataset, (III) creating new feature using SVM with RBF kernel gave better accuracy (94%) than MLP
clustering, (IV) training classifiers, (V) finally evaluating the (64%). The study showed driving with high speed after
performance of individual classifiers. Moreover, the pro- drunk was the main reason for accident occurrence.
posed approach compared with a deep neural network Wahab and Jiang [51]carried out crash accidents on
to evaluate further with another state of the art classifier Ghana dataset using MLP, PART, and SimpleCART intend-
techniques. The evaluation outcome showed the proposed ing to evaluate classifiers and to identify the major factors
better performs than other classifiers based on classifica- for motorcycle crash. Auhors used Weka tools to compare
tion and performance metrics. and analyze datasets and InfoGainAttributeEval applied
The rest of the study prepared as follows: In Sect. 2, to see the most influential variable for motorcycle crash
existing research in road accident classification concerning in Ghana. As a result simpleCART model showed better
the Machine Learning approach discussed. In Sect. 3 new accuracy than other classification models.
Hybrid Based Machine Learning method using k-means Kumar et al. [27] implemented kmeans and Association
and random forest is presented. Experiment, Evaluation, Rule data mining approaches to identify the frequency of
and Discussion are summarized in Sect. 4. In Sect. 5, results accident severity locations and to extract hidden infor-
and analysis driven from the experiment are explained at mation. From the total 158 locations; 87 of them were
the last conclusion presented in Sect. 6. selected after removing accident location frequency
count less than 20. Then k-means were applied to cluster
into three groups, Number of clusters are determined by
2 Litrature review gap statistics. To get rules, they used minimum support
of 5 percent. As a result, curved and slop on the hilly sur-
In the area of road safety traditional statistical model- face were revealed as accident prone locations. Authors
based techniques were used to predict accident fatal and worked on the FARS data-set using data mining tech-
severity. Mixed logit modeling approach [23, 26], ordered niques to combat death and injury severity during 2007.
Vol:.(1234567890)
SN Applied Sciences (2020) 2:1576 | https://doi.org/10.1007/s42452-020-3125-1 Research Article
After prepossessing the study applied clustering Associa- 3 Methodology

tion rule and Nave Bayes to get trends of fatal accidents
in the USA. The study explained and identify human and The study concerns mainly variable-based classification
collusion types were the main cause of the fatality rate on road accident severity. It combines K-means cluster-
[34]. Other studies conducted using clustering and classifi- ing and classification technique to get better result than
cation techniques to predict an accurate model in Iran. The individual classifier. K-means employed to create new
research mainly focused on combining k means clustering features and random forest used for classification pre-
with self-organizing maps to get better classification accu- diction. The proposed approach workflow in the study
racy than ANN and ANFIS. The author’s preference model is shown in Fig. 1. Major components of the flow chart
better performs than the single classifiers [5]. AlMamlook are as follows:
et al. [6] used AdaBoost, Nave Bayes, Logistic Regression
and random forest to get determinant factors and to
identify high risky highways for Michigan traffic Agencies. 3.1 Road accident dataset manipulation
Performance measurement ROC, AUC, Precision and recall
and F1-score were applied to evaluate models. The Study 3.1.1 Raw traffic accident data‑set
showed random forest outperforms other classifiers with
an accuracy of 75.5%. Tiwari et al. [47] conducted a data Dataset in this study comprises 5000 road traffic accidents
mining approach to analyze causality class traffic acci- collected from federal traffic police agency reports from
dents. The authors implemented clustering like K-modes 2011 to 2018 in Addis Ababa, Ethiopia. One of the chal-
and SOM and classification techniques like NB, DT, and lenging parts of this research is collecting sample data-
SVM. As a result, better accuracy was presented on cluster sets from the organization. The original dataset collected
dataset over classification. from the authority handled manually. Most of the fields are
The existing study on road accident severity in Ethio- incomplete. Some of the fields are useless, and the main
pia see [2, 8, 10, 17, 25, 49]. These stated works concerned vital fields did not include as a field on the manual docu-
mostly on road accident analysis and pedestrian severity ment. Documents were invisible to read. They are written
in Ethiopia using Statistical methods. on the other hand either heedlessly or in a rush. We forced to record in excel
some studies employed a data mining techniques (Deci- format to ease analysis and prediction. In each instant
sion Tree and MLP) on Weka tool focusing mainly on driver accident, 14 variables (ten categorical variables and four
responsibility [39]. Another study employed J48 and PART numeric variables were recorded. Among these variables,
a data mining algorithm on driver and vehicle information Severity class is a target variable with three values (Fatal,
considering as a major risk in accident severity on Weka Severe Injury and Light Injury). Full Dataset description
tool [19]. Other related work in the country, Beshah [12] described in the following manner:
studied to identify the key road way related variables for Accident time This variable implies the time on which
accident severity in Ethiopia. Authors used a data mining road traffic severity occurred throughout the day (24
approach (Decision Tree, Naive Bayes and KNN) to develop hours).
a decision rule to improve road safety. Their focus has Driver age This variable shows the age of the driver.
been analyzing driver and pedestrian crashes without giv- Drivers age mainly in the range of 18-80 years of age.
ing more attention to the influence of machine learning Sex This variable indicates the driver’s sex. The driver’s
accuracy for better identification of major risks influencing sex (observed from the data collection) is either male or
in road accident in Ethiopia. At this time there is a great female.
need for increasing road safety prevention study due to Drivers experience This variable indicates the driver’s
the growth of crashes. There is still room for improvement experience. It mainly represents the duration of how
in the prediction accuracy of RTA in the case of Ethiopia much time the driver drives a car.
to improve prediction accuracy in road safety. Therefore, Type of vehicle This variable implies the different types
we tried to develop a new hybrid approach to classify road of vehicles. Namely: ambulance, car, automobile, Isuzu,
accident severity by combining or collaborating clustering taxi, truck, motorcycle, pick up, bus and minibus.
with classification, which will give remarkable classifica- Service year This variable indicates the year of service
tion results in road accident prediction. Clustering mini- the vehicle gives to the owner.
mizes the sample dataset in the cooperation. Classifica- Location This variable indicates where the accident
tion predicts road traffic severity. In this vein, clustering occurred: namely: canteen area, pubic area, organiza-
provides indirect cooperation for classification to extract tion, government office, hospital, college, vehicle sta-
hidden information from the training set to improve clas- tion, market living area, and hospital.
sifier performance.
Vol.:(0123456789)
Fig. 1 Flowchart of proposed

model framework for predict-
ing road traffic accident—case
of Ethiopia
Road condition This variable shows the situation of the with absolute value—encoding and normalization are
road during the accident. Variable represents Namely: dry, carefully purify before using it.
muddy and wet.
Light condition This variable connotes the situation of 3.1.3 Splitting dataset
the road during the accident. Variable represents Namely:
dry, muddy and wet. Raw datasets and k-means created features split into train-
Weather condition This variable indicates the climate ing set and testing sets. The aforementioned training set
condition during the accident. Variable represents namely: helps to learn the newly proposed method. On the other
rainy, sunny, cold and windy. hand, the testing set used to measure the performances
Causality class This variable indicates the severity of the of the new proposed model. In the study, a 70:30 ratio is
class. The variable represents namely: driver, passenger, used to split the raw dataset. Then 70% is used to train
pedestrian, cycle driver, and resident. the prediction model, whereas, 30% of the dataset used
Causality age This variable indicates the severity of class to evaluate the performance of the prediction classifica-
age. tion accuracy.
Causality sex This variable indicates the severed class
sex male or female. 3.1.4 Prediction model
Severity This variable is the target variable represents
three classes, namely: fatal, serious injury and light injury. A prediction model is mainly used in machine learning
techniques to forecast future behavior by analyzing cur-
3.1.2 Preprocessing rent and historical data.
Raw datasets were sadly dirty, not in a proper format to 3.2 K‑means techniques
be understood by computing machines and give incom-
plete information to use as it is. Using Such datasets will K-means Technique [35] is unsupervised Machine learn-
reduce the efficiency of the accident severity prediction ing technique mainly used in statistical data analy-
model. Therefore, irrelevant datasets need to remove to sis, image processing, signal processing, information
obtain quality data. In the study before building a model retrieval. The presence of heterogeneity in a road acci-
intensive data preprocessing technique employed to get dent may lead to wrong model building and prediction.
meaningful and determinant risk factors Like Data clean- Unobserved heterogeneity defined as the presence of
ing, missing value handling, outlier treatment, dealing critical unseen features correlated with the observed
Vol:.(1234567890)
feature in a model building.To overcome this problem, 2. Calculate the distance between each instance to the
we are engaged in using clustering in our accident data- Cj-centroid.
set. The split of datasets based on its similarity makes 3. Compute mean of each data points in each cluster to
homogenous within clusters and heterogenous between find their centroid.
clusters. Besides, clustering in collaboration with clas- 4. Then repeat the aforementioned steps until each
sification makes the classifier to train a model with a points assigned to their nearest cluster.
short time, more accurate and needs less computational
memory when dealing with a massive amount of dataset 3.3 Random forest
[29]. K-means technique works on M data points as input
in the N dimension in initial k cluster centrists, k is user Random forest is ensembled classification technique pro-
defined to determine the total number of clusters; as a posed by Breiman and Adele Cutler mainly works building
result after claculating there distance from each cluster multiple tress to make uncorrelated decision trees [13]. It
data points assigned to each nearest cluster. Hartigan is one of the robust algorithms to predict a large number
and Wong [24]. All points within a cluster are closer in of datasets. Mainly decision tree prone to overfitting but
the distance to their centroid than they are to any other random forest uses multiple tresses to reduce overfitting
centroid. The primary goal of the K-means technique is [13]. The random forest creates many shallow, random
to reduce the Euclidean distance D(Xi, Cj) between each subset trees and then combine or aggregate subtrees to
point from the centroid. as a result intra-cluster variance avoid overfitting. Also, when it employed in large datasets
can be reduced and inter cluster similarity increases. gives more accurate predictions and cannot relinquish its
Squared error function represented in Eq. 1. accuracy when it faces several missing data. Random for-
est combines multiple Decision Tress during training then
k n
∑ ∑ takes the aggregate of it to build model. Therefore, weak
f (x) = |Xi − Cj |2 (1)
n=1 n=1
estimators improve when they are combined. Even if some
of the decision trees become weak, there overall desired
where k is number of clusters, n-number of cases and Cj- output results tend to be accurate. Figure 2 illustrates sam-
number of centroids and X is data points of which Euclid- ple random forest implementation.
ean distance from the centroid is calculated. K means
algorithm has initialization and iteration phases. In the first 3.4 Proposed approach
phase data points assign randomly in to k clusters, then
in iteration phase the algorithm calculate the distance These days road accident datasets are stored in a vast
between each data points to each cluster centers, finally database repository. A large number of datasets make the
the algorithm converges when each road accident data training and testing phase more complicated and reduces
points assigned to the nearest cluster [24]. Let us see how predicting efficiency. Therefore, it needs a powerful model
K-means algorithm works as follow: to overcome or minimize the complexity of a huge amount
of dataset. We developed a hybrid K-Means and Random
1. Randomly initialize and select the Cj-centroids. forest model to get a better efficient predictive model to
Fig. 2 Sample random forest

(n-estimator= 5)
Vol.:(0123456789)
enhance the efficiency and accuracy of prediction model.

K-means normally, which is an unsupervised machine
learning algorithm mainly used to find similar groups
within the dataset. Even though this is an unsupervised
technique, k-means can create new features for the train-
ing set to improve the performance of classifier. Clustering
creates cluster feature and adds to the training set. Then
random forest employed on clustered training data to
classify severity of RTA. There combination will produce a
powerful prediction model in terms of generalization per-
formance and predictive accuracy.
4 Experiment, evaluation, and discussion
In this section, the preprocessing technique applied to the

road accident dataset, evaluation metrics, and experimen-
tal result analysis presented.
4.1 Dataset manipulation
The Dataset collected from Addis Ababa City is not entirely

clear and organized. Raw dataset recorded manually and
prone to damage. However, it must be in a machine-
understandable format to get meaningful information Fig. 3 The number of missing values and their percentage in the
and to develop an efficient intelligent system. The road RTA dataset
accident severity prediction model depends on the quality
of the datasets. We used different types of data preproc-
essing techniques to clean the dataset. machine learning algorithm on categorical values are
a challenging problem. Therefore, categorical values
• Missing value handling Missing value treatment is a should be either converted into numeric values or
mandatory task in data preprocessing. Before building needs to be removed [32]. In the dataset, most of the
model missing values needs to be filled using a differ- variables are categorical; Among 14 variables, 10 of
ent strategy. In the dataset, some attribute values are them are categorical values and needed to transform
missing. Building an exciting and well-performing pre- into a numeric format. predictive variables and target
diction model on incomplete data will not give a deci- variables converted into numeric using one-hot encod-
sive output. It ought to handle wisely and either ignore ing and label encoding respectively.
or must be filled using different methods to get a better
result [43]. Ignoring or dropping values is an approach 4.2 Experimental system set up
to handle missing values, but dropping may lead to
missing valuable information. In the study, missing The study implemented using python 3.7 on Jupyter note-
value is not forced to drop missing attributes. Figure 3 book as IDM and intel core i7 1.80GHz processor speed
presents a number of missing values and its percentage CPU, 8Gb RAM, and 1TByte HD system. In this section,
from the total dataset.The missing value is less than different experiments like Choosing an optimal value of
50 percent of the total population. we employed sub- k, evaluating the proposed approach, and finally compar-
stituting feature mean for numeric variables and most ing with conventional algorithms with the new approach
frequent (mode) value for categorical variable [3]. For presented.
further,on Preprocessing our previous work gives detail
information, see Ref. [43]. 4.3 Evaluation metrics
• Categorical Value Encoding Raw traffic accident
datasets consists of categorical and numeric values. In the study, different types of evaluation, metrics are used
However, many machine learning algorithms require to measure the performances of the proposed approach to
numeric values to predict a model. Employing a predict road accident training set as indicated from Eqs. 1
Vol:.(1234567890)
to 6. Namely: accuracy, specificity, precision, recall and F1 5 Experimental result analysis

score [38] and discussion
TP + TN
Accuracy = (2) 5.1 Train‑test split
TP + FP + FN + TN
Once the dataset prepared to train the model. It splits into
TP
Recall = (3) a training set and testing set. The former used to learn the
TP + FN
classifier, whereas the later used to test the performances
of the predictive model. In the study, It is applied into
TN 70:30 ratio, 70% of the proportion used to train model and
Specificity = (4)
FP + TN 30% for the testing set applied to evaluate trained model.
Precision =
TP
(5) 5.2 Choosing k
TP + FP
There is no specific solution to find the exact value of k
(Precision ∗ Recall) to partition training dataset. For each k, we can initialize
F1Score = 2 ∗ (6)
(Precision + Recall) k-means and use the inertia attribute to identify the sum
of squared distances of the training set to the nearest
cluster center. When k increases, the sum of squared dis-
tance leans towards zero and the percentage of variances
TP: it shows predictive is positive and it is normally true increased as shown in Fig. 4a, b. If we use k to its maximum
TN: it implies predictive is Negative and it is normally value in the M training set, each training set will form its
True cluster. Figure 5a. below is a plot of the sum of squared
FP: denotes predictive is positive and it is normally false distances for k. If the plot looks like an arm, then the
FN: represents predictive is negative and it is false. Where elbow on the arm is optimal k. However, from the graph,
TP implies true positive, TN denotes true negative, FP the elbow is not clear to determine the optimal value of k.
indicates false positive, and FN denotes false nega- Then we created line joining the first and last points (i.e.
tive. in the actual study values are represented by K = 1 and k = 9) (Fig 5b illustrates line creates to connect
true and false whereas predictive values denoted by k = 1and k = 9). Then we calculated the distance between
positive and negative. each cluster and the line to find the maximum distance.
Figure 6a, b shows values of a distance of each k points
from the line. The maximum length is index 2 (i.e 3.63). so
we could say that the exact optimal value of k is three and
road accident dataset clustered into three groups based
on the experimentation.
Fig. 4 a The average distance within clusters (SSD) and b the percentage of variance between clusters
Vol.:(0123456789)
Fig. 5 a Elbow technique for optimal k, b lines created from k = 1to k = 9
Fig. 6 a, b Calculated distance values from each k or cluster to the line (value of k = 3)
5.3 Model performance evaluation on both data set showed all classification algorithms per-
form well on all evaluation metrics except k means clas-
In this section, evaluation of the performance and reliabil- sifier. An excellent performer on both datasets is random
ity of the model, and comparing the proposed approach forest. Especially on the cluster added dataset, the effi-
with the conventional models discussed briefly. ciency of the random forest algorithm performed very
The study employed the k-Means algorithm on a raw well. Its performance dramatically improved to 99.86%
accident dataset to cluster into three groups based on accuracy. Each supervised machine learning classifier
given k value. The newly created cluster used as a new achieved a promising result. Classification techniques
feature and added to the training set. An experiment per- performance showed a mouthwatering efficiency, espe-
formed on both the raw dataset and a new feature added cially random forest performance accuracy heightened
training set. when we emplloy unsupervised K-means from 87.77 to 99.86%. But unsupervised k means classifier
algorithm on raw dataset scores an accuracy of 42.25% performed somehow better on a raw dataset. In Table 1
whereas supervising machine learning algorithms like performance evaluation of each model is presented before
logistic regression, random forest, support vector machine, and after adding a new feature to the dataset. Result dis-
and k-Nearest Neighbors performance accuracy on a raw covered the proposed model has better performance than
dataset scored 86.83%, 87.77%, 68.45%, and 64.97% other models. Table 2 presents the execution time of each
respectively. While unsupervised and supervised machine model. Astonishingly KNN model had less execution time
learning techniques applied to a new feature added train- than other models.
ing set its performance of K-means, logistic regression, In the study, proposed approaches compared with
random forest, support vector machine, and k-Nearest other related studies. Table 3 shows the previous papers
Neighbors scored an accuracy of 35.83%, 99.13%, 99.86%, that worked on road accident severity prediction using
73.13% and 68.58% respectively. The experiment revealed different types of methodology. Our proposed Hybrid
performances of each classifier on various classifier metric approach used k-means from clustering and random forest
Vol:.(1234567890)
Table 1 Performance S. No Testing set without new feature Testing set with new feature
evaluation of classifiers and
proposed approach Classifier Precision Recall f1 score Accuracy Precision Recall f1 score Accuracy
1 K Means 47 42 43 42.25 36 36 35 35.83

2 LR 85 87 84 86.83 99 99 99 99.13
3 RF 86 88 87 87.77 100 100 100 99.86
4 SVM 69 68 65 68.45 76 73 70 73.13
5 KNN 64 65 62 64.97 68 69 66 68.58
Table 2 The execution time of models (ms) and Model 2 = 88.77%) achieved better result than the
Model Training time Testing time third model (model 3 = 88.03%). However as presented in
Fig. 7 model 2 has low test loss value (0.3622) than model
K-means 191 2.57 1 (0.3819) and model 3 (0.3686) relatively. But test loss is
LR 231 1.29 not as such attractive in multi-class classification. On the
RF 399 38 other hand Fig. 8 presents the AUC metric values of each
SVM 566 134 neural network model with different amounts of dense
KNN 9.7 87 layers. As it has seen all models with different dense layer
K-means-RF 295 5.71 gives similar results.
5.5 Comparative of neural network and proposed

from classification to improve severity model accuracy and models
more importantly designed to identify target specific con-
tributing factors for road accident severity. The proposed In this study, the ANN model compared with the pro-
approach infrequently used in the related study and target posed Hybrid model. The Comparative performance of
specific classification was not concerned. This makes the both models showed that the proposed model (Hybrid
study unique and significant in Ethiopia. K means and random forest) performed better than
the ANN model in terms of Precision, Recall, F1 score,
5.4 ANN experiment analysis
In this paper, we created a baseline ANN for road accident Table 4 Test accuracy, loss, and ROC curve value of ANN model
model prediction for different dense layers. Rectifier and with multiple dense layers
Sigmoid activation function used as input and output Model Dense layer Test accuracy (%) Test loss ROC curve (%)
layers respectively. Table 4 presents the performance of
Model1 2 88.77 0.3819 96.1
Artificial Neural network (ANN) with different dense layers.
Model2 3 88.77 0.3622 96.1
Experiment result showed the best test accuracy seen by
Model3 4 88.03 0.3686 96.1
two and three dense layers. Both model (Model 1 = 88.77%
Table 3 Performance comparison of related work models

References Classifier Dataset Accuracy
Gu et al. [21] PSO-SVM China –

Xiao et al. [52] SVM, KNN (Ensemble) I-880 data set 99.33%
Castro et al. [15] BN, JR8 and MLP DVSA—UK 72.39%, 72.02%, 71.70% Respectively
Al-Radaideh et al. [4] RF, ANN (backpropagation), SVM Uk 80.6%, 61.4%, 54.8% respectively
Casado et al. [14] LCC, MNL Spain –
Wahab et al. [51] MLP. SimpleCart, PART Ghana 72.16%, 73.45%, 73.81% respectively
Sameen et al. [40] MLP, BLR, RNN Malaysia 65.48%, 58.30%, 71.77% respectively
Fentahun [18] J48, ID3, PART Ethiopia 81.21%, 81.01%, 81.18%
Seid et al. [42] HMR Ethiopia NA
Abebe et al. [1] DSA Ethiopia –
Lytin et al. [30] UBA Ethiopia –
Vol.:(0123456789)
Fig. 7 The validation and loss

accuracy of different ANN
Model
Fig. 8 ROC curve of different

ANN models
and Accuracy. Table 5 presents Performance compari- Table 5 Comparison of ANN and proposed model performance
son of proposed and ANN models. The proposed model with different metrics (%)
achieved a better result than a deep neural network. Model type Precision Recall F1 score Accuracy
ANN 88 88 88 88
Proposed model 100 100 100 99.86
Vol:.(1234567890)
5.6 Random forest interpretation year, weather condition and causality class are a strong
contribution for fatal accident severity.
Random forest builds numerous decision trees for sev-
eral subsets of RTA variables. It is commonly called a
black box-difficult to know how it processed inside the
model. Indeed, it comprises of many decision trees. 6 Conclusion
Examining each deep tree decision and process is trou-
blesome and improbable. Whereas individual tree could In the study, a hybrid-based approach developed to
learn on bagged data on randomly selected features predict the severity of the RTA dataset. The approach is
[13]. So, we can get insights from the random forest on competitive and better than traditional machine learn-
computing feature importance. Before going to see how ing algorithms. In the case of creating a new cluster fea-
random forest works lets see how decision tree works. It ture and finally added to the training set, k means used
has a series of decision paths from the node to the last and showed a convincing result when combined with
leaf safeguarded by a sub-feature. Prediction is a sum classification algorithms. In the paper, K-Means used
of individual features and bias (mean value of top-most to group road accident dataset based on its similarity
region covered by training set). Decision tree prediction and random forest employed to classify road accident
function defined as: factors into the severity variable. The combination of
K-Means with random forest outperforms other Con-
M
∑ ventional models, namely Logistic Regression, k Nearest
f (x) = Cfull + contrib(x, k) (7)
m=1
Neighbor, and Support Vector Machine. The classification
technique used in the experiment improves classifica-
where M—number of leaves in the tree, k—the number of tion accuracy values for logistic regression, random for-
features, C full—root node value, Contrib(x, k) kth feature est, support vector machine, and k nearest neighbor are
contribution in feature vector x. Now let’s move to the pre- 12.3, 12.09, 4.8, and 3.61 respectively. On the contrary, k
diction of random forest, which is as discussed in Sect. 3.2 means decreased its accuracy value by 6.42. The experi-
an average value of its tree prediction. Therefore, random ment result revealed that adding a new cluster on the
forest prediction function defined as follows: training set has a strong impact to improve classification
J accuracy. Random Forest got better accuracy (99.86%).
1∑ Before clustering and classification data Preprocessing
f (x) = f (x) (8)
J j=1 j performed to purify raw datasets. Missing value treat-
ment and conversion of categorical values done to get a
It is pretty clear that the random forest prediction is the better result. In the paper optimal value of k discovered
average value of bias and the average value of each con- after calculating the maximum distance from each clus-
tribute feature set. Which can be defined as ter to the joining line from K1 to Kn. Also, to trust a pre-
J k J diction model interpretation made to understand how
1∑ ∑ 1∑ a model inside processes to predict. Prediction is a sum
f (x) = Cj full + ( +contribj (x, k) (9)
J j=1 k=1
j j=1 of bias and contribution features. In the experiment, we
showed results to get insights into the contributing vari-
The above expression explained how random forest black ables for the prediction model. Knowing the influence
box processed by following decision routes through the of individual variables on the prediction model is trust-
tree and compute the contributions of individual fea- worthy. Moreover, In the study target-specific, variable
tures. knowing the relatedness of predictive variable contribution explained. Overall, the paper tried to show
to the prediction model either negatively or positively the effects of combining Clustering and Classification to
helps to understand detail information about the model. improve model accuracy and identified major contribut-
which helps to know the influence of each variable on the ing factors class-specific wise from the collected data
outcome. for road traffic accident datasets. Another dataset will
In the experiment, the default parameter set up used straighten our model to get a better result.
to implement the random forest algorithm. we have
seen that day, driver experience, type of vehicle, loca-
tion, light condition, causality age, and casualty sex are Compliance with ethical standards
the strong contribution for serious injury, light condi-
tion, causality class, causality age, and causality sex are Conflict of interest The authors declare that they have no conflict of
a contributor for light injury whereas driver age, service interest.
Vol.:(0123456789)
References road traffic accidents: the case of Addis Ababa city. Addis
Ababa Addis Ababa University
20. Gissane W (1965) Accidentsa modern epidemic. J Inst Health
1. Abebe Y, Dida T, Yisma E, Silvestri DM (2018) Ambulance use is
Educ 3(1):16–18
not associated with patient acuity after road traffic collisions: a
21. Gu X, Li T, Wang Y, Zhang L, Wang Y, Yao J (2018) Traffic fatali-
cross-sectional study from Addis Ababa, Ethiopia. BMC Emerg
ties prediction using support vector machine with hybrid
Med 18(1):7
particle swarm optimization. J Algorithms Comput Technol
2. Abegaz T, Gebremedhin S (2019) Magnitude of road traffic
12(1):20–29
accident related injuries and fatalities in Ethiopia. PLoS one
22. Habib MT, Majumder A, Jakaria A, Akter M, Uddin MS, Ahmed
14(1):e0202240
F (2018) Machine vision based papaya disease recognition. J
3. Acurna E, Rodriguez C (2004) The treatment of missing values
King Saud Univ Comput Inf Sci 32(3):300–309
and its effect in the classifier accuracy, classification, clustering,
23. Haleem K, Alluri P, Gan A (2015) Analyzing pedestrian crash
and data mining applications. In: Proceedings of the meeting
injury severity at signalized and non-signalized locations.
of the International Federation of Classification Societies (IFCS),
Accid Anal Prev 81:14–23
pp 639–647
24. Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means
4. Al-Radaideh QA, Daoud EJ (2018) Data mining methods for
clustering algorithm. J R Stat Soc Ser C (Appl Stat)
traffic accident severity prediction. Int J Neural Netw Adv Appl
28(1):100–108
5:1–12
25. Hordofa GG, Assegid S, Girma A, Weldemarium TD (2018)
5. Alikhani M, Nedaie A, Ahmadvand A (2013) Presentation of
Prevalence of fatality and associated factors of road traffic acci-
clustering-classification heuristic method for improvement
dents among victims reported to burayu town police stations,
accuracy in classification of severity of road accidents in Iran.
between 2010 and 2015, Ethiopia. J Transp Health 10:186–193
Saf Sci 60:142–150
26. Kim JK, Ulfarsson GF, Shankar VN, Mannering FL (2010) A note
6. AlMamlook RE, Kwayu KM, Alkasisbeh MR, Frefer AA (2019)
on modeling pedestrian-injury severity in motor-vehicle crashes
Comparison of machine learning algorithms for predicting
with the mixed logit model. Accid Anal Prev 42(6):1751–1758
traffic accident severity. In: 2019 IEEE Jordan international joint
27. Kumar S, Toshniwal D (2016) A data mining approach to charac-
conference on electrical engineering and information technol-
terize road accident locations. J Mod Transp 24(1):62–72
ogy (JEEIT). IEEE, pp 272–276
28. Kwon OH, Rhee W, Yoon Y (2015) Application of classification
7. Ansari S, Akhdar F, Mandoorah M, Moutaery K (2000) Causes and
algorithms for analysis of road safety risk factor dependencies.
effects of road traffic accidents in Saudi Arabia. Public Health
Accid Anal Prev 75:1–15
114(1):37–39
29. Kyriakopoulou A, Kalamboukis T (2008) Combining clustering
8. Asefa F, Assefa D, Tesfaye G (2014) Magnitude of, trends in, and
with classification for spam detection in social bookmarking
associated factors of road traffic collision in Cntral Ethiopia. BMC
systems. In: ECML PKDD
Public Health 14(1):1072
30. Laytin AD, Seyoum N, Kassa S, Juillard CJ, Dicker RA (2020) Pat-
9. Balogun J, Abereoje O (1992) Pattern of road traffic accident
terns of injury at an Ethiopian referral hospital: using an institu-
cases in a Nigerian University teaching hospital between 1987
tional trauma registry to inform injury prevention and systems
and 1990. J Trop Med Hyg 95(1):23–9
strengthening. Afr J Emerg Med 10(2):58–63
10. Baru A, Azazh A, Beza L (2019) Injury severity levels and asso-
31. Lee C, Saccomanno F, Hellinga B (2002) Analysis of crash precur-
ciated factors among road traffic collision victims referred to
sors on instrumented freeways. Transp Res Rec 1784(1):1–8
emergency departments of selected public hospitals in Addis
32. Lee N, Kim JM (2010) Conversion of categorical variables into
Ababa, Ethiopia: the study based on the Haddon matrix. BMC
numerical variables via bayesian network classifiers for binary
Emerg Med 19(1):2
classifications. Comput Stat Data Anal 54(5):1247–1265
11. Bedard M, Guyatt GH, Stones MJ, Hirdes JP (2002) The independ-
33. Leka S, Griffiths A, Cox T, World Health Organization et al (2003)
ent contribution of driver, crash, and vehicle characteristics to
Work organisation and stress: systematic problem approaches
driver fatalities. Accid Anal Prev 34(6):717–727
for employers, managers and trade union representatives. World
12. Beshah T, Hill S (2010) Mining road traffic accident data to
Health Organization, Geneva
improve safety: role of road-related factors on accident sever-
34. Li L, Shrestha S, Hu G (2017) Analysis of road traffic fatal acci-
ity in Ethiopia. In: 2010 AAAI Spring symposium series
dents using data mining techniques. In: 2017 IEEE 15th interna-
13. Breiman L (2001) Random forests. Mach Learn 45(1):5–32
tional conference on software engineering research, manage-
14. Casado-Sanz N, Guirao B, Attard M (2020) Analysis of the risk fac-
ment and applications (SERA). IEEE, pp 363–370
tors affecting the severity of traffic accidents on spanish cross-
35. MacQueen J et al (1967) Some methods for classification and
town roads: the drivers perspective. Sustainability 12(6):2237
analysis of multivariate observations. In: Proceedings of the fifth
15. Castro Y, Kim YJ (2016) Data mining on road safety: factor assess-
Berkeley symposium on mathematical statistics and probability,
ment on vehicle accidents using classification models. Int J
Oakland, CA, USA, vol 1, pp 281–297
Crashworthiness 21(2):104–111
36. Odero W, Khayesi M, Heda P (2003) Road traffic injuries in Kenya:
16. Chen WH, Jovanis PP (2000) Method for identifying factors con-
magnitude, causes and status of intervention. Inj Control Saf
tributing to driver-injury severity in traffic crashes. Transp Res
Promot 10(1–2):53–61
Rec 1717(1):1–9
37. Persson A (2008) Road traffic accidents in Ethiopia: magnitude,
17. Deme D (2019) Road traffic accident in Ethiopia from 2007/08-
causes and possible interventions. Adv Transp Stud 15:5–16
2017/18. Am Int J Sci Eng Res 2(2):49–59
38. Powers DMW (2011) Evaluation: from precision, recall and
18. Fentahun A (2011) Mining road traffic accident data for predict-
f-measure to roc, informedness, markedness and correlation. J
ing accident severity to improve public health-role of driver and
Mach Learn Technol 2(1):37–63
road factors in the case of Addis Ababa. PhD thesis, Addis Ababa
39. Regassa Z (2009) Determining the degree of driver’s responsibil-
University
ity for car accident: the case of Addis Ababa traffic office. Addis
19. Getnet M (2009) Applying data mining with decision tree and
Ababa University, Addis Ababa
rule induction techniques to identify determinant factors of
40. Sameen MI, Pradhan B (2017) Severity prediction of traffic acci-
drivers and vehicles in support of reducing and controlling
dents with recurrent neural networks. Appl Sci 7(6):476
Vol:.(1234567890)
41. Sarkar S, Vinay S, Raj R, Maiti J, Mitra P (2019) Application of 49. Tulu GS (2015) Pedestrian crashes in Ethiopia: identification of
optimized machine learning techniques for prediction of occu- contributing factors through modelling of exposure and road
pational accidents. Comput Oper Res 106:210–224 environment variables. PhD thesis, Queensland University of
42. Seid M, Azazh A, Enquselassie F, Yisma E (2015) Injury charac- Technology
teristics and outcome of road traffic accident among victims 50. Vinodhini G, Chandrasekaran R (2016) A comparative perfor-
at Adult Emergency Department of Tikur Anbessa specialized mance evaluation of neural network based approach for senti-
hospital, Addis Ababa, Ethiopia: a prospective hospital based ment classification of online reviews. J King Saud Univ Comput
study. BMC Emerg Med 15(1):10 Inf Sci 28(1):2–12
43. Seid S et al (2019) Road accident data analysis: data preproc- 51. Wahab L, Jiang H (2019) Severity prediction of motorcycle
essing for better model building. J Comput Theor Nanosci crashes with machine learning methods. Int J Crashworthiness
16(9):4019–4027 24:1–8
44. Sharma B, Katiyar VK, Kumar K (2016) Traffic accident prediction 52. Xiao J (2019) SVM and KNN ensemble learning for traffic incident
model using support vector machines with Gaussian kernel. In: detection. Phys A 517:29–35
Proceedings of fifth international conference on soft computing 53. Yahya AA (2017) Swarm intelligence-based approach for edu-
for problem solving, Springer, Berlin, pp 1–10 cational data classification. J King Saud Univ Comput Inf Sci
45. Singh SK (2017) Road traffic accidents in India: issues and chal- 31(1):35–51
lenges. Transp Res Procedia 25:4708–4719 54. Zajac SS, Ivan JN (2003) Factors influencing injury severity of
46. Sundarkumar GG, Ravi V (2015) A novel hybrid undersampling motor vehicle-crossing pedestrian crashes in rural Connecticut.
method for mining unbalanced datasets in banking and insur- Accid Anal Prev 35(3):369–379
ance. Eng Appl Artif Intell 37:368–377
47. Tiwari P, Kumar S, Kalitin D (2017) Road-user specific analysis of Publisher’s Note Springer Nature remains neutral with regard to
traffic accident using data mining techniques. In: International jurisdictional claims in published maps and institutional affiliations.
conference on computational intelligence, communications,
and business analytics. Springer, Berlin, pp 398–410
48. Tixier AJP, Hallowell MR, Rajagopalan B, Bowman D (2016) Appli-
cation of machine learning to construction injury prediction.
Autom Constr 69:102–114
Vol.:(0123456789)

Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach

Uploaded by

Copyright:

Available Formats

Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Road Accident Prediction and Model Interpretation Using A Hybrid K Means and Random Forest Algorithm Approach

Uploaded by

Copyright:

Available Formats

Research Article