Crash Severity Analysis Through Nonparametric Machine Learning
Crash Severity Analysis Through Nonparametric Machine Learning
Crash Severity Analysis Through Nonparametric Machine Learning
Abstract:
In recent years, intelligent transportation systems (ITS) have developed around the world as
part
of smart cities, integrating various technologies like cloud computing, the Internet of Things,
sensors, artificial intelligence, geographical information, and social networks. The innovative
services provided by ITS can improve transportation mobility and safety by making road users
better informed and more coordinated, which helps in addressing the transportation issues
caused by the significant increase in city traffic in the past few decades. Traffic prediction is one
of the key tasks of ITS. Various parametric models such as logistic regression or Linear
Discriminant Analysis have been most commonly used to explore factors contributing to the
severity of crashes.
It provides essential information to road users and traffic management agencies to allow better
decision making. It also helps to improve transport network planning to reduce common
problems, such as road accidents, traffic congestion, and air pollution. The results show that the
Road Type, area type, type of traffic control and speed limit are the critical factors affecting
crash severity.
Keywords: Crash severity, Data Mining, Traffic Crashes, XGBoost, MLP Classifier.
2. Parameter tuning
To generate the best results from the model, we utilized a g rid search for each of the models
above to determine the best set of parameters for each classifier. Additionally, a special function
for classifier estimation was used to compare each model using accuracy, precision, and recall.
Additionally, new features were developed from the dataset. The feature ehourCat was created
to determine if rush hour traffic was a key factor in fatalities and collisions. The feature Type of
Traffic Control tries to draw a relation between the severity of the accident and how controlled
traffic was present.
Features are derived values from raw data, and used as input to a machine learning algorithm.
High-quality features (e.g., being informative, relevant, interpretable, non-redundant) are the
basis for modelling and problem-solving, as well as generating reliable and convincing results.
Some operations are defined to summarise the key information and profile the data series,
including statistical descriptions, threshold-based filtering, aggregated or accumulated values,
etc.
Finally, some informative and interpretable features are preliminarily shortlisted, and
learning-based feature selection is developed to select the most important features. In this
section, the selected features used in our research are explained. Feature selection is
an important task for the success of a model. Different features have different influences on
the prediction results.
• Type: Set to 4 if Fatality takes place, set to 3 if Grievous Injured, set to 2 if Minor Injury, set to
1 for non-injury
• ehourCat: Set to 6 if the time is between 0:00 and 06:00 h, Set to 6 if the time is between 6:00
and 10:00 h; Set to 10 if the time is between 10:00 and 16:00 h, Set to 16 if the time is between
16:00 and 21:00 h, Set to 21 if the time is between 16:00 and 21:00 h
• Speed Limit: Set to 0 if No Speed Sign or Less than 40, set to 1 if 40 to 60, set to 2 if 60 to
80, set to 3 if Greater than 80
• Type of traffic control: Set to zero in case of No Speed Sign and Uncontrolled, one in case of
Traffic Light Signal, Stop Sign, Flashing Signal/Blinker and 2 in case of Police Control
• Number of Lanes: Set to 0 in case of 2 Lanes or less than 2 lanes, 1 in case of more than 2
lanes
• Road Type: Set to 1 if National Highway, set to 2 if Expressway, set to 3 if State Highway, set
to 4 if Other Roads
Model-based imputation: This type of imputation is done by predicting the missing values with
a simple machine learning regression algorithms if the missing value feature is real-valued and
classification algorithms in case of the feature is a classification. The observations that contain
missing values are made as test data and the observations that do not contain missing values
as training data for the algorithm to predict the missing value. The value that is being predicted
is nothing but the weight(W) value/vector that is learned by the linear regression algorithm by
minimizing the loss of training data. KNN regressor is used for imputation of missing values in
the dataset
XgBoost
XGBoost is a scalable learning system that has a recognised impact in solving machine
learning challenges in different application domains. The speed of XGBoost is much faster than
that of common machine learning methods, and it can efficiently process billions of data in a
parallel and distributed way.
Several algorithmic improvements and the ability to handle sparse data are other features of
this model. Sparse input is common in many real-world problems due to several reasons, such
as missing data and multiple zero values. Therefore, sparsity-aware data is an important feature
that should be included in the algorithm. XGBoost provides this feature by visiting non-missing
data only. For the missing data, the algorithm adds the default direction to each tree node; if the
value is not provided in the sparse matrix, the input is classified into the default direction. The
optimal value of the default direction is learnt from the data. This improvement makes the
algorithm faster by making the computation time linear to the number of missing data. The
second feature for XGBoost is using column block for parallel learning. XGBoost uses a
memory unit called block to store data in a compressed column format. Each column is sorted
according to the feature values. The objective function (loss function and regularization) at
iteration t that we need to minimize is the following:
MLP Classifier
A MLP is a finite directed acyclic graph. Nodes that are no target of any connection are
called input neurons. A MLP that should be applied to input patterns of dimension must
have input neurons, one for each dimension. Input neurons are typically enumerated as
neuron 1, neuron 2, neuron 3, Nodes that are no source of any connection are called
output neurons. A MLP can have more than one output neuron. The number of output
neurons depends on the way the target values (desired values) of the training patterns
are described. All nodes that are neither input neurons nor output neurons are called
hidden neurons. Since the graph is acyclic, all neurons can be organized in layers, with
the set of input layers being the first layer. Connections that hop over several layers is
called shortcut. Most MLPs have a connection structure with connections from all
neurons of one layer to all neurons of the next layer without shortcuts.
All neurons are enumerated. MLPClassifier supports the Cross-Entropy loss function,
which allows probability estimates by running the predict_probamethod. MLP trains
using Backpropagation. More precisely, it trains using some form of gradient descent
and the gradients are calculated using Backpropagation
Gaussian Naive Bayes classifier
Decision Trees
A decision tree is a tree where each node represents a feature(attribute), each
link(branch) represents a decision(rule) and each leaf represents an
outcome(categorical or continues value)
A node with outgoing edges is called aninternalor test node. All other nodes are called
leaves (also known as terminal or decision nodes). In a decision tree, each internal
node splits the instance space into two or more sub-spaces according to a certain
discrete function of the input attributes values. Decision Trees follow Sum of Product
(SOP) representation. By using information gain as a criterion, we try to estimate the
information contained by each attribute. Decision trees divide the feature space into
axis-parallel rectangles or hyperplanes
4 Class and Two Class Model
Two different models were prepared to address the issue of accident severity.
Based on the data available, four classes model of injury severity was formed: Fatal (if
number of
fatalities > 0), Major Injury (if number of fatalities = 0 and number of major injury >0), Minor
Injury (if number of fatalities = 0, number of major injury = 0 and number of minor injury > 0)
and no Injury (if number of fatalities = 0, number of major injury = 0 and number of minor
injury = 0).
The second classification output of a 2 class model is a binary set of injury severity. The
first class comprises of crashes resulting in urgent hospitalization (i.e. the combined data set of
Fatal and Grievous Injury and the second class comprises of crashes not needing any
hospitalized in an urgent need (i.e., the combined set off Minor Injury and Non- Injury). The
main motive to conduct the binary set classification is mainly to address the underreporting of
the minor and no injury data.
Xgboost MLP classifier Decision Tree Gaussian NaiveBayes
The features will be scaled to zero mean and unit variance for determining feature importance.
Relative importance of variables and their feature importance scores (XGBoost)
The table above gives the probability of each feature given there was a non fatality for a given
accident : Area Type, Motor vehicle involved, Road Type, ehourCat and week days are the most
important features, on the other hand in case if a given accident involves fatality, Road Type,
ehourCat, weekday, Motor Vehicle Involved are the most important factors. This prediction gives
an idea about quantitative contribution of each of the features involving either a fatality or non
fatality.
For example, in the case of an accident involving a fatality, the contribution factor probability of
Speed Limit is 17.3 %. On the other hand, if there is an accident not involving a fatality, the
contribution factor probability drops to 14.33%.
CONCLUSION
The study aimed to model a functional relationship between the injury severity classes and the
different label that govern the cause of these crashes. Besides, different classification models
were explored to accurately predict the injury severity class and further identify the key
parameters to understand the dependency of the severity of injury with factors such as
Area type, weekday, number of lanes, speed limit, and involvement of motorized vehicle.
Various useful information was obtained from the datasets using data mining techniques. The
findings from this study can guide decision-making authorities to make better policies for
preventing severe crashes in road accidents in India. In this study, two different non-parametric
approaches were implemented to model the severity level of injuries taking place in crashes
during road accidents in India. The performance of each model was estimated and from the
study and it can be deduced that XGBoost was the
overall better performer for identifying factors influencing severity in the crash data. The results
showed that area type, type of traffic control, weekday and speed limit are significant
contributors to severe crashes. The above approaches used in the study have provided a
reasonable estimate of the severity level in traffic crashes. These approaches capture the
underlying relationship between the injury severity and the factors affecting severity. The
decision-making authorities can use the knowledge of these factors to make relevant traffic
policies and safety plan to reduce the fatalities and injuries on the roads.
From the study, it emerged that, India being a developing country, lacks many basic
infrastructural facilities. From the result, it is evident that there is a serious need for pedestrian
facilities like subway, FOB, sidewalks, pedestrian traffic lights, and others. Moreover, lack of
proper traffic control had led to an increase in wrong turns, overtaking and over speeding near
the intersections. Another interesting observation was that the absence of an appropriate speed
limit was also one of the key determinants that influenced the severity of the crashes. This
highlights that there are several stretches on roads where there is a need to control speed. The
insights from the study can help the concerned authorities in decision-making
policies and establishing safety plans. This study can help in prioritizing the critical issues
common to the crash-prone locations on the highways and can help in designing specific
countermeasures based on the distinctive safety deficiencies. For example, from the results, it
can be suggested that law enforcement needs to be intensified for the adherence to traffic rules
near the intersections. In this context, CCTV cameras can be installed at the major intersections
to detect over speeding, unsafe overtaking and lane changing. Similarly, providing adequate
pedestrian facilities near the built-up areas can also help to cut down the severe crash counts.
Even low-cost intelligent transportation system (ITS) solutions can be designed to warn vehicles
of the presence of pedestrians on the crosswalks on highways. This study also highlights that
over speeding should be controlled by providing traffic calming measure or posted speed limit
sign, which can reduce the proportion of severe accident on such highways. From this, it can
be concluded that the non-parametric approaches are good techniques, which can be engaged
in the studies in the field of transportation and safety.