Crash Severity Analysis Through Nonparametric Machine Learning

Crash Severity Analysis Through
Nonparametric Machine Learning
Abstract:
In recent years, intelligent transportation systems (ITS) have developed around the world as
part
of smart cities, integrating various technologies like cloud computing, the Internet of Things,
sensors, artificial intelligence, geographical information, and social networks. The innovative
services provided by ITS can improve transportation mobility and safety by making road users
better informed and more coordinated, which helps in addressing the transportation issues
caused by the significant increase in city traffic in the past few decades. Traffic prediction is one
of the key tasks of ITS. Various parametric models such as logistic regression or Linear
Discriminant Analysis have been most commonly used to explore factors contributing to the
severity of crashes.
It provides essential information to road users and traffic management agencies to allow better
decision making. It also helps to improve transport network planning to reduce common
problems, such as road accidents, traffic congestion, and air pollution. The results show that the
Road Type, area type, type of traffic control and speed limit are the critical factors affecting
crash severity.
Keywords: Crash severity, Data Mining, Traffic Crashes, XGBoost, MLP Classifier.
2. Parameter tuning
To generate the best results from the model, we utilized a g rid search for each of the models
above to determine the best set of parameters for each classifier. Additionally, a special function
for classifier estimation was used to compare each model using accuracy, precision, and recall.
Additionally, new features were developed from the dataset. The feature ehourCat was created
to determine if rush hour traffic was a key factor in fatalities and collisions. The feature Type of
Traffic Control tries to draw a relation between the severity of the accident and how controlled
traffic was present.
3. Data Integration and Preprocessing
3.1. Data Sets

To achieve accurate traffic prediction at intersections, we made use of multiple data sources in
this work. These data sources are publicly accessible from the VicTraffic web application
maintained
by VicRoads [50] and the Victorian Government Data Directory maintained by the State of
Victoria in
Australia [51]. In particular, we used complete data collected within 2014. A summary of the
data sets
is given in Table 1.
Table 1. Description of data sets.
Data Set Duration # of Records Type # of Sensors Time Interval

Traffic Volume 1 January 2014–31 December 2014 28,992 per intersection Public 4598 15 min
Car Accidents 1 January 2014–31 December 2014 14,247 Public - -
Roadworks 1 January 2014–31 December 2014 5637 Public - -
The first source of data was the intersection traffic volume data set which consists of sensor
data
collected in Melbourne, Victoria, Australia (as shown in Figure 1a). This data set covers more
than 4598 traffic
4. Features Selection
Features are derived values from raw data, and used as input to a machine learning algorithm.
High-quality features (e.g., being informative, relevant, interpretable, non-redundant) are the
basis for modelling and problem-solving, as well as generating reliable and convincing results.
Some operations are defined to summarise the key information and profile the data series,
including statistical descriptions, threshold-based filtering, aggregated or accumulated values,
etc.
Finally, some informative and interpretable features are preliminarily shortlisted, and
learning-based feature selection is developed to select the most important features. In this
section, the selected features used in our research are explained. Feature selection is
an important task for the success of a model. Different features have different influences on
the prediction results.
• weekday: Set to 0 to 6 as per the day from Monday to Sunday
• Type: Set to 4 if Fatality takes place, set to 3 if Grievous Injured, set to 2 if Minor Injury, set to
1 for non-injury
• ehourCat: Set to 6 if the time is between 0:00 and 06:00 h, Set to 6 if the time is between 6:00
and 10:00 h; Set to 10 if the time is between 10:00 and 16:00 h, Set to 16 if the time is between
16:00 and 21:00 h, Set to 21 if the time is between 16:00 and 21:00 h
• Speed Limit: Set to 0 if No Speed Sign or Less than 40, set to 1 if 40 to 60, set to 2 if 60 to
80, set to 3 if Greater than 80
• Type of traffic control: Set to zero in case of No Speed Sign and Uncontrolled, one in case of
Traffic Light Signal, Stop Sign, Flashing Signal/Blinker and 2 in case of Police Control
• Number of Lanes: Set to 0 in case of 2 Lanes or less than 2 lanes, 1 in case of more than 2
lanes
• Road Type: Set to 1 if National Highway, set to 2 if Expressway, set to 3 if State Highway, set
to 4 if Other Roads
• Area Type: Set to 1 if 'Urban', set to 0 in case of Rural area
5. Missing Data Analysis

6. Filling of missing data
Model-based imputation: This type of imputation is done by predicting the missing values with
a simple machine learning regression algorithms if the missing value feature is real-valued and
classification algorithms in case of the feature is a classification. The observations that contain
missing values are made as test data and the observations that do not contain missing values
as training data for the algorithm to predict the missing value. The value that is being predicted
is nothing but the weight(W) value/vector that is learned by the linear regression algorithm by
minimizing the loss of training data. KNN regressor is used for imputation of missing values in
the dataset
Review Feature Correlations

To easily identify and remove highly correlated features that may impact our analysis, a
correlation heatmap was used.
Modelling and Evaluation
Classification of Single or Multi fatality accidents
We will attempt to classify accidents as single-fatality or multiple-fatality accidents

Using MLP classifier, Decision Tree, Xgboost, and Gaussian Naive Bayes models we will
attempt to classify the accidents in our dataset as to whether they result in single or multiple
fatalities. For each of these models, a grid search will be used to assist in determining the most
useful combination of parameters. Model parameters are adjusted during the grid search using
the accuracy measurement.
The intended metrics which we will use to compare models are accuracy, precision, and
recall. Accuracy gives us a good measure of True Positives and True Negatives. However,
accuracy ignores misclassification and as such is somewhat misleading in that it predicts single
fatalities because our data is skewed that way.
Precision penalizes the models for prediction of False Positives and good precision will lower
the False Positive rate. Recall penalizes for predicting False Negatives and good recall will
lower the false negatives.
XgBoost
XGBoost is a scalable learning system that has a recognised impact in solving machine
learning challenges in different application domains. The speed of XGBoost is much faster than
that of common machine learning methods, and it can efficiently process billions of data in a
parallel and distributed way.
Several algorithmic improvements and the ability to handle sparse data are other features of
this model. Sparse input is common in many real-world problems due to several reasons, such
as missing data and multiple zero values. Therefore, sparsity-aware data is an important feature
that should be included in the algorithm. XGBoost provides this feature by visiting non-missing
data only. For the missing data, the algorithm adds the default direction to each tree node; if the
value is not provided in the sparse matrix, the input is classified into the default direction. The
optimal value of the default direction is learnt from the data. This improvement makes the
algorithm faster by making the computation time linear to the number of missing data. The
second feature for XGBoost is using column block for parallel learning. XGBoost uses a
memory unit called block to store data in a compressed column format. Each column is sorted
according to the feature values. The objective function (loss function and regularization) at
iteration t that we need to minimize is the following:
MLP Classifier
A MLP is a finite directed acyclic graph. Nodes that are no target of any connection are
called input neurons. A MLP that should be applied to input patterns of dimension must
have input neurons, one for each dimension. Input neurons are typically enumerated as
neuron 1, neuron 2, neuron 3, Nodes that are no source of any connection are called
output neurons. A MLP can have more than one output neuron. The number of output
neurons depends on the way the target values (desired values) of the training patterns
are described. All nodes that are neither input neurons nor output neurons are called
hidden neurons. Since the graph is acyclic, all neurons can be organized in layers, with
the set of input layers being the first layer. Connections that hop over several layers is
called shortcut. Most MLPs have a connection structure with connections from all
neurons of one layer to all neurons of the next layer without shortcuts.
All neurons are enumerated. MLPClassifier supports the Cross-Entropy loss function,
which allows probability estimates by running the predict_probamethod. MLP trains
using Backpropagation. More precisely, it trains using some form of gradient descent
and the gradients are calculated using Backpropagation
Gaussian Naive Bayes classifier
A Gaussian Naive Bayes classifier is a simple probabilistic classifier based on

applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence
assumptions. A more descriptive term for the underlying probability model would be
"independent feature model".In simple terms, a Gaussian naive Bayes classifier
assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. Depending on the
precise nature of the probability model, Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications, parameter
estimation for Bayes models uses the method of maximum likelihood; in other
words, one can work with the Bayes model without believing in Bayesian probability
or using any Bayesian methods. It’s specifically used when the features have
continuous values. It’s also assumed that all the features are following a Gaussian
distribution i.e, normal distribution classifiers.
Decision Trees
A decision tree is a tree where each node represents a feature(attribute), each
link(branch) represents a decision(rule) and each leaf represents an
outcome(categorical or continues value)
A node with outgoing edges is called aninternalor test node. All other nodes are called
leaves (also known as terminal or decision nodes). In a decision tree, each internal
node splits the instance space into two or more sub-spaces according to a certain
discrete function of the input attributes values. Decision Trees follow Sum of Product
(SOP) representation. By using information gain as a criterion, we try to estimate the
information contained by each attribute. Decision trees divide the feature space into
axis-parallel rectangles or hyperplanes
4 Class and Two Class Model
Two different models were prepared to address the issue of accident severity.
Based on the data available, four classes model of injury severity was formed: Fatal (if
number of
fatalities > 0), Major Injury (if number of fatalities = 0 and number of major injury >0), Minor
Injury (if number of fatalities = 0, number of major injury = 0 and number of minor injury > 0)
and no Injury (if number of fatalities = 0, number of major injury = 0 and number of minor
injury = 0).
The second classification output of a 2 class model is a binary set of injury severity. The
first class comprises of crashes resulting in urgent hospitalization (i.e. the combined data set of
Fatal and Grievous Injury and the second class comprises of crashes not needing any
hospitalized in an urgent need (i.e., the combined set off Minor Injury and Non- Injury). The
main motive to conduct the binary set classification is mainly to address the underreporting of
the minor and no injury data.
Xgboost MLP classifier Decision Tree Gaussian NaiveBayes
4 class 2 class 4 class 2 class 4 class 2 class model 4 class 2 class

model model model model model model mode
84.08 92.73 65.59 90.45 83.78 90.63 24.32 64.01
`Relative Feature Importance
The features will be scaled to zero mean and unit variance for determining feature importance.
Relative importance of variables and their feature importance scores (XGBoost)
The table above gives the probability of each feature given there was a non fatality for a given
accident : Area Type, Motor vehicle involved, Road Type, ehourCat and week days are the most
important features, on the other hand in case if a given accident involves fatality, Road Type,
ehourCat, weekday, Motor Vehicle Involved are the most important factors. This prediction gives
an idea about quantitative contribution of each of the features involving either a fatality or non
fatality.
For example, in the case of an accident involving a fatality, the contribution factor probability of
Speed Limit is 17.3 %. On the other hand, if there is an accident not involving a fatality, the
contribution factor probability drops to 14.33%.
CONCLUSION
The study aimed to model a functional relationship between the injury severity classes and the
different label that govern the cause of these crashes. Besides, different classification models
were explored to accurately predict the injury severity class and further identify the key
parameters to understand the dependency of the severity of injury with factors such as
Area type, weekday, number of lanes, speed limit, and involvement of motorized vehicle.
Various useful information was obtained from the datasets using data mining techniques. The
findings from this study can guide decision-making authorities to make better policies for
preventing severe crashes in road accidents in India. In this study, two different non-parametric
approaches were implemented to model the severity level of injuries taking place in crashes
during road accidents in India. The performance of each model was estimated and from the
study and it can be deduced that XGBoost was the
overall better performer for identifying factors influencing severity in the crash data. The results
showed that area type, type of traffic control, weekday and speed limit are significant
contributors to severe crashes. The above approaches used in the study have provided a
reasonable estimate of the severity level in traffic crashes. These approaches capture the
underlying relationship between the injury severity and the factors affecting severity. The
decision-making authorities can use the knowledge of these factors to make relevant traffic
policies and safety plan to reduce the fatalities and injuries on the roads.
From the study, it emerged that, India being a developing country, lacks many basic
infrastructural facilities. From the result, it is evident that there is a serious need for pedestrian
facilities like subway, FOB, sidewalks, pedestrian traffic lights, and others. Moreover, lack of
proper traffic control had led to an increase in wrong turns, overtaking and over speeding near
the intersections. Another interesting observation was that the absence of an appropriate speed
limit was also one of the key determinants that influenced the severity of the crashes. This
highlights that there are several stretches on roads where there is a need to control speed. The
insights from the study can help the concerned authorities in decision-making
policies and establishing safety plans. This study can help in prioritizing the critical issues
common to the crash-prone locations on the highways and can help in designing specific
countermeasures based on the distinctive safety deficiencies. For example, from the results, it
can be suggested that law enforcement needs to be intensified for the adherence to traffic rules
near the intersections. In this context, CCTV cameras can be installed at the major intersections
to detect over speeding, unsafe overtaking and lane changing. Similarly, providing adequate
pedestrian facilities near the built-up areas can also help to cut down the severe crash counts.
Even low-cost intelligent transportation system (ITS) solutions can be designed to warn vehicles
of the presence of pedestrians on the crosswalks on highways. This study also highlights that
over speeding should be controlled by providing traffic calming measure or posted speed limit
sign, which can reduce the proportion of severe accident on such highways. From this, it can
be concluded that the non-parametric approaches are good techniques, which can be engaged
in the studies in the field of transportation and safety.

Crash Severity Analysis Through Nonparametric Machine Learning

Uploaded by

Copyright:

Available Formats

Crash Severity Analysis Through Nonparametric Machine Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Crash Severity Analysis Through Nonparametric Machine Learning

Uploaded by

Copyright:

Available Formats

Crash Severity Analysis Through

Nonparametric Machine Learning

3. Data Integration and Preprocessing

3.1. Data Sets

Table 1. Description of data sets.

Data Set Duration # of Records Type # of Sensors Time Interval

• weekday: Set to 0 to 6 as per the day from Monday to Sunday

• Area Type: Set to 1 if 'Urban', set to 0 in case of Rural area

5. Missing Data Analysis

Review Feature Correlations

Classification of Single or Multi fatality accidents

We will attempt to classify accidents as single-fatality or multiple-fatality accidents

A Gaussian Naive Bayes classifier is a simple probabilistic classifier based on

4 class 2 class 4 class 2 class 4 class 2 class model 4 class 2 class

84.08 92.73 65.59 90.45 83.78 90.63 24.32 64.01

`Relative Feature Importance

You might also like