Academia.eduAcademia.edu

Road Traffic Accidents Injury Data Analytics

2020, International Journal of Advanced Computer Science and Applications

Road safety researchers working on road accident data have witnessed success in road traffic accidents analysis through the application data analytic techniques, though, little progress was made into the prediction of road injury. This paper applies advanced data analytics methods to predict injury severity levels and evaluates their performance. The study uses predictive modelling techniques to identify risk and key factors that contributes to accident severity. The study uses publicly available data from UK department of transport that covers the period from 2005 to 2019. The paper presents an approach which is general enough so that can be applied to different data sets from other countries. The results identified that tree based techniques such as XGBoost outperform regression based ones, such as ANN. In addition to the paper, identifies interesting relationships and acknowledged issues related to quality of data.

(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 Road Traffic Accidents Injury Data Analytics Mohamed K Nour1 Atif Naseer2 Basem Alkazemi3 Muhammad Abid Jamil4 College of Computer and Information Systems Umm Al-Qura University Science and Technology Unit Umm Al-Qura University College of Computer and Information Systems Umm Al-Qura University College of Computer and Information Systems Umm Al-Qura University Abstract—Road safety researchers working on road accident data have witnessed success in road traffic accidents analysis through the application data analytic techniques, though, little progress was made into the prediction of road injury. This paper applies advanced data analytics methods to predict injury severity levels and evaluates their performance. The study uses predictive modelling techniques to identify risk and key factors that contributes to accident severity. The study uses publicly available data from UK department of transport that covers the period from 2005 to 2019. The paper presents an approach which is general enough so that can be applied to different data sets from other countries. The results identified that tree based techniques such as XGBoost outperform regression based ones, such as ANN. In addition to the paper, identifies interesting relationships and acknowledged issues related to quality of data. addition, the paper aims to apply machine learning models to enable more accurate predictions. Hence, the compares the performance of several machine learning algorithms in predicting the accident injury severity. In particular, the paper applies logistic regression, support vector machines, decision trees, random forest , XGBoost and artificial neural network models. The rest of this paper is organized as follows: Section II introduces some previous works. Section III shows the methodology used in this work, Section IV describes the data management and the patterns of traffic accident data. Section V shows the results and analysis of the all the approached used in this work. Section VI gives the conclusions and future works. Keywords—Traffic Accidents Analytics (RTA); data mining; machine learning; XGBOOST I. I NTRODUCTION Road Traffic Accident (RTA) is an unexpected event that unintentionally occurs on the road which involves vehicle and/or other road users that causes casualty or loss of property. Over 90% the world’s fatalities on roads occur in low and middle income countries which account for only 48% of world’s registered vehicles [1]. The financial loss, which is about US$518 billion, is more than the development assistance allocated for these countries. While developed rich nations have stable or declining road traffic death rates through coordinated correcting efforts from various sectors, developing countries are still losing 1–3% of their gross national product (GNP) due to the endemic of traffic casualties. World Health Organization (WHO) fears, unless immediate action is taken, road crash will rise to the fifth leading cause of death by 2030, resulting in an estimated 2.4 million fatalities per year [1]. Thus, measures to reduce crashes based on in-depth understanding of the underlying causes are of great interest for researchers.The 21st century has been seeing a rapid growth of road motorisation due to rapid increase of population, massive urbanisation, and increased mobility of the modern society, risks of road traffic fatality (RTF) may also become higher and RTA can also be assumed as a “modern epidemic”. This paper presents an analytic framework to predict accident severity for road traffic accidents [1]. Past research on road traffic accidents analysis had mainly relied on statistical methods such as linear and Poisson regression. This paper presents an analytic framework to predict accident severity for road traffic accidents. In particular, the paper addresses issues related to data preprocessing and preparation such as data aggregation, transformation, feature engineering and imbalanced data. In II. L ITERATURE R EVIEW Mehdizadeh et al. [2] presented a comprehensive review on data analytic methods in road safety. Analytics models can be grouped into two categories: predictive or explanatory models that attempt to understand and quantify crash risk and (b) optimization techniques that focus on minimizing crash risk through route/path-selection and rest-break scheduling. Their work presented a publicly available data sources and descriptive analytic techniques (data summarization, visualization, and dimension reduction) that can be used to achieve safer-routing and provide code to facilitate data collection/exploration by practitioners/researchers. The paper also reviewed the statistical and machine learning models used for crash risk modelling. Hu et al. [3] categorized the optimization and prescriptive analytic models that focus on minimizing crash risk. Ziakopoulos et al. [4] critically reviewed the existing literature on different spatial approaches that include dimension of space in its various aspects in their analyses for road safety. Moosavi et al. [5] identified weaknesses with road traffic accidents research which include: small-scale datasets, dependency on extensive set of data, and being not applicable for real time purposes. The work proposed a data collection technique with a deep- neural-network model called Deep Accident Prediction( DAP); The results showed significant improvements to predict rare accident events. Zagorodnikh et al.[6] developed an information system that displays the accidents concentration on electronic terrain map automatically mode for Russian RTA to help simplifying the RTA analysis. Kononen et al. [7] analysed the severity of accidents occurred in United States using logistic regression model. They reported performance 40% and 98%, for sensitivity and specificity respectively. Also, they identified the most important predictors for injury level are: change in velocity, seat belt use, and crash direction. www.ijacsa.thesai.org 762 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 The Artificial Neural Networks (ANNs) are one of the data mining tools and non-parametric techniques in which researchers have analysed the severity of accidents and injuries among those involved in such crashes. Delen et al. [8] applied a ANNs to model the relationships between injury severity levels and crash related factors. They used US crash data with 16 attributes. The work identified four factors that influence the injury level: seat belt, alcohol or drug use, age, gender, and vehicle. the analysis cost. Data Analysis use multiple machine learning algorithms to get the insight of data. The data analysis is very crucial for any organization as it provides the detailed information about the data and is helpful in certain decision making and predictions about the business. Data can be presented in various forms depending on the type of data being used. The data can be shown into organized tables, charts, or graphs. Data presentation is very important for business users as it provides the results from the analysis of data in a visual format. Naseer et al. [9] introduces a deep learning based traffic accident analysis method. They highlighted deep learning techniques to build prediction and classification models from the road accident data. One of the most important tasks for road risk analysis and modelling is to predict accident severity level. This paper looks at building predictive model for accident severity level and investigate the process of constructing a classification model to predict accident severity level, in particular the study: Sharma et al. [10] applied support vector machines with different Gaussian kernel functions for crash to extract important features related to accident occurrence. The paper compared neural network with support vector machines. The paper reported that SVMs are superior on accuracy. However, the SVMs method has the same disadvantages of ANN in traffic accident severity predication as mentioned earlier Meng et al. [11] used XGBoost to predict accidents using road traffic accident data from multiple sources. They used historical data along with weather and traffic data. Schlogl et al. [12] performed multiple experiments to proof that XGBoost performs better as compared to several machine learning algorithms. Ma et al. [13] proposed the XGBoost based framework which analysed the relationship between collision, time and environmental and spatial factors and fatality rate. Results show that the proposed method has the best modelling performance compared with other machine learning algorithms. The paper identified eight factors that have impact on traffic fatality. Cuenca et al. [14] compared the performance of Naive Bayes, Deep Learning and Gradient Boosting to predict the severity of injury for Spanish road accidents. Their work reported that Deep Learning outperform other methods III. M ETHODOLOGY The methodology adopted in this paper is shown in Fig. 1. The first step for data analytics process is data collection which is regarded as the primary building block for successful data analysis project. There are many data sources like sensors, visual data through cameras, and IoT and mobile devices which captures data in different formats and need to be stored realtime or offline. In addition, data collected from different authorities related to traffic volume, accident details and demographic information. The storage can be on the local servers or on cloud. The key of data management pyramid is data preprocessing. The data acquired from the storage locations cannot be used as it is. It requires preprocessing before performing any analysis. The acquired data may include missing information that needed to rectify as well as many information needed to be removed due to duplication. The preprocessing may involve the data transformation as it helps in data normalization, attribute selection, discretization, and hierarchy generation. Data reduction maybe required on the large scale of data as the analysis of a huge amount of data is harder. Data reduction increases the efficiency of storage and • Presents the data management framework. This is followed by discussion on how the data was prepared prior to modelling. This includes pre-processing and data cleansing. This section is presented in the Data Management Framework section • Identifies gaps in the RTAs predictive modelling techniques. This section gives brief background on each technique used in this paper and presents prior work in road traffic accident prediction together with with data requirements and recorded performance results. This topic is presented in the Data Analysis section • Build prediction models. This section begins by stating performance metrics then followed by data used. Then compares classifiers in particular; logistic regression, support vector machines, neural networks, decision trees, random forest and Extreme gradient boosting tree (XGboost) . In this section, the unbalanced class distribution is investigated to see their impact on injury severity during accidents. This is presented in the results section IV. DATA M ANAGEMENT A. Data Collection The data comprises publicly available data from UK government which spans the period of 2005 to 2019 [15]. Although the UK department of transport provide data from 1979, it was reported the data collected from 2005 onwards are more accurate and contains less missing data. The records shows information on road traffic collisions that involve personal injury occurring on public roads which have been reported to the police. Data is collected by the authorities at scene of an accident or, in some cases, reported by a member of the public at a police station, then processed and passed on to the authorities. Data includes, 2 million unique collisions, with x, y space coordinates available. Data related to traffic flow is and information about all UK network roads and local authorities are also available separately. The dataset contains a single entry for each accident with 33 attributes (features). The attributes can be grouped by geography, accident-focused, weather, time . Data Related to vehicles involved with the accidents is stored in separate file with 16 attributes. Data related to casualty involved with the accident is stored in 23 fields file. The relation between these three files is one to many, i.e one accident row can contain many casualty rows and many vehicle rows with the accident index is the linking field. www.ijacsa.thesai.org 763 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 Fig. 1. Proposed Methodology The severity in any accident is the most important feature to analyze the injury pattern. According to the WSDOT [16], the severity level during the accidents are measured using the KABCO scale, which uses the parameters: fatal (K), incapacitating-injury (A), non-incapacitating injury (B), minor injury (C), and property damage only (PDO or O)). In this work, we divided the accidents into three categories i.e. Fatal (where the death occurs within 30 days of accident), Serious injury (where the person requires hospital treatment) and Slight injury (where the person not required any medical treatment). In this project the data is stored in relation database. In future work however, the relational data first will be denormalized then transformed into Hadoop key-value records B. Data Preprocessing Different data preprocessing and cleaning methods were applied to the data. Data pre preprocessing involve many tasks and techniques. This include: dealing with missing value, outlier or anomaly detection, feature selection. A key stage in the data analytic is the selection of data. Data needs to be of good quality and clean. Data quality considerations include accuracy, completeness and consistency [17]. In addition data volume is important as well. Data should be large enough to be of value in predictive modelling. It must be split into training, test and validation subset in order to evaluate the model. The following data preprocessing steps were applied to the data in order to make the data ready for analysis and machine learning algorithms: • Most machine learning methods require the data to be in either binary or numeric format. However, in real life data sources include category attributes such road type, casualty class. All category attributes will be converted into numeric values. • Numeric values differ in ranges. To avoid bias to large numeric values all numeric values will be normalized to values between 0 and 1 • Records with missing values will be removed. • Date field will create several relate fields such as month, year and week. • Determine less quality attributes. Attributes with more than 70% missing values will be removed. • Calculate the correlation between severity level and all other attributes. Attributes with high correlation values will be removed as well as attributes with very low correlation values. • Create fields related to easting and northing to create zones for accidents instead of specific location. The threshold value for zone level is 1km2 . One of the issues that faces building analytic models for crash severity, is the imbalance of data [18] where the occurrence fatality which is infrequent or rare event compared to no or minor injury accidents. Due to the extreme imbalance of accident data most algorithms will not produce good predictive models and perform poorly will likely missclassify the fatal accidents as it is not prevalent in the dataset [19]. For imbalanced data sets such as traffic accidents data, sampling techniques can help improve classifier accuracy [19]. Two sampling techniques; undersampling and oversampling techniques will be discussed below. Under sampling is used to adjust the class distribution of a dataset in favour of the minority class. With undersampling, the majority class is reduced or under sampled [17] and randomly eliminates data from the majority class until both classes match. Oversampling is a technique used in data mining to adjust the class distribution of a dataset in favour of the majority class [18]. Oversampling, on the other hand, the minority class increased or over sampled until the size meets that of the majority class. However, these techniques require specialised skill and it can take a significant time-frame to identify the best sample. In this study, we need to apply feature selection task on the dataset. The dataset is preprocessed as specified in Tables I, II and III. The Table I shows the features with respect to accidents, Table II shows all features of vehicles, while the Table III highlights the features with respect to casualty with their type. The tables also shows the preprocessing on the features so that some features excludes from the list while some of them adjusted with scale. www.ijacsa.thesai.org 764 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 TABLE I. ACCIDENTS F EATURES Variable Name Accident Index Police Force Accident Severity Number of Vehicles Number of Casualties Date (DD/MM/YYYY) Day of Week Time (HH:MM) Location Easting OSGR (Null if not known) Location Northing OSGR (Null if not known) Longitude (Null if not known) Latitude (Null if not known) Local Authority (District) Local Authority (Highway Authority - ONS code) 1st Road Class Type Link field Number from 1-98 1 Fatal 2 Serious 3 Slight Numeric Numeric DATE 1 TO 7 TIME Numeric EXECLUDE SCALE SCALE Split into Moth, Year and Week, weekend Weekday SCALE Split into Rush hours and Non Rush Hours Remove Last two dists and scale Remove Last two dists and scale Numeric INCLUDE Numeric INCLUDE 1 to 941 exclude 208 Items exclude 1 to 6 Numeric 1 to 12 Numeric Junction Detail 0 TO 9 0 TO 4 2nd Road Class 0 TO 6 2nd Road Number Pedestrian CrossingHuman Control Pedestrian CrossingPhysical Facilities Light Conditions Numeric Weather Conditions 1 to 9 Road Surface Conditions Special Conditions at Site Carriageway Hazards Urban or Rural Area Did Police Officer Attend Scene of Accident 0 london 1 otherwise Numeric 1st Road Number Road Type Speed limit Junction Control Preprocessing EXECLUDE ( unique in accidents) 0 TO 2 0 TO 8 1 TO 7 1 to 7 0 motoryway, 1 othewise exclude 1 motoryway SCALE 0 no juntion, 1 otherwise 0 no junction control, 1 otherwise 0 motoryway, 1 othewise exclude 0 no pedestrain crossing, 1 otherwise 0 no pedestrain crossing, 1 otherwise 0 daylight 1 otherwise 0 good conditions, 1 otherwise Variable Name Type 0 TO 8 Preprocessing EXECLUDE ( unique accident, one or more vehicle) EXECLUDE (unique vehicle & one or more casualty) 0 car 1 otherwise 0 no towing, 1 otherwise 0 reversing,1 otherwise 0 on main c way, 1 otherwise 0 no juntion, 1 otherwise 0 TO 5 0 no skidding, 1 otherwise 0 TO 12 0 no object, 1 otherwise 0 TO 8 0 not leaving, 1 otherwise Accident Index Link field Vehicle Reference Link field Vehicle Type Towing and Articulation Vehicle Manoeuvre Vehicle LocationRestricted Lane Junction Location Skidding and Overturning Hit Object in Carriageway Vehicle Leaving Carriageway Hit Object off Carriageway 1st Point of Impact Was Vehicle Left Hand Drive Journey Purpose of Driver Sex of Driver Age Band of Driver Engine Capacity Vehicle Propulsion Code Age of Vehicle (manufacture) Driver IMD Decile Driver Home Area Type 1 TO 113 1 TO 5 1 TO 18 0 TO 10 0 TO 4 0 not object off c way, 1 otherwise 0 no impact, 1 otherwise 1 TO 2 0 right hand, 1 otherwise 1 TO 6 0 work, 1 otherwise 1 TO 3 1 TO 11 Numeric 0 male, 1 otherwise 1 TO 10 0 Petrol, 1 otherwise 0 TO 11 Numeric 0 TO 10 0 deprived 1 othewise 1 TO 3 0 deprived 1 othewise 0 dry, 1 otherwise 0 to 7 1 to 3 0 1 0 0 1 TO 3 0 attend, 1 otherwise 0 to 7 TABLE II. V EHICLE F EATURES no special conditions, otherwise no hazard, 1 otherwise urban, 1 otherwise www.ijacsa.thesai.org 765 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 the relationship between a response variable and one or more explanatory variables. Logistic regression is a maximumlikelihood method that has been used in hundreds of studies of crash outcome.Traditionally, statistical regression models are developed in highway safety studies to associate crash frequency with the most significant variables. The logistic regression is a special case of the generalized linear model (GLM), which generalizes the ordinary linear regression by allowing the linear model to be related with a response variable that follows the exponential family via an appropriate link function. Logistic regression can be binomial or multinomial. The binomial logistic regression model has the following form: TABLE III. C ASUALTY F EATURES Variable Name Type Accident Index Link field Vehicle Reference Link field Casualty Reference Link field Casualty Class Sex of Casualty Age Band of Casualty 1 TO 3 1 TO 2 Numeric Casualty Severity 1 TO 3 Pedestrian Location Pedestrian Movement Car Passenger Bus or Coach Passenger Pedestrian Road Maintenance Worker (From 2011) Casualty Type Casualty IMD Decile Casualty Home Area Type 1 TO 10 1 TO 9 0 TO 2 0 TO 4 0 TO 2 0 TO 113 0 TO 10 1 TO 3 Preprocessing EXECLUDE ( unique accident, one or more casualty ) EXECLUDE (unique in casualty table) EXECLUDE (unique in vehicle and one or more in casualty ) 0 driver, 1 otherwise 0 male, 1 otherwise SCALE TARGET VALBLE (0 fatal , 1 otherwise) 0 crossing,1 otherwise 0 crossing,1 otherwise 0 passenger, 1 otherwise 0 bus or coach passenger, 1 otherwise p(y|x, w) = Ber(y|sigm(wT x)) where w and x are extended vectors, i.e., 0 road worker, 1 otherwise w = (b, w1, w2, ..., wD ), x = (1, x1, x2, ..., xD ). 0 pedestrain 1 otherwise 0 deprived 1 othewise 0 urban 1 otherwise C. Data Analysis Methods for traffic accident prediction can be broadly classified into three categories, namely statistical models, machine learning and analytics approaches, and simulation-based methods [17]. In this research we will concentrate on machine learning approaches. Machine learning is a broad concept, which include supervised learning and unsupervised techniques. Supervised learning techniques include:artificial neural networks and its variations (Deep Learning, self-organised map), support vector machine (SVM), decision trees, Bayesian inference. Unsupervised learning include: association rules and clustering techniques. Unsupervised learning involves searching for previously unknown patterns or groupings. Usually these techniques work without a prior target variable. Clustering and association rules fall under this group of techniques. Supervised learning, on the other hand, involves classification, prediction and estimation techniques that contain a target variable. Classification is a machine learning technique that assigns a class to an instance, i.e. automatically assigning traffic accident to one predefined class of severity. Prediction is similar to classification but involve assigning a continuous value to an instance. Supervised learning methods usually use two sets: a training and test set. Training data is used for learning the model and requires a primary group of labelled traffic accident. Test set is used to measure the efficiency of the learned model and includes labelled traffic accident instances, which do not participate in learning classifiers. This paper focuses on applying classification methods to classify accident severity. Five techniques will be applied and compared: (1) logistic regression models, (2) deep neural networks, (3) support vector machines, (4) decision trees, (5) extreme gradient boosting. 1) Logistic Regression: Regression models have become an integral component of any data analysis concerned with 2) Artificial Neural Networks: Artificial Neural Networks (ANN) was build to imitate how the human brain works. It is formed by creating a network of small processing units called Neurons. Each neuron is very primitive, but the network can achieve complex tasks such as pattern recognition, image classification, and detection, natural language processing, etc. Mathematically ANN can be looked like a type of regression system that predicts and estimates new values from historical records. ANN is able to estimate any non-linear functions provided enough datasets were supplied for training the ANN. The architecture of ANN is built with three layers: 1) 2) 3) Input Layer: This layer receives the feature for the model. Hidden Layer: This layer consists of one or more layers that identify the depth of the ANN. Each layer is connected through nodes with weighted edges. The performance of the model depends greatly on the hidden layers and their connectivity with input and output layers. content... 3) Support Vector Machines: Support Vector Machines (SVMs) have been introduced as a new and novel machine learning technique according to the statistical learning theory. SVMs are used for classification and regression problems. Structural Risk Minimization (SRM) applied by SVM can be superior to Empirical Risk Minimization (ERM) since SRM minimizes the generalization error. The primal form of SVM for classification is: H : y = f (x) = sign(wx + b) For regression the SVM is represented as: H : y = f (x) = wT x + b 4) Decision Trees: Decision trees are powerful data mining methods that can be used for classification and prediction.Decision trees represent rules, which are easy to interpret. There are a multiple methods used in creating decision trees for example: Iterative Dichotomiser 3 (ID3) and C4.5. Decision www.ijacsa.thesai.org 766 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 trees are supervised learning methods. The decision trees working mechanism is to divide the data data into training and testing sets randomly. Decision trees have high variance because the model yields to different results. Namely, bagging and boosting. In Bagging techniques, many decision trees build in parallel, form the base learners of bagging technique.The sampled data is input to the learners for training. In boosting techniques, the trees are build sequentially with fewer splits. Such small trees, which are not very deep, are highly interpretable. The validation techniques like k-fold helps in finding the optimal parameters which helps in finding the optimal depth of the tree. Also, it is very important to carefully stop the boosting criteria to avoid over-fitting. 5) eXtreme Gradient Boosting (XGBoost): XGBoost, a scalable machine learning system for tree boosting which is proofed very popular in machine learning competitions such as kaggle and kdnuggests Most winning teams either utilize or supplement their solution with XGBoost . This success can be mainly attributed to the scalabitliy feature that is inherit inside the algorithm. Scalabitliy is due to the optimized learning algorithm to work with sparse data,parrallisim and its utilitisation of mutlithreading [20]. XGBoost is a boosting algorithm which uses gradient descent optimization technique with a regulized learning objective function. It has the following features: 1) 2) 3) 4) 5) 6) Regularization: XGBoost prevents overfitting by using L1 and L2 regularization. Weighted quantile sketch: Finding the split points is core task of most decision tree algorithms. Their performance affected if the data is weighted. XGBoost handles weighted data through a distributed weighted quantile sketch algorithm. Block structure for parallel learning: XGBoost utilizes multiple cores on the CPU using a block structure which part of its design. Data is sorted and stored in in-memory units or blocks which enables the reuse of data by iterations. This also useful for split finding and column sub-sampling tasks. Handling sparse data: Data can become sparse for many reasons such as missing values or one-hot encoding. XGBoost split finding algorithm can handle different types of sparsity patterns in the data. Cache awareness: In XGBoost, non-continuous memory access is required to get the gradient statistics by row index. Hence, XGBoost has been designed to make optimal use of hardware. This is done by allocating internal buffers in each thread, where the gradient statistics can be stored. Out-of-core computing: This feature optimizes the available disk space and maximizes its usage when handling huge datasets that do not fit into memory. D. Model Evaluation Evaluation is a key stage in the data analytics that assesses the predictive capability of the model and identify the model which performs best [17]. Several techniques normally used to evaluate classification models such as the confusion matrix, receiver operator curve (ROC) and the area under the curve (AUC). A confusion matrix shows the correct classifications true positives (TP) and true negatives (TN) in addition to incorrect classification false positives (FP), and false negatives (FN) [21]. The accuracy is calculated from the confusion matrix which gives the precision (percentage of data correctly classified) and recall (percentage of data which are correctly labelled) values. The equations from 1-6 shows the performance matrices formulas. TP TPR = (1) TP + FN FPR = FP FP + TN TP TP + FP (3) TP + TN TP + TN + FP + FN (4) P recision = Accuracy = Recall = F − measure = V. (2) TP TP + FN (5) 2 ∗ Recall ∗ P recision recall + precision (6) R ESULTS AND A NALYSIS Using python, jupyter notebook and Scikit learn, pandas and matplot data science libraries we have developed a workflow for processing the dataset and generate the corresponding accident severity prediction models. It is composed of a number of nodes, namely: 1) 2) 3) 4) 5) 6) Dataset: contains the pre-processed data for the experiment Explore Data: is an optional node to help in data exploration and viewing some statistics about the data before modelling. Model: contains the algorithms that will be used for model generation. Apply: where the model is applied to the predictors to generate the required results Predictors: sample dataset for testing the prediction. Prediction: the resulted table after applying the model on the predictors. The dataset we have used is the UK traffic accidents that occurred between 2005-2019 obtained from UK department of transport. The data comes in three different files:accidents, casualties and vehicles in Tables I, II and III. Accidents index field joins the three tables in a one to many relationship, with one accident record corresponds to one or more casualties and vehicle records. Casualty table has a field called Vehicle Reference links a particular casualty record with vehicle and driver information record with the same accident index field. The original dataset contains about 2M accidents, 2M casualties and 5M Vehicles. The combined table resulted in records 3M records. Data was explored using bar charts and histograms to look for trends and patterns in the data. Examples of such graphs are shown in Fig. 2, 3 and 4. The following observations can be noted from the data: www.ijacsa.thesai.org 767 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 Fig. 4. Accidents Day of the Week 8) 9) 10) 11) Fig. 2. Accidents per Speed Zone The resulting number of attributes used for model building is 48 features. Extra preprocessing was implemented on category type attributes by encoding with 0 and 1 values. Numeric and ordinal fields were scaled to remove bias due to large values. Out of 3M records of accidents only 29K were fatalities and 190K serious injuries. Serious and fatal records are grouped into one class and slight injury was the second class. This reduces the class imbalance together with under sampling method will enables the models achieve better results. Fig. 3. Accident Time of Day 1) 2) 3) 4) 5) 6) 7) 8) Accident severity (more than 90% records with slight severity). Most crashes involve less than five cars with median 2 and mean 1.8. Number of casualty range between 1 to 93 with mean 1.3 and median 1. Accidents spread throughout the week with slight increase of accidents in Thursdays. Accidents spread throughout the day with slight increase at school return time during working days. First road class 3 roads has maximum number of accidents and single carriage ways roads and speed limit 30 MPH. Uncontrolled junctions has more accidents than other types of accident. Most accidents occur with fine weather conditions with Dry road surface. The date then cleaned from incomplete records. All records with empty cells or value −1 were considered missing and removed. Then a histogram diagram was created for each column and non widespread columns were removed. 1) 2) 3) 4) 5) 6) 7) Bus or Coach P assenger T owing and Articulation V ehicle LocationRestricted Lane st P oint ofI mpact W as V ehicle Lef t Hand Drive? st Road Class, st Road N umber nd Road Class nd Road N umber P edestrian Crossing Human Control Special Conditions atS ite The selected models for this experiment are: logistic regression, decision trees, random forest, neural networks and XGBoost. Hyper parameter tuning was applied the methods. The dataset was partitioned two parts, 70% for training and 30% as test data. The metrics used for evaluating the algorithms were: balanced accuracy which is usually used with imbalanced data set. The balanced accuracy results is shown Table IV and ROC curves are shown in Fig. 5. TABLE IV. BALANCED ACCURACY R ESULTS Method Logistic Regression Decision Trees Support Vector Machines Neural Networks Random Forest XGBoost balanced accuracy 66.26 69.42 53.22 67.23 73.82 74.40 XGBoost and Random forest has shown better performance than logistic regression and, support vector machine and neural networks. This can be attributed to the nature of the modelling task and the data used. Most attributes have category values where decision trees based methods are reported to outperform regression based methods. Although with high number of dimensions decision trees based methods tend to affect its performance, in this data these methods continue to outperform linear and non linear classifiers. One downside is performance, however, as the data size increases time increased to obtain the results compared to logistic regression. In addition, further investigations needed to be undertaken to compare the performance with rule based methods which are report to perform well with categorical data. www.ijacsa.thesai.org 768 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 Fig. 5. ROC Curves for LR, SVM, ANN, DT, RF, XGBoost The feature importance figure can be shown in Fig. 6. The figure shows the top 20 attributes that has effect on severity level. The top attribute is the casualty type which specify whether the casualty was a pedestrian or a passenger. This is followed by vehicle and area attributes. Although 66% of accidents were in 30 miles speed limit , it appear from the feature importance table that speed limit has less effect on the injury type. This insight can help raod traffic authorities to prioritise measures to reduce injury levels. VI. C ONCLUSION This paper presented a data analytic framework in which UK traffic accidents data was analysed to established a model for predicting injury severity. The paper used publicly available data from 2005 to 2019 to build prediction models for injury severity level. The paper has combined all attributes from three data sources to analyse 63 attributes and it relation with accident severity. The paper highlighted issues related to data quality and imbalanced data and applied techniques to tackle these issues. The paper compared performance between www.ijacsa.thesai.org 769 | P a g e (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 [5] [6] [7] [8] [9] [10] Fig. 6. Feature Importance [11] different machine learning techniques, XGBoost algorithm was shown to outperform other techniques with higher accuracy rate even with imbalanced data. Further work is suggested to use parallel processing libraries and compare the performance of rule based techniques and decision based techniques. [12] [13] ACKNOWLEDGMENT This work is supported by grant number 17-COM-1-010007,Deanship of Scientific Research (DSR) of Umm al Qura University, Kingdom of Saudi Arabia. The authors would like to express their gratitude for the support and generous contribution towards pursuing research in this area [15] [16] R EFERENCES [1] [14] World Health Organization (WHO), A Road Safety Technical Package, 2017. [Online]. Available: http://iris.paho.org/xmlui/bitstream/handle/123456789/34980/ 9789275320013-por.pdf?sequence=1{\&}isAllowed=y [2] A. Mehdizadeh, M. Cai, Q. Hu, M. A. A. Yazdi, N. Mohabbati-Kalejahi, A. Vinel, S. E. Rigdon, K. C. Davis, and F. M. Megahed, “A review of data analytic applications in road traffic safety. Part 1: Descriptive and predictive modeling,” Sensors (Switzerland), vol. 20, no. 4, pp. 1–24, 2020. [3] Q. Hu, M. Cai, N. Mohabbati-Kalejahi, A. Mehdizadeh, M. A. A. Yazdi, A. Vinel, S. E. Rigdon, K. C. Davis, and F. M. Megahed, “A review of data analytic applications in road traffic safety. Part 2: Prescriptive modeling,” Sensors (Switzerland), vol. 20, no. 4, pp. 1–19, 2020. [4] A. Ziakopoulos and G. Yannis, “A review of spatial approaches in road safety,” Accid. Anal. Prev., vol. 135, no. July, p. 105323, 2020. [Online]. Available: https://doi.org/10.1016/j.aap.2019.105323 [17] [18] [19] [20] [21] S. Moosavi, M. H. Samavatian, A. Nandi, S. Parthasarathy, and R. Ramnath, “Short and long-term pattern discovery over large-scale geospatiotemporal data,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., pp. 2905–2913, 2019. N. Zagorodnikh, A. Novikov, and A. Yastrebkov, “Algorithm and software for identifying accident-prone road sections,” Transp. Res. Procedia, vol. 36, pp. 817–825, 2018. [Online]. Available: https://doi.org/10.1016/j.trpro.2018.12.074 D. W. Kononen, C. A. Flannagan, and S. C. Wang, “Identification and validation of a logistic regression model for predicting serious injuries associated with motor vehicle crashes,” Accid. Anal. Prev., vol. 43, no. 1, pp. 112–122, 2011. [Online]. Available: http://dx.doi.org/10.1016/j.aap.2010.07.018 D. Delen, R. Sharda, and M. Bessonov, “Identifying significant predictors of injury severity in traffic accidents using a series of artificial neural networks,” Accid. Anal. Prev., vol. 38, no. 3, pp. 434–444, 2006. A. Naseer, M. K. Nour, and B. Y. Alkazemi, “Towards deep learning based traffic accident analysis,” in 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), 2020, pp. 0817– 0820. B. Sharma, V. K. Katiyar, and K. Kumar, “Traf fi c Accident Prediction Model Using Support Vector Machines with Gaussian Kernel Á Accident characteristics Á Data mining Á,” Adv. Intell. Syst. Comput., vol. 437, pp. 1–10, 2016. H. Meng, X. Wang, and X. Wang, “Expressway crash prediction based on traffic big data,” ACM Int. Conf. Proceeding Ser., pp. 11–16, 2018. M. Schlögl, R. Stütz, G. Laaha, and M. Melcher, “A comparison of statistical learning methods for deriving determining factors of accident occurrence from an imbalanced high resolution dataset,” Accid. Anal. Prev., vol. 127, no. January, pp. 134–149, 2019. [Online]. Available: https://doi.org/10.1016/j.aap.2019.02.008 J. Ma, Y. Ding, J. C. Cheng, Y. Tan, V. J. Gan, and J. Zhang, “Analyzing the Leading Causes of Traffic Fatalities Using XGBoost and Grid-Based Analysis: A City Management Perspective,” IEEE Access, vol. 7, pp. 148 059–148 072, 2019. L. G. Cuenca, E. Puertas, N. Aliane, and J. F. Andres, “Traffic Accidents Classification and Injury Severity Prediction,” in 2018 3rd IEEE Int. Conf. Intell. Transp. Eng. ICITE 2018, 2018, pp. 52–57. Department for Transport, “Road Traffic Statisticsn guidance,” pp. 1–13, 2014. [Online]. Available: http://data.dft.gov.uk/gb-traffic-matrix/ all-traffic-data-metadata.pdf B. Burdett, “Improving Accuracy of KABCO Injury Severity Assessment by Law Enforcement,” Univ. Wisconsin-Madison, 2014. J. Han, M. Kamber, and J. Pei. (2012) Data mining concepts and techniques, third edition. Waltham, Mass. [Online]. Available: {http://www.amazon.de/ Data-Mining-Concepts-Techniques-Management/dp/0123814790/ref= tmm hrd title 0?ie=UTF8\&qid=1366039033\&sr=1-1} A. Fernández, S. Garcı́a, M. Galar, R. C. Prati, B. Krawczyk, and F. Herrera, Learning from Imbalanced Data Sets, 2018. L. Gautheron, A. Habrard, E. Morvant, and M. Sebban, “Metric learning from imbalanced data,” Proc. - Int. Conf. Tools with Artif. Intell. ICTAI, vol. 2019-Novem, no. 9, pp. 923–930, 2019. T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13-17August-2016, pp. 785–794, 2016. P. Hájek, M. Holeňa, and J. Rauch, “The GUHA method and its meaning for data mining,” J. Comput. Syst. Sci., vol. 76, no. 1, pp. 34–48, 2010. www.ijacsa.thesai.org 770 | P a g e