MLforReal TimeVehicleDataAnalysisfor

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/332816177
Machine learning based real-time vehicle data analysis for safe driving
modeling
Conference Paper · April 2019

DOI: 10.1145/3297280.3297584
CITATIONS READS
10 5,578
3 authors, including:
Pamul Yadav Dhananjay Singh

Yonsei University Pennsylvania State University
13 PUBLICATIONS 43 CITATIONS 146 PUBLICATIONS 2,136 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Pamul Yadav on 28 January 2021.
The user has requested enhancement of the downloaded file.

Machine Learning Based Real-Time Vehicle Data Analysis for Safe
Driving Modeling
Pamul Yadav Sangsu Jung Dhananjay Singh
School of Elect & Computer Eng., UNIST MtoV Inc. Dept. of Electronics Engineering, HUFS
P.O. Box 44919 P.O. Box 34129 P.O. Box 17035
Ulsan, South Korea Daejeon, South Korea Global Campus - Yongin, South Korea
pamulydv@unist.ac.kr sjung@mtov.net dsingh@hufs.ac.kr
ABSTRACT 1 INTRODUCTION
This paper identifies a necessity to evaluate the Meta features of Automotive Technologies are providing improvised services to
vehicles which could be helpful in improving the vehicle driver’s the driver’s safety and vehicle security under the umbrella of
skill to prevent accidents and also evaluate the change in the quality Intelligent Transportation System (ITS). In the development of ITS,
advanced Automotive Technologies shall play a crucial role in
of cars over passing time. This paper does an analysis of the vehicle
determining the overall experience of users by making it much at ease
data using supervised learning based linear regression model that is
in terms of reducing the risk of road accidents, risk of cybercrime in
used as an estimator for Driver’s Safety Metrics and Economic the vehicle, buying a used car etc. It is often noted that judging the
Driving Metrics. The data collected was obtained from fifteen driver’s driving skill is subjective and is difficult to set a standard for
different drivers over a span of one month which accumulated over driver’s skills [1]. The modern approach to transportation system is
15000 data points. And the metrics that we have devised have focusing on rapidly evolving with the intelligent vehicles. High rise
potential application in automotive technology analysis for in recorded traffic density, road accidents and crisis faced in
developing an advanced intelligent vehicles. Also, we have presented regulating the effective management of traffic control in urban and
a system for performing the real-time experiment based on the On- rural areas have concerned us to develop a smart solution in context
Board-Diagnosis version II (OBD-II) scanner data. Finally, we have to ITS [2]. The automotive industry has great expectations from these
futuristic solutions to improve the safety of people and security of
analyzed and presented the parameter accuracy over 80% for the
vehicles. It is observed that the users are shifting from individualistic
driver’s safety solution in real-world scenario.
approach to the data-centric approach based on OBD-II scanner to
avail the augmented driving experience. In spite of the modern
CCS CONCEPTS command, control, communication, computers and intelligent
• Computer systems organization → Embedded systems; OBD-II, systems, we are still facing numerous calamities in which thousands
Dashboard Camera • Networks → Vehicular networks, Intelligent of precious human lives are lost in accidents. Therefore, it should be
Transportation System an immediate need to tackle the small scale yet serious issues using
the state-of-the-art techniques. We are mainly focusing on analyzing
the data which is collected from the vehicle using the OBD-II scanner
KEYWORDS and eventually providing the driver’s safety solutions. We aim to
Supervised Learning, Linear Regression, Statistical Analysis, obtain the solutions by observing the blind-spots accurately and
Automotive Vehicle Data efficiently using pattern recognition techniques from supervised
learning.
ACM Reference format:
P. Yadav, S. Jung, D. Singh. 2019. In Proceedings of ACM SAC Conference, In the paper, Section 1 describes the problem statement and brief
Limassol, Cyprus, April 8-12, 2019 (SAC’19), 4 pages. DOI: analysis of the problem which we have attempted to work on in this
10.1145/3297280.3297584 paper. Section 2 discusses the system design used to solve the
problem, which encapsulates the definition of the machine learning
model, tuning parameters and the overall data analysis workflow that
______________________________ maintains the order of the whole computing process. Then, in Section
Permission to make digital or hard copies of part or all of this work for personal or 3 we made a detailed performance analysis in terms of accuracy of
classroom use is granted without fee provided that copies are not made or distributed for our solution. It covers the data generation, data preprocessing
profit or commercial advantage and that copies bear this notice and the full citation on
the first page. Copyrights for third-party components of this work must be honored. For techniques and algorithms employed in obtaining the results. And
all other uses, contact the owner/author(s). To copy otherwise, or republish, to post on finally, in Section 4 we concluded the paper with the overall insights
servers or to redistribute to lists, requires prior specific permission and/or a fee. and potential applications of the predictive modelling performed in
SAC’19, April 8-12, 2019, Limassol, Cyprus
© 2019 Copyright held by the owner/author(s). 978-1-4503-5933-7/19/04. . . $15.00
the experiment.
DOI: 10.1145/3297280.3297584
1355
2 MACHINE LEARNING BASED determining a financial indicator such as gross domestic product
EXPERIMENTAL MODEL based on a changing number of input variables (use of arable land,
population education levels, industrial production, etc.), and
2.1 System Design determining potential market shares with the introduction of new
models. We have used a linear regression model for finding a future
The proposed system consists of some external hardware devices
(unknown) values, where the model assumes a relationship between
such as OBD-II scanner and Mini-dash camera, which are employed
the dependent variable (in our case, economic driving index or driver
to gather the driving and vehicular data. OBD-II gathers various data
safety index) and the independent variable (in our case, arithmetic
related to vehicle performance such as speed, acceleration, idle time
combination of weights and features) [3].
of engine, fuel consumption, distance travelled etc. Mini-dash camera
gathers the real-time images and videos of the events happening in
the surrounding of the vehicle. Then the gathered data is fed into our
utility platform which transmits the data to the cloud server where the
analysis is done and desired output is produced for our experiment.
These results are shared with the driver, insurance companies, and
other entities of interest.
B2B Insurance, car sharing, taxi companies…. B2C Drivers
System Architecture Data Analysis
Cloud Server Data management web service

Big data analysis server
Figure 2: Machine learning model for predicting output values.
Data 2.3 Hyperparameter Tuning

Car-Data Collection Outer Data Collection
All of the data was processed as described above and then the
hypothesis function was created to which a bias value was added. We
used the Multivariate Linear Regression model as our base model to
OBD Scanner Mini Dashcam
train on the data, since we found mostly a linear to quadratic
relationship among the relevant features. To reduce the computation
power and complexity of the training, linear regression model was
found to be the best choice. Hyperparameter tuning was done using
Figure 1: Overview of the OBD-II data collection and ‘GridSearchCV’ in ScikitLearn Library [8]. The grid parameter made
processing system. an evaluation of all 3 × 5 = 15 combinations of n_estimators and
Therefore, it is a challenge to obtain the correct and meaningful max_features. The hyperparameters were fine-tuned automatically
without involving too much human-labor tuning.
knowledge out of the data collected. With the growing community
interest in Machine Learning, these techniques have been employed
2.4 Process Structure
to obtain plausible results and further advance the knowledge domain
in many industries to tackle some important challenges. These Data Collection involves OBD-II scanner for gathering data such as
challenges may include and extend beyond driver’s safety fuel efficiency, speed value parameters etc. We applied
normalization and standardization methods to fit the data in the
performance, estimation of car’s life, fuel efficiency, and long-
model. Then the processed data goes through the feature selection
distance driving efficiency, all of them involve parametric learning process, which was evaluated using correlation matrix. After
of real-world vehicle related datasets. All the same the mini-dash cam obtaining the feature classes, the data was fed into the supervised
has the additional function of recording the video; it can store the learning based Linear Regression model. The model predicted the
location of accident and condition of the vehicle in case of rear-end required output values which were then sent over the cloud to the
collision by adding the video information of head-on collision and Database servers where all the processing is done, whereafter the
GPS in the event of an accident. The system design has been processed information is transmitted to the driver, insurance
categorized under the following steps. companies, nearby police stations. This overall system serves as a
database, monitoring and alert system to prevent accidental risk and
2.2 Machine Learning Model monitor the car’s health [4].
We have used supervised learning algorithm to the known target

3 PERFORMANCE ANALYSIS
values (labels) for a problem. In order to train such a model which
can be identified as the vehicle parameters – preferably with a variety 3.1 Data Used
of configurations – are required as input variables. Regression
models includes determining continuous numerical values based on We analyzed the data collected over a span of one month from fifteen
different drivers. Data was collected using the OBD-II scanner
multiple input variables, for e.g., in a car, calculating its ideal speed
installed in the test vehicles, developed by MtoV Inc., Korea. In the
to minimize the fuel consumption according to the road conditions,
1356
original dataset, we have recorded 51 features and after doing a
correlation analysis we obtained some of the features such as fuel
efficiency, average speed value, maximum speed value, fourth
section speed value, interval driving distance, driving time value 3.3 Hypothesis Generation
during green zone, travelling time value, emergency accelerated
value, emergency decelerated value, fourth rpm time value and fifth After generating a correlation matrix, we found the appropriate
rpm time value [5]. In our dataset, the total number of instances features that were considered in our hypothesis generation. Following
recorded were approximately 6000 which was divided in 80:20 ratio are our hypotheses:
of training and test set. We derived the useful features for our scope
using a correlation matrix. (i)
TABLE 1: Correlation matrix of SFTY_DRVG_INDX
(ii)
SFTY_DRVG_IN Correlation Matrix
DX Features Correlation We hypothesize an outcome called Economic Driving Index
Value (ECN_DRVG_INDX) represented using h1 and another outcome
called Safe Driving Index (SFTY_DRVG_INDX) represented using
ROF_XCH_SCR Fuel Efficiency 1.000000 h2. Based on our correlation values, we frame a linear regression-
AVG_SPD_VAL Average Speed Value 0.951148 based hypothesis where the feature values are represented using xi,
the weights for each feature are represented using β.
MAX_SPD_VAL Maximum Speed Value 0.850964
SPD_TH4_DRVG Fourth Throttle Driving 0.676948
_TIM_VAL Time Value
IDL_HCT Idle Time Value -0.296508
Table 1 shows the correlation values for our hypothesis #1, Safe (iii)
driving Index. And Table 2 shows the correlation values we used for
our hypothesis #2, Economic driving Index (ECN_DRVG_INDX). The weights are assumed in the initial stage but are refined with every
We found that Fuel Efficiency in the following table showed a 100% iteration of the model while the initial noise in the system is assumed
correlation with our hypothesis #1, but considering it alone for the to be a bias value. However, before performing the product of
training would lead to over fitting of the data and eventually reduce features and weights, we need to perform an inverse operation as
the efficiency of the model. Therefore it was important to consider shown in (iii) and (iv) on the feature matrix due to the difference in
all the features. the data storage formats.
TABLE 2: Correlation matrix of ECN_DRVG_INDX

ECNM_DRVG_I Correlation Matrix
NDX Features Correlation
Value (iv)
TH5_RPM_TIM Fifth Throttle RPM time -0.567331 We added the bias value to represent an arbitrary but appropriate
value to model a real-time scenario of roads. Based on the features
UGY_ACSD_OFT Urgent Acceleration -0.615989 that we have considered for deriving the hypothesis we found that
Number five of the features are useful for EDI while four of the features are
UGY_RDSD_OFT Urgent Deceleration -0.621209 useful for SDI.
Number
TH4_RPM_TIM Fourth Throttle RPM -0.563859
4. RESULTS AND DISCUSSION
time
The result analysis consisted of the collection of data from the OBD-
3.2 Data Processing II scanner through the app, which was then processed into the
machine learning model and finally trained as shown in Fig. 3.
We did the data processing by cleaning and eliminating the non- Trained Output values were used as the benchmark for testing against
relevant features for our hypothesis. We performed normalization on
the gathered data. To do so, we performed a k-fold cross-validation
the data where the values were shifted and modified on the scaling so
technique with k=10, to train the model. We performed several
that the range ends up within 0 to 5000 in case of hypothesis 1 and 0
experiments on the parameters which are essential for the testing of
to 200 in case of hypothesis 2 according to the different set of values. vehicle’s safety and economic efficiency. In our first experiment, a
This was done by subtracting the minimum value and dividing by the relationship between Maximum speed value and the travel time (red
maximum minus the minimum.
zone) is obtained. This relationship describes the total distances
travelled while crossing the road given the signal was red. We clearly
1357
observed, as shown in Fig. 4, that given the speed of the car was high, In the Hypothesis-2, safety driving index (SFTY_DRVG_INDX), we
it was more likely for the driver to cross the road at red signal and in observed a slower initial growth in the trend but rapidly picks up the
turn this implies the increasing likelihood of meeting with an momentum after crossing a certain threshold value. After this value
accident. the trend moves in a logarithmic fashion up towards the positive y-
axis. We do observe few outliers that were safely ignored for
consideration. Therefore, the safety driving index
(SFTY_DRVG_INDX) is found to be a logarithmic function of the
features considered in our hypothesis. However, a good example for
future work would be to obtain different datasets on vehicle metrics
and train our model on that data to further improve the accuracy from
80% to more than 90%. However, an accuracy of 80% in the
economic driving index, which represents a car’s health condition, is
good enough to include a human-in-the-loop for the prediction work.
We aim to implement our model in an IoT system [7] for driving
school scenario to judge the driver driving ability and frame a
Figure 3: A schematic diagram of the analytics process. standard whether the driver should be eligible to obtain a driver’s
license or not.
We performed a k-fold cross-validation technique [6] with k=10, to
train the model. In the Hypothesis-1, ECN_DRVG_INDX, we found 5. CONCLUSIONS
that majority of the data was congested in the lower left part of the
graph suggesting an inverse logarithmic growth of the trend based on In this paper we have obtained some newer insights about the car data
the training data. This showed a positive growth of the analysis such as economic driving index (ECN_DRVG_INDX) and
ECN_DRVG_INDX based on the hypothesis value. However, the safety driving index (SFTY_DRVG_INDX.) The results have proven
data scatters as the value on x-axis increases, hinting at a somewhat to be approximately 80% fitting the given features and are very
lesser correlation for predicted value based on hypothesis value. helpful to be used in different use cases such as a parameter in finding
Therefore, the ECN_DRVG_INDX is found to be an inverse the driver’s driving performance in a driving school, as a good
logarithmic function of the features. estimate for finding an optimal price for a used car that can be based
on several factors which we have analyzed in this paper etc. We also
found that the model used to train the data can be improved further
by finding better hyper parameter values for the features. It is also
possible that different features can be considered for improving the
hypothesis.
ACKNOWLEDGMENT
This work was partially supported by the MtoV Inc., VESTELLA
Inc., and Hankuk University of Foreign Studies research fund.
REFERENCES
[1] Singh D, Singh M., "Internet of Vehicles for Smart and Safe
Driving", International Conference on Connected Vehicles and Expo
Figure 4: Linear regression based prediction trend of (ICCVE), Shenzhen, 19-23 Oct., 2015.
[2] Zhang, Y., Lin, W., and Chin, Y., "Data-Driven Driving Skill Characterization:
ECN_DRVG_INDX against hypothesis value. Algorithm Comparison and Decision Fusion," SAE Technical Paper 2009-01-
1286, 2009, https://doi.org/10.4271/2009-01-1286.Azevedo, C. L Cardoso.
[3] J. E. Meseguer, C. T. Calafate, J. C. Cano and P. Manzoni, "DrivingStyles: A
smartphone application to assess driver behavior," 2013 IEEE Symposium on
Computers and Communications (ISCC), Split, 2013, pp.000535-000540.
doi: 10.1109/ISCC.2013.6755001.
[3] Schneider, A., Hommel, G., & Blettner, M. (2010). Linear Regression Analysis:
Part 14 of a Series on Evaluation of Scientific Publications. Deutsches Ärzteblatt
International, 107(44), pp. 776–782.
[5] Kenneth L. Clarkson. 1985. Algorithms for Closest-Point Problems
(Computational Geometry). Ph.D. Dissertation. Stanford University, Palo Alto,
CA. UMI Order Number: AAT 8506171.
[6] Schneider, A., Hommel, G., & Blettner, M. (2010). Linear Regression Analysis:
Part 14 of a Series on Evaluation of Scientific Publications. Deutsches Ärzteblatt
International, 107(44), pp. 776–782.
[7] Goszczynska H., Kowalczyk L., Kuraszkiewicz B. (2014) Correlation Matrices
as a Tool to Analyze the Variability of EEG Maps. In: Piętka E., Kawa J.,
Wieclawek W. (eds) Information Technologies in Biomedicine, Volume 4.
Advances in Intelligent Systems and Computing, vol 284. Springer.
Figure 5: Linear regression based prediction trend of
SFTY_DRVG_INDX against hypothesis value.
1358
View publication stats

MLforReal TimeVehicleDataAnalysisfor

Uploaded by

Copyright:

Available Formats

MLforReal TimeVehicleDataAnalysisfor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MLforReal TimeVehicleDataAnalysisfor

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Conference Paper · April 2019

Pamul Yadav Dhananjay Singh

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

B2B Insurance, car sharing, taxi companies…. B2C Drivers

System Architecture Data Analysis

Cloud Server Data management web service

Data 2.3 Hyperparameter Tuning

We have used supervised learning algorithm to the known target

TABLE 2: Correlation matrix of ECN_DRVG_INDX

View publication stats

You might also like