Prioritization of Near-Miss Incidents Using
Text Mining and Bayesian Network
Abhishek Verma(&), Deeshant Rajput, and J. Maiti
Department of Industrial and Systems Engineering,
Indian Institute of Technology Kharagpur, Kharagpur 721302, India
abhishekverma.cs@gmail.com, deeshantrajput@gmail.com,
jhareswar.maiti@gmail.com
Abstract. Near-Miss incidents can be treated as events to signal the weakness
of safety management system (SMS) at the workplace. Analyzing near-misses
will provide relevant root causes behind such incidents so that effective safety
related interventions can be developed beforehand. Despite having a huge
potential towards workplace safety improvements, analysis of near-misses is
scant in the literature owing to the fact that near-misses are often reported as text
narratives. The aim of this study is therefore to explore text-mining for
extraction of root causes of near-misses from the narrative text descriptions of
such incidents and to measure their relationships probabilistically. Root causes
were extracted by word cloud technique and causal model was constructed using
a Bayesian network (BN). Finally, using BN’s inference mechanism, scenarios
were evaluated and root causes were listed in a prioritized order. A case study in
a steel plant validated the approach and raised concerns for variety of circumstances such as incidents related to collision, slip-trip-fall, and working at
height.
Keywords: Bayesian network
incidents Workplace safety
Word cloud
Narrative text
Near-miss
1 Introduction
Near miss reporting produces the same amount of information as an accident reporting
provides, without any serious consequences. It gives the opportunity to move from
reaction to prediction of incident. Quantitative estimation of near miss events is very
important; otherwise it may escalate to accident in near future with lack of imposed
constraints or loss of control on the chain of events. The aim of this study is to
investigate and model the root causes of near miss incidents of a steel plant. Data were
captured as incident reports contain incidents’ information in textual format. The
narrative text field provides user to include the details about the incident, such as
description about the machine, exact location, surrounding condition, summary about
the incidents etc. However, it also increases the complexity of analysis to extract
meaningful information. To serve this purpose both the structured and narrative text
data were analyzed using text mining and Bayesian network (BN). The approach has
been implemented to prioritize the root causes behind near-miss events.
© Springer Nature Singapore Pte Ltd. 2017
M. Singh et al. (Eds.): ICACDS 2016, CCIS 721, pp. 183–191, 2017.
DOI: 10.1007/978-981-10-5427-3_20
184
A. Verma et al.
The rest of the paper is organized in the following manner. Related literature review
is discussed in brief in Sect. 2. In Sect. 3, preparation of data and employed techniques
are discussed in short. The results obtained from the model and its practical implications are given in Sect. 4. Finally, conclusions of the study with scope of future
research are given in Sect. 5.
2 Literature Review
Research in near-miss management is still in infancy state leaving avenues, scopes and
opportunities to structure and develop models around it. Importance of near-miss incident data analysis, for safety improvement in the organization was discussed and inverse
proportionality between near-misses and actual accident cases was found [1]. So,
learning from near-miss events and prioritization of causal factors behind near-miss
incidents are utmost required. Some safety related studies incorporated narrative text to
extract and predict incident scenarios [2–4]. Bayesian auto coding methods have been
applied to near miss data for minimizing the human effort required to manually code the
large descriptive dataset [5]. Bier and Mosleh [6] analyzed the near-miss cases for nuclear
plant using probabilistic model and suggested that near-miss events should be given more
preference than experts’ claim. A study was conducted to evaluate and prioritize the
different risk associated with water mains failure using a Bayesian belief network model
[7]. BN was also used to prioritize the incidents with help of score collected from expert
opinion. Recently, Bayesian network combined with analytic hierarchy process
(AHP) has been used to prioritize the factors behind the near-miss events [8]. In that study
the prior probability of factors causing near misses was decided by the experts’ judgement. It is better to include the information provided by organization’s employees about
the near-miss incidents, experienced by them. Our study incorporates the information
from all the workers instead of relying only on the expert judgement.
3 Methodology
Figure 1 presents the conceptual framework of this study in qualitative and quantitative
aspect. The network structure of BN for identifying incident factors corresponds to
qualitative aspect, while estimating the probabilities of factors and inference form the
networks correspond to the quantitative aspect. The factors (root causes) are identified
under coded primary causes by text mining the narrative text. The prior probability is
calculated for each factor by measuring their frequency in incidents reports for corresponding coded primary cause. After finding the prior probability of root causes, the
conditional probability is calculated for every dependent node in BN to know the
cause-consequence relationship among them. Hypothetical evidence then considered
about the absence of individual root cause to see the effect on overall BN. Scenarios
were generated to draw the conclusion about the top contributing root causes. R language and OpenMarkov software were utilized to build word cloud and Bayesian network respectively. The hardware configuration for the analysis includes: Intel-core(TM)
i5-4200 M CPU@ 2.50 GHz, 4 GB RAM and Windows 8 (64 bit).
Prioritization of Near-Miss Incidents Using Text Mining
Qualitative aspect
Quantitative aspect
Incident Investigation reports
Identification of 'primary causes' for nearmiss analysis
185
Determining the presence of root cause in
incident report
(considering once only in one documnet)
Expert opinion
Extract the narrative text for corresponding
'primary cause'
Data preparation
Estimate and assign the unconditional and
unconditional probability of external node
and internal node respectively
Compute the Bayesian probabilities using
OpenMarkov software
Grouping of 'primary causes'
Scenario generation using hypothetical
evidence
Building word cloud
Draw conclusion and implication
Selecting the highly probable root cause
word from word cloud
Based on availability of root cause word in
different primary causes, create a causeconsequence graph
Fig. 1. Conceptual framework for study
3.1
Data Collection and Preparation
For this study, incident investigation data of a steel plant is considered. Data preparation involves resolving various issues related to data like missing values, duplicate
data points, spelling errors, non-vocabulary words, incomplete information, etc. Raw
file extracted from the safety management system (SMS) database of the plant studied,
in MS excel format, originally had 9086 records. After pre-processing, 8877 records
were considered for further analysis. After considering 16 primary causes listed in the
SMS (out of these 8877 records), 2984 cases were found as near-miss incidents.
3.2
Word Cloud Preparation
Word cloud can extract the information from the text contents and provides overview
of the key points. Weighting function is used to calculate the weight of the words using
term frequency-inverse document frequency (tf-idf) [9], which magnifies the importance of words occur frequently within a document and rarely across the documents.
Brief description of the steps to create the word cloud using R language is given as
follows:
1. The brief description of incidents is extracted for individual primary cause.
2. Stop words like “a”, “an”, “the”, “and”, “or” were removed.
3. Corpus is created through ‘Corpus()’ function and structured dataset is created by
using ‘TermDocumentMatrix()’ function.
186
A. Verma et al.
4. Scoring of words is done by using the ‘tf-idf’ statistic given as follows:
tf ðt; d Þ ¼
number of times word t appears in a document
f d ðt Þ
¼
total number of words in the document
maxw2d fd ðwÞ
idf ðt; DÞ ¼ ln
total number of documents
jD j
¼ ln
number of documents with word t in it
j fd 2 D : t 2 d gj
tfidf ðt; d; DÞ ¼ tf ðt; d Þ:idf ðt; DÞ
ð1Þ
ð2Þ
ð3Þ
where, fd ðtÞ ¼ frequency of word t in report d, and D ¼ corpus of documents.
5. The structured format of words and their corresponding estimated weightages have
been found, will be finally put into the ‘wordcloud()’ function to build word cloud.
3.3
Bayesian Network
Bayesian networks (BNs) use a directed acyclic graph (DAG) to represent conditional
probability relationships between a set of variables [10]. As shown in Fig. 2, a BN is
composed of set of variables (e.g., X, Y1, Y2 and Z) and a set of directed links between
vertices that represent the relationships between these variables. Each variable have
mutually exclusive states (e.g., for X, Y1, Y2 and Z, the states are {Pr = present,
Ab = absent}). The nodes with arcs directed into are called “child” node and the nodes
from which arrows comes from are called “parent” nodes (e.g., Y1 and Y2 are child of
X and the parents of Z). Edges show conditional dependencies between these variables
such that the value of any variable is a probabilistic function of the values of the
variables which are its parents in the DAG. These dependencies on predecessor nodes
are quantified through conditional probability tables (CPTs) attached to each node.
Probability
Var X
Pr
P(X=Pr)
Ab
P(X=Ab)
Var X
Var Y1
Var
X
Var Y2
Var Y1
Probability
Var Y2
Var
X
Probability
Pr
Ab
Pr
Ab
Pr
P(Y1=Pr|X=Pr)
P(Y1=Ab|X=Pr)
Pr
P(Y2=Pr|X=Pr)
P(Y2=Ab|X=Pr)
Ab
P(Y1=Pr|X=Ab)
P(Y1=Ab|X=Ab)
Ab
P(Y2=Pr|X=Ab)
P(Y2=Ab|X=Ab)
Var Z
Var Z
Var
Y1
Var
Y2
Pr
Pr
P(Z=Pr|Y1=Pr, Y2=Pr)
P(Z=Ab|Y1=Pr, Y2=Pr)
Pr
Ab
P(Z=Pr|Y1=Pr, Y2=Ab)
P(Z=Ab|Y1=Pr, Y2=Ab)
Ab
Pr
P(Z=Pr|Y1=Ab,Y2=Pr)
P(Z=Ab|Y1=Ab, Y2=Pr)
Ab
Ab
P(Z=Pr|Y1=Ab,Y2=Ab)
P(Z=Ab|Y1=Ab, Y2=Ab)
Probability
Pr
Ab
Fig. 2. Sample of Bayesian network (BN)
Prioritization of Near-Miss Incidents Using Text Mining
187
In Bayesian analysis the relation between parent nodes Yi (i = 1, 2, …, n) and the
evidence or child node Z can be computed as:
pðZjYi Þ pðYi Þ
pðYi jZ Þ ¼ Pn
j¼1 p ZjYj p Yj
ð4Þ
where, p(Y|Z) represents the posterior probability of occurrence of variable Y with the
given condition that Z occurs, p(Y) is the prior probability of Y, and p(Z|Y) denotes the
hood distribution of Z given the occurrence of Y.
4 Results and Discussion
4.1
Root Cause Extraction from Word Cloud
Comparison word-cloud allows studying the differences or similarities between two or
more primary causes by plotting the word cloud of each primary cause against the other.
Figure 3 shows the comparison cloud of “collision and fall related” incidents grouped
by the primary causes: dashing/collision, skidding, slip/trip/fall, and working at height.
Due to space limitation all clouds are not shown in the paper. Readers are encouraged to
contact authors of this study regarding it. Keywords (root causes) were extracted on the
basis of presence in number of incident reports (counted only once for individual report).
Fig. 3. Comparison word cloud of collision and fall related incidents
4.2
BN Causal Model from Data
In the Fig. 4, model based on the root causes extracted from the word cloud is constructed. In BN, the conditional probability table (CPT) structure depends upon the
conditional independence of nodes. In our study for every node, two states were
defined as ‘present’ and ‘absent’. The probability of independent nodes was calculated
by its frequency in all the reports of particular primary cause. In this study, OpenMarkov software [11] has been employed to perform the model construction and CPT
calculation. Four groups were formed on the basis of similarity and to minimize the
188
A. Verma et al.
mathematical calculation by clubbing different primary causes as per experts’ opinions.
So, three levels of causal factors (group, primary, root level) were considered for
building the BN. Finally, the conditional probabilities of all the child nodes are
computed assuming all possible combination of probability values of its parents. For
making BN, total 57 nodes (36 independent nodes, 21 dependent nodes), 88 links and
1200 conditional probabilities were estimated. After calculating all the prior and
conditional probability at every node, the probability of occurrence of near-miss event
comes out to be 36.04%. Due to high connectivity and low probability of root cause
node, it is important to note that this causes small change in near-miss index, contributed by a particular node.
Fig. 4. Bayesian network to measure near-miss incidents
Table 1. Prioritized order of groups after incorporating the hypothetical evidence about their
absence
Order
Group
1
Collision and fall
related
Process related
Energy related
Material and
equipment related
2
3
4
Posterior
probability
75.19
Near-miss node probability in absence
of group cause
32.40
18.66
6.34
32.77
35.24
35.91
36.00
Prioritization of Near-Miss Incidents Using Text Mining
189
Table 2. Prioritized order of primary cause after incorporating the hypothetical evidence about
their absence
Order
Primary cause
1
2
3
4
5
6
7
8
9
10
11
12
13
Slip/trip/fall
Working at height
Process incidents
Fire
Dashing collision
Electrical flash
Skidding
Material handling
Gas leakage
Energy isolation
Hot metals
Hydraulic/pneumatic
Equipment
machinery damage
Structural integrity
Lifting tools tackles
Toxic chemicals
14
15
16
Posterior
probability
47.11
59.78
32.45
19.77
15.80
54.06
4.84
32.77
4.56
14.81
2.97
2.30
13.94
7.18
1.93
0.97
Near-miss node probability in absence
of primary cause
35.03
35.13
35.55
35.82
35.83
35.92
35.99
36.00
36.02
36.03
36.03
36.03
36.04
36.04
36.04
36.04
Table 3. Prioritized order of root causes obtained from word cloud after incorporating the
hypothetical evidence about their absence
Order
1
2
3
4
5
6
7
Root cause (primary
cause)
Switch/cable
Stairs/floor condition
Wire/rope/sling
Crane
Valve/hose
Truck/dumper
Loading/shifting
Probability
(%)
23.16
15.87
16.23
13.74
10.64
8.44
7.46
Near-miss node probability in absence of
root cause
36.02
36.02
36.03
36.03
36.03
36.03
36.03
To prioritize the causal factors at three different levels of network, their effect on
near-miss node was estimated by incorporating the hypothetical evidence of absence of
different nodes. It will help in indicating the effect of the particular node on overall
near-miss node. Prioritized order of particular group, primary cause and root cause
level are given in Tables 1, 2 and 3 respectively. In Table 1, all groups are listed in
prioritized order of their effect on near-miss node. Similarly in Table 2, primary causes
are listed down in prioritized order to show their individual impact on the near-miss
node. In Table 3, the root causes extracted from narrative text using text mining are
190
A. Verma et al.
listed and their effect on near-miss node is measured. It will help the safety practitioners
to put focused intervention and safety control measures to improve the safety performance at workplaces.
5 Conclusions
The proposed model can extract information at lowest level using word cloud and
combines with BN to help in prioritizing the near-miss incidents. The model is capable
to provide qualitative and quantitative information of different causes of near-miss.
Slip/trip/fall, material handling, electrical flash, dashing/collision and energy isolation
are found to be the top five primary causes out of 16 primary causes reported in
preliminary investigation report data. At the lowest level, 36 root causes were extracted
and their impact on overall network was measured by providing hypothetical evidence.
It was found that switch/cable, condition of stair/floor, wire/rope/sling, crane operation,
valve/hose dysfunction, and heavy vehicle like truck or dumper and loading or shifting
operation and have more propensity of the causing incident. Other root causes can also
be prioritized on the basis of their probability and impact on intermediate nodes. It can
help to focus the attention of management to improve on particular area. Our future
work would focus to analysis organizational hierarchy to find out the location specific
causes. Moreover, validation of model can be done by using expert opinions as the gold
standard, or using the developed near-miss based system to predict actual accidents and
measuring its predictive power.
References
1. Jones, S., Kirchsteiger, C., Bjerke, W.: The importance of near miss reporting to further
improve safety performance. J. Loss Prev. Process Ind. 12(1), 59–67 (1999)
2. Abdat, F., Leclercq, S., Cuny, X., Tissot, C.: Extracting recurrent scenarios from narrative
texts using a Bayesian network: application to serious occupational accidents with
movement disturbance. Accid. Anal. Prev. 70, 155–166 (2014)
3. Lincoln, A.E., Sorock, G.S., Courtney, T.K., Wellman, H.M., Smith, G.S., Amoroso, P.J.:
Using narrative text and coded data to develop hazard scenarios for occupational injury
interventions. Inj. Prev. 10(4), 249–254 (2004)
4. Sawaragi, T., Ito, K., Horiguchi, Y., Nakanishi, H.: Identifying latent similarities among
near-miss incident records using a text-mining method and a scenario-based approach. In:
Salvendy, G., Smith, M.J. (eds.) Human Interface 2009. LNCS, vol. 5618, pp. 594–603.
Springer, Heidelberg (2009). doi:10.1007/978-3-642-02559-4_65
5. Taylor, J.A., Lacovara, A.V., Smith, G.S., Pandian, R., Lehto, M.: Near-miss narratives from
the fire service: a Bayesian analysis. Accid. Anal. Prev. 62, 119–129 (2014)
6. Bier, V.M., Mosleh, A.: The analysis of accident precursors and near misses: implications
for risk assessment and risk management. Reliab. Eng. Syst. Saf. 27(1), 91–101 (1990)
7. Kabir, G., Tesfamariam, S., Francisque, A., Sadiq, R.: Evaluating risk of water mains failure
using a Bayesian belief network model. Eur. J. Oper. Res. 240(1), 220–234 (2015)
8. Zubair, M., Park, S., Heo, G., Hassan, M.U., Aamir, M.: Study on nuclear accident
precursors using AHP and BBN, a case study of Fukushima accident. Int. J. Energy Res.
39(1), 98–110 (2015)
Prioritization of Near-Miss Incidents Using Text Mining
191
9. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval.
Cambridge University Press, New York (2008)
10. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann Publishers Inc., San Francisco (1988)
11. Arias, M., Díez, F., Palacios, M.: OpenMarkovXML. A format for encoding probabilistic
graphical models (2010)