Academia.eduAcademia.edu

Prioritization of Near-Miss Incidents Using Text Mining and Bayesian Network

2017, Springer

Near-Miss incidents can be treated as events to signal the weakness of safety management system (SMS) at the workplace. Analyzing near-misses will provide relevant root causes behind such incidents so that effective safety related interventions can be developed beforehand. Despite having a huge potential towards workplace safety improvements, analysis of near-misses is scant in the literature owing to the fact that near-misses are often reported as text narratives. The aim of this study is therefore to explore text-mining for extraction of root causes of near-misses from the narrative text descriptions of such incidents and to measure their relationships probabilistically. Root causes were extracted by word cloud technique and causal model was constructed using a Bayesian network (BN). Finally, using BN’s inference mechanism, scenarios were evaluated and root causes were listed in a prioritized order. A case study in a steel plant validated the approach and raised concerns for variety of circumstances such as incidents related to collision, slip-trip-fall, and working at height.

Prioritization of Near-Miss Incidents Using Text Mining and Bayesian Network Abhishek Verma(&), Deeshant Rajput, and J. Maiti Department of Industrial and Systems Engineering, Indian Institute of Technology Kharagpur, Kharagpur 721302, India abhishekverma.cs@gmail.com, deeshantrajput@gmail.com, jhareswar.maiti@gmail.com Abstract. Near-Miss incidents can be treated as events to signal the weakness of safety management system (SMS) at the workplace. Analyzing near-misses will provide relevant root causes behind such incidents so that effective safety related interventions can be developed beforehand. Despite having a huge potential towards workplace safety improvements, analysis of near-misses is scant in the literature owing to the fact that near-misses are often reported as text narratives. The aim of this study is therefore to explore text-mining for extraction of root causes of near-misses from the narrative text descriptions of such incidents and to measure their relationships probabilistically. Root causes were extracted by word cloud technique and causal model was constructed using a Bayesian network (BN). Finally, using BN’s inference mechanism, scenarios were evaluated and root causes were listed in a prioritized order. A case study in a steel plant validated the approach and raised concerns for variety of circumstances such as incidents related to collision, slip-trip-fall, and working at height. Keywords: Bayesian network incidents  Workplace safety  Word cloud  Narrative text  Near-miss 1 Introduction Near miss reporting produces the same amount of information as an accident reporting provides, without any serious consequences. It gives the opportunity to move from reaction to prediction of incident. Quantitative estimation of near miss events is very important; otherwise it may escalate to accident in near future with lack of imposed constraints or loss of control on the chain of events. The aim of this study is to investigate and model the root causes of near miss incidents of a steel plant. Data were captured as incident reports contain incidents’ information in textual format. The narrative text field provides user to include the details about the incident, such as description about the machine, exact location, surrounding condition, summary about the incidents etc. However, it also increases the complexity of analysis to extract meaningful information. To serve this purpose both the structured and narrative text data were analyzed using text mining and Bayesian network (BN). The approach has been implemented to prioritize the root causes behind near-miss events. © Springer Nature Singapore Pte Ltd. 2017 M. Singh et al. (Eds.): ICACDS 2016, CCIS 721, pp. 183–191, 2017. DOI: 10.1007/978-981-10-5427-3_20 184 A. Verma et al. The rest of the paper is organized in the following manner. Related literature review is discussed in brief in Sect. 2. In Sect. 3, preparation of data and employed techniques are discussed in short. The results obtained from the model and its practical implications are given in Sect. 4. Finally, conclusions of the study with scope of future research are given in Sect. 5. 2 Literature Review Research in near-miss management is still in infancy state leaving avenues, scopes and opportunities to structure and develop models around it. Importance of near-miss incident data analysis, for safety improvement in the organization was discussed and inverse proportionality between near-misses and actual accident cases was found [1]. So, learning from near-miss events and prioritization of causal factors behind near-miss incidents are utmost required. Some safety related studies incorporated narrative text to extract and predict incident scenarios [2–4]. Bayesian auto coding methods have been applied to near miss data for minimizing the human effort required to manually code the large descriptive dataset [5]. Bier and Mosleh [6] analyzed the near-miss cases for nuclear plant using probabilistic model and suggested that near-miss events should be given more preference than experts’ claim. A study was conducted to evaluate and prioritize the different risk associated with water mains failure using a Bayesian belief network model [7]. BN was also used to prioritize the incidents with help of score collected from expert opinion. Recently, Bayesian network combined with analytic hierarchy process (AHP) has been used to prioritize the factors behind the near-miss events [8]. In that study the prior probability of factors causing near misses was decided by the experts’ judgement. It is better to include the information provided by organization’s employees about the near-miss incidents, experienced by them. Our study incorporates the information from all the workers instead of relying only on the expert judgement. 3 Methodology Figure 1 presents the conceptual framework of this study in qualitative and quantitative aspect. The network structure of BN for identifying incident factors corresponds to qualitative aspect, while estimating the probabilities of factors and inference form the networks correspond to the quantitative aspect. The factors (root causes) are identified under coded primary causes by text mining the narrative text. The prior probability is calculated for each factor by measuring their frequency in incidents reports for corresponding coded primary cause. After finding the prior probability of root causes, the conditional probability is calculated for every dependent node in BN to know the cause-consequence relationship among them. Hypothetical evidence then considered about the absence of individual root cause to see the effect on overall BN. Scenarios were generated to draw the conclusion about the top contributing root causes. R language and OpenMarkov software were utilized to build word cloud and Bayesian network respectively. The hardware configuration for the analysis includes: Intel-core(TM) i5-4200 M CPU@ 2.50 GHz, 4 GB RAM and Windows 8 (64 bit). Prioritization of Near-Miss Incidents Using Text Mining Qualitative aspect Quantitative aspect Incident Investigation reports Identification of 'primary causes' for nearmiss analysis 185 Determining the presence of root cause in incident report (considering once only in one documnet) Expert opinion Extract the narrative text for corresponding 'primary cause' Data preparation Estimate and assign the unconditional and unconditional probability of external node and internal node respectively Compute the Bayesian probabilities using OpenMarkov software Grouping of 'primary causes' Scenario generation using hypothetical evidence Building word cloud Draw conclusion and implication Selecting the highly probable root cause word from word cloud Based on availability of root cause word in different primary causes, create a causeconsequence graph Fig. 1. Conceptual framework for study 3.1 Data Collection and Preparation For this study, incident investigation data of a steel plant is considered. Data preparation involves resolving various issues related to data like missing values, duplicate data points, spelling errors, non-vocabulary words, incomplete information, etc. Raw file extracted from the safety management system (SMS) database of the plant studied, in MS excel format, originally had 9086 records. After pre-processing, 8877 records were considered for further analysis. After considering 16 primary causes listed in the SMS (out of these 8877 records), 2984 cases were found as near-miss incidents. 3.2 Word Cloud Preparation Word cloud can extract the information from the text contents and provides overview of the key points. Weighting function is used to calculate the weight of the words using term frequency-inverse document frequency (tf-idf) [9], which magnifies the importance of words occur frequently within a document and rarely across the documents. Brief description of the steps to create the word cloud using R language is given as follows: 1. The brief description of incidents is extracted for individual primary cause. 2. Stop words like “a”, “an”, “the”, “and”, “or” were removed. 3. Corpus is created through ‘Corpus()’ function and structured dataset is created by using ‘TermDocumentMatrix()’ function. 186 A. Verma et al. 4. Scoring of words is done by using the ‘tf-idf’ statistic given as follows: tf ðt; d Þ ¼ number of times word t appears in a document f d ðt Þ ¼ total number of words in the document maxw2d fd ðwÞ idf ðt; DÞ ¼ ln total number of documents jD j ¼ ln number of documents with word t in it j fd 2 D : t 2 d gj tfidf ðt; d; DÞ ¼ tf ðt; d Þ:idf ðt; DÞ ð1Þ ð2Þ ð3Þ where, fd ðtÞ ¼ frequency of word t in report d, and D ¼ corpus of documents. 5. The structured format of words and their corresponding estimated weightages have been found, will be finally put into the ‘wordcloud()’ function to build word cloud. 3.3 Bayesian Network Bayesian networks (BNs) use a directed acyclic graph (DAG) to represent conditional probability relationships between a set of variables [10]. As shown in Fig. 2, a BN is composed of set of variables (e.g., X, Y1, Y2 and Z) and a set of directed links between vertices that represent the relationships between these variables. Each variable have mutually exclusive states (e.g., for X, Y1, Y2 and Z, the states are {Pr = present, Ab = absent}). The nodes with arcs directed into are called “child” node and the nodes from which arrows comes from are called “parent” nodes (e.g., Y1 and Y2 are child of X and the parents of Z). Edges show conditional dependencies between these variables such that the value of any variable is a probabilistic function of the values of the variables which are its parents in the DAG. These dependencies on predecessor nodes are quantified through conditional probability tables (CPTs) attached to each node. Probability Var X Pr P(X=Pr) Ab P(X=Ab) Var X Var Y1 Var X Var Y2 Var Y1 Probability Var Y2 Var X Probability Pr Ab Pr Ab Pr P(Y1=Pr|X=Pr) P(Y1=Ab|X=Pr) Pr P(Y2=Pr|X=Pr) P(Y2=Ab|X=Pr) Ab P(Y1=Pr|X=Ab) P(Y1=Ab|X=Ab) Ab P(Y2=Pr|X=Ab) P(Y2=Ab|X=Ab) Var Z Var Z Var Y1 Var Y2 Pr Pr P(Z=Pr|Y1=Pr, Y2=Pr) P(Z=Ab|Y1=Pr, Y2=Pr) Pr Ab P(Z=Pr|Y1=Pr, Y2=Ab) P(Z=Ab|Y1=Pr, Y2=Ab) Ab Pr P(Z=Pr|Y1=Ab,Y2=Pr) P(Z=Ab|Y1=Ab, Y2=Pr) Ab Ab P(Z=Pr|Y1=Ab,Y2=Ab) P(Z=Ab|Y1=Ab, Y2=Ab) Probability Pr Ab Fig. 2. Sample of Bayesian network (BN) Prioritization of Near-Miss Incidents Using Text Mining 187 In Bayesian analysis the relation between parent nodes Yi (i = 1, 2, …, n) and the evidence or child node Z can be computed as: pðZjYi Þ  pðYi Þ   pðYi jZ Þ ¼ Pn j¼1 p ZjYj  p Yj ð4Þ where, p(Y|Z) represents the posterior probability of occurrence of variable Y with the given condition that Z occurs, p(Y) is the prior probability of Y, and p(Z|Y) denotes the hood distribution of Z given the occurrence of Y. 4 Results and Discussion 4.1 Root Cause Extraction from Word Cloud Comparison word-cloud allows studying the differences or similarities between two or more primary causes by plotting the word cloud of each primary cause against the other. Figure 3 shows the comparison cloud of “collision and fall related” incidents grouped by the primary causes: dashing/collision, skidding, slip/trip/fall, and working at height. Due to space limitation all clouds are not shown in the paper. Readers are encouraged to contact authors of this study regarding it. Keywords (root causes) were extracted on the basis of presence in number of incident reports (counted only once for individual report). Fig. 3. Comparison word cloud of collision and fall related incidents 4.2 BN Causal Model from Data In the Fig. 4, model based on the root causes extracted from the word cloud is constructed. In BN, the conditional probability table (CPT) structure depends upon the conditional independence of nodes. In our study for every node, two states were defined as ‘present’ and ‘absent’. The probability of independent nodes was calculated by its frequency in all the reports of particular primary cause. In this study, OpenMarkov software [11] has been employed to perform the model construction and CPT calculation. Four groups were formed on the basis of similarity and to minimize the 188 A. Verma et al. mathematical calculation by clubbing different primary causes as per experts’ opinions. So, three levels of causal factors (group, primary, root level) were considered for building the BN. Finally, the conditional probabilities of all the child nodes are computed assuming all possible combination of probability values of its parents. For making BN, total 57 nodes (36 independent nodes, 21 dependent nodes), 88 links and 1200 conditional probabilities were estimated. After calculating all the prior and conditional probability at every node, the probability of occurrence of near-miss event comes out to be 36.04%. Due to high connectivity and low probability of root cause node, it is important to note that this causes small change in near-miss index, contributed by a particular node. Fig. 4. Bayesian network to measure near-miss incidents Table 1. Prioritized order of groups after incorporating the hypothetical evidence about their absence Order Group 1 Collision and fall related Process related Energy related Material and equipment related 2 3 4 Posterior probability 75.19 Near-miss node probability in absence of group cause 32.40 18.66 6.34 32.77 35.24 35.91 36.00 Prioritization of Near-Miss Incidents Using Text Mining 189 Table 2. Prioritized order of primary cause after incorporating the hypothetical evidence about their absence Order Primary cause 1 2 3 4 5 6 7 8 9 10 11 12 13 Slip/trip/fall Working at height Process incidents Fire Dashing collision Electrical flash Skidding Material handling Gas leakage Energy isolation Hot metals Hydraulic/pneumatic Equipment machinery damage Structural integrity Lifting tools tackles Toxic chemicals 14 15 16 Posterior probability 47.11 59.78 32.45 19.77 15.80 54.06 4.84 32.77 4.56 14.81 2.97 2.30 13.94 7.18 1.93 0.97 Near-miss node probability in absence of primary cause 35.03 35.13 35.55 35.82 35.83 35.92 35.99 36.00 36.02 36.03 36.03 36.03 36.04 36.04 36.04 36.04 Table 3. Prioritized order of root causes obtained from word cloud after incorporating the hypothetical evidence about their absence Order 1 2 3 4 5 6 7 Root cause (primary cause) Switch/cable Stairs/floor condition Wire/rope/sling Crane Valve/hose Truck/dumper Loading/shifting Probability (%) 23.16 15.87 16.23 13.74 10.64 8.44 7.46 Near-miss node probability in absence of root cause 36.02 36.02 36.03 36.03 36.03 36.03 36.03 To prioritize the causal factors at three different levels of network, their effect on near-miss node was estimated by incorporating the hypothetical evidence of absence of different nodes. It will help in indicating the effect of the particular node on overall near-miss node. Prioritized order of particular group, primary cause and root cause level are given in Tables 1, 2 and 3 respectively. In Table 1, all groups are listed in prioritized order of their effect on near-miss node. Similarly in Table 2, primary causes are listed down in prioritized order to show their individual impact on the near-miss node. In Table 3, the root causes extracted from narrative text using text mining are 190 A. Verma et al. listed and their effect on near-miss node is measured. It will help the safety practitioners to put focused intervention and safety control measures to improve the safety performance at workplaces. 5 Conclusions The proposed model can extract information at lowest level using word cloud and combines with BN to help in prioritizing the near-miss incidents. The model is capable to provide qualitative and quantitative information of different causes of near-miss. Slip/trip/fall, material handling, electrical flash, dashing/collision and energy isolation are found to be the top five primary causes out of 16 primary causes reported in preliminary investigation report data. At the lowest level, 36 root causes were extracted and their impact on overall network was measured by providing hypothetical evidence. It was found that switch/cable, condition of stair/floor, wire/rope/sling, crane operation, valve/hose dysfunction, and heavy vehicle like truck or dumper and loading or shifting operation and have more propensity of the causing incident. Other root causes can also be prioritized on the basis of their probability and impact on intermediate nodes. It can help to focus the attention of management to improve on particular area. Our future work would focus to analysis organizational hierarchy to find out the location specific causes. Moreover, validation of model can be done by using expert opinions as the gold standard, or using the developed near-miss based system to predict actual accidents and measuring its predictive power. References 1. Jones, S., Kirchsteiger, C., Bjerke, W.: The importance of near miss reporting to further improve safety performance. J. Loss Prev. Process Ind. 12(1), 59–67 (1999) 2. Abdat, F., Leclercq, S., Cuny, X., Tissot, C.: Extracting recurrent scenarios from narrative texts using a Bayesian network: application to serious occupational accidents with movement disturbance. Accid. Anal. Prev. 70, 155–166 (2014) 3. Lincoln, A.E., Sorock, G.S., Courtney, T.K., Wellman, H.M., Smith, G.S., Amoroso, P.J.: Using narrative text and coded data to develop hazard scenarios for occupational injury interventions. Inj. Prev. 10(4), 249–254 (2004) 4. Sawaragi, T., Ito, K., Horiguchi, Y., Nakanishi, H.: Identifying latent similarities among near-miss incident records using a text-mining method and a scenario-based approach. In: Salvendy, G., Smith, M.J. (eds.) Human Interface 2009. LNCS, vol. 5618, pp. 594–603. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02559-4_65 5. Taylor, J.A., Lacovara, A.V., Smith, G.S., Pandian, R., Lehto, M.: Near-miss narratives from the fire service: a Bayesian analysis. Accid. Anal. Prev. 62, 119–129 (2014) 6. Bier, V.M., Mosleh, A.: The analysis of accident precursors and near misses: implications for risk assessment and risk management. Reliab. Eng. Syst. Saf. 27(1), 91–101 (1990) 7. Kabir, G., Tesfamariam, S., Francisque, A., Sadiq, R.: Evaluating risk of water mains failure using a Bayesian belief network model. Eur. J. Oper. Res. 240(1), 220–234 (2015) 8. Zubair, M., Park, S., Heo, G., Hassan, M.U., Aamir, M.: Study on nuclear accident precursors using AHP and BBN, a case study of Fukushima accident. Int. J. Energy Res. 39(1), 98–110 (2015) Prioritization of Near-Miss Incidents Using Text Mining 191 9. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, New York (2008) 10. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco (1988) 11. Arias, M., Díez, F., Palacios, M.: OpenMarkovXML. A format for encoding probabilistic graphical models (2010)