A Survey of Data Mining Techniques For C
A Survey of Data Mining Techniques For C
A Survey of Data Mining Techniques For C
(USJICT)
Volume 2, Issue 1, January 2018
_______________________________________________________________________________________________________
Abstract: In large datasets, data mining is one of the most powerful ways of knowledge extraction or we can say it is one of the
best approaches to detect underlying relationships among data with the help of machine learning and artificial intelligence
techniques. Crime Detection is one of the hot topics in data mining where different patterns of criminology are identified. It includes
variety of steps, starting from identification of crime characterization till detection of crime pattern. For this purpose, various crime
detection techniques have been discussed in literature. In this paper, we have selected widely adapted data mining techniques that
are specifically used for crime detection. The analytical study is presented with an extraction in form of strengths and weakness of
each technique. Each technique is specific to its use. This survey would serve as a helping guide to researchers to get state of the
art crime detection techniques in data mining along with pros and cons.
Keywords: Data mining, Crime detection, Classification, Clustering, Association, Prediction, Constraint, Association rules
_____________________________________________________________________________________________________________________
circumstances of usage. The differences and This is very helpful in investigating the simultaneous
commonalities of these techniques have also been occurrences of events[27]. The strength of association can
discussed. be measured in terms of support; the applicability of rule
to given dataset and confidence; the frequency of
II. EXISTING CRIME DATA MINING TECHNIQUES: appearance of one data in transactions that contain another
Crime can range from the simple street crimes to data[28].
internationally planed crimes[8]. Crime data mining, as
compared to usual data mining, is more concerned with D. Sequential pattern mining
privacy [16]. In structured data, the patterns are identified Sequential pattern mining discovers frequently
through different traditional data mining techniques such occurred sequence of items at different intervals of times.
as association, classification, prediction, clustering and It is helpful in network intrusion detection. For meaningful
outlier analysis[17]. Advanced data mining handles both results, a large amount of structured data is required[2].
structured and unstructured data for pattern recognition Ayres et al., have provided an algorithm that finds all
[14, 18]. In this section we analyze the existing data possible sequences in the transactions data, very quickly.
mining techniques that are used for crime detection and They have used depth first traversal combined with the
investigation. bitmap representation to achieve this [29].
2
University of Sindh Journal of Information and Communication Technology (USJICT), Vol.2(1), pg.: 1-6
It can also help in analyzing the flow of information among The mentioned techniques have been deeply studied to
these entities, though it won’t help in identifying networks’ know the pros and cons of any technique while adapting it
true leaders[5].it reveals the structure within some text, by in crime detection. Entity Extraction is enhanced by
presenting some interlinked entities[43]. This shows that machine learning techniques but polluted data can be a
people have participated or communicated somewhere[44]. hurdle to it so its weakness is requirement of clean data.
The most widely used techniques for SNA are: Degree; The strength of clustering is detection of outliers without
number of nodes connected to any node [45], Density; labeled data but as this process is costly and its
number of edges in a specific area as compared to the effectiveness depends on selected method as well.
overall number of edges[46] and Centrality; the importance Association Rule Mining is yet another technique which
of a node within a structure[47]. basically supports classification and its weakness is its
specific nature to classification rules only. Sequential
Hossein Hassani et al., have reviewed the data mining pattern mining has wide range of applicability in all areas
techniques for crime. This review cover the techniques: and hence, large amount of structured data is required for
entity extraction, cluster analysis, association rule, its execution. Deviation detection is widely used in fraud
classification and social network analysis [15]. In [48] a detection but data dependency in some areas is still a
tool was discussed which is based on Natural Language question unsolved. Classification technique is
Processing technique for detection of white collar crimes. conventional technique with very less time consumption
However, a comparative analysis of all above mentioned and weakness is predefined scheme of classification that
crime data mining techniques is still missing in the requires complete training data set. String comparator
literature. accuracy is great when we consider numerical values but it
requires high computations. Social network analysis
III. ANALYSIS focuses on relationships between actors rather than their
In this section, comparative analysis of each technique is attribute which makes it more direct but unfortunately it
presented on the basis of its strength and weakness. The doesn’t identify network’s true leader in the system.
strength and weakness of each technique has been
extracted from literature review in which authors and IV. CLASSIFICATION OF EXISTING TECHNIQUES
researchers have identified positive and negative impacts In order to understand the techniques used for crime data
of particular technique. Table 1 is a complete analysis of mining, we first present classification of these techniques.
each technique: This classification contains the data mining techniques that
are specifically used for crime data mining and are stated
TABLE 1 STRENGTH AND WEAKNESS OF EACH CRIME
DATA MINING TECHNIQUE in the literature. The classification of the techniques along
with the methods they use is shown in Fig 1.
TECHNIQUE STRENGTH WEAKNESS
Entity extraction Machine learning large amount of clean
makes it easier data required
Clustering Detect outliers Computational cost is
without any required high. Its effectiveness
label data also depends upon the
method used
Association Rule Support It is used for the most
mining accurate classification
rules
Sequential pattern Wide range of large amount of
mining applicability structured data is
required
Deviation Widely applicable in Sometimes its data
detection fraud detection dependency becomes a
hurdle
Classification Very less time Predefined scheme of
consumption classification and
complete training
dataset required
String Accuracy in terms of Large amount of
Comparator numerical value computation required
Social network focus on relationships Won’t identify
analysis between actors rather network’s true leaders
than attributes of
actors
3
University of Sindh Journal of Information and Communication Technology (USJICT), Vol.2(1), pg.: 1-6
TABLE 2 USAGE BASED CLASSIFICATION OF CRIME COMPARISON OF CRIME DATA MINING TECHNIQUES
DATA MINING TECHNIQUES ATTRIBUTE/TECHNIQ TIMELINES DATA ACCURAC
UE S DEPENDENC Y
USAGE DATA MINING REFERENCE Y
TECHNIQUE Entity extraction Less time Huge Accurate
Identification of Entity extraction [11, 15, 16, 17, 18, consumptio amount of
programs written by 19, 20, 21] n data required
hackers Clustering Moderate Less data Accurate
Identifying criminals Clustering [22, 23] Time dependency
following a set pattern consumptio
Identifying network Association rule [20, 24, 25, 26] n
attacks mining Association Rule Moderate Moderate Accurate
Intrusion detection Sequential pattern [24, 27] mining
mining Sequential pattern Moderate Huge Accurate
Fraud detection Deviation detection [24, 28, 29, 30] mining amount of
Predicting crimes Classification [28, 31-33, 36-41] data required
Identifying deceptive String Comparator [20, 42] Deviation detection Less time Moderate Accurate
information consumptio
n
4
University of Sindh Journal of Information and Communication Technology (USJICT), Vol.2(1), pg.: 1-6
5
University of Sindh Journal of Information and Communication Technology (USJICT), Vol.2(1), pg.: 1-6
[16] H. Kargupta, K. Liu, and J. Ryan, "Privacy sensitive [41] E. A. Wan, "Neural network classification: A Bayesian
distributed data mining from multi-party data," in International interpretation," IEEE Transactions on Neural Networks, vol. 1, pp. 303-
Conference on Intelligence and Security Informatics, 2003, pp. 336-342. 305, 1990.
[17] J. Han, J. Pei, and M. Kamber, Data mining: concepts and [42] B. Widrow, D. E. Rumelhart, and M. A. Lehr, "Neural
techniques: Elsevier, 2011. networks: applications in industry, business and science,"
[18] G. Gupta, Introduction to data mining with case studies: PHI Communications of the ACM, vol. 37, pp. 93-106, 1994.
Learning Pvt. Ltd., 2014. [43] J. Mena, Investigative data mining for security and criminal
[19] M. Chau, J. J. Xu, and H. Chen, "Extracting meaningful detection: Butterworth-Heinemann, 2003.
entities from police narrative reports," in Proceedings of the 2002 annual [44] A. M. Fard and M. Ester, "Collaborative mining in multiple
national conference on Digital government research, 2002, pp. 1-5. social networks data for criminal group discovery," in Computational
[20] S. Baluja, V. O. Mittal, and R. Sukthankar, "Applying Science and Engineering, 2009. CSE'09. International Conference on,
Machine Learning for High‐Performance Named‐Entity Extraction," 2009, pp. 582-587.
Computational Intelligence, vol. 16, pp. 586-595, 2000. [45] M. K. Sparrow, "The application of network analysis to
[21] A. Borthwick, J. Sterling, E. Agichtein, and R. Grishman, criminal intelligence: An assessment of the prospects," Social networks,
"Exploiting diverse knowledge sources via maximum entropy in named vol. 13, pp. 251-274, 1991.
entity recognition," in Proc. of the Sixth Workshop on Very Large [46] A. Iriberri and G. Leroy, "Natural language processing and e-
Corpora, 1998. government: extracting reusable crime report information," in
[22] S. Miller, M. Crystal, H. Fox, L. Ramshaw, R. Schwartz, R. Information Reuse and Integration, 2007. IRI 2007. IEEE International
Stone, et al., "Algorithms that learn to extract information: Bbn: Tipster Conference on, 2007, pp. 221-226.
phase iii," in Proceedings of a workshop on held at Baltimore, Maryland: [47] K. Chan and J. Liebowitz, "The synergy of social network
October 13-15, 1998, 1998, pp. 75-89. analysis and knowledge mapping: a case study," International journal of
[24] I. H. Witten, Z. Bray, M. Mahoui, and W. J. Teahan, "Using management and decision making, vol. 7, pp. 19-35, 2005.
language models for generic entity extraction," in Proceedings of the [48] Maartin B., et. al., Performance Evaluation of a Natural Language
ICML Workshop on Text Mining, 1999. Processing approach applied in White Collar Crime, Springer adfa,
[25] R. V. Hauck, H. Atabakhsb, P. Ongvasith, H. Gupta, and H. p 1., Berlin, 2011
Chen, "Using Coplink to analyze criminal-justice data," Computer, vol.
35, pp. 30-37, 2002.
[26] R. T. Ng and J. Han, "E cient and E ective Clustering Methods
for Spatial Data Mining," in Proc. of, 1994, pp. 144-155.
[27] H. Yun, D. Ha, B. Hwang, and K. H. Ryu, "Mining association
rules on significant rare data using relative support," Journal of Systems
and Software, vol. 67, pp. 181-191, 2003.
[28] P.-N. Tan, Introduction to data mining: Pearson Education
India, 2006.
[29] J. Ayres, J. Flannick, J. Gehrke, and T. Yiu, "Sequential
pattern mining using a bitmap representation," in Proceedings of the
eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, 2002, pp. 429-435.
[30] C. C. Aggarwal and P. S. Yu, "Outlier detection for high
dimensional data," in ACM Sigmod Record, 2001, pp. 37-46.
[31] A. Arning, R. Agrawal, and P. Raghavan, "A Linear Method
for Deviation Detection in Large Databases," in KDD, 1996, pp. 164-169.
[32] C. J. Stone, "Classification and regression trees," Wadsworth
International Group, vol. 8, pp. 452-456, 1984.
[33] J. R. Quinlan, C4. 5: programs for machine learning: Elsevier,
2014.
[34] J. R. Quinlan, "Induction of decision trees," Machine learning,
vol. 1, pp. 81-106, 1986.
[35] C. Cortes and V. Vapnik, "Support-vector networks," Machine
learning, vol. 20, pp. 273-297, 1995.
[36] P. Langley, W. Iba, and K. Thompson, "An analysis of
Bayesian classifiers," in Aaai, 1992, pp. 223-228.
[37] M. D. Richard and R. P. Lippmann, "Neural network
classifiers estimate Bayesian a posteriori probabilities," Neural
computation, vol. 3, pp. 461-483, 1991.
[38] G. P. Zhang, "Neural networks for classification: a survey,"
IEEE Transactions on Systems, Man, and Cybernetics, Part C
(Applications and Reviews), vol. 30, pp. 451-462, 2000.
[39] H. Gish, "A probabilistic approach to the understanding and
training of neural network classifiers," in Acoustics, Speech, and Signal
Processing, 1990. ICASSP-90., 1990 International Conference on, 1990,
pp. 1361-1364.
[40] P. A. Shoemaker, "A note on least-squares learning procedures
and classification by neural network models," IEEE Transactions on
Neural Networks, vol. 2, pp. 158-160, 1991.