Project Title: A Major Project Report Submitted in Partial Fulfillment of The Requirements For The Degree of
Project Title: A Major Project Report Submitted in Partial Fulfillment of The Requirements For The Degree of
Project Title: A Major Project Report Submitted in Partial Fulfillment of The Requirements For The Degree of
Submitted By
Boby Mahato
<08 >
Ritu Munda
<21>
Certified that this project report title CRIME RATE PREDICTION is the
Bonafide work of MR. BOBY MAHATO and MS. RITU MUNDA who carried out the
research under my supervision. Certified further that to the best of my knowledge the
work reported herein does not form part of any other project or dissertation on the
basis of which a degree or award was conferred on an earlier occasion on this or any
other candidate.
ii
DECLARATION
I do hereby declare that this report entitled “Crime Rate Prediction”, submitted by Boby Mahato (08) & Ritu
Munda (21), in the fulfillment of the requirement for the degree of Master of Computer Application to Usha
Martin
University, Ranchi, is my own and it is not submitted to any other institute.
iii
CERTIFICATE
This is to certify that entitled “Crime Rate Prediction” being submitted by BOBY MAHATO (08) &
RITU MUNDA (21), in the fulfillment of the requirement for the degree of Master of Computer Application to
Usha Martin University, Ranchi, is a bonafide work carried out under my/our supervision. The matter embodied
in this report is original and has not been submitted for the award of any other degree.
(Sucheta Panda)
Guide
(Sharmistha Roy)
HOD External Examiner
iv
ACKNOWLEDGEMENTS
I express my deepest sense of gratitude to my guide SUCHETA PANDA , Faculty of Computing and
Information
Technology, Usha Martin University, Ranchi, for suggesting the subject of work and constant supervision
throughout this work. His/Her co-operation and timely suggestions have been unparalleled stimuli for me to
travel eventually towards the completion of this project report. Indeed his/her continuous involvement has
helped me in bringing of this project work which otherwise would have remained a distant dream.
I am indeed thankful to < SHARMISTHA ROY >, HOD, Faculty of Computing and Information
Technology, Usha Martin University, Ranchi, for giving me permission to carry out my project work. I would
like to express my gratitude to all teaching and non teaching staff members of Faculty of Computing and
Information Technology, Usha Martin University, Ranchi, for their co-operation in my work.
Boby Mahato
08
Ritu Munda
21
v
Abstract
Crime analysis and prediction is a systematic approach for identifying the crime. This system can predict region
which have high probability for crime occurrences and visualize crime prone area. Using the concept of data
mining we can extract previously unknown, useful information from an unstructured data. The extraction of
new information is predicted using the existing datasets. Crimes are treacherous and common social problem
faced worldwide. Crimes affect the quality of life, economic growth and reputation of nation. With the aim of
securing the society from crimes, there is a need for advanced systems and new approaches for improving the
crime analytics for protecting their communities. We propose a system which can analysis, detect, and predict
various crime probability in given region. This paper explains various types of criminal analysis and crime
prediction using several data mining techniques.
vi
Table of Contents
Declaration ………………………………………………………………………………iii
Certificate ………………………………………………………………………………iv
Acknowledgement ………………………………………………………………………………v
Abstract ………………………………………………………………………………vi
Chapter Topic Name Page No.
1. Introduction
vii
List of Tables
Sl. No. Table Name Page No.
Table 1. Sample Data ………………………………………………………….
List of Figures
Sl. No. Figure Name Page No.
Fig. 1 Software Life cycle model ……………………………………………
viii
Chapter 1
Introduction
1.1 Brief overview of crime and its effect on a country
Crime, like any other definition of word, is not always so simple to define as it may mean differently for
different person. A typical understanding of the word ‘crime’, according to Britannica, can be defined as an act
that is socially harmful or dangerous that is usually prohibited and punishable under criminal law Crime has
been known to be a prevalent social problem that has affected the quality of life and the economic growth of
every country. Now to truly understand the
effects of crime on society, let us dive into why it is a social problem. Firstly, the effects it could have on city is
that
it creates chaos which in turn disrupts the natural order of society. As crime naturally goes against social
conventions, it disrupts many everyday activities from running a business, going shopping or even just walking
outside. Another effect
crime has on society is that it impedes collaboration and trust in a community. As with higher crime rates the
trust toward law enforcement will be affected. Seeing how the law enforcement that was supposed to maintain
the peace has failed to
do their job, the people’s willingness to collaborate will decrease not only towards law enforcement but also
others in
their community. Moving on to economic losses, let us take our night bouring country, Indonesia. It is much
like
Malaysia and has an abundance of natural resources as well as human resources which should have accelerated
the pace of their economy, and yet it was found that the number of crimes may have limited the economic
growth. The growth of Indonesia's economy is usually attributed to the consumption of goods which is directly
influenced by the ability of income sources of households. Other than that, it is also found that foreign
investments also aid in the economic growth of the country as it increases the production capacity of the
country by reducing the basic costs and variable costs of the industrial sector which in turn increases the
purchasing power of the people thus aiding in the increase of the consumption. Though if crime were to
increase it would give investors a bad perspective thus causing fewer investments to be made in the country.
Thus, Kusuma, Hariyani, and Wahyu found that when the number of criminal acts increases it would reduce the
Gross regional domestic product (GRDP) of the country
ix
Figure 1.1 Descriptive Data Analysis of Economic Growth Vs Crime (1980-2011)
Another example of crime affecting the economy is a study done on Pakistan which is also another developing
country. Pakistan’s economy much like Indonesia could benefit from foreign investors but due to the high crime
rates, it may have deterred some investors from investing in Pakistan. Thus, according to the figure shown
below, between the years 1980 to 2011 Ahmad, Ali, and Ahmad (2014) found that the economic growth
fluctuates through the years, but the trend is that as crime increases economic growth decreases.
x
1.2 Problem Statement and motivation
xi
Project objectives:
To produce a system that is able predict areas that will have higher crime rates.
To explore and enhance classification algorithms to predict future crime category based on previous crime
trends.
Create a web-based system to allow for easy access to the application
SUPERVISED LEARNING
In the majority of supervised learning applications, the ultimate goal is to develop a finely tuned predictor
function h(x) (sometimes called the “hypothesis”). “Learning” consists of using sophisticated mathematical
algorithms to optimize this function so that, given input data x about a certain domain (say, square footage of a
house), it will accurately predict some interesting value h(x) (say, market price for said house).
h(x1, x2, x3, x4) = Ø0+Ø1x1+ Ø2x32+Ø3x3x4+Ø4x13x22+ Ø5x2x34x24
This function takes input in four dimensions and has a variety of polynomial terms. Deriving a
normal equation for this function is a significant challenge. Many modern machine learning problems take
xiii
thousands or even millions of dimensions of data to build predictions using hundreds of coefficients. Predicting
how an organism’s genome will be expressed, or what the climate will be like in fifty years, are examples of
such complex problems.
xiv
REGRESSION
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc.
Linear Regression
Regression Trees
Non-Linear Regression
Bayesian Linear Regression
Polynomial Regression
CLASSIFICATION
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
Random Forest
Decision Tree
Logistic Regression
Support vector Machines
1.4.2 PROPOSED ALGORITHMS
Decision Tree Classification Algorithm
Decision Tree is a supervised learning technique that can be used for both classification and regression
problems, but mostly it is preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a dataset, branches represent the decision rules
and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are
used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based on given
conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
sub trees.
Below diagram explains the general structure of a decision tree
xv
Figure-1.3 Structure of decision tree
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure. Decision Tree Terminologies
Root node is from where the decision tree starts. It represents the entire dataset, which further
gets divided into two or more homogeneous sets.
Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf
node.
Splitting is the process of dividing the decision node/root node into sub-nodes according to the
given conditions.
A tree formed by splitting the tree known as branch tree.
Pruning is the process of removing the unwanted branches from the tree.
The root node of the tree is called the parent node, and other nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the
tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on
the comparison, follows the branch and jumps to the next node. For the next node, the algorithm again
compares the attribute value with the other sub-nodes and move further. It continues the process until it reaches
the leaf node of the tree.
xvi
Python Implementation of Decision Tree
Now we will implement the Decision tree using Python. For this, we will use the dataset "user_data.csv," which
we have used in previous classification models. By using the same dataset, we can compare the Decision tree
classifier with other classification models such as KNN, SVM, Logistic Regression, etc.
Steps will also remain the same, which are given below:
Data Pre-processing step
Fitting a Decision-Tree algorithm to the Training set
Predicting the test result
Test accuracy of the result (Creation of Confusion matrix)
Visualizing the test set result.
It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble
learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model. A random forest algorithm consists of many decision trees. The ‘forest’ generated by
the random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is an ensemble meta-
algorithm that improves the accuracy of machine learning algorithms.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying
on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.
xvii
The below diagram explains the working of the Random Forest algorithm
xviii
Classification in random forests
Classification in random forests employs an ensemble methodology to attain the outcome. The
training data is fed to train various decision trees. This dataset consists of observations and features that will be
selected randomly during the splitting of nodes. A rain forest system relies on various decision trees. Every
decision tree consists of decision nodes, leaf nodes, and a root node. The leaf node of each tree is the final
output produced by that specific decision tree. The selection of the final output follows the majority-voting
system. In this case, the output chosen by the majority of the decision trees becomes the final output of the rain
forest system. The diagram below shows a simple random forest classifier.
xix
Advantages of Random Forest
Random Forest is capable of performing both Classification and Regression tasks.
It is capable of handling large datasets with high dimensionality.
It enhances the accuracy of the model and prevents the over fitting issue.
Disadvantages of Random Forest
Although random forest can be used for both classification and regression tasks, it is not more suitable
for Regression tasks.
LOGISTIC REGRESSION
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
Supervised Learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. There for the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and1, it gives the probabilistic values which lie between 0 and
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification.
xx
CHAPTER 2
LITERATURE SURVEY
Literature survey is the main advance in programming improvement measure. Prior to building up the
instrument it is important to decide the time factor, economy and friends strength. When these things are
fulfilled, at that point the subsequent stage is to figure out which working framework and language can be
utilized for building up the device. When the developers begin assembling the apparatus the software engineers
need parcel of outer help. This help can be gotten from senior developers, from book or from sites. The major
part of the project development sector considers and fully survey all the required needs for developing the
project. Before developing the tools and the associated designing it is necessary to determine and survey the
time factor, resource requirement, man power, economy, and company strength. Prior to building the framework
the above thought are considered for building up the proposed framework. The significant piece of the
undertaking advancement area considers and completely survey all the necessary requirements for building up
the venture. For each undertaking Literature survey is the main area in programming improvement measure.
Prior to building up the instruments and the related planning it is important to decide and survey the time factor,
asset prerequisite, labor, economy, and friends strength. When these things are fulfilled and completely
surveyed, at that point the following stage is to decide about the product details in the separate framework, for
example, what kind of working framework the venture would require and what are largely the important
programming are expected to continue with the subsequent stage like building up the apparatuses, and the
related activities. Here we have taken the general surveys of different creators and noted down the fundamental
central issues with respect to their work. In this venture literature survey assumes a prevailing part in get assets
from different areas and all the connected points that are exceptionally valuable under this segment. The most
awesome aspect if this is the manner in which things get all together and encourages us to suite our work
according to the current information.
xxi
2.2 LITERARY REVIEWS
“An Exploration of Crime Prediction Using Data Mining on Open Data”Ginger Saltos and Minhaela
Cocea (2017)
The increase in crime data recording coupled with data analytics resulted in the growth of research
Approaches aimed at extracting knowledge from crime records to better understand criminal behavior and
ultimately
prevent future crimes. While many of these approaches make use of clustering and association rule mining
techniques, there are fewer approaches focusing on predictive models of crime. In this paper, we explore
models for predicting the frequency of several types of crimes by LSOA code (Lower Layer Super Output
Areas — an administrative system of areas used by the UK police) and the frequency of anti-social behavior
crimes. Three algorithms are used from different categories of approaches: instance-based learning, regression
and decision trees. The data are from the UK police and contain over 600,000 records before preprocessing.
The results, looking at predictive performance as well as processing time, indicate that decision trees (M5P
algorithm) can be used to reliably predict crime frequency in general as well as anti-social behavior frequency.
The experiments were conducted using the SCIAMA High Performance Computer Cluster at the University of
Portsmouth.
“Crime Analysis and Prediction Using Data Mining” Shiju Sathyadevan, Devan M.S, Surya
Gangadharan (IEEE-2014)
Crime analysis and prevention is a systematic approach for identifying and analyzing patterns and trends in
crime. Our system can predict regions which have high probability for crime occurrence and can visualize
crime prone areas. With the increasing advent of computerized systems, crime data analysts can help the Law
enforcement officers to speed up the process of solving crimes. Using the concept of data mining we can extract
previously unknown, useful information from an unstructured data. Here we have approach between computer
science and criminal justice to develop a data mining procedure that can help solve crimes faster. Instead of
focusing on causes of crime occurrence like criminal background of offender, political enmity etc we are
focusing mainly on crime factors of each day. This paper has tested the accuracy of classification and prediction
based on different test sets. Classification is done based on the Bayes theorem which showed more than 90%
accuracy.
xxii
“Crime Pattern Analysis, Visualization and Prediction Using Data Mining”
Rajkumar Sakkarai Soundarya Jagan. J Varnikasree P (2015)
Crime against women these days has become problem of every nation around the globe many countries are
trying to curb this problem. Preventive are taken to reduce the increasing number of cases of crime against
women. A huge amount of data set is generated every year on the basis of reporting of crime. This data can
prove very useful in analyzing and predicting crime and help us prevent the crime to some extent. Crime
analysis is an area of vital importance in police department. Study of crime data can help us analyze crime
pattern, inter-related clues& important hidden relations between the crimes. That is why data mining can be
great aid to analyze, visualize and predict crime using crime data set. Classification and correlation of data set
makes it easy to understand similarities & dissimilarities amongst the data objects. We group data objects using
clustering technique. Dataset is classified on the basis of some predefined condition. Here grouping is done
according to various types of crimes against women taking place in different states and cities of India. Crime
mapping will help the administration to plan strategies for prevention of crime, further using data mining
technique data can be predicted and visualized in various form in order to provide better understanding of crime
patterns.
“Survey on crime analysis and predict ion using data mining techniques”
Benjamin Fredrick David. H and Suruliand I (2017)
Data Mining is the procedure which includes evaluating and examining large pre-existing databases in order to
xxiii
generate new information which may be essential to the organization. The extraction of new information is
predicted using the existing datasets. Many approaches for analysis and prediction in data mining had been
performed. But, many few efforts has made in the criminology field. Many few have taken efforts for
comparing the information all these approaches produce. The police stations and other similar criminal justice
agencies hold many large databases of information which can be used to predict or analyze the criminal
movements and criminal activity involvement in the society. The criminals can also be predicted based on the
crime data. The main aim of this work is to perform a survey on the supervised learning and unsupervised
learning techniques that has been applied towards criminal identification. This paper presents the survey on the
Crime analysis and crime prediction using several Data Mining techniques. The quantitative analysis produced
results which shows the increase in the Accuracy level of classification because of using the GA to optimize the
parameters.
Crimes will somehow influence organizations and institutions when occurred frequently in a society. Thus, it
seems necessary to study reasons, factors and relations between occurrence of different crimes and finding the
most appropriate ways to control and avoid more crimes. The main objective of this paper is to classify
clustered crimes based on occurrence frequency during different years. Data mining is used extensively in terms
of analysis, investigation and discovery of patterns for occurrence of different crimes. We applied a theoretical
model based on data mining techniques such as clustering and classification to real crime dataset recorded by
xxiv
police in England and Wales within 1990 to 2011. We assigned weights to the features in order to improve the
quality of the model and remove low value of them. The Genetic Algorithm (GA) is used for optimizing of
Outlier Detection operator parameters using Rapid Miner tool.
“Empirical Analysis for Crime Prediction and Forecasting Using Machine Learning
and Deep Learning Techniques” Wajiha Safat, Sohail Asghar, Saira Andleeb Gillani (IEEE-2021)
Crime and violation are the threat to justice and meant to be controlled. Accurate crime prediction and future
forecasting trends can assist to enhance metropolitan safety computationally. The limited ability of humans to
process complex information from big data hinders the early and accurate prediction and forecasting of crime.
The accurate estimation of the crime rate, types and hot spots from past patterns creates many computational
challenges and opportunities. Despite considerable research efforts, yet there is a need to have a better
predictive algorithm, which direct police patrols toward criminal activities. Previous studies are lacking to
achieve crime forecasting and prediction accuracy based on learning models. Therefore, this study applied
different machine learning algorithms, namely, the logistic regression, support vector machine (SVM), Naïve
Bayes, k-nearest neighbors (KNN), decision tree, multilayer perceptron (MLP), random forest, and extreme
Gradient Boosting, and time series analysis by long- short term memory (LSTM) and autoregressive integrated
moving average (ARIMA) model to better fit the crime data. The performance of LSTM for time series analysis
was reasonably adequate in order of magnitude of root mean square error (RMSE) and mean absolute error
(MAE), on both data sets. Exploratory data analysis predicts more than 35 crime types and overall, these results
provide early identification of crime, hot spots with higher crime rate.
xxv