Exploratory Data Analysis and Crime Prevention Using Machine Learning The Case of Ghana
Exploratory Data Analysis and Crime Prevention Using Machine Learning The Case of Ghana
ISSN No:-2456-2165
B. Artificial Nueral Network Luiz G.A Alves, et al [12] examined the crime dataset
Artificial Neural Network is also called neural network. of Brazil. They performed predictive analysis of crimes
S. N. Sivanandam et al [6] an artificial neural network (ANN) dataset and tried to obtain the relationship between crime and
may be defined as an information processing model that is urban indicators. They proposed that due to the non-Gaussian
inspired by the way biological nervous systems, such as the distribution and multicollinearity in the urban indicators, it
brain, process information. This model tries to replicate only makes it very easy to find conclusion about the influence if
the most basic functions of the brain. The key element of some urban indicators on crime. Random forest algorithm
ANN is the novel structure of its information processing was used to predict crimes and the influence of urban
system. An ANN is composed of a large number of highly indicators on crimes such as homicides. It also took into
interconnected processing elements (neurons) working in consideration the ranks of the various indicators and predicted
unison to solve specific problems. The fundamental data that unemployment and illiteracy are the most important
structure in neural networks is the layer which is the data- variables for homicide crimes in cities in Brazilian.
processing module. This data processing module accepts one
or more tensors as inputs. Some layers are stateless, but more However they could not predict or forecast which
frequently layers have a state: the layer’s weights, one or crimes are most likely to be committed in a given period. It
several tensors learned with stochastic gradient descent, only took the accuracy of the model which it says can predict
which together contain the network’s knowledge. A neural accurately at 97% of a crime based on some attributed and
network has its main responsibility of learning during the indicators that have effect on crime to guide crime control. It
process of training and it is able to adapt to stimulus. This is is also important to note that their model was able to perform
through the [7] parameterization of the weight where the prediction with accuracy between 38% and 39%.
weight is consistently adjusted until it is able to produce the
desired target or response. The main components of neural Rizwan Iqbal, et al [13] used the classification
networks are the layers, networks, objective functions, and algorithm to predict crime for different states in the United
optimizers. [8] Proposed a dynamic programming based States of America using real data. An open source tool called
recursive algorithm to find the similarity between the training WEKA developed in JAVA was used. The ‘Crime and
and test images. For ranking the images on the basis of Community’ dataset was used in their work. It performed the
highest similarity with the given object template a greedy results of two machine learning algorithms namely Naïve
multistate gradient descent method was used. The method Bayesian algorithm and Decision Tree algorithm to predict a
proved robust to rotation and deformations but the method particular category of crime. A 10 fold cross validation was
was not able to deal with extreme viewpoint changes. applied to the dataset separately for Naïve Bayesian algorithm
and Decision Tress algorithm. It then concluded that Naïve
C. Related Work Bayesian algorithm is the best for performing an accurate
Propose a new way of analyzing crime patterns using prediction of crime category at 83.9519 %. The researchers
the combination of Formal Concept Analysis and could not evaluate the performance of their prediction. It did
Geographical Information Science to discover patterns in not consider other features to study their effects on crimes
A secondary data is the type of data that has already After the data has been cleaned, exploited and
been collected by means of a primary source and it’s been visualized, it is time to select the machine learning model.
made available for researchers to use for their own research. Selecting or choosing a machine learning model enables us to
This data may be collected for general use with absolutely no perform the work easily and also to obtain the right
specific research purpose. Sources of data which are predictions we expect. Some of the machine learning models
classified as secondary data sources include books, personal includes Classification, Regression, Dimensionality
sources, journal website, government record, non- Reduction and Clustering. The choice of the model to use
governmental organizations, and newspapers among others. shall depend on the amount of data available for the purpose
The prevalence of the internet and the electronic media has of the training and testing and prediction. It is also very
made access to secondary sources of data easy without any paramount to note that the choice of model selection also
difficulties. It saves time without spending much time in depends highly on the problem to be solved. In machine
collecting data, providing cost efficiencies. It serves as the learning a model is created in order to predict the outcome of
baseline for the primary research helping in research design the event such as predicting the crime rate, the price of a
and is associated with a quantitative database serving as a house among others. After the model is created, the
legitimate avenue for quantitative research. performance of the model is calculated using the method
called Train and Test.
In this research work, a secondary source of data will be
the basis. Secondary sources will be used to collect all data VIII. EVALUATION OF THE MODEL
relevant for the study. This secondary data for the research
work will be the Chicago Crime datasets. The Chicago Crime The important part of performing a predictive modeling
datasets will be the primary focus of this work however if is evaluating the model. A machine learning model is always
possible data on crime activities on rural and farming evaluated to determine how good the model is at predicting
households and their accessibility to state security in the the target of the new and future data. Model evaluation aims
event of a commitment of a crime may be considered and at estimating the general accuracy of a model on future data.
shall be stated wherever applicable The methods for evaluating a model’s performance are
divided into two main categories namely, holdout and Cross-
V. DATA COLLECTION AND PREPARATIONS validation methods. Cross-validation is also called out-of-
sample testing used to determine how well the result of model
Data collection is the process by which data is gathered generalizes well to an unseen dataset. It involve partitioning
or collected through relevant sources to find solutions to a the original observation into training set which is used for
research problem, to test a hypothesis of the research and training purpose and an independent set which is used to
evaluate the outcomes. Data collection is an important part of evaluate the performance of the model. The most common
quantitative research which captures quality evidence cross validation technique is called k-fold cross validation.
allowing analysis to lead to the formulation of a credible The holdout validation is to test the model on a different data
solution to a problem. than the data it was trained on to provide an unbiased estimate
of the learning performance. It involves the random division
A dataset is a container for our data storage which is into three subset namely Training set, validation set and the
mostly presented in a two dimensional array form. It is made test set. Both methods use a test set to evaluate model
up of series consisting of some rows and columns. The performance. It’s not recommended to use the data used to
dataset for this work involve 22 attributes. The data is build the model to evaluate it. The reason being that the model
preprocessed using data mining techniques to transform the will simply remember the whole training set, and will
raw data into an efficient, useful and meaningful format. therefore always predict the correct label for any point in the
training set which will result in overfitting of the model.
VI. DATA VISUALIZATION
IX. EXPLORATORY DATA ANALYSIS
Data visualization is the means of conveying
information to users in a graphical view. This can be in the The entire dataset is visualized and analyzed. And
form of graph, maps, etc. I can be said to be a graphical missing values were all dropped from the dataset before
representation of information for easy understanding and prediction. Seaborn and matplotlib were used for the purpose
interpretations. The important information in the dataset is of visualization.
exploited to obtain good information about the Chicago crime
dataset. This is done through the process of Exploratory Data
Analysis (EDA). This enables us to see some useful trends in
the dataset.
Figure 1 Figure 2
From the dataset between 2012 and 2017, the number of X. LOCATION OF CRIMES
domestic crimes committed were 12000000 and 200000
represents the total number of other crimes. The history data was visualized to present the
description of crimes that are committed and the frequency.
A total number of 110000 offenders could not be From the diagram below, it is observed that the top four
arrested while about 40000 arrests were made by the police crimes are committed on STREET, RESIDENCE,
department. This is represented graphically below. APARTMENT and SIDEWALK. Other crimes are
committed on OTHER, PARKING LOT/GARAGE, ALLEY,
RESIDENTIAL YARD, SMALL RETAIL STORE and
SCHOOL, PUBLIC and BUILDING. Street crimes are the
most committed crimes which records 330471 incidents. The
least crime is committed in School, Public and Building.
Figure 2
The data was studied to observe the top ten (10) crimes
that are committed with theft being the most crimes to
robbery in the topmost crimes. A little over 300000 thefts
were committed in the between 2012 and 2017 with robbery
of a little over 50000 crimes in the years as the least crime
among the top 10 crimes between 2012 and 2017.
Figure 1
A top 10 crimes that were committed is shown below in
the graph From the exploratory analysis being done on the
historical data, the location of crimes is exploited to
determine the District with the highest crimes. From the
diagram below, the lowest amount of crime is committed in
the District 13. District 11 recorded the highest crime
between the year 2012 and 2017.
Figure 1
From the analysis from the below figure 4, the number Figure 2
of crimes committed appears evenly distributed across days
of the week. We cannot therefore say that crimes are more From the below plot in figure 7, it can be inferred that
likely to be committed on a particular day of the week more crimes were committed between May and August of
however, it is imperative to assume Thursday to be the day which July has the highest number of crimes. The month of
Table 1
S/N Dates Actual Predicted
Crimes(y) Crimes(yhat)
1 2014-02- 17684 17892
Figure 3 28
2 2014-03- 19155 22005
31
3 2014-04- 20949 22519
30
4 2014-05- 22619 24413
31
5 2014-05- 20203 24867
31
Figure 5
Figure 1
XI. THE PROPET
The model was able to predict the total number of
The Prophet is a tool developed by Facebook for the crimes that could be committed in months for 1460 days
purpose of time series forecasting. It is one of the tools that representing 4 years based on the historical data that was
are capable of predicting at a reasonable quality. It is an open- obtained. This is from the year 2018 to 2021. The model
source tool based on an additive model where nonlinear predicted a decrease in the number of crimes that are likely to
trends are fit with yearly, weekly and daily seasonality. It also be committed with 2021 expected to record the least crimes.
has holiday effects. It is used in many applications for the It is important to note that the number of crimes keeps
production of reliable predictions. declining each year since 2012.
In this work, the Prophet is used for the purpose of The model also predicted the least crimes to be in the
modeling the dynamics of crimes. This is done to enable the period 1st to 30th January, 2021 as 10978 as compared to 1st
generation of daily, weekly, monthly and quarterly crime to 30th January, 2017with the least crimes committed as 30 in
predictions. the historical data. The table below is the forecast
components which plot the trend, yearly and weekly
In application of the Prophet, it is very important that at seasonality of the Chicago crime datasets between 2012 and
the least a 24 month of the historical data is made available 2021. We therefore visualize the prediction of the crime from
for the efficient and reliable estimation of the various trends. the year 2018 to the year 2021.
A quality prediction is very important after the model
(2)
(3)
Figure 2
The table below is the metrics of our model for a given
From the diagram below, the blue line in the graph period or horizon using the facbook prophet.
represents the predicted values which are the prediction and
the black dots represent the data in the historical dataset. Table 1
Horizon MSE RMSE MAE
48 1.606 12.67 8.35
51 1.600 12.65 8.17
53 1.74 13.22 9.45
56 1.64 12.81 8.52
59 1.90 13.79 10.49
The forecast or prediction is evaluated on real-world I express my sincere gratitude to the Almighty God,
data. More resources should be allocated to the police Shaheed Udham Singh College of Engineering and
department to fight crime. It can be concluded that most Technology, I.K.G Punjab Technical University, Jalandhar
crimes occur on the street, residence followed by school, for giving me the opportunity to work on the thesis during my
public and building and therefore, most police personnel must final year of M.Tech. I owe my sincerest gratitude towards
be deployed to these areas. Much resources and logistics must Professor Sushil Kamboj, ER Parvinder Kaur, Er Deepinda
be allocated to fight crimes in these locations. According to Kaur for their valuable advice and healthy criticism
the exploratory data analysis on the historical data, it can be throughout my thesis which helped me immensely to
concluded that most offenders are not arrested of crimes that complete my work successfully. I would like to thank Head
they have committed. Much as 110000 offenders were not of Department CSE and members of the Departmental
arrested while only 40000 offenders of various crimes were research team for their valuable suggestions and healthy
arrested. This could be due to the wrongly allocation of criticism during my presentation of the work. I would also
logistics to places with less crimes. On the average, more like to thank Professor Sukhpreet Kaur, ER Rasneet Kaur, Dr.
crimes are committed between the month of May and August Manavjot, and all my lecturers at the CSE department and
each year. The month February is the month with the least SUS College for their help. I would also like to thank my
crime rate and this can be attributed to the fact that it has parents, Sylvia Somuah, Nancy Edu, Mavis Darko, Andrews
fewer days in the year. Oberko, Asare Lydia, Taiwo Emmanuel, Mary Margaret
Dakurah, African students of SUS, friends, etc. who helped
It was also discovered that crimes are committed me one way or the other in the process of my studies.
between the hours of 5:00pm and 10:00 pm. However it is
very imperative to understand from the data that most of the REFERENCES
crimes are committed at 12:00 noon during the day.
[1]. https://en.wikipedia.org/wiki/Ghana_Police_Service,
With regard to District level crimes, it was observed Access online on 5 March 2020
that District 11 is the district with the highest crime between [2]. https://police.gov.gh/en/index.php/functions/# Access
2012 and 2017. This is followed by 7, 4, 25 and 6. This is an online on 5 March 2020
indication that more logistics and personnel will be required [3]. S.N.Sivanandam and S.N. Deepa, “Principles of Soft
in those Districts to help prevent crimes from being Computing”
committed. [4]. Rekha Nagar1, Yudhvir Singh, U.I.E.T (M.D.U), “A
literature survey on Machine Learning Algorithms,”
The overall trend is that the crime rate keeps decreasing [5]. Lawrence McClendon and Natarajan Meghanathan,
from the forecast in each year. We now know that most “Using machine learning algorithms to analyze crime
crimes happen on street and on sidewalks therefore we need data”.
extra police personnel on street patrolling. A lot of crimes are [6]. S.N.Sivanandam and S.N. Deepa, “Principles of Soft
in residence and/or apartments therefore the Police Service Computing”
will require more personnel to respond to destress 911 calls [7]. S.Agatonovic-Kustrin, R Beresford “Basic concepts of
from people. artificial neural network (ANN) modeling and its
application in pharmaceutical research”
XV. CONCLUSION [8]. K. Schindler, D. Suter, “Object Detection by Global
Contour Shape”, Science Direct Journal on Pattern
In this paper, Facebook Prophet was used by applying Recognition (2008), Vol. 41, Issue 12, pp. 3736–3748
machine learning. The model used Facebook prophet to [9]. Alice Zheng et al Amanda Casar, “Feature Engineering
predict or forecast crime. It is recommended for the Ghana for Machine Learning Principles and Techniques for
Police Service and other security agencies to apply machine Data Scientists”.
Learning and artificial Intelligence in the combat of crimes in [10]. Kester Quist-Aphetsi ‘Visualization and analysis of
the country. The Police Service can use this method to geographical crime patterns using formal concept
forecast crimes in the country. The amount of crime to be analysis” IJRSG ISSN No: 2319-3484 Volume 2, Issue
committed in the year and month. It is hereby concluded that, 1, Jan. 2013.
it is very necessary to ensure that Ghana Police service and
other security agencies apply Machine Learning in crime