Keywords

1 Introduction

Technology is a double-edged sword. Criminals have been using technology for various destructive purposes [1]. Preemptive measures are of utmost importance for crime prevention. This makes it imperative that law enforcement agencies need advanced crime analysis tools which will aid them in the process of crime prevention [2,3,4]. Various countries like the United States of America have benefitted by this approach [5, 6].

One of the challenges faced by the police departments is to minimize threats to society by investigating large volumes of data [7]. Various initiative has been taken by researchers to analyze crime data using data mining techniques [1, 8,9,10,11]. The Government of India has also taken an initiative to implement data analytics to facilitate crime prevention measures [11]. The Gujarat Police Department shares the same vision. A proposal was submitted by one of the authors to the Gujarat Police Department to access confidential crime data for research purposes. They graciously granted access to monthly and annual crime records for the years 2012 to 2016, for the State of Gujarat in India. The Gujarat Police Department categorized crime into 9 different categories such as Home Break-In, Injuries, Kidnapping, Murder, Attempt to Murder, Police Raids, Rioting, Robbery and Theft. The goal of this paper is to develop a predictive model to predict future solving rates of the police department making them more proactive in nature and identifying crime associations and crime hotspots. This will further help the Gujarat Police Department to analyze their crime records and provide meaningful insights for decision making to solve the cases recorded. A survey on crime analysis found that 10% of the criminals commit 50% of the crimes [10]. So, any research initiative taken will help the police department solve crimes at a faster rate.

This research paper is a blend of supervised and unsupervised data mining techniques to analyze crime data. Here we implemented Multiple Linear Regression, Association Rules Analysis and K-Means Clustering algorithms to conduct a comparative study of various crime patterns from the dataset. The above-mentioned methods will help us identify underlying crime patterns and generate valuable insights from the dataset.

The motivation for conducting this research is to aid Gujarat Police Department in crime prevention and further help them understand the benefits of data mining in this domain. In this paper, different types of data mining methods have been applied and their subsequent results have been discussed which can be used by multiple enforcement agencies and as a point of reference for future research initiatives in the domain of crime analysis.

The paper is organized into six different sections. Section 1 is comprised of introduction. The background and information for crime analysis and prevention is discussed in literature review (Sect. 2). The methodology is discussed in Sect. 3. The different phases of data analysis are presented in Sect. 4. In Sect. 5 we discuss the results of data analysis which includes data visualization, graphs, thresholds and performance evaluation. The conclusion of the paper is presented in Sect. 6 which also includes the limitations and future scope of the research.

2 Literature Review

India is second most populous country in the world with a population of 1.3 billion people [11, 12]. According to the “7th Schedule Article 246 of the Indian Constitution”, Law and Order is a subject of the state [13]. This has created enormous pressure on state government to prevent crimes in their respective states. In this research paper, the authors have analyzed crime records for the state of Gujarat using various data mining techniques. The state of Gujarat has a population of 60.3 million people according to the latest census data [14]. The recorded number of actual strength of police personal in the year 2011 for the state of Gujarat was 72,838. This makes the task of maintaining law and order extremely challenging owing to the sparse police to population ratio which is 129.89 per 100,000 people [15].

Understanding the administrative structure of the Gujarat State Police Department was important for the authors to effectively collaborate with them. For a state jurisdiction the Head of the police force is Director General of Police (DGP). In a state there are many districts. For a Range that is a cluster of neighboring districts are under the control of Deputy Inspector General of Police (DIGP). There are some states where it has two or more ranges based on geographical or population figure such a state is headed by the Inspector General of Police (IGP). A district is headed by Senior Superintendent of Police (SSP)/Superintendent of Police (SP) under whom the Assistant Superintendent of Police (ASP) and Deputy Superintendents of Police (DSP) act [11].

Researchers used various data mining techniques to help law enforcement agencies prevent crimes [16,17,18,19]. During our literature review we observed that to conduct crime hotspot analysis and crime pattern analysis various researchers preferred Density Based Clustering and K-Means Clustering which resulted in clusters based on the number of crime incidents recorded [1, 10, 20]. Based on the crime type and data type the method used for analysis greatly varies. Method such as social network analysis is used to analyze data from various social media platform, which is extremely important to prevent crimes [4]. Certain classification techniques such as K-NN (K Nearest Neighbors) classification help law enforcement agencies to classify crime types based on independent attributes [2].

Studying the work done by various other researchers we learnt that most of the researchers used one of the following above-mentioned data mining techniques. We identified this as our research gap and aimed to develop a holistic research paper which uses multiple unsupervised and supervised data mining methods along with statistical tests to test their performance and accuracy.

Market Basket Analysis using the Apriori algorithm has been a widely used data mining technique to study consumer behavior in the retail sector [21]. We wanted to study if market basket analysis can be used in other domains to study the underlying relationships in the data. We came across research carried out in the domain of transportation safety using Market Basket analysis to study the associations between fatal car crashes and various under lying factors [22]. We extrapolated this idea in the domain of crime data mining and after initial data preparation we used Market Basket analysis to study the associations between different crime types based on the number of crime incidence recorded.

3 Methodology

To effectively analyze the data using multiple data mining techniques and avoid any ambiguities we followed the SEMMA data mining implementation steps developed by the SAS institute [23]. SEMMA stands for “Sample”, “Explore”, “Modify”, “Model”, “Assess” [24].

  • Sample: Various sampling strategies are used when the data is too large or complex to be analyzed. These sampling strategies try to replicate the properties of the population. In our dataset we analyzed the complete dataset to derive important insights and inferences.

  • Explore: We conducted exploratory analysis by visualizing the data in R Studio to study descriptive statistics and to identify relationships between the variables in the dataset which helped us understand the data in a concise manner.

  • Modify: Feature engineering is an important element which enriches the data for effective analysis and model development [10]. We added variables such as “Population”, “Crime Density” and “Crime Code” for calculating the Crime Density we wanted to know the population in each city as population is an attribute present in crime density formula. Secondly, the Crime Code was added to apply Multiple Linear Regression in which the different Crime Types were encoded to numerical data for gaining statistical parameters. These variables improved model accuracy.

  • Model: We developed predictive models using Multiple Linear Regression and used unsupervised learning methods such as K-Means Clustering algorithm and Apriori algorithm to study underlying patterns and associations in the data.

  • Assess: To evaluate the performance of the model it was tested on unseen data. Various statistical parameters such as ‘p-value’, ‘t-value’, ‘R-Squared’ and ANNOVA tests were performed to select the best model.

A conceptual model is developed which would help us perform the data analysis activities in a structured manner. Figure 1 depicts this conceptual model. It illustrates the data analysis processes and techniques used for developing the predictive and classification model using SEMMA. First the data was obtained from the police department after which the annual crime data was extracted from the main dataset. This crime dataset is used for detailed analysis. To prepare the data various irrelevant fields and null values were omitted and the variable solving rate was checked for correctness and errors were modified. To further enrich the data, we added new features such as “Crime Density”, “Crime Code” and “Population”. Average imputation method was used to impute population values for districts whose populations weren’t available [1]. The cleaned data was loaded to R Studio for further analysis.

Fig. 1.
figure 1

Data analysis – conceptual model

We decided to use two approaches for our analysis “Crime pattern analysis” to study patterns and associations in the dataset using K-Means Clustering and Association Rules mining [4]. Predictive models were also developed using Multiple Linear Regression to predict the departments solving rate in subsequent years [25].

Various model testing techniques such as ANNOVA, variable importance tests and p-values were used for finalizing the variables and model selection. The model was validated by testing it on unseen test data. The model’s stability was further tested by performing Ten-Fold cross validation [26].

4 Data Analysis

Data analysis uses various data mining techniques to analyze the data and uncover information previously unknown to us [26, 27]. Here selecting the best algorithms that best fits our research needs and help the Gujarat State Police Department was of utmost importance. Various steps were undertaken to successfully analyze the data and extract meaningful information from the data.

4.1 Data Preprocessing and Preparation

The dataset comprised of variables such as “Year”, “City”, “Crime type”, “Public” (Cases recorded), “Found” (Cases resolved), and “Percentage” (Crime solving percentage). Data cleaning and transformation was carried out to derive better insights from the data. Three additional variables namely “Population”, “Crime Density” and “Crime Code” were added. The population for the years 2012 to 2016 were derived based on the annual population growth rate of India and census data for the year of 2011 [1, 28]. In addition, the Crime density per 100,000 of the general population was calculated by mathematical formula:

$$ Crime \,density = \left( {\frac{{Public\left( {\text{Cases Recorded}} \right)}}{Population}} \right) * 1 0 0 , 0 0 0 $$

The variable “Crime Code” was derived by encoding variable “Crime types” as sequential numbers from 1 to 9.

4.2 Data Mining Techniques

Data mining techniques are used to extract valuable information from the data [27]. These techniques are mainly of two types supervised and unsupervised.

Supervised learning: In this method the dataset is divided into two parts namely Training dataset and Test dataset. The models are developed using the training data and validated by comparing the model’s prediction with that of the unseen test data [5, 27].

Unsupervised learning: In this class of data mining technique, the dataset is analyzed for underlying patterns and relations between different variables of a dataset [5, 27].

We have applied both of these data mining techniques to analyze the data, such as Multiple Linear Regression (supervised learning) to develop predictive models, K-Means Clustering and Association Rules mining (unsupervised learning) for studying the underlying patterns and the associations between different attributes in the dataset.

R programming provides us with various packages to perform analysis on dataset. We used R programming to clean the data, developed data mining models and test models. The following data mining methods were used to analyze our data.

Multiple Linear Regression: This method has more than one independent variables which predicts and establishes a relationship between our dependent variable. The difference between the predicted value Pi and actual value A i is termed as error rate. The Regression can be expressed as Yi [5, 27]: \( {\text{Y}}_{\text{i}} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{n} x_{n} + \in_{i} \)

Where:

\( Y \) :

denotes the measured value of dependent variable

\( X \) :

denotes the value of the independent variables

\( \beta_{0} \) :

denotes a constant

\( \upbeta_{1} \ldots\upbeta_{n} \) :

denotes estimated regression co-efficient

\( \in_{i} \) :

denotes residual

K-Means Clustering: This method groups similar objects together to analyze crime patterns in the dataset. The algorithm uses distance computation functions to assign an observation to a cluster. An object is assigned to the cluster that it has the closest distance. The user decides the number of clusters in such a way that it minimizes the within sum of squares distance between the clusters [27, 29].

Association Rule Mining: The Association Rules analysis is an unsupervised learning technique which is used to extrapolate the rules associated with each other. In this case we have mined the associations of different types of crime committed with each other type [27, 30, 31]. The Association Rules are generated as a Transaction ID when applied. Each rule is represented by a transaction ID. The Association Rules are evaluated using the metrics - Support, Confidence and Lift.

Support is the measurement in which the crime type is repeated in a set of transactions given in the dataset. Confidence is the ratio of the support of two different crime types together to the support measure of individual crime committed regardless of associated crime is repeated in other transaction sets with other crime type. It also measures the accuracy of the associated rule [32]. Lift is measured as ratio of the support of two crimes together to the support of two crimes individually committed [30, 31].

The algorithm used for mining Association Rules is Apriori algorithm using R programming in R Studio. The different crime types are mined and crime types which are associated to each other based on the above three measures are extracted carefully. This shows the crime type that are committed by the criminals based on the previous committed crimes. The police department can use this prediction to analyze the next crime type of the suspect and be watchful for prevention of crime.

5 Results and Discussion

After this preliminary analysis of dataset, we would like to help the Gujarat Police Department to be more proactive and introduce counter measures in due course of time. We developed a predictive model using regression [5, 27] in R which would predict the cases resolved. These will be the cases which need to be solved on high priority to minimize the crime in Gujarat. The regression model provided an accuracy of 85.5%. We observed in Table 3 the “Crime Type Thieves” is not significant. This leads to an interesting finding that theft related cases are the least resolved ones which needs to be considered by Gujarat Police. Theft being the least resolved crime type might be one of the reasons that we found theft as one of the most happening crime in Gujarat as represented in Table 1.

Table 1. Regression model results

To test and validate the Multiple Linear Regression model we used the analysis of variance test and the model prediction function on unseen test data. Table 2 below displays the result of the analysis of variance test where the F-value of the variables of our choice are of significance.

Table 2. Analysis of variance

Table 3 displays how well our predictive model performed on unseen test data we got an R-Squared value of 77% which signifies that our model did perform well on unseen test data as well.

Table 3. Model prediction statistics

To check the impact and importance of the variable in our model we ran a variable importance test. Here a higher overall value signifies higher importance and relevance. From the Table 4 it is evidently clear that the variable ‘Cases Recorded’ has the highest significance towards predicting our crime solving rate.

Table 4. Variable importance table

We decided to perform K-Means Clustering and Association Rules Analysis to identify patterns and associations in the dataset which will help us to have deeper analysis of the crime data. To explore the underlying patterns in our data we used the K-Means clustering technique [9]. The Elbow method in Fig. 2 shows six clusters are optimal number of clusters as adding another doesn’t improve the total within sum of squares. Using this as our K-value we performed K-Means clustering in R. Figure 3 represents the six homogenous clusters were created based on variables “Cases recorded” and “Cases resolved”.

Fig. 2.
figure 2

A plot describing the within sum of squares (WSS) by Elbow method

Fig. 3.
figure 3

K-means clustering result

As observed in Fig. 3, clusters can be segregated into different crime zones based on the number of cases recorded. In addition, Cluster 5 is a high-risk crime zone where the number of cases recorded is over 2500 crime instances which is extremely high as compared to the average.

After detailed analysis of cluster 5 it is evident that all the recorded instances of crime are related to theft (Table 5). Hence, it is recommended for the Gujarat Police Department to allocate more resources and prioritize theft related cases in Ahmedabad city. In Table 5 we found an interesting inference as the cases of theft are not properly resolved they do not have a high significance in helping us predict the cases resolved.

Table 5. Contents of K-means cluster 5

Our predictive model helped us to identify that theft is the most happening crime in Gujarat (Table 1). Crime pattern analysis using K-means clustering further confirms our results by providing the details that over 2500 recorded instances of crime are related to theft (Fig. 3, Table 5). So, planned to extrapolate different association between crime types using a novel approach of performing market basket analysis. To conduct this analysis using the annual crime dataset we split the dataset based on the crime type and grouped it based on cases recorded to create our baskets [32]. We used different Support and Lift values to analyze these baskets and mine relevant Association Rules as displayed in Table 6.

Table 6. Association rules for crime types

Based on the lift values we can infer that suspects committing Murder and Riot are most likely to commit Kidnapping. Here a set of interesting rules that are relevant and will help the Gujarat State Police department are displayed in Table 6. Owing to these rules the department can narrow down their search to relevant suspects. In addition, the understanding of the association of crime types can help police department to focus on resolving those crime types simultaneously. For example, if police department strategize to resolve the crime types of murder, riot and robbery it can help them to resolve the kidnapping crime type.

Another interesting finding of the association rules analysis is that theft is usually not associated with any another crime type. Overall, we have observed theft is the most happening and recorded crime type in the state of Gujarat but is usually not associated with any other crime type. On the other hand, robbery is associated with most of the crime types. It would be interesting for us to further explore the difference between robbery and theft and focus on the cases reported and resolved for both crime types. We plan to extend our analysis on these lines in future.

6 Conclusion

As the population keeps growing with urbanization taking place new types of crime are committed making the cities vulnerable for the public to be safe. A smart and robust approach would help the law enforcement agencies to maintain public safety and peace in the cities by preventing crimes to be committed. The data mining techniques have been proven to be effective in analyzing the dataset and gather insights which are useful in many domains. This research paper used few data mining techniques to predict and identify the pattern of crime types committed in the state of Gujarat. This will help the Gujarat Police Department to identify the suspects and prevent them to further committing crime. The Association Rules suggest the Crime Types that are associated based on the cases recorded and the Multiple Linear Regression model has helped us to develop a predictive model which predicts the resolved cases with accuracy of 85.5%. The K-Means Clustering has shown the high crime risk zone city amongst the other cities.

In this research we have used yearly data (2012–2016) to predict the solving rate and to analyze the patterns of crime. The results of this research will help Gujarat police department to analyze their crime records and provide meaningful insights for decision making to solve the cases recorded. Currently we are working on monthly crime dataset to develop more robust models that could be implemented with the department. To successfully implement these methodologies, we are studying the As-Is crime solving process of the Gujarat Police Department and subsequently modelling a To-Be process which would integrate the data mining approach in the crime solving process.