Data Mining Using R For Criminal Detection
Data Mining Using R For Criminal Detection
Data Mining Using R For Criminal Detection
Abstract
Increasing crime rates all over the world have lead us to some serious concerns. While we are busy finding punishments for
criminals, we tend to forget the fact that prevention is always a better solution. In todays world, social media has become the
key to deciphering a personality. Through our system, we aim to utilize the data extracted from various social media platforms in
order to narrow down the key factors which might lead to a potential criminal mindset. We make use of a premium integrated
development environment i.e. RStudio to arrive at a conclusion which will be in terms of criminal potential in any individual.
Personality trait analysis has been a subject of interest for years but in the proposed system, we aim to use clustering as well as
classification algorithms to have a more accurate analysis of the personalities which will be classified on the basis of the Five
Factor Model. The five factors have been defined as openness, conscientiousness, extraversion, agreeableness and neuroticism
(OCEAN).
Keywords: Data Mining, Data Source, Personality Traits, Criminal Psychology, Classification, Clustering
_______________________________________________________________________________________________________
I. INTRODUCTION
Data mining[2] is the computational process used for discovering a number of patterns in large data sets involving techniques at
the crossing points of machine learning and statistics, database systems and artificial intelligence. The main goal of data mining
process is to obtain information from an existing data set and transform it into a simplified structure for further utilization.
The idea of our proposed system evolved because of the increased crime rates in the society. We are designing the system in
such a way that it will help to decrease the crime rate. As most of the population is actively involved in the social media sites, we
can get an abstract view of the mindset of the particular suspect. Thus, this system proves to be a boon for the police department
to have a broader approach in solving the case. It is extremely important to mine the peculiar objects groups as this will help in
the criminal background check. Personality analysis based on this mining also plays a critical role as it has a lot of potential in
practice.
Mining peculiarity groups and defining a measurement of the degree of peculiarity is a major concern. When it comes to
correlating the same to personality trait analysis, these peculiarity groups can be defined in terms of the Five Factor Model[1]. It
is a standard in Psychology with the help of which we can define the attributes of an individual. Extraversion, Neuroticism,
Agreeableness, Conscientiousness and Openness to New Experiences are the most important ones.
The usage of a certain set of keywords or continuously choosing to visit some particular site shows an ardent inclination
towards that respective personality trait. Here, we are designing the system in such a manner that the data extracted from various
social media platforms which are ripe with semantically rich information and also knowledge related to public domains gives us
an insight into the inclination of the individual towards a particular arena of information. This will help us to show the direct and
indirect association to a particular type of trait. This can be done by efficiently classifying the keywords into clusters which will
denote the traits. The further analysis on these proportions of personality traits will reveal the potential that a person might be a
criminal or not.
The existing systems have provided a generalized view of the personalities present in a person but our proposed system moves
a step ahead by detecting the criminal behavior in that particular individual using their social media data. Also the proposed
system makes use of the latest technology Rstudio, which has never been used earlier for personality trait analysis. This tool
provides the results that are very easy to interpret unlike many other complex tools.
III. COMPARISON
Based on the analysis of the related work with our proposed system, following are the observations:
Table -1
Comparison (Proposed System and Existing System)
Parameters Proposed System Existing System
Basic personalities present in an individual along with analysis of
Personality Personality traits of a person at a very basic level
criminal psychology will approximate the chances of a person
Detection without further analysis have been implemented.
likely to be a criminal.
Both Classification and Clustering algorithms are used which Classification and Clustering are not applied
Accuracy
increases the accuracy. together in the existing systems.
Facebook as well as Twitter data have been used
Platform Most commonly used messaging application i.e. Whatsapp is used.
for analysis.
Technology R Programming language will be used due to it's special feature of Implementation is done using most commonly
Used statistical computing along with analysis. used programming language.
V. EXPERIMENTAL RESULTS
help of data extracted from Twitter and Whatsapp. The following procedure has been followed to obtain data for further
extraction from Whatsapp.
1) The chats of individual were opened and the data is mailed using the option Email chat.
2) A text file was obtained after downloading the email chats. The text file contained the date, time and the contents of the
chat separated by comma (,), colon (:) respectively.
3) The text file is converted into Excel file using various different commands and updating the setting in the MS Excel.
4) The data is loaded in RStudio for the statistical analysis. Implementing the commands in RStudio, data cleaning is
performed and we get the data in the tabular format with the parameters as date, time, sender and text.
5) Now we install the ggplot package to display the histogram of the statistics. Here, we have taken chats count in
consideration and the final result is in a graphical format which gives the count of the chats on different days.
Fig. 4: Graph depicting the individual number of chats of different persons on different days
Twitter Extraction
Twitter extraction can be done using two methods: Borrowed authentication as well as Direct authentication. We have made use
of direct authentication as it provides an efficient access i.e. it is possible to only extract the tweets directly from twitter.
Steps for extracting tweets from twitter:
1) For the Twitter extraction process, we need a Twitter application and hence a Twitter account. Use your Twitter login ID
and password to sign in at Twitter Developers.
2) Now, the twitter application generates the consumer key, consumer secret key, consumer access token and consumer access
secret token needed for the authentication of the particular account.
3) The twitter package, TwitteR was installed in Rstudio. After direct authorization, we received the extracted data in an
unstructured format and then the tweets were loaded into a .csv file using R.
4) Import the csv file by configuring various details into a dataframe for analysis. Further data cleaning is done using RStudio
and a file having appropriate headers is obtained in a tabular format for further processing.
VI. ALGORITHMS
Naive Bayes classifier is an algorithm which forms a set of supervised learning algorithms completely based on applying the
Bayes theorem with an inappropriate assumption of independence which exists between every pair of characteristic features.
Bayes theorem will state the following: P(H|X) = P(X|H)P(H)/P(X)
Where, P(H|X) is the posterior probability of H conditioned on X
P(H) is the prior probability of H
P(X) is the prior probability of X
Naive Bayes classification with e1071 package
The e1071 package consists a function named naiveBayes() which helps in performing Bayes classification. The function is able
to receive categorical data and continguous table as input data. It returns an object of class naiveBayes. The object can be
passed to predict() to predict the outcomes of unlabeled subjects.
Naive Bayes classification with caret package
The package of caret contains train() function which helps in setting a grid of tuning parameters for many classification and
routines of regression, fits each model and calculates a resampling based performance measure.
Clustering:
R has an amazing variety of functions for cluster analysis. K Means Clustering is an unsupervised learning algorithm that tries to
cluster the data based on their similarity. Unsupervised learning means that there is no result to be predicted and the algorithm
just tries to find patterns in the data. In Kmeans clustering, we specify the number of clusters we want the data to be grouped
into. The algorithm might randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the
algorithm iterates through two steps:
Reassign various data points to the cluster whose centroid is near.
Calculate the new centroid of each cluster.
These two steps are repeated until the cluster variation cannot be reduced further. The cluster variation is calculated as the
average sum of the Euclidean distance between the data and their respective cluster centroids.
VII. ADVANTAGES
When it comes to analyzing personalities, a lot of methods have already been applied. However, the efficiency in each of these
models has been very less compared to other methods. Considering our system, we have used clustering as well as classification
algorithms which will help us deduce the conclusions for this analysis. Initially the clustering algorithm provides the clustered
version of the personality traits based on the five factors and its corresponding keywords. Now the clustered data makes use of
the class labels property to depict the criminal habits with the help of the 'Neurotisim' keyword. We make use of the R
programming language which is the most comprehensive statistical analysis package available. It incorporates most of the
statistical tests, models and analyses, as well as provides a comprehensive language for managing and manipulating data.R plays
well with many other tools, importing data, for example, from CSV or directly from Microsoft Excel, Microsoft Access,
MySQL, and SQLite. It also produces graphics output in PDF, JPG, PNG, and SVG formats.
VIII. APPLICATIONS
IX. CONCLUSION
In this paper we have presented a way to analyze information which is obtained from extraction of real time data. The data will
be used to analyze the criminal psychology of an individual based on the Five Factor Model of Personality[2] i.e agreeableness,
openness, extraversion, neuroticism, conscientiousness.
Our findings demonstrate that it is possible to determine the personality of individuals by using clustering as well as
classification algorithm. This paper represents a way to make use of personality traits obtained from social media in order to
determine criminal behavior amongst users. The scope of our proposed system extends to the field of cryptography. The
encrypted data from the social media can be extracted and further analyzed from various aspects by the ethical hackers.
ACKNOWLEDGMENT
Before presenting our project work entitled Analysis of Personality Traits using Data Mining on Real Time Data, we would
like to convey our sincere thanks to many people who guided us throughout the course for this project work. First, we would like
to express our sincere thanks to our beloved Principal Dr. Suresh Ukarande for providing various facilities to carry out this
report. We would like to express our sincere thanks to our Project Guide Prof. Seema Yadav for her guidance, encouragement,
cooperation and suggestions given to us at progressing stages of our implementation phase. Finally, we would like to thank our
H.O.D. Prof. Uday Rote and all teaching, non-teaching staff of the college and friends for their moral support rendered during
the course of the project work and for their direct and indirect involvement in the completion of our project, which made our
endeavour fruitful.
REFERENCES
[1] Menasha Thilakaratne; Ruvan Weerasinghe; Sujan Perera,"Knowledge-driven Approach to Predict Personality Traits by Leveraging Social Media Data",
2016, p288-295, DOI 10.1109/WI.2016.47. Publisher:IEEE.
[2] Yun Xiong; Yangyong Zhu, Mining Peculiarity Groups in Day-by-Day Behavioral Datasets,2009 Ninth IEEE International Conference on Data Mining ,
2009, p578-587, 10p. Publisher: IEEE.
[3] Lima, Ana C.E.S.; Castro, Leandro N,Multi-Label Semi-Supervised Classification Applied to Personality Prediction in Tweets de. 2013 BRICS Congress
on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence, 2013, p195-203, 9p. Publisher: IEEE
[4] K.M. Al-Aidaroos, A.A. Bakar and Z.Othman, "Medical Data Classification with Naive Bayes Approach",Information Technology Journal 11(9) 1166-
1174,2012,ISSN:1812-5638 /DOI: 10.3923/itj.2012.1166.1174
[5] Xi Chen,Hemant Ishwaran,"Random Forests for Genomic Data Analysis", US National Library of Medicine National Institue of Health, PUBMED,
MID:22546560, PMCID: PMC3387489, DOI:10.1016/j.ygeno.2012.04.003
[6] Alexandros Ladas, Uwe Aickelin, Jon Garibaldi, Eamonn Ferguson,"Using Clustering to extract Personality Information from socio economic data",UKCI
2012, the 12th Annual Workshop on Computational Intelligence, Heriot-Watt University, 2012,Submitted on 8 Jul 2013
[7] Gao, Rui; Hao, Bibo; Bai, Shuotian; Li, Lin; Li, Ang; Zhu, Tingshao,"Improving User Profile with Personality Traits Predicted from Social Media
Content",Proceedings of the 7th ACM Conference Recommender Systems , 10/12/2013, p355-358, 4p. Publisher: Association for Computing Machinery.
[8] Farnadi, Golnoosh; Sitaraman, Geetha; Sushmita, Shanu; Celli, Fabio; Kosinski, Michal; Stillwell, David; Davalos, Sergio; Moens, Marie-Francine; Cock,
Martine,"Computational personality recognition in social media",In: User Modeling and User-Adapted Interaction. June 2016, Vol. 26 Issue 2-3, p109, 34
p.; Springer Language: English, Database: Academic OneFile.
[9] W.A. Awad and S.M. ELseuofi,"Machine Learning Methods For Spam E-Mail Classification",International Journal of Computer Science &
Information Technology (IJCSIT), Vol 3, No 1,Feb 2011, DOI : 10.5121/ijcsit.2011.3112 173
[10] Ping Sun,Irena Begaj,Iris Fermin,Jim McManus" Creating Health Typologies with Random Forest Clustering,"The2010 International Joint Conference on
Neural Networks, IEEE Conference Publications,Year:2010, Pg(1-7),DOI:10.1109/IJCNN.2010.55965