Data Mining Using R For Criminal Detection

IJIRST International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 09 | February 2017
ISSN (online): 2349-6010
Data Mining using R For Criminal Detection

Shefali Sharma Rujutha Shetty
UG Student UG Student
Department of Information Technology Department of Information Technology
K. J. Somaiya Institute of Engineering and Information K. J. Somaiya Institute of Engineering and Information
Technology India Technology India
Bini Shah Seema Yadav

UG Student Assistant Professor
Department of Information Technology Department of Information Technology
K. J. Somaiya Institute of Engineering and Information K. J. Somaiya Institute of Engineering and Information
Technology India Technology India
Abstract
Increasing crime rates all over the world have lead us to some serious concerns. While we are busy finding punishments for
criminals, we tend to forget the fact that prevention is always a better solution. In todays world, social media has become the
key to deciphering a personality. Through our system, we aim to utilize the data extracted from various social media platforms in
order to narrow down the key factors which might lead to a potential criminal mindset. We make use of a premium integrated
development environment i.e. RStudio to arrive at a conclusion which will be in terms of criminal potential in any individual.
Personality trait analysis has been a subject of interest for years but in the proposed system, we aim to use clustering as well as
classification algorithms to have a more accurate analysis of the personalities which will be classified on the basis of the Five
Factor Model. The five factors have been defined as openness, conscientiousness, extraversion, agreeableness and neuroticism
(OCEAN).
Keywords: Data Mining, Data Source, Personality Traits, Criminal Psychology, Classification, Clustering
_______________________________________________________________________________________________________
I. INTRODUCTION
Data mining[2] is the computational process used for discovering a number of patterns in large data sets involving techniques at
the crossing points of machine learning and statistics, database systems and artificial intelligence. The main goal of data mining
process is to obtain information from an existing data set and transform it into a simplified structure for further utilization.
The idea of our proposed system evolved because of the increased crime rates in the society. We are designing the system in
such a way that it will help to decrease the crime rate. As most of the population is actively involved in the social media sites, we
can get an abstract view of the mindset of the particular suspect. Thus, this system proves to be a boon for the police department
to have a broader approach in solving the case. It is extremely important to mine the peculiar objects groups as this will help in
the criminal background check. Personality analysis based on this mining also plays a critical role as it has a lot of potential in
practice.
Mining peculiarity groups and defining a measurement of the degree of peculiarity is a major concern. When it comes to
correlating the same to personality trait analysis, these peculiarity groups can be defined in terms of the Five Factor Model[1]. It
is a standard in Psychology with the help of which we can define the attributes of an individual. Extraversion, Neuroticism,
Agreeableness, Conscientiousness and Openness to New Experiences are the most important ones.
The usage of a certain set of keywords or continuously choosing to visit some particular site shows an ardent inclination
towards that respective personality trait. Here, we are designing the system in such a manner that the data extracted from various
social media platforms which are ripe with semantically rich information and also knowledge related to public domains gives us
an insight into the inclination of the individual towards a particular arena of information. This will help us to show the direct and
indirect association to a particular type of trait. This can be done by efficiently classifying the keywords into clusters which will
denote the traits. The further analysis on these proportions of personality traits will reveal the potential that a person might be a
criminal or not.
The existing systems have provided a generalized view of the personalities present in a person but our proposed system moves
a step ahead by detecting the criminal behavior in that particular individual using their social media data. Also the proposed
system makes use of the latest technology Rstudio, which has never been used earlier for personality trait analysis. This tool
provides the results that are very easy to interpret unlike many other complex tools.
All rights reserved by www.ijirst.org 91

(IJIRST/ Volume 3 / Issue 09/ 017)
II. EXISTING SYSTEMS
Twitter Data Set Personality

The user generated content on Twitter i,e tweets also provides a source of information for identifying users personality traits. It
was considered that only few users posted links on Twitter, which formed the content of the dataset. This dataset has been used
for the task of automatically predicting the personalities of the users, as well as for user behavior analysis. It was observed that
extroverts and emotionally sound people are popular as well as users who are influential on Twitter[8]. It was also observed that
popular users are creative, while inuential people on Twitter are more organized.
Using Clustering to extract Personality Information from Socio Economic Data
In this work, a method to extract behavior related to different groups by using simple clustering methods that can potentially
reveal various aspects of the personalities [6].Extraction of different clusters such as Selfish and non Selfish behavior is based
on the judgments of 52 students from psychology background that rated the 9 attributes of expenditure of the dataset. The results
demonstrated that it is possible to extract information regarding the personality of individuals from similar datasets by using even
simple data mining techniques.
Improving User Prole with Personality Traits Predicted from Social Media Content
Here, the personality was proved to be related with personal preferences in music tastes and GUIs. The systems that recommend
accuracy should be improved by taking user personality traits into consideration. Results of these survey indicated that
personality could aect the desire of online purchasing activity. Improving user prole with personality traits could enhance
performance of the recommendation system [7].The recommendation systems which are personalized based on social media
usage behavior and content is among the apex topics throughout the whole network in both academics and business. However,
the personality traits of user were mainly measured with psychological questionnaire. It is difficult to obtain personality traits of
large amount of users in real-world personalized systems in case of application scenes. Thus the system provides an efficient and
approximate result of the personality analysis of an individual.
III. COMPARISON
Based on the analysis of the related work with our proposed system, following are the observations:
Table -1
Comparison (Proposed System and Existing System)
Parameters Proposed System Existing System
Basic personalities present in an individual along with analysis of
Personality Personality traits of a person at a very basic level
criminal psychology will approximate the chances of a person
Detection without further analysis have been implemented.
likely to be a criminal.
Both Classification and Clustering algorithms are used which Classification and Clustering are not applied
Accuracy
increases the accuracy. together in the existing systems.
Facebook as well as Twitter data have been used
Platform Most commonly used messaging application i.e. Whatsapp is used.
for analysis.
Technology R Programming language will be used due to it's special feature of Implementation is done using most commonly
Used statistical computing along with analysis. used programming language.
IV. PROPOSED SYSTEM
The proposed design is divided into five modules:

Data Source
The data source that we will mainly focus on is Whatsapp chats as well as twitter tweets. Whatsapp chats will be extracted using
the 'email chat' option. Twitter data will also be extracted by direct authentication method using Twitter Application. The data
obtained from these sources will be obtained in unstructured data.
Data Extraction
The data will be extracted from the data sources using R, which is a programming language and environment for statistical
computing and graphics. RStudio is a free and open-source IDE for R. Data has been extracted from Whatsapp and Twitter using
the corresponding commands. Whatsapp Data was extracted via the email. Similarly, Twitter data was extracted by creating a
Twitter application and generating the consumer key, consumer access secret key and so on.

R and MySQL connection

The retrieval and storage of data from a MySQL database with R is possible by using the RMySQL package. The package
simply needs to be installed and loaded in the library of the database. Also we can directly load the data into RStudio by
installing the twitteR package. Thus the tweets are obtained in a cleaned format.
Classification and Clustering Algorithms
Classification and Clustering algorithms will be applied on the data to identify the personality traits of a person based on various
criteria considering the base as the Five Factor Model of Personality. The Big Five traits of a human are Openness,
Conscientiousness, Extraversion, Agreeableness and Neuroticism. Naive Bayes' Theorem will be used for the classification of
the personalities of that individual. There are many in-build commands for the same.
User Interface
A user interface may be used to display the results in an appropriate manner in the form of statistical diagrammatical
representations such as graphs and pie charts. Analyzing the conclusion, we will display the chances of a person likely to be a
criminal. The details of the graph can be put in the RStudio to get the histogram. This will help us to show the association to a
particular type of trait in terms of percentage and other parameters.
Fig. 1: Architecture of the proposed system
V. EXPERIMENTAL RESULTS
Whatsapp Data Extraction

Our main aim is to classify personalities in their respective labels of various traits and get an average word count of the number
of interactions an individual has with the other party involved. The current experimental analysis on data has been done with the

help of data extracted from Twitter and Whatsapp. The following procedure has been followed to obtain data for further
extraction from Whatsapp.
1) The chats of individual were opened and the data is mailed using the option Email chat.
2) A text file was obtained after downloading the email chats. The text file contained the date, time and the contents of the
chat separated by comma (,), colon (:) respectively.
Fig. 2: Whatsapp Text File
3) The text file is converted into Excel file using various different commands and updating the setting in the MS Excel.
4) The data is loaded in RStudio for the statistical analysis. Implementing the commands in RStudio, data cleaning is
performed and we get the data in the tabular format with the parameters as date, time, sender and text.
Fig. 3: csv file having appropriate headers
5) Now we install the ggplot package to display the histogram of the statistics. Here, we have taken chats count in
consideration and the final result is in a graphical format which gives the count of the chats on different days.
Fig. 4: Graph depicting the individual number of chats of different persons on different days

Twitter Extraction
Twitter extraction can be done using two methods: Borrowed authentication as well as Direct authentication. We have made use
of direct authentication as it provides an efficient access i.e. it is possible to only extract the tweets directly from twitter.
Steps for extracting tweets from twitter:
1) For the Twitter extraction process, we need a Twitter application and hence a Twitter account. Use your Twitter login ID
and password to sign in at Twitter Developers.
Fig. 5: Username created on Twitter Developers
2) Now, the twitter application generates the consumer key, consumer secret key, consumer access token and consumer access
secret token needed for the authentication of the particular account.
Fig. 6: Secret Keys generated
3) The twitter package, TwitteR was installed in Rstudio. After direct authorization, we received the extracted data in an
unstructured format and then the tweets were loaded into a .csv file using R.
Fig. 7: Raw tweets of a particular user obtained
4) Import the csv file by configuring various details into a dataframe for analysis. Further data cleaning is done using RStudio
and a file having appropriate headers is obtained in a tabular format for further processing.

Fig. 8: Tweets obtained in a clean tabular format with important headers
VI. ALGORITHMS
Naive Bayes classifier is an algorithm which forms a set of supervised learning algorithms completely based on applying the
Bayes theorem with an inappropriate assumption of independence which exists between every pair of characteristic features.
Bayes theorem will state the following: P(H|X) = P(X|H)P(H)/P(X)
Where, P(H|X) is the posterior probability of H conditioned on X
P(H) is the prior probability of H
P(X) is the prior probability of X
Naive Bayes classification with e1071 package
The e1071 package consists a function named naiveBayes() which helps in performing Bayes classification. The function is able
to receive categorical data and continguous table as input data. It returns an object of class naiveBayes. The object can be
passed to predict() to predict the outcomes of unlabeled subjects.
Naive Bayes classification with caret package
The package of caret contains train() function which helps in setting a grid of tuning parameters for many classification and
routines of regression, fits each model and calculates a resampling based performance measure.
Clustering:
R has an amazing variety of functions for cluster analysis. K Means Clustering is an unsupervised learning algorithm that tries to
cluster the data based on their similarity. Unsupervised learning means that there is no result to be predicted and the algorithm
just tries to find patterns in the data. In Kmeans clustering, we specify the number of clusters we want the data to be grouped
into. The algorithm might randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the
algorithm iterates through two steps:
Reassign various data points to the cluster whose centroid is near.
Calculate the new centroid of each cluster.
These two steps are repeated until the cluster variation cannot be reduced further. The cluster variation is calculated as the
average sum of the Euclidean distance between the data and their respective cluster centroids.
VII. ADVANTAGES
When it comes to analyzing personalities, a lot of methods have already been applied. However, the efficiency in each of these
models has been very less compared to other methods. Considering our system, we have used clustering as well as classification
algorithms which will help us deduce the conclusions for this analysis. Initially the clustering algorithm provides the clustered
version of the personality traits based on the five factors and its corresponding keywords. Now the clustered data makes use of
the class labels property to depict the criminal habits with the help of the 'Neurotisim' keyword. We make use of the R
programming language which is the most comprehensive statistical analysis package available. It incorporates most of the
statistical tests, models and analyses, as well as provides a comprehensive language for managing and manipulating data.R plays
well with many other tools, importing data, for example, from CSV or directly from Microsoft Excel, Microsoft Access,
MySQL, and SQLite. It also produces graphics output in PDF, JPG, PNG, and SVG formats.

VIII. APPLICATIONS
Criminal Background Check

Whenever there is an occurrence of any criminal activity and the police department decides to investigate the case, the police
department will do a background check to determine the mindset of the criminal by gathering information from certain people.
However, our system can be used to identify the mindset of the concerned person by accessing his/her social media accounts that
would determine the personality based on the five factors which in turn would determine whether the person has a criminal
mindset. This would play a major role in solving the case.
Medical Data Analysis
Medical area[4] produces extremely humongous and voluminous quantities of electronic data that is becoming more and more
complicated. The generated medical data has certain characteristics which makes their analysis highly challenging and lucrative.
In this method of Naive Bayes utilization for classifying medical data, it allows mining from different perspectives; including
attributes of medical data, requirements of designs and systems dealing with this data and also the different methodologies used
for medical data mining.
IX. CONCLUSION
In this paper we have presented a way to analyze information which is obtained from extraction of real time data. The data will
be used to analyze the criminal psychology of an individual based on the Five Factor Model of Personality[2] i.e agreeableness,
openness, extraversion, neuroticism, conscientiousness.
Our findings demonstrate that it is possible to determine the personality of individuals by using clustering as well as
classification algorithm. This paper represents a way to make use of personality traits obtained from social media in order to
determine criminal behavior amongst users. The scope of our proposed system extends to the field of cryptography. The
encrypted data from the social media can be extracted and further analyzed from various aspects by the ethical hackers.
ACKNOWLEDGMENT
Before presenting our project work entitled Analysis of Personality Traits using Data Mining on Real Time Data, we would
like to convey our sincere thanks to many people who guided us throughout the course for this project work. First, we would like
to express our sincere thanks to our beloved Principal Dr. Suresh Ukarande for providing various facilities to carry out this
report. We would like to express our sincere thanks to our Project Guide Prof. Seema Yadav for her guidance, encouragement,
cooperation and suggestions given to us at progressing stages of our implementation phase. Finally, we would like to thank our
H.O.D. Prof. Uday Rote and all teaching, non-teaching staff of the college and friends for their moral support rendered during
the course of the project work and for their direct and indirect involvement in the completion of our project, which made our
endeavour fruitful.
REFERENCES
[1] Menasha Thilakaratne; Ruvan Weerasinghe; Sujan Perera,"Knowledge-driven Approach to Predict Personality Traits by Leveraging Social Media Data",
2016, p288-295, DOI 10.1109/WI.2016.47. Publisher:IEEE.
[2] Yun Xiong; Yangyong Zhu, Mining Peculiarity Groups in Day-by-Day Behavioral Datasets,2009 Ninth IEEE International Conference on Data Mining ,
2009, p578-587, 10p. Publisher: IEEE.
[3] Lima, Ana C.E.S.; Castro, Leandro N,Multi-Label Semi-Supervised Classification Applied to Personality Prediction in Tweets de. 2013 BRICS Congress
on Computational Intelligence & 11th Brazilian Congress on Computational Intelligence, 2013, p195-203, 9p. Publisher: IEEE
[4] K.M. Al-Aidaroos, A.A. Bakar and Z.Othman, "Medical Data Classification with Naive Bayes Approach",Information Technology Journal 11(9) 1166-
1174,2012,ISSN:1812-5638 /DOI: 10.3923/itj.2012.1166.1174
[5] Xi Chen,Hemant Ishwaran,"Random Forests for Genomic Data Analysis", US National Library of Medicine National Institue of Health, PUBMED,
MID:22546560, PMCID: PMC3387489, DOI:10.1016/j.ygeno.2012.04.003
[6] Alexandros Ladas, Uwe Aickelin, Jon Garibaldi, Eamonn Ferguson,"Using Clustering to extract Personality Information from socio economic data",UKCI
2012, the 12th Annual Workshop on Computational Intelligence, Heriot-Watt University, 2012,Submitted on 8 Jul 2013
[7] Gao, Rui; Hao, Bibo; Bai, Shuotian; Li, Lin; Li, Ang; Zhu, Tingshao,"Improving User Profile with Personality Traits Predicted from Social Media
Content",Proceedings of the 7th ACM Conference Recommender Systems , 10/12/2013, p355-358, 4p. Publisher: Association for Computing Machinery.
[8] Farnadi, Golnoosh; Sitaraman, Geetha; Sushmita, Shanu; Celli, Fabio; Kosinski, Michal; Stillwell, David; Davalos, Sergio; Moens, Marie-Francine; Cock,
Martine,"Computational personality recognition in social media",In: User Modeling and User-Adapted Interaction. June 2016, Vol. 26 Issue 2-3, p109, 34
p.; Springer Language: English, Database: Academic OneFile.
[9] W.A. Awad and S.M. ELseuofi,"Machine Learning Methods For Spam E-Mail Classification",International Journal of Computer Science &
Information Technology (IJCSIT), Vol 3, No 1,Feb 2011, DOI : 10.5121/ijcsit.2011.3112 173
[10] Ping Sun,Irena Begaj,Iris Fermin,Jim McManus" Creating Health Typologies with Random Forest Clustering,"The2010 International Joint Conference on
Neural Networks, IEEE Conference Publications,Year:2010, Pg(1-7),DOI:10.1109/IJCNN.2010.55965

Data Mining Using R For Criminal Detection

Uploaded by

Copyright:

Available Formats

Data Mining Using R For Criminal Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Using R For Criminal Detection

Uploaded by

Copyright:

Available Formats

IJIRST International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 09 | February 2017

ISSN (online): 2349-6010

Data Mining using R For Criminal Detection

Bini Shah Seema Yadav

All rights reserved by www.ijirst.org 91

II. EXISTING SYSTEMS

Twitter Data Set Personality

IV. PROPOSED SYSTEM

The proposed design is divided into five modules:

All rights reserved by www.ijirst.org 92

R and MySQL connection

Fig. 1: Architecture of the proposed system

Whatsapp Data Extraction

All rights reserved by www.ijirst.org 93

Fig. 2: Whatsapp Text File

Fig. 3: csv file having appropriate headers

All rights reserved by www.ijirst.org 94

Fig. 5: Username created on Twitter Developers

Fig. 6: Secret Keys generated

Fig. 7: Raw tweets of a particular user obtained

All rights reserved by www.ijirst.org 95

Fig. 8: Tweets obtained in a clean tabular format with important headers

All rights reserved by www.ijirst.org 96

Criminal Background Check

All rights reserved by www.ijirst.org 97

You might also like