0% found this document useful (0 votes)
74 views5 pages

Categorization of Email Using Machine Learning On Cloud: Abstract

Download as docx, pdf, or txt
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1/ 5

CATEGORIZATION OF EMAIL USING

MACHINE LEARNING ON CLOUD


B C SIDDHU SIDDHARTH, MANDARA R BHARADWAJ
R. V. COLLEGE OF ENGINEERING
Mysore Road, Bangalore

B C Siddhu Siddharth Mandara R Bharadwaj


[email protected] [email protected]

Abstract: This project investigates a comparison between 2 completely different approaches for classifying
emails supported their classes and their deployment in cloud. Naive Thomas Bayes and Hidden Markov Model
(HMM).Two completely different machine learning algorithms, each are used for detection whether or not AN
email is vital or spam.
Naive Thomas Bayes Classifier relies on conditional possibilities, it's quick and works nice with little data set. It
considers freelance words as a feature. HMM could be a generative, probabilistic model that gives North American
nation with distribution over the sequences of observations. HMM's will handle inputs of variable length and
facilitate programs return to the foremost possible call, supported each previous selections and current
information. Varied mixtures of IP techniques- stop words removing, stemming, summarizing are tried on each
the algorithms to examine the variations in accuracy additionally on notice the simplest methodology among them.
Index terms: Email Classification, Hidden Markov Model, Naive Bayes, Natural Language Processing, NLTK,
Supervised Learning, Cloud deployment

I. INTRODUCTION

Email is one in all the foremost vital means that II. RELATED WORK
of communication in today’s world. Email
usage has multiplied considerably round the This paper focuses on identifying vital emails from
world. In 2015, the quantity of emails sent and spam emails. One major consider the categorization is
received per day destroyed over 205 billion. that of a way to represent the messages. Specifically,
This figure is predicted to grow at a median one should decide that options to use, and the way to
annual rate of three over succeeding four years, use those options to the categorization. M. Aery et al.
reaching over 246 billion by the top of 2019. As [1]gave AN approach that relies on the premise that
of December 2016, spam messages accounted patterns is extracted from a pre-classified email folder
for sixty-one.66 % of email traffic worldwide2. and therefore the same is used effectively for
Therefore, filtering these spam emails has classifying incoming emails. As emails consists a
become a crying want for email users round the format within the sort of headers and body of the e-
globe. This paper describes the methodologies mail, the correlation between totally different terms is
that may be wont to classify emails into totally showed within the type graph. They need chosen graph
different classes like vital and spam. Relative mining as a viable technique for pattern extraction and
words or sentences are thought-about as feature classification. R. Islam et al. [2] showed the simplest
to classify email messages. The distinction in way that projected a multi-stage classification
nature between Naïve Thomas Bayes Classifier technique victimization totally different fashionable
and Hidden Andre Markov Model makes it learning algorithms like SVM, Naive Bayes and
attention-grabbing to match them. Data set has boosting with an analyser that reduces the False
been collected and reprocessed before exactness well and will increase classification
evaluating accuracy, precision, recall, f-metrics accuracy compared to similar existing techniques. B.
for each algorithm. Stemming, summarizing, Gustav Klimt et al [3] gave AN approach that
removal of stop words- these techniques are introduced Enron corpus as a replacement dataset for
utilized in numerous mixtures with the this domain. V. Bhat et al. [4] came up with AN
algorithmic programs to research that algorithm approach that derives spam filter known as Beaks.
on what combination provides the most They classify emails into spam and non-spam. Their
effective result. pre-processing technique is meant to spot tag-of-spam
words relevant to the data-set. X. Wang et al. [5] took
an approach that reviews recent approaches to separate
spam email, to categories email into a hierarchy of
folders, And to mechanically confirm the tasks needed given the word's amusement, amusing, and amused,
in response to an email. Consistent with E.Yitagesul et the lemma for every and every one would be ‘amuse’.
al [6], in sender based mostly detection, the e-mail This aims to get rid of inflectional endings and to come
sender info like the literary genre and therefore the base or lexicon style of a word. This method involves
email sender user name is employed because the major linguistic approach, like morphological analysis
options. The analysis paper written by S.Teli [7] through regular relations compiled in finite-state
showed USA a 3 phased system that they designed for transducers. Stop words may be a set of ordinarily used
his or her approach of spam detection. Within the words in any language that are excluded out before or
initial part, the user creates the rule for classification. once process of linguistic communication knowledge
Rules are nothing, however the keywords/phrases that that, during this case, is text. The most reason why stop
occur in mails for several legitimate or spam mails. words are essential to any program is that, once we
The second part is known as coaching part. Here the take away the words that are terribly ordinarily
classifier are going to be trained employing a spam and employed in a given language, we will specialize in the
legit emails manually by the user. Then with the necessary words instead. For removing stop words
assistance of rule the keywords are extracted from from an email in data set, we have a tendency to search
classified emails. Once the primary and second phases them in NLTK toolkit’s given list and therefore the
are completed, classifying the emails by given rule result obtained was terribly correct.
starts, victimization this data of tokens, the filter
classifies each new incoming email. Here the IV. SYSTEM IMPLEMENTATION
likelihood of most keyword match is calculated and
therefore the standing of a replacement email is 1) Naive Bayes Classification
confirmed as spam or vital email. Two main strategies
for detection spam email are wide used. One is sender Bayes theorem [11] provides a way of calculating
based spam detection and the other method is content posterior probability P(c|x) from P(c), P(x) and P(x|c).
based spam detection which will consider only the Let’s look at the equation below:
content of an email. This paper talks about the content
based spam detection

III. TECHNIQUES FOR RETRIEVING


RELEVANT INFORMATION Above,
P(c|x) is the posterior probability of class (c, target)
This paper discusses some techniques to eliminate given predictor (x, attributes). P(c) is the prior
impertinent knowledge from emails that are stemming, probability of class.
summarizing and stop-words removal. These P(x|c) is the likelihood which is the probability of
techniques may be tried along in eight completely predictor given class.
different mixtures and every one of them are P(x) is the prior probability of predictor.
experimented with. NLTK, one in all the dominant
platforms for making Python programs, is meant to Bayesian Classification is used as a probabilistic
support analysis in linguistic communication or learning method and every feature of this algorithm
closely connected areas. It’s varied text process being classified is independent of the value of any
libraries for classification, modernization, stemming, other feature. The goal is to build a classifier that will
tagging, parsing, linguistics reasoning and wrappers automatically tag new emails with appropriate
for strong human language technology libraries, category labels. Now the classifier has a list of
several of that are employed in this analysis. Stemming documents- emails labelled with the appropriate
is that the technique of decreasing deviating or derived categories. The first step in creating a classifier is
words to their base kind. For grammatical reasons, deciding what features of the input are relevant, and
documents are progressing to use completely different how to encode those features. So, a feature extractor
varieties of a word, like meet, meets, and meeting. In for documents was defined so that the classifier knows
several things, it's helpful for a look for one in all these which aspects of the data it should pay attention to.
words to come documents that contain another word The duplicate words from those emails were removed.
within the set. Victimization stemming on the on top This made the checking faster. Now for every word in
of strings, we'll get meet because the base kind [8]. word_features, if that word existed in each email, it
Stemming chops off the ends of words. Algorithms for was tagged with the category (important or spam) of
stemming are studied in applied science since the that email. Thus, words were found that were labelled
Nineteen Sixties. The foremost common and effective as ‘important’ and ‘spam’. These word: label pairs
rule for stemming English is Porter's rule. Porter were used as feature set for Naive Bayes Classifier. At
Stemmer [9] has been foreign from NLTK for this point, there are words in feature set that are
stemming purpose. labelled as both important and spam. As feature
Lemmatization is that the method of changing the extractor was defined earlier, it can be used to train the
words of a sentence to its lexicon kind. For instance, classifier to label new emails. Ninety percent of the
feature set was used as train_set while the remaining could not be determined. And out of 100 spams 57
ten percent was used as test_set instances were classified correctly, 25 instances were
classified incorrectly and 18 instances could not be
2) Hidden Markov Model determined.
HMM is a tool for representing probability
distributions over sequence of observations. The
HMM assumes that the observation at time t was
generated by some process whose state St is hidden
from the observer. It also assumes that the state of this
hidden process satisfies the Markov property, which is,
given the value of St-1, the current state St is
independent of all the states prior to t-1. Graphically
we can explain it as shown in figure below

Table 5.1: Evaluation on Test Set (Naive Bayes in


Different Processes)

The goal is same here- to build a classifier that will The total accuracy achieved was 78.65%. Using both
automatically tag new emails with appropriate stop words and summarizing 70 instances were classified
category labels. Two states were used- important and correctly; 16 instances were classified incorrectly and 14
spam. The list word_features were used as instances could not be determined. And out of 100 spams
observations. And, start probability was set as 67 instances were classified correctly, 20 instances were
classified incorrectly and 13 instances could not be
{‘important’: 0.5, ‘spam’: 0.5}. Each word was determined. The total accuracy achieved was 79.19%
searched for in both categories and its occurrence in which gives us the best result in this comparison
either category was counted separately. Their
probability of appearing in spam emails or important
emails was used to create the emission probability set.
In order to determine transition probability, for each
word from spam and important emails, the probability
of the next word being spam or important was
recorded. HMM was used from sklearn module in
python and the start probability, transition probability
and emission probability were set. This way, important
words and spam words could be identified.

V. RESULTS AND ANALYSIS Figure 5.1: Accuracy Comparisons (Naive Bayes)

Table 5.1 shows that using basic Naive Bayes approach Figure 5.1 shows accuracy comparison among 8 different
out of 100 important emails, 68 instances were classified combinational approach to Naive Bayes and it shows that
correctly, 23 instances were classified incorrectly and 9 using stop words and lemmatizing together gives the best
instances could not be determined. And out of 100 spams accuracy result which is 79.19%.
69 instances were classified correctly, 15 instances were
classified incorrectly and 16 instances could not be
determined. The total accuracy achieved was 78.28%.
Then again using only stop words, out of 100 important
emails 83 instances were classified correctly, 13
instances were classified incorrectly and 4 instances
Table 5.3: Accuracy Comparison between Naive
Bayes and HMM Algorithm.

Table 5.2: Evaluation on Test Set (Hidden Markov


Model in Different Processes)

Figure 5.3: Accuracy Comparison between Naive


Bayes and HMM Algorithm

VI. CONCLUSION
In summary, we have a tendency to propose a
comparative approach to email classification
victimization Naive Thomas Bayes Classifier and
HMM. We have a tendency to reason emails by
considering solely text half from body of the message.
As a result of we have a tendency to take into account
relative words and sentences as feature. Once running
constant variants on each the algorithms, we have a
Figure 5.2:Accuracy Comparisons (Hidden tendency to compare the results and used HMM for
Markov Model) classification as a result of it gave higher accuracy. The
structure of our analysis has been in-built such how
After running both Naive Bayes and HMM algorithm in that with correct data set and minor affray it will work
8 combinations of 3 different processes along with basic to classify texts in any variety of classes
approach we find different classification accuracy for
different REFERENCES

[1] Jeffrey Dean, Sanjay Ghemawat, “MapReduce:


Simplified data processing on large clusters,”
Communications of the ACM, Vol.51, Issue.1, 2008,
pp.107-113.

[2] Hadoop [Internet], http://hadoop.apache.org.

[3] Sudheesh Narayanan, “Securing Hadoop : Implement


robust end-to-end security for your Hadoop ecosystem,” 1st
Vol, PACKT Publishing, 2014

[4] So Hyeon Park and Ik Rae Jeong, “A Study on Security


Improvement in Hadoop Distributed File System Based on
Kerberos,” Journal of the Korea Institute of Information
Security and Cryptology, Vol.23, Issue.5, 2013, pp.803-813

[5] Liu Yi, Hadoop Crypto Design [Internet],


https://issues.apache.org/jira/secure/attachment/12571116/
Hadoop Crypto Desi gn.pdf
[6] Seonyoung Park and Youngseok Lee, “A Performance
Analysis of Encryption in HDFS,” Journal of KISS :
Databases, Vol.41, Issue.1, 2014, pp.21-27

[7] Byeong-yoon Choi. “Design of Cryptographic Processor


for AES Rijndael Algorithm,” The Journal of The Korean
Institute of Communication Sciences, Vol.26, Issue.10,
2001, pp. 1491-1500

[8] Yong Kuk Cho, Jung Hwan Song, and Sung Woo Kang,
“Criteria for Evaluating Cryptographic Algorithms based on
Statistical Testing of Randomness,” Journal of the Korea
Institute of Information Security and Cryptology, Vol.11,
Issue.6, 2001, pp.67-76.

[9] ARIA Development Team, Block Encryption Algorithm


ARIA [Internet],
http://glukjeoluk.tistory.com/attachment/ok110000000002.
pdf.

[10] Korea Internet & Security Agency, ARIA specification


[Internet],
http://seed.kisa.or.kr/iwt/ko/bbs/EgovReferenceDetail.do?b
bsId=BBSM
STR_000000000002&nttId=39&pageIndex=1&searchCnd

=&searchWrd =.

[11] Jeffrey Root, Intel Ⓡ Advanced Encryption Standard


Instructions(AESNI) [Internet]
https://software.intel.com/en-us/articles/intel-
advancedencryption- standard-instructions-aes-ni.

[12] Weizhong Zhao, Huifang Ma, Qing He, “Parallel k-


means clustering based on mapreduce,” In: IEEE
International Conference on Cloud Computing. Springer
Berlin Heidelberg, Vol.5931 p. 674-679, 2009.

[13] Hui Gao, Jun Jiang, Li She, Yan Fu, “A New


Agglomerative Hierarchical Clustering Algorithm
Implementation based on the Map Reduce Framework,”
International Journal of Digital Content Technology and its
Applications, Vol.4 Issue.3, 2010, pp.95-100

[14] [Internet] https://dumps.wikimedia.org/enwiki/[15]


[Internet] https://dumps.wikimedia.org/metawiki/

[16] MODIS [Internet] https://modis.gsfc.nasa.gov/



You might also like