ICavor Paper IT2021
ICavor Paper IT2021
ICavor Paper IT2021
Abstract— In addition to the undeniable benefits, the selection from email body is very important. Features or
development of the Internet has led to many undesirable security attributes play a vital role in the process of classification [6].
effects. Spam emails are one of the most challenging issues faced In this paper, semantic properties of email content are used
by the Internet users. Spam refers to all emails of unsolicited for feature reduction and selection. In order to reduce the
content that arrive in a user's email box. Spam can often lead to computations demand and to obtain accurate results, email
network congestion and blocking or even damage to the system data is pre-processed [7], [8]. The main aim is to preserve the
for receiving and sending electronic messages. Thus, appropriate
most important features. After feature selection, the ID3
classification of spam email from legitimate email has become
very important. This paper aims to present a new approach for algorithm is used to generate a decision tree that categorizes
feature selection and Iterative Dichotomiser 3 (ID3) algorithm emails as spam or ham [9], [10]. The proposed approach is
designed to generate the decision tree for email classification. evaluated using accuracy and precision. The performance of
The experimental results indicate that the proposed model proposed system is measured against the size of dataset and
achieves very high accuracy. feature size.
This paper is organized as follows. The second section
I. INTRODUCTION explains proposed approach for spam detection in detail. The
The Internet as a “network of networks” has expanded the third section is the result analysis. The fourth section gives the
possibilities of communication and placement of content. conclusion.
Email system is one of the most effective and commonly used
sources of communication. Unfortunately, the continuous II. SPAM DETECTION SYSTEM
growth of email users has led to a massive increase of spam This section represents the workflow of the Spam Detection
emails [1]. Spam emails are usually sent in bulk and do not (SD) system for the classification of emails into ham and spam
target individual recipients. Whether it is commercial in emails. The text based email dataset considered is initially pre-
nature or not, spam emails can cause serious problems in processed for efficient feature extraction. The SD system
electronic communication. Spam emails produce large consists of four modules: email dataset preparation, pre-
amount of unwanted data and thus affect the network’s processing of data, feature selection and classification. SD
capacity and usage [2]. Due to the large number of spam process is presented in Fig.1 and the proposed procedure is
emails to users of email services it is difficult to distinguish briefly explained in sections below.
useful from unsolicited emails. Thus, managing and filtering
emails is an important challenge. The filtering purpose is to
detect and isolate spam emails.
TABLE I
FEATURE MATRIX: EACH ROW REPRESENTS AN EMAIL WITH THE FEATURES PRESENTED IN COLUMNS
FEATURES
EMAIL Numbr Call Txt Free Claim Httpaddr Moneysymb Total_spam_words DECISION/CLASS
Email_1 0 1 0 0 0 0 0 1 Ham
Email_2 2 0 0 1 1 1 0 4 Spam
Email_3 1 0 0 3 0 0 0 2 Spam
Email_4 1 0 0 0 0 0 0 0 Ham
D. Decision tree Information gain is calculated to split the attributes
A decision tree is a structure that represents a procedure further in the tree. The attribute with the highest
for classifying objects based on their attributes. A decision information gain is always preferred first. Entropy and
tree is a tree where each node represents a feature, each information gain is related by the following equation:
branch represents a decision and each leaf represents an
outcome (class or decision). Decision trees can be used to 𝒈𝒂𝒊𝒏(𝑺, 𝑨𝒊 ) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑨𝒊 (𝑺) (2)
predict the class of an unknown query instance by building
a model based on existing data for which the decision is where EntropyAi(S) is the expected entropy if attribute Ai is
known. To train a decision tree model we need a dataset used to partition the data.
consisting of a number of training examples characterized The algorithm was implemented according to the
by a number of descriptive features and the class. The following steps:
features can have either nominal or continuous values. 1. Create a root node
A decision tree consists of root node, internal nodes and 2. Calculate the entropy of the whole (sub) dataset
leaf nodes. Internal nodes represent the conditions applied 3. Calculate the information gain for each single
on attributes or features whereas leaf nodes represent the feature and select the feature with the largest
class. Each node typically has two or more nodes extending information gain.
from it. When classifying an unknown instance, the 4. Assign the (root) node the label of the feature with
unknown instance is routed down the tree according to the maximum information gain. Grow for each
values of the attributes in the successive nodes. The main feature value an outgoing branch and add
advantage for using decision tree is that it is easy to follow unlabeled nodes at the end.
and understand. Fig. 4 presents example of a typical 5. Split the dataset along the values of maximum
decision tree. Words “free” and “money” are typical spam information gain feature and remove this feature
words and they are used as features. If the word “free” from dataset.
appears more than two times in an email than the email is 6. For each sub-dataset, repeat steps 3 to 5 until a
classified as spam. Otherwise, we are asking does the email stopping criteria is satisfied.
contain the word “money”. If the word “money” appears
more than three times than the email is certainly spam, Since the chosen features have continual values, to
otherwise it is ham. perform a binary split, it is needed to convert continuous
values to nominal ones. That is done using threshold value.
The threshold value is a value that offers maximum
information gain for that attribute. For example, the
information gain maximizes when threshold is equal to two
for total_spam_words feature. In fact, for most features, it
is shown that it is not important how many times a certain
spam word occurred in an email but whether it appeared at
all. This conclusion has enabled data dimensionality
reduction, since there are some features that do not have an
effect on the decision. The feature that has no influence on
the class labels can be discarded. The feature reduction has
made the data less sparse and more statistically significant
for ID3 algorithm.
Figure 3. An example of decision tree
III. EXPERIMENTAL RESULTS
The ID3 algorithm is based on the Decision tree The efficiency of proposed SD system is encountered by
algorithm. ID3 algorithm builds the decision tree based on evaluating the performance parameters. Parameters like
entropy and the information gain. Entropy measures the true negative rate, false negative rate and false positive rate,
impurity of an arbitrary collection of samples while the precision and accuracy are calculated in order to evaluate
information gain calculates the reduction in entropy by the performance of the SD system.
partitioning the sample according to a certain attribute. If Given a set of labeled data and such a predictive model,
the target attribute (class) takes on n different values, then every data point lies in one of four categories:
the entropy S relative to this n-wise classification is defined TP (True Positive): the number of instances correctly
as shown in (1): classified to that class.
TN (True Negative): the number of instances correctly
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) = ∑𝒏𝒊=𝟏 −𝒑𝒊 ∙ 𝐥𝐨𝐠 𝟐 𝒑𝒊 (1) rejected from that class.
FP (False Positive): the number of instances incorrectly
rejected from that class.
where p is the proportion/probability of S belonging to
FN (False Negative): the number of instances incorrectly
class Cn.
classified to that class.
These values are often presented in a confusion matrix. A For a classifier, accuracy is defined as the number of
confusion matrix is a summary of prediction results on a items categorized correctly divided by the total number of
classification problem. Table 2 represents confusion matrix items. It’s what fraction of the time the classifier has made
for email spam classification. correct decision. Precision is defined as the ratio of true
positives to predicted positives. It shows how many actual
TABLE II spams there are among the predicted ones.
CONFUSION MATRIX
The performance of proposed SD system is measured
Predicted HAM Predicted SPAM
Actual HAM True Negative False Positive against the size of dataset and the features size.
Actual SPAM False Negative True Positive The dataset of different sizes are used for measuring the
performance. For example, in case of 500 emails being
Accordingly, accuracy and precision can be defined as used for the training process, accuracy was 97.22% using
follows: decision tree classifier. Decision tree classifier supports
over 97.32% of classification accuracy for more than 1000
𝑻𝑷+𝑻𝑵 emails. Reducing the number of features also effects the
𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝑻𝑷+𝑻𝑵+𝑭𝑷+𝑭𝑵 (3)
accuracy. As expected, the accuracy was increased
according to the feature size increased. The accuracy using
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (4) 3 features is 97.12% and when using 7 features is 97.4%
for the same dataset.
TABLE III
CLASSIFICATION RESULTS BASED OD DATASET SIZE AND FEATURE SIZE