Jawaharlal Nehru Technology University-A, Ananthapur: A Social Relevant Project Report Submitted To
Jawaharlal Nehru Technology University-A, Ananthapur: A Social Relevant Project Report Submitted To
Jawaharlal Nehru Technology University-A, Ananthapur: A Social Relevant Project Report Submitted To
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
By
G.MADHU SUJAN - 19AK1A0593
T.KAVERI - 19AK1A0575
N.LAVANYA - 19AK1A0583
A.LARIFA - 19AK1A0581
Associate Professor
(AUTONOMOUS)
2021-2022
1
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(AUTONOMOUS)
CERTIFICATE
Certified that this is a bonafide record of the Social Relevant Project Report entitled “ EMAIL
SPAM DETECTION”, done by G.MADHU SUJAN, REG NO: 19AK1A0593, T.KAVERI
REGNO:19AK1A0575,N.LAVANYA,REGNO:19AK1A0583, A.LARIFA, REG NO: 19AK1A0581,submitted to the faculty of
Computer Science and Engineering, in partial fulfillment of the requirements for the Degree of BACHELOR OF
TECHNOLOGY in Computer Science and Engineering from Jawaharlal Nehru Technological University-A,
Anantapur during the year 2019 - 2023.
Date:______________
Place:Tirupati.
2
ANNAMACHARYA INSTITUTE OF TECHNOLOGY AND SCIENCES
(AUTONOMOUS)
Venkatapuram(V), Karakambadi (Po), Renigunta(M), Tirupati-517520, A.P
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DECLARATION
We hereby declare that the project titled “EMAIL SPAM DETECTION” is a genuine project work carried
out by us, in B.TECH (Computer Science and Engineering) degree course of Jawaharlal Nehru
Technology University-A, Ananthapur and has not been submitted to any other course or university for
the award of our degree by us.
3
ACKNOWLEDGEMENT
The satisfaction that accompanies the successful completion of the task would be incomplete without
the mention of the people who made it possible, whose encouragement crown all the efforts with success.
We avail this opportunity to express our deep sense of gratitude and hearty thanks to Mr. C. GANGI
REDDY, Honarable Secretary of AITS- Tirupati for providing congenial atmosphere and encouragement.
We show gratitude to Dr. C. NADHAMUNI REDDY, Principal for having provided all the
facilities and support.
We would like to thank MR.B. RAMANA REDDY, Assistant Professor & HOD, Computer
Science and Engineering for encouragement at various levels of our Project
.
We thankful to our guide Dr. T.Sreenivasula Reddy, Computer Science and Engineering for his
sustained inspiring guidance and cooperation throughout the process of this project.
We express our deep sense of gratitude and thanks to all the Teaching and Non-Teaching Staff of
our college who stood with us during the project and helped us to make it a successful venture.
We place highest regards to our Parents, Friends and Well Wishers who helped a lot in making the
report of this project.
4
CONTENTS
5
LIST OF FIGURES
6
ABSTRACT
Social communication has evolved, with e-mail still being one of the most common communication means, used for
both formal and informal ways. With many languages being digitized for the electronic world, the use of English is
still abundant. However, various native languages of different regions are emerging gradually. The Urdu language,
coming from South Asia, mostly Pakistan, is also getting its pace as a medium for communications used in social
media platforms, websites, and emails. With the increased usage of emails, Urdu’s number and variety of spam
content also increase. Spam emails are inappropriate and unwanted messages usually sent to breach security. These
spam emails include phishing URLs, advertisements, commercial segments, and a large number of indiscriminate
recipients. Thus, such content is always a hazard for the user, and many studies have taken place to detect such
spam content. However, there is a dire need to detect spam emails, which have content written in Urdu language.
The proposed system “ EMAIL SPAM DETECTION” study utilizes the existing machine learning
algorithms including Naive Bayes, CNN, SVM, and LSTM to detect and categorize e-mail content. According to our
findings, the LSTM model outperforms other models with a highest score of 98.4% accuracy.
7
1.INTRODUCTION
1.1INTRODUCTION
Email Spam has become a major problem nowadays, with Rapid growth of internet users, Email spams is
also increasing. People are using them for illegal and unethical conducts, phishing and fraud. Sending
malicious link through spam emails which can harm our system and can also seek in into your system.
Creating a fake profile and email account is much easy for the spammers, they pretend like a genuine
person in their spam emails, these spammers target those peoples who are not aware about these frauds.
So, it is needed to Identify those spam mails which are fraud.
Email Spam has become a major problem nowadays, with Rapid growth of internet users, Email spams is
also increasing. People are using them for illegal and unethical conducts, phishing and fraud. Sending
malicious link through spam emails which can harm our system and can also seek in into your system.
Creating a fake profile and email account is much easy for the spammers, they pretend like a genuine
person in their spam emails, these spammers target those peoples who are not aware about these frauds.
So, it is needed to Identify those spam mails which are fraud.
In this proposed system, a dataset from “Kaggle” website is used as a training dataset. The inserted dataset
is first checked for duplicates and null values for better performance of the machine. Then, the dataset is
split into 2 sub-datasets; say “train dataset” and “test dataset” in the proportion of 70:30. Then the “train”
and “test” dataset is then passed as parameters for text-processing.
In text-processing, punctuation symbols and words that are in the stop words list are removed and returned
as clean words.
After acquiring the values from the “hyperparameter tuning”, the machine is fitted using those values with a
random state. The state of the trained model and features are saved for future use for testing unseen data.
Using classifiers from module sklearn in python, the machines are trained using the values obtained from
above.
Automatic email filtering may be the most effective method of detecting spam but nowadays spammers
can easily bypass all these spam filtering applications easily. Naive Bayes is one of the utmost well-
known algorithms applied in these procedures. The boycott approach has been probably the soonest
technique pursued for the separating of spams. The technique is to acknowledge all the sends other than
those from the area/electronic mail ids.
8
1.5 ADVANTAGES OF PROPOSED SYSTEM
Ensemble methods on the other hand proven to be useful as they using multiple classifiers for class
prediction. Nowadays , lots of emails are sent and received and it is difficult as our project is only able to
test emails using a limited amount of corpus. Our project, thus spam detection is proficient of filtering
mails giving to the content of the email and not according to the domain names or any other criteria.
Good Efficiency
Greater accuracy
9
2. ANALYSIS
2.1 INTRODUCTION
Analysis is the process of gathering and interpreting the requirements. Analysis can be done in
different ways. In this, it involves the identification of materials that are suitable for relevant analysis. It is
important to gather the necessary information first beforeorganizing or scheduling anything. System analysis
is an important phase of any system development process.
Web Browser
Internet Connection
Laptop with good specifications
What is Colab?
Colab, or "Colaboratory", allows you to write and execute Python in your browser, with
Whether you're a student, a data scientist or an AI researcher, Colab can make your work easier
With Colab you can import an image dataset, train an image classifier on it, and evaluate the
model, all in just a few lines of code. Colab notebooks execute code on Google's cloud servers,
meaning you can leverage the power of Google hardware, including GPUs and TPUs, regardless
of the power of your machine. All you need is a browser.
Colab is used extensively in the machine learning community with applications including:
When creating a machine learning project, it is not always a case that we come across the clean
and formatted data. And while doing any operation with data, it is mandatory to clean it and put in
a formatted way. So for this, we use data preprocessing task.
11
Machine Learning Models
A machine learning model is defined as a mathematical representation of the output of the
training process. Machine learning is the study of different algorithms that can improve
automatically through experience & old data and build the model. A machine learning model is
similar to computer software designed to recognize patterns or behaviors based on previous
experience or data. The learning algorithm discovers patterns within the training data, and it
outputs an ML model which captures these patterns and makes predictions on new data.
o Supervised Learning o
Unsupervised Learning o
Reinforcement Learning
o Classification o Regression
Dimensionality Reduction
The main aim of the linear regression model is to find the best fit line that best fits the data points.
Linear regression is extended to multiple linear regression (find a plane of best fit) and polynomial
regression (find the best fit curve).
12
Classification
Classification models are the second type of Supervised Learning techniques, which are used to
generate conclusions from observed values in the categorical form. For example, the classification
model can identify if the email is spam or not; a buyer will purchase the product or not, etc.
Classification algorithms are used to predict two classes and categorize the output into different
groups.
In classification, a classifier model is designed that classifies the dataset into different categories,
and each category is assigned a label.
o Binary classification: If the problem has only two possible classes, called a binary classifier. For
o Multi-class classification: If the problem has more than two possible classes, it is a multi-class
classifier.
a) Logistic Regression
Logistic Regression is used to solve the classification problems in machine learning. They are
similar to linear regression but used to predict the categorical variables. It can predict the output in
either Yes or No, 0 or 1, True or False, etc. However, rather than giving the exact values, it provides
the probabilistic values between 0 & 1.
Support vector machine or SVM is the popular machine learning algorithm, which is widely used
for classification and regression tasks. However, specifically, it is used to solve classification
problems. The main aim of SVM is to find the best decision boundaries in an N-dimensional space,
which can segregate data points into classes, and the best decision boundary is known as
Hyperplane. SVM selects the extreme vector to find the hyperplane, and these vectors are known
as support vectors
13
c) Naïve Bayes
Naïve Bayes is another popular classification algorithm used in machine learning. It is called so as it
is based on Bayes theorem and follows the naïve(independent) assumption between the features
which is given as:
Each naïve Bayes classifier assumes that the value of a specific variable is independent of any other
variable/feature. For example, if a fruit needs to be classified based on color, shape, and taste. So
yellow, oval, and sweet will be recognized as mango. Here each feature is independent of other
features.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem
of overfitting.
14
e.k-nearest neighbor algorithm:
This algorithm is used to solve the classification model problems. K-nearest neighbor or
K-NN algorithm basically creates an imaginary boundary to classify the data. When new
data points come in, the algorithm will try to predict that to the nearest of the boundary
line.
Therefore, larger k value means smother curves of separation resulting in less complex
models. Whereas, smaller k value tends to overfit the data and resulting in complex
models.
15
3.DESIGN
3.1 INTRODUCTION
The aim is to develop one or more designs that can be used to achieve the desired project golas. The
main aim of the system design phase is to provide a design for specified needs of the apothecary
management system. This Management System is designed to reduce pen-paper work at the hospitals.
ER Diagram stands for Entity Relationship Diagram, also known as ERD is a diagram
that displays the relationship of entity sets stored in a database. In other words, ER diagrams help to explain
the logical structure of databases. ER diagrams are created based on three basic concepts: entities, attributes
and relationships. The ER Model represents real-world entities and the relationships between them. Creating
an ER Model in DBMS is considered as a best practice before implementing your database. Entity
Relationship Diagram Symbols & Notations mainly contains three basic symbols which are rectangle, oval
and diamond to represent relationships between elements, entities and attributes. There are some sub
elements which are based on main elements in ERD Diagram. ER Diagram is a visual representation of data
that describes how data is related to each other using different ERD Symbols and Notations.
16
Following are the main components and its symbols in ER Diagrams:
Rectangles: This Entity Relationship Diagram symbol represents
entity types
Elipses: Symbol represent attributes
Diamonds: This symbol represents relationship types
Lines: It links attributes to entity types and entity types with other
relationship types
Primary key: attributes are underlined
Double Ellipses: Represent multi-valued attributes
3.2.1 ER DIAGRAM
17
3.3 MODULE DESIGN AND ORGANIZATION
Step 1: E-mail Data Collection. The dataset contained in a corpus plays a crucial role in assessing the performance of
any spam filter. ...
Performance Analysis.
18
4. IMPLEMENTATION AND RESULT
4.1 INTRODUCTION
Implementation is the stage of the project when the theoretical design is turned out into a working
system. Thus, it can be considered to be the most critical stage in achieving a successful new system and in
giving the user, confidence that the new system will work and be effective. The implementation stage
involves careful planning, investigation of the existing system and its constraints on implementation,
designing of methods to achieve change and evaluation of changeover methods.
# loading the data from csv file to a pandas Dataframe
raw_mail_data = pd.read_csv('/content/mail_data.csv')
print(raw_mail_data)
Splitting the data into training data & test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_sta
te=3)
print(X.shape)
print(X_train.shape)
print(X_test.shape)
Feature Extraction
# transform the text data to feature vectors that can be used as input to the Logis
tic regression
feature_extraction = TfidfVectorizer(min_df = 1, stop_words='english', lowercase='T
rue')
X_train_features = feature_extraction.fit_transform(X_train)
X_test_features = feature_extraction.transform(X_test)
# convert Y_train and Y_test values as integers
Y_train = Y_train.astype('int')
Y_test = Y_test.astype('int')
Training the Model
Logistic Regression
m1 = LogisticRegression()
# training the Logistic Regression model with the training data
m1.fit(X_train_features, Y_train)
prediction_on_training_data = m1.predict(X_train_features)
19
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
print('Accuracy on training data : ', accuracy_on_training_data)
# prediction on test data
prediction_on_test_data = m1.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)
print('Accuracy on test data : ', accuracy_on_test_data)
prediction_on_training_data = m2.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
print('Accuracy on training data : ', accuracy_on_training_data)
Applying RandomForest Classification.
from sklearn.ensemble import RandomForestClassifier
m3.fit(X_train_features, Y_train)
# prediction on training data
prediction_on_training_data = m3.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
Applying KneighboursClassifier
from sklearn.neighbors import KNeighborsClassifier
# prediction on training data
prediction_on_training_data = m4.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)
# making prediction
prediction = m2.predict(input_data_features)
print(prediction)
if (prediction[0]==1):
print('Ham mail')
else:
print('Spam mail')
20
4.2 OUTPUT SCREENS
From the above screen we are give input messgage hii how are you it’s a normal message so the ml model
classifies as ham mail.
21
From the above screen we are given input as a fake message so the ml model predicts it’s as a spam mail.
5.TESTING
5.1 INTRODUCTION
The test case is an object for execution of the other modules in the architecture which would not
represent the interaction with itself.
Each test case is a set of sequential steps to execute a test operating on a set of predefined inputs to
produce the expected outputs.
The table shows the test cases, corresponding results and the status of the test steps.
A test case consists of the set of conditions in which tester determines whether the system satisfies
the requirements and works correctly.
Problems in the requirements and design of the application are evaluated during the process of
developing test cases.
The primary goal of software tests is to eliminate bugs in the code.
However, there are additional benefits a project can gain from a good testing process.
Benefits such as enhancing performance, user experience, and security of the overall project.
Often, when working on big projects, the team is divided into several sections.
Each has its development task, and each task has its standalone functionality.
These tasks are then combined to form the overall software product.
That’s why each part must undergo its own testing process to make sure it functions properly before
it is added to the main project.
6.1 CONCLUSION
Logistic Regression:96%
Multinomomail Naïve bayes Classification:98%
RandomForestClassifier:88%
KnnClassifier;90%
From this we can conclude that Multinomial Naïve Bayes Classification giving the best Acurracy Prediction
with 98% best accuracy when compared to remaining mL models.
Efficient pattern detection in spam mail filtering plays crucial role. Using ml model spam detection gives the spam
patterns, non –spam patterns and general patterns which easily identify the whether the mail is spam or ham. The
current method which uses the spam detection method does not include the general patterns. RFD gives the general
patterns of which user can decide to determine whether he wants to put the mail as spam or non-spam to avoid the
loss of important mails. The images which are in forms of spams are also detected using File Properties, Histogram
and Hough Transform. The current proposed system is for English language mails but as future scope we can design
the system for multiple languages.
23
7. REFERENCES
1.Nikhil Kumar, Sanket Sonowal, Nishant “Email Spam Detection Using Machine Learning Algorithms”,
IEEE CONFERENCE 2020.
2.https://www.kaggle.com/venky73/spam-mails-dataset
3.https://jpinfotech.org/email-spam-detection-using-machine-learning-algorithms/
24