Mini Project II Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

FAKE NEWS DETECTOR

MINI PROJECT - II REPORT

Submitted By
MANORANJANI L N(1805054)
MITHRA V(1805057)
MONAL P(1805058)

In partial Fulfilment for the award of the degree


of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY

SRI RAMAKRISHNA ENGINEERING COLLEGE


[Educational Service : SNR Sons Charitable Trust]
[Autonomous Institution, Accredited by NAAC with ‘A’ Grade]
[Approved by AICTE and Permanently Affiliated to Anna University, Chennai]
[ISO 9001:2015 Certified and all eligible programmes Accredited by NBA]
VATTAMALAIPALAYAM, N.G.G.O. COLONY POST,
COIMBATORE - 641022

ANNA UNIVERSITY : CHENNAI 600 025


April 2021
ANNA UNIVERSITY : CHENNAI 600 025
BONAFIDE CERTIFICATE

16IT258 - MINI PROJECT II

Certified that this Mini Project - II Report “Fake News Detector” is the
bonafide work of “ Manoranjani L N(1805054), Mithra V(1805057),
Monal P(1805058)” who carried out the project work under my
supervision.

SIGNATURE SIGNATURE

Dr. M. Senthamil Selvi Dr. J. Angel Ida Chellam

Head of the Department Assistant Professor(Sr. Grade)

Department of Information Technology Department of Information Technology

Sri Ramakrishna Engineering College Sri Ramakrishna Engineering College

Vattamalaipalayam Vattamalaipalayam

Coimbatore-22 Coimbatore-22

Submitted for University Examination Viva Voce held on ___________

Internal Examiner External Examiner


ABSTRACT

Machine learning provides systems the ability to automatically learn and improve from
experience without being explicitly programmed. The information on social media
networks has been increasing rapidly and hence this becomes the reason for the difficulty
in classifying it as true or false. People may trust the fake news as true one and they will
spread that news to the non - social media users which will create a negative impact on
people. Here we aimed to create a model that classifies the text as fake or true. This
model will be helpful to identify the fake news and neglect them, which can avoid some
controversies among the public.
ACKNOWLEDGEMENT

We put forth our hearts and souls to thank the Almighty for being with us through
our achievements and success. We would like to express our unfathomable thanks to our
esteemed and Honorable Managing Trustee Thiru.D.Lakshminarayanaswamy and
Joint Managing Trustee Thiru.R.Sundar for giving us the chance to be a part of this elite
team at Sri Ramakrishna Engineering College, Coimbatore.

We would like to express our sincere thanks to our honorable Principal Dr.N.R.
Alamelu, for the facilities provided to complete this project.

We take the privilege to thank the Head of the Department of Information


Technology, Dr.M.Senthamil Selvi, for her consistent support and encouragement at
every step of our project work.

We wish to convey our special thanks to our academic coordinator, Dr.K. Deepa,
Professor, Information technology for her consistent support, timely help and valuable
suggestions during the entire period of our project work.

We would like to express our sincere thanks to our course instructor ,


Mr.C.Ranjeeth Kumar, Assistant Professor(Sr.Gr), Information technology for her
valuable support in the completion of this project.

We would like to express our sincere thanks to our project guide Dr. J. Angel Ida
Chellam, Assistant Professor(Sr.Gr), Information technology for his valuable support in
the completion of this project.

We extend our sincere gratitude to all the teaching and non-teaching staff of our
department who helped us during our project.
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT iii

ACKNOWLEDGEMENT iv

LIST OF TABLES vii

LIST OF FIGURES viii

1 INTRODUCTION 1

1.2 PROBLEM STATEMENT 3

1.3 APPLICATIONS 3

2 LITERATURE SURVEY 4

2.1 EXISTING SYSTEM 6

2.2 PROPOSED SYSTEM 7

2.3 FLOW DIAGRAM 8


3 SYSTEM IMPLEMENTATION 9

3.1 SOFTWARE REQUIREMENTS 9

3.1.1 PYTHON 3.7 9

3.1.3 GOOGLE COLAB 9

3.2 DESCRIPTION 10

3.3 DATASET DESCRIPTION 10

3.4 MODULE DESCRIPTION 11

3.4.1 DATA PRE-PROCESSING 11

3.4.2 FEATURE EXTRACTION 12

3.4.2.1 EXTRA TREES 12


CLASSIFIER

3.4.3 BUILDING THE MODEL 12


3.4.3.1 LINEAR REGRESSION 13

3.4.3.2 RANDOM FOREST 13

3.4.3.3 DECISION TREE 14

3.4.3.4 ADABOOST 14

3.4.3.5 GAUSSIAN PROCESS 15

3.4.3.6 GRADIENT BOOSTING 16

3 .4.4 TESTING 16

3.4.4.1 CROSS VALIDATION 16

4 RESULT 17

5 CONCLUSION AND FUTURE SCOPE 19

5.1 CONCLUSION 19
5.2 FUTURE SCOPE 19

6 REFRENCES 20

7 APPENDICES 22
INTRODUCTION

The advent of the World Wide Web and the rapid adoption of social media
platforms paved the way for rapid spread of information that has never been witnessed in
human history before. Besides other use cases, news outlets benefitted from the
widespread use of social media platforms by providing updated news in near real time to
its subscribers. The traditional way of reading news from newspapers, tabloids, and
magazines had moved to a digital form such as online news platforms, blogs, social media
feeds, and other digital media formats. It became easier for consumers to acquire the
latest news at their fingertips.
There has been a rapid increase in the spread of fake news in the last decade. Such
proliferation of sharing articles online that do not conform to facts has led to many
problems. It is not just limited to politics but covering various other domains such as
sports, health, and also science . One such area affected by fake news is the financial
markets , where a rumor can have disastrous consequences and may bring the market to a
halt.

As a result, sharing of information among people increases without knowing it as


true or false. Hence the trustworthiness of social media gets decreased. Fake news may
intentionally or accidentally give harm to an individual or a group, for any purposes such
as for political issues or for religious purposes and so on.

Our ability to take a decision depends mostly on the type of information we


consume; our worldview is shaped on the basis of information we digest. There is
increasing evidence that consumers have reacted to a larger extent to news that later
proved to be fake. One recent case is the spread of novel coronavirus, where fake reports
spread over the Internet about the origin, nature, count and behavior of the virus.
PROBLEM STATEMENT

Fake news in social media has become a major problem so that the Social media
users are misled by them. Increase in the spread of fake news creates a negative impact on
every individual. The main aim of this project is to increase the trustworthiness of the
online news among people.

APPLICATIONS
LITERATURE SURVEY

Various researches have been done for Fake news. This research is done prior to
taking up the project and understanding the various methods that were used previously.
This study helped to identify the benefits and drawbacks of the existing system.

1. “Survey on Automated System for Fake News Detection using NLP & Machine
Learning Approach” by Subhadra Gurav1, Swati Sase2, Supriya Shinde3, Prachi
Wabale4, Sumit Hirve5 they took the news event, analyze related data from data sources
and then use various classification algorithms to classify the news as legitimate or fake.

2. “A Tool for Fake News Detection'' by Bashar AI Asaad and Madalina Erascu

In this paper, they proposed machine learning techniques for fake news detection. They
used a dataset of fake and real news to train a machine learning model using the
Scikit-learn library in Python and extracted features from the dataset using text
representation models. We tested two classification approaches, namely probabilistic
classification and linear classification on the title and the content, checking if it is
clickbait or non-clickbait, i.e fake or real.

3. “False Content Detection with Deep Learning Techniques” by RachanaKunapareddy, Sri


Rohitha Mandala, SuhasiniSodagudi In this paper, datasets have been collected and tested
from various resources like buzzfeed, kaggle, github and related sites for news information. Input
layer is the initial layer for the work process in the neural system. The output layer in an artificial
neural network is the last layer of neurons that produces given outputs for the program.
EXISTING SYSTEM
PROPOSED SYSTEM

In this project a model is build based on the count vectorizer or a tfidf matrix ( i.e )
word tallies relatives to how often they are used in other articles in our dataset can help.
Since the motive is for text classification, Implementing a Binomial logistic regression
will give more accuracy.
The actual goal is in developing a model which was the text transformation (count
vectorizer vs tfidf vectorizer) and choosing which type of text to use (headlines vs full
text). Now the next step is to extract the most optimal features for countvectorizer or
tfidf-vectorizer, this is done by using a n-number of the most used words, and/or phrases,
lower casing or not, mainly removing the stop words which are common words such as
“the”, “when”, and “there” and only using those words that appear at least a given
number of times in a given text dataset.

LOGISTIC REGRESSION

As the classification of text is on the basis of a wide feature set, with a binary
output (true news/fake news), a logistic regression (LR) model is used, since it provides
the intuitive equation to classify problems into binary or multiple classes. We performed
hyperparameter tuning to get the best result for all individual datasets, while multiple
parameters were tested before acquiring the maximum accuracies from LR model.
Mathematically, the logistic regression hypothesis function can be defined as follows:

Logistic regression uses a sigmoid function to transform the output to a probability value;
the objective is to minimize the cost function to achieve an optimal probability. The cost

function is calculated as shown in


FLOW DIAGRAM
SYSTEM IMPLEMENTATION

SOFTWARE REQUIREMENTS

PYTHON 3.7

Python is an interpreter, high-level and general-purpose programming language.


Python's design philosophy emphasizes code readability with its notable use of significant
indentation. Its language constructs and object-oriented approach aim to help
programmers write clear, logical code for small and large-scale projects.
Python code is understandable by humans, which makes it easier to build models
for machine learning. Since Python is a general-purpose language, it can do a set of
complex machine learning tasks and enable to build prototypes quickly that allows to test
the product for machine learning purposes.

GOOGLE COLAB

Colaboratory, or “Colab” for short, is a product from Google Research. Colab


allows anybody to write and execute arbitrary python code through the browser, and is
especially well suited to machine learning, data analysis and education. More technically,
Colab is a hosted Jupyter notebook service that requires no setup to use, while providing
free access to computing resources including GPUs.
DESCRIPTION

DATASET DESCRIPTION

Two datasets have been used in this project namely fake.csv and true.csv. These
datasets have a shape of (23481, 4) and (21417, 4) respectively. The first column contains
the title of the article, the second column contains the text, the third column contains the
subject of the article and the fourth column contains the date in which the article is
published. Fake news dataset and true news dataset are labeled as 1 and 0 respectively.
Both the datasets are merged into a single dataset. The datasets have been divided into
train and test dataset in which test dataset is 20% of the total dataset and train dataset is
80% of the total dataset.

WORKING PRINCIPLE

Input values (x) are combined linearly using weights or coefficient values to
predict an output value (y). A key difference from linear regression is that the output
value being modeled is a binary value (0 or 1) rather than a numeric value. Below is an
equation for logistic regression:
y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Making predictions with a logistic regression model is as simple as plugging in


numbers into the logistic regression equation and calculating a result. The result is fake
when the probability is more than 0.5 while it is fake when the probability is less than 0.5.

MODULE DESCRIPTION
DATA COLLECTION

The datasets have been taken from kaggle which contains true and fake news
separately.

FAKE DATASET

TRUE DATASET

DATA ANALYSIS

Libraries like Numpy, Pandas, Seaborn and matplotlib are used for data analysis.
Nltk is the library used for data cleaning using which the stopwords are removed.
MODEL TRAINING

The regression model is trained using Sklearn library which is the efficient library
for machine learning in python. TfidfVectorizer is used for extracting the features.
SGDClasiffier is used to train the model using logistic regression with minimal loss.
Using the GridSearchCV function we will get accuracy/loss for every combination of
hyperparameters and we can choose the one with the best performance.
MODEL TESTING

Confusion matrix is created for evaluating the model trained. It is created using the
library sklearn

RESULT

ACCURACY FOR TEST


ADVANTAGES

● People will start to trust the social media platform and to save them by getting
misled by them.
● Fake news can create a dispute among the people. With the help of this model,
peace is spread throughout the society.
● Highly accurate.

DISADVANTAGES

● As our model is created using a Machine learning algorithm, it cannot predict the
news other than the given dataset.
● This model cannot be applicable for both photos and videos cannot be predicted.

CONCLUSION AND FUTURE SCOPE

CONCLUSION

You might also like