Aishwarya Pendyala Fall2019
Aishwarya Pendyala Fall2019
Aishwarya Pendyala Fall2019
A Project
MASTER OF SCIENCE
in
Software Engineering
by
Aishwarya Pendyala
FALL
2019
© 2019
Aishwarya Pendyala
ALL RIGHTS RESERVED
ii
FAKE CONSUMER REVIEW DETECTION
A Project
by
Aishwarya Pendyala
Approved by:
____________________________
Date
iii
Student: Aishwarya Pendyala
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for electronic submission to the library
iv
Abstract
of
by
Aishwarya Pendyala
experience stories are useful for the user as well as the vendor. The reviewer can increase
their brand’s loyalty and help other customers understand their experience with the
product. Similarly reviews help the vendors gain more profiles by increasing their sale of
For example, one may create fake positive reviews to promote brand’s reputation
or try to demote competitor’s products by leaving fake negative reviews on their product.
Unlike the existing work, instead of using a constrained dataset I chose to have a
one big data set. Sentiment analysis has been incorporated based on emojis and text
content in the reviews. Fake reviews are detected and categorized. The testing results are
obtained through the application of Naïve Bayes, Linear SVC, Support Vector Machine
genuine. The highest accuracy is obtained by using Naïve Bayes by including sentiment
classifier.
_______________________
Date
vi
DEDICATION
To My Family
vii
ACKNOWLEDGEMENTS
I whole heartedly show sincere gratitude to my project guide, Dr. Jingwei Yang
and Dr. Jinsong Ouyang for guiding me with their technical expertise, providing me
feedback and suggestions for improving this project and giving me an opportunity to gain
and learn through my project experience. I thank Dr. Jagannadha Chidella for constantly
I am also thankful my family for their love, support and trust in me throughout
my masters.
viii
TABLE OF CONTENTS
Page
Dedication…………………………………………………...………………...…….vii
Acknowledgments………………………………………………………………......viii
List of Tables…………………………………………………………………….…...xi
List of Figures……………………………………………………………………..…xii
Chapter
1. INTRODUCTION…………………………………………………………………..1
2. PROBLEM STATEMENT………………………………………………………… 4
ix
5.3.1 Hardware Configuration .............................................................21
7. RESULTS………………………………………………………………………....31
8. CONCLUSION…………………………………………………………………...40
9. FUTURE WORK…………………………………………………………………41
References………………….…………………………...…………………………....42
x
LIST OF TABLES
Tables Page
1. Results ................................................................................................................ 37
xi
LIST OF FIGURES
Figures Page
1. Implementation Architecture………………………………...………………....9
3. Data Exploration...…………………………………………………………….13
6. Preprocessing .................................................................................................... 15
13. Reviews.txt........................................................................................................ 21
xiii
1
Chapter 1: Introduction
Everyone can freely express his/her views and opinions anonymously and without
the fear of consequences. Social media and online posting have made it even easier to
post confidently and openly. These opinions have both pros and cons while providing the
right feedback to reach the right person which can help fix the issue and sometimes a con
when these get manipulated These opinions are regarded as valuable. This allows people
with malicious intentions to easily make the system to give people the impression of
genuineness and post opinions to promote their own product or to discredit the competitor
products and services, without revealing identity of themselves or the organization they
work for. Such people are called opinion spammers and these activities can be termed as
opinion spamming.
There are few different types of opinion spamming. One type is giving positive
opinions to some products with intention to promote giving untrue or negative reviews to
opinions on product. There is lot of research work done in field of sentiment analysis and
created models while using different sentiment analysis on data from various sources, but
the primary focus is on the algorithms and not on actual fake review detection. One of
many other research works by E. I. Elmurngi and A. Gherbi [1] used machine learning
algorithms to classify the product reviews on Amazon.com dataset [2] including customer
usage of the product and buying experiences. The use of Opinion Mining, a type of
language processing to track the emotion and thought process of the people or users about
system to collect and examine opinions about the product made in social media posts,
comments, online product and service reviews or even tweets. Automated opinion
system can be built using a software that can extract knowledge from dataset and
One of the biggest applications of opinion mining is in the online and e-commerce
reviews of consumer products, feedback and services. As these opinions are so helpful
for both the user as well as the seller the e-commerce web sites suggest their customers
to leave a feedback and review about their product or service they purchased. These
reviews provide valuable information that is used by potential customers to know the
opinions of previous or current users before they decide to purchase that product from
that seller. Similarly, the seller or service providers use this information to identify any
defects or problems users face with their products and to understand the competitive
There is a lot of scope of using opinion mining and many applications for different
usages:
competing products before taking a decision without missing out on any other better
Businesses/Sellers: Opinion mining helps the sellers to reach their audience and
understand their perception about the product as well as the competitors. Such reviews
3
also help the sellers to understand the issues or defects so that they can improve later
versions of their product. In today’s generation this way of encouraging the consumers to
write a review about a product has become a good strategy for marketing their product
through real audience’s voice. Such precious information has been spammed and
manipulated. Out of many researches one fascinating research was done to identify the
People write unworthy positive reviews about products to promote them. In some
cases malicious negative reviews to other (competitive) products are given in order to
damage their reputation. Some of these consists of non-reviews (e.g., ads and promotions)
The first challenge here is, a word can be positive in one situation while being
negative in any other situation. For e.g. the word "long" in terms of a laptop’s battery life
being long is a positive opinion while the same word about the start time is long is a
negative opinion. This shows that the opinion mining system trained about words from
opinions cannot understand this nature of the word, giving a different meaning in different
situations.
Another challenge is that people don't always express opinions the same way.
Most of the traditional text processing techniques assume that small difference in text
don't change the meaning much. However, in opinion mining, e.g. the service was great,
Finally, in some cases, people give contradictory statements which were difficult
to anticipate the nature of the opinion. There could be a hidden positive sense in a negative
review. And sometimes there is both positive and negative opinion about the product. An
emotion factor can add a lot to what a person says or expresses. Adding a negative emoji
to a positive comment or vice versa. In the millennial world of texting people have
replaced long sentences with short forms and emoticons. These emoticons when used in
5
text format are composed of punctuations and there is a good chance that they will be lost
After all these challenges, detecting the reviews that are not genuine or which are
used to deviate the consumers opinion in a certain direction becomes even more difficult.
Opinion spamming or fake review detection is thus significant problem for ecommerce
sites and other service providers as the consumer these days rely highly on such opinions
or reviews.
6
Lack of genuine feedback, creating fake reviews and ratings for supporting the
products on their website to improve their reputation and sales is unfair and misleading.
This is a common practice these days which increases the need for a fake review detector.
In a recent study a method was proposed by E.I Elmurngi and A. Gherbi [1] using
an open source software tool called ‘Weka tool’ to implement machine learning
algorithms using sentiment analysis to classify fair and unfair reviews from amazon
reviews based on three different categories positive, negative and neutral words. In this
research work, the spam reviews are identified by only including the helpfulness votes
voted by the customers along with the rating deviation are considered which limits the
overall performance of the system. Also, as per the researcher’s observations and
experimental results, the existing system uses Naive Bayes classifier for spam and non-
spam classification where the accuracy is quite low which may not provide accurate
Benevenuto [5] have proposed solutions that depends only on the features used in the data
set with the use of different machine learning algorithms in detecting fake news on social
media. Though different machine learning algorithms the approach lacks in showing how
B. Wagh, J.V.Shinde, P.A.Kale [6] worked on twitter to analyze the tweets posted
by users using sentiment analysis to classify twitter tweets into positive and negative.
They made use of K-Nearest Neighbor as a strategy to allot them sentiment labels by
7
training and testing the set using feature vectors. But the applicability of their approach
To solve the major problem faced by online websites due to opinion spamming,
this project proposes to identify any such spammed fake reviews by classifying them into
fake and genuine. The method attempts to classify the reviews obtained from freely
available datasets from various sources and categories including service based, product
based, customer feedback, experience based and the crawled Amazon dataset with a
greater accuracy using Naïve Bayes [7], Linear SVC, SVM, Random forest, Decision
Trees algorithm. In order to improve the accuracy, the additional features like comparison
of the sentiment of the review, verified purchases, ratings, emoji count, product category
with the overall score are used in addition to the review details.
A classifier is built based on the identified features. And those features are
assigned a probability factor or a weight depending on the classified training sets. This is
The high-level architecture of the implementation can be seen in Figure:1 and the problem
such as Amazon, websites for booking Airlines, Hotel and Restaurant, CarGurus, etc.
reviews. Doing so was to increase the diversity of the review data. A dataset of 21000
was created.
10
Processing and refining the data by removal of irrelevant and redundant information as
The entire review is given as input and it is tokenized into sentences using NLTK
package.
Punctuation marks used at the starting and ending of the reviews are removed along
Each individual review is tokenized into words and stored in a list for easier retrieval.
Affixes are removed from the stem. For example, the stem of "cooking" is "cook", and
the stemming algorithm knows that the "ing" suffix can be removed. A few words from
Reviewer ID- A reviewer posting multiple reviews with the same Reviewer ID.
11
Rating-Fake reviews in most scenarios have 5 out of 5 stars to entice the customer or have
the lowest rating for the competitive products thus it plays an important role in fake
detection.
Verified Purchase-Purchase reviews that are fake have lesser chance of it being verified
Thus these combination of features are selected for identifying the fake reviews.
positive, negative or neutral. It includes predicting the reviews being positive or negative
according to the words used in the text, emojis used, ratings given to the review and so
on. Related research [8] shows that fake reviews has stronger positive or negative
emotions than true reviews. The reasons are that, fake reviews are used to affect people
opinion, and it is more significant to convey opinions than to plainly describe the facts.
The Subjective vs Objective ratio matters: Advertisers post fake reviews with more
objective information, giving more emotions such as how happy it made them than
conveying how the product is or what it does. Positive sentiment vs negative sentiment:
The sentiment of the review is analyzed which in turn help in making the decision of it
of classification is to accurately predict the target class for each case in the data. Each
data in the review file is assigned a weight and depending upon which it is classified into
Comparison of the accuracies of various models and classifiers with enhancements for
datasets and the fake and genuine labels help us to cross validate the classification results
of the data.
Collection of data is done by choosing appropriate dataset. Datasets for such reviews with
labels is found from different sources like hotel reviews, amazon product reviews, and
other free available review datasets and combined into Reviews.txt file. Firstly, the
Then to make it readable, the labels in the dataset are clearly labelled as fake or genuine
as shown in Figure 4.
14
The dataset created from multiple sources of information has many forms of redundant
and unclean values. Such type of data is neither useful nor easy to model.
Preprocessing: Data has been cleaned by removing all the null values, white spaces and
punctuations. This raw dataset is loaded in the form of <ID, Review text, Label> tuple
using the code as shown in Figure 5 allowing to only focus on the textual review content.
Then the raw data is preprocessed by applying tokenization, removal of stop words and
Figure 6: Preprocessing
Feature Extraction: The text reviews have different features or peculiarities that can
help to solve the classification problem. For e.g. Length of reviews (fake reviews tend to
be smaller in length with less facts revealed about the product) and repetitive words (fake
reviews have smaller vocabulary with words repeated). Apart from the just the review
text there are other features that can contribute towards the classification of reviews as
fake. Some of the significant ones that were used as additional features inclusion are
Ratings, verified purchase and product category. The code snippet used to extract them
is shown in Figure 7.
16
Figure 8 and Figure 9 show the count of the reviews for each feature.
Sentiment Analysis: This processed data is now analyzed for emotions or sentiment, if
the review is positive or negative. The significant factors for doing the sentiment analysis
of the reviews are use of emoticons sentiment scores and the rating of the reviews. Note
that while removing the punctuation marks a list of emoticons is parsed to be exception,
so we do not remove or discard them by accident, while cleaning the dataset. This is
such as Naïve Bayes, Linear SVC, Non-linear SVM and Random forest to obtain better
Fake review Detection: This is the final goal of the project to classify these reviews into
fake or genuine. The preprocessed dataset is thus classified using different classification
The experimental configuration for both classifiers was kept the same, and this section
consists of the configurations used to set up the models for training the Python Client.
Naïve Bayes [7] and Decision Tree Classifier are used for detecting the genuine(T) and
fake(F) reviews across a wide range of data set. The probability for each word is
calculated is given by the ratio of (sum of frequency of each word of a class to the total
words for that class). The dataset is split into 80% training 20% testing, 16800 for training
and 4200 for testing. Finally, for testing the data using a test set where the probability of
each review is calculated for each class. The class with the highest probability value using
which the review is assigned the label i.e. true/genuine (T) or fake (F) Review. The
datasets used for training are F-train.txt and T-train.txt. They include Review ID (for e.g.
ID-1100) as well as the Review text (Great product) shown below in Figure 10 and Figure
11 respectively.
19
Review
ID
Review
ID
Review
ID
Figure 12 contains the testing dataset which has only the ID and text for the review and
the output of this after running the model is stored in output.txt which contains the result
The Sklearn based classifiers were also used for classification and compared
a. Multinomial Naïve Bayes: Naive Bayes classifier [7] is used in natural language
processing (NLP) problems by predicting the tag of text, calculate probability of each tag
b. LinearSVC: This classifier classifies data by providing the best fit hyper plane that
c. SVC: Different studies have shown If you use the default kernel in SVC (), the Radial
Basis Function (rbf) kernel, then you probably used a more nonlinear decision boundary
on the case of the dataset, this will vastly outperform a linear decision boundary
d. Random Forest: This algorithm has also been used for classifying which is provided
by sklearn library by creating multiple decision trees set randomly on subset of training
data.
For these classifiers Reviews.txt dataset is used. Figure 13 shows the dataset.
After the application of all these classifiers, accuracies for each of them is compared and
their performance is evaluated for classification of the fake reviews. There are some more
enhancements also made to the models as discussed in the upcoming chapter 6. This
provided even better accuracy results for classification of these fake reviews.
The machine on which this project was built, is a personal computer with the
following configuration:
RAM: 8GB
Python 3.5.2
First, “Numpy” that provides with high-level math function collection to support
multi-dimensional matrices and arrays. This is used for faster computations over
The project makes use of Anaconda Environment which is an open source distribution
for Python which simplifies package management and deployment. It is best for large
The biggest challenge was generalizing the behavior for the datasets which it was
never trained for. In a real-life situation, we can never train a model with every scenario
possible. Also, it is not possible to gather the dataset for all kinds of reviews as it all
depends on varied dialects. Here are a few techniques or strategies that have significantly
improved the model accuracy to classify the reviews as fake or genuine. They are applied
in different phases of the project, making them more efficient. These will be discussed in
6.1 Enhancement 1
Using a predefined sentiment word list to count the sentiment words in each
review. This is based on the research where the results have shown that the more the
number of sentiment words in a review, the more chances of it being fake. There is a list
of sentiment words that the review text is compared against and ratio of words that match
from the list to the total number of words. This ratio is considered as one of the factors
while determining the fake reviews and is applied during the preprocessing as well as
The predefined sentiment list can be glanced in following picture. B. Liu and M.
a. Positive Words
b. Negative Words
These are included in the sentimentwordlist.txt file for further reference. A glimpse of
The code snippet to solve make use of this sentiment list is shown in Figure 15.
25
6.2 Enhancement 2
Compared the number of verbs and nouns in each review. This is based on the
research where the results have shown that the more the number of verbs in a review than
number of nouns, the more chances of it being fake. One of the more powerful aspects of
NLTK for Python is the part of speech tagger that is built in. This can be used in the
preprocessing phase of the project. Use of NLTK part of speech tagging is done using the
The review text can be tagged as verbs and nouns with use of NLTK and thus the count
can be compared [9]. The code snippet to do that is shown below in Figure 16
6.3 Enhancement 3
There are reviews by some users which involves the discount prices or sale at
some store to distract the buyers to buy from certain sites. These are mostly for
promotional purposes, done intentionally by sellers mostly. For considering them in the
fake classification, the keywords that are common in such reviews are used to identify.
1.profit
2.sale
3.percent
4.dollars
Use of such words are flagged as fake reviews on the testing dataset. Though it is
6.4 Enhancement 4
reviews by the reviewer, demonstrating the sentiment of the reviewer. The list of
emoticons [10] that can be included in as positive negative or neutral is shown in Figure
17 below
28
Positive emojis:
😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀
Negative emojis:
😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀😀
Neutral emojis:
😀😀😀😀😀😀😀
Reference:
https://li.st/jesseno/positive-negative-and-neutral-
emojis-6EGfnd2QhBsa3t6Gp0FRP9
Figure 17: Emojis Classification
When cleaning the dataset while preprocessing all punctuations are removed from
the reviews text just like sentiment research on emojis [11]. The emojis are kept as an
exception by making another list ‘items_to_keep[]’ in the review text and the snippet of
intensity of an emotion as an integer polarity. Some of the most commonly used emojis
are selected from a list of 751 emojis with respect to their frequency and distinction in
the emoji Scores [12]. These scores are referred for finding the sentiment of the reviews
that contains emojis. The scores are mentioned below in Figure 19.
The sentiment scores can be assigned according to the UTF-8 code of the
The emoticons scores recognition using UTF-8 and code snippet is included in
Figure 20. This probability will be used to determine the sentiment of the review which
in turn will help determining the genuineness of the reviews. Sentiment classification is
done using all same classification algorithms but before actual fake review detection step.
31
Chapter 7: Results
Data visualization:
The following visualizations show the kind of data that was used and each depicts
how many product categories are there for each label in the Reviews.txt. Here label means
fake and genuine. For e.g. for category Instruments there are 350 reviews with label fake
Observing the number of occurences of reviews with ratings vs the label they
have. For eg. Number of occurnaces of reviews with a fake label and rated as 5 out of 5
is more than reviews with a fake label and rated 3. The following Figure 22 shows Label
vs Rating code snippet and the comparison Label vs Rating is shown in Figure 23.
32
Observing the number of occurences of reviews with emojis vs the label they
have. For eg. Number of occurnaces of reviews with a fake label and have emojis is less
than reviews with a genuine label. The following Figure 24 shows the Label vs Emojis
count code snippet and comparison Label vs Emojis is shown in Figure 25.
33
Observing the number of occurences of reviews with stop words counts vs the
label they have. For eg. Number of occurnaces of reviews with a fake label have
stopwords is less than reviews with a genuine label. The following Figure 26 shows the
Label vs Stopwords count code snippet and the comparison Label vs Stopwords count in
Figure 27
34
the label they have. For eg. Number of occurnaces of reviews with a fake label have way
less verified purchases than reviews with a genuine label. The following Figure 28 shows
the Label vs Verified Purchases code snippet and comparison Label vs Verified Purchases
in Figure 29.
35
These snippets of code can be observed in the DataVisualization.ipynb file for further
reference. The following output.txt file is the result generated by textblob Naïve Bayes
The accuracy scores obtained for this dataset are shown as follows:
Accuracy-80.542
F1 score-77.888
Precision Score-80.612
Recall-79.001
37
The following results were observed for each of the previously described
experimental setups. The results show how the accuracy has improved after each
Table 1: Results
Another plotting of the results is shown in Figure 31 which depicts the bar chart
for each classifier with a different color for a data of 21000 in total.
38
Accuracies
81 77 80 81 84 81 83 79
72 69 67 68 77 75 74 72 73 71
70 70
Raw data is loaded from Reviews.txt file and by just parsing it and tokenizing,
accuracy of each model is calculated to predict the reviews being fake or genuine. The
best results were obtained using Naïve Bayes classifier as evident in the figure.
model is calculated to predict the reviews being fake or genuine. The best results were
purchase, ratings, product category of the review. Previously, the data features used were
only in an ID, Text, Label tuple from each review in the dataset. After utilizing these
other features, the accuracy of the models increased and can see the improvements in the
Testing Dataset covers the classification accuracy for the reviews in the testing
dataset. Here as you observed the non-linear SVM classifier performed the best and could
give 81% accuracy. This shows it could generalize and predict the fake reviews more
according to the emojis used, the count of positive or negative word ratio, ratings given
to the review. This sentiment classification is in turn used in predicting the reviews being
fake or genuine. The accuracy results show how each model performed on sentiment
Enhancement 1 is used in predicting the sentiment of the reviews using the list
Enhancement 2 compares the number of verbs and nouns in each review and
accuracy, but this can be regarded as infinitesimally small to be included in the results.
model that helped in most accurate measure. It has improved the sentiment analysis of
the reviews, and in turn helped the performance of the models to predict whether the
Chapter 8: Conclusion
The fake review detection is designed for filtering the fake reviews. In this
research work SVM classification provided a better accuracy of classifying than the
Naïve Bayes classifier for testing dataset. On the other hand, the Naïve Bayes classifier
has performed better than other algorithms on the training data. Revealing that it can
generalize better and predict the fake reviews efficiently. This method can be applied over
other sampled instances of the dataset. The data visualization helped in exploring the
dataset and the features identified contributed to the accuracy of the classification. The
various algorithms used, and their accuracies show how each of them have performed
Also, the approach provides the user with a functionality to recommend the most
truthful reviews to enable the purchaser to make decisions about the product. Various
factors such as adding new vectors like ratings, emojis, verified purchase have affected
1. To use a real time/ time based datasets which will allow us to compare the user’s
timestamps of the reviews to find if a certain user is posting too many reviews in a
2. To use and compare other machine learning algorithms like logistic regression to
3. To develop a similar process for unsupervised learning for unlabeled data to detect
fake reviews.
42
References
3. J. Li, M. Ott, C. Cardie and E. Hovy, “Towards a General Rule for Identifying Deceptive
Opinion Spam,” in Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics, Baltimore, MD, USA, vol. 1, no. 11, pp. 1566-1576,
November 2014.
6. B. Wagh, J. V. Shinde and P. A. Kale, “A Twitter Sentiment Analysis Using NLTK and
Machine Learning Techniques,” International Journal of Emerging Research in
Management and Technology, vol. 6, no. 12, pp. 37-44, December 2017.
7. A. McCallum and K. Nigam, “A Comparison of Event Models for Naive Bayes Text
Classification,” in Proceedings of AAAI-98 Workshop on Learning for Text
Categorization, Pittsburgh, PA, USA, vol. 752, no. 1, pp. 41-48, July 1998.
8. B. Liu and M. Hu, “Opinion Mining, Sentiment Analysis and Opinion Spam Detection,”
[Online]. Available: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
[Accessed: January 2019].
9. C. Hill, “10 Secrets to Uncovering which Online Reviews are Fake,” [Online]. Available:
https://www.marketwatch.com/story/10-secrets-to-uncovering-which-online-reviews-
are-fake-2018-09-21 [Accessed: March 2019].