Sentiment Analysis of Product-Based Reviews Using Machine Learning Approaches
Sentiment Analysis of Product-Based Reviews Using Machine Learning Approaches
Sentiment Analysis of Product-Based Reviews Using Machine Learning Approaches
of Product-Based Reviews
Using Machine Learning
Approaches
BY
ANUSUYA DHARA (CSE/2014/041)
ARKADEB SAHA (CSE/2014/048)
SOURISH SENGUPTA (CSE/2014/049)
PRANIT BOSE (CSE/2014/060)
………………………………………
Project Supervisor
Department of Computer Science and Engineering
RCC Institute of Information Technology
Countersigned:
………………………………………
Head
Department of Computer Science & Engineering
RCC Institute of Information Technology
Kolkata – 700015
2
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
RCC INSTITUTE OF INFORMATION TECHNOLOGY
CERTIFICATE OF APPROVAL
2. ———————————
(Signature of Examiners)
3
ACKNOWLEDGEMENT
We would like to express special thanks & gratitude to our guide, Dr. Anup Kumar Kolya
who gave us this golden opportunity to work on this scalable project on the topic of
“Sentiment Analysis of product based reviews using Machine Learning Approaches”, which
led us into doing a lot of Research which diversified our knowledge to a huge extent for
which we are thankful.
Also, we would like to thank our parents and friends who supported us a lot in finalizing
this project within the limited time frame.
--------------------------------------------------
ANUSUYA DHARA (CSE/2014/041)
--------------------------------------------------
ARKADEB SAHA (CSE/2014/048)
--------------------------------------------------
SOURISH SENGUPTA (CSE/2014/049)
--------------------------------------------------
PRANIT BOSE (CSE/2014/060)
4
Table of Contents
Page No.
1. Abstract …………………………………………………………………… 6
2. Introduction ………………………………………………………………. 7
5. System Design……………………………………………………………… 10
9. Conclusion…………………………………………………………………… 22
5
1. Abstract
Sentiment Analysis also known as Opinion Mining refers to the use of natural language
processing, text analysis to systematically identify, extract, quantify, and study affective
states and subjective information. Sentiment analysis is widely applied to reviews and
survey responses, online and social media, and healthcare materials for applications that
range from marketing to customer service to clinical medicine.
In this project, we aim to perform Sentiment Analysis of product based reviews. Data used
in this project are online product reviews collected from “amazon.com”. We expect to do
review-level categorization of review data with promising outcomes.
6
1. Introduction
“It is a quite boring movie…….. but the scenes were good enough. ”
The given line is a movie review that states that “it” (the movie) is
quite boring but the scenes were good. Understanding such sentiments require
multiple tasks.
Hence, SENTIMENTAL ANALYSIS is a kind of text classification
based on Sentimental Orientation (SO) of opinion they contain.
Sentiment analysis of product reviews has recently become very popular in text
mining and computational linguistics research.
7
2. Review of Literature
A max-entropy POS tagger is used in order to classify the words of the sentence, an
additional python program to speed up the process. The negation words like no, not,
and more are included in the adverbs whereas Negation of Adjective and Negation of
Verb are specially used to identify the phrases.
The following are the various classification models which are selected for
categorization: Naïve Bayesian, Random Forest, Logistic Regression and Support
Vector Machine.
For feature selection, Pang and Lee suggested to remove objective sentences by
extracting subjective ones. They proposed a text-categorization technique that is able
to identify subjective content using minimum cut. Gann et al. selected 6,799 tokens
based on Twitter data, where each token is assigned a sentiment score, namely TSI
(Total Sentiment Index), featuring itself as a positive token or a negative token.
Specifically, a TSI for a certain token is computed as:
where p is the number of times a token appears in positive tweets and n is the
number of times a token appears in negative tweets is the ratio of total
8
3. Objective of the Project
9
4. System Design
Hardware Requirements:
Core i5/i7 processor
At least 8 GB RAM
At least 60 GB of Usable Hard Disk Space
Software Requirements:
Python 3.x
Anaconda Distribution
NLTK Toolkit
UNIX/LINUX Operating System.
Data Information:
The Amazon reviews dataset consists of reviews from amazon. The data span
a period of 18 years, including ~35 million reviews up to March 2013.
Reviews include product and user information, ratings, and a plaintext review.
For more information, please refer to the following paper: J. McAuley and J.
Leskovec. Hidden factors and hidden topics: understanding rating dimensions
with review text. RecSys, 2013.
10
Data Format:
The dataset we will use is .json file. The sample of the dataset is given below.
{
"reviewSummary": "Surprisingly delightful",
"reviewText": “ This is a first read filled with unexpected humor and profound
insights into the art of politics and policy. In brief, it is sly, wry, and wise. ”,
"reviewRating": “4”,
11
5. Methodology for Implementation
(Formulation/Algorithm)
DATA COLLECTION:
Data which means product reviews collected from amazon.com from May
1996 to July 2014. Each review includes the following information: 1) reviewer ID; 2)
product ID; 3) rating; 4) time of the review; 5) helpfulness; 6) review text. Every rating is
based on a 5-star scale, resulting all the ratings to be ranged from 1-star to 5-star with no
existence of a half-star or a quarter-star.
12
SENTIMENT CLASSIFICATION ALGORITHMS:
The Naïve Bayesian classifier works as follows: Suppose that there exist a set of
training data, D, in which each tuple is represented by an n-dimensional feature
vector, X=x 1,x 2,..,x n , indicating n measurements made on the tuple from n attributes or
features. Assume that there are m classes, C 1,C 2,...,C m . Given a tuple X, the classifier will
predict that X belongs to C i if and only if: P(C i |X)>P(C j |X),
where i,j∈[1,m]a n d i≠j. P(C i |X) is computed as:
Random forest
The random forest classifier was chosen due to its superior performance over a single
decision tree with respect to accuracy. It is essentially an ensemble method based on
bagging. The classifier works as follows: Given D, the classifier firstly creates k bootstrap
samples of D, with each of the samples denoting as Di . A Di has the same number of tuples
as D that are sampled with replacement from D. By sampling with replacement, it means
that some of the original tuples of D may not be included in Di , whereas others may occur
more than once. The classifier then constructs a decision tree based on each Di . As a result,
13
a “forest" that consists of k decision trees is formed.
To classify an unknown tuple, X, each tree returns its class prediction counting as one vote.
The final decision of X’s class is assigned to the one that has the most votes.
Where pi is the probability that a tuple in D belongs to class C i . The Gini index measures
the impurity of D. The lower the index value is, the better D was partitioned.
Support vector machine (SVM) is a method for the classification of both linear and
nonlinear data. If the data is linearly separable, the SVM searches for the linear optimal
separating hyperplane (the linear kernel), which is a decision boundary that separates data
of one class from another. Mathematically, a separating hyper plane can be written
as: W·X+b=0, where W is a weight vector and W=w1,w2,...,w n. X is a training tuple. b is a
scalar. In order to optimize the hyperplane, the problem essentially transforms to the
minimization of ∥W∥, which is eventually computed as:
if y i =−1 then
14
If the data is linearly inseparable, the SVM uses nonlinear mapping to transform the data
into a higher dimension. It then solve the problem by finding a linear hyperplane. Functions
to perform such transformations are called kernel functions. The kernel function selected for
our experiment is the Gaussian Radial Basis Function (RBF):
where Xi are support vectors, X j are testing tuples, and γ is a free parameter that uses the
default value from scikit-learn in our experiment. Figure shows a classification example of
SVM based on the linear kernel and the RBF kernel on the next page-
Logistic Regression
Logistic regression predicts the probability of an outcome that can only have two
values (i.e. a dichotomy). The prediction is based on the use of one or several predictors
(numerical and categorical). A linear regression is not appropriate for predicting the
value of a binary variable for two reasons:
15
A linear regression will predict values outside the acceptable range (e.g. predicting
probabilities outside the range 0 to 1)
Since the dichotomous experiments can only have one of two possible values
for each experiment, the residuals will not be normally distributed about
the predicted line.
On the other hand, a logistic regression produces a logistic curve, which is limited to
values between 0 and 1. Logistic regression is similar to a linear regression, but the
curve is constructed using the natural logarithm of the “odds” of the target variable,
rather than the probability. Moreover, the predictors do not have to be normally
distributed or have equal variance in each group.
Logistic regression uses maximum likelihood estimation (MLE) to obtain the model
coefficients that relate predictors to the target. After this initial function is estimated,
the process is repeated until LL (Log Likelihood) does not change significantly.
16
6. Implementation Details
The training of dataset consists of the following steps:
Hence initial fetching of data is done in this section using Python File Handlers.
ii) After that, the review ratings which are 3 out of 5 are removed as they
signify neutral review, and all we are concerned of is positive and negative reviews.
iii) The entire task of preprocessing the review data is handled by this
17
utility class- “NltkPreprocessor”.
iv) The time required to prepare the following data is hence displayed.
Preprocessing Data:This is a vital part of training the dataset. Here Words present
in the file are accessed both as a solo word and also as pair of words. Because, for
example the word “bad” means negative but when someone writes “not bad” it refers
to as positive. In such cases considering single word for training data will work
otherwise. So words in pairs are checked to find the occurrence to modifiers before
18
any adjective which if present which might provide a different meaning to the
outlook.
Training Data/ Evaluation:The main chunk of code that does the whole
evaluation of sentimental analysis based on the preprocessed data is a part of this.
The following are the steps followed:
i) The Accuracy, Precision, Recall, and Evaluation time is calculated and displayed.
ii) Navie Bayes, Logistic Regression, Linear SVM and Random forest classifiers are
applied on the dataset for evaluation of sentiments.
iii) Prediction of test data is done and Confusion Matrix of prediction isdisplayed.
iv) Total positive and negative reviews are counted.
v) A review like sentence is taken as input on the console and if positive the console gives 1
as output and 0 for negative input.
19
7. Results and Sample Output
The ultimate outcome of this Training of Public reviews dataset is that, the
machine is capable of judging whether an entered sentence bears positive response or
negative response.
Precision (also called positive predictive value) is the fraction of relevant
instances among the retrieved instances, while Recall (also known as sensitivity) is
the fraction of relevant instances that have been retrieved over the total amount of
relevant instances. Both precision and recall are therefore based on an understanding
and measure of relevance.
20
When using normalized units, the area under the curve (often referred to as
simply the AUC) is equal to the probability that a classifier will rank a randomly
chosen positive instance higher than a randomly chosen negative one (assuming
'positive' ranks higher than 'negative'). This can be seen as follows: the area under the
curve is given by (the integral boundaries are reversed as large T has a lower value
on the x-axis).
The machine evaluates the accuracy of training the data along with precision
Recall and F1
The Confusion matrix of evaluation is calculated.
It is thus capable of judging an externally written review as positive or
negative.
A positive review will be marked as [1], and a negative review will be hence
marked as [0].
False
True Positive
Negative
21
The Confusion Matrix of Each Classifier are as follows:
68556 69928
11470 10098
69963 62695
10063 17331
22
The following are the images of such sample output after successful dataset training
using the classifiers:
23
24
The Bar Graph showing the Frequency of Ratings in the dataset
This Bar graph shows the score of each classifier after successful training. The parameters
be: F1 Score, Accuracy, Precision, Recall and Roc-Auc.
25
8. Conclusion
Sentiment analysis deals with the classification of texts based on the sentiments they
contain. This article focuses on a typical sentiment analysis model consisting of three
core steps, namely data preparation, review analysis and sentiment classification, and
describes representative techniques involved in those steps.
26
Appendix
Code:
Loading the dataset:
import json
import pickle
import numpy as np
from matplotlib import pyplot as plt
from textblob import TextBlob
# fileHandler.close()
# saveReviewText = open('review_text.pkl', 'wb')
# saveReviewRating = open('review_rating.pkl','wb')
# pickle.dump(reviewText, saveReviewText)
# pickle.dump(reviewRating, saveReviewRating)
reviewTextFile = open('review_text.pkl', 'rb')
27
reviewRatingFile = open('review_rating.pkl', 'rb')
reviewText = pickle.load(reviewTextFile)
reviewRating = pickle.load(reviewRatingFile)
# print(len(reviewText))
# print(reviewText[0])
# print(reviewRating[0])
# ratings = np.array(reviewRating)
plt.hist(ratings, bins=np.arange(ratings.min(), ratings.max()+2)-0.5, rwidth=0.7)
plt.xlabel('Rating', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.title('Histogram of Ratings', fontsize=18)
plt.show()
lang = {}
i=0
for review in reviewText:
tb = TextBlob(review)
l = tb.detect_language()
if l != 'en':
lang.setdefault(l, [])
lang[l].append(i)
print(i, l)
i += 1
print(lang)
Scrapping data:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import openpyxl
class Review():
def __init__(self):
28
self.rating=""
self.info=""
self.review=""
def scrape():
options = Options()
options.add_argument("--headless") # Runs Chrome in headless mode.
options.add_argument('--no-sandbox') # # Bypass OS security model
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument("--disable-extensions")
driver=webdriver.Chrome(executable_path=r'C:\chromedriver\chromedriver.exe')
url='https://www.amazon.com/Moto-PLUS-5th-Generation-Exclusive/product-
reviews/B0785NN142/ref=cm_cr_arp_d_paging_btm_2?ie=UTF8&reviewerType=all_reviews&pageNumb
er=5'
driver.get(url)
soup=BeautifulSoup(driver.page_source,'lxml')
ul=soup.find_all('div',class_='a-section review')
review_list=[]
for d in ul:
a=d.find('div',class_='a-row')
sib=a.findNextSibling()
b=d.find('div',class_='a-row a-spacing-medium review-data')
'''print sib.text'''
new_r=Review()
new_r.rating=a.text
new_r.info=sib.text
new_r.review=b.text
review_list.append(new_r)
driver.quit()
return review_list
def main():
29
m = scrape()
i=1
for r in m:
book = openpyxl.load_workbook('Sample.xlsx')
sheet = book.get_sheet_by_name('Sample Sheet')
sheet.cell(row=i, column=1).value = r.rating
sheet.cell(row=i, column=1).alignment = openpyxl.styles.Alignment(horizontal='center',
vertical='center', wrap_text=True)
sheet.cell(row=i, column=3).value = r.info
sheet.cell(row=i, column=3).alignment = openpyxl.styles.Alignment(horizontal='center',
vertical='center', wrap_text=True)
sheet.cell(row=i, column=5).value = r.review.encode('utf-8')
sheet.cell(row=i, column=5).alignment = openpyxl.styles.Alignment(horizontal='center',
vertical='center', wrap_text=True)
book.save('Sample.xlsx')
i=i+1
if __name__ == '__main__':
main()
Preprocessing Data:
import string
from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk import wordpunct_tokenize
from nltk import sent_tokenize
from nltk import WordNetLemmatizer
from nltk import pos_tag
class NltkPreprocessor:
def __init__(self, stopwords = None, punct = None, lower = True, strip = True):
self.lower = lower
self.strip = strip
self.stopwords = stopwords or set(sw.words('english'))
30
self.punct = punct or set(string.punctuation)
self.lemmatizer = WordNetLemmatizer()
if token in self.stopwords:
continue
return tokenized_doc
31
Sentiment Analysis:
import ast
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, chi2, SelectPercentile, f_classif
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
confusion_matrix
from sklearn.svm import LinearSVC
# from textblob import TextBlob
from time import time
def getInitialData(data_file):
print('Fetching initial data...')
t = time()
i=0
df = {}
with open(data_file, 'r') as file_handler:
for review in file_handler.readlines():
df[i] = ast.literal_eval(review)
i += 1
32
print('Fetching data completed!')
print('Fetching time: ', round(time()-t, 3), 's\n')
# def filterLanguage(text):
# text_blob = TextBlob(text)
# return text_blob.detect_language()
def prepareData(reviews_df):
print('Preparing data...')
t = time()
stemmer = SnowballStemmer('english')
stop_words = stopwords.words('english')
# print(len(reviews_df.reviewText))
# filterLanguage = lambda text: TextBlob(text).detect_language()
# reviews_df = reviews_df[reviews_df['reviewText'].apply(filterLanguage) == 'en']
# print(len(reviews_df.reviewText))
33
print('Preparing data completed!')
print('Preparing time: ', round(time()-t, 3), 's\n')
def preprocessData(reviews_df_preprocessed):
print('Preprocessing data...')
t = time()
X = reviews_df_preprocessed.iloc[:, -1].values
y = reviews_df_preprocessed.iloc[:, -2].values
print('Results evaluated!')
print('Evaluation time: ', round(time()-t, 3), 's\n')
# getInitialData('datasets/reviews_digital_music.json')
# reviews_df = pd.read_pickle('reviews_digital_music.pickle')
34
# prepareData(reviews_df)
reviews_df_preprocessed = pd.read_pickle('reviews_digital_music_preprocessed.pickle')
# print(reviews_df_preprocessed.isnull().values.sum()) # Check for any null values
print('Training data...')
t = time()
pipeline = Pipeline([
('vect', TfidfVectorizer(ngram_range = (1,2), stop_words = 'english',
sublinear_tf = True)),
('chi', SelectKBest(score_func = chi2, k = 50000)),
('clf', LinearSVC(C = 1.0, penalty = 'l1', max_iter = 3000, dual = False,
class_weight = 'balanced'))
])
prediction = model.predict(X_test)
print('Prediction completed!')
print('Prediction time: ', round(time()-t, 3), 's\n')
evaluate(y_test, prediction)
35
print()
l = (y_test == 0).sum() + (y_test == 1).sum()
s = y_test.sum()
print('Total number of observations: ' + str(l))
print('Positives in observation: ' + str(s))
print('Negatives in observation: ' + str(l - s))
print('Majority class is: ' + str(s / l * 100) + '%')
36
error_kw=error_config,
label='Multinomial Naive Bayes')
z=index + bar_width
rects2 = ax.bar(z, score_LR, bar_width,
alpha=opacity, color='r',
error_kw=error_config,
label='Logistic Regression')
z=z+ bar_width
rects3 = ax.bar(z, score_LSVC, bar_width,
alpha=opacity, color='y',
error_kw=error_config,
label='Linear SVM')
z=z+ bar_width
rects4 = ax.bar(z, score_RF, bar_width,
alpha=opacity, color='g',
error_kw=error_config,
label='Random Forest')
ax.set_xlabel('Score Parameters')
ax.set_ylabel('Scores (in %)')
ax.set_title('Scores of Classifiers')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(('F1', 'Accuracy', 'Precision', 'Recall', 'ROC AUC'))
ax.legend(bbox_to_anchor=(1, 1.02), loc=5, borderaxespad=0)
fig.tight_layout()
plt.show()
37
References
38