"Sentiment Analysis of Survey Comments: Animesh Tilak

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

“SENTIMENT

ANALYSIS OF
SURVEY
COMMENTS”

BY

ANIMESH TILAK
(CSE)

1
ABSTRACT

Sentiment Analysis of Survey comment by the use of Natural


Language Processing Techniques and analyzing the performance of
various Machine Learning Algorithms on survey comments. After
converting unstructured data into structured data for the ease of
analysis.

Survey comment are classified into binary categories i.e. Positive


comments or Negative comments on the basis of words used in the
comments.
Machine Learning classifiers are used to categorize these comments to
its maximum accuracy and comparing the performance of the
classifiers with each other on the same dataset.

2
Introduction
Unstructured data inflow is rapidly increasing day by day. It needs to be
classified to get meaningful insight out of it. Sentiment Analysis can be
used in various fields like Product performance analysis in market, training
chatter bots with specific sentiments to respond, content ratings for various
blogs, posts, videos and can also be used in story summarizing. Sentiment
Analysis is also used in Page Ranking Systems for various search engines.

Survey comments dataset labeled as negative and positive comments is


taken. In the dataset both negative and positive comment has many
comments.

These unstructured comments are converted into structured data as


vectors. These vectors labeled as negative and positive comments train the
model to classify test data comments into positive or negative comments
category.
Software Requirements Specification
Tools Used

1. Anaconda Navigator
2. Jupyter Notebook

Language and Libraries Used

1. Python 3.6
2. Numpy
3. Pandas
4. Sklearn
Project Planning & Implementations
Converting unstructured data into structured data

Survey comments are imported from the text file where comments were
line separated and labeled as negative and positive. Now a dictionary is
created by taking all the words from negative as well as positive comments.

Using Count-Vectorizer

Count Vectorizer removes English stop words from our created dictionary
and an object of Count-Vectorizer is initialized and it is fed by the created
dictionary. This Count-Vectorizer object gives unique index to each word
present in the created dictionary.
Single line comments are given as a parameter to Count-Vectorizer object.
Now these comments are converted from unstructured data in English
language to 1-D vector. This 1-D vector is the combination of 0’s and 1’s.
If a comment contains any word, then the index which is assigned to that
word by the Count-Vectorizer object is assigned frequency of that word in
the comment.

Extracting Features by Tf-idf Transformer


Tf means term-frequency while idf means inverse document-frequency.
This is a common term weighing scheme in information retrieval, that has
also found good in document classification.
Tf-idf Transformer takes 1-D vectors and their label and gives weights
according to the importance of words for the classification.
The formula that is used to compute the tf-idf for a term t of a document d
in a document set is tf-idf (t, d) = tf(t, d) * idf(t), and the idf is computed as
idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number
of
documents in the document set and df(t) is the document frequency of t;
the document frequency is the number of documents in the document set
that contain the term t. The effect of adding “1” to the idf in the equation
above is that terms with zero idf, i.e., terms that occur in all documents in a
training set, will not be entirely ignored. (Note that the idf formula above
differs from the standard textbook notation that defines the idf as idf(t) = log
[ n / (df(t) + 1) ]).

If smooth_idf=True (the default), the constant “1” is added to the numerator


and denominator of the idf as if an extra document was seen containing
every term in the collection exactly once, which prevents zero divisions:
idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1.

Classification using Machine Learning Algorithm


Classification is done with the help of classifiers like Logistic Regression,
Support Vector Machine, Gaussian Naive Bayes, Multinomial Naive Bayes
and K-Nearest Neighbors.
Logistic Regression
Logistic Regression is best used for classification of binary categorical
data.Vectors are classified as positive or negative comments by this
classifier.

There is an awesome function called Sigmoid or Logistic function , we use


to get the values between 0 and 1.This function squashes the value (any
value ) and gives the value between 0 and 1
GAUSSIAN NAIVE BAYES
Gaussian Naive Bayes is best used to classify text data because it treat
each word as independent from others. Words become features and
contribute equally according to their weights to classify comment which has
been converted into vectors.

MULTINOMIAL NAIVE BAYES


It works same as Gaussian Naive Bayes, both the classifier use likelihood
table to calculate the probabilities. But there is a limitation in Gaussian
Naive Bayes when an unseen word comes which is not in the created
dictionary, then Gaussian Naive Bayes makes the probability zero which is
not right decision. Multinomial Naive Bayes overcomes this limitation.

where Nki is the number of times feature i appears in a sample of


class k inthe training set T, Nk is frequency of that feature in the dataset, n
is number of documents in which that feature is present. The smoothing
priors α≥0 accounts for features not present in the learning samples and
prevents zero probabilities
in further computations. setting α=1 is called Laplace smoothing,
while α<1 is called Lidstone smoothing.
SUPPORT VECTOR MACHINE
A Support Vector Machine (SVM) is a discriminative classifier formally
defined by a separating hyperplane. In other words, given labelled training
data (supervised learning), the algorithm outputs an optimal hyperplane
which categorizes new examples. In two dimensional space this hyperplane
is a line dividing a plane in two parts where in each class lay in either side.

K-NEAREST NEIGHBORS
K-Nearest Neighbors is one of the most basic yet essential classification
algorithms in Machine Learning. It belongs to the supervised learning
domain and finds intense application in pattern recognition, data mining
and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric,
meaning, it does not make any underlying assumptions about the
distribution of data (as opposed to other algorithms, which assume a
Gaussian distribution of the given data).

ACCURACY SCORE
For calculating the performance of each classifier we use accuracy score
which is calculated by comparing predicted label with actual label.

Accuracy = (No. of correct Predictions) / (No. of total Predictions)


Screenshots of Project
DATA LOADING

DATA PRE-PROCESSING

DATA PRE-PROCESSING
MODEL TRAINING
MODEL TRAINING & ACCURACY SCORE
Conclusion and Future Scope
Sentiment analysis is done better when we convert the unstructured data
into structured data because machine learning models understand
numerical data better than categorical or language data. After applying
different classifiers it is observed that logistic regression, Multinomial
Naive Bayes and Support Vector Machine perform very good in
classifying binary data. 75% accuracy is good to achieve because the
dataset was small. It is hard for classifiers to classify when small data set
is given for training.

Performance increased because while data pre-processing Tf-idf


Transformer gave weights to each word according to their importance.
This technique made classification easy for the classifiers. Therefore by
changing the data pre-processing, feature selection, feature engineering
methods high performance can be achieved.

Maximum accuracy is given by logistic regression and then SVM (Support


vector machine)

You might also like