Sentiment Analysis Using Deep Learning Technique CNN With Kmeans

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

International Journal of Pure and Applied Mathematics

Volume 114 No. 11 2017, 47-57


ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu

Sentiment Analysis Using Deep Learning Technique


CNN with KMeans
B. Swathi Lakshmi, P. Sini Raj and R. Raj Vikram
Department of Computer Science and Engineering, Amrita School of
Engineering, Amrita Vishwa Vidyapeetham, Ettimadai, Coimbatore.

Abstract
Sentiment analysis has already started playing a vital role in most of social
media. Whether it is social networking sites or video or audio based systems,
people are interested in knowing the sentiments which will help them to
identify whether the word is positive, negative or neutral. Sentiment analysis
in turn helps to detect the emotions. In the proposed work sentiment analysis
is used to find the review for a particular movie by using a novel combination
of deep learning technique CNN and unsupervised learning method K means
upon movie reviews, which gives a better estimation of the sentiments than
the existing methods which are currently available. This minimal
improvement in the accuracy is expected to get improved when applied to a
larger corpus of big data where it will show its significance.
Key Words: CNN, deep learning, K means, sentiment analysis.

47
International Journal of Pure and Applied Mathematics Special Issue

1. Introduction
Sentiment analysis also known as Opinion Mining is an interesting way to find
the opinions of a user and to effectively categorize then to be positive, negative
or neutral. Now-a-days sentiment analysis has shown its significance in almost
all the fields of media. Natural language processing is deeply tied with
Sentiment analysis. When a user expresses his views, it is important for the
organization to correctly identify the requirements of the user to make him stay
longer as their customer. For that a deep understanding of their customer’s
opinion[1][3] is important. By the analysis of product reviews by the customer,
it is easier for the company to decide about the future of that product. In the
same way, it is very important to analyze the comments given in social media.[1]
Twitter Analytics has become a separate field by itself, where even studies
show the impact of tweets over the sensitive fields like [9] market prediction.
Sentiment Analysis has its diverse applications ranging from the field of
centrifugal pump to social media[5].

The centrifugal pump is widely distributed over many applications. In the case
of centrifugal pump, if there occurs an error, or a mono block, monitoring is
very essential. Another set of networking algorithms like ANN (Artificial
Neural Network) are used [6]. But the accuracy produced by these algorithms is
not satisfactory. It provided a better outcome in case of monoblock. In the
algorithmic process, features are trained, extracted and their fault classification
is compared. The major advantage is that the operator may be informed about
the status of the pump well in advance. If the status is negative, necessary
precautions may be taken. On contrary, Support Vector Machine and Proximal
Support vector Machine (PSVM) provides a better outcome under good and
faulty conditions of a monoblock centrifugal pump. In this machine learning
process, decision tree is used for feature extraction. The extracted features are
fed as inputs to SVM and PSVM and the inputs are trained, tested and their fault
classification is compared.[11]

Raymond Hsu et.al suggested an attempted experiment model which was


implemented in Stanford University.[8] The process involves Raw Data, Parser,
Spell Checker, Synset Features and LIWC features.This helped in knowing the
sentiments of the data[10].

In stock market prediction, it was very difficult to collect these enormous data
in a short span of time. So the role of sentiment analysis plays a vital part. The
prediction was done by two algorithms; one is the genetic algorithm (GA), and
support vector machine (SVM).Hybrid systems were proposed in order to avoid
regression problems and to manage the existing problem with satisfactory
accuracy. Upon the previous day records, the algorithm may be applied and
successful targets may be achieved. For the decision tree parameters were
optimized by the GA and SVM for the accuracy. Once the trade has begun for
that day, trade must be carried in order to obtain a highest possible

48
International Journal of Pure and Applied Mathematics Special Issue

profit.[11]Basically, by keeping track of the previous record, next day’s


prediction is done. This helps in achieving the fixed target. This factor is
implemented as a part of hybrid systems.

By the analysis of tweets [2] for the apt classification to be positive, negative or
neutral is noteworthy. Research of Sentiment analysis in a blog form has grown
rapidly. As the population exceeds, the users of blogs and microblogs have also
increased in a short span of time. This leads to a lot of unformatted, bulky and
unprecised text formats. In order to overcome the factor, sentiment analysis is
widely used and considered as the most efficient part of the deep learning
process. Various methods are involved in sentiment analysis, in which feature
extraction is the most efficient part. On the contrary to opinion mining, it has a
lot of drawbacks as compared to opinion mining [4]. Opinion mining was only
concentrated on one-dimensional feature, unlike the sentiment analysis. In order
to avoid these problems, Jeong et. al. proposed a theory on FEROM (Feature
Extraction and Refinement Method) that extracts the appropriate grammars and
features by scanning the whole blog content. This method checks with each
grammar and features are to be extracted by merging with the correct exact-
matching words. This arose a challenge for keyword extraction which was
proposed by Fan and Chang from the concept of contextual advertising in
related to the advertising ads of the blog page. Only the traditional keyword
extraction can be referred for searching or featuring formal documents in
traditional blogs, newspapers or scientific related papers. In addition to
traditional keyword extraction, frequency-based extraction was introduced for
extracting features from micro blogs [4].In addition to the frequency; graphical
model extraction was also introduced[12].

The words in a Sentiment analysis is classified on the basis of semantic


orientation (SO), that is the word is basically classified using its weight,
polarity, and its strength. Semantic Orientation is extremely helpful in
determining marketing reviews, compiling reviews etc. In general semantic
orientation always refers to the strength of the words, phrases or texts in
addition to the sentiment analysis which is the main goal of our process[16].
Semantic Orientation involves adjectives, phrases, words, texts, adverbs, verbs
and noun.

At first we start with each tweet , then for each word in sentiment dictionary , if
an emoticon [12]is found; then calculate it as positive, negative or neutral ; else
if a contextual word is found Contextual Valence Shifter [9]then calculate its
valence ; otherwise if a sentiment word is found then calculate positive ,
negative valences. Finally sum all positive values, negative values and neutral
values for each sentence.

Sentiment analysis is also used in an interesting application when the user is


talking, it analyzes whether the situation or action has been actually occurred or
not. Those terms are called as “Irrealis”, which are applied in non-factual
contexts. These are some set of grammatical moods which predicts the

49
International Journal of Pure and Applied Mathematics Special Issue

occurrence of an event or not. Here the imperative mood plays the major role in
irrealis blocking.[13] An instance taken here is the validation of dictionary where
granularity of the dictionary is used by the data set which provides evidence for
the dictionary rankings. Also predicting the intuition of English speaking people
(here) which are valuable, in comparing to the automatically generated ones.
Granularity of the scales is expected in datasets, so as to increase the
efficiency.[13]

2. Machine Learning Algorithms used in


Sentiment Analysis
Machine learning algorithms play an important role in sentiment analysis.
Specifically speaking, lots of works in sentimental analysis uses classification
algorithms like Support Vector Machine (SVM), Kernel trick, KNN (K-Nearest
Neighbor) to detect positive, negative or neutral sentiments.

A. Support Vector Machine

Support Vector Machine (SVM) which are also called as supervised learning
networks,that analyze data for classification and regression analysis.In SVM,
the points are present in the space so that the examples present forms the new
category in the space.Two seperate categories are formed so that it forms a clear
gap in space.SVM also has a special advantage that it can perform non-linear
classification called the Kenrel trick[7],by mapping the inputs to high-
dimensional features.SVM is always applicable to supervised learning data set.

B. Proximal Support Vector Machine

Instead of a software machine that classifies points by assigning to one of the


disjoint planes,PSVM classifies by assigning them to the closest of the planes.

C. Kernel Trick

It is a set of algorithms designed for pattern analysis.This method is used to find


general types of analysis such as clustering,ranking,components,co-relations
and classifications which are implemented in datasets[8].Kernel functions
works on the basis of the data sets which is present maps to all images and
algorithms.

This is more efficient than computation of the coordinates.Kernel method


algoritms are capable of operating with Support Vector Machines.The
functions are used in graphs,vectors,text,images and vectors.Basically Kernel
algorithms are based on convex optimization or eigen value problems.

50
International Journal of Pure and Applied Mathematics Special Issue

D. K Nearest Nieighbor (KNN)

KNN is one the simplest and most commonly used classification algorithm. It is
extremely simple and usually works better providing good accurate and
competitive results. Here the whole data set needs to be classified into positive,
negative or neutral. This is done considering the k nearest neighbors and their
closeness. The closeness is measured by any of the distance measures mainly
Euclidian distance measure is used. This classification correctly classifies for a
better smaller datasets.

E. Hybrid model K Nearest Nieighbor (KNN) and SVM

There are various works which uses individual methods for the purpose of
classification. The works which uses the hybrid model where KNN-SVM has
been used for a better classification [15].This also shows an improvement in the
sentiments identification by using this method.

3. Deep Learning Technique–Convolution


Neural Networks (CNN)
Convolution Neural Network is a type of feed forward network, which consists
of two or more layers deep within and then connected with a fully connected
layers like a multilayer neural network. In the perspective of sentiment analysis,
CNN works on the process in which each word is given a weight in the hidden
layer. Further each word is being checked for the exact match and the process
continues in a repeated manner.CNN also works based on the logic of sliding
window. For an instance, if an image is given each, filters are decided and
passed through the image as a sliding window. This gives the corresponding
value of the image and is stored as a matrix. Thus for the entire image, a matrix
will be calculated. In the case of text classification, every word will be given as
an input and finally represented in a matrix format as shown in Figure 1.
Feature detection is done by the convolution layers.

Figure 1: Convolution Works

51
International Journal of Pure and Applied Mathematics Special Issue

4. Sentiment Analysis using Movie Reviews


A. Existing Method

One such method is discussed in the paper by Kim Yong et.al used the
combination of CNN and KNN to identify the sentiments in the movie reviews
.The data file has to be loaded. Pre-processing has to be done so that the
maximum noise is removed. In the existing method they use Deep learning
technique Convolution neural network to train and learn the positive and
negative sentiments from the movie review data sets. A sentence in the movie
review is inputted and is separated it into words. It is then passed through the
convolution layers. Multiple layers are set using the filters. The Features are
extracted after the convolution layers[12]. These features are fed to a KNN
classifier to identify whether the reviews get categorized to positive or negative
sentiments as shown in Figure 2. In this paper, they also suggested to convert
word into integer values using word2vec library or any other method such as
word embedding techniques.

Figure 2: Sentiment Analysis using CNN – KNN

B. Proposed Method

There are various unsupervised learning algorithms such as k- means,


hierarchical, agglomerative clustering. As a deviation from the existing work,
experiment carried out from the combination of deep learning technique CNN
with unsupervised learning K Means clustering method.

All the unlabelled data sets comes under unsupervised learning. In the case of
K-means clustering, no labels are known. In K means clustering, the no of
clusters has to be decided in advance according to the application. Once the k
clusters has been decided, then the as and when the new data comes, the data
needs to be put in clusters according to the centroid value calculated. This
shows the distance of the data from the centroid value. According to the
distance calculated the data may be put into various clusters. Unsupervised
learning is very useful for the datasets where the labels are not proper, so that it
shows better results in the case of novel and unknown data.

52
International Journal of Pure and Applied Mathematics Special Issue

Unsupervised learning methods has a advantage to predict the hidden patterns


and grouping methods. In our proposed model, a movie review dataset is used,
which contains all the mixed data containing positive and negative reviews. The
deep leaning technique CNN is used to train and learn the system. The input to
the proposed system is also sentences which need to be converted to a matrix by
using multilevel convolutions. The features are extracted from the CNN which
are in turn fed to a K-Means set up where the reviews are groups into positive or
negative clusters. Thus the complete data set will be grouped accordingly.

Whenever a novel unknown movie review comes, they are passed through the
trained and learned CNN and after the feature extraction, the K Means
clustering algorithm used will help to group the movie reviews accordingly into
positive or negative clusters as in Figure3. But the proposed method works
better and gives a minimal improvement in the accuracy when in the movie
review dataset. But this dataset is not a big dataset when in comparison with
others as in these consists of only 10,662 instances[15].

Figure 3: Sentiment Analysis using CNN –K Means

5. Experiments and Results


In this paper, a comparative study of supervised learning, the combination of
CNN and KNN and unsupervised learning, the combination of CNN and K-
means is done. This is implemented in tensor flow framework. Tensor flow is
one of the trending framework for working with Convolution neural networks
and more of techniques in the field of deep learning. In the case of existing
system, using CNN and KNN, it provides better results for smaller datasets.
This is evaluated using the metrics accuracy and precision. As this is supervised
learning, accuracy is highly superior for smaller datasets. As all the positive and
negative sentiments are trained, learned and labeled by CNN, and then by the
use of KNN ,it correctly classifies the reviews as positive and negative
sentiments with less error rate.[15]

In the proposed work, uses unsupervised learning which when used in


combination with CNN, the accuracy and precision is seen improved. Tensor
flow usually runs faster when in a GPU (Graphical Processing Unit) set up.If

53
International Journal of Pure and Applied Mathematics Special Issue

the system needs to be worked for a larger corpora then the normal CPU may
not be suffice. Then it is suggested to have GPUs space and time consumption
can be made lesser. Thus our system shows that CNN-KNN works better for a
smaller dataset and for a larger dataset, CNN-K Means is suggested. The
comparison of both the algorithms is depicted and is given below. This graph
shows the sentiment analysis done for various real time movies. This is done by
taking the review comments of these movies and analyzed the positive and
negative comments. This is plotted and is given below in the Figure 4.
300

250

200

150 Pos
Neg
100

50

0
FoodFight The God Father
House of the Dead BlackHat

Figure 4: Sentiment analysis for different movies

This graph shows the loss and accuracy of various movies and it is also
observed that when we change the filters in convolution neural networks, for
few movies the accuracy is more and the loss is less which is the required. This
is achieved in the case of CNN-KNN for smaller datasets. The same is achieved
when we use CNN-K Means for larger datasets. The below Figure.5 shows that
the accuracy is attained with mere loss or lesser error rate when we use our
proposed method.

1
0.9
0.8
0.7
0.6
Loss
0.5
0.4 Accuracy
0.3
0.2
0.1
0
1 2 3 4 5

Figure 5: Loss and accuracy trade off

54
International Journal of Pure and Applied Mathematics Special Issue

6. Conclusion
Sentiment Analysis is very essential in our daily routine. It has its diverse
specification in the areas of social media such as analysis of twitter data, other
mechanical specifications like centrifugal pump through the help of Support
Vector Machine. Through Sentimental Analysis marketing strategy, campaign
success, improving product messaging and other areas. In this paper we have
proposed a theory through the impact of K-means algorithm which is effective
for larger sets of data also. Sentiment Analysis has been effective in all its cases
in which it has been implemented. Filters like CNN, using deep learning
techniques, is also used as a part of Sentimental Analysis. All these factors
make an impact in the difference of learning, in order to increase the existing
work. Algorithms like hierarchical and Agglomerative clustering are also useful
for the data prediction. The factors which can also be applicable for larger
datasets, which improves the efficiency and accuracy.

References
[1] Varghese R., Jayasree M., A survey on sentiment analysis and
opinion mining, International Journal of Research in Engineering
and Technology 2(11) (2013), 312-317.
[2] Agarwal A., Xie B., Vovsha I., Rambow O., Passonneau R,
Sentiment analysis of twitter data, Proceedings of the workshop
on languages in social media, Association for Computational
Linguistics (2011), 30-38.
[3] Vinita Sharma, Literature Survey (2014).
[4] Sahayak V., Shete V., Pathan A, Sentiment Analysis on Twitter
Data, International Journal of Innovative Research in Advanced
Engineering (IJIRAE) 2(1) (2015), 178-183.
[5] Singh R., Kaur, R, Sentiment Analysis on Social Media and
Online Review, International Journal of Computer Applications
121(20) (2015).
[6] Medhat W., Hassan A., Korashy H., Sentiment analysis
algorithms and applications: A survey, Ain Shams Engineering
Journal 5(4) (2014), 1093-1113.
[7] Sources from Wikipedia, Kernel Methods.
[8] Sindhwani V., Melville P., Document-word co-regularization for
semi-supervised sentiment analysis, Eighth IEEE International
Conference on Data Mining (2008), 1025-1030.
[9] Nair B.B., Mohandas V.P., Sakthivel N.R., A genetic algorithm
optimized decision tree-SVM based stock market trend prediction
system, International Journal on Computer Science and
Engineering 2(9) (2010), 2981-2988.

55
International Journal of Pure and Applied Mathematics Special Issue

[10] Nanli Z., Ping Z., Weiguo L., Meng C., Sentiment analysis: A
literature review, International Symposium on Management of
Technology (ISMOT) (2012), 572-576.
[11] Taboada M., Brooke J., Tofiloski M., Voll K., Stede, M, Lexicon-
based methods for sentiment analysis, Computational linguistics
37(2) (2011), 267-307.
[12] Vaitheeswaran G., Arockiam, L, A Novel Lexicon Based
Approach to Enhance the Accuracy of Sentiment Analysis on Big
Data, International Journal of Emerging Research in
Management and Technology (IJERMT) 5(2) (2016).
[13] Sivakumar P.B., Mohandas V.P., Sobh T, Evaluating the
predictability of financial time series, A case study on SENSEX
data, Innovations and Advanced Techniques in Computer and
Information Sciences and Engineering (2007), 99–104.
[14] Padmavathi S., Rajalaxmi C., Soman K.P, Texel identification
using K-Means clustering method, Advances in Computer
Science, Engineering & Applications (2012), 285-294.
[15] Abarna K., Rajamani M., Vasudevan S.K, Big data analytics: A
detailed gaze and a technical review, International Journal of
Applied Engineering Research 9(9) (2014).
[16] Geethan P., Jithin P., Naveen T., Padminy K.V., Shruthi Krithika
J., Vasudevan S.K, Augmented reality X-ray vision with gesture
interaction, Indian Journal of Science and Technology 8 (2015),
43-47.
[17] Sankar A., Suresh A., Varun Babu P., Baskar A., Vasudevan
S.K, An in-depth analysis of applications of object recognition,
Research Journal of Applied Sciences, Engineering and
Technology 10(1) (2015), 1-14.
[18] Rajendran A., Kiran M.V.K., Vasudevan S.K., Baskar A, An
exhaustive survey on human computer interaction’s past, present
and future, International Journal of Applied Engineering
Research 10(2) (2015), 5091-5105.
[19] Gaurangi Patil, Varsha Galande, Vedant Kekan, Kalpana Dange,
Sentiment Analysis Using Support Vector Machine, International
Journal of Innovative Research in Computer and Communication
Engineering 2(1), (2014).
[20] Yong Yang, Chun Xu, Ge Ren, Sentiment Analysis of Text Using
SVM, Electrical, Information Engineering and Mechatronics of the
series Lecture Notes in Electrical Engineering 138 (2012), 1133-
1139.

56
57
58

You might also like