Sentiment Analysis Using Deep Learning Technique CNN With Kmeans
Sentiment Analysis Using Deep Learning Technique CNN With Kmeans
Sentiment Analysis Using Deep Learning Technique CNN With Kmeans
Abstract
Sentiment analysis has already started playing a vital role in most of social
media. Whether it is social networking sites or video or audio based systems,
people are interested in knowing the sentiments which will help them to
identify whether the word is positive, negative or neutral. Sentiment analysis
in turn helps to detect the emotions. In the proposed work sentiment analysis
is used to find the review for a particular movie by using a novel combination
of deep learning technique CNN and unsupervised learning method K means
upon movie reviews, which gives a better estimation of the sentiments than
the existing methods which are currently available. This minimal
improvement in the accuracy is expected to get improved when applied to a
larger corpus of big data where it will show its significance.
Key Words: CNN, deep learning, K means, sentiment analysis.
47
International Journal of Pure and Applied Mathematics Special Issue
1. Introduction
Sentiment analysis also known as Opinion Mining is an interesting way to find
the opinions of a user and to effectively categorize then to be positive, negative
or neutral. Now-a-days sentiment analysis has shown its significance in almost
all the fields of media. Natural language processing is deeply tied with
Sentiment analysis. When a user expresses his views, it is important for the
organization to correctly identify the requirements of the user to make him stay
longer as their customer. For that a deep understanding of their customer’s
opinion[1][3] is important. By the analysis of product reviews by the customer,
it is easier for the company to decide about the future of that product. In the
same way, it is very important to analyze the comments given in social media.[1]
Twitter Analytics has become a separate field by itself, where even studies
show the impact of tweets over the sensitive fields like [9] market prediction.
Sentiment Analysis has its diverse applications ranging from the field of
centrifugal pump to social media[5].
The centrifugal pump is widely distributed over many applications. In the case
of centrifugal pump, if there occurs an error, or a mono block, monitoring is
very essential. Another set of networking algorithms like ANN (Artificial
Neural Network) are used [6]. But the accuracy produced by these algorithms is
not satisfactory. It provided a better outcome in case of monoblock. In the
algorithmic process, features are trained, extracted and their fault classification
is compared. The major advantage is that the operator may be informed about
the status of the pump well in advance. If the status is negative, necessary
precautions may be taken. On contrary, Support Vector Machine and Proximal
Support vector Machine (PSVM) provides a better outcome under good and
faulty conditions of a monoblock centrifugal pump. In this machine learning
process, decision tree is used for feature extraction. The extracted features are
fed as inputs to SVM and PSVM and the inputs are trained, tested and their fault
classification is compared.[11]
In stock market prediction, it was very difficult to collect these enormous data
in a short span of time. So the role of sentiment analysis plays a vital part. The
prediction was done by two algorithms; one is the genetic algorithm (GA), and
support vector machine (SVM).Hybrid systems were proposed in order to avoid
regression problems and to manage the existing problem with satisfactory
accuracy. Upon the previous day records, the algorithm may be applied and
successful targets may be achieved. For the decision tree parameters were
optimized by the GA and SVM for the accuracy. Once the trade has begun for
that day, trade must be carried in order to obtain a highest possible
48
International Journal of Pure and Applied Mathematics Special Issue
By the analysis of tweets [2] for the apt classification to be positive, negative or
neutral is noteworthy. Research of Sentiment analysis in a blog form has grown
rapidly. As the population exceeds, the users of blogs and microblogs have also
increased in a short span of time. This leads to a lot of unformatted, bulky and
unprecised text formats. In order to overcome the factor, sentiment analysis is
widely used and considered as the most efficient part of the deep learning
process. Various methods are involved in sentiment analysis, in which feature
extraction is the most efficient part. On the contrary to opinion mining, it has a
lot of drawbacks as compared to opinion mining [4]. Opinion mining was only
concentrated on one-dimensional feature, unlike the sentiment analysis. In order
to avoid these problems, Jeong et. al. proposed a theory on FEROM (Feature
Extraction and Refinement Method) that extracts the appropriate grammars and
features by scanning the whole blog content. This method checks with each
grammar and features are to be extracted by merging with the correct exact-
matching words. This arose a challenge for keyword extraction which was
proposed by Fan and Chang from the concept of contextual advertising in
related to the advertising ads of the blog page. Only the traditional keyword
extraction can be referred for searching or featuring formal documents in
traditional blogs, newspapers or scientific related papers. In addition to
traditional keyword extraction, frequency-based extraction was introduced for
extracting features from micro blogs [4].In addition to the frequency; graphical
model extraction was also introduced[12].
At first we start with each tweet , then for each word in sentiment dictionary , if
an emoticon [12]is found; then calculate it as positive, negative or neutral ; else
if a contextual word is found Contextual Valence Shifter [9]then calculate its
valence ; otherwise if a sentiment word is found then calculate positive ,
negative valences. Finally sum all positive values, negative values and neutral
values for each sentence.
49
International Journal of Pure and Applied Mathematics Special Issue
occurrence of an event or not. Here the imperative mood plays the major role in
irrealis blocking.[13] An instance taken here is the validation of dictionary where
granularity of the dictionary is used by the data set which provides evidence for
the dictionary rankings. Also predicting the intuition of English speaking people
(here) which are valuable, in comparing to the automatically generated ones.
Granularity of the scales is expected in datasets, so as to increase the
efficiency.[13]
Support Vector Machine (SVM) which are also called as supervised learning
networks,that analyze data for classification and regression analysis.In SVM,
the points are present in the space so that the examples present forms the new
category in the space.Two seperate categories are formed so that it forms a clear
gap in space.SVM also has a special advantage that it can perform non-linear
classification called the Kenrel trick[7],by mapping the inputs to high-
dimensional features.SVM is always applicable to supervised learning data set.
C. Kernel Trick
50
International Journal of Pure and Applied Mathematics Special Issue
KNN is one the simplest and most commonly used classification algorithm. It is
extremely simple and usually works better providing good accurate and
competitive results. Here the whole data set needs to be classified into positive,
negative or neutral. This is done considering the k nearest neighbors and their
closeness. The closeness is measured by any of the distance measures mainly
Euclidian distance measure is used. This classification correctly classifies for a
better smaller datasets.
There are various works which uses individual methods for the purpose of
classification. The works which uses the hybrid model where KNN-SVM has
been used for a better classification [15].This also shows an improvement in the
sentiments identification by using this method.
51
International Journal of Pure and Applied Mathematics Special Issue
One such method is discussed in the paper by Kim Yong et.al used the
combination of CNN and KNN to identify the sentiments in the movie reviews
.The data file has to be loaded. Pre-processing has to be done so that the
maximum noise is removed. In the existing method they use Deep learning
technique Convolution neural network to train and learn the positive and
negative sentiments from the movie review data sets. A sentence in the movie
review is inputted and is separated it into words. It is then passed through the
convolution layers. Multiple layers are set using the filters. The Features are
extracted after the convolution layers[12]. These features are fed to a KNN
classifier to identify whether the reviews get categorized to positive or negative
sentiments as shown in Figure 2. In this paper, they also suggested to convert
word into integer values using word2vec library or any other method such as
word embedding techniques.
B. Proposed Method
All the unlabelled data sets comes under unsupervised learning. In the case of
K-means clustering, no labels are known. In K means clustering, the no of
clusters has to be decided in advance according to the application. Once the k
clusters has been decided, then the as and when the new data comes, the data
needs to be put in clusters according to the centroid value calculated. This
shows the distance of the data from the centroid value. According to the
distance calculated the data may be put into various clusters. Unsupervised
learning is very useful for the datasets where the labels are not proper, so that it
shows better results in the case of novel and unknown data.
52
International Journal of Pure and Applied Mathematics Special Issue
Whenever a novel unknown movie review comes, they are passed through the
trained and learned CNN and after the feature extraction, the K Means
clustering algorithm used will help to group the movie reviews accordingly into
positive or negative clusters as in Figure3. But the proposed method works
better and gives a minimal improvement in the accuracy when in the movie
review dataset. But this dataset is not a big dataset when in comparison with
others as in these consists of only 10,662 instances[15].
53
International Journal of Pure and Applied Mathematics Special Issue
the system needs to be worked for a larger corpora then the normal CPU may
not be suffice. Then it is suggested to have GPUs space and time consumption
can be made lesser. Thus our system shows that CNN-KNN works better for a
smaller dataset and for a larger dataset, CNN-K Means is suggested. The
comparison of both the algorithms is depicted and is given below. This graph
shows the sentiment analysis done for various real time movies. This is done by
taking the review comments of these movies and analyzed the positive and
negative comments. This is plotted and is given below in the Figure 4.
300
250
200
150 Pos
Neg
100
50
0
FoodFight The God Father
House of the Dead BlackHat
This graph shows the loss and accuracy of various movies and it is also
observed that when we change the filters in convolution neural networks, for
few movies the accuracy is more and the loss is less which is the required. This
is achieved in the case of CNN-KNN for smaller datasets. The same is achieved
when we use CNN-K Means for larger datasets. The below Figure.5 shows that
the accuracy is attained with mere loss or lesser error rate when we use our
proposed method.
1
0.9
0.8
0.7
0.6
Loss
0.5
0.4 Accuracy
0.3
0.2
0.1
0
1 2 3 4 5
54
International Journal of Pure and Applied Mathematics Special Issue
6. Conclusion
Sentiment Analysis is very essential in our daily routine. It has its diverse
specification in the areas of social media such as analysis of twitter data, other
mechanical specifications like centrifugal pump through the help of Support
Vector Machine. Through Sentimental Analysis marketing strategy, campaign
success, improving product messaging and other areas. In this paper we have
proposed a theory through the impact of K-means algorithm which is effective
for larger sets of data also. Sentiment Analysis has been effective in all its cases
in which it has been implemented. Filters like CNN, using deep learning
techniques, is also used as a part of Sentimental Analysis. All these factors
make an impact in the difference of learning, in order to increase the existing
work. Algorithms like hierarchical and Agglomerative clustering are also useful
for the data prediction. The factors which can also be applicable for larger
datasets, which improves the efficiency and accuracy.
References
[1] Varghese R., Jayasree M., A survey on sentiment analysis and
opinion mining, International Journal of Research in Engineering
and Technology 2(11) (2013), 312-317.
[2] Agarwal A., Xie B., Vovsha I., Rambow O., Passonneau R,
Sentiment analysis of twitter data, Proceedings of the workshop
on languages in social media, Association for Computational
Linguistics (2011), 30-38.
[3] Vinita Sharma, Literature Survey (2014).
[4] Sahayak V., Shete V., Pathan A, Sentiment Analysis on Twitter
Data, International Journal of Innovative Research in Advanced
Engineering (IJIRAE) 2(1) (2015), 178-183.
[5] Singh R., Kaur, R, Sentiment Analysis on Social Media and
Online Review, International Journal of Computer Applications
121(20) (2015).
[6] Medhat W., Hassan A., Korashy H., Sentiment analysis
algorithms and applications: A survey, Ain Shams Engineering
Journal 5(4) (2014), 1093-1113.
[7] Sources from Wikipedia, Kernel Methods.
[8] Sindhwani V., Melville P., Document-word co-regularization for
semi-supervised sentiment analysis, Eighth IEEE International
Conference on Data Mining (2008), 1025-1030.
[9] Nair B.B., Mohandas V.P., Sakthivel N.R., A genetic algorithm
optimized decision tree-SVM based stock market trend prediction
system, International Journal on Computer Science and
Engineering 2(9) (2010), 2981-2988.
55
International Journal of Pure and Applied Mathematics Special Issue
[10] Nanli Z., Ping Z., Weiguo L., Meng C., Sentiment analysis: A
literature review, International Symposium on Management of
Technology (ISMOT) (2012), 572-576.
[11] Taboada M., Brooke J., Tofiloski M., Voll K., Stede, M, Lexicon-
based methods for sentiment analysis, Computational linguistics
37(2) (2011), 267-307.
[12] Vaitheeswaran G., Arockiam, L, A Novel Lexicon Based
Approach to Enhance the Accuracy of Sentiment Analysis on Big
Data, International Journal of Emerging Research in
Management and Technology (IJERMT) 5(2) (2016).
[13] Sivakumar P.B., Mohandas V.P., Sobh T, Evaluating the
predictability of financial time series, A case study on SENSEX
data, Innovations and Advanced Techniques in Computer and
Information Sciences and Engineering (2007), 99–104.
[14] Padmavathi S., Rajalaxmi C., Soman K.P, Texel identification
using K-Means clustering method, Advances in Computer
Science, Engineering & Applications (2012), 285-294.
[15] Abarna K., Rajamani M., Vasudevan S.K, Big data analytics: A
detailed gaze and a technical review, International Journal of
Applied Engineering Research 9(9) (2014).
[16] Geethan P., Jithin P., Naveen T., Padminy K.V., Shruthi Krithika
J., Vasudevan S.K, Augmented reality X-ray vision with gesture
interaction, Indian Journal of Science and Technology 8 (2015),
43-47.
[17] Sankar A., Suresh A., Varun Babu P., Baskar A., Vasudevan
S.K, An in-depth analysis of applications of object recognition,
Research Journal of Applied Sciences, Engineering and
Technology 10(1) (2015), 1-14.
[18] Rajendran A., Kiran M.V.K., Vasudevan S.K., Baskar A, An
exhaustive survey on human computer interaction’s past, present
and future, International Journal of Applied Engineering
Research 10(2) (2015), 5091-5105.
[19] Gaurangi Patil, Varsha Galande, Vedant Kekan, Kalpana Dange,
Sentiment Analysis Using Support Vector Machine, International
Journal of Innovative Research in Computer and Communication
Engineering 2(1), (2014).
[20] Yong Yang, Chun Xu, Ge Ren, Sentiment Analysis of Text Using
SVM, Electrical, Information Engineering and Mechatronics of the
series Lecture Notes in Electrical Engineering 138 (2012), 1133-
1139.
56
57
58