Polarity Detection of Kannada Documents: Deepamala. N Dr. Ramakanth Kumar. P

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Polarity Detection of Kannada documents

Deepamala. N Dr. Ramakanth Kumar. P


Assistant Professor, Dept. of Computer Science, Professor, Dept. of Information Science,
R. V. College of Engineering R. V. College of Engineering
Bangalore, India Bangalore, India
deepamalan@rvce.edu.in ramakanthkp@rvce.edu.in

Abstract— Document polarity detection is a part of sentiment -5(very negative). The document polarity is detected based on
analysis where a document is classified as a positive polarity number of positive or negative sentences in a document.
document or a negative polarity document. The applications of Machine learning algorithms like Naïve Bayes and Maximum
polarity detection are content filtering and opinion mining. Entropy are also used for polarity detection of a Kannada
Content filtering of negative polarity documents is an important document as part of this work.
application to protect children from negativity and can be used in
security filters of organizations. In this paper, dictionary based II. RELATED WORK
method using polarity lexicon and machine learning algorithms
are applied for polarity detection of Kannada language Sentiment analysis is a widely researched topic in current
documents. In dictionary method, a manually created polarity scenario. In paper [2], unsupervised algorithm is used to
lexicon of 5043 Kannada words is used and compared with classify a review as recommended or not recommended. As a
machine learning algorithms like Naïve Bayes and Maximum first step, the text is POS tagged to identify the adjectives and
Entropy. It is observed that performance of Naïve Bayes and adverbs and then the semantic orientation of these words are
Maximum Entropy is better than dictionary based method with identified. PMI-IR algorithm is used for calculation of
accuracy of 0.90, 0.93 and 0.78 respectively.
semantic orientation between words given by:
௣ሺ௪௢௥ௗభ Ƭ௪௢௥ௗమ ሻ
Keywords— sentiment analysis, polarity detection, Natural PMI (word1, word2) = Ž‘‰ ଶ ቂ ቃ (1)
௣ሺ௪௢௥ௗభ ሻ௣ሺ௪௢௥ௗమ ሻ
language processing, Kannada language, Naïve Bayes, Maximum PMI (word1, word2) is the probability that word1 and word2
Entropy.
co-occur. An accuracy of 74% was achieved for Epinions,
66% for movie reviews and 80-84% in travel reviews. In [3]
Pang et al. proposed classification of text based on sentiment
I. INTRODUCTION instead of topic. They created word list of positive and
Sentiment analysis refers to identifying the sentiment, polarity, negative words used in movie reviews. Naïve Bayes,
opinion or emotion of a text. Web contains enormous data in maximum Entropy and Support Vector machines algorithms
the form of product reviews, news, blogs, internet forums etc. were used for classification. Twitter, a microblogging site is
Opinion mining and sentiment analysis are used used for sentiment analysis in [8]. A corpus is extracted
interchangeably which involves finding the polarity and automatically from twitter and a classifier is built to determine
emotion of the documents. Sentiment analysis can be positive, negative and neutral sentiments. A lexicon based
performed using keyword spotting, lexicon analysis, statistical approach to extract sentiment from text is described in [9]. [6]
methods and machine learning algorithms. Document level Is a tool called SentiStrength which uses a lexicon based
sentiment classification refer to determining whether a given sentiment analysis of short text. Each word in the lexicon is
document express a positive or negative opinion. This can be assigned a score between +5(very positive) and -5(very
performed using binary text classification method where the negative).
classes are positive or negative [1]. Sentence Level subjectivity Sentiment Analysis in Indian Languages is a new topic for
or sentiment classification can be performed using supervised research. WordNet, POS tagger, SentiWordNet and other
methods like Naïve Bayes, Support Vector Machines, keyword resources are available for very few languages. A
spotting etc. Opinion words beautiful, lovely, nice represent SentiWordNet development for Indian Languages like
positive polarity and bad, ugly, horrible represent negative Bengali, Hindi and Telugu is tried using WordNet, dictionary
polarity. Collection of opinion words or phrases used in based and Corpus based methods in [10]. In [11], a Hindi
sentiment classification is called opinion lexicon. WordNet [4], sentiment lexicon called Hindi-SentiWordNet (H-SWN) was
SentiWordNet[5] are some sources for opinion lexicon. For developed using English SentiWordNet (SWN) and English-
Kannada language, there is no opinion lexicon or Hindi WordNet linking. The sentiment analysis of documents
SentiWordNet or WordNet available. To find the polarity of a was performed using sentiment annotated corpora, Machine
Kannada text document, a polarity lexicon is required. Hence, a translation and resource based sentiment analysis methods. In
lexicon is manually created for Kannada Language where 5043 [12], the opinion polarity of a given document was predicted
words are assigned a polarity score between 5 (very positive) to using the classifier trained from annotated corpus of another
language. The classifier trained using linear SVM for Hindi

978-1-4799-8047-5/15/$31.00 2015
c IEEE 764
and Marathi achieved an accuracy of 72% and 84% random words from EMILLE corpus [14] the stemmer gave
respectively. 67% accuracy. The words in the sentence are first stemmed
using the rule based stemmer and then the polarity lexicon is
III. CURRENT WORK searched for a match.
In the current work, polarity detection of Kannada text
C. Polarity Negation
document is performed using polarity lexicon method and also
machine learning algorithms like Naïve Bayes and Maximum Another important aspect to be considered while deciding the
Entropy. polarity is negation. In Kannada language, addition of suffix
like DĺĖ^Ð, ©eĖ^Ð, Ė^Ð, `2Ħ^Ð negate the sentiment of the
A. Polarity Lexicon
Kannada polarity lexicon was manually created where each word. For example ShľÄĦ is negative whereas ShľÄĦ[^Ð
word is assigned a polarity score between -5(very negative) toggle the meaning making it positive. Similarly, 4B´ªl is
and +5(very positive). A total of 5043 Kannada words were
assigned a polarity score in the lexicon. Table I shows few positive polarity whereas 4B´ªl[^Ð is negative. A list of such
sample entries in the polarity lexicon file. suffixes is maintained in a file and the polarity of the found
word in the lexicon is toggled if the suffix of the word
TABLE I. Sample entries in the Kannada polarity lexicon matches with an entry in the suffix list.
ShDĦ* -2 D. Sentence Boundary Detection
For a given Kannada text document, the polarity of each
ShľÄĦ* -2
sentence is found using the polarity lexicon. Before that,
VÎĦBi^* -2 sentence boundary detection has to be performed as a period
may not always represent end of sentence. For example, in
£eLĔs[Ÿl* -2
œe., Ħs.U2.ļÎsB2M[ÍU`\h a period does not indicate end of
¤lÎsZ* 3
sentence. For Sentence Boundary detection, rule based
ZhShÅ* 3 sentence boundary detection using abbreviations and verb
suffixes list as discussed in [13] is applied.
®eQÕ^Í* 3
E. Lexicon Based Document Polarity Detection Algorithm
®eÍūsd* 3
For every sentence in a document,
4B´ªl* 3 Step 1: Extract the word from the sentence, stem the word and
match it with the polarity lexicon word list.
4UhªeD* 3
Step 2: If the word is found, check whether the suffix of the
word matches with an entry in negation suffix list.
It can be observed that the words ¤lÎsZ, 4UhªeD have Step 3: If the suffix is found in the negation suffix list, the
polarity score of the word is negated.
positive polarity and ShľÄĦ, ShDĦ have negative polarity. Step 4: Go to step1 for next word in the given sentence.
Words like Q¬lŸl’l, c`£ea, li­lQh£e\h have a score of -5 Step 5: At the end of all words in the sentence, a summation of
all the polarity score of words found in a sentence is recorded.
to indicate most negative. Words like 4ŸeÍU2S, Step 6: After summation, if the polarity of a sentence is
c2ŸlisbYĸQ, dĽQ have polarity score of 5 to indicate very negative, the sentence is labelled as negative sentence else, it
is labelled as positive sentence.
positive. The * indicate that the word is in its stem or root
form. For example the word ¤lÎsZ can take different forms The polarity of the document is based on the number of
like ¤lÎsZĨ2S, ¤lÎsZe´Ė, ¤lÎsZli´c´\ etc. The entry in the positive sentences and negative sentences. If the total number
of positive sentences is more than the negative sentences, the
lexicon will match with all the inflections of the stem word. document is considered as a positive document; else it is a
Hence, a rule based stemmer is used to perform stemming of negative polarity document.
the words before polarity lexicon lookup.
D. Polarity Detection as classification task
B. Rule Based Kannada Stemmer
The polarity detection of Kannada documents can also be
The rule based stemmer works on Paice method, a list of 326 viewed as a problem of binary text classification with the
different suffixes are listed in the file. The suffixes of words classes: positive and negative. Machine learning algorithms
are matched with the list and stemmed if inflected. For like Naïve Bayes and Maximum Entropy were tested for
Example, DhPD_UhÇ is stemmed as DhP. To improve the polarity detection.
performance of the stemmer a stem word dictionary of 18,805 In Naïve Bayes method, the probability of a document ݀
stem words is added to the stemmer module. For 6,500 belonging to class ܿ௜ is given by

2015 IEEE International Advance Computing Conference (IACC) 765


௉ሺௗȁ௖೔ ሻ‫כ‬௉ሺ௖೔ ሻ
ܲሺܿ௜ ȁ݀ሻ ൌ  (2) where ‫݌‬෤ሺ‫ݔ‬ሻ is the empirical distribution of x in the
௉ሺௗሻ
training dataset and is set to 1/N.
The document if represented as bag-of-words which contain
• By constraining the expected value to be equal to
words with their corresponding frequencies. The class of the
empirical value, the equation is :
document is the maximum probability achieved, given by:

ܽ‫ݔܽ݉݃ݎ‬ ෍ ‫݌‬෤ ሺ‫ݔ‬ሻ‫݌‬ሺ‫ݕ‬ȁ‫ݔ‬ሻ݂௝ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൌ  ෍ ‫݌‬෤ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ݂௝ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ
‫ܥ‬ெ஺௉ ൌ  ܲሺܿȁ݀ሻ (3)
ܿ߳‫ܥ‬ ௫ǡ௬ ௫ǡ௬
(9)
Replacing P(c|d) with the Bayes rule • Select model with Maximum Entropy
ܽ‫ ݔܽ݉݃ݎ‬௉ሺௗȁ௖ሻ௉ሺ௖ሻ
ൌ   (4) Given that:
ܿ߳‫ܥ‬ ௉ሺௗሻ
1. ‫݌‬ሺ‫ݕ‬ȁ‫ݔ‬ሻ ൒ Ͳ݂‫ݔ݈݈ܽݎ݋‬ǡ ‫ݕ‬
Denominator is probability of document which is identical for 2. σ௬ ‫݌‬ሺ‫ݕ‬ȁ‫ݔ‬ሻ ൌ ͳ݂‫ݔ݈݈ܽݎ݋‬
all classes. Hence, the denominator is removed which still 3. σ௫ǡ௬ ‫݌‬෤ ሺ‫ݔ‬ሻ‫݌‬ሺ‫ݕ‬ȁ‫ݔ‬ሻ݂௝ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ ൌ
gives the most likely class. σ௫ǡ௬ ‫݌‬෤ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ݂௝ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ for j ∈ {1,2,3,…..n}
• If {Ȝ1………Ȝn} parameters are found which
ܽ‫ݔܽ݉݃ݎ‬ maximize the dual problem, the probability that a
ൌ  ܲሺ݀ȁܿሻܲሺܿሻ (5)
ܿ߳‫ܥ‬ given document x belong to class y is equal to :
P(d|c) is the likelihood and P(c) is the prior probability. ୣ୶୮ሺσ೔ ఒ೔ ௙೔ ሺ௫ǡ௬ሻሻ
‫ כ݌‬ሺ‫ݕ‬ȁ‫ݔ‬ሻ ൌ  σ (10)
೤ ୣ୶୮ሺσ೔ ఒ೔ ௙೔ ሺ௫ǡ௬ሻሻ
ܽ‫ݔܽ݉݃ݎ‬
ൌ ܲሺ‫ݔ‬ଵ ǡ ‫ݔ‬ଶ ǡ ǥ ǥ ǥ Ǥ ǡ ‫ݔ‬௡ ȁܿሻܲሺܿሻ (6)
ܿ߳‫ܥ‬
Here ‫ݔ‬ଵ ǡ ‫ݔ‬ଶ ǡ ǥ ǥ ǥ Ǥ ǡ ‫ݔ‬௡ are features. The bag-of-words IV. RESULTS
assumption is that the position of the words is not considered. The polarity detection of Kannada text documents was
The feature probabilities considered are assumed to be conducted on a corpus created with positive and negative
independent given a class c. polarity documents. Negative documents contained text
Maximum Entropy method is a probabilistic classifier that related to violence, accident etc., positive documents
belongs to the exponential models. Unlike Naïve Bayes, contained text containing explanation on arts, education. The
Maximum Entropy does not assume that the features are total document count in the corpus is given in the Table II.
conditionally independent to each other. It considers the
contextual information about the document to find its class. TABLE II. Kannada documents used for training and testing.
The steps to build a model using Maximum entropy are:
Class No. of documents
• Convert the training data into samples of form (xi,yi) Positive 213
where xi is the contextual information of the Negative 131
document and yi is its class.
• Summarize the training sample using the empirical The result of polarity classification of Kannada documents
probability distribution: using the polarity lexicon method and machine learning
ܲ෨ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ = 1/N x number of times (x,y) occur in the algorithms like Naïve Bayes and Maximum Entropy are
training sample. shown in Table IV. The machine learning algorithms use 5-
where N is the number of training documents. This fold cross validation on the corpus mentioned in Table II, as
assigns the text to a particular class based on the the number of documents for training and testing is limited.
contextual information. A feature is represented as Figure 1 shows comparison of accuracies using polarity
ͳ݂݅‫ ݕ‬ൌ  ܿ௜ ܽ݊݀‫ݓݏ݊݅ܽݐ݊݋ܿݔ‬௞ lexicon, Naïve Bayes and Maximum Entropy methods. The
follows: ݂௝ ൌ  ቄ
‫݁ݏ݅ݓݎ݄݁ݐ݋݋‬ confusion matrix is represented as shown in Table III.
This binary valued function return 1 if the class of the
document is ci and document contains the word wk. TABLE III. Confusion matrix
• The expected value of feature fj w.r.t. the empirical
distribution ܲ෨ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ is equal to: Detected
ܲ෨ ሺ݂௜ ሻ ‫  ؠ‬σ௫ǡ௬ ‫݌‬෤ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ݂௝ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ (7)
Positive True Positive (TP) False Negative (FN)
• The expected value of feature fj w.r.t. the model
p(y|x) is equal to: Actual
Negative False positive (FP) True Negative (TN)
‫݌‬൫݂௝ ൯ ‫  ؠ‬σ௫ǡ௬ ‫݌‬෤ ሺ‫ݔ‬ሻ‫݌‬ሺ‫ݕ‬ȁ‫ݔ‬ሻ݂௝ ሺ‫ݔ‬ǡ ‫ݕ‬ሻ (8)

766 2015 IEEE International Advance Computing Conference (IACC)


The accuracy, precision, recall and f-measure are calculated
from the confusion matrix as shown below:
V. CONCLUSION
்௉ା்ே
• Accuracy = Polarity Detection has an important application in content
்௉ାிேାி௉ା்ே
்௉ filtering and opinion mining. With so much negativity
• Precision = everywhere, this is a very important topic of research. The
்௉ାி௉
்௉
• Recall = applications developed for English languages cannot be
்௉ାிே
ோ௘௖௔௟௟‫כ‬௉௥௘௖௜௦௜௢௡ directly used for languages like Kannada. In this paper, a
• F-Measure = ʹ ‫ כ‬ polarity lexicon is created for resource deprived language:
ோ௘௖௔௟௟ା௉௥௘௖௜௦௜௢௡
Kannada. The results show that Naïve Bayes and Maximum
For Polarity lexicon method, the precision, recall and f- Entropy perform better than the dictionary method using
measure for the corpus shown in Table 2 is as shown in Table polarity lexicon. But with addition of more words into the
4. lexicon, the performance can be further improved.

TABLE IV. Results for Polarity detection of Kannada documents using


polarity lexicon REFERENCES
Method Precision Recall F-Measure
Accur
Positiv Nega acy
Positive Negative Positive Negative [1] Liu, Bing. "Sentiment analysis and subjectivity." Handbook of
e tive
Polarity natural language processing, 2, 2010, pp. 627-666.
Lexicon 0.78 0.77 0.89 0.60 0.82 0.67 0.78
[2] P. Turney, “Thumbs up or thumbs down? Semantic orientation
Applied to Unsupervised Classification of Reviews,” in Proc. of
Polarity detection using Naïve Bayes and Maximum entropy the Association for Computational Linguistics, Philadelphia,
2002, pp.417-424.
for the corpus shown in Table II with 5-fold cross validation is [3] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up? Sentiment
as shown in Table V Lexicon Based Document Lexicon Based Classification using Machine Learning Techniques” In Proc. of
Document. Overall test accuracy is the mean accuracy of the 5- the Empirical Methods on Natural Language Processing,
fold cross validation. Pennsylvania, 2002, pp. 79-86.
[4] C. Fellbaum, ed., “Wordnet: An Electronic Lexical Database”.
TABLE V. Results for Polarity detection of Kannada documents MIT Press, 1998.
Method Precision Recall F-Measure Overa [5] Esuli, Andrea, and Fabrizio Sebastiani. "Sentiwordnet: A
ll test publicly available lexical resource for opinion mining." In
Nega Accur
Positive Negative Positive Positive Negative Proceedings of LREC, vol. 6, 2006, pp. 417-422.
tive acy
[6] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, A. Kappas,
Naïve
Bayes 0.90 0.90 0.94 0.84 0.92 0.87 0.90 “Sentiment strength detection in short informal text”. Journal of
the American Society for Information Science and Technology,
Maxim vol. 61(12), 2010, pp. 2544–2558.
um 0.93 0.92 0.95 0.89 0.94 0.90 0.93 [7] Hatzivassiloglou, Vasileios, and Janyce M. Wiebe. "Effects of
Entropy
adjective orientation and gradability on sentence subjectivity."
Proceedings of the 18th conference on Computational linguistics-
Volume 1. Association for Computational Linguistics, 2000, pp. 299-
Accuracy [8]
305.
A. Pak, P. Paroubek, “Twitter as a Corpus for Sentiment Analysis and
Opinion Mining”. In LREC, vol. 10, 2010, pp. 1320-1326.
Maximum [9] Taboada, Maite, et al. "Lexicon-based methods for sentiment analysis."
Computational linguistics 37.2, 2011, pp. 267-307.
Entropy
[10] A. Das and S. Bandyopadhyay, ”SentiWordNet for Indian
Languages”,Asian Federation for Natural Language
Naïve Bayes Processing(COLING), China , 2010, pp. 56-63.
Accuracy
[11] A. Joshi, B. A. R, and P. Bhattacharyya, ”A fall-back strategy for
Polarity Lexicon sentiment analysis in Hindi: a case study” In proc. Of International
Conference On Natural Language Processing (ICON), 2010.
[12] Cross-Lingual Sentiment Analysis for Indian Languages using Linked
0.7 0.8 0.9 1 WordNets” Balamurali A R,Aditya Joshi, Pushpak Bhattacharyya
Proceedings of COLING 2012: Posters, COLING 2012, Mumbai,
December 2012, pp. 73–82.
Fig. 1. Accuracy of experiments on polarity detection
[13] N. Deepamala, P. Ramakanth Kumar, “Kannada Sentence Boundary
Detection using Rule based and Maximum Entropy Methods”,
From the results it can be observed that Maximum Entropy Advanced Research in Engineering and Technology, vol. 2, 2013, pp.
performs better than the Naïve Bayes and the Polarity lexicon 510-512.
lookup methods. This is because the Maximum Entropy [14] McEnery, A., Baker, P., Gaizauskas, R., Cunningham, H.: EMILLE:
method takes context into consideration as each word which is Building a corpus of South Asian languages. Vivek-Bombay 13(3),
2000, pp. 22-28.
the feature is not considered as independent as in Naïve Bayes.
Polarity Lexicon is not extensive and hence gives less
performance.

2015 IEEE International Advance Computing Conference (IACC) 767

You might also like