Embarrassingly Simple Unsupervised Aspect Extraction

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Embarrassingly Simple Unsupervised Aspect Extraction

Stéphan Tulkens Andreas van Cranenburgh


CLiPS Department of Information Science
University of Antwerp University of Groningen
Belgium The Netherlands
[email protected] [email protected]

Abstract The two things that really drew me to vinyl


We present a simple but effective method were the expense and the inconvenience .
for aspect identification in sentiment analysis.
arXiv:2004.13580v1 [cs.CL] 28 Apr 2020

Our unsupervised method only requires word


embeddings and a POS tagger, and is there- Figure 1: An example of a sentence expressing two
fore straightforward to apply to new domains aspects (red) on a target (italics). Source: https:
and languages. We introduce Contrastive At- //www.newyorker.com/cartoon/a19180
tention (CAt ), a novel single-head attention
mechanism based on an RBF kernel, which
gives a considerable boost in performance and S: sentence A: aspect
makes the model interpretable. Previous work (word vectors) vectors
relied on syntactic features and complex neu-
RBF
ral models. We show that given the sim-
plicity of current benchmark datasets for as-
pect extraction, such complex models are not
needed. The code to reproduce the experi- att: attention vector
ments reported in this paper is available at
https://github.com/clips/cat.
d: sentence summary
1 Introduction
food staff ambience
We consider the task of unsupervised aspect ex-
traction from text. In sentiment analysis, an as- Figure 2: An overview of our aspect extraction model.
pect can intuitively be defined as a dimension on
which an entity is evaluated (see Figure 1). While
aspects can be concrete (e.g., a laptop battery), stricted Boltzmann machines (Wang et al., 2015),
they can also be subjective (e.g., the loudness of among others. Recently, autoencoders using atten-
a motorcycle). Aspect extraction is an important tion mechanisms (He et al., 2017; Luo et al., 2019)
subtask of aspect-based sentiment analysis. How- have also been proposed as a method for aspect
ever, most existing systems are supervised (for an extraction, and have reached state of the art perfor-
overview, cf. Zhang et al., 2018). As aspects are mance on a variety of datasets. These models are
domain-specific, supervised systems that rely on unsupervised in the sense that they do not require
strictly lexical cues to differentiate between aspects labeled data, although they do rely on unlabeled
are unlikely to transfer well between different do- data to learn relevant patterns. In addition, these
mains (Rietzler et al., 2019). Another reason to con- are complex neural models with a large number of
sider the unsupervised extraction of aspect terms is parameters. We show that a much simpler model
the scarcity of training data for many domains (e.g., suffices for this task.
books), and, more importantly, the complete lack We present a simple unsupervised method for
of training data for many languages. Unsupervised aspect extraction which only requires a POS tag-
aspect extraction has previously been attempted ger and in-domain word embeddings, trained on
with topic models (Mukherjee and Liu, 2012), topic a small set of documents. We introduce a novel
model hybrids (Garcı́a-Pablos et al., 2018), and re- single-head attention mechanism, Contrastive At-
the bread is top notch as well . networks (Weston et al., 2014; Sukhbaatar et al.,
best spicy tuna roll , great asian salad . 2015). With an attention mechanism, a sequence
also get the onion rings – best we ’ve ever had . of words, e.g., a sentence or a document, is embed-
ded into a matrix S, which is operated on with an
Figure 3: Examples of Contrastive Attention (γ=.03)
aspect a to produce a probability distribution, att.
Schematically:
tention (CAt ), based on Radial Basis Function
(RBF) kernels. Compared to conventional atten- att = softmax(aS) (1)
tion mechanisms (Weston et al., 2014; Sukhbaatar att is then multiplied with S to produce an in-
et al., 2015), CAt captures more relevant infor- formative summary with respect to the aspect a:
mation from a sentence. Our method outperforms
more complex methods, e.g., attention-based neu-
X
d= atti Si (2)
ral networks (He et al., 2017; Luo et al., 2019). In i
addition, our method automatically assigns aspect Where d is the weighted sentence summary.
labels, while in previous work, labels are manu- There is no reason to restrict a to be a single vector:
ally assigned to aspect clusters. Finally, we present when replaced by a matrix of queries, A, the equa-
an analysis of the limitations of our model, and tion above gives a separate attention distribution
propose some directions for future research. for each aspect, which can then be used to create
different summaries, thereby keeping track of dif-
2 Method
ferent pieces of information. In our specific case,
Like previous methods (Hu and Liu, 2004; Xu et al., however, we are interested in tracking which words
2013), our method (see Figure 2) consists of two elicit aspects, regardless of the aspect to which they
steps: extraction of candidate aspect terms and belong. We address this by introducing Contrastive
assigning aspect labels to instances. Both steps as- Attention (CAt ), a way of calculating attention
sume a set of in-domain word embeddings, which that integrates a set of query vectors into a single
we train using word2vec (Mikolov et al., 2013). attention distribution. It uses an RBF kernel, which
We use a small set of in-domain documents, con- is defined as follows:
taining about 4 million tokens for the restaurant
domain. rbf(x, y, γ) = exp(−γ||x − y||22 ) (3)
where, x and y are vectors, and γ is a scaling
Step 1: aspect term extraction In previous
factor, which we treat as a hyperparameter. An
work (Hu and Liu, 2004; Xu et al., 2013), the
important aspect of the RBF kernel is that it turns
main assumption has been that nouns that are fre-
an arbitrary unbounded distance, the squared eu-
quently modified by sentiment-bearing adjectives
clidean distance in this case, into a bounded simi-
(e.g., good, bad, ugly) are likely to be aspect nouns.
larity. For example, regardless of γ, if x and y have
We experimented with this notion and devised a
a distance of 0, their RBF response will be 1. As
labeling strategy in which aspects are extracted
their distance increases, their similarity decreases,
based on their co-occurrence with seed adjectives.
and will eventually asymptote towards 0, depend-
However, during experimentation we found that for
ing on γ. Given the RBF kernel, a matrix S, and a
the datasets in this paper, the most frequent nouns
set of aspect vectors A, attention is calculated as
were already good aspects; any further constraint
follows:
led to far worse performance on the development
set. This means that our method only needs a POS P
rbf(w, a, γ)
tagger to recognize nouns, not a full-fledged parser. att = P a∈A P (4)
w∈S a∈A rbf(w, a, γ)
Throughout this paper, we use spaCy (Honni-
bal and Montani, 2017) for tokenization and POS The attention for a given word is thus the sum of
tagging. In Section 5, we investigate how these the RBF responses of all vectors in A, divided by
choices impact performance. the sum of the RBF responses of the vectors to all
vectors in S. This defines a probability distribution
Step 2: aspect selection using Contrastive At- over words in the sentence or document, where
tention We use a simple of form of attention, words that are, on average, more similar to aspects,
similar to the attention mechanism used in memory get assigned a higher score.
Train Test Method P R F
Citysearch (2009) 1,490 Aspect: FOOD
SemEval (2014) 3,041 402 SERBM (2015) 89.1 85.4 87.2
SemEval (2015) 1,315 250 ABAE (2017) 95.3 74.1 82.8
W2VLDA (2018) 96.0 69.0 81.0
Table 1: The number of sentences in each of the AE-CSA (2019) 90.3 92.6 91.4
datasets after removing sentences that did not express
Mean 92.4 73.5 85.6
exactly one aspect in our set of aspects.
Attention 86.7 89.5 88.1
Method P R F CAt 91.8 92.4 92.1

SERBM (2015) 86.0 74.6 79.5 Aspect: STAFF


ABAE (2017) 89.4 73.0 79.6 SERBM (2015) 81.9 58.2 68.0
W2VLDA (2018) 80.8 70.0 75.8 ABAE (2017) 80.2 72.8 75.7
AE-CSA (2019) 85.6 86.0 85.8 W2VLDA (2018) 61.0 86.0 71.0
Mean 78.9 76.9 77.2 AE-CSA (2019) 92.6 75.6 77.3
Attention 80.5 80.7 80.6 Mean 55.8 85.7 67.5
CAt 86.5 86.4 86.4 Attention 74.4 69.3 71.8
CAt 82.4 75.6 78.8
Table 2: Weighted macro averages across all aspects on
Aspect: AMBIENCE
the test set of the Citysearch dataset.
SERBM (2015 80.5 59.2 68.2
ABAE (2017) 81.5 69.8 74.0
Step 3: assigning aspect labels After reweigh- W2VLDA (2018) 55.0 75.0 64.0
ing the word vectors, we label each document based AE-CSA (2019) 91.4 77.9 77.0
on the cosine similarity between the weighted doc- Mean 58.7 56.1 57.4
ument vector d and the label vector. Attention 67.1 65.7 66.4
CAt 76.6 80.1 76.6
ŷ = argmax(cos(d, ~c)) (5)
c∈C Table 3: Precision, recall, and F-scores on the test set
of the Citysearch dataset.
Where C is the set of labels, i.e., {FOOD, AM -
BIENCE , STAFF }. In the current work, we use
word embeddings of the labels as the targets. This and do not optimize any models on the test set.
avoids the inherent subjectivity of manually assign- Following previous work (He et al., 2017; Ganu
ing aspect labels, the strategy employed in previous et al., 2009), we restrict ourselves to sentences that
work (He et al., 2017; Luo et al., 2019). only express exactly one aspect; sentences that ex-
press more than one aspect, or no aspect at all,
3 Datasets are discarded. Additionally, we restrict ourselves
to three labels: FOOD, SERVICE, and AMBIENCE.
We use several English datasets of restaurant re-
We adopt these restrictions in order to compare to
views for the aspect extraction task. All datasets
other systems. Additionally, previous work (Brody
have been annotated with one or more sentence-
and Elhadad, 2010) reported that the other labels,
level labels, indicating the aspect expressed in that
ANECDOTES and PRICE , were not reliably anno-
sentence (e.g., the sentence “The sushi was great”
tated. Table 1 shows statistics of the datasets.
would be assigned the label FOOD). We evalu-
ate our approach on the Citysearch dataset (Ganu 4 Evaluation
et al., 2009), which uses the same labels as the
SemEval datasets. To avoid optimizing for a sin- We optimize all our models on SemEval ’14 and
gle corpus, we use the restaurant subsets of the ’15 training data; the scores on the Citysearch
SemEval 2014 (Pontiki et al., 2014) and SemEval dataset do not reflect any form of optimization with
2015 (Pontiki et al., 2015) datasets as development regards to performance. We optimize the hyperpa-
data. Note that, even though our method is com- rameters of each model separately (i.e., the number
pletely unsupervised, we explicitly allocate test of aspect terms and γ of the RBF kernel), leading
data to ensure proper methodological soundness, to the following hyperparameters: For the regular
attention, we select the top 980 nouns as aspect 100
candidates. For the RBF attention, we use the top
200 nouns and a γ of .03. 80

score (weighted f1)


We compare our system to four other systems.
60
W2VLDA (Garcı́a-Pablos et al., 2018) is a topic
modeling approach that biases word-aspect associ- 40
ations by computing the similarity from a word to
a set of aspect terms. SERBM (Wang et al., 2015) 20
a restricted Boltzmann Machine (RBM) that learns
topic distributions, and assigns individual words 0
20 40 60 80 100
to these distributions. In doing so, it learns to as- percentage of training data (326k sentences)
sign words to aspects. We also compare our system
to two attention-based systems. First, ABAE (He Figure 4: A learning curve on the restaurant data, aver-
et al., 2017), which is an auto-encoder that learns aged over 5 embedding models.
an attention distribution over words in the sentence
by simultaneously considering the global context
and aspect vectors. In doing so, ABAE learns an SERBM is smaller than one would expect based
attention distribution, as well as appropriate aspect on the F1 scores on the labels, on which ABAE
vectors. Second, AE-CSA (Luo et al., 2019), which outperforms SERBM on S TAFF and A MBIENCE.
is a hierarchical model which is similar to ABAE. The Mean model still performs well on this dataset,
In addition to word vectors and aspect vectors, this while it does not use any attention or knowledge
model also considers sense and sememe (Bloom- of aspects. This implies that aspect knowledge
field, 1926) vectors in computing the attention dis- is probably not required to perform well on this
tribution. Note that all these systems, although dataset; focusing on lexical semantics is enough.
being unsupervised, do require training data, and
need to be fit to a specific domain. Hence, all these 5 Analysis
systems rely on the existence of in-domain train- We perform an ablation study to see the influence
ing data on which to learn reconstructions and/or of each component of our system; specifically, we
topic distributions. Furthermore, much like our look at the effect of POS tagging, in-domain word
approach, ABAE, AE-CSA, and W2VLDA rely embeddings, and the amount of data on perfor-
on the availability of pre-trained word embeddings. mance.
Additionally, AE-CSA needs a dictionary of senses Only selecting the most frequent words as as-
and sememes, which might not be available for pects, regardless of their POS tag, had a detrimen-
all languages or domains. Compared to other sys- tal effect on performance, giving an F-score of 64.5
tems, our system does require a UD POS tagger (∆-21.9), while selecting nouns based on adjective-
to extract frequent nouns. However, this can be an noun co-occurrence had a smaller detrimental ef-
off-the-shelf POS tagger, since it does not need to fect, giving an F-score of 84.4 (∆-2.2), higher than
be trained on domain-specific data. ABAE and SERBM.
We also compare our system to a baseline based Replacing the in-domain word embeddings
on the mean of word embeddings, a version of our trained on the training set with pretrained GloVe
system using regular attention, and a version of embeddings (Pennington et al., 2014)1 had a large
our system using Contrastive Attention (CAt ). detrimental effect on performance, dropping the
The results are shown in Table 3. Because of class F-score to 54.4 (∆-32); this shows that in-domain
imbalance (60 % of instances are labeled FOOD), data is important.
the F-scores from 3 do not give a representative To investigate how much in-domain data is re-
picture of model performance. Therefore, we also quired to achieve good performance, we perform a
report weighted macro-averaged scores in Table 2. learning curve experiment (Figure 4). We increase
Our system outperforms ABAE, AE-CSA, and the training data in 10% increments, training five
the other systems, both in weighted macro-average word2vec models at each increment. As the fig-
F1 score, and on the individual aspects. In addition, 1
Specifically, the glove.6B.200D vectors from
2 shows that the difference between ABAE and https://nlp.stanford.edu/projects/glove/
Phenomenon Example In the future, we would like to address the limita-
OOV “I like the Somosas” tions of the current method, and apply it to datasets
Data Sparsity “great Dhal” with other domains and languages. Such datasets
Homonymy “Of course” exist, but we have not yet evaluated our system
Verb > Noun “Waited for food” on them due to the lack of sufficient unannotated
Discourse “She didn’t offer dessert” in-domain data in addition to annotated data.
Implicature “No free drink” Given the performance of CAt , especially
compared to regular dot-product attention, it would
Table 4: A categorization of observed error types. be interesting to see how it performs as a replace-
ment of regular attention in supervised models, e.g.,
memory networks (Weston et al., 2014; Sukhbaatar
ure shows, only a modest amount of data (about
et al., 2015). Additionally, it would be interest-
260k sentences) is needed to tackle this specific
ing to see why the attention model outperforms
dataset.
regular dot product attention. Currently, our un-
To further investigate the limits of our model, we
derstanding is that the dot-product attention places
perform a simple error analysis on our best perform-
a high emphasis on words with a higher vector
ing model. Table 4 shows a manual categorization
norm; words with a higher norm have, on average,
of error types. Several of the errors relate to Out-
a higher inner product with other vectors. As the
of-Vocabulary (OOV) or low frequency items, such
norm of a word embedding directly relates to the
as the words ‘Somosas’ (OOV) and ‘Dhal’ (low
frequency of this word in the training corpus, the
frequency). Since our model is purely based on lex-
regular dot-product attention naturally attends to
ical similarity, homonyms and polysemous words
more frequent words. In a network with trainable
can lead to errors. An example of this is the word
parameters, such as ABAE (He et al., 2017), this ef-
‘course,’ which our model interprets as being about
fect can be mitigated by finetuning the embeddings
food. As the aspect terms we use are restricted to
or other weighting mechanisms. In our system,
nouns, the model also misses aspects expressed in
no such training is available, which can explain
verbs, such as “waited for food.” Finally, discourse
the suitability of CAt as an unsupervised aspect
context and implicatures often lead to errors. The
extraction mechanism.
model does not capture enough context or world
knowledge to infer that ‘no free drink’ does not 6 Conclusion
express an opinion about drinks, but about service.
Given these errors, we surmise that our model We present a simple model of aspect extraction that
will perform less well in domains in which aspects uses a frequency threshold for candidate selection
are expressed in a less overt way. For example, together with a novel attention mechanism based
consider the following sentence from a book re- on RBF kernels, together with an automated as-
view (Kirkus Reviews, 2019): pect assignment method. We show that for the task
of assigning aspects to sentences in the restaurant
(1) As usual, Beaton conceals any number of domain, the RBF kernel attention mechanism out-
surprises behind her trademark wry humor. performs a regular attention mechanism, as well as
This sentence touches on a range of aspects, includ- more complex models based on auto-encoders and
ing writing style, plot, and a general opinion on the topic models.
book that is being reviewed. Such domains might
Acknowledgments
also require the use of more sophisticated aspect
term extraction methods. We are grateful to the three reviewers for their
However, it is not the case that our model nec- feedback. The first author was sponsored by a
essarily overlooks implicit aspects. For example, Fonds Wetenschappelijk Onderzoek (FWO) aspi-
the word “cheap” often signals an opinion about rantschap.
the price of something. As the embedding of the
word “cheap” is highly similar to that of “price”
our model will attend to “cheap” as long as enough References
price-related terms are in the set of extracted aspect Leonard Bloomfield. 1926. A set of postulates for the
terms of the model. science of language. Language, 2(3):153–164.
Samuel Brody and Noemie Elhadad. 2010. An unsu- Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.
pervised aspect-sentiment model for online reviews. 2015. End-to-end memory networks. In Proceed-
In Proceedings of NAACL-HLT, pages 804–812. ings of NIPS, pages 2440–2448.

Gayatree Ganu, Noemie Elhadad, and Amélie Marian. Linlin Wang, Kang Liu, Zhu Cao, Jun Zhao, and Ger-
2009. Beyond the stars: improving rating predic- ard de Melo. 2015. Sentiment-aspect extraction
tions using review text content. In Proceedings of based on restricted boltzmann machines. In Pro-
WebDB, volume 9, pages 1–6. ceedings of ACL-IJCNLP, pages 616–625.

Aitor Garcı́a-Pablos, Montse Cuadros, and German Jason Weston, Sumit Chopra, and Antoine Bor-
Rigau. 2018. W2VLDA: almost unsupervised sys- des. 2014. Memory networks. arXiv preprint
tem for aspect based sentiment analysis. Expert Sys- arXiv:1410.3916.
tems with Applications, 91:127–137.
Liheng Xu, Kang Liu, Siwei Lai, Yubo Chen, and Jun
Zhao. 2013. Mining opinion words and opinion tar-
Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel
gets in a two-stage framework. In Proceedings of
Dahlmeier. 2017. An unsupervised neural attention
ACL, pages 1764–1773.
model for aspect extraction. In Proceedings of ACL,
pages 388–397. Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep
learning for sentiment analysis: A survey. Wiley
Matthew Honnibal and Ines Montani. 2017. spaCy 2: Interdisciplinary Reviews: Data Mining and Knowl-
Natural language understanding with Bloom embed- edge Discovery, 8(4):e1253.
dings, convolutional neural networks and incremen-
tal parsing. Software package.

Minqing Hu and Bing Liu. 2004. Mining and summa-


rizing customer reviews. In Proceedings of ACM
SIGKDD, pages 168–177.

Kirkus Reviews. 2019. Beating about the bush.

Ling Luo, Xiang Ao, Yan Song, Jinyao Li, Xiaopeng


Yang, Qing He, and Dong Yu. 2019. Unsupervised
neural aspect extraction with sememes. In Proceed-
ings of IJCAI, pages 5123–5129.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey


Dean. 2013. Efficient estimation of word represen-
tations in vector space. In ICLR Workshop Papers.

Arjun Mukherjee and Bing Liu. 2012. Aspect extrac-


tion through semi-supervised modeling. In Proceed-
ings of ACL, pages 339–348.

Jeffrey Pennington, Richard Socher, and Christopher


Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of EMNLP, pages 1532–
1543.

Maria Pontiki, Dimitris Galanis, Haris Papageorgiou,


Suresh Manandhar, and Ion Androutsopoulos. 2015.
Semeval-2015 task 12: Aspect based sentiment anal-
ysis. In Proceedings of SemEval, pages 486–495.

Maria Pontiki, Dimitris Galanis, John Pavlopoulos,


Haris Papageorgiou, Ion Androutsopoulos, and
Suresh Manandhar. 2014. Semeval-2014 task 4: As-
pect based sentiment analysis. In Proceedings of Se-
mEval.

Alexander Rietzler, Sebastian Stabinger, Paul Opitz,


and Stefan Engl. 2019. Adapt or get left behind:
Domain adaptation through BERT language model
finetuning for aspect-target sentiment classification.
arXiv preprint arXiv:1908.11860.

You might also like