Senticnet 6: Ensemble Application of Symbolic and Subsymbolic Ai For Sentiment Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

SenticNet 6: Ensemble Application of


Symbolic and Subsymbolic AI for Sentiment Analysis
Erik Cambria Yang Li Frank Z. Xing
Nanyang Technological University Nanyang Technological University Nanyang Technological University
Singapore Singapore Singapore
[email protected] [email protected] [email protected]

Soujanya Poria Kenneth Kwok


Singapore University of Technology Agency for Science, Technology and
and Design Research (A*STAR)
Singapore Singapore
[email protected] [email protected]

ABSTRACT of data and, for instance, making predictions, suggestions, and cat-
Deep learning has unlocked new paths towards the emulation of egorizations based on them. All such classifications are made by
the peculiarly-human capability of learning from examples. While transforming real items that need to be classified into numbers or
this kind of bottom-up learning works well for tasks such as im- features in order to later calculate distances between them. While
age classification or object detection, it is not as effective when it this is good for making comparison between such items and cluster
comes to natural language processing. Communication is much them accordingly, it does not tell us much about the items them-
more than learning a sequence of letters and words: it requires a selves. Thanks to machine learning, we may find out that apples
basic understanding of the world and social norms, cultural aware- are similar to oranges but this information is only useful to clus-
ness, commonsense knowledge, etc.; all things that we mostly learn ter oranges and apples together: it does not actually tell us what
in a top-down manner. In this work, we integrate top-down and an apple is, what it is usually used for, where it is usually found,
bottom-up learning via an ensemble of symbolic and subsymbolic how does it taste, etc. Throughout the span of our lives, we learn a
AI tools, which we apply to the interesting problem of polarity lot of things by example but many others are learnt via our own
detection from text. In particular, we integrate logical reasoning personal (kinaesthetic) experience of the world and taught to us by
within deep learning architectures to build a new version of Sentic- our parents, mentors, and friends. If we want to replicate human
Net, a commonsense knowledge base for sentiment analysis. intelligence into a machine, we cannot avoid implementing this
kind of top-down learning.
KEYWORDS Integrating logical reasoning within deep learning architectures
has been a major goal of modern AI systems [19, 61, 65]. Most
Knowledge representation and reasoning; Sentiment analysis
of such systems, however, merely transform symbolic logic into
ACM Reference format: a high-dimensional vector space using neural networks. In this
Erik Cambria, Yang Li, Frank Z. Xing, Soujanya Poria, and Kenneth Kwok. work, instead, we do the opposite: we employ subsymbolic AI
2020. SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for for recognizing meaningful patterns in natural language text and,
Sentiment Analysis. In Proceedings of the 29th ACM International Conference
hence, represent these in a knowledge base, termed SenticNet 6,
on Information and Knowledge Management, Virtual Event, Ireland, October
19–23, 2020 (CIKM ’20), 10 pages.
using symbolic logic. In particular, we use deep learning to gen-
https://doi.org/10.1145/3340531.3412003 eralize words and multiword expressions into primitives, which
are later defined in terms of superprimitives. For example, expres-
sions like shop_for_iphone11, purchase_samsung_galaxy_S20
1 INTRODUCTION or buy_huawei_mate are all generalized as BUY(PHONE) and later
The AI gold rush has become increasingly intense for the huge reduced to smaller units thanks to definitions such as BUY(x)=
potential AI offers for human development and growth. Most of GET(x) ∧ GIVE($), where GET(x) for example is defined in terms
what is considered AI today is actually subsymbolic AI, i.e., machine of the superprimitive HAVE as !HAVE(x)→ HAVE(x).
learning: an extremely powerful tool for exploring large amounts While this does not solve the symbol grounding problem, it helps
reducing it to a great degree and, hence, improves the accuracy
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
of natural language processing (NLP) tasks for which statistical
for profit or commercial advantage and that copies bear this notice and the full citation analysis alone is usually not enough, e.g., narrative understanding,
on the first page. Copyrights for components of this work owned by others than ACM dialogue systems and sentiment analysis. In this work, we focus
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a on sentiment analysis where this ensemble application of symbolic
fee. Request permissions from [email protected]. and subsymbolic AI is superior to both symbolic representations
CIKM ’20, October 19–23, 2020, Virtual Event, Ireland and subsymbolic approaches, respectively.
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-6859-9/20/10. . . $15.00
https://doi.org/10.1145/3340531.3412003

105
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

With the advent of Web 2.0, researchers started exploiting mi-


croblogging text or Twitter-specific features such as emoticons,
hashtags, URLs, @symbols, capitalizations, and elongations to en-
hance the accuracy of social media sentiment analysis. For example,
Tang et al. [58] used a convolutional neural network (CNN) to ob-
tain word embeddings for words frequently used in tweets and dos
Santos and Gatti [17] employed a deep CNN for sentiment detection
in short texts. More recent approaches have been focusing on the
development of sentiment-specific word embeddings [44], which
are able to encode more affective clues than regular word vectors,
Figure 1: An example of sentic algebra.
and on the use of context-aware subsymbolic approaches such as
attention modeling [32, 33] and capsule networks [13, 66].
By deconstructing multiword expressions into primitives and su-
perprimitives, in fact, there is no need to build a lexicon that assigns 3 PRIMITIVE DISCOVERY
polarity to thousands of words and multiword expressions: all we While the bag-of-words model is good enough for simple NLP
need is the polarity of superprimitives. For example, expressions tasks such as autocategorization of documents, it does not work
like grow_profit, enhance_reward or intensify_benefit are all well for complex NLP tasks such as sentiment analysis, for which
generalized as INCREASE(GAIN) and, hence, classified as positive context awareness is often required. Extracting concepts or mul-
(Fig. 1). Likewise, this approach is also superior to most subsym- tiword expressions from text has always been a “pain in the neck
bolic approaches that simply classify text based on word occur- for NLP” [49]. Semantic parsing and n-gram models have taken a
rence frequencies. For example, a purely statistical approach would bottom-up approach to solve this issue by automatically extract-
classify expressions like lessen_agony, reduce_affliction or ing concepts from raw data. The resulting multiword expressions,
diminish_suffering as negative because of the statistically nega- however, are prone to errors due to both richness and ambigu-
tive words that compose them. In SenticNet 6, however, such ex- ity of natural language. A more effective way to overcome this
pressions are all generalized as DECREASE(PAIN) and thus correctly hurdle is to take a top-down approach by generalizing semantically-
classified (Fig. 1). related concepts (e.g., sell_pizza, offer_noodles_for_sale and
The remainder of the paper is organized as follows: Section 2 vend_ice_cream and) via a set of primitives, i.e., a set of ontological
briefly discusses related works in the field of sentiment analysis; parents or more general terms (e.g., SELL_FOOD). In this way, most
Section 3 describes in detail how to discover affect-bearing primi- concept inflections can be captured by SenticNet 6: noun concepts
tives for this task; Section 4 explains how to define such primitives like pasta, cheese_cake, steak are replaced with the primitive
in terms of denotative and connotative information; Section 5 pro- FOOD while verb concepts like offer_for_sale, put_on_sale, and
poses experimental results on 9 different datasets; finally, Section 6 vend are all represented as the primitive SELL, which is later de-
provides concluding remarks. constructed into simpler primitives, e.g., SELL(x)= BARTER(x,$),
where BARTER(x,y)= GIVE(x) ∧ GET(y).
2 RELATED WORK The main goal of this generalization is to get away from asso-
Sentiment analysis is an NLP task that has raised growing interest ciating polarity to a static list of affect keywords or multiword
within both the scientific community, for the many exciting open expressions by letting SenticNet 6 figure out such polarity on the
challenges, as well as the business world, due to the remarkable ben- fly based on the building blocks of meaning. This way, SenticNet 6
efits to be had from marketing and financial prediction. While most reduces the symbol grounding problem and, hence, gets one step
works approach it as a simple categorization problem, sentiment closer to natural language understanding. As preached by the field
analysis is actually a complex research problem that requires tack- of semiotics, in fact, words are “completely arbitrary signs" [18]
ling many NLP tasks, including subjectivity detection, anaphora that we automatically and almost instinctively connect to semantic
resolution, word sense disambiguation, sarcasm detection, aspect representations in our mind. Such process is far from being auto-
extraction, and more. matic for an AI, since it never got the chance to learn a language
Sentiment analysis research can be broadly categorized into or experience the world the way we did during the first years of
symbolic approaches (i.e., ontologies and lexica) and subsymbolic our existence. In order to bridge this huge gap between symbols
approaches (i.e., statistical NLP). The former school of thought fo- and meaning, we need to ground words (and their associations)
cuses on the construction of knowledge bases for the identification into some form of semantic representation, e.g., a structure of se-
of polarity in text, e.g., WordNet-Affect [55], SentiWordNet [3], and mantic features in the Katz-Fodor semantics [28] or in Jackendoff’s
SenticNet [10]. The latter school of thought leverages statistics- conceptual structure [26].
based approaches for the same task, with a special focus on su- While this would be a formidable task for NLP research, it is still
pervised statistical methods. Pang et al. [43] pioneered this trend manageable in the context of sentiment analysis because, in this
by comparing the performance of different machine learning algo- domain, the description of such features would be more connotative
rithms on a movie review dataset and obtained 82% accuracy for than denotative. In other words, we do not need define what a
polarity detection. Later, Socher et al. [53] obtained 85% accuracy concept really is but simply what kind of emotions it generates or
on the same dataset using a recursive neural tensor network (NTN). evokes.

106
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

While the set of mental primitives and the principles of mental


combination governing their interaction are potentially infinite ht −1
 
X = (1)
for NLP, in the context of sentiment analysis these are bounded xt
by a finite set of emotion categories and much simpler interaction
principles that lead to an either positive or negative outcome. Thus, ft = σ (Wf .X + bf ) (2)
in this work, we leverage subsymbolic AI to automatically discover i t = σ (Wi .X + bi ) (3)
the primitives that can better generalize SenticNet’s commonsense ot = σ (Wo .X + bo ) (4)
knowledge. This generalization is inspired by different theories on c t = ft ⊙ c t −1 + i t ⊙ tanh(Wc .X + bc ) (5)
conceptual primitives, including Roger Schank’s conceptual depen-
dency theory [51], Ray Jackendoff’s work on explanatory semantic ht = ot ⊙ tanh(c t ) (6)
representation [25], and Anna Wierzbicka’s book on primes and where d is the dimension of the hidden representations and Wi ,Wf ,
universals [62], but also theoretical studies on knowledge repre-
Wo ,Wc ∈ Rd×(d+dw ) , bi , bf , bo ∈ Rd are parameters to be learnt
sentation [37, 48]. All such theories claim that a decompositional
during the training (Table 1). σ is the sigmoid function and ⊙ is
method is necessary to explore conceptualization.
element-wise multiplication. The optimal values of the d and k
In the same manner as a physical scientist understands matter
were set to 300 and 100, respectively (based on experiment results
by breaking it down into progressively smaller parts, a scientific
on the validation dataset). We used 10 negative samples.
study of conceptualization proceeds by decomposing meaning into
When a biLSTM is employed, these operations are applied in both
smaller parts. Clearly, this decomposition cannot go on forever: at
directions of the sequence and the outputs for each timestep are
some point we must find semantic atoms that cannot be further
merged to form the overall representation for that word. Thus, for
decomposed. In SenticNet 6, this ‘decomposition’ translates into
each sentence matrix, after applying biLSTM, we get the recurrent
the generalization of words and multiword expressions into primi-
representation feature matrix as H LC ∈ R 2d×l , and H RC ∈ R 2d×r .
tives and subsequently superprimitives, from which they inherit a
specific set of emotions and, hence, a particular polarity.
One of the main reasons why conceptual dependency theory, and
3.2 Target Word Representation
many other symbolic methods, were abandoned in favor of subsym- The final feature vector c for target word c is generated by passing C
bolic techniques was the amount of time and effort required to come through a multilayer neural network. The equations are as follows:
up with a comprehensive set of rules. Subsymbolic techniques do C ∗ = tanh(Wa .c + ba ) (7)
not require much time nor effort to perform classification but they ∗
are data-dependent and function in a black-box manner (i.e., we do c = tanh(Wb .C + bb ) (8)
not really know how and why classification labels are produced). In
where Wa ∈ R d×dw ,Wb ∈ R k ×d , ba ∈ R d and bb ∈ R k are
this work, we leverage the representation learning power of long
parameters (Table 1) and c ∈ R k is the final target word vector.
short-term memory (LSTM) networks to automatically discover
primitives for sentiment analysis. The deconstruction of primitives
into superprimitives is currently a manual process: we leave the au- 3.3 Sentential Context Representation
tomatic (or semi-automatic) discovery of superprimitives to future For our model to be able to attend to subphrases which are impor-
work. tant in providing contexts, we incorporate an attention module on
A sentence S can be represented as a sequence of words, i.e., top of our biLSTM for our context sentences. The attention module
S = [w 1 , w 2 , ...w n ] where n is the number of words in the sen- consists of an augmented neural network having a hidden layer
tence. The sentence can be split into sections such that the prefix: followed by a softmax output (Fig. 2).
[w 1 , ...w i−1 ] form the left context sentence with l words and the
suffix: [w i+1 , ...w n ] form the right context sentence with r words.
Here, c = w i is the target word. In the first step, we represent these
words in a low-dimensional distributed representation, i.e., word
embeddings. Specifically, we use the pre-trained 300-dimensional
word2vec embeddings [36] trained on the 3-billion-word Google
News corpus. The context sentences and target concept can now
be represented as a sequence of word vectors, thus constituting
matrices, L ∈ R dw ×l , R ∈ R dw ×r and C ∈ R dw ×1 (dw = 300) for
left context, right context and target word, respectively.

3.1 biLSTM
To extract the contextual features from these subsentences, we use
the biLSTM model on L and C independently. Given that we repre-
sent the word vector for the t t h word in a sentence as x t , the LSTM Figure 2: Overall framework for context and word embed-
transformation can be performed as: ding generation.

107
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

It generates a vector which provides weights corresponding to Algorithm 1 Context and target word embedding generation
the relevance of the underlying context across the sentence. Below, 1: procedure TrainEmbeddings
we describe the attention formulation applied on the left context 2: Given sentence S = [w 1 , w 2 , ...w n ] s.t. w i is target word.
sentence. H LC can be represented as a sequence of [ht ] where 3: L ← E([w 1 , w 2 , ...w i−1 ]) ▷ E() : word2vec embedding
t ∈ [1, l]. Let A denote the attention network for this sentence. The 4: R ← E([w i+1 , w 2 , ...w n ])
attention mechanism of A produces an attention weight vector α 5: C ← E(w i )
and a weighted hidden representation r as follows: 6: c ←TargetWordEmbedding(C)
P = tanh(Wh .H LC ) (9) 7: v ←ContextEmbedding(L, R)
T 8: NegativeSampling(c, v)
α = so f tmax(w .P) (10)
9: procedure TargetWordEmbedding(C)
r = H LC .α T (11) 10: C ∗ = tanh(Wa .c + ba )
c = tanh(Wb .C ∗ + bb )
where P ∈ Rd×l , α ∈ Rl , r ∈ R2d . And, Wh ∈ Rd×2d , w ∈ Rd are
11:
12: return c
projection parameters (Table 1). Finally, the sentence representation
is generated as: 13: procedure ContextEmbedding(L, R)
H LC ← ϕ
r ∗ = tanh(Wp .r )
14:
(12)
15: ht −1 ← 0
Here, r ∗ ∈ R2d and Wp ∈ Rd×2d is the weight to be learnt while 16: for t:[1,i − 1] do
training. This generates the overall sentential context representa- 17: ht ← LST M(ht −1 , Lt )
tion for the left context sentence: E LC = r ∗ . Similarly, attention is 18: H LC ← H LC ∪ ht
also applied to the right context sentence to get the right context 19: ht −1 ← ht
sentence E RC . To get a comprehensive feature representation of 20: H RC ← ϕ
the context for a particular concept, we fuse the two sentential con- 21: ht −1 ← 0
text representations, E LC and E RC , using a NTN [52]. It involves 22: for t:[i + 1,n] do
a neural tensor T ∈ R 2d×2d×k which performs a bilinear fusion 23: ht ← LST M(ht −1 , R t )
across k dimensions. Along with a single layer neural model, the 24: H RC ← H RC ∪ ht
overall fusion can be shown as: 25: ht −1 ← ht
E LC
 
v = tanh(ETLC .T [1:k ] .E RC + W . + b) (13) 26: E LC ←Attention(H LC )
E RC 27: E RC ←Attention(H RC )
Here, the tensor product ETLC .T [1:k ] .E RC is calculated to get a 28: v ←NTN(E LC , E RC )
return v
vector v∗ ∈ R k such that each entry in the vector v∗ is calculated
29:
30: procedure LSTM(h  t −1 ,x t )
as vi∗ = ETLC .T [i] .E RC , where T [i] is the i t h slice of the tensor
ht −1
T . W ∈ R k ×4d and b ∈ R k are the parameters (Table 1). The 31: X =
xt
tensor fusion network thus finally provides the sentential context 32: ft = σ (Wf .X + bf )
representation v. 33: i t = σ (Wi .X + bi )
34: ot = σ (Wo .X + bo )
3.4 Negative Sampling 35: c t = ft ⊙ c t −1 + i t ⊙ tanh(Wc .X + bc )
To learn the appropriate representation of sentential context and 36: ht = ot ⊙ tanh(c t )
target word, we use word2vec’s negative sampling objective func- 37: return ht
tion. Here, a positive pair is described as a valid context and word 38: procedure Attention(H )
pair and the negative pairs are created by sampling random words 39: P = tanh(Wh .H )
from a unigram distribution. Formally, our aim is to maximize the 40: α = so f tmax(w T .P)
following objective function: 41: r = H .α T
Õ Õz 42: return r
Obj = (loд(σ (c.v)) + loд(σ (−ci .v))) (14) 43: procedure NTN(E LC , E RC )
c,v E LC
 
i=1
44: v = tanh(ETLC .T [1:k ] .E RC + W . + b)
Here, the overall objective is calculated across all the valid word E RC
and context pairs. We choose z invalid word-context pairs where 45: return v
each −ci refers to an invalid word with respect to a context.

3.5 Context embedding using BERT In one of the tasks, BERT randomly masks a percentage of words
We leverage the BERT architecture [16] to obtain the sentential in the sentences and only predicts those masked words. In the
context embedding of a word. BERT utilizes a transformer net- other task, BERT predicts the next sentence given a sentence. This
work to pre-train a language model for extracting contextual word task, in particular, tries to model the relationship among two sen-
embeddings. Unlike ELMo and OpenAI-GPT, BERT uses different tences which is supposedly not captured by traditional bidirectional
pre-training tasks for language modeling. language models.

108
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

The goal now is to find a substitute for the target word having
the same parts of speech in the given context. To achieve this, we
obtain the context and target word embeddings (v and c) from the
joint hyperspace of the network. For all possible substitute words b,
we then calculate the cosine similarity using equation 16 and rank
them using this metric for possible substitutes. This substitution
leads to new verb-noun or adjective-noun pairs which bear the
same conceptual meaning in the given context. The context2vec
code for primitive discovery is available on our github1 .

4 PRIMITIVE SPECIFICATION
The deep learning framework described in the previous section
allows for the automatic discovery of concept clusters that are se-
mantically related and share a similar lexical function. The label
Figure 3: An example of primitive specification. of each of such cluster is a primitive and it is assigned by select-
ing the most typical of the terms. In the verb cluster {increase,
enlarge, intensify, grow, expand, strengthen, extend,
Consequently, this particular pre-training scheme helps BERT
widen, build_up, accumulate...}, for example, the term with the
to outperform state-of-the-art techniques by a large margin on
highest occurrence frequency in text (the one people most com-
key NLP tasks such as question answering and natural language
monly use in conversation) is increase.
inference where understanding the relation among two sentences
Hence, the cluster is named after it, i.e., labeled by the prim-
is very important. In SenticNet 6, we utilize BERT as follows:
itive INCREASE and later defined either via symbolic logic, e.g.,
• First, we fine-tune the pre-trained BERT network on the INCREASE(x) = x + a(x), where a(x) is an undefined quantity
ukWaC corpus [4]. related to x, or in terms of polar transitions, e.g., INCREASE: LESS
• Next, we calculate the embedding for the context v. For this, → MORE (Fig. 3). Symbolic logic is usually used to define super-
we first remove the target word c, i.e., either the verb or primitives or neutral primitives. Polar transitions are used to define
noun from the sentence. The remainder of the sentence is polarity-bearing verb primitives in terms of polar state change
then fed to the BERT architecture which returns the context (from positive to negative and vice versa) via a ying-yang kind of
embedding. clustering [64].
• Finally, we adopt a new similarity measure in order to find In both cases, the goal is to define the connotative information
the replacement of the word. For this, we need the embedding associated with primitives and, hence, associate a polarity to them
of the target word which we obtain by simply feeding the (explained in the next section). Such a polarity is later transferred
word to BERT pre-trained network. Given a target word c to words and multiword expressions via a four-layered knowledge
and its sentential context v, we calculate the cosine distance representation (Fig. 4).
of all the other words in the embedding hyperspace with
both c and v. If b is a candidate word, the distance is then 1 http://github.com/senticnet/context2vec
calculated as:
dist(b, (c, v)) = cos(b, c) + cos(b, v) +
(15)
cos(BERT (v, b), BERT (v, c)) Parameters
where BERT (v, b) is the BERT-produced embedding of the Weights
sentence formed by replacing word c with the candidate Wi ,Wf ,Wo ,Wc ∈ Rd×(d+dw ) Wp ∈ Rd×2d
word b in the sentence. Similarly, BERT (v, c) is the embed- Wb ∈ R k ×d Bias
ding of the original sentence which consists of word c. Wa ∈ R d×dw b i , b f , bo ∈ Rd
A stricter rule to ensure high similarity between the target T ∈ R 2d×2d×k ba ∈ Rd
and candidate word is to apply multiplication instead of Wh ∈ Rd×2d b ∈ Rk
addition: W ∈R k ×4d bb ∈ Rk
dist(b, (c, v)) = cos(b, c) · cos(b, v)· w ∈R d
(16)
cos(BERT (v, b), BERT (v, c)) Hyperparameters
d dimension of LSTM hidden unit
We rank the candidates as per their cosine distance and
k NTN tensor dimension
generate the list of possible lexical substitutes.
z negative sampling invalid pairs
First, we extract all the concepts of the form verb-noun and
Table 1: Summary of notations used in Algorithm 1. Note: dw
adjective-noun present in ConceptNet 5 [54]. An example sentence
is the word embedding size. All the hyperparameters were
for each of these concepts is also extracted. Then, we take one word
set using random search [5].
from the concept (either a verb/adjective or a noun) to be the target
word and the remaining sentence serves as the context.

109
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Figure 4: SenticNet 6’s dependency graph structure.

In this representation, in particular, named entities are linked to the topological structure of the vector space from one state to its
commonsense concepts by IsA relationships from IsaCore [11], a antithetic partner is more likely to contain concepts that are both
large subsumption knowledge base mined from 1.68 billion web- semantically and affectively relevant. To calculate such a path, we
pages. Commonsense concepts are later generalized into primitives use regularized k-means (RKM) [20], a novel algorithm that finds a
by means of deep learning (as explained in the previous section). morphism between a given point set and two reference points in a
Primitives are finally deconstructed into superprimitives, basic vector space X ∈ Rd where d ∈ N + by exploiting the information
states and actions that are defined by means of first order logic, e.g., provided by the available data.
HAVE(subj,obj)= ∃ obj @ subj. Such morphism is described as a discrete path, composed by a
set of prototypes selected based on the data manifolds. Consider
4.1 Key Polar State Specification a set of points X = {x j ∈ Rd }, j = 1, ..., N and two points w 0
In order to automatically discover words and multiword expres- and w Nc ∈ Rd . The path connecting the two points w 0 and w Nc +1
sions that are both semantically and affectively related to key polar is described as an ordered set W of Nc prototypes w ∈ Rd . Such
states such as EASY versus HARD or STABLE versus UNSTABLE, we path is found by minimizing standard k-means cost function with
use AffectiveSpace [7], a vector space of affective commonsense the addition of a regularization term that considers the distance
knowledge built by means of semantic multidimensional scaling. between ordered centroids.
By exploiting the information sharing property of random projec- The cost function can be formalized as:
N Nc Nc
tions, AffectiveSpace maps a dataset of high-dimensional semantic γ ÕÕ λÕ
and affective features into a much lower-dimensional subspace in min ∥x i − w j ∥ 2δ (ui , j) + ∥w i+1 − w i ∥ 2 (17)
W 2 2 i=0
which concepts conveying the same polarity and similar meaning i=1 j=1
fall near each other. In past works, this vector space model has been where ui is the datum cluster.
used to classify concepts as positive or negative by calculating the The novel cost function is composed of two terms weighted by
dot product between new concepts and prototype concepts. the hyper-parameters γ and λ:
In this case, rather than a distance, we need a discrete path
between a key polar state and its opposite (e.g., CLEAN and DIRTY) Ω(W , u, X , γ , λ) = γ ΩX (W , u, X ) + λΩW (W ). (18)
throughout the vector space manifolds. While the shortest path (in The first term coincides with the standard k-means cost func-
a k-means sense) between two polar states in AffectiveSpace risks tion while the second one induces a path topology based on the
to include many irrelevant concepts, in fact, a path that follows centroids ordering and controls the level of smoothness of the path.

110
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

This way, key polar states get mapped to emotion categories


of the Hourglass model and, by the transitive property, all the
concepts connected to such states inherit the same emotion and
polarity classification (Fig. 7).

5 EXPERIMENTS
In this section, we evaluate the performance of both the subsymbolic
and symbolic segments of SenticNet 6 (the former being the deep
learning framework for primitive discovery, the latter being the
logic framework for primitive specification) on 9 different datasets.

5.1 Subsymbolic Evaluation


Figure 5: Hyper-parameters influence in the shape of the In order to evaluate the performance of our context2vec framework
path. for primitive discovery, we employed it to solve the problem of
lexical substitution. We used ukWaC as the training corpus. We
removed sentences with length greater than 80 (which resulted
Fig. 5 proposes a graphical example of the algorithm’s behavior in a 7% reduction of the corpus), lower-cased text, and removed
for different values of the regularization hyper-parameters: data tokens with low occurrence. Finally, we were left with a corpus of
are represented as blue dots and centroids as crosses; the blue line 173,000 words. As for lexical substitution evaluation datasets, we
refers to a configuration in which the first cost function term is used the LST-07 dataset from the lexical substitution task of the
prominent; the green one to a configuration where the second term 2007 Semantic Evaluation (SemEval) challenge [34] and the 15,000
of the cost function is preponderant; finally, the red line refers to a target word all-words LST-14 dataset from SemEval-2014 [30].
configuration with a good trade-off between the two.
In our case, let C be the set of N concepts belonging to a specific
primitive cluster and let {x 1 , .., x N } ∈ Rd their projections induced
by embedding F . Additionally, let pst ar t , pend ∈ C be the two key
SENSITIVITY ATTITUDE

polar states corresponding to the two extremes of the path under


analysis. Accordingly, RKM is used to identify the path that connects TEMPER INTROSPECTION

pst ar t with pend in AffectiveSpace. Thus, the algorithm’s output


is the list of intermediate concepts that characterize the transition
induced by the data distribution.
Because positive and negative concepts are found in diametri- bliss ecstasy
cally opposite zones of the space, we expect the paths calculated
by means of RKM to traverse AffectiveSpace from one end to the enthusiasm delight
calmness joy
other. This ensures the discovery of enough concepts that are both pleasantness
eagerness
nt

semantically and affectively related to both polar states. Towards


seren

ntme

re
sp
ce

the center of the space, however, there are many low-intensity (al- on
ity

conte

an

si
v
pt

en
ce

most neutral) concepts. Hence, we only consider the first 20 nearest es


ac

concepts to each polar state within the discovered morphism. If


we set pst ar t = CLEAN and pend = DIRTY, for example, we only as-
ke

ann

an
ly
sli

ncho

xie

sign the first 20 concepts of the path (e.g., cleaned, spotless, and
di

oya

ty

immaculate) to pst ar t and the last 20 concepts of the path (e.g.,


mela

nce

disgust fear
filthy, stained, and soiled) to pend . sadness anger
We also use this morphism to assign emotion labels to key polar loathing terror

states, based on the average distance (dot product) between the con-
cepts of the path (the first 20 and the last 20, respectively) and the grief rage

key concepts in AffectiveSpace that represent emotion labels (posi-


tive and negative, respectively) of the Hourglass of Emotions [56],
an emotion categorization model for sentiment analysis consist- INTROSPECTION TEMPER
ing of 24 basic emotions organized around four independent but
concomitant affective dimensions (Fig. 6).
In the previous example, for instance, CLEAN would be assigned ATTITUDE SENSITIVITY

the label pleasantness because it is the nearest emotion concept to


cleaned, spotless, immaculate, etc. on average. Likewise, DIRTY
would be assigned the label disgust because it is the nearest emo-
tion concept to filthy, stained, soiled, etc. on average. Figure 6: The Hourglass of Emotions.

111
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Model LJ-5k
K-means 77.91%
Sentic medoids 82.76%
RKM 91.54%
Table 3: Comparison between RKM and two baselines on a
dataset for concept polarity detection.

5.3 Ensemble Evaluation


We tested SenticNet 6 (available both as a standalone XML reposi-
tory2 and as an API3 ) against six commonly used benchmarks for
sentence-level sentiment analysis, namely: STS [50], an evaluation
dataset for Twitter sentiment analysis developed in 2013 consisting
of 1,402 negative tweets and 632 positive ones; SST [53], a dataset
built in 2013 consisting of 11,855 movie reviews and containing
4,871 positive sentences and 4,650 negative ones; SemEval-2013 [40],
a dataset consisting of 2,186 negative and 5,349 positives tweets
constructed for the Twitter sentiment analysis task (Task 2) in the
Figure 7: A sketch of SenticNet 6’s semantic network. 2013 SemEval challenge; SemEval-2015 [47], a dataset built for Task
10 of SemEval 2015 consisting 15,195 tweets and containing 5,809
positive sentences and 2,407 negative ones; SemEval-2016 [39], a
The first one comes with a 300-sentence dev set and a 1710-
dataset constructed in 2016 for Task 4 of the SemEval challenge
sentence test set split; the second one comes with a 35% and 65%
consisting of 17,639 tweets about 100 topics and containing 13,942
split, which we used as the dev set and test set, respectively. The
positive sentences and 3,697 negative ones; finally, Sanders [2], a
performance is measured using generalized average precision in
dataset consisting of 5,512 tweets on four different topics of which
which we rank the lexical substitutes of a word based on the cosine
654 are negative and 570 positive.
similarity score calculated among a substitution and the context
We used these six datasets to compare SenticNet 6 with 15 pop-
embedding. This ranking is compared to the gold standard lexical
ular sentiment lexica, namely: ANEW [6], a list of 1,030 words
substitution ranking provided in the dataset.
created in 1999; WordNet-Affect [55], an extension of WordNet
Model LST-07 [34] LST-14 [30]
made of 4,787 words developed in 2004; Opinion Lexicon [22], a
Baseline 1 52.35% 50.05% lexicon of 6,789 words built in the same year by means of opin-
Baseline 2 55.10% 53.60% ion word extraction from product reviews; Opinion Finder [63],
Context2vec 59.48% 57.32% a lexicon of 8,221 words created in 2005 using a polarity classi-
fier; Micro WNOp [12], a lexicon of 5,636 words created in 2007;
Table 2: Comparison between our approach and two base-
Sentiment140 [21], a lexicon of 62,466 words developed in 2009;
lines on two datasets for lexical substitution.
SentiStrength [59] and SentiWordNet [3], two lexica created in
2010 consisting of 2,546 and 23,089 words, respectively; General
Inquirer [57], a lexicon of 8,639 words with 1,916 of them contain-
The performance of this approach is shown in Table 2, in which
ing polarity built in 2011; AFINN [41], a lexicon of 2,477 words
we compare it with two baselines. Baseline 1 has been implemented
constructed in the same year; EmoLex [38], a lexicon of 5,636 words
by training the skipgram model on the learning corpus and then
built in 2013; NRC HS Lexicon [67] and VADER [23], two lexica
simply taking the average of the words present in the context as
developed in 2014 containing 54,128 and 7,503 words, respectively;
context representation. The cosine similarity among this context
MPQA [15], a lexicon of 8,222 words built in 2015; finally, Sentic-
representation and the target word embeddings is calculated to
Net 5, the predecessor of SenticNet 6, a knowledge base of 100,000
find a match for the lexical substitution. Baseline 2 is a model
commonsense concepts.
proposed by [35] to find lexical substitution of a target based on
We set the experiment as a binary classification problem so the
skipgram word embeddings and incorporating syntactic relations
labels of both datasets and lexica were reduced to simply positive
in the skipgram model.
versus negative. To be fair to all lexica, two basic linguistic pat-
terns [45] were used, namely: negation and adversative patterns.
5.2 Symbolic Evaluation
If we do not apply such patterns, in fact, sentences like “The car
As mentioned earlier, the deconstruction of primitives into super- is very old but rather not expensive” would be wrongly classified
primitives is currently performed manually and, hence, it does not by all lexica although most of them correctly list both ‘old’ and
require evaluation. Therefore, we only evaluate the quality of key ‘expensive’ as negative (Fig. 8).
polar state specification using RKM (as shown in Table 3) in compar-
ison with k-means and sentic medoids [8] on a LiveJournal corpus 2 http://sentic.net/downloads
3 http://sentic.net/api
of 5,000 concepts (LJ-5k).

112
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

Model Year SST Dataset [53] STS Dataset [50] SemEval-2013 [40] SemEval-2015 [47] SemEval-2016 [39] Sanders [2]
ANEW [6] 1999 31.21% 36.77% 42.72% 33.13% 42.20% 27.70%
WordNet-Affect [55] 2004 04.51% 11.98% 03.82% 03.27% 03.53% 05.64%
Opinion Lexicon [22] 2004 54.21% 60.72% 41.00% 43.15% 37.83% 54.33%
Opinion Finder [63] 2005 53.60% 55.71% 47.50% 43.97% 46.75% 46.98%
Micro WNOp [12] 2007 15.45% 18.94% 19.13% 16.97% 17.85% 15.36%
Sentiment140 [21] 2009 55.75% 67.69% 45.67% 50.92% 41.70% 64.95%
SentiStrength [59] 2010 36.76% 51.53% 37.28% 41.51% 33.97% 44.85%
SentiWordNet [3] 2010 50.19% 48.75% 50.15% 50.31% 49.62% 43.55%
General Inquirer [57] 2011 25.91% 11.14% 16.06% 12.47% 16.78% 10.29%
AFINN [41] 2011 44.81% 58.50% 43.82% 44.99% 40.13% 53.19%
EmoLex [38] 2013 46.94% 47.63% 45.12% 42.33% 42.38% 44.12%
NRC HS Lexicon [67] 2014 47.90% 49.86% 28.56% 42.54% 25.28% 54.33%
VADER [23] 2014 50.72% 64.90% 50.36% 49.08% 45.93% 57.27%
MPQA [15] 2015 53.71% 55.43% 46.75% 43.97% 45.42% 46.57%
SenticNet 5 [10] 2018 53.61% 55.71% 68.17% 56.03% 70.80% 48.37%
SenticNet 6 2020 75.43% 83.82% 81.79% 80.19% 82.23% 77.62%
Table 4: Comparison with 15 popular lexica on 6 benchmark datasets for sentiment analysis (top 3 results in bold).

Since most of the datasets we used are for Twitter sentiment To enhance the accuracy of all such tasks, we propose a new
analysis, initially we also wanted to apply microtext normalization version of SenticNet built using an approach to knowledge rep-
to all sentences before processing them through the lexica. If we did resentation that is both top-down and bottom-up: top-down for
that, however, we should have also applied many other NLP tasks the fact that it leverages symbolic models (i.e., logic and semantic
required for proper polarity detection [9], e.g., anaphora resolution networks) to encode meaning; bottom-up because it uses subsym-
and sarcasm detection, so eventually we refrained from doing so. bolic methods (i.e., biLSTM and BERT) to implicitly learn syntactic
Classification results are shown in Table 4. SenticNet 6 was the patterns from data. We believe that coupling symbolic and subsym-
best-performing lexicon mostly because of its bigger size (200,000 bolic AI is key for stepping forward in the path from NLP to natural
words and multiword expressions). Most of the classification errors language understanding. Machine learning is only useful to make
made by other lexica, in fact, were due to a missing entry in the a ‘good guess’ based on past experience because it simply encodes
knowledge base. Most of the sentences misclassified by SenticNet 6, correlation and its decision-making process is merely probabilistic.
instead, were using sarcasm or contained microtext. As professed by Noam Chomsky, natural language understanding
requires much more than that: “you do not get discoveries in the
6 CONCLUSION sciences by taking huge amounts of data, throwing them into a
In the past, SenticNet has been employed for many different tasks computer and doing statistical analysis of them: that’s not the way
other than polarity detection, e.g., recommendation systems [24], you understand things, you have to have theoretical insights”.
stock market prediction [31], political forecasting [46], irony de-
tection [60], drug effectiveness measurement [42], depression de- ACKNOWLEDGMENTS
tection [14], mental health triage [1], vaccination behavior detec- This research is supported by the Agency for Science, Technol-
tion [27], psychological studies [29], and more. ogy and Research (A*STAR) under its AME Programmatic Funding
Scheme (Project #A18A2b0046).

REFERENCES
[1] Hayda Almeida, Marc Queudot, and Marie-Jean Meurs. 2016. Automatic triage of
mental health online forum posts: CLPsych 2016 system description. In Workshop
on Computational Linguistics and Clinical Psychology. 183–187.
[2] Sanders Analytics. 2015. Sanders Dataset. (2015). http://sananalytics.com/lab
[3] Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet
3.0: an enhanced lexical resource for sentiment analysis and opinion mining.. In
LREC. 2200–2204.
[4] Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The
WaCky wide web: a collection of very large linguistically processed web-crawled
corpora. Language resources and evaluation 43, 3 (2009), 209–226.
[5] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter
optimization. The Journal of Machine Learning Research 13, 1 (2012), 281–305.
[6] Margaret Bradley and Peter Lang. 1999. Affective Norms for English Words
(ANEW): Stimuli, Instruction Manual and Affective Ratings. Technical Report. The
Center for Research in Psychophysiology, University of Florida.
[7] Erik Cambria, Jie Fu, Federica Bisio, and Soujanya Poria. 2015. AffectiveSpace
Figure 8: Sentiment data flow for the sentence “The car is 2: Enabling Affective Intuition for Concept-Level Sentiment Analysis. In AAAI.
very old but rather not expensive” using linguistic patterns. 508–514.

113
Full Paper Track CIKM '20, October 19–23, 2020, Virtual Event, Ireland

[8] Erik Cambria, Thomas Mazzocco, Amir Hussain, and Chris Eckl. 2011. Sen- [38] Saif M Mohammad and Peter D Turney. 2013. Crowdsourcing a word–emotion
tic Medoids: Organizing Affective Common Sense Knowledge in a Multi- association lexicon. Computational Intelligence 29, 3 (2013), 436–465.
Dimensional Vector Space. In LNCS 6677. 601–610. [39] Preslav Nakov, Alan Ritter, Sara Rosentha, Fabrizio Sebastiani, and Veselin Stoy-
[9] Erik Cambria, Soujanya Poria, Alexander Gelbukh, and Mike Thelwall. 2017. anov. 2016. SemEval-2016 Task 4: Sentiment Analysis in Twitter. In SemEval.
Sentiment Analysis is a Big Suitcase. IEEE Intelligent Systems 32, 6 (2017), 74–80. [40] Preslav Nakov, Sara Rosenthal, Zornitsa Kozareva, Veselin Stoyanov, Alan Ritter,
[10] Erik Cambria, Soujanya Poria, Devamanyu Hazarika, and Kenneth Kwok. 2018. and Theresa Wilson. 2013. SemEval-2013 Task 2: Sentiment Analysis in Twitter.
SenticNet 5: Discovering conceptual primitives for sentiment analysis by means In SemEval. 312–320.
of context embeddings. In AAAI. 1795–1802. [41] Finn Nielsen. 2011. A new ANEW: Evaluation of a word list for sentiment analysis
[11] Erik Cambria, Yangqiu Song, Haixun Wang, and Newton Howard. 2014. Semantic in microblogs. CoRR abs/1103.2903 (2011).
Multi-Dimensional Scaling for Open-Domain Sentiment Analysis. IEEE Intelligent [42] Samira Noferesti and Mehrnoush Shamsfard. 2015. Using Linked Data for polarity
Systems 29, 2 (2014), 44–51. classification of patients’ experiences. Journal of biomedical informatics 57 (2015),
[12] Sabrina Cerini, Valentina Compagnoni, Alice Demontis, Maicol Formentelli, and 6–19.
Caterina Gandini. 2007. Micro-WNOp: A gold standard for the evaluation of [43] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: Senti-
automatically compiled lexical resources for opinion mining. Language resources ment classification using machine learning techniques. In EMNLP. 79–86.
and linguistic theory: Typology, Second Language Acquisition, English linguistics [44] Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2016. Aspect Extraction
(2007), 200–210. for Opinion Mining with a Deep Convolutional Neural Network. Knowledge-
[13] Zhuang Chen and Tieyun Qian. 2019. Transfer Capsule Network for Aspect Based Systems 108 (2016), 42–49.
Level Sentiment Classification. In ACL. 547–556. [45] Soujanya Poria, Erik Cambria, Alexander Gelbukh, Federica Bisio, and Amir
[14] Ting Dang, Brian Stasak, Zhaocheng Huang, Sadari Jayawardena, Mia Atcheson, Hussain. 2015. Sentiment Data Flow Analysis by Means of Dynamic Linguistic
Munawar Hayat, Phu Le, Vidhyasaharan Sethu, Roland Goecke, and Julien Epps. Patterns. IEEE Computational Intelligence Magazine 10, 4 (2015), 26–36.
2017. Investigating word affect features and fusion of probabilistic predictions [46] Lei Qi, Chuanhai Zhang, Adisak Sukul, Wallapak Tavanapong, and David Peter-
incorporating uncertainty in AVEC 2017. In Workshop on Audio/Visual Emotion son. 2016. Automated coding of political video ads for political science research.
Challenge. 27–35. In IEEE International Symposium on Multimedia. 7–13.
[15] Lingjia Deng and Janyce Wiebe. 2015. MPQA 3.0: An entity/event-level sentiment [47] Sara Rosenthal, Preslav Nakov, Svetlana Kiritchenko, Saif Mohammad, Alan
corpus. In NAACL. 1323–1328. Ritter, and Veselin Stoyanov. 2015. SemEval-2015 Task 10: Sentiment Analysis in
[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Twitter. In SemEval. 451–463.
Pre-training of Deep Bidirectional Transformers for Language Understanding. In [48] David Rumelhart and Andrew Ortony. 1977. The representation of knowledge in
NAACL-HLT. 4171–4186. memory. In Schooling and the acquisition of knowledge. Erlbaum, Hillsdale, NJ.
[17] Cıcero Nogueira dos Santos and Maıra Gatti. 2014. Deep convolutional neural [49] Ivan Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger.
networks for sentiment analysis of short texts. In COLING. 69–78. 2002. Multiword Expressions: A Pain in the Neck for NLP. In CICLing. 1–15.
[18] Umberto Eco. 1984. Semiotics and Philosophy of Language. Indiana University [50] Hassan Saif, Miriam Fernandez, Yulan He, and Harith Alani. 2013. Evaluation
Press. datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold.
[19] Richard Evans and Edward Grefenstette. 2018. Learning explanatory rules from In AI*IA.
noisy data. Journal of Artificial Intelligence Research 61 (2018), 1–64. [51] Roger Schank. 1972. Conceptual dependency: A theory of natural language
[20] Marco Ferrarotti, Sergio Decherchi, and Walter Rocchia. 2019. Finding Principal understanding. Cognitive Psychology 3 (1972), 552–631.
Paths in Data Space. IEEE Transactions on Neural Networks and Learning Systems [52] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013.
30, 8 (2019), 2449–2462. Reasoning with neural tensor networks for knowledge base completion. In NIPS.
[21] Alec Go, Richa Bhayani, and Lei Huang. 2009. Twitter sentiment classification 926–934.
using distant supervision. CS224N project report, Stanford 1, 12 (2009). [53] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning,
[22] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Andrew Y Ng, and Christopher Potts. 2013. Recursive deep models for semantic
SIGKDD. 168–177. compositionality over a sentiment treebank. In EMNLP. 1631–1642.
[23] Clayton J Hutto and Eric GIlbert. 2014. VADER: A parsimonious rule-based model [54] Robert Speer and Catherine Havasi. 2012. ConceptNet 5: A Large Semantic Net-
for sentiment analysis of social media text. In ICWSM. 216–225. work for Relational Knowledge. In Theory and Applications of Natural Language
[24] Muhammad Ibrahim, Imran Sarwar Bajwa, Riaz Ul-Amin, and Bakhtiar Kasi. 2019. Processing. Chapter 6.
A neural network-inspired approach for improved and true movie recommenda- [55] Carlo Strapparava and Alessandro Valitutti. 2004. WordNet-Affect: An Affective
tions. Computational intelligence and neuroscience (2019), 4589060. Extension of WordNet. In LREC. 1083–1086.
[25] Ray Jackendoff. 1976. Toward an explanatory semantic representation. Linguistic [56] Yosephine Susanto, Andrew Livingstone, Bee Chin Ng, and Erik Cambria. 2020.
Inquiry 7, 1 (1976), 89–150. The Hourglass Model Revisited. IEEE Intelligent Systems 35, 5 (2020).
[26] Ray Jackendoff. 1983. Semantics and cognition. MIT Press. [57] Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede.
[27] Aditya Joshi, Xiang Dai, Sarvnaz Karimi, Ross Sparks, Cecile Paris, and C Raina 2011. Lexicon-based methods for sentiment analysis. Computational linguistics
MacIntyre. 2018. Shot or not: Comparison of NLP approaches for vaccination 37, 2 (2011), 267–307.
behaviour detection. In SMM4H@EMNLP. 43–47. [58] Duyu Tang, Furu Wei, Bing Qin, Ting Liu, and Ming Zhou. 2014. Coooolll: A
[28] Jerrold Katz and Jerry Fodor. 1963. The structure of a Semantic Theory. Language deep learning system for Twitter sentiment classification. In SemEval. 208–212.
39 (1963), 170–210. [59] Mike Thelwall, Kevan Buckley, Georgios Paltoglou, Di Cai, and Arvid Kappas.
[29] Megan O Kelly and Evan F Risko. 2019. The Isolation Effect When Offloading 2010. Sentiment strength detection in short informal text. Journal of the American
Memory. Journal of Applied Research in Memory and Cognition 8, 4 (2019), 471– society for information science and technology 61, 12 (2010), 2544–2558.
480. [60] Cynthia Van Hee, Els Lefever, and Véronique Hoste. 2018. We usually don’t like
[30] Gerhard Kremer, Katrin Erk, Sebastian Padó, and Stefan Thater. 2014. What going to the dentist: Using common sense to detect irony on Twitter. Computa-
Substitutes Tell Us - Analysis of an "All-Words" Lexical Substitution Corpus. In tional Linguistics 44, 4 (2018), 793–832.
EACL. 540–549. [61] Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. 2019. SATNet: Bridging
[31] Xiaodong Li, Haoran Xie, Raymond YK Lau, Tak-Lam Wong, and Fu-Lee Wang. deep learning and logical reasoning using a differentiable satisfiability solver. In
2018. Stock prediction via sentimental transfer learning. IEEE Access 6 (2018), ICML. 6545–6554.
73110–73118. [62] Anna Wierzbicka. 1996. Semantics: Primes and Universals. Oxford University
[32] Qiao Liu, Haibin Zhang, Yifu Zeng, Ziqi Huang, and Zufeng Wu. 2018. Content Press.
Attention Model for Aspect Based Sentiment Analysis. In WWW. 1023–1032. [63] Theresa Wilson, Paul Hoffmann, Swapna Somasundaran, Jason Kessler, Janyce
[33] Yukun Ma, Haiyun Peng, and Erik Cambria. 2018. Targeted aspect-based senti- Wiebe, Yejin Choi, Claire Cardie, Ellen Riloff, and Siddharth Patwardhan. 2005.
ment analysis via embedding commonsense knowledge into an attentive LSTM. OpinionFinder: A system for subjectivity analysis. In HLT/EMNLP. 34–35.
In AAAI. 5876–5883. [64] Lei Xu. 1997. Bayesian Ying–Yang machine, clustering and number of clusters.
[34] Diana McCarthy and Roberto Navigli. 2007. SemEval-2007 task 10: English lexical Pattern Recognition Letters 18, 11 (1997), 1167–1178.
substitution task. In SemEval. 48–53. [65] Fan Yang, Zhilin Yang, and William Cohen. 2017. Differentiable learning of
[35] Oren Melamud, Omer Levy, Ido Dagan, and Israel Ramat-Gan. 2015. A Simple logical rules for knowledge base reasoning. In NIPS. 2319–2328.
Word Embedding Model for Lexical Substitution. In VS@HLT-NAACL. 1–7. [66] Wei Zhao, Haiyun Peng, Steffen Eger, Erik Cambria, and Min Yang. 2019. Towards
[36] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. scalable and reliable capsule networks for challenging NLP applications. In ACL.
Distributed representations of words and phrases and their compositionality. In 1549–1559.
NIPS. 3111–3119. [67] Xiaodan Zhu, Svetlana Kiritchenko, and Saif Mohammad. 2014. NRC-canada-
[37] Marvin Minsky. 1975. A framework for representing knowledge. In The psychol- 2014: Recent improvements in the sentiment analysis of tweets. In SemEval.
ogy of computer vision, Patrick Winston (Ed.). McGraw-Hill, New York. 443–447.

114

You might also like