Detection and Aptness A Study in Metapho PDF
Detection and Aptness A Study in Metapho PDF
Detection and Aptness A Study in Metapho PDF
yuri.bizzoni
December 2018
2
Distribution
Department of philosophy, linguistics and theory of science,
Box 200, SE-405 30 Gothenburg
Abstract
Metaphor is one of the most prominent, and most studied, figures of speech.
While it is considered an element of great interest in several branches of linguistics, such as
semantics, pragmatics and stylistics, its automatic processing remains an open challenge. First of all,
the semantic complexity of the concept of metaphor itself creates a range of theoretical complications.
Secondly, the practical lack of large scale resources forces researchers to work under conditions of data
scarcity.
This compilation thesis provides a set of experiments to (i) automatically detect metaphors and
(ii) assess a metaphor’s aptness with respect to a given literal equivalent. The first task has already
been tackled by a number of studies. I approach it as a way to assess the potentialities and limitations
of our approach, before dealing with the second task. For metaphor detection I was able to use existing
resources, while I created my own dataset to explore metaphor aptness assessment. In all of the studies
presented here, I have used a combination of word embeddings and neural networks.
To deal with metaphor aptness assessment, I framed the problem as a case of paraphrase identi-
fication. Given a sentence containing a metaphor, the task is to find the best literal paraphrase from
a set of candidates. I built a dataset designed for this task, that allows a gradient scoring of various
paraphrases with respect to a reference sentence, so that paraphrases are ordered according to their
degree of aptness. Therefore, I could use it both for binary classification and ordering tasks. This
dataset is annotated through crowd sourcing by an average of 20 annotators for each pair. I then
designed a deep neural network to be trained on this dataset, that is able achieve encouraging levels
of performance.
In the final experiment of this compilation, more context is added to a sub-section of the dataset
in order to study the effect of extended context on metaphor aptness rating. I show that extended
context changes human perception of metaphor aptness and that this effect is reproduced by my neural
classifier. The conclusion of the last study is that extended context compresses aptness scores towards
the center of the scale, raising low ratings and decreasing high ratings given to paraphrase candidates
outside of extended context.
3
4
Acknowledgments
I want to thank Shalom Lappin for more than three years of helpful supervision, as well as my CLASP
friends and colleagues for their guidance and endurance. I am grateful to Beata Beigman Klebanov for
her thorough reading of my thesis and for her extensive comments. I also am obliged to my colleagues
in Pisa University for their collaboration in some of my work. Finally I am thankful to the Swedish
Research Council for providing the funds necessary to my paycheck, and to all FLoV department’s
members for their help.
Without their support this thesis would have been different, so they should share part of the
blame.
5
Contents
I THESIS FRAME 11
1 Introduction 13
1.1 My Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Metaphor Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Out-of-context metaphor detection . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.3 In-context metaphor detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Metaphor Aptness Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Dataset and architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Out-of-context Metaphor Aptness Assessment . . . . . . . . . . . . . . . . . . . . 18
1.3.3 In-context Metaphor Aptness Assessment . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Vector space lexical embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7
8 CONTENTS
5 Conclusions 45
5.1 My Research Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Nuanced properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Sequentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Data Scarcity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Appendix 55
II STUDIES 81
CONTENTS 9
List of Studies
STUDY I
Bizzoni, Y., Chatzikyriakidis, S., Ghanimifard, M. (2017). “Deep" Learning: Detecting Metaphoric-
ity in Adjective-Noun Pairs. In Proceedings of the Workshop on Stylistic Variation (pp. 43-52).
STUDY II
Bizzoni, Y., Senaldi, Marco S.G., Lenci, Alessandro (2018). "Finding the Neural Net: Deep-
learning Idiom Type Identification from Distributional Vectors". To appear in Italian Journal of
Computational Linguistics.
STUDY III
Bizzoni, Y., Ghanimifard, M. (2018). Bigrams and BiLSTMs Two neural networks for sequential
metaphor detection. In Proceedings of the Workshop on Figurative Language Processing (pp. 91-101).
STUDY IV
Bizzoni, Y., Lappin, S. (2017). Deep Learning of Binary and Gradient Judgements for Seman-
tic Paraphrase. In IWCS 2017—12th International Conference on Computational Semantics—Short
papers.
STUDY V
Bizzoni, Y., Lappin, S. (2018). Predicting Human Metaphor Paraphrase Judgments with Deep
Neural Networks. In Proceedings of the Workshop on Figurative Language Processing (pp. 45-55).
STUDY VI
Bizzoni, Y., Lappin, S. (2018). The Effect of Context on Metaphor Paraphrase Aptness Judg-
ments. arXiv preprint arXiv:1809.01060.
Part I
THESIS FRAME
11
Chapter 1
Introduction
Figurative Language is an umbrella term comprehending several different phenomena, and covering
linguistic behavior that involves pragmatics, semantics, syntax and even phonetics 1 . Figures of speech
are often framed as language patterns that are perceived by the speakers as “unusual" or improbable.
Statistical and prosodic patterns are essential in the way we learn and update language (Kuhl;
2004), and it is not surprising that we develop a high sensitivity towards them. It can be argued
that since we learn language mainly through positive examples, we don’t learn what is wrong. We
rather develop a sensitivity towards what sounds unusual. These are the patterns we classify as
‘ungrammatical’ , or having a lower degree of acceptability (Lau; 2015; Lau et al.; 2017), as ‘slips of
tongue’ (Dell; 1985), or, to some extent, as ‘figures of speech’.
The human ability to constantly learn patterns, and to detect what seems to deviate from the
acquired schemes - phonetic, syntactic, semantic, just to name the most obvious ones in language -
can be the reason why figurative language is both effective in communication and difficult to model.
We get also easily used to new patterns, which is why our perception of figures changes over time. For
example, novel metaphors can be characterized as a way of forcing a word or an expression out of its
conventional use, and stretching its meaning in order to better convey an experience or an idea. But
the metaphoric usage of a word can become itself a conventional pattern. This for example creates
“dead" metaphors, metaphors that are no longer perceived as such by speakers.
The most studied of these figures in linguistics, including natural language processing, are the
semantic-pragmatic tropes of irony, sarcasm, metaphor (together with simile), and metonymy. Metaphor
is often considered one of the most important and widespread figures of speech, and an important
phenomenon to analyze in several areas of linguistics, from stylistics (Goodman; 1975; Semino and
1 Non-verbal figurativity also exists (MIGLIORE; 2007). In this thesis I limit my discussion to verbal figurativity.
13
14 CHAPTER 1. INTRODUCTION
Culpeper; 2002; Simpson; 2004; Leech and Short; 2007; Fahnestock; 2009; Steen; 2014) to dialogue
studies (Pollio et al.; 1990; Corts and Pollio; 1999; Kintsch; 2000; Cameron; 2008). Metaphor is also
relevant for pragmatics, and it has been associated with an increase in “emotionality" in communica-
tion (Fussell and Moss; 1998; Gibbs et al.; 2002), and with an attempt to improve the clarity of an
explanation (Sell et al.; 1997; Darian; 2000).
From the perspective of computational linguistics, figurative language in general, and metaphor
in particular, are of both theoretical and a practical interest.
Theoretically it is worth modelling, because its broad function seems to consist in stretching the
expressive power of natural language.
For example, the synesthetic expression A cold voice draws its meaning both from auditory per-
ception (as a source domain) and tactile perception (as a target domain) to convey a particular char-
acteristic of a voice. But how and why this semantic shift operates is far from being a solved problem.
Practically, figurative language can be a source of problems if it is not understood (or at least
recognized) by language processing systems.
Given the widespread presence of metaphor in everyday language (Deignan; 2007; Lakoff and
Johnson; 2008a; Sikos et al.; 2008) and its proved importance in communication and transmission of
knowledge (Salager-Meyer; 1990; Rodriguez; 2003; Littlemore; 2004; Baumer and Tomlinson; 2008;
Kokkinakis; 2013; Laranjeira; 2013), an automatic metaphor processing can open new possibilities to
a number of computational linguistics’ applications like machine translation, dialogue management,
information retrieval, opinion mining, sentiment analysis and author profiling 2 . Effective detection
of figures such as irony and sarcasm is already deemed "immensely useful for reducing incorrect clas-
sification of consumer sentiment" (Mukherjee and Bala; 2017) and is often applied to the analysis of
social media language. (Reyes et al.; 2012).
In the rest of this chapter I will detail the research questions that have led to this compilation.
They revolve around two main topics: metaphor detection and metaphor aptness assessment.
I will give a short explanation of what each of these themes involves and how I tried to explore
them.
I will also discuss shortly the two main tools I have used throughout all my experiments: vector
space lexical embeddings and neural networks.
Finally, I will present the structure of my thesis’ frame.
2I am thinking here of figurative uses that are perceived as such by speakers: conventional metaphors, for example,
can be treated with general statistical tools
1.1. MY RESEARCH QUESTIONS 15
1. To what extent can we exploit vector space semantic models through artificial neural networks
to perform supervised metaphor detection?
2. Is it possible to use the same approach to go beyond metaphor detection, and tackle metaphor
aptness assessment as a natural language processing task?
3. What is the best way of dealing with metaphor aptness assessment, in terms of both dataset
structure and task design?
It is possible to read these questions under a more “linguistic" angle, highlighting the theoretical, rather
than technical, interest of my research.
In this sense, my research questions could be reformulated in the following way:
1. To what extent can we exploit the distributional profile of single words to detect their metaphoric
usage in text through a supervised machine learning approach? To what extent would such
process be compositional?
2. Is it possible to exploit the distributional profile of single words to go beyond metaphor detection,
and tackle metaphor aptness assessment as a natural language processing task?
A way to break the problem down into a more manageable task is that of working on lists of out-of-
context metaphors and literal expressions. While it is true that every piece of text could be interpreted
metaphorically or literally depending on a larger context, an average reader is usually able to express
a tentative judgment of metaphoricity on an isolated expression. For example, the expression a bright
color will most likely be interpreted as literal. On the other hand, the expression a bright person will
most likely be interpreted as metaphorical. While it may seem counter-intuitive, developing a system
to detect metaphoric expressions in isolation presents several advantages with respect to a system
designed to find metaphors in unconstrained text.
The main advantages of such a framework are that there is no need to parse or process a larger
text, the boundaries of the expression to be judged are already established in the dataset and there is
no risk of finding a confusing larger context that causes troubles for the detector.
This is the approach we adopt in the first paper of this compilation. It will also introduce the
reader to the two tools that we use throughout the following studies: lexical embeddings and neural
networks.
1.2.2 Idioms
As a proof of concept, I also present a study on idiom identification. In this study we first show that
a similar combination of neural classifier and lexical embeddings can be applied to figures of speech
beyond metaphor. Second, we discuss the compositional nature of metaphor, showing that, when
dealing with essentially non-compositional figures such as idioms (Vietri; 2014), the same approach
used in Paper 1 works poorly. On the other hand, treating idioms as unitary tokens brings the
performance of our model up to high level.
After combining neural networks and word embeddings on out-of-context expressions, I present a study
on metaphor detection in context. The complexity of detecting multi-word metaphors in unconstrained
text made it necessary to adopt deeper and more sophisticated neural architectures. We consider two
competitive approaches to the problem, discussing the challenges delineated by the corpus’ annotation
system, and the benefits of enriching semantic vector spaces with diverse non-distributional features.
1.3. METAPHOR APTNESS ASSESSMENT 17
paraphrase but has a more narrow focus of interpretation; at the same time, My lawyer likes money
will be perceived as more fitting a paraphrase than My lawyer is a strange person, and so on.
We thus decided to frame the problem as an ordering task, rather than as a binary classification
problem. For a metaphor-literal pair of sentences, we want to generate both a binary judgment and
a gradient score. We prefer the method used in semantic similarity tasks (Xu et al.; 2015; Agirre
et al.; 2016), rather than in traditional paraphrase detection. Under this respect, we are following
the dominant view in the current literature on the related task of automatic metaphor paraphrasing
(Bollegala and Shutova; 2013).
This is the general concern that binds together the latter three papers of this compilation.
While Paper 4 discusses this approach, applying it to a “traditional" paraphrase detection task, Paper
5 deals with metaphor aptness as a problem of metaphor paraphrase grading. We present a dataset
for metaphor and simile aptness assessment, together with a number of machine learning experiments.
Our dataset contains metaphors and similes at various degrees of salience (Giora; 1999; Giora and Fein;
1999; Laurent et al.; 2006; Giora; 2002) , together with candidate literal paraphrases. Each sentence
appears in isolation, devoid of a more general context.
It can be argued that a reader’s assessment of acceptability might change depending on a larger
context (as is explored in Paper 6). We tried to create sentences that were not too ambiguous in this
sense, and we relied on the annotators’ “common sense". As Paper 5 will show, the agreement between
the majority of annotators and my own (“golden") judgment is very high: the potential confusion due
to divergent interpretations doesn’t seem to affect the results.
This task falls midway between metaphor comprehension and appreciation (Gerrig; 1989; Cor-
nelissen; 2004). The possibility of ranking the candidates for aptness is of particular importance.
While several sentences can be seen as a reasonable interpretation of a metaphor, the aptness of such
metaphor to express their meaning can vary (Tourangeau and Sternberg; 1982; Camac and Glucksberg;
1984).
It is important to note that we are treating aptness as a symmetrical phenomenon.
Our annotators were presented with a pair of sentences and were asked to rank such pair for
paraphrase aptness. But they always saw first the metaphor, then its candidate paraphrase.
In other words, given a metaphor (the first element) they were asked to determine whether a
given sentence (the second element) was a good paraphrase. We considered this judgment as holding
true also in the opposite direction: it also tells us whether the given metaphor is a good paraphrase of
the literal sentence, and it thus becomes a metaphor aptness judgment.
1.4. NEURAL NETWORKS 19
So if
My lawyer is a shark
with an average human score of 3.4/4.0, we maintain that the reverse is also true:
My lawyer is a shark
with the same average score, and thus we say that My lawyer is a shark has an aptness of 3.4/4.0
in expressing the meaning My lawyer is greedy and aggressive.
There is some possibility that, if presented with the opposite frame - the literal sentence first, and
then the metaphor - their aptness judgment might change.
Thus, there is some possibility that metaphor aptness is not a symmetric phenomenon.
I consider this an interesting direction to explore in future studies.
Paper 6 offers a concluding study of metaphor aptness. In this case, we use a subset of the previous
corpus to explore the perception of metaphor aptness in extended context, a topic already studied in
cognitive linguistics (Gildea and Glucksberg; 1983). We show the effect of extended context on human
ratings and we try to replicate it by means of a neural architecture.
A Convolutional Neural Network is an architecture based on the neural patterns observed in the
animal visual field and was primarily used for image recognition, but has been often applied also to
language processing tasks (Collobert and Weston; 2008a) .
Both LSTM and CNN are able to capture elements of both semantics and syntax (Sboev et al.;
2016; Kim; 2014).
The single architecture most used through this series of studies is a composite structure designed
by me, used to produce a continuous score on a pair of sentences. This architecture consists of two
encoders that take the sentences as input, and a final series of fully connected layers that merge the
encoders’ output and produce the value. Each encoder is composed of a CNN 3 , an LSTM and a series
of fully connected layers.
I am not the first one to combine CNNs and LSTMs: CNN+LSTM structures have proved fruitful
in a number of studies.
Sainath et al. (2015) explore the advantages of a composite architecture very similar to that of
my encoders, formed by a concatenation of CNN, LSTM and fully connected layers, to show that these
three models can complement each other effectively on language processing tasks.
Wang et al. (2016) use a CNN to extract relevant features for sentiment analysis from single
sentences , and then feed the output to an LSTM to create a sentimental representation of the full
text. Vosoughi et al. (2016) show the advantages of combining CNNs and LSTMs for the task of tweets’
semantic similarity and sentiment analysis and Zhang et al. (2017) use a CNN-LSTM combination to
predict the intensity of the emotion expressed in a tweet.
Similar combinations are also used in extra-linguistic tasks (Bae et al.; 2016; Wang et al.; 2017)
and hybrid tasks like visual questions answering (Agrawal et al.; 2016; Johnson et al.; 2017; Santoro
et al.; 2017).
The bottom line is that, rather than modelling the input’s sequentiality as a very first step, it
seems a better idea to first use a CNN to extract local features useful for the task at hand, and then
use an LSTM to model the sequential patterns formed by such features.
I chose this specific architecture quite empirically, as it performed best among several competing
models I designed. In Paper 4, I report the performance of a number of potential variants of my
architecture, showing that the presented structure appears to work best.
Such “ablation experiments" between competing models, as presented in Paper 4, show that the
single most relevant component of my network is the LSTM (since the entire model’s performance
drops deepest when the LSTM is removed or when the number of its filters is reduced under a critical
3I actually used a specific kind of CNN, so called “Atrous CNN", particularly useful when the input’s information is
scarce and can be excessively reduced in the pooling stage
1.5. VECTOR SPACE LEXICAL EMBEDDINGS 21
threshold), confirming the importance of sequentiality for this kind of task. The second most important
component appears to be the ACNN. The network’s other features, such as the dense layers, the
dropouts and the merging through concatenation are also important for the architecture’s performance.
Finally, the least important variation proved to be the substitution the ACNN with a normal CNN -
still resulting in a drop of performance of a couple of points.
This architecture naturally presents a high number of parameters, which can be difficult to control
in conditions of data scarcity. In Papers 4 and 5 I expose the results of a 12-fold cross-validation to
ensure that our architecture is not overfitting (or is not overfitting too much) on the presented training
set. The quite strong dropouts I inserted in my encoders also have the role of preventing excessive
overfitting on the dataset.
The reader can refer to Papers 4 and 5 for a more detailed discussion of the model’s architecture.
The most general definition of metaphor, also shared by other figures of speech like idioms and
metonymy, is that of an expression that is contextually used outside of its literal meaning (Frege;
1892; Gibbs et al.; 1997; Cacciari and Papagno; 2012). In this sense, metaphor in its basic form is
an intentional form of mis-categorization. If a speaker refers to a person saying He is an elephant,
the interpretation of such a sentence is usually not the literal meaning of the word elephant. Rather,
the sentence’s interpretation resides in a series of prototypical properties that, to an extent, both an
elephant and a person could share, such as having a relatively huge bodily mass, being known to have
an impressive memory, or being perceived as awkward and graceless in movement (Cacciari; 2014).
In this sense, a metaphor implies a process of re-categorization - we can categorize a person as an
elephant under some circumstances (Glucksberg et al.; 1997) - and an analogical shift. To interpret
a non-conventional or novel metaphor, it is necessary to understand, or induce, the shared properties
linking the source and the target domain (Gentner; 1983).
Another essential characteristic of metaphor underlined in linguistic literature is its composition-
ality. Metaphors are compositional in nature. A speaker can encounter a new metaphor and try to
decipher its meaning by composing its constituents. When this compositionality is impossible, we are
instead dealing with an idiomatic expression (Sag et al.; 2002; Fraser; 1970; Cruse; 1986; Frege; 1892;
Nunberg et al.; 1994; Liu; 2003; Bohrn et al.; 2012; Cacciari; 2014; Gibbs; 1993, 1994; Torre; 2014;
Geeraert et al.; 2017).1
1 This is of course something of an overstatement. As observed throughout this thesis, compositionality of both
metaphors and idioms is actually a gradient phenomenon (Nunberg et al.; 1994; Wulff; 2008). Different idioms have
23
24 CHAPTER 2. THEORETICAL BACKGROUND AND FRAMEWORK: KEY CONCEPTS
Many linguistic models of metaphoricity have been produced over the years (Morgan; 1980;
Sweetser; 1991; Vogel; 2001; Wilson; 2011; Romero and Soria; 2014). One of the most influential
of these models is arguably Lakoff’s Conceptual Metaphor Theory (Lakoff; 1989, 1993; Lakoff and
Johnson; 2008b,a; McGlone; 1996). This theory postulates that several metaphors used in natural
language derive from “seed" conceptual metaphors. For example, according to this theory expressions
like Saving my time and Spending some time derive from the conceptual metaphor “Time is Money".
According to this view, many of the metaphors we use are implementations of an implicit analogy
between two general topics or concepts. This is a view often advocated in computational linguistics
approaches to metaphor, and relatively easy to adjust to NLP tools like ontologies and semantic spaces.
Depending on how we draw the line between metaphorical and literal uses of a word or expression,
the number of metaphors found in everyday language is more or less striking, but there is general
consensus on the idea, also supported by corpus-based studies, that it is a pervasive phenomenon in
natural language (Cameron; 2003).
different degrees of opacity, and different metaphors have different degrees of transparency (Titone and Libben; 2014).
A gradient approach to idiomaticity is also assumed in Senaldi et al. (2016) and Senaldi et al. (2017), constituting the
basis of the second paper of this compilation.
2.1. METAPHOR DETECTION IN NLP 25
Unsupervised approaches tend to use large knowledge bases (Li et al.; 2013), ontologies (Krishnaku-
maran and Zhu; 2007), semantic similarity graphs (Li and Sporleder; 2010a) or vector space semantic
models (Shutova et al.; 2010a; Gutiérrez et al.; 2016) to define “standard" words’ meanings. A minority
of studies also use measures from information theory such as pointwise mutual information (Mohler
et al.; 2013), entropy (Schlechtweg et al.; 2017) or Jensen-Shannon divergence (Pernes; 2016) to detect
metaphoricity.
The advantage of using ontologies in metaphor detection is similar to the advantage offered by
vector space semantics. If it is possible to recognize a novel metaphor as a word used out of its usual
context, it is possible to use lexicographic resources like WordNet (Miller; 1995) or MultiWordNet
(Pianta et al.; 2002) to build a simple metaphor detection algorithm. For example, WordNet’s defini-
tions, together with the network’s lexical structure, can be used to detect a sudden shift in topic in
26 CHAPTER 2. THEORETICAL BACKGROUND AND FRAMEWORK: KEY CONCEPTS
a sentence, which might be due to the presence of a metaphor. Similar ontologies also offer the pos-
sibility of moving on the hyponym/hypernym axis, which, in some cases, helps to identify metaphors
and metonymies (Schlechtweg et al.; 2017). They can also be used to attempt an interpretation of the
detected metaphors.
Such approaches often treat metaphor detection as a special case of Word Sense Disambiguation
(Banerjee and Pedersen; 2002). Although figurative language is not limited to the problem of word
sense disambiguation, this view of metaphoricity has its own motivations. In many respects, the
problems presented by metaphor detection 2
are similar to the problems of “new sense detection", the
task of automatically recognizing whether, in a corpus, a word appears used in a new sense. Novel
sense detection techniques work on contextual information, which is also the case for figurative language
processing. Word sense induction overlaps with the common interpretation of figurative language in
its concern with the employment of a word or expression beyond its usual meaning.
In this frame, we can imagine figurative language as the coming of a "new sense" of a word that
can be discovered analyzing the word’s context. This is also the perspective adopted by studies using
topic modelling to perform metaphor detection (Li and Sporleder; 2010b; Schulder and Hovy; 2015).
A problem of these approaches is that topic modeling is usually based on the presence of topic-specific
terminology - in other words, it is based on how the words were linked to specific topics in the training
corpus. 3
While this can account for several metaphors, we can easily imagine a new metaphor arising
in the form of topic-related terminology used in a new way. Some studies have also noted that words
that strongly represent a specific topic are less likely to be used metaphorically (Klebanov et al.; 2009).
Unsupervised approaches involving vector space semantic models are usually preoccupied with
designing the best operation to apply to the word vectors composing a metaphor in order to cluster
them away from literal expressions (Mohler et al.; 2014; Gutiérrez et al.; 2016; Gong et al.; 2017).
In many cases, these studies are still in the track of new word sense detection. They try to
induce the new sense of a word from the context it occurs in. A radically new context, measured
as distributional distance, becomes the strongest hint for the metaphoric use of a term. Similar
approaches have also been applied in the unsupervised detection of idiomatic expressions (Lin; 1999;
Lenci; 2008; Fazly et al.; 2009; Lenci; 2018; Turney and Pantel; 2010; Fazly and Stevenson; 2008;
Mitchell and Lapata; 2010; Krčmář et al.; 2013; Senaldi et al.; 2017). For a more detailed discussion
of unsupervised detection of figures the reader can refer to Paper 2.
In supervised approaches, machine learning algorithms are trained on annotated corpora to detect
metaphoricity patterns. Feature-based classifiers constitute one of the main tools used in this category
of studies (Turney et al.; 2011; Hovy et al.; 2013; Tsvetkov et al.; 2014a).
If unsupervised approaches to metaphor detection have the practical advantage of avoiding the
data scarcity bottleneck, feature-based classifiers can give interesting insights in the combinations of
features - often psycholinguistic properties of words - that appear to be useful for the detection of
metaphors (Köper and im Walde; 2016b; Köper and Im Walde; 2016a; Köper and im Walde; 2017).
These sets of features can include several different dimensions of a word’s meaning and perception,
such as its syntactic role, its “conceptual" nature, the semantic class it belongs to (Klebanov et al.;
2016), its affective valence (Rai et al.; 2016), as well as more subtle characteristics such as imageability
(Broadwell et al.; 2013).
A feature often used in similar experiments is a word’s degree of concreteness (Klebanov et al.;
2015), which importance for metaphor processing also appears in cognitive linguistics (Forgács et al.;
2015). Shutova et al. (2010b) presents an approach that we could call “minimally supervised" to
metaphor detection: a small set of manually annotated metaphors is used to harvest, through dis-
tributional semantic similarity, several related metaphorical expressions from a text. In more recent
years,Shutova et al. (2016) have tried to use visual features in combination with word embeddings for
a supervised metaphor detection system.
Combinations of different models, such as unigrams, part-of-speech and topic models, are often
explored to find the best pipelines to detect metaphoricity (Klebanov et al.; 2014). Selectional prefer-
ence violation has also often been used as a feature for metaphor detection, both in unsupervised and
supervised frames. (Haagsma and Bjerva; 2016).
The difference between novel and conventionalized metaphors is often felt in this field, since
systems to detect novel metaphors (Schulder and Hovy; 2014) tend to be different - in terms of input
features and machine learning classifiers - from systems aiming to the detection of conventionalized
metaphors (Mohler et al.; 2013).
In general, many supervised approaches to metaphor detection resort to a large number of different
features. Using a large set of features requires a large set of resources, which can appear as a drawback
of traditional classifiers (Schulder and Hovy; 2014). This is one of the reasons why, as in many
other sectors of computational linguistics, neural networks have recently become more popular than
traditional classifiers to approach metaphor detection. 4
4 Also, the rise of very large corpora for training - or pre-training, as in the case of semantic spaces used in pipelines
- has contributed to make neural networks fairly competitive.
28 CHAPTER 2. THEORETICAL BACKGROUND AND FRAMEWORK: KEY CONCEPTS
Many of the recent supervised experiments in metaphor detection employ deep neural networks. To
name a few recent papers, Rei et al. (2017) present a task-specific neural architecture to predict
metaphoricity in Verb-Noun and Adjective-Noun bigrams, Do Dinh and Gurevych (2016) use a multi-
layered fully connected network to identify token-level metaphors in sentences - an approach later
used in Gutierrez et al. (2017) to predict first-episode schizophrenia in patients through an automatic
analysis of their use of language - and Sun and Xie (2017) apply bi-LSTMs to the task of verb metaphor
detection on unconstrained text.
Nonetheless, the mechanisms that lead to metaphor processing are still an object of investigation in
both artificial and human neural networks (see for example Lacey et al. (2017)). While the application
of more sophisticated network architectures to metaphor detection has led to encouraging results,
explaining the way machine learning algorithms model metaphoricity in text has become increasingly
difficult. Visualizing and understanding the networks themselves is a difficult task (Karpathy et al.;
2015; Dai et al.; 2017) and it is a current concern of the NLP community to create systems to make
the networks’ inner workings more transparent (Ancona et al.; 2017; Lake et al.; 2017).
The idea that we can infer the meaning of a word by its “neighbours" is the basis of vector space
semantic models. In a vector space model, each word is associated with a point on a multi-dimensional
space that models its contextual distribution in a large corpus. Such points can either represent sparse
count-based vectors, reporting the number of times a word appears into a given context, or dense
embeddings that maximize the probability of a word being in a given context.
Distributional semantic vector spaces are useful to model several aspects of word meaning and to
quantify the semantic relationships between words and expressions in corpora, such as the semantic
similarity between words in a given text.
Another important characteristic of semantic spaces is that they represent words on a multi-
dimensional continuum. This perspective allows a high degree of flexibility when operating with
elements larger than single terms. For example, it is possible to compute the mean vector of a sentence
as the mean of the vectors of its individual words. In this way, it is possible to ’locate’ the words
composing said sentence at different distances from the sentence’s mean semantic value. This is a
simple way that is sometimes used to spot “unfitting" terms into a sequence. At the same time,
semantic spaces can, to an extent, reproduce the same valuable word relations built in traditional
onthologies, such as hyponimity, hypernimity, co-hiponimity and antonimy (Lenci et al.; 2015).
2.1. METAPHOR DETECTION IN NLP 29
Vector space semantic models are also one of the most used resources in both supervised and
unsupervised metaphor detection (Haagsma and Bjerva; 2016; Kesarwani et al.; 2017), since they can
provide the “fingerprint" a of a word’s semantic profile without the need of manually crafting a set of
ad hoc features.
The major advantage of using distributional semantic spaces in metaphor detection - and also in
the detection of other figures of speech, like metonymy (Nastase and Strube; 2009) - lies in the fact
that metaphors are a contextual phenomenon. A metaphor can be seen as fundamentally composed
of two different semantic domains: one domain acts as a source - and the words related to it have a
literal meaning - while the other domain acts as a target - and the words related to it have a figurative
meaning. The apparent mismatch between source and target domain can, at least in theory, appear
through the difference between the semantic vectors of the words used literally and of those used
metaphorically in a sentece or expression (Steen et al.; 2014; Gutiérrez et al.; 2016). In this frame,
semantic spaces appear as a very flexible and powerful tool to model such semantic domains in terms
of words’ clustering and distributional similarity Mohler et al. (2014).
Another advantage of using semantic spaces is that they can be trained on very large unannotated
corpora. Being pre-trained on consistent amounts of text, semantic spaces can provide the classifiers
trained on the contained datasets of metaphor detection with useful information drawn from big data.
Studies using different resources for their classifiers also report that distributional vectors are the
best performing single resource to tackle metaphor detection Köper and im Walde (2016b).
It can be interesting to underline that approaches using vector space semantic models tend to
advocate the idea, widely maintained in linguistics, that metaphors are compositional. While they
are not the only linguistic tool to work with compositionality (Lappin and Zadrozny; 2000), vector
space semantic models have proven useful to deal with basic forms of word composition in a number
of cases (Lenci and Zamparelli; 2010; Ferrone and Zanzotto; 2015). Given the contextual nature of
this figure of speech, in some cases basic metaphor detection has been used as a way to test the
quality of semantic spaces themselves (Srivastava and Hovy; 2014; Köper and im Walde; 2016b), and
ad hoc semantic spaces have been designed to deal with metaphor detection (Bulat et al.; 2017).
The presence of analogical symmetries between concepts, something not too far from Lakoff’s idea
of conceptual metaphors, has also been presented as one of the most interesting features of neural
word embeddings (Mikolov, Sutskever, Chen, Corrado and Dean; 2013b). It is also possible to use
vector space semantic models together with other features, such as visual or psycholinguistic features
(Tsvetkov et al.; 2014b; Rai et al.; 2016; Do Dinh and Gurevych; 2016; Shutova et al.; 2016). This is
the same kind of hybrid approach detailed in the third study of this compilation.
In the first three papers of this compilation we also provide analyses of the performance of our
models on different distributional semantic models.
30 CHAPTER 2. THEORETICAL BACKGROUND AND FRAMEWORK: KEY CONCEPTS
5 There are exceptions. Some studies in textual entailment have shown interest in the peculiarities of figurative
language processing (Agerri; 2008; Turney; 2013).
Chapter 3
Collecting data for metaphor processing is a demanding task. One problem that arises when dealing
with annotated data for metaphor detection is the definition of metaphor itself. While many cases will
appear metaphorical or literal to the vast majority of annotators (The street was a river of people vs.
The Nile is a river ), every real document contains a number of less clear-cut elements.
Annotators’ sensibility, background and linguistic training can play a major role in cases of
metaphors ingrained in everyday language. Depending on how fine-grained and linguistically aware
the annotators are, the number and types of metaphors detected in a corpus can vary. In part due
to these obstacles, the scientific community has produced a small amount of open source annotated
corpora for figurative language studies, and major, standard datasets are still wanted (Shutova; 2011).
It is possible to divide the available resources into three main categories.
1. The first category represents conceptual or ontological resources, as the list presented in Lakoff
et al. (1991). It includes handcrafted, theory-driven constructions, like catalogs of widespread
metaphors. While these resources can be of great utility for other approaches, they are of little
interest for the main goals of this research.
2. The second category includes datasets annotated for metaphoricity, like the ones I have been
using for my own research. These datasets usually consist of selected text extracts where words
or sentences are annotated as “figurative” or “metaphorical”. Sometimes, these datasets can be
created as an expansion of “first category" resources. For example, the metaphor dataset created
by Hovy et al. (2013) contains 3879 sentences generated by bootstrapping from lists of classical
metaphors. Amazon Mechanical Turk’s users later annotated these sentences as metaphorical
or literal. Seven different annotators labeled each sentence, and the sentences that the majority
found impossible to judge were discarded. These resources are often centered on a specific part
of speech used metaphorically. For example, Dunn (2013a) produced and publicly released a
31
32 CHAPTER 3. DATA COLLECTION AND RESOURCES
corpus of 500 sentences centered on a set of 25 verbs, annotated as metaphorical or literal. The
largest existing dataset of this kind is probably the VU Amsterdam Metaphor Corpus (Steen
et al.; 2010), that I describe more in detail in 3.3.
3. The last category includes datasets that divide figurative language into types (for example,
sentences can be linked to the general kind of metaphor they implement, and so on). These
annotated datasets usually present sets of sentences annotated with some kind of conceptual
mapping or metaphorical interpretation. An open source example is Gordon et al. (2015)’s
corpus, containing 1771 metaphorical sentences. This corpus contains only examples of figurative
language (thus there is no ”literal” counterpart in the corpus), annotated for source and target
domain. The corpus is partly constituted of hand-annotated sentences, and partly formed by
automatically annotated sentences that were manually corrected.
Cases of “composite" annotation of figures also exist. Dunn (2013b) for example released a re-
annotated subset of the VU Amsterdam Corpus where sentences are annotated not only as metaphorical
or literal, but also as “humorous” or not, constituting one of the rare corpora where figurativity is
divided into categories. They used this subset to compare four metaphor recognition systems.
Most of these resources suffer from the second major problem of metaphor corpora: data scarcity.
The community has often applied traditional ways to deal with scarcity of data, like artificially
bootstrapping the dataset (He and Liu; 2017) and looking for complementary resources that could
allow the employment of richer feature sets (Tanguy et al.; 2012).
An effect of data scarcity is that many studies tend to create their own resource to train and test
their models. It is often complained that different systems in figurative language processing are tested
on different datasets (Shutova; 2011).
In the first three publications of this thesis, we used pre-existent resources for metaphor detection.
In the first and third papers, such resources present a binary view of figurativity: words and expressions
are labelled as either figurative (1) or literal (0). The dataset used in the second paper presents instead
a nuanced view of figurativity: the expressions in the dataset have a continuous score of idiomaticity,
going from completely literal to unmistakably idiomatic (Senaldi et al.; 2016, 2017). We used the same
approach for metaphor aptness judgments in papers 4 to 6.
In general, we believe that continuous scores are more suitable than binary scores when dealing
with figurative language, both for detection and aptness assessment. Not all figurative expressions
have the same level of figurativity. Figurative expressions tend to become more and more conventional
precisely through their usage in everyday communication (Bowdle and Gentner; 2005) - thus follow
the law might be perceived as having a somehow lower level of figurativity than the winds of change.
I discuss this aspect more in detail in my conclusions. To conclude this Chapter, I will now provide a
cursory presentation of the corpora used in this thesis.
3.1. ADJECTIVE-NOUN DATASET 33
as an ordering task. This dataset contains 250 sets of 5 sentences each, labeled on a 1-5 scale of
paraphrasehood, in analogy with semantic similarity datasets (Xu et al.; 2015; Agirre et al.; 2016).
Every group of five sentences contains 1 reference sentence and 4 candidate paraphrases. This corpus
was annotated by me. Being a proof of concept study to find a new frame for metaphor paraphrase
processing, I didn’t go to the extent of running a crowd-sourced annotation yet, as I instead did in the
following works. While its annotation scheme sustains graded semantic similarity labels, it also provides
sets of related elements. It is thus possible to score or order each pair of elements independently from
the others. For a detailed discussion of its characteristics and design the reader can refer to Paper 4.
1 This dataset contains both examples of metaphors and similes. Metaphors and similes cannot always be treated
as equivalent elements in linguistics (Sam and Catrinel; 2006; Glucksberg; 2008), but for the purposes of our study we
considered them as belonging to the same category.
Chapter 4
The series of studies I present in this compilation follows, and partly answers to, the main questions I
outlined in the Introduction.
This is the logic within which they should be read.
To what extent can we exploit distributional semantic spaces through artificial neural
networks to perform supervised metaphor detection?
2. Paper 2 - Finding the Neural net: Deep-learning Idiom Type Identification from Distributional
Vectors. We explore the importance of individual words’ distributional profile in metaphor
detection and the compositionality of the approach presented in Paper 1 through a comparison
with the twin-task of idiom identification.
3. Paper 3 - Bigrams and BiLSTMs: Two neural networks for sequential metaphor detection. We ap-
ply a combination of vector space lexical embeddings and neural networks to in-context metaphor
detection.
35
36 CHAPTER 4. SUMMARY OF THE STUDIES
Is it possible to use the same approach to go beyond metaphor detection, and tackle
metaphor aptness assessment as a natural language processing task? What is the best
way of dealing with metaphor aptness assessment, in terms of both dataset structure and
task design?
These are the leading questions for the second part of my thesis. As before, the three papers represent
a unitary development.
1. Paper 4 - Deep Learning of Binary and Gradient Judgements for Semantic Paraphrase. We define
the dataset structure and neural architecture we will use for metaphor aptness assessment. As
a middle ground step, we define them through a task of “general" paraphrase detection and
ranking.
2. Paper 5 - Predicting Human Metaphor Paraphrase Judgments with Deep Neural Networks. We ap-
ply the dataset structure and neural architecture defined in our previous publication to metaphor
aptness assessment.
3. Paper 6 - The Effect of Context on Metaphor Paraphrase Aptness Judgments. As with the last
paper of the previous triad, we explore automatic metaphor aptness assessment in a frame of
extended context.
Both groups of studies are related, as are the research questions that dictated them. The first group
of studies applies a specific combination of tools, and a specific view of metaphoricity, on a relatively
traditional task - metaphor detection. The second group represents an attempt to bring such combi-
nation of tools, and view of metaphoricity, beyond the limits of metaphor detection into a less studied,
and even more challenging, task.
4.1. STUDY I 37
4.1 Study I
I am responsible for the central idea of the paper, together with the basic implementation of our
experiment. Mehdi Ghanimifard developed and analyzed most of the ablation trials presented in the
study, while Stergios Chatzikyriakidis focused on the theoretical background and selected the dataset.
38 CHAPTER 4. SUMMARY OF THE STUDIES
4.2 Study II
The main structure of this research was discussed and elaborated by Marco Senaldi, Alessandro Lenci
and me. I developed and implemented the neural classifier used in this study and I both designed and
performed all of the experiments. Marco Senaldi provided all the datasets and produced most of the
theoretical background and analytical discussion of the paper.
test the model’s performance, we used the training and test sets defined by the Workshop’s task. Our
results were also reproduced independently in order to assess our performance. This paper represents
our attempt to deal with metaphor detection on real text. The corpus we use to both train and test
our models is the VU Amsterdam Corpus, a subsection of the British National Coprus annotated
for metaphoricity (see also Chapter 3). We provide an analysis and critical discussion of the corpus,
together with the results of our experiments.
The task of detecting metaphors in unconstrained text proves to be more difficult than the task
of finding metaphors in a list of expressions. To attempt this task, architectures more complex than a
basic Perceptron become necessary.
We compare the performance of a Bi-LSTM and a composition-based hierarchical network, using
both GloVe Pennington et al. (2014) and Word2Vec Mikolov, Sutskever, Chen, Corrado and Dean
(2013a) pre-trained embeddings. We find that the Bi-LSTM achieved the best individual performance,
while a combination of both networks carries out the best overall performance.
Also, enriching the input’s word embeddings with explicit features, such as words’ concreteness
scores as given in Brysbaert et al. (2014), and manipulating the input to break the length of sentences
prove to be useful strategies. In this sense, to improve our performance we had to abandon the
only-distributional approach assumed in the other papers and recur to systems previously adapted in
“standard" feature based studies (Köper and im Walde; 2017).
We compare our results with Do Dinh and Gurevych (2016)’s results, showing that the architec-
ture’s depth beyond a certain point does not provide significant improvements. Our system scored
second best in performance when trained on the training partition of the shared task (Leong et al.;
2018).
Mehdi Ghanimifard and I developed the main structure, the background research and the analytical
discussion of the paper. I took care of the Bi-LSTM implementation, while Ghanimifard designed
and tested the competing architecture. I also developed the part of our study dealing with input
manipulation and feature enrichment.
40 CHAPTER 4. SUMMARY OF THE STUDIES
4.4 Study IV
1 We measured ranking in terms of both Pearson’s and Spearman’s correlation. Since both correlations were very
similar, we only reported Pearson’s correlations in the paper.
4.5. STUDY V 41
This paper basically presents the dataset structure and the machine learning approach that we
will apply to metaphor aptness in the remaining studies. Also, the pipeline described above is the
same that will be used in Papers 5 and 6.
Shalom Lappin and I discussed and developed the main structure of this study. I created and annotated
the dataset and I designed and tested the neural classifier, under Lappin’s supervision.
4.5 Study V
We also argue that it is possible in fact to summarize this task in the following question: Given
that X is a metaphor, which one of the candidates would be its best literal interpretation? This creates
apparent paradoxes with respect to a traditional paraphrase task. When confronted with two sentences
like The candidate is a fox and The candidate is cunning, a typical paraphrase model should return
a low score, while our model interprets the first sentence as an apt metaphor for the second sentence,
and assigns the pair a high score. When presented with a pair like The Council is on fire and The
Council is burning, a typical paraphrase model should return a high paraphrase score, while our model
gives the pair a low grade. In other terms, classical paraphrasehood and metaphor aptness are not
completely overlapping.
Shalom Lappin and I discussed and developed the main structure of the study. I created the dataset
and tested the neural classifier under Lappin’s supervision. I also took care of the crowd sourcing
annotation.
4.6 Study VI
Our results contradict in part the expectations expressed by previous cognitive studies on metaphor
aptness (Chiappe et al.; 2003), and are confirmed by independent findings on the effect of extended
context on acceptability judgments published in Bernardy et al. (2018).
We also provide a review of cognitive linguistic studies on both out of context (Tourangeau and
Sternberg; 1981; Fainsilber and Kogan; 1984; Tourangeau and Rips; 1991; Blasko; 1999; Chiappe et al.;
2003) and in context (McCabe; 1983) metaphor aptness assessment.
This paper is available on Arxiv and closes my compilation.
I am responsible for the central idea of the study, that I developed from Lappin’s contemporary works
on acceptability in context. I took care of the creation of the new dataset, of its crowd sourcing
annotation and of the neural classifier’s training, under Lappin’s supervision.
44 CHAPTER 4. SUMMARY OF THE STUDIES
Chapter 5
Conclusions
The aim of this thesis is to provide a conceptually related set of experiments on metaphor processing,
starting from the most “simple" cases of metaphoric bigrams and escalating to complex problems that
go beyond metaphor detection.
In this thesis I used a combination of semantic spaces and neural networks to explore metaphor
detection and metaphor aptness both in and out of context. This combination provides a flexible
framework for machine learning, it does not require feature engineering and it is, as shown in Paper
2, language independent, a characteristic appreciated in metaphor studies (Kövecses; 2003; Deignan;
2003; Tsvetkov et al.; 2014b; Shutova; 2010c).
While the set of distributional spaces used through my research has been more or less constant, I
used different neural architectures for different tasks. The architectures increase in complexity with the
task at hand. While I tackled the “simpler" detection tasks through rather shallow, fully connected
networks, I had to deal with metaphor detection in unconstrained text through a Bi-LSTM, which
displays a way more sophisticated structure. To approach metaphor aptness assessment I designed a
deeper, composite architecture featuring a combination of CNNs and LSTMs. This escalation was not
performed a priori. I tested simpler models on every task and turned to more complex architectures
when it proved necessary.
An important contribution of this thesis should be the new approach we propose to deal with
metaphor aptness assessment, a rarely studied topic in computational linguistics. I propose a new
dataset to assess metaphor aptness, annotated through crowd sourcing by a large number of humans.
I consider this part of my compilation the most original but, at the same time, the most prob-
lematic. The conceptual difficulties of going beyond metaphor detection in computational linguistics
are notorious (Sculley and Pasanek; 2008). In some sense, to shift towards aptness assessment we had
to start from scratch. While cognitive literature on metaphor aptness is relatively abundant (Johnson
45
46 CHAPTER 5. CONCLUSIONS
and Malgady; 1979; Marschark et al.; 1983; Blasko and Connine; 1993), natural language processing
has left this topic almost untouched 1 .
1. To what extent can we exploit vector space lexical embeddings through artificial
neural networks to perform supervised metaphor detection?
Vector space lexical embeddings proved to be highly effective in detecting metaphoricity in con-
strained datasets (out of context, representing specific kinds of metaphors). A simple neural
network trained purely on vector space models proved to outperform by and large traditional
supervised systems on an out of context metaphor dataset and reached an accuracy of beyond
90% (Paper 1).
This approach clearly employs the distributional profiles of the single words composing the
metaphors. It is able to use the semantic signature of words learned from very large corpora such
as itWaC (Baroni et al.; 2009) or Google News (Mikolov, Sutskever, Chen, Corrado and Dean;
2013d) in order to learn a specific semantic task like metaphor detection from small supervised
training sets. Our pipeline looks like a transfer learning paradigm: the embeddings learned in
an unsupervised way from very large corpora are used for a supervised learning task on small
datasets.
Being based on the distributional information of single words, this approach performs poorly
when applied to non-compositional multi-word figures like idioms.
When idioms are instead treated like unitary tokens, their distributional profile in large corpora
can be exploited by a neural network to learn idiomaticity from very small training data. As
for other non-compositional expressions (Loukachevitch and Parkhomenko; 2018a,b), the dis-
tributional profile of idioms appears to contain enough information for our network to learn
idiomaticity, but the same information cannot be found in the combination of the distributional
1 Some interesting studies in metaphor generation have dealt with the problem of metaphor aptness. Abe et al. (2006)
explore the possibility of a generator able to create apt metaphors computing the probabilistic relationship between
source and target domains. Veale and Hao (2007a) describe a system that tries to generate apt metaphors for a target
on demand from apt similes.
5.1. MY RESEARCH ANSWERS 47
profile of the single words composing the idioms. The opposite holds for metaphoric (non id-
iomatic) expressions 2 .
Distributionality and compositionality are thus two of the main strengths of our framework.
Also, this approach appears to be language-independent (Paper 2).
Nonetheless, switching from out of context, constrained datasets to contextual and unconstrained
metaphor detection brings the difficulty of the task to a higher level.
We resorted to a Bi-LSTM architecture to tackle this task, comparing (and ultimately combining)
it with a simpler hierarchical model. In general, Bi-LSTMs are proving increasingly useful for
tasks of sequential annotation on unconstrained text, such as multi-word expressions detection
(Berk et al.; 2018).
Bi-LSTM and vector space lexical embeddings manage to perform encouragingly, but there is
still room for improvement. Non distributional features, often used in metaphor detection (Rai
and Chakraverty; 2017), also help improving the accuracy of our models. Vector space lexical
embeddings and neural networks can learn metaphor detection on unconstrained text to a good
extent, but they cannot (yet) account for all dimensions of metaphoricity (Paper 3).
While leaving room for improvement, our model was nonetheless quite successful: we scored as
second best performing group in 2018’s Metaphor Detection Task. The first best performing
group also used sequential deep neural networks. Applying Bi-LSTMs to metaphor detection
is an active trend in the field, and new applications to improve their performance are being
published (Gao et al.; 2018).
2. What is the best way of dealing with metaphor aptness assessment, in terms of
dataset structure and task design?
We designed a new kind of dataset to deal with metaphor aptness as a form of metaphor para-
phrase. This dataset allows the user to deal with aptness as both a binary task and an ordering
task. It is composed of groups of five sentences, where the first one contains a metaphor and the
remaining four are candidate paraphrases annotated by degree of aptness.
This frame, while not applied on aptness yet, is presented in Paper 4 for a task of paraphrase
scoring, together with a neural structure designed to be trained on it. We did not choose
a paraphrase detection task exclusively as a half-way project towards metaphor aptness: we
think that this way of dealing with paraphrases could be actually beneficial, and indeed several
studies are joining paraphrase detection and text similarity (Soleymanzadeh et al.; 2018). The
possibility of interpreting paraphrasehood in a non-binary way, and ranking the degree to which
2 Our findings about idioms’ distributional properties have been recently confirmed by Peng et al. (2018).
48 CHAPTER 5. CONCLUSIONS
two paraphrases are close or far from one another, could be particularly useful for plagiarism
detection and literary scholarship (Moritz et al.; 2018). At the same time, our approach appears
simpler, in terms of pipeline and resources employed, than several paraphrase detection or text
similarity experiments proposed by the literature (Mohamed and Oussalah; n.d.).
3. Is it possible to use the same approach to go beyond metaphor detection, and tackle
metaphor aptness assessment as a natural language processing task?
It is indeed possible to apply a combination of vector space lexical embeddings and deep neural
networks to tackle, to a degree, metaphor aptness assessment (Paper 5).
Like for metaphor detection, the dimension of contextuality - the amount of extended context
provided in the dataset - appears to play an important role in aptness assessment (Paper 6).
This task and topic are to be considered as a first step in the rarely studied field of automatic
metaphor aptness assessment.
While neural networks are amply used in the bordering fields of metaphor detection, paraphrase
rating and sentence representation (Chen, Hu, Huang and He; 2018; Tang and de Sa; 2018; Chen,
Guo, Chen, Sun and You; 2018), this is to the best of my knowledge the first application of a
neural architecture to the task of metaphor aptness assessment. Similarly, word embeddings have
been previously used to explore metaphor paraphrasing and metaphor interpretation (Utsumi;
2011), but, to the best of my knowledge, this is the first employment of distributional semantic
spaces to metaphor aptness assessment.
A solid development of this line of work could be of interest for metaphor generation systems
(Veale; 2014, 2015), sometimes used for artificial tutors and conversational systems. (Rzepka
et al.; 2013; Dybala et al.; 2012).
On a merely “technological" level, I have explored a variety of experimental frames, neural architectures,
semantic spaces and corpora throughout my studies. This variety can be confusing. In Table 1 I present
a synthesis of the main frames presented in each experiment. This table is not a comprehensive
overview of all the combinations explored in my work, but rather a conclusive summary of the most
important ones. Colors and line breaks should help distinguishing between the detection oriented and
the paraphrase oriented studies, and point to the shift between out-of-context and in-context datasets.
This can be seen as a very short summary of my findings and considerations about metaphor
processing. Still, I think this does not exhaust the conclusions I can draw from my experience.
There are some recurrent topics that also informed my line of research in metaphor processing: the
nuanced approach to both metaphoricity and aptness, the necessity of dealing with input’s sequentiality,
the main problem of data scarcity and the matter of context in metaphor datasets.
I will analyze them shortly in the rest of this chapter.
5.1. MY RESEARCH ANSWERS 49
AN metaphoric- Single fully con- Various different AN corpus an- F1 higher than
ity nected neural distributional notated for 0.9 in most set-
layer spaces metaphoricity tings
Short phrases id- Three-layered Various different Datasets of F1 of 0.85 (VN)
iomaticity fully connected distributional expressions and 0.89 (AN)
network spaces annotated for
idiomaticity
Word Multi-layered GloVe and W2V VU Amsterdam F1 of 0.63
metaphoricity fully connected spaces enriched Metaphor Corpus
in context network and a with explicit
Bi-LSTM features
Paraphrase qual- Deep NNs com- W2V space Manually crafted F1 of 0.76, Pear-
ity assessment bining ACNN, corpus for para- son correlation of
LSTMs and fully phrase ranking 0.61
connected layers
Metaphor para- Deep NNs com- W2V space Manually crafted F1 of 0.67 Pear-
phrase (aptness) bining ACNN, corpus for son correlation of
assessment LSTMs and fully metaphor para- 0.55
connected layers phrase ranking
Metaphor para- Deep NNs com- W2V space Manually crafted F1 of 0.72, Pear-
phrase (aptness) bining ACNN, corpus for son of 0.3
assessment in LSTMs and fully metaphor para-
extended context connected layers phrase ranking in
context
Table 5.1: Summary of the experimental settings presented in this compilation. The first three studies
are about detection, the last three about paraphrase. In both sets, the last experiment includes
extensive context.
50 CHAPTER 5. CONCLUSIONS
5.3 Sequentiality
In all my experiments, I took the sequentiality of input into account.
While systems that don’t consider sequentiality can still perform encouragingly on some datasets
for metaphor detection, words’ order is often an important dimension to contemplate when dealing
with figurative language. The importance of sequentiality in metaphor detection appears most clearly
in the third paper of this compilation, where two types of neural network - a Bi-LSTM and a simpler,
multi-layered fully conntected network - are compared on a metaphor detection dataset. The Bi-LSTM,
which is the most sequence-oriented model of the two, outperformed the other architecture. At the
same time, anyway, the best overall performance was achieved by a combination of the two systems,
hinting to the necessity a more complex approach to the matter.
To deal with metaphor aptness assessment I also used sequential models (LSTM) together with
models able to detect non-sequential, far-reaching patterns in data (CNN).
As this task is somehow close to paraphrase detection, sequentiality becomes even more essential.
5.5 Context
The presence of extended context in training and test sets is another relevant line of my work.
For both metaphor detection and metaphor aptness assessment, I have started dealing with out-
of-context datasets and then moved on to more contextualized corpora.
In metaphor detection, it was possible to use existing resources for both cases: lists of metaphorical
(or idiomatic) expressions for the first stage, and annotated corpora for the second stage.
For metaphor aptness, both frames lacked data. I thus had to first create a corpus for the out-of-
context stage, and then expand a part of that corpus to move to the contextualized frame.
When dealing with context in metaphor aptness, I was able to observe a consistent shift in human
judgments. Our annotators’ ratings were compressed towards the mean: aptness increased for low-
rated pairs and decreased for high-rated pairs.
This different perception of aptness was detectable only thanks to the gradient approach we had
chosen for human annotation. If we had gone for a binary classification task, this compression towards
the mean would have gone largely unnoticed. This seems to be another dimension of figurativity
studied more in cognitive than in computational linguistics (Inhoff et al.; 1984).
It would be interesting in future to explore this contextual effect also on detection-oriented
datasets.
5.6 Limitations
I am aware of a part of my work’s several flaws.
Neural networks’ notorious opacity is a drawback to overcome in future studies. The features that
lead to learning are not as clear as they used to be in traditional machine learning approaches (see
for example Li and Sporleder (2010c)). Until it was possible, I tried to infer some of the mechanisms
leading to the networks’ performance through ablation experiments. For example, in Paper 1 we show
that our model side-learns an abstract-concrete semantic continuum in order to distinguish between
figurative and literal expressions.
When dealing with more complex kinds of inputs, like sentences or pairs of sentences, clarifying
the networks’ inner working became harder. The result is that often the performance of the model is
unexplained from a linguistic point of view.
Another limitation lies in the dimensions of some of the presented datasets, especially the datasets
we created for metaphor aptness assessment. These datasets are necessarily small. They are hard to
produce and have to be dealt with in completely non-automated ways. In other words, each example
has to be produced from scratch.
5.7. FUTURE WORKS 53
Shortly, my work in metaphor processing, and especially in aptness assessment, suffers from the
same limitations Tony Veale described when talking about Figurative Language Processing in general:
it is “ neither scalable nor robust, and not yet practical enough to migrate beyond the lab " (Veale;
2011).
Appendix
This is a short appendix to detail the way the main annotation task for paraphrase ranking has been
carried through.
The metaphor aptness dataset (Paper 5) is composed of groups of five sentences, composed of a
reference sentence a four candidate paraphrases, like the following.
The annotation of this corpus was carried out through Amazon Mechanical Turk.
Sentences were presented to the anonymous annotators in forms of groups of pairs, like the
following.
The annotators could score each pair from 1 (completely unrelated) to 4 (strong paraphrase). An
average of 20 human annotators scored each pair.
55
56 CHAPTER 6. APPENDIX
1. Some “trap" elements were inserted in the task (sentences that were completely unrelated).
Annotators who did not give the minimum score to these elements were discarded as rogue.
2. Annotators who gave very high or very low scores to the vast majority of pairs were discarded
as rogues.
After filtering the rogues, we were left with an average of 15 annotations per pair. These annotations
were then averaged and confronted with my own annotation of the corpus, proving to hold a high
correlation with my judgment 1 . The mean human judgments for each pair were then used as golden
labels to train and test my models.
We used Amazon Mechanical Turk also in the case of the extended context dataset (Paper 6). In
this case, the annotators were presented with pairs of three-sentence paragraphs, in which the central
sentence (the “relevant" one) was highlighted:
They had arrived in the capital city. The crowd was a roaring river. It was glorious.
They had arrived in the capital city. The crowd was huge and noisy. It was glorious.
1 This was intended only as a sanity check. A low correlation with my judgment would not have automatically resulted
in the elimination of the whole annotation set.
57
Figure 6.2: An example of annotation frame for the metaphor aptness dataset.
The same procedures were applied as for the previous annotation in terms of scoring system, rogues
filtering and handling of the results.
For more detailed information, the reader can refer to Paper 5 and Paper 6, where the annotation’s
logic and results for each dataset are discussed more in detail.
Figure 6.3: An example of annotation frame for the in-context metaphor aptness dataset.
58 CHAPTER 6. APPENDIX
Bibliography
Abe, K., Sakamoto, K. and Nakagawa, M. (2006). A computational model of the metaphor generation
process, Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 28.
Agerri, R. (2008). Metaphor in textual entailment, COLING 2008, 22nd International Conference on
Computational Linguistics, Posters Proceedings, 18-22 August 2008, Manchester, UK, pp. 3–6.
URL: http://www.aclweb.org/anthology/C08-2001
Agirre, E., Banea, C., Cer, D. M., Diab, M. T., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G.
and Wiebe, J. (2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and
cross-lingual evaluation, Proceedings of the 10th International Workshop on Semantic Evaluation,
SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pp. 497–511.
URL: http://aclweb.org/anthology/S/S16/S16-1081.pdf
Agrawal, A., Batra, D. and Parikh, D. (2016). Analyzing the behavior of visual question answering
models, arXiv preprint arXiv:1606.07356 .
Ancona, M., Ceolini, E., Öztireli, C. and Gross, M. (2017). A unified view of gradient-based attribution
methods for deep neural networks, arXiv preprint arXiv:1711.06104 .
Bae, S. H., Choi, I. and Kim, N. S. (2016). Acoustic scene classification using parallel combination of
lstm and cnn, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Workshop (DCASE2016), pp. 11–15.
Bambini, V., Bertini, C., Schaeken, W., Stella, A. and Di Russo, F. (2016). Disentangling metaphor
from context: an erp study, Frontiers in psychology 7: 559.
Bambini, V., Canal, P., Resta, D. and Grimaldi, M. (2018). Time course and neurophysiological
underpinnings of metaphor in literary context, Discourse Processes pp. 1–21.
Bambini, V., Gentili, C., Ricciardi, E., Bertinetto, P. M. and Pietrini, P. (2011). Decomposing
metaphor processing at the cognitive and neural level through functional magnetic resonance imag-
ing, Brain Research Bulletin 86(3-4): 203–216.
59
60 BIBLIOGRAPHY
Banerjee and Pedersen (2002). An Adapted Lesk Algorithm for Word Sense Disambiguation wit Word-
Net.
Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009). The WaCky wide web: a collec-
tion of very large linguistically processed web-crawled corpora, Language Resources and Evaluation
43(3): 209–226.
URL: http://dx.doi.org/10.1007/s10579-009-9081-4
Baroni, M., Dinu, G. and Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of
context-counting vs. context-predicting semantic vectors., Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics, pp. 238–247.
Basile, P., Caputo, A. and Semeraro, G. (2015). Temporal random indexing, Italian Journal of Com-
putational Linguistics 1.
Berk, G., Erden, B. and Güngör, T. (2018). Deep-bgt at parseme shared task 2018: Bidirectional
lstm-crf model for verbal multiword expression identification, Proceedings of the Joint Workshop on
Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pp. 248–
253.
Bernardy, J.-P., Lappin, S. and Lau, J. H. (2018). The influence of context on sentence acceptability
judgements, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), Vol. 2, pp. 456–461.
Birke, J. and Sarkar, A. (2006). A clustering approach for nearly unsupervised recognition of nonliteral
language, 11th Conference of the European Chapter of the Association for Computational Linguistics.
Blacoe, W. and Lapata, M. (2012). A comparison of vector-based representations for semantic compo-
sition, Proceedings of the 2012 joint conference on empirical methods in natural language processing
and computational natural language learning, Association for Computational Linguistics, pp. 546–
556.
Blasko, D. G. (1999). Only the tip of the iceberg: Who understands what about metaphor?, Journal
of Pragmatics 31(12): 1675–1683.
Blasko, D. G. and Connine, C. M. (1993). Effects of familiarity and aptness on metaphor processing.,
Journal of experimental psychology: Learning, memory, and cognition 19(2): 295.
BIBLIOGRAPHY 61
Bohrn, I. C., Altmann, U. and Jacobs, A. M. (2012). Looking at the brains behind figurative language:
a quantitative meta-analysis of neuroimaging studies on metaphor, idiom, and irony processing,
Neuropsychologia 50(11): 2669–2683.
Bollegala, D. and Shutova, E. (2013). Metaphor interpretation using paraphrases extracted from the
web, PloS one 8(9): e74304.
Bookheimer, S. (2002). Functional mri of language: new approaches to understanding the cortical
organization of semantic processing, Annual review of neuroscience 25(1): 151–188.
Bowdle, B. F. and Gentner, D. (2005). The career of metaphor., Psychological review 112(1): 193.
Broadwell, G. A., Boz, U., Cases, I., Strzalkowski, T., Feldman, L., Taylor, S., Shaikh, S., Liu,
T., Cho, K. and Webb, N. (2013). Using imageability and topic chaining to locate metaphors in
linguistic corpora, International Conference on Social Computing, Behavioral-Cultural Modeling,
and Prediction, Springer, pp. 102–110.
Brysbaert, M., Warriner, A. B. and Kuperman, V. (2014). Concreteness ratings for 40 thousand
generally known english word lemmas, Behavior research methods 46(3): 904–911.
Bulat, L., Clark, S. and Shutova, E. (2017). Modelling metaphor with attribute-based semantics,
Proceedings of the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, Vol. 2, pp. 523–528.
Cacciari, C. (2014). Processing multiword idiomatic strings: Many words in one?, The Mental Lexicon
9(2): 267–293.
Cacciari, C. and Papagno, C. (2012). Neuropsychological and neurophysiological correlates of idiom un-
derstanding: How many hemispheres are involved, The handbook of the neuropsychology of language
pp. 368–385.
Camac, M. K. and Glucksberg, S. (1984). Metaphors do not use associations between concepts, they
are used to create them, Journal of Psycholinguistic Research 13(6): 443–455.
Cameron, L. (2008). Metaphor and talk, The Cambridge handbook of metaphor and thought pp. 197–
211.
Cardillo, E. R., Watson, C. E., Schmidt, G. L., Kranjec, A. and Chatterjee, A. (2012). From novel to
familiar: tuning the brain for metaphors, Neuroimage 59(4): 3212–3221.
Chen, P., Guo, W., Chen, Z., Sun, J. and You, L. (2018). Gated convolutional neural network for
sentence matching, memory 1: 3.
62 BIBLIOGRAPHY
Chen, Q., Hu, Q., Huang, J. X. and He, L. (2018). Can: Enhancing sentence similarity modeling with
collaborative and adversarial network, The 41st International ACM SIGIR Conference on Research
& Development in Information Retrieval, ACM, pp. 815–824.
Chiappe, D. L., Kennedy, J. M. and Chiappe, P. (2003). Aptness is more important than comprehen-
sibility in preference for metaphors and similes, Poetics 31(1): 51–68.
Collobert, R. and Weston, J. (2008a). A unified architecture for natural language processing: Deep
neural networks with multitask learning, Proceedings of the 25th International Conference on Ma-
chine Learning, ICML ’08, ACM, New York, NY, USA, pp. 160–167.
URL: http://doi.acm.org/10.1145/1390156.1390177
Collobert, R. and Weston, J. (2008b). A unified architecture for natural language processing: Deep
neural networks with multitask learning, Proceedings of the 25th international conference on Machine
learning, ACM, pp. 160–167.
Colton, S., Goodwin, J. and Veale, T. (2012). Full-face poetry generation., ICCC, pp. 95–102.
Conklin, K. and Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly than
nonformulaic language by native and nonnative speakers?, Applied linguistics 29(1): 72–89.
Consortium, B. N. C. et al. (2007). British national corpus version 3 (bnc xml edition), Distributed
by Oxford University Computing Services on behalf of the BNC Consortium. Retrieved February
13: 2012.
Cordeiro, S., Ramisch, C., Idiart, M. and Villavicencio, A. (2016). Predicting the compositionality of
nominal compounds: Giving word embeddings a hard time, Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics, Vol. 1, pp. 1986–1997.
Cornelissen, J. P. (2004). What are we playing at? theatre, organization, and the use of metaphor,
Organization Studies 25(5): 705–726.
Corts, D. P. and Pollio, H. R. (1999). Spontaneous production of figurative language and gesture in
college lectures, Metaphor and Symbol 14(2): 81–100.
Dai, D., Tan, W. and Zhan, H. (2017). Understanding the feedforward artificial neural network model
from the perspective of network flow, arXiv preprint arXiv:1704.08068 .
BIBLIOGRAPHY 63
Darian, S. (2000). The role of figurative language in introductory science texts, International Journal
of applied linguistics 10(2): 163–186.
Davidov, D., Tsur, O. and Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences
in twitter and amazon, Proceedings of the fourteenth conference on computational natural language
learning, Association for Computational Linguistics, pp. 107–116.
Deignan, A. (2003). Metaphorical expressions and culture: An indirect link, Metaphor and symbol
18(4): 255–271.
Deignan, A. (2007). “image” metaphors and connotations in everyday language, Annual Review of
Cognitive Linguistics 5(1): 173–192.
Do Dinh, E.-L. and Gurevych, I. (2016). Token-level metaphor detection using neural networks,
Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 28–33.
Dolan, B., Quirk, C. and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora:
Exploiting massively parallel news sources, Proceedings of the 20th International Conference on
Computational Linguistics, COLING ’04, Association for Computational Linguistics, Stroudsburg,
PA, USA.
URL: https://doi.org/10.3115/1220355.1220406
Dunn, J. (2013a). valuating the premises and results of four metaphor identification systems, Proceed-
ings of CICLing’13 .
Dunn, J. (2013b). What metaphor identification systems can tell us about metaphor-in-language,
Proceedings of the First Workshop on Metaphor in NLP, pp. 1–10.
Dunn, J., De Heredia, J. B., Burke, M., Gandy, L., Kanareykin, S., Kapah, O., Taylor, M., Hines, D.,
Frieder, O., Grossman, D. et al. (2014). Language-independent ensemble approaches to metaphor
identification, 28th AAAI Conference on Artificial Intelligence, AAAI 2014, AI Access Foundation.
Dybala, P., Ptaszynski, M., Rzepka, R., Araki, K. and Sayama, K. (2012). Beyond conventional
recognition: Concept of a conversational system utilizing metaphor misunderstanding as a source of
humor, Proceedings of The 26th Annual Conference of The Japanese Society for Artificial Intelligence
(JSAI 2012), Alan Turing Year Special Session on AI Research That Can Change The World.
Erk, K. and Padó, S. (2010). Exemplar-based models for word meaning in context, Proceedings of the
acl 2010 conference short papers, Association for Computational Linguistics, pp. 92–97.
64 BIBLIOGRAPHY
Fahnestock, J. (2009). Quid pro nobis. rhetorical stylistics for argument analysis, Examining argumen-
tation in context. Fifteen studies on strategic maneuvering pp. 131–152.
Fainsilber, L. and Kogan, N. (1984). Does imagery contribute to metaphoric quality?, Journal of
psycholinguistic research 13(5): 383–391.
Fazly, A., Cook, P. and Stevenson, S. (2009). Unsupervised type and token identification of idiomatic
expressions, Computational Linguistics 1(35): 61–103.
Fazly, A. and Stevenson, S. (2008). A distributional account of the semantics of multiword expressions,
Italian Journal of Linguistics 1(20): 157–179.
Ferrone, L. and Zanzotto, F. (2015). Distributed smoothed tree kernel, Italian Journal of Computa-
tional Linguistics 1.
Filice, S., Da San Martino, G. and Moschitti, A. (2015). Structural representations for learning relations
between pairs of texts, Vol. 1, Association for Computational Linguistics (ACL), pp. 1003–1013.
Forgács, B., Bardolph, M. D., B.D., A., K.A., D. and Kutas, M. (2015). Metaphors are physical and
abstract: Erps to metaphorically modified nouns resemble erps to abstract language, Front. Hum.
Neurosci. 9(28).
Fraser, B. (1970). Idioms within a transformational grammar, Foundations of language pp. 22–42.
Frege, G. (1892). Über sinn und bedeutung, Zeitschrift für Philosophie und philosophische Kritik
100: 25–50.
Gao, G., Choi, E., Choi, Y. and Zettlemoyer, L. (2018). Neural metaphor detection in context, arXiv
preprint arXiv:1808.09653 .
Geeraert, K., Baayen, R. H. and Newman, J. (2017). Understanding idiomatic variation, MWE 2017
p. 80.
Gentner, D. and Bowdle, B. F. (2001). Convention, form, and figurative language processing, Metaphor
and Symbol 16(3-4): 223–247.
BIBLIOGRAPHY 65
Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J. and Reyes, A. (2015). Semeval-2015
task 11: Sentiment analysis of figurative language in twitter, Proceedings of the 9th International
Workshop on Semantic Evaluation (SemEval 2015), pp. 470–478.
Gibbs, R. W. (1993). Why idioms are not dead metaphors, Idioms: Processing, structure, and inter-
pretation pp. 57–77.
Gibbs, R. W. (1994). The poetics of mind: Figurative thought, language, and understanding, Cambridge
University Press.
Gibbs, R. W., Bogdanovich, J. M., Sykes, J. R. and Barr, D. J. (1997). Metaphor in idiom compre-
hension, Journal of memory and language 37(2): 141–154.
Gibbs, R. W., Leggitt, J. S. and Turner, E. A. (2002). What’s special about figurative language
in emotional communication, The verbal communication of emotions: Interdisciplinary perspectives
pp. 125–149.
Gildea, P. and Glucksberg, S. (1983). On understanding metaphor: The role of context, Journal of
Memory and Language 22(5): 577.
Giora, R. (1997). Understanding figurative and literal language: The graded salience hypothesis,
Cognitive Linguistics (includes Cognitive Linguistic Bibliography) 8(3): 183–206.
Giora, R. (1999). On the priority of salient meanings: Studies of literal and figurative language, Journal
of pragmatics 31(7): 919–929.
Giora, R. (2002). Literal vs. figurative language: Different or equal?, Journal of pragmatics 34(4): 487–
506.
Giora, R. (2003). On our mind: Salience, context, and figurative language, Oxford University Press.
Giora, R. and Fein, O. (1999). Irony: Context and salience, Metaphor and Symbol 14(4): 241–257.
Glucksberg, S. (2008). How metaphors create categories–quickly, The Cambridge handbook of metaphor
and thought pp. 67–83.
Glucksberg, S., Gildea, P. and Bookin, H. B. (1982). On understanding nonliteral speech: Can people
ignore metaphors?, Journal of verbal learning and verbal behavior 21(1): 85–98.
Glucksberg, S., McGlone, M. S. and Manfredi, D. (1997). Property attribution in metaphor compre-
hension, Journal of memory and language 36(1): 50–67.
66 BIBLIOGRAPHY
Gong, H., Bhat, S. and Viswanath, P. (2017). Geometry of compositionality., AAAI, pp. 3202–3208.
Gordon, J., Hobbs, J., May, J., Mohler, M., Morbini, F., Rink, B., Tomlinson, M. and Wertheim, S.
(2015). A corpus of rich metaphor annotation, Proc. Workshop on Metaphor in NLP.
URL: http://www.isi.edu/ jgordon/papers/gordon-et-al.a-corpus-of-rich-metaphor-annotation.pdf
Gutierrez, E. D., Cecchi, G., Corcoran, C. and Corlett, P. (2017). Using automated metaphor iden-
tification to aid in detection and prediction of first-episode schizophrenia, Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, pp. 2923–2930.
Gutiérrez, E. D., Shutova, E., Marghetis, T. and Bergen, B. K. (2016). Literal and metaphorical senses
in compositional distributional semantic models, Proceedings of the 54th Meeting of the Association
for Computational Linguistics, pp. 160–170.
Haagsma, H. and Bjerva, J. (2016). Detecting novel metaphor using selectional preference information,
Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 10–17.
He, H., Gimpel, K. and Lin, J. (2015). Multi-perspective sentence similarity modeling with con-
volutional neural networks, Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, Lisbon, Portugal, pp. 1576–1586.
URL: https://aclweb.org/anthology/D/D15/D15-1181
He, X. and Liu, Y. (2017). Not enough data?: Joint inferring multiple diffusion networks via network
generation priors, Proceedings of the Tenth ACM International Conference on Web Search and Data
Mining, ACM, pp. 465–474.
Hovy, D., Shrivastava, S., Jauhar, S. K., Sachan, M., Goyal, K., Li, H., Sanders, W. and Hovy, E.
(2013). Identifying metaphorical word use with tree kernels, Proceedings of the First Workshop on
Metaphor in NLP.
Inhoff, A. W., Lima, S. D. and Carroll, P. J. (1984). Contextual effects on metaphor comprehension
in reading, Memory & Cognition 12(6): 558–567.
Jang, H., Piergallini, M., Wen, M. and Rose, C. (2014). Conversational metaphors in use: Exploring the
contrast between technical and everyday notions of metaphor, Proceedings of the Second Workshop
on Metaphor in NLP, pp. 1–10.
Jiang, N. A. and Nekrasova, T. M. (2007). The processing of formulaic sequences by second language
speakers, The Modern Language Journal 91(3): 433–445.
BIBLIOGRAPHY 67
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L. and Girshick, R. (2017).
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, Computer
Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, pp. 1988–1997.
Johnson, M. G. and Malgady, R. G. (1979). Some cognitive aspects of figurative language: Association
and metaphor, Journal of Psycholinguistic Research 8(3): 249–265.
Jurgens, D. and Stevens, K. (2009). Event detection in blogs using temporal random indexing, Pro-
ceedings of the Workshop on Events in Emerging Text Types.
Karpathy, A., Johnson, J. and Fei-Fei, L. (2015). Visualizing and understanding recurrent networks,
arXiv preprint arXiv:1506.02078 .
Kesarwani, V., Inkpen, D., Szpakowicz, S. and Tanasescu, C. (2017). Metaphor detection in a poetry
corpus, Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Literature, pp. 1–9.
Keysar, B. (1989). On the functional equivalence of literal and metaphorical interpretations in dis-
course, Journal of memory and language 28(4): 375–385.
Kim, Y. (2014). Convolutional neural networks for sentence classification, CoRR abs/1408.5882.
URL: http://arxiv.org/abs/1408.5882
Kintsch, W. (2000). Metaphor comprehension: A computational theory, Psychonomic bulletin & review
7(2): 257–266.
Klebanov, B. B., Beigman, E. and Diermeier, D. (2009). Discourse topics and metaphors, Proceedings of
the Workshop on Computational Approaches to Linguistic Creativity, Association for Computational
Linguistics, pp. 1–8.
Klebanov, B. B., Diermeier, D. and Beigman, E. (2008). Lexical cohesion analysis of political speech,
Political Analysis 16(4): 447–463.
Klebanov, B. B., Leong, B., Heilman, M. and Flor, M. (2014). Different texts, same metaphors:
Unigrams and beyond, Proceedings of the Second Workshop on Metaphor in NLP, pp. 11–17.
Klebanov, B. B., Leong, C. W. and Flor, M. (2015). Supervised word-level metaphor detection:
Experiments with concreteness and reweighting of examples, Proceedings of the Third Workshop on
Metaphor in NLP, pp. 11–20.
68 BIBLIOGRAPHY
Klebanov, B. B., Leong, C. W., Gutierrez, E. D., Shutova, E. and Flor, M. (2016). Semantic classifi-
cations for detection of verb metaphors, Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 101–106.
Kokkinakis, D. (2013). Figurative language in swedish clinical texts, Proceedings of the IWCS 2013
Workshop on Computational Semantics in Clinical Text (CSCT 2013), pp. 17–22.
Köper, M. and im Walde, S. S. (2016b). Distinguishing literal and non-literal usage of german particle
verbs., HLT-NAACL, pp. 353–362.
Köper, M. and im Walde, S. S. (2017). Improving verb metaphor detection by propagating abstractness
to words, phrases and individual senses, Proceedings of the 1st Workshop on Sense, Concept and
Entity Representations and their Applications, pp. 24–30.
Kövecses, Z. (2003). Language, figurative thought, and cross-cultural comparison, Metaphor and
symbol 18(4): 311–320.
Kozareva, Z. (2015). Multilingual affect polarity and valence prediction in metaphors, Proceedings of the
6th Workshop on Computational Approaches to Subjectivity,Sentiment and Social Media Analysis,
WASSA@EMNLP 2015, 17 September2015, Lisbon, Portugal, p. 1.
URL: http://aclweb.org/anthology/W/W15/W15-2901.pdf
Krennmayr, T. (2015). What corpus linguistics can tell us about metaphor use in newspaper texts,
Journalism Studies 16(4): 530–546.
Krennmayr, T. and Steen, G. (2017). Vu amsterdam metaphor corpus, Handbook of Linguistic Anno-
tation, Springer, pp. 1053–1071.
Krishnakumaran, S. and Zhu, X. (2007). Hunting elusive metaphors using lexical resources, Pro-
ceedings of the Workshop on Computational Approaches to Figurative Language, FigLanguages ’07,
Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 13–20.
URL: http://dl.acm.org/citation.cfm?id=1611528.1611531
Krčmář, L., Ježek, K. and Pecina, P. (2013). Determining Compositionality of Expresssions Using
Various Word Space Models and Measures, Proceedings of the Workshop on Continuous Vector
Space Models and their Compositionality, pp. 64–73.
Kuhl, P. K. (2004). Early language acquisition: cracking the speech code, Nature reviews neuroscience
5(11): 831–843.
BIBLIOGRAPHY 69
Lacey, S., Stilla, R., Deshpande, G., Zhao, S., Stephens, C., McCormick, K., Kemmerer, D. and
Sathian, K. (2017). Engagement of the left extrastriate body area during body-part metaphor
comprehension, Brain and language 166: 1–18.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. and Gershman, S. J. (2017). Building machines that
learn and think like people, Behavioral and Brain Sciences 40.
Lakoff, G. (1989). Some empirical results about the nature of concepts, Mind & Language 4(1-2): 103–
129.
Lakoff, G., Espenson, J. and Schwartz, A. (1991). Master metaphor list. university of california at
berkely, Cognitive Linguistics Group .
Lakoff, G. and Johnson, M. (2008a). Metaphors we live by, University of Chicago press.
Lakoff, G. and Johnson, M. (2008b). Metaphors we live by, University of Chicago press.
Lappin, S. and Zadrozny, W. (2000). Compositionality, synonymy, and the systematic representation
of meaning, CoRR cs.CL/0001006.
URL: http://arxiv.org/abs/cs.CL/0001006
Laranjeira, C. (2013). The role of narrative and metaphor in the cancer life story: a theoretical
analysis, Medicine, Health Care and Philosophy 16(3): 469–481.
Lau, J. H., Clark, A. and Lappin, S. (2017). Grammaticality, acceptability, and probability: a proba-
bilistic view of linguistic knowledge, Cognitive Science 41(5): 1202–1241.
Laurent, J.-P., Denhières, G., Passerieux, C., Iakimova, G. and Hardy-Baylé, M.-C. (2006). On under-
standing idiomatic language: The salience hypothesis assessed by erps, Brain Research 1068(1): 151–
160.
Leech, G. N. and Short, M. (2007). Style in fiction: A linguistic introduction to English fictional prose,
number 13, Pearson Education.
Lenci, A. (2008). Distributional semantics in linguistic and cognitive research, Italian Journal of
Linguistics 20(1): 1–31.
Lenci, A. (2018). Distributional Models of Word Meaning, Annual Review of Linguistics 4: 151–171.
70 BIBLIOGRAPHY
Lenci, A., Santus, E. and Q. Lu, C. H. (2015). When similarity becomes opposition: Synonyms and
antonyms discrimination in dsms, Italian Journal of Computational Linguistics 1.
Leong, C. W. B., Klebanov, B. B. and Shutova, E. (2018). A report on the 2018 vua metaphor detection
shared task, Proceedings of the Workshop on Figurative Language Processing, pp. 56–66.
Levy, O., Goldberg, Y. and Dagan, I. (2015). Improving distributional similarity with lessons learned
from word embeddings, Transactions of the Association for Computational Linguistics 3: 211–225.
Li, H., Zhu, K. Q. and Wang, H. (2013). Data-driven metaphor recognition and explanation, Trans-
actions of the Association for Computational Linguistics 1: 379–390.
Li, L., Roth, B. and Sporleder, C. (2010). Topic models for word sense disambiguation and token-
based idiom detection, Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, pp. 1138–1147.
Li, L. and Sporleder, C. (2010a). Linguistic cues for distinguishing literal and non-literal usages,
Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association
for Computational Linguistics, pp. 683–691.
Li, L. and Sporleder, C. (2010b). Using gaussian mixture models to detect figurative language in con-
text, Human Language Technologies: The 2010 Annual Conference of the North American Chapter
of the Association for Computational Linguistics, HLT ’10, Association for Computational Linguis-
tics, Stroudsburg, PA, USA, pp. 297–300.
URL: http://dl.acm.org/citation.cfm?id=1857999.1858038
Li, L. and Sporleder, C. (2010c). Using gaussian mixture models to detect figurative language in con-
text, Human Language Technologies: The 2010 Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics, Association for Computational Linguistics,
pp. 297–300.
Lin, D. (1999). Automatic identification of non-compositional phrases, Proceedings of the 37th Annual
Meeting of the Association for Computational Linguistics, pp. 317–324.
Liu, D. (2003). The most frequently used spoken american english idioms: A corpus analysis and its
implications, Tesol Quarterly 37(4): 671–700.
BIBLIOGRAPHY 71
Loukina, A., Zechner, K., Bruno, J. and Beigman Klebanov, B. (2018). Using exemplar responses for
training and evaluating automated speech scoring systems, Proceedings of the Thirteenth Workshop
on Innovative Use of NLP for Building Educational Applications, pp. 1–12.
Madnani, N., Tetreault, J. and Chodorow, M. (2012). Re-examining machine translation metrics for
paraphrase identification, Proceedings of the 2012 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12,
Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 182–190.
URL: http://dl.acm.org/citation.cfm?id=2382029.2382055
Marschark, M., Katz, A. N. and Paivio, A. (1983). Dimensions of metaphor, Journal of Psycholinguistic
Research 12(1): 17–40.
McCabe, A. (1983). Conceptual similarity and the quality of metaphor in isolated sentences versus
extended contexts, Journal of Psycholinguistic Research 12(1): 41–68.
McGlone, M. S. (1996). Conceptual metaphors and figurative language interpretation: Food for
thought?, Journal of memory and language 35(4): 544–565.
MIGLIORE, T. (2007). Gruppo µ. trattato del segno visivo. per una retorica dell’immagine.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013a). Distributed representations
of words and phrases and their compositionality, Advances in neural information processing systems,
pp. 3111–3119.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013b). Distributed representations
of words and phrases and their compositionality, Proceedings of the 26tth International Conference
on Neural Information Processing System, pp. 3111–3119.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013c). Distributed representa-
tions of words and phrases and their compositionality, in C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani and K. Q. Weinberger (eds), Advances in Neural Information Processing Systems
26, Curran Associates, Inc., pp. 3111–3119.
72 BIBLIOGRAPHY
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013d). Distributed representations
of words and phrases and their compositionality, Advances in neural information processing systems,
pp. 3111–3119.
Mikolov, T., Yih, W.-t. and Zweig, G. (2013). Linguistic regularities in continuous space word rep-
resentations., Human Language Technologies: Conference of the North American Chapter of the
Association of Computational Linguistics, Vol. 13, pp. 746–751.
Miller, G. A. (1995). Wordnet: a lexical database for english, Communications of the ACM 38(11): 39–
41.
Mohamed, M. and Oussalah, M. (n.d.). A hybrid approach for paraphrase identification based on
knowledge-enriched semantic heuristics.
Mohammad, S., Shutova, E. and Turney, P. D. (2016). Metaphor as a medium for emotion: An
empirical study, Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics,
*SEM@ACL 2016, Berlin, Germany, 11-12 August 2016.
URL: http://aclweb.org/anthology/S/S16/S16-2003.pdf
Mohler, M., Bracewell, D., Tomlinson, M. and Hinote, D. (2013). Semantic signatures for example-
based linguistic metaphor detection, Proceedings of the First Workshop on Metaphor in NLP, pp. 27–
35.
Mohler, M., Rink, B., Bracewell, D. B. and Tomlinson, M. T. (2014). A novel distributional approach
to multilingual conceptual metaphor recognition., COLING, pp. 1752–1763.
Morgan, G. (1980). Paradigms, metaphors, and puzzle solving in organization theory, Administrative
science quarterly pp. 605–622.
Moritz, M., Hellrich, J. and Buechel, S. (2018). A method for human-interpretable paraphrasticality
prediction, Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for
Cultural Heritage, Social Sciences, Humanities and Literature, pp. 113–118.
Mukherjee, S. and Bala, P. K. (2017). Detecting sarcasm in customer tweets: an nlp based approach,
Industrial Management & Data Systems 117(6): 1109–1126.
Nastase, V. and Strube, M. (2009). Combining collocations, lexical and encyclopedic knowledge for
metonymy resolution, Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing: Volume 2-Volume 2, Association for Computational Linguistics, pp. 910–918.
BIBLIOGRAPHY 73
Neuman, Y., Assaf, D., Cohen, Y., Last, M., Argamon, S., Howard, N. and Frieder, O. (2013).
Metaphor identification in large texts corpora, PloS one 8(4): e62343.
Niculae, V. and Yaneva, V. (2013). Computational considerations of comparisons and similes, 51st
Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research
Workshop, pp. 89–95.
Nunberg, G., Sag, I. and Wasow, T. (1994). Idioms, Language 70(3): 491–538.
Peng, J., Aharodnik, K. and Feldman, A. (2018). A distributional semantics model for idiom detection-
the case of english and russian., ICAART (2), pp. 675–682.
Pennington, J., Socher, R. and Manning, C. D. (2014). Glove: Global vectors for word representation,
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.
URL: http://www.aclweb.org/anthology/D14-1162
Pernes, S. (2016). Metaphor mining in historical german novels: Using unsupervised learning to
uncover conceptual systems in literature., DH, pp. 651–653.
Pianta, E., Bentivogli, L. and Girardi, C. (2002). MultiWordNet: Developing and Aligned Multilingual
Database, Proceedings of the First International Conference on Global WordNet, pp. 293–302.
Pollio, H. R., Smith, M. K. and Pollio, M. R. (1990). Figurative language and cognitive psychology,
Language and Cognitive Processes 5(2): 141–167.
Pramanick, M., Gupta, A. and Mitra, P. (2018). An lstm-crf based approach to token-level metaphor
detection, Proceedings of the Workshop on Figurative Language Processing, pp. 67–75.
Rai, S. and Chakraverty, S. (2017). Metaphor detection using fuzzy rough sets, International Joint
Conference on Rough Sets, Springer, pp. 271–279.
Rai, S., Chakraverty, S. and Tayal, D. K. (2016). Supervised metaphor detection using conditional
random fields, Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 18–27.
Rei, M., Bulat, L., Kiela, D. and Shutova, E. (2017). Grasping the finer point: A supervised similarity
network for metaphor detection, arXiv preprint arXiv:1709.00575 .
Rentoumi, Giannakopoulos, Karkaletsis and Vouros (2009). Sentiment analysis of figurative language
using word sense disambiguation approach, International Conference RANLP 2009.
Reyes, A., Rosso, P. and Buscaldi, D. (2012). From humor recognition to irony detection: The figurative
language of social media, Data & Knowledge Engineering 74: 1–12.
74 BIBLIOGRAPHY
Rimell, L., Maillard, J., Polajnar, T. and Clark, S. (2016). Relpron: A relative clause evaluation data
set for compositional distributional semantics, Computational Linguistics 42(4): 661–701.
Rodriguez, M. C. (2003). How to talk shop through metaphor: bringing metaphor research to the esp
classroom, English for Specific Purposes 22(2): 177–194.
Romero, E. and Soria, B. (2014). Relevance theory and metaphor, Linguagem em (Dis) curso
14(3): 489–509.
Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W. and Charfi, A. (2018). A survey on
author profiling, deception, and irony detection for the arabic language, Language and Linguistics
Compass 12(4): e12275.
Rzepka, R., Dybala, P., Sayama, K. and Araki, K. (2013). Semantic clues for novel metaphor generator,
Proceedings of 2nd international workshop of computational creativity, concept invention, and general
intelligence, C3GI.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D. (2002). Multiword Expressions:
A Pain in the Neck for NLP, Proceedings of the 3rd International Conference on Intelligent Text
Processing and Computational Linguistics, pp. 1–15.
Sainath, T. N., Vinyals, O., Senior, A. and Sak, H. (2015). Convolutional, long short-term memory,
fully connected deep neural networks, Acoustics, Speech and Signal Processing (ICASSP), 2015
IEEE International Conference on, IEEE, pp. 4580–4584.
Salager-Meyer, F. (1990). Metaphors in medical english prose: A comparative study with french and
spanish, English for Specific Purposes 9(2): 145–159.
Sam, G. and Catrinel, H. (2006). On the relation between metaphor and simile: When comparison
fails, Mind & Language 21(3): 360–378.
Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P. and Lillicrap, T.
(2017). A simple neural network module for relational reasoning, Advances in neural information
processing systems, pp. 4967–4976.
Sboev, A., Litvinova, T., Gudovskikh, D., Rybka, R. and Moloshnikov, I. (2016). Machine learning
models of text categorization by author gender using topic-independent features, Procedia Computer
Science 101: 135 – 142.
URL: http://www.sciencedirect.com/science/article/pii/S1877050916326849
Schlechtweg, D., Eckmann, S., Santus, E., Walde, S. S. i. and Hole, D. (2017). German in flux:
Detecting metaphoric change via word entropy, arXiv preprint arXiv:1706.04971 .
BIBLIOGRAPHY 75
Schmidt, G. L. and Seger, C. A. (2009). Neural correlates of metaphor processing: the roles of
figurativeness, familiarity and difficulty, Brain and cognition 71(3): 375–386.
Schulder, M. and Hovy, E. (2014). Metaphor detection through term relevance, Proceedings of the
Second Workshop on Metaphor in NLP, pp. 18–26.
Schulder, M. and Hovy, E. (2015). Metaphor detection through term relevance, Proceedings of the
First Workshop on Metaphor in NLP.
Sculley, D. and Pasanek, B. M. (2008). Meaning and mining: the impact of implicit assumptions in
data mining for the humanities, Literary and Linguistic Computing 23(4): 409–424.
Sell, M. A., Kreuz, R. J. and Coppenrath, L. (1997). Parents’ use of nonliteral language with preschool
children, Discourse Processes 23(2): 99–118.
Semino, E. and Culpeper, J. (2002). Cognitive stylistics: Language and cognition in text analysis,
Vol. 1, John Benjamins Publishing.
Senaldi, M. S. G., Lebani, G. E. and Lenci, A. (2016). Lexical variability and compositionality:
Investigating idiomaticity with distributional semantic models, Proceedings of the 12th Workshop on
Multiword Expressions, pp. 21–31.
Senaldi, M. S. G., Lebani, G. E. and Lenci, A. (2017). Determining the compositionality of noun-
adjective pairs with lexical variants and distributional semantics, Italian Journal of Computational
Linguistics 3(1): 43–58.
Shutova, E. (2010c). Models of metaphor in nlp, Proceedings of the 48th annual meeting of the asso-
ciation for computational linguistics, Association for Computational Linguistics, pp. 688–697.
Shutova, E., Kiela, D. and Maillard, J. (2016). Black holes and white rabbits: Metaphor identification
with visual features, Proceedings of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pp. 160–170.
76 BIBLIOGRAPHY
Shutova, E., Sun, L. and Korhonen, A. (2010a). Metaphor identification using verb and noun cluster-
ing, Proceedings of the 23rd International Conference on Computational Linguistics, Association for
Computational Linguistics, pp. 1002–1010.
Shutova, E., Sun, L. and Korhonen, A. (2010b). Metaphor identification using verb and noun cluster-
ing, Proceedings of the 23rd International Conference on Computational Linguistics, Association for
Computational Linguistics, pp. 1002–1010.
Shutova, E., Teufel, S. and Korhonen, A. (2013). Statistical metaphor processing, Computational
Linguistics 39(2): 301–353.
Sikos, L., Brown, S. W., Kim, A. E., Michaelis, L. A. and Palmer, M. (2008). Figurative language:"
meaning" is often more than just a sum of the parts., AAAI Fall Symposium: Biologically Inspired
Cognitive Architectures, pp. 180–185.
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y. and Manning+, C. D. (2011). Dynamic Pooling
and Unfolding Recursive Autoencoders for Paraphrase Detection, Advances in Neural Information
Processing Systems 24.
Soleymanzadeh, K., Karaoğlan, B., Metin, S. K. and Kişla, T. (2018). Combining machine translation
and text similarity metrics to identify paraphrases in turkish, 2018 26th Signal Processing and
Communications Applications Conference (SIU), IEEE, pp. 1–4.
Sporleder, C. and Li, L. (2009). Unsupervised recognition of literal and non-literal use of idiomatic
expressions, Proceedings of the 12th Conference of the European Chapter of the Association for
Computational Linguistics, Association for Computational Linguistics, pp. 754–762.
Srivastava, S. and Hovy, E. (2014). Vector space semantics with frequency-driven motifs, Proceedings
of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Vol. 1, pp. 634–643.
Steen, G. (2010). A method for linguistic metaphor identification: From MIP to MIPVU, Vol. 14, John
Benjamins Publishing.
Steen, G. (2014). Metaphor and style, The Cambridge handbook of Stylistics pp. 315–328.
Steen, G., Dorst, A., Herrmann, B., Kaal, A., Krennmayr, T. and Pasma, T. (2010). A Method for
Linguistic Metaphor Identification: From MIP to MIPVU.
Steen, G. J., Dorst, A. G., Herrmann, J. B., Kaal, A., Krennmayr, T. and Pasma, T. (2014). A
method for linguistic metaphor identification: From mip to mipvu., Metaphor and the Social World
4(1): 138–146.
BIBLIOGRAPHY 77
Su, C., Huang, S. and Chen, Y. (2017). Automatic detection and interpretation of nominal metaphor
based on the theory of meaning, Neurocomputing 219: 300–311.
Sun, S. and Xie, Z. (2017). Bilstm-based models for metaphor detection, National CCF Conference
on Natural Language Processing and Chinese Computing, Springer, pp. 431–442.
Sweetser, E. (1991). From etymology to pragmatics: Metaphorical and cultural aspects of semantic
structure, Vol. 54, Cambridge University Press.
Tai, K. S., Socher, R. and Manning, C. D. (2015). Improved semantic representations from tree-
structured long short-term memory networks, CoRR abs/1503.00075.
URL: http://arxiv.org/abs/1503.00075
Tang, S. and de Sa, V. R. (2018). Exploiting invertible decoders for unsupervised sentence represen-
tation learning, arXiv preprint arXiv:1809.02731 .
Tanguy, L., Sajous, F., Calderone, B. and Hathout, N. (2012). Authorship attribution: Using rich
linguistic features when training data is scarce., PAN Lab at CLEF.
Titone, D. and Libben, M. (2014). Time-dependent effects of decomposability, familiarity and literal
plausibility on idiom priming: A cross-modal priming investigation, The Mental Lexicon 9(3): 473–
496.
Torre, E. (2014). The emergent patterns of Italian idioms: A dynamic-systems approach, PhD thesis,
Lancaster University.
Tourangeau, R. and Rips, L. (1991). Interpreting and evaluating metaphors, Journal of Memory and
Language 30(4): 452–472.
Tourangeau, R. and Sternberg, R. J. (1981). Aptness in metaphor, Cognitive psychology 13(1): 27–55.
Tsvetkov, Y., Boytsov, L., Gershman, A., Nyberg, E. and Dyer, C. (2014a). Metaphor detection with
cross-lingual model transfer.
Tsvetkov, Y., Boytsov, L., Gershman, A., Nyberg, E. and Dyer, C. (2014b). Metaphor detection
with cross-lingual model transfer, Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 248–258.
Turney, P. D. (2013). Distributional semantics beyond words: Supervised learning of analogy and
paraphrase, CoRR abs/1310.5042.
URL: http://arxiv.org/abs/1310.5042
78 BIBLIOGRAPHY
Turney, P. D., Neuman, Y., Assaf, D. and Cohen, Y. (2011). Literal and metaphorical sense identifica-
tion through concrete and abstract context, Proceedings of the Conference on Empirical Methods in
Natural Language Processing, EMNLP ’11, Association for Computational Linguistics, Stroudsburg,
PA, USA, pp. 680–690.
URL: http://dl.acm.org/citation.cfm?id=2145432.2145511
Turney, P. D. and Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics,
Journal of Artificial Intelligence Research 37: 141–188.
Underwood, G., Schmitt, N. and Galpin, A. (2004). The eyes have it, Formulaic sequences: Acquisition,
processing, and use 9: 153.
Van Hee, C., Lefever, E. and Hoste, V. (2018a). Exploring the fine-grained analysis and automatic
detection of irony on twitter, Language Resources and Evaluation pp. 1–25.
Van Hee, C., Lefever, E. and Hoste, V. (2018b). Semeval-2018 task 3: Irony detection in english tweets,
Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 39–50.
Veale, T. (2011). Creative language retrieval: A robust hybrid of information retrieval and linguistic
creativity, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1, Association for Computational Linguistics, pp. 278–287.
Veale, T. (2012). Exploding the creativity myth: The computational foundations of linguistic creativity,
A&C Black.
Veale, T. (2014). A service-oriented architecture for metaphor processing, Proceedings of the Second
Workshop on Metaphor in NLP, pp. 52–60.
Veale, T. (2015). Game of tropes: Exploring the placebo effect in computational creativity., ICCC,
pp. 78–85.
Veale, T. (2017). Metaphor and metamorphosis, Metaphor in Communication, Science and Education
36: 43.
Veale, T. and Hao, Y. (2007a). Comprehending and generating apt metaphors: a web-driven, case-
based approach to figurative language, AAAI, Vol. 2007, pp. 1471–1476.
Veale, T. and Hao, Y. (2007b). Learning to understand figurative language: from similes to metaphors
to irony, Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 29.
BIBLIOGRAPHY 79
Veale, T. and Li, G. (2012). Specifying viewpoint and information need with affective metaphors:
a system demonstration of the metaphor magnet web app/service, Proceedings of the ACL 2012
System Demonstrations, Association for Computational Linguistics, pp. 7–12.
Veale, T., Shutova, E. and Klebanov, B. B. (2016a). Metaphor: A Computational Perspective, Synthesis
Lectures on Human Language Technologies, Morgan & Claypool Publishers.
URL: https://doi.org/10.2200/S00694ED1V01Y201601HLT031
Veale, T., Shutova, E. and Klebanov, B. B. (2016b). Metaphor: A computational perspective, Synthesis
Lectures on Human Language Technologies 9(1): 1–160.
Vietri, S. (2014). Idiomatic constructions in Italian: a lexicon-grammar approach, Vol. 31, John
Benjamins Publishing Company.
Vogel, C. (2001). Dynamic semantics for metaphor, Metaphor and Symbol 16(1-2): 59–74.
Vosoughi, S., Vijayaraghavan, P. and Roy, D. (2016). Tweet2vec: Learning tweet embeddings us-
ing character-level cnn-lstm encoder-decoder, Proceedings of the 39th International ACM SIGIR
conference on Research and Development in Information Retrieval, ACM, pp. 1041–1044.
Wang, J., Yu, L.-C., Lai, K. R. and Zhang, X. (2016). Dimensional sentiment analysis using a regional
cnn-lstm model, Proceedings of the 54th Annual Meeting of the Association for Computational Lin-
guistics (Volume 2: Short Papers), Vol. 2, pp. 225–230.
Wang, X., Gao, L., Song, J. and Shen, H. (2017). Beyond frame-level cnn: saliency-aware 3-d cnn with
lstm for video action recognition, IEEE Signal Processing Letters 24(4): 510–514.
Wilson, D. (2011). Parallels and differences in the treatment of metaphor in relevance theory and
cognitive linguistics, Intercultural Pragmatics 8(2): 177–196.
Xu, W., Callison-Burch, C. and Dolan, B. (2015). Semeval-2015 task 1: Paraphrase and semantic
similarity in twitter (pit), Proceedings of the 9th International Workshop on Semantic Evaluation
(SemEval 2015), Association for Computational Linguistics, Denver, Colorado, pp. 1–11.
URL: http://www.aclweb.org/anthology/S15-2001
Yin, W. and Schütze, H. (2015). Convolutional neural network for paraphrase identification, NAACL
HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015,
pp. 901–911.
URL: http://aclweb.org/anthology/N/N15/N15-1091.pdf
80 BIBLIOGRAPHY
Zhang, Y., Yuan, H., Wang, J. and Zhang, X. (2017). Ynu-hpcc at emoint-2017: Using a cnn-lstm model
for sentiment intensity prediction, Proceedings of the 8th Workshop on Computational Approaches
to Subjectivity, Sentiment and Social Media Analysis, pp. 200–204.
Part II
STUDIES
81
Study I
83
1
“Deep” Learning: Detecting Metaphoricity in Adjective-Noun Pairs
Table 1: The reported accuracy from previous words on AN metaphor detection. The first two studies
used different datasets. We are using larger pre-trained vectors than Gutiérrez et al. (2016); at the same
time, we don’t need a parsed corpus to build our vectors and we don’t use adjectival matrices. Given
these differences, this comparison should not be considered a “competition”.
T
p = Wadj T
u + Wnoun v+b (9) u0 = W T u (13)
v0 = W T v (14)
" #
Wadj p=u +v 0 0
(15)
W = (10)
Wnoun
Compared to the first architecture, in this archi-
where the composition function in equation (4) tecture we don’t assume the need of distinguish-
now has θ = (W, b). ing the weight matrix for the adjectives from the
This formulation is very similar to the compo- weight matrix for the nouns.
sition model in (Socher et al., 2011) without the It is rather interesting, then, that this architec-
syntactic tree parametrization. As such, instead of ture doesn’t present significant differences in per-
the non-linearity function we have linear identity: formance with respect to the first one. The num-
" # ber of parameters, however, is smaller: W ∈
T u IR300×300 and b ∈ IR300 .
p = fθ (u, v) = W +b (11)
v
3.4 Third Architecture
In practice, this approach represents a simple
merging through concatenation: given two words’ The third architecture, similarly to the second, fea-
vectors, we concatenate them before feeding them tures a shared composition matrix of weights be-
to a single-layered, fully connected Neural Net- tween the noun and the adjective, but we perform
work. elementwise multiplication between the two vec-
As a consequence, the network learns a weight tors:
matrix that represents linearly the AN combina-
tion. To visualize this concept, we could say that, p = fθ (u, v) = (u × v)W + b (16)
since our pairs always hold the same internal struc- The number of parameters in this case is similar
ture (adjective in first position and noun in sec- to previous architecture: W ∈ IR300×300 and b ∈
ond position), the first half of the weight matrix IR300 .
is trained on adjectives and the second half of the
weight matrix is trained on nouns. 3.5 Other Architectures
By using 300 dimension pre-trained word vec- In all three previous architectures we saw that a
tors, the parameter space for this composition weight matrix W can be learned as part of the
function will be as following: W ∈ IR300×600 and composing function. Throughout our exploration,
b ∈ IR300 . we found that W can be a random and a constant
uniform matrix (not trained in the network) and
3.3 Second architecture still being able to learn q unless we use a non-
The second architecture we describe has the ad- linear activation functions over the AN composi-
vantage of training a smaller set of parameters tions.
with respect to the first. In this model, the weight
matrix is shared between the noun and the adjec- p = g(fθ (u, v)) (17)
tive:
An intuition is to take W as an identity matrix
in Second architecture, the network will take the
p = fθ (u, v) = W T u + W T v + b (12) sum of pre-trained vectors to as features and learn
how to predict metaphoricity. A fixed uniform W
Notice that in the case of comparing the vec- basically keeps the information in input vectors.
tor representations of two different AN phrases, For a short overview of all these alternative archi-
b will be essentially redundant. An advantage of tectures see Table 2.
4 Evaluation Word2Vec embeddings trained on Google News
(Mikolov et al., 2013) we examined the accuracy,
Our classifier achieved 91.5% accuracy trained on
precision and recall of the our trained classifier.
500 labeled AN-phrases out of 8592 in the corpus
and tested on the rest. Training on 8000 and test- We have used three different word embeddings:
ing on the rest gave us accuracy of 98.5%.3 Word2Vec embeddings trained on Google News
We tested several combinations of the architec- (Mikolov et al., 2013), GloVe embeddings (Pen-
tures we described in the paper. For each of the nington et al., 2014) and Levy-Goldberg embed-
three architectures, we also tested the Rectified dings (Levy and Goldberg, 2014).
linear unit (ReLU) as the non-linearity mentioned These embeddings are not up-dated during the
in Section 3.5. Our test also shows that a random training process. Thus, the classification task is
constant matrix W is enough to train the rest of the always performed by learning weights for the pre-
parameters (reported in Table 2). In general, the existing vectors.
best performing combinations involve the use of The results of our experiment can be seen in
concatenation (the first architecture), while multi- Figure 3. All these embeddings have returned sim-
plication led to the lowest results. In any case, all ilar accuracies both when trained on scarce data
experiments returned accuracies above 75% 4 . (100 phrases) and when trained on half of the
To test the robustness of our approach, we have dataset (4000 phrases).
evaluated our model’s performance under several Training on 100 phrases indicates the ability of
constraints: our model to learn from scarce data. One way of
checking the consistency of our model under data
• Total separation of vocabulary in train and
scarcity is to perform flipped cross-validation: this
test sets (Table 3) in case of out of vocabu-
is a cross-validation where, instead of training our
lary words.
model on 90% of the data and testing it on the re-
• Use of different pretrained word embeddings maining 10%, we flipped the sizes train it on 10%
(Figure 3). of the data and test it on the remaining 90%. Re-
sults for both classic cross-validation and flipped
• Cross validation (Figure 1). cross-validation can be seen in Figure 1. Training
• Qualitative selection of the training data on 10% of the data proved to consistently achieve
based on the semantic categories of adjec- accuracies not much lower than 90%. In other
tives (Figure 2). terms, a model trained on 90% of the data does
not do much better than a model trained on 10%.
Finally, we will provide some qualitative insights
Finally, we tried training our model on only one
on how the model works.
of the semantic categories we introduced at the be-
Our model is based on the idea of transfer learn-
ginning of the paper and testing it on the rest of the
ing: using the learned representation for a new
dataset. Results can be seen in Figure 2.
task, in this case the metaphor detection. Our
We can wonder “why” our system is working:
model should generalize very fast with a small
with respect to more traditional machine learn-
set of samples as training data. In order to test
ing approaches, there is no direct way to evaluate
this matter, we have to train and test on totally
which features mostly contribute to the success of
different samples so vocabulary doesn’t overlap.
our system. One way to have an idea of what is
The splitting of the 8592 labeled phrases based
happening in the model is to use the “metaphoric-
on vocabulary gives us uneven sizes of training
ity vector” we discussed in Section 3. Such vector
and test phrases5 . In Table 3 using the pretrained
represents what is learned by our model and can
3
These results are based on the first architecture, the per- help making it less opaque for us.
formance of other architectures are not very different in this
simple test. The sample code is available on https://gu- If we compute the cosine similarity between
clasp.github.io/anvec-metaphor/ all the nouns in our dataset and this learned vec-
4
The number of parameters in case of using concatenation
(as in first architecture) is 180 601 and other compositions,
tor, we can see that nouns tend to polarize on an
including addition and multiplication, number of parameters abstract/concrete axis: abstract nouns tend to be
is almost the half: 90 601. more similar to the learned vector than concrete
5
We chose the vocabulary splitting points for every 10%
from 10% to 90%, then we applied the splitting separately on
nouns.
nouns and adjective It is likely that our model is learning nouns’
Test Train Accuracy Precision Recall
0.92
6929 72 0.83 0.89 0.77 accuracies
0.90
5561 299 0.89 0.86 0.93
4406 643 0.91 0.92 0.90 0.88
nce
th
clarity
light
rature
texture
taste
depth
streng
substa
tempe
Table 3: This table shows consistent results in ac-
curacy, precision and recall of the classifier trained Figure 2: Accuracy training on different categories
with different split points of vocabulary instead of of adjectives. In this experiment, we train on
phrases. Splitting the vocabulary creates different just one category of the dataset and test on all
sizes of training phrases and test phrases. the others. In general, training on just one cate-
gory (e.g.temperature) and testing on all other cat-
egories still yields high accuracy. While the power
level of abstractness as a mean to determine phrase of generalization of our model is still unclear, we
metaphoricity. In Table 4 we show the 10 most can see that it can detect similar semantic mecha-
similar and the 10 least similar nouns obtained nisms even without any vocabulary overlap. The
with this approach. As can be seen, a concrete- category taste is a partial exception: this category
abstract polarity is apparently learned in training. seems to be a relative “outlier”.
This factor was amply noted and even used in
some feature-based metaphor classifiers, as we 5 Discussion and future work
discussed in the beginning: the advantage of using
continuous semantic spaces probably relies on the In this paper we have presented an approach
possibility of having a more nuanced and complex for detecting metaphoricity in AN pairs that out-
polarization of nouns along the concrete/abstract performs the state of the art without using human
axes than using hand-annotated resources. annotated data or external resources beyond pre-
trained word embeddings. We treasured the infor-
mation captured by Word2Vec vectors through a
0.96
fully connected neural network able to filter out
the ”noise” of the original semantic space. We
0.95
have presented a series of alternative variations
0.94
of this approach and evaluated its performance
under several conditions - different word embed-
0.93 dings, different training data and different training
sizes - showing that our model can generalize ef-
0.92
CV ficiently and obtain solid results over scarce train-
Flipped-CV
1 2 3 4 5 6 7 8 9 10
ing data. We think that this is one of the central
findings in this paper, since many semantic phe-
nomena similar to metaphor (for example other
Figure 1: Accuracies for each fold over two com- figures of speech) are under-represented in current
plementary approaches: cross-validation (CV) and NLP resources and their study through supervised
flipped cross-validation (“flipped-CV”). Flipped classifiers would require systems able to work on
cross-validation takes 90% of our dataset for train- small datasets.
ing. The graph shows that both methods yield The possibility of detecting metaphors and as-
good results: in other words training on just 10% signing a degree of “metaphoricity” to a snippet
of the dataset yields results that are just few points of text is essential to automatic stylistic programs
lower than normal cross-validation. designed to go beyond “shallow features” such
as sentence length, functional word counting etc.
Figure 3: Accuracy on different kinds of embeddings, both training on 100 phrases and 4000 phrases.
Top ten reluctance, reprisal, resignation, or concrete) and generalize over this distinction; a
response, rivalry, satisfaction, behavior that might not be too far from the way a
storytelling, supporter, surveil- human learns to distinguish different senses of a
lance, vigilance word.
Bottom ten saucepan, flour, skillet, chimney, An issue that we would like to further test in
jar, tub, fuselage, pellet, pouch, the future is metaphoricity detection on different
cupboard datasets, to explore the ability of generalization
of our models. Researching on different datasets
Table 4: 10 most similar and 10 least simi-
could also help us gaining a better insight about
lar terms with respect to the “metaphoricity vec-
the model’s learning.
tor”, concatenated using an all-zeros vector for the
adjective. In practice, this is a way to explore An obvious option is to test verb-adverb pairs
which semantic dimensions are particularly use- (VA, e.g. think deeply) using the same approach
ful to the classifier. A concrete/abstract polarity discussed in this paper. It would then be inter-
on the nouns was apparently derived esting to see whether having a common training
set for both the AN and the VA pairs will allow
the model to generalize for both cases or differ-
While such metrics have already allowed powerful ent training on two training sets, one for AN and
studies, the lack of tools to quantify more com- one for VA, will be needed. Other cases to test
plex stylistic phenomena is evident (Hughes et al., include N-N compounds or proposition/sentence
2012; Gibbs Jr, 2017). Naturally, this work is in- level pairs.
tended as a first step: the “metaphoricity” degree Another way such an approach can be extended,
our system is learning would mirror the kinds of is to investigate whether reasoning tasks typically
combination present in this specific dataset, which associated with different classes of adjectives can
represents a very specific type of metaphor. be performed. One task might be to distinguish
It can be argued that we are not really learn- adjectives that are intersective, subsective or none
ing the defining ambiguities of an adjective (e.g. of the two. In the first case, from A N x one should
the double meaning of “bright”) but that we are infer that x is both an A and an N (something that
probably side-learning nouns’ degree of abstrac- is a black table is both black and a table), in the
tion. This would be in harmony with psycholin- second case one should infer that x is N only (for
guistic findings, since detecting nouns’ abstraction example someone who is a skillful surgeon is only
seems to be one of the main mechanisms we re- a surgeon but we do not know if s/he is skillful
cur to, when we have to judge the metaphoricity in general), and in the third case neither of the
of an expression (Forgács et al., 2015) and is used two should be inferred. However, this task is not
as a main feature in traditional Machine Learning as simple as giving a training set with instances
approaches to this problem. In other terms, our of AN pairs, to recognize where novel instances
system seems to detect when the same adjective is of AN pairs belong to. Going beyond logical ap-
used with different categories of words (abstract proaches by having the ability to recognize differ-
ent uses of an adjective requires a richer notion of Beata Beigman Klebanov, Chee Wee Leong, and
context which extends way beyond the AN-pairs. Michael Flor. 2015. Supervised word-level
metaphor detection: Experiments with concreteness
A further idea we want to pursue in the future
and reweighting of examples. In Proceedings of the
is the development of more fine grained datasets, Third Workshop on Metaphor in NLP. pages 11–20.
where metaphoricity is not represented as a binary
feature but as a gradient property. This means that Saisuresh Krishnakumaran and Xiaojin Zhu. 2007.
Hunting elusive metaphors using lexical resources.
a classifier should have the ability to predict a de- In Proceedings of the Workshop on Compu-
gree of metaphoricity and thus allow more fine- tational Approaches to Figurative Language.
grained distinctions to be captured. This is a theo- Association for Computational Linguistics, Strouds-
retically interesting side and definitely something burg, PA, USA, FigLanguages ’07, pages 13–20.
http://dl.acm.org/citation.cfm?id=1611528.1611531.
that has to be tested since not much literature is
available (if at all) on gradient metaphoricity. It George Lakoff. 1989. Some empirical results about the
seems to us that similar approaches, quantifying a nature of concepts. Mind & Language 4(1-2):103–
text’s metaphoricity and framing it as a supervised 129.
learning task, could help having a clear view on George Lakoff. 1993. The contemporary theory of
the influence of metaphor on style. metaphor.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Geoffrey N Leech and Mick Short. 2007. Style in fic-
2013. Representation learning: A review and new tion: A linguistic introduction to English fictional
perspectives. IEEE transactions on pattern analysis prose. 13. Pearson Education.
and machine intelligence 35(8):1798–1828.
Omer Levy and Yoav Goldberg. 2014. Neural word
Lynne Cameron. 2003. Metaphor in educational dis- embedding as implicit matrix factorization. In Ad-
course. A&C Black. vances in neural information processing systems.
pages 2177–2185.
Jeanne Fahnestock. 2009. Quid pro nobis. rhetorical
stylistics for argument analysis. Examining argu- Linlin Li and Caroline Sporleder. 2010a. Linguistic
mentation in context. Fifteen studies on strategic ma- cues for distinguishing literal and non-literal usages.
neuvering pages 131–152. In Proceedings of the 23rd International Conference
on Computational Linguistics: Posters. Association
Balint Forgács, Megan D. Bardolph, Amsel B.D., for Computational Linguistics, pages 683–691.
DeLong K.A., and M. Kutas. 2015. Metaphors
are physical and abstract: Erps to metaphorically Linlin Li and Caroline Sporleder. 2010b. Using gaus-
modified nouns resemble erps to abstract language. sian mixture models to detect figurative language
Front. Hum. Neurosci. 9(28). in context. In Human Language Technologies: The
2010 Annual Conference of the North American
Raymond W Gibbs Jr. 2017. Metaphor Wars. Cam- Chapter of the Association for Computational Lin-
bridge University Press. guistics. Association for Computational Linguistics,
Stroudsburg, PA, USA, HLT ’10, pages 297–300.
Nelson Goodman. 1975. The status of style. Critical http://dl.acm.org/citation.cfm?id=1857999.1858038.
Inquiry 1(4):799–811.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
E Darıo Gutiérrez, Ekaterina Shutova, Tyler Marghetis, rado, and Jeff Dean. 2013. Distributed representa-
and Benjamin K Bergen. 2016. Literal and tions of words and phrases and their compositional-
metaphorical senses in compositional distributional ity. In Advances in neural information processing
semantic models. In Proceedings of the 54th Meet- systems. pages 3111–3119.
ing of the Association for Computational Linguis-
tics. pages 160–170. Jeff Mitchell and Mirella Lapata. 2010. Composition
in distributional models of semantics. Cognitive sci-
James M Hughes, Nicholas J Foti, David C Krakauer, ence 34(8):1388–1429.
and Daniel N Rockmore. 2012. Quantitative pat-
terns of stylistic influence in the evolution of liter- Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last,
ature. Proceedings of the National Academy of Sci- Shlomo Argamon, Newton Howard, and Ophir
ences 109(20):7682–7686. Frieder. 2013. Metaphor identification in large texts
corpora. PloS one 8(4):e62343.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Jeffrey Pennington, Richard Socher, and Christo-
arXiv:1412.6980 . pher D. Manning. 2014. Glove: Global vectors for
word representation. In Empirical Methods in Nat-
ural Language Processing (EMNLP). pages 1532–
1543. http://www.aclweb.org/anthology/D14-1162.
Esther Romero and Belén Soria. 2014. Relevance
theory and metaphor. Linguagem em (Dis) curso
14(3):489–509.
Ekaterina Shutova, Lin Sun, and Anna Korhonen.
2010. Metaphor identification using verb and noun
clustering. In Proceedings of the 23rd International
Conference on Computational Linguistics. Associ-
ation for Computational Linguistics, pages 1002–
1010.
Paul Simpson. 2004. Stylistics: A resource book for
students. Psychology Press.
Richard Socher, Jeffrey Pennington, Eric H Huang,
Andrew Y Ng, and Christopher D Manning. 2011.
Semi-supervised recursive autoencoders for predict-
ing sentiment distributions. In Proceedings of the
conference on empirical methods in natural lan-
guage processing. Association for Computational
Linguistics, pages 151–161.
Gerard Steen. 2014. Metaphor and style. The Cam-
bridge handbook of Stylistics pages 315–328.
Lisa Torrey and Jude Shavlik. 2009. Transfer learn-
ing. Handbook of Research on Machine Learning
Applications and Trends: Algorithms, Methods, and
Techniques 1:242.
Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman,
Eric Nyberg, and Chris Dyer. 2014. Metaphor de-
tection with cross-lingual model transfer.
Peter D. Turney, Yair Neuman, Dan Assaf, and
Yohai Cohen. 2011. Literal and metaphorical
sense identification through concrete and abstract
context. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing.
Association for Computational Linguistics, Strouds-
burg, PA, USA, EMNLP ’11, pages 680–690.
http://dl.acm.org/citation.cfm?id=2145432.2145511.
Carl Vogel. 2001. Dynamic semantics for metaphor.
Metaphor and Symbol 16(1-2):59–74.
Deirdre Wilson. 2011. Parallels and differences
in the treatment of metaphor in relevance theory
and cognitive linguistics. Intercultural Pragmatics
8(2):177–196.
Study II
95
Finding the Neural Net: Deep-learning Idiom
Type Identification from Distributional
Vectors
Alessandro Lenci†
University of Pisa, Italy
The present work aims at automatically classifying Italian idiomatic and non-idiomatic phrases
with a neural network model under constrains of data scarcity. Results are discussed in com-
parison with an existing unsupervised model devised for idiom type detection and a similar
supervised classifier previously trained to detect metaphorical bigrams. The experiments suggest
that the distributional context of a given phrase is sufficient to carry out idiom type identifi-
cation to a satisfactory degree, with an increase in performance when input phrases are filtered
according to human-elicited idiomaticity ratings collected for the same expressions. Crucially,
employing concatenations of single word vectors rather than whole-phrase vectors as training
input results in the worst performance for our models, differently from what was previously
registered in metaphor detection tasks.
1. Introduction
on metaphor production strategies indeed show a large ability of language users to gen-
eralize and create new metaphors on the fly from existing ones, allowing researchers to
hypothesize recurrent semantic mechanisms underlying a large number of productive
metaphors (McGlone, 1996; Lakoff and Johnson, 2008). For example, starting from the
clean performance metaphor above, we could also say the delivered performance was
neat, spick-and-span and crystal-clear by sticking to the same conceptual domain of clean-
liness. On the other hand, although most idioms originate as metaphors (Cruse, 1986),
they have undergone a crystallization process in diachrony, whereby they now appear
as conventionalized and (mostly) fixed combinations that form a finite repository in a
given language (Nunberg et al., 1994). From a formal standpoint, though some idioms
allow for restricted lexical variability (e.g., the concept of getting crazy can be conveyed
both by to go nuts and to go bananas), this kind of variation is not as free and systematic
as with metaphors and literal language (e.g., transforming the take the cake idiom above
into take the candy would hinder a possible idiomatic reading) (Fraser, 1970; Geeraert
et al., 2017). From the semantic point of view, it is interesting to observe how speakers
can correctly use the most semantically opaque idioms in discourse without necessarily
being aware of their actual metaphorical origin or anyway having contrasting intuitions
about it. For example, Gibbs (1994) reports that many English speakers explain the
idiom kick the bucket ‘to die’ as someone kicking a bucket to hang themselves, while it
actually originates from a corruption of the French word buquet indicating the wooden
framework that slaughtered hogs kicked in their death struggles. Secondly, metaphor-
ical expressions can receive varying interpretations according to the context at hand:
saying that John is a shark could mean that he’s ruthless on his job, that he’s aggressive
or that he attacks people suddenly (Cacciari, 2014). Contrariwise, idiomatic expressions
always keep the same meaning: saying that John kicked the bucket can only be used to
state that he passed away. Finally, idioms and metaphors differ in the mechanisms they
recruit in language processing: while metaphors seem to bring into play categorization
(Glucksberg et al., 1997) or analogical (Gentner, 1983) processes between the vehicle and
the topic (e.g., shark and John respectively in the sentence above), idioms by and large
call for lexical access mechanisms (Cacciari, 2014). Nevertheless, it is crucial to under-
line that idiomaticity itself is a multidimensional and gradient phenomenon (Nunberg
et al., 1994; Wulff, 2008) with different idioms showing varying degrees of semantic
transparency, formal versatility, proverbiality and affective valence. All this variance
within the class of idioms themselves has been demonstrated to affect the processing of
such expressions in different ways (Cacciari, 2014; Titone and Libben, 2014).
The aim of this work is to focus on the fuzzy boundary between idiomatic and
metaphorical expressions from a computational viewpoint, by applying a supervised
method previously designed to discriminate metaphorical vs. literal usages of input
constructions to the task of distinguishing idiomatic from compositional expressions.
Our starting point is the work of Bizzoni et al. (2017), who managed to classify adjective-
noun pairs where the same adjectives were used both in a metaphorical and a literal
sense (e.g., clean performance vs. clean floor) by means of a neural classifier trained
on a composition of the words’ embeddings (Mikolov et al., 2013a). As the authors
found out, the neural network succeeded in the task because it was able to detect the
abstract/concrete semantic shift undergone by the nouns when used with the same
adjective in figurative and literal compositions respectively. In our attempt, we will use a
relatively similar approach to classify idiomatic expressions by training a three-layered
neural network on a set of Italian idioms (e.g. gettare la spugna ‘to throw in the towel’, lit.
‘to throw the sponge’) and non-idioms (e.g. vedere una partita ‘to watch a match’). The
performance of the network will be compared when trained with constructions belong-
Bizzoni et al. Deep-learning Idiom Identification
2. Related Work
Previous computational research has exploited different methods to perform idiom type
detection (i.e., automatically telling apart potential idioms like to get the sack from only
literal combinations like to kill a man). For example, Lin (1999) and Fazly et al. (2009)
label a given word combination as idiomatic if the Pointwise Mutual Information (PMI)
(Church and Hanks, 1991) between its constituents is higher than the PMIs between the
components of a set of lexical variants of this combination obtained by replacing the
component words of the original expressions with semantically related words. Other
studies have resorted to Distributional Semantics (Lenci, 2008, 2018; Turney and Pantel,
2010) by measuring the cosine between the vector of a given phrase and the single
vectors of its components (Fazly and Stevenson, 2008) or between the phrase vector
and the sum or product vector of its components (Mitchell and Lapata, 2010; Krčmář
et al., 2013). Senaldi et al. (2016) and Senaldi et al. (2017) combine insights from both
these approaches. They start from two lists of 90 VN and 26 AN constructions, the
former composed of 45 idioms (e.g., gettare la spugna) and 45 non-idioms (e.g., vedere una
partita), the latter comprising 13 idioms (e.g., filo rosso ‘common thread’, lit. ‘red thread’)
and 13 non-idioms (e.g., lungo periodo ‘long period’). For each of these constructions,
a series of lexical variants are generated distributionally or via MultiWordNet (Pianta
et al., 2002) by replacing the subparts of the constructions with semantically related
words (e.g. from filo rosso, variants like filo nero ‘black thread’, cavo rosso ‘red cable’ and
cavo nero ‘black cable’ are generated). What comes to the fore is that the vectors of the
Italian Journal of Computational Linguistics Volume 1, Number 1
idiomatic expressions are less similar to the vectors of their lexical variants with respect
to the similarity between the vector of a literal constructions and the vectors of its lexical
alternatives. To provide an example, the cosine similarity between the vector of an idiom
like filo rosso and the vectors of its lexical variants like filo nero and cavo rosso was found
to be smaller than the cosine similarity between the vector of a literal phrase like lungo
periodo and the vectors of its variants like interminabile tempo ‘endless time’ and breve
periodo ‘short period’.
Moving to the methodology exploited in the current study, to the best of our
knowledge, neural networks have been previously adopted to perform MWE detection
in general (Legrand and Collobert, 2016; Klyueva et al., 2017), but not idiom identifica-
tion specifically. As mentioned in the Introduction, in Bizzoni et al. (2017), pre-trained
noun and adjective vector embeddings are fed to a single-layered neural network to
disambiguate metaphorical and literal AN combinations. Several combination algo-
rithms are experimented with to concatenate adjective and noun embeddings. All in
all, the method is shown to outperform the state of the art, presumably leveraging
the abstractness degree of the noun as a clue to figurativeness and basically treating
the noun as the “context” to discriminate the metaphoricity of the adjective (cf. clean
performance vs clean floor, where performance is more abstract than floor and therefore the
mentioned cleanliness is to be intended metaphorically).
Besides Bizzoni et al. (2017), using neural networks for metaphor detection with
pretrained word embeddings initialization has been tried in a small number of recent
works, proving that this is a valuable strategy to predict metaphoricity in datasets. Rei
et al. (2017) present an ad-hoc neural design able to compose and detect metaphoric
bigrams in two different datasets. Do Dinh and Gurevych (2016) apply a series of
perceptrons to the VU Amsterdam Metaphor Corpus (Steen et al., 2014) combined
with word embeddings and part-of-speech tagging. Finally, a similar approach - a
combination of fully connected networks and pre-trained word embeddings - has also
been used as a pre-processing step to metaphor detection, in order to learn word and
sense abstractness scores to be used as features in a metaphor identification pipeline
(Köper and Schulte im Walde, 2017).
3. Method
In this work we carried out a supervised idiom type identification task by resorting to a
three-layered neural network classifier. After selecting our dataset of VN and AN target
expressions (Section 4.1), for which gold standard idiomaticity ratings had already been
collected (Section 4.2), we built count vector representations for them (Section 4.3) from
the itWaC corpus (Baroni et al., 2009) and fed them to our classifier (Section 5) with
different training splits (Section 6). The network returned a binary output, whereby
idioms were taken as our positive examples and non-idioms as our negative ones. Dif-
ferently from Bizzoni et al. (2017), for each idiom or non-idiom we initially built a count-
based vector (Turney and Pantel, 2010) of the expression as a whole, taken as a single
token. We then compared this approach with a model trained on the concatenation of
the individual words of an expression, but the latter turned out to be less effective for
idioms than for metaphors. Each model was finally evaluated in terms of classification
accuracy, ranking performance and correlation between its continuous scores and the
human-elicited idiomaticity judgments (Section 7).
Since we mostly worked with vectors that took our target expressions as unana-
lyzed wholes, as if they were single tokens, we were not concerned with the fact that
some verbs were shared by more than one idiom (e.g., lasciare il campo ‘to leave the field’
Bizzoni et al. Deep-learning Idiom Identification
and lasciare il segno ‘to leave one’s mark’) or non-idiom (e.g., andare a casa ‘to go home’
and andare all’estero ‘to go abroad’) at once, given that our network could not access this
information.
4. Dataset
The two datasets we employed in the current study come from Senaldi et al. (2016) and
Senaldi et al. (2017). The first one is composed of 45 idiomatic Italian V-NP and V-PP
constructions (e.g., tagliare la corda ‘to flee’ lit. ‘to cut the rope’) that were selected from
an Italian idiom dictionary (Quartu, 1993) and extracted from the itWaC corpus (Baroni
et al. 2009, 1,909M tokens ca.) and whose frequency spanned from 364 (ingannare il tempo
‘to while away the time’) to 8294 (andare in giro ‘to get about’), plus other 45 Italian non-
idiomatic V-NP and V-PP constructions of comparable frequencies (e.g., leggere un libro
‘to read a book’). The latter dataset comprises 13 idiomatic and 13 non-idiomatic AN
constructions (e.g., punto debole ‘weak point’ and nuova legge ‘new law’) that were still
extracted from itWaC and whose frequency varied from 21 (alte sfere ‘high places’, lit.
‘high spheres’) to 194 (punto debole).
Senaldi et al. (2016) and Senaldi et al. (2017) collected gold standard idiomaticity judg-
ments for the 26 AN and 90 VN target constructions in their datasets. Nine linguistics
students were presented with a list of the 26 AN constructions and were asked to
evaluate how idiomatic each expression was from 1 to 7, with 1 standing for ‘totally
compositional’ and 7 standing for ‘totally idiomatic’. Inter-coder agreement, measured
with Krippendorff’s α (Krippendorff, 2012), was equal to 0.76. The same procedure was
repeated for the 90 VN constructions, but in this case the inital list was split into 3
sublists of 30 expressions, each one to be rated by 3 subjects. Krippendorff’s α was
0.83 for the first sublist and 0.75 for the other two. These inter-coder agreement scores
were taken as a confirmation of reliability for the collected ratings (Artstein and Poesio,
2008). As will become clear in Section 6, these judgments served the twofold purpose
of evaluating the classification performance of our neural network and filtering the
expressions to use as training input for our models.
Count-based Distributional Semantic Models (DSMs) (Turney and Pantel, 2010) allow
for representing words and expressions as high-dimensionality vectors, where the vec-
tor dimensions register the co-occurrence of the target words or expressions with some
contextual features, e.g. the content words that linearly precede and follow the target
element within a fixed contextual window. We trained two DSMs on itWaC, where
our target AN and VN idioms and non-idioms were represented as target vectors and
co-occurrence statistics counted how many times each target construction occurred in
the same sentence with each of the 30,000 top content words in the corpus. Differently
from Bizzoni et al. (2017), we did not opt for prediction-based vector representations
(Mikolov et al., 2013a). Although some studies have brought out that context-predicting
models fare better than count-based ones on a variety of semantic tasks (Baroni et al.,
2014), including compositionality modeling (Rimell et al., 2016), others (Blacoe and
Italian Journal of Computational Linguistics Volume 1, Number 1
Lapata, 2012; Cordeiro et al., 2016) have shown them to perform comparably. In phrase
similarity and paraphrase tasks, Blacoe and Lapata (2012) find count vectors to score
better then or comparably to predict vectors built following Collobert and Weston
(2008)’s neural language model. Cordeiro et al. (2016) show PPMI-weighted count-
based models to perform comparably to word2vec (Mikolov et al., 2013b) in predicting
nominal compound compositionality. Moreover, Levy et al. (2015) highlight that much
of the superiority in performance exhibited by word embeddings is actually due to
hyperparameter optimizations, which, if applied to traditional models as well, can bring
to equivalent outcomes. Therefore, we felt confident in resorting to count-based vectors
as an equally reliable representation for the task at hand.
We built a neural network composed of three “dense” or fully connected hidden layers.1
The input layer has the same dimensionality of the original vectors and the output
layer has dimensionality 1. The other two hidden layers have dimensionality 12 and
8 respectively. Our network takes in input a single vector at a time, which can be a
word embedding, a count-based distributional vector or a composition of several word
vectors. For the core part of our experiment we used as input single distributional
vectors of two-word expressions. As we discussed in the previous section, these vec-
tors have 30,000 dimensions each and represent the distributional behavior of a full
expression rather than that of the individual words composing such expression. Given
this distributional matrix, we defined idioms as positive examples and non-idioms as
negative examples of our training set. Due to the magnitude of our input, the most
important reduction of data dimensionality is carried out by the first hidden layer of
our model. The last layer applies a sigmoid activation function on the output in order
to produce a binary judgment. While binary scores are necessary to compute the model
classification accuracy and will be evaluated in terms of F1, our model’s continuous
scores can be retrieved and will be used to perform an ordering task on the test set, that
we will evaluate in terms of Interpolated Average Precision (IAP) 2 and Spearman’s ρ
with the human-elicited idiomaticity judgments. IAP and ρ, therefore, will be useful to
investigate how good our model is in ranking idioms before non-idioms.
The scarcity of our training sets constitutes a challenge for neural models, typically
designed to deal with massive amounts of data. The typical effect of such scarcity is
a fluctuation in performance: training our model on two different sections of the same
dataset is likely to result in quite different F-scores.
Unless otherwise specified, the IAP, Spearman’s ρ and F1 scores reported in Table
1 are averaged on 5 runs of each model on the same datasets: at each run, the training
split is randomly selected. We found that some samples of the training set seemingly
make it harder for the model to learn idiom detection. When such runs are included in
the mean, the performance is drastically lowered.
In our attempt to understand whether we could find a rationale behind this phe-
nomenon or it was instead completely unpredictable, in some versions of our models
we have tried to filter our training sets according to the idiomaticity judgments we
elicited from speakers (Section 4.2) to assess which composition of our training sets
made our algorithm more effective. In the first approach, which we will label as High-
to-Low (HtL henceforth), the network was trained on the idioms receiving the highest
idiomaticity ratings (and symmetrically on the compositional expressions having the
lowest idiomaticity ratings) and was therefore tested on the intermediate cases. In
the second approach, which we called Low-to-High (LtH), the model was trained on
more borderline exemplars, i.e. the idioms having the lowest idiomaticity ratings and
the compositional expressions having the highest ones, and then tested on the most
polarized cases of idioms and non-idioms.
For example, in the HtL setting, the AN bigrams we selected for the training set
included idioms like testa calda ‘hothead’ and faccia tosta ‘brazen person’ (lit. ‘tough
face’), that reported an average idiomaticity rating of 6.8 and 6.6 out of 7 respectively,
and non-idioms like famoso scrittore ‘famous writer’ and nuovo governo ‘new govern-
ment’ that elicited an average idiomaticity rating of 1.2 and 1.1 out of 7. In the case
of VN bigrams, we selected idioms like andare a genio ‘to sit well’ (lit. ‘to go to genius’)
(mean idiomaticity rating of 7) and non-idioms like vendere un libro ‘to sell a book’ (mean
idiomaticity rating of 1). The neural network was thus trained only on elements that our
annotators had judged as clearly positive and clearly negative examples.
To provide examples on the LtH training sets, for the VN data, we selected idioms
like lasciare il campo (mean rating = 3.6) and cambiare colore ‘to change color (in face)’
(mean rating = 3.6) against non-idiomatic expressions like prendere un caffè ‘to grab a
coffee’ (3.3) and lasciare un incarico ‘to leave a job’ (2.3). For the AN data, we selected
idioms like prima serata ‘prime time’ (lit. ‘first evening’) (mean rating = 4 out of 7) and
compositional expressions like proposta concreta ‘concrete proposal’ (2.7). The neural
network was in this case trained only on elements that our annotators had judged as
borderline cases.
The results of these different filtering procedures can be found in Table 1.
7. Evaluation
Once the training sets were established, a variety of transformations were tried on our
VN and AN distributional vectors before giving them as input to our network. Some
models were trained on the raw 30,000 dimensional distributional vectors of VN and
AN expressions; other models used the concatenation of the vectors of the individual
components of the expressions; finally, other models employed PPMI (Positive Point-
wise Mutual Information) (Church and Hanks, 1991) and SVD (Singular Value De-
composition) transformed (Deerwester et al., 1990) vectors of 150 and 300 dimensions.
Details of both classification and ordering tasks are shown in Table 1. Qualitative details
about the results will be given in Section 8.
7.1 Verb-Noun
We ran our model on the VN dataset, composed of 90 elements, namely 45 idioms and 45
non-idiomatic expressions. This is the largest of the two datasets. We trained our model
on 303 and 40 elements for 20 epochs and tested it on the remaining 60 and 50 elements
3 When we report the number of training and test items in Table 1 as 15+15, for instance, we mean 15
idioms + 15 non-idioms. The same applies to all the other listed models.
Italian Journal of Computational Linguistics Volume 1, Number 1
Table 1
Interpolated Average Precision (IAP), Spearman’s ρ correlation with the human judgments and
F-measure (F1) for Vector-Noun training (VN), Adjective-Noun training (AN), joint (VN+AN)
training and training through vector concatenation. High-to-Low (HtL) models were trained on
clear-cut cases, while Low-to-High (LtH) models were trained on borderline cases. As for the
other models, the average performance over 5 runs with randomly selected training sets is
reported. Training and test set are expressed as the sum of positive and negative examples.
respectively. The models that best succeeded at classifying our phrases into idioms and
non-idioms were trained with 40 PPMI-transformed vectors, reaching an average F1
score of .77 on the randomized iterations and an F1 score of .85, with a Spearman’s ρ
correlation of .68, when the training set was composed of borderline cases and the model
was then tested on more clear-cut exemplars (LtH). As for the rest of the F1 scores,
what comes to light from our results is that increasing the number of training vectors
generally leads to better results, except for models fed with SVD-transformed vectors
of 300 dimensions, which seem to be insensitive to the size of our training data. Quite
interestingly, SVD-reduced vectors appear to perform worse in general than raw ones
and just PPMI-transformed ones. Due to space limitations, raw-frequency VN models
are not reported in Table 1 since they were comparable to just PPMI-weighted ones.
This same pattern is encountered when evaluating the ability of our algorithm to
rank idioms before non-idioms (IAP). The models with the highest score employs 40
PPMI training vectors and reach .73 on the randomized training, .79 on the HtL training
Bizzoni et al. Deep-learning Idiom Identification
and .77 on the LtH ones, while SVD training vectors generally lead to poorer ranking
performances. Despite these IAP scores being encouraging, they are anyway lower than
those obtained by Senaldi et al. (2016), who reach a maximum IAP of 0.91. This drop in
performance could point to the fact that resorting to distributional information only
to carry out idiom identification overlooks some aspects of the behavior of idiomatic
constructions (e.g., formal rigidity) that is to be taken into account to arrive at a more
satisfactory classification. Concerning the correlation between the continuous score of
the neural net and the human idiomaticity ratings presented in Section 4.2, the best
model also employed 40 PPMI vectors of borderline expressions (.68), followed by
the model using 40 PPMI vectors of clear-cut cases (.65). These correlation values are
quite comparable to the maximum of -0.67 obtained in Senaldi et al. (2016)4 in High-
to-Low and Low-to-High ordered models, while they are lower in randomized models,
especially SVD-reduced ones.
All in all, both HtL and LtH experimental settings result in IAP, correlation and F1
scores that are higher than what we get from averaging over randomly selected training
sets. More precisely, the strategy of training only on borderline examples (LtH) appears
to be the most effective. This can intuitively make sense: once a network has learned to
discriminate between borderline cases, detecting clear-cut elements should be relatively
easy. The opposite strategy also seems to bring some benefits, possibly because training
on clear negative and positive examples provides the network with a data set which is
easier to generalize. In any case, it seems clear that selecting our training set with the
help of human ratings allows us to significantly increase the performance of our models.
We can see this as another proof that human continuous scores on idiomaticity - and not
only binary judgments - are mirrored in the distributional pattern on these expressions.
As for the influence of the training set size on IAP and ρ, all in all it seems that the best
results are reached with 40 training vectors, both on the randomized training sets and
on the ordered training sets.
The general trend we can abstract from these results is that our neural network does
a good job in telling apart idioms and non-idioms by just relying on raw-frequency
and PPMI-transformed distributional information. Performing dimensionality reduc-
tion apparently deprives the model of useful information, which makes the overall
performance plummet to lower levels.
7.2 Adjective-Noun
Our model was also run on the AN dataset, composed of 26 elements (13 idioms and
13 non-idiomatic expressions). We empirically found that our network was able to
perform some generalization on the data when the training set contained at least 14
elements, evenly balanced between positive and negative examples. We trained our
model on 16 elements for 30 epochs and tested on the remaining 10 elements. As
happened with VN vectors, performing SVD worsened the performance of the model.
While F1 exact value can undergo fluctuations when a model is trained on very small
sets, we always registered accuracies higher than 70% for the ordered training sets. In
this case even more than in the Verb-Noun frame, the difference between randomizing
the training set and selecting it using human idiomaticity ratings appears to be very
evident, possibly due to the extremely small dimensions of this specific dataset, that
4 Please keep in mind that the correlation values in Senaldi et al. (2016) and Senaldi et al. (2017) are
negative since the less similar a target vector to the vectors of its variants, the more idiomatic the target.
Italian Journal of Computational Linguistics Volume 1, Number 1
make the qualitative selection of the training data of particular importance. Once again
the highest Spearman’s ρ correlation (.93) was reached when using a Low-to-High set
trained on borderline cases, although it is important to keep in mind that such scores
are computed on a very restricted test set. The same reasoning applies to IAP scores,
which all reach the top value, though we must consider the very small test set. Senaldi
et al. (2017) instead reached a maximum IAP of .85 and a maximum ρ of -.68 in AN
idiom identification. When the training size was under the critical threshold, accuracy
dropped significantly. With training sets of 10 or 12 elements, our model naturally went
in overfitting, quickly reaching 100% accuracy on the training set and failing to correctly
classify unseen expressions. In these cases a partial learning was still visible in the
ordering task, where most idioms, even if labeled incorrectly, received higher scores
than non-idioms.
Our last experiment consisted in training our model on a mixed dataset of both VN
and AN expressions, to check to what extent it would be able to recognize the same
underlying semantic phenomenon across different syntactic constructions. In these
models as well as in those described in Section 7.4, PPMI and SVD transformations were
not tested anymore, since they were already shown to bring to generally comparable or
even worse outcomes when tried on the VN and the AN datasets singularly. Concerning
the structure of our training and test sets, two approaches were experimented with. We
first tried to train our model on one pair type, e.g. the AN pairs, and then tested on
the other, but we saw this required more epochs overall (more than 100) to stabilize
and resulted in a poorer performance. When training our model on a mixed dataset
containing the elements of both pair types, our model employed 20 epochs to reach an
F-measure of 66% on the mixed training set when the set was ordered Low-to-High
(i.e., it was composed of borderline cases only) and a comparable F-score of 65% when
using clear-cut training input (HtL). Anyway, we also noticed that VN expressions were
learned better than AN expressions. It’s also worth considering that, although the F-
scores of the LtH and HtL models were higher, the IAP and Spearman’s ρ were lower
than in the unordered input model. In other words, while ordering the input led to a
better binary classification, the continuous scores returned a less precise ranking.
Our model was able to generalize over the two datasets, but this involved a loss in
accuracy with respect to the only-VN and only-AN ordered training sets. It can be seen
in Table 1 that a loss in accuracy is also evident for joint training on the randomized
frame, although in this case the model seems hardly able to generalize at all.
resulted in the worst performance, differently from what happened with metaphors
(Bizzoni et al., 2017).
Despite all correlations are low and not statistically significant, it is still worth
pointing out however that not all the results are completely random: with an F1 of 59%
for the LtH training set and an IAP of .61 for the HtL set, the model seems able to learn
idiomaticity to a lower, but not null, degree; these findings would be in line with the
claim that the meaning of the subparts of several idioms, while less important than in
metaphors, is not completely obliterated (McGlone et al., 1994). Another hint in this
direction is the difference in performance between randomized and ordered training
that we can observe for concatenation: if human idiomaticity ratings were completely
independent from the composition of the individual subparts of our idioms, such effect
should not be present at all. Anyway, similarly to what happened with the joint models,
ordering the training input led to higher F-scores and comparable IAPs, but returned a
worse correlation with human judgments with respect to the models with a randomized
training input.
8. Error Analysis
item, be it an idiom or a literal, the more the network tended to consider it as literal (i.e.,
it gave it a lower idiomaticity score). This tendency could be explained if we consider
that some of our most frequent idioms were actually quite ambiguous (e.g., aprire gli
occhi ‘to open one’s eyes’ occurred 6306 times in the corpus and bussare alla porta 3303
times) and most of their corpus occurrences could be literal uses.
The experiments we have presented show that the distribution of idiomatic and com-
positional expressions in large corpora can suffice for a supervised classifier to learn
the difference between the two linguistic elements from small training sets and with a
good level of accuracy. Specifically, we have observed that human continuous ratings
of idiomaticity can be useful to select a better training set for our models, and that
training our models on cases deemed by our annotators as borderline allows them to
learn and perform better than if they were fed with randomized input. Also training our
models only on clear-cut cases increases the performance. In general we can see from
this phenomena that human continuous ratings of idiomaticity seem to be mirrored in
the distributional structure of our data.
Unlike with metaphors (Bizzoni et al., 2017), feeding the classifier with a composi-
tion of the individual words’ vectors of such expressions performs quite scarcely and
can be used to detect only some idioms. This takes us back to the core difference that
while metaphors are more compositional and preserve a transparent source domain to
target domain mapping, idioms are by and large non-compositional. Since our classi-
fiers rely only on contextual features, their ability in classification must stem from a
difference in distribution between idioms and non-idioms. A possible explanation is
that while the literal expressions we selected, like vedere un film or ascoltare un discorso,
tend to be used with animated subjects and thus to appear in more concrete contexts,
most of our idioms (e.g. cadere dal cielo or lasciare il segno) allow for varying degrees
of animacy or concreteness of the subject, and thus their context can easily get more
diverse. At the same time, the drop in performance we observe in the joint models seems
to indicate that the different parts of speech composing our elements entail a significant
contextual difference between the two groups, which introduces a considerable amount
of uncertainty in our model.
It is also possible that other contextual elements we did not consider have played
a role in the learning process of our models, like the ambiguity between idiomatic and
literal meaning that some potentially idiomatic strings possess (e.g. to leave the field)
and that would lead their contextual distribution to be more variegated with respect to
only-literal combinations. We intend to further investigate this aspect in future works.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis,
A., Dean, J., Devin, M., et al. (2016). Tensorflow: Large-scale machine learning on
heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
Artstein, R. and Poesio, M. (2008). Inter-coder agreement for computational linguistics.
Computational Linguistics, 34(4):555–596.
Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. (2009). The WaCky wide
web: a collection of very large linguistically processed web-crawled corpora. Language
Resources and Evaluation, 43(3):209–226.
Bizzoni et al. Deep-learning Idiom Identification
Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! a systematic com-
parison of context-counting vs. context-predicting semantic vectors. In Proceedings of
the 52nd Annual Meeting of the Association for Computational Linguistics, pages 238–247.
Bizzoni, Y., Chatzikyriakidis, S., and Ghanimifard, M. (2017). “deep” learning: Detecting
metaphoricity in adjective-noun pairs. In EMNLP 2017.
Blacoe, W. and Lapata, M. (2012). A comparison of vector-based representations for
semantic composition. In Proceedings of the 2012 joint conference on empirical methods in
natural language processing and computational natural language learning, pages 546–556.
Association for Computational Linguistics.
Bohrn, I. C., Altmann, U., and Jacobs, A. M. (2012). Looking at the brains behind figu-
rative language: a quantitative meta-analysis of neuroimaging studies on metaphor,
idiom, and irony processing. Neuropsychologia, 50(11):2669–2683.
Cacciari, C. (2014). Processing multiword idiomatic strings: Many words in one? The
Mental Lexicon, 9(2):267–293.
Cacciari, C. and Papagno, C. (2012). Neuropsychological and neurophysiological corre-
lates of idiom understanding: How many hemispheres are involved. The handbook of
the neuropsychology of language, pages 368–385.
Church, K. W. and Hanks, P. (1991). Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1):22–29.
Collobert, R. and Weston, J. (2008). A unified architecture for natural language pro-
cessing: Deep neural networks with multitask learning. In Proceedings of the 25th
international conference on Machine learning, pages 160–167. ACM.
Cordeiro, S., Ramisch, C., Idiart, M., and Villavicencio, A. (2016). Predicting the com-
positionality of nominal compounds: Giving word embeddings a hard time. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,
volume 1, pages 1986–1997.
Cruse, D. A. (1986). Lexical semantics. Cambridge University Press.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American society for information
science, 41(6):391.
Do Dinh, E.-L. and Gurevych, I. (2016). Token-level metaphor detection using neural
networks. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 28–33.
Fazly, A., Cook, P., and Stevenson, S. (2009). Unsupervised type and token identification
of idiomatic expressions. Computational Linguistics, 1(35):61–103.
Fazly, A. and Stevenson, S. (2008). A distributional account of the semantics of multi-
word expressions. Italian Journal of Linguistics, 1(20):157–179.
Fraser, B. (1970). Idioms within a transformational grammar. Foundations of language,
pages 22–42.
Frege, G. (1892). Über sinn und bedeutung. Zeitschrift für Philosophie und philosophische
Kritik, 100:25–50.
Geeraert, K., Baayen, R. H., and Newman, J. (2017). Understanding idiomatic variation.
MWE 2017, page 80.
Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive
science, 7(2):155–170.
Gibbs, R. W. (1993). Why idioms are not dead metaphors. Idioms: Processing, structure,
and interpretation, pages 57–77.
Gibbs, R. W. (1994). The poetics of mind: Figurative thought, language, and understanding.
Cambridge University Press.
Gibbs, R. W., Bogdanovich, J. M., Sykes, J. R., and Barr, D. J. (1997). Metaphor in idiom
comprehension. Journal of memory and language, 37(2):141–154.
Italian Journal of Computational Linguistics Volume 1, Number 1
Rimell, L., Maillard, J., Polajnar, T., and Clark, S. (2016). Relpron: A relative clause eval-
uation data set for compositional distributional semantics. Computational Linguistics,
42(4):661–701.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002). Multiword
Expressions: A Pain in the Neck for NLP. In Proceedings of the 3rd International
Conference on Intelligent Text Processing and Computational Linguistics, pages 1–15.
Senaldi, M. S. G., Lebani, G. E., and Lenci, A. (2016). Lexical variability and composition-
ality: Investigating idiomaticity with distributional semantic models. In Proceedings
of the 12th Workshop on Multiword Expressions, pages 21–31.
Senaldi, M. S. G., Lebani, G. E., and Lenci, A. (2017). Determining the compositionality
of noun-adjective pairs with lexical variants and distributional semantics. Italian
Journal of Computational Linguistics, 3(1):43–58.
Steen, G. J., Dorst, A. G., Herrmann, J. B., Kaal, A., Krennmayr, T., and Pasma, T. (2014).
A method for linguistic metaphor identification: From mip to mipvu. Metaphor and
the Social World, 4(1):138–146.
Tanguy, L., Sajous, F., Calderone, B., and Hathout, N. (2012). Authorship attribution:
Using rich linguistic features when training data is scarce. In PAN Lab at CLEF.
Titone, D. and Libben, M. (2014). Time-dependent effects of decomposability, familiarity
and literal plausibility on idiom priming: A cross-modal priming investigation. The
Mental Lexicon, 9(3):473–496.
Torre, E. (2014). The emergent patterns of Italian idioms: A dynamic-systems approach. PhD
thesis, Lancaster University.
Turney, P. D. and Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of
Semantics. Journal of Artificial Intelligence Research, 37:141–188.
Wulff, S. (2008). Rethinking Idiomaticity: A Usage-based Approach. Continuum.
Study III
113
Bigrams and BiLSTMs
Two neural networks for sequential metaphor detection
Yuri Bizzoni Mehdi Ghanimifard
Centre for Linguistic Theory and Centre for Linguistic Theory and
Studies in Probability (CLASP), Studies in Probability (CLASP),
Department of Philosophy, Department of Philosophy,
Linguistics and Theory of Science, Linguistics and Theory of Science,
University of Gothenburg. University of Gothenburg.
[email protected] [email protected]
Table 2: Performance of different models compared to the score reported by two relevant works in the literature.
We report the performance of simpler models and their combinations as baselines. We used some abbreviations to
describe the models in the table. For example, Dense(1) represents a single, fully connected layer of output length
of 1, LSTM(32) is an LSTM with an output length of 32 and Concat represents our compositional model. Thus,
Concat(n=2)+Dense(300) represents the bigram composition model with a concatenation window of 2 combined
with a fully connected layer of 300 output units.
Table 4: Parameter tuning, testing both deeper and wider settings of the model. We write in parenthesis the
dimensions each layer: for example Dense(20) is a fully connected layer with an output space of dimensionality
20.
N Precision Recall F1
Dense(300)+Bi-LSTM(32)+Dense(20) .642 .498 .561
Dense(301)+Bi-LSTM(32)+Dense(20)+Conc .580 .491 .530
Concat(n=2)+Dense(300)+Conc .554 .570 .562
Concat(n=3)+Dense(300)+Conc .567 .593 .580
Table 5: Results for different models using embeddings enriched with explicit information regarding word con-
creteness. The first line works as baseline showing a model without input manipulation. Concat(n=) represents
our compositional model, with n= representing the composition window length. Conc signifies the usage of
concreteness scores. So for example Concat(n=2)+Dense(300)+Conc represents our compositional model with
concatenation window of 2 combined with a fully connected layer of 300 output units and using the concreteness
scores as additional information.
N Precision Recall F1
Dense(300)+Bi-LSTM(32)+Dense(20) .642 .498 .561
Dense(300)+Bi-LSTM(32)+Dense(20)+Chunk .671 .570 .621
Concat(n=2)+Dense(300)+Chunk .571 .561 .560
Concat(n=3)+Dense(300)+Chunk .611 .400 .491
Table 6: Results for different models using sentence breaking to 20 (any sentence longer than 20 tokens is split in
two parts treated as complete different sentences). The first line works as baseline showing a model without input
manipulation. Concat(n=) represents our compositional model, Chunk signifies the usage of sentence breaking.
N Precision Recall F1
Dense(300)+Bi-LSTM(32)+Dense(20) .642 .498 .561
Dense(300)+Bi-LSTM(32)+Dense(20)+Chunk .670 .571 .620
Dense(301)+Bi-LSTM(32)+Dense(20)+Conc .581 .490 .531
Dense(301)+Bi-LSTM(32)+Dense(20)+Conc+Chunk .649 .624 .636
Concat(n=3)+Dense(300)+Conc+Chunk .632 .446 .523
Table 7: Results for different models using embeddings enriched with explicit information regarding word con-
creteness and sentence breaking to 20 (any sentence longer than 20 tokens is split in two parts treated as complete
different sentences). The first lines work as baselines showing the performance of previous models (without any
input manipulation, only chunking, only concreteness scores). Concat(n=) represents our compositional model,
Chunk signifies the usage of sentence breaking, Conc represents the usage of concreteness scores.
N Precision Recall F1
Dense(300)+Bi-LSTM(32)+Dense(20) .638 .593 .615
Concat(n=2)+Dense(300) .642 .498 .561
Combined results .595 .680 .635
Table 8: Results for the evaluation set from the shared dataset competition (NAACL 2018). We used sentence
breaking and concreteness information.
127
Deep Learning of Binary and Gradient Judgements for Semantic
Paraphrase
Yuri Bizzoni Shalom Lappin
University of Gothenburg University of Gothenburg
[email protected] [email protected]
Abstract
We treat paraphrase identification as an ordering task. We construct a corpus of 250 sets of five
sentences, with each set containing a reference sentence and four paraphrase candidates, which are
annotated on a scale of 1 to 5 for paraphrase proximity. We partition this corpus into 1000 pairs of
sentences in which the first is the reference sentence and the second is a paraphrase candidate. We
then train a DNN encoder for sentence pair inputs. It consists of parallel CNNs that feed parallel
LSTM RNNs, followed by fully connected NNs, and finally a dense merging layer that produces
a single output. We test it for both binary and graded predictions. The latter are generated as a
by-product of training the former (the binary classifier). It reaches 70% accuracy on the binary
classification task. It achieves a Pearson correlation of .59-.61 with the annotated gold standard for
the gradient ranking candidate sets.
1 Introduction
Paraphrase identification is an area of research with a long history. Approaches to the task can be di-
vided into supervised methods, such as (Madnani et al., 2012), currently the most commonly used, and
unsupervised techniques (Socher et al., 2011).
While many approaches of both types use carefully selected features to determine similarity, such
as string edit distance (Dolan et al., 2004)) or longest common subsequence (Fernando and Stevenson,
2008), several recent supervised approaches apply Neural Networks to the task (Filice et al., 2015; He
et al., 2015), often linking it to the related issue of semantic similarity (Tai et al., 2015; Yin and Schütze,
2015).
Traditionally, paraphrase detection has been formulated as a binary problem. Corpora employed in
this work contain pairs of sentences labeled as paraphrase or non-paraphrase. The most representative
of these corpora, such as the Microsoft Paraphrase Corpus (Dolan et al., 2004), conform to this paradigm.
This approach is different from the one adopted in semantic similarity datasets, where a pair of
words or sentences is labeled on a gradient classification system. In some cases, semantic similarity
tasks overlap with paraphrase detection, as in Xu et al. (2015) and in Agirre et al. (2016). Xu et al.
(2015) is one of the first works that tries to connect paraphrase identification with semantic similarity.
They define a task where the system generates both a binary judgment and a gradient score for sentences
pairs.
We present a new dataset for paraphrase identification which is built on two main ideas: (i) Para-
phrase recognition is a gradient classification task. (ii) Paraphrase recognition is an ordering problem,
where sets of sentences are ranked by similarity with respect to a reference sentence.
While the first assumption is shared by some of the work we have cited here, our corpus is, to the
best of our knowledge, the first one constructed on the basis of the second claim.
We believe that annotating sets of sentences for similarity with respect to a reference sentence can
help with both the learning and the testing processes in paraphrase identification.
We use this corpus to test a neural network architecture formed by a combination of Convolutional
Neural Networks (CNNs) and Long Short Term Memory Recurrent Neural Networks (LSTM RNNs). We
test this model on two classification problems: (i) binary paraphrase classification, and (ii) paraphrase
ranking. We show that our system can achieve a significant correlation to human paraphrase judgments
on the ranking task as a by-product of supervised binary learning.
While the extremes of our scale (1 and 5) are relatively rare in our corpus, we focus on the interme-
diate cases of paraphrase, from non-paraphrases with some semantic similarity (2) to non type-identical
strong paraphrases (4).
We believe that this annotation scheme is particularly useful. While it sustains graded semantic
similarity labels, it also provides sets of semantically related elements, each one of which can be scored
or ordered independently from the others. Therefore, the reference sentence can be tested separately for
each sentence in the set in a binary classification task. In the test phase, this annotation schema allows
us to observe how a system represents the similarity between two sentences by taking the scores of two
candidates as points of relative proximity to the reference sentence.
Our examples above indicate that a binary classification can be misleading because it conceals the
different levels of similarity between competing candidates.
We find instead that framing paraphrase recognition as an ordering problem allows a more flexible
evaluation of a model. It permits us to evaluate the relative proximity of several candidate paraphrases
to the reference sentence independently of the particular paraphrase score that the model assigns to each
candidate in the set.
For example, the sentence A person feeds an animal can be considered to be a loose paraphrase of
the sentence A woman feeds a cat, or alternatively, as a semantically related non-paraphrase. Which
of these conclusions we adopt depends on our decision concerning how much content sentences need
to share in order to be classified as paraphrases. By contrast, it would be far fetched to suggest that A
woman kicks a cat is a better or even equally strong paraphrase for A woman feeds a cat. Similarly, the
sentences I have a black hat and My hat is night black can be considered to be loose paraphrases, or
semantically related non-paraphrases. But I have a red hat cannot plausibly be taken as more similar in
meaning to I have a black hat than My hat is night black.
The core of this dataset was built from various parts of the Brown Corpus (Francis and Kucera, 1979),
mainly from the news and narrative sections. For each sentence, we introduced raw paraphrases by round
trip machine translation from English through Swedish, German, Spanish and Japanese, back to English.
This process yielded paraphrases, looser relations of semantic relatedness, and non-paraphrases.
One of the authors then manually annotated each set of five sentences and corrected grammatical infe-
licities. We also introduced more interesting syntactic and semantic variation. For example we manually
constructed many cases of negation and passive/active mood switch. This allows us to test paraphrase
over a wider range of syntactic and lexical semantic constructions. Similar manually generated elements
were often substituted as candidate paraphrases to round-trip generated candidates judged to be of little
interest for the task. So, for example, we frequently had several strong paraphrases produced by round-
trip translation, resulting in groups of three or four strong candidates for a reference sentence, and we
replaced several of these with our own alternatives.
A number of shorter examples produced by the authors were also added to the corpus. These are
intended to test the performance of the system for specific semantic relations, such as antinomy (I have
a new car – I have an old car), expansion (His car is red – His car has a characteristic red colour) and
subject–object permutation (A white blanket covered her mouth – Her mouth covered a white blanket).
One of the authors assigned the 1-5 ratings for each sentence in a reference set. We naturally regard
this as a ”weak” point in our dataset. As we discuss in the Conclusion, we intend to use crowd sourcing
to obtain more broadly based and reliable speaker annotation for our examples.
Our corpus has the advantage of being suitable for both training a binary classifier and developing a
model to predict gradient paraphrase judgments. For the former, we simply consider every score over a
given gradient threshold label as 1, and scores below that threshold as 0. For gradient classification we
use all the scoring labels to test the correlation between a system’s ordering performance and our human
judgments. We will show how, once a model has been trained for a binary detection task, we can check
its performance on the gradient ordering task.
3. A final set of fully connected layers that work on the merged representation of the two sentences
to generate a judgment.
The encoder for each pair of sentences taken as input is composed of two parallel Convolutional
Neural Networks and LSTM RNNs, feeding two sequenced fully connected layers.
The first layer of our encoders is a CNN with 50 filters of length 5. CNNs have been successfully
applied to problems in computational semantics, such as text classification and sentiment analysis (Lai
et al., 2015), as well as to paraphrase recognition (Socher et al., 2011). In this part of our model, the
encoder learns a more compact representation of the sentence, with reduced vector space dimensions and
features. This permits the NN to focus on the information most relevant to paraphrase identification.
We use an ”Atrous” Convolutional Neural Network (Giusti et al., 2013; Chen et al., 2016). An
”Atrous” CNN is a modified form of Convolutional Network designed to reduce the risk of losing impor-
tant information in max pooling. In the case of a standard CNN, max pooling will perform a reduction
of the output of the convolutional layer, selecting only some information contained in it. In the case of
image processing, for example, a 2x2 max pooling on the so-called ”map” returned by the convolutional
layer will create a smaller map that does not contain information from the entire original map, but only
from a specific region of such map, or mirroring a specific pattern in the original image: for example, all
the patches whose upper left corner lies on even coordinates on the map (Giusti et al., 2013). This way
of processing information can undermine the results when complex inputs are involved. An Atrous net-
work fragments the map returned by the max pooling layer, so that each fragment contains information
independent of the other fragments, and each reduced map contains information from all the patches of
the input. This is a good strategy for speeding up processing time by avoiding redundant computation.
The output of each CNN is passed through a max pooling layer to an LSTM RNN. Since the CNN
and the max pooling layer perform discriminative reduction of the input dimensionality, we can run
a large LSTM RNN model (50 smart cells) without substantial computational cost. In this phase of
processing, the vector dimensions of the sentence representation is further reduced, with relevant infor-
mation (hopefully) conserved and highlighted, particularly for the sequential structure of the data. Each
encoder is completed by two successive fully connected layers of dimensions 50 and 300, respectively,
that produces a vector representation for an input sentence in the pair. The first one has a .5 dropout rate.
The 300 dimensional outputs of the two encoders are then passed to a layer that merges them into a
single vector. We found that simple vector concatenation was the best option for performing this merge.
To measure the similarity of two sequences our model only makes use of the information contained in
the merged version of the encoders’ output. We did not use a device in the merging phase to assess
similarity between two sequences. The merging layer feeds the concatenated input to a series of five
fully connected layers. The last layer applies a sigmoid function to produce the classifier judgment.
While the sigmoid function performs well for binary classification, it returns a gradient over its input,
thus generating an ordering of values for the ranking task.
These three kinds of Neural Network capture information in different ways. They can be combined
to achieve a better global representation of sentence input. Specifically, while a CNN can reduce the
spectral variance of input, an LSTM RNN is designed to model its sequential dimension over time. The
CNN manages to reduce the input’s dimensionality while keeping the ordering information of the original
sentence. This information will then be processed by the LSTM RNN, which is particularly well suited
for handling words sequenced through time.
Also, an LSTM RNN’s performance can be strongly improved by providing it with better features
(Pascanu et al., 2014). In our case this is accomplished by the CNN. The densely connected layers create
clearer, more separable final vector representations of the data. To encode the original sentences we used
Word2Vec embeddings pre-trained on Google News (Mikolov et al., 2013).
Table 1 gives the binary accuracy, and ranked ordering Pearson correlation performance of our model,
over 10 fold validation, after 200 epochs.
Table 2 presents accuracy and F1 for different versions of our model. The baseline is the model’s per-
formance without any training. We compute the baseline by relying solely on the pre-loaded Word2Vec
lexical embedding content of the words’ distributional vectors to obtain a semantic similarity judgment.
No learning from our corpus annotation is involved. The sentence’s vectors are still reduced to a single
vector through the LSTM layer, but this is done without corpus based supervision or training.
accuracy of 70.0 over 10 fold cross-validation. We see that our architecture learned to recognize different
semantic and syntactic phenomena with a promising level of accuracy, although it is not state of the art
in paraphrase recognition for systems trained on large corpora, such as the Microsoft Paraphrase Corpus
(Ji and Eisenstein, 2013). 1
A small corpus may cause instability in results. Interestingly, we found that our DNN is able to
generalize consistently on the following patterns:
• Negation. This is a rich man’s world – This is not a rich man’s world. Non-Paraphrase;
• Subject–Object permutation. The man follows the wolf – The wolf follows the man. Non-
Paraphrase;
• Active–Passive relation. A white blanket covered her mouth – Her mouth was covered with a
white blanket. Paraphrase;
• Various cases of loose paraphrase The man follows the wolf – The person follows the animal.
Paraphrase.
However, our model had trouble with several others cases, some due to its lack of relevant world
knowledge, and others because of its limited capacity for semantically driven inference. These include:
• Time expressions. It was morning – It was noon. Non-Paraphrase;
• Some cases of antinomy. This is not good – This is bad. Paraphrase;
• Space expressions. Some years ago I was going to school when I met a man – Some years ago I
was going to church when I met a man. Non-paraphrase.
Predictably, the model has difficulty in learning a pattern or a phrase when it is under represented in
the training data. In some cases, the effect of data scarcity can be observed in an ”overfit weighting” of
specific words. We believe that these idiosyncrasies can be overcome through training on a larger set.
1
This is to be expected, given the specific nature of the task and the small dimensions of our dataset. It is also worth noting
that, while sentences in the Microsoft Paraphrase Corpus are generally longer, our corpus contains a much larger variety of
syntactic and semantic patterns, including ”more difficult” cases, like passive-active change and negation.
Figure 2: A more abstract representation of our full model. Sequential 2 and sequential 3 are encoders
of the kind specified in Figure 1. Their outputs are concatenated in merge 1 and fed to a series of dense
layers. Dropout 3 has a rate of 0.2
We observe that, on occasion, the model’s errors are in the gray area between clear paraphrase and
clear non-paraphrase. Here the correctness of a label is not obvious. For example, the pair I am so sleepy
I can barely stand – I am sleep deprived can be considered to be a loose paraphrase pair, or they can be
taken as an instance of non-paraphrase.
Table 1: Accuracy (on the binary task) and Pearson Correlation (on the ordering task) Over Ten Fold
Validation Testing after 200 epochs. The accuracy reported in the paper is an average over these results.
Model Accuracy F1
Baseline (without training) 42.1 59.3
Our model 78.0 74.6
Encoders without LSTM 65.9 68.9
Encoders without ACNN 69.5 50.8
Just one layer after concatena- 73.0 70.0
tion
Using CNN instead of ACNN 76.6 76.0
ACNN with 10 filters 70.4 68.1
LSTM with 10 filters 69.0 71.3
Without dropouts 72.6 71.0
Merging via multiplication 72.6 71.1
Encoders without dense layers 72.2 71.7
Table 2: Accuracy for different versions of the model after 200 epochs. Each model ran on our standard
train and test data, without our performing cross-validation.
task in order to perform a new task, we consider it a form of transfer learning from a supervised binary
context (assigning a 0/1 value to a pair of sentences) to an unsupervised ordering problem (ranking a set
of sentences). In this case, our corpus allowed us to perform double transfer learning. First, we use word
embeddings trained to maximize single words’ contextual similarity, in order to train on a supervised
binary paraphrase dataset. Then, we use the representations acquired in this way to perform an ordering
task for which the DNN has not been trained.
The fact that ranked correlations are sustained through binary paraphrase classification is not an obvi-
ous result. A model trained on {0,1} labels could ”polarize” its scores to the point where no meaningful
ordering would be available. Had this happened, a good performance in a binary task would actually con-
ceal the loss of important semantic information. Xu et al. (2015), discussing the relation of paraphrase
identification to the recognition of semantic similarity, observe that there is no necessary connection be-
tween binary classification and prediction of gradient labels, and that an increase in one can even produce
a loss in the other.
7 Acknowledgments
We are grateful to three anonymous reviewers for helpful comments and suggestions on an earlier draft
of this paper. The research reported in this paper was supported by a grant from the Swedish Research
Council for the establishment of the Centre for Linguistic Theory and Studies in Probability (CLASP) at
the University of Gothenburg. We would also like to thank our colleagues at CLASP at the University of
Gothenburg for useful discussion of many of the ideas presented here. We are solely responsible for any
errors which may remain in this paper.
References
Agirre, E., C. Banea, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wiebe
(2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation.
In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT
2016, San Diego, CA, USA, June 16-17, 2016, pp. 497–511.
Chen, L., G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016). Deeplab: Seman-
tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
CoRR abs/1606.00915.
Dolan, B., C. Quirk, and C. Brockett (2004). Unsupervised construction of large paraphrase corpora:
Exploiting massively parallel news sources. In Proceedings of the 20th International Conference
on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA. Association for Computational
Linguistics.
Fernando, S. and M. Stevenson (2008). A semantic similarity approach to paraphrase detection. Com-
putational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium.
Filice, S., G. Da San Martino, and A. Moschitti (2015). Structural representations for learning relations
between pairs of texts, Volume 1, pp. 1003–1013. Association for Computational Linguistics (ACL).
Francis, W. N. and H. Kucera (1979). Brown corpus manual. Technical report, Department of Linguistics,
Brown University, Providence, Rhode Island, US.
Giusti, A., D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber (2013). Fast image scanning
with deep max-pooling convolutional neural networks. In Image Processing (ICIP), 2013 20th IEEE
International Conference on, pp. 4034–4038. IEEE.
He, H., K. Gimpel, and J. Lin (2015, September). Multi-perspective sentence similarity modeling with
convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, Lisbon, Portugal, pp. 1576–1586. Association for Computational Lin-
guistics.
Lai, S., L. Xu, K. Liu, and J. Zhao (2015). Recurrent convolutional neural networks for text classification.
In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2267–
2273. AAAI Press.
Madnani, N., J. Tetreault, and M. Chodorow (2012). Re-examining machine translation metrics for
paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12,
Stroudsburg, PA, USA, pp. 182–190. Association for Computational Linguistics.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed representations of
words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-
mani, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26, pp. 3111–
3119. Curran Associates, Inc.
Pascanu, R., C. Gulcehre, K. Cho, and Y. Bengio (2014). How to construct deep recurrent neural
networks.
Socher, R., E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning+ (2011). Dynamic Pooling
and Unfolding Recursive Autoencoders for Paraphrase Detection. In Advances in Neural Information
Processing Systems 24.
Tai, K. S., R. Socher, and C. D. Manning (2015). Improved semantic representations from tree-structured
long short-term memory networks. CoRR abs/1503.00075.
Xu, W., C. Callison-Burch, and B. Dolan (2015, June). Semeval-2015 task 1: Paraphrase and semantic
similarity in twitter (pit). In Proceedings of the 9th International Workshop on Semantic Evaluation
(SemEval 2015), Denver, Colorado, pp. 1–11. Association for Computational Linguistics.
Yin, W. and H. Schütze (2015). Convolutional neural network for paraphrase identification. In NAACL
HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pp.
901–911.
138
Study V
139
Predicting Human Metaphor Paraphrase Judgments with Deep Neural
Networks
Abstract it.
We propose a new annotated corpus for Some quantitative analyses of figurative lan-
metaphor interpretation by paraphrase, and a guage have involved metaphor interpretation and
novel DNN model for performing this task. paraphrasing. These focus on integrating para-
Our corpus consists of 200 sets of 5 sen- phrase into automatic Textual Entailment frames
tences, with each set containing one reference (Agerri, 2008), to explore the properties of distri-
metaphorical sentence, and four ranked candi- butional semantics in larger-than-word structures
date paraphrases. Our model is trained for a
(Turney, 2013). Alternatively, they study the sen-
binary classification of paraphrase candidates,
and then used to predict graded paraphrase ac- timent features of metaphor usage (Mohammad
ceptability. It reaches an encouraging 75% ac- et al., 2016; Kozareva, 2015). This last aspect
curacy on the binary classification task, and of figurative interpretation is considered a par-
high Pearson (.75) and Spearman (.68) correla- ticularly hard task and has generated several ap-
tions on the gradient judgment prediction task. proaches.
The task of metaphor interpretation is a partic-
1 Introduction ular case of paraphrase detection, although this
characterization is not unproblematic, as we will
Metaphor is an increasingly studied phenomenon see in Section 6.
in computational linguistics. But while metaphor In Bollegala and Shutova (2013), metaphor
detection has received considerable attention in paraphrase is treated as a ranking problem. Given
the NLP literature (Dunn et al., 2014; Veale et al., a metaphorical usage of a verb in a short sen-
2016) and in corpus linguistics (Krennmayr, 2015) tence, several candidate literal sentences are re-
in recent years, not much work has focused on trieved from the Web and ranked. This approach
the task of metaphor paraphrasing - assigning an requires the authors to create a gradient score to
appropriate interpretation to a metaphorical ex- label their paraphrases, a perspective that is now
pression. Moreover, there are few (if any) anno- gaining currency in broader semantic similarity
tated corpora of metaphor paraphrases (Shutova tasks (Xu et al., 2015; Agirre et al., 2016).
and Teufel, 2010). The main papers in this area Mohammad et al. (2016) resort to metaphor
are Shutova (2010), and Bollegala and Shutova paraphrasing in order to perform a quantitative
(2013). The first applies a supervised method study on the emotions associated with the usage
combining WordNet and distributional word vec- of metaphors. They create a small corpus of para-
tors to produce the best paraphrase of a single verb phrase pairs formed from a metaphorical expres-
used metaphorically in a sentence. The second ap- sion and a literal equivalent. They ask candidates
proach, conceptually related to the first, builds an to judge the degree of ”emotionality” conveyed
unsupervised system that, given a sentence with by the metaphorical and the literal expressions.
a single metaphorical verb and a set of poten- While the study has shown that metaphorical para-
tial paraphrases, selects the most accurate candi- phrases are generally perceived as more emotion-
date through a combination of mutual information ally charged than their literal equivalents, a corpus
scores and distributional similarity. of this kind has not been used to train a computa-
Despite the computational and linguistic inter- tional model for metaphor paraphrase scoring.
est of this task, little research has been devoted to In this paper we present a new dataset for
metaphor paraphrase identification and ranking. candidates as paraphrases of a metaphorical sen-
In our corpus, paraphrase recognition is treated as tence or expression. Our corpus is formed of 200
an ordering problem, where sets of sentences are sets of five sentence paraphrase candidates for a
ranked with respect to a reference metaphor sen- metaphorical sentence or expression.1
tence. In each set, the first sentence contains a
The main difference with respect to existing metaphor, and it provides the reference sentence to
work in this field consists in the syntactic and be paraphrased. The remaining four sentences are
semantic diversity covered by our dataset. The labeled on a 1-4 scale based on the degree to which
metaphors in our corpus are not confined to a sin- they paraphrase the reference sentence. This is
gle part of speech. We introduce metaphorical ex- on analogy with the annotation frame used for
amples of nouns, adjectives, verbs and a number SemEval Semantic Similarity tasks (Agirre et al.,
of multi-word metaphors. 2016). Broadly, our labels represent the following
Our corpus is, to the best of our knowledge, the categories:
largest existing dataset for metaphor paraphrase
detection and ranking. 1 Two sentences cannot be considered para-
As we describe in Section 2, it is composed of phrases.
groups of five sentences: one metaphor, and four
2 Two sentences cannot be considered para-
candidates that can be ranked as its literal para-
phrase, but they show a degree of semantic
phrases.
similarity.
The inspiration for the structure of our dataset
comes from a recent work on paraphrase (Bizzoni 3 Two sentences could be considered para-
and Lappin, 2017), where a similarly organized phrases, although they present some impor-
dataset was introduced to deal with paraphrase de- tant difference in style or content (they are
tection. not strong paraphrases).
In our work, we use an analogous structure to
model metaphor paraphrase. Also, while Bizzoni 4 Two sentences are strong paraphrases.
and Lappin (2017) present a corpus annotated by
a single human, each paraphrase set in our cor- On average, every group of five sentences con-
pus was judged by 20 different Amazon Mechani- tains a strong paraphrase, a loose paraphrase and
cal Turk (AMT) annotators, making the grading of two non-paraphrases, one of which may use some
our sentences more robust and reliable (see Sec- relevant words from the metaphor in question.2
tion 2.1). The following examples illustrate these ranking
We use this corpus to test a neural net- labels.
work model formed by a combination of Con-
• Metaphor: The crowd was a river in the street
volutional Neural Networks (CNNs) and Long
Short Term Memory Recurrent Neural Networks – The crowd was large and impetuous in
(LSTM RNNs). We test this model on two clas- the street. Score: 4
sification problems: (i) binary paraphrase classifi- – There were a lot of people in the street.
cation and (ii) paraphrase ranking. We show that Score: 3
our system can achieve significant correlation with – There were few people in the street.
human judgments on the ranking task as a by- Score: 2
product of supervised binary learning. To the best
– We reached a river at the end of the
of our knowledge, this is the first work in metaphor
street. Score: 1
paraphrasing to use supervised gradient represen-
tations. We believe that this annotation scheme is use-
ful. While it sustains graded semantic similarity
labels, it also provides sets of semantically related
2 A New Corpus for Metaphor
Paraphrase Evaluation 1
Our annotated data set and the code for our model is
available at https://github.com/yuri-bizzoni/
Metaphor-Paraphrase .
We present a dataset for metaphor paraphrase de- 2
Some of the problems raised by the concept of para-
signed to allow users to rank non-metaphorical phrase in figurative language are discussed in Section 6.
elements, each one of which can be scored or or- words.
dered independently of the others. Therefore, the
metaphorical sentence can be tested separately for 4. Multi-word Metaphors : The seeds of change
each literal candidate in the set in a binary classi- were planted in 1943.
fication task.
In the test phase, the annotation scheme allows All these sentences and their candidates were
us to observe how a system represents the similar- manually produced to insure that for each group
ity between a metaphorical and a literal sentence we have a strong literal paraphrase, a loose lit-
by taking the scores of two candidates as points of eral paraphrase and two semantically related non-
relative proximity to the metaphor. paraphrases. Here “semantically related” can in-
It can be argued that a good literal paraphrase of dicate either a re-use of the metaphorical words
a metaphor needs to compensate to some extent for to express a different meaning, or an unacceptable
the expressive or sentimental bias that a metaphor interpretation of the reference metaphor.
usually supplies, as argued in Mohammad et al. Although the paraphrases were gener-
(2016). In general a binary classification can be ated freely and cover a number of possible
misleading because it conceals the different levels (mis)interpretations, we did take several issues
of similarity between competing candidates. into account. For example, for sentiment related
For example, the literal sentence Republican metaphors two opposite interpretations are often
candidates during the convention were terrible proposed, forcing the system to make a choice
can be considered to be a loose paraphrase of between two sentiment poles when ranking the
the metaphor The Republican convention was a paraphrases (I love my job – I hate my job for
horror show, or alternatively, as a semantically My job is a dream). In general, antonymous
related non-paraphrase. Which of these conclu- interpretations (Time passes very fast – Time is
sions we adopt depends on our decision concern- slow for Time flies) are listed, when possible,
ing how much interpretative content a literal sen- among the four competing choices.
tence needs to provide in order to qualify as a valid Our corpus has the advantage of being suitable
paraphrase of a metaphor. The question whether for both binary classification and gradient para-
the two sentences are acceptable paraphrases or phrase judgment prediction. For the former, we
not can be hard to answer. By contrast, it would be map every score over a given gradient threshold la-
far fetched to suggest that The Republican conven- bel to 1, and scores below that threshold to 0. For
tion was a joy to follow is a better or even equally gradient classification, we use all the scoring la-
strong literal paraphrase for The Republican con- bels to test the correlation between the system’s or-
vention was a horror show. dered predictions and human judgments. We will
In this sense, the sentences Her new occupa- show how, once a model has been trained for a
tion was a dream come true and She liked her binary detection task, we can evaluate its perfor-
new occupation can be considered to be loose mance on the gradient ordering task.
paraphrases, in that the term liked can be judged an We stress that our corpus is under development.
acceptable, but not ideal interpretation of the more As far as we know it is unique for the kind of task
intense metaphorical expression a dream come we are discussing. The main difficulty in build-
true. By contrast, She hated her new occupation ing this corpus is that there is no obvious way to
cannot be plausibly regarded as more similar in collect the data automatically. Even if there were
meaning than She liked her new occupation to Her a procedure to extract pairs of paraphrases con-
new occupation was a dream come true. taining a metaphoric element semi-automatically,
Our training dataset is divided into four main it does not seem possible to generate alternative
sections: paraphrase candidates automatically.
The reference sentences we chose were either
1. Noun phrase Metaphors : My lawyer is an selected from published sources or created man-
angel. ually by the authors. In all cases, the paraphrase
2. Adjective Metaphors : The rich man had a candidates had to be crafted manually. We tried
cold heart. to keep a balanced diversity inside the corpus.
The dataset is divided among metaphorically used
3. Verb Metaphors : She cut him down with her Nouns, Adjectives and Verbs, plus a section of
Multi Word metaphors. The corpus is an attempt of the more evenly distributed judgment patterns
to represent metaphor in different parts of speech. that we observed.
A native speaker of English independently These mean judgments appear to provide reli-
checked all the sentences for acceptability. able data for supervision of a machine learning
model. We thus set the upper bound for the per-
2.1 Collecting judgments through AMT formance of a machine learning algorithm trained
on this data to be around .9, on the basis of the
Originally, one author individually annotated the Pearson correlation with the original single anno-
entire corpus. The difference between strong and tator scores. In what follows, we refer to the mean
loose literal paraphrases can be a matter of indi- judgments of AMT annotators as our gold stan-
vidual sensibility. dard when evaluating our results, unless otherwise
While such annotations could be used as the indicated.
basis for a preliminary study, we needed more
judgments to build a statistically reliable annotated
dataset. Therefore we used crowd sourcing to so-
3 A DNN for Metaphor Para-
licit judgments from large numbers of annotators. phrase Classification
We collected human judgments on the degree of
paraphrasehood for each pair of sentences in a set For classification and gradient judgment predic-
(with the reference metaphor sentence in the pair) tion we constructed a deep neural network. Its ar-
through Amazon Mechanical Turk (AMT). chitecture consists of three main components:
Annotators were presented with four metaphor 1. Two encoders that learn the representation of
- candidate paraphrase pairs, all relating to the two sentences separately
same metaphor. They were asked to express
a judgment between 1 and 4, according to the 2. A unified layer that merges the output of the
scheme given above. encoders
We collected 20 human judgments for each pair
3. A final set of fully connected layers that op-
metaphor - candidate paraphrase. Analyzing in-
erate on the merged representation of the two
dividual annotators’ response patterns, we were
sentences to generate a judgment.
able to filter out a small number of “rogue” anno-
tators (less than 10%). This filtering process was The encoder for each pair of sentences taken as
based on annotators’ answers to some control el- input is composed of two parallel Convolutional
ements inserted in the corpus, and evaluation of Neural Networks (CNNs) and LSTM RNNs, feed-
their overall performance. For example, an anno- ing two sequenced fully connected layers. We use
tator who consistently assigned the same score to an ”Atrous” CNN (Chen et al., 2016). Interest-
all sentences is classified as “rogue”. ingly, classical CNNs only decrease our accuracy
We then computed the mean judgment for each by approximately two points and reach a good F1
sentence pair and compared it with the original score, as Table 1 indicates.
judgments expressed by one of the authors. We Using a CNN (we apply 25 filters of length
found a high Pearson correlation between the an- 5) as a first layer proved to be an efficient strat-
notators’ mean judgments and the author’s judg- egy. While CNNs were originally introduced in
ment of close to 0.93. the field of computer vision, they have been suc-
The annotators’ understanding of the problem cessfully applied to problems in computational se-
and their evaluation of the sentence pairs seem, on mantics, such as text classification and sentiment
average, to correspond very closely to that of our analysis (Lai et al., 2015), as well as to paraphrase
original single annotator. The high correlation also recognition (Socher et al., 2011). In NLP applica-
suggests a small level of variation from the mean tions, CNNs usually abstract over a series of word-
across AMT annotators. Finally, a similar corre- or character-level embeddings, instead of pixels.
lation strengthens the hypothesis that paraphrase In this part of our model, the encoder learns a more
detection is better modeled as an ordering, rather compact representation of the sentence, with re-
than a binary, task. If this had not been the case, duced vector space dimensions and features. This
we would expect more polarized judgments tend- permits the entire DNN to focus on the informa-
ing towards the highest and lowest scores, instead tion most relevant to paraphrase identification.
The output of each CNN is passed through a similarity between the two sequences. This allows
max pooling layer to an LSTM RNN. Since the a high degree of freedom in the interpretation pat-
CNN and the max pooling layer perform discrim- terns we are trying to model, but it also involves
inative reduction of the input’s dimensions, we a fair amount of noise, which increases the risk of
can run a relatively small LSTM RNN model (20 error.
hidden units). In this phase, the vector dimen- The merging layer feeds the concatenated input
sions of the sentence representation are further re- to a final fully connected layer. The last layer
duced, with relevant information conserved and applies a sigmoid function to produce the judg-
highlighted, particularly for the sequential struc- ments. The advantage of using a sigmoid func-
ture of the data. Each encoder is completed by tion in this case is that, while it performs well for
two successive fully connected layers, of dimen- binary classification, it returns a gradient over its
sions 15 and 10 respectively, the first one having a input, thus generating an ordering of values appro-
0.5 dropout rate. priate for the ranking task. The combination of
these three kinds of Neural Networks in this or-
der (CNN, LSTM RNN and fully connected lay-
ers) has been explored in other works, with inter-
esting results (Sainath et al., 2015). This research
has indicated that these architectures can comple-
ment each other in complex semantic tasks, such
as sentiment analysis (Wang et al., 2016) and text
representation (Vosoughi et al., 2016).
The fundamental idea here is that these three
kinds of Neural Network capture information in
different ways that can be combined to achieve
a better global representation of sentence input.
While a CNN can reduce the spectral variance of
input, an LSTM RNN is designed to model its se-
quential temporal dimension. At the same time,
an LSTM RNN’s performance can be strongly im-
proved by providing it with better features (Pas-
canu et al., 2014), such as the ones produced by
Figure 1: Example of an encoder. Input is passed to a a CNN, as happens in our case. The densely con-
CNN, a max pooling layer, an LSTM RNN, and finally nected layers contribute a clearer, more separable
two fully connected layers, the first having a dropout final vector representation of one sentence.
rate of .5. The input’s and output’s shape is indicated To encode the original sentences we used
in brackets for each layer Word2Vec embeddings pre-trained on the very
large Google News dataset (Mikolov et al., 2013).
Each sentence is thus transformed to a 10 di- We used these embeddings to create the input se-
mensional vector. To perform the final compari- quences for our model.
son, these two low dimensional vectors are passed We take as a baseline for evaluating our model
to a layer that merges them into a single vector. the cosine similarity of the sentence vectors, ob-
We tried several ways of merging the encoders’ tained through combining their respective pre-
outputs, and we found that simple vector concate- trained lexical embeddings. This baseline gives
nation was the best option. We produce a 20 di- very low accuracy and F1 scores.
mensional two-sentence vector as the final output
of the DNN.
We do not apply any special mechanism for 4 Binary Classification Task
”comparison” or ”alignment” in this phase. To
measure the similarity of two sequences our model As discussed above, our corpus can be applied to
makes use only of the information contained in the model two sub-problems: binary classification and
merged vector that the encoders produce. We did paraphrase ordering.
not use a device in the merging phase to assess To use our corpus for a binary classification task
Model Accuracy F1 ranking (Bollegala and Shutova, 2013), where
Baseline (cosine similarity) 50.8 10.1
Our model 75.2 74.6 the metaphorical element is explicitly identified,
Encoders without LSTM 64.4 64.9 and the candidates don’t contain any syntactic-
Encoders without ACNN 62.6 61.5 semantic expansion, our results are encouraging.3
Using CNN instead of ACNN 61.0 61.6
ACNN with 10 filters 73.4 71.7 Although a small corpus may cause instability
LSTM with 10 filters 72.3 70.6 in results, our DNN seems able to generalize with
Merging via multiplication 53.4 69.6
Aligner 49.4 61.6
relative consistency on the following patterns:
Aligner + our model 73.4 75. • Sentiment. My life in California was a night-
Table 1: Accuracy for different versions of the model, mare – My life in California was terrible. Our
and the baseline. Each version ran on our standard system seems able to discriminate the right
train and test data, without performing cross-validation. sentiment polarity of a metaphor by picking
We use as a baseline the cosine similarity between the the right paraphrase, even when some can-
mean of the word vectors composing each sentence. didates contain sentiment words of opposite
polarity, which are usually very similar in a
we map each set of five sentences into a series of distributional space
pairs, where the first element is the metaphor we • Non metaphorical word re-use. Our sys-
want to interpret and the second element is one of tem seems able, in several cases, to discrim-
its four literal candidates. inate the correct paraphrase for a metaphor,
Gradient labels are then replaced by binary even when some candidates re-use the words
ones. We consider all labels higher than 2 as pos- of the metaphor to convey a (wrong) literal
itive judgments (Paraphrase) and all labels less meaning. My life in California was a dream
than or equal to 2 as negative judgments (Non- – I lived in California and had a dream
Paraphrase), reflecting the ranking discussed in
Section 2. We train our model with these labels • Cases of multi-word metaphor Although
for a binary metaphor paraphrase detection task. well represented in our corpus, multi-word
Keeping the order of the input fixed (we will metaphors are in some respects the most dif-
discuss this issue below), we ran the training phase ficult to correctly paraphrase, since the inter-
for 15 epochs. pretation has to be extended to a number of
We reached an average accuracy of 67% for 12 words. Nonetheless, our model was able to
fold cross-validation. correctly handle these in a number of situa-
Interestingly, when trained on the pre-defined tions. You can plant the seeds of anger – You
training set only, our model reaches the higher ac- can act in a way that will engender rage
curacy of 75%. However, our model had trouble with several
We strongly suspect that this discrepancy in per- others cases.
formance is due to the small training and test sets It seems to have particular difficulty in discrim-
created by the partitions of the 12 fold cross vali- inating sentiment intensity, with assignment of
dation process. higher scores to paraphrases that value the sen-
In general, this task is particularly hard, both be- timent intensity of the metaphor, which creates
cause of the complexity of the semantic properties problems in several instances. Also, cases of
involved in accurate paraphrase (see 4.1), and the metaphoric exaggeration (My roommate is a sport
limited size of the training set. It seems to us that maniac – My roommate is a sport person), nega-
an average accuracy of 67% on a 12 fold partition- tion (My roommate was not an eagle – My room-
ing of training and test sets is a reasonable result, mate was dumb.) and syntactic inversions pose
given the size of our corpus. difficulties for our models.
We observe that our architecture learned to rec- We found that our model is able to abstract over
ognize different semantic phenomena related to specific patterns, but, predictably, it has difficulty
metaphor interpretation with a promising level of in learning when the semantic focus of an interpre-
accuracy, but such phenomena need to be repre- tation consists in a phrase that is under represented
sented in the training set. in the training data.
In light of the fact that previous work in this 3
It should be noted that Bollegala and Shutova (2013) em-
field is concerned with single verb paraphrase ploy an unsupervised approach.
In some cases, the effect of data scarcity can phrase) in a set. This is the similarity score learned
be observed in an ”overfit weighting” of specific in the binary task, so it is determined by the sig-
terms. Some words that were seen in the data only moid function applied on the output.
once are associated with a high or low score inde- The following is an example of an ordered set
pendently of their context, degrading the overall with strong correlation between the model’s pre-
performance of the model. We believe that these dictions (marked in bold) and our annotations
idiosyncrasies, can be overcome through training (given in italics)
on a larger data set.
• The candidate is a fox
– 0.13 1 The candidate owns a fox
4.1 The gray areas of interpretation
– 0.30 2 The candidate is stupid
We observe that, on occasion, the model’s errors – 0.41 3 The candidate is intelligent
fall into a gray area between clear paraphrase and – 0.64 4 The candidate is a cunning person
clear non-paraphrase. Here the correctness of a
label is not obvious. We compute the average Pearson and Spearman
These cases are particularly important in correlations on all sets of the test corpus, to check
metaphor paraphrasing, since this task requires an the extent to which the ranking that our DNN pro-
interpretative leap from the metaphor to its literal duces matches our mean crowd source human an-
equivalent. For example, the pair I was home notations.
watching the days slip by from my window – I While Pearson correlation measures the rela-
was home thinking about the time I was wasting tionship between two continuous variables, Spear-
can be considered as a loose paraphrase pair. Al- man correlation evaluates the monotonic relation
ternatively, it can be regarded as a case of non- between two variables, continuous or ordinal.
paraphrase, since the second element introduces Since the first of our variables, the model’s
some interpretative elements (I was thinking about judgment, is continuous, while the second one, the
the time) that are not in the original. human labels, is ordinal, both measures are of in-
In our test set we labeled it as 3 (loose para- terest.
phrase), but if our system fails to label it correctly We found comparable and meaningful correla-
in a binary task, it is not entirely clear that it is tions between mean AMT rankings and the order-
making an error. For these cases, the approach ing that our model predicts, on both metrics. On
presented in the next section is particularly useful. the balanced training and test set, we achieve an
average Pearson correlation of 0.75 and an aver-
age Spearman correlation of 0.68. On a twelve
5 Paraphrase Ordering Task fold cross-validation frame, we achieve an average
Pearson correlation of 0.55 and an average Spear-
The high degree of correlation we found between man correlation of 0.54. We chose a twelve fold
the AMT annotations and our single annotator’s cross-validation because it is the smallest partition
judgments indicate that we can use this dataset we can use to get meaningful results. We conjec-
for an ordering task as well. Since the human ture that the average cross fold validation perfor-
judgments we collected about the “degree of para- mance is lower because of the small size of the
phrasehood” are quite consistent, it is reasonable training data in each fold. These results are dis-
to pursue a non-binary approach. played in Table 2.4
Once the DNN has learned representations for These correlations indicate that our model
binary classification, we can apply it to rank the achieves an encouraging level of accuracy in pre-
sentences of the test set in order of similarity. dicting our gradient annotations for the candidate
We apply the sigmoid value distribution for the sentences in a set when trained for a binary classi-
candidate sentences in a set of five (the reference fication task.
and four candidates) to determine the ranking. This task differs from the binary classification
To do this we use the original structure of our task in several important respects. In one way,
dataset, composed of sets of five sentences. First, 4
As discussed above, the upper bound for our model’s per-
we assign a similarity score to all pairs of sen- formance can be set at 0.9, the correlation between our single
tences (reference sentence and candidate para- annotator’s and the mean crowd sourced judgments.
it is easier. A non-paraphrase can be misjudged Measure 12-fold value Baseline
Accuracy 67 51
as a paraphrase and still appear in the right or- Pearson correlation 0.553 0.151
der within a ranking. In another sense, it is more Spearman correlation 0.545 0.113
difficult. Strict paraphrases, loose paraphrases,
Table 2: Accuracy and ranking correlation for Twelve
and various kinds of semantically similar non-
Fold Cross-Validation. It can be seen that the simple
paraphrases have to be ordered in accord with hu- cosine similarity between the mean vectors of the two
man judgment patterns, which is a more complex sentences, which we use as baseline, returns a low cor-
task than simple binary classification. relation with human judgments.
We should consider to what extent this task is
different from a multi-class categorization prob-
lem. Broadly, multi-class categorization requires
a system for linking a pair of sentences to a spe-
cific class of similarity. This is dependent upon On the other hand, it is clear that in the choice
the classes defined by the annotator and presented between While living in California I had a dream
in the training phase. In several cases determin- and My life in California was nice, I enjoyed it,
ing these ranked categories might be problem- the latter is a more reasonable interpretation of the
atic. A class corresponding to our label ”3”, for metaphor.
example, could contain many different phenom-
ena related to metaphor paraphrase: expansions, The annotators relative mean ranking has been
reformulations, reduction in the expressivity of sustained by our model, even if its absolute scor-
the sentence, or particular interpretations of the ing involves an error in binary classification.
metaphor’s meaning. Our way of formulating the
ordering task allows us to overcome this problem.
The correlation between AMT annotation or-
A paraphrase containing an expansion and a para-
dering and our model’s predictions is a by-product
phrase involving some information loss, both la-
of supervised binary learning. Since we are re-
beled as ”3”, might have quite different scoring,
using the predictions of a binary classification
but they still fall between all ”2” elements and all
task, we consider it a form of transfer learning
”4” elements in a ranking.
from a supervised binary context to an unsuper-
We can see that our gradient ranking system
vised ordering task. In this case, our corpus al-
provides a more nuanced view of the paraphrase
lows us to perform double transfer learning. First,
relation than a binary classification.
we used pretrained word embeddings trained to
Consider the following example:
maximize single words’ contextual similarity, in
• My life in California was a dream order to train on a supervised binary paraphrase
– 0.03 1 I had a dream once dataset. Then, we use the representations acquired
in this way to perform an ordering task for which
– 0.05 2 While living in California I had a
the DNN had not been trained.
dream
– 0.11 3 My life in California was nice, I
enjoyed it The fact that ranked correlations are sustained
through binary paraphrase classification is not an
– 0.58 4 My life in California was abso-
obvious result. In principle, a model trained on
lutely great
{0,1} labels could ”polarize” its scores to the point
The human annotators consider the pair My life where no meaningful ordering would be available.
in California was a dream – My life in California Had this happened, a good performance in a bi-
was nice, I enjoyed it as loose paraphrases, while nary task would actually conceal the loss of im-
the model scored it very low. But the difference portant semantic information. The fact that there
in sentiment intensity between the metaphor and is no necessary connection between binary classi-
the literal candidate renders the semantic relation fication and prediction of gradient labels, and that
between the two sentences less than perspicuous. an increase in one can even produce a loss in the
Such intensity is instead present in My life in Cal- other, is pointed out in Xu et al. (2015), who dis-
ifornia was absolutely great, marked as a more cuss the relation of paraphrase identification to the
valid paraphrase (score 4). recognition of semantic similarity.
6 The Nature of the Metaphor In- We show how this kind of corpus can be used
terpretation Task for both supervised learning of binary classifica-
tion, and for gradient judgment prediction.
Although this task resembles a particular case of The neural network architecture that we pro-
paraphrase detection, in many respects it is some- pose encodes each sentence in a 10 dimen-
thing different. While paraphrase detection con- sional vector representation, combining a CNN,
cerns learning content identity or strong cases of an LSTM RNN, and two densely connected neu-
semantic similarity, our task involves the interpre- ral layers. The two input representations are
tation of figurative language. merged through concatenation and fed to a series
In a traditional paraphrase task, we should of densely connected layers.
maintain that “The candidate is a fox” and “The We show that such an architecture is able, to an
candidate is cunning” are invalid paraphrases. extent, to learn metaphor-to-literal paraphrase.
First, the superficial informational content of the While binary classification is learned in the
two sentences is different. Second, without fur- training phase, it yields a robust correlation in the
ther context we might assume that the candidate is ordering task through the softmax sigmoid distri-
an actual fox. We ignore the context of the phrase. butions generated for binary classification. The
In this task the frame is different. We assume model learns to classify a sentence as a valid or in-
that the first sentence contains a metaphor. We valid literal interpretation of a given metaphor, and
summarize this task by the following question. it retains enough information to assign a gradient
Given that X is a metaphor, which one of the value to sets of sentences in a way that correlates
given candidates would be its best literal interpre- with our crowd source annotation.
tation? Our model doesn’t use any ”alignment” of the
We trained our model to move along a similar data. The encoders’ representations are simply
learning pattern. This training frame can produce concatenated. This gives our DNN consider-
the apparent, but false paradox that two acceptable able flexibility in modeling interpretation patterns.
paraphrases such as The Council is on fire and The It can also create complications where a simple
Council is burning are assigned a low score by our alignment of two sentences might suffice to iden-
model. If the first element is a metaphor, the sec- tify a similarity. We have considered several possi-
ond element is, in fact, a bad literal interpretation. ble alternative versions of this model to tackle this
A higher score is correctly assigned to the candi- issue.
date People in the Council are very excited. In future we will expand the size and variety of
our corpus. We will perform a detailed error anal-
7 Conclusions ysis of our model’s predictions, and we will further
explore different kinds of neural network designs
We present a new kind of corpus to evaluate for paraphrase detection and ordering. Finally, we
metaphor paraphrase detection, following the ap- intend to study this task “the other way around” by
proach presented in Bizzoni and Lappin (2017) for detecting the most appropriate metaphor to para-
paraphrase grading, and we construct a novel type phrase a literal reference sentence or phrase.
of DNN architecture for a set of metaphor inter-
pretation tasks. We show that our model learns an
effective representation of sentences, starting from Acknowledgments
the distributional representations of their words.
Using word embeddings trained on very large cor- We are grateful to our colleagues in the Centre for Linguis-
pora proved to be a fruitful strategy. Our model is tic Theory and Studies in Probability (CLASP), FLoV, at the
able to retrieve from the original semantic spaces University of Gothenburg for useful discussion of some of the
not only the primary meaning or denotation of ideas presented in this paper, and to three anonymous review-
words, but also some of the more subtle semantic ers for helpful comments on an earlier draft. The research
aspects involved in the metaphorical use of terms. reported here was done at CLASP, which is supported by a
We based our corpus’ design on the view that 10 year research grant (grant 2014-39) from the Swedish Re-
paraphrase ranking is a useful way to approach the search Council.
metaphor interpretation problem.
References Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Rodrigo Agerri. 2008. Metaphor in textual entailment. tions of words and phrases and their composition-
In COLING 2008, 22nd International Conference ality. In C. J. C. Burges, L. Bottou, M. Welling,
on Computational Linguistics, Posters Proceedings, Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
18-22 August 2008, Manchester, UK. pages 3– vances in Neural Information Processing Systems
6. http://www.aclweb.org/anthology/ 26, Curran Associates, Inc., pages 3111–3119.
C08-2001.
Saif Mohammad, Ekaterina Shutova, and Peter D.
Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Turney. 2016. Metaphor as a medium for emo-
Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, Ger- tion: An empirical study. In Proceedings of the
man Rigau, and Janyce Wiebe. 2016. Semeval- Fifth Joint Conference on Lexical and Computa-
2016 task 1: Semantic textual similarity, mono- tional Semantics, *SEM@ACL 2016, Berlin, Ger-
lingual and cross-lingual evaluation. In Proceed- many, 11-12 August 2016. http://aclweb.
ings of the 10th International Workshop on Seman- org/anthology/S/S16/S16-2003.pdf.
tic Evaluation, SemEval@NAACL-HLT 2016, San
Diego, CA, USA, June 16-17, 2016. pages 497– Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho,
511. http://aclweb.org/anthology/S/ and Yoshua Bengio. 2014. How to construct deep
S16/S16-1081.pdf. recurrent neural networks. Proceedings of the Sec-
ond International Conference on Learning Repre-
sentations (ICLR 2014) .
Yuri Bizzoni and Shalom Lappin. 2017. Deep learn-
ing of binary and gradient judgments for semantic Tara N. Sainath, Oriol Vinyals, Andrew W. Senior,
paraphrase. Proceedings of IWCS 2017 . and Hasim Sak. 2015. Convolutional, long short-
term memory, fully connected deep neural networks.
Danushka Bollegala and Ekaterina Shutova. 2013. In 2015 IEEE International Conference on Acous-
Metaphor interpretation using paraphrases extracted tics, Speech and Signal Processing, ICASSP 2015,
from the web. PloS one 8(9):e74304. South Brisbane, Queensland, Australia, April 19-24,
2015. pages 4580–4584. https://doi.org/
Liang-Chieh Chen, George Papandreou, Iasonas 10.1109/ICASSP.2015.7178838.
Kokkinos, Kevin Murphy, and Alan L. Yuille. 2016.
Deeplab: Semantic image segmentation with deep Ekaterina Shutova. 2010. Automatic metaphor in-
convolutional nets, atrous convolution, and fully terpretation as a paraphrasing task. In Hu-
connected crfs. CoRR abs/1606.00915. http: man Language Technologies: The 2010 An-
//arxiv.org/abs/1606.00915. nual Conference of the North American Chap-
ter of the Association for Computational Linguis-
Jonathan Dunn, Jon Beitran De Heredia, Maura Burke, tics. Association for Computational Linguistics,
Lisa Gandy, Sergey Kanareykin, Oren Kapah, Stroudsburg, PA, USA, HLT ’10, pages 1029–
Matthew Taylor, Dell Hines, Ophir Frieder, David 1037. http://dl.acm.org/citation.
Grossman, et al. 2014. Language-independent en- cfm?id=1857999.1858145.
semble approaches to metaphor identification. In
28th AAAI Conference on Artificial Intelligence, Ekaterina Shutova and Simone Teufel. 2010. Metaphor
AAAI 2014. AI Access Foundation. corpus annotated for source-target domain map-
pings. In LREC. volume 2, pages 2–2.
Zornitsa Kozareva. 2015. Multilingual affect po-
larity and valence prediction in metaphors. In Richard Socher, Eric H. Huang, Jeffrey Pennington,
Proceedings of the 6th Workshop on Computa- Andrew Y. Ng, and Christopher D. Manning+. 2011.
tional Approaches to Subjectivity,Sentiment and Dynamic Pooling and Unfolding Recursive Autoen-
Social Media Analysis, WASSA@EMNLP 2015, coders for Paraphrase Detection. In Advances in
17 September2015, Lisbon, Portugal. page 1. Neural Information Processing Systems 24.
http://aclweb.org/anthology/W/W15/ Peter D. Turney. 2013. Distributional semantics be-
W15-2901.pdf. yond words: Supervised learning of analogy and
paraphrase. CoRR abs/1310.5042. http://
Tina Krennmayr. 2015. What corpus linguistics can arxiv.org/abs/1310.5042.
tell us about metaphor use in newspaper texts. Jour-
nalism Studies 16(4):530–546. Tony Veale, Ekaterina Shutova, and Beata Beigman
Klebanov. 2016. Metaphor: A Computa-
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. tional Perspective. Synthesis Lectures on Hu-
2015. Recurrent convolutional neural networks man Language Technologies. Morgan & Claypool
for text classification. In Proceedings of the Publishers. https://doi.org/10.2200/
Twenty-Ninth AAAI Conference on Artificial In- S00694ED1V01Y201601HLT031.
telligence. AAAI Press, AAAI’15, pages 2267–
2273. http://dl.acm.org/citation. Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb
cfm?id=2886521.2886636. Roy. 2016. Tweet2vec: Learning tweet embeddings
using character-level cnn-lstm encoder-decoder. In
Proceedings of the 39th International ACM SIGIR
Conference on Research and Development in In-
formation Retrieval. ACM, New York, NY, USA,
SIGIR ’16, pages 1041–1044. https://doi.
org/10.1145/2911451.2914762.
Jin Wang, Liang-Chih Yu, K. Robert Lai, and Xue-
jie Zhang. 2016. Dimensional sentiment analysis
using a regional CNN-LSTM model. In Proceed-
ings of the 54th Annual Meeting of the Association
for Computational Linguistics, ACL 2016, August
7-12, 2016, Berlin, Germany, Volume 2: Short Pa-
pers. http://aclweb.org/anthology/P/
P16/P16-2037.pdf.
153
The Effect of Context on Metaphor Paraphrase Aptness Judgments
source task in which speakers rank metaphor- the reason why the simile in the second sentence
paraphrase candidate sentence pairs in short works best is intuitive. A salient characteristic of
document contexts for paraphrase aptness. In a banshee is a powerful scream. Turtles are not
the second we train a composite DNN to pre- known for screaming, and so it is harder to define
dict these human judgments, first in binary
the quality of a scream through such a comparison,
classifier mode, and then as gradient ratings.
We found that for both mean human judg- except as a form of irony.2 Other cases are more
ments and our DNN’s predictions, adding doc- complicated to decide upon. The simile crying like
ument context compresses the aptness scores a fire in the sun (It’s All Over Now, Baby Blue,
towards the center of the scale, raising low out- Bob Dylan) is powerfully apt for many readers, but
of-context ratings and decreasing high out-of- simply odd for others. Fire and sun are not known
context scores. We offer a provisional expla- to cry in any way. But at the same time the sim-
nation for this compression effect. ile can capture the association we draw between
1 Introduction something strong and intense in other senses - vi-
sion, touch, etc. - and a loud cry.
A metaphor is a way of forcing the normal bound- Nonetheless, most metaphors and similes need
aries of a word’s meaning in order to better ex- some kind of context, or external reference point
press an experience, a concept or an idea. To a to be interpreted. The sentence The old lady had a
native speaker’s ear some metaphors sound more heart of stone is apt if the old lady is cruel or indif-
conventional (like the usage of the words ear and ferent, but it is inappropriate as a description of a
sound in this sentence), others more original. This situation in which the old lady is kind and caring.
is not the only dimension along which to judge a We assume that, to an average reader’s sensibility,
metaphor. One of the most important qualities of the sentence models the situation in a satisfactory
a metaphor is its appropriateness, its aptness: how way only in the first case.
good is a metaphor for conveying a given expe- This is the approach to metaphor aptness that
rience or concept. While a metaphor’s degree of we assume in this paper. Following Bizzoni and
conventionality can be measured through proba- Lappin (2018), we treat a metaphor as apt in rela-
bilistic methods, like language models, it is harder tion to a literal expression that it paraphrases.3 If
to represent its aptness. Chiappe et al. (2003) de-
fine aptness as “the extent to which a comparison some level work differently and cannot always be considered
as variations of the same phenomenon (Sam and Catrinel,
captures important features of the topic”. 2006; Glucksberg, 2008), for this study we treat them as be-
It is possible to express an opinion about some longing to the same category of figurative language.
2
metaphors’ and similes’ aptness (at least to a de- It is important not to confuse aptness with transparency.
The latter measures how easy it is to understand a compar-
gree) without previously knowing what they are ison. Chiappe et al. (2003) claim, for example, that many
trying to convey, or the context in which they ap- literary or poetic metaphors score high on aptness and low on
pear1 . For example, we don’t need a particular transparency, in that they capture the nature of the topic very
well, but it is not always clear why they work.
1 3
While it can be argued that metaphors and similes at Bizzoni and Lappin (2018) apply Bizzoni and Lappin
the metaphor is judged to be a good paraphrase, Many of these structures and phenomena do not
then it closely expresses the core information of occur as metaphorical expressions, with any fre-
the literal sentence through its metaphorical shift. quency, in natural text and were therefore intro-
We refer to the prediction of readers’ judgments duced through hand crafted examples.
on the aptness candidates for the literal paraphrase Each pair of sentences in the corpus has been
of a metaphor as the metaphor paraphrase aptness rated by AMT annotators for paraphrase aptness
task (MPAT). Bizzoni and Lappin (2018) address on a scale of 1-4, with 4 being the highest de-
the MPAT by using Amazon Mechanical Turk gree of aptness. In Bizzoni and Lappin (2018)’s
(AMT) to obtain crowd sourced annotations of dataset, sentences come in groups of five, where
metaphor-paraphrase candidate pairs. They train a the first element is the “reference element” with a
composite Deep Neural Network (DNN) on a por- metaphorical expression, and the remaining four
tion of their annotated corpus, and test it on the re- sentences are “candidates” that stand in a degree
maining part. Testing involves using the DNN as of paraphrasehood to the reference.
a binary classifier on paraphrase candidates. They Here is an example of a metaphor-paraphrase
derive predictions of gradient paraphrase aptness candidate pair.
for their test set, and assess them by Pearson coef-
ficient correlation to the mean judgments of their 1a. The crowd was a roaring river.
crowd sourced annotation of this set. Both training b. The crowd was huge and noisy.
and testing are done independently of any docu-
ment context for the metaphorical sentence and its The average AMT paraphrase score for this pair is
literal paraphrase candidates. 4.0, indicating a high degree of aptness.
In this paper we study the role of context We extracted 200 sentence pairs from Bizzoni
on readers’ judgments concerning the aptness of and Lappin (2018)’s dataset and provided each
metaphor paraphrase candidates. We look at the pair with a document context consisting of a pre-
accuracy of Bizzoni and Lappin (2018)’s DNN ceding and a following sentence4 , as in the follow-
when trained and tested on contextually embedded ing example.
metaphor-paraphrase pairs for the MPAT. In Sec-
tion 2 we describe an AMT experiment in which 2a. They had arrived in the capital city. The
annotators judge metaphors and paraphrases em- crowd was a roaring river. It was glorious.
bodied in small document contexts, and in Sec- b. They had arrived in the capital city. The
tion 3 we discuss the results of this experiment. In crowd was huge and noisy. It was glorious.
Section 4 we describe our MPAT modeling exper-
iment, and in Section 5 we discuss the results of One of the authors constructed most of these
this experiment. Section 6 briefly surveys some contexts by hand. In some cases, it was possible
related work. In Section 7 we draw conclusions to locate the original metaphor in an existing doc-
from our study, and we indicate directions for fu- ument. This was the case for
ture work in this area.
(i) Literary metaphors extracted from poetry or
2 Annotating Metaphor-Paraphrase novels, and
Pairs in Contexts
(ii) Short conventional metaphors (The President
Bizzoni and Lappin (2018) have recently produced brushed aside the accusations, Time flies)
a dataset of paraphrases containing metaphors de- that can be found, with small variations, in
signed to allow both supervised binary classifica- a number of texts.
tion and gradient ranking. This dataset contains
For these cases, a variant of the existing con-
several pairs of sentences, where in each pair the
text was added to both the metaphorical and the
first sentence contains a metaphor, and the second
literal sentences. We introduced small modifi-
is a literal paraphrase candidate.
cations to keep the context short and clear, and
This corpus was constructed with a view to rep-
to avoid copyright issues. We lightly modified
resenting a large variety of syntactic structures and
4
semantic phenomena in metaphorical sentences. Our annotated data set and the code for our model is
available at https://github.com/yuri-bizzoni/
(2017)’s modeling work on general paraphrase to metaphor. Metaphor-Paraphrase .
the contexts of metaphors extracted from corpora very similar to those reported in Bizzoni and Lap-
when the original context was too long, ie. when pin (2018). The Pearson coefficent correlation be-
the contextual sentences of the selected metaphor tween the mean judgments of our out-of-context
were longer than the maximum length we speci- pilot annotations and Bizzoni and Lappin (2018)’s
fied for our corpus. In such cases we reduced the annotations for the same pair was over 0.9. We
length of the sentence, while sustaining its mean- then conducted an AMT annotation task for the
ing. 200 contextualised pairs. On average, 20 differ-
The context was designed to sound as natu- ent annotators rated each pair. We considered as
ral as possible. Since the same context is used “rogue” those annotators who rated the large ma-
for metaphors and their literal candidate para- jority of pairs with very high or very low scores,
phrases, we tried to design short contexts that and those who responded inconsistently to two
make sense for both the figurative and the literal “trap” pairs. After filtering out the rogues, we had
sentences, even when the pair had been judged as an average of 14 annotators per pair.
non-paraphrases. We kept the context as neutral
as possible in order to avoid a distortion in crowd 3 Annotation Results
source ratings.
For example, in the following pair of sentences, We found a Pearson correlation of 0.81 between
the literal sentence is not a good paraphrase of the the in-context and out-of-context mean human
figurative one (a simile). paraphrase ratings for our two corpora. This corre-
lation is virtually identical to the one that Bernardy
3a. He is grinning like an ape. et al. (2018) report for mean acceptability ratings
of out-of-context to in-context sentences in their
b. He is smiling in a charming way. (average crowd source experiment. It is interesting that
score: 1.9) a relatively high level of ranking correspondence
should occur in mean judgments for sentences pre-
We opted for a context that is natural for both sented out of and within document contexts, for
sentences. two entirely distinct tasks.
Our main result concerns the effect of context
4a. Look at him. He is grinning like an ape. He
on mean paraphrase judgment. We observed that
feels so confident and self-assured.
it tends to flatten aptness ratings towards the cen-
b. Look at him. He is smiling in a charming ter of the rating scale. 71.1% of the metaphors that
way. He feels so confident and self-assured. had been considered highly apt (average rounded
score of 4) in the context-less pairs received a
We sought to avoid, whenever possible, an in- more moderate judgment (average rounded score
congruous context for one of the sentences that of 3), but the reverse movement was rare. Only
could influence our annotators’ ratings. 5% of pairs rated 3 out of context (2 pairs) were
We collected a sub-corpus of 200 contextually boosted to a mean rating of 4 in context. At the
embedded pairs of sentences. We tried to keep our other end of the scale, 68.2% of the metaphors
data as balanced as possible, drawing from all four judged at 1 category of aptness out of context were
rating classes of paraphrase aptness ratings (be- raised to a mean of 2 in context, while only the
tween 1 to 4) that Bizzoni and Lappin (2018) ob- 3.9% of pairs rated 2 out of context were lowered
tained. We selected 44 pairs of 1 ratings, 51 pairs to 1 in context.
of 2, 43 pairs of 3 and 62 pairs of 4. Ratings at the middle of the scale - 2 (defined as
We then used AMT crowd sourcing to rate the semantically related non-paraphrases) and 3 (im-
contextualized paraphrase pairs, so that we could perfect or loose paraphrases) - remained largely
observe the effect of document context on assess- stable, with little movement in either direction.
ments of metaphor paraphrase aptness. 9.8% of pairs rated 2 were re-ranked as 3 when
To test the reproducibility of Bizzoni and Lap- presented in context, and 10% of pairs ranked at 3
pin (2018)’s ratings, we launched a pilot study for changed to 2. The division between 2 and 3 sep-
10 original non-contextually embedded pairs, se- arates paraphrases from non-paraphrases. Our re-
lected from all four classes of aptness. We ob- sults suggest that this binary rating of paraphrase
served that the annotators provided mean ratings aptness was not strongly affected by context. Con-
text operates at the extremes of our scale, raising encoders
low aptness ratings and lowering high aptness rat-
ings. This effect is clearly indicated in the regres- 3. A final set of fully connected layers that op-
sion chart in Fig 1. erate on the merged representation of the two
This effect of context on human ratings is very sentences to generate a judgment.
similar to the one reported in Bernardy et al. The encoder for each pair of sentences taken as
(2018). They find that sentences rated as ill input is composed of two parallel ”Atrous” Con-
formed out of context are improved when they volutional Neural Networks (CNNs) and LSTM
are presented in their document contexts. How- RNNs, feeding two sequenced fully connected
ever the mean ratings for sentences judged to be layers.
highly acceptable out of context declined when as- The encoder is preloaded with the lexical em-
sessed in context. Bernardy et al. (2018)’s linear beddings from Word2vec (Mikolov et al., 2013).
regression chart for the correlation between out- The sequences of word embeddings that we use as
of-context and in-context acceptability judgments input provides the model with dense word-level in-
looks remarkably like our Fig 1. There is, then, formation, while the model tries to generalize over
a striking parallel in the compression pattern that these embedding patterns.
context appears to exert on human judgments for The combination of a CNN and an LSTM al-
two entirely different linguistic properties. lows us to capture both long-distance syntactic and
This pattern requires an explanation. Bernardy semantic relations, best identified by a CNN, and
et al. (2018) suggest that adding context causes the sequential nature of the input, most efficiently
speakers to focus on broader semantic and prag- identified by an LSTM. Several existing studies,
matic issues of discourse coherence, rather than cited in Bizzoni and Lappin (2017), demonstrate
simply judging syntactic well formedness (mea- the advantages of combining CNNs and LSTMs
sured as naturalness) when a sentence is consid- to process texts.
ered in isolation. On this view, compression of rat- The model produces a single classifier value be-
ing results from a pressure to construct a plausible tween 0 and 1. We transform this score into a bi-
interpretation for any sentence within its context. nary output of 0 or 1 by applying a threshold of
If this is the case, an analogous process 0.5 for assigning 1.
may generate the same compression effect for The architecture of the model is given in Fig 2.
metaphor aptness assessment of sentence pairs in We use the same general protocol as Bizzoni
context. Speakers may attempt to achieve broader and Lappin (2018) for training with supervised
discourse coherence when assessing the metaphor- learning, and testing the model.
paraphrase aptness relation in a document context. Using Bizzoni and Lappin (2018)’s out-of- con-
Out of context they focus more narrowly on the se- text metaphor dataset and our contextualized ex-
mantic relations between a metaphorical sentence tension of this set, we apply four variants of the
and its paraphrase candidate. Therefore, this rela- training and testing protocol.
tion is at the centre of a speaker’s concern, and it
receives more fine-grained assessment when con- 1. Training and testing on the in-context dataset.
sidered out of context than in context. This issue
clearly requires further research. 2. Training on the out-of-context dataset, and
testing on the in-context dataset.
4 Modelling Paraphrase Judgments in
Context 3. Training on the in-context dataset, and testing
on the out-of-context dataset.
We use the DNN model described in Bizzoni and
Lappin (2018) to predict aptness judgments for in- 4. Training and testing on the out-of-context
context paraphrase pairs. It has three main com- dataset (Bizzoni and Lappin (2018)’s origi-
ponents: nal experiment provides the results for out-
of-context training and testing).
1. Two encoders that learn the representations
of two sentences separately When we train or test the model on the out-
of-context dataset, we use Bizzoni and Lap-
2. A unified layer that merges the output of the pin (2018)’s original annotated corpus of 800
Figure 1: In-context and out-of-context mean ratings. Points above the broken diagonal line represent
sentence pairs which received a higher rating when presented in context. The total least-square linear
regression is shown as the second line.
metaphor-paraphrase pairs. The in-context dataset decline to the fact that the compression effect ren-
contains 200 annotated pairs. ders the gradient judgments less separable, and so
harder to predict. A similar, but more pronounced
5 MPAT Modelling Results version of this effect may account for the difficulty
We use the model both to predict binary classifi- that our model encounters in predicting gradient
cation of a metaphor paraphrase candidate, and to in-context ratings. The binary classifier achieves
generate gradient aptness ratings on the 4 category greater success for these cases because its training
scale (see Bizzoni and Lappin (2018) for details). tends to polarise the data in one direction or the
A positive binary classification is accurate if it is other.
≥ a 2.5 mean human rating. The gradient predic- We also observe that the best combination
tions are derived from the softmax distribution of seems to consist in training our model on the orig-
the output layer of the model. The results of our inal out-of-context dataset and testing it on the in-
modelling experiments are given in Table 1. context pairs. In this configuration we reach an
The main result that we obtain from these ex- F-score (0.72) only slightly lower than the one re-
periments is that the model learns binary classi- ported in Bizzoni and Lappin (2018) (0.74), and
fication to a reasonable extent on the in-context we record the highest Pearson correlation, 0.3
dataset, both when trained on the same kind of (which is still not strong, compared to Bizzoni and
data (in-context pairs), and when trained on Biz- Lappin (2018)’s best run, 0.755 ). This result may
zoni and Lappin (2018)’s original dataset (out-of- partly be an artifact of the the larger amount of
context pairs). However, the model does not per- training data provided by the out-of-context pairs.
form well in predicting gradient in-context judg- We can use this variant (out-of-context training
ments when trained on in-context pairs. It im- and in-context testing) to perform a fine-grained
proves slightly for this task when trained on out- comparison of the model’s predicted ratings for
of-context pairs. the same sentences in and out of context. When
By contrast, it does well in predicting both bi- we do this, we observe that out of 200 sentence
nary and gradient ratings when trained and tested pairs, our model scores the majority (130 pairs)
on out-of-context data sets. higher when processed in context than out of con-
Bernardy et al. (2018) also note a decline in
5
Pearson correlation for their DNN models on the It is also important to consider that their ranking scheme
is different from ours: the Pearson correlation reported there
task of predicting human in-context acceptability is the average of the correlations over all groups of 5 sen-
judgments, but it is less drastic. They attribute this tences present in the dataset.
Figure 2: DNN encoder for predicting metaphorical paraphrase aptness from Bizzoni and Lappin (2018).
Each encoder represents a sentence as a 10-dimensional vector. These vectors are concatenated to com-
pute a single score for the pair of input sentences.
Table 1: F-score binary classification accuracy and Pearson correlation for three different regimens of
supervised learning. The * indicates results for a set of 10-fold cross-validation runs. This was necessary
in the first case, when training and testing are both on our small corpus of in-context pairs. In the second
and third rows, since we are using the full out-of-context and in-context dataset, we report single-run
results. The fourth row is Bizzoni and Lappin (2018)’s best run result. (Our single-run best result for the
first row is an F-score of 0.8 and a Pearson correlation 0.16).
text. A smaller but significant group (70 pairs) re- 6 Related Cognitive Work on Metaphor
ceives a lower score when processed in context. Aptness
The first group’s average score before adding con-
text (0.48) is consistently lower than that of the Tourangeau and Sternberg (1981) present ratings
second group (0.68). Also, as Table 2 indicates, of aptness and comprehensibility for 64 metaphors
the pairs that our model rated, out of context, with from two groups of subjects. They note that
a score lower than 0.5 (on the model’s softmax metaphors were perceived as more apt and more
distribution), received on average a higher rating comprehensible to the extent that their terms occu-
in context, while the opposite is true for the pairs pied similar positions within dissimilar domains.
rated with a score higher than 0.5. In general, sen- Interestingly, Fainsilber and Kogan (1984) also
tence pairs that were rated highly out of context present experimental results to claim that imagery
receive a lower score in context, and vice versa. does not clearly correlate with metaphor aptness.
When we did linear regression on the DNNs in and Aptness judgments are also subjected to individual
out of context predicted scores, we observed sub- differences.
stantially the same compression pattern exhibited Blasko (1999) points to such individual differ-
by our AMT mean human judgments. Figure 3 ences in metaphor processing. She asked 27 par-
plots this regression graph. ticipants to rate 37 metaphors for difficulty, apt-
ness and familiarity, and to write one or more in-
terpretations of the metaphor. Subjects with higher
working memory span were able to give more de-
OOC score Number of ele- OOC Mean OOC Std IC Mean IC Std
ments
0.0-0.5 112 0.42 0.09 0.54 0.1
0.5-1.0 88 0.67 0.07 0.64 0.07
Table 2: We show the number of pairs that received a low score out of context (first row) and the number
of pairs that received a high score out of context (second row). We report the mean score and standard
deviation (Std) of the two groups when judged out of context (OOC) and when judged in context (IC)
by our model. The model’s scores range between 0 and 1. As can be seen, the mean of the low-scoring
group rises in context, and the mean of the high-scoring group decreases in context.
Figure 3: In-context and out-of-context ratings assigned by our trained model. Points above the broken
diagonal line represent sentence pairs which received a higher rating when presented in context. The
total least-square linear regression is shown as the second line.
tailed and elaborate interpretations of metaphors. rattlesnakes as the dancer’s arms were startled
Familiarity and aptness correlated with both high rattlesnakes) if they were judged to be particularly
and low span subjects. For high span subjects apt- apt, rather than particularly comprehensible. They
ness of metaphor positively correlated with num- claim that context might play an important role
ber of interpretations, while for low span subjects in this process. They suggest that context should
the opposite was true. ease the transparency and increase the aptness of
McCabe (1983) analyses the aptness of both metaphors and similes.
metaphors with and without extended context. Tourangeau and Rips (1991) present a series of
She finds that domain similarity correlates with experiments indicating that metaphors tend to be
aptness judgments in isolated metaphors, but not interpreted through emergent features that were
in contextualized metaphors. She also reports that not rated as particularly relevant, either for the
there is no clear correlation between metaphor tenor or for the vehicle of the metaphor. The num-
aptness ratings in isolated and in contextualized ber of emergent features that subjects were able
examples. Chiappe et al. (2003) study the rela- to draw from a metaphor seems to correlate with
tion between aptness and comprehensibility in their aptness judgments.
metaphors and similes. They provide experi- Bambini et al. (2018) use Event-Related Brain
mental results indicating that aptness is a better Potentials (ERPs) to study the temporal dynamics
predictor than comprehensibility for the “trans- of metaphor processing in reading literary texts.
formation” of a simile into a metaphor. Subjects They emphasize the influence of context on the
tended to remember similes as metaphors (i.e. ability of a reader to smoothly interpret an unusual
remember the dancer’s arms moved like startled metaphor.
Bambini et al. (2016) use electrophysiological clined sharply for the prediction of human gradi-
experiments to try to disentangle the effect of a ent aptness judgments, relative to its performance
metaphor from that of its context. They find that on a corresponding out-of-context test set. This
de-contextualized metaphors elicited two different appears to be the result of the increased difficulty
brain responses, N 400 and P 600, while contextu- in separating rating categories introduced by the
alized metaphors only produced the P 600 effect. compression effect.
They attribute the N 400 effect, often observed in Strikingly, the linear regression analyses of hu-
neurological studies of metaphors, to expectations man aptness judgments for in- and out-of-context
about upcoming words in the absence of a pre- paraphrase pairs, and of our DNN’s predictions
dictive context that “prepares” the reader for the for these pairs reveal similar compression pat-
metaphor. They suggest that the P 600 effect re- terns. These patterns produce ratings that cannot
flects the actual interpretative processing of the be clearly separated along a linear ranking scale.
metaphor. To the best of our knowledge ours is the first
This view is supported by several neurological study of the effect of context on metaphor apt-
studies showing that the N 400 effect arises with ness on a corpus of this dimension, using crowd
unexpected elements, like new presuppositions in- sourced human judgments as the gold standard
troduced into a text in a way not implied by the for assessing the predictions of a computational
context (Masia et al., 2017), or unexpected asso- model of paraphrase. We also present the first
ciations with a noun-verb combination, not indi- comparative study of both human and model judg-
cated by previous context (for example preceded ments of metaphor paraphrase for in-context and
by neutral context, as in Cosentino et al. (2017)). out-of-context variants of metaphorical sentences.
Finally, the compression effect that context
7 Conclusions and Future Work induces on paraphrase judgments corresponds
We have observed that embedding metaphorical closely to the one observed independently in an-
sentences and their paraphrase candidates in a doc- other task, which is reported in Bernardy et al.
ument context generates a compression effect in (2018). We regard this effect as a significant dis-
human metaphor aptness ratings. Context seems covery that increases the plausibility and the inter-
to mitigate the perceived aptness of metaphors in est of our results. The fact that it appears clearly
two ways. Those metaphor-paraphrase pairs given with two tasks involving different sorts of DNNs
very low scores out of context receive increased and distinct learning regimes (unsupervised learn-
scores in context, while those with very high ing with neural network language models for the
scores out of context decline in rating when pre- acceptability prediction task discussed, as opposed
sented in context. At the same time, the demarca- to supervised learning with our composite DNN
tion line between paraphrase and non-paraphrase for paraphrase prediction) reduces the likelihood
is not particularly affected by the introduction of that this effect is an artefact of our experimental
extended context. design.
As previously observed by McCabe (1983), we While our dataset is still small, we are present-
found that context has an influence on human apt- ing an initial investigation of a phenomenon which
ness ratings for metaphors, although, unlike her is, to date, little studied. We are working to en-
results, we did find a correlation between the two large our dataset and in future work we will ex-
sets of ratings. Chiappe et al. (2003)’s expecta- pand both our in- and out-of-context annotated
tion that context should facilitate a metaphor’s apt- metaphor-paraphrase corpora.
ness was supported only in one sense. Aptness in- While the corpus we used contains a number of
creases for low-rated pairs. But it decreases for hand crafted examples, it would be preferable to
high-rated pairs. find these example types in natural corpora, and
We applied Bizzoni and Lappin (2018)’s DNN we are currently working on this. We will be ex-
for the MAPT to an in-context test set, experi- tracting a dataset of completely natural (corpus-
menting with both out-of-context and in-context driven) examples. We are seeking to expand the
training corpora. We obtained reasonable results size of the data set to improve the reliability of our
for binary classification of paraphrase candidates modelling experiments.
for aptness, but the performance of the model de- We will also experiment with alternative DNN
architectures for the MAPT. We will conduct qual- Viviana Masia, Paolo Canal, Irene Ricci,
itative analyses on the kinds of metaphors and sim- Edoardo Lombardi Vallauri, and Valentina Bambini.
2017. Presupposition of new information as a
iles that are more prone to a context-induced rating
pragmatic garden path: Evidence from event-related
switch. brain potentials. Journal of Neurolinguistics,
One of our main concerns in future research will 42:31–48.
be to achieve a better understanding of the com-
Allyssa McCabe. 1983. Conceptual similarity and the
pression effect of context on human judgments quality of metaphor in isolated sentences versus ex-
and DNN models. tended contexts. Journal of Psycholinguistic Re-
search, 12(1):41–68.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
References rado, and Jeff Dean. 2013. Distributed representa-
Valentina Bambini, Chiara Bertini, Walter Schaeken, tions of words and phrases and their composition-
Alessandra Stella, and Francesco Di Russo. 2016. ality. In C. J. C. Burges, L. Bottou, M. Welling,
Disentangling metaphor from context: an erp study. Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
Frontiers in psychology, 7:559. vances in Neural Information Processing Systems
26, pages 3111–3119. Curran Associates, Inc.
Valentina Bambini, Paolo Canal, Donatella Resta, and Glucksberg Sam and Haught Catrinel. 2006. On the
Mirko Grimaldi. 2018. Time course and neurophys- relation between metaphor and simile: When com-
iological underpinnings of metaphor in literary con- parison fails. Mind & Language, 21(3):360–378.
text. Discourse Processes, pages 1–21.
Roger Tourangeau and Lance Rips. 1991. Interpreting
Jean-Philippe Bernardy, Shalom Lappin, and Jay and evaluating metaphors. Journal of Memory and
Han Lau. 2018. The influence of context on sen- Language, 30(4):452–472.
tence acceptability judgmenets. Proceedings of ACL
2018, Melbourne, Australia. Roger Tourangeau and Robert J Sternberg. 1981. Apt-
ness in metaphor. Cognitive psychology, 13(1):27–
Yuri Bizzoni and Shalom Lappin. 2017. Deep learn- 55.
ing of binary and gradient judgements for semantic
paraphrase. In IWCS 2017 - 12th International Con-
ference on Computational Semantics - Short papers,
Montpellier, France, September 19 - 22, 2017.