Detection and Aptness A Study in Metapho PDF

Detection and Aptness: A study in metaphor detection and
aptness assessment through neural networks and

distributional semantic spaces
yuri.bizzoni
December 2018
2
Doctoral dissertation in computational linguistics, University of Gothenburg

20190221
©Yuri Bizzoni, 2019

Cover: Stefanie Kaech and Gary Banker
Printed by Repro Lorensberg,
University of Gothenburg
Gothenburg 2019
Publisher: University of Gothenburg (Dissertations)
ISBN 978-91-7833-310-3 (print)
ISBN 978-91-7833-311-0 (pdf)
Distribution
Department of philosophy, linguistics and theory of science,
Box 200, SE-405 30 Gothenburg
Abstract
Metaphor is one of the most prominent, and most studied, figures of speech.
While it is considered an element of great interest in several branches of linguistics, such as
semantics, pragmatics and stylistics, its automatic processing remains an open challenge. First of all,
the semantic complexity of the concept of metaphor itself creates a range of theoretical complications.
Secondly, the practical lack of large scale resources forces researchers to work under conditions of data
scarcity.
This compilation thesis provides a set of experiments to (i) automatically detect metaphors and
(ii) assess a metaphor’s aptness with respect to a given literal equivalent. The first task has already
been tackled by a number of studies. I approach it as a way to assess the potentialities and limitations
of our approach, before dealing with the second task. For metaphor detection I was able to use existing
resources, while I created my own dataset to explore metaphor aptness assessment. In all of the studies
presented here, I have used a combination of word embeddings and neural networks.
To deal with metaphor aptness assessment, I framed the problem as a case of paraphrase identi-
fication. Given a sentence containing a metaphor, the task is to find the best literal paraphrase from
a set of candidates. I built a dataset designed for this task, that allows a gradient scoring of various
paraphrases with respect to a reference sentence, so that paraphrases are ordered according to their
degree of aptness. Therefore, I could use it both for binary classification and ordering tasks. This
dataset is annotated through crowd sourcing by an average of 20 annotators for each pair. I then
designed a deep neural network to be trained on this dataset, that is able achieve encouraging levels
of performance.
In the final experiment of this compilation, more context is added to a sub-section of the dataset
in order to study the effect of extended context on metaphor aptness rating. I show that extended
context changes human perception of metaphor aptness and that this effect is reproduced by my neural
classifier. The conclusion of the last study is that extended context compresses aptness scores towards
the center of the scale, raising low ratings and decreasing high ratings given to paraphrase candidates
outside of extended context.
3
4
Acknowledgments
I want to thank Shalom Lappin for more than three years of helpful supervision, as well as my CLASP
friends and colleagues for their guidance and endurance. I am grateful to Beata Beigman Klebanov for
her thorough reading of my thesis and for her extensive comments. I also am obliged to my colleagues
in Pisa University for their collaboration in some of my work. Finally I am thankful to the Swedish
Research Council for providing the funds necessary to my paycheck, and to all FLoV department’s
members for their help.
Without their support this thesis would have been different, so they should share part of the
blame.
5
Contents
I THESIS FRAME 11
1 Introduction 13
1.1 My Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Metaphor Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Out-of-context metaphor detection . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.2 Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.3 In-context metaphor detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Metaphor Aptness Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.1 Dataset and architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3.2 Out-of-context Metaphor Aptness Assessment . . . . . . . . . . . . . . . . . . . . 18
1.3.3 In-context Metaphor Aptness Assessment . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Vector space lexical embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.6 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Theoretical background and framework: key concepts 23

2.1 Metaphor detection in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.1 Unsupervised approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.1.2 Supervised approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.4 Vector space semantic models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 Beyond detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Data collection and resources 31

3.1 Adjective-Noun Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7
8 CONTENTS
3.2 Idiom dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 VU Amsterdam Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Paraphrase corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Metaphor Aptness Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Contextual Metaphor Aptness Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Summary of the studies 35

4.1 Study I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 My contribution to the paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Study II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Study III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Study IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Study V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.6 Study VI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Conclusions 45
5.1 My Research Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Nuanced properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Sequentiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 Data Scarcity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.5 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.7 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Appendix 55
II STUDIES 81
CONTENTS 9
List of Studies
STUDY I
Bizzoni, Y., Chatzikyriakidis, S., Ghanimifard, M. (2017). “Deep" Learning: Detecting Metaphoric-
ity in Adjective-Noun Pairs. In Proceedings of the Workshop on Stylistic Variation (pp. 43-52).
STUDY II
Bizzoni, Y., Senaldi, Marco S.G., Lenci, Alessandro (2018). "Finding the Neural Net: Deep-
learning Idiom Type Identification from Distributional Vectors". To appear in Italian Journal of
Computational Linguistics.
STUDY III
Bizzoni, Y., Ghanimifard, M. (2018). Bigrams and BiLSTMs Two neural networks for sequential
metaphor detection. In Proceedings of the Workshop on Figurative Language Processing (pp. 91-101).
STUDY IV
Bizzoni, Y., Lappin, S. (2017). Deep Learning of Binary and Gradient Judgements for Seman-
tic Paraphrase. In IWCS 2017—12th International Conference on Computational Semantics—Short
papers.
STUDY V
Bizzoni, Y., Lappin, S. (2018). Predicting Human Metaphor Paraphrase Judgments with Deep
Neural Networks. In Proceedings of the Workshop on Figurative Language Processing (pp. 45-55).
STUDY VI
Bizzoni, Y., Lappin, S. (2018). The Effect of Context on Metaphor Paraphrase Aptness Judg-
ments. arXiv preprint arXiv:1809.01060.
Part I
THESIS FRAME
11
Chapter 1
Introduction
Figurative Language is an umbrella term comprehending several different phenomena, and covering
linguistic behavior that involves pragmatics, semantics, syntax and even phonetics 1 . Figures of speech
are often framed as language patterns that are perceived by the speakers as “unusual" or improbable.
Statistical and prosodic patterns are essential in the way we learn and update language (Kuhl;
2004), and it is not surprising that we develop a high sensitivity towards them. It can be argued
that since we learn language mainly through positive examples, we don’t learn what is wrong. We
rather develop a sensitivity towards what sounds unusual. These are the patterns we classify as
‘ungrammatical’ , or having a lower degree of acceptability (Lau; 2015; Lau et al.; 2017), as ‘slips of
tongue’ (Dell; 1985), or, to some extent, as ‘figures of speech’.
The human ability to constantly learn patterns, and to detect what seems to deviate from the
acquired schemes - phonetic, syntactic, semantic, just to name the most obvious ones in language -
can be the reason why figurative language is both effective in communication and difficult to model.
We get also easily used to new patterns, which is why our perception of figures changes over time. For
example, novel metaphors can be characterized as a way of forcing a word or an expression out of its
conventional use, and stretching its meaning in order to better convey an experience or an idea. But
the metaphoric usage of a word can become itself a conventional pattern. This for example creates
“dead" metaphors, metaphors that are no longer perceived as such by speakers.
The most studied of these figures in linguistics, including natural language processing, are the
semantic-pragmatic tropes of irony, sarcasm, metaphor (together with simile), and metonymy. Metaphor
is often considered one of the most important and widespread figures of speech, and an important
phenomenon to analyze in several areas of linguistics, from stylistics (Goodman; 1975; Semino and
1 Non-verbal figurativity also exists (MIGLIORE; 2007). In this thesis I limit my discussion to verbal figurativity.
13
14 CHAPTER 1. INTRODUCTION
Culpeper; 2002; Simpson; 2004; Leech and Short; 2007; Fahnestock; 2009; Steen; 2014) to dialogue
studies (Pollio et al.; 1990; Corts and Pollio; 1999; Kintsch; 2000; Cameron; 2008). Metaphor is also
relevant for pragmatics, and it has been associated with an increase in “emotionality" in communica-
tion (Fussell and Moss; 1998; Gibbs et al.; 2002), and with an attempt to improve the clarity of an
explanation (Sell et al.; 1997; Darian; 2000).
From the perspective of computational linguistics, figurative language in general, and metaphor
in particular, are of both theoretical and a practical interest.
Theoretically it is worth modelling, because its broad function seems to consist in stretching the
expressive power of natural language.
For example, the synesthetic expression A cold voice draws its meaning both from auditory per-
ception (as a source domain) and tactile perception (as a target domain) to convey a particular char-
acteristic of a voice. But how and why this semantic shift operates is far from being a solved problem.
Practically, figurative language can be a source of problems if it is not understood (or at least
recognized) by language processing systems.
Given the widespread presence of metaphor in everyday language (Deignan; 2007; Lakoff and
Johnson; 2008a; Sikos et al.; 2008) and its proved importance in communication and transmission of
knowledge (Salager-Meyer; 1990; Rodriguez; 2003; Littlemore; 2004; Baumer and Tomlinson; 2008;
Kokkinakis; 2013; Laranjeira; 2013), an automatic metaphor processing can open new possibilities to
a number of computational linguistics’ applications like machine translation, dialogue management,
information retrieval, opinion mining, sentiment analysis and author profiling 2 . Effective detection
of figures such as irony and sarcasm is already deemed "immensely useful for reducing incorrect clas-
sification of consumer sentiment" (Mukherjee and Bala; 2017) and is often applied to the analysis of
social media language. (Reyes et al.; 2012).
In the rest of this chapter I will detail the research questions that have led to this compilation.
They revolve around two main topics: metaphor detection and metaphor aptness assessment.
I will give a short explanation of what each of these themes involves and how I tried to explore
them.
I will also discuss shortly the two main tools I have used throughout all my experiments: vector
space lexical embeddings and neural networks.
Finally, I will present the structure of my thesis’ frame.
2I am thinking here of figurative uses that are perceived as such by speakers: conventional metaphors, for example,
can be treated with general statistical tools
1.1. MY RESEARCH QUESTIONS 15
1.1 My Research Questions

The main questions that led to the development of my research are the following:
1. To what extent can we exploit vector space semantic models through artificial neural networks
to perform supervised metaphor detection?
2. Is it possible to use the same approach to go beyond metaphor detection, and tackle metaphor
aptness assessment as a natural language processing task?
3. What is the best way of dealing with metaphor aptness assessment, in terms of both dataset
structure and task design?
It is possible to read these questions under a more “linguistic" angle, highlighting the theoretical, rather
than technical, interest of my research.
In this sense, my research questions could be reformulated in the following way:
1. To what extent can we exploit the distributional profile of single words to detect their metaphoric
usage in text through a supervised machine learning approach? To what extent would such
process be compositional?
2. Is it possible to exploit the distributional profile of single words to go beyond metaphor detection,
and tackle metaphor aptness assessment as a natural language processing task?
1.2 Metaphor Detection

Metaphor is increasingly an object of study in computational linguistics and natural language pro-
cessing, where automatic metaphor processing remains a serious challenge. Even the “basic" task of
automatically detecting metaphors in a text is far from solved. (Neuman et al.; 2013; Klebanov et al.;
2015). Conceptual problems - diversity and complexity of metaphors - and practical obstacles - data
scarcity - both contribute to the difficulty of the task.
Data scarcity in particular is an important limitation, since the number of annotated resources
for metaphor processing is relatively low, and their dimensions tend to be small - from few hundreds
elements to some thousands for the richest resources.
In this compilation, I use distributional semantic information derived from very large corpora
and encoded in lexical embeddings to partially overcome this problem. This semantic information can
be exploited by a classifier to deal with the words and expressions contained in the relatively small
annotated datasets for metaphor processing.
1.2.1 Out-of-context metaphor detection
A way to break the problem down into a more manageable task is that of working on lists of out-of-
context metaphors and literal expressions. While it is true that every piece of text could be interpreted
metaphorically or literally depending on a larger context, an average reader is usually able to express
a tentative judgment of metaphoricity on an isolated expression. For example, the expression a bright
color will most likely be interpreted as literal. On the other hand, the expression a bright person will
most likely be interpreted as metaphorical. While it may seem counter-intuitive, developing a system
to detect metaphoric expressions in isolation presents several advantages with respect to a system
designed to find metaphors in unconstrained text.
The main advantages of such a framework are that there is no need to parse or process a larger
text, the boundaries of the expression to be judged are already established in the dataset and there is
no risk of finding a confusing larger context that causes troubles for the detector.
This is the approach we adopt in the first paper of this compilation. It will also introduce the
reader to the two tools that we use throughout the following studies: lexical embeddings and neural
networks.
1.2.2 Idioms
As a proof of concept, I also present a study on idiom identification. In this study we first show that
a similar combination of neural classifier and lexical embeddings can be applied to figures of speech
beyond metaphor. Second, we discuss the compositional nature of metaphor, showing that, when
dealing with essentially non-compositional figures such as idioms (Vietri; 2014), the same approach
used in Paper 1 works poorly. On the other hand, treating idioms as unitary tokens brings the
performance of our model up to high level.
1.2.3 In-context metaphor detection
After combining neural networks and word embeddings on out-of-context expressions, I present a study
on metaphor detection in context. The complexity of detecting multi-word metaphors in unconstrained
text made it necessary to adopt deeper and more sophisticated neural architectures. We consider two
competitive approaches to the problem, discussing the challenges delineated by the corpus’ annotation
system, and the benefits of enriching semantic vector spaces with diverse non-distributional features.
1.3. METAPHOR APTNESS ASSESSMENT 17
1.3 Metaphor Aptness Assessment

Automatic metaphor identification is far from a solved task, and it remains an area to which many
researchers are currently devoting significant effort. Combinations of neural networks and vector space
semantics, like the ones that we explore, are increasingly popular in this domain, and the performance
of metaphor detection models is steadily improving (Leong et al.; 2018; Pramanick et al.; 2018; Gao
et al.; 2018). But automatic identification is only one issue that a computational study of metaphor
can take up.
The second part of this thesis discusses a far less studied aspect of metaphor processing: automatic
aptness assessment.
Metaphor aptness refers to a metaphor’s efficacy in expressing a concept: how “good" a metaphor
is at conveying a given meaning. If a metaphor is a form of analogy, aptness defines how fitting such
analogy is (Chiappe et al.; 2003). Different metaphors can be perceived as more or less fitting. This
does not imply that an apt metaphor has to be completely transparent. The shared properties of
source and target domain need not be entirely obvious. A metaphor can be apt even without it being
clear. Aptness consists in the effectiveness of a metaphor, which does not necessarily correspond to its
level of transparency.
Metaphor aptness is rarely addressed in computational linguistics. For a more detailed discussion
of aptness, see the last paper of this compilation.
1.3.1 Dataset and architecture
To tackle Metaphor Aptness Assessment, we decided to frame it as a paraphrase identification problem.

The basic idea of this approach is that a metaphor can be seen as an effective way of paraphrasing a
literal equivalent, and, conversely, a metaphor can be paraphrased through a more or less fitting literal
interpretation. Another reason to choose this approach is that when a literal interpretation is given it
is easier to determine the aptness of a metaphor, since its intended meaning is made explicit. Working
on metaphor aptness without a literal paraphrase could be a riskier approach, since different readers
may interpret the metaphor’s meaning differently. Such potential confusion would add to the already
subjective quality of metaphor aptness evaluation.
Metaphor interpretation has an evocative and elusive nature (Cornelissen; 2005). My lawyer is a
shark can be paraphrased as My lawyer is greedy, My lawyer is ambitious, My lawyer is dangerous...
Many acceptable partial paraphrases can exist of a given metaphor, since the metaphor itself can be
ground for several nuances of interpretation.
At the same time, though, we argue that a sentence like My lawyer is greedy and ambitious will
be perceived as more fitting a paraphrase than My lawyer likes money, that can be an acceptable
paraphrase but has a more narrow focus of interpretation; at the same time, My lawyer likes money
will be perceived as more fitting a paraphrase than My lawyer is a strange person, and so on.
We thus decided to frame the problem as an ordering task, rather than as a binary classification
problem. For a metaphor-literal pair of sentences, we want to generate both a binary judgment and
a gradient score. We prefer the method used in semantic similarity tasks (Xu et al.; 2015; Agirre
et al.; 2016), rather than in traditional paraphrase detection. Under this respect, we are following
the dominant view in the current literature on the related task of automatic metaphor paraphrasing
(Bollegala and Shutova; 2013).
This is the general concern that binds together the latter three papers of this compilation.
1.3.2 Out-of-context Metaphor Aptness Assessment
While Paper 4 discusses this approach, applying it to a “traditional" paraphrase detection task, Paper
5 deals with metaphor aptness as a problem of metaphor paraphrase grading. We present a dataset
for metaphor and simile aptness assessment, together with a number of machine learning experiments.
Our dataset contains metaphors and similes at various degrees of salience (Giora; 1999; Giora and Fein;
1999; Laurent et al.; 2006; Giora; 2002) , together with candidate literal paraphrases. Each sentence
appears in isolation, devoid of a more general context.
It can be argued that a reader’s assessment of acceptability might change depending on a larger
context (as is explored in Paper 6). We tried to create sentences that were not too ambiguous in this
sense, and we relied on the annotators’ “common sense". As Paper 5 will show, the agreement between
the majority of annotators and my own (“golden") judgment is very high: the potential confusion due
to divergent interpretations doesn’t seem to affect the results.
This task falls midway between metaphor comprehension and appreciation (Gerrig; 1989; Cor-
nelissen; 2004). The possibility of ranking the candidates for aptness is of particular importance.
While several sentences can be seen as a reasonable interpretation of a metaphor, the aptness of such
metaphor to express their meaning can vary (Tourangeau and Sternberg; 1982; Camac and Glucksberg;
1984).
It is important to note that we are treating aptness as a symmetrical phenomenon.
Our annotators were presented with a pair of sentences and were asked to rank such pair for
paraphrase aptness. But they always saw first the metaphor, then its candidate paraphrase.
In other words, given a metaphor (the first element) they were asked to determine whether a
given sentence (the second element) was a good paraphrase. We considered this judgment as holding
true also in the opposite direction: it also tells us whether the given metaphor is a good paraphrase of
the literal sentence, and it thus becomes a metaphor aptness judgment.
1.4. NEURAL NETWORKS 19
So if
My lawyer is a shark
can be paraphrased with
My lawyer is greedy and aggressive
with an average human score of 3.4/4.0, we maintain that the reverse is also true:
My lawyer is greedy and aggressive
can be paraphrased with the metaphor
My lawyer is a shark
with the same average score, and thus we say that My lawyer is a shark has an aptness of 3.4/4.0
in expressing the meaning My lawyer is greedy and aggressive.
There is some possibility that, if presented with the opposite frame - the literal sentence first, and
then the metaphor - their aptness judgment might change.
Thus, there is some possibility that metaphor aptness is not a symmetric phenomenon.
I consider this an interesting direction to explore in future studies.
1.3.3 In-context Metaphor Aptness Assessment
Paper 6 offers a concluding study of metaphor aptness. In this case, we use a subset of the previous
corpus to explore the perception of metaphor aptness in extended context, a topic already studied in
cognitive linguistics (Gildea and Glucksberg; 1983). We show the effect of extended context on human
ratings and we try to replicate it by means of a neural architecture.
1.4 Neural Networks

As my research questions show, throughout my thesis I have constantly used neural networks, together
with vector space lexical embeddings, to model metaphoricity. Beyond the “simple", fully connected
neural layers I use in Papers 1 and 2, the neural architectures I have applied in my research are LSTM
and CNN.
An LSTM, or Long short-term memory, is a recurrent neural network considered fit for tasks of
sequence learning, such as movies, time sequences, and sentences in natural language (Hochreiter and
Schmidhuber; 1997).
A Convolutional Neural Network is an architecture based on the neural patterns observed in the
animal visual field and was primarily used for image recognition, but has been often applied also to
language processing tasks (Collobert and Weston; 2008a) .
Both LSTM and CNN are able to capture elements of both semantics and syntax (Sboev et al.;
2016; Kim; 2014).
The single architecture most used through this series of studies is a composite structure designed
by me, used to produce a continuous score on a pair of sentences. This architecture consists of two
encoders that take the sentences as input, and a final series of fully connected layers that merge the
encoders’ output and produce the value. Each encoder is composed of a CNN 3 , an LSTM and a series
of fully connected layers.
I am not the first one to combine CNNs and LSTMs: CNN+LSTM structures have proved fruitful
in a number of studies.
Sainath et al. (2015) explore the advantages of a composite architecture very similar to that of
my encoders, formed by a concatenation of CNN, LSTM and fully connected layers, to show that these
three models can complement each other effectively on language processing tasks.
Wang et al. (2016) use a CNN to extract relevant features for sentiment analysis from single
sentences , and then feed the output to an LSTM to create a sentimental representation of the full
text. Vosoughi et al. (2016) show the advantages of combining CNNs and LSTMs for the task of tweets’
semantic similarity and sentiment analysis and Zhang et al. (2017) use a CNN-LSTM combination to
predict the intensity of the emotion expressed in a tweet.
Similar combinations are also used in extra-linguistic tasks (Bae et al.; 2016; Wang et al.; 2017)
and hybrid tasks like visual questions answering (Agrawal et al.; 2016; Johnson et al.; 2017; Santoro
et al.; 2017).
The bottom line is that, rather than modelling the input’s sequentiality as a very first step, it
seems a better idea to first use a CNN to extract local features useful for the task at hand, and then
use an LSTM to model the sequential patterns formed by such features.
I chose this specific architecture quite empirically, as it performed best among several competing
models I designed. In Paper 4, I report the performance of a number of potential variants of my
architecture, showing that the presented structure appears to work best.
Such “ablation experiments" between competing models, as presented in Paper 4, show that the
single most relevant component of my network is the LSTM (since the entire model’s performance
drops deepest when the LSTM is removed or when the number of its filters is reduced under a critical
3I actually used a specific kind of CNN, so called “Atrous CNN", particularly useful when the input’s information is
scarce and can be excessively reduced in the pooling stage
1.5. VECTOR SPACE LEXICAL EMBEDDINGS 21
threshold), confirming the importance of sequentiality for this kind of task. The second most important
component appears to be the ACNN. The network’s other features, such as the dense layers, the
dropouts and the merging through concatenation are also important for the architecture’s performance.
Finally, the least important variation proved to be the substitution the ACNN with a normal CNN -
still resulting in a drop of performance of a couple of points.
This architecture naturally presents a high number of parameters, which can be difficult to control
in conditions of data scarcity. In Papers 4 and 5 I expose the results of a 12-fold cross-validation to
ensure that our architecture is not overfitting (or is not overfitting too much) on the presented training
set. The quite strong dropouts I inserted in my encoders also have the role of preventing excessive
overfitting on the dataset.
The reader can refer to Papers 4 and 5 for a more detailed discussion of the model’s architecture.
1.5 Vector space lexical embeddings

The main advantage of using distributional vectors lies in the possibility of exploiting effective semantic
information drawn from large corpora. This appears to be of particular importance when working
under constrains of data scarcity. I mainly used pre-trained vector space semantic models - in the
form of both count-based word vectors (Turney and Pantel; 2010) and of neural embeddings (Mikolov,
Sutskever, Chen, Corrado and Dean; 2013a) - learned from large general corpora.
Recent literature (Collobert and Weston; 2008b; Blacoe and Lapata; 2012; Mikolov, Yih and
Zweig; 2013; Baroni et al.; 2014; Levy et al.; 2015; Rimell et al.; 2016; Cordeiro et al.; 2016) has
widely discussed the advantages and limitations of using count-based vectors versus neural embeddings.
In Paper 1 and Paper 3, we compare the performance of different spaces on the same task. No
major differences in performance appear when using sparse, count-based vectors instead of dense word
embeddings. In the second paper of this series we use count-based vectors. In the last three papers,
instead, we resort to word embeddings only. The vector space semantic model I use most often in
this thesis is a Gensim implementation of Word2Vec (Mikolov, Sutskever, Chen, Corrado and Dean;
2013a), pre-trained on GoogleNews corpus of 3 billion tokens. The space contains 300-dimensional
vectors for 3 million types (words and phrases).
1.6 Structure of the thesis

I organized this thesis’ frame in order to give the reader an overview of the general background
informing my papers.
In Chapter 2 I provide a partial summary of the existing literature on metaphor processing in
computational linguistics, together with some more general considerations about the most prominent
views on metaphoricity in linguistics.
Chapter 3 presents the datasets and corpora I used in my studies.
Chapter 4 summarizes the studies of my compilation, discussing how they interconnect and high-
lighting the line of research they represent.
In Chapter 5 I draw some conclusions about my study, I provide some synthetic answers to my
original research questions, I discuss some characteristics of my own line of work and I shortly outline
directions for future research.
Chapter 2
Theoretical background and

framework: key concepts
The most general definition of metaphor, also shared by other figures of speech like idioms and
metonymy, is that of an expression that is contextually used outside of its literal meaning (Frege;
1892; Gibbs et al.; 1997; Cacciari and Papagno; 2012). In this sense, metaphor in its basic form is
an intentional form of mis-categorization. If a speaker refers to a person saying He is an elephant,
the interpretation of such a sentence is usually not the literal meaning of the word elephant. Rather,
the sentence’s interpretation resides in a series of prototypical properties that, to an extent, both an
elephant and a person could share, such as having a relatively huge bodily mass, being known to have
an impressive memory, or being perceived as awkward and graceless in movement (Cacciari; 2014).
In this sense, a metaphor implies a process of re-categorization - we can categorize a person as an
elephant under some circumstances (Glucksberg et al.; 1997) - and an analogical shift. To interpret
a non-conventional or novel metaphor, it is necessary to understand, or induce, the shared properties
linking the source and the target domain (Gentner; 1983).
Another essential characteristic of metaphor underlined in linguistic literature is its composition-
ality. Metaphors are compositional in nature. A speaker can encounter a new metaphor and try to
decipher its meaning by composing its constituents. When this compositionality is impossible, we are
instead dealing with an idiomatic expression (Sag et al.; 2002; Fraser; 1970; Cruse; 1986; Frege; 1892;
Nunberg et al.; 1994; Liu; 2003; Bohrn et al.; 2012; Cacciari; 2014; Gibbs; 1993, 1994; Torre; 2014;
Geeraert et al.; 2017).1
1 This is of course something of an overstatement. As observed throughout this thesis, compositionality of both
metaphors and idioms is actually a gradient phenomenon (Nunberg et al.; 1994; Wulff; 2008). Different idioms have
23
24 CHAPTER 2. THEORETICAL BACKGROUND AND FRAMEWORK: KEY CONCEPTS
Many linguistic models of metaphoricity have been produced over the years (Morgan; 1980;
Sweetser; 1991; Vogel; 2001; Wilson; 2011; Romero and Soria; 2014). One of the most influential
of these models is arguably Lakoff’s Conceptual Metaphor Theory (Lakoff; 1989, 1993; Lakoff and
Johnson; 2008b,a; McGlone; 1996). This theory postulates that several metaphors used in natural
language derive from “seed" conceptual metaphors. For example, according to this theory expressions
like Saving my time and Spending some time derive from the conceptual metaphor “Time is Money".
According to this view, many of the metaphors we use are implementations of an implicit analogy
between two general topics or concepts. This is a view often advocated in computational linguistics
approaches to metaphor, and relatively easy to adjust to NLP tools like ontologies and semantic spaces.
Depending on how we draw the line between metaphorical and literal uses of a word or expression,
the number of metaphors found in everyday language is more or less striking, but there is general
consensus on the idea, also supported by corpus-based studies, that it is a pervasive phenomenon in
natural language (Cameron; 2003).
Structure of the chapter

As I describe in the Introduction, my research revolves around two main aims or tasks: (i)
performing metaphor detection and (ii) going beyond metaphor detection to tackle metaphor aptness
assessment.
Following this scheme, I will divide my review of previous and current works into two main parts.
In the first part I will discuss works on automatic metaphor detection. In the second part I will discuss
works that push automatic metaphor processing beyond the field of mere detection.
These two sections will have different lengths: the amount of work devoted in computational
linguistics to metaphor detection is more significant than the amount of work devoted on any other
kind of metaphor processing.
For the sake of clarity, I have also organized my review of metaphor detection studies along the
lines that I deemed most important for my own research:
1. The comparison between unsupervised and supervised approaches to detection.
2. The use of neural networks to learn metaphoricity.
3. The role of vector space semantic models.
different degrees of opacity, and different metaphors have different degrees of transparency (Titone and Libben; 2014).
A gradient approach to idiomaticity is also assumed in Senaldi et al. (2016) and Senaldi et al. (2017), constituting the
basis of the second paper of this compilation.
2.1. METAPHOR DETECTION IN NLP 25
2.1 Metaphor detection in NLP

Many Natural Language Processing studies tend to deal with metaphor as a sort of semantic anomaly.
In part for this reason, the main focus of metaphor research in NLP until now has been detection
(Dunn et al.; 2014; Veale et al.; 2016a), also due to its usefulness in fields like quantitative corpus
linguistic studies (Krennmayr; 2015), automatic text evaluation (Klebanov et al.; 2008) and rating
(Klebanov and Flor; 2013), and so on.
Automatic metaphor detection can be the first step towards more extensive figurative language
processing in computational linguistics. But it is quite a difficult task on its own right. One possible
reason of its arduousness is in the complexity of the topic, since metaphor is not a straightforward
phenomenon. There are several ways in which we can use a word as a metaphor. For example, some
metaphors turn on the concrete/abstract polarity (A strong democracy), others on the concept of
animacy (The flowers were dancing), others on other semantic dimensions. The conventional/novel
continuum also plays an important role, since many conventional metaphors are not perceived as
metaphoric by most readers, and it could be argued whether they can still be considered metaphors
at all (see cases like the legs of a chair, a program’s bug, etc.).
This complexity also makes it tricky to create human-annotated datasets of metaphors.
Despite its difficulty, metaphor detection has been attempted through several different technolo-
gies. Shutova (2011) and Veale et al. (2016b) offer a good review of traditional methods for metaphor
processing.
It is possible to draw a rough division of metaphor detection studies as unsupervised or supervised
approaches.
2.1.1 Unsupervised approaches
Unsupervised approaches tend to use large knowledge bases (Li et al.; 2013), ontologies (Krishnaku-
maran and Zhu; 2007), semantic similarity graphs (Li and Sporleder; 2010a) or vector space semantic
models (Shutova et al.; 2010a; Gutiérrez et al.; 2016) to define “standard" words’ meanings. A minority
of studies also use measures from information theory such as pointwise mutual information (Mohler
et al.; 2013), entropy (Schlechtweg et al.; 2017) or Jensen-Shannon divergence (Pernes; 2016) to detect
metaphoricity.
The advantage of using ontologies in metaphor detection is similar to the advantage offered by
vector space semantics. If it is possible to recognize a novel metaphor as a word used out of its usual
context, it is possible to use lexicographic resources like WordNet (Miller; 1995) or MultiWordNet
(Pianta et al.; 2002) to build a simple metaphor detection algorithm. For example, WordNet’s defini-
tions, together with the network’s lexical structure, can be used to detect a sudden shift in topic in
a sentence, which might be due to the presence of a metaphor. Similar ontologies also offer the pos-
sibility of moving on the hyponym/hypernym axis, which, in some cases, helps to identify metaphors
and metonymies (Schlechtweg et al.; 2017). They can also be used to attempt an interpretation of the
detected metaphors.
Such approaches often treat metaphor detection as a special case of Word Sense Disambiguation
(Banerjee and Pedersen; 2002). Although figurative language is not limited to the problem of word
sense disambiguation, this view of metaphoricity has its own motivations. In many respects, the
problems presented by metaphor detection 2
are similar to the problems of “new sense detection", the
task of automatically recognizing whether, in a corpus, a word appears used in a new sense. Novel
sense detection techniques work on contextual information, which is also the case for figurative language
processing. Word sense induction overlaps with the common interpretation of figurative language in
its concern with the employment of a word or expression beyond its usual meaning.
In this frame, we can imagine figurative language as the coming of a "new sense" of a word that
can be discovered analyzing the word’s context. This is also the perspective adopted by studies using
topic modelling to perform metaphor detection (Li and Sporleder; 2010b; Schulder and Hovy; 2015).
A problem of these approaches is that topic modeling is usually based on the presence of topic-specific
terminology - in other words, it is based on how the words were linked to specific topics in the training
corpus. 3
While this can account for several metaphors, we can easily imagine a new metaphor arising
in the form of topic-related terminology used in a new way. Some studies have also noted that words
that strongly represent a specific topic are less likely to be used metaphorically (Klebanov et al.; 2009).
Unsupervised approaches involving vector space semantic models are usually preoccupied with
designing the best operation to apply to the word vectors composing a metaphor in order to cluster
them away from literal expressions (Mohler et al.; 2014; Gutiérrez et al.; 2016; Gong et al.; 2017).
In many cases, these studies are still in the track of new word sense detection. They try to
induce the new sense of a word from the context it occurs in. A radically new context, measured
as distributional distance, becomes the strongest hint for the metaphoric use of a term. Similar
approaches have also been applied in the unsupervised detection of idiomatic expressions (Lin; 1999;
Lenci; 2008; Fazly et al.; 2009; Lenci; 2018; Turney and Pantel; 2010; Fazly and Stevenson; 2008;
Mitchell and Lapata; 2010; Krčmář et al.; 2013; Senaldi et al.; 2017). For a more detailed discussion
of unsupervised detection of figures the reader can refer to Paper 2.
2 With the usual exception of completely conventionalized metaphors

3 There are, anyway, more sophisticated topic models that partly overcome this limitation. For example, distributional
Random Indexing (Jurgens and Stevens; 2009) is a topic modelling technique to apply when there isn’t a “standard
corpus" to model a basic meaning from. Random Indexing has been used to track different word senses over time (Basile
et al.; 2015).
2.1.2 Supervised approaches
In supervised approaches, machine learning algorithms are trained on annotated corpora to detect
metaphoricity patterns. Feature-based classifiers constitute one of the main tools used in this category
of studies (Turney et al.; 2011; Hovy et al.; 2013; Tsvetkov et al.; 2014a).
If unsupervised approaches to metaphor detection have the practical advantage of avoiding the
data scarcity bottleneck, feature-based classifiers can give interesting insights in the combinations of
features - often psycholinguistic properties of words - that appear to be useful for the detection of
metaphors (Köper and im Walde; 2016b; Köper and Im Walde; 2016a; Köper and im Walde; 2017).
These sets of features can include several different dimensions of a word’s meaning and perception,
such as its syntactic role, its “conceptual" nature, the semantic class it belongs to (Klebanov et al.;
2016), its affective valence (Rai et al.; 2016), as well as more subtle characteristics such as imageability
(Broadwell et al.; 2013).
A feature often used in similar experiments is a word’s degree of concreteness (Klebanov et al.;
2015), which importance for metaphor processing also appears in cognitive linguistics (Forgács et al.;
2015). Shutova et al. (2010b) presents an approach that we could call “minimally supervised" to
metaphor detection: a small set of manually annotated metaphors is used to harvest, through dis-
tributional semantic similarity, several related metaphorical expressions from a text. In more recent
years,Shutova et al. (2016) have tried to use visual features in combination with word embeddings for
a supervised metaphor detection system.
Combinations of different models, such as unigrams, part-of-speech and topic models, are often
explored to find the best pipelines to detect metaphoricity (Klebanov et al.; 2014). Selectional prefer-
ence violation has also often been used as a feature for metaphor detection, both in unsupervised and
supervised frames. (Haagsma and Bjerva; 2016).
The difference between novel and conventionalized metaphors is often felt in this field, since
systems to detect novel metaphors (Schulder and Hovy; 2014) tend to be different - in terms of input
features and machine learning classifiers - from systems aiming to the detection of conventionalized
metaphors (Mohler et al.; 2013).
In general, many supervised approaches to metaphor detection resort to a large number of different
features. Using a large set of features requires a large set of resources, which can appear as a drawback
of traditional classifiers (Schulder and Hovy; 2014). This is one of the reasons why, as in many
other sectors of computational linguistics, neural networks have recently become more popular than
traditional classifiers to approach metaphor detection. 4
4 Also, the rise of very large corpora for training - or pre-training, as in the case of semantic spaces used in pipelines
- has contributed to make neural networks fairly competitive.
2.1.3 Neural Networks
Many of the recent supervised experiments in metaphor detection employ deep neural networks. To
name a few recent papers, Rei et al. (2017) present a task-specific neural architecture to predict
metaphoricity in Verb-Noun and Adjective-Noun bigrams, Do Dinh and Gurevych (2016) use a multi-
layered fully connected network to identify token-level metaphors in sentences - an approach later
used in Gutierrez et al. (2017) to predict first-episode schizophrenia in patients through an automatic
analysis of their use of language - and Sun and Xie (2017) apply bi-LSTMs to the task of verb metaphor
detection on unconstrained text.
Nonetheless, the mechanisms that lead to metaphor processing are still an object of investigation in
both artificial and human neural networks (see for example Lacey et al. (2017)). While the application
of more sophisticated network architectures to metaphor detection has led to encouraging results,
explaining the way machine learning algorithms model metaphoricity in text has become increasingly
difficult. Visualizing and understanding the networks themselves is a difficult task (Karpathy et al.;
2015; Dai et al.; 2017) and it is a current concern of the NLP community to create systems to make
the networks’ inner workings more transparent (Ancona et al.; 2017; Lake et al.; 2017).
2.1.4 Vector space semantic models
The idea that we can infer the meaning of a word by its “neighbours" is the basis of vector space
semantic models. In a vector space model, each word is associated with a point on a multi-dimensional
space that models its contextual distribution in a large corpus. Such points can either represent sparse
count-based vectors, reporting the number of times a word appears into a given context, or dense
embeddings that maximize the probability of a word being in a given context.
Distributional semantic vector spaces are useful to model several aspects of word meaning and to
quantify the semantic relationships between words and expressions in corpora, such as the semantic
similarity between words in a given text.
Another important characteristic of semantic spaces is that they represent words on a multi-
dimensional continuum. This perspective allows a high degree of flexibility when operating with
elements larger than single terms. For example, it is possible to compute the mean vector of a sentence
as the mean of the vectors of its individual words. In this way, it is possible to ’locate’ the words
composing said sentence at different distances from the sentence’s mean semantic value. This is a
simple way that is sometimes used to spot “unfitting" terms into a sequence. At the same time,
semantic spaces can, to an extent, reproduce the same valuable word relations built in traditional
onthologies, such as hyponimity, hypernimity, co-hiponimity and antonimy (Lenci et al.; 2015).
Vector space semantic models are also one of the most used resources in both supervised and
unsupervised metaphor detection (Haagsma and Bjerva; 2016; Kesarwani et al.; 2017), since they can
provide the “fingerprint" a of a word’s semantic profile without the need of manually crafting a set of
ad hoc features.
The major advantage of using distributional semantic spaces in metaphor detection - and also in
the detection of other figures of speech, like metonymy (Nastase and Strube; 2009) - lies in the fact
that metaphors are a contextual phenomenon. A metaphor can be seen as fundamentally composed
of two different semantic domains: one domain acts as a source - and the words related to it have a
literal meaning - while the other domain acts as a target - and the words related to it have a figurative
meaning. The apparent mismatch between source and target domain can, at least in theory, appear
through the difference between the semantic vectors of the words used literally and of those used
metaphorically in a sentece or expression (Steen et al.; 2014; Gutiérrez et al.; 2016). In this frame,
semantic spaces appear as a very flexible and powerful tool to model such semantic domains in terms
of words’ clustering and distributional similarity Mohler et al. (2014).
Another advantage of using semantic spaces is that they can be trained on very large unannotated
corpora. Being pre-trained on consistent amounts of text, semantic spaces can provide the classifiers
trained on the contained datasets of metaphor detection with useful information drawn from big data.
Studies using different resources for their classifiers also report that distributional vectors are the
best performing single resource to tackle metaphor detection Köper and im Walde (2016b).
It can be interesting to underline that approaches using vector space semantic models tend to
advocate the idea, widely maintained in linguistics, that metaphors are compositional. While they
are not the only linguistic tool to work with compositionality (Lappin and Zadrozny; 2000), vector
space semantic models have proven useful to deal with basic forms of word composition in a number
of cases (Lenci and Zamparelli; 2010; Ferrone and Zanzotto; 2015). Given the contextual nature of
this figure of speech, in some cases basic metaphor detection has been used as a way to test the
quality of semantic spaces themselves (Srivastava and Hovy; 2014; Köper and im Walde; 2016b), and
ad hoc semantic spaces have been designed to deal with metaphor detection (Bulat et al.; 2017).
The presence of analogical symmetries between concepts, something not too far from Lakoff’s idea
of conceptual metaphors, has also been presented as one of the most interesting features of neural
word embeddings (Mikolov, Sutskever, Chen, Corrado and Dean; 2013b). It is also possible to use
vector space semantic models together with other features, such as visual or psycholinguistic features
(Tsvetkov et al.; 2014b; Rai et al.; 2016; Do Dinh and Gurevych; 2016; Shutova et al.; 2016). This is
the same kind of hybrid approach detailed in the third study of this compilation.
In the first three papers of this compilation we also provide analyses of the performance of our
models on different distributional semantic models.
2.2 Beyond detection

The largest percentage of studies in automatic metaphor processing deals with metaphor detection. It
is relatively rare to come across studies that tackle more complex aspects of metaphor, like metaphor
understanding and evaluation. For example, a relatively small amount of studies within the range
of computational linguistics have dealt with automatic metaphor interpretation, against the wide
literature on the topic developed in other branches of linguistics (Veale and Hao; 2007b; Shutova;
2010a; Shutova et al.; 2010b; Shutova; 2010b; Bollegala and Shutova; 2013; Shutova et al.; 2013; Su
et al.; 2017).
The relationship between figurative language and sentiment analysis in corpora is another dimen-
sion that goes beyond mere detection, and some systems to identify sentiment polarity in figurative
sentences exist (Rentoumi et al.; 2009; Ghosh et al.; 2015; Mohammad et al.; 2016; Kozareva; 2015;
Mohammad et al.; 2016). Still, sentiment analysis of metaphoric sentences has not been widely ex-
plored in natural language processing yet. Interpreting the motivation behind the use of a metaphor
is also a problem that few pioneers have addressed (Jang et al.; 2014).
At the same time, fields that could provide valuable contributions to a deeper analysis of metaphors’
power of expression, such as paraphrase identification (Socher et al.; 2011; Madnani et al.; 2012), don’t
usually take figurative language into account.5 As an obvious effect of this lack of research, datasets
for automatic metaphor processing going beyond metaphor detection are virtually non-existent.
Finally, an aspect of figurative language processing which is more talked about than tackled is
generation. While several papers maintain the importance of figurative language processing to build
better interactive systems - like, for example, dialogue systems (Birke and Sarkar; 2006; Davidov et al.;
2010) and artistic text generation (Colton et al.; 2012) - few efforts have actually been made in the
direction of figurative language generation. For the pioneers in this specific field, also often interested
in metaphor aptness, see for example Veale and Hao (2007a); Veale and Li (2012); Veale (2012, 2017).
5 There are exceptions. Some studies in textual entailment have shown interest in the peculiarities of figurative
language processing (Agerri; 2008; Turney; 2013).
Chapter 3
Data collection and resources
Collecting data for metaphor processing is a demanding task. One problem that arises when dealing
with annotated data for metaphor detection is the definition of metaphor itself. While many cases will
appear metaphorical or literal to the vast majority of annotators (The street was a river of people vs.
The Nile is a river ), every real document contains a number of less clear-cut elements.
Annotators’ sensibility, background and linguistic training can play a major role in cases of
metaphors ingrained in everyday language. Depending on how fine-grained and linguistically aware
the annotators are, the number and types of metaphors detected in a corpus can vary. In part due
to these obstacles, the scientific community has produced a small amount of open source annotated
corpora for figurative language studies, and major, standard datasets are still wanted (Shutova; 2011).
It is possible to divide the available resources into three main categories.
1. The first category represents conceptual or ontological resources, as the list presented in Lakoff
et al. (1991). It includes handcrafted, theory-driven constructions, like catalogs of widespread
metaphors. While these resources can be of great utility for other approaches, they are of little
interest for the main goals of this research.
2. The second category includes datasets annotated for metaphoricity, like the ones I have been
using for my own research. These datasets usually consist of selected text extracts where words
or sentences are annotated as “figurative” or “metaphorical”. Sometimes, these datasets can be
created as an expansion of “first category" resources. For example, the metaphor dataset created
by Hovy et al. (2013) contains 3879 sentences generated by bootstrapping from lists of classical
metaphors. Amazon Mechanical Turk’s users later annotated these sentences as metaphorical
or literal. Seven different annotators labeled each sentence, and the sentences that the majority
found impossible to judge were discarded. These resources are often centered on a specific part
of speech used metaphorically. For example, Dunn (2013a) produced and publicly released a
31
32 CHAPTER 3. DATA COLLECTION AND RESOURCES
corpus of 500 sentences centered on a set of 25 verbs, annotated as metaphorical or literal. The
largest existing dataset of this kind is probably the VU Amsterdam Metaphor Corpus (Steen
et al.; 2010), that I describe more in detail in 3.3.
3. The last category includes datasets that divide figurative language into types (for example,
sentences can be linked to the general kind of metaphor they implement, and so on). These
annotated datasets usually present sets of sentences annotated with some kind of conceptual
mapping or metaphorical interpretation. An open source example is Gordon et al. (2015)’s
corpus, containing 1771 metaphorical sentences. This corpus contains only examples of figurative
language (thus there is no ”literal” counterpart in the corpus), annotated for source and target
domain. The corpus is partly constituted of hand-annotated sentences, and partly formed by
automatically annotated sentences that were manually corrected.
Cases of “composite" annotation of figures also exist. Dunn (2013b) for example released a re-
annotated subset of the VU Amsterdam Corpus where sentences are annotated not only as metaphorical
or literal, but also as “humorous” or not, constituting one of the rare corpora where figurativity is
divided into categories. They used this subset to compare four metaphor recognition systems.
Most of these resources suffer from the second major problem of metaphor corpora: data scarcity.
The community has often applied traditional ways to deal with scarcity of data, like artificially
bootstrapping the dataset (He and Liu; 2017) and looking for complementary resources that could
allow the employment of richer feature sets (Tanguy et al.; 2012).
An effect of data scarcity is that many studies tend to create their own resource to train and test
their models. It is often complained that different systems in figurative language processing are tested
on different datasets (Shutova; 2011).
In the first three publications of this thesis, we used pre-existent resources for metaphor detection.
In the first and third papers, such resources present a binary view of figurativity: words and expressions
are labelled as either figurative (1) or literal (0). The dataset used in the second paper presents instead
a nuanced view of figurativity: the expressions in the dataset have a continuous score of idiomaticity,
going from completely literal to unmistakably idiomatic (Senaldi et al.; 2016, 2017). We used the same
approach for metaphor aptness judgments in papers 4 to 6.
In general, we believe that continuous scores are more suitable than binary scores when dealing
with figurative language, both for detection and aptness assessment. Not all figurative expressions
have the same level of figurativity. Figurative expressions tend to become more and more conventional
precisely through their usage in everyday communication (Bowdle and Gentner; 2005) - thus follow
the law might be perceived as having a somehow lower level of figurativity than the winds of change.
I discuss this aspect more in detail in my conclusions. To conclude this Chapter, I will now provide a
cursory presentation of the corpora used in this thesis.
3.1. ADJECTIVE-NOUN DATASET 33
3.1 Adjective-Noun Dataset

This dataset was published by Gutiérrez et al. (2016) and contains more than eight thousand Adjective-
Noun combinations. Each combination is annotated as metaphoric or literal. The metaphoric element
is usually the adjective (cf. clean performance vs clean floor ).
3.2 Idiom dataset

This dataset was collected by Senaldi et al. (2016) and Senaldi et al. (2017). It is composed of lists of
Italian idioms and literal expressions annotated by nine linguistics students. Annotators were asked
to assign to each expression an idiomaticity score between 1 and 7.
3.3 VU Amsterdam Corpus

The VU Amsterdam Corpus is one of the largest existing corpora for metaphor detection (Steen
et al.; 2010; Krennmayr and Steen; 2017). This corpus is a subsection of the British National Corpus
corpus (Consortium et al.; 2007), divided into four sections: news texts, fiction, academic texts and
conversations. It contains about 190,000 lexical units, annotated by 4 experts looking for fine-grained
metaphors. Also the so-called dead metaphors and widespread metaphoric usages are marked as
figurative, such as for example follow the law, we observed the target, etc. Each word is annotated as
figurative (“metaphoric”) or literal, following a standardized metaphor identification procedure called
MIP (Steen; 2010) that was meant to increase inter-annotator agreement. The MIP procedure instructs
annotators to determine whether each lexical unit can have a more basic contemporary meaning than
the one given by the context. This procedure defines a word’s “more basic meaning” as: more concrete,
more related to bodily action, more precise, or historically older than the salient meaning of the word
in the annotated context. In this frame, the most basic meaning of a word is not necessarily the most
frequent one. This corpus has been used in several studies about metaphor (Dunn; 2013b; Niculae and
Yaneva; 2013) and is probably the most widespread benchmark for metaphor detection. In the third
paper of this compilation, we provide some discussion about what we consider the strong and weak
aspects of this dataset.
3.4 Paraphrase corpus

We built this corpus to provide a proof of concept on paraphrase detection, with the idea of using it
as a framework to deal with automatic metaphor aptness assessment. While traditional paraphrase
detection resources adopt binary scores (Dolan et al.; 2004), we treated paraphrase identification
34 CHAPTER 3. DATA COLLECTION AND RESOURCES
as an ordering task. This dataset contains 250 sets of 5 sentences each, labeled on a 1-5 scale of
paraphrasehood, in analogy with semantic similarity datasets (Xu et al.; 2015; Agirre et al.; 2016).
Every group of five sentences contains 1 reference sentence and 4 candidate paraphrases. This corpus
was annotated by me. Being a proof of concept study to find a new frame for metaphor paraphrase
processing, I didn’t go to the extent of running a crowd-sourced annotation yet, as I instead did in the
following works. While its annotation scheme sustains graded semantic similarity labels, it also provides
sets of related elements. It is thus possible to score or order each pair of elements independently from
the others. For a detailed discussion of its characteristics and design the reader can refer to Paper 4.
3.5 Metaphor Aptness Corpus

This corpus’ structure mirrors the Paraphrase dataset. It contains 250 groups of five sentences an-
notated on a 1-4 scale of aptness. The first sentence of each group contains a metaphor, and the
remaining 4 sentences are candidate literal interpretations of the reference. 1
As far as I know, this
is the only existing resource of this kind. We annotated this dataset through crowd sourcing. We col-
lected human aptness judgments through Amazon Mechanical Turk (AMT). 20 annotators rated each
pair of sentences. Inter-annotator agreement between the annotators’ mean judgment and my own
annotation of the corpus was close to 0.93. We consider this dataset as currently under development.
An extensive discussion is provided in Paper 5.
3.6 Contextual Metaphor Aptness Corpus

Rather than a dataset in its own right, this is a subsection of our resource for Metaphor Aptness
Assessment. It contains 200 pairs of sentences drawn from the mentioned dataset. I provided such
pairs with extensive context - 2 sentences of context for each original sentence - and I re-annotated each
sentence for aptness through crowd sourcing, following the same procedure used for the main dataset.
An average of 14 annotators (after filtering out the rogues) annotated each sentence. The agreement
between the annotators’ judgments and my own judgments, calculated as a Pearson correlation, is
close to 0.9. For an extensive discussion about this small dataset the reader can refer to Paper 6.
1 This dataset contains both examples of metaphors and similes. Metaphors and similes cannot always be treated
as equivalent elements in linguistics (Sam and Catrinel; 2006; Glucksberg; 2008), but for the purposes of our study we
considered them as belonging to the same category.
Chapter 4
Summary of the studies
The series of studies I present in this compilation follows, and partly answers to, the main questions I
outlined in the Introduction.
This is the logic within which they should be read.
To what extent can we exploit distributional semantic spaces through artificial neural
networks to perform supervised metaphor detection?
As I discussed in the Introduction, it is possible to reformulate this question in a somewhat more

“linguistic" frame as two different interrogatives: To what extent can we exploit the distributional
profile of single words to detect their metaphoric usage in text through a supervised machine learning
approach? To what extent would such process be compositional?
This is the leading aim of the first three papers of my compilation. Each paper develops on the
previous one.
1. Paper 1 - “Deep" Learning: Detecting Metaphoricity in Adjective-Noun Pairs. We use a combina-

tion of vector space lexical embeddings and neural networks to perform out-of-context metaphor
detection.
2. Paper 2 - Finding the Neural net: Deep-learning Idiom Type Identification from Distributional
Vectors. We explore the importance of individual words’ distributional profile in metaphor
detection and the compositionality of the approach presented in Paper 1 through a comparison
with the twin-task of idiom identification.
3. Paper 3 - Bigrams and BiLSTMs: Two neural networks for sequential metaphor detection. We ap-
ply a combination of vector space lexical embeddings and neural networks to in-context metaphor
detection.
35
36 CHAPTER 4. SUMMARY OF THE STUDIES
Is it possible to use the same approach to go beyond metaphor detection, and tackle
metaphor aptness assessment as a natural language processing task? What is the best
way of dealing with metaphor aptness assessment, in terms of both dataset structure and
task design?
These are the leading questions for the second part of my thesis. As before, the three papers represent
a unitary development.
1. Paper 4 - Deep Learning of Binary and Gradient Judgements for Semantic Paraphrase. We define
the dataset structure and neural architecture we will use for metaphor aptness assessment. As
a middle ground step, we define them through a task of “general" paraphrase detection and
ranking.
2. Paper 5 - Predicting Human Metaphor Paraphrase Judgments with Deep Neural Networks. We ap-
ply the dataset structure and neural architecture defined in our previous publication to metaphor
aptness assessment.
3. Paper 6 - The Effect of Context on Metaphor Paraphrase Aptness Judgments. As with the last
paper of the previous triad, we explore automatic metaphor aptness assessment in a frame of
extended context.
Both groups of studies are related, as are the research questions that dictated them. The first group
of studies applies a specific combination of tools, and a specific view of metaphoricity, on a relatively
traditional task - metaphor detection. The second group represents an attempt to bring such combi-
nation of tools, and view of metaphoricity, beyond the limits of metaphor detection into a less studied,
and even more challenging, task.
4.1. STUDY I 37
4.1 Study I
“Deep" Learning: Detecting Metaphoricity in Adjective-Noun Pairs

In the first study of this compilation we use a single layered, fully connected neural network to detect
out-of-context metaphors in a set of expressions compiled and annotated by Gutiérrez et al. (2016).
These are simple Adjective-Noun expressions annotated for metaphoricity: sweet drink (literal) versus
sweet word (metaphorical). Given its scope, this dataset is large: 8592 phrases. We extracted the
distributional word vectors of each expression before feeding them to the neural network. We then
trained our network on the concatenated vectors of each expression. We show that the neural network
is able to learn the difference between metaphors and literal expressions with a high degree of precision.
We also show that our network needs little training data (down to 10% of the corpus) to reach a high
level of accuracy (up to 94%).
We argue that the neural network is effective in using the pre-encoded knowledge present in the
distributional space to learn from scarce annotated data. This is not a secondary benefit: being able
to perform supervised learning on small training sets can be of great help for metaphor detection.
We present various ways of testing our model, and different kinds of cross-validation, training and
testing our model on different partitions of the dataset. As an ablation experiment, we also train and
test our network 5 different semantic spaces. Finally, we try to analyze the network’s performance to
understand what are the semantic features that it is exploiting to learn metaphoricity. Our conclu-
sion is that the classifier is detecting the abstract/concrete semantic shift present in adjectives used
metaphorically (A heavy table vs A heavy feeling). While our results were good from a technical point
of view, this study focuses on a narrow task. Also, the abstract gradient accounts only for some kinds
of metaphor.
This study presents the essential combination of technical tools (neural networks) and semantic
resources (distributional spaces) that I use throughout the rest of the thesis. For more details about
this aspect the reader can refer to Chapter 1.
4.1.1 My contribution to the paper
I am responsible for the central idea of the paper, together with the basic implementation of our
experiment. Mehdi Ghanimifard developed and analyzed most of the ablation trials presented in the
study, while Stergios Chatzikyriakidis focused on the theoretical background and selected the dataset.
4.2 Study II
Finding the Neural net: Deep-learning Idiom Type Identification

from Distributional Vectors
This study can be seen as a proof of concept. We trained a neural classifier on various datasets of AN
and VN idioms and literal expressions both in English and Italian. Each expression in the datasets
was rated by nine linguistics students on an “idiomaticity" scale of 1 to 7. Count-based distributional
spaces are the only resource we used to provide the network with semantic information. While this
study is carried out on very small datasets, I think its results help tracing the line between metaphors
and idioms. It is also a way to prove the importance of single words’ distributional profiles in metaphor
detection.
In our experiments, the essentially compositional nature of metaphors with respect to idioms
appears clearly. When applied to idiom identification, combining the word vectors of an expression
does not lead to effective learning. For the classifier to distinguish idiomatic expressions, it is instead
necessary to train it on the contextual vector of the whole expression. In other words, only when idioms
are treated as single units, the classifier is able to learn idiomaticity. Idioms act like unitary linguistic
tokens, while metaphors share the compositional nature of their literal equivalents. Our results confirm
the view that metaphors mirror a transparent mapping from source to target domain, while idioms
exhibit forms of semantic non-compositionality and morphosyntactic rigidity.
The main structure of this research was discussed and elaborated by Marco Senaldi, Alessandro Lenci
and me. I developed and implemented the neural classifier used in this study and I both designed and
performed all of the experiments. Marco Senaldi provided all the datasets and produced most of the
theoretical background and analytical discussion of the paper.
4.3 Study III
Bigrams and BiLSTMs: Two neural networks for sequential metaphor

detection
This study was actually carried out as a contribution to the shared task on metaphor detection pub-
lished by NAACL 2018’s First Workshop on Figurative Language Processing (Leong et al.; 2018). To
4.3. STUDY III 39
test the model’s performance, we used the training and test sets defined by the Workshop’s task. Our
results were also reproduced independently in order to assess our performance. This paper represents
our attempt to deal with metaphor detection on real text. The corpus we use to both train and test
our models is the VU Amsterdam Corpus, a subsection of the British National Coprus annotated
for metaphoricity (see also Chapter 3). We provide an analysis and critical discussion of the corpus,
together with the results of our experiments.
The task of detecting metaphors in unconstrained text proves to be more difficult than the task
of finding metaphors in a list of expressions. To attempt this task, architectures more complex than a
basic Perceptron become necessary.
We compare the performance of a Bi-LSTM and a composition-based hierarchical network, using
both GloVe Pennington et al. (2014) and Word2Vec Mikolov, Sutskever, Chen, Corrado and Dean
(2013a) pre-trained embeddings. We find that the Bi-LSTM achieved the best individual performance,
while a combination of both networks carries out the best overall performance.
Also, enriching the input’s word embeddings with explicit features, such as words’ concreteness
scores as given in Brysbaert et al. (2014), and manipulating the input to break the length of sentences
prove to be useful strategies. In this sense, to improve our performance we had to abandon the
only-distributional approach assumed in the other papers and recur to systems previously adapted in
“standard" feature based studies (Köper and im Walde; 2017).
We compare our results with Do Dinh and Gurevych (2016)’s results, showing that the architec-
ture’s depth beyond a certain point does not provide significant improvements. Our system scored
second best in performance when trained on the training partition of the shared task (Leong et al.;
2018).
Mehdi Ghanimifard and I developed the main structure, the background research and the analytical
discussion of the paper. I took care of the Bi-LSTM implementation, while Ghanimifard designed
and tested the competing architecture. I also developed the part of our study dealing with input
manipulation and feature enrichment.
4.4 Study IV
Deep Learning of Binary and Gradient Judgements for Semantic

Paraphrase
This paper works as an “introduction" to the task of Metaphor Aptness Assessment.
We present both a new paraphrase dataset and a new deep learning architecture. The dataset
is composed of groups of five sentences. The first sentence is the reference and the remaining four
sentences are candidate paraphrases. In this way, it is possible to both order and categorize paraphrases
with respect to a main sentence, making paraphrase detection closer to the related task of semantic
similarity assessment (Tai et al.; 2015; Yin and Schütze; 2015).
Paraphrase detection approaches have recently shifted from feature-based similarity detection
(Dolan et al.; 2004) and longest common subsequence (Fernando and Stevenson; 2008) to neural
classification (Filice et al.; 2015; He et al.; 2015). Moving in the same direction, this paper presents a
neural architecture designed to tackle this specific task.
Our network consists of two encoders that learn the representation of two sentences separately,
a unified layer that merges the output of the encorders and a final set of fully connected layers that
operates on the merged representation of the two sentences to generate a judgment. Each encoder
produces a 10-dimensional representation of the input sentence. Each encoder is composed of a Con-
volutional Neural Network (CNN), a Long Short Term Memory (LSTM) and several fully connected
neural layers. We train it on lexical word embeddings from Word2Vec (Mikolov, Sutskever, Chen,
Corrado and Dean; 2013c).
We use this architecture to perform both binary classification and ranking 1 . We use the same
architecture in Papers 5 and 6.
The pipeline we present works in five stages. (i) We convert the mean human judgments into
binary judgments. (ii) We feed the model with the binary version of the dataset (paraphrase vs non-
paraphrase). (iii) The model learns to assign to each pair of sentences a continuous score between
0 and 1. (iv) Rounding up the assigned scores, we evaluate the model’s binary performance. This
is measured in terms of accuracy and F-score. (v) Keeping the scores continuous, we evaluate the
model’s ranking performance against the original mean human judgments. This is measured in terms
of Pearson correlation. Thus, the ordering gradients are actually by-products of the binary learning.
1 We measured ranking in terms of both Pearson’s and Spearman’s correlation. Since both correlations were very
similar, we only reported Pearson’s correlations in the paper.
4.5. STUDY V 41
This paper basically presents the dataset structure and the machine learning approach that we
will apply to metaphor aptness in the remaining studies. Also, the pipeline described above is the
same that will be used in Papers 5 and 6.
Shalom Lappin and I discussed and developed the main structure of this study. I created and annotated
the dataset and I designed and tested the neural classifier, under Lappin’s supervision.
4.5 Study V
Predicting Human Metaphor Paraphrase Judgments with Deep

Neural Networks
In this paper we finally tackle Metaphor Aptness Assessment. We model metaphor aptness in the
frame of metaphor paraphrase detection. We apply the ranking approach presented in the previous
study to a metaphor paraphrase dataset. The dataset is composed of groups of five sentences. The
first element, or reference sentence, contains a metaphor or simile, and the remaining four sentences
are candidate literal paraphrases. In this way, it is possible to rank metaphor-literal pairs based on
their aptness.
Due to the difficulty of creating a similar resource, the dimensions of our datasets are still small.
At the same time, this resource is, to the best of my knowledge, the largest existing dataset devoted
to metaphor paraphrase and aptness assessment. Unlike other resources, our dataset covers a broad
semantic and syntactic variety. Metaphors are not relegated to a single part of speech, nor to a single
word. Multi-word metaphors constitute a large part of the corpus. We also tried to cover different
registers: commonly used metaphors, literary metaphors and “in-betweens" are present in the dataset.
We annotated the dataset through crowd-sourcing, using Amazon Mechanical Turk. An average
of 20 Turkers annotated each metaphor-literal pair of sentences on an aptness scale of 1 to 4. The
correlation of the annotators’ mean judgments with my judgment, calculated as a Pearson correlation,
is 0.93. We train the previously discussed neural network on this new dataset. Again, we test both a
binary classification and a ranking task.
In this paper we also discuss the conceptual difference between ordering 4 candidates and ordering
lists of pairs. The relative aptness of each candidate with respect to the reference sentence distinguishes
this benchmark from gradient semantic similarity lists of pairs. We argue that this approach is of
particular importance when dealing with figurative paraphrase tasks.
We also argue that it is possible in fact to summarize this task in the following question: Given
that X is a metaphor, which one of the candidates would be its best literal interpretation? This creates
apparent paradoxes with respect to a traditional paraphrase task. When confronted with two sentences
like The candidate is a fox and The candidate is cunning, a typical paraphrase model should return
a low score, while our model interprets the first sentence as an apt metaphor for the second sentence,
and assigns the pair a high score. When presented with a pair like The Council is on fire and The
Council is burning, a typical paraphrase model should return a high paraphrase score, while our model
gives the pair a low grade. In other terms, classical paraphrasehood and metaphor aptness are not
completely overlapping.
Shalom Lappin and I discussed and developed the main structure of the study. I created the dataset
and tested the neural classifier under Lappin’s supervision. I also took care of the crowd sourcing
annotation.
4.6 Study VI
The Effect of Context on Metaphor Paraphrase Aptness Judg-

ments
The central focus of the paper is the study of metaphor aptness in extended context. We provided a
subsection of our metaphor dataset with document context. We obtained new annotations through
Amazon Mechanical Turk and tested the described DNN on the dataset.
We observe that adding context to the sentences has a consistent effect on human judgments of
metaphor aptness.
This effect was already observed in previous experiments (McCabe; 1983) and has been an object
of research in a series of neurological experiments on metaphor processing (Bambini et al.; 2016,
2018). Nonetheless, the effect that we notice is somehow unexpected: pairs of sentences that our
annotators rated poorly (not apt) in the previous experiment, receive a higher grade when seen in
extended context. At the same time, pairs that our annotators rated highly (very apt) in the previous
experiment, get a lower grade seen in extended context. In other words, we observe that context bends
aptness judgments towards the mean. We also show that it is possible, to an extent, to reproduce this
effect with our neural classifier.
4.6. STUDY VI 43
Our results contradict in part the expectations expressed by previous cognitive studies on metaphor
aptness (Chiappe et al.; 2003), and are confirmed by independent findings on the effect of extended
context on acceptability judgments published in Bernardy et al. (2018).
We also provide a review of cognitive linguistic studies on both out of context (Tourangeau and
Sternberg; 1981; Fainsilber and Kogan; 1984; Tourangeau and Rips; 1991; Blasko; 1999; Chiappe et al.;
2003) and in context (McCabe; 1983) metaphor aptness assessment.
This paper is available on Arxiv and closes my compilation.
I am responsible for the central idea of the study, that I developed from Lappin’s contemporary works
on acceptability in context. I took care of the creation of the new dataset, of its crowd sourcing
annotation and of the neural classifier’s training, under Lappin’s supervision.
Chapter 5
Conclusions
The aim of this thesis is to provide a conceptually related set of experiments on metaphor processing,
starting from the most “simple" cases of metaphoric bigrams and escalating to complex problems that
go beyond metaphor detection.
In this thesis I used a combination of semantic spaces and neural networks to explore metaphor
detection and metaphor aptness both in and out of context. This combination provides a flexible
framework for machine learning, it does not require feature engineering and it is, as shown in Paper
2, language independent, a characteristic appreciated in metaphor studies (Kövecses; 2003; Deignan;
2003; Tsvetkov et al.; 2014b; Shutova; 2010c).
While the set of distributional spaces used through my research has been more or less constant, I
used different neural architectures for different tasks. The architectures increase in complexity with the
task at hand. While I tackled the “simpler" detection tasks through rather shallow, fully connected
networks, I had to deal with metaphor detection in unconstrained text through a Bi-LSTM, which
displays a way more sophisticated structure. To approach metaphor aptness assessment I designed a
deeper, composite architecture featuring a combination of CNNs and LSTMs. This escalation was not
performed a priori. I tested simpler models on every task and turned to more complex architectures
when it proved necessary.
An important contribution of this thesis should be the new approach we propose to deal with
metaphor aptness assessment, a rarely studied topic in computational linguistics. I propose a new
dataset to assess metaphor aptness, annotated through crowd sourcing by a large number of humans.
I consider this part of my compilation the most original but, at the same time, the most prob-
lematic. The conceptual difficulties of going beyond metaphor detection in computational linguistics
are notorious (Sculley and Pasanek; 2008). In some sense, to shift towards aptness assessment we had
to start from scratch. While cognitive literature on metaphor aptness is relatively abundant (Johnson
45
46 CHAPTER 5. CONCLUSIONS
and Malgady; 1979; Marschark et al.; 1983; Blasko and Connine; 1993), natural language processing
has left this topic almost untouched 1 .
5.1 My Research Answers

In the Introduction, I enumerated three main questions that led my research in metaphor processing.
I will use them now to outline a very synthetic conclusion of my research on metaphor detection
and metaphor aptness assessment.
1. To what extent can we exploit vector space lexical embeddings through artificial
neural networks to perform supervised metaphor detection?
Vector space lexical embeddings proved to be highly effective in detecting metaphoricity in con-
strained datasets (out of context, representing specific kinds of metaphors). A simple neural
network trained purely on vector space models proved to outperform by and large traditional
supervised systems on an out of context metaphor dataset and reached an accuracy of beyond
90% (Paper 1).
This approach clearly employs the distributional profiles of the single words composing the
metaphors. It is able to use the semantic signature of words learned from very large corpora such
as itWaC (Baroni et al.; 2009) or Google News (Mikolov, Sutskever, Chen, Corrado and Dean;
2013d) in order to learn a specific semantic task like metaphor detection from small supervised
training sets. Our pipeline looks like a transfer learning paradigm: the embeddings learned in
an unsupervised way from very large corpora are used for a supervised learning task on small
datasets.
Being based on the distributional information of single words, this approach performs poorly
when applied to non-compositional multi-word figures like idioms.
When idioms are instead treated like unitary tokens, their distributional profile in large corpora
can be exploited by a neural network to learn idiomaticity from very small training data. As
for other non-compositional expressions (Loukachevitch and Parkhomenko; 2018a,b), the dis-
tributional profile of idioms appears to contain enough information for our network to learn
idiomaticity, but the same information cannot be found in the combination of the distributional
1 Some interesting studies in metaphor generation have dealt with the problem of metaphor aptness. Abe et al. (2006)
explore the possibility of a generator able to create apt metaphors computing the probabilistic relationship between
source and target domains. Veale and Hao (2007a) describe a system that tries to generate apt metaphors for a target
on demand from apt similes.
5.1. MY RESEARCH ANSWERS 47
profile of the single words composing the idioms. The opposite holds for metaphoric (non id-
iomatic) expressions 2 .
Distributionality and compositionality are thus two of the main strengths of our framework.
Also, this approach appears to be language-independent (Paper 2).
Nonetheless, switching from out of context, constrained datasets to contextual and unconstrained
metaphor detection brings the difficulty of the task to a higher level.
We resorted to a Bi-LSTM architecture to tackle this task, comparing (and ultimately combining)
it with a simpler hierarchical model. In general, Bi-LSTMs are proving increasingly useful for
tasks of sequential annotation on unconstrained text, such as multi-word expressions detection
(Berk et al.; 2018).
Bi-LSTM and vector space lexical embeddings manage to perform encouragingly, but there is
still room for improvement. Non distributional features, often used in metaphor detection (Rai
and Chakraverty; 2017), also help improving the accuracy of our models. Vector space lexical
embeddings and neural networks can learn metaphor detection on unconstrained text to a good
extent, but they cannot (yet) account for all dimensions of metaphoricity (Paper 3).
While leaving room for improvement, our model was nonetheless quite successful: we scored as
second best performing group in 2018’s Metaphor Detection Task. The first best performing
group also used sequential deep neural networks. Applying Bi-LSTMs to metaphor detection
is an active trend in the field, and new applications to improve their performance are being
published (Gao et al.; 2018).
2. What is the best way of dealing with metaphor aptness assessment, in terms of
dataset structure and task design?
We designed a new kind of dataset to deal with metaphor aptness as a form of metaphor para-
phrase. This dataset allows the user to deal with aptness as both a binary task and an ordering
task. It is composed of groups of five sentences, where the first one contains a metaphor and the
remaining four are candidate paraphrases annotated by degree of aptness.
This frame, while not applied on aptness yet, is presented in Paper 4 for a task of paraphrase
scoring, together with a neural structure designed to be trained on it. We did not choose
a paraphrase detection task exclusively as a half-way project towards metaphor aptness: we
think that this way of dealing with paraphrases could be actually beneficial, and indeed several
studies are joining paraphrase detection and text similarity (Soleymanzadeh et al.; 2018). The
possibility of interpreting paraphrasehood in a non-binary way, and ranking the degree to which
2 Our findings about idioms’ distributional properties have been recently confirmed by Peng et al. (2018).
two paraphrases are close or far from one another, could be particularly useful for plagiarism
detection and literary scholarship (Moritz et al.; 2018). At the same time, our approach appears
simpler, in terms of pipeline and resources employed, than several paraphrase detection or text
similarity experiments proposed by the literature (Mohamed and Oussalah; n.d.).
3. Is it possible to use the same approach to go beyond metaphor detection, and tackle
metaphor aptness assessment as a natural language processing task?
It is indeed possible to apply a combination of vector space lexical embeddings and deep neural
networks to tackle, to a degree, metaphor aptness assessment (Paper 5).
Like for metaphor detection, the dimension of contextuality - the amount of extended context
provided in the dataset - appears to play an important role in aptness assessment (Paper 6).
This task and topic are to be considered as a first step in the rarely studied field of automatic
metaphor aptness assessment.
While neural networks are amply used in the bordering fields of metaphor detection, paraphrase
rating and sentence representation (Chen, Hu, Huang and He; 2018; Tang and de Sa; 2018; Chen,
Guo, Chen, Sun and You; 2018), this is to the best of my knowledge the first application of a
neural architecture to the task of metaphor aptness assessment. Similarly, word embeddings have
been previously used to explore metaphor paraphrasing and metaphor interpretation (Utsumi;
2011), but, to the best of my knowledge, this is the first employment of distributional semantic
spaces to metaphor aptness assessment.
A solid development of this line of work could be of interest for metaphor generation systems
(Veale; 2014, 2015), sometimes used for artificial tutors and conversational systems. (Rzepka
et al.; 2013; Dybala et al.; 2012).
On a merely “technological" level, I have explored a variety of experimental frames, neural architectures,
semantic spaces and corpora throughout my studies. This variety can be confusing. In Table 1 I present
a synthesis of the main frames presented in each experiment. This table is not a comprehensive
overview of all the combinations explored in my work, but rather a conclusive summary of the most
important ones. Colors and line breaks should help distinguishing between the detection oriented and
the paraphrase oriented studies, and point to the shift between out-of-context and in-context datasets.
This can be seen as a very short summary of my findings and considerations about metaphor
processing. Still, I think this does not exhaust the conclusions I can draw from my experience.
There are some recurrent topics that also informed my line of research in metaphor processing: the
nuanced approach to both metaphoricity and aptness, the necessity of dealing with input’s sequentiality,
the main problem of data scarcity and the matter of context in metaphor datasets.
I will analyze them shortly in the rest of this chapter.
5.1. MY RESEARCH ANSWERS 49
Study Model Word represen- Corpus Best results

tation
AN metaphoric- Single fully con- Various different AN corpus an- F1 higher than
ity nected neural distributional notated for 0.9 in most set-
layer spaces metaphoricity tings
Short phrases id- Three-layered Various different Datasets of F1 of 0.85 (VN)
iomaticity fully connected distributional expressions and 0.89 (AN)
network spaces annotated for
idiomaticity
Word Multi-layered GloVe and W2V VU Amsterdam F1 of 0.63
metaphoricity fully connected spaces enriched Metaphor Corpus
in context network and a with explicit
Bi-LSTM features
Paraphrase qual- Deep NNs com- W2V space Manually crafted F1 of 0.76, Pear-
ity assessment bining ACNN, corpus for para- son correlation of
LSTMs and fully phrase ranking 0.61
connected layers
Metaphor para- Deep NNs com- W2V space Manually crafted F1 of 0.67 Pear-
phrase (aptness) bining ACNN, corpus for son correlation of
assessment LSTMs and fully metaphor para- 0.55
connected layers phrase ranking
Metaphor para- Deep NNs com- W2V space Manually crafted F1 of 0.72, Pear-
phrase (aptness) bining ACNN, corpus for son of 0.3
assessment in LSTMs and fully metaphor para-
extended context connected layers phrase ranking in
context
Table 5.1: Summary of the experimental settings presented in this compilation. The first three studies
are about detection, the last three about paraphrase. In both sets, the last experiment includes
extensive context.
5.2 Nuanced properties

In many of the presented studies I tried to go beyond binary frames of classification to explore more
nuanced takes on metaphoricity.
Most of the existing figurative language datasets, and most figurative language processing studies,
view figurativity as a binary property, practically equating the figurative usage of an expression with
the simple shift to a new or alternative sense (Birke and Sarkar; 2006; Sporleder and Li; 2009; Li et al.;
2010; Erk and Padó; 2010).
I argue through most papers for a different take on the matter. Metaphor is a rather gradient
phenomenon and approaches able to return gradient judgments of metaphoricity should be encouraged.
In Paper 2, I partly address this point in the experimental frame itself. A dataset of this kind
- annotating figurativity as a gradient property of expressions - is actually used to train and test a
classifier. We show that this kind of annotation not only allows a more accurate modelling of the
phenomenon, but can also be exploited to improve the classifier’s performance. In other words, human
continuous ratings of figurativity seem to be mirrored in the distribution of expressions in big data.
This perspective is also in line with cognitive studies of figurative processing: while a different
neurological processing does exist to detect and understand literal and figurative language, it is the
“novelty" of a figure - or the contextual necessity of giving a word or expression a new interpretation
- that determines which processing pattern is taken, rather than an expression’s inherent figurativity
(Glucksberg et al.; 1982; Keysar; 1989; Giora; 1997; Gentner and Bowdle; 2001; Bookheimer; 2002;
Giora; 2003; Underwood et al.; 2004; Jiang and Nekrasova; 2007; Conklin and Schmitt; 2008; Schmidt
and Seger; 2009; Bambini et al.; 2011; Bohrn et al.; 2012; Cardillo et al.; 2012).
I adopted a gradient approach with respect to metaphor aptness as well. Rather than considering
a literal sentence as a “good" or “bad" paraphrase of a metaphor, several sentences are ordered for
degree of aptness with respect to a single metaphorical element. In this way, metaphor aptness can
be re-formulated as a ranking problem. This led to a fundamentally hybrid approach when using our
dataset for training: our model was trained on binary judgments, and tested on both a binary and an
ordering task.
In Papers 5 and 6 we discuss more extensively the importance of testing our network’s ability to
rank, and not only classify, candidate sentences.
5.3. SEQUENTIALITY 51
5.3 Sequentiality
In all my experiments, I took the sequentiality of input into account.
While systems that don’t consider sequentiality can still perform encouragingly on some datasets
for metaphor detection, words’ order is often an important dimension to contemplate when dealing
with figurative language. The importance of sequentiality in metaphor detection appears most clearly
in the third paper of this compilation, where two types of neural network - a Bi-LSTM and a simpler,
multi-layered fully conntected network - are compared on a metaphor detection dataset. The Bi-LSTM,
which is the most sequence-oriented model of the two, outperformed the other architecture. At the
same time, anyway, the best overall performance was achieved by a combination of the two systems,
hinting to the necessity a more complex approach to the matter.
To deal with metaphor aptness assessment I also used sequential models (LSTM) together with
models able to detect non-sequential, far-reaching patterns in data (CNN).
As this task is somehow close to paraphrase detection, sequentiality becomes even more essential.
5.4 Data Scarcity

Since the first experiment, I observed how the large-scale information encoded in vector space semantic
models can be treasured by neural networks to compensate the dire constraints of data scarcity that
characterize research in figurative language processing.
The distribution of words on large scale corpora can suffice to learn their metaphorical use in
adjective-noun pairs from relatively small annotated data, and the distribution of non-compositional
expressions can suffice to learn their figurative or literal nature from extremely scarce datasets. It
appears quite evidently that, for example, the distributional difference of idioms and literal expressions
is clear enough to be learned by a neural classifier on datasets of few dozens examples. Only in one case
(Paper 3) we decided to explore other features beyond distributional vectors to enrich our classifiers’
input.
This aspect can be considered particularly important since figurative annotation of corpora is a
challenging task. When dealing with metaphor aptness, the information provided by pre-trained neural
embeddings was essential to allow our models to generalize on a small and rather difficult corpus. The
performance of our classifier on this task shows even more the importance of vector space semantic
models for data scarcity. As we comment in Paper 5, our model is able to retrieve from the original
semantic spaces not only the primary meaning or denotation of words, but also some of the more subtle
semantic aspects involved in the metaphorical use of terms.
5.5 Context
The presence of extended context in training and test sets is another relevant line of my work.
For both metaphor detection and metaphor aptness assessment, I have started dealing with out-
of-context datasets and then moved on to more contextualized corpora.
In metaphor detection, it was possible to use existing resources for both cases: lists of metaphorical
(or idiomatic) expressions for the first stage, and annotated corpora for the second stage.
For metaphor aptness, both frames lacked data. I thus had to first create a corpus for the out-of-
context stage, and then expand a part of that corpus to move to the contextualized frame.
When dealing with context in metaphor aptness, I was able to observe a consistent shift in human
judgments. Our annotators’ ratings were compressed towards the mean: aptness increased for low-
rated pairs and decreased for high-rated pairs.
This different perception of aptness was detectable only thanks to the gradient approach we had
chosen for human annotation. If we had gone for a binary classification task, this compression towards
the mean would have gone largely unnoticed. This seems to be another dimension of figurativity
studied more in cognitive than in computational linguistics (Inhoff et al.; 1984).
It would be interesting in future to explore this contextual effect also on detection-oriented
datasets.
5.6 Limitations
I am aware of a part of my work’s several flaws.
Neural networks’ notorious opacity is a drawback to overcome in future studies. The features that
lead to learning are not as clear as they used to be in traditional machine learning approaches (see
for example Li and Sporleder (2010c)). Until it was possible, I tried to infer some of the mechanisms
leading to the networks’ performance through ablation experiments. For example, in Paper 1 we show
that our model side-learns an abstract-concrete semantic continuum in order to distinguish between
figurative and literal expressions.
When dealing with more complex kinds of inputs, like sentences or pairs of sentences, clarifying
the networks’ inner working became harder. The result is that often the performance of the model is
unexplained from a linguistic point of view.
Another limitation lies in the dimensions of some of the presented datasets, especially the datasets
we created for metaphor aptness assessment. These datasets are necessarily small. They are hard to
produce and have to be dealt with in completely non-automated ways. In other words, each example
has to be produced from scratch.
5.7. FUTURE WORKS 53
Shortly, my work in metaphor processing, and especially in aptness assessment, suffers from the
same limitations Tony Veale described when talking about Figurative Language Processing in general:
it is “ neither scalable nor robust, and not yet practical enough to migrate beyond the lab " (Veale;
2011).
5.7 Future Works

Metaphor detection is still an open problem. While I find the results of the proposed models en-
couraging, only a deeper understanding of their working and limitations can lead to improvements in
performance.
This is even more important for metaphor aptness. Our dataset for metaphor aptness assessment,
being the first of its kind, is limited in extension and variety. Larger datasets, with a more systematic
division of kinds and types of metaphors, would be a necessary step to continue this line of work. To
further explore the effect that extended context seemingly has on human perception of metaphoricity,
as proposed in Paper 6, an enlargement of the dataset would also be highly beneficial. Finding ways
to automatically boost the dimensions of metaphor datasets would be an excellent first step in this
direction.
Also, evaluating a model on a ranking problem is a tricky task. Other ways than the ones proposed
in this compilation can be explored to test a model’s ability to rank more or less clear cases of metaphor
paraphrase or aptness. The influence that evaluation datasets have on a model’s ranking performance
has been explored in various studies (see for example Loukina et al. (2018), where the effect of different
evaluations sets for ranking algorithms is explored). Different approaches could be undertaken to test
my models’ effectiveness.
The gradual nature of metaphoricity is another matter I would like to address in future works.
While I mentioned the problem through this thesis’ frame, most of my detection experiments were
necessarily based on binary data for both training and evaluation, with the exception of Paper 2.
Applying the same frame I present in Paper 2 - figurativity as a ranking problem - to the datasets in
Paper 1 and Paper 3 might prove a fruitful undertaking.
Finally, metaphor is the tip of the iceberg. Metaphor and irony (Rosso et al.; 2018; Van Hee et al.;
2018b,a) are by and large the most studied figures, but the world of figurative language is diverse and
approaches that are, to an extent, successful on metaphor could be applied to other elements as well,
at least tentatively. Both detection and aptness assessment could be performed in future on more
extensive datasets containing less studied figures of speech.
Chapter 6
Appendix
This is a short appendix to detail the way the main annotation task for paraphrase ranking has been
carried through.
The metaphor aptness dataset (Paper 5) is composed of groups of five sentences, composed of a
reference sentence a four candidate paraphrases, like the following.
I was alone in a sea of unknown faces.
1. I was alone in a circle of unknown faces.
2. I was alone in a large number of unknown faces.
3. I was alone in an exclusive club of unknown faces.
4. I was among many people I did not know, all alone.
The annotation of this corpus was carried out through Amazon Mechanical Turk.
Sentences were presented to the anonymous annotators in forms of groups of pairs, like the
following.

I was alone in a circle of unknown faces.

I was among many people I did not know, all alone.
The annotators could score each pair from 1 (completely unrelated) to 4 (strong paraphrase). An
average of 20 human annotators scored each pair.
55
56 CHAPTER 6. APPENDIX
Figure 6.1: Annotators were presented with this page of instructions.
Filtering of rogue annotators was operated along two main lines:
1. Some “trap" elements were inserted in the task (sentences that were completely unrelated).
Annotators who did not give the minimum score to these elements were discarded as rogue.
2. Annotators who gave very high or very low scores to the vast majority of pairs were discarded
as rogues.
After filtering the rogues, we were left with an average of 15 annotations per pair. These annotations
were then averaged and confronted with my own annotation of the corpus, proving to hold a high
correlation with my judgment 1 . The mean human judgments for each pair were then used as golden
labels to train and test my models.
We used Amazon Mechanical Turk also in the case of the extended context dataset (Paper 6). In
this case, the annotators were presented with pairs of three-sentence paragraphs, in which the central
sentence (the “relevant" one) was highlighted:
They had arrived in the capital city. The crowd was a roaring river. It was glorious.
They had arrived in the capital city. The crowd was huge and noisy. It was glorious.
1 This was intended only as a sanity check. A low correlation with my judgment would not have automatically resulted
in the elimination of the whole annotation set.
57
Figure 6.2: An example of annotation frame for the metaphor aptness dataset.
The same procedures were applied as for the previous annotation in terms of scoring system, rogues
filtering and handling of the results.
For more detailed information, the reader can refer to Paper 5 and Paper 6, where the annotation’s
logic and results for each dataset are discussed more in detail.
Figure 6.3: An example of annotation frame for the in-context metaphor aptness dataset.
58 CHAPTER 6. APPENDIX
Bibliography
Abe, K., Sakamoto, K. and Nakagawa, M. (2006). A computational model of the metaphor generation
process, Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 28.
Agerri, R. (2008). Metaphor in textual entailment, COLING 2008, 22nd International Conference on
Computational Linguistics, Posters Proceedings, 18-22 August 2008, Manchester, UK, pp. 3–6.
URL: http://www.aclweb.org/anthology/C08-2001
Agirre, E., Banea, C., Cer, D. M., Diab, M. T., Gonzalez-Agirre, A., Mihalcea, R., Rigau, G.
and Wiebe, J. (2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and
cross-lingual evaluation, Proceedings of the 10th International Workshop on Semantic Evaluation,
SemEval@NAACL-HLT 2016, San Diego, CA, USA, June 16-17, 2016, pp. 497–511.
URL: http://aclweb.org/anthology/S/S16/S16-1081.pdf
Agrawal, A., Batra, D. and Parikh, D. (2016). Analyzing the behavior of visual question answering
models, arXiv preprint arXiv:1606.07356 .
Ancona, M., Ceolini, E., Öztireli, C. and Gross, M. (2017). A unified view of gradient-based attribution
methods for deep neural networks, arXiv preprint arXiv:1711.06104 .
Bae, S. H., Choi, I. and Kim, N. S. (2016). Acoustic scene classification using parallel combination of
lstm and cnn, Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016
Workshop (DCASE2016), pp. 11–15.
Bambini, V., Bertini, C., Schaeken, W., Stella, A. and Di Russo, F. (2016). Disentangling metaphor
from context: an erp study, Frontiers in psychology 7: 559.
Bambini, V., Canal, P., Resta, D. and Grimaldi, M. (2018). Time course and neurophysiological
underpinnings of metaphor in literary context, Discourse Processes pp. 1–21.
Bambini, V., Gentili, C., Ricciardi, E., Bertinetto, P. M. and Pietrini, P. (2011). Decomposing
metaphor processing at the cognitive and neural level through functional magnetic resonance imag-
ing, Brain Research Bulletin 86(3-4): 203–216.
59
60 BIBLIOGRAPHY
Banerjee and Pedersen (2002). An Adapted Lesk Algorithm for Word Sense Disambiguation wit Word-
Net.
Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009). The WaCky wide web: a collec-
tion of very large linguistically processed web-crawled corpora, Language Resources and Evaluation
43(3): 209–226.
URL: http://dx.doi.org/10.1007/s10579-009-9081-4
Baroni, M., Dinu, G. and Kruszewski, G. (2014). Don’t count, predict! a systematic comparison of
context-counting vs. context-predicting semantic vectors., Proceedings of the 52nd Annual Meeting
of the Association for Computational Linguistics, pp. 238–247.
Basile, P., Caputo, A. and Semeraro, G. (2015). Temporal random indexing, Italian Journal of Com-
putational Linguistics 1.
Baumer, E. and Tomlinson, B. (2008). Computational metaphor identification in communities of

blogs., ICWSM.
Berk, G., Erden, B. and Güngör, T. (2018). Deep-bgt at parseme shared task 2018: Bidirectional
lstm-crf model for verbal multiword expression identification, Proceedings of the Joint Workshop on
Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), pp. 248–
253.
Bernardy, J.-P., Lappin, S. and Lau, J. H. (2018). The influence of context on sentence acceptability
judgements, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 2: Short Papers), Vol. 2, pp. 456–461.
Birke, J. and Sarkar, A. (2006). A clustering approach for nearly unsupervised recognition of nonliteral
language, 11th Conference of the European Chapter of the Association for Computational Linguistics.
Blacoe, W. and Lapata, M. (2012). A comparison of vector-based representations for semantic compo-
sition, Proceedings of the 2012 joint conference on empirical methods in natural language processing
and computational natural language learning, Association for Computational Linguistics, pp. 546–
556.
Blasko, D. G. (1999). Only the tip of the iceberg: Who understands what about metaphor?, Journal
of Pragmatics 31(12): 1675–1683.
Blasko, D. G. and Connine, C. M. (1993). Effects of familiarity and aptness on metaphor processing.,
Journal of experimental psychology: Learning, memory, and cognition 19(2): 295.
BIBLIOGRAPHY 61
Bohrn, I. C., Altmann, U. and Jacobs, A. M. (2012). Looking at the brains behind figurative language:
a quantitative meta-analysis of neuroimaging studies on metaphor, idiom, and irony processing,
Neuropsychologia 50(11): 2669–2683.
Bollegala, D. and Shutova, E. (2013). Metaphor interpretation using paraphrases extracted from the
web, PloS one 8(9): e74304.
Bookheimer, S. (2002). Functional mri of language: new approaches to understanding the cortical
organization of semantic processing, Annual review of neuroscience 25(1): 151–188.
Bowdle, B. F. and Gentner, D. (2005). The career of metaphor., Psychological review 112(1): 193.
Broadwell, G. A., Boz, U., Cases, I., Strzalkowski, T., Feldman, L., Taylor, S., Shaikh, S., Liu,
T., Cho, K. and Webb, N. (2013). Using imageability and topic chaining to locate metaphors in
linguistic corpora, International Conference on Social Computing, Behavioral-Cultural Modeling,
and Prediction, Springer, pp. 102–110.
Brysbaert, M., Warriner, A. B. and Kuperman, V. (2014). Concreteness ratings for 40 thousand
generally known english word lemmas, Behavior research methods 46(3): 904–911.
Bulat, L., Clark, S. and Shutova, E. (2017). Modelling metaphor with attribute-based semantics,
Proceedings of the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, Vol. 2, pp. 523–528.
Cacciari, C. (2014). Processing multiword idiomatic strings: Many words in one?, The Mental Lexicon
9(2): 267–293.
Cacciari, C. and Papagno, C. (2012). Neuropsychological and neurophysiological correlates of idiom un-
derstanding: How many hemispheres are involved, The handbook of the neuropsychology of language
pp. 368–385.
Camac, M. K. and Glucksberg, S. (1984). Metaphors do not use associations between concepts, they
are used to create them, Journal of Psycholinguistic Research 13(6): 443–455.
Cameron, L. (2003). Metaphor in educational discourse, A&C Black.
Cameron, L. (2008). Metaphor and talk, The Cambridge handbook of metaphor and thought pp. 197–
211.
Cardillo, E. R., Watson, C. E., Schmidt, G. L., Kranjec, A. and Chatterjee, A. (2012). From novel to
familiar: tuning the brain for metaphors, Neuroimage 59(4): 3212–3221.
Chen, P., Guo, W., Chen, Z., Sun, J. and You, L. (2018). Gated convolutional neural network for
sentence matching, memory 1: 3.
62 BIBLIOGRAPHY
Chen, Q., Hu, Q., Huang, J. X. and He, L. (2018). Can: Enhancing sentence similarity modeling with
collaborative and adversarial network, The 41st International ACM SIGIR Conference on Research
& Development in Information Retrieval, ACM, pp. 815–824.
Chiappe, D. L., Kennedy, J. M. and Chiappe, P. (2003). Aptness is more important than comprehen-
sibility in preference for metaphors and similes, Poetics 31(1): 51–68.
Collobert, R. and Weston, J. (2008a). A unified architecture for natural language processing: Deep
neural networks with multitask learning, Proceedings of the 25th International Conference on Ma-
chine Learning, ICML ’08, ACM, New York, NY, USA, pp. 160–167.
URL: http://doi.acm.org/10.1145/1390156.1390177
Collobert, R. and Weston, J. (2008b). A unified architecture for natural language processing: Deep
neural networks with multitask learning, Proceedings of the 25th international conference on Machine
learning, ACM, pp. 160–167.
Colton, S., Goodwin, J. and Veale, T. (2012). Full-face poetry generation., ICCC, pp. 95–102.
Conklin, K. and Schmitt, N. (2008). Formulaic sequences: Are they processed more quickly than
nonformulaic language by native and nonnative speakers?, Applied linguistics 29(1): 72–89.
Consortium, B. N. C. et al. (2007). British national corpus version 3 (bnc xml edition), Distributed
by Oxford University Computing Services on behalf of the BNC Consortium. Retrieved February
13: 2012.
Cordeiro, S., Ramisch, C., Idiart, M. and Villavicencio, A. (2016). Predicting the compositionality of
nominal compounds: Giving word embeddings a hard time, Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics, Vol. 1, pp. 1986–1997.
Cornelissen, J. P. (2004). What are we playing at? theatre, organization, and the use of metaphor,
Organization Studies 25(5): 705–726.
Cornelissen, J. P. (2005). Beyond compare: Metaphor in organization theory, Academy of Management

Review 30(4): 751–764.
Corts, D. P. and Pollio, H. R. (1999). Spontaneous production of figurative language and gesture in
college lectures, Metaphor and Symbol 14(2): 81–100.
Cruse, D. A. (1986). Lexical semantics, Cambridge University Press.
Dai, D., Tan, W. and Zhan, H. (2017). Understanding the feedforward artificial neural network model
from the perspective of network flow, arXiv preprint arXiv:1704.08068 .
BIBLIOGRAPHY 63
Darian, S. (2000). The role of figurative language in introductory science texts, International Journal
of applied linguistics 10(2): 163–186.
Davidov, D., Tsur, O. and Rappoport, A. (2010). Semi-supervised recognition of sarcastic sentences
in twitter and amazon, Proceedings of the fourteenth conference on computational natural language
learning, Association for Computational Linguistics, pp. 107–116.
Deignan, A. (2003). Metaphorical expressions and culture: An indirect link, Metaphor and symbol
18(4): 255–271.
Deignan, A. (2007). “image” metaphors and connotations in everyday language, Annual Review of
Cognitive Linguistics 5(1): 173–192.
Dell, G. S. (1985). Positive feedback in hierarchical connectionist models: Applications to language

production, Cognitive Science 9(1): 3–23.
Do Dinh, E.-L. and Gurevych, I. (2016). Token-level metaphor detection using neural networks,
Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 28–33.
Dolan, B., Quirk, C. and Brockett, C. (2004). Unsupervised construction of large paraphrase corpora:
Exploiting massively parallel news sources, Proceedings of the 20th International Conference on
Computational Linguistics, COLING ’04, Association for Computational Linguistics, Stroudsburg,
PA, USA.
URL: https://doi.org/10.3115/1220355.1220406
Dunn, J. (2013a). valuating the premises and results of four metaphor identification systems, Proceed-
ings of CICLing’13 .
Dunn, J. (2013b). What metaphor identification systems can tell us about metaphor-in-language,
Proceedings of the First Workshop on Metaphor in NLP, pp. 1–10.
Dunn, J., De Heredia, J. B., Burke, M., Gandy, L., Kanareykin, S., Kapah, O., Taylor, M., Hines, D.,
Frieder, O., Grossman, D. et al. (2014). Language-independent ensemble approaches to metaphor
identification, 28th AAAI Conference on Artificial Intelligence, AAAI 2014, AI Access Foundation.
Dybala, P., Ptaszynski, M., Rzepka, R., Araki, K. and Sayama, K. (2012). Beyond conventional
recognition: Concept of a conversational system utilizing metaphor misunderstanding as a source of
humor, Proceedings of The 26th Annual Conference of The Japanese Society for Artificial Intelligence
(JSAI 2012), Alan Turing Year Special Session on AI Research That Can Change The World.
Erk, K. and Padó, S. (2010). Exemplar-based models for word meaning in context, Proceedings of the
acl 2010 conference short papers, Association for Computational Linguistics, pp. 92–97.
64 BIBLIOGRAPHY
Fahnestock, J. (2009). Quid pro nobis. rhetorical stylistics for argument analysis, Examining argumen-
tation in context. Fifteen studies on strategic maneuvering pp. 131–152.
Fainsilber, L. and Kogan, N. (1984). Does imagery contribute to metaphoric quality?, Journal of
psycholinguistic research 13(5): 383–391.
Fazly, A., Cook, P. and Stevenson, S. (2009). Unsupervised type and token identification of idiomatic
expressions, Computational Linguistics 1(35): 61–103.
Fazly, A. and Stevenson, S. (2008). A distributional account of the semantics of multiword expressions,
Italian Journal of Linguistics 1(20): 157–179.
Fernando, S. and Stevenson, M. (2008). A semantic similarity approach to paraphrase detection,

Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium .
Ferrone, L. and Zanzotto, F. (2015). Distributed smoothed tree kernel, Italian Journal of Computa-
tional Linguistics 1.
Filice, S., Da San Martino, G. and Moschitti, A. (2015). Structural representations for learning relations
between pairs of texts, Vol. 1, Association for Computational Linguistics (ACL), pp. 1003–1013.
Forgács, B., Bardolph, M. D., B.D., A., K.A., D. and Kutas, M. (2015). Metaphors are physical and
abstract: Erps to metaphorically modified nouns resemble erps to abstract language, Front. Hum.
Neurosci. 9(28).
Fraser, B. (1970). Idioms within a transformational grammar, Foundations of language pp. 22–42.
Frege, G. (1892). Über sinn und bedeutung, Zeitschrift für Philosophie und philosophische Kritik
100: 25–50.
Fussell, S. R. and Moss, M. M. (1998). Figurative language in emotional communication, Human-

Computer Interaction Institute p. 82.
Gao, G., Choi, E., Choi, Y. and Zettlemoyer, L. (2018). Neural metaphor detection in context, arXiv
preprint arXiv:1808.09653 .
Geeraert, K., Baayen, R. H. and Newman, J. (2017). Understanding idiomatic variation, MWE 2017
p. 80.
Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy, Cognitive science

7(2): 155–170.
Gentner, D. and Bowdle, B. F. (2001). Convention, form, and figurative language processing, Metaphor
and Symbol 16(3-4): 223–247.
BIBLIOGRAPHY 65
Gerrig, R. J. (1989). Empirical constraints on computational theories of metaphor: Comments on

indurkhya, Cognitive Science 13(2): 235–241.
Ghosh, A., Li, G., Veale, T., Rosso, P., Shutova, E., Barnden, J. and Reyes, A. (2015). Semeval-2015
task 11: Sentiment analysis of figurative language in twitter, Proceedings of the 9th International
Workshop on Semantic Evaluation (SemEval 2015), pp. 470–478.
Gibbs, R. W. (1993). Why idioms are not dead metaphors, Idioms: Processing, structure, and inter-
pretation pp. 57–77.
Gibbs, R. W. (1994). The poetics of mind: Figurative thought, language, and understanding, Cambridge
University Press.
Gibbs, R. W., Bogdanovich, J. M., Sykes, J. R. and Barr, D. J. (1997). Metaphor in idiom compre-
hension, Journal of memory and language 37(2): 141–154.
Gibbs, R. W., Leggitt, J. S. and Turner, E. A. (2002). What’s special about figurative language
in emotional communication, The verbal communication of emotions: Interdisciplinary perspectives
pp. 125–149.
Gildea, P. and Glucksberg, S. (1983). On understanding metaphor: The role of context, Journal of
Memory and Language 22(5): 577.
Giora, R. (1997). Understanding figurative and literal language: The graded salience hypothesis,
Cognitive Linguistics (includes Cognitive Linguistic Bibliography) 8(3): 183–206.
Giora, R. (1999). On the priority of salient meanings: Studies of literal and figurative language, Journal
of pragmatics 31(7): 919–929.
Giora, R. (2002). Literal vs. figurative language: Different or equal?, Journal of pragmatics 34(4): 487–
506.
Giora, R. (2003). On our mind: Salience, context, and figurative language, Oxford University Press.
Giora, R. and Fein, O. (1999). Irony: Context and salience, Metaphor and Symbol 14(4): 241–257.
Glucksberg, S. (2008). How metaphors create categories–quickly, The Cambridge handbook of metaphor
and thought pp. 67–83.
Glucksberg, S., Gildea, P. and Bookin, H. B. (1982). On understanding nonliteral speech: Can people
ignore metaphors?, Journal of verbal learning and verbal behavior 21(1): 85–98.
Glucksberg, S., McGlone, M. S. and Manfredi, D. (1997). Property attribution in metaphor compre-
hension, Journal of memory and language 36(1): 50–67.
66 BIBLIOGRAPHY
Gong, H., Bhat, S. and Viswanath, P. (2017). Geometry of compositionality., AAAI, pp. 3202–3208.
Goodman, N. (1975). The status of style, Critical Inquiry 1(4): 799–811.
Gordon, J., Hobbs, J., May, J., Mohler, M., Morbini, F., Rink, B., Tomlinson, M. and Wertheim, S.
(2015). A corpus of rich metaphor annotation, Proc. Workshop on Metaphor in NLP.
URL: http://www.isi.edu/ jgordon/papers/gordon-et-al.a-corpus-of-rich-metaphor-annotation.pdf
Gutierrez, E. D., Cecchi, G., Corcoran, C. and Corlett, P. (2017). Using automated metaphor iden-
tification to aid in detection and prediction of first-episode schizophrenia, Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, pp. 2923–2930.
Gutiérrez, E. D., Shutova, E., Marghetis, T. and Bergen, B. K. (2016). Literal and metaphorical senses
in compositional distributional semantic models, Proceedings of the 54th Meeting of the Association
for Computational Linguistics, pp. 160–170.
Haagsma, H. and Bjerva, J. (2016). Detecting novel metaphor using selectional preference information,
Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 10–17.
He, H., Gimpel, K. and Lin, J. (2015). Multi-perspective sentence similarity modeling with con-
volutional neural networks, Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, Association for Computational Linguistics, Lisbon, Portugal, pp. 1576–1586.
URL: https://aclweb.org/anthology/D/D15/D15-1181
He, X. and Liu, Y. (2017). Not enough data?: Joint inferring multiple diffusion networks via network
generation priors, Proceedings of the Tenth ACM International Conference on Web Search and Data
Mining, ACM, pp. 465–474.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory.
Hovy, D., Shrivastava, S., Jauhar, S. K., Sachan, M., Goyal, K., Li, H., Sanders, W. and Hovy, E.
(2013). Identifying metaphorical word use with tree kernels, Proceedings of the First Workshop on
Metaphor in NLP.
Inhoff, A. W., Lima, S. D. and Carroll, P. J. (1984). Contextual effects on metaphor comprehension
in reading, Memory & Cognition 12(6): 558–567.
Jang, H., Piergallini, M., Wen, M. and Rose, C. (2014). Conversational metaphors in use: Exploring the
contrast between technical and everyday notions of metaphor, Proceedings of the Second Workshop
on Metaphor in NLP, pp. 1–10.
Jiang, N. A. and Nekrasova, T. M. (2007). The processing of formulaic sequences by second language
speakers, The Modern Language Journal 91(3): 433–445.
BIBLIOGRAPHY 67
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L. and Girshick, R. (2017).
Clevr: A diagnostic dataset for compositional language and elementary visual reasoning, Computer
Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, pp. 1988–1997.
Johnson, M. G. and Malgady, R. G. (1979). Some cognitive aspects of figurative language: Association
and metaphor, Journal of Psycholinguistic Research 8(3): 249–265.
Jurgens, D. and Stevens, K. (2009). Event detection in blogs using temporal random indexing, Pro-
ceedings of the Workshop on Events in Emerging Text Types.
Karpathy, A., Johnson, J. and Fei-Fei, L. (2015). Visualizing and understanding recurrent networks,
arXiv preprint arXiv:1506.02078 .
Kesarwani, V., Inkpen, D., Szpakowicz, S. and Tanasescu, C. (2017). Metaphor detection in a poetry
corpus, Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural
Heritage, Social Sciences, Humanities and Literature, pp. 1–9.
Keysar, B. (1989). On the functional equivalence of literal and metaphorical interpretations in dis-
course, Journal of memory and language 28(4): 375–385.
Kim, Y. (2014). Convolutional neural networks for sentence classification, CoRR abs/1408.5882.
URL: http://arxiv.org/abs/1408.5882
Kintsch, W. (2000). Metaphor comprehension: A computational theory, Psychonomic bulletin & review
7(2): 257–266.
Klebanov, B. B., Beigman, E. and Diermeier, D. (2009). Discourse topics and metaphors, Proceedings of
the Workshop on Computational Approaches to Linguistic Creativity, Association for Computational
Linguistics, pp. 1–8.
Klebanov, B. B., Diermeier, D. and Beigman, E. (2008). Lexical cohesion analysis of political speech,
Political Analysis 16(4): 447–463.
Klebanov, B. B. and Flor, M. (2013). Argumentation-relevant metaphors in test-taker essays, Proceed-

ings of the First Workshop on Metaphor in NLP, pp. 11–20.
Klebanov, B. B., Leong, B., Heilman, M. and Flor, M. (2014). Different texts, same metaphors:
Unigrams and beyond, Proceedings of the Second Workshop on Metaphor in NLP, pp. 11–17.
Klebanov, B. B., Leong, C. W. and Flor, M. (2015). Supervised word-level metaphor detection:
Experiments with concreteness and reweighting of examples, Proceedings of the Third Workshop on
Metaphor in NLP, pp. 11–20.
68 BIBLIOGRAPHY
Klebanov, B. B., Leong, C. W., Gutierrez, E. D., Shutova, E. and Flor, M. (2016). Semantic classifi-
cations for detection of verb metaphors, Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 101–106.
Kokkinakis, D. (2013). Figurative language in swedish clinical texts, Proceedings of the IWCS 2013
Workshop on Computational Semantics in Clinical Text (CSCT 2013), pp. 17–22.
Köper, M. and Im Walde, S. S. (2016a). Automatically generated affective norms of abstractness,

arousal, imageability and valence for 350 000 german lemmas., LREC.
Köper, M. and im Walde, S. S. (2016b). Distinguishing literal and non-literal usage of german particle
verbs., HLT-NAACL, pp. 353–362.
Köper, M. and im Walde, S. S. (2017). Improving verb metaphor detection by propagating abstractness
to words, phrases and individual senses, Proceedings of the 1st Workshop on Sense, Concept and
Entity Representations and their Applications, pp. 24–30.
Kövecses, Z. (2003). Language, figurative thought, and cross-cultural comparison, Metaphor and
symbol 18(4): 311–320.
Kozareva, Z. (2015). Multilingual affect polarity and valence prediction in metaphors, Proceedings of the
6th Workshop on Computational Approaches to Subjectivity,Sentiment and Social Media Analysis,
WASSA@EMNLP 2015, 17 September2015, Lisbon, Portugal, p. 1.
URL: http://aclweb.org/anthology/W/W15/W15-2901.pdf
Krennmayr, T. (2015). What corpus linguistics can tell us about metaphor use in newspaper texts,
Journalism Studies 16(4): 530–546.
Krennmayr, T. and Steen, G. (2017). Vu amsterdam metaphor corpus, Handbook of Linguistic Anno-
tation, Springer, pp. 1053–1071.
Krishnakumaran, S. and Zhu, X. (2007). Hunting elusive metaphors using lexical resources, Pro-
ceedings of the Workshop on Computational Approaches to Figurative Language, FigLanguages ’07,
Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 13–20.
URL: http://dl.acm.org/citation.cfm?id=1611528.1611531
Krčmář, L., Ježek, K. and Pecina, P. (2013). Determining Compositionality of Expresssions Using
Various Word Space Models and Measures, Proceedings of the Workshop on Continuous Vector
Space Models and their Compositionality, pp. 64–73.
Kuhl, P. K. (2004). Early language acquisition: cracking the speech code, Nature reviews neuroscience
5(11): 831–843.
BIBLIOGRAPHY 69
Lacey, S., Stilla, R., Deshpande, G., Zhao, S., Stephens, C., McCormick, K., Kemmerer, D. and
Sathian, K. (2017). Engagement of the left extrastriate body area during body-part metaphor
comprehension, Brain and language 166: 1–18.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B. and Gershman, S. J. (2017). Building machines that
learn and think like people, Behavioral and Brain Sciences 40.
Lakoff, G. (1989). Some empirical results about the nature of concepts, Mind & Language 4(1-2): 103–
129.
Lakoff, G. (1993). The contemporary theory of metaphor.
Lakoff, G., Espenson, J. and Schwartz, A. (1991). Master metaphor list. university of california at
berkely, Cognitive Linguistics Group .
Lakoff, G. and Johnson, M. (2008a). Metaphors we live by, University of Chicago press.
Lakoff, G. and Johnson, M. (2008b). Metaphors we live by, University of Chicago press.
Lappin, S. and Zadrozny, W. (2000). Compositionality, synonymy, and the systematic representation
of meaning, CoRR cs.CL/0001006.
URL: http://arxiv.org/abs/cs.CL/0001006
Laranjeira, C. (2013). The role of narrative and metaphor in the cancer life story: a theoretical
analysis, Medicine, Health Care and Philosophy 16(3): 469–481.
Lau, J. H., Clark, A. and Lappin, S. (2017). Grammaticality, acceptability, and probability: a proba-
bilistic view of linguistic knowledge, Cognitive Science 41(5): 1202–1241.
Lau, J.H., C. A. L. S. (2015). Unsupervised prediction of acceptability judgements, Proceedings of

the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing pp. 26–31.
Laurent, J.-P., Denhières, G., Passerieux, C., Iakimova, G. and Hardy-Baylé, M.-C. (2006). On under-
standing idiomatic language: The salience hypothesis assessed by erps, Brain Research 1068(1): 151–
160.
Leech, G. N. and Short, M. (2007). Style in fiction: A linguistic introduction to English fictional prose,
number 13, Pearson Education.
Lenci, A. (2008). Distributional semantics in linguistic and cognitive research, Italian Journal of
Linguistics 20(1): 1–31.
Lenci, A. (2018). Distributional Models of Word Meaning, Annual Review of Linguistics 4: 151–171.
70 BIBLIOGRAPHY
Lenci, A., Santus, E. and Q. Lu, C. H. (2015). When similarity becomes opposition: Synonyms and
antonyms discrimination in dsms, Italian Journal of Computational Linguistics 1.
Lenci, A. and Zamparelli (2010). Compositionality and distributional semantic models.
Leong, C. W. B., Klebanov, B. B. and Shutova, E. (2018). A report on the 2018 vua metaphor detection
shared task, Proceedings of the Workshop on Figurative Language Processing, pp. 56–66.
Levy, O., Goldberg, Y. and Dagan, I. (2015). Improving distributional similarity with lessons learned
from word embeddings, Transactions of the Association for Computational Linguistics 3: 211–225.
Li, H., Zhu, K. Q. and Wang, H. (2013). Data-driven metaphor recognition and explanation, Trans-
actions of the Association for Computational Linguistics 1: 379–390.
Li, L., Roth, B. and Sporleder, C. (2010). Topic models for word sense disambiguation and token-
based idiom detection, Proceedings of the 48th Annual Meeting of the Association for Computational
Linguistics, Association for Computational Linguistics, pp. 1138–1147.
Li, L. and Sporleder, C. (2010a). Linguistic cues for distinguishing literal and non-literal usages,
Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association
for Computational Linguistics, pp. 683–691.
Li, L. and Sporleder, C. (2010b). Using gaussian mixture models to detect figurative language in con-
text, Human Language Technologies: The 2010 Annual Conference of the North American Chapter
of the Association for Computational Linguistics, HLT ’10, Association for Computational Linguis-
tics, Stroudsburg, PA, USA, pp. 297–300.
Li, L. and Sporleder, C. (2010c). Using gaussian mixture models to detect figurative language in con-
text, Human Language Technologies: The 2010 Annual Conference of the North American Chap-
ter of the Association for Computational Linguistics, Association for Computational Linguistics,
pp. 297–300.
Lin, D. (1999). Automatic identification of non-compositional phrases, Proceedings of the 37th Annual
Meeting of the Association for Computational Linguistics, pp. 317–324.
Littlemore, J. (2004). Item-based and cognitive-style-based variation in students’ abilities to use

metaphoric extension strategies, Ibérica: Revista de la Asociación Europea de Lenguas para Fines
Específicos (AELFE) (7): 5–31.
Liu, D. (2003). The most frequently used spoken american english idioms: A corpus analysis and its
implications, Tesol Quarterly 37(4): 671–700.
BIBLIOGRAPHY 71
Loukachevitch, N. and Parkhomenko, E. (2018a). Evaluating distributional features for multiword

expression recognition, International Workshop on Temporal, Spatial, and Spatio-Temporal Data
Mining, Springer, pp. 126–134.
Loukachevitch, N. and Parkhomenko, E. (2018b). Recognition of multiword expressions using word

embeddings, Russian Conference on Artificial Intelligence, Springer, pp. 112–124.
Loukina, A., Zechner, K., Bruno, J. and Beigman Klebanov, B. (2018). Using exemplar responses for
training and evaluating automated speech scoring systems, Proceedings of the Thirteenth Workshop
on Innovative Use of NLP for Building Educational Applications, pp. 1–12.
Madnani, N., Tetreault, J. and Chodorow, M. (2012). Re-examining machine translation metrics for
paraphrase identification, Proceedings of the 2012 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12,
Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 182–190.
Marschark, M., Katz, A. N. and Paivio, A. (1983). Dimensions of metaphor, Journal of Psycholinguistic
Research 12(1): 17–40.
McCabe, A. (1983). Conceptual similarity and the quality of metaphor in isolated sentences versus
extended contexts, Journal of Psycholinguistic Research 12(1): 41–68.
McGlone, M. S. (1996). Conceptual metaphors and figurative language interpretation: Food for
thought?, Journal of memory and language 35(4): 544–565.
MIGLIORE, T. (2007). Gruppo µ. trattato del segno visivo. per una retorica dell’immagine.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013a). Distributed representations
of words and phrases and their compositionality, Advances in neural information processing systems,
pp. 3111–3119.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013b). Distributed representations
of words and phrases and their compositionality, Proceedings of the 26tth International Conference
on Neural Information Processing System, pp. 3111–3119.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013c). Distributed representa-
tions of words and phrases and their compositionality, in C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani and K. Q. Weinberger (eds), Advances in Neural Information Processing Systems
26, Curran Associates, Inc., pp. 3111–3119.
72 BIBLIOGRAPHY
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S. and Dean, J. (2013d). Distributed representations
of words and phrases and their compositionality, Advances in neural information processing systems,
pp. 3111–3119.
Mikolov, T., Yih, W.-t. and Zweig, G. (2013). Linguistic regularities in continuous space word rep-
resentations., Human Language Technologies: Conference of the North American Chapter of the
Association of Computational Linguistics, Vol. 13, pp. 746–751.
Miller, G. A. (1995). Wordnet: a lexical database for english, Communications of the ACM 38(11): 39–
41.
Mitchell, J. and Lapata, M. (2010). Composition in Distributional Models of Semantics, Cognitive

Science 34(8): 1388–1429.
Mohamed, M. and Oussalah, M. (n.d.). A hybrid approach for paraphrase identification based on
knowledge-enriched semantic heuristics.
Mohammad, S., Shutova, E. and Turney, P. D. (2016). Metaphor as a medium for emotion: An
empirical study, Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics,
*SEM@ACL 2016, Berlin, Germany, 11-12 August 2016.
URL: http://aclweb.org/anthology/S/S16/S16-2003.pdf
Mohler, M., Bracewell, D., Tomlinson, M. and Hinote, D. (2013). Semantic signatures for example-
based linguistic metaphor detection, Proceedings of the First Workshop on Metaphor in NLP, pp. 27–
35.
Mohler, M., Rink, B., Bracewell, D. B. and Tomlinson, M. T. (2014). A novel distributional approach
to multilingual conceptual metaphor recognition., COLING, pp. 1752–1763.
Morgan, G. (1980). Paradigms, metaphors, and puzzle solving in organization theory, Administrative
science quarterly pp. 605–622.
Moritz, M., Hellrich, J. and Buechel, S. (2018). A method for human-interpretable paraphrasticality
prediction, Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for
Cultural Heritage, Social Sciences, Humanities and Literature, pp. 113–118.
Mukherjee, S. and Bala, P. K. (2017). Detecting sarcasm in customer tweets: an nlp based approach,
Industrial Management & Data Systems 117(6): 1109–1126.
Nastase, V. and Strube, M. (2009). Combining collocations, lexical and encyclopedic knowledge for
metonymy resolution, Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing: Volume 2-Volume 2, Association for Computational Linguistics, pp. 910–918.
BIBLIOGRAPHY 73
Neuman, Y., Assaf, D., Cohen, Y., Last, M., Argamon, S., Howard, N. and Frieder, O. (2013).
Metaphor identification in large texts corpora, PloS one 8(4): e62343.
Niculae, V. and Yaneva, V. (2013). Computational considerations of comparisons and similes, 51st
Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research
Workshop, pp. 89–95.
Nunberg, G., Sag, I. and Wasow, T. (1994). Idioms, Language 70(3): 491–538.
Peng, J., Aharodnik, K. and Feldman, A. (2018). A distributional semantics model for idiom detection-
the case of english and russian., ICAART (2), pp. 675–682.
Pennington, J., Socher, R. and Manning, C. D. (2014). Glove: Global vectors for word representation,
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543.
URL: http://www.aclweb.org/anthology/D14-1162
Pernes, S. (2016). Metaphor mining in historical german novels: Using unsupervised learning to
uncover conceptual systems in literature., DH, pp. 651–653.
Pianta, E., Bentivogli, L. and Girardi, C. (2002). MultiWordNet: Developing and Aligned Multilingual
Database, Proceedings of the First International Conference on Global WordNet, pp. 293–302.
Pollio, H. R., Smith, M. K. and Pollio, M. R. (1990). Figurative language and cognitive psychology,
Language and Cognitive Processes 5(2): 141–167.
Pramanick, M., Gupta, A. and Mitra, P. (2018). An lstm-crf based approach to token-level metaphor
detection, Proceedings of the Workshop on Figurative Language Processing, pp. 67–75.
Rai, S. and Chakraverty, S. (2017). Metaphor detection using fuzzy rough sets, International Joint
Conference on Rough Sets, Springer, pp. 271–279.
Rai, S., Chakraverty, S. and Tayal, D. K. (2016). Supervised metaphor detection using conditional
random fields, Proceedings of the Fourth Workshop on Metaphor in NLP, pp. 18–27.
Rei, M., Bulat, L., Kiela, D. and Shutova, E. (2017). Grasping the finer point: A supervised similarity
network for metaphor detection, arXiv preprint arXiv:1709.00575 .
Rentoumi, Giannakopoulos, Karkaletsis and Vouros (2009). Sentiment analysis of figurative language
using word sense disambiguation approach, International Conference RANLP 2009.
Reyes, A., Rosso, P. and Buscaldi, D. (2012). From humor recognition to irony detection: The figurative
language of social media, Data & Knowledge Engineering 74: 1–12.
74 BIBLIOGRAPHY
Rimell, L., Maillard, J., Polajnar, T. and Clark, S. (2016). Relpron: A relative clause evaluation data
set for compositional distributional semantics, Computational Linguistics 42(4): 661–701.
Rodriguez, M. C. (2003). How to talk shop through metaphor: bringing metaphor research to the esp
classroom, English for Specific Purposes 22(2): 177–194.
Romero, E. and Soria, B. (2014). Relevance theory and metaphor, Linguagem em (Dis) curso
14(3): 489–509.
Rosso, P., Rangel, F., Farías, I. H., Cagnina, L., Zaghouani, W. and Charfi, A. (2018). A survey on
author profiling, deception, and irony detection for the arabic language, Language and Linguistics
Compass 12(4): e12275.
Rzepka, R., Dybala, P., Sayama, K. and Araki, K. (2013). Semantic clues for novel metaphor generator,
Proceedings of 2nd international workshop of computational creativity, concept invention, and general
intelligence, C3GI.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A. and Flickinger, D. (2002). Multiword Expressions:
A Pain in the Neck for NLP, Proceedings of the 3rd International Conference on Intelligent Text
Processing and Computational Linguistics, pp. 1–15.
Sainath, T. N., Vinyals, O., Senior, A. and Sak, H. (2015). Convolutional, long short-term memory,
fully connected deep neural networks, Acoustics, Speech and Signal Processing (ICASSP), 2015
IEEE International Conference on, IEEE, pp. 4580–4584.
Salager-Meyer, F. (1990). Metaphors in medical english prose: A comparative study with french and
spanish, English for Specific Purposes 9(2): 145–159.
Sam, G. and Catrinel, H. (2006). On the relation between metaphor and simile: When comparison
fails, Mind & Language 21(3): 360–378.
Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P. and Lillicrap, T.
(2017). A simple neural network module for relational reasoning, Advances in neural information
processing systems, pp. 4967–4976.
Sboev, A., Litvinova, T., Gudovskikh, D., Rybka, R. and Moloshnikov, I. (2016). Machine learning
models of text categorization by author gender using topic-independent features, Procedia Computer
Science 101: 135 – 142.
URL: http://www.sciencedirect.com/science/article/pii/S1877050916326849
Schlechtweg, D., Eckmann, S., Santus, E., Walde, S. S. i. and Hole, D. (2017). German in flux:
Detecting metaphoric change via word entropy, arXiv preprint arXiv:1706.04971 .
BIBLIOGRAPHY 75
Schmidt, G. L. and Seger, C. A. (2009). Neural correlates of metaphor processing: the roles of
figurativeness, familiarity and difficulty, Brain and cognition 71(3): 375–386.
Schulder, M. and Hovy, E. (2014). Metaphor detection through term relevance, Proceedings of the
Second Workshop on Metaphor in NLP, pp. 18–26.
Schulder, M. and Hovy, E. (2015). Metaphor detection through term relevance, Proceedings of the
First Workshop on Metaphor in NLP.
Sculley, D. and Pasanek, B. M. (2008). Meaning and mining: the impact of implicit assumptions in
data mining for the humanities, Literary and Linguistic Computing 23(4): 409–424.
Sell, M. A., Kreuz, R. J. and Coppenrath, L. (1997). Parents’ use of nonliteral language with preschool
children, Discourse Processes 23(2): 99–118.
Semino, E. and Culpeper, J. (2002). Cognitive stylistics: Language and cognition in text analysis,
Vol. 1, John Benjamins Publishing.
Senaldi, M. S. G., Lebani, G. E. and Lenci, A. (2016). Lexical variability and compositionality:
Investigating idiomaticity with distributional semantic models, Proceedings of the 12th Workshop on
Multiword Expressions, pp. 21–31.
Senaldi, M. S. G., Lebani, G. E. and Lenci, A. (2017). Determining the compositionality of noun-
adjective pairs with lexical variants and distributional semantics, Italian Journal of Computational
Shutova, E. (2010a). Automatic metaphor interpretation as a paraphrasing task, Human Language

Technologies: The 2010 Annual Conference of the North American Chapter of the Association for
Computational Linguistics, HLT ’10, Association for Computational Linguistics, Stroudsburg, PA,
USA, pp. 1029–1037.
Shutova, E. (2010b). Automatic metaphor interpretation as a paraphrasing task, Human Language

Technologies: The 2010 Annual Conference of the North American Chapter of the Association for
Computational Linguistics, Association for Computational Linguistics, pp. 1029–1037.
Shutova, E. (2010c). Models of metaphor in nlp, Proceedings of the 48th annual meeting of the asso-
ciation for computational linguistics, Association for Computational Linguistics, pp. 688–697.
Shutova, E. (2011). Computational approaches to figurative language, Technical report.
Shutova, E., Kiela, D. and Maillard, J. (2016). Black holes and white rabbits: Metaphor identification
with visual features, Proceedings of the 2016 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, pp. 160–170.
76 BIBLIOGRAPHY
Shutova, E., Sun, L. and Korhonen, A. (2010a). Metaphor identification using verb and noun cluster-
ing, Proceedings of the 23rd International Conference on Computational Linguistics, Association for
Computational Linguistics, pp. 1002–1010.
Shutova, E., Sun, L. and Korhonen, A. (2010b). Metaphor identification using verb and noun cluster-
ing, Proceedings of the 23rd International Conference on Computational Linguistics, Association for
Computational Linguistics, pp. 1002–1010.
Shutova, E., Teufel, S. and Korhonen, A. (2013). Statistical metaphor processing, Computational
Sikos, L., Brown, S. W., Kim, A. E., Michaelis, L. A. and Palmer, M. (2008). Figurative language:"
meaning" is often more than just a sum of the parts., AAAI Fall Symposium: Biologically Inspired
Cognitive Architectures, pp. 180–185.
Simpson, P. (2004). Stylistics: A resource book for students, Psychology Press.
Socher, R., Huang, E. H., Pennington, J., Ng, A. Y. and Manning+, C. D. (2011). Dynamic Pooling
and Unfolding Recursive Autoencoders for Paraphrase Detection, Advances in Neural Information
Processing Systems 24.
Soleymanzadeh, K., Karaoğlan, B., Metin, S. K. and Kişla, T. (2018). Combining machine translation
and text similarity metrics to identify paraphrases in turkish, 2018 26th Signal Processing and
Communications Applications Conference (SIU), IEEE, pp. 1–4.
Sporleder, C. and Li, L. (2009). Unsupervised recognition of literal and non-literal use of idiomatic
expressions, Proceedings of the 12th Conference of the European Chapter of the Association for
Computational Linguistics, Association for Computational Linguistics, pp. 754–762.
Srivastava, S. and Hovy, E. (2014). Vector space semantics with frequency-driven motifs, Proceedings
of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), Vol. 1, pp. 634–643.
Steen, G. (2010). A method for linguistic metaphor identification: From MIP to MIPVU, Vol. 14, John
Benjamins Publishing.
Steen, G. (2014). Metaphor and style, The Cambridge handbook of Stylistics pp. 315–328.
Steen, G., Dorst, A., Herrmann, B., Kaal, A., Krennmayr, T. and Pasma, T. (2010). A Method for
Linguistic Metaphor Identification: From MIP to MIPVU.
Steen, G. J., Dorst, A. G., Herrmann, J. B., Kaal, A., Krennmayr, T. and Pasma, T. (2014). A
method for linguistic metaphor identification: From mip to mipvu., Metaphor and the Social World
4(1): 138–146.
BIBLIOGRAPHY 77
Su, C., Huang, S. and Chen, Y. (2017). Automatic detection and interpretation of nominal metaphor
based on the theory of meaning, Neurocomputing 219: 300–311.
Sun, S. and Xie, Z. (2017). Bilstm-based models for metaphor detection, National CCF Conference
on Natural Language Processing and Chinese Computing, Springer, pp. 431–442.
Sweetser, E. (1991). From etymology to pragmatics: Metaphorical and cultural aspects of semantic
structure, Vol. 54, Cambridge University Press.
Tai, K. S., Socher, R. and Manning, C. D. (2015). Improved semantic representations from tree-
structured long short-term memory networks, CoRR abs/1503.00075.
Tang, S. and de Sa, V. R. (2018). Exploiting invertible decoders for unsupervised sentence represen-
tation learning, arXiv preprint arXiv:1809.02731 .
Tanguy, L., Sajous, F., Calderone, B. and Hathout, N. (2012). Authorship attribution: Using rich
linguistic features when training data is scarce., PAN Lab at CLEF.
Titone, D. and Libben, M. (2014). Time-dependent effects of decomposability, familiarity and literal
plausibility on idiom priming: A cross-modal priming investigation, The Mental Lexicon 9(3): 473–
496.
Torre, E. (2014). The emergent patterns of Italian idioms: A dynamic-systems approach, PhD thesis,
Lancaster University.
Tourangeau, R. and Rips, L. (1991). Interpreting and evaluating metaphors, Journal of Memory and
Language 30(4): 452–472.
Tourangeau, R. and Sternberg, R. J. (1981). Aptness in metaphor, Cognitive psychology 13(1): 27–55.
Tourangeau, R. and Sternberg, R. J. (1982). Understanding and appreciating metaphors, Cognition

11(3): 203–244.
Tsvetkov, Y., Boytsov, L., Gershman, A., Nyberg, E. and Dyer, C. (2014a). Metaphor detection with
cross-lingual model transfer.
Tsvetkov, Y., Boytsov, L., Gershman, A., Nyberg, E. and Dyer, C. (2014b). Metaphor detection
with cross-lingual model transfer, Proceedings of the 52nd Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 248–258.
Turney, P. D. (2013). Distributional semantics beyond words: Supervised learning of analogy and
paraphrase, CoRR abs/1310.5042.
78 BIBLIOGRAPHY
Turney, P. D., Neuman, Y., Assaf, D. and Cohen, Y. (2011). Literal and metaphorical sense identifica-
tion through concrete and abstract context, Proceedings of the Conference on Empirical Methods in
Natural Language Processing, EMNLP ’11, Association for Computational Linguistics, Stroudsburg,
PA, USA, pp. 680–690.
Turney, P. D. and Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of Semantics,
Journal of Artificial Intelligence Research 37: 141–188.
Underwood, G., Schmitt, N. and Galpin, A. (2004). The eyes have it, Formulaic sequences: Acquisition,
processing, and use 9: 153.
Utsumi, A. (2011). Computational exploration of metaphor comprehension processes using a semantic

space model, Cognitive science 35(2): 251–296.
Van Hee, C., Lefever, E. and Hoste, V. (2018a). Exploring the fine-grained analysis and automatic
detection of irony on twitter, Language Resources and Evaluation pp. 1–25.
Van Hee, C., Lefever, E. and Hoste, V. (2018b). Semeval-2018 task 3: Irony detection in english tweets,
Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 39–50.
Veale, T. (2011). Creative language retrieval: A robust hybrid of information retrieval and linguistic
creativity, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies-Volume 1, Association for Computational Linguistics, pp. 278–287.
Veale, T. (2012). Exploding the creativity myth: The computational foundations of linguistic creativity,
A&C Black.
Veale, T. (2014). A service-oriented architecture for metaphor processing, Proceedings of the Second
Workshop on Metaphor in NLP, pp. 52–60.
Veale, T. (2015). Game of tropes: Exploring the placebo effect in computational creativity., ICCC,
pp. 78–85.
Veale, T. (2017). Metaphor and metamorphosis, Metaphor in Communication, Science and Education
36: 43.
Veale, T. and Hao, Y. (2007a). Comprehending and generating apt metaphors: a web-driven, case-
based approach to figurative language, AAAI, Vol. 2007, pp. 1471–1476.
Veale, T. and Hao, Y. (2007b). Learning to understand figurative language: from similes to metaphors
to irony, Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 29.
BIBLIOGRAPHY 79
Veale, T. and Li, G. (2012). Specifying viewpoint and information need with affective metaphors:
a system demonstration of the metaphor magnet web app/service, Proceedings of the ACL 2012
System Demonstrations, Association for Computational Linguistics, pp. 7–12.
Veale, T., Shutova, E. and Klebanov, B. B. (2016a). Metaphor: A Computational Perspective, Synthesis
Lectures on Human Language Technologies, Morgan & Claypool Publishers.
URL: https://doi.org/10.2200/S00694ED1V01Y201601HLT031
Veale, T., Shutova, E. and Klebanov, B. B. (2016b). Metaphor: A computational perspective, Synthesis
Lectures on Human Language Technologies 9(1): 1–160.
Vietri, S. (2014). Idiomatic constructions in Italian: a lexicon-grammar approach, Vol. 31, John
Benjamins Publishing Company.
Vogel, C. (2001). Dynamic semantics for metaphor, Metaphor and Symbol 16(1-2): 59–74.
Vosoughi, S., Vijayaraghavan, P. and Roy, D. (2016). Tweet2vec: Learning tweet embeddings us-
ing character-level cnn-lstm encoder-decoder, Proceedings of the 39th International ACM SIGIR
conference on Research and Development in Information Retrieval, ACM, pp. 1041–1044.
Wang, J., Yu, L.-C., Lai, K. R. and Zhang, X. (2016). Dimensional sentiment analysis using a regional
cnn-lstm model, Proceedings of the 54th Annual Meeting of the Association for Computational Lin-
guistics (Volume 2: Short Papers), Vol. 2, pp. 225–230.
Wang, X., Gao, L., Song, J. and Shen, H. (2017). Beyond frame-level cnn: saliency-aware 3-d cnn with
lstm for video action recognition, IEEE Signal Processing Letters 24(4): 510–514.
Wilson, D. (2011). Parallels and differences in the treatment of metaphor in relevance theory and
cognitive linguistics, Intercultural Pragmatics 8(2): 177–196.
Wulff, S. (2008). Rethinking Idiomaticity: A Usage-based Approach, Continuum.
Xu, W., Callison-Burch, C. and Dolan, B. (2015). Semeval-2015 task 1: Paraphrase and semantic
similarity in twitter (pit), Proceedings of the 9th International Workshop on Semantic Evaluation
(SemEval 2015), Association for Computational Linguistics, Denver, Colorado, pp. 1–11.
URL: http://www.aclweb.org/anthology/S15-2001
Yin, W. and Schütze, H. (2015). Convolutional neural network for paraphrase identification, NAACL
HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015,
pp. 901–911.
URL: http://aclweb.org/anthology/N/N15/N15-1091.pdf
80 BIBLIOGRAPHY
Zhang, Y., Yuan, H., Wang, J. and Zhang, X. (2017). Ynu-hpcc at emoint-2017: Using a cnn-lstm model
for sentiment intensity prediction, Proceedings of the 8th Workshop on Computational Approaches
to Subjectivity, Sentiment and Social Media Analysis, pp. 200–204.
Part II
STUDIES
81
Study I
83
1
“Deep” Learning: Detecting Metaphoricity in Adjective-Noun Pairs
Yuri Bizzoni and Stergios Chatzikyriakidis and Mehdi Ghanimifard

yuri.bizzoni,stergios.chatzikyriakidis,[email protected]
Abstract taken into consideration qualitative stylistic anal-

yses (Fahnestock, 2009). Nonetheless, it is still
Metaphor is one of the most studied and very difficult to take metaphors into account in
widespread figures of speech and an essen- computational stylistics due to the complexity of
tial element of individual style. In this pa- automatic metaphor identification (Neuman et al.,
per we look at metaphor identification in 2013; Klebanov et al., 2015), which is the task of
Adjective-Noun pairs. We show that us- identifying metaphorical usages of text, sentences
ing a single neural network combined with or subsentential fragments.
pre-trained vector embeddings can outper- This paper’s focus of interest is the automatic
form the state of the art in terms of accu- detection of adjective-noun (AN) pairs like the fol-
racy. In specific, the approach presented in lowing:
this paper is based on two ideas: a) trans-
fer learning via using pre-trained vectors
representing adjective noun pairs, and b) a (1) Clean floor / clean performance
neural network as a model of composition (2) Bright painting / bright idea
that predicts a metaphoricity score as out- (3) Heavy table / heavy feeling
put. We present several different architec-
tures for our system and evaluate their per- The above examples illustrate that adjectives
formances. Variations on dataset size and “normally” used to describe physical characteris-
on the kinds of embeddings are also inves- tics, e.g. a feature that can be perceived through
tigated. We show considerable improve- senses like size or weight, are reused to describe
ment over the previous approaches both in more abstract properties. Thus, both a painting
terms of accuracy and w.r.t the size of an- and an idea can be bright, both a table and a feel-
notated training data. ing can be heavy. We will not provide a mean to
retrieve AN metaphors in unconstrained texts (e.g.
1 Introduction we won’t focus on segmentation) but we will study
The importance of metaphor to characterize both ways to detect metaphoricity in given pairs. The-
individual and genre-related style has been under- oretical work on metaphor in the linguistics litera-
lined in several works (Leech and Short, 2007; ture goes back a long way and spans different the-
Simpson, 2004; Goodman, 1975). Studying the oretical paradigms. One of the earliest and most
kinds of metaphors used in a text can contribute influential works is Conceptual Metaphor The-
to differentiate between poetic and prosaic style, ory (CMT) (Lakoff and Johnson, 2008) (originally
etc. In literary studies, metaphor analysis is often published in 1981) and subsequently elaborated in
undertaken on a stylistic perspective: ”after all, a couple of papers (Lakoff, 1989, 1993). Accord-
metaphor in literature is a stylistic device and its ing to CMT, metaphors in natural language can
forms, meanings and use all fall within the remit be seen as instances of conceptual metaphors. A
of stylistics” (Steen, 2014). Metaphor is thus often conceptual metaphor roughly corresponds to un-
∗
derstanding a concept or an idea via association
This research is funded by the Centre of Linguistic The-
ory and Studies in Probability at the University of Gothen- or relation with another idea or concept. Other in-
burg. fluential linguistic approaches to metaphor include
pragmatic approaches cast within frameworks like 2 Background
relevance theory (Romero and Soria, 2014; Wil-
In the specific task of detecting metaphoricity for
son, 2011), and also approaches where some sort
AN pairs we find four relevant works that seem
of formal semantics is used (Vogel, 2001). The
to represent the main stages in figurative language
common denominator in all these approaches is
detection until now.
the recognition that there is systematicity in the
The oldest work of the series, (Krishnakumaran
way metaphorical meanings arise and also that
and Zhu, 2007), strongly relies on external re-
the process of metaphor construction is extremely
sources. They adopt a WordNet based approach
productive. Thus, given these properties, one
to recognize Noun-Noun (NN), Noun-Verb (NV)
would expect metaphors to be quite common in
and AN metaphors. Their work is mainly based
Natural Language (NL). Evidence from corpus
on qualitative analyses of specific examples and
linguistics seems to support this claim (Cameron,
shows that, while they can be useful in such a
2003).
task, hyponym/hypernym relations are not enough
Metaphor detection in statistical NLP has been
to distinguish metaphors from literal expressions.
attempted through several different frames, such
as topic modeling (Li and Sporleder, 2010b), More recently, Turney et al. (2011) adopt a two-
semantic similarity graphs (Li and Sporleder, stage machine learning approach. They first try to
2010a), distributional clustering (Shutova et al., learn the words’ degree of concreteness and then
2010), vector space based learning (Gutiérrez use this knowledge to detect whether an AN cou-
et al., 2016) and, most of all, feature-based classi- ple is metaphorical or not. They measure their
fiers (Tsvetkov et al., 2014). In the latter case, the performance on 100 phrases involving 5 adjectives
challenge consists in selecting the right features to and reach an accuracy of 0.79. It is worth noting
annotate the training data with, and to review their that this choice is not random: the authors select
”importance” or weight based on machine learn- the abstract/concrete polarity based on psycholin-
ing results. guistic findings that seem to validate the hypoth-
In this paper we show how using a single- esis that some kinds of metaphorical expressions
layered neural network combined with pre-trained are processed as abstract elements.1
distributional embeddings can outperform the These results were outperformed by Tsvetkov
state of the art in an AN metaphor detection task. et al. (2014) through a random forest classifier
More specifically, this paper’s contributions are using DSM vectors, WordNet senses and several
the following: accurately selected features, such as abstractness.
They also introduce a new set of 200 phrases, on
• We introduce a system to predict AN which they declare an F-score of 0.85.
metaphoricity and test it on the corpus intro- Finally, Gutiérrez et al. (2016) train a distribu-
duced by (Gutiérrez et al., 2016), showing a tional model on a corpus of 4.58 billion tokens
significant improvement in accuracy. and test it on an annotated dataset they introduce
consisting of 8592 AN phrases. This is the same
• We explore different variations of this model dataset we are using in this paper and the largest
based on ideas found in the literature for available to date.
composing distributional meaning and we
They first train distributional vectors for the
evaluate them under different constraints.
words in the dataset using positive pointwise mu-
The paper is structured as follows: in Section 2 tual information. Then, for each adjective present
we present the background on AN metaphor de- in the dataset, they divide the literal phrases the
tection and we detail the dataset we use to train adjective occurs in from the metaphorical phrases
our model. In Section 3 we describe our approach, the same adjective appears in. Then, three differ-
giving a general overview and further describing ent adjective matrices are trained: one to model
three alternative architectures on the same model. the adjective’s literal sense, one to model its
In Section 4 we present several evaluations of our metaphorical sense, and one trained on all the
model. Table 1 and Table 2 synthesize some of our phrases containing this adjective, both literal and
findings. In Section 5 we discuss our findings and metaphorical. They then develop a system to “de-
possible future applications of the work described 1
For a more recent study on this issue see (Forgács et al.,
in this paper. 2015).
Accuracy Feature engineering Annotated dataset Embedding
(Turney et al., 2011) 0.79 Yes 100 LSA
(Tsvetkov et al., 2014) 0.85 Yes 200 -
(Gutiérrez et al., 2016) 0.81 No 8592 DSM
Our model 0.91 No 8592 Word2Vec
Table 1: The reported accuracy from previous words on AN metaphor detection. The first two studies
used different datasets. We are using larger pre-trained vectors than Gutiérrez et al. (2016); at the same
time, we don’t need a parsed corpus to build our vectors and we don’t use adjectival matrices. Given
these differences, this comparison should not be considered a “competition”.
Random W Trained W 2. light adjectives (e.g. bright)

cat-linear 0.8973 0.9153
cat-relu 0.8763 0.9228 3. texture adjectives (e.g. rough)
sum-linear 0.8815 0.9068
4. substance adjectives (e.g. dense)
sum-relu 0.8597 0.9150
mul-linear 0.7858 0.8066 5. clarity adjectives (e.g. clean)
mul-relu 0.7795 0.8186
6. taste adjectives (e.g. bitter)
Table 2: The accuracy results after training the
model based on each architecture. In all setups, 7. strength adjectives (e.g. strong)
we trained on 500 samples in 20 epochs. Using
8. depth adjectives (e.g. deep)
a random W is equivalent to preventing our net-
work from learning any form of compositionality The corpus was carefully built in order to avoid
(we could consider it as a baseline for models with non-ambiguous elements: all the AN phrases
trained W). As we discuss in the paper, the differ- present in this dataset were extracted from large
ence in accuracies with the “baseline” (not training corpora and all phrases that seemed to require a
W) shows that training W is helpful. larger context for their interpretation were filtered
out in order to eliminate potentially ambiguous id-
cide” whether a particular occurrence of an adjec- iomatic expressions such as bright side.
tive is more likely to relate to the “literal matrix” In other terms, the corpus was designed to con-
or the “metaphorical matrix”. It is shown that, al- tain elements whose metaphoricity could be de-
though such matrices are trained on relatively few duced by a human annotator without the need of
examples, they can reach an accuracy of over 0.78. a larger context.
More details about the construction of the
2.1 Corpus/Experimental Data dataset and annotation methodology can be found
The dataset we are using comes from (Gutiérrez in (Gutiérrez et al., 2016).
et al., 2016). 2 It contains 8592 annotated AN
pairs, 3991 being literal and 4601 being metaphor- 3 Describing our approach
ical. The dataset focuses on a set of 23 adjectives 3.1 The model framework
that: a) can potentially have both metaphorical and
Our objective is to build a classifier that disam-
literal meanings, and b) are fairly productive.
biguates between metaphoric and literal AN com-
The choice of adjectives was based on the test
positions by providing a probability measure be-
set of (Tsvetkov et al., 2014) and focuses on 23
tween 0 and 1. We based the framework of the
adjectives.
model on the following ideas:
In details, all adjectives belong to one of the fol-
lowing categories: 1. Transfer learning: we use pre-trained word-
vectors to represent AN pairs as input.
1. temperature adjectives (e.g. cold)
2
The dataset is publicly available here: 2. A neural network as a model of composi-
http://bit.ly/1TQ5czN tion for the AN phrase: our model represents
phrases with vectors, then based on this rep- phrase itself, but rather the maximal possible level
resentation predicts a metaphoricity score as of metaphoricity given our training set.
output. Although we are going to present sev- The degree of metaphoricity of a phrase can
eral variations of this framework, it’s impor- thus be directly computed as cosine similarity be-
tant to remember that the basic model is al- tween this vector and the phrase vector. However,
ways a standard NN with a single fully con- in the network we used a sigmoid function to pro-
nected hidden layer we will call p. duce the measure:
Our approach is thus based on the idea that well- 1
trained distributional vectors contain more valu- ŷ = σ(p· q + b1 ) = (5)
1+ e−p·q+b1
able information than their reciprocal similarity where q and b1 are parameters of the final
and, furthermore, that it is possible to treasure layer and work as metaphoricity indicators, while
such information through machine learning in dif- ŷ is the predicted score (metaphoric or literal)
ferent tasks. We use 300-dimensional word vec- for the composition p. Given a dataset of D =
tors trained on different corpora (see Evaluation {(xt , yt )}t∈{1,...,T } , the composition p can be for-
for more details) . Our approach can be considered malized as a model for Bernoulli distribution:
as a way of transferring the learned representation
from one task to another. Although it is not pos-
yt = P r(xt being metaphorical|D) ∈ {0, 1}
sible to point out an explicit mapping between the
word-vector learning task (e.g. Word2Vec model) ŷt = σ(pt · q + b1 )
and our metaphoricity task, as it is pointed out by ≈ P r(xt being metaphorical) ∈ (0, 1)
Torrey and Shavlik 2009, we use neural networks (6)
which automatically learn how to adapt the feature where each xt is an AN pair in the training
representations between two tasks (Bengio et al., dataset labeled with a binary value yt (0 or 1).
2013). In this way we stretch the original embed- Given the labels in D, we interpret yt as a categor-
dings, trained in order to learn lexical similarity, to ical probability score: the probability of a given
identify AN metaphors. phrase being metaphorical. Then, for each pair of
Our neural network, being a parameterized words in xt , we use pre-trained word-vector repre-
function, follows the generalized architecture of sentations such as ut and vt in the Equation 4 to
word-vector composition similar to (Mitchell and produce pt and, consequently, the score ŷt .
Lapata, 2010): In this formulation, the objective is to minimize
the binary cross entropy distance between the es-
p = f (u, v; θ) (4) timated ŷt and the given annotation yt . Adding q
where u and v are two word vector representa- and b1 in the list of parameters Θ, we fit all param-
tions to be composed, while p is the vector rep- eters with a small annotated data size T :
resentation of their composition with the same di-
x = (x1 , ...xT )
mensions. The function f in our model is param-
eterized by θ, a list of parameters to be learned as y = (y1 , ...yT ) (7)
part of our neural network architecture. Θ = (θ, q, b1 )
Based on the argument by (Mitchell and Lapata,
2010), parameters such as θ are encoded knowl- PT
L(Θ; x, y) = − t=1 (yt log(ŷt )+
edge required by the compositional process. In (8)
(1 − yt ) log(1 − ŷt ))
our case, the gradient based learning in neural net-
works will find these parameters as an optimiza- where, on each iteration, we update the param-
tion problem where p is just an intermediate rep- eters in Θ using Adam stochastic gradient descent
resentation in the pipeline of the neural network, (Kingma and Ba, 2014), with a fixed number of
which ends with a prediction of a metaphoricity iterations over x and y to minimize L.
score. In this paper, we describe three alternative ar-
In other words, in order to predict the degree chitectures to implement this framework. All
of metaphoricity, we end up learning a specific three, with small variations, show a robust ability
semantic space for phrase representations p and to generalize on the dataset and perform correct
a vector q which actually does not represent a predictions.
3.2 First Architecture this model is that the learned composition func-
One possible formulation of this frame is similar tion f can also map all words’ vectors, regardless
to additive composition as described in (Mitchell of the part of speech these words belong to, in the
and Lapata, 2010), but instead of performing a new vector space without losing accuracy in the
scalar modification of each vector, a weight ma- original task. In this new vector space, a simple
trix modifies all feature dimensions at once: addition operator composes two vectors:
T
p = Wadj T
u + Wnoun v+b (9) u0 = W T u (13)
v0 = W T v (14)
" #
Wadj p=u +v 0 0
(15)
W = (10)
Wnoun
Compared to the first architecture, in this archi-
where the composition function in equation (4) tecture we don’t assume the need of distinguish-
now has θ = (W, b). ing the weight matrix for the adjectives from the
This formulation is very similar to the compo- weight matrix for the nouns.
sition model in (Socher et al., 2011) without the It is rather interesting, then, that this architec-
syntactic tree parametrization. As such, instead of ture doesn’t present significant differences in per-
the non-linearity function we have linear identity: formance with respect to the first one. The num-
" # ber of parameters, however, is smaller: W ∈
T u IR300×300 and b ∈ IR300 .
p = fθ (u, v) = W +b (11)
v
3.4 Third Architecture
In practice, this approach represents a simple
merging through concatenation: given two words’ The third architecture, similarly to the second, fea-
vectors, we concatenate them before feeding them tures a shared composition matrix of weights be-
to a single-layered, fully connected Neural Net- tween the noun and the adjective, but we perform
work. elementwise multiplication between the two vec-
As a consequence, the network learns a weight tors:
matrix that represents linearly the AN combina-
tion. To visualize this concept, we could say that, p = fθ (u, v) = (u × v)W + b (16)
since our pairs always hold the same internal struc- The number of parameters in this case is similar
ture (adjective in first position and noun in sec- to previous architecture: W ∈ IR300×300 and b ∈
ond position), the first half of the weight matrix IR300 .
is trained on adjectives and the second half of the
weight matrix is trained on nouns. 3.5 Other Architectures
By using 300 dimension pre-trained word vec- In all three previous architectures we saw that a
tors, the parameter space for this composition weight matrix W can be learned as part of the
function will be as following: W ∈ IR300×600 and composing function. Throughout our exploration,
b ∈ IR300 . we found that W can be a random and a constant
uniform matrix (not trained in the network) and
3.3 Second architecture still being able to learn q unless we use a non-
The second architecture we describe has the ad- linear activation functions over the AN composi-
vantage of training a smaller set of parameters tions.
with respect to the first. In this model, the weight
matrix is shared between the noun and the adjec- p = g(fθ (u, v)) (17)
tive:
An intuition is to take W as an identity matrix
in Second architecture, the network will take the
p = fθ (u, v) = W T u + W T v + b (12) sum of pre-trained vectors to as features and learn
how to predict metaphoricity. A fixed uniform W
Notice that in the case of comparing the vec- basically keeps the information in input vectors.
tor representations of two different AN phrases, For a short overview of all these alternative archi-
b will be essentially redundant. An advantage of tectures see Table 2.
4 Evaluation Word2Vec embeddings trained on Google News
(Mikolov et al., 2013) we examined the accuracy,
Our classifier achieved 91.5% accuracy trained on
precision and recall of the our trained classifier.
500 labeled AN-phrases out of 8592 in the corpus
and tested on the rest. Training on 8000 and test- We have used three different word embeddings:
ing on the rest gave us accuracy of 98.5%.3 Word2Vec embeddings trained on Google News
We tested several combinations of the architec- (Mikolov et al., 2013), GloVe embeddings (Pen-
tures we described in the paper. For each of the nington et al., 2014) and Levy-Goldberg embed-
three architectures, we also tested the Rectified dings (Levy and Goldberg, 2014).
linear unit (ReLU) as the non-linearity mentioned These embeddings are not up-dated during the
in Section 3.5. Our test also shows that a random training process. Thus, the classification task is
constant matrix W is enough to train the rest of the always performed by learning weights for the pre-
parameters (reported in Table 2). In general, the existing vectors.
best performing combinations involve the use of The results of our experiment can be seen in
concatenation (the first architecture), while multi- Figure 3. All these embeddings have returned sim-
plication led to the lowest results. In any case, all ilar accuracies both when trained on scarce data
experiments returned accuracies above 75% 4 . (100 phrases) and when trained on half of the
To test the robustness of our approach, we have dataset (4000 phrases).
evaluated our model’s performance under several Training on 100 phrases indicates the ability of
constraints: our model to learn from scarce data. One way of
checking the consistency of our model under data
• Total separation of vocabulary in train and
scarcity is to perform flipped cross-validation: this
test sets (Table 3) in case of out of vocabu-
is a cross-validation where, instead of training our
lary words.
model on 90% of the data and testing it on the re-
• Use of different pretrained word embeddings maining 10%, we flipped the sizes train it on 10%
(Figure 3). of the data and test it on the remaining 90%. Re-
sults for both classic cross-validation and flipped
• Cross validation (Figure 1). cross-validation can be seen in Figure 1. Training
• Qualitative selection of the training data on 10% of the data proved to consistently achieve
based on the semantic categories of adjec- accuracies not much lower than 90%. In other
tives (Figure 2). terms, a model trained on 90% of the data does
not do much better than a model trained on 10%.
Finally, we will provide some qualitative insights
Finally, we tried training our model on only one
on how the model works.
of the semantic categories we introduced at the be-
Our model is based on the idea of transfer learn-
ginning of the paper and testing it on the rest of the
ing: using the learned representation for a new
dataset. Results can be seen in Figure 2.
task, in this case the metaphor detection. Our
We can wonder “why” our system is working:
model should generalize very fast with a small
with respect to more traditional machine learn-
set of samples as training data. In order to test
ing approaches, there is no direct way to evaluate
this matter, we have to train and test on totally
which features mostly contribute to the success of
different samples so vocabulary doesn’t overlap.
our system. One way to have an idea of what is
The splitting of the 8592 labeled phrases based
happening in the model is to use the “metaphoric-
on vocabulary gives us uneven sizes of training
ity vector” we discussed in Section 3. Such vector
and test phrases5 . In Table 3 using the pretrained
represents what is learned by our model and can
3
These results are based on the first architecture, the per- help making it less opaque for us.
formance of other architectures are not very different in this
simple test. The sample code is available on https://gu- If we compute the cosine similarity between
clasp.github.io/anvec-metaphor/ all the nouns in our dataset and this learned vec-
4
The number of parameters in case of using concatenation
(as in first architecture) is 180 601 and other compositions,
tor, we can see that nouns tend to polarize on an
including addition and multiplication, number of parameters abstract/concrete axis: abstract nouns tend to be
is almost the half: 90 601. more similar to the learned vector than concrete
5
We chose the vocabulary splitting points for every 10%
from 10% to 90%, then we applied the splitting separately on
nouns.
nouns and adjective It is likely that our model is learning nouns’
Test Train Accuracy Precision Recall
0.92
6929 72 0.83 0.89 0.77 accuracies
0.90
5561 299 0.89 0.86 0.93
4406 643 0.91 0.92 0.90 0.88
3239 1203 0.90 0.91 0.88 0.86

2253 1961 0.91 0.92 0.92 0.84
1568 2763 0.89 0.90 0.90 0.82
707 4291 0.91 0.94 0.91 0.80
313 5494 0.93 0.92 0.95
0.78
148 6282 0.93 0.94 0.92
nce
th
clarity
light
rature
texture
taste
depth
streng
substa
tempe
Table 3: This table shows consistent results in ac-
curacy, precision and recall of the classifier trained Figure 2: Accuracy training on different categories
with different split points of vocabulary instead of of adjectives. In this experiment, we train on
phrases. Splitting the vocabulary creates different just one category of the dataset and test on all
sizes of training phrases and test phrases. the others. In general, training on just one cate-
gory (e.g.temperature) and testing on all other cat-
egories still yields high accuracy. While the power
level of abstractness as a mean to determine phrase of generalization of our model is still unclear, we
metaphoricity. In Table 4 we show the 10 most can see that it can detect similar semantic mecha-
similar and the 10 least similar nouns obtained nisms even without any vocabulary overlap. The
with this approach. As can be seen, a concrete- category taste is a partial exception: this category
abstract polarity is apparently learned in training. seems to be a relative “outlier”.
This factor was amply noted and even used in
some feature-based metaphor classifiers, as we 5 Discussion and future work
discussed in the beginning: the advantage of using
continuous semantic spaces probably relies on the In this paper we have presented an approach
possibility of having a more nuanced and complex for detecting metaphoricity in AN pairs that out-
polarization of nouns along the concrete/abstract performs the state of the art without using human
axes than using hand-annotated resources. annotated data or external resources beyond pre-
trained word embeddings. We treasured the infor-
mation captured by Word2Vec vectors through a
0.96
fully connected neural network able to filter out
the ”noise” of the original semantic space. We
0.95
have presented a series of alternative variations
0.94
of this approach and evaluated its performance
under several conditions - different word embed-
0.93 dings, different training data and different training
sizes - showing that our model can generalize ef-
0.92
CV ficiently and obtain solid results over scarce train-
Flipped-CV
1 2 3 4 5 6 7 8 9 10
ing data. We think that this is one of the central
findings in this paper, since many semantic phe-
nomena similar to metaphor (for example other
Figure 1: Accuracies for each fold over two com- figures of speech) are under-represented in current
plementary approaches: cross-validation (CV) and NLP resources and their study through supervised
flipped cross-validation (“flipped-CV”). Flipped classifiers would require systems able to work on
cross-validation takes 90% of our dataset for train- small datasets.
ing. The graph shows that both methods yield The possibility of detecting metaphors and as-
good results: in other words training on just 10% signing a degree of “metaphoricity” to a snippet
of the dataset yields results that are just few points of text is essential to automatic stylistic programs
lower than normal cross-validation. designed to go beyond “shallow features” such
as sentence length, functional word counting etc.
Figure 3: Accuracy on different kinds of embeddings, both training on 100 phrases and 4000 phrases.
Top ten reluctance, reprisal, resignation, or concrete) and generalize over this distinction; a
response, rivalry, satisfaction, behavior that might not be too far from the way a
storytelling, supporter, surveil- human learns to distinguish different senses of a
lance, vigilance word.
Bottom ten saucepan, flour, skillet, chimney, An issue that we would like to further test in
jar, tub, fuselage, pellet, pouch, the future is metaphoricity detection on different
cupboard datasets, to explore the ability of generalization
of our models. Researching on different datasets
Table 4: 10 most similar and 10 least simi-
could also help us gaining a better insight about
lar terms with respect to the “metaphoricity vec-
the model’s learning.
tor”, concatenated using an all-zeros vector for the
adjective. In practice, this is a way to explore An obvious option is to test verb-adverb pairs
which semantic dimensions are particularly use- (VA, e.g. think deeply) using the same approach
ful to the classifier. A concrete/abstract polarity discussed in this paper. It would then be inter-
on the nouns was apparently derived esting to see whether having a common training
set for both the AN and the VA pairs will allow
the model to generalize for both cases or differ-
While such metrics have already allowed powerful ent training on two training sets, one for AN and
studies, the lack of tools to quantify more com- one for VA, will be needed. Other cases to test
plex stylistic phenomena is evident (Hughes et al., include N-N compounds or proposition/sentence
2012; Gibbs Jr, 2017). Naturally, this work is in- level pairs.
tended as a first step: the “metaphoricity” degree Another way such an approach can be extended,
our system is learning would mirror the kinds of is to investigate whether reasoning tasks typically
combination present in this specific dataset, which associated with different classes of adjectives can
represents a very specific type of metaphor. be performed. One task might be to distinguish
It can be argued that we are not really learn- adjectives that are intersective, subsective or none
ing the defining ambiguities of an adjective (e.g. of the two. In the first case, from A N x one should
the double meaning of “bright”) but that we are infer that x is both an A and an N (something that
probably side-learning nouns’ degree of abstrac- is a black table is both black and a table), in the
tion. This would be in harmony with psycholin- second case one should infer that x is N only (for
guistic findings, since detecting nouns’ abstraction example someone who is a skillful surgeon is only
seems to be one of the main mechanisms we re- a surgeon but we do not know if s/he is skillful
cur to, when we have to judge the metaphoricity in general), and in the third case neither of the
of an expression (Forgács et al., 2015) and is used two should be inferred. However, this task is not
as a main feature in traditional Machine Learning as simple as giving a training set with instances
approaches to this problem. In other terms, our of AN pairs, to recognize where novel instances
system seems to detect when the same adjective is of AN pairs belong to. Going beyond logical ap-
used with different categories of words (abstract proaches by having the ability to recognize differ-
ent uses of an adjective requires a richer notion of Beata Beigman Klebanov, Chee Wee Leong, and
context which extends way beyond the AN-pairs. Michael Flor. 2015. Supervised word-level
metaphor detection: Experiments with concreteness
A further idea we want to pursue in the future
and reweighting of examples. In Proceedings of the
is the development of more fine grained datasets, Third Workshop on Metaphor in NLP. pages 11–20.
where metaphoricity is not represented as a binary
feature but as a gradient property. This means that Saisuresh Krishnakumaran and Xiaojin Zhu. 2007.
Hunting elusive metaphors using lexical resources.
a classifier should have the ability to predict a de- In Proceedings of the Workshop on Compu-
gree of metaphoricity and thus allow more fine- tational Approaches to Figurative Language.
grained distinctions to be captured. This is a theo- Association for Computational Linguistics, Strouds-
retically interesting side and definitely something burg, PA, USA, FigLanguages ’07, pages 13–20.
http://dl.acm.org/citation.cfm?id=1611528.1611531.
that has to be tested since not much literature is
available (if at all) on gradient metaphoricity. It George Lakoff. 1989. Some empirical results about the
seems to us that similar approaches, quantifying a nature of concepts. Mind & Language 4(1-2):103–
text’s metaphoricity and framing it as a supervised 129.
learning task, could help having a clear view on George Lakoff. 1993. The contemporary theory of
the influence of metaphor on style. metaphor.
George Lakoff and Mark Johnson. 2008. Metaphors

References we live by. University of Chicago press.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Geoffrey N Leech and Mick Short. 2007. Style in fic-
2013. Representation learning: A review and new tion: A linguistic introduction to English fictional
perspectives. IEEE transactions on pattern analysis prose. 13. Pearson Education.
and machine intelligence 35(8):1798–1828.
Omer Levy and Yoav Goldberg. 2014. Neural word
Lynne Cameron. 2003. Metaphor in educational dis- embedding as implicit matrix factorization. In Ad-
course. A&C Black. vances in neural information processing systems.
pages 2177–2185.
Jeanne Fahnestock. 2009. Quid pro nobis. rhetorical
stylistics for argument analysis. Examining argu- Linlin Li and Caroline Sporleder. 2010a. Linguistic
mentation in context. Fifteen studies on strategic ma- cues for distinguishing literal and non-literal usages.
neuvering pages 131–152. In Proceedings of the 23rd International Conference
on Computational Linguistics: Posters. Association
Balint Forgács, Megan D. Bardolph, Amsel B.D., for Computational Linguistics, pages 683–691.
DeLong K.A., and M. Kutas. 2015. Metaphors
are physical and abstract: Erps to metaphorically Linlin Li and Caroline Sporleder. 2010b. Using gaus-
modified nouns resemble erps to abstract language. sian mixture models to detect figurative language
Front. Hum. Neurosci. 9(28). in context. In Human Language Technologies: The
2010 Annual Conference of the North American
Raymond W Gibbs Jr. 2017. Metaphor Wars. Cam- Chapter of the Association for Computational Lin-
bridge University Press. guistics. Association for Computational Linguistics,
Stroudsburg, PA, USA, HLT ’10, pages 297–300.
Nelson Goodman. 1975. The status of style. Critical http://dl.acm.org/citation.cfm?id=1857999.1858038.
Inquiry 1(4):799–811.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
E Darıo Gutiérrez, Ekaterina Shutova, Tyler Marghetis, rado, and Jeff Dean. 2013. Distributed representa-
and Benjamin K Bergen. 2016. Literal and tions of words and phrases and their compositional-
metaphorical senses in compositional distributional ity. In Advances in neural information processing
semantic models. In Proceedings of the 54th Meet- systems. pages 3111–3119.
ing of the Association for Computational Linguis-
tics. pages 160–170. Jeff Mitchell and Mirella Lapata. 2010. Composition
in distributional models of semantics. Cognitive sci-
James M Hughes, Nicholas J Foti, David C Krakauer, ence 34(8):1388–1429.
and Daniel N Rockmore. 2012. Quantitative pat-
terns of stylistic influence in the evolution of liter- Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last,
ature. Proceedings of the National Academy of Sci- Shlomo Argamon, Newton Howard, and Ophir
ences 109(20):7682–7686. Frieder. 2013. Metaphor identification in large texts
corpora. PloS one 8(4):e62343.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint Jeffrey Pennington, Richard Socher, and Christo-
arXiv:1412.6980 . pher D. Manning. 2014. Glove: Global vectors for
word representation. In Empirical Methods in Nat-
ural Language Processing (EMNLP). pages 1532–
1543. http://www.aclweb.org/anthology/D14-1162.
Esther Romero and Belén Soria. 2014. Relevance
theory and metaphor. Linguagem em (Dis) curso
14(3):489–509.
Ekaterina Shutova, Lin Sun, and Anna Korhonen.
2010. Metaphor identification using verb and noun
clustering. In Proceedings of the 23rd International
Conference on Computational Linguistics. Associ-
ation for Computational Linguistics, pages 1002–
1010.
Paul Simpson. 2004. Stylistics: A resource book for
students. Psychology Press.
Richard Socher, Jeffrey Pennington, Eric H Huang,
Andrew Y Ng, and Christopher D Manning. 2011.
Semi-supervised recursive autoencoders for predict-
ing sentiment distributions. In Proceedings of the
conference on empirical methods in natural lan-
guage processing. Association for Computational
Linguistics, pages 151–161.
Gerard Steen. 2014. Metaphor and style. The Cam-
bridge handbook of Stylistics pages 315–328.
Lisa Torrey and Jude Shavlik. 2009. Transfer learn-
ing. Handbook of Research on Machine Learning
Applications and Trends: Algorithms, Methods, and
Techniques 1:242.
Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman,
Eric Nyberg, and Chris Dyer. 2014. Metaphor de-
tection with cross-lingual model transfer.
Peter D. Turney, Yair Neuman, Dan Assaf, and
Yohai Cohen. 2011. Literal and metaphorical
sense identification through concrete and abstract
context. In Proceedings of the Conference on Em-
pirical Methods in Natural Language Processing.
Association for Computational Linguistics, Strouds-
burg, PA, USA, EMNLP ’11, pages 680–690.
http://dl.acm.org/citation.cfm?id=2145432.2145511.
Carl Vogel. 2001. Dynamic semantics for metaphor.
Metaphor and Symbol 16(1-2):59–74.
Deirdre Wilson. 2011. Parallels and differences
in the treatment of metaphor in relevance theory
and cognitive linguistics. Intercultural Pragmatics
8(2):177–196.
Study II
95
Finding the Neural Net: Deep-learning Idiom
Type Identification from Distributional
Vectors
Yuri Bizzoni∗ Marco S. G. Senaldi∗∗

University of Gothenburg, Sweden Scuola Normale Superiore di Pisa, Italy
Alessandro Lenci†
University of Pisa, Italy
The present work aims at automatically classifying Italian idiomatic and non-idiomatic phrases
with a neural network model under constrains of data scarcity. Results are discussed in com-
parison with an existing unsupervised model devised for idiom type detection and a similar
supervised classifier previously trained to detect metaphorical bigrams. The experiments suggest
that the distributional context of a given phrase is sufficient to carry out idiom type identifi-
cation to a satisfactory degree, with an increase in performance when input phrases are filtered
according to human-elicited idiomaticity ratings collected for the same expressions. Crucially,
employing concatenations of single word vectors rather than whole-phrase vectors as training
input results in the worst performance for our models, differently from what was previously
registered in metaphor detection tasks.
1. Introduction
Generally speaking, figurativeness has to do with pointing at a contextual interpretation

for a given expression that goes beyond its mere literal meaning (Frege, 1892; Gibbs
et al., 1997; Cacciari and Papagno, 2012). Let’s imagine a commentator that, referring to
an athlete, says She’s always delivered clean performances but this one really took the cake! In
this sentence, clean performances is an example of metaphorical expression that, according to
the model proposed by Lakoff and Johnson (2008), reflects a rather transparent mapping
between an abstract concept in a target domain (e.g., the flawlessness of a performance)
and a concrete example taken from a source domain (e.g., the cleanliness of a surface).
On the other hand, take the cake is an idiom, i.e. a lexicosyntactically rigid multiword
unit (Sag et al., 2002) that is entirely non-compositional, since its meaning of ‘being
outstanding’ is not accessible by simply composing the meanings of take and cake and
must therefore be learnt by heart by speakers (Frege, 1892; Cacciari, 2014).
Important differences have been stressed between metaphors and idioms in theoret-
ical (Gibbs, 1993; Torre, 2014), neurocognitive (Bohrn et al., 2012) and corpus linguistic
(Liu, 2003) studies. First of all, metaphors represent a productive phenomenon: studies
∗ Department of Philosophy, Linguistics, Theory of Science - Dicksonsgatan 4, 41256, Göteborg, Sweden.

E-mail: [email protected]
∗∗ Scuola Normale Superiore - Piazza dei Cavalieri 7, I-56126 Pisa, Italy. E-mail: [email protected]
† CoLing Lab, Department of Philology, Literature and Linguistics - Via S. Maria 36, I-56126 Pisa, Italy.
E-mail: [email protected]
© 2015 Associazione Italiana di Linguistica Computazionale

Italian Journal of Computational Linguistics Volume 1, Number 1
on metaphor production strategies indeed show a large ability of language users to gen-
eralize and create new metaphors on the fly from existing ones, allowing researchers to
hypothesize recurrent semantic mechanisms underlying a large number of productive
metaphors (McGlone, 1996; Lakoff and Johnson, 2008). For example, starting from the
clean performance metaphor above, we could also say the delivered performance was
neat, spick-and-span and crystal-clear by sticking to the same conceptual domain of clean-
liness. On the other hand, although most idioms originate as metaphors (Cruse, 1986),
they have undergone a crystallization process in diachrony, whereby they now appear
as conventionalized and (mostly) fixed combinations that form a finite repository in a
given language (Nunberg et al., 1994). From a formal standpoint, though some idioms
allow for restricted lexical variability (e.g., the concept of getting crazy can be conveyed
both by to go nuts and to go bananas), this kind of variation is not as free and systematic
as with metaphors and literal language (e.g., transforming the take the cake idiom above
into take the candy would hinder a possible idiomatic reading) (Fraser, 1970; Geeraert
et al., 2017). From the semantic point of view, it is interesting to observe how speakers
can correctly use the most semantically opaque idioms in discourse without necessarily
being aware of their actual metaphorical origin or anyway having contrasting intuitions
about it. For example, Gibbs (1994) reports that many English speakers explain the
idiom kick the bucket ‘to die’ as someone kicking a bucket to hang themselves, while it
actually originates from a corruption of the French word buquet indicating the wooden
framework that slaughtered hogs kicked in their death struggles. Secondly, metaphor-
ical expressions can receive varying interpretations according to the context at hand:
saying that John is a shark could mean that he’s ruthless on his job, that he’s aggressive
or that he attacks people suddenly (Cacciari, 2014). Contrariwise, idiomatic expressions
always keep the same meaning: saying that John kicked the bucket can only be used to
state that he passed away. Finally, idioms and metaphors differ in the mechanisms they
recruit in language processing: while metaphors seem to bring into play categorization
(Glucksberg et al., 1997) or analogical (Gentner, 1983) processes between the vehicle and
the topic (e.g., shark and John respectively in the sentence above), idioms by and large
call for lexical access mechanisms (Cacciari, 2014). Nevertheless, it is crucial to under-
line that idiomaticity itself is a multidimensional and gradient phenomenon (Nunberg
et al., 1994; Wulff, 2008) with different idioms showing varying degrees of semantic
transparency, formal versatility, proverbiality and affective valence. All this variance
within the class of idioms themselves has been demonstrated to affect the processing of
such expressions in different ways (Cacciari, 2014; Titone and Libben, 2014).
The aim of this work is to focus on the fuzzy boundary between idiomatic and
metaphorical expressions from a computational viewpoint, by applying a supervised
method previously designed to discriminate metaphorical vs. literal usages of input
constructions to the task of distinguishing idiomatic from compositional expressions.
Our starting point is the work of Bizzoni et al. (2017), who managed to classify adjective-
noun pairs where the same adjectives were used both in a metaphorical and a literal
sense (e.g., clean performance vs. clean floor) by means of a neural classifier trained
on a composition of the words’ embeddings (Mikolov et al., 2013a). As the authors
found out, the neural network succeeded in the task because it was able to detect the
abstract/concrete semantic shift undergone by the nouns when used with the same
adjective in figurative and literal compositions respectively. In our attempt, we will use a
relatively similar approach to classify idiomatic expressions by training a three-layered
neural network on a set of Italian idioms (e.g. gettare la spugna ‘to throw in the towel’, lit.
‘to throw the sponge’) and non-idioms (e.g. vedere una partita ‘to watch a match’). The
performance of the network will be compared when trained with constructions belong-
Bizzoni et al. Deep-learning Idiom Identification
ing to different syntactic patterns, namely Adjective-Noun and Verb-Noun expressions

(AN and VN henceforth). Noteworthily, the abstract/concrete polarity the network was
able to learn in Bizzoni et al. (2017) will not be available this time: while the nouns in the
dataset of Bizzoni et al. (2017) were used in their literal sense, idioms are entirely non-
compositional, so none of their constituents is employed literally inside the expressions,
independently of their concreteness (e.g., spugna ‘sponge’ in gettare la spugna vs numeri
‘numbers’ in dare i numeri ‘to lose it’, lit. ‘to give the numbers’). What we want to find
out is whether the sole information captured by the distributional vector of a given
expression is sufficient for the network to learn its potential idiomaticity. The idiom
classification scores of our models will be compared with those obtained by Senaldi
et al. (2016) and Senaldi et al. (2017), who propose a distributional semantic algorithm
for idiom type detection. Our study employs their small datasets. Therefore, the training
sets we will operate on will be very scarce. Traditional ways to deal with data scarcity in
computational linguistics resort to a wide number of different features to annotate the
training set (see for example Tanguy et al. (2012)) or rely on artificial bootstrapping of
the training set (He and Liu, 2017). In our case, we test the performance of our classifier
on scarce data without bootstrapping the dataset and relying only on the information
provided by the distributional semantic space, showing that the distribution of an
expression in large corpora can provide enough information to learn idiomaticity from
few examples with a satisfactory degree of accuracy.
This paper is structured as follows: after reviewing in Section 2 the existing lit-
erature on idiom and metaphor processing, in Section 3 we will briefly outline the
experimental design and in Section 4 we will provide details about the dataset we
used and the human ratings we collected to validate our algorithms; in Section 5 we
will go through the structure and functioning of our classifier and in Section 7 we will
evaluate the performance of our models. Section 8 presents a qualitative error analysis,
then followed by a discussion of the results (Section 9).
2. Related Work
Previous computational research has exploited different methods to perform idiom type
detection (i.e., automatically telling apart potential idioms like to get the sack from only
literal combinations like to kill a man). For example, Lin (1999) and Fazly et al. (2009)
label a given word combination as idiomatic if the Pointwise Mutual Information (PMI)
(Church and Hanks, 1991) between its constituents is higher than the PMIs between the
components of a set of lexical variants of this combination obtained by replacing the
component words of the original expressions with semantically related words. Other
studies have resorted to Distributional Semantics (Lenci, 2008, 2018; Turney and Pantel,
2010) by measuring the cosine between the vector of a given phrase and the single
vectors of its components (Fazly and Stevenson, 2008) or between the phrase vector
and the sum or product vector of its components (Mitchell and Lapata, 2010; Krčmář
et al., 2013). Senaldi et al. (2016) and Senaldi et al. (2017) combine insights from both
these approaches. They start from two lists of 90 VN and 26 AN constructions, the
former composed of 45 idioms (e.g., gettare la spugna) and 45 non-idioms (e.g., vedere una
partita), the latter comprising 13 idioms (e.g., filo rosso ‘common thread’, lit. ‘red thread’)
and 13 non-idioms (e.g., lungo periodo ‘long period’). For each of these constructions,
a series of lexical variants are generated distributionally or via MultiWordNet (Pianta
et al., 2002) by replacing the subparts of the constructions with semantically related
words (e.g. from filo rosso, variants like filo nero ‘black thread’, cavo rosso ‘red cable’ and
cavo nero ‘black cable’ are generated). What comes to the fore is that the vectors of the
idiomatic expressions are less similar to the vectors of their lexical variants with respect
to the similarity between the vector of a literal constructions and the vectors of its lexical
alternatives. To provide an example, the cosine similarity between the vector of an idiom
like filo rosso and the vectors of its lexical variants like filo nero and cavo rosso was found
to be smaller than the cosine similarity between the vector of a literal phrase like lungo
periodo and the vectors of its variants like interminabile tempo ‘endless time’ and breve
periodo ‘short period’.
Moving to the methodology exploited in the current study, to the best of our
knowledge, neural networks have been previously adopted to perform MWE detection
in general (Legrand and Collobert, 2016; Klyueva et al., 2017), but not idiom identifica-
tion specifically. As mentioned in the Introduction, in Bizzoni et al. (2017), pre-trained
noun and adjective vector embeddings are fed to a single-layered neural network to
disambiguate metaphorical and literal AN combinations. Several combination algo-
rithms are experimented with to concatenate adjective and noun embeddings. All in
all, the method is shown to outperform the state of the art, presumably leveraging
the abstractness degree of the noun as a clue to figurativeness and basically treating
the noun as the “context” to discriminate the metaphoricity of the adjective (cf. clean
performance vs clean floor, where performance is more abstract than floor and therefore the
mentioned cleanliness is to be intended metaphorically).
Besides Bizzoni et al. (2017), using neural networks for metaphor detection with
pretrained word embeddings initialization has been tried in a small number of recent
works, proving that this is a valuable strategy to predict metaphoricity in datasets. Rei
et al. (2017) present an ad-hoc neural design able to compose and detect metaphoric
bigrams in two different datasets. Do Dinh and Gurevych (2016) apply a series of
perceptrons to the VU Amsterdam Metaphor Corpus (Steen et al., 2014) combined
with word embeddings and part-of-speech tagging. Finally, a similar approach - a
combination of fully connected networks and pre-trained word embeddings - has also
been used as a pre-processing step to metaphor detection, in order to learn word and
sense abstractness scores to be used as features in a metaphor identification pipeline
(Köper and Schulte im Walde, 2017).
3. Method
In this work we carried out a supervised idiom type identification task by resorting to a
three-layered neural network classifier. After selecting our dataset of VN and AN target
expressions (Section 4.1), for which gold standard idiomaticity ratings had already been
collected (Section 4.2), we built count vector representations for them (Section 4.3) from
the itWaC corpus (Baroni et al., 2009) and fed them to our classifier (Section 5) with
different training splits (Section 6). The network returned a binary output, whereby
idioms were taken as our positive examples and non-idioms as our negative ones. Dif-
ferently from Bizzoni et al. (2017), for each idiom or non-idiom we initially built a count-
based vector (Turney and Pantel, 2010) of the expression as a whole, taken as a single
token. We then compared this approach with a model trained on the concatenation of
the individual words of an expression, but the latter turned out to be less effective for
idioms than for metaphors. Each model was finally evaluated in terms of classification
accuracy, ranking performance and correlation between its continuous scores and the
human-elicited idiomaticity judgments (Section 7).
Since we mostly worked with vectors that took our target expressions as unana-
lyzed wholes, as if they were single tokens, we were not concerned with the fact that
some verbs were shared by more than one idiom (e.g., lasciare il campo ‘to leave the field’
and lasciare il segno ‘to leave one’s mark’) or non-idiom (e.g., andare a casa ‘to go home’
and andare all’estero ‘to go abroad’) at once, given that our network could not access this
information.
4. Dataset
4.1 Target expressions selection
The two datasets we employed in the current study come from Senaldi et al. (2016) and
Senaldi et al. (2017). The first one is composed of 45 idiomatic Italian V-NP and V-PP
constructions (e.g., tagliare la corda ‘to flee’ lit. ‘to cut the rope’) that were selected from
an Italian idiom dictionary (Quartu, 1993) and extracted from the itWaC corpus (Baroni
et al. 2009, 1,909M tokens ca.) and whose frequency spanned from 364 (ingannare il tempo
‘to while away the time’) to 8294 (andare in giro ‘to get about’), plus other 45 Italian non-
idiomatic V-NP and V-PP constructions of comparable frequencies (e.g., leggere un libro
‘to read a book’). The latter dataset comprises 13 idiomatic and 13 non-idiomatic AN
constructions (e.g., punto debole ‘weak point’ and nuova legge ‘new law’) that were still
extracted from itWaC and whose frequency varied from 21 (alte sfere ‘high places’, lit.
‘high spheres’) to 194 (punto debole).
4.2 Gold standard idiomaticity judgments
Senaldi et al. (2016) and Senaldi et al. (2017) collected gold standard idiomaticity judg-
ments for the 26 AN and 90 VN target constructions in their datasets. Nine linguistics
students were presented with a list of the 26 AN constructions and were asked to
evaluate how idiomatic each expression was from 1 to 7, with 1 standing for ‘totally
compositional’ and 7 standing for ‘totally idiomatic’. Inter-coder agreement, measured
with Krippendorff’s α (Krippendorff, 2012), was equal to 0.76. The same procedure was
repeated for the 90 VN constructions, but in this case the inital list was split into 3
sublists of 30 expressions, each one to be rated by 3 subjects. Krippendorff’s α was
0.83 for the first sublist and 0.75 for the other two. These inter-coder agreement scores
were taken as a confirmation of reliability for the collected ratings (Artstein and Poesio,
2008). As will become clear in Section 6, these judgments served the twofold purpose
of evaluating the classification performance of our neural network and filtering the
expressions to use as training input for our models.
4.3 Building target vectors
Count-based Distributional Semantic Models (DSMs) (Turney and Pantel, 2010) allow
for representing words and expressions as high-dimensionality vectors, where the vec-
tor dimensions register the co-occurrence of the target words or expressions with some
contextual features, e.g. the content words that linearly precede and follow the target
element within a fixed contextual window. We trained two DSMs on itWaC, where
our target AN and VN idioms and non-idioms were represented as target vectors and
co-occurrence statistics counted how many times each target construction occurred in
the same sentence with each of the 30,000 top content words in the corpus. Differently
from Bizzoni et al. (2017), we did not opt for prediction-based vector representations
(Mikolov et al., 2013a). Although some studies have brought out that context-predicting
models fare better than count-based ones on a variety of semantic tasks (Baroni et al.,
2014), including compositionality modeling (Rimell et al., 2016), others (Blacoe and
Lapata, 2012; Cordeiro et al., 2016) have shown them to perform comparably. In phrase
similarity and paraphrase tasks, Blacoe and Lapata (2012) find count vectors to score
better then or comparably to predict vectors built following Collobert and Weston
(2008)’s neural language model. Cordeiro et al. (2016) show PPMI-weighted count-
based models to perform comparably to word2vec (Mikolov et al., 2013b) in predicting
nominal compound compositionality. Moreover, Levy et al. (2015) highlight that much
of the superiority in performance exhibited by word embeddings is actually due to
hyperparameter optimizations, which, if applied to traditional models as well, can bring
to equivalent outcomes. Therefore, we felt confident in resorting to count-based vectors
as an equally reliable representation for the task at hand.
5. The neural network classifier
We built a neural network composed of three “dense” or fully connected hidden layers.1
The input layer has the same dimensionality of the original vectors and the output
layer has dimensionality 1. The other two hidden layers have dimensionality 12 and
8 respectively. Our network takes in input a single vector at a time, which can be a
word embedding, a count-based distributional vector or a composition of several word
vectors. For the core part of our experiment we used as input single distributional
vectors of two-word expressions. As we discussed in the previous section, these vec-
tors have 30,000 dimensions each and represent the distributional behavior of a full
expression rather than that of the individual words composing such expression. Given
this distributional matrix, we defined idioms as positive examples and non-idioms as
negative examples of our training set. Due to the magnitude of our input, the most
important reduction of data dimensionality is carried out by the first hidden layer of
our model. The last layer applies a sigmoid activation function on the output in order
to produce a binary judgment. While binary scores are necessary to compute the model
classification accuracy and will be evaluated in terms of F1, our model’s continuous
scores can be retrieved and will be used to perform an ordering task on the test set, that
we will evaluate in terms of Interpolated Average Precision (IAP) 2 and Spearman’s ρ
with the human-elicited idiomaticity judgments. IAP and ρ, therefore, will be useful to
investigate how good our model is in ranking idioms before non-idioms.
6. Choosing the training set
The scarcity of our training sets constitutes a challenge for neural models, typically
designed to deal with massive amounts of data. The typical effect of such scarcity is
a fluctuation in performance: training our model on two different sections of the same
dataset is likely to result in quite different F-scores.
Unless otherwise specified, the IAP, Spearman’s ρ and F1 scores reported in Table
1 are averaged on 5 runs of each model on the same datasets: at each run, the training
split is randomly selected. We found that some samples of the training set seemingly
make it harder for the model to learn idiom detection. When such runs are included in
the mean, the performance is drastically lowered.
In our attempt to understand whether we could find a rationale behind this phe-
nomenon or it was instead completely unpredictable, in some versions of our models
1 We used Keras, a library running on TensorFlow (Abadi et al., 2016).

2 Following Fazly et al. (2009), IAP was computed at recall levels of 20%, 50% and 80%.
we have tried to filter our training sets according to the idiomaticity judgments we
elicited from speakers (Section 4.2) to assess which composition of our training sets
made our algorithm more effective. In the first approach, which we will label as High-
to-Low (HtL henceforth), the network was trained on the idioms receiving the highest
idiomaticity ratings (and symmetrically on the compositional expressions having the
lowest idiomaticity ratings) and was therefore tested on the intermediate cases. In
the second approach, which we called Low-to-High (LtH), the model was trained on
more borderline exemplars, i.e. the idioms having the lowest idiomaticity ratings and
the compositional expressions having the highest ones, and then tested on the most
polarized cases of idioms and non-idioms.
For example, in the HtL setting, the AN bigrams we selected for the training set
included idioms like testa calda ‘hothead’ and faccia tosta ‘brazen person’ (lit. ‘tough
face’), that reported an average idiomaticity rating of 6.8 and 6.6 out of 7 respectively,
and non-idioms like famoso scrittore ‘famous writer’ and nuovo governo ‘new govern-
ment’ that elicited an average idiomaticity rating of 1.2 and 1.1 out of 7. In the case
of VN bigrams, we selected idioms like andare a genio ‘to sit well’ (lit. ‘to go to genius’)
(mean idiomaticity rating of 7) and non-idioms like vendere un libro ‘to sell a book’ (mean
idiomaticity rating of 1). The neural network was thus trained only on elements that our
annotators had judged as clearly positive and clearly negative examples.
To provide examples on the LtH training sets, for the VN data, we selected idioms
like lasciare il campo (mean rating = 3.6) and cambiare colore ‘to change color (in face)’
(mean rating = 3.6) against non-idiomatic expressions like prendere un caffè ‘to grab a
coffee’ (3.3) and lasciare un incarico ‘to leave a job’ (2.3). For the AN data, we selected
idioms like prima serata ‘prime time’ (lit. ‘first evening’) (mean rating = 4 out of 7) and
compositional expressions like proposta concreta ‘concrete proposal’ (2.7). The neural
network was in this case trained only on elements that our annotators had judged as
borderline cases.
The results of these different filtering procedures can be found in Table 1.
7. Evaluation
Once the training sets were established, a variety of transformations were tried on our
VN and AN distributional vectors before giving them as input to our network. Some
models were trained on the raw 30,000 dimensional distributional vectors of VN and
AN expressions; other models used the concatenation of the vectors of the individual
components of the expressions; finally, other models employed PPMI (Positive Point-
wise Mutual Information) (Church and Hanks, 1991) and SVD (Singular Value De-
composition) transformed (Deerwester et al., 1990) vectors of 150 and 300 dimensions.
Details of both classification and ordering tasks are shown in Table 1. Qualitative details
about the results will be given in Section 8.
7.1 Verb-Noun
We ran our model on the VN dataset, composed of 90 elements, namely 45 idioms and 45
non-idiomatic expressions. This is the largest of the two datasets. We trained our model
on 303 and 40 elements for 20 epochs and tested it on the remaining 60 and 50 elements
3 When we report the number of training and test items in Table 1 as 15+15, for instance, we mean 15
idioms + 15 non-idioms. The same applies to all the other listed models.
Vectors PPMI SVD Training Test IAP ρ F1

VN Yes No 15+15 30+30 .72 .48 .67
VN Yes No 20+20 25+25 .73 .52 .77
VN Yes 150 15+15 30+30 .63 .35 .48
VN Yes 150 20+20 25+25 .61 .33 .63
VN Yes 300 15+15 30+30 .67 .33 .64
VN Yes 300 20+20 25+25 .65 .3 .57
AN No No 8+8 6+4 .72 .19 .40
AN Yes No 8+8 6+4 .70 .06 .60
AN Yes 150 8+8 6+4 .65 .11 .32
AN Yes 300 8+8 6+4 .88 .51 .10
VN (HtL) Yes No 15+15 30+30 .71 .62 .77
VN (HtL) Yes No 20+20 25+25 .79 .65 .84
VN (LtH) Yes No 15+15 30+30 .71 .58 .80
VN (LtH) Yes No 20+20 25+25 .77 .68 .85
AN (HtL) No No 8+8 6+4 1 .8 .71
AN (HtL) Yes No 8+8 6+4 1 .71 .78
AN (LtH) No No 8+8 6+4 1 .93 .89
AN (LtH) Yes No 8+8 6+4 1 .84 .88
VN+AN No No 23+23 36+34 (joint) .80 .64 .46
VN+AN (HtL) No No 23+23 36+34 (joint) .63 .41 .65
VN+AN (LtH) No No 23+23 36+34 (joint) .68 .51 .66
Conc. VN No No 20+20 24+24 .59 .34 .40
Conc. VN (HtL) No No 20+20 24 +24 .61 .07 .46
Conc. VN (LtH) No No 20+20 24+24 .57 .31 .59
Table 1
Interpolated Average Precision (IAP), Spearman’s ρ correlation with the human judgments and
F-measure (F1) for Vector-Noun training (VN), Adjective-Noun training (AN), joint (VN+AN)
training and training through vector concatenation. High-to-Low (HtL) models were trained on
clear-cut cases, while Low-to-High (LtH) models were trained on borderline cases. As for the
other models, the average performance over 5 runs with randomly selected training sets is
reported. Training and test set are expressed as the sum of positive and negative examples.
respectively. The models that best succeeded at classifying our phrases into idioms and
non-idioms were trained with 40 PPMI-transformed vectors, reaching an average F1
score of .77 on the randomized iterations and an F1 score of .85, with a Spearman’s ρ
correlation of .68, when the training set was composed of borderline cases and the model
was then tested on more clear-cut exemplars (LtH). As for the rest of the F1 scores,
what comes to light from our results is that increasing the number of training vectors
generally leads to better results, except for models fed with SVD-transformed vectors
of 300 dimensions, which seem to be insensitive to the size of our training data. Quite
interestingly, SVD-reduced vectors appear to perform worse in general than raw ones
and just PPMI-transformed ones. Due to space limitations, raw-frequency VN models
are not reported in Table 1 since they were comparable to just PPMI-weighted ones.
This same pattern is encountered when evaluating the ability of our algorithm to
rank idioms before non-idioms (IAP). The models with the highest score employs 40
PPMI training vectors and reach .73 on the randomized training, .79 on the HtL training
and .77 on the LtH ones, while SVD training vectors generally lead to poorer ranking
performances. Despite these IAP scores being encouraging, they are anyway lower than
those obtained by Senaldi et al. (2016), who reach a maximum IAP of 0.91. This drop in
performance could point to the fact that resorting to distributional information only
to carry out idiom identification overlooks some aspects of the behavior of idiomatic
constructions (e.g., formal rigidity) that is to be taken into account to arrive at a more
satisfactory classification. Concerning the correlation between the continuous score of
the neural net and the human idiomaticity ratings presented in Section 4.2, the best
model also employed 40 PPMI vectors of borderline expressions (.68), followed by
the model using 40 PPMI vectors of clear-cut cases (.65). These correlation values are
quite comparable to the maximum of -0.67 obtained in Senaldi et al. (2016)4 in High-
to-Low and Low-to-High ordered models, while they are lower in randomized models,
especially SVD-reduced ones.
All in all, both HtL and LtH experimental settings result in IAP, correlation and F1
scores that are higher than what we get from averaging over randomly selected training
sets. More precisely, the strategy of training only on borderline examples (LtH) appears
to be the most effective. This can intuitively make sense: once a network has learned to
discriminate between borderline cases, detecting clear-cut elements should be relatively
easy. The opposite strategy also seems to bring some benefits, possibly because training
on clear negative and positive examples provides the network with a data set which is
easier to generalize. In any case, it seems clear that selecting our training set with the
help of human ratings allows us to significantly increase the performance of our models.
We can see this as another proof that human continuous scores on idiomaticity - and not
only binary judgments - are mirrored in the distributional pattern on these expressions.
As for the influence of the training set size on IAP and ρ, all in all it seems that the best
results are reached with 40 training vectors, both on the randomized training sets and
on the ordered training sets.
The general trend we can abstract from these results is that our neural network does
a good job in telling apart idioms and non-idioms by just relying on raw-frequency
and PPMI-transformed distributional information. Performing dimensionality reduc-
tion apparently deprives the model of useful information, which makes the overall
performance plummet to lower levels.
7.2 Adjective-Noun
Our model was also run on the AN dataset, composed of 26 elements (13 idioms and
13 non-idiomatic expressions). We empirically found that our network was able to
perform some generalization on the data when the training set contained at least 14
elements, evenly balanced between positive and negative examples. We trained our
model on 16 elements for 30 epochs and tested on the remaining 10 elements. As
happened with VN vectors, performing SVD worsened the performance of the model.
While F1 exact value can undergo fluctuations when a model is trained on very small
sets, we always registered accuracies higher than 70% for the ordered training sets. In
this case even more than in the Verb-Noun frame, the difference between randomizing
the training set and selecting it using human idiomaticity ratings appears to be very
evident, possibly due to the extremely small dimensions of this specific dataset, that
4 Please keep in mind that the correlation values in Senaldi et al. (2016) and Senaldi et al. (2017) are
negative since the less similar a target vector to the vectors of its variants, the more idiomatic the target.
make the qualitative selection of the training data of particular importance. Once again
the highest Spearman’s ρ correlation (.93) was reached when using a Low-to-High set
trained on borderline cases, although it is important to keep in mind that such scores
are computed on a very restricted test set. The same reasoning applies to IAP scores,
which all reach the top value, though we must consider the very small test set. Senaldi
et al. (2017) instead reached a maximum IAP of .85 and a maximum ρ of -.68 in AN
idiom identification. When the training size was under the critical threshold, accuracy
dropped significantly. With training sets of 10 or 12 elements, our model naturally went
in overfitting, quickly reaching 100% accuracy on the training set and failing to correctly
classify unseen expressions. In these cases a partial learning was still visible in the
ordering task, where most idioms, even if labeled incorrectly, received higher scores
than non-idioms.
7.3 Joint training
Our last experiment consisted in training our model on a mixed dataset of both VN
and AN expressions, to check to what extent it would be able to recognize the same
underlying semantic phenomenon across different syntactic constructions. In these
models as well as in those described in Section 7.4, PPMI and SVD transformations were
not tested anymore, since they were already shown to bring to generally comparable or
even worse outcomes when tried on the VN and the AN datasets singularly. Concerning
the structure of our training and test sets, two approaches were experimented with. We
first tried to train our model on one pair type, e.g. the AN pairs, and then tested on
the other, but we saw this required more epochs overall (more than 100) to stabilize
and resulted in a poorer performance. When training our model on a mixed dataset
containing the elements of both pair types, our model employed 20 epochs to reach an
F-measure of 66% on the mixed training set when the set was ordered Low-to-High
(i.e., it was composed of borderline cases only) and a comparable F-score of 65% when
using clear-cut training input (HtL). Anyway, we also noticed that VN expressions were
learned better than AN expressions. It’s also worth considering that, although the F-
scores of the LtH and HtL models were higher, the IAP and Spearman’s ρ were lower
than in the unordered input model. In other words, while ordering the input led to a
better binary classification, the continuous scores returned a less precise ranking.
Our model was able to generalize over the two datasets, but this involved a loss in
accuracy with respect to the only-VN and only-AN ordered training sets. It can be seen
in Table 1 that a loss in accuracy is also evident for joint training on the randomized
frame, although in this case the model seems hardly able to generalize at all.
7.4 Vector concatenation
In addition to using the vector of an expression as a whole, we tried to feed our

model with the concatenation of the vectors of the single words in an expression, as
in Bizzoni et al. (2017). For example, instead of using the 30,000 dimensional vector
of the expression tagliare la corda, we used the 60,000 dimensional vector resulting
from the concatenation of tagliare and corda. This approach mimics the one adopted for
metaphoric pairs and concludes our set of experiments, providing us with comparable
results obtained from a compositionality-based approach to the same problem. We ran
this experiment only on the VN dataset, being the largest and the one that yielded the
best results in the previous settings. We used 40 elements in training and 48 in testing
and trained our model for 30 epochs overall. Predictably enough, vector composition
resulted in the worst performance, differently from what happened with metaphors
(Bizzoni et al., 2017).
Despite all correlations are low and not statistically significant, it is still worth
pointing out however that not all the results are completely random: with an F1 of 59%
for the LtH training set and an IAP of .61 for the HtL set, the model seems able to learn
idiomaticity to a lower, but not null, degree; these findings would be in line with the
claim that the meaning of the subparts of several idioms, while less important than in
metaphors, is not completely obliterated (McGlone et al., 1994). Another hint in this
direction is the difference in performance between randomized and ordered training
that we can observe for concatenation: if human idiomaticity ratings were completely
independent from the composition of the individual subparts of our idioms, such effect
should not be present at all. Anyway, similarly to what happened with the joint models,
ordering the training input led to higher F-scores and comparable IAPs, but returned a
worse correlation with human judgments with respect to the models with a randomized
training input.
8. Error Analysis
As we mentioned in Section 1, idiomaticity is not a black-or-white phenomenon and

idioms are rather spread on a continuum of semantic transparency and formal rigidity,
which makes some exemplars harder to classify. In our models we can find some
“prototypical” cases of idioms which were always labeled correctly, like toccare il fondo
‘to hit rock bottom’, lasciare il campo and passare alla storia ‘to go down in history’ and
also some cases of unambiguously classified non-idioms, like andare in vacanza ‘to go
on holiday’, ascoltare una canzone ‘to listen to a song’ and prendere un caffè. On the other
hand, we have some ambiguous expressions like abbassare la guardia ‘to let down one’s
guard’ and sentire una voce ‘to hear a voice’, which, despite being compositional and
potentially literal, can be very often used figuratively, i.e. if someone were referring to
guardia as a metaphorical defense or to voce as a rumor. In such cases, it might be the
case that the evidence available in the chosen corpus privileged just one of the two
possible readings, leading to labeling issues. By the same token, the expression bussare
alla porta (di qualcuno) ‘to go ask for (someone’s) help’ (lit.‘to knock at the door’), which
we initially labeled as idiomatic, can have a literal reading as well and that is why it
was often labeled as non-idiomatic. Finally, as happened in Senaldi et al. (2016), some
false positives like chiedere le dimissioni ‘to demand the resignation’ and entrare in crisi ‘to
get into a crisis’ are compositional expressions which nonetheless display collocational
behavior, since they represent very common and fixed expressions in the Italian lan-
guage. Interestingly, while Senaldi et al. (2016) could justify their being false positives
since it is likely that the variant-based model took their lexical fixedness as a clue of their
idiomatic status, our neural net relies on distributional semantic information only. What
this suggests is that not only a semantic phenomenon like compositionality, but even a
shallower one like collocability, which does not always and straightforwardly go hand
in hand with non-compositionality, can be spotted out just by looking at contextual
distribution.
As mentioned in Section 4.1, our target idioms and non-idioms varied considerably
in frequency. We therefore conducted some correlation analyses to check out a possible
relationship between the scores returned by our network and the frequency of our items.
All in all, we can conclude that in most of our models frequency and the continuous
idiomaticity scores were negatively correlated, though such a correlation did not show
up systematically and was not always significant. In other words, the more frequent an
item, be it an idiom or a literal, the more the network tended to consider it as literal (i.e.,
it gave it a lower idiomaticity score). This tendency could be explained if we consider
that some of our most frequent idioms were actually quite ambiguous (e.g., aprire gli
occhi ‘to open one’s eyes’ occurred 6306 times in the corpus and bussare alla porta 3303
times) and most of their corpus occurrences could be literal uses.
9. Discussion and Conclusions
The experiments we have presented show that the distribution of idiomatic and com-
positional expressions in large corpora can suffice for a supervised classifier to learn
the difference between the two linguistic elements from small training sets and with a
good level of accuracy. Specifically, we have observed that human continuous ratings
of idiomaticity can be useful to select a better training set for our models, and that
training our models on cases deemed by our annotators as borderline allows them to
learn and perform better than if they were fed with randomized input. Also training our
models only on clear-cut cases increases the performance. In general we can see from
this phenomena that human continuous ratings of idiomaticity seem to be mirrored in
the distributional structure of our data.
Unlike with metaphors (Bizzoni et al., 2017), feeding the classifier with a composi-
tion of the individual words’ vectors of such expressions performs quite scarcely and
can be used to detect only some idioms. This takes us back to the core difference that
while metaphors are more compositional and preserve a transparent source domain to
target domain mapping, idioms are by and large non-compositional. Since our classi-
fiers rely only on contextual features, their ability in classification must stem from a
difference in distribution between idioms and non-idioms. A possible explanation is
that while the literal expressions we selected, like vedere un film or ascoltare un discorso,
tend to be used with animated subjects and thus to appear in more concrete contexts,
most of our idioms (e.g. cadere dal cielo or lasciare il segno) allow for varying degrees
of animacy or concreteness of the subject, and thus their context can easily get more
diverse. At the same time, the drop in performance we observe in the joint models seems
to indicate that the different parts of speech composing our elements entail a significant
contextual difference between the two groups, which introduces a considerable amount
of uncertainty in our model.
It is also possible that other contextual elements we did not consider have played
a role in the learning process of our models, like the ambiguity between idiomatic and
literal meaning that some potentially idiomatic strings possess (e.g. to leave the field)
and that would lead their contextual distribution to be more variegated with respect to
only-literal combinations. We intend to further investigate this aspect in future works.
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G. S., Davis,
A., Dean, J., Devin, M., et al. (2016). Tensorflow: Large-scale machine learning on
heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
Artstein, R. and Poesio, M. (2008). Inter-coder agreement for computational linguistics.
Computational Linguistics, 34(4):555–596.
Baroni, M., Bernardini, S., Ferraresi, A., and Zanchetta, E. (2009). The WaCky wide
web: a collection of very large linguistically processed web-crawled corpora. Language
Resources and Evaluation, 43(3):209–226.
Baroni, M., Dinu, G., and Kruszewski, G. (2014). Don’t count, predict! a systematic com-
parison of context-counting vs. context-predicting semantic vectors. In Proceedings of
the 52nd Annual Meeting of the Association for Computational Linguistics, pages 238–247.
Bizzoni, Y., Chatzikyriakidis, S., and Ghanimifard, M. (2017). “deep” learning: Detecting
metaphoricity in adjective-noun pairs. In EMNLP 2017.
Blacoe, W. and Lapata, M. (2012). A comparison of vector-based representations for
semantic composition. In Proceedings of the 2012 joint conference on empirical methods in
natural language processing and computational natural language learning, pages 546–556.
Association for Computational Linguistics.
Bohrn, I. C., Altmann, U., and Jacobs, A. M. (2012). Looking at the brains behind figu-
rative language: a quantitative meta-analysis of neuroimaging studies on metaphor,
idiom, and irony processing. Neuropsychologia, 50(11):2669–2683.
Cacciari, C. (2014). Processing multiword idiomatic strings: Many words in one? The
Mental Lexicon, 9(2):267–293.
Cacciari, C. and Papagno, C. (2012). Neuropsychological and neurophysiological corre-
lates of idiom understanding: How many hemispheres are involved. The handbook of
the neuropsychology of language, pages 368–385.
Church, K. W. and Hanks, P. (1991). Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1):22–29.
Collobert, R. and Weston, J. (2008). A unified architecture for natural language pro-
cessing: Deep neural networks with multitask learning. In Proceedings of the 25th
international conference on Machine learning, pages 160–167. ACM.
Cordeiro, S., Ramisch, C., Idiart, M., and Villavicencio, A. (2016). Predicting the com-
positionality of nominal compounds: Giving word embeddings a hard time. In
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,
volume 1, pages 1986–1997.
Cruse, D. A. (1986). Lexical semantics. Cambridge University Press.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American society for information
science, 41(6):391.
Do Dinh, E.-L. and Gurevych, I. (2016). Token-level metaphor detection using neural
networks. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 28–33.
Fazly, A., Cook, P., and Stevenson, S. (2009). Unsupervised type and token identification
of idiomatic expressions. Computational Linguistics, 1(35):61–103.
Fazly, A. and Stevenson, S. (2008). A distributional account of the semantics of multi-
word expressions. Italian Journal of Linguistics, 1(20):157–179.
Fraser, B. (1970). Idioms within a transformational grammar. Foundations of language,
pages 22–42.
Frege, G. (1892). Über sinn und bedeutung. Zeitschrift für Philosophie und philosophische
Kritik, 100:25–50.
Geeraert, K., Baayen, R. H., and Newman, J. (2017). Understanding idiomatic variation.
MWE 2017, page 80.
Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive
science, 7(2):155–170.
Gibbs, R. W. (1993). Why idioms are not dead metaphors. Idioms: Processing, structure,
and interpretation, pages 57–77.
Gibbs, R. W. (1994). The poetics of mind: Figurative thought, language, and understanding.
Cambridge University Press.
Gibbs, R. W., Bogdanovich, J. M., Sykes, J. R., and Barr, D. J. (1997). Metaphor in idiom
comprehension. Journal of memory and language, 37(2):141–154.
Glucksberg, S., McGlone, M. S., and Manfredi, D. (1997). Property attribution in

metaphor comprehension. Journal of memory and language, 36(1):50–67.
He, X. and Liu, Y. (2017). Not enough data?: Joint inferring multiple diffusion networks
via network generation priors. In Proceedings of the Tenth ACM International Conference
on Web Search and Data Mining, pages 465–474. ACM.
Klyueva, N., Doucet, A., and Straka, M. (2017). Neural networks for multi-word
expression detection. MWE 2017, page 60.
Köper, M. and Schulte im Walde, S. (2017). Improving verb metaphor detection by
propagating abstractness to words, phrases and individual senses. In Proceedings of the
1st Workshop on Sense, Concept and Entity Representations and their Applications, pages
24–30.
Krippendorff, K. (2012). Content analysis: An introduction to its methodology. Sage.
Krčmář, L., Ježek, K., and Pecina, P. (2013). Determining Compositionality of Express-
sions Using Various Word Space Models and Measures. In Proceedings of the Workshop
on Continuous Vector Space Models and their Compositionality, pages 64–73.
Lakoff, G. and Johnson, M. (2008). Metaphors we live by. University of Chicago press.
Legrand, J. and Collobert, R. (2016). Phrase representations for multiword expressions.
In Proceedings of the 12th Workshop on Multiword Expressions, number EPFL-CONF-
219842.
Lenci, A. (2008). Distributional semantics in linguistic and cognitive research. Italian
Journal of Linguistics, 20(1):1–31.
Lenci, A. (2018). Distributional Models of Word Meaning. Annual Review of Linguistics,
4:151–171.
Levy, O., Goldberg, Y., and Dagan, I. (2015). Improving distributional similarity with
lessons learned from word embeddings. Transactions of the Association for Computa-
tional Linguistics, 3:211–225.
Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of
the 37th Annual Meeting of the Association for Computational Linguistics, pages 317–324.
Liu, D. (2003). The most frequently used spoken american english idioms: A corpus
analysis and its implications. Tesol Quarterly, 37(4):671–700.
McGlone, M. S. (1996). Conceptual metaphors and figurative language interpretation:
Food for thought? Journal of memory and language, 35(4):544–565.
McGlone, M. S., Glucksberg, S., and Cacciari, C. (1994). Semantic productivity and
idiom comprehension. Discourse Processes, 17(2):167–190.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013a). Distributed
representations of words and phrases and their compositionality. In Proceedings of the
26tth International Conference on Neural Information Processing System, pages 3111–3119.
Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguistic regularities in continuous
space word representations. In Human Language Technologies: Conference of the North
American Chapter of the Association of Computational Linguistics, volume 13, pages 746–
751.
Mitchell, J. and Lapata, M. (2010). Composition in Distributional Models of Semantics.
Cognitive Science, 34(8):1388–1429.
Nunberg, G., Sag, I., and Wasow, T. (1994). Idioms. Language, 70(3):491–538.
Pianta, E., Bentivogli, L., and Girardi, C. (2002). MultiWordNet: Developing and
Aligned Multilingual Database. In Proceedings of the First International Conference on
Global WordNet, pages 293–302.
Quartu, M. B. (1993). Dizionario dei modi di dire della lingua italiana. RCS Libri.
Rei, M., Bulat, L., Kiela, D., and Shutova, E. (2017). Grasping the finer point: A super-
vised similarity network for metaphor detection. arXiv preprint arXiv:1709.00575.
Rimell, L., Maillard, J., Polajnar, T., and Clark, S. (2016). Relpron: A relative clause eval-
uation data set for compositional distributional semantics. Computational Linguistics,
42(4):661–701.
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., and Flickinger, D. (2002). Multiword
Expressions: A Pain in the Neck for NLP. In Proceedings of the 3rd International
Conference on Intelligent Text Processing and Computational Linguistics, pages 1–15.
Senaldi, M. S. G., Lebani, G. E., and Lenci, A. (2016). Lexical variability and composition-
ality: Investigating idiomaticity with distributional semantic models. In Proceedings
of the 12th Workshop on Multiword Expressions, pages 21–31.
Senaldi, M. S. G., Lebani, G. E., and Lenci, A. (2017). Determining the compositionality
of noun-adjective pairs with lexical variants and distributional semantics. Italian
Journal of Computational Linguistics, 3(1):43–58.
Steen, G. J., Dorst, A. G., Herrmann, J. B., Kaal, A., Krennmayr, T., and Pasma, T. (2014).
A method for linguistic metaphor identification: From mip to mipvu. Metaphor and
the Social World, 4(1):138–146.
Tanguy, L., Sajous, F., Calderone, B., and Hathout, N. (2012). Authorship attribution:
Using rich linguistic features when training data is scarce. In PAN Lab at CLEF.
Titone, D. and Libben, M. (2014). Time-dependent effects of decomposability, familiarity
and literal plausibility on idiom priming: A cross-modal priming investigation. The
Mental Lexicon, 9(3):473–496.
Torre, E. (2014). The emergent patterns of Italian idioms: A dynamic-systems approach. PhD
thesis, Lancaster University.
Turney, P. D. and Pantel, P. (2010). From Frequency to Meaning: Vector Space Models of
Semantics. Journal of Artificial Intelligence Research, 37:141–188.
Wulff, S. (2008). Rethinking Idiomaticity: A Usage-based Approach. Continuum.
Study III
113
Bigrams and BiLSTMs
Two neural networks for sequential metaphor detection
Yuri Bizzoni Mehdi Ghanimifard
Centre for Linguistic Theory and Centre for Linguistic Theory and
Studies in Probability (CLASP), Studies in Probability (CLASP),
Department of Philosophy, Department of Philosophy,
Linguistics and Theory of Science, Linguistics and Theory of Science,
University of Gothenburg. University of Gothenburg.
[email protected] [email protected]
Abstract (Veale et al., 2016). Traditionally, the main ap-

proaches to this problem have been of two kinds:
We present and compare two alternative deep either a set of manually crafted rules was applied
neural architectures to perform word-level
to a text, or a machine learning algorithm was
metaphor detection on text: a bi-LSTM model
and a new structure based on recursive feed- trained on a source dataset to identify patterns of
forward concatenation of the input. We dis- features identifying metaphoricity. In the latter
cuss different versions of such models and the case, typically used features were “psycholinguis-
effect that input manipulation - specifically, re- tics” features such as abstractness or imageability
ducing the length of sentences and introducing 2 ; hypernym-hyponym coercions as modeled by
concreteness scores for words - have on their resources like WordNet; sequence probabilities as
performance. 1
given by language models; and semantic spaces
or word embeddings. Similar trends can also be
1 Paper’s contribution
observed in works dealing with other figures of
This paper describes our contribution to the shared speech (Zhang and Gelernter, 2015).
task on metaphor detection published by NAACL The use of word embeddings in metaphor pro-
2018’s First Workshop on Figurative Language cessing - both in detection and interpretation -
Processing. is particularly widespread, and distributional se-
In this paper, we will: mantic spaces may represent the single most con-
sistently used “tool” in this task. Su et al.
1. Present and compare two neural network (2017) combine word embeddings and WordNet
models, (1) a bidirectional recurrent neural hypernym/hyponym information to detect nominal
networks for long distance compositions and predicative metaphors of the kind “X is Y” and to
(2) a novel bigram based model for local select a more literal target - thus producing a para-
compositions. phrase of the metaphor.
2. Show the results of ablation experiments on Shutova et al. (2017) use unsupervised and
these two models. weakly supervised learning to detect metaphors,
exploiting syntax-aware distributional word vec-
3. Present some input manipulations and feature tors.
enrichment to improve their performance. Gong et al. (2017) use figurative language de-
tection - sarcasm and metaphor - as a way to ex-
The implementation code and additional supple- plore word vector compositionality and try to use
mentary material is available here: https:// simple cosine distance to tell metaphoric from lit-
github.com/GU-CLASP/ocota eral sentences: a word being out of context in a
sentence has a likelihood of being metaphoric.
2 Introduction The reason why semantic spaces are consis-
Automatic metaphor detection is the task of auto- 2
Recent trends tend to see metaphoricity as a nuanced
matically identifying metaphors in a text or dataset rather than binary property, and to take into consideration the
correlation between figurativity and affective scoring (Köper
1
The model product of this paper competed in The Work- and im Walde, 2016), an umbrella term usually including four
shop on Figurative Language’s Shared Task with team name psycolinguistic properties: abstractness, arousal, imageabil-
OCOTA. ity and valence (Köper and Im Walde, 2016).
tently used in metaphor detection lies in the con- represented by the attributes of the concepts they
ception that metaphor, like metonymy and other represent, so that for example ant is represented by
figures of speech (Nastase and Strube, 2009), is elements such as an insect, is black etc. The au-
a mainly contextual phenomenon. In this view, a thors describe a system to map conventional dis-
metaphor is fundamentally composed of two dif- tributional spaces to pre-existent attribute-based
ferent semantic domains, in which one domain spaces and show that such approach helps detect-
acts as source - and is used literally - while the ing metaphoric bigrams.
other acts as target - and is used figuratively. A recent approach is that of using neural net-
In this frame, semantic spaces appear to be a works for metaphor detection with pretrained
very flexible and powerful frame to model such word embeddings initialization. Bizzoni et al.
semantic domains in terms of words’ cluster- (2017) and Rei et al. (2017) proved that this is
ing and distributional similarity (Mohler et al., a valuable strategy to predict metaphoricity in
2014). Also, semantic spaces are relatively easy datasets of bigrams without any extra contextual or
to build and handle, giving them an advantage explicit world knowledge representations. While
over more time-consuming resources, such as very Bizzoni et al. (2017) show how a simple fully con-
large knowledge bases and “is A” bases from web nected neural network is able to learn pre-existing
corpora, as in Li et al. (2013). a dataset of metaphoric bigrams with high ac-
Gutierrez et al. (2016) use the flexibility of curacy and to achieve a better performance than
word vectors to study the compositional nature of previous approaches, Rei et al. (2017) present an
metaphors and the possibility of modeling it in a ad-hoc neural design able to compose and detect
semantic space. metaphoric bigrams in two different datasets.
Tsvetkov et al. (2014) use distributional spaces, Do Dinh and Gurevych (2016) apply a series
together with several other resources such as of perceptrons to the Amsterdam Corpus com-
imageability scores and abstractness to detect bined with word embeddings and part-of-speech
metaphors in English and apply a transfer learning tagging, reaching a f-score of .56.
system through pivoting on bilingual dictionaries Interestingly, a similar approach - a combina-
to detect metaphors in multiple language. tion of fully connected networks and pre-trained
A composite approach using both distributional word embeddings - has also been used as a pre-
features and psycho-linguistics scores for lexical processing step to metaphor detection, in order
items is also used by Rai et al. (2016) to per- to learn word and sense abstractness scores to
form metaphor detection using conditional ran- be used as features in a metaphor identification
dom fields. pipeline (Köper and im Walde, 2017).
Metaphor detection with semantic spaces has
3 Corpus
also been explored in a multimodal frame by
Shutova et al. (2016), where systems using only Metaphor processing suffers from a problem of
text-based distributional vectors are compared data scarcity: annotated corpora for metaphor de-
against systems using distributional vectors en- tection are relatively rare and of modest propor-
riched with visual information. tions.
The link between distributional information and In this work we will use the VU Amsterdam
metaphors appears so relevant that some studies Metaphor Corpus (Krennmayr and Steen, 2017)
presenting new general distributional approaches train and test our models. To this date, the
have elected metaphor detection as a benchmark to VU Amsterdam Metaphor Corpus (VUAMC) the
test their models (Srivastava and Hovy, 2014), and largest publicly available annotated corpus for
studies using diversified sets of resources for their metaphor detection.
classifiers report that distributional vectors are the Metaphor corpora in other languages do exits,
best performing single device to tackle metaphor but, to the best of our knowledge, suffer of the
detection (Köper and im Walde, 2016). same problem of data scarcity.
Finally, Bulat et al. (2017) present a differ- The VUAMC is divided into four sub-categories
ent kind of semantic space, not context-based representing four different genres: news texts, fic-
but attribute-based, to detect and generalize over tion, academic texts and conversations. Every
metaphoric patterns. In such spaces, words are word in the corpus is manually annotated by sev-
eral annotators for metaphoricity. In the corpus, the following words are annotated as metaphoric:
metaphor, simile and personification are equated, things, on, admission, part, keep and after.
while also implicit metaphors are taken into con- While the very fine-grained metaphoricity of
sideration. For example, in the sentence To em- things, part and keep is to some extent still un-
bark on such a step is not necessarily to succeed derstandable - these terms are not used in their
immediately in realizing it the word it is consid- physical sense to indicate material objects, such
ered an implicit metaphor since it refers to the as a concrete slice of something, or the act of
words step that was used metaphorically. physically keeping something with oneself - the
The corpus covers about 190,000 lexical units, metaphoric nature of admission remains quite
randomly selected from the BNC Baby corpus. opaque. At the same time, it is not clear why
According to Krennmayr and Steen (2017), the the annotators ignored the metaphoric interpreta-
genre with a higher percentage of manually de- tion of the break up.
tected metaphors is academic texts (18.5%), fol- There are also harder to explain examples, at
lowed by news (16.4%), fiction (“only” 11.9%) least from our perspective. The sentence
and conversation (7.7%). Given the very fine-
grained nature of metaphor annotation applied to Going to bed with Jean fucking,
the corpus, the authors also find that the parts of fucking shite! [kbd-fragment07-2586]
speech that tend to be used metaphorically most
often are prepositions and verbs, followed adjec- is annotated as completely literal - no metaphoric
tives and nouns. usage is detected by the annotators.
Due to its dimensions, diversity and accessi- In the sentence
bility, the VU Amsterdam Metaphor Corpus has Take that fucking urbane look off
been used in a number of studies. Using it can your face and face reality, Adam [fpb-
provide a direct comparison to important previous fragment01-1343]
works and proposed models. This makes of the
VUAMC a valuable resource for metaphor detec- the following words are annotated as metaphoric:
tion and processing. take, that, off, face.
Nonetheless, the VU Amsterdam Metaphor All the remaining terms have to be considered
Corpus presents some difficulties: the semantic as literal, which looks slightly incoherent with the
annotation of metaphor can be extremely fine- previous fine-grained metaphoricity annotations.
grained and cross the boundaries with word sense
disambiguation. 4 Models
For example, in the sentence:
4.1 Architectures
The 63-year-old head of Pembridge In this work we present two alternative neural ar-
Investments, through which the bid is chitectures to process sentences as input and pre-
being mounted says, ‘rule number one dict words’ metaphoricity as output.
in this business is: the more luxuri- The first model we discuss is composed of a
ous the luncheon rooms at headquarters, bi-directional LSTM (Schuster and Paliwal, 1997)
the more inefficient the business’.[a1e- and two fully connected or dense layers, having
fragment01-5] respectively dimensionality of 32, 20 and 1. We
three words were annotated as metaphoric: head, will also show results for deeper and more shal-
through, mounted, rule, in, this and headquarters. low alternative versions of this model.
Sometimes the annotation itself can be puzzling Sun and Xie (2017) recently tried to tackle verb
or questionable. In the sentence: metaphor detection on the TroFi corpus (Birke and
Sarkar, 2006) using Bi-LSTMs with word embed-
There are other things he has, on his dings. For their study they tried different kinds
own admission, not fully investigated, of input: using the whole sentence; using a sub-
like the value of the DRG properties, sequence composed of the target verb and all its
or which part of the DRG business he dependents; using a sub-sequence composed of
would keep after the break up . [a1e- the target verb, its subject and its object. Inter-
fragment01-7] estingly, they show that the simplest approach -
taking into consideration the whole sentence - re-
turns the best results, with an F score only slightly
lower than that achieved by a composite approach
taking into consideration all of the previous differ-
ent inputs together.
The main difference with our architecture is the
presence of the final Perceptrons (fully connected
networks). Sun and Xie (2017) don’t mention fur-
ther hidden layers beyond the bi-LSTM.
We also don’t have any form of syntactic pre-
processing and we only use the sequence of the
standard word embeddings to represent the whole
sentence. Finally, we are interested in considering
the different performances of bi-LSTMs on dif-
ferent part-of-speech elements: metaphor recog-
nition on functional words is supposedly harder,
since these words have a more complex semantic
signature in distributional spaces.
In this spirit we find worth it approaching the
problem with a relatively “standard” neural frame-
work.
The second model we discuss is a simple se- Figure 1: Bigram composition networks with depth
n = 2.
quence of fully connected neural networks.
We present the design of this architecture in Fig-
ure 1. GloVe (Pennington et al., 2014) (2) Word2Vec
This model is a generalization of neural ar- (Mikolov et al., 2013). Since these vector
chitectures for bigram phrase compositions as spaces are trained on different corpora, there are
tested on Adjective-Noun phrases in Bizzoni et al. some out-of-vocabulary words, we represent these
(2017). While a similar approach is already at- words with zero vectors. Additionally, Word2Vec
tempted in Do Dinh and Gurevych (2016), we is using a sub-sampling technique for more effi-
introduce a recursive variant which can make ciency which consequently it doesn’t cover most
the compositions deeper and while allowing wide frequent words. In order to expand the word-
window sizes. There have been more sophisticated coverage, we also trained GloVe embeddings on
architectures such as Kalchbrenner et al. (2014), British National Corpus (Consortium et al., 2007)
which take a similar approach for sentence repre- from which the VUAMC corpus was sampled, and
sentation with convolutional neural networks, but compared it with both pre-trained Word2Vec em-
we propose a simpler method only using dense beddings on Google News corpus and standard
compositions. GloVe embeddings trained on Common Crawl
We built our architecture using the Python li- corpus.
brary Keras (Chollet et al., 2015). Explicit features It has been observed in sev-
For both our models we used Adam optimizer. eral works that metaphoricity judgments are par-
tially related to a gap in concreteness between the
4.2 Input manipulation target word and its context. Köper and im Walde
We compare two different features representa- (2017) try detecting all metaphoric verbs in the
tions: 1. different word embeddings, 2. concrete- Amsterdam corpus using this single feature. Biz-
ness scores as word representations. In addition zoni et al. (2017) show how a network trained for
to ablation test for feature representations, we ex- metaphor detection on pairs of word embeddings
amined the effect of breaking sentences in shorter can “side-learn” noun abstractness.
sequences. A metaphor functioning on this axis is com-
Embeddings We tried two types of pre-trained posed of an abstract and a concrete element: in
word embeddings both with 300 dimensions: (1) such case, usually, the concrete element is the
metaphoric one. The expression “In a window of Concreteness score window number of words
5 years, between 2011 and 2016” could be consid- 1-2 38 262
ered a metaphor playing on this level, where the 2-3 36 730
more concrete word ”window” has a metaphoric 3-4 28 664
sense. 4-5 14 473
There are kinds of metaphors functioning at dif-
ferent semantic levels: for example a synesthesia, Table 1: Concrete and abstract tokens in VUAM corpus
which can be considered a sub-type of metaphor, according to Brysbaert et al. (2014) dataset.
is an expression where a word linked to a senso-
rial field is used to refer to a term that pertains to
another sensorial field. ern day annotators’ minds. We discussed various
In this case, the features used metaphorically cases of this problem in the section about the cor-
are usually on a similar level of abstractness. pus: words that have gradually assumed a new and
However, for our purposes the abstract-concrete main sense in the English language are often anno-
features may be among the most important to take tated as metaphors in the VUAMC.
into consideration. Nonetheless, the abstract-concrete polarity re-
While the abstract-concrete polarity is repre- mains one of the main semantic dimensions to in-
sented in distributional embeddings, it is possible terpret and understand metaphors and has been ex-
that taking such features more explicitly into con- plicitly used in several metaphor detection tasks
sideration would help a neural classifier. Brysbaert with promising results.
et al. (2014) released a list of almost forty thou- We can thus partly revert to feature engineer-
sand English words annotated along the concrete- ing and see whether adding this dimension can im-
abstract axis, annotated by over four thousand par- prove the performance of our models.
ticipants.
Sentence breaking Including long sentences in
We try using such scores as an extra dimen-
our training dataset makes it necessary to consis-
sion for the distributional embeddings: we thus
tently pad short sentences with zero-vectors. In
obtain sequences of 301-dimensional embeddings,
our experiments we have seen that this seems to
the last dimension being the human rating of con-
slow down and harm training for our models, since
creteness. For the out-of-vocabulary words we use
they will try to learn both patterns for sequences of
the average concreteness value of 2.5.
pre-trained embeddings and patterns for long se-
This resource allows us to assign to (almost) ev-
quences of vectors filled with 0s.
ery word in the dataset an explicit concreteness
score. When a word might have more than one To partly avoid this problem, we can break long
sense, the annotations seem to use the most con- sentences into two or more shorter elements. We
crete one: for example the word “node” has a con- assume that long distance information is not par-
creteness score of 4 out of 5. For comparison the ticularly important here to detect metaphoricity,
words “output” and “literally” have a score of 2.48 while long padding can affect performance.
and the word “being” has a score of 1.93.
It must be noted that the abstract-concrete gap 4.3 Preprocessing
is not necessarily the best way to describe the kind
of metaphors represented in this specific corpus. We chose a maximum sentence length of 50: while
The network should be able to mark as metaphoric the longest sentence in the dataset is 87 words, the
words in this dataset that have a low level of con- vast majority of the elements in the dataset is less
creteness, such as “approach” (2.76), in equally than 50 words long. Out of vocabulary words,
abstract contexts, such as “latest corporate reveals which are words that did not have a correspond-
laid-back approach” (here “approach” was marked ing vector in our embedding space, were replaced
as metaphoric in VUAMC). by a mock vector of all zeros. After shuffling the
Many of the metaphoric uses outlined here are dataset, we use the first 1000 sentences of the cor-
so ingrained in language that their actual con- pus as test, and the rest of the data for training
crete origins may be under-represented not only (11122 sentences). We used the same training and
in modern day corpora, but even in many mod- test data for all reported results.
4.4 Loss function models. Without external features such as con-
The design of the models is to predict the creteness or POS tagging, composing the input im-
metaphoricity of each word in a sentence. The pre- proves the model’s performance up to a window of
dicted value from a final layer with sigmoid acti- 3. Larger windows reduce the performance of the
vation is compared with the labeled data and usual model.
logarithmic loss is used. However, most words In Table 4 we report the tests with different set-
do not have specified metaphoric or literal anno- tings on depth and width of each layer.
tations in the dataset. Instead of assigning a non- It seems that widening the dimensionality of the
metaphor value to unspecified tokens in a string, Bi-LSTM itself beyond a certain limit does not
we modified the loss function in order to generate improve - and rather harms - the model’s perfor-
zero loss for these tokens. mance in classification.
Regarding our first model, completely relying
4.5 Training on the power of the Bi-LSTM architecture is not
After shuffling the training data, 1000 samples are enough, and deeper fully connected layers are
taken as holdout to find the overfitting point. With clearly playing a role.
batch size 64 and and early stopping patience 3 We can also see that inserting a fully connected
based on validation loss we trained each model up layer before the Bi-LSTM returns better results.
to 15 epochs. This layer has a number of nodes as large as the
number of dimensions of the input token embed-
5 Results dings. It can be another clue that the most rele-
vant information for this task has to be searched
5.1 Embeddings in the word embeddings composing the sentence
Through a comparison of different semantic and their immediate surrounding, rather than in the
spaces, we found that the best performing space structure of the whole sequence.
was GloVe trained on 42B Common Crawl, of di- In conclusion, our results show that a quite
mensionality 300. standard deep neural architecture fed with good
For the rest of our experiments we used these word embeddings can return promising results in
embeddings. metaphor detection. The “compositional” archi-
tecture also achieves comparable results, with an
5.2 Baseline F score only a couple of points lower than that of
In Table 2, we compare the results obtained from the Bi-LSTM, indicating that “forcing” a network
previous works on this task, and the performance to give particular attention to the short or immedi-
of the “vanilla” settings of our model including a ate context of each word in the data can improve
simple LSTM as our baselines. The comparison its performance all the while reducing its depth,
with Do Dinh and Gurevych (2016) shows that de- complexity and number of parameters. While this
ploying deeper and more complex architectures on approach is not the one returning the absolute best
this set does not return particularly large improve- F score, we consider the trade-off between its sim-
ments: we achieve an F1-score one point higher plicity and its performance worth noting.
than Do Dinh and Gurevych (2016)’s results on Our results also show a negative aspect: while
a setting enriched with POS tags, and two points we consider our models’ performances encourag-
higher than the simplest model proposed in the pa- ing, there is an ample room for improvement.
per.
5.3 Feature experiments
It can be observed that our bigram composition
architecture seems to produce comparable results Interestingly adding explicit semantic information
considering the previous works. The influence of such as concreteness ratings in our input - which
LSTM architectures appears thus further dimin- means, somehow, reverting to feature engineering
ished. - did produce better results for the composition ar-
Table 3 presents precision, recall and F-score chitecture, but not yet for our Bi-LSTM.
values for several concatenation windows of our Table 5 show the results of our best perform-
composition model. These results can be coming models when the concreteness of the individ-
pared to the ones we obtain with deep Bi-LSTM ual token was explicitly added to the embeddings.
Architecture F1
Haagsma and Bjerva (2016) .53
Do Dinh and Gurevych (2016)3 .56
Dense(1) .22
LSTM(32) .43
Bi-LSTM(32) .46
Bi-LSTM(32)+Dense(20) .50
Dense(300)+Bi-LSTM(32)+Dense(20) .56
Concat(n=2)+Dense(300) .55
Table 2: Performance of different models compared to the score reported by two relevant works in the literature.
We report the performance of simpler models and their combinations as baselines. We used some abbreviations to
describe the models in the table. For example, Dense(1) represents a single, fully connected layer of output length
of 1, LSTM(32) is an LSTM with an output length of 32 and Concat represents our compositional model. Thus,
Concat(n=2)+Dense(300) represents the bigram composition model with a concatenation window of 2 combined
with a fully connected layer of 300 output units.
N Precision Recall F1 score the set.

1 .627 .459 .530 Not surprisingly, a combination of these two
2 .588 .504 .543 methods - adding explicit concreteness informa-
3 .571 .531 .550 tion and breaking long sentences - returns the best
4 .649 .402 .497 overall results, as can be seen in Table 7.
Finally, since these experiments were originally
Table 3: F1 for different windows of concatenation (N) designed for the shared task in metaphor detec-
in the composition model. N=1 is equivalent to no con- tion of the First Workshop in Figurative Language
catenation. (NAACL 2018), in Table 8 we report our best per-
forming models’ results on the evaluation set pro-
The results are higher than those returned by the vided in the task.
same models trained and tested on the same sen- The last line reports the result from using both
tences only with pre-trained distributional embed- models together: as can be seen, the F score we get
dings. It appears that simply adding the concrete- from taking into consideration the output of both
ness feature returns a better performance on the architectures together is higher than the F score of
whole dataset. It is worth noting that in this case, the single models.
and only in this case, the “compositional” archi- We can suppose that the two models are learn-
tecture is the best performing, while the bi-LSTM ing to detect slightly different kinds of metaphors -
has a harder time detecting metaphors in the tex- their true positives are not completely overlapping
tual data. - and they can thus complement each other.
Finally, we try to break long sentences into
6 Conclusions
shorter sequences, as we discussed in 4.2. The
metaphors identified in the VUAM corpus do not In the frame of NAACL 2018’s shared task on
generally require long-distance information to be metaphor detection, we explored two main ap-
detected. We can observe that this method im- proaches to detect metaphoricity through deep
proves the performance of our models: this is learning and compared their performances with
probably because the “noise” due to long padding different kinds of inputs. The overall single best
of short sentences is reduced. Having less contex- performing system is a deep neural network com-
tual information for words tagged as metaphoric or posed of a bi-LSTM preceded and followed by
literal does not seem to have a real negative impact fully connected layers, having access to concrete-
on the learning process. ness scores for each token and running on rela-
As we show in Table 6, breaking sentences tively short sequences - thus reducing the effects
longer than 20 tokens into several short sequences of sentence padding.
reduces the number of misclassified elements in We show that adding such features, our model is
Architecture F1
Bi-LSTM(32) .46
Bi-LSTM(32)+LSTM(32)+Dense(20) .35
Bi-LSTM(400)+LSTM(32)+Dense(20) .43
Dense(300)+Bi-LSTM(300)+LSTM(20)+Dense(20) .57
Dense(300)+Bi-LSTM(300)+LSTM(100)+Dense(20) .40
Table 4: Parameter tuning, testing both deeper and wider settings of the model. We write in parenthesis the
dimensions each layer: for example Dense(20) is a fully connected layer with an output space of dimensionality
20.
N Precision Recall F1
Dense(300)+Bi-LSTM(32)+Dense(20) .642 .498 .561
Dense(301)+Bi-LSTM(32)+Dense(20)+Conc .580 .491 .530
Concat(n=2)+Dense(300)+Conc .554 .570 .562
Concat(n=3)+Dense(300)+Conc .567 .593 .580
Table 5: Results for different models using embeddings enriched with explicit information regarding word con-
creteness. The first line works as baseline showing a model without input manipulation. Concat(n=) represents
our compositional model, with n= representing the composition window length. Conc signifies the usage of
concreteness scores. So for example Concat(n=2)+Dense(300)+Conc represents our compositional model with
concatenation window of 2 combined with a fully connected layer of 300 output units and using the concreteness
scores as additional information.
Dense(300)+Bi-LSTM(32)+Dense(20)+Chunk .671 .570 .621
Concat(n=2)+Dense(300)+Chunk .571 .561 .560
Concat(n=3)+Dense(300)+Chunk .611 .400 .491
Table 6: Results for different models using sentence breaking to 20 (any sentence longer than 20 tokens is split in
two parts treated as complete different sentences). The first line works as baseline showing a model without input
manipulation. Concat(n=) represents our compositional model, Chunk signifies the usage of sentence breaking.
Dense(300)+Bi-LSTM(32)+Dense(20)+Chunk .670 .571 .620
Dense(301)+Bi-LSTM(32)+Dense(20)+Conc .581 .490 .531
Dense(301)+Bi-LSTM(32)+Dense(20)+Conc+Chunk .649 .624 .636
Concat(n=3)+Dense(300)+Conc+Chunk .632 .446 .523
Table 7: Results for different models using embeddings enriched with explicit information regarding word con-
creteness and sentence breaking to 20 (any sentence longer than 20 tokens is split in two parts treated as complete
different sentences). The first lines work as baselines showing the performance of previous models (without any
input manipulation, only chunking, only concreteness scores). Concat(n=) represents our compositional model,
Chunk signifies the usage of sentence breaking, Conc represents the usage of concreteness scores.
Concat(n=2)+Dense(300) .642 .498 .561
Combined results .595 .680 .635
Table 8: Results for the evaluation set from the shared dataset competition (NAACL 2018). We used sentence
breaking and concreteness information.
able to slightly outperform two baselines recently Acknowledgments

published.
We are grateful to our colleagues in the Centre
We also found that combining these two sys-
for Linguistic Theory and Studies in Probability
tems gave the best results on the test set provided
(CLASP), FLoV, at the University of Gothenburg
by the shared task.
for useful discussion of some of the ideas pre-
Considering the difficult nature of the original sented in this paper.
annotations, we judge this a promising result. It We are also grateful to three anonymous review-
could be the case that adding more explicit fea- ers for their several helpful comments on our ear-
tures further helps reduce the number of inconsis- lier draft.
tent detections on the corpus, but one of the goals The research reported here was done at CLASP,
of these experiments was that of keeping the fea- which is supported by a 10 year research grant
ture engineering as contained as possible, reduc- (grant 2014-39) from the Swedish Research Coun-
ing the number of external resources used to en- cil.
rich the input.
We also explored a simpler neural architecture
based on the recursive composition of word em- References
beddings. Yielding a slighlty worse performance
Julia Birke and Anoop Sarkar. 2006. A clustering ap-
than the Bi-LSTM architecture, this model still proach for nearly unsupervised recognition of non-
shows that a much simpler architecture can reach literal language. In 11th Conference of the Euro-
interesting results. pean Chapter of the Association for Computational
Linguistics.
7 Future Works Yuri Bizzoni, Stergios Chatzikyriakidis, and Mehdi

Ghanimifard. 2017. ” deep” learning: Detecting
metaphoricity in adjective-noun pairs. In Proceed-
We think that an in depth error analysis of our ings of the Workshop on Stylistic Variation, pages
models’ shortcomings might represent an interest- 43–52.
ing contribution in order to better understand what
neural networks are learning when they are learn- Marc Brysbaert, Amy Beth Warriner, and Victor Ku-
perman. 2014. Concreteness ratings for 40 thousand
ing metaphor detection. In future we would like to generally known english word lemmas. Behavior
perform a systematic analysis of the errors of our research methods, 46(3):904–911.
networks both when used alone and when used in
combination. Luana Bulat, Stephen Clark, and Ekaterina Shutova.
2017. Modelling metaphor with attribute-based se-
We would also like to extend the range of our mantics. In Proceedings of the 15th Conference of
comparisons to different, and simpler, machine the European Chapter of the Association for Com-
learning algorithms to see to what extent the in- putational Linguistics: Volume 2, Short Papers, vol-
ume 2, pages 523–528.
formation provided in input - in terms of distri-
butional information and explicit lexical scores François Chollet et al. 2015. Keras. https://
- contributes to the performance of our models. github.com/keras-team/keras.
While a consistent body of works on metaphor de-
tection with “traditional” machine learning means British National Corpus Consortium et al. 2007.
British national corpus version 3 (bnc xml edition).
already exists, we think that a direct comparison of Distributed by Oxford University Computing Ser-
our networks with other systems might help clari- vices on behalf of the BNC Consortium. Retrieved
fying the contribution of deep learning to this task. February, 13:2012.
Erik-Lân Do Dinh and Iryna Gurevych. 2016. Token- Vivi Nastase and Michael Strube. 2009. Combining
level metaphor detection using neural networks. In collocations, lexical and encyclopedic knowledge
Proceedings of the Fourth Workshop on Metaphor in for metonymy resolution. In Proceedings of the
NLP, pages 28–33. 2009 Conference on Empirical Methods in Natural
Language Processing: Volume 2-Volume 2, pages
Hongyu Gong, Suma Bhat, and Pramod Viswanath. 910–918. Association for Computational Linguis-
2017. Geometry of compositionality. In AAAI, tics.
pages 3202–3208.
Jeffrey Pennington, Richard Socher, and Christo-
E Dario Gutierrez, Ekaterina Shutova, Tyler Marghetis, pher D. Manning. 2014. Glove: Global vectors for
and Benjamin Bergen. 2016. Literal and metaphor- word representation. In Empirical Methods in Nat-
ical senses in compositional distributional semantic ural Language Processing (EMNLP), pages 1532–
models. In Proceedings of the 54th Annual Meet- 1543.
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), volume 1, pages 183–193. Sunny Rai, Shampa Chakraverty, and Devendra K
Tayal. 2016. Supervised metaphor detection using
Hessel Haagsma and Johannes Bjerva. 2016. Detect- conditional random fields. In Proceedings of the
ing novel metaphor using selectional preference in- Fourth Workshop on Metaphor in NLP, pages 18–
formation. In Proceedings of the Fourth Workshop 27.
on Metaphor in NLP, pages 10–17.
Marek Rei, Luana Bulat, Douwe Kiela, and Ekaterina
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Shutova. 2017. Grasping the finer point: A su-
som. 2014. A convolutional neural network for pervised similarity network for metaphor detection.
modelling sentences. In Proceedings of the 52nd arXiv preprint arXiv:1709.00575.
Annual Meeting of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), vol- Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
ume 1, pages 655–665. tional recurrent neural networks. IEEE Transactions
on Signal Processing, 45(11):2673–2681.
Maximilian Köper and Sabine Schulte Im Walde. 2016.
Ekaterina Shutova, Douwe Kiela, and Jean Maillard.
Automatically generated affective norms of abstract-
2016. Black holes and white rabbits: Metaphor
ness, arousal, imageability and valence for 350 000
identification with visual features. In Proceedings of
german lemmas. In LREC.
the 2016 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Maximilian Köper and Sabine Schulte im Walde. 2016.
Human Language Technologies, pages 160–170.
Distinguishing literal and non-literal usage of ger-
man particle verbs. In HLT-NAACL, pages 353–362. Ekaterina Shutova, Lin Sun, Elkin Dario Gutierrez, Pa-
tricia Lichtenstein, and Srini Narayanan. 2017. Mul-
Maximilian Köper and Sabine Schulte im Walde. 2017. tilingual metaphor processing: Experiments with
Improving verb metaphor detection by propagating semi-supervised and unsupervised learning. Com-
abstractness to words, phrases and individual senses. putational Linguistics.
In Proceedings of the 1st Workshop on Sense, Con-
cept and Entity Representations and their Applica- Shashank Srivastava and Eduard Hovy. 2014. Vector
tions, pages 24–30. space semantics with frequency-driven motifs. In
Proceedings of the 52nd Annual Meeting of the As-
Tina Krennmayr and Gerard Steen. 2017. Vu amster- sociation for Computational Linguistics (Volume 1:
dam metaphor corpus. In Handbook of Linguistic Long Papers), volume 1, pages 634–643.
Annotation, pages 1053–1071. Springer.
Chang Su, Shuman Huang, and Yijiang Chen. 2017.
Hongsong Li, Kenny Q Zhu, and Haixun Wang. 2013. Automatic detection and interpretation of nominal
Data-driven metaphor recognition and explanation. metaphor based on the theory of meaning. Neuro-
Transactions of the Association for Computational computing, 219:300–311.
Linguistics, 1:379–390.
Shichao Sun and Zhipeng Xie. 2017. Bilstm-based
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- models for metaphor detection. In National CCF
rado, and Jeff Dean. 2013. Distributed representa- Conference on Natural Language Processing and
tions of words and phrases and their compositional- Chinese Computing, pages 431–442. Springer.
ity. In Advances in neural information processing
systems, pages 3111–3119. Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman,
Eric Nyberg, and Chris Dyer. 2014. Metaphor detec-
Michael Mohler, Bryan Rink, David B Bracewell, and tion with cross-lingual model transfer. In Proceed-
Marc T Tomlinson. 2014. A novel distributional ap- ings of the 52nd Annual Meeting of the Association
proach to multilingual conceptual metaphor recog- for Computational Linguistics (Volume 1: Long Pa-
nition. In COLING, pages 1752–1763. pers), volume 1, pages 248–258.
Tony Veale, Ekaterina Shutova, and Beata Beigman
Klebanov. 2016. Metaphor: A computational per-
spective. Synthesis Lectures on Human Language
Technologies, 9(1):1–160.
Wei Zhang and Judith Gelernter. 2015. Explor-
ing metaphorical senses and word representa-
tions for identifying metonyms. arXiv preprint
arXiv:1508.04515.
126
Study IV
127
Deep Learning of Binary and Gradient Judgements for Semantic
Paraphrase
Yuri Bizzoni Shalom Lappin
University of Gothenburg University of Gothenburg
Abstract
We treat paraphrase identification as an ordering task. We construct a corpus of 250 sets of five
sentences, with each set containing a reference sentence and four paraphrase candidates, which are
annotated on a scale of 1 to 5 for paraphrase proximity. We partition this corpus into 1000 pairs of
sentences in which the first is the reference sentence and the second is a paraphrase candidate. We
then train a DNN encoder for sentence pair inputs. It consists of parallel CNNs that feed parallel
LSTM RNNs, followed by fully connected NNs, and finally a dense merging layer that produces
a single output. We test it for both binary and graded predictions. The latter are generated as a
by-product of training the former (the binary classifier). It reaches 70% accuracy on the binary
classification task. It achieves a Pearson correlation of .59-.61 with the annotated gold standard for
the gradient ranking candidate sets.
1 Introduction
Paraphrase identification is an area of research with a long history. Approaches to the task can be di-
vided into supervised methods, such as (Madnani et al., 2012), currently the most commonly used, and
unsupervised techniques (Socher et al., 2011).
While many approaches of both types use carefully selected features to determine similarity, such
as string edit distance (Dolan et al., 2004)) or longest common subsequence (Fernando and Stevenson,
2008), several recent supervised approaches apply Neural Networks to the task (Filice et al., 2015; He
et al., 2015), often linking it to the related issue of semantic similarity (Tai et al., 2015; Yin and Schütze,
2015).
Traditionally, paraphrase detection has been formulated as a binary problem. Corpora employed in
this work contain pairs of sentences labeled as paraphrase or non-paraphrase. The most representative
of these corpora, such as the Microsoft Paraphrase Corpus (Dolan et al., 2004), conform to this paradigm.
This approach is different from the one adopted in semantic similarity datasets, where a pair of
words or sentences is labeled on a gradient classification system. In some cases, semantic similarity
tasks overlap with paraphrase detection, as in Xu et al. (2015) and in Agirre et al. (2016). Xu et al.
(2015) is one of the first works that tries to connect paraphrase identification with semantic similarity.
They define a task where the system generates both a binary judgment and a gradient score for sentences
pairs.
We present a new dataset for paraphrase identification which is built on two main ideas: (i) Para-
phrase recognition is a gradient classification task. (ii) Paraphrase recognition is an ordering problem,
where sets of sentences are ranked by similarity with respect to a reference sentence.
While the first assumption is shared by some of the work we have cited here, our corpus is, to the
best of our knowledge, the first one constructed on the basis of the second claim.
We believe that annotating sets of sentences for similarity with respect to a reference sentence can
help with both the learning and the testing processes in paraphrase identification.
We use this corpus to test a neural network architecture formed by a combination of Convolutional
Neural Networks (CNNs) and Long Short Term Memory Recurrent Neural Networks (LSTM RNNs). We
test this model on two classification problems: (i) binary paraphrase classification, and (ii) paraphrase
ranking. We show that our system can achieve a significant correlation to human paraphrase judgments
on the ranking task as a by-product of supervised binary learning.
2 A New Type of Corpus for Paraphrase Recognition

At this stage our corpus is formed of 250 sets of five sentences. In each set, the first sentence is the
reference sentence, while the remaining four sentences are labeled on a 1-5 scale, based on their degree of
paraphrase similarity with respect to the reference sentence. This is on analogy with the annotation frame
used for SemEval Semantic Similarity tasks (Agirre et al., 2016). Every group of 5 sentences illustrates
(possibly different) graduated degrees of paraphrasehood relative to the reference sentence. Broadly, our
labels represent the following categories: (1) Two sentences are completely unrelated. (2) Two sentences
are semantically related, but they are not paraphrases. (3) Two sentences are weak paraphrases. (4) Two
sentences are strong paraphrases. (5) Two sentences are (type) identical.
The following example illustrates these ranking labels.
• Ref. sent: A woman feeds a cat
– A woman kicks a cat. Score: 2

– A person feeds an animal Score: 3
– A woman is feeding a cat. Score: 4
– A woman feeds a cat . Score: 5
• Ref. sent: I have a black hat
– Larry teaches plants to grow. Score: 1

– I have a red hat . Score: 2
– My hat is night black ; pitch black. Score: 3
– My hat’s color is black. Score: 4
While the extremes of our scale (1 and 5) are relatively rare in our corpus, we focus on the interme-
diate cases of paraphrase, from non-paraphrases with some semantic similarity (2) to non type-identical
strong paraphrases (4).
We believe that this annotation scheme is particularly useful. While it sustains graded semantic
similarity labels, it also provides sets of semantically related elements, each one of which can be scored
or ordered independently from the others. Therefore, the reference sentence can be tested separately for
each sentence in the set in a binary classification task. In the test phase, this annotation schema allows
us to observe how a system represents the similarity between two sentences by taking the scores of two
candidates as points of relative proximity to the reference sentence.
Our examples above indicate that a binary classification can be misleading because it conceals the
different levels of similarity between competing candidates.
We find instead that framing paraphrase recognition as an ordering problem allows a more flexible
evaluation of a model. It permits us to evaluate the relative proximity of several candidate paraphrases
to the reference sentence independently of the particular paraphrase score that the model assigns to each
candidate in the set.
For example, the sentence A person feeds an animal can be considered to be a loose paraphrase of
the sentence A woman feeds a cat, or alternatively, as a semantically related non-paraphrase. Which
of these conclusions we adopt depends on our decision concerning how much content sentences need
to share in order to be classified as paraphrases. By contrast, it would be far fetched to suggest that A
woman kicks a cat is a better or even equally strong paraphrase for A woman feeds a cat. Similarly, the
sentences I have a black hat and My hat is night black can be considered to be loose paraphrases, or
semantically related non-paraphrases. But I have a red hat cannot plausibly be taken as more similar in
meaning to I have a black hat than My hat is night black.
The core of this dataset was built from various parts of the Brown Corpus (Francis and Kucera, 1979),
mainly from the news and narrative sections. For each sentence, we introduced raw paraphrases by round
trip machine translation from English through Swedish, German, Spanish and Japanese, back to English.
This process yielded paraphrases, looser relations of semantic relatedness, and non-paraphrases.
One of the authors then manually annotated each set of five sentences and corrected grammatical infe-
licities. We also introduced more interesting syntactic and semantic variation. For example we manually
constructed many cases of negation and passive/active mood switch. This allows us to test paraphrase
over a wider range of syntactic and lexical semantic constructions. Similar manually generated elements
were often substituted as candidate paraphrases to round-trip generated candidates judged to be of little
interest for the task. So, for example, we frequently had several strong paraphrases produced by round-
trip translation, resulting in groups of three or four strong candidates for a reference sentence, and we
replaced several of these with our own alternatives.
A number of shorter examples produced by the authors were also added to the corpus. These are
intended to test the performance of the system for specific semantic relations, such as antinomy (I have
a new car – I have an old car), expansion (His car is red – His car has a characteristic red colour) and
subject–object permutation (A white blanket covered her mouth – Her mouth covered a white blanket).
One of the authors assigned the 1-5 ratings for each sentence in a reference set. We naturally regard
this as a ”weak” point in our dataset. As we discuss in the Conclusion, we intend to use crowd sourcing
to obtain more broadly based and reliable speaker annotation for our examples.
Our corpus has the advantage of being suitable for both training a binary classifier and developing a
model to predict gradient paraphrase judgments. For the former, we simply consider every score over a
given gradient threshold label as 1, and scores below that threshold as 0. For gradient classification we
use all the scoring labels to test the correlation between a system’s ordering performance and our human
judgments. We will show how, once a model has been trained for a binary detection task, we can check
its performance on the gradient ordering task.
3 A DNN for Paraphrase Classification

For classification and gradient judgment prediction we constructed a deep neural network. Its architecture
consists of three main components:
1. Two encoders that learn the representation of two sentences separately
2. A unified layer that merges the output of the encoders
3. A final set of fully connected layers that work on the merged representation of the two sentences
to generate a judgment.
The encoder for each pair of sentences taken as input is composed of two parallel Convolutional
Neural Networks and LSTM RNNs, feeding two sequenced fully connected layers.
The first layer of our encoders is a CNN with 50 filters of length 5. CNNs have been successfully
applied to problems in computational semantics, such as text classification and sentiment analysis (Lai
et al., 2015), as well as to paraphrase recognition (Socher et al., 2011). In this part of our model, the
encoder learns a more compact representation of the sentence, with reduced vector space dimensions and
features. This permits the NN to focus on the information most relevant to paraphrase identification.
We use an ”Atrous” Convolutional Neural Network (Giusti et al., 2013; Chen et al., 2016). An
”Atrous” CNN is a modified form of Convolutional Network designed to reduce the risk of losing impor-
tant information in max pooling. In the case of a standard CNN, max pooling will perform a reduction
of the output of the convolutional layer, selecting only some information contained in it. In the case of
image processing, for example, a 2x2 max pooling on the so-called ”map” returned by the convolutional
layer will create a smaller map that does not contain information from the entire original map, but only
from a specific region of such map, or mirroring a specific pattern in the original image: for example, all
the patches whose upper left corner lies on even coordinates on the map (Giusti et al., 2013). This way
of processing information can undermine the results when complex inputs are involved. An Atrous net-
work fragments the map returned by the max pooling layer, so that each fragment contains information
independent of the other fragments, and each reduced map contains information from all the patches of
the input. This is a good strategy for speeding up processing time by avoiding redundant computation.
The output of each CNN is passed through a max pooling layer to an LSTM RNN. Since the CNN
and the max pooling layer perform discriminative reduction of the input dimensionality, we can run
a large LSTM RNN model (50 smart cells) without substantial computational cost. In this phase of
processing, the vector dimensions of the sentence representation is further reduced, with relevant infor-
mation (hopefully) conserved and highlighted, particularly for the sequential structure of the data. Each
encoder is completed by two successive fully connected layers of dimensions 50 and 300, respectively,
that produces a vector representation for an input sentence in the pair. The first one has a .5 dropout rate.
The 300 dimensional outputs of the two encoders are then passed to a layer that merges them into a
single vector. We found that simple vector concatenation was the best option for performing this merge.
To measure the similarity of two sequences our model only makes use of the information contained in
the merged version of the encoders’ output. We did not use a device in the merging phase to assess
similarity between two sequences. The merging layer feeds the concatenated input to a series of five
fully connected layers. The last layer applies a sigmoid function to produce the classifier judgment.
While the sigmoid function performs well for binary classification, it returns a gradient over its input,
thus generating an ordering of values for the ranking task.
These three kinds of Neural Network capture information in different ways. They can be combined
to achieve a better global representation of sentence input. Specifically, while a CNN can reduce the
spectral variance of input, an LSTM RNN is designed to model its sequential dimension over time. The
CNN manages to reduce the input’s dimensionality while keeping the ordering information of the original
sentence. This information will then be processed by the LSTM RNN, which is particularly well suited
for handling words sequenced through time.
Also, an LSTM RNN’s performance can be strongly improved by providing it with better features
(Pascanu et al., 2014). In our case this is accomplished by the CNN. The densely connected layers create
clearer, more separable final vector representations of the data. To encode the original sentences we used
Word2Vec embeddings pre-trained on Google News (Mikolov et al., 2013).
Table 1 gives the binary accuracy, and ranked ordering Pearson correlation performance of our model,
over 10 fold validation, after 200 epochs.
Table 2 presents accuracy and F1 for different versions of our model. The baseline is the model’s per-
formance without any training. We compute the baseline by relying solely on the pre-loaded Word2Vec
lexical embedding content of the words’ distributional vectors to obtain a semantic similarity judgment.
No learning from our corpus annotation is involved. The sentence’s vectors are still reduced to a single
vector through the LSTM layer, but this is done without corpus based supervision or training.
4 Binary Classification Task

To use our corpus for a binary classification task we map each set of five sentences into a series of pairs,
where the first element is the reference sentence and the second element is one of the four remaining
sentences. Gradient labels are then replaced by binary ones. We consider all labels higher than 2 as
positive judgments (Paraphrase) and all labels equal to or lower than 2 as negative judgments (Non-
Paraphrase). We train our model with these labels for a binary classification task.
We split our corpus into a training and a test set, making sure that the two sets contained completely
distinct reference-candidate pairs. While a small minority of reference sentences is the same in train and
test, their candidate paraphrases are always different.
We ran the training phase for 200 epochs, keeping the order of the input fixed (due to curriculum
learning issues). Training on 761 pairs of sentences and testing on 239 pairs, we reached an average
Figure 1: Example of an encoder. A padded input of fixed length is passed to a CNN, a max pooling
layer, a single LSTM RNN, and finally two fully connected layers separated by a dropout layer of 0.5.
The input’s and output’s shape is indicated in brackets for each layer
accuracy of 70.0 over 10 fold cross-validation. We see that our architecture learned to recognize different
semantic and syntactic phenomena with a promising level of accuracy, although it is not state of the art
in paraphrase recognition for systems trained on large corpora, such as the Microsoft Paraphrase Corpus
(Ji and Eisenstein, 2013). 1
A small corpus may cause instability in results. Interestingly, we found that our DNN is able to
generalize consistently on the following patterns:
• Negation. This is a rich man’s world – This is not a rich man’s world. Non-Paraphrase;
• Subject–Object permutation. The man follows the wolf – The wolf follows the man. Non-
Paraphrase;
• Active–Passive relation. A white blanket covered her mouth – Her mouth was covered with a
white blanket. Paraphrase;
• Various cases of loose paraphrase The man follows the wolf – The person follows the animal.
Paraphrase.
However, our model had trouble with several others cases, some due to its lack of relevant world
knowledge, and others because of its limited capacity for semantically driven inference. These include:
• Time expressions. It was morning – It was noon. Non-Paraphrase;
• Some cases of antinomy. This is not good – This is bad. Paraphrase;
• Space expressions. Some years ago I was going to school when I met a man – Some years ago I
was going to church when I met a man. Non-paraphrase.
Predictably, the model has difficulty in learning a pattern or a phrase when it is under represented in
the training data. In some cases, the effect of data scarcity can be observed in an ”overfit weighting” of
specific words. We believe that these idiosyncrasies can be overcome through training on a larger set.
1
This is to be expected, given the specific nature of the task and the small dimensions of our dataset. It is also worth noting
that, while sentences in the Microsoft Paraphrase Corpus are generally longer, our corpus contains a much larger variety of
syntactic and semantic patterns, including ”more difficult” cases, like passive-active change and negation.
Figure 2: A more abstract representation of our full model. Sequential 2 and sequential 3 are encoders
of the kind specified in Figure 1. Their outputs are concatenated in merge 1 and fed to a series of dense
layers. Dropout 3 has a rate of 0.2
We observe that, on occasion, the model’s errors are in the gray area between clear paraphrase and
clear non-paraphrase. Here the correctness of a label is not obvious. For example, the pair I am so sleepy
I can barely stand – I am sleep deprived can be considered to be a loose paraphrase pair, or they can be
taken as an instance of non-paraphrase.
5 Paraphrase Ordering Task

Once the DNN has learned representations for binary classification, we can use it to rank the sentences
of the test set in order of paraphrase proximity. We apply the sigmoid value distribution for the candidate
sentences in a set of five (the reference and four candidates) to determine the ranking. To do this we use
the original structure of our dataset, composed of sets of five sentences.
First, we attribute a similarity score to all pairs of sentences (reference sentence and candidate para-
phrase) in a set. This is the similarity score learned in the binary task, which is determined by the sigmoid
function applied on the output of our DNN. In this case, we don’t ”round” the judgment, as we are not
seeking a 0,1 output.
We compute the average Pearson correlation on all sets of the test corpus to check the extent to which
the ranking that our algorithm produces matches our gradient annotation. We found comparable and
meaningful correlations between our ranking and the ordering that our model predicts. These correlations
indicate that our model achieves an encouraging level of accuracy in predicting our gradient annotations
for the candidate sentences in a set, using the weights learned for a binary classification task.
This task differs from the binary classification task in several important respects. In one way, it
is easier. A non-paraphrase can be misjudged as a paraphrase and still fall in the right order within a
ranking. In another sense, it is more difficult. Strict paraphrases, loose paraphrases, and semantically
similar non-paraphrases have to be ordered in accord with human judgment patterns, which is a more
complex task than simple binary classification.
Our gradient ranking system allows us to have a more nuanced view concerning some of the issues
that arise in pairwise paraphrase labeling that we pointed out at the end of the previous section.
The existence of a correlation between our annotation ordering and our model’s predictions is a by-
product of supervised binary learning. Since we are re-using the representations learned for the binary
k Accuracy k Pearson
1 70.10 1 .51
2 67.01 2 .63
3 79.38 3 .59
4 73.20 4 .62
5 67.01 5 .61
6 72.92 6 .72
7 66.67 7 .59
8 75.79 8 .67
9 64.21 9 .54
10 73.68 10 .67
Table 1: Accuracy (on the binary task) and Pearson Correlation (on the ordering task) Over Ten Fold
Validation Testing after 200 epochs. The accuracy reported in the paper is an average over these results.
Model Accuracy F1
Baseline (without training) 42.1 59.3
Our model 78.0 74.6
Encoders without LSTM 65.9 68.9
Encoders without ACNN 69.5 50.8
Just one layer after concatena- 73.0 70.0
tion
Using CNN instead of ACNN 76.6 76.0
ACNN with 10 filters 70.4 68.1
LSTM with 10 filters 69.0 71.3
Without dropouts 72.6 71.0
Merging via multiplication 72.6 71.1
Encoders without dense layers 72.2 71.7
Table 2: Accuracy for different versions of the model after 200 epochs. Each model ran on our standard
train and test data, without our performing cross-validation.
task in order to perform a new task, we consider it a form of transfer learning from a supervised binary
context (assigning a 0/1 value to a pair of sentences) to an unsupervised ordering problem (ranking a set
of sentences). In this case, our corpus allowed us to perform double transfer learning. First, we use word
embeddings trained to maximize single words’ contextual similarity, in order to train on a supervised
binary paraphrase dataset. Then, we use the representations acquired in this way to perform an ordering
task for which the DNN has not been trained.
The fact that ranked correlations are sustained through binary paraphrase classification is not an obvi-
ous result. A model trained on {0,1} labels could ”polarize” its scores to the point where no meaningful
ordering would be available. Had this happened, a good performance in a binary task would actually con-
ceal the loss of important semantic information. Xu et al. (2015), discussing the relation of paraphrase
identification to the recognition of semantic similarity, observe that there is no necessary connection be-
tween binary classification and prediction of gradient labels, and that an increase in one can even produce
a loss in the other.
6 Conclusions and Future Work

We present a new kind of corpus to evaluate paraphrase identification and we construct a novel type of
DNN architecture for a set of paraphrase classification tasks. We show that our model learns an effective
representation of sentences for such paraphrase tasks.
Our corpus’ design is based on the assumption that paraphrase ranking is a useful way to approach
the paraphrase identification problem. We show how this kind of corpus can be used for supervised
learning of binary classification, for multi-class classification, and for gradient judgment prediction.
The neural network architecture that we propose encodes each sentence in a low dimensional repre-
sentation, combining a CNN, an LSTM RNN, and two densely connected neural layers. The two output
representations of the encoders are then merged through concatenation, and fed to a series of densely
connected layers.
While binary classification is directly learned in the training phase, our model also yields a robust
correlation to human judgments in the ordering task through the softmax sigmoid distributions gener-
ated for binary classification. While the model learns to classify two sentences as paraphrases or non-
paraphrases, it retains enough information to assign gradient values to members of sets of sentences in a
way that correlates significantly with our annotation.
Our model doesn’t use any ”alignment” of the data. The encoders’ representations are simply con-
catenated. This gives our DNN considerable flexibility in modeling patterns such as subject–object
permutation (The man follows the wolf – The wolf follows the man), and sentence expansions (A man
eats the food – There is a man and he eats the food). It can also create complications where a simple
alignment of two sentences might suffice to identify a similarity. We will experiment with the addition
of some form of alignment to our model in future work.
We will be experimenting with crowd sourcing to obtain more reliable annotation of our corpus. We
will also be expanding the corpus to encompass a wider range of syntactic and semantic patterns, and
to include a significantly larger number of reference + candidate sets. Finally, we will be looking at
alternative DNN architectures, particularly those with attentional components, in an effort to improve
the performance of our models for both the binary classification and gradient judgment prediction tasks.
7 Acknowledgments
We are grateful to three anonymous reviewers for helpful comments and suggestions on an earlier draft
of this paper. The research reported in this paper was supported by a grant from the Swedish Research
Council for the establishment of the Centre for Linguistic Theory and Studies in Probability (CLASP) at
the University of Gothenburg. We would also like to thank our colleagues at CLASP at the University of
Gothenburg for useful discussion of many of the ideas presented here. We are solely responsible for any
errors which may remain in this paper.
References
Agirre, E., C. Banea, D. M. Cer, M. T. Diab, A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, and J. Wiebe
(2016). Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation.
In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT
2016, San Diego, CA, USA, June 16-17, 2016, pp. 497–511.
Chen, L., G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2016). Deeplab: Seman-
tic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.
CoRR abs/1606.00915.
Dolan, B., C. Quirk, and C. Brockett (2004). Unsupervised construction of large paraphrase corpora:
Exploiting massively parallel news sources. In Proceedings of the 20th International Conference
on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA. Association for Computational
Linguistics.
Fernando, S. and M. Stevenson (2008). A semantic similarity approach to paraphrase detection. Com-
putational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium.
Filice, S., G. Da San Martino, and A. Moschitti (2015). Structural representations for learning relations
between pairs of texts, Volume 1, pp. 1003–1013. Association for Computational Linguistics (ACL).
Francis, W. N. and H. Kucera (1979). Brown corpus manual. Technical report, Department of Linguistics,
Brown University, Providence, Rhode Island, US.
Giusti, A., D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber (2013). Fast image scanning
with deep max-pooling convolutional neural networks. In Image Processing (ICIP), 2013 20th IEEE
International Conference on, pp. 4034–4038. IEEE.
He, H., K. Gimpel, and J. Lin (2015, September). Multi-perspective sentence similarity modeling with
convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in
Natural Language Processing, Lisbon, Portugal, pp. 1576–1586. Association for Computational Lin-
guistics.
Ji, Y. and J. Eisenstein (2013). Discriminative improvements to distributional sentence similarity. In In

EMNLP, pp. 891–896.
Lai, S., L. Xu, K. Liu, and J. Zhao (2015). Recurrent convolutional neural networks for text classification.
In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2267–
2273. AAAI Press.
Madnani, N., J. Tetreault, and M. Chodorow (2012). Re-examining machine translation metrics for
paraphrase identification. In Proceedings of the 2012 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12,
Stroudsburg, PA, USA, pp. 182–190. Association for Computational Linguistics.
Mikolov, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Distributed representations of
words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-
mani, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26, pp. 3111–
3119. Curran Associates, Inc.
Pascanu, R., C. Gulcehre, K. Cho, and Y. Bengio (2014). How to construct deep recurrent neural
networks.
Socher, R., E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning+ (2011). Dynamic Pooling
and Unfolding Recursive Autoencoders for Paraphrase Detection. In Advances in Neural Information
Processing Systems 24.
Tai, K. S., R. Socher, and C. D. Manning (2015). Improved semantic representations from tree-structured
long short-term memory networks. CoRR abs/1503.00075.
Xu, W., C. Callison-Burch, and B. Dolan (2015, June). Semeval-2015 task 1: Paraphrase and semantic
similarity in twitter (pit). In Proceedings of the 9th International Workshop on Semantic Evaluation
(SemEval 2015), Denver, Colorado, pp. 1–11. Association for Computational Linguistics.
Yin, W. and H. Schütze (2015). Convolutional neural network for paraphrase identification. In NAACL
HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pp.
901–911.
138
Study V
139
Predicting Human Metaphor Paraphrase Judgments with Deep Neural
Networks
Yuri Bizzoni and Shalom Lappin

University of Gothenburg
[email protected]
Abstract it.
We propose a new annotated corpus for Some quantitative analyses of figurative lan-
metaphor interpretation by paraphrase, and a guage have involved metaphor interpretation and
novel DNN model for performing this task. paraphrasing. These focus on integrating para-
Our corpus consists of 200 sets of 5 sen- phrase into automatic Textual Entailment frames
tences, with each set containing one reference (Agerri, 2008), to explore the properties of distri-
metaphorical sentence, and four ranked candi- butional semantics in larger-than-word structures
date paraphrases. Our model is trained for a
(Turney, 2013). Alternatively, they study the sen-
binary classification of paraphrase candidates,
and then used to predict graded paraphrase ac- timent features of metaphor usage (Mohammad
ceptability. It reaches an encouraging 75% ac- et al., 2016; Kozareva, 2015). This last aspect
curacy on the binary classification task, and of figurative interpretation is considered a par-
high Pearson (.75) and Spearman (.68) correla- ticularly hard task and has generated several ap-
tions on the gradient judgment prediction task. proaches.
The task of metaphor interpretation is a partic-
1 Introduction ular case of paraphrase detection, although this
characterization is not unproblematic, as we will
Metaphor is an increasingly studied phenomenon see in Section 6.
in computational linguistics. But while metaphor In Bollegala and Shutova (2013), metaphor
detection has received considerable attention in paraphrase is treated as a ranking problem. Given
the NLP literature (Dunn et al., 2014; Veale et al., a metaphorical usage of a verb in a short sen-
2016) and in corpus linguistics (Krennmayr, 2015) tence, several candidate literal sentences are re-
in recent years, not much work has focused on trieved from the Web and ranked. This approach
the task of metaphor paraphrasing - assigning an requires the authors to create a gradient score to
appropriate interpretation to a metaphorical ex- label their paraphrases, a perspective that is now
pression. Moreover, there are few (if any) anno- gaining currency in broader semantic similarity
tated corpora of metaphor paraphrases (Shutova tasks (Xu et al., 2015; Agirre et al., 2016).
and Teufel, 2010). The main papers in this area Mohammad et al. (2016) resort to metaphor
are Shutova (2010), and Bollegala and Shutova paraphrasing in order to perform a quantitative
(2013). The first applies a supervised method study on the emotions associated with the usage
combining WordNet and distributional word vec- of metaphors. They create a small corpus of para-
tors to produce the best paraphrase of a single verb phrase pairs formed from a metaphorical expres-
used metaphorically in a sentence. The second ap- sion and a literal equivalent. They ask candidates
proach, conceptually related to the first, builds an to judge the degree of ”emotionality” conveyed
unsupervised system that, given a sentence with by the metaphorical and the literal expressions.
a single metaphorical verb and a set of poten- While the study has shown that metaphorical para-
tial paraphrases, selects the most accurate candi- phrases are generally perceived as more emotion-
date through a combination of mutual information ally charged than their literal equivalents, a corpus
scores and distributional similarity. of this kind has not been used to train a computa-
Despite the computational and linguistic inter- tional model for metaphor paraphrase scoring.
est of this task, little research has been devoted to In this paper we present a new dataset for
metaphor paraphrase identification and ranking. candidates as paraphrases of a metaphorical sen-
In our corpus, paraphrase recognition is treated as tence or expression. Our corpus is formed of 200
an ordering problem, where sets of sentences are sets of five sentence paraphrase candidates for a
ranked with respect to a reference metaphor sen- metaphorical sentence or expression.1
tence. In each set, the first sentence contains a
The main difference with respect to existing metaphor, and it provides the reference sentence to
work in this field consists in the syntactic and be paraphrased. The remaining four sentences are
semantic diversity covered by our dataset. The labeled on a 1-4 scale based on the degree to which
metaphors in our corpus are not confined to a sin- they paraphrase the reference sentence. This is
gle part of speech. We introduce metaphorical ex- on analogy with the annotation frame used for
amples of nouns, adjectives, verbs and a number SemEval Semantic Similarity tasks (Agirre et al.,
of multi-word metaphors. 2016). Broadly, our labels represent the following
Our corpus is, to the best of our knowledge, the categories:
largest existing dataset for metaphor paraphrase
detection and ranking. 1 Two sentences cannot be considered para-
As we describe in Section 2, it is composed of phrases.
groups of five sentences: one metaphor, and four
2 Two sentences cannot be considered para-
candidates that can be ranked as its literal para-
phrase, but they show a degree of semantic
phrases.
similarity.
The inspiration for the structure of our dataset
comes from a recent work on paraphrase (Bizzoni 3 Two sentences could be considered para-
and Lappin, 2017), where a similarly organized phrases, although they present some impor-
dataset was introduced to deal with paraphrase de- tant difference in style or content (they are
tection. not strong paraphrases).
In our work, we use an analogous structure to
model metaphor paraphrase. Also, while Bizzoni 4 Two sentences are strong paraphrases.
and Lappin (2017) present a corpus annotated by
a single human, each paraphrase set in our cor- On average, every group of five sentences con-
pus was judged by 20 different Amazon Mechani- tains a strong paraphrase, a loose paraphrase and
cal Turk (AMT) annotators, making the grading of two non-paraphrases, one of which may use some
our sentences more robust and reliable (see Sec- relevant words from the metaphor in question.2
tion 2.1). The following examples illustrate these ranking
We use this corpus to test a neural net- labels.
work model formed by a combination of Con-
• Metaphor: The crowd was a river in the street
volutional Neural Networks (CNNs) and Long
Short Term Memory Recurrent Neural Networks – The crowd was large and impetuous in
(LSTM RNNs). We test this model on two clas- the street. Score: 4
sification problems: (i) binary paraphrase classifi- – There were a lot of people in the street.
cation and (ii) paraphrase ranking. We show that Score: 3
our system can achieve significant correlation with – There were few people in the street.
human judgments on the ranking task as a by- Score: 2
product of supervised binary learning. To the best
– We reached a river at the end of the
of our knowledge, this is the first work in metaphor
street. Score: 1
paraphrasing to use supervised gradient represen-
tations. We believe that this annotation scheme is use-
ful. While it sustains graded semantic similarity
labels, it also provides sets of semantically related
2 A New Corpus for Metaphor
Paraphrase Evaluation 1
Our annotated data set and the code for our model is
available at https://github.com/yuri-bizzoni/
Metaphor-Paraphrase .
We present a dataset for metaphor paraphrase de- 2
Some of the problems raised by the concept of para-
signed to allow users to rank non-metaphorical phrase in figurative language are discussed in Section 6.
elements, each one of which can be scored or or- words.
dered independently of the others. Therefore, the
metaphorical sentence can be tested separately for 4. Multi-word Metaphors : The seeds of change
each literal candidate in the set in a binary classi- were planted in 1943.
fication task.
In the test phase, the annotation scheme allows All these sentences and their candidates were
us to observe how a system represents the similar- manually produced to insure that for each group
ity between a metaphorical and a literal sentence we have a strong literal paraphrase, a loose lit-
by taking the scores of two candidates as points of eral paraphrase and two semantically related non-
relative proximity to the metaphor. paraphrases. Here “semantically related” can in-
It can be argued that a good literal paraphrase of dicate either a re-use of the metaphorical words
a metaphor needs to compensate to some extent for to express a different meaning, or an unacceptable
the expressive or sentimental bias that a metaphor interpretation of the reference metaphor.
usually supplies, as argued in Mohammad et al. Although the paraphrases were gener-
(2016). In general a binary classification can be ated freely and cover a number of possible
misleading because it conceals the different levels (mis)interpretations, we did take several issues
of similarity between competing candidates. into account. For example, for sentiment related
For example, the literal sentence Republican metaphors two opposite interpretations are often
candidates during the convention were terrible proposed, forcing the system to make a choice
can be considered to be a loose paraphrase of between two sentiment poles when ranking the
the metaphor The Republican convention was a paraphrases (I love my job – I hate my job for
horror show, or alternatively, as a semantically My job is a dream). In general, antonymous
related non-paraphrase. Which of these conclu- interpretations (Time passes very fast – Time is
sions we adopt depends on our decision concern- slow for Time flies) are listed, when possible,
ing how much interpretative content a literal sen- among the four competing choices.
tence needs to provide in order to qualify as a valid Our corpus has the advantage of being suitable
paraphrase of a metaphor. The question whether for both binary classification and gradient para-
the two sentences are acceptable paraphrases or phrase judgment prediction. For the former, we
not can be hard to answer. By contrast, it would be map every score over a given gradient threshold la-
far fetched to suggest that The Republican conven- bel to 1, and scores below that threshold to 0. For
tion was a joy to follow is a better or even equally gradient classification, we use all the scoring la-
strong literal paraphrase for The Republican con- bels to test the correlation between the system’s or-
vention was a horror show. dered predictions and human judgments. We will
In this sense, the sentences Her new occupa- show how, once a model has been trained for a
tion was a dream come true and She liked her binary detection task, we can evaluate its perfor-
new occupation can be considered to be loose mance on the gradient ordering task.
paraphrases, in that the term liked can be judged an We stress that our corpus is under development.
acceptable, but not ideal interpretation of the more As far as we know it is unique for the kind of task
intense metaphorical expression a dream come we are discussing. The main difficulty in build-
true. By contrast, She hated her new occupation ing this corpus is that there is no obvious way to
cannot be plausibly regarded as more similar in collect the data automatically. Even if there were
meaning than She liked her new occupation to Her a procedure to extract pairs of paraphrases con-
new occupation was a dream come true. taining a metaphoric element semi-automatically,
Our training dataset is divided into four main it does not seem possible to generate alternative
sections: paraphrase candidates automatically.
The reference sentences we chose were either
1. Noun phrase Metaphors : My lawyer is an selected from published sources or created man-
angel. ually by the authors. In all cases, the paraphrase
2. Adjective Metaphors : The rich man had a candidates had to be crafted manually. We tried
cold heart. to keep a balanced diversity inside the corpus.
The dataset is divided among metaphorically used
3. Verb Metaphors : She cut him down with her Nouns, Adjectives and Verbs, plus a section of
Multi Word metaphors. The corpus is an attempt of the more evenly distributed judgment patterns
to represent metaphor in different parts of speech. that we observed.
A native speaker of English independently These mean judgments appear to provide reli-
checked all the sentences for acceptability. able data for supervision of a machine learning
model. We thus set the upper bound for the per-
2.1 Collecting judgments through AMT formance of a machine learning algorithm trained
on this data to be around .9, on the basis of the
Originally, one author individually annotated the Pearson correlation with the original single anno-
entire corpus. The difference between strong and tator scores. In what follows, we refer to the mean
loose literal paraphrases can be a matter of indi- judgments of AMT annotators as our gold stan-
vidual sensibility. dard when evaluating our results, unless otherwise
While such annotations could be used as the indicated.
basis for a preliminary study, we needed more
judgments to build a statistically reliable annotated
dataset. Therefore we used crowd sourcing to so-
3 A DNN for Metaphor Para-
licit judgments from large numbers of annotators. phrase Classification
We collected human judgments on the degree of
paraphrasehood for each pair of sentences in a set For classification and gradient judgment predic-
(with the reference metaphor sentence in the pair) tion we constructed a deep neural network. Its ar-
through Amazon Mechanical Turk (AMT). chitecture consists of three main components:
Annotators were presented with four metaphor 1. Two encoders that learn the representation of
- candidate paraphrase pairs, all relating to the two sentences separately
same metaphor. They were asked to express
a judgment between 1 and 4, according to the 2. A unified layer that merges the output of the
scheme given above. encoders
We collected 20 human judgments for each pair
3. A final set of fully connected layers that op-
metaphor - candidate paraphrase. Analyzing in-
erate on the merged representation of the two
dividual annotators’ response patterns, we were
sentences to generate a judgment.
able to filter out a small number of “rogue” anno-
tators (less than 10%). This filtering process was The encoder for each pair of sentences taken as
based on annotators’ answers to some control el- input is composed of two parallel Convolutional
ements inserted in the corpus, and evaluation of Neural Networks (CNNs) and LSTM RNNs, feed-
their overall performance. For example, an anno- ing two sequenced fully connected layers. We use
tator who consistently assigned the same score to an ”Atrous” CNN (Chen et al., 2016). Interest-
all sentences is classified as “rogue”. ingly, classical CNNs only decrease our accuracy
We then computed the mean judgment for each by approximately two points and reach a good F1
sentence pair and compared it with the original score, as Table 1 indicates.
judgments expressed by one of the authors. We Using a CNN (we apply 25 filters of length
found a high Pearson correlation between the an- 5) as a first layer proved to be an efficient strat-
notators’ mean judgments and the author’s judg- egy. While CNNs were originally introduced in
ment of close to 0.93. the field of computer vision, they have been suc-
The annotators’ understanding of the problem cessfully applied to problems in computational se-
and their evaluation of the sentence pairs seem, on mantics, such as text classification and sentiment
average, to correspond very closely to that of our analysis (Lai et al., 2015), as well as to paraphrase
original single annotator. The high correlation also recognition (Socher et al., 2011). In NLP applica-
suggests a small level of variation from the mean tions, CNNs usually abstract over a series of word-
across AMT annotators. Finally, a similar corre- or character-level embeddings, instead of pixels.
lation strengthens the hypothesis that paraphrase In this part of our model, the encoder learns a more
detection is better modeled as an ordering, rather compact representation of the sentence, with re-
than a binary, task. If this had not been the case, duced vector space dimensions and features. This
we would expect more polarized judgments tend- permits the entire DNN to focus on the informa-
ing towards the highest and lowest scores, instead tion most relevant to paraphrase identification.
The output of each CNN is passed through a similarity between the two sequences. This allows
max pooling layer to an LSTM RNN. Since the a high degree of freedom in the interpretation pat-
CNN and the max pooling layer perform discrim- terns we are trying to model, but it also involves
inative reduction of the input’s dimensions, we a fair amount of noise, which increases the risk of
can run a relatively small LSTM RNN model (20 error.
hidden units). In this phase, the vector dimen- The merging layer feeds the concatenated input
sions of the sentence representation are further re- to a final fully connected layer. The last layer
duced, with relevant information conserved and applies a sigmoid function to produce the judg-
highlighted, particularly for the sequential struc- ments. The advantage of using a sigmoid func-
ture of the data. Each encoder is completed by tion in this case is that, while it performs well for
two successive fully connected layers, of dimen- binary classification, it returns a gradient over its
sions 15 and 10 respectively, the first one having a input, thus generating an ordering of values appro-
0.5 dropout rate. priate for the ranking task. The combination of
these three kinds of Neural Networks in this or-
der (CNN, LSTM RNN and fully connected lay-
ers) has been explored in other works, with inter-
esting results (Sainath et al., 2015). This research
has indicated that these architectures can comple-
ment each other in complex semantic tasks, such
as sentiment analysis (Wang et al., 2016) and text
representation (Vosoughi et al., 2016).
The fundamental idea here is that these three
kinds of Neural Network capture information in
different ways that can be combined to achieve
a better global representation of sentence input.
While a CNN can reduce the spectral variance of
input, an LSTM RNN is designed to model its se-
quential temporal dimension. At the same time,
an LSTM RNN’s performance can be strongly im-
proved by providing it with better features (Pas-
canu et al., 2014), such as the ones produced by
Figure 1: Example of an encoder. Input is passed to a a CNN, as happens in our case. The densely con-
CNN, a max pooling layer, an LSTM RNN, and finally nected layers contribute a clearer, more separable
two fully connected layers, the first having a dropout final vector representation of one sentence.
rate of .5. The input’s and output’s shape is indicated To encode the original sentences we used
in brackets for each layer Word2Vec embeddings pre-trained on the very
large Google News dataset (Mikolov et al., 2013).
Each sentence is thus transformed to a 10 di- We used these embeddings to create the input se-
mensional vector. To perform the final compari- quences for our model.
son, these two low dimensional vectors are passed We take as a baseline for evaluating our model
to a layer that merges them into a single vector. the cosine similarity of the sentence vectors, ob-
We tried several ways of merging the encoders’ tained through combining their respective pre-
outputs, and we found that simple vector concate- trained lexical embeddings. This baseline gives
nation was the best option. We produce a 20 di- very low accuracy and F1 scores.
mensional two-sentence vector as the final output
of the DNN.
We do not apply any special mechanism for 4 Binary Classification Task
”comparison” or ”alignment” in this phase. To
measure the similarity of two sequences our model As discussed above, our corpus can be applied to
makes use only of the information contained in the model two sub-problems: binary classification and
merged vector that the encoders produce. We did paraphrase ordering.
not use a device in the merging phase to assess To use our corpus for a binary classification task
Model Accuracy F1 ranking (Bollegala and Shutova, 2013), where
Baseline (cosine similarity) 50.8 10.1
Our model 75.2 74.6 the metaphorical element is explicitly identified,
Encoders without LSTM 64.4 64.9 and the candidates don’t contain any syntactic-
Encoders without ACNN 62.6 61.5 semantic expansion, our results are encouraging.3
Using CNN instead of ACNN 61.0 61.6
ACNN with 10 filters 73.4 71.7 Although a small corpus may cause instability
LSTM with 10 filters 72.3 70.6 in results, our DNN seems able to generalize with
Merging via multiplication 53.4 69.6
Aligner 49.4 61.6
relative consistency on the following patterns:
Aligner + our model 73.4 75. • Sentiment. My life in California was a night-
Table 1: Accuracy for different versions of the model, mare – My life in California was terrible. Our
and the baseline. Each version ran on our standard system seems able to discriminate the right
train and test data, without performing cross-validation. sentiment polarity of a metaphor by picking
We use as a baseline the cosine similarity between the the right paraphrase, even when some can-
mean of the word vectors composing each sentence. didates contain sentiment words of opposite
polarity, which are usually very similar in a
we map each set of five sentences into a series of distributional space
pairs, where the first element is the metaphor we • Non metaphorical word re-use. Our sys-
want to interpret and the second element is one of tem seems able, in several cases, to discrim-
its four literal candidates. inate the correct paraphrase for a metaphor,
Gradient labels are then replaced by binary even when some candidates re-use the words
ones. We consider all labels higher than 2 as pos- of the metaphor to convey a (wrong) literal
itive judgments (Paraphrase) and all labels less meaning. My life in California was a dream
than or equal to 2 as negative judgments (Non- – I lived in California and had a dream
Paraphrase), reflecting the ranking discussed in
Section 2. We train our model with these labels • Cases of multi-word metaphor Although
for a binary metaphor paraphrase detection task. well represented in our corpus, multi-word
Keeping the order of the input fixed (we will metaphors are in some respects the most dif-
discuss this issue below), we ran the training phase ficult to correctly paraphrase, since the inter-
for 15 epochs. pretation has to be extended to a number of
We reached an average accuracy of 67% for 12 words. Nonetheless, our model was able to
fold cross-validation. correctly handle these in a number of situa-
Interestingly, when trained on the pre-defined tions. You can plant the seeds of anger – You
training set only, our model reaches the higher ac- can act in a way that will engender rage
curacy of 75%. However, our model had trouble with several
We strongly suspect that this discrepancy in per- others cases.
formance is due to the small training and test sets It seems to have particular difficulty in discrim-
created by the partitions of the 12 fold cross vali- inating sentiment intensity, with assignment of
dation process. higher scores to paraphrases that value the sen-
In general, this task is particularly hard, both be- timent intensity of the metaphor, which creates
cause of the complexity of the semantic properties problems in several instances. Also, cases of
involved in accurate paraphrase (see 4.1), and the metaphoric exaggeration (My roommate is a sport
limited size of the training set. It seems to us that maniac – My roommate is a sport person), nega-
an average accuracy of 67% on a 12 fold partition- tion (My roommate was not an eagle – My room-
ing of training and test sets is a reasonable result, mate was dumb.) and syntactic inversions pose
given the size of our corpus. difficulties for our models.
We observe that our architecture learned to rec- We found that our model is able to abstract over
ognize different semantic phenomena related to specific patterns, but, predictably, it has difficulty
metaphor interpretation with a promising level of in learning when the semantic focus of an interpre-
accuracy, but such phenomena need to be repre- tation consists in a phrase that is under represented
sented in the training set. in the training data.
In light of the fact that previous work in this 3
It should be noted that Bollegala and Shutova (2013) em-
field is concerned with single verb paraphrase ploy an unsupervised approach.
In some cases, the effect of data scarcity can phrase) in a set. This is the similarity score learned
be observed in an ”overfit weighting” of specific in the binary task, so it is determined by the sig-
terms. Some words that were seen in the data only moid function applied on the output.
once are associated with a high or low score inde- The following is an example of an ordered set
pendently of their context, degrading the overall with strong correlation between the model’s pre-
performance of the model. We believe that these dictions (marked in bold) and our annotations
idiosyncrasies, can be overcome through training (given in italics)
on a larger data set.
• The candidate is a fox
– 0.13 1 The candidate owns a fox
4.1 The gray areas of interpretation
– 0.30 2 The candidate is stupid
We observe that, on occasion, the model’s errors – 0.41 3 The candidate is intelligent
fall into a gray area between clear paraphrase and – 0.64 4 The candidate is a cunning person
clear non-paraphrase. Here the correctness of a
label is not obvious. We compute the average Pearson and Spearman
These cases are particularly important in correlations on all sets of the test corpus, to check
metaphor paraphrasing, since this task requires an the extent to which the ranking that our DNN pro-
interpretative leap from the metaphor to its literal duces matches our mean crowd source human an-
equivalent. For example, the pair I was home notations.
watching the days slip by from my window – I While Pearson correlation measures the rela-
was home thinking about the time I was wasting tionship between two continuous variables, Spear-
can be considered as a loose paraphrase pair. Al- man correlation evaluates the monotonic relation
ternatively, it can be regarded as a case of non- between two variables, continuous or ordinal.
paraphrase, since the second element introduces Since the first of our variables, the model’s
some interpretative elements (I was thinking about judgment, is continuous, while the second one, the
the time) that are not in the original. human labels, is ordinal, both measures are of in-
In our test set we labeled it as 3 (loose para- terest.
phrase), but if our system fails to label it correctly We found comparable and meaningful correla-
in a binary task, it is not entirely clear that it is tions between mean AMT rankings and the order-
making an error. For these cases, the approach ing that our model predicts, on both metrics. On
presented in the next section is particularly useful. the balanced training and test set, we achieve an
average Pearson correlation of 0.75 and an aver-
age Spearman correlation of 0.68. On a twelve
5 Paraphrase Ordering Task fold cross-validation frame, we achieve an average
Pearson correlation of 0.55 and an average Spear-
The high degree of correlation we found between man correlation of 0.54. We chose a twelve fold
the AMT annotations and our single annotator’s cross-validation because it is the smallest partition
judgments indicate that we can use this dataset we can use to get meaningful results. We conjec-
for an ordering task as well. Since the human ture that the average cross fold validation perfor-
judgments we collected about the “degree of para- mance is lower because of the small size of the
phrasehood” are quite consistent, it is reasonable training data in each fold. These results are dis-
to pursue a non-binary approach. played in Table 2.4
Once the DNN has learned representations for These correlations indicate that our model
binary classification, we can apply it to rank the achieves an encouraging level of accuracy in pre-
sentences of the test set in order of similarity. dicting our gradient annotations for the candidate
We apply the sigmoid value distribution for the sentences in a set when trained for a binary classi-
candidate sentences in a set of five (the reference fication task.
and four candidates) to determine the ranking. This task differs from the binary classification
To do this we use the original structure of our task in several important respects. In one way,
dataset, composed of sets of five sentences. First, 4
As discussed above, the upper bound for our model’s per-
we assign a similarity score to all pairs of sen- formance can be set at 0.9, the correlation between our single
tences (reference sentence and candidate para- annotator’s and the mean crowd sourced judgments.
it is easier. A non-paraphrase can be misjudged Measure 12-fold value Baseline
Accuracy 67 51
as a paraphrase and still appear in the right or- Pearson correlation 0.553 0.151
der within a ranking. In another sense, it is more Spearman correlation 0.545 0.113
difficult. Strict paraphrases, loose paraphrases,
Table 2: Accuracy and ranking correlation for Twelve
and various kinds of semantically similar non-
Fold Cross-Validation. It can be seen that the simple
paraphrases have to be ordered in accord with hu- cosine similarity between the mean vectors of the two
man judgment patterns, which is a more complex sentences, which we use as baseline, returns a low cor-
task than simple binary classification. relation with human judgments.
We should consider to what extent this task is
different from a multi-class categorization prob-
lem. Broadly, multi-class categorization requires
a system for linking a pair of sentences to a spe-
cific class of similarity. This is dependent upon On the other hand, it is clear that in the choice
the classes defined by the annotator and presented between While living in California I had a dream
in the training phase. In several cases determin- and My life in California was nice, I enjoyed it,
ing these ranked categories might be problem- the latter is a more reasonable interpretation of the
atic. A class corresponding to our label ”3”, for metaphor.
example, could contain many different phenom-
ena related to metaphor paraphrase: expansions, The annotators relative mean ranking has been
reformulations, reduction in the expressivity of sustained by our model, even if its absolute scor-
the sentence, or particular interpretations of the ing involves an error in binary classification.
metaphor’s meaning. Our way of formulating the
ordering task allows us to overcome this problem.
The correlation between AMT annotation or-
A paraphrase containing an expansion and a para-
dering and our model’s predictions is a by-product
phrase involving some information loss, both la-
of supervised binary learning. Since we are re-
beled as ”3”, might have quite different scoring,
using the predictions of a binary classification
but they still fall between all ”2” elements and all
task, we consider it a form of transfer learning
”4” elements in a ranking.
from a supervised binary context to an unsuper-
We can see that our gradient ranking system
vised ordering task. In this case, our corpus al-
provides a more nuanced view of the paraphrase
lows us to perform double transfer learning. First,
relation than a binary classification.
we used pretrained word embeddings trained to
Consider the following example:
maximize single words’ contextual similarity, in
• My life in California was a dream order to train on a supervised binary paraphrase
– 0.03 1 I had a dream once dataset. Then, we use the representations acquired
in this way to perform an ordering task for which
– 0.05 2 While living in California I had a
the DNN had not been trained.
dream
– 0.11 3 My life in California was nice, I
enjoyed it The fact that ranked correlations are sustained
through binary paraphrase classification is not an
– 0.58 4 My life in California was abso-
obvious result. In principle, a model trained on
lutely great
{0,1} labels could ”polarize” its scores to the point
The human annotators consider the pair My life where no meaningful ordering would be available.
in California was a dream – My life in California Had this happened, a good performance in a bi-
was nice, I enjoyed it as loose paraphrases, while nary task would actually conceal the loss of im-
the model scored it very low. But the difference portant semantic information. The fact that there
in sentiment intensity between the metaphor and is no necessary connection between binary classi-
the literal candidate renders the semantic relation fication and prediction of gradient labels, and that
between the two sentences less than perspicuous. an increase in one can even produce a loss in the
Such intensity is instead present in My life in Cal- other, is pointed out in Xu et al. (2015), who dis-
ifornia was absolutely great, marked as a more cuss the relation of paraphrase identification to the
valid paraphrase (score 4). recognition of semantic similarity.
6 The Nature of the Metaphor In- We show how this kind of corpus can be used
terpretation Task for both supervised learning of binary classifica-
tion, and for gradient judgment prediction.
Although this task resembles a particular case of The neural network architecture that we pro-
paraphrase detection, in many respects it is some- pose encodes each sentence in a 10 dimen-
thing different. While paraphrase detection con- sional vector representation, combining a CNN,
cerns learning content identity or strong cases of an LSTM RNN, and two densely connected neu-
semantic similarity, our task involves the interpre- ral layers. The two input representations are
tation of figurative language. merged through concatenation and fed to a series
In a traditional paraphrase task, we should of densely connected layers.
maintain that “The candidate is a fox” and “The We show that such an architecture is able, to an
candidate is cunning” are invalid paraphrases. extent, to learn metaphor-to-literal paraphrase.
First, the superficial informational content of the While binary classification is learned in the
two sentences is different. Second, without fur- training phase, it yields a robust correlation in the
ther context we might assume that the candidate is ordering task through the softmax sigmoid distri-
an actual fox. We ignore the context of the phrase. butions generated for binary classification. The
In this task the frame is different. We assume model learns to classify a sentence as a valid or in-
that the first sentence contains a metaphor. We valid literal interpretation of a given metaphor, and
summarize this task by the following question. it retains enough information to assign a gradient
Given that X is a metaphor, which one of the value to sets of sentences in a way that correlates
given candidates would be its best literal interpre- with our crowd source annotation.
tation? Our model doesn’t use any ”alignment” of the
We trained our model to move along a similar data. The encoders’ representations are simply
learning pattern. This training frame can produce concatenated. This gives our DNN consider-
the apparent, but false paradox that two acceptable able flexibility in modeling interpretation patterns.
paraphrases such as The Council is on fire and The It can also create complications where a simple
Council is burning are assigned a low score by our alignment of two sentences might suffice to iden-
model. If the first element is a metaphor, the sec- tify a similarity. We have considered several possi-
ond element is, in fact, a bad literal interpretation. ble alternative versions of this model to tackle this
A higher score is correctly assigned to the candi- issue.
date People in the Council are very excited. In future we will expand the size and variety of
our corpus. We will perform a detailed error anal-
7 Conclusions ysis of our model’s predictions, and we will further
explore different kinds of neural network designs
We present a new kind of corpus to evaluate for paraphrase detection and ordering. Finally, we
metaphor paraphrase detection, following the ap- intend to study this task “the other way around” by
proach presented in Bizzoni and Lappin (2017) for detecting the most appropriate metaphor to para-
paraphrase grading, and we construct a novel type phrase a literal reference sentence or phrase.
of DNN architecture for a set of metaphor inter-
pretation tasks. We show that our model learns an
effective representation of sentences, starting from Acknowledgments
the distributional representations of their words.
Using word embeddings trained on very large cor- We are grateful to our colleagues in the Centre for Linguis-
pora proved to be a fruitful strategy. Our model is tic Theory and Studies in Probability (CLASP), FLoV, at the
able to retrieve from the original semantic spaces University of Gothenburg for useful discussion of some of the
not only the primary meaning or denotation of ideas presented in this paper, and to three anonymous review-
words, but also some of the more subtle semantic ers for helpful comments on an earlier draft. The research
aspects involved in the metaphorical use of terms. reported here was done at CLASP, which is supported by a
We based our corpus’ design on the view that 10 year research grant (grant 2014-39) from the Swedish Re-
paraphrase ranking is a useful way to approach the search Council.
metaphor interpretation problem.
References Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Rodrigo Agerri. 2008. Metaphor in textual entailment. tions of words and phrases and their composition-
In COLING 2008, 22nd International Conference ality. In C. J. C. Burges, L. Bottou, M. Welling,
on Computational Linguistics, Posters Proceedings, Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
18-22 August 2008, Manchester, UK. pages 3– vances in Neural Information Processing Systems
6. http://www.aclweb.org/anthology/ 26, Curran Associates, Inc., pages 3111–3119.
C08-2001.
Saif Mohammad, Ekaterina Shutova, and Peter D.
Eneko Agirre, Carmen Banea, Daniel M. Cer, Mona T. Turney. 2016. Metaphor as a medium for emo-
Diab, Aitor Gonzalez-Agirre, Rada Mihalcea, Ger- tion: An empirical study. In Proceedings of the
man Rigau, and Janyce Wiebe. 2016. Semeval- Fifth Joint Conference on Lexical and Computa-
2016 task 1: Semantic textual similarity, mono- tional Semantics, *SEM@ACL 2016, Berlin, Ger-
lingual and cross-lingual evaluation. In Proceed- many, 11-12 August 2016. http://aclweb.
ings of the 10th International Workshop on Seman- org/anthology/S/S16/S16-2003.pdf.
tic Evaluation, SemEval@NAACL-HLT 2016, San
Diego, CA, USA, June 16-17, 2016. pages 497– Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho,
511. http://aclweb.org/anthology/S/ and Yoshua Bengio. 2014. How to construct deep
S16/S16-1081.pdf. recurrent neural networks. Proceedings of the Sec-
ond International Conference on Learning Repre-
sentations (ICLR 2014) .
Yuri Bizzoni and Shalom Lappin. 2017. Deep learn-
ing of binary and gradient judgments for semantic Tara N. Sainath, Oriol Vinyals, Andrew W. Senior,
paraphrase. Proceedings of IWCS 2017 . and Hasim Sak. 2015. Convolutional, long short-
term memory, fully connected deep neural networks.
Danushka Bollegala and Ekaterina Shutova. 2013. In 2015 IEEE International Conference on Acous-
Metaphor interpretation using paraphrases extracted tics, Speech and Signal Processing, ICASSP 2015,
from the web. PloS one 8(9):e74304. South Brisbane, Queensland, Australia, April 19-24,
2015. pages 4580–4584. https://doi.org/
Liang-Chieh Chen, George Papandreou, Iasonas 10.1109/ICASSP.2015.7178838.
Kokkinos, Kevin Murphy, and Alan L. Yuille. 2016.
Deeplab: Semantic image segmentation with deep Ekaterina Shutova. 2010. Automatic metaphor in-
convolutional nets, atrous convolution, and fully terpretation as a paraphrasing task. In Hu-
connected crfs. CoRR abs/1606.00915. http: man Language Technologies: The 2010 An-
//arxiv.org/abs/1606.00915. nual Conference of the North American Chap-
ter of the Association for Computational Linguis-
Jonathan Dunn, Jon Beitran De Heredia, Maura Burke, tics. Association for Computational Linguistics,
Lisa Gandy, Sergey Kanareykin, Oren Kapah, Stroudsburg, PA, USA, HLT ’10, pages 1029–
Matthew Taylor, Dell Hines, Ophir Frieder, David 1037. http://dl.acm.org/citation.
Grossman, et al. 2014. Language-independent en- cfm?id=1857999.1858145.
semble approaches to metaphor identification. In
28th AAAI Conference on Artificial Intelligence, Ekaterina Shutova and Simone Teufel. 2010. Metaphor
AAAI 2014. AI Access Foundation. corpus annotated for source-target domain map-
pings. In LREC. volume 2, pages 2–2.
Zornitsa Kozareva. 2015. Multilingual affect po-
larity and valence prediction in metaphors. In Richard Socher, Eric H. Huang, Jeffrey Pennington,
Proceedings of the 6th Workshop on Computa- Andrew Y. Ng, and Christopher D. Manning+. 2011.
tional Approaches to Subjectivity,Sentiment and Dynamic Pooling and Unfolding Recursive Autoen-
Social Media Analysis, WASSA@EMNLP 2015, coders for Paraphrase Detection. In Advances in
17 September2015, Lisbon, Portugal. page 1. Neural Information Processing Systems 24.
http://aclweb.org/anthology/W/W15/ Peter D. Turney. 2013. Distributional semantics be-
W15-2901.pdf. yond words: Supervised learning of analogy and
paraphrase. CoRR abs/1310.5042. http://
Tina Krennmayr. 2015. What corpus linguistics can arxiv.org/abs/1310.5042.
tell us about metaphor use in newspaper texts. Jour-
nalism Studies 16(4):530–546. Tony Veale, Ekaterina Shutova, and Beata Beigman
Klebanov. 2016. Metaphor: A Computa-
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. tional Perspective. Synthesis Lectures on Hu-
2015. Recurrent convolutional neural networks man Language Technologies. Morgan & Claypool
for text classification. In Proceedings of the Publishers. https://doi.org/10.2200/
Twenty-Ninth AAAI Conference on Artificial In- S00694ED1V01Y201601HLT031.
telligence. AAAI Press, AAAI’15, pages 2267–
2273. http://dl.acm.org/citation. Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb
cfm?id=2886521.2886636. Roy. 2016. Tweet2vec: Learning tweet embeddings
using character-level cnn-lstm encoder-decoder. In
Proceedings of the 39th International ACM SIGIR
Conference on Research and Development in In-
formation Retrieval. ACM, New York, NY, USA,
SIGIR ’16, pages 1041–1044. https://doi.
org/10.1145/2911451.2914762.
Jin Wang, Liang-Chih Yu, K. Robert Lai, and Xue-
jie Zhang. 2016. Dimensional sentiment analysis
using a regional CNN-LSTM model. In Proceed-
ings of the 54th Annual Meeting of the Association
for Computational Linguistics, ACL 2016, August
7-12, 2016, Berlin, Germany, Volume 2: Short Pa-
pers. http://aclweb.org/anthology/P/
P16/P16-2037.pdf.
Wei Xu, Chris Callison-Burch, and Bill Dolan. 2015.

Semeval-2015 task 1: Paraphrase and semantic sim-
ilarity in twitter (pit). In Proceedings of the 9th In-
ternational Workshop on Semantic Evaluation (Se-
mEval 2015). Association for Computational Lin-
guistics, Denver, Colorado, pages 1–11. http://
www.aclweb.org/anthology/S15-2001.
152
Study VI
153
The Effect of Context on Metaphor Paraphrase Aptness Judgments
Yuri Bizzoni Shalom Lappin

University of Gothenburg University of Gothenburg
Abstract context or frame of reference to construe the sim-

ile She was screaming like a turtle as strange, and
We conduct two experiments to study the ef-
fect of context on metaphor paraphrase apt- less apt for expressing the quality of a scream than
ness judgments. The first is an AMT crowd She was screaming like a banshee. In this case,
arXiv:1809.01060v1 [cs.CL] 4 Sep 2018
source task in which speakers rank metaphor- the reason why the simile in the second sentence
paraphrase candidate sentence pairs in short works best is intuitive. A salient characteristic of
document contexts for paraphrase aptness. In a banshee is a powerful scream. Turtles are not
the second we train a composite DNN to pre- known for screaming, and so it is harder to define
dict these human judgments, first in binary
the quality of a scream through such a comparison,
classifier mode, and then as gradient ratings.
We found that for both mean human judg- except as a form of irony.2 Other cases are more
ments and our DNN’s predictions, adding doc- complicated to decide upon. The simile crying like
ument context compresses the aptness scores a fire in the sun (It’s All Over Now, Baby Blue,
towards the center of the scale, raising low out- Bob Dylan) is powerfully apt for many readers, but
of-context ratings and decreasing high out-of- simply odd for others. Fire and sun are not known
context scores. We offer a provisional expla- to cry in any way. But at the same time the sim-
nation for this compression effect. ile can capture the association we draw between
1 Introduction something strong and intense in other senses - vi-
sion, touch, etc. - and a loud cry.
A metaphor is a way of forcing the normal bound- Nonetheless, most metaphors and similes need
aries of a word’s meaning in order to better ex- some kind of context, or external reference point
press an experience, a concept or an idea. To a to be interpreted. The sentence The old lady had a
native speaker’s ear some metaphors sound more heart of stone is apt if the old lady is cruel or indif-
conventional (like the usage of the words ear and ferent, but it is inappropriate as a description of a
sound in this sentence), others more original. This situation in which the old lady is kind and caring.
is not the only dimension along which to judge a We assume that, to an average reader’s sensibility,
metaphor. One of the most important qualities of the sentence models the situation in a satisfactory
a metaphor is its appropriateness, its aptness: how way only in the first case.
good is a metaphor for conveying a given expe- This is the approach to metaphor aptness that
rience or concept. While a metaphor’s degree of we assume in this paper. Following Bizzoni and
conventionality can be measured through proba- Lappin (2018), we treat a metaphor as apt in rela-
bilistic methods, like language models, it is harder tion to a literal expression that it paraphrases.3 If
to represent its aptness. Chiappe et al. (2003) de-
fine aptness as “the extent to which a comparison some level work differently and cannot always be considered
as variations of the same phenomenon (Sam and Catrinel,
captures important features of the topic”. 2006; Glucksberg, 2008), for this study we treat them as be-
It is possible to express an opinion about some longing to the same category of figurative language.
2
metaphors’ and similes’ aptness (at least to a de- It is important not to confuse aptness with transparency.
The latter measures how easy it is to understand a compar-
gree) without previously knowing what they are ison. Chiappe et al. (2003) claim, for example, that many
trying to convey, or the context in which they ap- literary or poetic metaphors score high on aptness and low on
pear1 . For example, we don’t need a particular transparency, in that they capture the nature of the topic very
well, but it is not always clear why they work.
1 3
While it can be argued that metaphors and similes at Bizzoni and Lappin (2018) apply Bizzoni and Lappin
the metaphor is judged to be a good paraphrase, Many of these structures and phenomena do not
then it closely expresses the core information of occur as metaphorical expressions, with any fre-
the literal sentence through its metaphorical shift. quency, in natural text and were therefore intro-
We refer to the prediction of readers’ judgments duced through hand crafted examples.
on the aptness candidates for the literal paraphrase Each pair of sentences in the corpus has been
of a metaphor as the metaphor paraphrase aptness rated by AMT annotators for paraphrase aptness
task (MPAT). Bizzoni and Lappin (2018) address on a scale of 1-4, with 4 being the highest de-
the MPAT by using Amazon Mechanical Turk gree of aptness. In Bizzoni and Lappin (2018)’s
(AMT) to obtain crowd sourced annotations of dataset, sentences come in groups of five, where
metaphor-paraphrase candidate pairs. They train a the first element is the “reference element” with a
composite Deep Neural Network (DNN) on a por- metaphorical expression, and the remaining four
tion of their annotated corpus, and test it on the re- sentences are “candidates” that stand in a degree
maining part. Testing involves using the DNN as of paraphrasehood to the reference.
a binary classifier on paraphrase candidates. They Here is an example of a metaphor-paraphrase
derive predictions of gradient paraphrase aptness candidate pair.
for their test set, and assess them by Pearson coef-
ficient correlation to the mean judgments of their 1a. The crowd was a roaring river.
crowd sourced annotation of this set. Both training b. The crowd was huge and noisy.
and testing are done independently of any docu-
ment context for the metaphorical sentence and its The average AMT paraphrase score for this pair is
literal paraphrase candidates. 4.0, indicating a high degree of aptness.
In this paper we study the role of context We extracted 200 sentence pairs from Bizzoni
on readers’ judgments concerning the aptness of and Lappin (2018)’s dataset and provided each
metaphor paraphrase candidates. We look at the pair with a document context consisting of a pre-
accuracy of Bizzoni and Lappin (2018)’s DNN ceding and a following sentence4 , as in the follow-
when trained and tested on contextually embedded ing example.
metaphor-paraphrase pairs for the MPAT. In Sec-
tion 2 we describe an AMT experiment in which 2a. They had arrived in the capital city. The
annotators judge metaphors and paraphrases em- crowd was a roaring river. It was glorious.
bodied in small document contexts, and in Sec- b. They had arrived in the capital city. The
tion 3 we discuss the results of this experiment. In crowd was huge and noisy. It was glorious.
Section 4 we describe our MPAT modeling exper-
iment, and in Section 5 we discuss the results of One of the authors constructed most of these
this experiment. Section 6 briefly surveys some contexts by hand. In some cases, it was possible
related work. In Section 7 we draw conclusions to locate the original metaphor in an existing doc-
from our study, and we indicate directions for fu- ument. This was the case for
ture work in this area.
(i) Literary metaphors extracted from poetry or
2 Annotating Metaphor-Paraphrase novels, and
Pairs in Contexts
(ii) Short conventional metaphors (The President
Bizzoni and Lappin (2018) have recently produced brushed aside the accusations, Time flies)
a dataset of paraphrases containing metaphors de- that can be found, with small variations, in
signed to allow both supervised binary classifica- a number of texts.
tion and gradient ranking. This dataset contains
For these cases, a variant of the existing con-
several pairs of sentences, where in each pair the
text was added to both the metaphorical and the
first sentence contains a metaphor, and the second
literal sentences. We introduced small modifi-
is a literal paraphrase candidate.
cations to keep the context short and clear, and
This corpus was constructed with a view to rep-
to avoid copyright issues. We lightly modified
resenting a large variety of syntactic structures and
4
semantic phenomena in metaphorical sentences. Our annotated data set and the code for our model is
available at https://github.com/yuri-bizzoni/
(2017)’s modeling work on general paraphrase to metaphor. Metaphor-Paraphrase .
the contexts of metaphors extracted from corpora very similar to those reported in Bizzoni and Lap-
when the original context was too long, ie. when pin (2018). The Pearson coefficent correlation be-
the contextual sentences of the selected metaphor tween the mean judgments of our out-of-context
were longer than the maximum length we speci- pilot annotations and Bizzoni and Lappin (2018)’s
fied for our corpus. In such cases we reduced the annotations for the same pair was over 0.9. We
length of the sentence, while sustaining its mean- then conducted an AMT annotation task for the
ing. 200 contextualised pairs. On average, 20 differ-
The context was designed to sound as natu- ent annotators rated each pair. We considered as
ral as possible. Since the same context is used “rogue” those annotators who rated the large ma-
for metaphors and their literal candidate para- jority of pairs with very high or very low scores,
phrases, we tried to design short contexts that and those who responded inconsistently to two
make sense for both the figurative and the literal “trap” pairs. After filtering out the rogues, we had
sentences, even when the pair had been judged as an average of 14 annotators per pair.
non-paraphrases. We kept the context as neutral
as possible in order to avoid a distortion in crowd 3 Annotation Results
source ratings.
For example, in the following pair of sentences, We found a Pearson correlation of 0.81 between
the literal sentence is not a good paraphrase of the the in-context and out-of-context mean human
figurative one (a simile). paraphrase ratings for our two corpora. This corre-
lation is virtually identical to the one that Bernardy
3a. He is grinning like an ape. et al. (2018) report for mean acceptability ratings
of out-of-context to in-context sentences in their
b. He is smiling in a charming way. (average crowd source experiment. It is interesting that
score: 1.9) a relatively high level of ranking correspondence
should occur in mean judgments for sentences pre-
We opted for a context that is natural for both sented out of and within document contexts, for
sentences. two entirely distinct tasks.
Our main result concerns the effect of context
4a. Look at him. He is grinning like an ape. He
on mean paraphrase judgment. We observed that
feels so confident and self-assured.
it tends to flatten aptness ratings towards the cen-
b. Look at him. He is smiling in a charming ter of the rating scale. 71.1% of the metaphors that
way. He feels so confident and self-assured. had been considered highly apt (average rounded
score of 4) in the context-less pairs received a
We sought to avoid, whenever possible, an in- more moderate judgment (average rounded score
congruous context for one of the sentences that of 3), but the reverse movement was rare. Only
could influence our annotators’ ratings. 5% of pairs rated 3 out of context (2 pairs) were
We collected a sub-corpus of 200 contextually boosted to a mean rating of 4 in context. At the
embedded pairs of sentences. We tried to keep our other end of the scale, 68.2% of the metaphors
data as balanced as possible, drawing from all four judged at 1 category of aptness out of context were
rating classes of paraphrase aptness ratings (be- raised to a mean of 2 in context, while only the
tween 1 to 4) that Bizzoni and Lappin (2018) ob- 3.9% of pairs rated 2 out of context were lowered
tained. We selected 44 pairs of 1 ratings, 51 pairs to 1 in context.
of 2, 43 pairs of 3 and 62 pairs of 4. Ratings at the middle of the scale - 2 (defined as
We then used AMT crowd sourcing to rate the semantically related non-paraphrases) and 3 (im-
contextualized paraphrase pairs, so that we could perfect or loose paraphrases) - remained largely
observe the effect of document context on assess- stable, with little movement in either direction.
ments of metaphor paraphrase aptness. 9.8% of pairs rated 2 were re-ranked as 3 when
To test the reproducibility of Bizzoni and Lap- presented in context, and 10% of pairs ranked at 3
pin (2018)’s ratings, we launched a pilot study for changed to 2. The division between 2 and 3 sep-
10 original non-contextually embedded pairs, se- arates paraphrases from non-paraphrases. Our re-
lected from all four classes of aptness. We ob- sults suggest that this binary rating of paraphrase
served that the annotators provided mean ratings aptness was not strongly affected by context. Con-
text operates at the extremes of our scale, raising encoders
low aptness ratings and lowering high aptness rat-
ings. This effect is clearly indicated in the regres- 3. A final set of fully connected layers that op-
sion chart in Fig 1. erate on the merged representation of the two
This effect of context on human ratings is very sentences to generate a judgment.
similar to the one reported in Bernardy et al. The encoder for each pair of sentences taken as
(2018). They find that sentences rated as ill input is composed of two parallel ”Atrous” Con-
formed out of context are improved when they volutional Neural Networks (CNNs) and LSTM
are presented in their document contexts. How- RNNs, feeding two sequenced fully connected
ever the mean ratings for sentences judged to be layers.
highly acceptable out of context declined when as- The encoder is preloaded with the lexical em-
sessed in context. Bernardy et al. (2018)’s linear beddings from Word2vec (Mikolov et al., 2013).
regression chart for the correlation between out- The sequences of word embeddings that we use as
of-context and in-context acceptability judgments input provides the model with dense word-level in-
looks remarkably like our Fig 1. There is, then, formation, while the model tries to generalize over
a striking parallel in the compression pattern that these embedding patterns.
context appears to exert on human judgments for The combination of a CNN and an LSTM al-
two entirely different linguistic properties. lows us to capture both long-distance syntactic and
This pattern requires an explanation. Bernardy semantic relations, best identified by a CNN, and
et al. (2018) suggest that adding context causes the sequential nature of the input, most efficiently
speakers to focus on broader semantic and prag- identified by an LSTM. Several existing studies,
matic issues of discourse coherence, rather than cited in Bizzoni and Lappin (2017), demonstrate
simply judging syntactic well formedness (mea- the advantages of combining CNNs and LSTMs
sured as naturalness) when a sentence is consid- to process texts.
ered in isolation. On this view, compression of rat- The model produces a single classifier value be-
ing results from a pressure to construct a plausible tween 0 and 1. We transform this score into a bi-
interpretation for any sentence within its context. nary output of 0 or 1 by applying a threshold of
If this is the case, an analogous process 0.5 for assigning 1.
may generate the same compression effect for The architecture of the model is given in Fig 2.
metaphor aptness assessment of sentence pairs in We use the same general protocol as Bizzoni
context. Speakers may attempt to achieve broader and Lappin (2018) for training with supervised
discourse coherence when assessing the metaphor- learning, and testing the model.
paraphrase aptness relation in a document context. Using Bizzoni and Lappin (2018)’s out-of- con-
Out of context they focus more narrowly on the se- text metaphor dataset and our contextualized ex-
mantic relations between a metaphorical sentence tension of this set, we apply four variants of the
and its paraphrase candidate. Therefore, this rela- training and testing protocol.
tion is at the centre of a speaker’s concern, and it
receives more fine-grained assessment when con- 1. Training and testing on the in-context dataset.
sidered out of context than in context. This issue
clearly requires further research. 2. Training on the out-of-context dataset, and
testing on the in-context dataset.
4 Modelling Paraphrase Judgments in
Context 3. Training on the in-context dataset, and testing
on the out-of-context dataset.
We use the DNN model described in Bizzoni and
Lappin (2018) to predict aptness judgments for in- 4. Training and testing on the out-of-context
context paraphrase pairs. It has three main com- dataset (Bizzoni and Lappin (2018)’s origi-
ponents: nal experiment provides the results for out-
of-context training and testing).
1. Two encoders that learn the representations
of two sentences separately When we train or test the model on the out-
of-context dataset, we use Bizzoni and Lap-
2. A unified layer that merges the output of the pin (2018)’s original annotated corpus of 800
Figure 1: In-context and out-of-context mean ratings. Points above the broken diagonal line represent
sentence pairs which received a higher rating when presented in context. The total least-square linear
regression is shown as the second line.
metaphor-paraphrase pairs. The in-context dataset decline to the fact that the compression effect ren-
contains 200 annotated pairs. ders the gradient judgments less separable, and so
harder to predict. A similar, but more pronounced
5 MPAT Modelling Results version of this effect may account for the difficulty
We use the model both to predict binary classifi- that our model encounters in predicting gradient
cation of a metaphor paraphrase candidate, and to in-context ratings. The binary classifier achieves
generate gradient aptness ratings on the 4 category greater success for these cases because its training
scale (see Bizzoni and Lappin (2018) for details). tends to polarise the data in one direction or the
A positive binary classification is accurate if it is other.
≥ a 2.5 mean human rating. The gradient predic- We also observe that the best combination
tions are derived from the softmax distribution of seems to consist in training our model on the orig-
the output layer of the model. The results of our inal out-of-context dataset and testing it on the in-
modelling experiments are given in Table 1. context pairs. In this configuration we reach an
The main result that we obtain from these ex- F-score (0.72) only slightly lower than the one re-
periments is that the model learns binary classi- ported in Bizzoni and Lappin (2018) (0.74), and
fication to a reasonable extent on the in-context we record the highest Pearson correlation, 0.3
dataset, both when trained on the same kind of (which is still not strong, compared to Bizzoni and
data (in-context pairs), and when trained on Biz- Lappin (2018)’s best run, 0.755 ). This result may
zoni and Lappin (2018)’s original dataset (out-of- partly be an artifact of the the larger amount of
context pairs). However, the model does not per- training data provided by the out-of-context pairs.
form well in predicting gradient in-context judg- We can use this variant (out-of-context training
ments when trained on in-context pairs. It im- and in-context testing) to perform a fine-grained
proves slightly for this task when trained on out- comparison of the model’s predicted ratings for
of-context pairs. the same sentences in and out of context. When
By contrast, it does well in predicting both bi- we do this, we observe that out of 200 sentence
nary and gradient ratings when trained and tested pairs, our model scores the majority (130 pairs)
on out-of-context data sets. higher when processed in context than out of con-
Bernardy et al. (2018) also note a decline in
5
Pearson correlation for their DNN models on the It is also important to consider that their ranking scheme
is different from ours: the Pearson correlation reported there
task of predicting human in-context acceptability is the average of the correlations over all groups of 5 sen-
judgments, but it is less drastic. They attribute this tences present in the dataset.
Figure 2: DNN encoder for predicting metaphorical paraphrase aptness from Bizzoni and Lappin (2018).
Each encoder represents a sentence as a 10-dimensional vector. These vectors are concatenated to com-
pute a single score for the pair of input sentences.
Training set Test set F-score Correlation

With-context* With-context* 0.68 -0.01
Without-context With-context 0.72 0.3
With-context Without-context 0.6 0.02
Without-context Without-context 0.74 0.75
Table 1: F-score binary classification accuracy and Pearson correlation for three different regimens of
supervised learning. The * indicates results for a set of 10-fold cross-validation runs. This was necessary
in the first case, when training and testing are both on our small corpus of in-context pairs. In the second
and third rows, since we are using the full out-of-context and in-context dataset, we report single-run
results. The fourth row is Bizzoni and Lappin (2018)’s best run result. (Our single-run best result for the
first row is an F-score of 0.8 and a Pearson correlation 0.16).
text. A smaller but significant group (70 pairs) re- 6 Related Cognitive Work on Metaphor
ceives a lower score when processed in context. Aptness
The first group’s average score before adding con-
text (0.48) is consistently lower than that of the Tourangeau and Sternberg (1981) present ratings
second group (0.68). Also, as Table 2 indicates, of aptness and comprehensibility for 64 metaphors
the pairs that our model rated, out of context, with from two groups of subjects. They note that
a score lower than 0.5 (on the model’s softmax metaphors were perceived as more apt and more
distribution), received on average a higher rating comprehensible to the extent that their terms occu-
in context, while the opposite is true for the pairs pied similar positions within dissimilar domains.
rated with a score higher than 0.5. In general, sen- Interestingly, Fainsilber and Kogan (1984) also
tence pairs that were rated highly out of context present experimental results to claim that imagery
receive a lower score in context, and vice versa. does not clearly correlate with metaphor aptness.
When we did linear regression on the DNNs in and Aptness judgments are also subjected to individual
out of context predicted scores, we observed sub- differences.
stantially the same compression pattern exhibited Blasko (1999) points to such individual differ-
by our AMT mean human judgments. Figure 3 ences in metaphor processing. She asked 27 par-
plots this regression graph. ticipants to rate 37 metaphors for difficulty, apt-
ness and familiarity, and to write one or more in-
terpretations of the metaphor. Subjects with higher
working memory span were able to give more de-
OOC score Number of ele- OOC Mean OOC Std IC Mean IC Std
ments
0.0-0.5 112 0.42 0.09 0.54 0.1
0.5-1.0 88 0.67 0.07 0.64 0.07
Table 2: We show the number of pairs that received a low score out of context (first row) and the number
of pairs that received a high score out of context (second row). We report the mean score and standard
deviation (Std) of the two groups when judged out of context (OOC) and when judged in context (IC)
by our model. The model’s scores range between 0 and 1. As can be seen, the mean of the low-scoring
group rises in context, and the mean of the high-scoring group decreases in context.
Figure 3: In-context and out-of-context ratings assigned by our trained model. Points above the broken
diagonal line represent sentence pairs which received a higher rating when presented in context. The
total least-square linear regression is shown as the second line.
tailed and elaborate interpretations of metaphors. rattlesnakes as the dancer’s arms were startled
Familiarity and aptness correlated with both high rattlesnakes) if they were judged to be particularly
and low span subjects. For high span subjects apt- apt, rather than particularly comprehensible. They
ness of metaphor positively correlated with num- claim that context might play an important role
ber of interpretations, while for low span subjects in this process. They suggest that context should
the opposite was true. ease the transparency and increase the aptness of
McCabe (1983) analyses the aptness of both metaphors and similes.
metaphors with and without extended context. Tourangeau and Rips (1991) present a series of
She finds that domain similarity correlates with experiments indicating that metaphors tend to be
aptness judgments in isolated metaphors, but not interpreted through emergent features that were
in contextualized metaphors. She also reports that not rated as particularly relevant, either for the
there is no clear correlation between metaphor tenor or for the vehicle of the metaphor. The num-
aptness ratings in isolated and in contextualized ber of emergent features that subjects were able
examples. Chiappe et al. (2003) study the rela- to draw from a metaphor seems to correlate with
tion between aptness and comprehensibility in their aptness judgments.
metaphors and similes. They provide experi- Bambini et al. (2018) use Event-Related Brain
mental results indicating that aptness is a better Potentials (ERPs) to study the temporal dynamics
predictor than comprehensibility for the “trans- of metaphor processing in reading literary texts.
formation” of a simile into a metaphor. Subjects They emphasize the influence of context on the
tended to remember similes as metaphors (i.e. ability of a reader to smoothly interpret an unusual
remember the dancer’s arms moved like startled metaphor.
Bambini et al. (2016) use electrophysiological clined sharply for the prediction of human gradi-
experiments to try to disentangle the effect of a ent aptness judgments, relative to its performance
metaphor from that of its context. They find that on a corresponding out-of-context test set. This
de-contextualized metaphors elicited two different appears to be the result of the increased difficulty
brain responses, N 400 and P 600, while contextu- in separating rating categories introduced by the
alized metaphors only produced the P 600 effect. compression effect.
They attribute the N 400 effect, often observed in Strikingly, the linear regression analyses of hu-
neurological studies of metaphors, to expectations man aptness judgments for in- and out-of-context
about upcoming words in the absence of a pre- paraphrase pairs, and of our DNN’s predictions
dictive context that “prepares” the reader for the for these pairs reveal similar compression pat-
metaphor. They suggest that the P 600 effect re- terns. These patterns produce ratings that cannot
flects the actual interpretative processing of the be clearly separated along a linear ranking scale.
metaphor. To the best of our knowledge ours is the first
This view is supported by several neurological study of the effect of context on metaphor apt-
studies showing that the N 400 effect arises with ness on a corpus of this dimension, using crowd
unexpected elements, like new presuppositions in- sourced human judgments as the gold standard
troduced into a text in a way not implied by the for assessing the predictions of a computational
context (Masia et al., 2017), or unexpected asso- model of paraphrase. We also present the first
ciations with a noun-verb combination, not indi- comparative study of both human and model judg-
cated by previous context (for example preceded ments of metaphor paraphrase for in-context and
by neutral context, as in Cosentino et al. (2017)). out-of-context variants of metaphorical sentences.
Finally, the compression effect that context
7 Conclusions and Future Work induces on paraphrase judgments corresponds
We have observed that embedding metaphorical closely to the one observed independently in an-
sentences and their paraphrase candidates in a doc- other task, which is reported in Bernardy et al.
ument context generates a compression effect in (2018). We regard this effect as a significant dis-
human metaphor aptness ratings. Context seems covery that increases the plausibility and the inter-
to mitigate the perceived aptness of metaphors in est of our results. The fact that it appears clearly
two ways. Those metaphor-paraphrase pairs given with two tasks involving different sorts of DNNs
very low scores out of context receive increased and distinct learning regimes (unsupervised learn-
scores in context, while those with very high ing with neural network language models for the
scores out of context decline in rating when pre- acceptability prediction task discussed, as opposed
sented in context. At the same time, the demarca- to supervised learning with our composite DNN
tion line between paraphrase and non-paraphrase for paraphrase prediction) reduces the likelihood
is not particularly affected by the introduction of that this effect is an artefact of our experimental
extended context. design.
As previously observed by McCabe (1983), we While our dataset is still small, we are present-
found that context has an influence on human apt- ing an initial investigation of a phenomenon which
ness ratings for metaphors, although, unlike her is, to date, little studied. We are working to en-
results, we did find a correlation between the two large our dataset and in future work we will ex-
sets of ratings. Chiappe et al. (2003)’s expecta- pand both our in- and out-of-context annotated
tion that context should facilitate a metaphor’s apt- metaphor-paraphrase corpora.
ness was supported only in one sense. Aptness in- While the corpus we used contains a number of
creases for low-rated pairs. But it decreases for hand crafted examples, it would be preferable to
high-rated pairs. find these example types in natural corpora, and
We applied Bizzoni and Lappin (2018)’s DNN we are currently working on this. We will be ex-
for the MAPT to an in-context test set, experi- tracting a dataset of completely natural (corpus-
menting with both out-of-context and in-context driven) examples. We are seeking to expand the
training corpora. We obtained reasonable results size of the data set to improve the reliability of our
for binary classification of paraphrase candidates modelling experiments.
for aptness, but the performance of the model de- We will also experiment with alternative DNN
architectures for the MAPT. We will conduct qual- Viviana Masia, Paolo Canal, Irene Ricci,
itative analyses on the kinds of metaphors and sim- Edoardo Lombardi Vallauri, and Valentina Bambini.
2017. Presupposition of new information as a
iles that are more prone to a context-induced rating
pragmatic garden path: Evidence from event-related
switch. brain potentials. Journal of Neurolinguistics,
One of our main concerns in future research will 42:31–48.
be to achieve a better understanding of the com-
Allyssa McCabe. 1983. Conceptual similarity and the
pression effect of context on human judgments quality of metaphor in isolated sentences versus ex-
and DNN models. tended contexts. Journal of Psycholinguistic Re-
search, 12(1):41–68.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
References rado, and Jeff Dean. 2013. Distributed representa-
Valentina Bambini, Chiara Bertini, Walter Schaeken, tions of words and phrases and their composition-
Alessandra Stella, and Francesco Di Russo. 2016. ality. In C. J. C. Burges, L. Bottou, M. Welling,
Disentangling metaphor from context: an erp study. Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
Frontiers in psychology, 7:559. vances in Neural Information Processing Systems
26, pages 3111–3119. Curran Associates, Inc.
Valentina Bambini, Paolo Canal, Donatella Resta, and Glucksberg Sam and Haught Catrinel. 2006. On the
Mirko Grimaldi. 2018. Time course and neurophys- relation between metaphor and simile: When com-
iological underpinnings of metaphor in literary con- parison fails. Mind & Language, 21(3):360–378.
text. Discourse Processes, pages 1–21.
Roger Tourangeau and Lance Rips. 1991. Interpreting
Jean-Philippe Bernardy, Shalom Lappin, and Jay and evaluating metaphors. Journal of Memory and
Han Lau. 2018. The influence of context on sen- Language, 30(4):452–472.
tence acceptability judgmenets. Proceedings of ACL
2018, Melbourne, Australia. Roger Tourangeau and Robert J Sternberg. 1981. Apt-
ness in metaphor. Cognitive psychology, 13(1):27–
Yuri Bizzoni and Shalom Lappin. 2017. Deep learn- 55.
ing of binary and gradient judgements for semantic
paraphrase. In IWCS 2017 - 12th International Con-
ference on Computational Semantics - Short papers,
Montpellier, France, September 19 - 22, 2017.
Yuri Bizzoni and Shalom Lappin. 2018. Predicting hu-

man metaphor paraphrase judgments with deep neu-
ral networks. Proceedings of The Workshop on Fig-
urative Language Processing, NAACL 2018, New
Orleans LA.
Dawn G Blasko. 1999. Only the tip of the iceberg:

Who understands what about metaphor? Journal of
Pragmatics, 31(12):1675–1683.
Dan L Chiappe, John M Kennedy, and Penny Chiappe.

2003. Aptness is more important than comprehensi-
bility in preference for metaphors and similes. Poet-
ics, 31(1):51–68.
Erica Cosentino, Giosuè Baggio, Jarmo Kontinen, and

Markus Werning. 2017. The time-course of sen-
tence meaning composition. n400 effects of the
interaction between context-induced and lexically
stored affordances. Frontiers in psychology, 8:813.
Lynn Fainsilber and Nathan Kogan. 1984. Does im-

agery contribute to metaphoric quality? Journal of
psycholinguistic research, 13(5):383–391.
Sam Glucksberg. 2008. How metaphors create

categories–quickly. The Cambridge handbook of
metaphor and thought, pages 67–83.

Detection and Aptness A Study in Metapho PDF

Uploaded by

Copyright:

Available Formats

Detection and Aptness A Study in Metapho PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Detection and Aptness A Study in Metapho PDF

Uploaded by

Copyright:

Available Formats

Detection and Aptness: A study in metaphor detection and

aptness assessment through neural networks and

Doctoral dissertation in computational linguistics, University of Gothenburg

©Yuri Bizzoni, 2019

2 Theoretical background and framework: key concepts 23

3 Data collection and resources 31

3.2 Idiom dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Summary of the studies 35

1.1 My Research Questions

1.2 Metaphor Detection

1.2.1 Out-of-context metaphor detection

1.2.3 In-context metaphor detection

1.3 Metaphor Aptness Assessment

1.3.1 Dataset and architecture

To tackle Metaphor Aptness Assessment, we decided to frame it as a paraphrase identification problem.

1.3.2 Out-of-context Metaphor Aptness Assessment

can be paraphrased with

My lawyer is greedy and aggressive

My lawyer is greedy and aggressive

can be paraphrased with the metaphor

1.3.3 In-context Metaphor Aptness Assessment

1.4 Neural Networks

1.5 Vector space lexical embeddings

1.6 Structure of the thesis

Theoretical background and

Structure of the chapter

1. The comparison between unsupervised and supervised approaches to detection.

2. The use of neural networks to learn metaphoricity.

3. The role of vector space semantic models.

2.1 Metaphor detection in NLP

2.1.1 Unsupervised approaches

2 With the usual exception of completely conventionalized metaphors

2.1.2 Supervised approaches

2.1.3 Neural Networks

2.1.4 Vector space semantic models

2.2 Beyond detection

Data collection and resources

3.1 Adjective-Noun Dataset

3.2 Idiom dataset

3.3 VU Amsterdam Corpus

3.4 Paraphrase corpus

3.5 Metaphor Aptness Corpus

3.6 Contextual Metaphor Aptness Corpus

Summary of the studies

As I discussed in the Introduction, it is possible to reformulate this question in a somewhat more

1. Paper 1 - “Deep" Learning: Detecting Metaphoricity in Adjective-Noun Pairs. We use a combina-

“Deep" Learning: Detecting Metaphoricity in Adjective-Noun Pairs

4.1.1 My contribution to the paper

Finding the Neural net: Deep-learning Idiom Type Identification

4.2.1 My contribution to the paper

4.3 Study III

Bigrams and BiLSTMs: Two neural networks for sequential metaphor

4.3.1 My contribution to the paper

Deep Learning of Binary and Gradient Judgements for Semantic

4.4.1 My contribution to the paper

Predicting Human Metaphor Paraphrase Judgments with Deep

4.5.1 My contribution to the paper

The Effect of Context on Metaphor Paraphrase Aptness Judg-

4.6.1 My contribution to the paper