Recent Trends in Word Sense Disambiguation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

https://helda.helsinki.

fi

Recent Trends in Word Sense Disambiguation : A Survey

Bevilacqua, Michele
2021-08-01

Bevilacqua , M , Pasini , T , Raganato , A & Navigli , R 2021 , Recent Trends in Word Sense
Disambiguation : A Survey . in Z-H Zhou (ed.) , Proceedings of the Thirtieth International
Joint Conference on Artificial Intelligence, IJCAI-21 . International Joint Conference on
Artificial Intelligence, Inc , Vienna , pp. 4330-4338 , International Joint Conference on
Artificial Intelligence , Montreal , Canada , 21/08/2021 . https://doi.org/10.24963/ijcai.2021/593

http://hdl.handle.net/10138/333318
10.24963/ijcai.2021/593

unspecified
publishedVersion

Downloaded from Helda, University of Helsinki institutional repository.


This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail.
Please cite the original version.
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

Recent Trends in Word Sense Disambiguation:


A Survey
Michele Bevilacqua1 , Tommaso Pasini2 ,
Alessandro Raganato3 and Roberto Navigli1
1
Sapienza NLP Group, Department of Computer Science, Sapienza University of Rome
2
Department of Computer Science, University of Copenhagen
3
Department of Digital Humanities, University of Helsinki
[email protected], [email protected]
[email protected], [email protected]

Abstract and adverbs), as these are the words carrying most of a sen-
tence’s meaning. In WSD, the sense inventory for a language
Word Sense Disambiguation (WSD) aims at mak- can be very large, i.e., in the order of hundreds of thousands
ing explicit the semantics of a word in context by of concepts, but also very sparse, in that each lexeme1 is as-
identifying the most suitable meaning from a pre- sociated with only a small subset of the sense inventory.
defined sense inventory. Recent breakthroughs in Predefined inventories define the output space for most
representation learning have fueled intensive WSD varieties of past and modern approaches. These exist in
research, resulting in considerable performance im- many flavors, ranging from purely supervised [Hadiwinoto
provements, breaching the 80% glass ceiling set by et al., 2019; Bevilacqua and Navigli, 2019] to knowledge-
the inter-annotator agreement. In this survey, we based [Moro et al., 2014; Agirre et al., 2014; Scozzafava
provide an extensive overview of current advances et al., 2020], to hybrid supervised and knowledge-based ap-
in WSD, describing the state of the art in terms of proaches [Kumar et al., 2019; Bevilacqua and Navigli, 2020;
i) resources for the task, i.e., sense inventories and Blevins and Zettlemoyer, 2020; Conia and Navigli, 2021;
reference datasets for training and testing, as well as Barba et al., 2021]. Supervised models, today based on neu-
ii) automatic disambiguation approaches, detailing ral architectures, frame the task as a classification problem
their peculiarities, strengths and weaknesses. Fi- and take advantage of annotated data to learn the association
nally, we highlight the current limitations of the task between words2 in context and senses. Knowledge-based ap-
itself, but also point out recent trends that could help proaches, instead, often employ graph algorithms on a seman-
expand the scope and applicability of WSD, setting tic network, in which senses are connected through semantic
up new promising directions for the future. relations and are described with definitions and usage exam-
ples. Their independence from labeled training data, however,
1 Introduction comes at the expense of performing worse than supervised
models [Pilehvar and Navigli, 2014; Raganato et al., 2017a;
Word Sense Disambiguation (WSD) is a historical task in Pasini et al., 2021] which, benefiting from pretrained language
Natural Language Processing (NLP) and Artificial Intelli- models, can now also nimbly scale across different languages.
gence (AI) which, in its essence, dates back to Weaver [1949], Nonetheless, information in semantic networks, be it unstruc-
who recognized the problem of polysemous words in the con- tured (e.g., definitions) or structured (e.g., relational informa-
text of Machine Translation. Even today, word polysemy re- tion), still remains highly relevant. This is demonstrated by
mains one of the most challenging and pervasive linguistic hybrid approaches, which, reporting the highest results in lit-
phenomena in NLP. For example, the ambiguous word bass erature, are currently attested as the best solution [Barba et al.,
refers to two completely disjoint classes of objects in the fol- 2021].
lowing sentences: i) “I can hear bass sounds”, ii) “They like Considering the fast pace at which the field is moving, to-
grilled bass”. NLP research has long sought ways to tackle gether with the fact that reference WSD surveys [Nancy and
this phenomenon, with the task of WSD being at the forefront Jean, 1998; Agirre and Edmonds, 2007; Navigli, 2009] are
of the automatic resolution of polysemy. In WSD, ambiguity now more than 10 years old, it is hard to have a clear picture as
is addressed by mapping a target expression to one (or poten- to which the most successful innovations introduced in the last
tially more) of its possible senses, depending on the surround- few years may be. In this survey paper we thus provide a com-
ing context. Indeed, a model should map the word bass to the prehensive overview of the literature, summarizing the most
meanings of low-frequency tones and type of fish, in the re- effective contributions proposed so far. Specifically, we focus
spective sentences above. WSD systems use the senses that
are enumerated by a static, predefined, machine-readable dic- 1
⟨ lemma, part of speech ⟩ pair.
tionary, i.e., a sense inventory. Sense inventories are mostly 2
For ease of reading, we use word to refer to both words and mul-
concerned with open-class words (nouns, verbs, adjectives tiword expressions.

4330
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

on the most recent and significant models for the task, high- [Hovy et al., 2006; Lacerra et al., 2020], but their use has not
lighting their strengths and weaknesses, while, at the same yet become mainstream, also due to limited coverage.
time, outlining possible fruitful directions that lie ahead. Another significant issue is the fact that sense inventories
assume that, at least for practical purposes, word meaning can
2 Resources for WSD be enumerated in a finite list. However, this also implicitly as-
sumes that language is static and does not change much over
WSD is a knowledge-intensive task, which needs data of two time. Unfortunately, this is not the real-case scenario, espe-
different kinds: i) sense inventories, i.e., reference compu- cially considering how fast new words and senses are intro-
tational lexicons which enumerate possible meanings; and duced online. Alternative approaches like the generative lexi-
ii) annotated corpora, in which a subset of words are tagged con [Pustejovsky, 1998], which provides a general framework
with one or more possible meanings drawn from the given in which word meaning can be produced online, have been
inventory. In the following subsections, we review the most proposed in the past, but no large-scale experiments have yet
popular sense inventories (§2.1) and annotated corpora (§2.2) been carried out on them.
used for training and testing WSD systems.
2.2 Sense-Annotated Data
2.1 Sense Inventories
As new annotated data are continuously created, in this Sec-
Sense inventories enumerate the set of possible senses for a
tion we only describe the standard benchmarks used in WSD,
given lexeme. The most popular ones are:
and refer the reader to a recent survey on corpora tagged with
• Princeton WordNet [Miller et al., 1990], a large, sense annotations [Pasini, 2020].
manually-curated lexicographic database of English and
the de facto standard inventory for WSD. It is organized Data for Training
into a graph, where nodes are synsets, i.e., groups of con- SemCor [Miller et al., 1993] is the largest manually anno-
textual synonyms. Each synonym in a synset represents tated dataset, comprising 200,000 sense annotations using the
a sense of a word. Synsets and senses are linked to each WordNet sense inventory. Despite the remarkable effort, it
other through edges representing lexical-semantic rela- only covers 22% of the almost 118,000 WordNet synsets, and,
tions, such as hypernymy (is-a), and meronymy (part- being a subset of the English Brown Corpus from the 1960s,
of), among others. For each synset, WordNet also pro- it features a different distribution of senses compared to that
vides other forms of lexical knowledge, such as defini- of contemporary texts, with numerous meanings that are now
tions (glosses) and usage examples. Most recent works commonplace, such as computer mouse, being completely ab-
in English WSD use the 3.0 version (released in 2006), sent. To increase the annotation coverage, several works [Vial
containing 117,659 synsets. Recently, English WordNet et al., 2019; Bevilacqua and Navigli, 2020] have recently
2020 [McCrae et al., 2020] extended the original Prince- started using the English Princeton WordNet Gloss Corpus
ton WordNet by introducing approximately 3,000 new (WNG)4 as additional data. WNG comprises sense definitions
synsets, including slang and neologisms. and examples in WordNet, annotated both manually and semi-
automatically, covering more than 59,000 WordNet senses.
• BabelNet [Navigli and Ponzetto, 2012], a multilingual
While English training data is widely available, unfortu-
dictionary with coverage of both lexicographic and en-
nately the same does not hold for other languages. Although
cyclopedic terms obtained by semi-automatically map-
hand-labeled data are notoriously difficult to obtain on a large
ping various resources, such as WordNet, multilingual
scale for many languages, some efforts in the past were di-
versions of WordNet and Wikipedia, among others. Ba-
rected towards creating manually-translated versions of Sem-
belNet is structured as a semantic network where nodes
Cor [Petrolito and Bond, 2014], but many of these are no
are multilingual synsets, i.e., groups of synonyms lexi-
longer available. Therefore, several subsequent works pro-
calized in several languages, and edges are semantic re-
posed automatic methods for producing high-quality sense-
lations between them. The latest 2021 release, i.e., ver-
annotated data both in English [Taghipour and Ng, 2015;
sion 5.0, covers 500 languages and contains more than
Loureiro and Camacho-Collados, 2020] and other languages
20M synsets [Navigli et al., 2021].
by leveraging: information from Wikipedia [Scarlini et al.,
Another inventory that has recently been gaining interest 2019], the Personalized PageRank algorithm [Pasini and Nav-
[Blevins et al., 2021] is Wiktionary:3 a collaborative project igli, 2020], label propagation over comparable texts [Barba et
designed to create a dictionary for each language separately. al., 2020] or automatic translations [Pasini et al., 2021].
Each of these inventories suffers from the so-called fine-
granularity problem, that is, different meanings of the same Data for Testing
lexeme are, sometimes, hard to discriminate between even for Evaluation in WSD is usually carried out using the manually
humans. For example, WordNet enumerates 29 senses for the annotated datasets from the Senseval and SemEval evaluation
noun line, two of which distinguish between a set of things laid campaigns. English WSD benefits from the evaluation suite
out horizontally and one laid out vertically. To cope with the of Raganato et al. [2017a] which combines together five all-
excessive granularity of word senses and simplify the WSD words gold-standard datasets: Senseval-2 [Edmonds and Cot-
task, different coarser-grained inventories have been proposed ton, 2001, S2], Senseval-3 [Snyder and Palmer, 2004, S3],
3 4
https://en.wiktionary.org/wiki/Wiktionary:Statistics https://wordnetcode.princeton.edu/glosstag.shtml

4331
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

SemEval-2007 Task 17 [Pradhan et al., 2007, S7], SemEval- [Devlin et al., 2019] on examples and definitions from Word-
2013 Task 12 [Navigli et al., 2013, S13] and SemEval-2015 Net, as well as on automatically retrieved contexts from the
Task 13 [Moro and Navigli, 2015, S15]. This framework stan- Web. Thanks to BabelNet, SyntagRank showed itself to be
dardized the evaluation in English WSD with the WordNet able to scale across many different languages, while SREF𝐾𝐵
sense inventory, making it easier to compare systems in a gen- has so far been tested on English only. In addition, SREF𝐾𝐵
eral domain, helping the field to develop increasingly better- does also make use of manually-created usage examples from
performing models. In an attempt to investigate the most com- WordNet, which arguably amounts to a form of stronger su-
mon weaknesses among WSD approaches, i.e., poor perfor- pervision.
mance on infrequent senses, Blevins et al. [2021] introduced
FEWS, an English benchmark where Wiktionary examples 3.2 Supervised WSD
are annotated with Wiktionary definitions. The most successful approaches to WSD are the so-called su-
For non-English languages, instead, WSD evaluation pervised methods. In abstract terms, these aim to learn a pa-
datasets have received less attention, as they are often anno- rameterized function 𝑓Θ mapping a word 𝑤 in a context 𝑐 to
tated with diverse and outdated inventories. Only very re- a sense 𝑠 ∈ 𝑉 (the vocabulary of senses) using the supervi-
cently, a comprehensive benchmark has been put forward to sion of a dataset  of word-context-sense triplets ⟨𝑤, 𝑐, 𝑠⟩. In
standardize the evaluation in this setting too [Pasini et al., what follows, we focus mainly on neural supervised systems,
2021, XL-WSD].5 XL-WSD extends the English evaluation which over recent years have consistently obtained the best
framework of Raganato et al. [2017a] and introduces test data overall results. Most of the methods we discuss exploit trans-
for 18 languages: Basque, Bulgarian, Catalan, Chinese, Croa- fer learning, with the use of pretrained Transformers being
tian, Danish, Dutch, English, Estonian, French, Galician, Ger- required for state-of-the-art performance.
man, Hungarian, Italian, Japanese, Korean, Slovenian, and As the most meaningful classification of the approaches
Spanish, resulting in more than 99K gold annotations. This concerns not so much the architecture, but what kind of addi-
benchmark includes training and testing data annotated with tional information the model is able to exploit, we group them
BabelNet 4.0 senses, enabling, for the first time, a large-scale into (i) purely data-driven models, (ii) supervised models ex-
monolingual and multilingual evaluation of WSD models, in- ploiting glosses, (iii) supervised models exploiting relations
cluding the cross-lingual zero-shot setting, e.g., training on in a knowledge graph, and (iv) supervised approaches using
English and testing on other languages. other sources of knowledge. In what follows we highlight dif-
ferent families of supervised approaches in boldface.
3 Main Approaches to WSD Purely Data-Driven WSD
In the next two subsections we overview different kinds of Most supervised WSD models are trained with gradient de-
system, ranging from those which do not require training data scent to minimize a cost function (𝑤, 𝑐, 𝑠) over all ⟨𝑤, 𝑐, 𝑠⟩ ∈
(§3.1), to models which are data-driven (§3.2).  with respect to the parameters Θ. A popular baseline model,
in this case, would be a token tagger, which for each word 𝑤
3.1 Knowledge-Based WSD in a context 𝑐 produces a probability distribution 𝑃𝑤 over all
Knowledge-based approaches leverage computational lexi- 𝑠′ ∈ 𝑉 , i.e., over all senses in the vocabulary. Token tagger
cons, such as WordNet or BabelNet, especially their graph models for WSD make use of a pretrained embedder, which is
structure, in which synsets act as nodes and the relations be- usually kept frozen, feed the contextualized representations to
tween them as edges. Successful approaches of this kind either a feedforward network [Hadiwinoto et al., 2019] (Eq. 1
employ graph algorithms such as random walks [Agirre et below) or a stack of Transformer layers [Bevilacqua and Nav-
al., 2014, UKB], clique approximation [Moro et al., 2014, igli, 2019; Vial et al., 2019] (Eq. 2), and then multiply the
Babelfy], or game theory [Tripodi and Navigli, 2019]. The output by a classification layer 𝑂:
richness and quality of the information encoded within their
underlying knowledge bases crucially determine the perfor-
mance of such approaches [Pilehvar and Navigli, 2014; Maru 𝐸𝑐 = Embed(𝑐) 𝐸𝑐 = Embed(𝑐)
et al., 2019]. 𝐻𝑐,𝑤 = FFN(𝐸𝑐,𝑤 ) 𝐻𝑐,𝑤 = Transformer(𝐸𝑐 )𝑤
The highest-scoring methods are two very different models: 𝑃𝑐,𝑤 = Softmax(𝐻𝑐,𝑤 𝑂) 𝑃𝑐,𝑤 = Softmax(𝐻𝑐,𝑤 𝑂)
SyntagRank [Scozzafava et al., 2020] and SREF𝐾𝐵 [Wang (1) (2)
and Wang, 2020]. SyntagRank is purely graph-based and ap-
plies the Personalized PageRank algorithm [Page et al., 1999]
on both the WordNet portion of BabelNet augmented with re- where ⬚𝑐,𝑤 selects the component that corresponds to the tar-
lations from the WNG corpus, and SyntagNet [Maru et al., get word 𝑤 in 𝑐. At inference time, rather than predicting the
2019], a resource providing manually curated relations be- most likely sense across the whole vocabulary, one predicts
tween synsets whose senses form a collocation. SREF𝐾𝐵 , the highest among those possible for the given word:
instead, is a vector-based approach leveraging contextualized 𝑠̂ = argmax 𝑃𝑐,𝑤,𝑠′ (3)
word representations and sense embeddings to perform dis- 𝑠′ ∈𝑉 (𝑤)
ambiguation. Sense vectors are computed by applying BERT
where 𝑉 (𝑤) ⊂ 𝑉 is the set of possible meanings that 𝑤 can
5
https://sapienzanlp.github.io/xl-wsd/ take according to the reference sense inventory.

4332
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

These simple approaches already produce a large improve- that existing sense embeddings can also be made sparse by ap-
ment over previous mostly randomly-initialized models [Ra- plying sparse coding.
ganato et al., 2017b]. Nevertheless, performances are – at Another use of sense embeddings (including gloss infor-
least partially – limited by the categorical cross-entropy that mation) is in providing the weights for the classification
is often used for training. In fact, the binary cross-entropy layer (the matrix 𝑂 in Eq. 1) of token-tagging architectures.
loss has been shown to be more effective [Conia and Navigli, EWISE [Kumar et al., 2019] creates sense representations
2021], as it allows multiple annotations for a single instance training a gloss encoder by means of a triplet loss on Word-
that are available in the training set to be taken into account, Net (§3.2); EWISER [Bevilacqua and Navigli, 2020], instead,
rather than having to use a single ground-truth sense only. finetunes off-the-shelf sense embeddings based on pretrained
A simpler approach compared to token taggers is that of language models, i.e., SensEmBERT and LMMS, attaining
the 1-nn vector-based methods [Peters et al., 2018]. This ap- results close to the state of the art. Finally, BEM [Blevins and
proach creates sense embeddings by averaging the contextual Zettlemoyer, 2020] fully embraces the idea of jointly train-
vectors of instances within the training set that were tagged ing text and sense representations, and puts it into practice
with the same sense: by leveraging two separate Transformer models to encode the
target word context and its candidate definitions.
𝑣(𝑐,𝑤) = Embed(𝑐)𝑤 Glosses have also been exploited in sequence-tagging ap-
1 ∑ (4) proaches [Huang et al., 2019; Yap et al., 2020]. These re-
𝑣(𝑠) = Embed(𝑐 ′ )𝑤
|(𝑤,𝑠) | 𝑐 ′ ∈(𝑤,𝑠) frame the WSD task as a sequence classification problem
where, given a word 𝑤 in a context 𝑐, they score the triplet
where 𝑣(𝑐,𝑤) and 𝑣(𝑠) are the representations for, respectively, ⟨𝑤, 𝑐, (𝑠′ )⟩ for each 𝑠′ ∈ 𝑉 (𝑤) , and select the sense 𝑠̂ with
a word in context and a sense, and (𝑤,𝑠) is the set of contexts the highest score:
where 𝑤 appears associated with a sense 𝑠 in the dataset . 𝑠̂ = argmax Γ(𝑐, 𝑤, (𝑠′ )) (6)
The predicted sense 𝑠̂ is selected as the one with the highest 𝑠′ ∈𝑉 (𝑤)
cosine similarity: where Γ is a scoring function typically implemented as a fine-
𝑠̂ = argmax simcos (𝑣 (𝑐,𝑤) (𝑠′ )
,𝑣 ) (5) tuned Transformer. While attaining competitive performance
𝑠′ ∈𝑉 (𝑤)
(§4.2), models of this kind are less efficient than token clas-
sifiers since they need to process the same sentence for each
The approaches presented so far assume that each sense is content word and for each of its possible definitions.
an opaque class, and the classification architecture cannot ex- Barba et al. [2021, ESCHER] mitigate this issue by fram-
ploit any knowledge beyond what can be inferred through the ing the WSD problem as a span extraction problem, where,
supervision from the training corpus. This issue is not only given a target word in a sentence concatenated with all its pos-
theoretical but also practical, as many senses do not actually sible definitions, a model has to find the span that best fits the
occur in training corpora (§2.2) owing to the extreme class target word use within the sentence. This approach allows a
imbalance. BART-based model [Lewis et al., 2020] to attain state-of-the-
Supervised WSD Exploiting Glosses art results on the standard English benchmarks while also be-
ing able to scale over vocabularies with different granularities.
One conspicuous source of information in sense inventories However, the model is still less efficient than regular token-
consists of textual definitions (also known as glosses). Defi- tagging alternatives, since it needs to run as many forward
nitions, mirroring the format of traditional dictionaries, pro- passes as there are targets to classify in the input sequence.
vide a simple human-readable way of clarifying sense dis- Finally, a generative variant of the sequence classification
tinctions. For example, the concept of nostalgia is defined approach has been introduced by Bevilacqua et al. [2020] to
in WordNet as longing for something past. Glosses have tackle WSD as a Natural Language Generation (NLG) prob-
proven themselves quite useful for increasing WSD perfor- lem where, given 𝑐 and 𝑤, the model has to generate (𝑠), thus
mances, with multiple ways to exploit them being explored reducing WSD to the task of definition modeling (§5). While
in the literature. Glosses can be encoded as vectors by aver- not using the definition as part of the input, this approach has
aging their tokens’ contextualized representations and easily obtained results in the same ballpark of sequence classifiers,
incorporated into both 1-nn approaches and token tagging ar- e.g., GlossBERT, disposing of the need for predefined sense
chitectures. Specifically, 1-nn approaches have been shown inventories and with the added flexibility of handling neolo-
to benefit greatly from concatenating gloss vectors to the “su- gisms, compound words and slang terms, which are virtually
pervised” representations (see Eq. 4) [Loureiro and Jorge, absent from standard inventories for WSD.
2019, LMMS]. Indeed, glosses are also used in the same man-
ner by more sophisticated 1-nn approaches, such as SensEm- Supervised WSD Exploiting Relations
BERT [Scarlini et al., 2020a], ARES [Scarlini et al., 2020b] WordNet offers another rich source of knowledge in the edges
and SREF [Wang and Wang, 2020]. They differ substantially that interweave its senses and synsets. Traditionally, this in-
in their approach to automatically retrieving additional con- formation is exploited by graph knowledge-based systems, for
texts in order to build the supervised part of the sense embed- example, those based on Personalized PageRank [Scozzafava
ding, with ARES attaining the highest performance by lever- et al., 2020]. Nevertheless, many recent supervised systems
aging collocational relations between senses to retrieve new – either 1-nn or token taggers – also draw benefit from using
example sentences from Wikipedia. Berend [2020] has shown WordNet as a graph. For example [Loureiro and Jorge, 2019,

4333
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

LMMS] create representations for those senses not appearing around 67-80% accuracy [Navigli, 2009]; these figures, how-
in SemCor by averaging the embeddings of their neighbours ever, call for further studies so as to obtain more centered esti-
in WordNet; Wang and Wang [2020, SREF] employ Word- mates of human performance, e.g., on up-to-date benchmarks.
Net hypernymy and hyponymy relations to devise a try-again We report results (collected from the literature) on the English
mechanism that refines the prediction of the WSD model, and WSD benchmark of Raganato et al. [2017a] in Table 1. All
Vial et al. [2019] reduce the number of output classes by supervised models therein are trained on SemCor (§2.2). Ad-
mapping each sense to an ancestor in the WordNet taxonomy. ditionally, we report in Table 2 results on the recent XL-WSD
Among the token-tagger models, EWISE [Kumar et al., 2019] multilingual benchmark [Pasini et al., 2021] including i) a
uses the WordNet graph structure to train the gloss embedder crosslingual 0-shot token-classification baselines (exploiting
offline, while EWISER [Bevilacqua and Navigli, 2020] shows XLM-R) trained on (English) SemCor, ii) the same baselines
that with a simple modification to Eq. 1 the full graph of Word- trained on the automatically translated silver corpora provided
Net can be directly incorporated into the architecture: as part of XL-WSD, iii) the best knowledge-based multilin-
gual system, i.e., SyntagRank [Scozzafava et al., 2020].
𝑃𝑐,𝑤 = Softmax(𝐻𝑐,𝑤 𝑂𝐴) (7)
4.2 Discussion
where 𝐴 is a sparse adjacency matrix. A different way to
use the same information is proposed by Conia and Navigli Pretrained language models. The use of pretrained lan-
[2021], who replace the whole adjacency matrix multiplica- guage models plays a crucial role in achieving high perfor-
tion with a binary cross-entropy loss where all senses related mance, for both knowledge-based and supervised approaches
[Wang and Wang, 2020; Blevins and Zettlemoyer, 2020]. The
to the gold one are also considered as relevant.
simple model of Hadiwinoto et al. [2019] results in a 2-point
In general, using relational knowledge is becoming com-
improvement over the best model without pretrained contex-
monplace in supervised WSD, with a gradual hybridization
tualized embeddings, i.e., EWISE [Kumar et al., 2019].
with knowledge-based methods. However, relational knowl-
edge is easily exploited only by token classification and 1-nn Are knowledge-based methods still relevant? Pure
approaches, while its integration into sequence classification knowledge-based methods are completely outperformed
methods has not yet been investigated. on English WSD, with a gap of 7.2 points between the
best knowledge-based method, i.e., SREFKB , and the best
Supervised WSD Exploiting Other Knowledge supervised system, i.e., ESCHER. The same trend appears
WSD models also prove to benefit from using additional in a recent multilingual benchmark as well [Pasini et al.,
sources of knowledge, both internal and external to the knowl- 2021]. Nevertheless, information within knowledge bases
edge base itself. Luan et al. [2020] leverage translations in remains valuable and many successful supervised methods
BabelNet to refine the output of any arbitrary WSD system by are effectively hybridized with knowledge-based methods
comparing the translation of the output senses with the target’s (§3.2).
translations provided by an NMT system. Is it worth it to include other kinds of knowledge? Ad-
In a different direction, Calabrese et al. [2020a] leverage ditional information is beneficial to boosting the results,
images from the BabelPic dataset [Calabrese et al., 2020b] with most token classification and 1-nn approaches exploit-
to build multimodal gloss vectors, which are shown to be ing knowledge graph information in order to reach compet-
stronger than text-only vectors when used to initialize the itive performances. We note that different kinds of knowl-
weights of the classification matrix (𝑂 in Eq. 1). Wikipedia edge are orthogonal to each other and can be exploited in
and Web search contexts are also used as additional data to conjunction. For example, token classification models benefit
create sense embeddings [Scarlini et al., 2020a; Scarlini et al., from the logits-adjacency matrix multiplication [Bevilacqua
2020b; Wang and Wang, 2020] and as an alternative source et al., 2020], binary cross-entropy training [Conia and Nav-
in order to propagate vectors through the WordNet network, igli, 2021], translation-based refinement [Luan et al., 2020]
showing higher performance and better representations for and visual information [Calabrese et al., 2020a].
rare senses.
Training data. The addition of more training data, e.g.,
the WNG corpus (§2.2), increases performance significantly,
4 Taking Stock of WSD even though this corpus contains a significant amount of noisy
In this Section, we review the performance figures of recent silver annotations. Indeed, multiple works [Bevilacqua and
WSD models, with details reported in §4.1. In §4.2, we put Navigli, 2020; Conia and Navigli, 2021] report that concate-
forward a few high-level guidelines that are meant to help the nating WNG to SemCor increases the performance of their
community to navigate current trends in the field. systems from 1.8 to 2.6 F1 points. This makes it worthwhile
investigating whether more advanced techniques for the auto-
4.1 Evaluation Setting matic creation of training corpora can be exploited for further
The performance of WSD systems is usually assessed in gains.
terms of F1 score over held-out test sets. As a performance What is the best model? In the standard configuration, i.e.,
comparison in WSD, a typical upper bound is given by the trained on SemCor only and tested in terms of F1 over the
inter-annotator agreement (IAA), i.e., the percentage of words Raganato et al. [2017a] English benchmark, the best result is
tagged with the same sense by two or more human annotators. achieved by ESCHER [Barba et al., 2021]. As we recall, ES-
The IAA over a fine-grained sense inventory is estimated to be CHER performs WSD by concatenating all glosses and the

4334
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

Kind System ALL S2 S3 S7 S13 S15


( ) [Scozzafava et al., 2020, SyntagRank] 71.7 71.6 72.0 59.3 72.2 75.8

KB
( ) [Wang and Wang, 2020, SREFKB ] 73.5 72.7 71.5 61.5 76.4 79.5
( ) [Loureiro and Jorge, 2019, LMMS] 75.4 76.3 75.6 68.1 75.1 77.0
( ) [Berend, 2020]
Vector-based

76.8 77.9 77.8 68.8 76.1 77.5


( ) [Scarlini et al., 2020b, ARES] 77.9 78.0 77.1 71.0 77.3 83.2
1-nn

( ) [Conia and Navigli, 2020, Conception] 76.4 77.1 76.4 70.3 76.2 77.2
( ) [Luan et al., 2020] 76.4 77.2 77.1 69.2 76.1 77.2
( ) [Scarlini et al., 2020a, SensEmBERT] - - - - 78.7 -
( ) [Wang and Wang, 2020, SREF] 77.8 78.6 76.6 72.1 78.0 80.5
[Hadiwinoto et al., 2019, GLU] 74.1 75.5 73.6 68.1 71.1 76.2
( ) [Vial et al., 2019, SVC] 76.7 76.5 77.4 69.5 76.0 78.3
Classifier

( ) [Kumar et al., 2019, EWISE] 71.8 73.8 71.1 67.3 69.4 74.5
Token

( ) [Blevins and Zettlemoyer, 2020, BEM] 79.0 79.4 77.4 74.5 79.7 81.7
( ) [Calabrese et al., 2020a, EViLBERT] 75.1 - - - - -
( ) [Bevilacqua and Navigli, 2020, EWISER] 78.3 78.9 78.4 71.0 78.9 79.3
( ) [Conia and Navigli, 2021] 77.6 78.4 77.8 72.2 76.7 78.2
( ) [Huang et al., 2019, GlossBERT]
Classif.

77.0 77.7 75.2 72.5 76.1 80.4


Seq.

( ) [Bevilacqua et al., 2020, Generationary] 76.7 78.0 75.4 71.9 77.0 77.6
( ) [Yap et al., 2020] 78.7 79.9 77.4 73.0 78.2 81.8
( ) [Barba et al., 2021, ESCHER] 80.7 81.7 77.8 76.3 82.2 83.2

Table 1: F1 performance figures of recent WSD systems in the literature. We consider results on the evaluation sets (S)enseval-
(2)/(3), (S)emEval 200(7)/20(13)/20(15), and on the concatenation of all of them (ALL). All supervised systems (bottom three
blocks) use SemCor only as training corpus. The leftmost column indicates the kind of system, i.e., knowledge-based,
vector-based 1-nn classifier, token tagger, sequence tagger and, in parentheses, the additional resources leveraged by each
model: WordNet glosses ( ), relational information ( ), text from Web ( ), automatic translations ( ), or visual information
( ).

input context together, extracting the indices corresponding compared to those for English WSD, where supervised results
to the predicted definition. Like ESCHER, most of the best on the concatenation of all datasets start from around 74 F1
performing approaches not only utilize gloss information to points (see Table 1).
represent word senses [Bevilacqua et al., 2020], but do so by
encoding it as a sequence rather than directly as a vector [Yap 5 Beyond Word Sense Disambiguation
et al., 2020; Blevins and Zettlemoyer, 2020], which appears to
While the WSD task has benefited from recent breakthroughs
be most beneficial for the WSD task. Other models also prove
in transfer learning, even surpassing its expected upper bound,
to achieve somewhat lower performances than ESCHER,
there are certain limits intrinsic to the task itself. The choice
while bringing distinct advantages. Sequence classification
of using a discrete sense inventory, while it is computationally
models, especially generative ones, offer zero-shot capabili-
convenient, prevents scaling to newer and more creative uses
ties over a changing sense inventory [Bevilacqua et al., 2020;
of words, and constrains systems to a given sense granularity,
Blevins et al., 2021], while 1-nn and token classification ap-
which may be suboptimal for the chosen application.
proaches are more flexible in terms of integrating task-specific
For these reasons, Pilehvar and Camacho-Collados [2019]
biases, and also more efficient, being able to classify many
chose to eschew discrete meanings altogether by putting for-
contexts at once with a single forward pass.
ward the Word-in-Context (WiC) task, a tool for evaluat-
Multilingual WSD. In the past, one of the main arguments ing the semantic competence of models without the need
in favor of knowledge-based WSD was that of scalability. for predefined inventories. WiC requires a model to take as
However, as Table 2 shows, this seems no longer to be the input two contexts featuring the same target words, and to
case. Overall, thanks to the availability of pretrained mul- predict whether those words are used with the same mean-
tilingual contextualised embeddings, one can train a simple ing. Building WiC datasets is easier than building ones for
supervised model on just English and get much higher perfor- WSD, and indeed large-scale benchmarks are also available
mances compared to a knowledge-based system, even on lan- for non-English languages [Raganato et al., 2020; Martelli et
guages that are very different, such as Basque, Chinese, Hun- al., 2021].
garian and Korean. In fact, the crosslingual setting works so In a different direction, but with the same purpose of
well that it outperforms language-specific models trained on dropping the need for predefined inventories, the task of
silver data, which are probably hampered by noise and distri- lexical substitution [McCarthy and Navigli, 2009] requires
bution skewing effects related to the data creation procedure. models to disambiguate a word in context by searching for
However, performance figures are still mostly underwhelming meaning-preserving substitutes. For example, given the con-

4335
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

XLMR-L XLMR-L models reach top performance, the WSD task is still not solved
Language SyntagRank [Navigli, 2018; Blevins et al., 2021] and this opens up new ex-
(zero-shot) (T-SC+ WNG)
Basque 47.15 41.96 42.91 citing directions.
Bulgarian 72.00 58.18 61.10 With the breaching of this glass ceiling, current bench-
Catalan 49.97 36.00 43.98 marks are really starting to show their inadequacy. This calls
Chinese 51.62 - 41.23 for the construction of new challenging test sets (possibly
Croatian 72.29 63.15 68.35 through adversarial techniques) to shed light on what remains
Danish 80.61 78.67 72.93 problematic for WSD. Indeed, the behavior of current mod-
Dutch 59.20 57.27 56.00 els in out-of-domain sense distributions should be studied fur-
Estonian 66.13 50.78 56.31 ther in the near future, in order to build WSD approaches that
French 83.88 71.38 69.57 are more robust to domain shift and reliable with Web text,
Galician 66.28 56.18 67.56 e.g., from social media. Moreover, multilingual WSD lacks
German 83.18 73.78 75.99 a comprehensive investigation to assess model capabilities in
Hungarian 67.64 52.60 57.98 non-English languages. While the recent cross-lingual evalu-
Italian 77.66 77.70 69.57 ation suite, i.e. XL-WSD [Pasini et al., 2021], is a first step
Japanese 61.87 50.55 57.46 towards a large-scale multilingual WSD benchmark, more ef-
Korean 64.20 - 50.29 fort is needed to create training or testing data for as many
Slovenian 68.36 51.13 52.25 languages as possible in the coming years.
Spanish 75.85 77.26 68.58
An additional avenue for research is the integration of WSD
Micro F1 65.66 - 57.68 with the related task of Entity Linking [Sevgili et al., 2021], in
which the model is required to associate mentions with entities
Table 2: F1 scores of supervised and knowledge-based ap- in a knowledge base such as Wikipedia. While the existence of
proaches on XL-WSD test sets. XLMR-L (zero-shot) has BabelNet provides a unified repository that allows one to per-
been trained and tuned on the English SemCor only. XLMR- form both tasks [Moro et al., 2014], the recent literature has
L (T-SC+ WNG) has been trained and tuned on automatically- not taken up this path. It is worth exploring whether recent ap-
translated versions of SemCor and WNG corpora. proaches which efficiently classify over the huge output space
of Entity Linking [Cao et al., 2021] can be combined with the
techniques for the exploitation of glosses and relations devel-
text “Would you give me a lift?”, lift would be disambiguated oped within the WSD community.
by proposing ride as candidate for substitution. Lexical sub- Since WSD systems now work fairly well, it is time to em-
stitution can deal with an evolving lexicon, and has straight- ploy them in other applications too, e.g., boosting semantic-
forward application in, e.g., data augmentation [Kobayashi, intensive downstream tasks such as Machine Translation, Se-
2018], but it suffers from circularity, and a lack of explicit- mantic Role Labeling, and Question Answering. Finally,
ness; also, sometimes non-convoluted substitutes are simply WSD could help pretrained language models to ground word
lacking, as for, e.g., gear in “my car doesn’t have a sixth gear”. representations onto a knowledge base [Pappas et al., 2020],
More recently, the task of definition modeling [Noraset et providing the semantics they seem to lack [Bender and Koller,
al., 2017] has reframed the disambiguation task from Natural 2020], and a gateway to other information sources and per-
Language Understanding (NLU) to NLG: instead of selecting ceptive domains, such as vision: a whole new realm that NLP,
the most relevant sense class, a system generates a description with approaches such as Vokenizer [Tan and Bansal, 2020], is
of its meaning. This approach is not limited by sense invento- just now starting to exploit, and in doing so may finally break
ries, as one can generate a definition for basically anything, be out of its sandbox!
it a word in its ordinary meaning, a novel word, a metaphor,
or an arbitrarily-sized expression, with obvious applications Acknowledgments
for language learners. Interestingly, definition modeling can
be used to perform WSD by using its beam search output to
The authors gratefully acknowledge the
select the most suitable definitions among those of a prede-
support of the ERC Consolidator Grants
fined inventory [Bevilacqua et al., 2020]. We think definition
MOUSSE No. 726487, and FoTran No.
modeling is a promising way forward for the task, expanding
771113 under the European Union’s Horizon
the scope of WSD without big sacrifices as a trade-off.
2020 research and innovation programme.
6 Conclusion (and What’s Next) This work was supported in part by the MIUR under the
In this paper we surveyed recent research on WSD, provid- grant “Dipartimenti di eccellenza 2018- 2022” of the Depart-
ing an overview over sense inventories and sense-annotated ment of Computer Science of Sapienza University and by the
data, and categorizing and describing current automatic ap- Innovation Fund Denmark under the LEGALESE project.
proaches. We discussed different methodologies, pointing out
the best practices for reaching competitive performance. The References
best models for English WSD attain results that are close to [Agirre and Edmonds, 2007] Eneko Agirre and Philip Edmonds.
or superior to the human upper bound, posing the question of Word Sense Disambiguation: Algorithms and applications, vol-
how to interpret such performance. While on some datasets ume 33. Springer Science & Business Media, 2007.

4336
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

[Agirre et al., 2014] Eneko Agirre, Oier Lopez de Lacalle, and [Hadiwinoto et al., 2019] Christian Hadiwinoto, Hwee Tou Ng, and
Aitor Soroa. Random walks for knowledge-based Word Sense Wee Chung Gan. Improved Word Sense Disambiguation us-
Disambiguation. Computational Linguistics, pages 57–84, 2014. ing pre-trained contextualized word representations. In Proc. of
[Barba et al., 2020] Edoardo Barba, Luigi Procopio, Niccolò Cam- EMNLP, pages 5297–5306, 2019.
polungo, Tommaso Pasini, and Roberto Navigli. MuLaN: Mul- [Hovy et al., 2006] Eduard Hovy, Mitch Marcus, Martha Palmer,
tilingual label propagation for Word Sense Disambiguation. In Lance Ramshaw, and Ralph Weischedel. OntoNotes: The 90%
Proc. of IJCAI, pages 3837–3844, 2020. solution. In Proc. of NAACL, pages 57–60, 2006.
[Barba et al., 2021] Edoardo Barba, Tommaso Pasini, and Roberto [Huang et al., 2019] Luyao Huang, Chi Sun, Xipeng Qiu, and Xu-
Navigli. ESC: Redesigning WSD with extractive sense compre- anjing Huang. GlossBERT: BERT for Word Sense Disambigua-
hension. In Proc. of NAACL, 2021. tion with gloss knowledge. In Proc. of EMNLP, 2019.
[Bender and Koller, 2020] Emily M. Bender and Alexander Koller. [Kobayashi, 2018] Sosuke Kobayashi. Contextual augmentation:
Climbing towards NLU: On meaning, form, and understanding in Data augmentation by words with paradigmatic relations. In Proc.
the age of data. In Proc. of ACL, pages 5185–5198, 2020. of NAACL, pages 452–457, June 2018.
[Berend, 2020] Gábor Berend. Sparsity makes sense: Word Sense [Kumar et al., 2019] Sawan Kumar, Sharmistha Jat, Karan Saxena,
Disambiguation using sparse contextualized word representa- and Partha Talukdar. Zero-shot Word Sense Disambiguation us-
tions. In Proc. of EMNLP, pages 8498–8508, 2020. ing sense definition embeddings. In Proc. of ACL, 2019.
[Bevilacqua and Navigli, 2019] Michele Bevilacqua and Roberto [Lacerra et al., 2020] Caterina Lacerra, Michele Bevilacqua, Tom-
Navigli. Quasi bidirectional encoder representations from Trans- maso Pasini, and Roberto Navigli. CSI: A coarse sense inventory
formers for Word Sense Disambiguation. In Proc. of RANLP, for 85% Word Sense Disambiguation. In Proc. of AAAI, pages
pages 122–131, 2019. 8123–8130, 2020.
[Bevilacqua and Navigli, 2020] Michele Bevilacqua and Roberto [Lewis et al., 2020] Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Navigli. Breaking through the 80% glass ceiling: Raising the jan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin
state of the art in Word Sense Disambiguation by incorporating Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-
knowledge graph information. In Proc. of ACL, pages 2854–2864, sequence pre-training for natural language generation, translation,
2020. and comprehension. In Proc. of ACL, pages 7871–7880, 2020.
[Bevilacqua et al., 2020] Michele Bevilacqua, Marco Maru, and [Loureiro and Camacho-Collados, 2020] Daniel Loureiro and Jose
Roberto Navigli. Generationary or “How we went beyond word Camacho-Collados. Don’t neglect the obvious: On the role of
sense inventories and learned to gloss”. In Proc. of EMNLP, pages unambiguous words in Word Sense Disambiguation. In Proc. of
7207–7221, 2020. EMNLP, pages 3514–3520, 2020.
[Blevins and Zettlemoyer, 2020] Terra Blevins and Luke Zettle- [Loureiro and Jorge, 2019] Daniel Loureiro and Alípio Jorge. Lan-
moyer. Moving down the long tail of Word Sense Disambiguation guage modelling makes sense: Propagating representations
with gloss informed bi-encoders. In Proc. of ACL, 2020. through WordNet for full-coverage Word Sense Disambiguation.
In Proc. of ACL, pages 5682–5691, 2019.
[Blevins et al., 2021] Terra Blevins, Mandar Joshi, and Luke Zettle-
moyer. FEWS: Large-scale, low-shot Word Sense Disambigua- [Luan et al., 2020] Yixing Luan, Bradley Hauer, Lili Mou, and
tion with the dictionary. In Proc. of EACL, 2021. Grzegorz Kondrak. Improving Word Sense Disambiguation with
translations. In Proc. of EMNLP, pages 4055–4065, 2020.
[Calabrese et al., 2020a] Agostina Calabrese, Michele Bevilacqua,
and Roberto Navigli. EViLBERT: Learning task-agnostic multi- [Martelli et al., 2021] Federico Martelli, Najla Kalach, Gabriele
modal sense embeddings. In Proc. of IJCAI, 2020. Tola, and Roberto Navigli. SemEval-2021 task 2: Multilingual
and cross-lingual word-in-context disambiguation (MCL-WiC).
[Calabrese et al., 2020b] Agostina Calabrese, Michele Bevilacqua, In Proc. of SemEval, 2021.
and Roberto Navigli. Fatality killed the cat or: BabelPic, a multi-
modal dataset for non-concrete concepts. In Proc. of ACL, pages [Maru et al., 2019] Marco Maru, Federico Scozzafava, Federico
4680–4686, 2020. Martelli, and Roberto Navigli. SyntagNet: Challenging super-
vised Word Sense Disambiguation with lexical-semantic combi-
[Cao et al., 2021] Nicola De Cao, Gautier Izacard, Sebastian nations. In Proc. of EMNLP, pages 3534–3540, 2019.
Riedel, and Fabio Petroni. Autoregressive entity retrieval. In
[McCarthy and Navigli, 2009] Diana McCarthy and Roberto Nav-
Proc. of ICLR, 2021.
igli. The English lexical substitution task. Language resources
[Conia and Navigli, 2020] Simone Conia and Roberto Navigli. and evaluation, 43(2):139–159, 2009.
Conception: Multilingually-enhanced, human-readable concept
[McCrae et al., 2020] John Philip McCrae, Alexandre Rademaker,
vector representations. In Proc. of COLING, 2020.
Ewa Rudnicka, and Francis Bond. English WordNet 2020: Im-
[Conia and Navigli, 2021] Simone Conia and Roberto Navigli. proving and extending a WordNet for English using an open-
Framing Word Sense Disambiguation as a multi-label problem for source methodology. In Proc. of MMW, 2020.
model-agnostic knowledge integration. In Proc. of EACL, 2021.
[Miller et al., 1990] George A Miller, Richard Beckwith, Christiane
[Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Fellbaum, Derek Gross, and Katherine J Miller. Introduction to
and Kristina Toutanova. BERT: Pre-training of deep bidirectional WordNet: An on-line lexical database. International journal of
Transformers for language understanding. In Proc. of NAACL, lexicography, pages 235–244, 1990.
pages 4171–4186, 2019. [Miller et al., 1993] George A Miller, Claudia Leacock, Randee
[Edmonds and Cotton, 2001] Philip Edmonds and Scott Cotton. Tengi, and Ross T Bunker. A semantic concordance. In Human
SENSEVAL-2: Overview. In Proc. of SENSEVAL-2, 2001. Language Technology, 1993.

4337
Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21)
Survey Track

[Moro and Navigli, 2015] Andrea Moro and Roberto Navigli. [Pradhan et al., 2007] Sameer Pradhan, Edward Loper, Dmitriy
SemEval-2015 task 13: Multilingual all-words sense disam- Dligach, and Martha Palmer. SemEval-2007 task-17: English
biguation and entity linking. In Proc. of SemEval, 2015. lexical sample, SRL and all words. In Proc. of SemEval, 2007.
[Moro et al., 2014] Andrea Moro, Alessandro Raganato, and [Pustejovsky, 1998] James Pustejovsky. The generative lexicon.
Roberto Navigli. Entity linking meets Word Sense Disambigua- MIT press, 1998.
tion: A unified approach. TACL, pages 231–244, 2014. [Raganato et al., 2017a] Alessandro Raganato, Jose Camacho-
[Nancy and Jean, 1998] Ide Nancy and Veronis Jean. Word Sense Collados, and Roberto Navigli. Word Sense Disambiguation:
Disambiguation: The state of the art. Computational Linguistics, A unified evaluation framework and empirical comparison. In
pages 1–40, 1998. Proc. of EACL, pages 99–110, 2017.
[Navigli and Ponzetto, 2012] Roberto Navigli and Simone Paolo [Raganato et al., 2017b] Alessandro Raganato, Claudio Delli Bovi,
Ponzetto. BabelNet: The automatic construction, evaluation and and Roberto Navigli. Neural sequence learning models for Word
application of a wide-coverage multilingual semantic network. Sense Disambiguation. In Proc. of EMNLP, 2017.
Artificial intelligence, pages 217–250, 2012. [Raganato et al., 2020] Alessandro Raganato, Tommaso Pasini,
[Navigli et al., 2013] Roberto Navigli, David Jurgens, and Daniele Jose Camacho-Collados, and Mohammad Taher Pilehvar. XL-
Vannella. SemEval-2013 task 12: Multilingual Word Sense Dis- WiC: A multilingual benchmark for evaluating semantic contex-
ambiguation. In Proc. of SemEval, 2013. tualization. In Proc. of EMNLP, 2020.
[Navigli et al., 2021] Roberto Navigli, Michele Bevilacqua, Simone [Scarlini et al., 2019] Bianca Scarlini, Tommaso Pasini, and
Conia, Dario Montagnini, and Francesco Cecconi. Ten years of Roberto Navigli. Just “OneSeC” for producing multilingual
BabelNet: A survey. In Proc. of IJCAI, 2021. sense-annotated data. In Proc. of ACL, pages 699–709, 2019.
[Navigli, 2009] Roberto Navigli. Word Sense Disambiguation: A [Scarlini et al., 2020a] Bianca Scarlini, Tommaso Pasini, and
survey. ACM computing surveys (CSUR), pages 1–69, 2009. Roberto Navigli. SensEmBERT: Context-enhanced sense embed-
dings for multilingual Word Sense Disambiguation. In Proc. of
[Navigli, 2018] Roberto Navigli. Natural language understanding:
AAAI, pages 8758–8765, 2020.
Instructions for (present and future) use. In Proc. of IJCAI, pages
5697–5702, 2018. [Scarlini et al., 2020b] Bianca Scarlini, Tommaso Pasini, and
Roberto Navigli. With more contexts comes better performance:
[Noraset et al., 2017] Thanapon Noraset, Chen Liang, Larry Birn-
Contextualized sense embeddings for all-round Word Sense Dis-
baum, and Doug Downey. Definition modeling: Learning to de- ambiguation. In Proc. of EMNLP, pages 3528–3539, 2020.
fine word embeddings in natural language. In Proc. of AAAI,
2017. [Scozzafava et al., 2020] Federico Scozzafava, Marco Maru, Fab-
rizio Brignone, Giovanni Torrisi, and Roberto Navigli. Person-
[Page et al., 1999] Lawrence Page, Sergey Brin, Rajeev Motwani, alized PageRank with syntagmatic information for multilingual
and Terry Winograd. The PageRank citation ranking: Bringing Word Sense Disambiguation. In Proc. of ACL (demos), 2020.
order to the web. Technical report, 1999.
[Sevgili et al., 2021] Ozge Sevgili, Artem Shelmanov, Mikhail
[Pappas et al., 2020] Nikolaos Pappas, Phoebe Mulcaire, and Arkhipov, Alexander Panchenko, and Chris Biemann. Neural en-
Noah A. Smith. Grounded compositional outputs for adaptive tity linking: A survey of models based on deep learning, 2021.
language modeling. In Proc. of EMNLP, pages 1252–1267, 2020.
[Snyder and Palmer, 2004] Benjamin Snyder and Martha Palmer.
[Pasini and Navigli, 2020] Tommaso Pasini and Roberto Navigli. The English all-words task. In Proc. of Senseval, 2004.
Train-O-Matic: Supervised Word Sense Disambiguation with no
(manual) effort. Artificial Intelligence, 2020. [Taghipour and Ng, 2015] Kaveh Taghipour and Hwee Tou Ng.
One million sense-tagged instances for Word Sense Disambigua-
[Pasini et al., 2021] Tommaso Pasini, Alessandro Raganato, and tion and induction. In Proc. of CoNLL, pages 338–344, 2015.
Roberto Navigli. XL-WSD: An extra-large and cross-lingual eval-
uation framework for Word Sense Disambiguation. In Proc. of [Tan and Bansal, 2020] Hao Tan and Mohit Bansal. Vokenization:
AAAI, 2021. Improving language understanding with contextualized, visual-
grounded supervision. In Proc. of EMNLP, 2020.
[Pasini, 2020] Tommaso Pasini. The knowledge acquisition bottle-
neck problem in multilingual Word Sense Disambiguation. In [Tripodi and Navigli, 2019] Rocco Tripodi and Roberto Navigli.
Proc. of IJCAI, 2020. Game theory meets embeddings: A unified framework for Word
Sense Disambiguation. In Proc. of EMNLP, pages 88–99, 2019.
[Peters et al., 2018] Matthew Peters, Mark Neumann, Mohit Iyyer,
[Vial et al., 2019] Loic Vial, B. Lecouteux, and D. Schwab. Sense
Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettle-
moyer. Deep contextualized word representations. In Proc. of vocabulary compression through the semantic knowledge of
NAACL, pages 2227–2237, 2018. WordNet for neural Word Sense Disambiguation. In Proc. of
GWNC, 2019.
[Petrolito and Bond, 2014] Tommaso Petrolito and Francis Bond. A
[Wang and Wang, 2020] Ming Wang and Yinglin Wang. A synset
survey of WordNet annotated corpora. In Proc. of GWNC, 2014.
relation-enhanced framework with a try-again mechanism for
[Pilehvar and Camacho-Collados, 2019] Mohammad Taher Pile- Word Sense Disambiguation. In Proc. of EMNLP, 2020.
hvar and Jose Camacho-Collados. WiC: The word-in-context
[Weaver, 1949] Warren Weaver. Translation. Machine Translation
dataset for evaluating context-sensitive meaning representations.
In Proc. of NAACL, pages 1267–1273, 2019. of Languages: Fourteen Essays, 1949.
[Yap et al., 2020] Boon Peng Yap, Andrew Koh, and Eng Siong
[Pilehvar and Navigli, 2014] Mohammad Taher Pilehvar and
Chng. Adapting BERT for Word Sense Disambiguation with
Roberto Navigli. A large-scale pseudoword-based evaluation
gloss selection objective and example sentences. In Findings of
framework for state-of-the-art Word Sense Disambiguation.
EMNLP, pages 41–46, 2020.
Computational Linguistics, pages 837–881, 2014.

4338

You might also like