2021 emnlp-main 92论文

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Label Verbalization and Entailment

for Effective Zero- and Few-Shot Relation Extraction


Oscar Sainz Oier Lopez de Lacalle Gorka Labaka
Ander Barrena Eneko Agirre

HiTZ Basque Center for Language Technologies - Ixa NLP Group


University of the Basque Country (UPV/EHU)
{oscar.sainz, oier.lopezdelacalle, gorka.labaka, ander.barrena, e.agirre}@ehu.eus

Abstract and label verbalizations (Puri and Catanzaro, 2019;


Schick and Schütze, 2021; Schick and Schütze,
Relation extraction systems require large 2020) as an alternative to standard fine-tuning (Gao
amounts of labeled examples which are costly
et al., 2020; Scao and Rush, 2021). In these meth-
to annotate. In this work we reformulate re-
lation extraction as an entailment task, with
ods, the prompts are input to the LM together with
simple, hand-made, verbalizations of relations the example, and the language modelling objec-
produced in less than 15 minutes per relation. tive is used in learning and inference. In a dif-
The system relies on a pretrained textual entail- ferent direction, some authors reformulate the tar-
ment engine which is run as-is (no training ex- get task (e.g. document classification) as a pivot
amples, zero-shot) or further fine-tuned on la- task (typically question answering or textual entail-
beled examples (few-shot or fully trained). In ment), which allows the use of readily available
our experiments on TACRED we attain 63%
question answering (or entailment) training data
F1 zero-shot, 69% with 16 examples per re-
lation (17% points better than the best super- (Yin et al., 2019; Levy et al., 2017). In all cases,
vised system on the same conditions), and only the underlying idea is to cast the target task into a
4 points short of the state-of-the-art (which formulation which allows us to exploit the knowl-
uses 20 times more training data). We also edge implicit in pre-trained LM (prompt-based) or
show that the performance can be improved general-purpose question answering or entailment
significantly with larger entailment models, up engines (pivot tasks).
to 12 points in zero-shot, giving the best re-
Prompt-based approaches are very effective
sults to date on TACRED when fully trained.
The analysis shows that our few-shot systems when the label verbalization is given by one or two
are especially effective when discriminating words (e.g. text classification), as they can be easily
between relations, and that the performance predicted by language models, but strive in cases
difference in low data regimes comes mainly where the label requires a more elaborate descrip-
from identifying no-relation cases. tion, as in RE. We thus propose to reformulate
RE as an entailment problem, where the verbal-
1 Introduction izations of the relation label are used to produce
Given a context where two entities appear, the Rela- a hypothesis to be confirmed by an off-the-shelf
tion Extraction (RE) task aims to predict the seman- entailment engine.
tic relation (if any) holding between the two entities. In our work1 we have manually constructed ver-
Methods that fine-tune large pretrained language balization templates for a given set of relations.
models (LM) with large amounts of labelled data Given that some verbalizations might be ambigu-
have established the state of the art (Yamada et al., ous (between city of birth and country of birth, for
2020). Nevertheless, due to differing languages, instance) we complemented them with entity type
domains and the cost of human annotation, there constraints. In order to ensure that the manual work
is typically a very small number of labelled exam- involved is limited and practical in real-world appli-
ples in real-world applications, and such models cations, we allowed at most 15 minutes of manual
perform poorly (Schick and Schütze, 2021). labor per relation. The verbalizations are used as-is
As an alternative, methods that only need a few for zero-shot RE, but we also recast labelled RE
examples (few-shot) or no examples (zero-shot) examples as entailment pairs and fine-tune the en-
have emerged. For instance, prompt based learning 1
Code and splits available at: https://github.com/
proposes hand-made or automatically learned task osainz59/Ask2Transformers
1199
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1199–1212
November 7–11, 2021. c 2021 Association for Computational Linguistics
tailment engine for few-shot RE. tual response to a given prompt. The manual gen-
The results on the widely used TACRED (Zhang eration of effective prompts is costly and requires
et al., 2017) RE dataset in zero- and few-shot sce- domain expertise. Gao et al. (2020) provide an
narios are excellent, well over state-of-the-art sys- effective way to generate prompts for text classifi-
tems using the same amount of data. In addition cation tasks that surpasses the performance of hand
our method scales well with large pre-trained LMs picked ones. The approach uses few-shot training
and large amounts of training data, reporting the with a generative T5 model (Raffel et al., 2020)
best results on TACRED to date. to learn to decode effective prompts. Similarly,
Liu et al. (2021) automatically search prompts in a
2 Related Work embedding space which can be simultaneously fine-
tuned along with the pre-trained language model.
Textual Entailment. It was first presented
Note that previous prompt-based models run their
by Dagan et al. (2006) and further developed by
zero-shot models on a semi-supervised setting in
Bowman et al. (2015) who called it Natural Lan-
which some amount of labeled data is given in
guage Inference (NLI). Given a textual premise
training. Prompts can be easily generated for text
and hypothesis, the task is to decide whether the
classification. Other tasks require more elaborate
premise entails or contradicts (or is neutral to) the
templates (Goswami et al., 2020; Li et al., 2021)
hypothesis. The current state-of-the-art uses large
and currently no effective prompt-based methods
pre-trained LM fine-tuned in NLI datasets (Lan
for RE exist.
et al., 2020; Liu et al., 2019; Conneau et al., 2020;
Lewis et al., 2020; He et al., 2021). Besides prompt-based methods, the use of pivot
tasks has been widely use for few/zero-shot learn-
Relation Extraction. The best results to date on
ing. For instance, relation and event extraction have
RE are obtained by fine-tuning large pre-trained
been cast as a question answering problem (Levy
language models equipped with a classification
et al., 2017; Du and Cardie, 2020), associating each
head. Joshi et al. (2020) pretrains a masked lan-
slot label to at least one natural language question.
guage model on random contiguous spans to learn
Closer to our work, NLI has been shown too to be a
span-boundaries and predict the entire masked span.
successful pivoting task for text classification (Yin
LUKE (Yamada et al., 2020) further pretrains a LM
et al., 2019, 2020; Wang et al., 2021; Sainz and
predicting entities from Wikipedia, and using entity
Rigau, 2021). These works verbalize the labels,
information as an additional input embedding layer.
and apply an entailment engine to check whether
K-Adapter (Wang et al., 2020) fixes the parameters
the input text entails the label description.
of the pretrained LM and use Adapters to infuse
factual and linguistic knowledge from Wikipedia In similar work to ours, the relation between en-
and dependency parsing. tailment and RE was explored by Obamuyide and
TACRED (Zhang et al., 2017) is the largest and Vlachos (2018). In their work they present some
most widely used dataset for RE in English. It preliminary experiments where they cast RE as en-
is derived from the TAC-KBP relation set, with tailment, but only evaluate performance as binary
labels obtained via crowdsourcing. Although al- entailment, not as a RE task. As a consequence they
ternate versions of TACRED have been published do not have competing positive labels and avoid
recently (Alt et al., 2020; Stoica et al., 2021), the RE inference and the issue of detecting no-relation.
state of the art is mainly tested in the original ver-
sion.

Zero-Shot and Few-Shot learning. Brown et al. Partially vs. fullly unseen labels in RE. Exist-
(2020) showed that task descriptions (prompts) can ing zero/few-shot RE models usually see some la-
be fed into LMs for task-agnostic and few-shot per- bels during training (label partially unseen), which
formance. In addition, (Schick and Schütze, 2020; helps generalize to the unseen label (Levy et al.,
Schick and Schütze, 2021; Tam et al., 2021) extend 2017; Obamuyide and Vlachos, 2018; Han et al.,
the method and allow finetuning of LMs on a va- 2018; Chen and Li, 2021). These approaches do
riety of tasks. Prompt-based prediction treats the not fully address the data scarcity problem. In this
downstream task as a (masked) language modeling work we address the more challenging label fully
problem, where the model directly generates a tex- unseen scenario.
1200
Figure 1: General workflow of our entailment-based RE approach.

3 Entailment for RE four verbalizations for the given entity pair.


A relation label can be verbalized by one or more
In this section we describe our models for zero-
templates. For instance, in addition to the previous
and few-shot RE.
template, PER : DATE _ OF _ BIRTH is also verbalized
3.1 Zero-shot relation extraction with {subj} was born on {obj}. At the
same time, a template can verbalize more than one
We reformulate RE as an entailment task: given the
relation label. For example, {subj} was born
input text containing the two entity mentions as the
in {obj} verbalizes PER : COUNTRY _ OF _ BIRTH
premise and the verbalized description of a relation
and PER : CITY _ OF _ BIRTH. In order to cope with
as hypothesis, the task is to infer if the premise
such ambiguous verbalizations, we added the entity
entails the hypothesis according to the NLI model.
type information to each relation, e.g. COUNTRY
Figure 1 illustrates the main 3 steps of our system.
and CITY for each of the relations in the previous
The first step is focused on relation verbalization
example. 4
to generate the set of hypotheses. In the second
We defined a function δr for every relation r ∈
we run the NLI model2 and obtain the entailment
R that checks the entity coherence between the
probability for each hypothesis. Finally, based on
template and the current relation label:
the probabilities and the entity types, we return the
relation label that maximizes the probability of the
(
1 e1 ∈ Er1 ∧ e2 ∈ Er2
hypothesis, including the NO - RELATION label. δr (e1 , e2 ) =
0 otherwise
Verbalizing relations as hypothesis. The hy-
potheses are automatically generated using a set of where e1 and e2 are the entity types of the first
templates. Each template verbalizes the relation and second arguments, Er1 and Er2 are the set of
holding between two entity mentions. For instance, allowed types for the first and second entities in
the relation PER : DATE _ OF _ BIRTH can be verbal- relation r. This function is used at inference time,
ized with the following template: {subj}’s to discard relations that do not match the given
birthday is on {obj}. More formally, types. Appendix C lists all templates and entity
given the text x that contains the mention of two type restrictions used in this work.
entities (xe1 , xe2 ) and template t, the hypothesis NLI for inferring relations. In a second step we
h is generated by VERBALIZE(t, xe1 , xe2 ), which make use of the NLI model to infer the relation
substitutes the subj and obj in the t with the en- label. Given the text x containing two entities xe1
tities xe1 and xe2 , respectively3 . Figure 1 shows
4
Alternatively, one could think on more specific verbaliza-
2
We describe the NLI models in Section 4.3 tions, such as {subj} was born in the city of
3
Note that the entities are given in a fixed order, that is the {obj} for PER : CITY _ OF _ BIRTH. In the checks done in the
relation needs to hold between xe1 and xe2 in that order; the available 15 min. such specific verbalizations had very low
reverse (xe2 and xe1 ) would be a different example. recall and were not finally selected.
1201
and xe2 the system returns the relation r̂ from the to fine-tune the NLI model to the task at hand, that
set of possible relation labels R with the highest is, assigning highest entailment probability to the
entailment probability as follows: verbalizations of the correct relation, and assign-
ing low entailment probabilities to the rest of the
r̂ = arg max Pr (x, xe1 , xe2 ) (1) hypothesis (see Eq. 2).
r∈R
Given a set of labelled relation examples, we
The probability of each relation Pr is computed use the following steps to produce labelled entail-
as the probability of the hypothesis that yields the ment pairs for fine-tuning the NLI model. 1) For
maximum entailment probability (Eq. 2), among each positive relation example we generate at least
the set of possible hypothesis. In case the two one entailment instance with the templates that
entities do not match the required entity types, the describes the current relation. That is, we generate
probability would be zero. one or several premise-hypothesis pairs labelled as
entailment. 2) For each positive relation example
we generate one neutral premise-hypothesis in-
Pr (x, xe1 , xe2 ) = δr (e1 , e2 ) max PN LI (x, hyp) stance, taken at random from the templates that do
t∈Tr
not represent the current relation. 3) For each neg-
where hyp = VERBALIZE(t, xe1 , xe2 ) (2)
ative relation example we generate one contradic-
where PN LI is the entailment probability between tion example, taken at random from the templates
the input text and the hypothesis generated by the of the rest of relations.
template verbalizer. Although entailment models If a template is used for the no-relation case,
return probabilities for entailment, contradiction we do the following: First, for each no-relation
and neutral, PN LI just makes use of the entailment example we generate one entailment example with
probability5 . The right hand-side of Figure 1 shows the no-relation template. Then, for each positive
the application of NLI models and how the proba- relation example we generate one contradiction
bility for each relation, Pr , is computed. example using the no-relation template.

Detection of no-relation. In supervised RE, the 4 Experimental Setup


NO - RELATION case is taken as an additional label.
In our case we examined two approaches. In this section we describe the dataset and scenarios
In template-based detection we propose an ad- we have used for evaluation, how we performed
ditional template as if it was yet another relation the verbalization process, the different pre-trained
label, and treated it as another positive relation in NLI models we have used and the state-of-the-art
Eq. 1. The template for NO - RELATION: {subj} baselines that we compare with.
and {obj} are not related.
In threshold-based detection we apply a thresh- 4.1 Dataset and scenarios
old T to Pr in Eq. 2. If none of the relations sur- We designed three different low-resource scenarios
passes the threshold, then our system returns NO - based on the large-scale TACRED (Zhang et al.,
RELATION . On the contrary, the model returns the 2017) dataset. The full dataset consists of 42 re-
relation label of highest probability (Eq. 1). When lation labels, including the NO - RELATION label,
no development data is available, the threshold T and each example contains the information about
is set to 0.5. Alternatively, we estimate T using the the entity type, among other linguistic informa-
available development dataset, as described in the tion. The scenarios are described in Table 1 and are
experimental part. formed by different splits of the original dataset.
We applied a stratified sampling method to keep
3.2 Few-Shot relation extraction
the original label distribution.
Our system is based on a NLI model which has
been pretrained on annotated entailment pairs. Zero-Shot. The aim of this scenario is the eval-
When labeled relation examples exist, we can re- uation of the models when no data is available for
formulate them as labelled NLI pairs, and use them training. We present two different situations on
5
this scenario: 1) no data is available for develop-
The probabilities for relations Pr defined in Eq. 2 are
independent from each other, which, in a way, they could be ment (0% split) and 2) a small development set is
easily extended to multi-label classification task. available with around 2 examples per relation (1%
1202
Train (Gold) Train (Silver) Development
# Pos # Neg # Pos # Neg # Pos # Neg
Scenario Split mean total total mean total total mean total total
Full training 100% 317.4 13013 55112 - - - 132.6 5436 17195
No Dev - - - - - - 0 0 0
Zero-Shot
1% Dev - - - - - - 1.9 54 173
1% 3.6 130 552 - - - 1.9 54 173
Few-Shot 5% 16.3 651 2756 - - - 7.0 272 861
10% 32.6 1302 5513 - - - 13.6 544 1721
0% 0 0 0 246.3 9850 41205 1.9 54 173
1% 3.6 130 552 246.3 9850 41205 1.9 54 173
Data Augment.
5% 16.3 651 2756 246.3 9850 41205 7.0 272 861
10% 32.6 1302 5513 246.3 9850 41205 13.6 544 1721

Table 1: Statistics about the dataset scenarios based on TACRED used in the paper, including positive examples
per relation, total amount of positive examples and the total amount of negative (no-relation) examples.

split)6 . In this scenario the models are not allowed is first fine-tuned with the gold data, then used to
to train their own parameters but development data annotate the silver data and finally the RE model is
is used to adjust the hyperparameters. fine-tuned over both, silver and gold, annotations.

Few-Shot. This scenario presents the challenge 4.2 Hand-crafted relation templates
of solving the RE task with just a few examples per We manually created the templates to verbalize
relation. We present three settings commonly used relation labels, based on the TAC-KBP guidelines
in few-shot learning (Gao et al., 2020) 7 : around 4 which underlie the TACRED dataset. We limited
examples per relation (1% of the training data in the time for creating the templates of each relation
TACRED), around 16 examples per relation (5%) to less than 15 minutes. Overall, we created 1-8
and around 32 examples per relation (10%). We templates per relation (2 on average) (cf. Appendix
reduced the development set following the same C for full list).
ratio. The verbalization process consists of generating
one or more templates that describe the relation
Full Training. In this setting we use all available
and contain the placeholders {subj} and {obj}.
training and development data.
The developer building the templates was given
Data Augmentation. In this scenario we want to the task guidelines (brief description of the rela-
test whether a silver dataset produced by running tion, including one or two examples and the type of
our systems on untagged data can be used to train the entities) and a NLI model (roberta-large-mnli
a supervised relation extraction system (cf. Section checkpoint). For a given relation, he/she would
3). In this scenario 75% of the training data in create a template (or set of templates) and check
TACRED is set aside as unlabeled data8 , and the whether the NLI model is able to output a high
rest of the training data is used in different splits entailment probability for the template when ap-
(ranging from 1% to 10%). Under this setting we plied on the guideline example(s). He/she could
carried out two type of experiments: In the zero- run this process for any new template that he/she
shot experiments (0% in the table) we use our NLI could come up with. There was no strict thresh-
based model to annotate the silver data and then old involved for selecting the templates, just the
fine-tune the RE model exclusively on the silver intuition of the developer. The spirit was to come
data. In the few-shot experiments the NLI model up with simple templates quickly, and not to build
numerous complex templates or to optimize entail-
6
This setting is comparable to one where the examples in ment probabilities.
the guidelines are used as development.
7
The commonly reported value in few-shot scenarios is 16
examples per label. We also added the 3-8 and 32 examples 4.3 Pre-Trained NLI models
settings in the evaluation. For our experiments we tried different NLI models
8
We use part of the original TACRED dataset to produce
silver data in order not to introduce noise coming from differ- that are publicly available with the Hugging Face
ent documents and/or pre-processing steps. Transformers (Wolf et al., 2020) python library.
1203
MNLI No Dev (T = 0.5) 1% Dev
NLI Model # Param. Acc. Pr. Rec. F1 Pr. Rec. F1
ALBERTxxLarge 223M 90.8 32.6 79.5 46.2 55.2 58.1 56.6 ±1.4
RoBERTa 355M 90.2 32.8 75.5 45.7 58.5 53.1 55.6 ±1.3
BART 406M 89.9 39.0 63.1 48.2 60.7 46.0 52.3 ±1.8
DeBERTaxLarge 900M 91.7 40.3 77.7 53.0 66.3 59.7 62.8 ±1.7
DeBERTaxxLarge 1.5B 91.7 46.6 76.1 57.8 63.2 59.8 61.4 ±1.0

Table 2: Zero-Shot scenario results (Precision, Recall and F1) for our system using several pre-trained NLI models
in two settings: no development (default threshold T =0.5), and small development (1% Dev.) for setting T . In the
leftmost columns we report the number of parameters and the accuracy in MNLI. For the 1% setting we report the
median measures along with the F1 standard deviation in 100 runs.

We tested the following models which implement


different architectures, sizes and pre-training objec-
tives and were fine-tuned mainly over the MNLI
(Williams et al., 2018) dataset9 : ALBERT (Lan
et al., 2020), RoBERTa (Liu et al., 2019), BART
(Lewis et al., 2020) and DeBERTa v2 (He et al.,
2021). Table 2 reports the number of parameters
of these models. Further details on models can be
found in Appendix A.
For each of the scenarios we have tested different
models. In zero-shot and full training scenarios
we compare all the pre-trained models using the
templates described in Section 4.2. For few-shot
we used RoBERTa for comparability, as it was used
in state-of-the-art systems (cf. Section 4.4), and Figure 2: Zero-shot scenario results. Mean F1 and stan-
DeBERTa which is the largest NLI model available dard error scores when setting T on increasing number
of development examples.
on the HUB10 . Finally, we only tested RoBERTa in
data-augmentation experiments.
We ran 3 different runs on each of the experi-
tion 2). In addition, we also report the results ob-
ments using different random seeds. In order to
tained by the vanilla RoBERTa baseline proposed
make a fair comparison with state-of-the-art sys-
by Wang et al. (2020) that serves as a reference for
tems (cf section 4.4.), we performed a hyperparam-
the improvements. We re-trained the different sys-
eter exploration in the full training scenario, using
tems on each scenario setting using their publicly
the resulting configuration also in the zero/few-shot
available implementations and best performing hy-
scenarios. We fixed the batch size at 32 for both
perparameters reported by the authors. All these
RoBERTa and DeBERTa, and search the optimum
models have a comparable number of parameters.
learning-rate among {1e−6 , 4e−6 , 1e−5 } on the de-
velopment set. The best results were obtained using
4e−6 as learning-rate. For more detailed informa- 5 Results
tion refer to the Appendix B.
5.1 Zero-Shot
4.4 State-of-the-art RE models
Table 2 shows the results for different pre-trained
We compared the NLI approach with the systems NLI models, as well as the number of parameters
reporting the best results to date on TACRED: Span- and the MNLI matched accuracy. These results
BERT (Joshi et al., 2020), K-Adapter (Wang et al., were obtained by using the threshold for negative
2020) and LUKE (Yamada et al., 2020) (cf. Sec- relations, as we found that it works substantially
9
ALBERT was trained in some additional NLI datasets. better than the no-relation template alternative (cf.
10
https://huggingface.co/models Section 3.1). For instance, RoBERTa yields an
1204
1% 5% 10%
Model Pr. Rec. F1 Pr. Rec. F1 Prec. Rec. F1
SpanBERT 0.0 0.0 0.0 ±0.0 36.3 23.9 28.8 ±13.5 3.2 1.1 1.6 ±20.7
RoBERTa 56.8 4.1 7.7 ±3.6 52.8 34.6 41.8 ±3.3 61.0 50.3 55.1 ±0.8
K-Adapter 73.8 7.6 13.8 ±3.4 56.4 37.6 45.1 ±0.1 62.3 50.9 56.0 ±1.3
LUKE 61.5 9.9 17.0 ±5.9 57.1 47.0 51.6 ±0.4 60.6 60.6 60.6 ±0.4
NLIRoBERTa (ours) 56.6 55.6 56.1 ±0.0 60.4 68.3 64.1 ±0.2 65.8 69.9 67.8 ±0.2
NLIDeBERTa (ours) 59.5 68.5 63.7 ±0.0 64.1 74.8 69.0 ±0.2 62.4 74.4 67.9 ±0.5

Table 3: Few-shot scenario results with 1%, 5% and 10% of training data. Precision, Recall and F1 score (standard
deviation) of the median of 3 different runs are reported. Top four rows for third-party RE systems run by us.

F1 of 30.111 well below the 45.7 when using the Model Pr. Rec. F1
default threshold (T = 0.5). Overall we see an ex-
cellent zero-shot performance across all the models SpanBERT 70.8 70.9 70.8
RoBERTa 70.2 72.4 71.3
and settings proving that the approach is robust and
K-Adapter 70.1 74.0 72.0
model agnostic.
LUKE 70.4 75.1 72.7
Regarding pre-trained models, the best F1
scores are obtained by the two DeBERTa v2 mod- NLIRoBERTa (ours) 71.6 70.4 71.0
els, which also score the best on the MNLI dataset. NLIDeBERTa (ours) 72.5 75.3 73.9
Note that all the models achieve similar scores on
Table 4: Full training results (TACRED). Top four rows
MNLI, but small differences in MNLI result in
for third-party RE systems as reported by authors.
large performance gaps when they come to RE, e.g.
the 1.5 difference in MNLI between RoBERTa and
DeBERTa becomes 7 points in No Dev. and 1% est training setting. For instance, the SpanBERT
Dev. We think the larger differences in RE are due system (Joshi et al., 2020) has difficulties to con-
to the generalization ability of some of the larger verge, even with the 10% of data setting. Both K-
models to domain and task differences. Adapter (Wang et al., 2020) and LUKE (Yamada
The table includes the results for different values et al., 2020) improve over the RoBERTa system
of the T hyperparameter. In the most challenging (Wang et al., 2020) in all three settings, but they are
setting, with default T , the results are worst, with at well below our NLIRoBERTa system, with improve-
most 57.8 F1. However, using as few as 2 examples ments of 48, 22 and 13 points against the baseline
per relation in average (1% Dev. setting) the results in each setting. We also report our method based
improve significantly. on DeBERTaxLarge , which is specially effective in
We performed further experiments using larger the smaller settings.
amounts of development data to tune T . Figure We would like to note that the zero-shot
2 shows that, for all models, the most significant NLIRoBERTa system (1% Dev) is comparable in
improvement occurs at the interval [0%, 1%) and terms of F1 score to a vanilla RoBERTa trained
that the interval [1%, 100%] is almost flat. The best with 10% of the training data. That is, 54 templates
results with all development data is 63.4%, only (10.5 hours, plus 23 development examples are
0.6 points better than using 1% of development. roughly equivalent to 6800 annotated examples12
These results show clearly that a small number of for training (plus 2265 development) .
examples suffice to set an optimal threshold.
5.3 Full training
5.2 Few-Shot
Some zero-shot and few-shot systems are not able
Table 3 shows the results of competing RE systems to improve results when larger amounts of train-
and our systems on the few-shot scenario. We re- ing data are available. Table 4 reports the results
port the median and standard deviation across 3 when the whole train and development datasets
different runs. The competing RE methods suffer are used, which is comparable to official results
a large performance drop, specially for the small-
12
Unfortunately we could not find the time estimates for
11
Results ommitted from Table 2 for brevity. annotating examples.
1205
Model 0% 1% 5% 10% Model Scenario P PvsN
RoBERTa - 7.7 41.8 55.1 No Dev 85.6 59.5
Zero-Shot
1% Dev 85.6 67.7
+ Zero-Shot DA 56.3 58.4 58.8 59.7 NLIDeBERTa
Few-Shot 5% 89.7 74.5
+ Few-Shot DA - 58.4 64.9 67.7
Full train - 92.2 77.8
Table 5: Data Augmentation scenario results (F1) for Few-Shot 5% 69.3 63.4
different gold training sizes. Silver annotations by the LUKE
Full train - 90.2 77.3
zero-shot and few-shot NLIRoBERTa model.
Table 6: Performance of selected systems and scenarios
on two metrics: the binary task of detecting a positive
on TACRED. Focusing on our NLIRoBERTa system, relation vs. no-relation (PvsN column, F1) and detect-
and comparing it to the results in Table 3, we can ing the correct relation among positive cases (P, F1).
see that it is able to effectively use the additional
training data, improving from 67.9 to 71.0. When
compared to a traditional RE system, it performs tions have comparable performance. A practical
on a par to RoBERTa, and a little behind K-Adapter advantage of a traditional RE system trained with
and LUKE, probably due to the infused knowledge our silver data is that is easier to integrate on avail-
which our model is not using. These results show able pipelines, as one just needs to download the
that our model keeps improving with additional trained Transformer model. It also makes it easy to
data and that it is competitive when larger amounts check additive improvements in the RE method.
of training is available. The results of NLIDeBERTa
6 Analysis
show that our model can benefit from larger and
more effective pre-trained NLI systems even in full Relation extraction can be analysed according to
training scenarios, and in fact achieves the best two auxiliary metrics: the binary task of detect-
results to date on the TACRED dataset. ing a positive relation vs. no-relation, and the
multi-class problem of detecting which relation
5.4 Data augmentation results
holds among positive cases (that is, discarding no-
In this section we explore whether our NLI-based relation instances from test data). Table 6 shows
system can produce high-quality silver data which the results of a selection of systems and scenar-
can be added to a small amount of gold data when ios. The first rows compare the performance of
training a traditional supervised RE system, e.g. our best system, NLIDeBERTa , across four scenarios,
the RoBERTa baseline (Wang et al., 2020). Table while the last two rows show the results for LUKE
5 reports the F1 results on the data augmentation in two scenarios. The zero-shot No dev. system
scenario for different amounts of gold training data. is very effective when discriminating the relation
Overall, we can see that both our zero-shot and few- among positive examples (P column), only 7 points
shot methods13 provide good quality silver data, as below the fully trained system, while it lags well
they improve significantly over the baseline in all behind when discriminating positive vs. negative,
settings. Although the zero-shot and few-shot meth- 18 points. The use of a small development data for
ods yield the same result with 1% of training data, tuning the T threshold closes the gap in PvsN, as
the few-shot model is better in the rest of train- expected, but the difference is still 10 points. All in
ing regimes, showing that it can effectively use the all, these numbers show that our zero-shot system
available training data in each case to provide better is very effective discriminating among positive ex-
quality silver data. If we compare the results in this amples, but that it still lags behind when detecting
table with those of the respective NLI-based system no-relation cases. Overall, the figures show the
with the same amount of gold training instances effectiveness of our methods in low data scenarios
(Tables 2 and 3) we can see that the results are com- on both metrics.
parable, showing that our NLI-based system and
a traditional RE system trained with silver annota- Confusion analysis In supervised models some
classes (relations) are better represented in train-
13
The zero-shot 1% Dev model is used in all data augmen- ing than others, usually due to data imbalance.
tation experiments, while the few-shot method changes to use
the available data at each run (1%, 5% and 10%), both with Our system instead, represents each relations as
RoBERTa a set of templates, which at least on a zero-shot
1206
Figure 3: Confusion matrix of our NLIDeBERTa zero-shot system on the development dataset. The rows represent
the true labels and the columns the predictions. The matrix is rowise normalized (recall in the diagonal).

scenario, should not be affected by data imbal- tent simple hand-made verbalizations are effec-
ance. The strong diagonal in the confusion ma- tive. The creation of templates is limited to 15
trix (Fig. 3) shows that our the model is able to minutes per relation, and yet allows for excellent
discriminate properly between most of the rela- results in zero- and few-shot scenarios. Our method
tions (after all it achieves 85.6% accuracy, cf. Ta- makes effective use of available labeled examples,
ble 6), with exception of the no-relation column, and together with larger LMs produces the best
which was expected. Regarding the confusion be- results on TACRED to date. Our analysis indi-
tween actual relations, most of them are about cates that the main performance difference against
overlapping relations, as expected. For instance, supervised models comes from discriminating no-
ORG : MEMBER _ OF and ORG : PARENTS both in- relation examples, as the performance among pos-
volve some organization A being part or member of itive examples equals that of the best supervised
some other organization B, where ORG : MEMBERS system using the full training data. We also show
is different from ORG : PARENTS in that correct that our method can be used effectively as a data-
fillers are distinct entities that are generally capable augmentation method to provide additional labeled
of autonomously ending their membership with the examples. For the future we would like to inves-
assigned organization14 . Something similar occurs tigate better methods for detecting no-relation in
between ORG : MEMBERS and ORG : SUBSIDIARIES. zero-shot settings.
Another reason for confusion happens when
two or more relations exist concurrently, as
in PER : ORIGIN, PER : COUNTRY _ OF _ BIRTH and Acknowledgements
PER : COUNTRY _ OF _ RESIDENCE . Finally, the
model scores low on PER : OTHER _ FAMILY, which
Oscar is funded by a PhD grant from the Basque
is a bucket of many specific relations where only a
Government (PRE_2020_1_0246). This work
handful were actually covered by the templates.
is based upon work partially supported via the
7 Conclusions IARPA BETTER Program contract No. 2019-
19051600006 (ODNI, IARPA), and by the Basque
In this work we reformulate relation extraction as Government (IXA excellence research group
an entailment problem, and explore to what ex- IT1343-19).
14
Description extracted from the guidelines.
1207
References Ankur Goswami, Akshata Bhat, Hadar Ohana, and
Theodoros Rekatsinas. 2020. Unsupervised relation
Christoph Alt, Aleksandra Gabryszak, and Leonhard extraction from language models using constrained
Hennig. 2020. TACRED revisited: A thorough eval- cloze completion. In Findings of the Association
uation of the TACRED relation extraction task. In for Computational Linguistics: EMNLP 2020, pages
Proceedings of the 58th Annual Meeting of the Asso- 1263–1276, Online. Association for Computational
ciation for Computational Linguistics, pages 1558– Linguistics.
1569, Online. Association for Computational Lin-
guistics. Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan
Samuel R. Bowman, Gabor Angeli, Christopher Potts, Yao, Zhiyuan Liu, and Maosong Sun. 2018. FewRel:
and Christopher D. Manning. 2015. A large anno- A large-scale supervised few-shot relation classifica-
tated corpus for learning natural language inference. tion dataset with state-of-the-art evaluation. In Pro-
In Proceedings of the 2015 Conference on Empiri- ceedings of the 2018 Conference on Empirical Meth-
cal Methods in Natural Language Processing, pages ods in Natural Language Processing, pages 4803–
632–642, Lisbon, Portugal. Association for Compu- 4809, Brussels, Belgium. Association for Computa-
tational Linguistics. tional Linguistics.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Weizhu Chen. 2021. Deberta: Decoding-enhanced
Neelakantan, Pranav Shyam, Girish Sastry, Amanda bert with disentangled attention.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Weld, Luke Zettlemoyer, and Omer Levy. 2020.
Clemens Winter, Christopher Hesse, Mark Chen, SpanBERT: Improving pre-training by representing
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin and predicting spans. Transactions of the Associa-
Chess, Jack Clark, Christopher Berner, Sam Mc- tion for Computational Linguistics, 8:64–77.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn- Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
ers. Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2020. Albert: A lite bert for self-supervised learning
Chih-Yao Chen and Cheng-Te Li. 2021. Zs-bert: To- of language representations. In International Con-
wards zero-shot relation extraction with attribute rep- ference on Learning Representations.
resentation learning. In Proceedings of 2021 Annual
Conference of the North American Chapter of the Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
Association for Computational Linguistics (NAACL- Zettlemoyer. 2017. Zero-shot relation extraction via
2021). reading comprehension. In Proceedings of the 21st
Conference on Computational Natural Language
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Learning (CoNLL 2017), pages 333–342, Vancou-
Vishrav Chaudhary, Guillaume Wenzek, Francisco ver, Canada. Association for Computational Linguis-
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- tics.
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Proceedings of the 58th Annual Meeting of the Asso- jan Ghazvininejad, Abdelrahman Mohamed, Omer
ciation for Computational Linguistics, pages 8440– Levy, Veselin Stoyanov, and Luke Zettlemoyer.
8451, Online. Association for Computational Lin- 2020. BART: Denoising sequence-to-sequence pre-
guistics. training for natural language generation, translation,
Ido Dagan, Oren Glickman, and Bernardo Magnini. and comprehension. In Proceedings of the 58th An-
2006. The pascal recognising textual entailment nual Meeting of the Association for Computational
challenge. In Machine Learning Challenges. Eval- Linguistics, pages 7871–7880, Online. Association
uating Predictive Uncertainty, Visual Object Classi- for Computational Linguistics.
fication, and Recognising Tectual Entailment, pages
Sha Li, Heng Ji, and Jiawei Han. 2021. Document-
177–190, Berlin, Heidelberg. Springer Berlin Hei-
level event argument extraction by conditional gen-
delberg.
eration.
Xinya Du and Claire Cardie. 2020. Event extrac-
tion by answering (almost) natural questions. In Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
Proceedings of the 2020 Conference on Empirical Yujie Qian, Zhilin Yang, and Jie Tang. 2021. Gpt
Methods in Natural Language Processing (EMNLP), understands, too.
pages 671–683, Online. Association for Computa-
tional Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Making pre-trained language models better few-shot Roberta: A robustly optimized bert pretraining ap-
learners. proach. arXiv preprint arXiv:1907.11692.
1208
Abiola Obamuyide and Andreas Vlachos. 2018. Zero- Chapter of the Association for Computational Lin-
shot relation classification as textual entailment. In guistics: Human Language Technologies, Volume
Proceedings of the First Workshop on Fact Extrac- 1 (Long Papers), pages 1112–1122, New Orleans,
tion and VERification (FEVER), pages 72–78, Brus- Louisiana. Association for Computational Linguis-
sels, Belgium. Association for Computational Lin- tics.
guistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Raul Puri and Bryan Catanzaro. 2019. Zero-shot text Chaumond, Clement Delangue, Anthony Moi, Pier-
classification with generative language models. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
ine Lee, Sharan Narang, Michael Matena, Yanqi Teven Le Scao, Sylvain Gugger, Mariama Drame,
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Quentin Lhoest, and Alexander Rush. 2020. Trans-
the limits of transfer learning with a unified text-to- formers: State-of-the-art natural language process-
text transformer. Journal of Machine Learning Re- ing. In Proceedings of the 2020 Conference on Em-
search, 21(140):1–67. pirical Methods in Natural Language Processing:
Oscar Sainz and German Rigau. 2021. System Demonstrations, pages 38–45, Online. Asso-
Ask2Transformers: Zero-shot domain labelling ciation for Computational Linguistics.
with pretrained language models. In Proceedings
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki
of the 11th Global Wordnet Conference, pages
Takeda, and Yuji Matsumoto. 2020. LUKE: Deep
44–52, University of South Africa (UNISA). Global
contextualized entity representations with entity-
Wordnet Association.
aware self-attention. In Proceedings of the 2020
Teven Le Scao and Alexander M. Rush. 2021. How Conference on Empirical Methods in Natural Lan-
many data points is a prompt worth? In Proceed- guage Processing (EMNLP), pages 6442–6454, On-
ings of the 2021 Conference of the North American line. Association for Computational Linguistics.
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019.
(Long and Short Papers). Association for Computa- Benchmarking zero-shot text classification:
tional Linguistics. Datasets, evaluation and entailment approach.
In Proceedings of the 2019 Conference on Empiri-
Timo Schick and Hinrich Schütze. 2021. Exploiting cal Methods in Natural Language Processing and
cloze-questions for few-shot text classification and the 9th International Joint Conference on Natural
natural language inference. In Proceedings of the Language Processing (EMNLP-IJCNLP), pages
16th Conference of the European Chapter of the As- 3914–3923, Hong Kong, China. Association for
sociation for Computational Linguistics: Main Vol- Computational Linguistics.
ume, pages 255–269, Online. Association for Com-
putational Linguistics. Wenpeng Yin, Nazneen Fatema Rajani, Dragomir
Radev, Richard Socher, and Caiming Xiong. 2020.
Timo Schick and Hinrich Schütze. 2020. It’s not just Universal natural language processing with limited
size that matters: Small language models are also annotations: Try few-shot textual entailment as a
few-shot learners. Computing Research Repository, start. In Proceedings of the 2020 Conference on
arXiv:2009.07118. Empirical Methods in Natural Language Process-
ing (EMNLP), pages 8229–8239, Online. Associa-
George Stoica, Emmanouil Antonios Platanios, and tion for Computational Linguistics.
Barnabás Póczos. 2021. Re-tacred: Addressing
shotcomings of the tacred dataset. In Proceedings Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
of the Thirty-fifth AAAI Conference on Aritificial In- geli, and Christopher D. Manning. 2017. Position-
telligence 2021. aware attention and supervised data improve slot
filling. In Proceedings of the 2017 Conference on
Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Empirical Methods in Natural Language Processing
Srivastava, and Colin Raffel. 2021. Improving and (EMNLP 2017), pages 35–45.
simplifying pattern exploiting training.
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xu-
anjing Huang, Jianshu ji, Guihong Cao, Daxin Jiang,
and Ming Zhou. 2020. K-adapter: Infusing knowl-
edge into pre-trained models with adapters.
Sinong Wang, Han Fang, Madian Khabsa, Hanzi Mao,
and Hao Ma. 2021. Entailment as few-shot learner.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
1209
A Pre-Trained models created based on the TAC KBP Slot Descriptions15
(annotation guidelines). Besides the templates, we
The pre-trained NLI models we have tested from
also report the valid argument types that are ac-
the Transformers library are the next:
cepted on each relation.
• ALBERT: ynie/albert-xxlarge-v2-snli_mnli
_fever_anli_R1_R2_R3-nli

• RoBERTa: roberta-large-mnli

• BART: facebook/bart-large-mnli

• DeBERTa v2 xLarge: microsoft/deberta-v2-


xlarge-mnli

• DeBERTa v2 xxLarge: microsoft/deberta-v2-


xxlarge-mnli

B Experimental details
We carried out all the experiments on a single Ti-
tan V (16GB) except for the fine-tuning of De-
BERTa, that has been done on a cluster of 4 Titan
V100 (32GB). The average inference time for the
zero and few-shot experiments is between 1h and
1.5h. The time needed for fine-tuning the NLI sys-
tems was at most 2.5h for RoBERTa and 5h for
DeBERTa. All the experiments were done with
mixed precision to speed up the overall runtime.
The whole hyperparameter settings used for fine-
tuning NLIRoBERTa and NLIDeBERTa are listed be-
low:

• Train epochs: 2

• Warmup steps: 1000

• Learning-rate: 4e-6

• Batch-size: 32

• FP16 training

• Seeds: {0, 24, 42}

Note that we are fine-tuning an already trained


NLI system so we kept the number of epochs and
learning-rate low. The rest of state-of-the-art sys-
tems were trained using the hyperparameters re-
ported by the authors.

C TACRED templates
This section describes the templates used in the
TACRED experiments. We performed all the ex-
periments using the templates showed in Tables 1 15
https://tac.nist.gov/2014/KBP/
(for PERSON relations) and 2 (for ORGANIZA- ColdStart/guidelines/TAC_KBP_2014_Slot_
TION relations). These templates were manually Descriptions_V1.4.pdf
1210
Relation Templates Valid argument types
per:alternate_names {subj} is also known as {obj} PERSON, MISC
per:date_of_birth {subj}’s birthday is on {obj} DATE
{subj} was born on {obj}
per:age {subj} is {obj} years old NUMBER, DURATION
per:country_of_birth {subj} was born in {obj} COUNTRY
per:stateorprovince_of_birth {subj} was born in {obj} STATE_OR_PROVINCE
per:city_of_birth {subj} was born in {obj} CITY, LOCATION
per:origin {obj} is the nationality of {subj} NATIONALITY, COUNTRY, LOCATION
per:date_of_death {subj} died in {obj} DATE
per:country_of_death {subj} died in {obj} COUNTRY
per:stateorprovince_of_death {subj} died in {obj} STATE_OR_PROVINCE
per:city_of_death {subj} died in {obj} CITY, LOCATION
per:cause_of_death {obj} is the cause of {subj}’s death CAUSE_OF_DEATH
per:countries_of_residence {subj} lives in {obj} COUNTRY, NATIONALITY
{subj} has a legal order to stay in {obj}
per:statesorprovinces_of_residence {subj} lives in {obj} STATE_OR_PROVINCE
{subj} has a legal order to stay in {obj}
per:city_of_residence {subj} lives in {obj} CITY, LOCATION
{subj} has a legal order to stay in {obj}
per:schools_attended {subj} studied in {obj} ORGANIZATION
{subj} graduated from {obj}
per:title {subj} is a {obj} TITLE
per:employee_of {subj} is a member of {obj} ORGANIZATION
per:religion {subj} belongs to {obj} RELIGION
{obj} is the religion of {subj}
{subj} believe in {obj}
per:spouse {subj} is the spouse of {obj} PERSON
{subj} is the wife of {obj}
{subj} is the husband of {obj}
per:children {subj} is the parent of {obj} PERSON
{subj} is the mother of {obj}
{subj} is the father of {obj}
{obj} is the son of {subj}
{obj} is the daughter of {subj}
per:parents {obj} is the parent of {subj} PERSON
{obj} is the mother of {subj}
{obj} is the father of {subj}
{subj} is the son of {obj}
{subj} is the daughter of {obj}
per:siblings {subj} and {obj} are siblings PERSON
{subj} is brother of {obj}
{subj} is sister of {obj}
per:other_family {subj} and {obj} are family PERSON
{subj} is a brother in law of {obj}
{subj} is a sister in law of {obj}
{subj} is the cousin of {obj}
{subj} is the uncle of {obj}
{subj} is the aunt of {obj}
{subj} is the grandparent of {obj}
{subj} is the grandmother of {obj}
{subj} is the grandson of {obj}
{subj} is the granddaughter of {obj}
per:charges {subj} was convicted of {obj} CRIMINAL_CHARGE
{obj} are the charges of {subj}

Table 1: Templates and valid arguments for PERSON relations.

1211
Relation Templates Valid argument types
org:alternate_names {subj} is also known as {obj} ORGANIZATION, MISC
org:political/religious_affiliation {subj} has political affiliation with {obj} RELIGION, IDEOLOGY
{subj} has religious affiliation with {obj}
org:top_memberts/employees {obj} is a high level member of {subj} PERSON
{obj} is chairman of {subj}
{obj} is president of {subj}
{obj} is director of {subj}
org:number_of_employees/members {subj} employs nearly {obj} people NUMBER
{subj} has about {obj} employees
org:members {obj} is member of {subj} ORGANIZATION, COUNTRY
{obj} joined {subj}
org:subsidiaries {obj} is a subsidiary of {subj} ORGANIZATION, LOCATION
{obj} is a branch of {subj}
org:parents {subj} is a subsidiary of {obj} ORGANIZATION, COUNTRY
{subj} is a branch of {obj}
org:founded_by {subj} was founded by {obj} PERSON
{obj} founded {subj}
org:founded {subj} was founded in {obj} DATE
{subj} was formed in {obj}
org:dissolved {subj} existed until {obj} DATE
{subj} disbanded in {obj}
{subj} dissolved in {obj}
org:country_of_headquarters {subj} has its headquarters in {obj} COUNTRY
{subj} is located in {obj}
org:stateorprovince_of_headquarters {subj} has its headquarters in {obj} STATE_OR_PROVINCE
{subj} is located in {obj}
org:city_of_headquarters {subj} has its headquarters in {obj} CITY, LOCATION
{subj} is located in {obj}
org:shareholders {obj} holds shares in {subj} ORGANIZATION, PERSON
org:website {obj} is the URL of {subj} URL
{obj} is the website of {subj}

Table 2: Templates and valid arguments for ORGANIZATION relations.

1212

You might also like