Zero-Shot and Few-Shot learning. Brown et al. Partially vs. fullly unseen labels in RE. Exist-
(2020) showed that task descriptions (prompts) can ing zero/few-shot RE models usually see some la-
be fed into LMs for task-agnostic and few-shot per- bels during training (label partially unseen), which
formance. In addition, (Schick and Schütze, 2020; helps generalize to the unseen label (Levy et al.,
Schick and Schütze, 2021; Tam et al., 2021) extend 2017; Obamuyide and Vlachos, 2018; Han et al.,
the method and allow finetuning of LMs on a va- 2018; Chen and Li, 2021). These approaches do
riety of tasks. Prompt-based prediction treats the not fully address the data scarcity problem. In this
downstream task as a (masked) language modeling work we address the more challenging label fully
problem, where the model directly generates a tex- unseen scenario.
Figure 1: General workflow of our entailment-based RE approach.
Table 1: Statistics about the dataset scenarios based on TACRED used in the paper, including positive examples
per relation, total amount of positive examples and the total amount of negative (no-relation) examples.
split)6 . In this scenario the models are not allowed is first fine-tuned with the gold data, then used to
to train their own parameters but development data annotate the silver data and finally the RE model is
is used to adjust the hyperparameters. fine-tuned over both, silver and gold, annotations.
Few-Shot. This scenario presents the challenge 4.2 Hand-crafted relation templates
of solving the RE task with just a few examples per We manually created the templates to verbalize
relation. We present three settings commonly used relation labels, based on the TAC-KBP guidelines
in few-shot learning (Gao et al., 2020) 7 : around 4 which underlie the TACRED dataset. We limited
examples per relation (1% of the training data in the time for creating the templates of each relation
TACRED), around 16 examples per relation (5%) to less than 15 minutes. Overall, we created 1-8
and around 32 examples per relation (10%). We templates per relation (2 on average) (cf. Appendix
reduced the development set following the same C for full list).
ratio. The verbalization process consists of generating
one or more templates that describe the relation
Full Training. In this setting we use all available
and contain the placeholders {subj} and {obj}.
training and development data.
The developer building the templates was given
Data Augmentation. In this scenario we want to the task guidelines (brief description of the rela-
test whether a silver dataset produced by running tion, including one or two examples and the type of
our systems on untagged data can be used to train the entities) and a NLI model (roberta-large-mnli
a supervised relation extraction system (cf. Section checkpoint). For a given relation, he/she would
3). In this scenario 75% of the training data in create a template (or set of templates) and check
TACRED is set aside as unlabeled data8 , and the whether the NLI model is able to output a high
rest of the training data is used in different splits entailment probability for the template when ap-
(ranging from 1% to 10%). Under this setting we plied on the guideline example(s). He/she could
carried out two type of experiments: In the zero- run this process for any new template that he/she
shot experiments (0% in the table) we use our NLI could come up with. There was no strict thresh-
based model to annotate the silver data and then old involved for selecting the templates, just the
fine-tune the RE model exclusively on the silver intuition of the developer. The spirit was to come
data. In the few-shot experiments the NLI model up with simple templates quickly, and not to build
numerous complex templates or to optimize entail-
This setting is comparable to one where the examples in ment probabilities.
the guidelines are used as development.
The commonly reported value in few-shot scenarios is 16
examples per label. We also added the 3-8 and 32 examples 4.3 Pre-Trained NLI models
settings in the evaluation. For our experiments we tried different NLI models
We use part of the original TACRED dataset to produce
silver data in order not to introduce noise coming from differ- that are publicly available with the Hugging Face
ent documents and/or pre-processing steps. Transformers (Wolf et al., 2020) python library.
MNLI No Dev (T = 0.5) 1% Dev
NLI Model # Param. Acc. Pr. Rec. F1 Pr. Rec. F1
ALBERTxxLarge 223M 90.8 32.6 79.5 46.2 55.2 58.1 56.6 ±1.4
RoBERTa 355M 90.2 32.8 75.5 45.7 58.5 53.1 55.6 ±1.3
BART 406M 89.9 39.0 63.1 48.2 60.7 46.0 52.3 ±1.8
DeBERTaxLarge 900M 91.7 40.3 77.7 53.0 66.3 59.7 62.8 ±1.7
DeBERTaxxLarge 1.5B 91.7 46.6 76.1 57.8 63.2 59.8 61.4 ±1.0
Table 2: Zero-Shot scenario results (Precision, Recall and F1) for our system using several pre-trained NLI models
in two settings: no development (default threshold T =0.5), and small development (1% Dev.) for setting T . In the
leftmost columns we report the number of parameters and the accuracy in MNLI. For the 1% setting we report the
median measures along with the F1 standard deviation in 100 runs.
Table 3: Few-shot scenario results with 1%, 5% and 10% of training data. Precision, Recall and F1 score (standard
deviation) of the median of 3 different runs are reported. Top four rows for third-party RE systems run by us.
F1 of 30.111 well below the 45.7 when using the Model Pr. Rec. F1
default threshold (T = 0.5). Overall we see an ex-
cellent zero-shot performance across all the models SpanBERT 70.8 70.9 70.8
RoBERTa 70.2 72.4 71.3
and settings proving that the approach is robust and
K-Adapter 70.1 74.0 72.0
model agnostic.
LUKE 70.4 75.1 72.7
Regarding pre-trained models, the best F1
scores are obtained by the two DeBERTa v2 mod- NLIRoBERTa (ours) 71.6 70.4 71.0
els, which also score the best on the MNLI dataset. NLIDeBERTa (ours) 72.5 75.3 73.9
Note that all the models achieve similar scores on
Table 4: Full training results (TACRED). Top four rows
MNLI, but small differences in MNLI result in
for third-party RE systems as reported by authors.
large performance gaps when they come to RE, e.g.
the 1.5 difference in MNLI between RoBERTa and
DeBERTa becomes 7 points in No Dev. and 1% est training setting. For instance, the SpanBERT
Dev. We think the larger differences in RE are due system (Joshi et al., 2020) has difficulties to con-
to the generalization ability of some of the larger verge, even with the 10% of data setting. Both K-
models to domain and task differences. Adapter (Wang et al., 2020) and LUKE (Yamada
The table includes the results for different values et al., 2020) improve over the RoBERTa system
of the T hyperparameter. In the most challenging (Wang et al., 2020) in all three settings, but they are
setting, with default T , the results are worst, with at well below our NLIRoBERTa system, with improve-
most 57.8 F1. However, using as few as 2 examples ments of 48, 22 and 13 points against the baseline
per relation in average (1% Dev. setting) the results in each setting. We also report our method based
improve significantly. on DeBERTaxLarge , which is specially effective in
We performed further experiments using larger the smaller settings.
amounts of development data to tune T . Figure We would like to note that the zero-shot
2 shows that, for all models, the most significant NLIRoBERTa system (1% Dev) is comparable in
improvement occurs at the interval [0%, 1%) and terms of F1 score to a vanilla RoBERTa trained
that the interval [1%, 100%] is almost flat. The best with 10% of the training data. That is, 54 templates
results with all development data is 63.4%, only (10.5 hours, plus 23 development examples are
0.6 points better than using 1% of development. roughly equivalent to 6800 annotated examples12
These results show clearly that a small number of for training (plus 2265 development) .
examples suffice to set an optimal threshold.
5.3 Full training
5.2 Few-Shot
Some zero-shot and few-shot systems are not able
Table 3 shows the results of competing RE systems to improve results when larger amounts of train-
and our systems on the few-shot scenario. We re- ing data are available. Table 4 reports the results
port the median and standard deviation across 3 when the whole train and development datasets
different runs. The competing RE methods suffer are used, which is comparable to official results
a large performance drop, specially for the small-
Unfortunately we could not find the time estimates for
Results ommitted from Table 2 for brevity. annotating examples.
Model 0% 1% 5% 10% Model Scenario P PvsN
RoBERTa - 7.7 41.8 55.1 No Dev 85.6 59.5
1% Dev 85.6 67.7
+ Zero-Shot DA 56.3 58.4 58.8 59.7 NLIDeBERTa
Few-Shot 5% 89.7 74.5
+ Few-Shot DA - 58.4 64.9 67.7
Full train - 92.2 77.8
Table 5: Data Augmentation scenario results (F1) for Few-Shot 5% 69.3 63.4
different gold training sizes. Silver annotations by the LUKE
Full train - 90.2 77.3
zero-shot and few-shot NLIRoBERTa model.
Table 6: Performance of selected systems and scenarios
on two metrics: the binary task of detecting a positive
on TACRED. Focusing on our NLIRoBERTa system, relation vs. no-relation (PvsN column, F1) and detect-
and comparing it to the results in Table 3, we can ing the correct relation among positive cases (P, F1).
see that it is able to effectively use the additional
training data, improving from 67.9 to 71.0. When
compared to a traditional RE system, it performs tions have comparable performance. A practical
on a par to RoBERTa, and a little behind K-Adapter advantage of a traditional RE system trained with
and LUKE, probably due to the infused knowledge our silver data is that is easier to integrate on avail-
which our model is not using. These results show able pipelines, as one just needs to download the
that our model keeps improving with additional trained Transformer model. It also makes it easy to
data and that it is competitive when larger amounts check additive improvements in the RE method.
of training is available. The results of NLIDeBERTa
6 Analysis
show that our model can benefit from larger and
more effective pre-trained NLI systems even in full Relation extraction can be analysed according to
training scenarios, and in fact achieves the best two auxiliary metrics: the binary task of detect-
results to date on the TACRED dataset. ing a positive relation vs. no-relation, and the
multi-class problem of detecting which relation
5.4 Data augmentation results
holds among positive cases (that is, discarding no-
In this section we explore whether our NLI-based relation instances from test data). Table 6 shows
system can produce high-quality silver data which the results of a selection of systems and scenar-
can be added to a small amount of gold data when ios. The first rows compare the performance of
training a traditional supervised RE system, e.g. our best system, NLIDeBERTa , across four scenarios,
the RoBERTa baseline (Wang et al., 2020). Table while the last two rows show the results for LUKE
5 reports the F1 results on the data augmentation in two scenarios. The zero-shot No dev. system
scenario for different amounts of gold training data. is very effective when discriminating the relation
Overall, we can see that both our zero-shot and few- among positive examples (P column), only 7 points
shot methods13 provide good quality silver data, as below the fully trained system, while it lags well
they improve significantly over the baseline in all behind when discriminating positive vs. negative,
settings. Although the zero-shot and few-shot meth- 18 points. The use of a small development data for
ods yield the same result with 1% of training data, tuning the T threshold closes the gap in PvsN, as
the few-shot model is better in the rest of train- expected, but the difference is still 10 points. All in
ing regimes, showing that it can effectively use the all, these numbers show that our zero-shot system
available training data in each case to provide better is very effective discriminating among positive ex-
quality silver data. If we compare the results in this amples, but that it still lags behind when detecting
table with those of the respective NLI-based system no-relation cases. Overall, the figures show the
with the same amount of gold training instances effectiveness of our methods in low data scenarios
(Tables 2 and 3) we can see that the results are com- on both metrics.
parable, showing that our NLI-based system and
a traditional RE system trained with silver annota- Confusion analysis In supervised models some
classes (relations) are better represented in train-
The zero-shot 1% Dev model is used in all data augmen- ing than others, usually due to data imbalance.
tation experiments, while the few-shot method changes to use
the available data at each run (1%, 5% and 10%), both with Our system instead, represents each relations as
RoBERTa a set of templates, which at least on a zero-shot
Figure 3: Confusion matrix of our NLIDeBERTa zero-shot system on the development dataset. The rows represent
the true labels and the columns the predictions. The matrix is rowise normalized (recall in the diagonal).
scenario, should not be affected by data imbal- tent simple hand-made verbalizations are effec-
ance. The strong diagonal in the confusion ma- tive. The creation of templates is limited to 15
trix (Fig. 3) shows that our the model is able to minutes per relation, and yet allows for excellent
discriminate properly between most of the rela- results in zero- and few-shot scenarios. Our method
tions (after all it achieves 85.6% accuracy, cf. Ta- makes effective use of available labeled examples,
ble 6), with exception of the no-relation column, and together with larger LMs produces the best
which was expected. Regarding the confusion be- results on TACRED to date. Our analysis indi-
tween actual relations, most of them are about cates that the main performance difference against
overlapping relations, as expected. For instance, supervised models comes from discriminating no-
ORG : MEMBER _ OF and ORG : PARENTS both in- relation examples, as the performance among pos-
volve some organization A being part or member of itive examples equals that of the best supervised
some other organization B, where ORG : MEMBERS system using the full training data. We also show
is different from ORG : PARENTS in that correct that our method can be used effectively as a data-
fillers are distinct entities that are generally capable augmentation method to provide additional labeled
of autonomously ending their membership with the examples. For the future we would like to inves-
assigned organization14 . Something similar occurs tigate better methods for detecting no-relation in
between ORG : MEMBERS and ORG : SUBSIDIARIES. zero-shot settings.
Another reason for confusion happens when
two or more relations exist concurrently, as
in PER : ORIGIN, PER : COUNTRY _ OF _ BIRTH and Acknowledgements
PER : COUNTRY _ OF _ RESIDENCE . Finally, the
model scores low on PER : OTHER _ FAMILY, which
Oscar is funded by a PhD grant from the Basque
is a bucket of many specific relations where only a
Government (PRE_2020_1_0246). This work
handful were actually covered by the templates.
is based upon work partially supported via the
7 Conclusions IARPA BETTER Program contract No. 2019-
19051600006 (ODNI, IARPA), and by the Basque
In this work we reformulate relation extraction as Government (IXA excellence research group
an entailment problem, and explore to what ex- IT1343-19).
Description extracted from the guidelines.
A Pre-Trained models created based on the TAC KBP Slot Descriptions15
(annotation guidelines). Besides the templates, we
The pre-trained NLI models we have tested from
also report the valid argument types that are ac-
the Transformers library are the next:
cepted on each relation.
• ALBERT: ynie/albert-xxlarge-v2-snli_mnli
• RoBERTa: roberta-large-mnli
• BART: facebook/bart-large-mnli
B Experimental details
We carried out all the experiments on a single Ti-
tan V (16GB) except for the fine-tuning of De-
BERTa, that has been done on a cluster of 4 Titan
V100 (32GB). The average inference time for the
zero and few-shot experiments is between 1h and
1.5h. The time needed for fine-tuning the NLI sys-
tems was at most 2.5h for RoBERTa and 5h for
DeBERTa. All the experiments were done with
mixed precision to speed up the overall runtime.
The whole hyperparameter settings used for fine-
tuning NLIRoBERTa and NLIDeBERTa are listed be-
• Train epochs: 2
• Learning-rate: 4e-6
• Batch-size: 32
• FP16 training
C TACRED templates
This section describes the templates used in the
TACRED experiments. We performed all the ex-
periments using the templates showed in Tables 1 15
(for PERSON relations) and 2 (for ORGANIZA- ColdStart/guidelines/TAC_KBP_2014_Slot_
TION relations). These templates were manually Descriptions_V1.4.pdf
Relation Templates Valid argument types
per:alternate_names {subj} is also known as {obj} PERSON, MISC
per:date_of_birth {subj}’s birthday is on {obj} DATE
{subj} was born on {obj}
per:age {subj} is {obj} years old NUMBER, DURATION
per:country_of_birth {subj} was born in {obj} COUNTRY
per:stateorprovince_of_birth {subj} was born in {obj} STATE_OR_PROVINCE
per:city_of_birth {subj} was born in {obj} CITY, LOCATION
per:origin {obj} is the nationality of {subj} NATIONALITY, COUNTRY, LOCATION
per:date_of_death {subj} died in {obj} DATE
per:country_of_death {subj} died in {obj} COUNTRY
per:stateorprovince_of_death {subj} died in {obj} STATE_OR_PROVINCE
per:city_of_death {subj} died in {obj} CITY, LOCATION
per:cause_of_death {obj} is the cause of {subj}’s death CAUSE_OF_DEATH
per:countries_of_residence {subj} lives in {obj} COUNTRY, NATIONALITY
{subj} has a legal order to stay in {obj}
per:statesorprovinces_of_residence {subj} lives in {obj} STATE_OR_PROVINCE
{subj} has a legal order to stay in {obj}
per:city_of_residence {subj} lives in {obj} CITY, LOCATION
{subj} has a legal order to stay in {obj}
per:schools_attended {subj} studied in {obj} ORGANIZATION
{subj} graduated from {obj}
per:title {subj} is a {obj} TITLE
per:employee_of {subj} is a member of {obj} ORGANIZATION
per:religion {subj} belongs to {obj} RELIGION
{obj} is the religion of {subj}
{subj} believe in {obj}
per:spouse {subj} is the spouse of {obj} PERSON
{subj} is the wife of {obj}
{subj} is the husband of {obj}
per:children {subj} is the parent of {obj} PERSON
{subj} is the mother of {obj}
{subj} is the father of {obj}
{obj} is the son of {subj}
{obj} is the daughter of {subj}
per:parents {obj} is the parent of {subj} PERSON
{obj} is the mother of {subj}
{obj} is the father of {subj}
{subj} is the son of {obj}
{subj} is the daughter of {obj}
per:siblings {subj} and {obj} are siblings PERSON
{subj} is brother of {obj}
{subj} is sister of {obj}
per:other_family {subj} and {obj} are family PERSON
{subj} is a brother in law of {obj}
{subj} is a sister in law of {obj}
{subj} is the cousin of {obj}
{subj} is the uncle of {obj}
{subj} is the aunt of {obj}
{subj} is the grandparent of {obj}
{subj} is the grandmother of {obj}
{subj} is the grandson of {obj}
{subj} is the granddaughter of {obj}
per:charges {subj} was convicted of {obj} CRIMINAL_CHARGE
{obj} are the charges of {subj}
Relation Templates Valid argument types
org:alternate_names {subj} is also known as {obj} ORGANIZATION, MISC
org:political/religious_affiliation {subj} has political affiliation with {obj} RELIGION, IDEOLOGY
{subj} has religious affiliation with {obj}
org:top_memberts/employees {obj} is a high level member of {subj} PERSON
{obj} is chairman of {subj}
{obj} is president of {subj}
{obj} is director of {subj}
org:number_of_employees/members {subj} employs nearly {obj} people NUMBER
{subj} has about {obj} employees
org:members {obj} is member of {subj} ORGANIZATION, COUNTRY
{obj} joined {subj}
org:subsidiaries {obj} is a subsidiary of {subj} ORGANIZATION, LOCATION
{obj} is a branch of {subj}
org:parents {subj} is a subsidiary of {obj} ORGANIZATION, COUNTRY
{subj} is a branch of {obj}
org:founded_by {subj} was founded by {obj} PERSON
{obj} founded {subj}
org:founded {subj} was founded in {obj} DATE
{subj} was formed in {obj}
org:dissolved {subj} existed until {obj} DATE
{subj} disbanded in {obj}
{subj} dissolved in {obj}
org:country_of_headquarters {subj} has its headquarters in {obj} COUNTRY
{subj} is located in {obj}
org:stateorprovince_of_headquarters {subj} has its headquarters in {obj} STATE_OR_PROVINCE
{subj} is located in {obj}
org:city_of_headquarters {subj} has its headquarters in {obj} CITY, LOCATION
{subj} is located in {obj}
org:shareholders {obj} holds shares in {subj} ORGANIZATION, PERSON
org:website {obj} is the URL of {subj} URL
{obj} is the website of {subj}