2021 emnlp-main 92论文
2021 emnlp-main 92论文
2021 emnlp-main 92论文
Zero-Shot and Few-Shot learning. Brown et al. Partially vs. fullly unseen labels in RE. Exist-
(2020) showed that task descriptions (prompts) can ing zero/few-shot RE models usually see some la-
be fed into LMs for task-agnostic and few-shot per- bels during training (label partially unseen), which
formance. In addition, (Schick and Schütze, 2020; helps generalize to the unseen label (Levy et al.,
Schick and Schütze, 2021; Tam et al., 2021) extend 2017; Obamuyide and Vlachos, 2018; Han et al.,
the method and allow finetuning of LMs on a va- 2018; Chen and Li, 2021). These approaches do
riety of tasks. Prompt-based prediction treats the not fully address the data scarcity problem. In this
downstream task as a (masked) language modeling work we address the more challenging label fully
problem, where the model directly generates a tex- unseen scenario.
1200
Figure 1: General workflow of our entailment-based RE approach.
Table 1: Statistics about the dataset scenarios based on TACRED used in the paper, including positive examples
per relation, total amount of positive examples and the total amount of negative (no-relation) examples.
split)6 . In this scenario the models are not allowed is first fine-tuned with the gold data, then used to
to train their own parameters but development data annotate the silver data and finally the RE model is
is used to adjust the hyperparameters. fine-tuned over both, silver and gold, annotations.
Few-Shot. This scenario presents the challenge 4.2 Hand-crafted relation templates
of solving the RE task with just a few examples per We manually created the templates to verbalize
relation. We present three settings commonly used relation labels, based on the TAC-KBP guidelines
in few-shot learning (Gao et al., 2020) 7 : around 4 which underlie the TACRED dataset. We limited
examples per relation (1% of the training data in the time for creating the templates of each relation
TACRED), around 16 examples per relation (5%) to less than 15 minutes. Overall, we created 1-8
and around 32 examples per relation (10%). We templates per relation (2 on average) (cf. Appendix
reduced the development set following the same C for full list).
ratio. The verbalization process consists of generating
one or more templates that describe the relation
Full Training. In this setting we use all available
and contain the placeholders {subj} and {obj}.
training and development data.
The developer building the templates was given
Data Augmentation. In this scenario we want to the task guidelines (brief description of the rela-
test whether a silver dataset produced by running tion, including one or two examples and the type of
our systems on untagged data can be used to train the entities) and a NLI model (roberta-large-mnli
a supervised relation extraction system (cf. Section checkpoint). For a given relation, he/she would
3). In this scenario 75% of the training data in create a template (or set of templates) and check
TACRED is set aside as unlabeled data8 , and the whether the NLI model is able to output a high
rest of the training data is used in different splits entailment probability for the template when ap-
(ranging from 1% to 10%). Under this setting we plied on the guideline example(s). He/she could
carried out two type of experiments: In the zero- run this process for any new template that he/she
shot experiments (0% in the table) we use our NLI could come up with. There was no strict thresh-
based model to annotate the silver data and then old involved for selecting the templates, just the
fine-tune the RE model exclusively on the silver intuition of the developer. The spirit was to come
data. In the few-shot experiments the NLI model up with simple templates quickly, and not to build
numerous complex templates or to optimize entail-
6
This setting is comparable to one where the examples in ment probabilities.
the guidelines are used as development.
7
The commonly reported value in few-shot scenarios is 16
examples per label. We also added the 3-8 and 32 examples 4.3 Pre-Trained NLI models
settings in the evaluation. For our experiments we tried different NLI models
8
We use part of the original TACRED dataset to produce
silver data in order not to introduce noise coming from differ- that are publicly available with the Hugging Face
ent documents and/or pre-processing steps. Transformers (Wolf et al., 2020) python library.
1203
MNLI No Dev (T = 0.5) 1% Dev
NLI Model # Param. Acc. Pr. Rec. F1 Pr. Rec. F1
ALBERTxxLarge 223M 90.8 32.6 79.5 46.2 55.2 58.1 56.6 ±1.4
RoBERTa 355M 90.2 32.8 75.5 45.7 58.5 53.1 55.6 ±1.3
BART 406M 89.9 39.0 63.1 48.2 60.7 46.0 52.3 ±1.8
DeBERTaxLarge 900M 91.7 40.3 77.7 53.0 66.3 59.7 62.8 ±1.7
DeBERTaxxLarge 1.5B 91.7 46.6 76.1 57.8 63.2 59.8 61.4 ±1.0
Table 2: Zero-Shot scenario results (Precision, Recall and F1) for our system using several pre-trained NLI models
in two settings: no development (default threshold T =0.5), and small development (1% Dev.) for setting T . In the
leftmost columns we report the number of parameters and the accuracy in MNLI. For the 1% setting we report the
median measures along with the F1 standard deviation in 100 runs.
Table 3: Few-shot scenario results with 1%, 5% and 10% of training data. Precision, Recall and F1 score (standard
deviation) of the median of 3 different runs are reported. Top four rows for third-party RE systems run by us.
F1 of 30.111 well below the 45.7 when using the Model Pr. Rec. F1
default threshold (T = 0.5). Overall we see an ex-
cellent zero-shot performance across all the models SpanBERT 70.8 70.9 70.8
RoBERTa 70.2 72.4 71.3
and settings proving that the approach is robust and
K-Adapter 70.1 74.0 72.0
model agnostic.
LUKE 70.4 75.1 72.7
Regarding pre-trained models, the best F1
scores are obtained by the two DeBERTa v2 mod- NLIRoBERTa (ours) 71.6 70.4 71.0
els, which also score the best on the MNLI dataset. NLIDeBERTa (ours) 72.5 75.3 73.9
Note that all the models achieve similar scores on
Table 4: Full training results (TACRED). Top four rows
MNLI, but small differences in MNLI result in
for third-party RE systems as reported by authors.
large performance gaps when they come to RE, e.g.
the 1.5 difference in MNLI between RoBERTa and
DeBERTa becomes 7 points in No Dev. and 1% est training setting. For instance, the SpanBERT
Dev. We think the larger differences in RE are due system (Joshi et al., 2020) has difficulties to con-
to the generalization ability of some of the larger verge, even with the 10% of data setting. Both K-
models to domain and task differences. Adapter (Wang et al., 2020) and LUKE (Yamada
The table includes the results for different values et al., 2020) improve over the RoBERTa system
of the T hyperparameter. In the most challenging (Wang et al., 2020) in all three settings, but they are
setting, with default T , the results are worst, with at well below our NLIRoBERTa system, with improve-
most 57.8 F1. However, using as few as 2 examples ments of 48, 22 and 13 points against the baseline
per relation in average (1% Dev. setting) the results in each setting. We also report our method based
improve significantly. on DeBERTaxLarge , which is specially effective in
We performed further experiments using larger the smaller settings.
amounts of development data to tune T . Figure We would like to note that the zero-shot
2 shows that, for all models, the most significant NLIRoBERTa system (1% Dev) is comparable in
improvement occurs at the interval [0%, 1%) and terms of F1 score to a vanilla RoBERTa trained
that the interval [1%, 100%] is almost flat. The best with 10% of the training data. That is, 54 templates
results with all development data is 63.4%, only (10.5 hours, plus 23 development examples are
0.6 points better than using 1% of development. roughly equivalent to 6800 annotated examples12
These results show clearly that a small number of for training (plus 2265 development) .
examples suffice to set an optimal threshold.
5.3 Full training
5.2 Few-Shot
Some zero-shot and few-shot systems are not able
Table 3 shows the results of competing RE systems to improve results when larger amounts of train-
and our systems on the few-shot scenario. We re- ing data are available. Table 4 reports the results
port the median and standard deviation across 3 when the whole train and development datasets
different runs. The competing RE methods suffer are used, which is comparable to official results
a large performance drop, specially for the small-
12
Unfortunately we could not find the time estimates for
11
Results ommitted from Table 2 for brevity. annotating examples.
1205
Model 0% 1% 5% 10% Model Scenario P PvsN
RoBERTa - 7.7 41.8 55.1 No Dev 85.6 59.5
Zero-Shot
1% Dev 85.6 67.7
+ Zero-Shot DA 56.3 58.4 58.8 59.7 NLIDeBERTa
Few-Shot 5% 89.7 74.5
+ Few-Shot DA - 58.4 64.9 67.7
Full train - 92.2 77.8
Table 5: Data Augmentation scenario results (F1) for Few-Shot 5% 69.3 63.4
different gold training sizes. Silver annotations by the LUKE
Full train - 90.2 77.3
zero-shot and few-shot NLIRoBERTa model.
Table 6: Performance of selected systems and scenarios
on two metrics: the binary task of detecting a positive
on TACRED. Focusing on our NLIRoBERTa system, relation vs. no-relation (PvsN column, F1) and detect-
and comparing it to the results in Table 3, we can ing the correct relation among positive cases (P, F1).
see that it is able to effectively use the additional
training data, improving from 67.9 to 71.0. When
compared to a traditional RE system, it performs tions have comparable performance. A practical
on a par to RoBERTa, and a little behind K-Adapter advantage of a traditional RE system trained with
and LUKE, probably due to the infused knowledge our silver data is that is easier to integrate on avail-
which our model is not using. These results show able pipelines, as one just needs to download the
that our model keeps improving with additional trained Transformer model. It also makes it easy to
data and that it is competitive when larger amounts check additive improvements in the RE method.
of training is available. The results of NLIDeBERTa
6 Analysis
show that our model can benefit from larger and
more effective pre-trained NLI systems even in full Relation extraction can be analysed according to
training scenarios, and in fact achieves the best two auxiliary metrics: the binary task of detect-
results to date on the TACRED dataset. ing a positive relation vs. no-relation, and the
multi-class problem of detecting which relation
5.4 Data augmentation results
holds among positive cases (that is, discarding no-
In this section we explore whether our NLI-based relation instances from test data). Table 6 shows
system can produce high-quality silver data which the results of a selection of systems and scenar-
can be added to a small amount of gold data when ios. The first rows compare the performance of
training a traditional supervised RE system, e.g. our best system, NLIDeBERTa , across four scenarios,
the RoBERTa baseline (Wang et al., 2020). Table while the last two rows show the results for LUKE
5 reports the F1 results on the data augmentation in two scenarios. The zero-shot No dev. system
scenario for different amounts of gold training data. is very effective when discriminating the relation
Overall, we can see that both our zero-shot and few- among positive examples (P column), only 7 points
shot methods13 provide good quality silver data, as below the fully trained system, while it lags well
they improve significantly over the baseline in all behind when discriminating positive vs. negative,
settings. Although the zero-shot and few-shot meth- 18 points. The use of a small development data for
ods yield the same result with 1% of training data, tuning the T threshold closes the gap in PvsN, as
the few-shot model is better in the rest of train- expected, but the difference is still 10 points. All in
ing regimes, showing that it can effectively use the all, these numbers show that our zero-shot system
available training data in each case to provide better is very effective discriminating among positive ex-
quality silver data. If we compare the results in this amples, but that it still lags behind when detecting
table with those of the respective NLI-based system no-relation cases. Overall, the figures show the
with the same amount of gold training instances effectiveness of our methods in low data scenarios
(Tables 2 and 3) we can see that the results are com- on both metrics.
parable, showing that our NLI-based system and
a traditional RE system trained with silver annota- Confusion analysis In supervised models some
classes (relations) are better represented in train-
13
The zero-shot 1% Dev model is used in all data augmen- ing than others, usually due to data imbalance.
tation experiments, while the few-shot method changes to use
the available data at each run (1%, 5% and 10%), both with Our system instead, represents each relations as
RoBERTa a set of templates, which at least on a zero-shot
1206
Figure 3: Confusion matrix of our NLIDeBERTa zero-shot system on the development dataset. The rows represent
the true labels and the columns the predictions. The matrix is rowise normalized (recall in the diagonal).
scenario, should not be affected by data imbal- tent simple hand-made verbalizations are effec-
ance. The strong diagonal in the confusion ma- tive. The creation of templates is limited to 15
trix (Fig. 3) shows that our the model is able to minutes per relation, and yet allows for excellent
discriminate properly between most of the rela- results in zero- and few-shot scenarios. Our method
tions (after all it achieves 85.6% accuracy, cf. Ta- makes effective use of available labeled examples,
ble 6), with exception of the no-relation column, and together with larger LMs produces the best
which was expected. Regarding the confusion be- results on TACRED to date. Our analysis indi-
tween actual relations, most of them are about cates that the main performance difference against
overlapping relations, as expected. For instance, supervised models comes from discriminating no-
ORG : MEMBER _ OF and ORG : PARENTS both in- relation examples, as the performance among pos-
volve some organization A being part or member of itive examples equals that of the best supervised
some other organization B, where ORG : MEMBERS system using the full training data. We also show
is different from ORG : PARENTS in that correct that our method can be used effectively as a data-
fillers are distinct entities that are generally capable augmentation method to provide additional labeled
of autonomously ending their membership with the examples. For the future we would like to inves-
assigned organization14 . Something similar occurs tigate better methods for detecting no-relation in
between ORG : MEMBERS and ORG : SUBSIDIARIES. zero-shot settings.
Another reason for confusion happens when
two or more relations exist concurrently, as
in PER : ORIGIN, PER : COUNTRY _ OF _ BIRTH and Acknowledgements
PER : COUNTRY _ OF _ RESIDENCE . Finally, the
model scores low on PER : OTHER _ FAMILY, which
Oscar is funded by a PhD grant from the Basque
is a bucket of many specific relations where only a
Government (PRE_2020_1_0246). This work
handful were actually covered by the templates.
is based upon work partially supported via the
7 Conclusions IARPA BETTER Program contract No. 2019-
19051600006 (ODNI, IARPA), and by the Basque
In this work we reformulate relation extraction as Government (IXA excellence research group
an entailment problem, and explore to what ex- IT1343-19).
14
Description extracted from the guidelines.
1207
References Ankur Goswami, Akshata Bhat, Hadar Ohana, and
Theodoros Rekatsinas. 2020. Unsupervised relation
Christoph Alt, Aleksandra Gabryszak, and Leonhard extraction from language models using constrained
Hennig. 2020. TACRED revisited: A thorough eval- cloze completion. In Findings of the Association
uation of the TACRED relation extraction task. In for Computational Linguistics: EMNLP 2020, pages
Proceedings of the 58th Annual Meeting of the Asso- 1263–1276, Online. Association for Computational
ciation for Computational Linguistics, pages 1558– Linguistics.
1569, Online. Association for Computational Lin-
guistics. Xu Han, Hao Zhu, Pengfei Yu, Ziyun Wang, Yuan
Samuel R. Bowman, Gabor Angeli, Christopher Potts, Yao, Zhiyuan Liu, and Maosong Sun. 2018. FewRel:
and Christopher D. Manning. 2015. A large anno- A large-scale supervised few-shot relation classifica-
tated corpus for learning natural language inference. tion dataset with state-of-the-art evaluation. In Pro-
In Proceedings of the 2015 Conference on Empiri- ceedings of the 2018 Conference on Empirical Meth-
cal Methods in Natural Language Processing, pages ods in Natural Language Processing, pages 4803–
632–642, Lisbon, Portugal. Association for Compu- 4809, Brussels, Belgium. Association for Computa-
tational Linguistics. tional Linguistics.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Weizhu Chen. 2021. Deberta: Decoding-enhanced
Neelakantan, Pranav Shyam, Girish Sastry, Amanda bert with disentangled attention.
Askell, Sandhini Agarwal, Ariel Herbert-Voss,
Gretchen Krueger, Tom Henighan, Rewon Child, Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Weld, Luke Zettlemoyer, and Omer Levy. 2020.
Clemens Winter, Christopher Hesse, Mark Chen, SpanBERT: Improving pre-training by representing
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin and predicting spans. Transactions of the Associa-
Chess, Jack Clark, Christopher Berner, Sam Mc- tion for Computational Linguistics, 8:64–77.
Candlish, Alec Radford, Ilya Sutskever, and Dario
Amodei. 2020. Language models are few-shot learn- Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
ers. Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2020. Albert: A lite bert for self-supervised learning
Chih-Yao Chen and Cheng-Te Li. 2021. Zs-bert: To- of language representations. In International Con-
wards zero-shot relation extraction with attribute rep- ference on Learning Representations.
resentation learning. In Proceedings of 2021 Annual
Conference of the North American Chapter of the Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
Association for Computational Linguistics (NAACL- Zettlemoyer. 2017. Zero-shot relation extraction via
2021). reading comprehension. In Proceedings of the 21st
Conference on Computational Natural Language
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Learning (CoNLL 2017), pages 333–342, Vancou-
Vishrav Chaudhary, Guillaume Wenzek, Francisco ver, Canada. Association for Computational Linguis-
Guzmán, Edouard Grave, Myle Ott, Luke Zettle- tics.
moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. In Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Proceedings of the 58th Annual Meeting of the Asso- jan Ghazvininejad, Abdelrahman Mohamed, Omer
ciation for Computational Linguistics, pages 8440– Levy, Veselin Stoyanov, and Luke Zettlemoyer.
8451, Online. Association for Computational Lin- 2020. BART: Denoising sequence-to-sequence pre-
guistics. training for natural language generation, translation,
Ido Dagan, Oren Glickman, and Bernardo Magnini. and comprehension. In Proceedings of the 58th An-
2006. The pascal recognising textual entailment nual Meeting of the Association for Computational
challenge. In Machine Learning Challenges. Eval- Linguistics, pages 7871–7880, Online. Association
uating Predictive Uncertainty, Visual Object Classi- for Computational Linguistics.
fication, and Recognising Tectual Entailment, pages
Sha Li, Heng Ji, and Jiawei Han. 2021. Document-
177–190, Berlin, Heidelberg. Springer Berlin Hei-
level event argument extraction by conditional gen-
delberg.
eration.
Xinya Du and Claire Cardie. 2020. Event extrac-
tion by answering (almost) natural questions. In Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
Proceedings of the 2020 Conference on Empirical Yujie Qian, Zhilin Yang, and Jie Tang. 2021. Gpt
Methods in Natural Language Processing (EMNLP), understands, too.
pages 671–683, Online. Association for Computa-
tional Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Tianyu Gao, Adam Fisch, and Danqi Chen. 2020. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Making pre-trained language models better few-shot Roberta: A robustly optimized bert pretraining ap-
learners. proach. arXiv preprint arXiv:1907.11692.
1208
Abiola Obamuyide and Andreas Vlachos. 2018. Zero- Chapter of the Association for Computational Lin-
shot relation classification as textual entailment. In guistics: Human Language Technologies, Volume
Proceedings of the First Workshop on Fact Extrac- 1 (Long Papers), pages 1112–1122, New Orleans,
tion and VERification (FEVER), pages 72–78, Brus- Louisiana. Association for Computational Linguis-
sels, Belgium. Association for Computational Lin- tics.
guistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Raul Puri and Bryan Catanzaro. 2019. Zero-shot text Chaumond, Clement Delangue, Anthony Moi, Pier-
classification with generative language models. ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
ine Lee, Sharan Narang, Michael Matena, Yanqi Teven Le Scao, Sylvain Gugger, Mariama Drame,
Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Quentin Lhoest, and Alexander Rush. 2020. Trans-
the limits of transfer learning with a unified text-to- formers: State-of-the-art natural language process-
text transformer. Journal of Machine Learning Re- ing. In Proceedings of the 2020 Conference on Em-
search, 21(140):1–67. pirical Methods in Natural Language Processing:
Oscar Sainz and German Rigau. 2021. System Demonstrations, pages 38–45, Online. Asso-
Ask2Transformers: Zero-shot domain labelling ciation for Computational Linguistics.
with pretrained language models. In Proceedings
Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki
of the 11th Global Wordnet Conference, pages
Takeda, and Yuji Matsumoto. 2020. LUKE: Deep
44–52, University of South Africa (UNISA). Global
contextualized entity representations with entity-
Wordnet Association.
aware self-attention. In Proceedings of the 2020
Teven Le Scao and Alexander M. Rush. 2021. How Conference on Empirical Methods in Natural Lan-
many data points is a prompt worth? In Proceed- guage Processing (EMNLP), pages 6442–6454, On-
ings of the 2021 Conference of the North American line. Association for Computational Linguistics.
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, Volume 1 Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019.
(Long and Short Papers). Association for Computa- Benchmarking zero-shot text classification:
tional Linguistics. Datasets, evaluation and entailment approach.
In Proceedings of the 2019 Conference on Empiri-
Timo Schick and Hinrich Schütze. 2021. Exploiting cal Methods in Natural Language Processing and
cloze-questions for few-shot text classification and the 9th International Joint Conference on Natural
natural language inference. In Proceedings of the Language Processing (EMNLP-IJCNLP), pages
16th Conference of the European Chapter of the As- 3914–3923, Hong Kong, China. Association for
sociation for Computational Linguistics: Main Vol- Computational Linguistics.
ume, pages 255–269, Online. Association for Com-
putational Linguistics. Wenpeng Yin, Nazneen Fatema Rajani, Dragomir
Radev, Richard Socher, and Caiming Xiong. 2020.
Timo Schick and Hinrich Schütze. 2020. It’s not just Universal natural language processing with limited
size that matters: Small language models are also annotations: Try few-shot textual entailment as a
few-shot learners. Computing Research Repository, start. In Proceedings of the 2020 Conference on
arXiv:2009.07118. Empirical Methods in Natural Language Process-
ing (EMNLP), pages 8229–8239, Online. Associa-
George Stoica, Emmanouil Antonios Platanios, and tion for Computational Linguistics.
Barnabás Póczos. 2021. Re-tacred: Addressing
shotcomings of the tacred dataset. In Proceedings Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
of the Thirty-fifth AAAI Conference on Aritificial In- geli, and Christopher D. Manning. 2017. Position-
telligence 2021. aware attention and supervised data improve slot
filling. In Proceedings of the 2017 Conference on
Derek Tam, Rakesh R Menon, Mohit Bansal, Shashank Empirical Methods in Natural Language Processing
Srivastava, and Colin Raffel. 2021. Improving and (EMNLP 2017), pages 35–45.
simplifying pattern exploiting training.
Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xu-
anjing Huang, Jianshu ji, Guihong Cao, Daxin Jiang,
and Ming Zhou. 2020. K-adapter: Infusing knowl-
edge into pre-trained models with adapters.
Sinong Wang, Han Fang, Madian Khabsa, Hanzi Mao,
and Hao Ma. 2021. Entailment as few-shot learner.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North American
1209
A Pre-Trained models created based on the TAC KBP Slot Descriptions15
(annotation guidelines). Besides the templates, we
The pre-trained NLI models we have tested from
also report the valid argument types that are ac-
the Transformers library are the next:
cepted on each relation.
• ALBERT: ynie/albert-xxlarge-v2-snli_mnli
_fever_anli_R1_R2_R3-nli
• RoBERTa: roberta-large-mnli
• BART: facebook/bart-large-mnli
B Experimental details
We carried out all the experiments on a single Ti-
tan V (16GB) except for the fine-tuning of De-
BERTa, that has been done on a cluster of 4 Titan
V100 (32GB). The average inference time for the
zero and few-shot experiments is between 1h and
1.5h. The time needed for fine-tuning the NLI sys-
tems was at most 2.5h for RoBERTa and 5h for
DeBERTa. All the experiments were done with
mixed precision to speed up the overall runtime.
The whole hyperparameter settings used for fine-
tuning NLIRoBERTa and NLIDeBERTa are listed be-
low:
• Train epochs: 2
• Learning-rate: 4e-6
• Batch-size: 32
• FP16 training
C TACRED templates
This section describes the templates used in the
TACRED experiments. We performed all the ex-
periments using the templates showed in Tables 1 15
https://tac.nist.gov/2014/KBP/
(for PERSON relations) and 2 (for ORGANIZA- ColdStart/guidelines/TAC_KBP_2014_Slot_
TION relations). These templates were manually Descriptions_V1.4.pdf
1210
Relation Templates Valid argument types
per:alternate_names {subj} is also known as {obj} PERSON, MISC
per:date_of_birth {subj}’s birthday is on {obj} DATE
{subj} was born on {obj}
per:age {subj} is {obj} years old NUMBER, DURATION
per:country_of_birth {subj} was born in {obj} COUNTRY
per:stateorprovince_of_birth {subj} was born in {obj} STATE_OR_PROVINCE
per:city_of_birth {subj} was born in {obj} CITY, LOCATION
per:origin {obj} is the nationality of {subj} NATIONALITY, COUNTRY, LOCATION
per:date_of_death {subj} died in {obj} DATE
per:country_of_death {subj} died in {obj} COUNTRY
per:stateorprovince_of_death {subj} died in {obj} STATE_OR_PROVINCE
per:city_of_death {subj} died in {obj} CITY, LOCATION
per:cause_of_death {obj} is the cause of {subj}’s death CAUSE_OF_DEATH
per:countries_of_residence {subj} lives in {obj} COUNTRY, NATIONALITY
{subj} has a legal order to stay in {obj}
per:statesorprovinces_of_residence {subj} lives in {obj} STATE_OR_PROVINCE
{subj} has a legal order to stay in {obj}
per:city_of_residence {subj} lives in {obj} CITY, LOCATION
{subj} has a legal order to stay in {obj}
per:schools_attended {subj} studied in {obj} ORGANIZATION
{subj} graduated from {obj}
per:title {subj} is a {obj} TITLE
per:employee_of {subj} is a member of {obj} ORGANIZATION
per:religion {subj} belongs to {obj} RELIGION
{obj} is the religion of {subj}
{subj} believe in {obj}
per:spouse {subj} is the spouse of {obj} PERSON
{subj} is the wife of {obj}
{subj} is the husband of {obj}
per:children {subj} is the parent of {obj} PERSON
{subj} is the mother of {obj}
{subj} is the father of {obj}
{obj} is the son of {subj}
{obj} is the daughter of {subj}
per:parents {obj} is the parent of {subj} PERSON
{obj} is the mother of {subj}
{obj} is the father of {subj}
{subj} is the son of {obj}
{subj} is the daughter of {obj}
per:siblings {subj} and {obj} are siblings PERSON
{subj} is brother of {obj}
{subj} is sister of {obj}
per:other_family {subj} and {obj} are family PERSON
{subj} is a brother in law of {obj}
{subj} is a sister in law of {obj}
{subj} is the cousin of {obj}
{subj} is the uncle of {obj}
{subj} is the aunt of {obj}
{subj} is the grandparent of {obj}
{subj} is the grandmother of {obj}
{subj} is the grandson of {obj}
{subj} is the granddaughter of {obj}
per:charges {subj} was convicted of {obj} CRIMINAL_CHARGE
{obj} are the charges of {subj}
1211
Relation Templates Valid argument types
org:alternate_names {subj} is also known as {obj} ORGANIZATION, MISC
org:political/religious_affiliation {subj} has political affiliation with {obj} RELIGION, IDEOLOGY
{subj} has religious affiliation with {obj}
org:top_memberts/employees {obj} is a high level member of {subj} PERSON
{obj} is chairman of {subj}
{obj} is president of {subj}
{obj} is director of {subj}
org:number_of_employees/members {subj} employs nearly {obj} people NUMBER
{subj} has about {obj} employees
org:members {obj} is member of {subj} ORGANIZATION, COUNTRY
{obj} joined {subj}
org:subsidiaries {obj} is a subsidiary of {subj} ORGANIZATION, LOCATION
{obj} is a branch of {subj}
org:parents {subj} is a subsidiary of {obj} ORGANIZATION, COUNTRY
{subj} is a branch of {obj}
org:founded_by {subj} was founded by {obj} PERSON
{obj} founded {subj}
org:founded {subj} was founded in {obj} DATE
{subj} was formed in {obj}
org:dissolved {subj} existed until {obj} DATE
{subj} disbanded in {obj}
{subj} dissolved in {obj}
org:country_of_headquarters {subj} has its headquarters in {obj} COUNTRY
{subj} is located in {obj}
org:stateorprovince_of_headquarters {subj} has its headquarters in {obj} STATE_OR_PROVINCE
{subj} is located in {obj}
org:city_of_headquarters {subj} has its headquarters in {obj} CITY, LOCATION
{subj} is located in {obj}
org:shareholders {obj} holds shares in {subj} ORGANIZATION, PERSON
org:website {obj} is the URL of {subj} URL
{obj} is the website of {subj}
1212