Efficient Attentions For Long Document Summarizati
Efficient Attentions For Long Document Summarizati
Efficient Attentions For Long Document Summarizati
marization. In this paper, we propose H EPOS, 2020; Tay et al., 2020a). Yet, these methods do not
a novel efficient encoder-decoder attention apply to encoder-decoder attentions in summariza-
with head-wise positional strides to effectively tion models since they collaborate and dynamically
pinpoint salient information from the source. pinpoint salient content in the source as the sum-
We further conduct a systematic study of ex- mary is decoded. Truncation is commonly used
isting efficient self-attentions. Combined with to circumvent the issue. However, training on cur-
H EPOS, we are able to process ten times more tailed content further aggravates “hallucination” in
tokens than existing models that use full atten-
existing abstractive models (Maynez et al., 2020).
tions. For evaluation, we present a new dataset,
G OV R EPORT, with significantly longer docu- We argue that summarizing long documents
ments and summaries. Results show that our (e.g., with thousands of words or more) requires ef-
models produce significantly higher ROUGE ficient handling of both types of attentions. To this
scores than competitive comparisons, includ- end, we propose an efficient encoder-decoder atten-
ing new state-of-the-art results on PubMed. tion with head-wise positional strides (H EPOS),
Human evaluation also shows that our mod- where the attention heads follow a strided pattern
els generate more informative summaries with
and have varying starting positions. H EPOS re-
fewer unfaithful errors.
duces computational and memory costs while (1)
1 Introduction maintaining the power of emphasizing important
tokens, and (2) preserving the global context per
Long documents, such as scientific papers and gov- head. H EPOS successfully doubles the processed
ernment reports, often discuss substantial issues at input sequence size, when combined with any en-
length, and thus are time-consuming to read, let coder. To the best of our knowledge, we are the
alone to comprehend. Generating abstractive sum- first to study efficient encoder-decoder attentions
maries can help readers quickly grasp the main and provide a systematic comparison of diverse
topics, yet prior work has mostly focused on short encoder attentions for the task of summarization.2
texts (containing hundreds of words), e.g., news For evaluation, we collect a new large-scale
articles (Gehrmann et al., 2018; Liu and Lapata, dataset, G OV R EPORT, consisting of about 19.5k
2019; Zhang et al., 2019). U.S. government reports with expert-written ab-
Model training efficiency and summary quality stractive summaries.3 G OV R EPORT has two impor-
present a pair of challenges for long document tant features: (1) It contains significantly longer
summarization. State-of-the-art systems (Lewis documents (9.4k words) and summaries (553
et al., 2020; Zhang et al., 2019) are built upon words) than existing datasets, such as PubMed and
Transformer (Vaswani et al., 2017), which uses at- arXiv (Cohan et al., 2018) (see Table 2); (2) Salient
tentions to compute pairwise relations between to-
kens. Such framework has quadratic time and mem- tokens with a batch size of 1, 70GB of memory is needed for
encoder attentions, and 8GB for encoder-decoder attentions.
ory complexities, and is too costly for long docu- 2
Our code is released at https://github.com/
ments 1 . Solutions have been proposed to reduce luyang-huang96/LongDocSum.
3
G OV R EPORT can be downloaded from https://
1
For instance, to fine-tune BART on documents of 10K gov-report-data.github.io.
content is spread throughout the documents, as op- Model Complexity # New Para.
posed to cases where summary-worthy words are Full O(n2 ) —
more heavily concentrated in specific parts of the Encoder Self-attentions
document. These properties make G OV R EPORT an I. Fixed Patterns
important benchmark for producing long document Sliding Window (2020) O(nw) 0
Adaptive Span (2019) O(nŵ) O(1)
summaries with multiple paragraphs. Global Tokens (2020) O(2ng) 0
We conduct experiments on G OV R EPORT and Stride (2019) O(n2 /s) 0
scientific papers in PubMed and arXiv. First, Random (2020) O(nr) 0
when summarizing documents of the same length, II. Low-rank
Linformer (2020c) O(nk) O(n)
H EPOS attention yields significantly better ROUGE
III. Learnable Patterns
scores than a non-trivial comparison that projects LSH (2020) O(lnbl ) 0
attentions into low-rank space (Wang et al., 2020c). Sinkhorn (2020a) O(2nbs ) 0
Second, when trained on the same GPU, H EPOS Encoder-decoder Attentions
attention, combined with sparse encoder attentions, Hepos (ours) O(mn/sh ) 0
Linformer O(mk) O(n)
is able to read more than 10K words and obtains sig-
nificantly higher ROUGE scores on G OV R EPORT Table 1: Summary of efficient Transformer attentions
and new state-of-the-art results on PubMed, com- on memory complexity and newly learned parameters
pared with full encoder-decoder attention models compared with full attentions at each layer. m and n
which can process at most 5K input words. Human are lengths of the input and the output. See § 2 and § 3
judges further rate the summaries generated by our for model-specific hyperparameters.
models to be more informative and faithful.
We further propose a new evaluation metric convention of Tay et al. (2020b), and summarize
for faithfulness, inspired by APES (Eyal et al., their memory complexities and numbers of newly
2019), a fill-in-the-blank QA metric for summary learned parameters in Table 1.
evaluation. With questions generated from refer-
ences, our metric, APESsrc , compares QA answers 2.1 Fixed Patterns
by reading the source and the system summary. It is Fixed patterns are used to limit the scope of atten-
shown to be better correlated with human judgment tions. In our experiments, in addition to window-
than the original metric and an entailment-based based attentions, we also combine them with global
scorer (Kryscinski et al., 2020). tokens, stride patterns, or random attentions.
The rest of the paper is organized as follows. We Sliding window attentions (Beltagy et al., 2020)
describe efficient encoder attentions in prior work aim to capture the local context, which is critical for
in § 2, and formulate our proposed encoder-decoder language understanding (Liu* et al., 2018; Child
attention in § 3. The G OV R EPORT data is presented et al., 2019). Concretely, each query token attends
in § 4. We then share details on evaluation metrics to w/2 neighboring tokens on both left and right,
(§ 5) and experimental results (§ 6). Additional yielding a memory complexity of O(nw).
related work is listed in § 7, with conclusion in §8.
Adaptive span is proposed by Sukhbaatar et al.
2 Prior Work on Efficient Encoder (2019) to learn attention windows at different lay-
Attentions ers. This is implemented by learning a masking
function for each head independently. In practice,
Transformer models are built upon multi-head at- the adaptive span attention has a complexity of
tentions in multiple layers. The attention is calcu- O(nŵ), where ŵ is the maximum values of pre-
T
lated as Attention(Q, K, V) = softmax( QK √
dk
)V, dicted spans for all heads. Besides, it introduces
where Q, K, and V are query, key, and value ma- O(1) new parameters for learning spans.
trices, each consisting of n vectors for a document Global tokens (Beltagy et al., 2020) are often
with n tokens, thus the quadratic memory footprint. added to sliding windows to let pre-selected tokens
Here, we present an overview of representa- attend to the full sequence, to build global represen-
tive methods for efficient encoder self-attentions tations. Importantly, global attention operations are
(henceforth “encoder attentions”) that can be symmetric, i.e., a global token is also attendable
built upon large pre-trained seq2seq models, e.g., to all tokens in the sequence. We select the first g
BART (Lewis et al., 2020). We follow the naming tokens as global tokens, as leading sentences are
often important for summarization. Memory com- Hepos Attention
plexity is O(2ng) due to the symmetric attentions. head 1 head 2 head 3 head 4
Encoder Key
Stride patterns are proposed by Child et al. (2019) Job
to capture long term interactions, where each query in
attends to every s-th token, with s as the stride size. home
care ...
It thus has a complexity of O(n2 /s).
...
Random attention is motivated by the fact that
randomly constructed graphs with Θ̃(n) edges can GAO was asked ...
Decoder Query
approximate the complete graphs spectrally (Za-
heer et al., 2020). Zaheer et al. (2020) propose Figure 1: A toy example of our H EPOS attention, with
to allow each query to attend to r random keys, a stride of 2 and four attention heads. Dark colors in-
resulting in a complexity of O(nr). For efficient dicate that heads 1 and 3 attend to the first and third
tokens (“Job" and “home") in the input, heads 2 and 4
implementations, input tokens are first segmented
look at the second and fourth words (“in" and “care").
into blocks. Tokens in the same block attend to
tokens in another randomly selected block.
2019), which share a similar theoretical foundation
2.2 Low-rank Methods as global tokens; and kernel methods over atten-
Wang et al. (2020c) show that self-attention matri- tions require training models from scratch (Choro-
ces are low-rank. They propose Linformer that manski et al., 2020; Katharopoulos et al., 2020).
linearly projects key and value matrices into a low-
3 Encoder-decoder Attention with
dimensional space, e.g., from n to k, to achieve a
Head-wise Positional Strides (Hepos)
O(nk) complexity. It also introduces O(n) new
parameters for projection matrix learning. The efficient design of encoder-decoder attentions
with head-wise positional strides (H EPOS) allows
2.3 Learnable Patterns models to consume longer sequences. Concretely,
Recently, learnable sparse attentions are proposed our design is motivated by two observations: (1)
to better capture both local and global contexts than Attention heads are redundant (Voita et al., 2019).
attentions based on fixed patterns. (2) Any individual head rarely attends to several
Locality-sensitive hashing (LSH) attentions use tokens in a row (Clark et al., 2019). Therefore, as
a random-projection hashing function to hash sim- illustrated in Fig. 1, H EPOS uses separate encoder-
ilar queries and keys into the same buckets in l decoder heads on the same layer to cover different
rounds (Kitaev et al., 2020). Attentions are then subsets of source tokens at fixed intervals. Each
computed among tokens within each bucket. For head starts at a different position, and all heads
bucket size bl , the complexity of LSH attention is collectively attend to the full sequence.
O(lnbl ). Given a stride size of sh , for the h-th head, its
attention value between decoder query qj (at step
Sinkhorn attentions first segment a sequence into
j) and encoder key vector ki (for the i-th input
blocks, which are then arranged by a learned
token) can be formulated as:
Sinkhorn sorting network (Tay et al., 2020a). Given (
the new permutation, each query attends to bs to- softmax(qj ki ), if (i − h) mod sh = 0
ahji = (1)
kens within the same block to maintain the local 0 otherwise
context and another bs tokens in a neighboring
block to capture global interactions. Its complexity In H EPOS attention, each query token attends to
is O(2nbs ). n/sh tokens per head, yielding a memory complex-
ity of O(mn/sh ), where m is the output length.
2.4 Other Attentions For comparison, Linformer (§ 2.2) can be
We also describe several notable methods that are straightforwardly adapted for encoder-decoder at-
not suitable for our experiments and excluded from tentions by using decoder queries for attention cal-
this study: Recurrence over input segments are culation instead. We do not adapt pattern-based
tailored for an autoregressive decoder only (Dai attentions (§ 2.1 and § 2.3), since they rely on local
et al., 2019); memory methods use a separate mem- token grouping which makes it difficult to pinpoint
ory module to attend to full sequences (Lee et al., salient content.
4 G OV R EPORT Dataset Dataset # Doc Summary Doc Comp. Den.
# word # sent # word
We introduce a new large-scale dataset, G OV R E -
P UB M ED 133,215 202.4 6.8 3049.0 16.2 5.8
PORT, containing 19, 466 long reports published by AR X IV 215,913 272.7 9.6 6029.9 39.8 3.8
U.S. Government Accountability Office (GAO)4 B ILL S UM 23,455 207.7 7.2 1813.0 13.6 4.1
to fulfill requests by congressional members, and B IG PATENT 1,341,362 116.5 3.7 3573.2 36.3 2.4
Congressional Research Service (CRS)5 , covering G OV R EPORT 19,466 553.4 17.8 9409.4 19.0 7.3
Score
3
Precentage (%)
60
53.8
PubMed 50
40
Encoder variants w/ full enc-dec attn. 31.0
F ULL (1024) 3.27 20.1% 2.8% 14.3%
30
25.0
20
15.5 12.3 15.7
S INKHORN (5120) 3.94 4.8% 1.6% 9.6% 10 7.2 8.5 7.4 7.1 10.1
Encoder variants w/ H EPOS enc-dec attn. (ours)
Introduction Methods Results Conclusion
0
summary covers important information of an aspect 4.0 3.75 3.80 4.01 3.49 3.55 3.60
3.5
2.99
when compared with the reference. All system sum- 3.0
2.64 2.61
Score
2.5
1.5
0.5
summaries for different aspects are included in Ap- 0.0
Why GAO did this study What GAO found What GAO recommends
pendix D. Informativeness
14.0
Precentage (%)
12.4
14
Why GAO did this study What GAO found What GAO recommends
0
coder, obtains better informativeness scores than Unfaithful Errors
comparisons that read in less text on both datasets. Full(1k)+Full Sinkhorn(5k)+Full Sinkhorn(10k)+Hepos
This echos results from automatic evaluation in
Figure 6: Aspect-level informativeness and percent-
the previous section. Moreover, both models that
ages of sentences with unfaithful errors on GovReport.
use efficient attentions reduce unfaithfulness, es-
pecially hallucination errors, when compared with
the full attention model, which only reads 1024 to- Especially, we find that the full attention model
kens. As the models read more content, they learn tends to produce fabricated numbers in resultant
to surface more factual and richer content in the summaries, whereas our models are able to correct
summaries, as seen in Fig. 3. them.
Next, we explore if reading more helps correctly Lastly, we report the entailment-based FactCC
reflect the content in documents’ later sections. We and QA scores APES and APESsrc for top perform-
plot aspect-level human ratings of informativeness ing models in Table 7. The results again show that
and unfaithful errors on PubMed and GovReport consuming longer input leads to more faithful sum-
in Fig. 5 and Fig. 6. We report percentages of sen- maries, though the differences are less pronounced.
tences with unfaithful errors by majority voting
(i.e., at least one error is found by both annota- 6.5 Correlations between Human and
tors in the sentence). As can be seen, our models Automatic Metrics
consistently improve informativeness and reduce Finally, we study whether the faithfulness evalua-
errors across sections, especially for “Results” and tion metrics correlate with human judgment. As
“Conclusions” on PubMed and “What GAO rec- shown in Table 8, on both government reports
ommends” on GovReport—these sections often and scientific papers, QA metrics are better cor-
appear in the later part of the source documents. related with human ratings, with our newly pro-
GovReport PubMed coherent, highlighting the need for handling long
System (MaxLen) F. APES APESsrc F. APES APESsrc documents via abstractive summarization.
F ULL (1024) 58.9 42.7 42.7 74.6 43.2 31.5 To that end, extract-then-abstract methods are
Encoder variants w/ full enc-dec attn. proposed. For example, Pilault et al. (2020) first
S TRIDE (4096) 55.3 43.1 42.5 72.7 43.8 31.9 extract relevant sentences and then rewrite them
L IN . (3072) 48.4 35.7 36.3 67.7 39.3 29.5
LSH (4096) 55.7 44.0 43.6 73.2 46.7 35.1
into paper abstracts. Our work is in line with build-
S INKHORN (5120) 57.0 43.6 42.1 72.9 46.8 35.4 ing end-to-end abstractive summarization models
Encoder variants w/ H EPOS enc-dec attn. (ours)
for long input. Cohan et al. (2018) design a hierar-
LSH (7168) 59.6 44.0 44.2 73.3 47.5 35.6 chical encoder to read different sections separately,
S INKHORN (10240) 60.1 44.0 44.3 71.9 46.2 34.8 and then use combined attentions over words and
sections to generate the summary. Multiple agents
Table 7: Evaluation with FactCC (F.), APES, and the are created to read segments separately, and then
new APESsrc metric, with higher numbers indicating collaboratively write an abstract (Celikyilmaz et al.,
more faithful summaries.
2018). However, both work truncates articles to
2K words. Although efficient encoder attentions
GovReport PubMed
have been studied in Zaheer et al. (2020) for ab-
Metric Inf.↑ Err.↓ Inf.↑ Err.↓
stractive summarization, at most 3K tokens can be
FactCC 0.07 -0.08 0.10 -0.14 consumed by their models. Our H EPOS encoder-
APES 0.16 -0.15 0.25 -0.31
APESsrc 0.21 -0.23∗ 0.32∗ -0.32 decoder attention are able to process more than
10K tokens, significantly improving summary in-
Table 8: Pearson correlation between human ratings formativeness and faithfulness.
and metrics. We use aggregated unfaithful errors (Err.).
∗: significantly better than other metrics based on 8 Conclusion
William’s test (Williams, 1959) (p < 0.05).
We investigate efficient attentions for long docu-
ment summarization. We propose a novel encoder-
posed APESsrc being the stronger of the two. Af- decoder attention, H EPOS, based on head-wise po-
ter inspection, we find that human-written sum- sitional strides that can effectively identify salient
maries contain paraphrases or acronyms that APES content. Models based on H EPOS attention can pro-
cannot capture via strict lexical matching. For in- cess at least twice as many words and produce more
stance, for the question “Diabetes may worsen informative summaries with less unfaithful errors,
in patients”, the reference answer is “death rate”, according to both automatic evaluation and human
whereas answers from the source and the system evaluation. We further show that our new cloze QA
summary are both “mortality”. APESsrc captures metric better correlates with human judgment than
this, but not APES. prior faithfulness evaluation metrics.
Summarizing long inputs has been investigated in This research is supported in part by Oracle for
many domains, including books (Mihalcea and Research Cloud Credits, National Science Founda-
Ceylan, 2007), patents (Trappey et al., 2009), tion through Grant IIS-1813341, and by the Office
movie scripts (Gorinski and Lapata, 2015), and sci- of the Director of National Intelligence (ODNI),
entific publications (Qazvinian and Radev, 2008). Intelligence Advanced Research Projects Activity
However, the datasets are often too small to train (IARPA), via contract # FA8650-17-C-9116. The
neural models. Cohan et al. (2018) publish two views and conclusions contained herein are those
large-scale datasets by collecting articles from of the authors and should not be interpreted as
AR X IV and P UB M ED . Popular methods rely on necessarily representing the official policies, either
extractive summarizers that identify salient sen- expressed or implied, of ODNI, IARPA, or the U.S.
tences based on positional information (Dong et al., Government. The U.S. Government is authorized
2020) or combined global and local contexts (Xiao to reproduce and distribute reprints for governmen-
and Carenini, 2019), where each sentence is repre- tal purposes notwithstanding any copyright annota-
sented as aggregated word embeddings. However, tion therein. We thank three anonymous reviewers
extractive summaries are often redundant and in- for their valuable suggestions and comments.
References Yue Dong, Andrei Romascanu, and Jackie CK Che-
ung. 2020. Hiporank: Incorporating hierarchical
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. and positional information into graph-based unsu-
Longformer: The long-document transformer. pervised long document extractive summarization.
BioNLP. 2011. Genia event extraction (genia). arXiv preprint arXiv:2005.00513.
BioNLP. 2013. Genia event extraction for nfkb knowl- Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
edge base. question answering evaluation framework for faith-
fulness assessment in abstractive summarization. In
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Proceedings of the 58th Annual Meeting of the Asso-
Yejin Choi. 2018. Deep communicating agents for ciation for Computational Linguistics, pages 5055–
abstractive summarization. In Proceedings of the 5070, Online. Association for Computational Lin-
2018 Conference of the North American Chapter of guistics.
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Pa- Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.
pers), pages 1662–1675. Question answering as an automatic evaluation met-
ric for news article summarization. In Proceed-
Rewon Child, Scott Gray, Alec Radford, and ings of the 2019 Conference of the North American
Ilya Sutskever. 2019. Generating long se- Chapter of the Association for Computational Lin-
quences with sparse transformers. arXiv preprint guistics: Human Language Technologies, Volume 1
arXiv:1904.10509. (Long and Short Papers), pages 3938–3948, Min-
neapolis, Minnesota. Association for Computational
Krzysztof Choromanski, Valerii Likhosherstov, David Linguistics.
Dohan, Xingyou Song, Andreea Gane, Tamas Sar-
los, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Sebastian Gehrmann, Yuntian Deng, and Alexander M
Lukasz Kaiser, David Belanger, Lucy Colwell, and Rush. 2018. Bottom-up abstractive summarization.
Adrian Weller. 2020. Rethinking attention with per- In Proceedings of the 2018 Conference on Empiri-
formers. cal Methods in Natural Language Processing, pages
4098–4109.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D. Manning. 2019. What does BERT Alexios Gidiotis and Grigorios Tsoumakas. 2020. A
look at? an analysis of BERT’s attention. In Pro- divide-and-conquer approach to the summarization
ceedings of the 2019 ACL Workshop BlackboxNLP: of long documents. arXiv: Computation and Lan-
Analyzing and Interpreting Neural Networks for guage.
NLP, pages 276–286, Florence, Italy. Association
for Computational Linguistics. Philip John Gorinski and Mirella Lapata. 2015. Movie
script summarization as graph-based scene extrac-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, tion. In Proceedings of the 2015 Conference of the
Trung Bui, Seokhwan Kim, Walter Chang, and Na- North American Chapter of the Association for Com-
zli Goharian. 2018. A discourse-aware attention putational Linguistics: Human Language Technolo-
model for abstractive summarization of long docu- gies, pages 1066–1076, Denver, Colorado. Associa-
ments. In Proceedings of the 2018 Conference of tion for Computational Linguistics.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
nologies, Volume 2 (Short Papers), pages 615–621, Newsroom: A dataset of 1.3 million summaries with
New Orleans, Louisiana. Association for Computa- diverse extractive strategies. In Proceedings of the
tional Linguistics. 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- man Language Technologies, Volume 1 (Long Pa-
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. pers), pages 708–719, New Orleans, Louisiana. As-
Transformer-XL: Attentive language models beyond sociation for Computational Linguistics.
a fixed-length context. In Proceedings of the 57th
Annual Meeting of the Association for Computa- Matthew Honnibal and Ines Montani. 2017. spaCy 2:
tional Linguistics, pages 2978–2988, Florence, Italy. Natural language understanding with Bloom embed-
Association for Computational Linguistics. dings, convolutional neural networks and incremen-
tal parsing. To appear.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pap-
deep bidirectional transformers for language under- pas, and François Fleuret. 2020. Transformers are
standing. In Proceedings of the 2019 Conference rnns: Fast autoregressive transformers with linear at-
of the North American Chapter of the Association tention.
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
pages 4171–4186, Minneapolis, Minnesota. Associ- 2020. Reformer: The efficient transformer. In Inter-
ation for Computational Linguistics. national Conference on Learning Representations.
Anastassia Kornilova and Vladimir Eidelman. 2019. Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
BillSum: A corpus for automatic summarization of Ryan McDonald. 2020. On faithfulness and factu-
US legislation. In Proceedings of the 2nd Workshop ality in abstractive summarization. In Proceedings
on New Frontiers in Summarization, pages 48–56, of the 58th Annual Meeting of the Association for
Hong Kong, China. Association for Computational Computational Linguistics, pages 1906–1919, On-
Linguistics. line. Association for Computational Linguistics.
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Rada Mihalcea and Hakan Ceylan. 2007. Explo-
and Richard Socher. 2020. Evaluating the factual rations in automatic book summarization. In Pro-
consistency of abstractive text summarization. In ceedings of the 2007 Joint Conference on Empirical
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing and Com-
Methods in Natural Language Processing (EMNLP), putational Natural Language Learning (EMNLP-
pages 9332–9346, Online. Association for Computa- CoNLL), pages 380–389, Prague, Czech Republic.
tional Linguistics. Association for Computational Linguistics.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Myle Ott, Sergey Edunov, Alexei Baevski, Angela
Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Fan, Sam Gross, Nathan Ng, David Grangier, and
Jaewoo Kang. 2020. Biobert: a pre-trained biomed- Michael Auli. 2019. fairseq: A fast, extensible
ical language representation model for biomedical toolkit for sequence modeling. In Proceedings of
text mining. Bioinformatics, 36(4):1234–1240. NAACL-HLT 2019: Demonstrations.
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Ko- Adam Paszke, Sam Gross, Francisco Massa, Adam
siorek, Seungjin Choi, and Yee Whye Teh. 2019. Lerer, James Bradbury, Gregory Chanan, Trevor
Set transformer: A framework for attention-based Killeen, Zeming Lin, Natalia Gimelshein, Luca
permutation-invariant neural networks. In Proceed- Antiga, Alban Desmaison, Andreas Kopf, Edward
ings of the 36th International Conference on Ma- Yang, Zachary DeVito, Martin Raison, Alykhan Te-
chine Learning, volume 97 of Proceedings of Ma- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang,
chine Learning Research, pages 3744–3753, Long Junjie Bai, and Soumith Chintala. 2019. Py-
Beach, California, USA. PMLR. torch: An imperative style, high-performance deep
learning library. In H. Wallach, H. Larochelle,
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Gar-
jan Ghazvininejad, Abdelrahman Mohamed, Omer nett, editors, Advances in Neural Information Pro-
Levy, Veselin Stoyanov, and Luke Zettlemoyer. cessing Systems 32, pages 8024–8035. Curran Asso-
2020. BART: Denoising sequence-to-sequence pre- ciates, Inc.
training for natural language generation, translation,
and comprehension. In Proceedings of the 58th An- Jonathan Pilault, Raymond Li, Sandeep Subramanian,
nual Meeting of the Association for Computational and Chris Pal. 2020. On extractive and abstractive
Linguistics, pages 7871–7880, Online. Association neural document summarization with transformer
for Computational Linguistics. language models. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Chin-Yew Lin. 2004. ROUGE: A package for auto- Processing (EMNLP), pages 9308–9319, Online. As-
matic evaluation of summaries. In Text Summariza- sociation for Computational Linguistics.
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics. Vahed Qazvinian and Dragomir R Radev. 2008. Sci-
entific paper summarization using citation summary
Ying Lin, Heng Ji, Fei Huang, and Lingfei Wu. 2020. networks. arXiv preprint arXiv:0807.1560.
A joint neural model for information extraction with
global features. In Proceedings of The 58th Annual Eva Sharma, Chen Li, and Lu Wang. 2019. BIG-
Meeting of the Association for Computational Lin- PATENT: A large-scale dataset for abstractive and
guistics. coherent summarization. In Proceedings of the 57th
Annual Meeting of the Association for Computa-
Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben tional Linguistics, pages 2204–2213, Florence, Italy.
Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Association for Computational Linguistics.
Shazeer. 2018. Generating wikipedia by summariz-
ing long sequences. In International Conference on Noam Shazeer and Mitchell Stern. 2018. Adafactor:
Learning Representations. Adaptive learning rates with sublinear memory cost.
In Proceedings of the 35th International Conference
Yang Liu and Mirella Lapata. 2019. Text summariza- on Machine Learning, volume 80 of Proceedings
tion with pretrained encoders. In Proceedings of of Machine Learning Research, pages 4596–4604,
the 2019 Conference on Empirical Methods in Nat- Stockholmsmässan, Stockholm Sweden. PMLR.
ural Language Processing and the 9th International
Joint Conference on Natural Language Processing Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo-
(EMNLP-IJCNLP), pages 3730–3740, Hong Kong, janowski, and Armand Joulin. 2019. Adaptive at-
China. Association for Computational Linguistics. tention span in transformers. In Proceedings of the
57th Annual Meeting of the Association for Compu- Macherey, et al. 2016. Google’s neural machine
tational Linguistics, pages 331–335, Florence, Italy. translation system: Bridging the gap between hu-
Association for Computational Linguistics. man and machine translation. arXiv preprint
arXiv:1609.08144.
Yi Tay, Dara Bahri, Liu Yang, Donald Metzler, and Da-
Cheng Juan. 2020a. Sparse sinkhorn attention. Wen Xiao and Giuseppe Carenini. 2019. Extractive
summarization of long documents by combining
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald global and local context. In Proceedings of the
Metzler. 2020b. Efficient transformers: A survey. 2019 Conference on Empirical Methods in Natu-
arXiv preprint arXiv:2009.06732. ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
Amy JC Trappey, Charles V Trappey, and Chun-Yi Wu.
(EMNLP-IJCNLP), pages 3011–3021, Hong Kong,
2009. Automatic patent document summarization
China. Association for Computational Linguistics.
for collaborative knowledge systems and services.
Journal of Systems Science and Systems Engineer-
ing, 18(1):71–94. Manzil Zaheer, Guru Guruganesh, Avinava Dubey,
Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang,
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz et al. 2020. Big bird: Transformers for longer
Kaiser, and Illia Polosukhin. 2017. Attention is all sequences. arxiv e-prints, art. arXiv preprint
you need. In Advances in Neural Information Pro- arXiv:2007.14062.
cessing Systems, volume 30, pages 5998–6008. Cur-
ran Associates, Inc. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
ter J. Liu. 2019. PEGASUS: pre-training with ex-
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- tracted gap-sentences for abstractive summarization.
nrich, and Ivan Titov. 2019. Analyzing multi-head CoRR, abs/1912.08777.
self-attention: Specialized heads do the heavy lift-
ing, the rest can be pruned. In Proceedings of the Yao Zhao, Mohammad Saleh, and Peter J. Liu.
57th Annual Meeting of the Association for Com- 2020. Seal: Segment-wise extractive-abstractive
putational Linguistics, pages 5797–5808, Florence, long-form text summarization.
Italy. Association for Computational Linguistics.
Figure 7: Sample summaries for a government report. Model with truncated input generates unfaithful content.
Our HEPOS encoder-decoder attention with Sinkhorn encoder attention covers more salient information in “What
GAO found” aspect.
Original Paper:
(...) anatomical abnormalities included complex pelvicalyceal anatomy in 98, pelviureteric junction obstruc-
tion (pujo) in 35, ectopic, horseshoe, cross-fused ectopic kidneys in 40, and duplex system with nonfunc-
tioning upper or lower moiety in 14 patients. (...)
of the 1404 patients from rural area , 547 (39%) were managed by open surgery as compared to 362 (22%)
of the patients from urban areas. (...)
the mean size of the kidney stone was 5.05 5.88 cm, ureter 1.95 1.33 cm, and bladder 7.6 3.2 cm. (...)
overall success rate was 90% in pcnl, 96% in urs, 100% in cystolithotripsy, and 81% in swl.
Summary by Human:
Results
Results of 3969 surgeries, 2794 (70%) were minimally invasive surgery (mis) techniques to include swl
19%, pcnl 16 % , urs 18.9% , and pucl + pccl 16% and 1175 (30%) were open surgeries .
the main factors necessitating open surgery were large stone burden 37%, anatomical abnormalities 16%,
stones with renal failure 34%, gross hydronephrosis with thin cortex 58% , urinary tract infection (uti)
25%, and failed mis 18%. nearly 50% of the surgeries were necessitated by economic constraints and long
distance from center where one-time treatment was preferred by the patient. tone-free rates by open surg-
eries were pyelolithotomy 91%, ureterolithotomy 100%, and cystolithotomy 100% with complication rate
of up to 3%.
Model w/ full attn.:
Results
results: the mean age of the patients was 6.5 years. the mean follow-up period was 12.5 months. mean stone
size was 1.5cm. the majority of the stones were lager stones (n=454). the stone free rate was 100%. open
surgery was performed in children with large bulk of stones, anatomical abnormalities, marked obstructive
cortical atrophy and scarring, gross hydronephrosis, or urinary tract infection (uti).
Model w/ HEPOS enc-dec attn.:
Results
the mean age of the patients was 6.35, 3.7 years with a male-to-female ratio of 2.8:1, and a range of 25
days (<1 month) to 15 years. anatomical abnormalities included complex pelvicalyceal anatomy in 98,
pelviureteric junction obstruction (pujo) in 35, ectopic, horseshoe, cross-fused ectopic kidneys in 40, and
duplex system with nonfunctioning upper or lower moiety in 14 patients. the mean size of the kidney stone
was 5.05 5.88 cm3. of the 1404 patients from rural areas, 547 (39%) were managed by surgery as compared
to 362 (22%) patients from urban areas. overall success rate was 90% in pcnl , 96% in urs , 100% in
cystolithotripsy , and 81% in swl.
Figure 8: Sample summaries for a scientific paper. Model with truncated input generates fabricated facts. Our
HEPOS encoder-decoder attention with LSH encoder attention are more faithful for the aspect of “results”.
Aspect Example
Why GAO Did This Study To protect data that are shared with state government agencies, federal agencies
have established cybersecurity requirements and related compliance assessment
programs. Specifically, they have numerous cybersecurity requirements for states
to follow when accessing, storing, and transmitting federal data. GAO was asked
to evaluate federal agencies’ cybersecurity requirements and related assessment
programs for state agencies. The objectives were to determine the extent to which
(...)
What GAO Found Although the Centers for Medicare and Medicaid Services (CMS), Federal Bu-
reau of Investigation (FBI), Internal Revenue Service (IRS), and Social Security
Administration (SSA) each established requirements to secure data that states re-
ceive, these requirements often had conflicting parameters. Such parameters in-
volve agencies defining specific values like the number of consecutive unsuccess-
ful logon attempts prior to locking out the user. Among the four federal agencies,
the percentage of total requirements with conflicting parameters ranged from 49
percent to 79 percent. Regarding variance with National Institute of Standards
and Technology guidance, GAO found that the extent to which the four agencies
did not fully address guidance varied from 9 percent to 53 percent of total re-
quirements. The variances were due in part to the federal agencies’ insufficient
coordination in establishing requirements. (...)
What GAO Recommends GAO is making 12 recommendations to the four selected agencies and to OMB.
Three agencies agreed with the recommendations and one agency (IRS) partially
agreed or disagreed with them. OMB did not provide comments. GAO continues
to believe all recommendations are warranted.
Table 11: Sample reference summary with aspects labeled in a PubMed article. Keywords are used to match
different parts of the summaries to the four aspects.