2020 Sustainlp-1 7
2020 Sustainlp-1 7
2020 Sustainlp-1 7
Yi-Te Hsu1∗
Sarthak Garg2 Yi-Hsiu Liao2 Ilya Chatsviorkin2
1
Johns Hopkins University
2
Apple Inc.
[email protected], {sarthak garg, yihsiu liao, ilych}@apple.com
48
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 48–53
Online, November 20, 2020. c 2020 Association for Computational Linguistics
2 Efficient Inference for Neural Machine T-MADL model to, again, re-decode the original
Translation bitext, but with more variance on the source side.
Finally, we use the above generated synthetic data
This section presents the proposed efficient infer- along with the original bitext to train our student
ence architecture for neural machine translation. model.
First, we outline the overall procedure of building We use interpolated sequence-level knowledge
an efficient inference architecture. Then, we detail distillation (Kim and Rush, 2016) in most of
each step in the process. the described re-decoding runs except the noisy
backward-forward translation where sampling is
used in the reverse direction. More details about
model training and architecture can be found in
Kim et al. (2019).
49
decoder; however, we found that it can be removed how aggressively the attention heads are pruned.
entirely from the decoder without hurting the trans- During inference time, all heads hj , where P (gj =
lation quality with our implementation of SSRU 0|φj ) = 1 are completely removed from the net-
(Section 3.1). work. Our experiments in Section 3.3 show that we
can effectively prune out a large portion of redun-
2.4 Deep Encoder, Shallow Decoder dant self-attention heads from the deep-encoder.
In order to further reduce the decoder computa-
tion, we decrease the number of decoder layers. In 3 Experiments
line with the work done by Kasai et al. (2020), to
maintain the same model capacity, we increase the We use the Transformer base model (Vaswani et al.,
number of encoder layers. We explore the speed- 2017) trained on teacher decoded data as our base-
accuracy trade-off while varying the depth of both line. All the described methods are stacked on top
components in Section 3.2, and find that using 12 of this baseline model. Following Kim et al. (2019),
encoder layers and 1 decoder layer gives a signifi- we use 4 million bitext from the WMT’14 English-
cant speedup without losing translation quality. German news translation task. All sentences are
2.5 Pruning Attention Heads encoded with 32K subword units using Sentence-
Piece (Kudo and Richardson, 2018). We report
Adopting a deep encoder, shallow decoder architec- BLEU on the newstest2014 in all the experiments
ture achieves a good speed-quality tradeoff; how- and use newstest2015 for the final evaluation in
ever, it increases the number of parameters in the Section 3.4
encoder. To further improve efficiency and reduce
All experiments are implemented in fairseq (Ott
parameters, we apply multi-head attention pruning
et al., 2019). The configuration of teacher-student
proposed by Voita et al. (2019) to our architecture.
training follows the settings in Kim et al. (2019).
The output of each head hi across all attention lay-
We use an effective batch size of 458k words and
ers is multiplied by a learnable gate gi , before it
16 GPUs for training. Adam optimizer is applied
is passed to subsequent layers of the network. To
with β = (0.9, 0.98). We use label smoothing with
switch off less informative heads (i.e. gi = 0), we
ε = 0.1, inverse square root learning rate sched-
applied L0 regularization to the gates. L0 norm
ule with 2500 warmup steps and peak learning rate
is the number of non-zero gates across the model.
of 0.0007. The models are trained with 50k up-
However, because of the non-differentiable prop-
dates except for the models with pruning, where
erty of the L0 norm, a differentiable approximation
additional fine-tuning with 100-150k updates is ap-
is used. Each gate gi , is modeled as a random vari-
plied. We use a beam size of 5 during inference.
able sampled from a Hard Concrete Distribution
We evaluate the inference speed with batch size of
(Louizos et al., 2018) parameterized by φi , and
128 sentences on GPU, batch size 1 on CPU and
takes values in the range [0, 1]. We then minimize
report speed in words per second (wps), averaged
the differentiable approximation of L0 regulariza-
over 10 decoding runs.
tion loss, Lc :
Hardware: We evaluate our performance on 1
h
X GPU (NVIDIA Tesla V100-SXM2-32GB) and 1
Lc (φ) = (1 − P (gi = 0|φi )), (3) core CPU (Intel Xeon E5-2640 v4 @ 2.40GHz)
i=1
where h denotes the total number of heads, φ is the 3.1 Replacing Self-Attention with RNN
set of gate parameters, and P (gi = 0|φi ) is com-
From Table 1, we can observe that replacing the
puted according to the Hard Concrete Distribution.
self-attention with lightweight recurrent units gives
The model is initially trained with the standard
significant speed improvements (18-25%) without
cross entropy loss Lxent and then fine-tuned with
any impact on BLEU score.
the additional regularization loss as follows:
Removing the feed-forward network in the de-
L(θ, φ) = Lxent (θ, φ) + λLc (φ), (4) coder leads to an additional 10-13% speedup for
both AAN and SSRU, but results in 0.9 BLEU
where θ denotes the set of original model param- degradation for AAN. Therefore, we use SSRU as
eters, and λ is a hyperparameter which controls our main architecture in further experiments.
50
BLEU wps speedup attention heads
BLEU
(enc/enc-dec/dec)
Baseline 28.9 4510 -
AAN 28.9 5323 18% Baseline 96/8/8 29.2
SSRU 28.7 5629 25% + pruned 22/7/8 29.0
SSRU w/o ffn 96/8/- 28.9
AAN w/o ffn 28.0 5915 31% + pruned 18/8/- 28.6
SSRU w/o ffn 28.5 6079 35%
Table 2: Head pruning through L0 regularization on the
Table 1: Results of replacing self-attention with 12 − 1 layer (encoder-decoder) structure. The (enc/enc-
lightweight recurrent units and removing the feed- dec/dec) refers to the total number of attention heads in
forward network (ffn) in the decoder. Decoding on a encoder self-attention, encoder-decoder attention and
GPU with batch-size 128. decoder self-attention respectively.
51
Acknowledgments Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,
Dehao Chen, Orhan Firat, Yanping Huang, Maxim
We would like to thank Andrew Finch, Stephan Krikun, Noam Shazeer, and Zhifeng Chen. 2020.
Peitz, Udhay Nallasamy, Matthias Paulik and Russ Gshard: Scaling giant models with conditional com-
Webb for their helpful comments and reviews. putation and automatic sharding. arXiv preprint
arXiv:2006.16668.
Many thanks to the rest of the Machine Transla-
tion Team for interesting discussions and support. Christos Louizos, Max Welling, and Diederik P.
Kingma. 2018. Learning sparse neural networks
through l0 regularization. In International Con-
References ference on Learning Representations, Vancouver,
Canada.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Antonio Valerio Miceli Barone, Jindřich Helcl, Rico
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Sennrich, Barry Haddow, and Alexandra Birch.
Askell, et al. 2020. Language models are few-shot 2017. Deep architectures for neural machine trans-
learners. arXiv preprint arXiv:2005.14165. lation. In Proceedings of the Second Conference on
Machine Translation, pages 99–107, Copenhagen,
Sergey Edunov, Myle Ott, Michael Auli, and David
Denmark. Association for Computational Linguis-
Grangier. 2018. Understanding back-translation at
tics.
scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Paul Michel, Omer Levy, and Graham Neubig. 2019.
page 489–500, Brussels, Belgium.
Are sixteen heads really better than one? In Ad-
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan vances in Neural Information Processing Systems
Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Ji- 32, pages 14014–14024. Vancouver, Canada.
quan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng
Chen. 2019. Gpipe: Efficient training of giant neu- Myle Ott, Sergey Edunov, Alexei Baevski, Angela
ral networks using pipeline parallelism. In Advances Fan, Sam Gross, Nathan Ng, David Grangier, and
in Neural Information Processing Systems 32, pages Michael Auli. 2019. fairseq: A fast, extensible
103–112, Vancouver, Canada. toolkit for sequence modeling. In Proceedings of
Annual Conference of the North American Chap-
Jungo Kasai, Nikolaos Pappas, Hao Peng, James ter of the Association for Computational Linguis-
Cross, and Noah A Smith. 2020. Deep encoder, tics: Human Language Technologies: Demonstra-
shallow decoder: Reevaluating the speed-quality tions, Minneapolis, USA.
tradeoff in machine translation. arXiv preprint
arXiv:2006.10369. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
Yoon Kim and Alexander M. Rush. 2016. Sequence- Kaiser, and Illia Polosukhin. 2017. Attention is all
level knowledge distillation. In Proceedings of the you need. In Proceedings of the 31st International
2016 Conference on Empirical Methods in Natu- Conference on Neural Information Processing Sys-
ral Language Processing, pages 1317–1327, Austin, tems, page 6000–6010, Red Hook, NY, USA. Curran
Texas. Association for Computational Linguistics. Associates Inc.
Young Jin Kim, Marcin Junczys-Dowmunt, Hany Has-
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
san, Alham Fikri Aji, Kenneth Heafield, Roman
nrich, and Ivan Titov. 2019. Analyzing multi-head
Grundkiewicz, and Nikolay Bogoychev. 2019. From
self-attention: Specialized heads do the heavy lift-
research to production and back: Ludicrously fast
ing, the rest can be pruned. In Proceedings of the
neural machine translation. In Proceedings of the
57th Annual Meeting of the Association for Com-
3rd Workshop on Neural Generation and Transla-
putational Linguistics, pages 5797–5808, Florence,
tion, pages 280–288, Hong Kong. Association for
Italy. Association for Computational Linguistics.
Computational Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece: Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
A simple and language independent subword tok- Changliang Li, Derek F. Wong, and Lidia S. Chao.
enizer and detokenizer for neural text processing. In 2019a. Learning deep transformer models for ma-
Proceedings of the 2018 Conference on Empirical chine translation. In Proceedings of the 57th Annual
Methods in Natural Language Processing: System Meeting of the Association for Computational Lin-
Demonstrations, pages 66–71, Brussels, Belgium. guistics, pages 1810–1822, Florence, Italy. Associa-
tion for Computational Linguistics.
Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav
Artzi. 2018. Simple recurrent units for highly par- Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin,
allelizable recurrence. In Proceedings of the 2018 ChengXiang Zhai, and Tie-Yan Liu. 2019b. Multi-
Conference on Empirical Methods in Natural Lan- agent dual learning. In International Conference on
guage Processing, pages 4470–4481, Brussels, Bel- Learning Representations, New Orleans, Louisiana,
gium. United States.
52
Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accel-
erating neural transformer via an average attention
network. In Proceedings of the 56th Annual Meet-
ing of the Association for Computational Linguis-
tics, Melbourne, Australia. Association for Compu-
tational Linguistics.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020.
Incorporating bert into neural machine translation.
In International Conference on Learning Represen-
tations, Addis Ababa, Ethiopia.
53