2020 Sustainlp-1 7

Efficient Inference For Neural Machine Translation
Yi-Te Hsu1∗
Sarthak Garg2 Yi-Hsiu Liao2 Ilya Chatsviorkin2
1
Johns Hopkins University
2
Apple Inc.
[email protected], {sarthak garg, yihsiu liao, ilych}@apple.com
Abstract a simpler version of the self-attention layer which

places equal attention weights on all previously
Large Transformer models have achieved state-
of-the-art results in neural machine translation decoded words instead of dynamically computing
and have become standard in the field. In this them. SRU and SSRU are lightweight recurrent
work, we look for the optimal combination of networks, with SSRU consisting of only 2 matrix
known techniques to optimize inference speed multiplications per decoded token.
without sacrificing translation quality. We con-
Because of the autoregressive property of the
duct an empirical study that stacks various ap-
proaches and demonstrates that combination
decoder in a standard Transformer model, reduc-
of replacing decoder self-attention with simpli- ing computation cost in the decoder is much more
fied recurrent units, adopting a deep encoder important than in the encoder. Recent publications
and a shallow decoder architecture and multi- (Miceli Barone et al., 2017; Wang et al., 2019a;
head attention pruning can achieve up to 109% Kasai et al., 2020) thus suggest that a deep encoder,
and 84% speedup on CPU and GPU respec- shallow decoder architecture can speed up infer-
tively and reduce the number of parameters ence while maintaining a similar BLEU score.
by 25% while maintaining the same translation
quality in terms of BLEU. Another line of research focuses on model prun-
ing techniques to make NMT models smaller and
1 Introduction and Related Work more efficient. In this paper, we only explore struc-
Transformer models (Vaswani et al., 2017) have tured pruning methods, in which smaller compo-
outperformed previously used RNN models and nents of the network are pruned away. Applica-
traditional statistical MT techniques. This improve- tions of structured pruning to NMT include works
ment, though, comes at the cost of higher compu- by Voita et al. (2019) and Michel et al. (2019)
tation complexity. The decoder computation often which show that most of the attention heads in the
remains the bottleneck due to its autoregressive network learn redundant information and can be
nature, large depth and self-attention structure. pruned. Michel et al. (2019) proposed the idea of
There has been a recent trend towards making pruning heads by head importance scoring. Voita
the models larger and ensembling multiple mod- et al. (2019) uses a relaxation of L0 regularization
els to achieve the best possible translation quality (Louizos et al., 2018) to prune the attention heads.
(Lepikhin et al., 2020; Huang et al., 2019). Lead- All of the above mentioned methods use the
ing solutions on common benchmarks (Zhu et al., vanilla Transformer architecture as their baseline,
2020; Brown et al., 2020) usually use an ensemble so it is not clear if these approaches can give com-
of Transformer big models, which combined can plimentary results when combined together. In this
have more than 1 billion parameters. work, we explore and benchmark, combining all of
Previous works suggest replacing the expensive the above techniques, with the goal of maximizing
self-attention layer in the decoder with simpler inference speed without hurting translation quality.
alternatives like the Average Attention Network After carefully stacking the approaches, our pro-
(AAN) (Zhang et al., 2018), Simple Recurrent Unit posed architecture is able to achieve a significant
(SRU) (Lei et al., 2018) and Simpler Simple Re- speed improvement of 84% on GPU and 109%
current Unit (SSRU) (Kim et al., 2019). AAN is on CPU architectures without any degradation of
∗
Work done during internship at Apple Inc. translation quality in terms of BLEU.
48
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 48–53
Online, November 20, 2020. c 2020 Association for Computational Linguistics
2 Efficient Inference for Neural Machine T-MADL model to, again, re-decode the original
Translation bitext, but with more variance on the source side.
Finally, we use the above generated synthetic data
This section presents the proposed efficient infer- along with the original bitext to train our student
ence architecture for neural machine translation. model.
First, we outline the overall procedure of building We use interpolated sequence-level knowledge
an efficient inference architecture. Then, we detail distillation (Kim and Rush, 2016) in most of
each step in the process. the described re-decoding runs except the noisy
backward-forward translation where sampling is
used in the reverse direction. More details about
model training and architecture can be found in
Kim et al. (2019).
2.2 Replacing Self-attention with

Lightweight Recurrent Units
Inspired by Kim et al. (2019), we replace the de-
coder self-attention with an RNN, reducing its time
complexity from O(N 2 ) to O(N ), where N is
the length of the output sentence. We compare
replacing self-attention with two lightweight lay-
ers: SSRU and AAN, in Section 3.1. The SSRU
Figure 1: Efficient Transformer Architecture
layer is as follows:
First, we use sequence-level knowledge distilla- ft = σ(Wt xt + bf )

tion (Kim and Rush, 2016) to transfer knowledge ct = ft ct−1 + (1 − ft ) W xt (1)
from a strong teacher model to a smaller student ot = ReLU (ct )
model. This approach allows the student model to
learn from a simpler target distribution and there- where the is element-wise multiplication. xt ,
fore enables us to use a simpler architecture. ot , ft and ct are the input, output, forget-gate and
Then, to simplify the decoder of the student cell-state, respectively. We optimized the SSRU by
model, the self-attention mechanism is replaced combining the two matrix multiplications, Wt xt
by lightweight recurrent units (Kim et al., 2019), and W xt , into one. We find this simple trick can
and the feed-forward network is removed. To fur- improve speed by 6% on GPU.
ther reduce the decoder computation, we adopt the For AAN, we found that removing the gating
deep encoder, shallow decoder architecture (Kasai layer does not degrade the translation quality while
et al., 2020). Lastly, we prune redundant atten- reducing the computation. In our experiments, we
tion heads through L0 regularization (Voita et al., use the following implementation of AAN (without
2019). Each architecture modification is performed a gating layer):
by retraining the student model. Figure 1 shows t
1X
the proposed efficient Transformer architecture. ot = F F N ( xk ) (2)
t
k=1
2.1 Teacher-student Training
where F F N (·) is a position-wise two-layer feed-
We follow the procedure described in Kim et al. forward network. t, ot and xk denote the current
(2019), to train an ensemble of 8 Transformer-big position, output at position t and input at position
models, 4 forward, 4 reverse direction, as the first k respectively.
round of teacher models (T). Without the help of ex-
tra monolingual corpora, we apply multi-agent dual 2.3 Removing the Feed-forward Layer
learning (MADL) (Wang et al., 2019b) to train an- Each decoder layer consists of a lightweight recur-
other 8 Transformer big teacher models (T-MADL) rent unit, followed by an encoder-decoder multi-
by re-decoded bitext with ensemble teacher models head attention component and a pointwise feed-
(T) in both directions. Then we use noisy backward- forward layer. The feed-forward sub-layer is re-
forward translation (Edunov et al., 2018) with the sponsible for 33% of parameters within the 6-layer
49
decoder; however, we found that it can be removed how aggressively the attention heads are pruned.
entirely from the decoder without hurting the trans- During inference time, all heads hj , where P (gj =
lation quality with our implementation of SSRU 0|φj ) = 1 are completely removed from the net-
(Section 3.1). work. Our experiments in Section 3.3 show that we
can effectively prune out a large portion of redun-
2.4 Deep Encoder, Shallow Decoder dant self-attention heads from the deep-encoder.
In order to further reduce the decoder computa-
tion, we decrease the number of decoder layers. In 3 Experiments
line with the work done by Kasai et al. (2020), to
maintain the same model capacity, we increase the We use the Transformer base model (Vaswani et al.,
number of encoder layers. We explore the speed- 2017) trained on teacher decoded data as our base-
accuracy trade-off while varying the depth of both line. All the described methods are stacked on top
components in Section 3.2, and find that using 12 of this baseline model. Following Kim et al. (2019),
encoder layers and 1 decoder layer gives a signifi- we use 4 million bitext from the WMT’14 English-
cant speedup without losing translation quality. German news translation task. All sentences are
2.5 Pruning Attention Heads encoded with 32K subword units using Sentence-
Piece (Kudo and Richardson, 2018). We report
Adopting a deep encoder, shallow decoder architec- BLEU on the newstest2014 in all the experiments
ture achieves a good speed-quality tradeoff; how- and use newstest2015 for the final evaluation in
ever, it increases the number of parameters in the Section 3.4
encoder. To further improve efficiency and reduce
All experiments are implemented in fairseq (Ott
parameters, we apply multi-head attention pruning
et al., 2019). The configuration of teacher-student
proposed by Voita et al. (2019) to our architecture.
training follows the settings in Kim et al. (2019).
The output of each head hi across all attention lay-
We use an effective batch size of 458k words and
ers is multiplied by a learnable gate gi , before it
16 GPUs for training. Adam optimizer is applied
is passed to subsequent layers of the network. To
with β = (0.9, 0.98). We use label smoothing with
switch off less informative heads (i.e. gi = 0), we
ε = 0.1, inverse square root learning rate sched-
applied L0 regularization to the gates. L0 norm
ule with 2500 warmup steps and peak learning rate
is the number of non-zero gates across the model.
of 0.0007. The models are trained with 50k up-
However, because of the non-differentiable prop-
dates except for the models with pruning, where
erty of the L0 norm, a differentiable approximation
additional fine-tuning with 100-150k updates is ap-
is used. Each gate gi , is modeled as a random vari-
plied. We use a beam size of 5 during inference.
able sampled from a Hard Concrete Distribution
We evaluate the inference speed with batch size of
(Louizos et al., 2018) parameterized by φi , and
128 sentences on GPU, batch size 1 on CPU and
takes values in the range [0, 1]. We then minimize
report speed in words per second (wps), averaged
the differentiable approximation of L0 regulariza-
over 10 decoding runs.
tion loss, Lc :
Hardware: We evaluate our performance on 1
h
X GPU (NVIDIA Tesla V100-SXM2-32GB) and 1
Lc (φ) = (1 − P (gi = 0|φi )), (3) core CPU (Intel Xeon E5-2640 v4 @ 2.40GHz)
i=1
where h denotes the total number of heads, φ is the 3.1 Replacing Self-Attention with RNN
set of gate parameters, and P (gi = 0|φi ) is com-
From Table 1, we can observe that replacing the
puted according to the Hard Concrete Distribution.
self-attention with lightweight recurrent units gives
The model is initially trained with the standard
significant speed improvements (18-25%) without
cross entropy loss Lxent and then fine-tuned with
any impact on BLEU score.
the additional regularization loss as follows:
Removing the feed-forward network in the de-
L(θ, φ) = Lxent (θ, φ) + λLc (φ), (4) coder leads to an additional 10-13% speedup for
both AAN and SSRU, but results in 0.9 BLEU
where θ denotes the set of original model param- degradation for AAN. Therefore, we use SSRU as
eters, and λ is a hyperparameter which controls our main architecture in further experiments.
50
BLEU wps speedup attention heads
BLEU
(enc/enc-dec/dec)
Baseline 28.9 4510 -
AAN 28.9 5323 18% Baseline 96/8/8 29.2
SSRU 28.7 5629 25% + pruned 22/7/8 29.0
SSRU w/o ffn 96/8/- 28.9
AAN w/o ffn 28.0 5915 31% + pruned 18/8/- 28.6
SSRU w/o ffn 28.5 6079 35%
Table 2: Head pruning through L0 regularization on the
Table 1: Results of replacing self-attention with 12 − 1 layer (encoder-decoder) structure. The (enc/enc-
lightweight recurrent units and removing the feed- dec/dec) refers to the total number of attention heads in
forward network (ffn) in the decoder. Decoding on a encoder self-attention, encoder-decoder attention and
GPU with batch-size 128. decoder self-attention respectively.
3.2 Number of Layers 3.4 Combined Results

We evaluate different combinations of depths in
We combine all of the methods and evaluate our
the encoder and decoder. In the decoder, the self-
model on the newstest2015 testset.
attention mechanism is replaced by the SSRU, and
the feed-forward network is removed.
speedup
BLEU #params
GPU/CPU
Baseline 31.1 - 61M
SSRU 31.1 14/12% 57M
+ Remove ffn 31.0 28/49% 45M
+ 12-1 31.5 82/103% 56M
+ Prune heads 31.4 84/109% 46M
Table 3: Decoding on a GPU with batch-size 128, and

a single CPU core with batch-size 1. [12-1] refers to
the number of layers in the encoder and the decoder.
Table 3 shows that by using all of the techniques

Figure 2: Translation quality – inference speed trade- in combination, the model achieves 84% and 109%
off over different number of encoder – decoder layers.
speed improvement on GPU and CPU, respectively
compared to the baseline model (Transformer-
From Figure 2, removing one decoder layer at a base). There are only 25% heads remain in the
time from the baseline model increases wps by 10% deep-encoder after pruning and the total number of
at a cost of BLEU score degradation since model parameters is 25% fewer.
capacity goes down. As we increase the number of
encoder layers to 12 or more, we observe up to 45% 4 Conclusion
speedup, better BLEU score but higher number of
parameters than the original 6-6 structure. In this paper we explored the combination of tech-
niques aimed at improving inference speed which
3.3 Pruning Attention Heads lead to the discovery of a very efficient architec-
Pruning allows us to remove up to 75% of attention ture. The best architecture has a deep 12-layer en-
heads with slight BLEU degradation. We observe coder, and a shallow decoder with only one single
from the remaining heads that for the pruned base- lightweight recurrent unit layer and one encoder-
line (22/7/8) model, the self-attention heads are decoder attention mechanism. 75% of the encoder
more important in the deeper layers rather than the heads were pruned giving rise to a model with 25%
lower layers. On the other hand, in our best config- fewer parameters than the baseline Transformer. In
uration (SSRU 18/8/-), there is no clear pattern of terms of inference speed, the proposed architecture
remaining heads. is 84% faster on a GPU, and 109% faster on a CPU.
51
Acknowledgments Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu,
Dehao Chen, Orhan Firat, Yanping Huang, Maxim
We would like to thank Andrew Finch, Stephan Krikun, Noam Shazeer, and Zhifeng Chen. 2020.
Peitz, Udhay Nallasamy, Matthias Paulik and Russ Gshard: Scaling giant models with conditional com-
Webb for their helpful comments and reviews. putation and automatic sharding. arXiv preprint
arXiv:2006.16668.
Many thanks to the rest of the Machine Transla-
tion Team for interesting discussions and support. Christos Louizos, Max Welling, and Diederik P.
Kingma. 2018. Learning sparse neural networks
through l0 regularization. In International Con-
References ference on Learning Representations, Vancouver,
Canada.
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Antonio Valerio Miceli Barone, Jindřich Helcl, Rico
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Sennrich, Barry Haddow, and Alexandra Birch.
Askell, et al. 2020. Language models are few-shot 2017. Deep architectures for neural machine trans-
learners. arXiv preprint arXiv:2005.14165. lation. In Proceedings of the Second Conference on
Machine Translation, pages 99–107, Copenhagen,
Sergey Edunov, Myle Ott, Michael Auli, and David
Denmark. Association for Computational Linguis-
Grangier. 2018. Understanding back-translation at
tics.
scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
Paul Michel, Omer Levy, and Graham Neubig. 2019.
page 489–500, Brussels, Belgium.
Are sixteen heads really better than one? In Ad-
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan vances in Neural Information Processing Systems
Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Ji- 32, pages 14014–14024. Vancouver, Canada.
quan Ngiam, Quoc V Le, Yonghui Wu, and Zhifeng
Chen. 2019. Gpipe: Efficient training of giant neu- Myle Ott, Sergey Edunov, Alexei Baevski, Angela
ral networks using pipeline parallelism. In Advances Fan, Sam Gross, Nathan Ng, David Grangier, and
in Neural Information Processing Systems 32, pages Michael Auli. 2019. fairseq: A fast, extensible
103–112, Vancouver, Canada. toolkit for sequence modeling. In Proceedings of
Annual Conference of the North American Chap-
Jungo Kasai, Nikolaos Pappas, Hao Peng, James ter of the Association for Computational Linguis-
Cross, and Noah A Smith. 2020. Deep encoder, tics: Human Language Technologies: Demonstra-
shallow decoder: Reevaluating the speed-quality tions, Minneapolis, USA.
tradeoff in machine translation. arXiv preprint
arXiv:2006.10369. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz
Yoon Kim and Alexander M. Rush. 2016. Sequence- Kaiser, and Illia Polosukhin. 2017. Attention is all
level knowledge distillation. In Proceedings of the you need. In Proceedings of the 31st International
2016 Conference on Empirical Methods in Natu- Conference on Neural Information Processing Sys-
ral Language Processing, pages 1317–1327, Austin, tems, page 6000–6010, Red Hook, NY, USA. Curran
Texas. Association for Computational Linguistics. Associates Inc.
Young Jin Kim, Marcin Junczys-Dowmunt, Hany Has-
Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
san, Alham Fikri Aji, Kenneth Heafield, Roman
nrich, and Ivan Titov. 2019. Analyzing multi-head
Grundkiewicz, and Nikolay Bogoychev. 2019. From
self-attention: Specialized heads do the heavy lift-
research to production and back: Ludicrously fast
ing, the rest can be pruned. In Proceedings of the
neural machine translation. In Proceedings of the
57th Annual Meeting of the Association for Com-
3rd Workshop on Neural Generation and Transla-
putational Linguistics, pages 5797–5808, Florence,
tion, pages 280–288, Hong Kong. Association for
Italy. Association for Computational Linguistics.
Computational Linguistics.
Taku Kudo and John Richardson. 2018. SentencePiece: Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu,
A simple and language independent subword tok- Changliang Li, Derek F. Wong, and Lidia S. Chao.
enizer and detokenizer for neural text processing. In 2019a. Learning deep transformer models for ma-
Proceedings of the 2018 Conference on Empirical chine translation. In Proceedings of the 57th Annual
Methods in Natural Language Processing: System Meeting of the Association for Computational Lin-
Demonstrations, pages 66–71, Brussels, Belgium. guistics, pages 1810–1822, Florence, Italy. Associa-
tion for Computational Linguistics.
Tao Lei, Yu Zhang, Sida I. Wang, Hui Dai, and Yoav
Artzi. 2018. Simple recurrent units for highly par- Yiren Wang, Yingce Xia, Tianyu He, Fei Tian, Tao Qin,
allelizable recurrence. In Proceedings of the 2018 ChengXiang Zhai, and Tie-Yan Liu. 2019b. Multi-
Conference on Empirical Methods in Natural Lan- agent dual learning. In International Conference on
guage Processing, pages 4470–4481, Brussels, Bel- Learning Representations, New Orleans, Louisiana,
gium. United States.
52
Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accel-
erating neural transformer via an average attention
network. In Proceedings of the 56th Annual Meet-
ing of the Association for Computational Linguis-
tics, Melbourne, Australia. Association for Compu-
tational Linguistics.
Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
Wengang Zhou, Houqiang Li, and Tieyan Liu. 2020.
Incorporating bert into neural machine translation.
In International Conference on Learning Represen-
tations, Addis Ababa, Ethiopia.
53

2020 Sustainlp-1 7

Uploaded by

Copyright:

Available Formats

2020 Sustainlp-1 7

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Sustainlp-1 7

Uploaded by

Copyright:

Available Formats

Efficient Inference For Neural Machine Translation

Abstract a simpler version of the self-attention layer which

2.2 Replacing Self-attention with

First, we use sequence-level knowledge distilla- ft = σ(Wt xt + bf )

3.2 Number of Layers 3.4 Combined Results

Table 3: Decoding on a GPU with batch-size 128, and

Table 3 shows that by using all of the techniques

You might also like