A Survey of Multilingual Neural Machine Translation: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
A Survey of Multilingual Neural Machine Translation: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
A Survey of Multilingual Neural Machine Translation: Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
RAJ DABRE, National Institute of Information and Communications Technology (NICT), Kyoto, Japan
CHENHUI CHU, Osaka University, Osaka, Japan
ANOOP KUNCHUKUTTAN, Microsoft, Hyderabad, India
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction
in recent years. MNMT has been useful in improving translation quality as a result of translation knowledge
transfer (transfer learning). MNMT is more promising and interesting than its statistical machine translation
counterpart, because end-to-end modeling and distributed representations open new avenues for research
on machine translation. Many approaches have been proposed to exploit multilingual parallel corpora for
improving translation quality. However, the lack of a comprehensive survey makes it difficult to determine
which approaches are promising and, hence, deserve further exploration. In this article, we present an in-
depth survey of existing literature on MNMT. We first categorize various approaches based on their central
use-case and then further categorize them based on resource scenarios, underlying modeling principles, core-
issues, and challenges. Wherever possible, we address the strengths and weaknesses of several techniques
by comparing them with each other. We also discuss the future directions for MNMT. This article is aimed
towards both beginners and experts in NMT. We hope this article will serve as a starting point as well as a
source of new ideas for researchers and engineers interested in MNMT.
CCS Concepts: • Computing methodologies → Machine translation;
Additional Key Words and Phrases: Neural machine translation, survey, multilingualism, low-resource, zero-
shot, multi-source
ACM Reference format:
Raj Dabre, Chenhui Chu, and Anoop Kunchukuttan. 2020. A Survey of Multilingual Neural Machine Trans-
lation. ACM Comput. Surv. 53, 5, Article 99 (September 2020), 38 pages.
https://doi.org/10.1145/3406095
1 INTRODUCTION
Neural machine translation (NMT) [8, 24, 140] has become the dominant paradigm for MT in aca-
demic research as well as commercial use [161]. NMT has shown state-of-the-art performance
for many language pairs [14, 15]. Its success can be mainly attributed to the use of distributed
This work is licensed under a Creative Commons Attribution International 4.0 License.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:2 R. Dabre et al.
1 Usingadditional languages can help in word sense disambiguation, which can help improve translation quality.
2 Relatively
modest increase in model parameters despite an increase in the number of languages as a result of parameter
sharing.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:3
Fig. 1. MNMT research categorized according to use-cases, core-issues, and the challenges involved. Note
that the focus is on use-cases and an approach for one use-case can be naturally adapted to another use-case
and thus can face similar core issues.
Multi-source Translation. Documents that have been translated into more than one language
might, in the future, be required to be translated into another language. In this scenario, existing
multilingual complementary content on the source side can be exploited for multi-source trans-
lation [168]. Multilingual complementary content can help in better disambiguation of content to
be translated, leading to an improvement in translation quality.
We believe that the biggest benefit of doing MNMT research by exploring these scenarios is that
we might gain insights and an answer to an important question in natural language processing
(NLP):
Q. How can we leverage multilingual data effectively to learn distributions across multiple lan-
guages to improve MT (NLP) performance across all languages?
This question can be decomposed into a number of smaller questions, and in this article, we try
to answer two of them as follows:
Q1. Is it possible to have a one-model-for-all-languages solution to MT (NLP) applications?
Q2. Can shared multilingual distributed representations help MT (NLP) for low-resource languages?
Given these benefits, scenarios, and the tremendous increase in the work on MNMT in recent
years, we write this survey article on MNMT to systematically organize the work in this area. To
the best of our knowledge, no such comprehensive survey on MNMT exists. Our goal is to shed
light on various MNMT scenarios, fundamental questions in MNMT, basic principles, architectures,
and datasets for MNMT systems. The remainder of this article is structured as follows: We present
a systematic categorization of different approaches to MNMT in each of the above-mentioned sce-
narios to help understand the array of design choices available while building MNMT systems
(Sections 3, 4, 5, and 6). The top-level organization of the survey is use-case-scenario-based: mul-
tiway NMT (Section 3), low-resource NMT (Sections 4 and 5), and multi-source NMT (Section 6).
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:4 R. Dabre et al.
where m is the number of words in y, y j is the current generated word, and y <j are the previ-
ously generated words. At inference time, beam search is typically used to find the translation
that maximizes the above probability.
The most commonly used NMT approach is the Embed - Encode - Attend - Decode paradigm.
Figure 2 shows an overview of this paradigm. The encoder first converts words in the source sen-
tence into word embeddings. These word embeddings are then processed by neural layers and
converted to representations that capture contextual information about these words. We call these
contextual representations as the encoder representations. The decoder uses an attention mecha-
nism, the encoder representations, and previously generated words to generate what we call the
decoder representations (states), which in turn are used to generate the next target word. The
encoder and decoder can be RNN [8], CNN [52], or self-attention and feed-forward [148] layers.
Among these, the self-attention layers are the most widely used. It is a common practice to stack
multiple layers, which leads to an improvement in translation quality. The attention mechanism
is calculated cross the decoder and encoder as:
e ji = a(sj−1 , hi ), (2)
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:5
exp(e ji )
a ji = m ,
k=1 exp(e ki )
where e ji is an alignment score, a is an alignment model that scores the match level of the inputs
around position i and the output at position j, sj−1 is the decoder hidden state of the previous
generated word, hi is the encoder hidden state at position i. The calculated attention vector is then
used to weight the encoder hidden states to obtain a context vector as:
n
cj = a ji hi .
i=1
This context vector, is fed to the decoder along with the previously generated word and its hidden
state to produce a representation for generating the current word. A decoder hidden state for the
current word sj is computed by:
sj = д(sj−1 , yj−1 , cj ),
where д is an activation decoder function, sj−1 is the previous decoder hidden state, yj−1 is the em-
bedding of the previous word. The current decoder hidden state sj , the previous word embedding,
and the context vector are fed to a feedforward layer f and a softmax layer to compute a score for
generating a target word as output:
P (y j |y <j , x) = softmax( f (sj , yj−1 , cj )).
Training NMT models. The parallel corpus used to train the NMT model is first subjected to
pre-processing, where it is sufficiently cleaned to remove noisy training examples. A vocabulary
of the N most frequent words is then created and the remaining words are treated as unknown
words mapped to a single token designated by “UNK.” To overcome the problem of unknown
words, the most common practice involves subword tokenization using methods such as byte-pair
encoding (BPE) [131], word-piece model (WPM) [127], or sentence-piece model (SPM) [82]. This
enables the use of an open vocabulary. To train an NMT model, we typically minimize the cross-
entropy3 (loss) between the predicted target words and the actual target words in the reference.
This loss minimization is an optimization problem and gradient descent methods such as SGD,
ADAM, ADAGRAD, Adafactor, and so on, can be used [122]. ADAM is widely used in MT due to
its ability to quickly train models but suffers from the inability to sufficiently converge. However,
SGD is known to converge better but requires long training times. Designing a learning schedule
that combines several optimizers can help train a model with high performance [161]. However,
following the popular work on the Transformer model [148], most recent research focuses on us-
ing a single optimizer with a learning rate schedule where the learning rate is altered depending
on the training iteration. Training is either done for a large number of iterations or till the model
converges sufficiently. Typically, a model is said to converge when its evaluation on a development
set does not change by a significant amount over several iterations. We refer readers to previous
works on NMT regarding this topic. Another consideration during training is the tuning of hyper-
parameters such as learning rate, hidden dimension size, number of layers, and so on. Researchers
often train a wide variety of models, an approach known as hyperparameter search, and choose
a model with the best performance. People who are familiar with deep learning might notice that
this basic model can be naturally extended to involve multiple language pairs.
Decoding NMT models. Once an NMT model has been trained, it can be used to translate a
sentence into another language. This process is also known as translation, decoding, or inference.
3 This also implies maximizing the likelihood of the predicted sequence of words in the target language.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:6 R. Dabre et al.
Fig. 3. An overview of multiway NMT from the perspective of the level of sharing and the features and limi-
tations each sharing approach. All types of MNMT models have to deal with complex training issues ranging
from batching to language grouping to knowledge distillation. Additionally, it is important to address lan-
guage divergence and finding the right balance of language-specific and language-agnostic representations.
The simplest decoding algorithm is beam search decoding. Typically, the development set source
sentences are decoded using combinations of beam size and a length penalty (also referred to as
alpha) and the combination that gives the best automatic evaluation metric score is chosen to
decode the test set.
3 MULTIWAY NMT
The primary goal of MNMT is a model that can support translation between more than one lan-
guage pair. Refer to Figure 3 for an overview of the multiway NMT paradigm. We use the term
multiway NMT models to denote such models. Formally, a single model can support translation
for l language pairs (srcl , tдtl ) ∈ L (l = 1 to L), where L ⊂ S × T , and S,T are sets of X source
and Y target languages, respectively. S and T need not be mutually exclusive. Parallel corpora are
available for all of these l language pairs as C (srcl ) and C (tдtl ). Note that our objective in this
specific scenario is to train a translation system between all language pairs. Refer to Figure 4 for
the two prototypical MNMT approaches with minimal and complete sharing of components. Most
existing works are variations of these models. Particularly, one-to-many [42], many-to-one [93]
and many-to-many [48] NMT models are specific instances of this general framework. The train-
ing objective for multiway NMT is maximization of the log-likelihood of all training data jointly
for all language pairs (different weights may be assigned to the likelihoods of different pairs:):
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:7
Fig. 4. Part (a) of the figure shows an X sources (X encoders) and Y targets (Y decoders) MNMT model.
The encoders contain N layers (L1 to L N ) and decoders contain M layers ((L1 to LM )) that can be stacked.
Each layer can be recurrent, convolutional, or feed-forward. The decoders contain a softmax layer (S) to
predict target language words. A single attention mechanism is shared among all encoders and decoders. In
the figure, the blue arrows indicate the flow of processing to translate an English sentence into its Italian
translation. Given the vast number of components, they can be initialized by pre-trained models such as
BERT to perform transfer learning. Post-training, this model can be decomposed into individual bilingual
models. Part (b) of the figure shows a fully shared MNMT model for all language pairs. The “language
tag” trick where a token like “< 2xx >” is prefixed to each source sentence to indicate the desired target
language. All existing MNMT models are adaptations of these two fundamental architectures. Depending
on the language pairs involved, the size of the data, the languages of focus, and deployment conditions, one
should consider choosing the components that should be shared.
[70]. Analyzing multiway NMT systems could also provide an understanding of the relationship
between languages from a statistical and linguistic point of view [36, 105].
Multiway translation systems follow the standard embed-encode-attend-decode paradigm that
is common for existing NMT systems. However, the architecture is adapted to support multiple
languages. This involves addressing issues related to vocabularies and associated embeddings, net-
work layers (RNN/CNN/Feed-Forward), parameter sharing, training protocols and language diver-
gence. We address each issue in this section.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:8 R. Dabre et al.
pair in Equation (2), the attention score in Firat et al. [48] is calculated from multiple encoders and
decoders as:
ji = a s j−1 , h i ,
e vu v u
where u is the u-th encoder and v is the v-th decoder. By sharing attention across languages, it
was hypothesized that transfer learning could take place and such a model was able to outper-
form bilingual models, especially when the target language was English. However, this model has
a large number of parameters, usually around 270M or more. Furthermore, the number of param-
eters only grows linearly with the number of languages, while it grows quadratically for bilingual
systems spanning all the language pairs in the multiway system. Another problem is that the
shared attention mechanism has to bear the burden of connecting different language pairs and
this can introduce a representational bottleneck where a model cannot learn the necessary repre-
sentations for the best translation quality. However, we emphasize that minimally shared models
are quite flexible because it is possible to treat individual languages with special processing mech-
anisms especially suited to maximize performance for that language. However after Johnson et al.
[70] showed that such models with large number of parameters are not required, research efforts
started focusing on a middle-ground where the amount of sharing is controlled.
Complete Parameter Sharing. Johnson et al. [70] proposed a highly compact model where all
languages share the same embeddings, encoder, decoder, and attention mechanism. Typically, a
common vocabulary across all languages is first generated using a subword-level encoding mech-
anism such as BPE, WPM, or SPM. Thereafter, all corpora are concatenated4 and the input se-
quences are prefixed with a special token (called the language tag) to indicate the target language
(see Figure 4(b)). This enables the decoder to correctly generate the target language despite all tar-
get languages sharing the same decoder parameters. Note that the embedding and softmax layers
are shared across all languages and Ha et al. [58] proposed a similar model, but they maintained
separate vocabularies for each language. While this might help in faster inference due to smaller
softmax layers, the possibility of cognate sharing is lower, especially for linguistically close lan-
guages sharing a common script. According to us, Johnson et al. [70] should be particularly useful
for related languages, especially those that use the same script, because they have high degree of
lexical and syntactic similarity [123]. However, Ha et al. [58] can be used for unrelated languages.
In practice, we recommend experimenting with both Ha et al. [58] and Reference [70] and then
choosing the appropriate model. We believe that an empirical analysis of these architectures in a
variety of language and corpora settings will be helpful. A few techniques through which lexical
similarity can be further leveraged are:
• representing all languages in a common script using script conversion [38, 93] or translit-
eration (Nakov and Ng [103] for multilingual SMT);
• using a common subword-vocabulary across all languages, e.g., character [93] and BPE
[106];
• representing words by both character encoding and a latent embedding space shared by all
languages [155].
Concerning the choice of non-embedding or softmax layers, Rikters et al. [121] and Lakew et al.
[89] have compared RNN, CNN, and the self-attention-based architectures for MNMT. They show
that self-attention-based architectures outperform the other architectures in many cases. The most
4 It is common to oversample smaller corpora so all language pairs are equally represented in the model regardless of the
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:9
common hypothesis is that self-attention enables random access to all words in a sentence, which
leads to better word and sentence representations.
The complete sharing approach treats the NMT system as a black box to train a multilingual sys-
tem. We would like to emphasize that this type of model has maximum simplicity and has minimal
parameter size, usually around 30M (similar to bilingual models), as all languages share the same
parameters; and achieves comparable/better results w.r.t. bilingual systems. Work on massively
multilingual NMT [2, 5, 11] pushes the idea of completely shared models for all language pairs to
its limits. Aharoni et al. [2], Arivazhagan et al. [5] trained a single model for 59 and 103 languages
and explore a wide range of model configurations focusing on data selection,5 corpora balancing,
vocabulary, deep stacking, training, and decoding approaches. While massively multilingual mod-
els have a strong positive impact on low-resource languages, they tend to benefit translation into
English a lot more than from English. Furthermore, the gains in translation quality tend to taper
off when using more than 50 languages. These works focus on translation performance between
language pairs for which there was no training data but we will discuss this separately in Section 5.
However, a massively multilingual system also runs into representation bottlenecks [2, 135],
where not all translation directions show improved performance despite a massive amount of data
being fed to a model with a massive number of parameters. Future research should focus on better
addressing the bottleneck issues, issues with deeply stacked models, and designing new modular
layers that can handle a wide variety and a large number of languages. We strongly recommend
readers to read the paper by Arivazhagan et al. [5], which itself is a survey on multilingual NMT on
web-scale data. From the perspective of understanding the working of such models, a toolkit for
visualization and inspection of multilingual models should be extremely valuable and Reference
[83] is a good start in this direction.
Controlled Parameter Sharing. In between the extremities of parameter sharing exemplified
by the above-mentioned models lies an array of choices. The degree of parameter sharing can be
controlled at various layers of the MNMT system. A major factor driving the degree of parameter
sharing is the divergence between the languages involved [123] and the desired simplicity as well
as the flexibility of modeling.
Sharing encoders among multiple languages is very effective and is widely used [93, 123]. Keep-
ing decoders separate is important, because the burden of generation is mainly on the decoder.
Therefore, the job of the encoder is relatively simpler, which means that sharing an encoder be-
tween many languages leads to better parameter utilization. However, the decoder and its attention
mechanism should be as robust as possible. Blackwood et al. [13] explored target language, source
language, and pair-specific attention parameters. They showed that target language-specific atten-
tion performs better than other attention-sharing configurations, thus highlighting that designing
a strong decoder is extremely important. For self-attention-based NMT models, Sachan and Neu-
big [123] explored various parameter-sharing strategies. They showed that sharing the decoder
self-attention and encoder-decoder cross-attention parameters is useful for linguistically dissimi-
lar languages. By sharing self- and cross-attention mechanisms in the decoder, the decoder most
likely learns target language representations that are better aligned with source language represen-
tations. Wang et al. [157] further proposed a mechanism to generate a universal representation in-
stead of separate encoders and decoders to maximize parameter sharing. They also used language-
sensitive embedding, attention, and discriminator for different languages. This helps control the
amount of sharing in an indirect fashion. Bapna and Firat [11] also extend a fully shared model
with language-pair-specific adaptor layers that are fine-tuned for those pairs. After training a fully
5 In particular, they focused on the effects of temperature-based data sampling on the multilingual models.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:10 R. Dabre et al.
shared model, additional adaptor layers are inserted into the model and only those layers are fine-
tuned, which requires significantly less computation cost. This does lead to an increase in the
number of parameters but it is modest compared to a minimally shared system in which we would
have to introduce many more parameters due to the addition of more encoders and decoders.
Fixing sharing configurations prior to training is ill-advised, because sharing one set of pa-
rameters might be optimal for one language pair but not another. To this end, Zaremoodi et al.
[164] proposed a routing network to dynamically control parameter sharing where the parts to be
shared depend on the parallel corpora used for training. However, Platanios et al. [115] learned
the degree of parameter sharing from the training data. This is achieved by defining the language-
specific model parameters as a function of global parameters and language embeddings. A base
set of parameters is transformed into another set of parameters for a specific language using lin-
ear projections. If θ is the set of base parameters, then the parameters for a specific language pair
src i and tдt j can be obtained via a linear transformation F (θ ). This linear transformation involves
learning some additional parameters for projection, but this involves far fewer parameters com-
pared to modeling complete sets of parameters for each new language and thus this approach is
quite attractive. It will be interesting to determine whether using non-linear projections is bet-
ter than linear ones. This approach reduces the number of language-specific parameters (only
language embeddings), while still allowing each language to have its own unique parameters for
different network layers. In fact, the number of parameters is only a small multiple of the compact
model (the multiplication factor accounts for the language embedding size) [70], but the language
embeddings can directly impact the model parameters instead of the weak influence that language
tags have.
Designing the right sharing strategy is important to maintaining a balance between model com-
pactness, translation accuracy, simplicity, and flexibility. There are not many works that address
the representation bottlenecks in these types of models. However, we should note that work by
Huang et al. [65] focuses on efficient training mechanisms for extremely large multilingual models
with over 128 layers so they can learn high-quality representations. Although, they do not actually
analyze the learned representations it could be beneficial to analyze such models to understand
how representation bottlenecks can be overcome. It will be interesting to see more approaches that
use the training data itself to enable a model to increase or decrease its own complexity or capac-
ity. Reinforcement learning and genetic algorithms applied to neural architecture search (NAS)
[169] can be one of the ways to achieve this. We also refer the readers to work on conditional
computation [12] in which parts of the model to be computed are determined by the data.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:11
models. This work was conducted in a massively multilingual setting, and we believe that further
studies and innovation are required.
The Nature of Multilingual Representations. Since MNMT systems share network compo-
nents across languages, they induce a relationship among representations of sentences across
languages. Understanding the nature of these multilingual representations can help get insights
into the working of multilingual models. Some works that visualize multilingual model embed-
dings suggest that the encoder learns similar representations for similar sentences across lan-
guages [36, 70]. Since these visualizations are done in very low dimensions (2–3 dimensions), they
might not convey the right picture about the language-invariance of multilingual representations.
Kudugunta et al. [83] do a systematic study of representations generated from a massively mul-
tilingual system using SVCCA [118], a framework for comparing representations across different
languages, models, and layers. Their study brings out the following observations on the nature of
multilingual embeddings from a compact MNMT system:
• While encoder representations for similar sentences are similar across languages, there is
a fine-grained clustering based on language similarity. This explains why transfer learning
works better with related languages (as discussed in Section 4).
• The boundary between the encoder and decoder is blurry and the source language repre-
sentations depend on the target language and viceversa.
• Representation similarity varies across layers. The invariance increases in higher layers on
the encoder side. However, the invariance decreases in higher layers on the decoder side.
This is expected, since the decoder is sensitive to the target language to be generated. The
decoder has to achieve the right balance between language-agnostic and language-aware
representations.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:12 R. Dabre et al.
able to generate language-specific translations is crucial. If several parallel corpora for different
language pairs are simply concatenated and then fed to a standard NMT model, then the NMT
model might end up generating a mixed language translation as a result of vocabulary leakage
due to language-invariant representations. The language tag trick has been very effective in
preventing vocabulary leakage [70], because it enables the decoder to clearly distinguish between
different languages.6 Further, Blackwood et al. [13] added the language tag to the beginning
as well as end of sequence to enhance its effect on the sentence representations learned by a
left-to-right encoder. This shows that dedicating a few parameters to learn language tokens can
help a decoder maintain a balance between language-agnostic and language-distinct features.
Hokamp et al. [64] showed that more often than not, using separate decoders and attention
mechanisms gives better results as compared to a shared decoder and attention mechanism. This
work implies that the best way to handle language divergence would be to use a shared encoder
for source languages and different decoders for target languages. We expect that the balance
between language-agnostic and language-specific representations should depend on the language
pairs. Prasanna [117], Tan et al. [141] are some of the works that cluster languages into language
families and train separate MNMT models per family. Language families can be decided by
using linguistic knowledge7 [117] or by using embedding similarities where the embeddings are
obtained from a multilingual word2vec model [141]. Comparing language-family-specific models
and mixed-language-family models shows that the former models outperform the latter models.
In the future, when training a model on a large number of languages, researchers could consider
different decoders for different language families and each decoder in turn can use the language
token trick to generate specific languages in the language family assigned to the decoder.
Impact of Language Tag. There are some works that explore the effects of the language tag trick
on the shared decoder, its language-agnostic (or lack thereof) representations, and the final transla-
tion quality. Wang et al. [156] explored multiple methods for supporting multiple target languages:
(a) target language tag at beginning of the decoder, (b) target language dependent positional em-
beddings, and (c) divide hidden units of each decoder layer into shared and language-dependent
ones. Each of these methods provides gains over Johnson et al. [70], and combining all methods
gave the best results. Hokamp et al. [64] showed that in a shared decoder setting, using a task-
specific (language pair to be translated) embedding works better than using language tokens. We
expect that this is because learning task-specific embeddings needs more parameters and helps the
decoder learn better features to distinguish between tasks. In the related sub-area of pre-training
using MASS [138], mBERT/XLM [32] it was shown that using language or task tags is sufficient
for distinguishing between the languages used for pre-training large models. Given that massive
pre-training is relatively new, further exploration is much needed.
In the above discussion, we have not considered the possibility of task interference and negative
language transfer. Previous research has shown that certain languages help improve translation
quality when grouped together [36, 170]. This implies the possibility of language pairs that may in-
terfere with each other (task interference in the multi-task setting) and cause unhelpful knowledge
to be transferred (negative transfer). Deeper research on this is pending.
6 Our own experiments in extremely low-resource scenarios show that it is impossible to completely avoid vocabulary
leakage, especially when the same word is present in different languages. Employing a special bias vector helps mitigate
this issue.
7 https://en.wikipedia.org/wiki/List_of_language_families.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:13
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:14 R. Dabre et al.
on first training bilingual models and then gradually increasing their capacity to include more
languages. To address representation bottlenecks, Bapna and Firat [11] proposed expanding the
capacities of pre-trained MNMT models (especially those trained on massive amounts of multilin-
gual data) using tiny feed-forward components that they call adaptors. For each language pair in
a pre-trained (multilingual) model, they add adaptors at each layer and fine-tune them on parallel
corpora for that direction. These modular adaptor layers can be considered as experts that special-
ize for specific language pairs and can be added incrementally. They showed that this approach
can help boost the performance of massively multilingual models trained by Arivazhagan et al. [5]
and Aharoni et al. [2]. Note that this model can be used to incorporate new data, but new language
pairs cannot be added. For multiway models, it is critical for incremental training to deal with cat-
astrophic forgetting where previously learned knowledge is forgotten by including an additional
language pair or training on additional data. While previously mentioned work on adaptor layers
[11] is an effective solution, further research will be beneficial.
A major criticism of all these approaches is that the MNMT models are trained in the same
way as a regular NMT model. Most researchers tend to treat all language pairs equally, with the
exception of oversampling smaller corpora to match the sizes of the larger corpora, and tend to
ignore the fact that NMT might be able to handle some language pairs better than others. There are
works that propose to focus on scaling learning rates or gradients differently for high-resource and
low-resource language pairs [67]. However, Kiperwasser and Ballesteros [74] proposed a multi-
task learning model for learning syntax and translation, where they showed different effects of
their model for high-resource and low-resource language pairs. Furthermore, different language
pairs might be learned at different rates and the optimal stopping criteria is also heavily under-
researched. Currently, model selection is based on an automatic evaluation metric score on the dev
set composed of multiple language pairs. As the metric is an average score for multiple language
pairs, the chosen model might not be optimal for individual pairs. Therefore, we suppose that
the MNMT models in the vast majority of papers are sub-optimally trained even if they improve
the translation quality for low-resource languages. In hindsight, this particular aspect of MNMT
training deserves more attention.
Although multiway MNMT modeling has been thoroughly explored, there are still a number
of open questions, especially the representation bottleneck and parameter-sharing protocols. The
next section will pay special attention to MNMT in resource-constrained scenarios.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:15
Fig. 5. An overview of MNMT for low-resource languages pairs depicting various aspects of proposed
solutions.
4.1 Training
Most studies have explored transfer learning on the source-side: The high-resource and low-
resource language pairs share the same target language. The simplest approach is jointly training
both language pairs [70]. However, the final model may not be optimally tuned for the child lan-
guage pair, whose performance we are interested in.
A better approach is to fine-tune the parent model with data from the child language pair. Zoph
et al. [170] showed that such transfer learning can benefit low-resource language pairs. First, they
trained a parent model on a high-resource language pair. The child model is initialized with the par-
ent’s parameters wherever possible and trained on the small parallel corpus for the low-resource
pair. They also studied the effect of fine-tuning only a subset of the child model’s parameters
(source and target embeddings, RNN layers, and attention). They observed that fine-tuning all pa-
rameters except the input and output embeddings was optimal in one setting. However, further
experiments are required for drawing strong conclusions.
Training the parent model to optimality may not be the best objective for child tasks. It may
be preferable that parent model parameters are amenable to fast adaptation/fine-tuning on child
tasks. Learning such a parent model is referred to as meta-learning and can generalize well to
child tasks. Gu et al. [55] used the model-agnostic meta-learning (MAML) framework [47] to learn
appropriate parameter initialization from the parent pair(s) by taking the child pair into consider-
ation. The meta-learning-based approach significantly outperforms simple fine-tuning. They also
show that having more parent language pairs (jointly trained) also improves performance on the
child language pairs.
Transfer learning on the target-side has been more challenging than transfer learning on
the source-side. Distinct target languages require target-language-specific representations, while
transfer learning prefers target-language-invariant representations. The success of transfer learn-
ing relies on achieving the right balance between these factors. Johnson et al. [70] showed that joint
training does not provide any significant benefit. Fine-tuning is beneficial in very low-resource
scenarios [37], but gains may be limited due to catastrophic forgetting. Dabre et al. [37] showed
that a multi-stage fine-tuning process is beneficial when multiple target languages are involved.
They do not focus on language divergence during their multilingual multi-stage tuning but show
that the size of helping data matters. From their work, it will be worthwhile to consider involving
multiple medium-sized (few hundreds of thousands of lines) helping corpora involving a variety
of languages. This multilingual multi-stage transfer learning scenario requires further research.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:16 R. Dabre et al.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:17
nearby regions to Estonia. Most of the literature suggests that language relatedness is beneficial. It
is probably easier to overcome language divergence when the languages are related, but a deeper
study of how to uncover different factors is warranted.
Further, language relatedness can be explicitly utilized to improve transfer learning. Language
relatedness is typically exploited by using shared subword BPE vocabulary and BPE embed-
dings between the parent and child languages [106]. Maimaiti et al. [95] used a unified translit-
eration scheme at the character level. This utilizes the lexical similarity between the languages
and shows significant improvements in translation quality. Neubig and Hu [105] used “similar
language regularization” to prevent overfitting when rapidly adapting a pre-trained, massively
multilingual NMT model (universal model) for low-resource languages. While fine-tuning the uni-
versal model for a low-resource pair, overfitting is avoided by using a subset of the training data
for a related high-resource pair along with the low-resource pair. Chaudhary et al. [19] used this
approach to translate 1,095 languages to English. Further, not all parallel data from the parent task
may be useful in improving the child task. Wang and Neubig [154] proposed selection of sen-
tence pairs from the parent task based on the similarity of the parent’s source sentences to the
child’s source sentences. The significant gains from simple methods described point to the value of
utilizing language relatedness. Further methods should be explored to create language-invariant
representations specifically designed for related languages.
While low-resource translation is hard in itself, an even more extreme scenario is where no
direct data exists between language pairs of interest. The next section discusses literature related
to this scenario.
8 N-way translations of the Bible may be amongst the rare source of parallel corpora across arbitrary language pairs, but it
is not sufficient for training a general-purpose MT system.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:18 R. Dabre et al.
Fig. 6. An overview of MNMT for unseen languages pairs. There are three broad approaches: pivot, zero-shot,
and zero-resource approaches.
has to be run twice. Sometimes, more than one pivot may be required to translate between the
source and pivot language. Increased pipeline length exacerbates the above-mentioned problems.
The quality of the source-pivot translation is a bottleneck to the system. A variant of this approach
extracts n-best translations from the S-P system. For each pivot translation, the P-T system can
generate m-best translations. The n × m translation candidates can be re-ranked using scores from
both systems and external features to generate the final translation. This approach improves the
robustness of the pipeline to translation errors.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:19
Some researchers have analyzed zero-shot translation to understand its underperformance. The
following reasons have been suggested as limitations of zero-shot translation:
Spurious correlations between input and output language. During training, the network is not
exposed to the unseen pairs. In its quest to capture all correlations in the training data, the model
learns associations between the input representations and the target language for the observed
language pairs. At test time, irrespective of the target language tag, the network will tend to output
a language it has already observed with the source language in the training data [56]. Arivazhagan
et al. [4] showed that translation quality is closer to pivot systems if evaluation is restricted to
sentences where the correct target language is generated.
Language variant encoder representations. The encoder representations generated by the model
for equivalent source and pivot languages are not similar. Due to this discrepancy, the output
generated by the decoder will be different for the source and pivot representations [4, 83].
To address these limitations, the following methods have been proposed for improving zero-shot
translation:
Minimize divergence between encoder representations. During training, additional objec-
tives ensure that the source and pivot encoder representations are similar. Arivazhagan et al. [4]
suggested an unsupervised approach to align the source and pivot vector spaces by minimizing a
domain adversarial loss [50]—a discriminator is trained to distinguish between different encoder
languages using representations from an adversarial encoder. Since S-P parallel corpora is also
available, supervised loss terms that penalize divergence in source and pivot representations for
each sentence pair have also be explored. Different loss functions such as cosine distance [4], Eu-
clidean distance [113], and correlation distance [124] have been shown to be beneficial in reducing
the source/pivot divergence. Ji et al. [69] proposed to use pre-trained cross-lingual encoders trained
using multilingual MLM, XLM, and BRLM objectives to obtain language-invariant encoder repre-
sentations. Sen et al. [129] used denoising autoencoding and back-translation to obtain language-
invariant encoder representations.
Encourage output agreement. Al-Shedivat and Parikh [3] incorporated additional terms in the
training objective to encourage source and pivot representations of parallel sentences to generate
similar output sentences (synthetic) in an auxiliary language (possibly an unseen pair). This also
avoids spurious correlations, since the network learns to generate unseen pairs. Similar considera-
tions motivated Pham et al. [113] to add a pivot auto-encoding task in addition to the source-pivot
translation task. They incorporate additional loss terms that encourage the attention-context vec-
tors as well as decoder output representations to agree while generating the same pivot output.
Xu et al. [163] considered different translation paths among multiple languages in unsupervised
NMT by designing training objectives for these paths to achieve the same goal.
Effect of corpus size and number of languages. Aharoni et al. [2] suggested that the zero-shot
performance of multilingual NMT system increases with the number of languages incorporated in
the model. It is not clear if the approaches mentioned above to address zero-shot NMT limitations
can scale to a large number of languages. Arivazhagan et al. [4] showed that cosine distance-based
alignment can be scaled to a small set of languages. Some studies suggest that zero-shot translation
works reasonably well only when the multilingual parallel corpora is large [91, 97].
Addressing wrong language generation. To address the problem of generation of words in the
wrong language, Ha et al. [59] proposed to filter the output of the softmax, forcing the model to
translate into the desired language. This method is rather effective despite its simplicity.
The zero-shot approaches discussed above can complement multiway NMT systems described
in Section 3 to support translation between the language pairs the model has not observed during
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:20 R. Dabre et al.
training. Further work is needed to establish if these methods can scale to massively multilingual
models.
6 MULTI-SOURCE NMT
If a source sentence has already been translated into multiple languages, then these sentences can
be used together to improve the translation into the target language. This technique is known
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:21
Table 1. Overview of Multi-source NMT Approaches Based on the Availability of N-way Data,
the Training Concerns, and Benefits Concerns
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:22 R. Dabre et al.
using each individual source and then averaged. A combination of late and early averaging was
shown to give the best results.
A simpler approach concatenated multiple source sentences and fed them to a standard NMT
model [36], with performance comparable to Zoph and Knight [168]. This model was obviously
inspired by the success of fully shared MNMT models [70]. A single encoder is responsible for
encoding a long multilingual sentence.9 Interestingly, this model could automatically identify the
boundaries between different source languages and simplify the training process for multi-source
NMT. Dabre et al. [36] also showed that it is better to use linguistically similar source languages,
especially in low-resource scenarios. Both studies showed that the attention mechanisms tend
to prefer some languages over others. Especially, linguistically distant languages are practically
ignored with computing context for decoding. It is possible to speed up encoding by parallellizing
the encoders.
Ensembling of individual source-target models is another beneficial approach, for which
Garmash and Monz [51] proposed several methods with different degrees of parameterization.
They proposed to learn an ensembling function to combine the softmaxes of several bilingual
models. This approach requires a smaller N-lingual corpus, but training an ensembling function
can be costly in terms of parameters. However, Dabre et al. [36] have shown that it is possible
to perform naive ensembling10 and still get reasonable improvements in translation quality. Note
that the ensembling method used by Dabre et al. [36] is the same as the late averaging technique
proposed by Firat et al. [49]. Although the black-box method by Dabre et al. [36] is significantly
simpler and more elegant than the method in Reference [168], most works on multi-source NMT
use the latter method.
Missing Source Sentences. There can be missing source sentences in multi-source corpora and
during translation. Nishimura et al. [108] extended Reference [168] by representing each “missing”
source language with a dummy token. The NMT model manages to adapt to missing sentences and
thus manages to give improved translations when all source sentences are available. Choi et al.
[25] and Nishimura et al. [107] proposed to use MT-generated synthetic sentences, instead of a
dummy token for the missing source languages. NMT models are used to translate sentences from
the available source languages into the missing ones. When sentences for all source languages
are ready, any standard multi-source approach may be used. The fact that such a simple “trick”
works so well means that combining all these methods allows the training of a single model with
both single as well as multi-source capabilities. Therefore, a single model can be used in different
deployment conditions. If low-latency translation services are desired, then the model can be used
in single-source mode. In moderate-latency scenarios, only a few source languages could be used.
In offline translation scenarios, all source languages could be used. Future work could focus on
dynamically determining which source languages are useful and which are not. Furthermore, a
multiway, multi-source model might be an interesting challenge to tackle.
Post-editing. Instead of having a translator translate from scratch, multi-source NMT can be used
to generate high-quality translations. The translations can then be post-edited, a process that is
less labor-intensive and cheaper compared to translating from scratch. Multi-source NMT has been
used for post-editing where the translated sentence is used as an additional source, leading to im-
provements [18]. Multi-source NMT has also been used for system combination, which combines
NMT and SMT outputs to improve translation performance [166].
9 Note that the order of the input sentences should be the same during training and testing.
10 This means uniform avergaging of softmaxes of different models as opposed to weighted averaging.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:23
In general, multi-source NMT does not receive much attention, because it cannot be used for
real-time translation, as it involves additional intermediate translation steps. However, it should
be possible to develop a system that can flexibly perform multi-source, multi-target as well as
single source and single target translation. The system by Firat et al. [49] is such a system, but
the adaptability to language pairs other than European languages and multiple (more than two)
source languages has not been verified. Such a model can be flexibly used during online as well as
offline translation scenarios.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:24 R. Dabre et al.
[10] or transliteration [103]. Lexical similarity was utilized using subword-level translation mod-
els [85, 86, 144, 150]. Combining subword-level representation and pivoting for translation among
related languages has been explored [62, 88, 144]. Most of the above-mentioned multilingual sys-
tems involved either decoding-time operations, chaining black-box systems, or composing new
phrase-tables from existing ones.
Comparison with MNMT. While symbolic representations constrain a unified multilingual rep-
resentation, distributed universal language representation using real-valued vector spaces makes
multilingualism easier to implement in NMT. As no language-specific feature engineering is re-
quired for NMT, it is possible to easily scale to multiple languages. Neural networks provide flexi-
bility in experimenting with a wide variety of architectures, while advances in optimization tech-
niques and availability of deep learning toolkits make prototyping faster.
11 http://opus.nlpl.eu.
12 https://github.com/indicnlpweb/indicnlp_catalog.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:25
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:26 R. Dabre et al.
Exploring Pre-trained Models. Pre-training embeddings, encoders, and decoders has been
shown to be useful for NMT [119]. Most works rely on fine-tuning but do not address techniques to
maximize the impact of transfer. Furthermore, how pre-training can be incorporated into different
MNMT architectures is important as well. Recent advances in cross-lingual word [6, 17, 33, 66, 75,
100] and sentence embeddings13 [7, 20, 34, 46] could provide directions for this line of investiga-
tion. Currently, transfer learning through unsupervised pre-training on extremely large corpora
and unsupervised NMT is gaining momentum and we believe that investing in these two topics
or a merger between them will yield powerful insights into ways to incorporate large amounts
of knowledge into translation systems. Recently, this research topic has shown to have a lot of
promise towards improving bilingual MT quality and should be beneficial for MNMT as well.
Unseen Language Pair Translation. Most work on unseen language pair translation has only
addressed cases where the pivot language is related to or shares the same script with the source
language. In many cases, the pivot language (mostly English) is unlikely to be related to the source
and target languages, and this scenario requires further investigation (especially for zero-shot
translation). Moreover, new approaches need to be explored to significantly improve over the
simple pivot baseline.
Fast Multi-source NMT. While multi-source NMT gives good results, it is limited by the lack of
multiway data during decoding. In a low-latency setting, the only acceptable solution would be
a way to rapidly generate the missing source sentences from the available source language. We
suppose that a form of simultaneous multi-target NMT solution might be a starting place for an
effective technique. The main idea would be to generate additional sources simultaneously along
with the target language and then at each decoding step the partially synthesized source sentences
can be used to provide additional contexts.
Related Languages, Language Registers, and Dialects. Translation involving related lan-
guages, language registers, and dialects can be further explored given the existing evidence about
the importance of language relatedness for improving MNMT quality. For dialects, currently the
focus is on modern standard dialects of languages, but most people feel comfortable speaking in
their local dialect and, hence, it would be valuable to transfer translation knowledge obtained
for modern standard languages to their dialects. In our opinion, unsupervised MT methods using
monolingual data should be extremely effective at handling dialect translation.
13 https://engineering.fb.com/ai-research/laser-multilingual-sentence-embeddings/.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:27
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:28 R. Dabre et al.
multilingual ASR and TTS. An interesting research direction would be to explore multilingual
speech translation, where the ASR, translation, and TTS modules can be multilingual. Interesting
challenges and opportunities may arise in the quest to compose all these multilingual systems in
an end-to-end method. Multilingual end-to-end speech-to-speech translation would also be a fu-
ture challenging scenario. Some datasets are already available for such research [41]. We believe
that this will be one of the hardest challenges in the field of multilingual translation.
11 CONCLUSION
MNMT has made rapid progress in recent years. In this survey, we have covered literature pertain-
ing to the major scenarios we identified for multilingual NMT: multiway, low- or zero-resource
(transfer learning, pivoting, and zero-shot approaches), and multi-source translation. We have sys-
tematically compiled the principal design approaches and their variants, central MNMT issues and
their proposed solutions, along with their strengths and weaknesses. We have put MNMT in a his-
torical perspective with respect to work on multilingual RBMT and SMT systems. At the outset,
we raised two important research questions that we feel can be answered to a certain extent as
below:
Q1. Is it possible to have a one-model-for-all-languages solution to MT (NLP) applications?
A1. Yes. While it is possible to fit all language pairs into a single model, existing deep learning
methodologies suffer from representation learning bottlenecks and generalization capabilities
that put a limit on the gains from multilingualism on translation quality. Further research
into better data selection and representation, network architectures, and learning algorithms is
needed.
Q2. Can shared multilingual distributed representations help MT (NLP) for low-resource languages?
A2. Yes. Low-resource language translation improves significantly, but further research on bal-
ancing the language-agnostic and language-specific nature of representations can help push
performance even further.
It is clear that multilingualism can not only improve translation quality when leveraged but
also can provide a number of insights into the relationships between languages. Most works lack
human evaluation of multilingual models, which should help in better understanding the impact
of multilingualism. We suggest promising and important directions for future work. We hope that
this survey article will give researchers and practitioners a deeper understanding of the MNMT
landscape and enable them to choose engaging topics for research. We hope that our work will
help promote and accelerate MNMT research.
ACKNOWLEDGMENTS
We thank the anonymous reviewers for their insightful comments.
REFERENCES
[1] Željko Agić and Ivan Vulić. 2019. JW300: A wide-coverage parallel corpus for low-resource languages. In Proceedings
of the 57th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 3204–
3210. DOI:https://doi.org/10.18653/v1/P19-1310
[2] Roee Aharoni, Melvin Johnson, and Orhan Firat. 2019. Massively multilingual neural machine translation. In Pro-
ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 3874–3884.
Retrieved from https://www.aclweb.org/anthology/N19-1388.
[3] Maruan Al-Shedivat and Ankur Parikh. 2019. Consistency by agreement in zero-shot neural machine translation. In
Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 1184–1197.
Retrieved from https://www.aclweb.org/anthology/N19-1121.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:29
[4] Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Roee Aharoni, Melvin Johnson, and Wolfgang Macherey. 2019.
The missing ingredient in zero-shot neural machine translation. CoRR abs/1903.07091 (2019).
[5] Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen,
Yuan Cao, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen, and Yonghui Wu. 2019. Massively
multilingual neural machine translation in the wild: Findings and challenges. CoRR abs/1907.05019 (2019).
[6] Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings
while preserving monolingual invariance. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 2289–2294. DOI:https://doi.org/10.18653/v1/D16-1250
[7] Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual
transfer and beyond. Trans. Assoc. Comput. Ling. 7 (2019), 597–610.
[8] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to
align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’15). Retrieved
from http://arxiv.org/abs/1409.0473.
[9] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp
Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceed-
ings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. Association for Computational
Linguistics, 178–186. Retrieved from http://www.aclweb.org/anthology/W13-2322.
[10] Tamali Banerjee, Anoop Kunchukuttan, and Pushpak Bhattacharya. 2018. Multilingual Indian language translation
system at WAT 2018: Many-to-one phrase-based SMT. In Proceedings of the 32nd Pacific Asia Conference on Lan-
guage, Information and Computation: 5th Workshop on Asian Translation. Association for Computational Linguistics.
Retrieved from https://www.aclweb.org/anthology/Y18-3013.
[11] Ankur Bapna and Orhan Firat. 2019. Simple, scalable adaptation for neural machine translation. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference
on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 1538–1548. DOI:
https://doi.org/10.18653/v1/D19-1165
[12] Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, and Doina Precup. 2016. Conditional computation in neural
networks for faster models. In Proceedings of the International Conference on Learning Representations (ICLR’16)
Workshop Track.
[13] Graeme Blackwood, Miguel Ballesteros, and Todd Ward. 2018. Multilingual neural machine translation with task-
specific attention. In Proceedings of the 27th International Conference on Computational Linguistics. Association for
Computational Linguistics, 3112–3122. Retrieved from http://aclweb.org/anthology/C18-1263.
[14] Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias
Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia
Specia, and Marco Turchi. 2017. Findings of the 2017 conference on machine translation (WMT’17). In Proceedings
of the 2nd Conference on Machine Translation. Association for Computational Linguistics, 169–214. Retrieved from
http://www.aclweb.org/anthology/W17-4717.
[15] Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Philipp Koehn, and Christof Monz.
2018. Findings of the 2018 conference on machine translation (WMT’18). In Proceedings of the 3rd Conference
on Machine Translation: Shared Task Papers. Association for Computational Linguistics, 272–303. Retrieved from
http://aclweb.org/anthology/W18-6401.
[16] Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Jan Niehues, Sebastian Stüker, Katsuhito Sudoh, Koichiro
Yoshino, and Christian Federmann. 2017. Overview of the IWSLT 2017 evaluation campaign. In Proceedings of the
14th International Workshop on Spoken Language Translation. 2–14.
[17] Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and
Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Proceedings of the Con-
ference on Advances in Neural Information Processing Systems. 1853–1861.
[18] Rajen Chatterjee, M. Amin Farajian, Matteo Negri, Marco Turchi, Ankit Srivastava, and Santanu Pal. 2017. Multi-
source neural automatic post-editing: FBK’s participation in the WMT 2017 APE shared task. In Proceedings of the
2nd Conference on Machine Translation. Association for Computational Linguistics. 630–638. DOI:https://doi.org/10.
18653/v1/W17-4773
[19] Aditi Chaudhary, Siddharth Dalmia, Junjie Hu, Xinjian Li, Austin Matthews, Aldrian Obaja Muis, Naoki Otani,
Shruti Rijhwani, Zaid Sheikh, Nidhi Vyas, Xinyi Wang, Jiateng Xie, Ruochen Xu, Chunting Zhou, Peter J. Jansen,
Yiming Yang, Lori Levin, Florian Metze, Teruko Mitamura, David R. Mortensen, Graham Neubig, Eduard Hovy,
Alan W. Black, Jaime Carbonell, Graham V. Horwood, Shabnam Tafreshi, Mona Diab, Efsun S. Kayi, Noura Farra,
and Kathleen McKeown. 2019. The ARIEL-CMU systems for LoReHLT18. CoRR abs/1902.08899 (2019).
[20] Xilun Chen, Ahmed Hassan Awadallah, Hany Hassan, Wei Wang, and Claire Cardie. 2019. Multi-source cross-lingual
model transfer: Learning what to share. In Proceedings of the 57th Meeting of the Association for Computational
Linguistics. Association for Computational Linguistics, 3098–3112. DOI:https://doi.org/10.18653/v1/P19-1299
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:30 R. Dabre et al.
[21] Yun Chen, Yang Liu, Yong Cheng, and Victor O. K. Li. 2017. A teacher-student framework for zero-resource neural
machine translation. In Proceedings of the 55th Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers). Association for Computational Linguistics, 1925–1935. DOI:https://doi.org/10.18653/v1/P17-1176
[22] Yun Chen, Yang Liu, and Victor O. K. Li. 2018. Zero-resource neural machine translation with multi-agent commu-
nication game. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. AAAI Press, 5086–5093.
[23] Yong Cheng, Qian Yang, Yang Liu, Maosong Sun, and Wei Xu. 2017. Joint training for pivot-based neural machine
translation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17). 3974–3980.
DOI:https://doi.org/10.24963/ijcai.2017/555
[24] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural
machine translation: Encoder–decoder approaches. In Proceedings of the 8th Workshop on Syntax, Semantics and
Structure in Statistical Translation (SSST’14). Association for Computational Linguistics, 103–111. DOI:https://doi.
org/10.3115/v1/W14-4012
[25] Gyu Hyeon Choi, Jong Hun Shin, and Young Kil Kim. 2018. Improving a multi-source neural machine translation
model with corpus extension for low-resource languages. In Proceedings of the 11th International Conference on
Language Resources and Evaluation (LREC’18). European Language Resource Association, 900–904. Retrieved from
http://aclweb.org/anthology/L18-1144.
[26] Christos Christodouloupoulos and Mark Steedman. 2015. A massively parallel corpus: The Bible in 100 languages.
Lang. Resour. Eval. 49, 2 (2015), 375–395.
[27] Chenhui Chu and Raj Dabre. 2018. Multilingual and multi-domain adaptation for neural machine translation. In
Proceedings of the 24th Meeting of the Association for Natural Language Processing (NLP’18). 909–912.
[28] Chenhui Chu and Raj Dabre. 2019. Multilingual multi-domain adaptation approaches for neural machine translation.
CoRR abs/1906.07978 (2019).
[29] Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of domain adaptation methods for
neural machine translation. In Proceedings of the 55th Meeting of the Association for Computational Linguistics (Volume
2: Short Papers). Association for Computational Linguistics, 385–391. DOI:https://doi.org/10.18653/v1/P17-2061
[30] Chenhui Chu and Rui Wang. 2018. A survey of domain adaptation for neural machine translation. In Proceedings
of the 27th International Conference on Computational Linguistics. Association for Computational Linguistics, 1304–
1319. Retrieved from http://aclweb.org/anthology/C18-1111.
[31] Michael Collins, Philipp Koehn, and Ivona Kučerová. 2005. Clause restructuring for statistical machine translation.
In Proceedings of the 43rd Meeting of the Association for Computational Linguistics (ACL’05). Association for Compu-
tational Linguistics, 531–540. DOI:https://doi.org/10.3115/1219840.1219906
[32] Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. In Proceedings of the 32nd
Conference on Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F.
d’Alché-Buc, E. Fox, and R. Garnett (Eds.). Curran Associates, Inc., 7059–7069. Retrieved from http://papers.nips.cc/
paper/8928-cross-lingual-language-model-pretraining.pdf.
[33] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word transla-
tion without parallel data. In Proceedings of the International Conference on Learning Representations. Retrieved from
https://github.com/facebookresearch/MUSE.
[34] Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin
Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the Conference on Empir-
ical Methods in Natural Language Processing. Association for Computational Linguistics. 2475–2485. Retrieved from
https://www.aclweb.org/anthology/D18-1269.
[35] Anna Currey and Kenneth Heafield. 2019. Zero-resource neural machine translation with monolingual pivot data.
In Proceedings of the 3rd Workshop on Neural Generation and Translation. Association for Computational Linguistics,
99–107. DOI:https://doi.org/10.18653/v1/D19-5610
[36] Raj Dabre, Fabien Cromieres, and Sadao Kurohashi. 2017. Enabling multi-source neural machine translation by
concatenating source sentences in multiple languages. In Proceedings of the Machine Translation Summit XVI, Vol.1:
Research Track. 96–106.
[37] Raj Dabre, Atsushi Fujita, and Chenhui Chu. 2019. Exploiting multilingualism through multistage fine-tuning for
low-resource neural machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association
for Computational Linguistics, 1410–1416. DOI:https://doi.org/10.18653/v1/D19-1146
[38] Raj Dabre, Anoop Kunchukuttan, Atsushi Fujita, and Eiichiro Sumita. 2018. NICT’s participation in WAT 2018: Ap-
proaches using multilingualism and recurrently stacked layers. In Proceedings of the 5th Workshop on Asian Language
Translation.
[39] Raj Dabre and Sadao Kurohashi. 2017. MMCR4NLP: Multilingual multiway corpora repository for natural language
processing. arXiv preprint arXiv:1710.01025 (2017).
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:31
[40] Raj Dabre, Tetsuji Nakagawa, and Hideto Kazawa. 2017. An empirical study of language relatedness for transfer
learning in neural machine translation. In Proceedings of the 31st Pacific Asia Conference on Language, Information and
Computation. The National University (Philippines), 282–286. Retrieved from http://aclweb.org/anthology/Y17-1038.
[41] Mattia A. Di Gangi, Roldano Cattoni, Luisa Bentivogli, Matteo Negri, and Marco Turchi. 2019. MuST-C: A multilin-
gual speech translation corpus. In Proceedings of the Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Com-
putational Linguistics. 2012–2017. Retrieved from https://www.aclweb.org/anthology/N19-1202.
[42] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language
translation. In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics,
1723–1732. DOI:https://doi.org/10.3115/v1/P15-1166
[43] Bonnie J. Dorr. 1987. UNITRAN: An interlingua approach to machine translation. In Proceedings of the 6th Conference
of the American Association of Artificial Intelligence.
[44] Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Hajime Tsukada. 2013. Adaptation data selection using neural
language models: Experiments in machine translation. In Proceedings of the 51st Meeting of the Association for Com-
putational Linguistics (Volume 2: Short Papers). 678–683. Retrieved from http://www.aclweb.org/anthology/P13-2119.
[45] Carlos Escolano, Marta R. Costa-jussà, and José A. R. Fonollosa. 2019. From bilingual to multilingual neural ma-
chine translation by incremental training. In Proceedings of the 57th Meeting of the Association for Computational
Linguistics.
[46] Cristina España-Bonet, Ádám Csaba Varga, Alberto Barrón-Cedeño, and Josef van Genabith. 2017. An empirical
analysis of NMT-derived interlingual embeddings and their use in parallel sentence identification. IEEE J. Select.
Topics Sig. Proc. 11, 8 (Dec. 2017), 1340–1350. DOI:https://doi.org/10.1109/JSTSP.2017.2764273
[47] Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep
networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning
Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. 1126–1135. Retrieved from http://proceedings.mlr.press/
v70/finn17a.html.
[48] Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with
a shared attention mechanism. In Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 866–875.
DOI:https://doi.org/10.18653/v1/N16-1101
[49] Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T. Yarman Vural, and Kyunghyun Cho. 2016. Zero-resource
translation with multi-lingual neural machine translation. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, 268–277. DOI:https://doi.org/10.18653/v1/
D16-1026
[50] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario
Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17, 1
(2016), 2096–2030.
[51] Ekaterina Garmash and Christof Monz. 2016. Ensemble learning for multi-source neural machine translation. In
Proceedings of the 26th International Conference on Computational Linguistics: Technical Papers (COLING’16). The
COLING 2016 Organizing Committee, 1409–1418. Retrieved from http://aclweb.org/anthology/C16-1133.
[52] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to
sequence learning. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine
Learning Research), Doina Precup and Yee Whye Teh (Eds.), Vol. 70. 1243–1252. Retrieved from http://proceedings.
mlr.press/v70/gehring17a.html.
[53] Adrià De Gispert and José B. Mariño. 2006. Catalan-English statistical machine translation without parallel corpus:
Bridging through Spanish. In Proceedings of the 5th International Conference on Language Resources and Evaluation
(LREC’06). 65–68.
[54] Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018. Universal neural machine translation for extremely
low resource languages. In Proceedings of the Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Lin-
guistics, 344–354. DOI:https://doi.org/10.18653/v1/N18-1032
[55] Jiatao Gu, Yong Wang, Yun Chen, Victor O. K. Li, and Kyunghyun Cho. 2018. Meta-learning for low-resource neural
machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Associ-
ation for Computational Linguistics, 3622–3631. Retrieved from http://aclweb.org/anthology/D18-1398.
[56] Jiatao Gu, Yong Wang, Kyunghyun Cho, and Victor O. K. Li. 2019. Improved zero-shot neural machine translation
via ignoring spurious correlations. In Proceedings of the 57th Meeting of the Association for Computational Linguistics.
Association for Computational Linguistics, 1258–1268. DOI:https://doi.org/10.18653/v1/P19-1121
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:32 R. Dabre et al.
[57] Francisco Guzmán, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and
Marc’Aurelio Ranzato. 2019. The FLORES evaluation datasets for low-resource machine translation: Nepali–English
and Sinhala–English. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computa-
tional Linguistics, 6098–6111. DOI:https://doi.org/10.18653/v1/D19-1632
[58] Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2016. Toward multilingual neural machine translation with
universal encoder and decoder. In Proceedings of the 13th International Workshop on Spoken Language Translation.
1–7.
[59] Thanh-Le Ha, Jan Niehues, and Alexander H. Waibel. 2017. Effective strategies in zero-shot neural machine trans-
lation. In Proceedings of the 14th International Workshop on Spoken Language Translation. 105–112.
[60] Barry Haddow and Faheem Kirefu. 2020. PMIndia—A collection of parallel corpora of languages of India. arxiv
2001.09907 (2020).
[61] Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio Ranzato. 2020. Revisiting self-training for neural sequence
generation. In Proceedings of the 8th International Conference on Learning Representations (ICLR’20). Retrieved from
https://openreview.net/forum?id=SJgdnAVKDH.
[62] Carlos Henríquez, Marta R. Costa-jussá, Rafael E. Banchs, Lluis Formiga, and José B. Mariño. 2011. Pivot strategies
as an alternative for statistical machine translation tasks involving Iberian languages. In Proceedings of the Workshop
on Iberian Cross-language Natural Language Processing Tasks (ICL’11). 22–27.
[63] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2014. Distilling the knowledge in a neural network. In Proceedings of
the Conference on Advances in Neural Information Processing Systems (NIPS’14) Deep Learning Workshop.
[64] Chris Hokamp, John Glover, and Demian Gholipour Ghalandari. 2019. Evaluating the supervised and zero-shot
performance of multi-lingual translation models. In Proceedings of the 4th Conference on Machine Translation (Volume
2: Shared Task Papers, Day 1). Association for Computational Linguistics, 209–217. DOI:https://doi.org/10.18653/v1/
W19-5319
[65] Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen.
2019. GPipe: Efficient training of giant neural networks using pipeline parallelism. In Proceedings of the Conference
on Neural Information Processing Systems (NeurIPS’19).
[66] Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, and Bamdev Mishra. 2019. Learning multilingual word
embeddings in latent metric space: A geometric approach. Trans. Assoc. Comput. Ling. 7 (2019), 107–120. Retrieved
from https://www.aclweb.org/anthology/Q19-1007.
[67] Sébastien Jean, Orhan Firat, and Melvin Johnson. 2019. Adaptive scheduling for multi-task learning. In Proceedings
of the Continual Learning Workshop at NeurIPS’18.
[68] Girish Nath Jha. 2010. The TDIL program and the Indian language Corpora Intitiative (ILCI). In Proceedings of the
7th Conference on International Language Resources and Evaluation (LREC’10). European Languages Resources As-
sociation (ELRA). Retrieved from http://www.lrec-conf.org/proceedings/lrec2010/pdf/874_Paper.pdf.
[69] Baijun Ji, Zhirui Zhang, Xiangyu Duan, Min Zhang, Boxing Chen, and Weihua Luo. 2020. Cross-lingual pre-training
based transfer for zero-shot neural machine translation. In Proceedings of the 34th AAAI Conference on Artificial
Intelligence.
[70] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda
Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural
machine translation system: Enabling zero-shot translation. Trans. Assoc. Comput. Ling. 5 (2017), 339–351. Retrieved
from http://aclweb.org/anthology/Q17-1024.
[71] Yunsu Kim, Yingbo Gao, and Hermann Ney. 2019. Effective cross-lingual transfer of neural machine transla-
tion models without shared vocabularies. In Proceedings of the 57th Meeting of the Association for Computational
Linguistics.
[72] Yunsu Kim, Petre Petrov, Pavel Petrushkov, Shahram Khadivi, and Hermann Ney. 2019. Pivot-based transfer learn-
ing for neural machine translation between non-English languages. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process-
ing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 866–876. DOI:https://doi.org/10.18653/v1/D19-
1080
[73] Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1317–1327. DOI:
https://doi.org/10.18653/v1/D16-1139
[74] Eliyahu Kiperwasser and Miguel Ballesteros. 2018. Scheduled multi-task learning: From syntax to translation. Trans.
Assoc. Comput. Ling. 6 (2018), 225–240. DOI:https://doi.org/10.1162/tacl_a_00017
[75] Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of
words. In Proceedings of the International Conference on Computational Linguistics (COLING’12). The COLING 2012
Organizing Committee, 1459–1474. Retrieved from https://www.aclweb.org/anthology/C12-1089.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:33
[76] Tom Kocmi and Ondřej Bojar. 2018. Trivial transfer learning for low-resource neural machine translation. In Pro-
ceedings of the Third Conference on Machine Translation, Volume 1: Research Papers. Association for Computational
Linguistics, 244–252. Retrieved from http://www.aclweb.org/anthology/W18-6325.
[77] Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the 10th Machine
Translation Summit. AAMT, 79–86. Retrieved from http://mt-archive.info/MTS-2005-Koehn.pdf.
[78] Philipp Koehn. 2017. Neural machine translation. CoRR abs/1709.07809 (2017).
[79] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke
Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan
Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Meeting of the
Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions. Association
for Computational Linguistics, 177–180. Retrieved from http://www.aclweb.org/anthology/P/P07/P07-2045.
[80] Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the
1st Workshop on Neural Machine Translation. Association for Computational Linguistics, 28–39. Retrieved from
http://www.aclweb.org/anthology/W17-3204.
[81] Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of the Hu-
man Language Technology Conference of the North American Chapter of the Association for Computational Linguistics.
127–133. Retrieved from https://www.aclweb.org/anthology/N03-1017.
[82] Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and
detokenizer for neural text processing. In Proceedings of the Conference on Empirical Methods in Natural Language
Processing: System Demonstrations. Association for Computational Linguistics, 66–71. DOI:https://doi.org/10.18653/
v1/D18-2012
[83] Sneha Kudugunta, Ankur Bapna, Isaac Caswell, and Orhan Firat. 2019. Investigating multilingual NMT represen-
tations at scale. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational
Linguistics, 1565–1575. DOI:https://doi.org/10.18653/v1/D19-1167
[84] Anoop Kunchukuttan. 2020. IndoWordnet Parallel Corpus. Retrieved from https://github.com/anoopkunchukuttan/
indowordnet_parallel.
[85] Anoop Kunchukuttan and Pushpak Bhattacharyya. 2016. Orthographic syllable as basic unit for SMT between re-
lated languages. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, 1912–1917. DOI:https://doi.org/10.18653/v1/D16-1196
[86] Anoop Kunchukuttan and Pushpak Bhattacharyya. 2017. Learning variable length units for SMT between related
languages via byte pair encoding. In Proceedings of the 1st Workshop on Subword and Character Level Models in NLP.
Association for Computational Linguistics, 14–24. DOI:https://doi.org/10.18653/v1/W17-4102
[87] Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, and Pushpak Bhattacharyya. 2014. Shata-
Anuvadak: Tackling multiway translation of Indian languages. In Proceedings of the 9th International Conference
on Language Resources and Evaluation (LREC’14). European Language Resources Association (ELRA), 1781–1787.
Retrieved from http://www.lrec-conf.org/proceedings/lrec2014/pdf/414_Paper.pdf.
[88] Anoop Kunchukuttan, Maulik Shah, Pradyot Prakash, and Pushpak Bhattacharyya. 2017. Utilizing lexical similarity
between related, low-resource languages for pivot-based SMT. In Proceedings of the 8th International Joint Conference
on Natural Language Processing (Volume 2: Short Papers). Asian Federation of Natural Language Processing, 283–289.
Retrieved from http://aclweb.org/anthology/I17-2048.
[89] Surafel Melaku Lakew, Mauro Cettolo, and Marcello Federico. 2018. A comparison of transformer and recurrent
neural networks on multilingual neural machine translation. In Proceedings of the 27th International Conference on
Computational Linguistics. Association for Computational Linguistics, 641–652. Retrieved from http://aclweb.org/
anthology/C18-1054.
[90] Surafel Melaku Lakew, Aliia Erofeeva, Matteo Negri, Marcello Federico, and Marco Turchi. 2018. Transfer learning in
multilingual neural machine translation with dynamic vocabulary. In Proceedings of the 15th International Workshop
on Spoken Language Translation (IWSLT’18). 54–61.
[91] Surafel Melaku Lakew, Quintino F. Lotito, Matteo Negri, Marco Turchi, and Marcello Federico. 2017. Improving zero-
shot translation of low-resource languages. In Proceedings of the 14th International Workshop on Spoken Language
Translation. 113–119.
[92] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine
translation using monolingual corpora only. In Proceedings of the International Conference on Learning Represen-
tations. Retrieved from https://openreview.net/forum?id=rkYTTf-AZ.
[93] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural machine translation without
explicit segmentation. Trans. Assoc. Comput. Ling. 5 (2017), 365–378. Retrieved from http://aclweb.org/anthology/
Q17-1026.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:34 R. Dabre et al.
[94] Yichao Lu, Phillip Keung, Faisal Ladhak, Vikas Bhardwaj, Shaonan Zhang, and Jason Sun. 2018. A neural interlingua
for multilingual machine translation. In Proceedings of the 3rd Conference on Machine Translation: Research Papers.
Association for Computational Linguistics, 84–92. Retrieved from http://aclweb.org/anthology/W18-6309.
[95] Mieradilijiang Maimaiti, Yang Liu, Huanbo Luan, and Maosong Sun. 2019. Multi-round transfer learning for low-
resource NMT using multiple high-resource languages. ACM Trans. Asian Low-Resour. Lang. Inf. Proc. 18, 4 (May
2019). DOI:https://doi.org/10.1145/3314945
[96] Chaitanya Malaviya, Graham Neubig, and Patrick Littell. 2017. Learning language representations for typology
prediction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, 2529–2535. DOI:https://doi.org/10.18653/v1/D17-1268
[97] Giulia Mattoni, Pat Nagle, Carlos Collantes, and Dimitar Shterionov. 2017. Zero-shot translation for Indian languages
with sparse data. In Proceedings of Machine Translation Summit XVI, Vol. 2: Users and Translators Track. 1–10.
[98] Evgeny Matusov, Nicola Ueffing, and Hermann Ney. 2006. Computing consensus translation for multiple machine
translation systems using enhanced hypothesis alignment. In Proceedings of the 11th Conference of the European
Chapter of the Association for Computational Linguistics. 33–40. Retrieved from https://www.aclweb.org/anthology/
E06-1005.
[99] Cettolo Mauro, Girardi Christian, and Federico Marcello. 2012. Wit3: Web inventory of transcribed and translated
talks. In Proceedings of the 16th Conference of European Association for Machine Translation. 261–268.
[100] Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine transla-
tion. CoRR abs/1309.4168 (2013).
[101] Rudra Murthy, Anoop Kunchukuttan, and Pushpak Bhattacharyya. 2019. Addressing word-order divergence in
multilingual neural machine translation for extremely low resource languages. In Proceedings of the Confer-
ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 3868–3873. Retrieved from
https://www.aclweb.org/anthology/N19-1387.
[102] Toshiaki Nakazawa, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Anoop Kunchukuttan, Win Pa Pa, Isao Goto,
Hideya Mino, Katsuhito Sudoh, and Sadao Kurohashi. 2018. Overview of the 5th workshop on Asian translation. In
Proceedings of the 5th Workshop on Asian Translation (WAT’18). 1–41.
[103] Preslav Nakov and Hwee Tou Ng. 2009. Improved statistical machine translation for resource-poor languages using
related resource-rich languages. In Proceedings of the Conference on Empirical Methods in Natural Language Process-
ing. Association for Computational Linguistics, 1358–1367. Retrieved from https://www.aclweb.org/anthology/D09-
1141.
[104] Graham Neubig. 2017. Neural machine translation and sequence-to-sequence models: A tutorial. CoRR
abs/1703.01619 (2017).
[105] Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Pro-
ceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, 875–880. Retrieved from http://aclweb.org/anthology/D18-1103.
[106] Toan Q. Nguyen and David Chiang. 2017. Transfer learning across low-resource, related languages for neural
machine translation. In Proceedings of the 8th International Joint Conference on Natural Language Processing (Vol-
ume 2: Short Papers). Asian Federation of Natural Language Processing, 296–301. Retrieved from http://aclweb.org/
anthology/I17-2050.
[107] Yuta Nishimura, Katsuhito Sudoh, Graham Neubig, and Satoshi Nakamura. 2018. Multi-source neural machine trans-
lation with missing data. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation. Associ-
ation for Computational Linguistics, 92–99. Retrieved from http://aclweb.org/anthology/W18-2711.
[108] Yuta Nishimura, Katsuhito Sudoh, Graham Neubig, and Satoshi Nakamura. 2018. Multi-source neural machine trans-
lation with data augmentation. In Proceedings of the 15th International Workshop on Spoken Language Translation
(IWSLT’18). 48–53. Retrieved from https://arxiv.org/abs/1810.06826.
[109] Eric Nyberg, Teruko Mitamura, and Jaime Carbonell. 1997. The KANT machine translation system: From R&D to
initial deployment. In Proceedings of the LISA Workshop on Integrating Advanced Translation Technology. 1–7.
[110] Franz Josef Och and Hermann Ney. 2001. Statistical multi-source translation. In Proceedings of the Machine Transla-
tion Summit, Vol. 8. 253–258.
[111] Robert Östling and Jörg Tiedemann. 2017. Continuous multilinguality with language vectors. In Proceedings of the
15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers.
Association for Computational Linguistics, 644–649. Retrieved from https://www.aclweb.org/anthology/E17-2102.
[112] Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (Oct.
2010), 1345–1359. DOI:https://doi.org/10.1109/TKDE.2009.191
[113] Ngoc-Quan Pham, Jan Niehues, Thanh-Le Ha, and Alexander Waibel. 2019. Improving zero-shot translation with
language-independent constraints. In Proceedings of the 4th Conference on Machine Translation (Volume 1: Research
Papers). Association for Computational Linguistics, 13–23. DOI:https://doi.org/10.18653/v1/W19-5202
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:35
[114] Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the
57th Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 4996–5001.
DOI:https://doi.org/10.18653/v1/P19-1493
[115] Emmanouil Antonios Platanios, Mrinmaya Sachan, Graham Neubig, and Tom Mitchell. 2018. Contextual param-
eter generation for universal neural machine translation. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics, 425–435. Retrieved from http://aclweb.org/
anthology/D18-1039.
[116] Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six Indian languages via
crowdsourcing. In Proceedings of the 7th Workshop on Statistical Machine Translation. Association for Computational
Linguistics, 401–409.
[117] Raj Noel Dabre Prasanna. 2018. Exploiting Multilingualism and Transfer Learning for Low Resource Machine Transla-
tion. Ph.D. Dissertation. Kyoto University. Retrieved from http://hdl.handle.net/2433/232411.
[118] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. 2017. SVCCA: Singular vector canonical
correlation analysis for deep learning dynamics and interpretability. In Proceedings of the 30th Conference on Ad-
vances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett (Eds.). Curran Associates, Inc., 6076–6085. Retrieved from http://papers.nips.cc/paper/
7188-svcca-singular-vector-canonical-correlation-analysis-for-deep-learning-dynamics-and-interpretability.pdf.
[119] Prajit Ramachandran, Peter Liu, and Quoc Le. 2017. Unsupervised pretraining for sequence to sequence learning. In
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, 383–391. DOI:https://doi.org/10.18653/v1/D17-1039
[120] Ananthakrishnan Ramanathan, Jayprasad Hegde, Ritesh Shah, Pushpak Bhattacharyya, and M. Sasikumar. 2008.
Simple syntactic and morphological processing can help English-Hindi statistical machine translation. In Proceedings
of the International Joint Conference on Natural Language Processing.
[121] Matı̄ss Rikters, Mārcis Pinnis, and Rihards Krišlauks. 2018. Training and adapting multilingual NMT for less-
resourced and morphologically rich languages. In Proceedings of the 11th International Conference on Language Re-
sources and Evaluation (LREC’18). European Language Resources Association (ELRA), 3766–3773.
[122] Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016).
[123] Devendra Sachan and Graham Neubig. 2018. Parameter sharing methods for multilingual self-attentional translation
models. In Proceedings of the 3rd Conference on Machine Translation: Research Papers. Association for Computational
Linguistics, 261–271. Retrieved from http://aclweb.org/anthology/W18-6327.
[124] Amrita Saha, Mitesh M. Khapra, Sarath Chandar, Janarthanan Rajendran, and Kyunghyun Cho. 2016. A correlational
encoder decoder architecture for pivot based sequence generation. In Proceedings of the 26th International Conference
on Computational Linguistics: Technical Papers (COLING’16). The COLING 2016 Organizing Committee, 109–118.
Retrieved from https://www.aclweb.org/anthology/C16-1011.
[125] Peter H. Schönemann. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31, 1 (1966),
1–10.
[126] Josh Schroeder, Trevor Cohn, and Philipp Koehn. 2009. Word lattices for multi-source translation. In Proceedings
of the 12th Conference of the European Chapter of the ACL (EACL’09). Association for Computational Linguistics,
719–727. Retrieved from https://www.aclweb.org/anthology/E09-1082.
[127] Mike Schuster and Kaisuke Nakajima. 2012. Japanese and Korean voice search. In Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing (ICASSP’12). IEEE, 5149–5152. Retrieved from http://dblp.uni-
trier.de/db/conf/icassp/icassp2012.html#SchusterN12.
[128] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong, and Francisco Guzmán. 2019. WikiMatrix: Mining
135M parallel sentences in 1620 language pairs from Wikipedia. CoRR abs/1907.05791 (2019).
[129] Sukanta Sen, Kamal Kumar Gupta, Asif Ekbal, and Pushpak Bhattacharyya. 2019. Multilingual unsupervised NMT
using shared encoder and language-specific decoders. In Proceedings of the 57th Meeting of the Association for Com-
putational Linguistics.
[130] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with mono-
lingual data. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers). Association for Computational Linguistics, 86–96. Retrieved from http://www.aclweb.org/anthology/P16-
1009.
[131] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword
units. In Proceedings of the 54th Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, 1715–1725. Retrieved from http://www.aclweb.org/anthology/P16-1162.
[132] Lierni Sestorain, Massimiliano Ciaramita, Christian Buck, and Thomas Hofmann. 2018. Zero-shot dual machine
translation. CoRR abs/1805.10338 (2018).
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:36 R. Dabre et al.
[133] Petr Sgall and Jarmila Panevová. 1987. Machine translation, linguistics, and interlingua. In Proceedings of the 3rd
Conference on European Chapter of the Association for Computational Linguistics (EACL’87). Association for Compu-
tational Linguistics, 99–103. DOI:https://doi.org/10.3115/976858.976876
[134] Itamar Shatz. 2016. Native language influence during second language acquisition: A large-scale learner corpus
analysis. In Proceedings of the Pacific Second Language Research Forum (PacSLRF’16). 175–180.
[135] Aditya Siddhant, Melvin Johnson, Henry Tsai, Naveen Arivazhagan, Jason Riesa, Ankur Bapna, Orhan Firat, and
Karthik Raman. 2020. Evaluating the cross-lingual effectiveness of massively multilingual neural machine transla-
tion. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’20).
[136] Shashank Siripragrada, Jerin Philip, Vinay P. Namboodiri, and C. V. Jawahar. 2020. A multilingual parallel corpora
collection effort for Indian languages. In Proceedings of the 12th Language Resources and Evaluation Conference. Eu-
ropean Language Resources Association, 3743–3751. Retrieved from https://www.aclweb.org/anthology/2020.lrec-
1.462.
[137] Anders Søgaard, Sebastian Ruder, and Ivan Vulić. 2018. On the limitations of unsupervised bilingual dictionary
induction. In Proceedings of the 56th Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
Association for Computational Linguistics, 778–788. DOI:https://doi.org/10.18653/v1/P18-1072
[138] Kai Song, Yue Zhang, Heng Yu, Weihua Luo, Kun Wang, and Min Zhang. 2019. Code-switching for enhancing NMT
with pre-specified translation. In Proceedings of the Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for
Computational Linguistics, 449–459. Retrieved from https://www.aclweb.org/anthology/N19-1044.
[139] Ralf Steinberger, Mohamed Ebrahim, Alexandros Poulis, Manuel Carrasco-Benitez, Patrick Schlüter, Marek
Przybyszewski, and Signe Gilbro. 2014. An overview of the European Union’s highly multilingual parallel corpora.
Lang. Resour. Eval. 48, 4 (2014), 679–707.
[140] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Pro-
ceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14). The MIT Press,
3104–3112. Retrieved from http://dl.acm.org/citation.cfm?id=2969033.2969173.
[141] Xu Tan, Jiale Chen, Di He, Yingce Xia, Tao Qin, and Tie-Yan Liu. 2019. Multilingual neural machine translation with
language clustering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational
Linguistics, 963–973. DOI:https://doi.org/10.18653/v1/D19-1089
[142] Xu Tan, Yi Ren, Di He, Tao Qin, and Tie-Yan Liu. 2019. Multilingual neural machine translation with knowledge
distillation. In Proceedings of the International Conference on Learning Representations (ICLR’19). Retrieved from
http://arxiv.org/abs/1902.10461.
[143] Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew M. Finch, and Eiichiro Sumita. 2016. Introducing the Asian
language treebank (ALT). In Proceedings of the 10th International Conference on Language Resources and Evaluation
(LREC’16). European Language Resources Association (ELRA), 1574–1578.
[144] Jörg Tiedemann. 2012. Character-based pivot translation for under-resourced languages and domains. In Proceed-
ings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for
Computational Linguistics, 141–151. Retrieved from https://www.aclweb.org/anthology/E12-1015.
[145] Jörg Tiedemann. 2012. Parallel data, tools, and interfaces in OPUS. In Proceedings of the 8th International Conference
on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), 2214–2218.
Retrieved from http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.
[146] Hiroshi Uchida. 1996. UNL: Universal networking language—An electronic language for communication, un-
derstanding, and collaboration. In UNU/IAS/UNL Center. Retrieved from https://www.semanticscholar.org/paper/
UNL%3A-Universal-Networking-Language-An-Electronic-Uchida/f281c6a61ee69e4fa0f15f3f6d03faeee7a74e10.
[147] Masao Utiyama and Hitoshi Isahara. 2007. A comparison of pivot methods for phrase-based statistical machine
translation. In Proceedings of the Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies. Association for Computational Linguistics, 484–491. Retrieved from
https://www.aclweb.org/anthology/N07-1061.
[148] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention is all you need. In Proceedings of the 30th Conference on Advances in Neural Information
Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.).
Curran Associates, Inc., 5998–6008. Retrieved from http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.
[149] Raúl Vázquez, Alessandro Raganato, Jörg Tiedemann, and Mathias Creutz. 2018. Multilingual NMT with a language-
independent attention bridge. CoRR abs/1811.00498 (2018).
[150] David Vilar, Jan-Thorsten Peter, and Hermann Ney. 2007. Can we translate letters? In Proceedings of the 2nd
Workshop on Statistical Machine Translation. Association for Computational Linguistics, 33–39. Retrieved from
https://www.aclweb.org/anthology/W07-0705.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
A Survey of Multilingual Neural Machine Translation 99:37
[151] Karthik Visweswariah, Rajakrishnan Rajkumar, Ankur Gandhe, Ananthakrishnan Ramanathan, and Jiri Navratil.
2011. A word reordering model for improved machine translation. In Proceedings of the Conference on Empiri-
cal Methods in Natural Language Processing. Association for Computational Linguistics. 486–496. Retrieved from
https://www.aclweb.org/anthology/D11-1045.
[152] Rui Wang, Andrew Finch, Masao Utiyama, and Eiichiro Sumita. 2017. Sentence embedding for neural machine trans-
lation domain adaptation. In Proceedings of the 55th Meeting of the Association for Computational Linguistics (Volume
2: Short Papers). Association for Computational Linguistics, 560–566. Retrieved from http://aclweb.org/anthology/
P17-2089.
[153] Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, and Eiichiro Sumita. 2017. Instance weighting for neural machine
translation domain adaptation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 1482–1488. DOI:https://doi.org/10.18653/v1/D17-1155
[154] Xinyi Wang and Graham Neubig. 2019. Target conditioned sampling: Optimizing data selection for multilingual
neural machine translation. In Proceedings of the 57th Meeting of the Association for Computational Linguistics. As-
sociation for Computational Linguistics, 5823–5828. DOI:https://doi.org/10.18653/v1/P19-1583
[155] Xinyi Wang, Hieu Pham, Philip Arthur, and Graham Neubig. 2019. Multilingual neural machine translation with soft
decoupled encoding. In Proceedings of the International Conference on Learning Representations (ICLR’19). Retrieved
from https://arxiv.org/abs/1902.03499.
[156] Yining Wang, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing Zong. 2018. Three strategies to improve one-to-
many multilingual translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 2955–2960. Retrieved from http://aclweb.org/anthology/D18-1326.
[157] Yining Wang, Long Zhou, Jiajun Zhang, Feifei Zhai, Jingfang Xu, and Chengqing Zong. 2019. A compact and
language-sensitive multilingual translation method. In Proceedings of the 57th Meeting of the Association for Computa-
tional Linguistics. Association for Computational Linguistics, 1213–1223. DOI:https://doi.org/10.18653/v1/P19-1117
[158] Toon Witkam. 2006. History and heritage of the DLT (Distributed Language Translation) project. In Utrecht, The
Netherlands: Private Publication. 1–11. Retrieved from http://www.mt-archive.info/Witkam-2006.pdf.
[159] Hua Wu and Haifeng Wang. 2007. Pivot language approach for phrase-based statistical machine translation. Mach.
Translat. 21, 3 (2007), 165–181.
[160] Hua Wu and Haifeng Wang. 2009. Revisiting pivot language approach for machine translation. In Proceedings of
the Joint Conference of the 47th Meeting of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP. Association for Computational Linguistics, 154–162. Retrieved from https://www.aclweb.
org/anthology/P09-1018.
[161] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,
Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser,
Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei
Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and
Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine
translation. CoRR abs/1609.08144 (2016).
[162] Fei Xia and Michael McCord. 2004. Improving a statistical MT system with automatically learned rewrite patterns.
In Proceedings of the 20th International Conference on Computational Linguistics (COLING’04). COLING, 508–514.
Retrieved from https://www.aclweb.org/anthology/C04-1073.
[163] Chang Xu, Tao Qin, Gang Wang, and Tie-Yan Liu. 2019. Polygon-Net: A general framework for jointly boosting
multiple unsupervised neural machine translation models. In Proceedings of the 28th International Joint Conference
on Artificial Intelligence (IJCAI’19). International Joint Conferences on Artificial Intelligence Organization, 5320–
5326. DOI:https://doi.org/10.24963/ijcai.2019/739
[164] Poorya Zaremoodi, Wray Buntine, and Gholamreza Haffari. 2018. Adaptive knowledge sharing in multi-task learn-
ing: Improving low-resource neural machine translation. In Proceedings of the 56th Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, 656–661. Retrieved
from http://aclweb.org/anthology/P18-2104.
[165] Yang Zhao, Jiajun Zhang, and Chengqing Zong. 2018. Exploiting pre-ordering for neural machine translation. In
Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18). European Language
Resources Association (ELRA). Retrieved from https://www.aclweb.org/anthology/L18-1143.
[166] Long Zhou, Wenpeng Hu, Jiajun Zhang, and Chengqing Zong. 2017. Neural system combination for machine trans-
lation. In Proceedings of the 55th Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
Association for Computational Linguistics, 378–384. DOI:https://doi.org/10.18653/v1/P17-2060
[167] Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations parallel corpus v1.0. In
Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC’16). European Language
Resources Association (ELRA), 3530–3534. Retrieved from https://www.aclweb.org/anthology/L16-1561.
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.
99:38 R. Dabre et al.
[168] Barret Zoph and Kevin Knight. 2016. Multi-source neural translation. In Proceedings of the Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for
Computational Linguistics, 30–34. DOI:https://doi.org/10.18653/v1/N16-1004
[169] Barret Zoph and Quoc V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the
5th International Conference on Learning Representations (ICLR’17). Retrieved from https://openreview.net/forum?
id=r1Ue8Hcxg.
[170] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine
translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, 1568–1575. DOI:https://doi.org/10.18653/v1/D16-1163
ACM Computing Surveys, Vol. 53, No. 5, Article 99. Publication date: September 2020.