A Brief Survey of Multilingual Neural Machine Translation: Equal Contribution
Figure 1: MNMT research categorized according to resource scenarios and underlying modeling principles.
strained environment like mobile phones or IoT into another language. In this scenario, existing
devices. It can also simplify the large-scale de- multilingual redundancy in the source side can be
ployment of MT systems. Most importantly, we exploited for multi-source translation (Zoph and
believe that the biggest benefit of doing MNMT Knight, 2016).
research is getting better insights into and answers Given these benefits, scenarios and the tremen-
to an important question in natural language pro- dous increase in the work on MNMT in recent
cessing: how do we build distributed representa- years, we undertake this survey paper on MNMT
tions such that similar text across languages have to systematically organize the work in this area. To
similar representations? the best of our knowledge, no such comprehensive
There are multiple MNMT scenarios based on survey on MNMT exists. Our goal is to shed light
available resources and studies have been con- on various MNMT scenarios, fundamental ques-
ducted for the following scenarios (Figure 11 ): tions in MNMT, basic principles, architectures,
Multiway Translation. The goal is constructing a and datasets of MNMT systems. The remainder
single NMT system for one-to-many (Dong et al., of this paper is structured as follows: We present a
2015), many-to-one (Lee et al., 2017) or many-to- systematic categorization of different approaches
many (Firat et al., 2016a) translation using parallel to MNMT in each of the above mentioned scenar-
corpora for more than one language pair. ios to help understand the array of design choices
Low or Zero-Resource Translation. For most of available while building MNMT systems (Sec-
the language pairs in the world, there are small or tions 2, 3, and 4). We put the work in MNMT into
no parallel corpora, and three main directions have a historical perspective with respect to multilin-
been studied for this scenario. Transfer learn- gual MT in older MT paradigms (Section 5). We
ing: Transferring translation knowledge from a also describe popular multilingual datasets and the
high-resource language pair to improve the trans- shared tasks that focus on multilingualism (Sec-
lation of a low-resource language pair (Zoph et al., tion 6). In addition, we compare MNMT with do-
2016). Pivot translation: Using a high-resource main adaptation for NMT, which tackles the prob-
language (usually English) as a pivot to translate lem of improving low-resource in-domain transla-
between a language pair (Firat et al., 2016a). Zero- tion (Section 7). Finally, we share our opinions on
shot translation: Translating between language future research directions in MNMT (Section 8)
pairs without parallel corpora (Johnson et al., and conclude this paper (Section 9).
Multi-Source Translation. Documents that have 2 Multiway NMT
been translated into more than one language The goal is learning a single model for l language
might, in the future, be required to be translated pairs (si , ti ) ∈ L (i = 1 to l), where L ⊂ S × T ,
Please see the supplementary material for papers related and S, T are sets of source and target languages
to each category. respectively. S and T need not be mutually ex-
clusive. Parallel corpora are available for these l embeddings, encoders and decoders for each lan-
language pairs. One-many, many-one and many- guage. By sharing attention across languages,
many NMT models have been explored in this they show improvements over bilingual models.
framework. Multiway translation systems follow However, this model has a large number of pa-
the standard paradigm in popular NMT systems. rameters. Nevertheless, the number of parameters
However, this architecture is adapted to support only grows linearly with the number of languages,
multiple languages. The wide ranges of possible while it grows quadratically for bilingual systems
architectural choices is exemplified by two highly spanning all the language pairs in the multiway
contrasting prototypical approaches. system.
2.1 Prototypical Approaches 2.2 Controlling Parameter Sharing
Complete Sharing. Johnson et al. (2017) pro- In between the extremities of parameter sharing
posed a highly compact model where all languages exemplified by the above mentioned models, lies
share the same embeddings, encoder, decoder, and an array of choices. The degree of parameter
attention mechanism. A common vocabulary, typ- sharing depends on the divergence between the
ically subword-level like byte pair encoding (BPE) languages involved (Sachan and Neubig, 2018)
(Sennrich et al., 2016b), is defined across all lan- and can be controlled at various layers of the
guages. The input sequence includes a special to- MNMT system. Sharing encoders among mul-
ken (called the language tag) to indicate the tar- tiple languages is very effective and is widely
get language. This enables the decoder to cor- used (Lee et al., 2017; Sachan and Neubig, 2018).
rectly generate the target language, though all tar- Blackwood et al. (2018) explored target language,
get languages share the same decoder parameters. source language and pair specific attention pa-
The model has minimal parameter size as all lan- rameters. They showed that target language spe-
guages share the same parameters; and achieves cific attention performs better than other attention
comparable/better results w.r.t. bilingual systems. sharing configurations. For self-attention based
But, a massively multilingual system can run into NMT models, Sachan and Neubig (2018) ex-
capacity bottlenecks (Aharoni et al., 2019). This plored various parameter sharing strategies. They
is a black-box model, which can use an off-the- showed that sharing the decoder self-attention and
shelf NMT system to train a multilingual system. encoder-decoder inter-attention parameters is use-
Ha et al. (2016) proposed a similar model, but ful for linguistically dissimilar languages. Zare-
they maintained different vocabularies for each moodi et al. (2018) further proposed a routing
language. network to dynamically control parameter sharing
This architecture is particularly useful for re- learned from the data. Designing the right shar-
lated languages, because they have high degree of ing strategy is important to maintaining a balance
lexical and syntactic similarity (Sachan and Neu- between model compactness and translation accu-
big, 2018). Lexical similarity can be further uti- racy.
lized by (a) representing all languages in a com- Dynamic Parameter or Representation Gener-
mon script using script conversion (Dabre et al., ation. Instead of defining the parameter sharing
2018; Lee et al., 2017) or transliteration (Nakov protocol a priori, Platanios et al. (2018) learned the
and Ng (2009) for multilingual SMT), (b) using a degree of parameter sharing from the data. This is
common subword-vocabulary across all languages achieved by defining the language specific model
e.g. character (Lee et al., 2017) and BPE (Nguyen parameters as a function of global parameters and
and Chiang, 2017), (c) representing words by both language embeddings. This approach also reduces
character encoding and a latent embedding space the number of language specific parameters (only
shared by all languages (Wang et al., 2019). language embeddings), while still allowing each
Pinnis et al. (2018) and Lakew et al. (2018a) language to have its own unique parameters for
have compared RNN, CNN and the self-attention different network layers. In fact, the number of
based architectures for MNMT. They show that parameters is only a small multiple of the compact
self-attention based architectures outperform the model (the multiplication factor accounts for the
other architectures in many cases. language embedding size) (Johnson et al., 2017),
Minimal Sharing. On the other hand, Firat et al. but the language embeddings can directly impact
(2016a) proposed a model comprised of separate the model parameters instead of the weak influ-
ence that language tags have. the output distributions of the student and teacher
Universal Encoder Representation. Ideally, models. The distillation loss is applied for a lan-
multiway systems should generate encoder repre- guage pair only if the teacher model shows better
sentations that are language agnostic. However, translation accuracy than the student model on the
the attention mechanism sees a variable number validation set. This approach shows better results
of encoder representations depending on the sen- than joint training of a black-box model, but train-
tence length (this could vary for translations of ing time increases significantly because bilingual
the same sentence). To overcome this, an atten- models also have to be trained.
tion bridge network generates a fixed number of
contextual representations that are input to the at- 3 Low or Zero-Resource MNMT
tention network (Lu et al., 2018; Vázquez et al., An important motivation for MNMT is to im-
2018). Murthy et al. (2018) pointed out that the prove or support translation for language pairs
contextualized embeddings are word order depen- with scarce or no parallel corpora, by utilizing
dent, hence not language agnostic. training data from high-resource language pairs.
Multiple Target Languages. This is a challeng- In this section, we will discuss the MNMT ap-
ing scenario because parameter sharing has to be proaches that specifically address the low or zero-
balanced with the capability to generate sentences resource scenario.
in each target language. Blackwood et al. (2018)
3.1 Transfer Learning
added the language tag to the beginning as well as
end of sequence to avoid its attenuation in a left- Transfer learning (Pan and Yang, 2010) has been
to-right encoder. Wang et al. (2018) explored mul- widely explored to address low-resource trans-
tiple methods for supporting target languages: (a) lation, where knowledge learned from a high-
target language tag at beginning of the decoder, (b) resource language pair is used to improve the
target language dependent positional embeddings, NMT performance on a low-resource pair.
and (c) divide hidden units of each decoder layer Training. Most studies have explored the follow-
into shared and language-dependent ones. Each of ing setting: the high-resource and low-resource
these methods provide gains over Johnson et al. language pairs share the same target language.
(2017), and combining all gave the best results. Zoph et al. (2016) first showed that transfer learn-
ing can benefit low-resource language pairs. First,
2.3 Training Protocols they trained a parent model on a high-resource lan-
Joint Training. All the available languages pairs guage pair. The child model is initialized with the
are trained jointly to minimize the mean nega- parent’s parameters wherever possible and trained
tive log-likelihood for each language pair. As on the small parallel corpus for the low-resource
some language pairs would have more data than pair. This process is known as fine-tuning. They
other languages, the model may be biased. To also studied the effect of fine-tuning only a subset
avoid this, sentence pairs from different language of the child model’s parameters (source and tar-
pairs are sampled to maintain a healthy balance. get embeddings, RNN layers and attention). The
Mini-batches can be comprised of a mix of sam- initialization has a strong regularization effect in
ples from different language pairs (Johnson et al., training the child model. Gu et al. (2018b) used
2017) or the training schedule can cycle through the model agnostic meta learning (MAML) frame-
mini-batches consisting of a language pair only work (Finn et al., 2017) to learn appropriate pa-
(Firat et al., 2016a). For architectures with lan- rameter initialization from the parent pair(s) by
guage specific layers, the latter approach is conve- taking the child pair into consideration. Instead of
nient to implement. fine-tuning, both language pairs can also be jointly
Knowledge Distillation. In this approach sug- trained (Gu et al., 2018a).
gested by Tan et al. (2019), bilingual models are Language Relatedness. Zoph et al. (2016) and
first trained for all language pairs involved. These Dabre et al. (2017b) have empirically shown that
bilingual models are used as teacher models to language relatedness between the parent and child
train a single student model for all language pairs. source languages has a big impact on the pos-
The student model is trained using a linear inter- sible gains from transfer learning. Kocmi and
polation of the standard likelihood loss as well as Bojar (2018) showed that transfer learning im-
distillation loss that captures the distance between proves low-resource language translation, even
when neither the source nor the target languages Spanish can be used for fine tuning and can have
are shared between the resource-rich and poor lan- the same effect as a pseduo-parallel corpus which
guage pairs. Further investigation is needed to un- is two orders of magnitude larger. Pivoting mod-
derstand the gains in translation quality in this sce- els can be improved if they are jointly trained as
nario. Neubig and Hu (2018) used language relat- shown by Cheng et al. (2017). Joint training was
edness to prevent overfitting when rapidly adapt- achieved by either forcing the pivot language’s
ing pre-trained MNMT model for low-resource embeddings to be similar or maximizing the like-
scenarios. Chaudhary et al. (2019) used this ap- lihood of the cascaded model on a small source-
proach to translate 1,095 languages to English. target parallel corpus. Chen et al. (2017) proposed
Lexical Transfer. Zoph et al. (2016) randomly teacher-student learning for pivoting where they
initialized the word embeddings of the child first trained a pivot-target NMT model and used
source language, because those could not be trans- it as a teacher to guide the behaviour of a source-
ferred from the parent. Gu et al. (2018a) im- target NMT model.
proved on this simple initialization by mapping
3.3 Zero-Shot
pre-trained monolingual embeddings of the parent
and child sources to a common vector space. On The approaches proposed so far involve pivoting
the other hand, Nguyen and Chiang (2017) utilized or synthetic corpus generation, which is a slow
the lexical similarity between related source lan- process due to its two-step nature. It is more inter-
guages using a small subword vocabulary. Lakew esting, and challenging, to enable translation be-
et al. (2018b) dynamically updated the vocabu- tween a zero-resource pair without explicitly in-
lary of the parent model with the low-resource lan- volving a pivot language during decoding or for
guage pair before transferring parameters. generating pseudo-parallel corpora. This scenario
Syntactic Transfer. Gu et al. (2018a) proposed to is known as zero-shot NMT. Zero-shot NMT also
encourage better transfer of contextual represen- requires a pivot language but it is only used dur-
tations from parents using a mixture of language ing training without the need to generate pseudo-
experts network. Murthy et al. (2018) showed parallel corpora.
that reducing the word order divergence between Training. Zero-shot NMT was first demonstrated
source languages via pre-ordering is beneficial in by Johnson et al. (2017). However, this zero-shot
extremely low-resource scenarios. translation method is inferior to pivoting. They
showed that the context vectors (from attention)
3.2 Pivoting for unseen language pairs differ from the seen lan-
Zero-resource NMT was first explored by Firat guage pairs, possibly explaining the degradation
et al. (2016a), where a multiway NMT model was in translation quality. Lakew et al. (2017) tried to
used to translate from Spanish to French using En- overcome this limitation by augmenting the train-
glish as a pivot language. This pivoting was done ing data with the pseudo-parallel unseen pairs gen-
either at run time or during pre-training. erated by iterative application of the same zero-
Run-Time Pivoting. Firat et al. (2016a) involved shot translation. Arivazhagan et al. (2018) in-
a pipeline through paths in the multiway model, cluded explicit language invariance losses in the
which first translates from French to English and optimization function to encourage parallel sen-
then from English to Spanish. They also experi- tences to have the same representation. Reinforce-
mented with using the intermediate English trans- ment learning for zero-shot learning was explored
lation as an additional source for the second stage. by Sestorain et al. (2018) where the dual learning
Pivoting during Pre-Training. Firat et al. framework was combined with rewards from lan-
(2016b) used the MNMT model to first translate guage models.
the Spanish side of the training corpus to English Corpus Size. Work on translation for Indian lan-
which in turn is translated into French. This gives guages showed that zero-shot works well only
a pseudo-parallel French-Spanish corpus where when the training corpora are extremely large
the source is synthetic and the target is original. (Mattoni et al., 2017). As the corpora for most
The MNMT model is fine tuned on this synthetic Indian languages contain fewer than 100k sen-
data and this enables direct French to Spanish tences, the zero-shot approach is rather infeasible
translation. Firat et al. (2016b) also showed that despite linguistic similarity. Lakew et al. (2017)
a small clean parallel corpus between French and confirmed this in the case of European languages
where small training corpora were used. Mattoni instead of a dummy token for the missing source
et al. (2017) also showed that zero-shot transla- languages.
tion works well only when the training corpora are Post-Editing. Instead of having a translator trans-
large, while Aharoni et al. (2019) show that mas- late from scratch, multi-source NMT can be used
sively multilingual models are beneficial for ze- to generate high quality translations. The trans-
roshot translation. lations can then be post-edited, a process that is
Language Control. Zero-shot NMT tends to less labor intensive and cheaper compared to trans-
translate into the wrong language at times and Ha lating from scratch. Multi-source NMT has been
et al. (2017) proposed to filter the output of the used for post-editing where the translated sentence
softmax so as to force the model to translate into is used as an additional source, leading to im-
the desired language. provements (Chatterjee et al., 2017).
If the same source sentence is available in mul- One of the long term goals of the MT community
tiple languages then these sentences can be used is the development of architectures that can handle
together to improve the translation into the tar- more than two languages.
get language. This technique is known as multi-
RBMT. To this end, rule-based systems
source MT (Och and Ney, 2001). Approaches
(RBMT) using an interlingua were explored
for multi-source NMT can be extremely useful
widely in the past. The interlingua is a symbolic
for creating N-lingual (N > 3) corpora such as
semantic, language-independent representation
Europarl (Koehn, 2005) and UN (Ziemski et al.,
for natural language text (Sgall and Panevová,
2016b). The underlying principle is to leverage
1987). Two popular interlinguas are UNL
redundancy in terms of source side linguistic phe-
(Uchida, 1996) and AMR (Banarescu et al., 2013)
nomena expressed in multiple languages.
Different interlinguas have been proposed in
Multi-Source Available. Most studies assume various systems like KANT (E. H. Nyberg and
that the same sentence is available in multiple lan- Carbonell, 1997), UNL, UNITRAN (Dorr, 1987)
guages. Zoph and Knight (2016) showed that and DLT (Witkam, 2006). Language specific
a multi-source NMT model using separate en- analyzers converted language input to interlingua,
coders and attention networks for each source lan- while language specific decoders converted the
guage outperforms single source models. A sim- interlingua into another language. To achieve
pler approach concatenated multiple source sen- an unambiguous semantic representation, a lot
tences and fed them to a standard NMT model of linguistic analysis had to be performed and
Dabre et al. (2017a), with performance compara- many linguistic resources were required. Hence,
ble to (Zoph and Knight, 2016). Interestingly, this in practice, most interlingua systems were limited
model could automatically identify the boundaries to research systems or translation in specific
between different source languages and simplify domains and could not scale to many languages.
the training process for multi-source NMT. Dabre Over time most MT research focused on building
et al. (2017a) also showed that it is better to use bilingual systems.
linguistically similar source languages, especially
in low-resource scenarios. Ensembling of individ- SMT. Phrase-based SMT (PBSMT) systems
ual source-target models is another beneficial ap- (Koehn et al., 2003), a very successful MT
proach, for which Garmash and Monz (2016) pro- paradigm, were also bilingual for the most part.
posed several methods with different degrees of Compared to RBMT, PBSMT requires less lin-
parameterization. guistic resources and instead requires parallel cor-
Missing Source Sentences. There can be miss- pora. However, like RBMT, they work with sym-
ing source sentences in multi-source corpora. bolic, discrete representations making multilin-
Nishimura et al. (2018b) extended (Zoph and gual representation difficult. Moreover, the central
Knight, 2016) by representing each “missing” unit in PBSMT is the phrase, an ordered sequence
source language with a dummy token. Choi et al. of words (not in the linguistic sense). Given its ar-
(2018) and Nishimura et al. (2018a) further pro- bitrary structure, it is not clear how to build a com-
posed to use MT generated synthetic sentences, mon symbolic representation for phrases across
languages. Nevertheless, some shallow forms of 6 Datasets and Resources
multilingualism have been explored in the context
of: (a) pivot-based SMT, (b) multi-source PBSMT,
and (c) SMT involving related languages. MNMT requires parallel corpora in similar do-
mains across multiple languages.
Pivoting. Popular solutions are: chaining
source-pivot and pivot-target systems at decoding Multiway. Commonly used publicly available
(Utiyama and Isahara, 2007), training a source- multilingual parallel corpora are the TED cor-
target system using synthetic data generated using pus (Mauro et al., 2012), UN Corpus (Ziemski
target-pivot and pivot-source systems (Gispert and et al., 2016a) and those from the European Union
Marino, 2006), and phrase-table triangulation piv- like Europarl, JRC-Aquis, DGT-Aquis, DGT-TM,
oting source-pivot and pivot-target phrase tables ECDC-TM, EAC-TM (Steinberger et al., 2014).
(Utiyama and Isahara, 2007; Wu and Wang, 2007). While these sources are primarily comprised of
European languages, parallel corpora for some
Multi-source. Typical approaches are: re-ranking Asian languages is accessible through the WAT
outputs from independent source-target systems shared task (Nakazawa et al., 2018). Only small
(Och and Ney, 2001), composing a new output amount of parallel corpora are available for many
from independent source-target outputs (Matusov languages, primarily from movie subtitles and
et al., 2006), and translating a combined input software localization strings (Tiedemann, 2012b).
representation of multiple sources using lattice
networks over multiple phrase tables (Schroeder Low or Zero-Resource. For low or zero-resource
et al., 2009). NMT translation tasks, good test sets are re-
quired for evaluating translation quality. The
Related languages. For multilingual translation above mentioned multilingual parallel corpora
with multiple related source languages, the typical can be a source for such test sets. In addi-
approaches involved script unification by mapping tion, there are other small parallel datasets like
to a common script such as Devanagari (Baner- the FLORES dataset for English-{Nepali,Sinhala}
jee et al., 2018) or transliteration (Nakov and (Guzmán et al., 2019), the XNLI test set spanning
Ng, 2009). Lexical similarity was utilized us- 15 languages (Conneau et al., 2018b) and the In-
ing subword-level translation models (Vilar et al., dic parallel corpus (Birch et al., 2011). The WMT
2007; Tiedemann, 2012a; Kunchukuttan and Bhat- shared tasks (Bojar et al., 2018) also provide test
tacharyya, 2016, 2017). Combining subword-level sets for some low-resource language pairs.
representation and pivoting for translation among
related languages has been explored (Henrı́quez Multi-Source. The corpora for multi-source NMT
et al., 2011; Tiedemann, 2012a; Kunchukuttan have to be aligned across languages. Multi-source
et al., 2017). Most of the above mentioned multi- corpora can be extracted from some of the above
lingual systems involved either decoding-time op- mentioned sources. The following are widely used
erations, chaining black-box systems or compos- for evaluation in the literature: Europarl (Koehn,
ing new phrase-tables from existing ones. 2005), TED (Tiedemann, 2012b), UN (Ziemski
et al., 2016b). The Indian Language Corpora Ini-
tiative (ILCI) corpus (Jha, 2010) is a 11-way paral-
lel corpus of Indian languages along with English.
Comparison with MNMT. While symbolic
The Asian Language Treebank (Thu et al., 2016)
representations constrain a unified multilingual
is a 9-way parallel corpus of South-East Asian lan-
representation, distributed universal language rep-
guages along with English, Japanese and Bengali.
resentation using real-valued vector spaces makes
The MMCR4NLP project (Dabre and Kurohashi,
multilingualism easier to implement in NMT. As
2017) compiles language family grouped multi-
no language specific feature engineering is re-
source corpora and provides standard splits.
quired for NMT, making it possible to scale to
multiple languages. Neural networks provide flex- Shared Tasks. Recently, shared tasks with a focus
ibility in experimenting with a wide variety of ar- on multilingual translation have been conducted
chitectures, while advances in optimization tech- at IWSLT (Cettolo et al., 2017), WAT (Nakazawa
niques and availability of deep learning toolkits et al., 2018) and WMT (Bojar et al., 2018); so
make prototyping faster. common benchmarks are available.
7 Connections with Domain Adaptation how do we build encoder and decoder representa-
tions that are language agnostic? Particularly, the
High quality parallel corpora are limited to spe- questions of word-order divergence between the
cific domains. Both, vanilla SMT and NMT per- source languages and variable length encoder rep-
form poorly for domain specific translation in resentations have received little attention.
low-resource scenarios (Duh et al., 2013; Koehn Multiple Target Language MNMT. Most current
and Knowles, 2017). Leveraging out-of-domain efforts address multiple source languages. Multi-
parallel corpora and in-domain monolingual cor- way systems for multiple low-resource target lan-
pora for in-domain translation is known as domain guages need more attention. The right balance be-
adaptation for MT (Chu and Wang, 2018). tween sharing representations vs. maintaining the
As we can treat each domain as a language, distinctiveness of the target language for genera-
there are many similarities and common ap- tion needs exploring.
proaches between MNMT and domain adapta- Explore Pre-training Models. Pre-training em-
tion for NMT. Therefore, similar to MNMT, when beddings, encoders and decoders have been shown
using out-of-domain parallel corpora for domain to be useful for NMT (Ramachandran et al., 2017).
adaptation, multi-domain NMT and transfer learn- How pre-training can be incorporated into dif-
ing based approaches (Chu et al., 2017) have ferent MNMT architectures, is an important as
been proposed for domain adaptation. When well. Recent advances in cross-lingual word (Kle-
using in-domain monolingual corpora, a typical mentiev et al., 2012; Mikolov et al., 2013; Chan-
way of doing domain adaptation is generating a dar et al., 2014; Artetxe et al., 2016; Conneau
pseduo-parallel corpus by back-translating target et al., 2018a; Jawanpuria et al., 2019) and sentence
in-domain monolingual corpora (Sennrich et al., embeddings (Conneau et al., 2018b; Chen et al.,
2016a), which is similar to the pseduo-parallel 2018a; Artetxe and Schwenk, 2018) could provide
corpus generation in MNMT (Firat et al., 2016b). directions for this line of investigation.
There are also many differences between Related Languages, Language Registers and
MNMT and domain adaptation for NMT. While Dialects. Translation involving related languages,
pivoting is a popular approach for MNMT (Cheng language registers and dialects can be further ex-
et al., 2017), it is unsuitable for domain adapta- plored given the importance of this use case.
tion. As there are always vocabulary overlaps be- Code-Mixed Language. Addressing intra-
tween different domains, there are no zero-shot sentence multilingualism i.e. code mixed input
translation (Johnson et al., 2017) settings in do- and output, creoles and pidgins is an interesting re-
main adaptation. In addition, it not uncommon to search direction. The compact MNMT models can
write domain specific sentences in different styles handle code-mixed input, but code-mixed output
and so multi-source approaches (Zoph and Knight, remains an open problem (Johnson et al., 2017).
2016) are not applicable either. On the other hand, Multilingual and Multi-Domain NMT. Jointly
data selection approaches in domain adaptation tackling multilingual and multi-domain translation
that select out-of-domain sentences which are sim- is an interesting direction with many practical use
ilar to in-domain sentences (2017a) have not been cases. When extending an NMT system to a new
applied to MNMT. In addition, instance weight- language, the parallel corpus in the domain of in-
ing approaches (Wang et al., 2017b) that interpo- terest may not be available. Transfer learning in
late in-domain and out-of-domain models have not this case has to span languages and domains.
been studied for MNMT. However, with the de-
velopment of cross-lingual sentence embeddings, 9 Conclusion
data selection and instance weighting approaches MNMT has made rapid progress in the recent
might be applicable for MNMT in the near future. past. In this survey, we have covered literature
pertaining to the major scenarios we identified
8 Future Research Directions
for multilingual NMT: multiway, low or zero-
While exciting advances have been made in resource (transfer learning, pivoting, and zero-
MNMT in recent years, there are still many inter- shot approaches) and multi-source translation. We
esting directions for exploration. have systematically compiled the principal design
Language Agnostic Representation Learning. approaches and their variants, central MNMT is-
A core question that needs further investigation is: sues and their proposed solutions along with their
