Multimodal Pretraining From Monolingual To Multilingual
Multimodal Pretraining From Monolingual To Multilingual
Multimodal Pretraining From Monolingual To Multilingual
Language Acquisition
number of languages, M-VLP models often (a) Multilingual Vision-Language Pre-training (b) MultiLingual Acquisition
require huge computing resources and cannot
be flexibly extended to new languages. In this Figure 1: Comparison of data usage between M-VLP and MLA.
work, we propose a MultiLingual Acquisition The size of a circle reflects the amount of training data. M-VLPs
learn on vision-language data from multiple languages simultane-
(MLA) framework that can easily generalize ously. Instead, MLA generalize monolingual VLP into multilin-
a monolingual Vision-Language Pre-training gual on much less training data.
model into multilingual. Specifically, we design a
lightweight language acquisition encoder based
on state-of-the-art monolingual VLP models. computing resources. For example, the state-of-the-art M-
We further propose a two-stage training strategy VLP model MURAL (Jain et al., 2021) is pre-trained on 128
to optimize the language acquisition encoder, Cloud TPUv3 for four days. It could support multimodal
namely the Native Language Transfer stage tasks on 100+ languages. However, considering there are
and the Language Exposure stage. With much 6,900+ languages worldwide (Zhou et al., 2021), building
less multilingual training data and computing such a single model to handle all languages will be highly
resources, our model achieves state-of-the-art expensive. Second, M-VLP models cannot be flexibly ex-
performance on multilingual image-text and tended to new languages. Additional training is required
video-text retrieval benchmarks. for M-VLP models to achieve satisfactory performance on
a new language. However, this training process will cause
performance degeneration of M-VLP models on the original
1. Introduction languages due to the limited model capacity. For example,
the limited model capacity even results in M-VLP models
We are living in a multimodal and multilingual world w. The
performing worse than their monolingual counterparts on
information we receive in our daily lives may come from
English (Ni et al., 2021; Zhou et al., 2021).
different modalities and languages. Therefore, building
multimodal and multilingual models to effectively under- To build multimodal and multilingual models with low-cost
stand such information has attracted much research attention and high-flexibility, we refer to our human learning habits
(Gella et al., 2017; Wehrmann et al., 2019; Kim et al., 2020; when acquiring new languages. We humans normally learn
Burns et al., 2020). Recently, Multilingual Vision-Language our native language during childhood and practice it through
Pre-training (M-VLP) achieves convincing performance in interactions with the multimodal living environments. When
various cross-lingual cross-modal tasks such as multilingual learning a new language, we humans initially tend to align
image-text retrieval (Ni et al., 2021; Zhou et al., 2021; Fei it with the native language, as we can easily map words
et al., 2021; Huang et al., 2021; Jain et al., 2021) and multi- in the native language to real-world objects and concepts.
modal machine translation (Song et al., 2021). As shown in After having a certain language foundation, we could fur-
Figure 1(a), M-VLP models handle multiple languages and ther master it by interacting with the environment directly
modalities simultaneously during pre-training. Despite their using the new language. This is known as the language ex-
successes, M-VLP models suffer from two problems. First, posure (Castello, 2015). The whole learning process rarely
pre-training on vision and multilingual data consumes huge degrades our native language capability.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Inspired by this, we propose a new framework, dual-stream models encode image and text with two inde-
MultiLingual Acquisition (MLA), which constructs multi- pendent encoders and optimize via simple objectives like
modal and multilingual models based on monolingual VLPs. image-text contrastive learning (Radford et al., 2021; Jia
The topology of the MLA-based multimodal and multilin- et al., 2021; Yuan et al., 2021). Compared with the single-
gual model is illustrated in Figure 1(b). Unlike M-VLPs, stream models, the dual-stream models are more efficient
which handle data from multiple languages and modalities to utilize noisy image-text data harvested from the web
in a single model, MLA generalizes monolingual VLPs (Huo et al., 2021), and thus achieve better performance and
into multilingual using much less training data through a transferability across downstream tasks. Meanwhile, the
language acquisition encoder. The language acquisition dual-stream models are more flexible for extension. Since
encoder is realized by inserting our proposed lightweight the dual-stream models process images and text through
language acquirers into the pre-trained monolingual encoder independent encoders, we can fix the vision encoders and
of the VLP model. During training, original parameters in focus on extending the text encoders to support new lan-
the pre-trained monolingual encoder are fixed, only multi- guages. Therefore, we focus on generalizing dual-stream
lingual embeddings and language acquirers for each new VLPs into multilingual in this work.
language are optimized. Following the human learning
Multilingual Vision-Language Pre-training: To achieve
habits, we propose a two-stage training strategy to train
both multilingual and multimodal capability, many works
the language acquisition encoder. In the Native Language
try to learn the relationship between multiple languages and
Transfer (NLT) stage, the model is optimized to establish
modalities simultaneously through pre-training. M3 P (Ni
the correspondence between the new languages with the
et al., 2021) introduces the multimodal code-switched train-
native language. In the Language Exposure (LE) stage,
ing method to enhance multilingual transferability. UC2
the model is optimized to build cross-modal alignment be-
(Zhou et al., 2021) augments the English image-text data to
tween new languages and images. We apply our proposed
other languages through machine translation and proposes
MLA to the monolingual VLP model CLIP (Radford et al.,
MRTM and VTLM objectives to encourage fine-grained
2021) and achieve state-of-the-art results on both multilin-
alignment between images and multiple languages. More
gual image-text and video-text retrieval benchmarks with
recently, MURAL (Jain et al., 2021) adopts the dual-stream
much less training data and computing resources. Ablation
structure. It is pre-trained with image-text and text-text con-
studies demonstrate the effectiveness of our training strategy.
trastive objectives on multilingual image-text pairs and trans-
Owing to the independence merit of the language acquir-
lation pairs. M-VLP models significantly outperform previ-
ers, the MLA-based models can be easily extended to new
ous non-pretraining models (Gella et al., 2017; Wehrmann
languages without compromising the performance of their
et al., 2019; Kim et al., 2020; Burns et al., 2020) on mul-
original languages. The main contributions of our work are
tilingual image-text retrieval. Despite their success, these
as follows:
models typically consume huge computing resources and
• We propose a lightweight MultiLingual Acquisition large-scale multilingual training data. Moreover, they fail to
(MLA) framework that can easily generalize monolingual take full advantage of the cross-modal knowledge learnt in
VLPs into multilingual. monolingual VLP, and building cross-modal cross-lingual
• We propose a two-stage training strategy to optimize the representations from scratch can be very hard. In contrast,
MLA-based models inspired by the language learning our MLA framework aims to generalize VLP models into
habits of humans. Ablation studies prove the effectiveness multilingual and it builds multimodal and multilingual mod-
of the strategy. els with much less data and computing cost.
• We apply MLA to the monolingual VLP model CLIP and
achieve the new state-of-the-art results on both multilin- Multilingual Extension: Some works explore making pre-
gual image-text and video-text retrieval benchmarks with trained monolingual language models multilingual. Reimers
much less training data and parameters. et al. extend sentence embeddings from monolingual to
multilingual by Multilingual Knowledge Distillation (MKD)
(Reimers & Gurevych, 2020). Given translation pairs, MKD
2. Related Work optimizes the multilingual student model to produce similar
Vision-Language Pre-training: There are increasing inter- sentence embeddings with the monolingual teacher model.
est in building Vision-Language Pre-training (VLP) models. Artetxe et al. extend monolingual models by training ad-
From the perspective of how to interact between vision ditional word embeddings (Artetxe et al., 2020). MAD-
and language modalities, existing models can be divided X (Pfeiffer et al., 2020) extends multilingual pre-training
into two categories: single-stream and dual-stream models. models to support low-resource languages through adapters
The single-stream models perform interaction on image and (Houlsby et al., 2019). By extending state-of-the-art pre-
text directly with a cross-modal transformer (Chen et al., trained language models, these works have achieved impres-
2020; Li et al., 2020b; Kim et al., 2021). In contrast, the sive results in NLP tasks such as bitext retrieval (Reimers
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
(a) (b)
Figure 2: Model illustration: (a) The overview of MLA framework. (b) The structure of a language acquirer
in a non-native language L, the sentence representation English.2 Considering the languages of the downstream
0
ti = Φ (Ti ; θΦ , θL , θemb ) should be closer to the aligned datasets, we train the model with CC6L for multilingual
image representation v i = Ψ(Vi ; θΨ ), and away from the image-text retrieval, and with CC69L for multilingual video-
misaligned one v j = Ψ(Vj ; θΨ ), j 6= i. This can be text retrieval.
achieved by performing contrastive learning between non- Multi30K (Elliott et al., 2016) is built upon Flickr30K
native languages and images. For a non-native sentence Ti , (Young et al., 2014). The English(en) captions are man-
we treat the corresponding image Vi as a positive sample, ually translated into German(de), French(fr) and Czech(cs).
and other images in the same batch Vj , j 6= i as negative It contains 31K images paired with 5 captions per image in
samples. Vice versa for images. The objective in the LE English and German, and 1 caption in French and Czech.
stage is minimizing the NCE loss defined as follows: We use the standard train, dev and test splits defined in
(Young et al., 2014).
1
LLE = (Lv2t + Lt2v ) (13) MSCOCO (Chen et al., 2015) contains 123K images with
2
B
5 English captions per image. (Yoshikawa et al., 2017) an-
1 X exp(sim(v i , ti )/τ ) notates 5 Japanese captions per image, and (Li et al., 2019)
Lv2t =− log PN (14)
B i=1 k=1 exp(sim(v i , tk )/τ )
extends MSCOCO with Chinese captions for 20K images.
B
We follow the standard train, dev and test splits for English
1 X exp(sim(v i , ti )/τ ) and Japanese as in (Karpathy & Fei-Fei, 2015). For Chinese,
Lt2v =− log PN (15)
B i=1 k=1 exp(sim(v k , ti )/τ )
we can only perform zero-shot evaluation on the test split
>
defined in (Li et al., 2019), as the full splits have overlaps
x y
where B is the batch size. sim(x, y) = kxkkyk is the cosine with English and Japanese splits.
similarity between two vectors. τ is a temperature hyper- XTD (Aggarwal & Kale, 2020) provides captions in 11 lan-
parameter to scale the logits. Note that though the image-to- guages (English(en), German(de), French(fr), Chinese(zh),
text loss Lv2t is optimized, the pre-trained vision encoder Japanese(ja), Italian(it), Spanish(es), Russian(ru), Polish(pl),
is kept frozen during training. Similar to NLT, the trainable Turkish(tr), Korean(ko)) for 1K MSCOCO images. Except
parameters in LE come from the language acquirers and the for Japanese, all non-English captions are translated from
non-native embedding block. the English caption directly. We use this dataset for zero-
shot image-text retrieval evaluation only.
MSRVTT (Xu et al., 2016) is a video caption dataset with
4. Experiments
10K videos, where each video is annotated with 20 En-
In this section, we first introduce the datasets used in this glish captions. Huang et al. translates the English cap-
paper, and then present detailed experiments to evaluate the tions into 8 languages (German(de), French(fr), Russian(ru),
proposed MLA framework. Spanish(es), Czech(cz), Swahili(sw), Chinese(zh) and Viet-
namese(vi)) via machine translation service (Huang et al.,
4.1. Dataset Description 2021). We follow the standard train/dev splits in (Xu et al.,
2016), and evaluate on the 1K test split as described in (Yu
We train our model with the Conceptual Captions (CC) et al., 2018).
dataset (Sharma et al., 2018) and two translation enhanced
versions of the CC (Zhou et al., 2021; Carlsson, 2021). We 4.2. Implementation Details
use Multi30K (Elliott et al., 2016), MSCOCO (Chen et al.,
2015; Li et al., 2019; Yoshikawa et al., 2017) and XTD (Ag- We apply MLA on two VLP models: CLIP-ViT-B-32
garwal & Kale, 2020) for multilingual image-text retrieval and CLIP-ViT-B-16 (Radford et al., 2021), denoted as
evaluation, and MSRVTT (Xu et al., 2016; Huang et al., MLACLIP and MLACLIP16 respectively. The hidden di-
2021) for multilingual video-text retrieval evaluation. mension of the language acquirers is set to 256, and all
Conceptual Captions (CC) (Sharma et al., 2018) contains language acquirers for each non-native language cost only
3.3 million image-text pairs in English crawled from the 3.14 MB parameters. The non-native embedding matrix is
Web1 . We also randomly select 300K image-text pairs de- initialized with M-BERT (Devlin et al., 2019). It costs 92.2
noted as CC300K for training our model to show the low- MB and shared with all non-native languages. We train two
cost merit of MLA. For multilingual sentences, we leverage separate models for multilingual image-text retrieval and
two translation augmented CC datasets: (1) CC6L (Zhou video-text retrieval. For the image model, we train with
et al., 2021) that translates all English captions of the CC CC6L (Zhou et al., 2021). For the video model, we use mul-
into 5 languages (German(de), French(fr), Czech(cs), Chi- tilingual captions from CC69L (Carlsson, 2021). For both
nese(zh)); and (2) CC69L (Carlsson, 2021) that contains models, we optimize multiple language acquirers iteratively
27K captions in each of the 68 languages translated from 2
We remove captions of unaccessible images, leaving ∼20K
1
We can only access ∼2.5 million images due to broken URLs. captions for each language.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
MURAL TrTrain(CC12M)+EOBT 91.0 87.3 86.4 82.4 89.4 87.4 73.7 71.9
MURAL† AT+MBT 92.2 88.6 87.6 84.2 88.6 88.4 75.4 74.9
MLACLIP TrTrain(CC300K) 92.0 86.8 85.4 82.3 89.3 88.1 75.7 73.2
MLACLIP16 TrTrain(CC300K) 94.5 89.7 89.2 85.9 91.3 90.4 79.4 76.5
Table 1: Multilingual image-text retrieval results on Multi30K and MSCOCO. TrTrain: Translate-train, FT-En: Fine-tune on English,
FT-All: Fine-tune on All. †: Models trained with publicly unavailable datasets. ‡: Models fine-tuned on COCO-CN (Li et al., 2019),
which has an overlap train split with the test split of English and Japanese. Best results are in bold and second best are underlined.
Method Trainable Params Computing Costs Following previous works (Ni et al., 2021; Zhou et al., 2021;
M3 P 566 M 4×V100×7d Jain et al., 2021), we report Average Recall (AR), which is
UC2 478 M 8×V100×4d the average score over Recall@1, Recall@5, and Recall@10
MURAL 300 M 128×TPUv3×4d on two retrieval directions (image→text, text→image). The
Ours (MLACLIP ) 108 M 1×V100×0.5d results are shown in Table 1. Also, the comparison of com-
Table 2: Comparison of trainable parameters and computing costs puting costs and parameters can be found in Table 2.
between MLA and M-VLPs.
Under the Zero-shot setting, we observe that MLACLIP
performs significantly better than state-of-the-art M-VLP
with a batch size of 128. The NLT stage performs 117,150
models on English. This is because MLACLIP could com-
steps with a learning rate of 1e-4, and the LE stage performs
pletely maintain the strong English performance of CLIP.
11,715 steps with a learning rate of 3e-6. The temperature τ
In contrast, M-VLP models typically perform worse than
is set to 0.01. For both stages, we use the Adam optimizer
their monolingual counterparts on English (M3 P 57.9 vs.
(Kingma & Ba, 2015) with a linear warm-up for the first
Unicoder-VL (Li et al., 2020a) 72.0, MURAL 80.9 vs.
10% of steps. The whole training process takes about 12
ALIGN (Jia et al., 2021) 84.3). MLACLIP also outper-
hours to converge on 1 Nvidia V100 GPU.
forms M-VLP models on other languages. For example,
MLACLIP achieves 78.7 average recall score on German,
4.3. Evaluation on Multilingual Image-Text Retrieval
outperforming MURAL by 2.7%. Note that the pre-training
In multilingual image-text retrieval, models are given a dataset of MURAL contains 12 million image-text pairs for
sentence in a certain language to find the most semantically each language, while MLACLIP only uses 300K training
relevant image from an image database and vice versa. We image-text pairs. It demonstrates that MLA is a high-data-
compare our model with state-of-the-art multilingual vision- efficient method to empower monolingual VLP models with
language pre-training methods under three settings: multilingual capability. Under the Fine-tune on English
setting, MLA shows strong cross-lingual transfer capabil-
• Zero-shot: we directly evaluate the model without fine-
ity. Under the Fine-tune on All setting, MLACLIP per-
tuning on downstream datasets.
forms slightly worse than MURAL which was pre-trained
• Fine-tune on English: we first fine-tune the VLP model
on publicly unavailable dataset AT+MBT (Jain et al., 2021).
on downstream English data. We then insert the language
We consider the reason is that MURAL has more trainable
acquirers and non-native embedding block into the fine-
parameters than MLACLIP (300M vs 108M, as shown in
tuned model and evaluate on other languages directly.
Table 2) for fine-tuning, which makes it easier to fit the
• Fine-tune on All: after Fine-tune on English, we fine-
downstream datasets with a certain scale such as Multi30K
tune the language acquirers and non-native embedding
and MSCOCO. MLACLIP16 achieves state-of-the-art re-
block and freeze other parts of the model.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Method en de fr cs zh ru vi sw es mean
Ours(MLACLIP w/o LE) 30.8 18.3 18.9 14.5 18.6 12.6 7.2 10.2 19.3 16.7
ZS
Ours(MLACLIP ) 30.8 20.1 22.0 15.7 18.3 14.4 8.2 10.7 20.2 17.8
FT-All FT-En
XLM-R-MMP (Huang et al., 2021) 23.8 19.4 20.7 19.3 18.2 19.1 8.2 8.4 20.4 17.5
Ours(MLACLIP ) 42.5 26.1 26.7 20.5 25.3 18.9 12.9 12.6 27.2 23.6
XLM-R-MMP (Huang et al., 2021) 23.1 21.1 21.8 20.7 20.0 20.5 10.9 14.4 21.9 19.4
Ours(MLACLIP ) 42.5 33.1 34.5 30.5 31.6 28.9 16.9 24.3 33.5 30.6
Table 3: Multilingual video-text retrieval results on MSRVTT. ZS: Zero-shot, FT-En: Fine-tune on English, FT-All: Fine-tune on All.
sults on all languages under three settings. It indicates that 4, we see that LE at stage two could bring improvements
if stronger VLP models such as ALIGN-L2 (Jia et al., 2021) on the new languages. Additionally, comparing row 4 and
or Florence (Yuan et al., 2021) are provided, better perfor- row 5 suggests that optimizing the model with NLT and LE
mance on multilingual image-text retrieval could be reached together at stage two does not bring improvements.
through MLA.
Stage one Stage two Multi30K MSCOCO 1K
Row
4.4. Evaluation on Multilingual Video-Text Retrieval NLT LE NLT LE de fr cs ja zh
1 X 76.3 74.2 67.2 72.1 75.7
In multilingual video-text retrieval, the model searches for 2 X 68.2 67.7 58.6 65.9 71.7
the most semantically relevant videos given a text query 3 X X 71.1 69.7 59.8 67.6 73.9
in a certain language. Following (Luo et al., 2021), we 4 X X 78.7 77.7 70.8 74.9 78.5
5 X X X 78.4 77.3 69.9 74.2 78.1
first uniformly sample 12 frames from each video, and use
the pre-trained vision encoder to extract representations for Table 4: Ablation study on training strategy
each frame. We then perform mean pooling over frame
representations to get the video representation. B. Language Acquirers and Embedding Initialization
We also evaluate the models under three settings as in In order to validate the effectiveness of the proposed Lan-
Sec.4.3. We report the text→video Recall@1 score in Table guage Acquirers, we remove the language acquirers and
3. Under Zero-shot setting, MLACLIP , which is trained the M-BERT embedding initialization from the model re-
on CC69L without using any video data, achieves compa- spectively and evaluate on zero-shot multilingual image-text
rable or even better results than the fine-tuning results of retrieval. As shown in Table 5, the performance on all
the state-of-the-art M-VLP model XLM-R-MMP (Huang languages drops significantly without language acquirers.
et al., 2021) on several languages (de: 20.1 vs. 21.1; fr: Meanwhile, initializing the embedding with M-BERT (De-
22.0 vs. 21.8; es: 20.2 vs. 21.9). Under the Fine-tune vlin et al., 2019) only brings incremental improvements.
on English and Fine-tune on All settings, MLACLIP also It indicates that the language acquirers contribute most to
outperforms XLM-R-MMP significantly. We consider the the performance, and MLA does not depend much on the
convincing performance comes from two reasons: 1) CLIP initialization of non-native embedding.
is a strong VLP model that can generalize well on video
data. 2) The proposed MLA framework can well transfer Multi30K MSCOCO 1K
Methods
de fr cs ja zh
the open-domain knowledge learned by CLIP to other lan-
MLACLIP 78.7 77.7 70.8 74.9 78.5
guages. These results suggest that MLA could maintain the
MLACLIP w/o LA 76.1 74.9 65.7 70.3 76.5
open-domain capability of the VLP model which general- MLACLIP w/o EI 77.9 76.2 69.4 74.6 78.1
izes well on different downstream data.
Table 5: Ablation study on language acquirers and embedding
initialization. LA: Language Acquirers, EI: M-BERT Embedding
4.5. Ablation Study Initialization
A. Training Strategy C. Low-resource Languages
We conduct an ablation study in Table 4 to validate the effec- Image-text pairs may be rare for low-resource languages.
tiveness of the proposed MLA training strategy. For those To explore the performance of MLA under this situation,
settings with NLT and LE at the same stage, we add the we further simulate a low-resource scenario using XTD
loss of the two objectives together during training. By com- dataset. We finetune MLACLIP and UC2 (pre-trained on
paring row 1 to row 2&3, we observe that LE at stage one CC6L) with small amount of data from XTD in an unseen
leads to poor performance. This indicates that aligning with language. We randomly sample 600 pairs for finetuning,
the native language is more important for the VLP model to and the remained 400 samples are evenly divided for vali-
acquire new languages at an early stage. It is consistent with dation and testing. Korean is chosen to perform simulation
the learning habits of humans. By comparing row 1 and row as its script and language family are not covered by CC6L.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Experimental results in Table 7 show that MLA can achieve E. Language Extensibility
competitive results with very small amount of text-text
Multilingual models often encounter the need to support new
pairs only (row 2), and adding image-text pairs brings fur-
languages that do not occur in the training stage. We con-
ther improvement (row 3). It demonstrates that MLA is
duct language extension experiments to compare MLACLIP
still an attractive method for low-resource languages even
with M-VLP model UC2 (Zhou et al., 2021) on the XTD
without any image-text pairs.
dataset (Aggarwal & Kale, 2020). XTD supports 11 lan-
Training samples guages, and 5 of them (en, de, fr, cs, zh, ja) are seen in the
Methods Data pre-training stage of UC2 , while other 6 languages (it, es,
100 / 200 / 600
1 UC2 Img-Txt 47.0 / 60.1 / 78.3 ru, pl, tr, ko) are unseen. To make a fair comparison, we
2 MLACLIP Txt-Txt 51.7 / 62.8 / 78.7 first train MLACLIP with the same data as UC2 and then
3 MLACLIP Both 56.7 / 66.9 / 80.1
train both of them on unseen languages with CC69L. The
Table 7: Low resource performance on image-Korean retrieval. zero-shot image-text retrieval results on XTD are shown in
Table 6. We observe a significant performance degeneration
D. Amount of Training Data on the seen languages for UC2 when training solely with un-
Multilingual image-text pairs may be rare in practice. To seen languages (row 1 vs. row 2). Even keep training with
explore the performance of MLA under low-resource con- the seen languages, the performance is still significantly
ditions, we conduct experiments to control the numbers of reduced due to the limited model capacity (row 1 vs. row 3).
image-text pairs used for each language. We train the mod- In contrast, as MLA decoupled multiple languages through
els with CC6L and evaluate on MSCOCO 1K and Multi30K acquirers, the performance of the seen languages is rarely
under the zero-shot setting. The corresponding mean AR affected (row 4 vs. row 5) . This suggests that MLA frame-
over non-English languages (de, fr, cs, ja, zh) are drawn work can build multimodal multilingual models that are
in Figure 3. We observe that MLA performs significantly suitable for supporting increasing numbers of languages.
better than MKD (Reimers & Gurevych, 2020) in all cases.
Note that when the amount of training data is small, the 5. Conclusion
advantage of MLA is more obvious, which could outper-
form MKD even without the LE training stage. Additionally, In this paper, we propose the MultiLingual Acquisition
when training with only 30K image-text pairs per language, (MLA) framework that can generalize monolingual Vision-
MLA outperforms UC2 , which is pre-trained with 3M pairs Language Pre-training models into multilingual with low-
per language. MLA is thus a data-efficient method to build cost and high-flexibility. MLA injects language acquirers
multilingual and multimodal models. and a non-native embedding block into VLPs to support
new languages. Inspired by the language learning habits
of humans, we propose a two-stage training strategy to op-
75 timize the language acquirers and non-native embedding
block. MLA applied on CLIP achieves state-of-the-art per-
70 formances on multilingual image-text and video-text re-
Mean Average Recall
Artetxe, M., Ruder, S., and Yogatama, D. On the cross- Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,
lingual transferability of monolingual representations. In Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.,
58th Annual Meeting of the Association for Computa- et al. The many faces of robustness: A critical analysis
tional Linguistics, pp. 4623–4637. Association for Com- of out-of-distribution generalization. In ICCV, 2021a.
putational Linguistics, 2020.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and
Burns, A., Kim, D., Wijaya, D., Saenko, K., and Plum- Song, D. Natural adversarial examples. In CVPR, 2021b.
mer, B. A. Learning to scale multilingual representations
for vision-language tasks. In European Conference on Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
Computer Vision, pp. 197–213. Springer, 2020. De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
Gelly, S. Parameter-efficient transfer learning for NLP.
Carlsson, F. Multilingual clip. https://github.com/ In ICML, 2019.
FreddeFrallan/Multilingual-CLIP, 2021.
Huang, P.-Y., Patrick, M., Hu, J., Neubig, G., Metze, F., and
Castello, D. First language acquisition and classroom lan-
Hauptmann, A. G. Multilingual multimodal pre-training
guage learning: Similarities and differences. ELAL Col-
for zero-shot cross-lingual transfer of vision-language
lege of Arts & Law, pp. 1–18, 2015.
models. In NAACL, 2021.
Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár,
P., and Zitnick, C. L. Microsoft COCO captions: Data Huo, Y., Zhang, M., Liu, G., Lu, H., Gao, Y., et al. Wenlan:
collection and evaluation server. CoRR, abs/1504.00325, Bridging vision and language by large-scale multi-modal
2015. pre-training, 2021.
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Jain, A., Guo, M., Srinivasan, K., Chen, T., Kudugunta, S.,
Cheng, Y., and Liu, J. Uniter: Universal image-text repre- Jia, C., Yang, Y., and Baldridge, J. MURAL: Multimodal,
sentation learning. In European conference on computer multitask representations across languages. In Findings of
vision. Springer, 2020. the Association for Computational Linguistics: EMNLP
2021, pp. 3449–3463, 2021.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan- Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
guage understanding. In NAACL-HLT (1), 2019. H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling
up visual and vision-language representation learning
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
with noisy text supervision. In Proceedings of the 38th
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
International Conference on Machine Learning, volume
Heigold, G., Gelly, S., et al. An image is worth 16x16
139, pp. 4904–4916. PMLR, 2021.
words: Transformers for image recognition at scale. In
International Conference on Learning Representations, Karpathy, A. and Fei-Fei, L. Deep visual-semantic align-
2020. ments for generating image descriptions. In IEEE Con-
Elliott, D., Frank, S., Sima’an, K., and Specia, L. Multi30k: ference on Computer Vision and Pattern Recognition
Multilingual english-german image descriptions. In VL@ (CVPR), June 2015.
ACL, 2016.
Kim, D., Saito, K., Saenko, K., Sclaroff, S., and Plummer,
Fei, H., Yu, T., and Li, P. Cross-lingual cross-modal pre- B. Mule: Multimodal universal language embedding. In
training for multimodal retrieval. In 2021 Conference of AAAI Conference on Artificial Intelligence, volume 34,
the North American Chapter of the Association for Com- pp. 11254–11261, 2020.
putational Linguistics: Human Language Technologies,
pp. 3644–3650, 2021. Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language
transformer without convolution or region supervision. In
Gella, S., Sennrich, R., Keller, F., and Lapata, M. Image 38th International Conference on Machine Learning, vol-
pivoting for learning multilingual multimodal representa- ume 139 of Proceedings of Machine Learning Research.
tions. In EMNLP 2017: Conference on Empirical Meth- PMLR, 2021.
ods in Natural Language Processing. Association for
Computational Linguistics, 2017. Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. In ICLR (Poster), 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In IEEE conference on Krizhevsky, A. Learning multiple layers of features from
computer vision and pattern recognition, 2016. tiny images. Master’s thesis, University of Tront, 2009.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Li, G., Duan, N., Fang, Y., Gong, M., and Jiang, D. Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con-
Unicoder-vl: A universal encoder for vision and lan- ceptual captions: A cleaned, hypernymed, image alt-text
guage by cross-modal pre-training. In Proceedings of the dataset for automatic image captioning. In 56th Annual
AAAI Conference on Artificial Intelligence, volume 34, Meeting of the Association for Computational Linguistics
pp. 11336–11344, 2020a. (Volume 1: Long Papers), pp. 2556–2565, 2018.
Li, X., Xu, C., Wang, X., Lan, W., Jia, Z., Yang, G., and Xu, Song, Y., Chen, S., Jin, Q., Luo, W., Xie, J., and Huang, F.
J. Coco-cn for cross-lingual image tagging, captioning, Product-oriented machine translation with cross-modal
and retrieval. IEEE Transactions on Multimedia, 21(9): cross-lingual pre-training. In 29th ACM International
2347–2360, 2019. Conference on Multimedia, 2021.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, Van der Maaten, L. and Hinton, G. Visualizing data using
L., Hu, H., Dong, L., Wei, F., et al. Oscar: Object- t-sne. Journal of machine learning research, 9(11), 2008.
semantics aligned pre-training for vision-language tasks.
In European Conference on Computer Vision. Springer, Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
2020b. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., and processing systems, pp. 5998–6008, 2017.
Li, T. Clip4clip: An empirical study of clip for end to end
video clip retrieval. arXiv preprint arXiv:2104.08860, Wehrmann, J., Souza, D. M., Lopes, M. A., and Barros,
2021. R. C. Language-agnostic visual-semantic embeddings. In
IEEE/CVF International Conference on Computer Vision,
Ni, M., Huang, H., Su, L., Cui, E., Bharti, T., Wang, L.,
pp. 5804–5813, 2019.
Zhang, D., and Duan, N. M3p: Learning universal rep-
resentations via multitask multilingual multimodal pre- Xu, J., Mei, T., Yao, T., and Rui, Y. Msr-vtt: A large
training. In IEEE/CVF Conference on Computer Vision video description dataset for bridging video and language.
and Pattern Recognition, pp. 3977–3986, 2021. In IEEE Conference on Computer Vision and Pattern
Pfeiffer, J., Vulić, I., Gurevych, I., and Ruder, S. Mad-x: Recognition (CVPR), June 2016.
An adapter-based framework for multi-task cross-lingual Yoshikawa, Y., Shigeto, Y., and Takeuchi, A. Stair captions:
transfer. In 2020 Conference on Empirical Methods in Constructing a large-scale japanese image caption dataset.
Natural Language Processing (EMNLP), pp. 7654–7673, In 55th Annual Meeting of the Association for Computa-
2020. tional Linguistics (Volume 2: Short Papers), 2017.
Pfeiffer, J., Geigle, G., Kamath, A., Steitz, J.-M. O., Roth,
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. From
S., Vulić, I., and Gurevych, I. xgqa: Cross-lingual visual
image descriptions to visual denotations: New similarity
question answering. arXiv e-prints, 2021.
metrics for semantic inference over event descriptions.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Transactions of the Association for Computational Lin-
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, guistics, 2:67–78, 2014.
J., Krueger, G., and Sutskever, I. Learning transferable
Yu, Y., Kim, J., and Kim, G. A joint sequence fusion model
visual models from natural language supervision. In 38th
for video question answering and retrieval. In European
International Conference on Machine Learning, volume
Conference on Computer Vision (ECCV), 2018.
139, pp. 8748–8763, 2021.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do ima- Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao,
genet classifiers generalize to imagenet? In International J., Hu, H., Huang, X., Li, B., Li, C., et al. Florence: A new
Conference on Machine Learning. PMLR, 2019. foundation model for computer vision. arXiv preprint
arXiv:2111.11432, 2021.
Reimers, N. and Gurevych, I. Making monolingual sentence
embeddings multilingual using knowledge distillation. Zhou, M., Zhou, L., Wang, S., Cheng, Y., Li, L., Yu, Z., and
In 2020 Conference on Empirical Methods in Natural Liu, J. Uc2: Universal cross-lingual cross-modal vision-
Language Processing (EMNLP), pp. 4512–4525, 2020. and-language pre-training. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pp.
Sennrich, R., Haddow, B., and Birch, A. Neural machine 4155–4165, June 2021.
translation of rare words with subword units. In 54th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 1715–1725,
2016.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
A. Qualitative Analysis
A.1. Case study
In Figure 4, we visualize the top-1 retrieved images for given text queries in 11 languages on XTD dataset (Aggarwal &
Kale, 2020). Compared with the multilingual vision-language pre-training model UC2 (Zhou et al., 2021), MLA can better
capture entities, attributes, and actions to retrieve the correct image. Specifically, given simple queries that contain few
entities such as Query #1 or Query #2, the images retrieved by MLA show high consistency across languages, since the
representations of non-English queries are aligned to English in the NLT stage. For the more complex queries such as Query
#3 or Query #4, MLA also shows better fidelity to all entities in most cases.
Query #3: a woman sitting on a bed while a dog laying on the bed also and a cat is laying on a chair
UC2
MLA
Query #4: a pile of bananas, apples, potatoes, and yams on a white background
UC2
MLA
Figure 4: Top-1 retrieved images for given text queries in 11 languages on XTD dataset. Only English queries are shown in this figure.
The correct images are bordered green.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
image
class label
Figure 5: Representation visualization with t-SNE. The categories are color coded. ’•’ denotes a image representation, and ’×’ denotes a
class label representation in a certain language.
Multi30K MSCOCO 1K
Methods Component
de fr cs ja zh
MLACLIP Linear 78.2 77.6 69.3 74.6 78.0
MLACLIP MLP 78.7 77.7 70.8 74.9 78.5
be closer than negative ones. We conduct experiments to use different objectives in the two stages. As shown in Table 9, we
observe that the MSE objective is more suitable for the NLT (row 1 vs. row 2, row 7 vs. row 8) stage, and the NCE objective
performs better for the LE stage (row 3 vs. row 4, row 5 vs. row 6). We consider the reason is that in the NLT stage, we
leverage translation pairs to build alignment between languages. Since the two sentences of a translation pair are highly
semantically related, their representations can be very similar. Thus, optimizing a strong objective like MSE during the
NLT stage is feasible. However, during the LE stage, the optimization is conducted with image-text pairs. Although the
image and text are semantically related, one sentence can hardly describe all the information in the image. Therefore, a
weak objective like NCE is suitable for the LE stage.
Table 9: Ablation study on objectives in the two training stages. mse: MSE objective, nce: NCE objective
Multi30K MSCOCO 1K
Methods
en de fr cs en ja zh
CMACLIP 80.2 73.9 72.8 67.0 76.3 69.8 75.1
MLACLIP 84.4 78.7 77.7 70.8 79.4 74.9 78.5
open-domain multimodal knowledge from large-scale pre-training. In contrast, MLA keeps the original text encoder fixed
and thus could maintain the open-domain capability of the pre-training model.