Multimodal Pretraining From Monolingual To Multilingual

Generalizing Multimodal Pre-training into Multilingual via
Language Acquisition
Liang Zhang 1 Anwen Hu 1 Qin Jin 1
Image Data Text Data Monolingual Vision-Language Pre-training

Abstract
arXiv:2206.11091v1 [cs.CL] 29 May 2022
English-based Vision-Language Pre-training Fr De

Fr De
(VLP) has achieved great success in various
downstream tasks. Some efforts have been Cs Image Ja Cs Ja
Image En
taken to generalize this success to non-English
languages through Multilingual Vision-Language Zh En
Pre-training (M-VLP). However, due to the large Zh Ko
number of languages, M-VLP models often (a) Multilingual Vision-Language Pre-training (b) MultiLingual Acquisition
require huge computing resources and cannot
be flexibly extended to new languages. In this Figure 1: Comparison of data usage between M-VLP and MLA.
work, we propose a MultiLingual Acquisition The size of a circle reflects the amount of training data. M-VLPs
learn on vision-language data from multiple languages simultane-
(MLA) framework that can easily generalize ously. Instead, MLA generalize monolingual VLP into multilin-
a monolingual Vision-Language Pre-training gual on much less training data.
model into multilingual. Specifically, we design a
lightweight language acquisition encoder based
on state-of-the-art monolingual VLP models. computing resources. For example, the state-of-the-art M-
We further propose a two-stage training strategy VLP model MURAL (Jain et al., 2021) is pre-trained on 128
to optimize the language acquisition encoder, Cloud TPUv3 for four days. It could support multimodal
namely the Native Language Transfer stage tasks on 100+ languages. However, considering there are
and the Language Exposure stage. With much 6,900+ languages worldwide (Zhou et al., 2021), building
less multilingual training data and computing such a single model to handle all languages will be highly
resources, our model achieves state-of-the-art expensive. Second, M-VLP models cannot be flexibly ex-
performance on multilingual image-text and tended to new languages. Additional training is required
video-text retrieval benchmarks. for M-VLP models to achieve satisfactory performance on
a new language. However, this training process will cause
performance degeneration of M-VLP models on the original
1. Introduction languages due to the limited model capacity. For example,
the limited model capacity even results in M-VLP models
We are living in a multimodal and multilingual world w. The
performing worse than their monolingual counterparts on
information we receive in our daily lives may come from
English (Ni et al., 2021; Zhou et al., 2021).
different modalities and languages. Therefore, building
multimodal and multilingual models to effectively under- To build multimodal and multilingual models with low-cost
stand such information has attracted much research attention and high-flexibility, we refer to our human learning habits
(Gella et al., 2017; Wehrmann et al., 2019; Kim et al., 2020; when acquiring new languages. We humans normally learn
Burns et al., 2020). Recently, Multilingual Vision-Language our native language during childhood and practice it through
Pre-training (M-VLP) achieves convincing performance in interactions with the multimodal living environments. When
various cross-lingual cross-modal tasks such as multilingual learning a new language, we humans initially tend to align
image-text retrieval (Ni et al., 2021; Zhou et al., 2021; Fei it with the native language, as we can easily map words
et al., 2021; Huang et al., 2021; Jain et al., 2021) and multi- in the native language to real-world objects and concepts.
modal machine translation (Song et al., 2021). As shown in After having a certain language foundation, we could fur-
Figure 1(a), M-VLP models handle multiple languages and ther master it by interacting with the environment directly
modalities simultaneously during pre-training. Despite their using the new language. This is known as the language ex-
successes, M-VLP models suffer from two problems. First, posure (Castello, 2015). The whole learning process rarely
pre-training on vision and multilingual data consumes huge degrades our native language capability.
Generalizing Multimodal Pre-training into Multilingual via Language Acquisition
Inspired by this, we propose a new framework, dual-stream models encode image and text with two inde-
MultiLingual Acquisition (MLA), which constructs multi- pendent encoders and optimize via simple objectives like
modal and multilingual models based on monolingual VLPs. image-text contrastive learning (Radford et al., 2021; Jia
The topology of the MLA-based multimodal and multilin- et al., 2021; Yuan et al., 2021). Compared with the single-
gual model is illustrated in Figure 1(b). Unlike M-VLPs, stream models, the dual-stream models are more efficient
which handle data from multiple languages and modalities to utilize noisy image-text data harvested from the web
in a single model, MLA generalizes monolingual VLPs (Huo et al., 2021), and thus achieve better performance and
into multilingual using much less training data through a transferability across downstream tasks. Meanwhile, the
language acquisition encoder. The language acquisition dual-stream models are more flexible for extension. Since
encoder is realized by inserting our proposed lightweight the dual-stream models process images and text through
language acquirers into the pre-trained monolingual encoder independent encoders, we can fix the vision encoders and
of the VLP model. During training, original parameters in focus on extending the text encoders to support new lan-
the pre-trained monolingual encoder are fixed, only multi- guages. Therefore, we focus on generalizing dual-stream
lingual embeddings and language acquirers for each new VLPs into multilingual in this work.
language are optimized. Following the human learning
Multilingual Vision-Language Pre-training: To achieve
habits, we propose a two-stage training strategy to train
both multilingual and multimodal capability, many works
the language acquisition encoder. In the Native Language
try to learn the relationship between multiple languages and
Transfer (NLT) stage, the model is optimized to establish
modalities simultaneously through pre-training. M3 P (Ni
the correspondence between the new languages with the
et al., 2021) introduces the multimodal code-switched train-
native language. In the Language Exposure (LE) stage,
ing method to enhance multilingual transferability. UC2
the model is optimized to build cross-modal alignment be-
(Zhou et al., 2021) augments the English image-text data to
tween new languages and images. We apply our proposed
other languages through machine translation and proposes
MLA to the monolingual VLP model CLIP (Radford et al.,
MRTM and VTLM objectives to encourage fine-grained
2021) and achieve state-of-the-art results on both multilin-
alignment between images and multiple languages. More
gual image-text and video-text retrieval benchmarks with
recently, MURAL (Jain et al., 2021) adopts the dual-stream
much less training data and computing resources. Ablation
structure. It is pre-trained with image-text and text-text con-
studies demonstrate the effectiveness of our training strategy.
trastive objectives on multilingual image-text pairs and trans-
Owing to the independence merit of the language acquir-
lation pairs. M-VLP models significantly outperform previ-
ers, the MLA-based models can be easily extended to new
ous non-pretraining models (Gella et al., 2017; Wehrmann
languages without compromising the performance of their
et al., 2019; Kim et al., 2020; Burns et al., 2020) on mul-
original languages. The main contributions of our work are
tilingual image-text retrieval. Despite their success, these
as follows:
models typically consume huge computing resources and
• We propose a lightweight MultiLingual Acquisition large-scale multilingual training data. Moreover, they fail to
(MLA) framework that can easily generalize monolingual take full advantage of the cross-modal knowledge learnt in
VLPs into multilingual. monolingual VLP, and building cross-modal cross-lingual
• We propose a two-stage training strategy to optimize the representations from scratch can be very hard. In contrast,
MLA-based models inspired by the language learning our MLA framework aims to generalize VLP models into
habits of humans. Ablation studies prove the effectiveness multilingual and it builds multimodal and multilingual mod-
of the strategy. els with much less data and computing cost.
• We apply MLA to the monolingual VLP model CLIP and
achieve the new state-of-the-art results on both multilin- Multilingual Extension: Some works explore making pre-
gual image-text and video-text retrieval benchmarks with trained monolingual language models multilingual. Reimers
much less training data and parameters. et al. extend sentence embeddings from monolingual to
multilingual by Multilingual Knowledge Distillation (MKD)
(Reimers & Gurevych, 2020). Given translation pairs, MKD
2. Related Work optimizes the multilingual student model to produce similar
Vision-Language Pre-training: There are increasing inter- sentence embeddings with the monolingual teacher model.
est in building Vision-Language Pre-training (VLP) models. Artetxe et al. extend monolingual models by training ad-
From the perspective of how to interact between vision ditional word embeddings (Artetxe et al., 2020). MAD-
and language modalities, existing models can be divided X (Pfeiffer et al., 2020) extends multilingual pre-training
into two categories: single-stream and dual-stream models. models to support low-resource languages through adapters
The single-stream models perform interaction on image and (Houlsby et al., 2019). By extending state-of-the-art pre-
text directly with a cross-modal transformer (Chen et al., trained language models, these works have achieved impres-
2020; Li et al., 2020b; Kim et al., 2021). In contrast, the sive results in NLP tasks such as bitext retrieval (Reimers
Pre-trained Text Encoder 𝚽
a man and his Native Transformer … Transformer

𝑆 𝒔
dogs … embedding layer-1 layer-𝒍
Stage One Language Acquirer
Language Acquirer Shared
Upper
Language Acquisition Encoder 𝚽′ Add & Norm
Projection
…
𝑇 一个男人与他的 Non-native Transformer Transformer
𝒕
狗… embedding layer-1 layer-𝒍 Feed-forward ReLU
Stage two
Add & Norm Down
Pre-trained Vision Encoder 𝚿 Projection
Attention
Image Transformer … Transformer
𝑉 layer-𝒍
𝒗
embedding layer-1
(a) (b)
Figure 2: Model illustration: (a) The overview of MLA framework. (b) The structure of a language acquirer
& Gurevych, 2020), cross-lingual QA and NER (Pfeiffer 3.1. Architecture

et al., 2020; Reimers & Gurevych, 2020). However, few
Figure 2(a) illustrates the overview of the MLA framework,
works focus on making VLP models multilingual. Work in
which consists of three modules: the pre-trained text en-
(Pfeiffer et al., 2021) is the first to extend single-stream VLP
coder, the pre-trained vision encoder, and the language ac-
model OSCAR (Li et al., 2020b). It adopts a similar strat-
quisition encoder.
egy with MAD-X (Pfeiffer et al., 2020) that trains language
adapters with Masked Language Modeling (MLM) for each Pre-trained Text Encoder. Given a sentence S in the na-
language. During inference, it replaces the English language tive language, the corresponding sentence representation
adapters with the target language adapters to achieve zero- s = Φ(S; θΦ ) is generated through the pre-trained text
shot cross-lingual transfer. However, it generalizes poorly encoder Φ. To preserve the cross-model knowledge of
on other languages since the MLM-based training strategy VLP, θΦ is keep fixed during training. As shown in the
can only implicitly establish the correspondence between top part of Figure 2(a), the pre-trained text encoder con-
other languages and English, let alone vision correspon- tains a native embedding block and l transformer layers
dences. In contrast, MLA directly builds the connection (Vaswani et al., 2017). The native embedding block first tok-
of other languages with English and then with vision in enizes S with byte pair encoding (BPE) (Sennrich et al.,
the two-stage training strategy. Therefore, MLA achieves 2016). Then, it converts words into embeddings ES =
comparable results on non-English languages as on English [e0=[SOS] , e1 , . . . , eM =[EOS] ]. [SOS] and [EOS] are spe-
in downstream tasks. cial tokens denoting the boundary of S. The word embed-
dings are then passed through the transformer layers:
3. Method H 0 = [e0=[SOS] , e1 , . . . , eM =[EOS] ] + Epos (1)
i i−1 i
The MultiLingual Acquisition (MLA) framework is pro- H = TransformerLayer(H ; θΦ ) (2)
posed to empower a dual-stream monolingual VLP model
where H i = [hi0 , . . . , hiM ] is the hidden state of the layer
with multilingual capability. We define the native language i
i. θΦ denotes the parameters of the layer i. Epos is the
of a VLP as its pre-training language. In this paper, we
positional encoding. Note that the causal self-attention mask
choose CLIP-ViT-B (Radford et al., 2021) as the VLP model.
is used in the transformer layers (Radford et al., 2021). The
It is pre-trained with 400M image-text pairs in English (Rad-
last hidden state of the [EOS] token is chosen to generate
ford et al., 2021). Note that MLA can also be applied to
the sentence representation:
VLP models with different native languages.
Since the state-of-the-art VLP models can project vision s = Wa hlM (3)
and native language into a shared multimodal space, we where s is the sentence representation of S, and Wa denotes
design a language acquisition encoder to process non-native a linear projection.
languages. We then simulate the learning habits of human
beings and propose a two-stage training strategy to optimize Pre-trained Vision Encoder. We extract the representation
the language acquisition encoder. We first introduce the v = Ψ(V ; θΨ ) of an image V with the pre-trained vision
architecture of the MLA framework in Sec.3.1. Then, we encoder Ψ. Similar with the pre-trained text encoder, θΨ
describe our training strategy in Sec.3.2. is also frozen. The pre-trained vision encoder is imple-
mented as a Vision Transformer (Dosovitskiy et al., 2020).
As shown in the bottom part of Figure 2(a), it consists of representation t:

a image embedding block and l transformer layers. Given
an image V , the image embedding block first divides V t = Wa xlm (11)
into patches V 0 = [v10 , . . . , vN
0
] following (Dosovitskiy et al., Note that Eq.11 shares the same linear projection Wa with
2020). Then, they are linearly projected into patch embed- Eq.3. The main advantage of the language acquisition en-
dings Ep = [e[CLASS] , Wp v10 , . . . , Wp vN 0
], where e[CLASS] coder is that it can extend the VLP models to support new
is a special embedding for the whole image and Wp is the languages without influencing the existing languages, as
linear projection. The patch embeddings are then fed into it handles different languages with independent language
transformer layers: acquirers.
Z 0 = [e[CLASS] , Wp v10 , . . . , Wp vN
0
] + Epos (4)
i
3.2. Training Strategy
Z = TransformerLayer(Zi−1 ; θVi ) (5)
To simulate the language learning habits of humans, we opti-
where Z i = [z0i , . . . , zN
i
] is the hidden state of the layer mize the model in two stages: the Native Language Transfer
i. The last hidden state of the [CLASS] embedding z0l is (NLT) stage and the Language Exposure (LE) stage.
selected to produce the representation of image V :
Native Language Transfer. When learning a new lan-
v= Wb z0l (6) guage, we humans initially tend to align it with the na-
tive language. To simulate this learning phase, we align
where v is the image representation of V , and Wb denotes a the non-native representations to the native representations
linear projection. during the Native Language Transfer (NLT) stage. Specifi-
cally, suppose {(S1 , T1 ), ..., (Sn , Tn )} are translation pairs,
Language Acquisition Encoder. As shown in the mid-
where Si is in the native language, and Ti is in a non-native
dle part of Figure 2(a), the language acquisition encoder
language L. The objective in the NLT stage is minimizing
is built upon the pre-trained text encoder. Suppose T is
the Mean Square Error (MSE) between the native repre-
a sentence written in a non-native language L, we get the
sentation si = Φ(Si ; θΦ ) and the non-native representation
representation of T through language acquisition encoder
ti = Φ0 (Ti ; θΦ , θL , θemb ):
t = Φ0 (T ; θΦ , θemb , θL ), where θΦ are fixed parameters
of the pre-trained text encoder, θemb refers to a shared 1 X
B
non-native embedding block and θL represents specialized LNLT = ksi − ti k2 (12)
B i=1
language acquirers for language L. Non-native sentence
T is first tokenized and processed into word embeddings where B is the batch size. Note that θΦ is loaded from the
ET = [u0=[SOS] , . . . , uM =[EOS] ] through the non-native VLP model and is kept frozen. θL is trained for non-native
embedding block. The word embeddings are then encoded language L. θemb is shared among non-native languages.
through the pre-trained transformer layers and language
acquirers: During the NLT stage, the non-native correspondence with
vision can be built pivoting on the native language, since
X 0 = [We u0=[SOS] , We u1 , . . . , We um=[EOS] ] + Epos the correspondence between the native language and vision
(7) is well established through VLP.
H i = TransformerLayer(X i−1 ; θΦ
i
) (8) Language Exposure. After the NLT stage, the model
i i i has built an implicit connection between non-native lan-
X = LA(H ; θL ) (9)
guages and vision. However, due to the existence of syn-
where X i = [xi0 , . . . , xim ] is the hidden state of the layer onyms, two same words in the native language may cor-
i. We is a linear projection to keep dimension consistency. respond to different images. Thus, ambiguity may arise
i
θL denotes the parameters of the i-th language acquirer when learning non-native languages solely by relying on
for language L. Note that each non-native language has the native language. Actually, we can regard the language
independent language acquirers, and all of them share the acquisition encoder after the NLT stage as a person with
same word embedding block. As shown in Figure 2(b), the a certain language foundation. He/She has learned the
language acquirer is implemented as a bottleneck MLP with basic usage of a language through native language teach-
residual connection (He et al., 2016): ing. To master it, he/she may practice the non-native lan-
guage by interacting with the multimodal living environ-
LA(X) = Wupper ReLU(Wdown X) + X (10) ments. Inspired by this learning phase, we directly es-
tablish the cross-modal alignment between non-native lan-
Similar with the pre-trained text encoder, the last hidden guages and vision during the Language Exposure (LE) stage.
state of the [EOS] token is projected into the sentence Given image-text pairs {(V1 , T1 ), ..., (Vn , Tn )} where Ti is
in a non-native language L, the sentence representation English.2 Considering the languages of the downstream
0
ti = Φ (Ti ; θΦ , θL , θemb ) should be closer to the aligned datasets, we train the model with CC6L for multilingual
image representation v i = Ψ(Vi ; θΨ ), and away from the image-text retrieval, and with CC69L for multilingual video-
misaligned one v j = Ψ(Vj ; θΨ ), j 6= i. This can be text retrieval.
achieved by performing contrastive learning between non- Multi30K (Elliott et al., 2016) is built upon Flickr30K
native languages and images. For a non-native sentence Ti , (Young et al., 2014). The English(en) captions are man-
we treat the corresponding image Vi as a positive sample, ually translated into German(de), French(fr) and Czech(cs).
and other images in the same batch Vj , j 6= i as negative It contains 31K images paired with 5 captions per image in
samples. Vice versa for images. The objective in the LE English and German, and 1 caption in French and Czech.
stage is minimizing the NCE loss defined as follows: We use the standard train, dev and test splits defined in
(Young et al., 2014).
1
LLE = (Lv2t + Lt2v ) (13) MSCOCO (Chen et al., 2015) contains 123K images with
2
B
5 English captions per image. (Yoshikawa et al., 2017) an-
1 X exp(sim(v i , ti )/τ ) notates 5 Japanese captions per image, and (Li et al., 2019)
Lv2t =− log PN (14)
B i=1 k=1 exp(sim(v i , tk )/τ )
extends MSCOCO with Chinese captions for 20K images.
B
We follow the standard train, dev and test splits for English
1 X exp(sim(v i , ti )/τ ) and Japanese as in (Karpathy & Fei-Fei, 2015). For Chinese,
Lt2v =− log PN (15)
B i=1 k=1 exp(sim(v k , ti )/τ )
we can only perform zero-shot evaluation on the test split
>
defined in (Li et al., 2019), as the full splits have overlaps
x y
where B is the batch size. sim(x, y) = kxkkyk is the cosine with English and Japanese splits.
similarity between two vectors. τ is a temperature hyper- XTD (Aggarwal & Kale, 2020) provides captions in 11 lan-
parameter to scale the logits. Note that though the image-to- guages (English(en), German(de), French(fr), Chinese(zh),
text loss Lv2t is optimized, the pre-trained vision encoder Japanese(ja), Italian(it), Spanish(es), Russian(ru), Polish(pl),
is kept frozen during training. Similar to NLT, the trainable Turkish(tr), Korean(ko)) for 1K MSCOCO images. Except
parameters in LE come from the language acquirers and the for Japanese, all non-English captions are translated from
non-native embedding block. the English caption directly. We use this dataset for zero-
shot image-text retrieval evaluation only.
MSRVTT (Xu et al., 2016) is a video caption dataset with
4. Experiments
10K videos, where each video is annotated with 20 En-
In this section, we first introduce the datasets used in this glish captions. Huang et al. translates the English cap-
paper, and then present detailed experiments to evaluate the tions into 8 languages (German(de), French(fr), Russian(ru),
proposed MLA framework. Spanish(es), Czech(cz), Swahili(sw), Chinese(zh) and Viet-
namese(vi)) via machine translation service (Huang et al.,
4.1. Dataset Description 2021). We follow the standard train/dev splits in (Xu et al.,
2016), and evaluate on the 1K test split as described in (Yu
We train our model with the Conceptual Captions (CC) et al., 2018).
dataset (Sharma et al., 2018) and two translation enhanced
versions of the CC (Zhou et al., 2021; Carlsson, 2021). We 4.2. Implementation Details
use Multi30K (Elliott et al., 2016), MSCOCO (Chen et al.,
2015; Li et al., 2019; Yoshikawa et al., 2017) and XTD (Ag- We apply MLA on two VLP models: CLIP-ViT-B-32
garwal & Kale, 2020) for multilingual image-text retrieval and CLIP-ViT-B-16 (Radford et al., 2021), denoted as
evaluation, and MSRVTT (Xu et al., 2016; Huang et al., MLACLIP and MLACLIP16 respectively. The hidden di-
2021) for multilingual video-text retrieval evaluation. mension of the language acquirers is set to 256, and all
Conceptual Captions (CC) (Sharma et al., 2018) contains language acquirers for each non-native language cost only
3.3 million image-text pairs in English crawled from the 3.14 MB parameters. The non-native embedding matrix is
Web1 . We also randomly select 300K image-text pairs de- initialized with M-BERT (Devlin et al., 2019). It costs 92.2
noted as CC300K for training our model to show the low- MB and shared with all non-native languages. We train two
cost merit of MLA. For multilingual sentences, we leverage separate models for multilingual image-text retrieval and
two translation augmented CC datasets: (1) CC6L (Zhou video-text retrieval. For the image model, we train with
et al., 2021) that translates all English captions of the CC CC6L (Zhou et al., 2021). For the video model, we use mul-
into 5 languages (German(de), French(fr), Czech(cs), Chi- tilingual captions from CC69L (Carlsson, 2021). For both
nese(zh)); and (2) CC69L (Carlsson, 2021) that contains models, we optimize multiple language acquirers iteratively
27K captions in each of the 68 languages translated from 2
We remove captions of unaccessible images, leaving ∼20K
1
We can only access ∼2.5 million images due to broken URLs. captions for each language.
Multi30K MSCOCO 1K MSCOCO 5K

Method Training Data
en de fr cs en ja en ja
Unicoder-VL CC3M (English only) 72.0 - - - 63.7 - - -
ALIGN AT-en (English only) 84.3 - - - 80.0 - 60.6 -
M3 P CC3M+Wiki 57.9 36.8 27.1 20.4 63.1 33.3 - -
Zero-shot
UC2 TrTrain(CC3M) 66.6 62.5 60.4 55.1 70.9 62.3 - -

MURAL TrTrain(CC12M)+EOBT 80.9 76.0 75.7 68.2 78.1 72.5 58.0 49.7
MURAL† AT+MBT 82.4 76.2 75.0 64.6 79.2 73.4 59.5 54.4
MLACLIP TrTrain(CC300K) 84.4 78.7 77.7 70.8 79.4 74.9 60.5 54.1
MLACLIP16 TrTrain(CC300K) 86.4 80.8 80.9 72.9 80.9 76.7 62.6 57.0
M3 P CC3M+Wiki 87.4 82.1 67.3 65.0 88.6 56.0 - -
FT-En
UC2 TrTrain(CC3M) 87.2 83.8 77.6 74.2 88.1 71.7 - -

M3 P‡ CC3M+Wiki 87.7 82.7 73.9 72.2 88.7‡ 87.9‡ - -
UC2‡ TrTrain(CC3M) 88.2 84.5 83.9 81.2 88.1‡ 87.5‡ - -
FT-All
MURAL TrTrain(CC12M)+EOBT 91.0 87.3 86.4 82.4 89.4 87.4 73.7 71.9
MURAL† AT+MBT 92.2 88.6 87.6 84.2 88.6 88.4 75.4 74.9
Table 1: Multilingual image-text retrieval results on Multi30K and MSCOCO. TrTrain: Translate-train, FT-En: Fine-tune on English,
FT-All: Fine-tune on All. †: Models trained with publicly unavailable datasets. ‡: Models fine-tuned on COCO-CN (Li et al., 2019),
which has an overlap train split with the test split of English and Japanese. Best results are in bold and second best are underlined.
Method Trainable Params Computing Costs Following previous works (Ni et al., 2021; Zhou et al., 2021;
M3 P 566 M 4×V100×7d Jain et al., 2021), we report Average Recall (AR), which is
UC2 478 M 8×V100×4d the average score over Recall@1, Recall@5, and Recall@10
MURAL 300 M 128×TPUv3×4d on two retrieval directions (image→text, text→image). The
Ours (MLACLIP ) 108 M 1×V100×0.5d results are shown in Table 1. Also, the comparison of com-
Table 2: Comparison of trainable parameters and computing costs puting costs and parameters can be found in Table 2.
between MLA and M-VLPs.
Under the Zero-shot setting, we observe that MLACLIP
performs significantly better than state-of-the-art M-VLP
with a batch size of 128. The NLT stage performs 117,150
models on English. This is because MLACLIP could com-
steps with a learning rate of 1e-4, and the LE stage performs
pletely maintain the strong English performance of CLIP.
11,715 steps with a learning rate of 3e-6. The temperature τ
In contrast, M-VLP models typically perform worse than
is set to 0.01. For both stages, we use the Adam optimizer
their monolingual counterparts on English (M3 P 57.9 vs.
(Kingma & Ba, 2015) with a linear warm-up for the first
Unicoder-VL (Li et al., 2020a) 72.0, MURAL 80.9 vs.
10% of steps. The whole training process takes about 12
ALIGN (Jia et al., 2021) 84.3). MLACLIP also outper-
hours to converge on 1 Nvidia V100 GPU.
forms M-VLP models on other languages. For example,
MLACLIP achieves 78.7 average recall score on German,
4.3. Evaluation on Multilingual Image-Text Retrieval
outperforming MURAL by 2.7%. Note that the pre-training
In multilingual image-text retrieval, models are given a dataset of MURAL contains 12 million image-text pairs for
sentence in a certain language to find the most semantically each language, while MLACLIP only uses 300K training
relevant image from an image database and vice versa. We image-text pairs. It demonstrates that MLA is a high-data-
compare our model with state-of-the-art multilingual vision- efficient method to empower monolingual VLP models with
language pre-training methods under three settings: multilingual capability. Under the Fine-tune on English
setting, MLA shows strong cross-lingual transfer capabil-
• Zero-shot: we directly evaluate the model without fine-
ity. Under the Fine-tune on All setting, MLACLIP per-
tuning on downstream datasets.
forms slightly worse than MURAL which was pre-trained
• Fine-tune on English: we first fine-tune the VLP model
on publicly unavailable dataset AT+MBT (Jain et al., 2021).
on downstream English data. We then insert the language
We consider the reason is that MURAL has more trainable
acquirers and non-native embedding block into the fine-
parameters than MLACLIP (300M vs 108M, as shown in
tuned model and evaluate on other languages directly.
Table 2) for fine-tuning, which makes it easier to fit the
• Fine-tune on All: after Fine-tune on English, we fine-
downstream datasets with a certain scale such as Multi30K
tune the language acquirers and non-native embedding
and MSCOCO. MLACLIP16 achieves state-of-the-art re-
block and freeze other parts of the model.
Method en de fr cs zh ru vi sw es mean
Ours(MLACLIP w/o LE) 30.8 18.3 18.9 14.5 18.6 12.6 7.2 10.2 19.3 16.7
ZS
Ours(MLACLIP ) 30.8 20.1 22.0 15.7 18.3 14.4 8.2 10.7 20.2 17.8
FT-All FT-En
XLM-R-MMP (Huang et al., 2021) 23.8 19.4 20.7 19.3 18.2 19.1 8.2 8.4 20.4 17.5
Ours(MLACLIP ) 42.5 26.1 26.7 20.5 25.3 18.9 12.9 12.6 27.2 23.6
XLM-R-MMP (Huang et al., 2021) 23.1 21.1 21.8 20.7 20.0 20.5 10.9 14.4 21.9 19.4
Ours(MLACLIP ) 42.5 33.1 34.5 30.5 31.6 28.9 16.9 24.3 33.5 30.6
Table 3: Multilingual video-text retrieval results on MSRVTT. ZS: Zero-shot, FT-En: Fine-tune on English, FT-All: Fine-tune on All.
sults on all languages under three settings. It indicates that 4, we see that LE at stage two could bring improvements
if stronger VLP models such as ALIGN-L2 (Jia et al., 2021) on the new languages. Additionally, comparing row 4 and
or Florence (Yuan et al., 2021) are provided, better perfor- row 5 suggests that optimizing the model with NLT and LE
mance on multilingual image-text retrieval could be reached together at stage two does not bring improvements.
through MLA.
Stage one Stage two Multi30K MSCOCO 1K
Row
4.4. Evaluation on Multilingual Video-Text Retrieval NLT LE NLT LE de fr cs ja zh
1 X 76.3 74.2 67.2 72.1 75.7
In multilingual video-text retrieval, the model searches for 2 X 68.2 67.7 58.6 65.9 71.7
the most semantically relevant videos given a text query 3 X X 71.1 69.7 59.8 67.6 73.9
in a certain language. Following (Luo et al., 2021), we 4 X X 78.7 77.7 70.8 74.9 78.5
5 X X X 78.4 77.3 69.9 74.2 78.1
first uniformly sample 12 frames from each video, and use
the pre-trained vision encoder to extract representations for Table 4: Ablation study on training strategy
each frame. We then perform mean pooling over frame
representations to get the video representation. B. Language Acquirers and Embedding Initialization
We also evaluate the models under three settings as in In order to validate the effectiveness of the proposed Lan-
Sec.4.3. We report the text→video Recall@1 score in Table guage Acquirers, we remove the language acquirers and
3. Under Zero-shot setting, MLACLIP , which is trained the M-BERT embedding initialization from the model re-
on CC69L without using any video data, achieves compa- spectively and evaluate on zero-shot multilingual image-text
rable or even better results than the fine-tuning results of retrieval. As shown in Table 5, the performance on all
the state-of-the-art M-VLP model XLM-R-MMP (Huang languages drops significantly without language acquirers.
et al., 2021) on several languages (de: 20.1 vs. 21.1; fr: Meanwhile, initializing the embedding with M-BERT (De-
22.0 vs. 21.8; es: 20.2 vs. 21.9). Under the Fine-tune vlin et al., 2019) only brings incremental improvements.
on English and Fine-tune on All settings, MLACLIP also It indicates that the language acquirers contribute most to
outperforms XLM-R-MMP significantly. We consider the the performance, and MLA does not depend much on the
convincing performance comes from two reasons: 1) CLIP initialization of non-native embedding.
is a strong VLP model that can generalize well on video
data. 2) The proposed MLA framework can well transfer Multi30K MSCOCO 1K
Methods
de fr cs ja zh
the open-domain knowledge learned by CLIP to other lan-
MLACLIP 78.7 77.7 70.8 74.9 78.5
guages. These results suggest that MLA could maintain the
MLACLIP w/o LA 76.1 74.9 65.7 70.3 76.5
open-domain capability of the VLP model which general- MLACLIP w/o EI 77.9 76.2 69.4 74.6 78.1
izes well on different downstream data.
Table 5: Ablation study on language acquirers and embedding
initialization. LA: Language Acquirers, EI: M-BERT Embedding
4.5. Ablation Study Initialization
A. Training Strategy C. Low-resource Languages
We conduct an ablation study in Table 4 to validate the effec- Image-text pairs may be rare for low-resource languages.
tiveness of the proposed MLA training strategy. For those To explore the performance of MLA under this situation,
settings with NLT and LE at the same stage, we add the we further simulate a low-resource scenario using XTD
loss of the two objectives together during training. By com- dataset. We finetune MLACLIP and UC2 (pre-trained on
paring row 1 to row 2&3, we observe that LE at stage one CC6L) with small amount of data from XTD in an unseen
leads to poor performance. This indicates that aligning with language. We randomly sample 600 pairs for finetuning,
the native language is more important for the VLP model to and the remained 400 samples are evenly divided for vali-
acquire new languages at an early stage. It is consistent with dation and testing. Korean is chosen to perform simulation
the learning habits of humans. By comparing row 1 and row as its script and language family are not covered by CC6L.
Seen languages Unseen languages

Row Method
en de fr zh ja it es ru pl tr ko
1 UC2 w/o unseen language training 71.8 67.5 68.4 61.9 51.5 - - - - -
2 UC2 w/ unseen language training 63.6 57.8 57.6 57.6 48.4 56.4 56.2 51.3 56.4 51.62 51.3
3 UC2 w/ all language training 65.2 59.3 59.7 60.1 50.5 57.7 56.5 50.9 55.3 53.2 50.2
4 MLACLIP w/o unseen language training 75.9 72.6 72.9 73.7 67.2 - - - - - -
5 MLACLIP w/ unseen language training 76.0 72.6 72.9 73.8 67.2 64.7 62.8 58.1 63.0 56.5 57.3
Table 6: Language extention experiments on XTD dataset.
Experimental results in Table 7 show that MLA can achieve E. Language Extensibility
competitive results with very small amount of text-text
Multilingual models often encounter the need to support new
pairs only (row 2), and adding image-text pairs brings fur-
languages that do not occur in the training stage. We con-
ther improvement (row 3). It demonstrates that MLA is
duct language extension experiments to compare MLACLIP
still an attractive method for low-resource languages even
with M-VLP model UC2 (Zhou et al., 2021) on the XTD
without any image-text pairs.
dataset (Aggarwal & Kale, 2020). XTD supports 11 lan-
Training samples guages, and 5 of them (en, de, fr, cs, zh, ja) are seen in the
Methods Data pre-training stage of UC2 , while other 6 languages (it, es,
100 / 200 / 600
1 UC2 Img-Txt 47.0 / 60.1 / 78.3 ru, pl, tr, ko) are unseen. To make a fair comparison, we
2 MLACLIP Txt-Txt 51.7 / 62.8 / 78.7 first train MLACLIP with the same data as UC2 and then
3 MLACLIP Both 56.7 / 66.9 / 80.1
train both of them on unseen languages with CC69L. The
Table 7: Low resource performance on image-Korean retrieval. zero-shot image-text retrieval results on XTD are shown in
Table 6. We observe a significant performance degeneration
D. Amount of Training Data on the seen languages for UC2 when training solely with un-
Multilingual image-text pairs may be rare in practice. To seen languages (row 1 vs. row 2). Even keep training with
explore the performance of MLA under low-resource con- the seen languages, the performance is still significantly
ditions, we conduct experiments to control the numbers of reduced due to the limited model capacity (row 1 vs. row 3).
image-text pairs used for each language. We train the mod- In contrast, as MLA decoupled multiple languages through
els with CC6L and evaluate on MSCOCO 1K and Multi30K acquirers, the performance of the seen languages is rarely
under the zero-shot setting. The corresponding mean AR affected (row 4 vs. row 5) . This suggests that MLA frame-
over non-English languages (de, fr, cs, ja, zh) are drawn work can build multimodal multilingual models that are
in Figure 3. We observe that MLA performs significantly suitable for supporting increasing numbers of languages.
better than MKD (Reimers & Gurevych, 2020) in all cases.
Note that when the amount of training data is small, the 5. Conclusion
advantage of MLA is more obvious, which could outper-
form MKD even without the LE training stage. Additionally, In this paper, we propose the MultiLingual Acquisition
when training with only 30K image-text pairs per language, (MLA) framework that can generalize monolingual Vision-
MLA outperforms UC2 , which is pre-trained with 3M pairs Language Pre-training models into multilingual with low-
per language. MLA is thus a data-efficient method to build cost and high-flexibility. MLA injects language acquirers
multilingual and multimodal models. and a non-native embedding block into VLPs to support
new languages. Inspired by the language learning habits
of humans, we propose a two-stage training strategy to op-
75 timize the language acquirers and non-native embedding
block. MLA applied on CLIP achieves state-of-the-art per-
70 formances on multilingual image-text and video-text re-
Mean Average Recall
trieval benchmarks with much less computing costs and

65 training data. Extensive ablation studies demonstrate that
MLA is a flexible, effective, and efficient method to build
multimodal and multilingual models.
60
MLA
MLA w/o LE
55 MKD References
UC2 Aggarwal, P. and Kale, A. Towards zero-shot cross-lingual
10K 30K 100K 300K 1M 3M image retrieval. CoRR, abs/2012.05107, 2020.
# of image-text pairs per language
Figure 3: Mean AR vs. number of image-text pairs per language.
Artetxe, M., Ruder, S., and Yogatama, D. On the cross- Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,
lingual transferability of monolingual representations. In Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.,
58th Annual Meeting of the Association for Computa- et al. The many faces of robustness: A critical analysis
tional Linguistics, pp. 4623–4637. Association for Com- of out-of-distribution generalization. In ICCV, 2021a.
putational Linguistics, 2020.
Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and
Burns, A., Kim, D., Wijaya, D., Saenko, K., and Plum- Song, D. Natural adversarial examples. In CVPR, 2021b.
mer, B. A. Learning to scale multilingual representations
for vision-language tasks. In European Conference on Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
Computer Vision, pp. 197–213. Springer, 2020. De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
Gelly, S. Parameter-efficient transfer learning for NLP.
Carlsson, F. Multilingual clip. https://github.com/ In ICML, 2019.
FreddeFrallan/Multilingual-CLIP, 2021.
Huang, P.-Y., Patrick, M., Hu, J., Neubig, G., Metze, F., and
Castello, D. First language acquisition and classroom lan-
Hauptmann, A. G. Multilingual multimodal pre-training
guage learning: Similarities and differences. ELAL Col-
for zero-shot cross-lingual transfer of vision-language
lege of Arts & Law, pp. 1–18, 2015.
models. In NAACL, 2021.
Chen, X., Fang, H., Lin, T., Vedantam, R., Gupta, S., Dollár,
P., and Zitnick, C. L. Microsoft COCO captions: Data Huo, Y., Zhang, M., Liu, G., Lu, H., Gao, Y., et al. Wenlan:
collection and evaluation server. CoRR, abs/1504.00325, Bridging vision and language by large-scale multi-modal
2015. pre-training, 2021.
Chen, Y.-C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Jain, A., Guo, M., Srinivasan, K., Chen, T., Kudugunta, S.,
Cheng, Y., and Liu, J. Uniter: Universal image-text repre- Jia, C., Yang, Y., and Baldridge, J. MURAL: Multimodal,
sentation learning. In European conference on computer multitask representations across languages. In Findings of
vision. Springer, 2020. the Association for Computational Linguistics: EMNLP
2021, pp. 3449–3463, 2021.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:
Pre-training of deep bidirectional transformers for lan- Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
guage understanding. In NAACL-HLT (1), 2019. H., Le, Q., Sung, Y.-H., Li, Z., and Duerig, T. Scaling
up visual and vision-language representation learning
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
with noisy text supervision. In Proceedings of the 38th
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
International Conference on Machine Learning, volume
Heigold, G., Gelly, S., et al. An image is worth 16x16
139, pp. 4904–4916. PMLR, 2021.
words: Transformers for image recognition at scale. In
International Conference on Learning Representations, Karpathy, A. and Fei-Fei, L. Deep visual-semantic align-
2020. ments for generating image descriptions. In IEEE Con-
Elliott, D., Frank, S., Sima’an, K., and Specia, L. Multi30k: ference on Computer Vision and Pattern Recognition
Multilingual english-german image descriptions. In VL@ (CVPR), June 2015.
ACL, 2016.
Kim, D., Saito, K., Saenko, K., Sclaroff, S., and Plummer,
Fei, H., Yu, T., and Li, P. Cross-lingual cross-modal pre- B. Mule: Multimodal universal language embedding. In
training for multimodal retrieval. In 2021 Conference of AAAI Conference on Artificial Intelligence, volume 34,
the North American Chapter of the Association for Com- pp. 11254–11261, 2020.
putational Linguistics: Human Language Technologies,
pp. 3644–3650, 2021. Kim, W., Son, B., and Kim, I. Vilt: Vision-and-language
transformer without convolution or region supervision. In
Gella, S., Sennrich, R., Keller, F., and Lapata, M. Image 38th International Conference on Machine Learning, vol-
pivoting for learning multilingual multimodal representa- ume 139 of Proceedings of Machine Learning Research.
tions. In EMNLP 2017: Conference on Empirical Meth- PMLR, 2021.
ods in Natural Language Processing. Association for
Computational Linguistics, 2017. Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. In ICLR (Poster), 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In IEEE conference on Krizhevsky, A. Learning multiple layers of features from
computer vision and pattern recognition, 2016. tiny images. Master’s thesis, University of Tront, 2009.
Li, G., Duan, N., Fang, Y., Gong, M., and Jiang, D. Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con-
Unicoder-vl: A universal encoder for vision and lan- ceptual captions: A cleaned, hypernymed, image alt-text
guage by cross-modal pre-training. In Proceedings of the dataset for automatic image captioning. In 56th Annual
AAAI Conference on Artificial Intelligence, volume 34, Meeting of the Association for Computational Linguistics
pp. 11336–11344, 2020a. (Volume 1: Long Papers), pp. 2556–2565, 2018.
Li, X., Xu, C., Wang, X., Lan, W., Jia, Z., Yang, G., and Xu, Song, Y., Chen, S., Jin, Q., Luo, W., Xie, J., and Huang, F.
J. Coco-cn for cross-lingual image tagging, captioning, Product-oriented machine translation with cross-modal
and retrieval. IEEE Transactions on Multimedia, 21(9): cross-lingual pre-training. In 29th ACM International
2347–2360, 2019. Conference on Multimedia, 2021.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, Van der Maaten, L. and Hinton, G. Visualizing data using
L., Hu, H., Dong, L., Wei, F., et al. Oscar: Object- t-sne. Journal of machine learning research, 9(11), 2008.
semantics aligned pre-training for vision-language tasks.
In European Conference on Computer Vision. Springer, Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
2020b. L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
tion is all you need. In Advances in neural information
Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., and processing systems, pp. 5998–6008, 2017.
Li, T. Clip4clip: An empirical study of clip for end to end
video clip retrieval. arXiv preprint arXiv:2104.08860, Wehrmann, J., Souza, D. M., Lopes, M. A., and Barros,
2021. R. C. Language-agnostic visual-semantic embeddings. In
IEEE/CVF International Conference on Computer Vision,
Ni, M., Huang, H., Su, L., Cui, E., Bharti, T., Wang, L.,
pp. 5804–5813, 2019.
Zhang, D., and Duan, N. M3p: Learning universal rep-
resentations via multitask multilingual multimodal pre- Xu, J., Mei, T., Yao, T., and Rui, Y. Msr-vtt: A large
training. In IEEE/CVF Conference on Computer Vision video description dataset for bridging video and language.
and Pattern Recognition, pp. 3977–3986, 2021. In IEEE Conference on Computer Vision and Pattern
Pfeiffer, J., Vulić, I., Gurevych, I., and Ruder, S. Mad-x: Recognition (CVPR), June 2016.
An adapter-based framework for multi-task cross-lingual Yoshikawa, Y., Shigeto, Y., and Takeuchi, A. Stair captions:
transfer. In 2020 Conference on Empirical Methods in Constructing a large-scale japanese image caption dataset.
Natural Language Processing (EMNLP), pp. 7654–7673, In 55th Annual Meeting of the Association for Computa-
2020. tional Linguistics (Volume 2: Short Papers), 2017.
Pfeiffer, J., Geigle, G., Kamath, A., Steitz, J.-M. O., Roth,
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. From
S., Vulić, I., and Gurevych, I. xgqa: Cross-lingual visual
image descriptions to visual denotations: New similarity
question answering. arXiv e-prints, 2021.
metrics for semantic inference over event descriptions.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Transactions of the Association for Computational Lin-
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, guistics, 2:67–78, 2014.
J., Krueger, G., and Sutskever, I. Learning transferable
Yu, Y., Kim, J., and Kim, G. A joint sequence fusion model
visual models from natural language supervision. In 38th
for video question answering and retrieval. In European
International Conference on Machine Learning, volume
Conference on Computer Vision (ECCV), 2018.
139, pp. 8748–8763, 2021.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do ima- Yuan, L., Chen, D., Chen, Y.-L., Codella, N., Dai, X., Gao,
genet classifiers generalize to imagenet? In International J., Hu, H., Huang, X., Li, B., Li, C., et al. Florence: A new
Conference on Machine Learning. PMLR, 2019. foundation model for computer vision. arXiv preprint
arXiv:2111.11432, 2021.
Reimers, N. and Gurevych, I. Making monolingual sentence
embeddings multilingual using knowledge distillation. Zhou, M., Zhou, L., Wang, S., Cheng, Y., Li, L., Yu, Z., and
In 2020 Conference on Empirical Methods in Natural Liu, J. Uc2: Universal cross-lingual cross-modal vision-
Language Processing (EMNLP), pp. 4512–4525, 2020. and-language pre-training. In IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR), pp.
Sennrich, R., Haddow, B., and Birch, A. Neural machine 4155–4165, June 2021.
translation of rare words with subword units. In 54th
Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pp. 1715–1725,
2016.
A. Qualitative Analysis
A.1. Case study
In Figure 4, we visualize the top-1 retrieved images for given text queries in 11 languages on XTD dataset (Aggarwal &
Kale, 2020). Compared with the multilingual vision-language pre-training model UC2 (Zhou et al., 2021), MLA can better
capture entities, attributes, and actions to retrieve the correct image. Specifically, given simple queries that contain few
entities such as Query #1 or Query #2, the images retrieved by MLA show high consistency across languages, since the
representations of non-English queries are aligned to English in the NLT stage. For the more complex queries such as Query
#3 or Query #4, MLA also shows better fidelity to all entities in most cases.
Query #1: a man wearing a red shirt is playing a tennis game

UC2
MLA
Query #2: A cat with its paws on a computer mouse at a desk

UC2
MLA
Query #3: a woman sitting on a bed while a dog laying on the bed also and a cat is laying on a chair
UC2
MLA
Query #4: a pile of bananas, apples, potatoes, and yams on a white background
UC2
MLA
Figure 4: Top-1 retrieved images for given text queries in 11 languages on XTD dataset. Only English queries are shown in this figure.
The correct images are bordered green.
A.2. Representation visualization

To visualize the multimodal and multilingual representation space, we translate the English class labels of CIFAR10
(Krizhevsky, 2009) into 5 languages including German (de), French (fr), Czech (cs), Chinese (zh), and Japanese (ja).
The images and labels in 6 languages are encoded into representations through MLACLIP . Figure 5 shows the t-SNE
(Van der Maaten & Hinton, 2008) visualization of these representations. We can see that the representations from different
languages and modalities are clustered according to the semantics. It suggests that MLACLIP indeed can project images and
multilingual sentences into a shared multimodal and multilingual space.
airplane
automobile
bird
cat
deer
dog
frog
horse
ship
truck
image
class label
Figure 5: Representation visualization with t-SNE. The categories are color coded. ’•’ denotes a image representation, and ’×’ denotes a
class label representation in a certain language.
B. Additional Ablation studies

We conduct additional ablation studies to verify the effectiveness of MLA. All experiments in this section are conducted on
zero-shot image-text retrieval.
B.1. Structure of language acquirer

In our proposed MLA, we implement the language acquirer as a bottleneck MLP. In Table.8, we compare the different
structure of the language acquirer, the bottleneck MLP and a linear projection layer with the same amount of parameters.
MLP works slightly better than the linear projection. Thus, we choose MLP to conduct our major experiments.
B.2. Objectives in the two-stage training

In the default setting, we use the MSE objective during the NLT stage and the NCE objective during the LE stage. The MSE
objective requires paired representations to be completely consistent, while the NCE objective only requires positive pairs to
Table 8: Ablation study on structure of language acquirer.
Multi30K MSCOCO 1K
Methods Component
de fr cs ja zh
MLACLIP Linear 78.2 77.6 69.3 74.6 78.0
MLACLIP MLP 78.7 77.7 70.8 74.9 78.5
be closer than negative ones. We conduct experiments to use different objectives in the two stages. As shown in Table 9, we
observe that the MSE objective is more suitable for the NLT (row 1 vs. row 2, row 7 vs. row 8) stage, and the NCE objective
performs better for the LE stage (row 3 vs. row 4, row 5 vs. row 6). We consider the reason is that in the NLT stage, we
leverage translation pairs to build alignment between languages. Since the two sentences of a translation pair are highly
semantically related, their representations can be very similar. Thus, optimizing a strong objective like MSE during the
NLT stage is feasible. However, during the LE stage, the optimization is conducted with image-text pairs. Although the
image and text are semantically related, one sentence can hardly describe all the information in the image. Therefore, a
weak objective like NCE is suitable for the LE stage.
Table 9: Ablation study on objectives in the two training stages. mse: MSE objective, nce: NCE objective
Stage one Stage two Multi30K MSCOCO 1K

Row
NLT LE NLT LE de fr cs ja zh
1 mse 76.3 74.2 67.2 72.1 75.7
2 nce 63.0 58.5 49.6 57.6 64.8
3 mse 47.2 47.0 37.4 46.3 54.9
4 nce 68.2 67.7 58.6 65.9 71.7
5 mse mse 55.0 51.3 43.8 50.9 57.9
6 mse nce 78.7 77.7 70.8 74.9 78.5
7 mse mse nce 78.4 77.3 69.9 74.2 78.1
8 mse nce nce 78.1 77.2 69.5 73.9 78.2
B.3. Multilingual Acquisition vs. Cross-modal Acquisition

MLA adopts the ”Multimodal→Multilingual” strategy that empowers VLP models with multilingual capability. However,
there is another option of ”Multilingual→Multimodal” that empowers multilingual pre-training models with multimodal
capability. To make a comparison between these two strategies, we implement the Cross-Modal Acquisition (CMA) that
inserts cross-modal acquirers in each layer of the multilingual pre-training model M-BERT (Devlin et al., 2019). We keep
the pre-trained M-BERT fixed and train the cross-modal acquirers with the same two-stage strategy as MLA. From Table
10, we find that CMA performs worse than MLA in all languages. It suggests that generalizing multilingual models to
multimodal is harder than generalizing multimodal models to multilingual through lightweight acquirers.
Table 10: Multilingual Acquisition vs. Cross-modal Acquisition
Multi30K MSCOCO 1K
Methods
en de fr cs en ja zh
CMACLIP 80.2 73.9 72.8 67.0 76.3 69.8 75.1
MLACLIP 84.4 78.7 77.7 70.8 79.4 74.9 78.5
C. Open-domain Image Classification

In order to test the open-domain capability of models, we conduct zero-shot open-domain image classification experiments
on CIFAR100 (Krizhevsky, 2009), ImageNet-V2 (Recht et al., 2019), ImageNet-R (Hendrycks et al., 2021a) and ImageNet-
A (Hendrycks et al., 2021b) datasets. As shown in Table 11, MKD (Reimers & Gurevych, 2020) performs badly on
open-domain image classification. We consider the reason is that MKD abandons the original text encoder which contains
open-domain multimodal knowledge from large-scale pre-training. In contrast, MLA keeps the original text encoder fixed
and thus could maintain the open-domain capability of the pre-training model.
Table 11: Top-1 Accuracy of zero-shot open-domain image classification.
Methods CIFAR100 ImageNet-V2 ImageNet-R ImageNet-A

MKDCLIP 32.8 54.7 37.7 23.5
MLACLIP 64.2 63.4 69.0 31.4

Multimodal Pretraining From Monolingual To Multilingual

Uploaded by

Copyright:

Available Formats

Multimodal Pretraining From Monolingual To Multilingual

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multimodal Pretraining From Monolingual To Multilingual

Uploaded by

Copyright:

Available Formats

Generalizing Multimodal Pre-training into Multilingual via

Liang Zhang 1 Anwen Hu 1 Qin Jin 1

Image Data Text Data Monolingual Vision-Language Pre-training

English-based Vision-Language Pre-training Fr De

Pre-trained Text Encoder 𝚽

a man and his Native Transformer … Transformer

& Gurevych, 2020), cross-lingual QA and NER (Pfeiffer 3.1. Architecture

As shown in the bottom part of Figure 2(a), it consists of representation t:

Multi30K MSCOCO 1K MSCOCO 5K

UC2 TrTrain(CC3M) 66.6 62.5 60.4 55.1 70.9 62.3 - -

UC2 TrTrain(CC3M) 87.2 83.8 77.6 74.2 88.1 71.7 - -

Seen languages Unseen languages

Table 6: Language extention experiments on XTD dataset.

trieval benchmarks with much less computing costs and

Query #1: a man wearing a red shirt is playing a tennis game

Query #2: A cat with its paws on a computer mouse at a desk

A.2. Representation visualization

B. Additional Ablation studies

B.1. Structure of language acquirer

B.2. Objectives in the two-stage training

Table 8: Ablation study on structure of language acquirer.

Stage one Stage two Multi30K MSCOCO 1K

B.3. Multilingual Acquisition vs. Cross-modal Acquisition

Table 10: Multilingual Acquisition vs. Cross-modal Acquisition

C. Open-domain Image Classification

Table 11: Top-1 Accuracy of zero-shot open-domain image classification.

Methods CIFAR100 ImageNet-V2 ImageNet-R ImageNet-A

You might also like