1 s2.0 S2666651024000056 Main
1 s2.0 S2666651024000056 Main
1 s2.0 S2666651024000056 Main
AI Open
journal homepage: www.keaipublishing.com/en/journals/ai-open
Keywords: Vision-Language Pre-training (VLP) models have shown promising capabilities in grounding natural language
Vision-language pre-training models in image data, facilitating a broad range of cross-modal tasks. However, we note that there exists a significant
Prompt tuning gap between the objective forms of model pre-training and fine-tuning, resulting in a need for large amounts
of labeled data to stimulate the visual grounding capability of VLP models for downstream tasks. To address
the challenge, we present Color-based Prompt Tuning (CPT), a novel paradigm for tuning VLP models, which
reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image
and text, maximally mitigating the gap. In this way, CPT enables strong few-shot and even zero-shot visual
grounding capabilities of VLP models. Comprehensive experimental results show that CPT achieves state-of-
the-art performance on zero/few-shot visual grounding (e.g., 75.1 zero-shot accuracy in RefCOCO evaluation),
outperforming fine-tuned and other prompt-tuned models by a large margin. Moreover, CPT can also be
easily extended to achieve promising zero/few-shot performance on other vision-language tasks, such as visual
relation detection, visual commonsense reasoning and visual question answering. We make the data and codes
publicly available at https://github.com/thunlp/CPT.
∗ Corresponding author.
E-mail address: [email protected] (Z. Liu).
1
Indicates equal contribution.
https://doi.org/10.1016/j.aiopen.2024.01.004
Received 6 August 2023; Received in revised form 14 December 2023; Accepted 29 January 2024
Available online 1 February 2024
2666-6510/© 2024 The Authors. Publishing services by Elsevier B.V. on behalf of KeAi Communications Co. Ltd. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Y. Yao et al. AI Open 5 (2024) 30–38
Fig. 1. Illustration of (a) pre-training for VLP models with masked language modeling (MLM) head, (b) vanilla fine-tuning with new classification (CLS) head, and (c) our
color-based prompt tuning (CPT) framework that reformulates visual grounding into a fill-in-the-blank problem with reused MLM head.
other prompt-tuned and even task-specific models by a large margin. common practice for the task is to first detect a set of region proposals
The consistent tuning approach of CPT can also bring more stable via object detectors, and then classify or rank the proposals to select
adaptation performance. For example, CPT achieves 73.8% reduction of the target region (Lu et al., 2019; Chen et al., 2020). Specifically, the
standard deviation over fine-tuning with one-shot in RefCOCO evalua- image and text inputs are fed into the pre-trained Transformers, and
tion. Moreover, for other VL tasks that require or benefit from grounded then the hidden representations of region proposals are optimized via
inputs, CPT can also be useful in prompting VLP models to explicitly classification or ranking loss, where new task-specific parameters are
indicate object positions. We show that CPT can be easily extended to introduced. As a result, fine-tuned VLP models need large amounts of
achieve promising zero/few-shot performance on other VL tasks, such labeled data to stimulate the visual grounding capability.
as visual relation detection, visual commonsense reasoning and visual
question answering. 3. Cross-modal prompt tuning (CPT)
Our contributions are summarized as threefold: (1) We present a
novel color-based prompt tuning framework for VLP models, which To establish fine-grained connections between image regions and
reformulates visual grounding into a fill-in-the-blank problem using text in a data-efficient way, a good cross-modal prompt tuning frame-
color-based co-referential markers. (2) We present a principled ap- work should take full advantage of co-referential signals from both
proach to search for high-quality cross-modal prompt configurations. modalities, and prevent the gap between pre-training and tuning. To
(3) We conduct comprehensive experiments which demonstrate the this end, CPT reformulates visual grounding into a fill-in-the-blank
effectiveness of the proposed model. problem, as shown in Fig. 1. Specifically, CPT consists of two compo-
nents: (1) a visual sub-prompt that uniquely marks the image regions
2. Preliminary with colored blocks, and (2) a textual sub-prompt that puts the query
text into a color-based template. Equipped with CPT, it is straightfor-
In this work, we adopt VinVL (Zhang et al., 2021) as the backbone, ward for VLP models to ground the query text by filling in the mask
which is a representative VLP model that achieves strong performance with the color text of the target image region, where the objective form
on various tasks. We briefly introduce the pre-training and vanilla is identical to pre-training.
fine-tuning procedure of the model. Visual Sub-prompt. Given an image and its region proposals =
Vision-language Pre-training. Given an image–text pair (𝐼, 𝑡), a set {𝑣1 , 𝑣2 , … , 𝑣𝑛 }, visual sub-prompt aims to uniquely mark the image
of objects {𝑣1 , 𝑣2 , … , 𝑣𝑛 } is first detected from the image via object de- regions with natural visual makers. Interestingly, we note that colored
tectors. Then image and text are transformed into a sequence of tokens bounding boxes are widely used to mark objects in images for visualiza-
{[𝙸𝙼𝙶], 𝑣1 , 𝑣2 , … , 𝑣𝑛 , [𝙲𝙻𝚂], 𝑤1 , 𝑤2 , … , 𝑤𝑚 , [𝚂𝙴𝙿]}, where {𝑤1 , 𝑤2 , … , 𝑤𝑚 } tion in the literature. Inspired by this, we bridge the image regions and
are text tokens of 𝑡, and [𝙸𝙼𝙶], [𝙲𝙻𝚂] and [𝚂𝙴𝙿] are special tokens. The query text through a set of colors , where each color 𝑐𝑖 = (𝑐𝑣𝑖 , 𝑐𝑤
𝑖 )∈
input representations are fed into Transformers (Vaswani et al., 2017) is defined by its visual appearance 𝑐𝑣𝑖 (e.g., RGB (255, 0, 0)) and color
to produce the hidden representations {𝐡[𝙸𝙼𝙶] , 𝐡1𝑣 , 𝐡2𝑣 , … , 𝐡𝑛𝑣 , 𝐡[𝙲𝙻𝚂] , 𝐡1𝑤 , text 𝑐𝑤𝑖 (e.g., red). Then we mark each region proposal 𝑣 in the image
𝑖
𝐡2𝑤 , … , 𝐡𝑚
𝑤 , 𝐡[𝚂𝙴𝙿] }. with a unique color 𝑐𝑣𝑖 , resulting in a set of colored image proposals
To mine self-supervised signals, there are two widely adopted pre- 𝛹 (; ), where 𝛹 (⋅) denotes visual sub-prompt.
training tasks, including masked language modeling (MLM) and image– In principle, there are multiple plausible choices to mark the regions
text matching (ITM). The MLM pre-training task randomly replaces with colors, including bounding boxes, solid blocks, or solid segmen-
some text tokens with a special [MASK] token, and recovers the tation masks. In our experiments, we find that coloring the object
masked token from 𝐡[𝙼𝙰𝚂𝙺] using an MLM head. The ITM pre-training with solid blocks and segmentation masks yields better results than
task discriminates whether a given image and text pair matches based bounding boxes, since solid colors are more obvious prompting signals
on 𝐡[𝙲𝙻𝚂] using an ITM head. for VLP models. Note that the addition of visual sub-prompt to the raw
Vanilla Fine-tuning. Here we take visual grounding as an example image does not change the architecture or parameters of VLP models.
to illustrate the fine-tuning procedure. Given an image 𝐼 and a query Textual Sub-prompt. Given the image regions marked by visual
text 𝑞, visual grounding aims to locate the corresponding region in 𝐼. A sub-prompt, textual sub-prompt aims to prompt VLP models to resolve
31
Y. Yao et al. AI Open 5 (2024) 30–38
Fig. 2. CPT framework for other vision-language tasks with reused pre-trained heads. ITM: image–text matching.
Algorithm 1 Cross-modal Prompt Tuning (e.g., 𝑐𝑖 = ((255, 0, 0), 𝑟𝑒𝑑)). However, this solution is sub-optimal,
Require: : Set of colors
since it determines the color text without considering its visual appear-
1: Visual sub-prompt: Mark the image regions with color set as 𝛹 (; ) ance, and the visual appearance of a color in real-world images often
2: Textual sub-prompt: Put the query into template as (𝑞) differs from its standard RGB.
3: Infer the prediction 𝑃 (𝑣 = 𝑣𝑖 |, 𝑞) from the visual and textual sub-prompts To address the challenge, we present a principled algorithm that
as Eq (1) probes the strongly activated cross-modal signals in VLP models for
4: if zero-shot then prompt construction. Specifically, we first identify a candidate set of
5: Take 𝑃 (𝑣 = 𝑣𝑖 |, 𝑞) as the final prediction color texts ̂𝑤 and visual appearances ̂𝑣 . For each visual appearance
6: else candidate 𝑐̂𝑣 ∈ ̂𝑣 , we feed into VLP models a pseudo-data instance
7: Supervise the model using labeled data points : = consisting of a pure colored block of 𝑐̂𝑣 and a text: ‘‘[CLS] a photo in
∑ [MASK] color [SEP]’’. Then we compute the decoding score 𝑠(𝑐̂𝑣 , 𝑐̂𝑤 )
− (,𝑞,𝑣⋆ )∈ log 𝑃 (𝑣⋆ |, 𝑞)
for each color text candidate 𝑐̂𝑤 ∈ ̂𝑤 as in Eq. (1), where a larger
decoding score indicates higher correlation between 𝑐̂𝑣 and 𝑐̂𝑤 . Finally,
the color set is obtained from the visual appearances and text pairs that
the query text. Specifically, the query text 𝑞 (e.g., ‘‘the horse watched achieve the highest correlation scores. We refer readers to the appendix
by the woman’’) is transformed into a fill-in-the-blank query using a for the pseudo-code. In practice, to make the raw content of the colored
template (⋅) as follows: image regions available to VLP models, a transparency hyperparameter
𝛼 ∈ (0, 1) is applied to color visual appearances.
32
Y. Yao et al. AI Open 5 (2024) 30–38
Table 1
Accuracies of grounding referring expressions. Ext.: extra data augmentation or heuristic rules. FT: vanilla fine-tuning, FT-ATT: attention weights of fine-tuned VLP models, Blk:
colored block, Seg: colored segmentation mask, Aug: extra data augmentation. We report mean (and standard deviation) performance over 5 random splits.
Shot Model Ext. RefCOCO RefCOCO+ RefCOCOg
val testA testB val testA testB val test
Random 15.9 (0.2) 19.4 (0.6) 13.4 (0.4) 16.1 (0.1) 13.3 (0.6) 20.0 (0.2) 18.8 (0.4) 19.2 (0.3)
FT-ATT (Cao et al., 2020) 26.9 26.1 30.6 27.1 26.7 30.8 36.6 36.3
VCb (Zhang et al., 2018) ✓ – 33.3 30.1 – 34.6 31.6 – –
ARNb (Liu et al., 2019a) ✓ 34.3 36.4 33.1 34.5 36.0 33.8 – –
KPRNb (Liu et al., 2019b) ✓ 35.0 34.7 37.0 36.0 35.2 37.0 – –
DTWREGb (Sun et al., 2021) ✓ 39.2 41.1 37.7 39.2 40.1 38.1 – –
0
GPV (Gupta et al., 2022) ✓ 44.6 41.4 47.1 42.9 39.1 47.6 52.1 52.3
ReCLIPb (Subramanian et al., 2022) ✓ 45.8 46.1 47.1 47.9 50.1 45.1 59.3 59.0
Pseudo-Qb (Jiang et al., 2022) ✓ 56.0 58.3 54.1 38.9 45.1 32.1 46.3 47.4
CPT-Blk (ours) 26.9 27.5 27.4 25.4 25.0 27.0 32.1 32.3
CPT-Seg (ours) 32.2 36.1 30.3 31.9 35.2 28.8 36.7 36.5
CPT-Aug (ours) ✓ 69.8 75.1 62.6 57.7 65.0 48.2 63.9 63.3
FT (Zhang et al., 2021) 16.5 (4.9) 12.0 (6.6) 23.5 (5.7) 22.2 (7.6) 20.6 (9.3) 25.7 (5.2) 26.9 (8.4) 26.9 (8.1)
FT-ATT (Cao et al., 2020) 26.9 (0.6) 26.3 (0.7) 30.9 (0.7) 26.7 (0.3) 26.1 (0.7) 30.6 (0.1) 36.1 (0.2) 36.6 (0.2)
GPV (Gupta et al., 2022) ✓ 48.7 (3.7) 47.6 (5.8) 49.0 (1.7) 47.1 (2.6) 45.0 (3.7) 49.3 (1.1) 54.6 (2.4) 55.0 (2.1)
1
CPT-Blk (ours) 34.1 (1.3) 37.7 (1.7) 32.2 (1.5) 35.9 (4.1) 40.4 (5.4) 32.2 (2.6) 39.7 (3.4) 39.9 (3.0)
CPT-Seg (ours) 37.2 (0.9) 41.5 (1.5) 33.2 (1.7) 37.9 (4.0) 42.3 (5.9) 33.9 (2.4) 43.1 (2.9) 43.4 (3.1)
CPT-Aug (ours) ✓ 70.2 (0.6) 75.5 (0.7) 63.7 (0.7) 57.6 (0.4) 65.2 (0.2) 48.5 (0.6) 63.7 (0.4) 63.9 (0.4)
FT (Zhang et al., 2021) 39.8 (4.2) 45.5 (5.0) 34.9 (3.0) 41.8 (3.0) 47.3 (3.1) 36.2 (2.3) 47.5 (4.1) 47.8 (4.7)
FT-ATT (Cao et al., 2020) 29.8 (0.5) 31.4 (1.1) 32.1 (0.1) 30.3 (1.3) 32.2 (1.9) 32.2 (0.6) 37.7 (0.8) 38.1 (0.6)
GPV (Gupta et al., 2022) ✓ 56.5 (0.5) 58.7 (1.5) 53.0 (0.6) 57.1 (1.4) 59.8 (1.7) 53.4 (1.1) 60.2 (0.5) 60.4 (0.5)
16
CPT-Blk (ours) 44.8 (3.3) 51.4 (4.1) 38.2 (2.3) 41.5 (1.3) 48.2 (2.1) 34.7 (0.9) 47.8 (2.1) 48.2 (2.8)
CPT-Seg (ours) 45.3 (1.8) 53.3 (3.0) 37.5 (1.3) 44.8 (0.9) 52.5 (1.2) 36.6 (1.2) 51.0 (2.6) 51.4 (2.8)
CPT-Aug (ours) ✓ 71.2 (0.7) 76.8 (1.0) 64.9 (0.5) 58.2 (0.3) 65.7 (0.4) 49.0 (0.3) 64.1 (0.6) 64.0 (0.6)
Oracle (full)a (Zhang et al., 2021) 81.8 87.2 74.3 74.5 80.8 64.3 74.6 75.7
a
Fine-tuned on the full training set.
b
Task-specific models.
text with textual sub-prompt, where each object in text is referred by 5.1. Visual grounding experiments
the corresponding color (e.g., in red), as shown in Fig. 2.
Most VLP models include an image–text matching (ITM) pre-training Datasets. We adopt three widely used visual grounding datasets,
task, where an ITM head is used to discriminate whether an image including RefCOCO (Yu et al., 2016), RefCOCO+ (Yu et al., 2016)
matches a given text (see Section 2). Intuitively, a question concate- and RefCOCOg (Mao et al., 2016). To better approximate the few-shot
scenario where only a few labeled instances are available, follow-
nated with the correct answer can better describe the image than the
ing Gao et al. (2021a), we use a few-shot validation set (consisting of
question concatenated with a wrong answer, and therefore should be
16 instances) for all experiments.
assigned with a higher ITM score. Therefore, ITM head can be used in
Evaluation Metrics. Following Lu et al. (2019), we adopt the
the same form as pre-training to judge whether the image and question–
accuracy of grounding results as the evaluation metrics. An expres-
answer text match (i.e., correctly answer) for zero/few-shot VCR. We sion is considered correctly grounded if the IoU of the top predicted
find that this ITM-based approach yields better performance in dealing region and the ground truth is greater than 0.5. Moreover, since model
with sentence-level VCR answers than MLM head in experiments. When training on limited data can suffer from instability, following Dodge
a few training instances are available, we can further optimize the VLP et al. (2020) and Gao et al. (2021a), we report average results over
model to classify the image–text pair into the ITM vocabulary {matched, 5 random training set splits, as well as the standard deviation. For
not matched} in the identical objective as pre-training. fair comparisons, the training and validation sets are identical for our
Visual Question Answering. We further investigate whether CPT baselines and CPT.
can benefit tasks that do not require explicit object modeling. We are Baselines. We compare our model with a series of strong base-
interested in the question: given a grounded question, where the object lines. (1) Vanilla fine-tuning (FT) for VinVL (Zhang et al., 2021).
positions are provided, can CPT stimulate pre-trained capabilities of This model adopts the same backbone as CPT, and serves as the
VLP models for zero/few-shot VQA? To obtain high-quality object most direct baseline. (2) Attention weights of fine-tuned VinVL model
grounding results for CPT, we use a strong visual grounding model pre- (FT-ATT). Previous works show that the attention weights of VLP
models are strong grounding indicators (Cao et al., 2020; Chen et al.,
trained on a large collection of labeled datasets (Yao et al., 2022). Then
2020). Following these works, we score each image region using the
we mark the image regions and object text using visual and textual sub-
average text-to-image attention weights from all text tokens in the
prompts. Finally, we concatenate the question with ‘‘[SEP][MASK]’’,
query across all attention heads.2 (3) Task-specific visual grounding
and reuse the MLM head to predict answers from the mask.
models. We compare with state-of-the-art models tailored for zero-
shot visual grounding (Zhang et al., 2018; Liu et al., 2019a,b; Sun
5. Experiments et al., 2021; Subramanian et al., 2022; Jiang et al., 2022). These works
typically utilize extra data augmentation, such as image-level ground-
truth referring expressions (Zhang et al., 2018; Liu et al., 2019a,b; Sun
We empirically evaluate CPT on different VL tasks in zero/few-shot
scenarios. In our experiments, we use the same color configurations for
different tasks. We refer readers to the appendix for implementation 2
We also experiment with maximum attention score from text tokens or
and dataset details. attention heads, or image-to-text attentions, and adopt the best practice.
33
Y. Yao et al. AI Open 5 (2024) 30–38
Table 2
Results of VRD on visual genome test set.
Shot Model VRD
R@50 R@100 mR@50 mR@100
Random 1.5 (0.0) 1.8 (0.1) 1.2 (0.1) 1.6 (0.1)
0 GPV (Gupta et al., 2022) 2.2 2.9 1.4 2.1
CPT 29.3 30.5 13.0 14.5
FT (Zhang et al., 2021) 4.1 (0.1) 4.7 (0.0) 6.7 (0.3) 7.6 (0.4)
1 GPV (Gupta et al., 2022) 11.6 (1.2) 16.2 (2.9) 7.1 (1.1) 10.5 (1.7)
CPT 18.0 (2.8) 20.0 (3.0) 23.9 (0.3) 26.3 (0.3)
FT (Zhang et al., 2021) 7.3 (1.5) 7.9 (1.7) 11.8 (1.0) 13.2 (0.9)
4 GPV (Gupta et al., 2022) 12.1 (2.3) 16.7 (3.3) 10.3 (0.8) 15.6 (2.1)
CPT 17.7 (0.6) 19.3 (0.6) 28.5 (1.5) 32.1 (1.0)
FT (Zhang et al., 2021) 10.4 (0.7) 11.2 (0.8) 19.7 (0.1) 21.7 (0.1)
16 GPV (Gupta et al., 2022) 13.7 (1.0) 19.2 (1.4) 15.3 (1.0) 24.1 (1.3)
CPT 18.4 (1.0) 20.0 (1.1) 32.5 (0.5) 36.1 (0.6)
FT (Zhang et al., 2021) 11.7 (0.2) 12.4 (0.3) 22.0 (0.1) 24.1 (0.0)
32 GPV (Gupta et al., 2022) 16.7 (0.6) 23.0 (0.8) 16.7 (1.6) 24.8 (2.0)
CPT 20.8 (0.1) 22.3 (0.1) 34.0 (0.1) 37.7 (0.3)
Oracle (full) (Zhang et al., 2021) 65.1 67.4 20.6 22.5
et al., 2021), generated pseudo-queries (Jiang et al., 2022), or heuristic are cropped in the image to indicate their positions, and the model is
rules for resolving referring expressions (Subramanian et al., 2022). prompted with ‘‘what is the relation between 𝑠𝑤 and 𝑜𝑤 ’’ to decode the
For example, Pseudo-Q (Jiang et al., 2022) generates pseudo-queries relation. FT-ATT cannot handle VRD which requires producing relation
using templates according to the detected nouns, attributes and spatial labels. Following Chen et al. (2019), we use recall@N (R@N) and mean
relations. (4) General prompt-based models. GPV (Gupta et al., 2022) recall@N (mR@N) over different relations as evaluation metrics.
is a strong general purpose model that can output object positions and
From Table 2 we observe that: (1) CPT outperforms baselines in
text tokens to address various tasks in a prompting fashion. GPV is
different shot settings and metrics. For example, using one shot, CPT
pre-trained with augmented pseudo-queries which are generated in a
outperforms fine-tuning by 15.3 points on R@100 and 18.7 points on
similar way to Pseudo-Q (Jiang et al., 2022). We evaluate three variants
mR@100, showing reasonable performance on both common relations
of our model: CPT-Blk uses colored blocks as visual sub-prompt, and
CPT-Seg leverages colored segmentation masks. CPT-Aug further pre- and long-tail relations. (2) We note while the macro performance
trains CPT-Seg using pseudo-queries from Pseudo-Q (Jiang et al., 2022) (mR@N) of CPT monotonically increases as the shot number grows, the
with Eq. (1). micro results (R@N) drop first in 1- and 4-shot settings. This is due to
Results. From Table 1 we observe that: (1) CPT consistently out- the distribution gap between the balanced training set (i.e., K shot for
performs the fine-tuning baseline by a large margin across different each relation) and long-tail test set. Since the relations in pre-training
datasets and shot settings. For example, using colored blocks as visual data also follow a long-tail distribution, CPT can achieve a high starting
sub-prompts, CPT-Blk achieves 17.3 absolute accuracy points improve- point for micro performance.
ment on average with one shot in RefCOCO evaluation. This indicates Visual Commonsense Reasoning and Visual Question Answer-
that CPT can effectively improve sample efficiency in tuning VLP mod- ing. We adopt the popular VCR dataset (Zellers et al., 2019) for
els. (2) Coloring objects with segmentation masks (CPT-Seg) achieves visual commonsense reasoning, and GQA dataset (Hudson and Man-
even better results than blocks. The reason is that solid colors that fit
ning, 2019) for visual question answering. We report the accuracy of
the outlines of objects are more common in real-world images, making
selecting answers and rationales for VCR, and accuracy of answers
CPT-Seg more natural visual sub-prompts. (3) CPT achieves substan-
for GQA. More shots are provided due to the difficulty of the tasks.
tially more stable performance than fine-tuning. For example, CPT-Seg
For visual sub-prompts we adopt segmentation masks provided by the
achieves 76.2% reduction of standard deviation with one shot in Ref-
COCO. This shows that a coherent tuning approach from pre-training VCR dataset, and bounding boxes from Yao et al. (2022) on GQA.
can lead to substantially more stable few-shot adaptation. (4) With sim- In experiments, we find that while providing reasoning clues, colors
ple data augmentation, CPT-Aug achieves state-of-the-art performance in prompt templates can sometimes disturb image understanding. To
on zero/few-shot visual grounding, outperforming strong prompt-tuned address the issue, we simply use the weighted average score given
and task-specific models. Notably, CPT-Aug achieves 75.1 zero-shot by CPT equipped with and without colors. GPV is prompted with
grounding accuracy on RefCOCO testA set, outperforming the previous the question concatenated with each candidate answer sentence, and
state-of-the-art by 16.8 accuracy points. The reason is that CPT more decodes yes/no for answer selection in VCR. Since the answers of
naturally connects visual and text signals with color-based prompts, GQA is typically short, GPV directly decodes the answers based on the
and therefore maximally stimulates the visual grounding capabilities prompting question.
of VLP models. From Table 3 we can see that, CPT can significantly improve the
data efficiency of VLP models for visual commonsense reasoning and
5.2. Experiments on other VL tasks
visual question answering. Notably, the zero-shot performance of CPT
even surpasses vanilla fine-tuning trained with 128 shots on both
We further evaluate CPT on visual relation detection, visual com-
monsense reasoning and question answering. datasets. This shows that CPT can effectively prompt VLP models to
Visual Relation Detection. We adopt the widely used Visual handle both sentence-level and token-level answers. In comparison, it
Genome dataset (Krishna et al., 2017), which contains 50 visual re- can be challenging for GPV to deal with sentence-level answers in ques-
lation types. During training, K labeled instances are provided for each tion answering. In addition, ensembling color-free prompt templates
relation. Since Visual Genome does not provide segmentation masks, helps alleviate the color disturbance problem. Removing the color-free
we use colored blocks in visual sub-prompt. For GPV, following Subra- prompt templates leads to 2.3 and 1.8 points degradation in 16-shot
manian et al. (2022) and Yao et al. (2021), the query subject and object setting on VCR (Q → AR) and GQA respectively.
34
Y. Yao et al. AI Open 5 (2024) 30–38
Table 3
Results on VCR validation set and GQA test-dev set.
Shot Model VCR GQA
Q → A QA → R Q → AR test-dev
Random 25.0 25.0 6.3 0.1
0 GPV (Gupta et al., 2022) 29.1 26.8 8.1 34.6
CPT 43.8 39.0 17.8 36.0
FT (Zhang et al., 2021) 31.5 (5.6) 30.2 (6.0) 10.4 (3.0) 12.2 (4.4)
4 GPV (Gupta et al., 2022) 29.9 (0.3) 28.3 (0.3) 8.8 (0.1) 34.6 (0.0)
CPT 44.4 (0.3) 41.4 (1.0) 18.1 (1.6) 36.7 (5.9)
FT (Zhang et al., 2021) 32.1 (7.9) 35.7 (1.7) 12.8 (4.0) 17.5 (2.7)
16 GPV (Gupta et al., 2022) 29.6 (0.5) 28.0 (0.4) 8.7 (0.2) 34.6 (0.0)
CPT 45.3 (1.6) 41.2 (1.8) 19.4 (1.3) 43.6 (3.4)
FT (Zhang et al., 2021) 41.1 (2.6) 38.8 (2.0) 14.6 (2.0) 22.1 (1.2)
64 GPV (Gupta et al., 2022) 30.3 (0.7) 28.4 (0.4) 8.8 (0.3) 34.6 (0.0)
CPT 45.7 (0.8) 42.5 (0.7) 19.2 (1.4) 50.9 (1.1)
FT (Zhang et al., 2021) 43.0 (2.5) 39.7 (4.8) 14.6 (1.5) 23.7 (0.7)
128 GPV (Gupta et al., 2022) 30.7 (0.3) 28.6 (0.4) 9.2 (0.2) 34.8 (0.4)
CPT 45.7 (0.9) 44.5 (0.9) 20.1 (0.6) 51.0 (0.7)
Oracle (full) (Zhang et al., 2021) 63.9 68.3 48.3 65.1
Table 4
Top 6 colors from the frequency-based baseline and our cross-modal prompt search method.
Model Color #1 Color #2 Color #3 Color #4 Color #5 Color #6
Freq (255,0,0), red (0,0,0), black (0,0,255), blue (0,255,0), green (255,255,0), yellow (165,42,42), brown
Ours (240,0,30), red (155,50,210), purple (255,255,25), yellow (0,10,255), blue (255,170,230), pink (0,255,0), green
Fig. 3. Results of utilizing different colors for visual grounding, including (a) an overall evaluation of top 6 colors from different models, and (b) a zoom-in study of individual
colors.
5.3. Influence of prompt configurations varies greatly in prompting VLP models in the same shot settings, and
the optimal colors are different in different shot settings. The results
We investigate the influence of colors, the key ingredients in the indicate the large influence of cross-modal prompt configurations,
visual grounding of CPT. Specifically, we compare colors obtained from consistent with the findings from recent studies in textual prompt
the frequency-based baseline (Freq), which uses the most frequent color
tuning (Jiang et al., 2020; Gao et al., 2021a). (2) Colors produced by
names in text and their standard RGB value (see Section 3), and our
CPS achieve comparable or superior performance compared with the
Cross-modal Prompt Search (CPS) method in two dimensions, including
an overall evaluation of top N colors and a zoom-in study of individual baseline in individual colors. The results show that given the color texts,
colors. The analysis is conducted based on CPT-Blk on the RefCOCO CPS can properly adjust the color visual appearance to improve the
validation set. visual grounding performance. (3) We note that in some cases, colors
Overall Evaluation of Top N Colors. We first show the top 6 colors produced by CPS slightly underperform the baseline. We hypothesize
recommended by each approach in Table 4. To evaluate the overall the reason is that CPS uses a single textual template to compute the
performance of the top colors from different models, we evaluate decoding scores for color adjustment, which can be biased. The problem
CPT equipped with each recommended color and report the mean
can potentially be addressed by ensembling templates as in Qin and
accuracy and standard deviation over different colors. From the results
Eisner (2021), which we leave for future work.
in Fig. 3(a), we observe that the top colors produced by CPS achieve
both higher mean accuracy and lower standard deviation than the Performance on Color-involved Instances. Despite the effective-
baseline method in different shot settings. The reason is that CPS probes ness, adding colors in prompts might also disturb the understanding of
sensitive colors in VLP models for prompt construction, and therefore raw images and text. We empirically assess the performance of CPT on
is able to effectively select the colors for better visual grounding. color-involved instances. For the 2262 referring expressions containing
Zoom-In Study of Individual Colors. To investigate the fine-
color texts on RefCOCO+ testA set, CPT can achieve a reasonable 42.3%
grained influence of specific colors in CPT’s visual grounding, we
grounding accuracy with one shot, as compared with 28.9% of fine-
further perform a zoom-in study of individual colors. To align the colors
for comparison, we merge the top 6 colors from the baseline and CPS, tuning. The reason is that establishing strong cross-modal connections
and remove the colors that are not included in the models’ complete is more important in zero/few-shot scenarios, and a capable model
color sets (e.g., black ∉ for CPS). We report the accuracies in Fig. 3(b), can easily learn to distinguish the colors of raw objects and artificial
from which we observe that: (1) The performance of different colors markers.
35
Y. Yao et al. AI Open 5 (2024) 30–38
Fig. 4. Case study. The bounding boxes given by image region proposals (olive), ground-truth annotation (pink), CPT (light green), and fine-tuning baseline (yellow) are highlighted
accordingly.
5.4. Case study In comparison, CPT prompts VLP models with natural co-referential
markers in both image and text, which enables zero/few-shot fine-
To provide a more intuitive understanding of CPT, we conduct a grained capabilities in locating and indicating objects for various VL
case study in few-shot setting. From Fig. 4 we can observe that: (1) CPT tasks.
enables VLP models to distinguish targets distracted by the same type Vision-language Pre-training Models. Existing VLP models can
of objects using only a few instances, while fine-tuning struggles to be roughly divided into three categories: (1) Masked language model-
succeed (left two figures). (2) CPT can be distracted by hard candidates ing based VLP models are mainly pre-trained to recover the masked
(e.g., objects of the same type that require complex reasoning), but typ- tokens (Lu et al., 2019; Su et al., 2019; Tan and Bansal, 2019; Li
ically produces reasonable predictions. In the right figure, CPT predicts et al., 2020; Yu et al., 2021); (2) Auto-regressive language modeling based
a nearby apple while fine-tuning predicts a bowl. The reason is that VLP models model image and text tokens with Transformer decoders
CPT reuses the pre-trained head of VLP models, which helps prevent auto-regressively (Ramesh et al., 2021; Wang et al., 2021b; Alayrac
outrageous results that typically happen in few-shot fine-tuning. et al., 2022); (3) Contrastive learning based VLP models are pre-trained
to holistically match image–text pairs (Radford et al., 2021; Li et al.,
6. Related work 2021). In this work, we focus on prompting masked language modeling
based VLP models due to their prevalence and superior performance,
Prompt tuning for pre-trained models is a rapidly emerging field while applying CPT to other VLP models should also be applicable.
(Petroni et al., 2019; Brown et al., 2020; Schick and Schütze, 2021; Visual Grounding. Most existing works on visual grounding learn
Lester et al., 2021). Existing works for VLP models can be roughly to classify or rank image region candidates based on the expressions
divided into two categories, including natural language prompts and in a fully supervised fashion (Mao et al., 2016; Zhang et al., 2018;
embedding-based prompts. We refer readers to the appendix for more Lu et al., 2019; Chen et al., 2020), requiring large amounts of costly
related works. human-annotated data. To alleviate the reliance on human annotation,
Natural Language Prompts. To avoid the gap between pre-training some works have investigated zero/few-shot grounding of new object
and tuning, VLP models can be prompted with natural language tem- types (Sadhu et al., 2019; Blukis et al., 2020), whereas amounts of train-
plates, and produce the answer by filling in the mask (Radford et al., ing data are still needed for existing object types. Jiang et al. (2022)
2021; Liu et al., 2022; Zeng, 2022; Li et al., 2022a) or casual language generate pseudo referring queries for data augmentation. Subramanian
generation (Wang et al., 2021b; Tsimpoukelli et al., 2021; Gupta et al., et al. (2022) enhance spatial reasoning capability of CLIP via heuristic
2022; Kamath et al., 2022; Alayrac et al., 2022). To deal with object po- rules. To refer to objects in images, Zellers et al. (2021) and Hessel et al.
(2022) highlight objects with colored blocks, but require labeled data to
sitions as task inputs/outputs, existing natural language prompt-based
learn the correlation between colors and object texts. Rohrbach et al.
models either utilize extensive human annotations (Yang et al., 2022a;
(2016) and Chen et al. (2018) explore weakly supervised approaches
Wang et al., 2022; Yao et al., 2022; Li et al., 2022b; Kamath et al.,
but are limited in dataset-specific in-domain training. In comparison,
2021) or generate heuristic pseudo-data with external tools (Gupta
CPT prompts general VLP models for zero/few-shot visual grounding
et al., 2022; Kamath et al., 2022; Cho et al., 2021) to learn specialized
in a fill-in-the-blank fashion independent of specific object types, and
position embeddings. Some works also generate text prompts that are
can also be extended to indicate object positions for other VL tasks.
related to the image content to enhance image representations (Rao
et al., 2021; Wang et al., 2021a; Lin et al., 2022). Natural language
7. Conclusion and future work
prompts can improve the data efficiency of VLP models, enabling
strong zero/few-shot capabilities. Moreover, different tasks can also be In this work, we present a novel color-based prompt tuning frame-
handled in a unified language modeling framework. work for VLP models. To facilitate prompt construction, we present a
Embedding-based Prompts. To facilitate automatic prompt tem- principled approach to search for cross-modal prompt configurations.
plate search, some works learn prompts as new parameters. The pa- Comprehensive experimental results demonstrate the effectiveness of
rameters can be static pseudo text representations (Zhou et al., 2021; CPT on zero/few-shot VL tasks. In principle, color is one of the promi-
Ju et al., 2021; Sun et al., 2022; Zhu et al., 2022), disturbance vectors nent attributes that can serve as natural prompts. In future, we will
on images (Bahng et al., 2022; Liang et al., 2022), dynamic conditional explore prompting VLP models with other cross-modal attributes, such
representations (Zhou et al., 2022; Han et al., 2022), or lightweight as shape, size and status for more expressive, efficient and robust
additional modules (Gao et al., 2021b; Zhang and Ré, 2022; Jia et al., prompt tuning.
2022; Lüddecke and Ecker, 2022; Yang et al., 2022b). When only new
parameters are tuned, embedding-based prompts can be parameter- Declaration of competing interest
efficient. However, it can be difficult for embedding-based prompts to
perform zero-shot tasks. The authors declare that they have no known competing finan-
Most existing prompt tuning methods cannot establish fine-grained cial interests or personal relationships that could have appeared to
connections between text and image regions in zero/few-shot scenarios. influence the work reported in this paper.
36
Y. Yao et al. AI Open 5 (2024) 30–38
Acknowledgment Jiang, Zhengbao, Xu, Frank F., Araki, Jun, Neubig, Graham, 2020. How can we know
what language models know? TACL 8, 423–438.
Ju, Chen, Han, Tengda, Zheng, Kunhao, Zhang, Ya, Xie, Weidi, 2021. Prompting
This work is supported by the National Natural Science Foundation
visual-language models for efficient video understanding. arXiv preprint arXiv:
of China (No. 62236004). 2112.04478.
Kamath, Amita, Clark, Christopher, Gupta, Tanmay, Kolve, Eric, Hoiem, Derek, Kem-
Appendix A. Supplementary data bhavi, Aniruddha, 2022. Webly supervised concept expansion for general purpose
vision models. arXiv preprint arXiv:2202.02317.
Kamath, Aishwarya, Singh, Mannat, LeCun, Yann, Misra, Ishan, Synnaeve, Gabriel,
Supplementary material related to this article can be found online
Carion, Nicolas, 2021. MDETR: Modulated detection for end-to-end multi-modal
at https://doi.org/10.1016/j.aiopen.2024.01.004. understanding. arXiv preprint arXiv:2104.12763.
Krishna, Ranjay, Zhu, Yuke, Groth, Oliver, Johnson, Justin, Hata, Kenji, Kravitz, Joshua,
References Chen, Stephanie, Kalantidis, Yannis, Li, Li-Jia, Shamma, David A, et al., 2017.
Visual Genome: Connecting language and vision using crowdsourced dense image
Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Has- annotations. IJCV 123 (1), 32–73.
son, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katie, Reynolds, Malcolm, et Lester, Brian, Al-Rfou, Rami, Constant, Noah, 2021. The power of scale for
al., 2022. Flamingo: a visual language model for few-shot learning. arXiv preprint parameter-efficient prompt tuning. In: Proceedings of EMNLP. pp. 3045–3059.
arXiv:2204.14198. Li, Wei, Gao, Can, Niu, Guocheng, Xiao, Xinyan, Liu, Hao, Liu, Jiachen, Wu, Hua,
Anderson, Peter, He, Xiaodong, Buehler, Chris, Teney, Damien, Johnson, Mark, Wang, Haifeng, 2021. UNIMO: Towards unified-modal understanding and genera-
Gould, Stephen, Zhang, Lei, 2018a. Bottom-up and top-down attention for image tion via cross-modal contrastive learning. In: Proceedings of ACL. Association for
captioning and visual question answering. In: Proceedings of CVPR. pp. 6077–6086. Computational Linguistics, pp. 2592–2607.
Anderson, Peter, Wu, Qi, Teney, Damien, Bruce, Jake, Johnson, Mark, Sünder- Li, Bin, Weng, Yixuan, Sun, Bin, Li, Shutao, 2022a. Towards visual-prompt temporal
hauf, Niko, Reid, Ian, Gould, Stephen, van den Hengel, Anton, 2018b. Vision-and- answering grounding in medical instructional video. arXiv preprint arXiv:2203.
language navigation: Interpreting visually-grounded navigation instructions in real 06667.
environments. In: Proceedings of CVPR. Li, Xiujun, Yin, Xi, Li, Chunyuan, Zhang, Pengchuan, Hu, Xiaowei, Zhang, Lei,
Antol, Stanislaw, Agrawal, Aishwarya, Lu, Jiasen, Mitchell, Margaret, Batra, Dhruv, Wang, Lijuan, Hu, Houdong, Dong, Li, Wei, Furu, et al., 2020. Oscar: Object-
Zitnick, C Lawrence, Parikh, Devi, 2015. VQA: Visual question answering. In: semantics aligned pre-training for vision-language tasks. In: Proceedings of ECCV.
Proceedings of ICCV. pp. 2425–2433. Springer, pp. 121–137.
Bahng, Hyojin, Jahanian, Ali, Sankaranarayanan, Swami, Isola, Phillip, 2022. Visual Li, Liunian Harold, Zhang, Pengchuan, Zhang, Haotian, Yang, Jianwei, Li, Chunyuan,
prompting: Modifying pixel space to adapt pre-trained models. arXiv preprint Zhong, Yiwu, Wang, Lijuan, Yuan, Lu, Zhang, Lei, Hwang, Jenq-Neng, et al., 2022b.
arXiv:2203.17274. Grounded language-image pre-training. In: Proceedings of CVPR. pp. 10965–10975.
Blukis, Valts, Knepper, Ross A., Artzi, Yoav, 2020. Few-shot object grounding and Liang, Sheng, Zhao, Mengjie, Schütze, Hinrich, 2022. Modular and parameter-efficient
mapping for natural language robot instruction following. arXiv preprint arXiv: multimodal fusion with prompting. In: Findings of ACL. pp. 2976–2985.
2011.07384. Lin, Bingqian, Zhu, Yi, Chen, Zicong, Liang, Xiwen, Liu, Jianzhuang, Liang, Xiaodan,
Brown, Tom B, Mann, Benjamin, Ryder, Nick, Subbiah, Melanie, Kaplan, Jared, Dhari- 2022. ADAPT: Vision-language navigation with modality-aligned action prompts.
wal, Prafulla, Neelakantan, Arvind, Shyam, Pranav, Sastry, Girish, Askell, Amanda, In: Proceedings of CVPR. pp. 15396–15406.
et al., 2020. Language models are few-shot learners. arXiv preprint arXiv:2005. Liu, Xuejing, Li, Liang, Wang, Shuhui, Zha, Zheng-Jun, Meng, Dechao, Huang, Qing-
14165. ming, 2019a. Adaptive reconstruction network for weakly supervised referring
Cao, Jize, Gan, Zhe, Cheng, Yu, Yu, Licheng, Chen, Yen-Chun, Liu, Jingjing, 2020. expression grounding. In: Proceedings of ICCV. pp. 2611–2620.
Behind the scene: Revealing the secrets of pre-trained vision-and-language models. Liu, Xuejing, Li, Liang, Wang, Shuhui, Zha, Zheng-Jun, Su, Li, Huang, Qingming, 2019b.
In: Proceedings of ECCV. Springer, pp. 565–580. Knowledge-guided pairwise reconstruction network for weakly supervised referring
Chen, Kan, Gao, Jiyang, Nevatia, Ram, 2018. Knowledge aided consistency for weakly expression grounding. In: Proceedings of ACM MM. pp. 539–547.
supervised phrase grounding. In: Proceedings of CVPR. pp. 4042–4050. Liu, Yuhang, Wei, Wei, Peng, Daowan, Zhu, Feida, 2022. Declaration-based prompt
Chen, Yen-Chun, Li, Linjie, Yu, Licheng, El Kholy, Ahmed, Ahmed, Faisal, Gan, Zhe, tuning for visual question answering. arXiv e-prints.
Cheng, Yu, Liu, Jingjing, 2020. UNITER: Universal image-text representation Liu, Pengfei, Yuan, Weizhe, Fu, Jinlan, Jiang, Zhengbao, Hayashi, Hiroaki, Neubig, Gra-
learning. In: Proceedings of ECCV. Springer, pp. 104–120. ham, 2021. Pre-train, prompt, and predict: A systematic survey of prompting
Chen, Tianshui, Yu, Weihao, Chen, Riquan, Lin, Liang, 2019. Knowledge-embedded methods in natural language processing. arXiv preprint arXiv:2107.13586.
routing network for scene graph generation. In: Proceedings of CVPR. pp. Lu, Jiasen, Batra, Dhruv, Parikh, Devi, Lee, Stefan, 2019. ViLBERT: Pretraining task-
6163–6171. agnostic visiolinguistic representations for vision-and-language tasks. Proc. NeurIPS
Cho, Jaemin, Lei, Jie, Tan, Hao, Bansal, Mohit, 2021. Unifying vision-and-language 32, 13–23.
tasks via text generation. In: Proceedings of ICML. In: PMLR, pp. 1931–1942. Lüddecke, Timo, Ecker, Alexander, 2022. Image segmentation using text and image
Das, Abhishek, Kottur, Satwik, Gupta, Khushi, Singh, Avi, Yadav, Deshraj, prompts. In: Proceedings of CVPR. pp. 7086–7096.
Moura, José MF, Parikh, Devi, Batra, Dhruv, 2017. Visual dialog. In: Proceedings Mao, Junhua, Huang, Jonathan, Toshev, Alexander, Camburu, Oana, Yuille, Alan L,
of CVPR. pp. 326–335. Murphy, Kevin, 2016. Generation and comprehension of unambiguous object
Dodge, Jesse, Ilharco, Gabriel, Schwartz, Roy, Farhadi, Ali, Hajishirzi, Hannaneh, descriptions. In: Proceedings of CVPR. pp. 11–20.
Smith, Noah, 2020. Fine-tuning pretrained language models: Weight initializations, Petroni, Fabio, Rocktäschel, Tim, Riedel, Sebastian, Lewis, Patrick, Bakhtin, Anton,
data orders, and early stopping. arXiv preprint arXiv:2002.06305. Wu, Yuxiang, Miller, Alexander, 2019. Language models as knowledge bases? In:
Gao, Tianyu, Fisch, Adam, Chen, Danqi, 2021a. Making pre-trained language models Proceedings of EMNLP-IJCNLP. pp. 2463–2473.
better few-shot learners. In: Proceedings of ACL. Qin, Guanghui, Eisner, Jason, 2021. Learning how to ask: Querying LMs with mixtures
Gao, Peng, Geng, Shijie, Zhang, Renrui, Ma, Teli, Fang, Rongyao, Zhang, Yongfeng, of soft prompts. In: Proceedings of NAACL. pp. 5203–5212.
Li, Hongsheng, Qiao, Yu, 2021b. CLIP-adapter: Better vision-language models with Radford, Alec, Kim, Jong Wook, Hallacy, Chris, Ramesh, Aditya, Goh, Gabriel, Agar-
feature adapters. arXiv preprint arXiv:2110.04544. wal, Sandhini, Sastry, Girish, Askell, Amanda, Mishkin, Pamela, Clark, Jack, et al.,
Gupta, Tanmay, Kamath, Amita, Kembhavi, Aniruddha, Hoiem, Derek, 2022. To- 2021. Learning transferable visual models from natural language supervision. arXiv
wards general purpose vision systems: An end-to-end task-agnostic vision-language preprint arXiv:2103.00020.
architecture. In: Proceedings of CVPR. pp. 16399–16409. Ramesh, Aditya, Pavlov, Mikhail, Goh, Gabriel, Gray, Scott, Voss, Chelsea, Rad-
Han, Guangxing, Ma, Jiawei, Huang, Shiyuan, Chen, Long, Chellappa, Rama, ford, Alec, Chen, Mark, Sutskever, Ilya, 2021. Zero-shot text-to-image generation.
Chang, Shih-Fu, 2022. Multimodal few-shot object detection with meta-learning arXiv preprint arXiv:2102.12092.
based cross-modal prompting. arXiv preprint arXiv:2204.07841. Rao, Yongming, Zhao, Wenliang, Chen, Guangyi, Tang, Yansong, Zhu, Zheng,
Hessel, Jack, Hwang, Jena D, Park, Jae Sung, Zellers, Rowan, Bhagavatula, Chandra, Huang, Guan, Zhou, Jie, Lu, Jiwen, 2021. DenseCLIP: Language-guided dense
Rohrbach, Anna, Saenko, Kate, Choi, Yejin, 2022. The abduction of sherlock prediction with context-aware prompting. arXiv preprint arXiv:2112.01518.
holmes: A dataset for visual abductive reasoning. arXiv preprint arXiv:2202.04800. Rohrbach, Anna, Rohrbach, Marcus, Hu, Ronghang, Darrell, Trevor, Schiele, Bernt,
Hudson, Drew A., Manning, Christopher D., 2019. GQA: A new dataset for real-world 2016. Grounding of textual phrases in images by reconstruction. In: Proceedings
visual reasoning and compositional question answering. In: Proceedings of CVPR. of ECCV. Springer, pp. 817–834.
pp. 6700–6709. Sadhu, Arka, Chen, Kan, Nevatia, Ram, 2019. Zero-shot grounding of objects from
Jia, Menglin, Tang, Luming, Chen, Bor-Chun, Cardie, Claire, Belongie, Serge, Hari- natural language queries. In: Proceedings of ICCV. pp. 4694–4703.
haran, Bharath, Lim, Ser-Nam, 2022. Visual prompt tuning. arXiv preprint arXiv: Schick, Timo, Schütze, Hinrich, 2021. It’s not just size that matters: Small language
2203.12119. models are also few-shot learners. In: Proceedings of NAACL. pp. 2339–2352.
Jiang, Haojun, Lin, Yuanze, Han, Dongchen, Song, Shiji, Huang, Gao, 2022. Pseudo-Q: Su, Weijie, Zhu, Xizhou, Cao, Yue, Li, Bin, Lu, Lewei, Wei, Furu, Dai, Jifeng, 2019.
Generating pseudo language queries for visual grounding. In: Proceedings of CVPR. VL-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings
pp. 15513–15523. of ICLR.
37
Y. Yao et al. AI Open 5 (2024) 30–38
Subramanian, Sanjay, Merrill, William, Darrell, Trevor, Gardner, Matt, Singh, Sameer, Yao, Yuan, Zhang, Ao, Han, Xu, Li, Mengdi, Weber, Cornelius, Liu, Zhiyuan,
Rohrbach, Anna, 2022. ReCLIP: A strong zero-shot baseline for referring expression Wermter, Stefan, Sun, Maosong, 2021. Visual distant supervision for scene graph
comprehension. In: Proceedings of ACL. pp. 5198–5215. generation. In: Proceedings of ICCV. pp. 15816–15826.
Sun, Ximeng, Hu, Ping, Saenko, Kate, 2022. DualCoOp: Fast adaptation to multi-label Yu, Licheng, Poirson, Patrick, Yang, Shan, Berg, Alexander C, Berg, Tamara L, 2016.
recognition with limited annotations. arXiv preprint arXiv:2206.09541. Modeling context in referring expressions. In: Proceedings of ECCV. Springer, pp.
Sun, Mingjie, Xiao, Jimin, Lim, Eng Gee, Liu, Si, Goulermas, John Y, 2021. Discrimina- 69–85.
tive triad matching and reconstruction for weakly referring expression grounding. Yu, Fei, Tang, Jiji, Yin, Weichong, Sun, Yu, Tian, Hao, Wu, Hua, Wang, Haifeng,
TPAMI 43 (11), 4189–4195. 2021. ERNIE-ViL: Knowledge enhanced vision-language representations through
Tan, Hao, Bansal, Mohit, 2019. LXMERT: Learning cross-modality encoder representa- scene graphs. In: Proceedings of AAAI. volume 35, pp. 3208–3216.
tions from transformers. In: Proceedings of EMNLP-IJCNLP. pp. 5100–5111. Zellers, Rowan, Bisk, Yonatan, Farhadi, Ali, Choi, Yejin, 2019. From recognition to
Tellex, Stefanie, Kollar, Thomas, Dickerson, Steven, Walter, Matthew, Banerjee, Ashis, cognition: Visual commonsense reasoning. In: Proceedings of CVPR. pp. 6720–6731.
Teller, Seth, Roy, Nicholas, 2011. Understanding natural language commands for Zellers, Rowan, Lu, Ximing, Hessel, Jack, Yu, Youngjae, Park, Jae Sung, Cao, Jize,
robotic navigation and mobile manipulation. In: Proceedings of AAAI. volume 25. Farhadi, Ali, Choi, Yejin, 2021. MERLOT: Multimodal neural script knowledge
Tsimpoukelli, Maria, Menick, Jacob, Cabi, Serkan, Eslami, SM, Vinyals, Oriol, Hill, Felix, models. In: Proceedings of NeurIPS. volume 34.
2021. Multimodal few-shot learning with frozen language models. arXiv preprint Zeng, Yawen, 2022. Point prompt tuning for temporally language grounding. In:
arXiv:2106.13884. Proceedings of SIGIR. pp. 2003–2007.
Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Zhang, Pengchuan, Li, Xiujun, Hu, Xiaowei, Yang, Jianwei, Zhang, Lei, Wang, Lijuan,
Gomez, Aidan N, Kaiser, Łukasz, Polosukhin, Illia, 2017. Attention is all you need. Choi, Yejin, Gao, Jianfeng, 2021. VinVL: Revisiting visual representations in
In: Proceedings of NeurIPS. pp. 5998–6008. vision-language models. In: Proceedings of CVPR. pp. 5579–5588.
Wang, Mengmeng, Xing, Jiazheng, Liu, Yong, 2021a. ActionCLIP: A new paradigm for Zhang, Hanwang, Niu, Yulei, Chang, Shih-Fu, 2018. Grounding referring expressions in
video action recognition. arXiv preprint arXiv:2109.08472. images by variational context. In: Proceedings of CVPR. pp. 4158–4166.
Wang, Peng, Yang, An, Men, Rui, Lin, Junyang, Bai, Shuai, Li, Zhikang, Ma, Jianxin, Zhang, Michael, Ré, Christopher, 2022. Contrastive adapters for foundation model
Zhou, Chang, Zhou, Jingren, Yang, Hongxia, 2022. OFA: Unifying architectures, group robustness. arXiv preprint arXiv:2207.07180.
tasks, and modalities through a simple sequence-to-sequence learning framework. Zhou, Kaiyang, Yang, Jingkang, Loy, Chen Change, Liu, Ziwei, 2021. Learning to prompt
In: Proceedings of ICML. for vision-language models. arXiv preprint arXiv:2109.01134.
Wang, Zirui, Yu, Jiahui, Yu, Adams Wei, Dai, Zihang, Tsvetkov, Yulia, Cao, Yuan, Zhou, Kaiyang, Yang, Jingkang, Loy, Chen Change, Liu, Ziwei, 2022. Conditional
2021b. SimVLM: Simple visual language model pretraining with weak supervision. prompt learning for vision-language models. In: Proceedings of CVPR. pp.
arXiv preprint arXiv:2108.10904. 16816–16825.
Yang, Zhengyuan, Gan, Zhe, Wang, Jianfeng, Hu, Xiaowei, Ahmed, Faisal, Liu, Zicheng, Zhu, Beier, Niu, Yulei, Han, Yucheng, Wu, Yue, Zhang, Hanwang, 2022. Prompt-aligned
Lu, Yumao, Wang, Lijuan, 2022a. UniTAB: Unifying text and box outputs for gradient for prompt tuning. arXiv preprint arXiv:2205.14865.
grounded vision-language modeling. In: Proceedings of ECCV.
Yang, Hao, Lin, Junyang, Yang, An, Wang, Peng, Zhou, Chang, Yang, Hongxia, 2022b.
Prompt tuning for generative multimodal pretrained models. arXiv preprint arXiv:
2208.02532.
Yao, Yuan, Chen, Qianyu, Zhang, Ao, Ji, Wei, Liu, Zhiyuan, Chua, Tat-Seng,
Sun, Maosong, 2022. PEVL: Position-enhanced pre-training and prompt tuning for
vision-language models. arXiv preprint arXiv:2205.11169.
38