Clip Score

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

CLIPScore:

A Reference-free Evaluation Metric for Image Captioning

Jack Hessel† Ari Holtzman‡ Maxwell Forbes‡ Ronan Le Bras† Yejin Choi†‡

Allen Institute for AI

Paul G. Allen School of Computer Science & Engineering, University of Washington
{jackh,ronanlb}@allenai.org {ahai,mbforbes,yejin}@cs.washington.edu

Abstract How clipScore works


reference captions
- Two dogs are running towards each
Image captioning has conventionally relied on other across the sand.
reference-based automatic evaluations, where - Two dogs run toward each other.
arXiv:2104.08718v3 [cs.CV] 23 Mar 2022

machine captions are compared against cap-


- Two dogs are running towards each
tions written by humans. This is in contrast other on a beach.
to the reference-free manner in which humans clip
assess caption quality. candidate
Two dogs run towards each
other on a marshy area.
clip
cos
sim.
0.83
clipScore
In this paper, we report the surprising empir-
ical finding that CLIP (Radford et al., 2021), clipScore vs traditional image captioning metrics
a cross-modal model pretrained on 400M im- references
- A helmeted boy flies through the air on

bleu-1 ❌
spice
age+caption pairs from the web, can be used a snowboard.
33.3 0.0
- A snowboarder balancing on a wall.
for robust automatic evaluation of image cap- - A snowboarder in green grinds along ❌ cider-d
meteor ❌
the edge of a rail at night.
tioning without the need for references. Ex- - A snowboarder wearing a green jacket 3.3 0.5
jumping a green railing. vs
periments spanning several corpora demon-
humans clipScore vs candidate
strate that our new reference-free metric, ✓4/4 ✓74.0 Person snowboarding at a ski slope.
CLIPScore, achieves the highest correlation references
- A dirt biker in the forest. bleu-1 spice
with human judgements, outperforming exist- - A dirt biker rides his motorcycle through ✓46.2 ✓19.4
ing reference-based metrics like CIDEr and the woods.
- A motocross bike is being ridden along meteor cider-d
SPICE. Information gain experiments demon- a woodland path.
- A person rides a motorbike on a dirt
✓18.1 ✓10.3
strate that CLIPScore, with its tight focus path surrounded by trees. vs

on image–text compatibility, is complemen- ❌


humans ❌ Score
clip candidate
vs
1/4 37.4 A grey dog walks on top of a fallen tree in the woods.
tary to existing reference-based metrics that
emphasize text–text similarities. Thus, we
Figure 1: Top: CLIPScore uses CLIP to assess
also present a reference-augmented version, image-caption compatibility without using references,
RefCLIPScore, which achieves even higher just like humans. Bottom: This frees CLIPScore
correlation. Beyond literal description tasks, from the well-known shortcomings of n-gram match-
several case studies reveal domains where ing metrics, which disfavor good captions with new
CLIPScore performs well (clip-art images, words (top) and favor any captions with familiar words
(bottom). Attribution: Paperclip, robot icons by Hasanudin,
alt-text rating), but also where it is relatively Adiyogi (resp.) from the Noun Project.
weaker in comparison to reference-based met-
rics, e.g., news captions that require richer con-
textual knowledge.
against even multiple human-authored captions for
1 Introduction each image is often insufficient (see Figure 1). As
a result, for many corpora, a significant gap re-
For most text generation tasks, reference-based n- mains between reference-based scoring and human
gram overlap methods are still the dominant means quality judgments.1
of automatic evaluation. For image caption genera-
Should we need references for the evaluation of
tion, recent reference-based metrics have sought to
image captions? After all, when humans assess the
transcend overlap by considering richer models of
appropriateness of an image caption, we do so just
reference-candidate similarity: e.g., approximate
by looking at the image and reading the candidate’s
scene graphs (Anderson et al., 2016), allowing
text.
reference-based methods to incorporate the image
(Jiang et al., 2019; Lee et al., 2020). But, refer- 1
See Elliott and Keller (2014) and Kilickaya et al. (2017)
ences can be expensive to collect and comparing for thorough comparisons of caption generation metrics.
A recent trend in machine translation serves as ods correlate best with human judgments. And, for
inspiration: there, a key hurdle for reference-free emotive captions inspired by language use on social
evaluation (sometimes called quality estimation) media, even reference-based metrics fall short.
has been estimating cross-lingual similarity be-
tween source+candidate pairs (Blatz et al., 2004; 2 Related Work
Specia et al., 2010; Mehdad et al., 2012; Specia Reference-only image caption evaluation In
and Shah, 2018). But recent work (Lo, 2019; general, image caption generation models are eval-
Yankovskaya et al., 2019; Zhao et al., 2020) has uated by a suite of 5 reference based metrics:
improved correlation with human judgment not by BLEU-4 (Papineni et al., 2002) (which measures
gathering more monolingual references, but instead a version of precision between a candidate and
by utilizing cross-lingual representations learned the references), ROUGE-L (Lin, 2004) (which mea-
by large-scale, pre-trained, multilingual models sures a version of recall), METEOR (Banerjee and
e.g., LASER (Artetxe and Schwenk, 2019) or M- Lavie, 2005) (which computes a word-level align-
BERT (Devlin et al., 2019). 2 ment), CIDEr (Vedantam et al., 2015) (which com-
We hypothesize that the relationships learned by bines n-gram tf-idf weighting and stemming) and
pretrained vision+language models (e.g., ALIGN SPICE (Anderson et al., 2016) (which applies a
(Jia et al., 2021) and CLIP (Radford et al., 2021)) semantic parser to a set of references, and com-
could similarly support reference-free evaluation putes similarity using the predicted scene graph).3
in the image captioning case. Indeed, they can: we Yi et al. (2020) give a method for re-weighting
show that a relatively direct application of CLIP BERTScore (Zhang et al., 2020) specifically tuned
to (image, generated caption) pairs results in sur- to the image caption generation domain (we refer
prisingly high correlation with human judgments to their method as BERT-S++).
on a suite of standard image description bench-
marks (e.g., MSCOCO (Lin et al., 2014)). We call Reference+image caption evaluation Recent
this process CLIPScore (abbreviated to CLIP-S). metrics incorporate image-text grounding features
Beyond direct correlation with human judgments, in addition to references: TIGEr (Jiang et al., 2019)
an information gain analysis reveals that CLIP-S is uses a pretrained SCAN model (Lee et al., 2018),
complementary both to commonly reported metrics and ViLBERTScore-F (Lee et al., 2020) uses a pre-
(like BLEU-4, SPICE, and CIDEr) and to newly pro- trained ViLBERT model (Lu et al., 2019) that is
posed reference-based metrics (e.g., ViLBERTScore-F also fine-tuned on 12 downstream vision and lan-
(Lee et al., 2020)). guage tasks (Lu et al., 2020). Our work provides
We additionally (1) propose a reference- perspective on the next logical extension: instead of
augmented version of CLIPScore, incorporating visual-textual interactions in addition
RefCLIPScore, that achieves even higher to references, can we ignore references entirely?
human correlation, (2) verify that CLIP-S is
Self-retrieval for image captioning Prior works
sensitive to adversarially constructed image
have proposed incorporating a self-retrieval loss
captions, where one noun-phrase has been
into caption generation, with the intuition that good
swapped for a plausible (but incorrect) distractor;
captions should be able to uniquely identify their
and (3) construct a corpus of images that have
images with high accuracy (Dai and Lin, 2017; Luo
never been posted publicly online to verify that
et al., 2018; Liu et al., 2018); monitoring this type
CLIP-S is able to reconstruct human judgments on
of loss can provide insight into how distinctive the
never-before-seen images.
captions are according to the model itself. CLIP-S
Finally, we assess CLIP-S in the context of four
is similar in spirit, but distinct for its utility as an
case studies that diverge from context-free, literal
extrinsic evaluation metric like BLEU-4 or CIDEr.
photograph description. In two cases, CLIP-S works
well: it achieves high correlation with alt-text qual- Reference-free evaluation In addition to the ma-
ity rating on Twitter, and demonstrates surprising chine translation cases highlighted in the introduc-
capacity to reason about clipart images+captions. tion, reference-free evaluations have been proposed
For news caption generation, reference-based meth- for other generation tasks, including summarization
2 3
K et al. (2020), Pires et al. (2019), and Wu and Dredze For comparison with these metrics, we use the stan-
(2019) explore how M-BERT learns and utilizes cross-lingual dard COCO evaluation tools available at https://github.
information. com/tylin/coco-caption.
(Louis and Nenkova, 2013; Peyrard and Gurevych, embeddings.5 We found that prefixing candidates
2018; Sun and Nenkova, 2019) and dialogue (Tao with the prompt: “A photo depicts" improved corre-
et al., 2018; Mehri and Eskenazi, 2020). These met- lations slightly (and is our recommended/standard
rics can be supervised, relying on human judgments configuration), though “A photo of", the recom-
for quality estimation, or less-supervised, relying mended prompt from Radford et al. (2021), worked
on pre-trained model representations. For image well too. Following Zhang et al. (2020), we per-
captioning, a version of VIFIDEL (Madhyastha form a re-scaling operation.6 For an image with
et al., 2019) was proposed for reference-free eval- visual CLIP embedding v and a candidate caption
uation; however, VIFIDEL, computed based on a with textual CLIP embedding c, we set w = 2.5
list of detected objects in the image from a fixed ob- and compute CLIP-S as:
ject vocabulary, generally produces less correlation
with human ratings vs. reference-based metrics. CLIP-S(c, v) = w ∗ max(cos(c, v), 0)

3 CLIPScore To compute corpus-level CLIP-S, we simply average


over (candidate, image) pairs. Note that this eval-
Model Details. CLIP (Radford et al., 2021) is
uation does not depend on underlying references.
a cross-modal retrieval model trained on 400M
The runtime of CLIP-S with the ViT-B/32 back-
(image, caption) pairs gathered from the web.
bone is fast: on our single consumer GPU and hard
500K search queries, consisting of common un-
drive, roughly 4K image-candidate pairings can be
igram/bigrams, named entities, etc., were executed
processed per minute.
on a search engine. For each query, up to 20K
(image, caption) pairs were collected. RefCLIPScore CLIP-S can additionally be ex-
The model we use is the ViT-B/32 version.4 tended to incorporate references, if they are avail-
It represents images via a Vision Transformer able. We extract vector representations of each
(Vaswani et al., 2017; Dosovitskiy et al., 2021), available reference by passing them through CLIP’s
which forgoes convolutional filters in favor of self- text transformer; the result is the set of vec-
attention maps computed between a 7 by 7 grid of tor representation of all references, R. Then,
image patches, which evenly divides a 224 by 224 RefCLIPScore is computed as a harmonic mean
pixel input image. This model has 12 transformer of CLIP-S, and the maximal reference cosine simi-
layers and 86M parameters. The text is similarly larity, i.e.,
represented by a 12-layer transformer trained over
a vocab of 49K BPE token types (Sennrich et al., RefCLIP-S(c, R, v) =
2016) (and is more fully described in Radford et al.
H-Mean(CLIP-S(c, v), max(max cos(c, r), 0))
(2019)). Both the text and image networks out- r∈R
put a single vector; these vectors aim to represent
the content of an input caption or an image, re- 4 Benchmark Captioning Evaluations
spectively. In the case of ViT-B/32, these vec-
tors are 512-D. The model’s weights are trained We first evaluate on a set of literal description
to maximize the scaled cosine similarity of truly corpora. Broadly, the captions in these corpora
corresponding image/caption pairs while simulta- aim to identify and highlight the literal, salient ob-
neously minimizing the similarity of mismatched jects/actions in a photographic image, presented
image/caption pairs using InfoNCE (Sohn, 2016; without additional context.7
Oord et al., 2018). We hold fixed this set of weights 5
More sophisticated CLIP configurations, e.g., region-
for our experiments. level/token-level correspondence models, did not achieve bet-
ter performance.
6
Evaluating Caption Generations with CLIP. While the cosine similarity, in theory, can range from
[−1, 1] (1) we never observed a negative cosine similarity;
To assess the quality of a candidate generation, and (2) we generally observe values ranging from roughly
we pass both the image and the candidate caption zero to roughly .4. The particular value of w we advocate for,
w = 2.5, attempts to stretch the range of the score distribution
through their respective feature extractors. Then, to [0, 1]. For more details and justification for our re-scaling,
we compute the cosine similarity of the resultant including a demonstration of generality across several corpora,
see Appendix B).
4 7
We expect that more powerful, larger versions of the See Berg et al. (2012) for a statistical exploration of
model, if released at a later date, could perform better. salience in a such a corpus.
τc τb
BLEU-1 32.3
BLEU-4 16.9
BLEU-4 30.8
ROUGE-L 32.3 CIDEr 24.6
BERT-S (RoBERTa-F) 39.2 METEOR 22.2
METEOR 41.8 ROUGE-L 19.9
CIDEr 43.9 SPICE 24.4
SPICE 44.9
BERT-S (RoBERTa-F) 22.8
LEIC (τb )* (Cui et al., 2018) 46.6
BERT-S++ (Yi et al., 2020) 46.7 LEIC * 29.5
TIGEr (Jiang et al., 2019) 49.3 CLIP-S (no refs) 34.4
NUBIA * (Kane et al., 2020) 49.5
ViLBERTScore-F (Lee et al., 2020) 50.1
RefCLIP-S 36.4
CLIP-S (no refs) 51.2
Table 2: Flickr8K-CF correlations with human judg-
RefCLIP-S 53.0
ment. * indicates a result reported in prior work.
Table 1: Flickr8K-Expert correlations with human
judgment. All metrics use 4-5 ground truth references,
except for CLIP-S (which uses none). * indicates a re- ences achieves higher correlation with human judg-
sult reported in prior work. ment compared to previously proposed metrics
that rely on references. Additionally, in all cases,
RefCLIP-S improves correlation even further. This
4.1 Caption-level likert judgments provides strong evidence that, in terms of correlat-
We first explore three corpora consisting of human ing with human judgment at the caption-level for
likert-scale judgments at the level of individual im- these literal photographic image description tasks,
age/caption pairs. Flickr8K-Expert (Hodosh et al., a relatively direct application of CLIP can serve as
2013) contains 17K “expert" human judgments be- a strong automatic evaluation metric.
tween 5664 images: humans graded captions on
a scale of 1 to 4 (4=“caption describes the image 4.2 Pairwise ranking on Pascal-50S
without any errors"; 1=“caption is unrelated to the
image"). Flickr8K-CF is a set of 145K binary qual- In Pascal-50S (Vedantam et al., 2015), raters made
ity judgments gathered from CrowdFlower over pairwise preference judgments between pairs of
48K (image, caption) pairs (1K unique images). sentences. There are 4K sentence pairs total, split
Each pair has at least 3 binary judgments, and we evenly across four categories, e.g., two human cap-
take the mean proportion of “yes" annotations as a tions, two machine captions, etc. For each pair, 48
score for each pair to compute correlations. human pairwise judgments were gathered.8 Follow-
Composite (Aditya et al., 2015) contains 12K ing prior work, instead of computing correlation
human judgments between images from MSCOCO coefficients, we compute accuracy, i.e., we consider
(2007 images), Flickr8k (997 images), and the caption preferred by a majority of annotators to
Flickr30k (Young et al., 2014) (991 images). Each be correct, and measure how often the evaluation
image originally has five references, but one of metric assigns a higher score to that member of the
the references was selected to be rated by humans pair. Ties are broken randomly. Due to random se-
in the set (and so we remove it from the refer- lection of 5 references among the 48 candidates to
ence set when computing metrics; this differs from serve as ground-truth for the reference-based met-
some prior work, see Appendix A for why we con- rics, the results may differ slightly from prior work
sider the more difficult setting). For Composite (we average over 5 random draws of references).
and Flickr8K judgments, we compute correlation The results are given in Table 4. Evaluation is
between each metric and the human ratings using split across four categories of caption pairs (de-
Kendall τ . tailed in the table caption). CLIP-S and RefCLIP-
S generally achieve high performance in all cate-
Results The results for Flickr8K-Expert are
gories.
given in Table 1, for Flickr8K-CF are given in Ta-
ble 2 (in τb , following Cui et al. (2018)), and for 8
Instead of being presented with the image, annotators
Composite are given in Table 3. For the caption- were presented only with a reference (and the two candidates
level corpora we consider, CLIP-S without refer- to rank).
τc HC HI HM MM Mean
length 51.7 52.3 63.6 49.6 54.3
BLEU-1 31.3
BLEU-4 60.4 90.6 84.9 54.7 72.6
BLEU-4 30.6 SPICE 63.6 96.3 86.7 68.3 78.7
ROUGE-L 32.4 METEOR 63.8 97.7 93.7 65.4 80.1
BERT-S (RoBERTa-F) 30.1 ROUGE-L 63.7 95.3 92.3 61.2 78.1
METEOR 38.9 CIDEr 65.1 98.1 90.5 64.8 79.6
CIDEr 37.7 BERT-S (RoBERTa-F) 65.4 96.2 93.3 61.4 79.1
SPICE 40.3 TIGEr * 56.0 99.8 92.8 74.2 80.7
BERT-S++ * 44.9 ViLBERTScore-F * 49.9 99.6 93.1 75.8 79.6
BERT-S++ * 65.4 98.1 96.4 60.3 80.1
TIGEr 45.4
ViLBERTScore-F 52.4 CLIP-S (no refs) 56.5 99.3 96.4 70.4 80.7
RefCLIP-S 64.5 99.6 95.4 72.8 83.1
CLIP-S (no refs) 53.8
RefCLIP-S 55.4 Table 4: Pascal50S accuracy results (5 references). HC
= two human correct captions; HI = both captions are
Table 3: Composite correlations with human judgment. human written, but one is wrong; HM = both captions
All metrics use between 4 and 5 ground truth refer- are for the image, but one is written by a human, one
ences, except for CLIP-S (which uses none). In contrast by an algorithm; MM = both captions are for the im-
to some prior work, we consider a harder setting, and age, and both are written by an algorithm. * indicates a
remove the candidate from the reference set (see Ap- result reported in prior work: the comparability of our
pendix A for details; for comparison purposes, RefCLIP- results to *-rows is subject to the (arbitrary) sample of
S achieves τc = 60.0 in the easier setting). * indicates references. We average our results over 5 random sam-
a result reported in prior work. ples (but CLIP-S doesn’t change because it doesn’t use
references).

4.3 System-level correlation for MSCOCO


CLIP-S achieves high correlation with human judg- are not depicted, is important. We use a sample of
ments at the system-level as well: we evaluate image captions from the FOIL dataset, constructed
the outputs of systems submitted to the 2015 by Shekhar et al. (2017), to test how sensitive CLIP-
MSCOCO Image Captioning Challenge (Vinyals S is to detecting potentially subtle inaccurate details
et al., 2016). We have some concerns with stan- in descriptions. This corpus consists of modified
dard evaluation setup on this corpus, mostly related reference captions from MSCOCO that have a sin-
to the fact that it consists of only 12 datapoints gle noun-phrase adversarially swapped out to make
(see supplementary for more discussion). Nonethe- the FOIL caption incorrect, e.g., switching “motor-
less, following the standard procedure, we correlate cycle" for “bicycle".
CLIP-S and RefCLIP-S with two metrics: “the percent- To adapt the corpus to our setting, for each of the
age of captions that are evaluated as better or equal 32K test images, we sample a (FOIL, true) pair, and
to a human caption (M1)" and percentage of cap- compute the accuracy of each evaluation metric in
tions that pass the “Turing Test" (M2), respectively. their capacity to assign a higher score to the true
CLIP-S achieves Spearman ρM 1 /ρM 2 = .59/.63 candidate versus the FOIL. To compute reference-
and RefCLIP-S achieves ρM 1 /ρM 2 = .69/.74 (all based metrics, we give access to the MSCOCO
p < .05) with these system-level metrics. reference captions for the image (excluding the the
true candidate being assessed against the FOIL).
4.4 Sensitivity of CLIP-S to hallucination While the paired setting we consider isn’t identi-
Prior work has demonstrated that, for many literal cal, Shekhar et al. (2017) estimate roughly 92%
description tasks, humans often prefer correctness human agreement on the unpaired version of the
in captions over specificity (Rohrbach et al., 2018, task, relative to a 50/50 random guessing baseline.
2017).9 Thus, understanding if and how evaluation Table 5 contains the results. In this setting,
metrics handle image captions that contain incor- having access to more annotation is quite helpful
rect “hallucinations," e.g., references to objects that for reference based metrics, e.g., the accuracy of
9
SPICE and BLEU-4 increase by over ten points when
This is not always the case: MacLeod et al. (2017) show
there is a range of opinion among a sample of low vision and shifting from one to four references. But in the
blind users of social media. reference-limited setting, CLIP-S, without any ref-
50 c
60
r TS r E
E GE BER GE UG
1-ref 4-ref V IL
B ER
T B-1 SP
IC TI

r
VI
LB
T
ER CID
ER TI RO

Er 4
E 1 B-
CE IG B- E R IG
re DE T
45
c I T IC
e T TS SP SP co CI
or ERBER S
Sc LB IP
length 50.2 50.2 Ref
CL
IP VI
50
or
e
SP
CL

IC
E
Sc
LIP
66.5 82.6 fC

R^2

R^2
BLEU-4 40 e
Re

or
Sc
IP
METEOR 78.8 85.4 CL
40 VI
LB
ER
T

35
ROUGE-L 71.7 79.3 RefCLIPScore RefCLIPScore
CIDEr 82.5 90.6 CLIPScore CLIPScore
30 30
SPICE 75.5 86.1 #1 #3 #5 #1 #3 #5
Importance Rank Importance Rank
BERT-S (RoBERTa-F) 88.6 92.1 (a) Composite (b) Flickr8k-Expert
CLIP-S (no refs) 87.2 87.2
Figure 2: R2 for the forward-selection regression
RefCLIP-S 91.0 92.6 of metrics on human Likert ratings for two corpora.
Foward-selection tends to identify both CLIP-S and
Table 5: Accuracy of evaluation metrics in the pairwise RefCLIP-S early-on: other informative and complemen-
FOIL hallucination detection setting. All reference- tary metrics include ViLBERTScore-F and SPICE.
based metrics are given access to either one or four ref-
erences.
baseline.11 Then, for each image, three authors
of this work independently selected which caption
erence outperforms all metrics except for BERT-S described the image content more accurately. Rela-
(RoBERTa-F). And, RefCLIP-S works best in all cases. tive to a 50% random baseline (and a 72% length
Overall, we corroborate Rohrbach et al. (2018)’s baseline of selecting the shorter caption) CLIP-S
finding that “object hallucination can not be always correctly recovers majority human preference in
predicted based on the traditional sentence metrics" 86% of cases. Human agreement for this corpus is
using a corpus derived from Shekhar et al. (2017), 93%.12
particularly in the case where there are few ref- While this setup cannot definitively refute the
erences available. However, CLIP-S and RefCLIP-S notion that CLIP works well because it has memo-
offer a performance improvement in the pairwise rized images, we hope the results here contribute
setting. to the evolving discussion about the nature of gen-
eralization for web-scale pretrained models.
4.5 Sensitivity of CLIP-S to memorization
4.6 Which metrics should I report?
One concern with model-based scoring methods Most caption generation works report multiple met-
is memorization, i.e., if a model’s weights are pre- rics, each of which (presumably) correlates with
trained using a large corpus, there’s a risk that data human judgment to different degrees. But it’s not
used at evaluation time have already been seen at always clear if individual metrics capture distinct
pretraining time. While Radford et al. (2021) con- or redundant dimensions of human judgment. For
duct a train-test overlap analysis and find that CLIP example, while CLIP-S and ViLBERTScore-F both pro-
is unlikely to succeed because of memorization, duce high correlations, are they redundant or com-
we nonetheless conduct an experiment with images plementary?
CLIP has never seen before. We seek a (minimal) set of metrics that explains
The authors of this work created a set of 250 the most variance in human judgment. To find
images that have never been posted to the Inter- this set, we undertake a forward selection on a set
net by aggregating personal photographs. The set of ten candidate metrics comprising six widely-
contains a variety of Flickr-like situations, e.g., na- reported metrics,13 and four newer metrics, BERT-S
ture scenes, animals, city streets, objects, etc. For (RoBERTa-F), TIGEr, ViLBERTScore-F, and CLIP-S (we
each image, we collect two automatically gener- also include experiments starting with RefCLIP-S
ated captions: one from a commercial API, Mi- instead of CLIP-S, too). Starting from an empty set,
crosoft Azure Cognitive Services (v 3.1)10 and one we perform an iterative greedy selection by picking
from Luo et al. (2018)’s pretrained model, which is 11
We use the ResNet101 pretrained version, which achieves
trained to maximize CIDEr score with a self-critical 1.05 CIDEr and 0.19 SPICE on the COCO validation set.
12
Raters preferred the Microsoft captions to the ResNet101
10 model 81% of the time.
https://azure.microsoft.com/en-us/
13
services/cognitive-services/ BLEU-1, BLEU-4, METEOR, CIDEr, ROUGE-L, SPICE
AltText Personality
the most informative additional metric to add.14 To The logo for OPEC,
the Organization of
Captions
[Miserable:] I'm
ready to get to
the Petroleum shore now. I hate
estimate variance, we repeat the forward-selection Exporting Countries, the waves.
is shown against a Nothing to do on
process 10 times with bootstrap re-sampled ver- background of flags.
shore either.

sions of the corpus. Abstract-50S


An angry Jenny
GoodNews
LATES with
Figure 2 shows the information gain that re- is holding a hot MasterCard at
dog next to the London's
sults from running this experiment on the Com- grill while Mike Natural History
sits. Museum
posite and Flickr8K-Expert corpora; we also show
which metric is most commonly selected at each Figure 3: Instances from our four case-study corpora.
iteration (earlier = more information gain). For
Composite, CLIP-S (or RefCLIP-S) is always se-
lected first, followed by ViLBERTScore-F, and then tive text: while few use this feature (Gleason et al.
(most commonly) BERT-S (RoBERTa-F). For Flickr8k- (2019) find that fewer than .1% of image tweets
Expert, the top three choices are always CLIP-S have alt-text), its broader adoption might someday
(or RefCLIP-S), ViLBERTScore-F, and SPICE. While make social media more accessible for low vision
CLIP-S and ViLBERTScore-F tend to be the most infor- and blind users. We measure CLIP-S’s capacity to
mative metrics, (1) while they are correlated, they reconstruct a set of 2.8K human judgments of alt-
are not purely redundant; and (2) image-unaware, text quality. This corpus was collected and rated
reference-based metrics like SPICE can still be use- by the authors of Gleason et al. (2019, 2020). Each
ful. alt-text was rated on a scale of 0 to 3 in terms of
In summary, these results suggest that evaluation its probable utility as an alt-text. While the human-
metrics like CLIP-S, which take into account visual raters raters themselves are sighted thus cannot
content, indeed capture axes of human judgment directly assess the utility of a given alt-text to a low
not currently covered by text-only reference-based vision or blind user, they are experts in designing
metrics. For the literal image description evalu- and evaluating alt-text systems. Tweets were sam-
ation settings we consider, a reasonable mix of pled from a mix of the Twitter FireHose API, and
metrics to report is at least one image-aware met- the timelines of low vision and blind users of the
ric (e.g., CLIP-S) plus a strong reference-only metric site. The images, qualitatively, are a broader mix of
(e.g., SPICE). web content in comparison to Flickr-like domains,
e.g., screenshots, memes, etc. Alt-text candidates
5 Case Studies Using CLIPScore are a mix of user-uploaded and machine-generated.
The corpus contains no references, but for the pur-
Our results thus far have demonstrated that CLIP
poses of comparison to reference-based metrics,
encodes information useful for evaluating literal im-
we (programmatically) treat any textual context of
age description tasks. But, reference-based metrics
the tweet as a reference.
may a priori seem more adaptable versus CLIP-S.
CLIP-S achieves 48.4 τc correlation with the
Does CLIP-S correlate with human judgment be-
human judgements. In contrast, likely due to
yond cases like MSCOCO and Flickr8K?
the unreliability of Tweet texts as viable alt-texts,
To address this question, we consider four case
reference-based methods struggle: the best per-
studies, exploring the correlation between CLIP-
forming purely-reference based metric, BERT-S
S and human judgment across “divergent" image
(RoBERTa-F) (which achieves 15 τc ) under-performs
description datasets. These corpora qualitatively
relative to length baseline (which achieves 25 τc ).
differ from the more popular domains explored in
While gathering high-quality, contextual reference
§4, either because the images are not “everyday"
alt-texts is a promising avenue for future work,15
images from Flickr, or because the captions are not
CLIP-S offers a promising evaluation metric candi-
literal description (Figure 3 illustrates).
date in this domain.
5.1 Alt-Text ratings from Twitter
5.2 Abstract-50S
When uploading an image alongside a tweet, users
We assess CLIP-S’s capacity to generalize to ab-
of Twitter have the option of providing alterna-
stract, non-photographic clip-art images using
14
Our criteria is how much additional R2 correlation with Abstract-50S (Vedantam et al., 2015). This dataset
human judgment a metric adds according to a linear regression.
15
We use sklearn (Pedregosa et al., 2011)’s forward selection, See Stangl et al. (2020), who conducted user-studies
which applies 5-fold cross-validation at each step. across six domains.
pairs clip-art images (originally constructed by Zit- of the time.16 Our takeaway: when given a direct
nick and Parikh (2013)) with 48 human-written ref- description and a more engaging, non-literal cap-
erence captions. These images depict two cartoon tion, CLIP-S will generally prefer the literal.
characters, Mike and Jenny, in various outdoor situ- For (2): CLIP-S performs slightly better than ran-
ations, e.g., playing sports, having a picnic, etc. For dom, e.g., 57% over 2.5K human pairwise judg-
400 human-written candidate caption pairs (200 ments comparing two neural generator models:
pairs are from the same image, 200 are from dif- TransResNet (ResNeXt-IG-3.5B) vs. TransRes-
ferent images), human judgments were collected: Net (ResNet-152) (see Shuster et al. (2019) Table
annotators were instructed to choose which of the 7, Row 5), but no better than a length-only base-
paired captions were more similar to each reference line (also 57%). Notably, even reference-based
caption, so 48 judgments were collected for each metrics fail to provide correlation with pairwise
candidate pair (for a total of 19200). human judgment of engagingness on this corpus:
We compare CLIP-S to several reference-based e.g., BLEU-4, CIDEr, and SPICE agree with human
metrics when given access to a random sample of judgment 52%/53%/51% when provided with one
five reference captions. Following our procedure personality-primed reference. Our takeaway: when
for Pascal-50S, we randomly re-sample 5 times, given two engaging, non-literal descriptions, both
and report average pairwise accuracy. Two base- CLIP-S and traditional reference-based metrics fail
lines (BL) both achieve 53: length-only (i.e., saying to predict which humans will judge to be more
the longer caption is better); and randomly shuf- engaging.
fling images as input to CLIP-S (so that it cannot
rely on meaningful visual-textual interactions). 5.4 News image captioning
Biten et al. (2019) consider caption generation for
BL BLEU-4 CIDEr METEOR BERT-S CLIP-S (no refs)
images from New York Times articles; their task
53 71 79 79 73 68
differs from MSCOCO because 1) 95% of captions
Overall, while CLIP-S underperforms relative to contain at least one named entity, e.g., a politician,
the reference-based metrics, it outperforms the celebrity, or place; and 2) captions generally “do
baselines by a wide margin. This result suggests not describe scene objects, but rather offer a contex-
that CLIP-S is capable of reasoning about visual- tualized interpretation of the scene." They collected
textual interactions, even in non-photographic im- 2.1K pairwise human judgments over 106 images
ages. that compare the performance of two news image
captioning models. For each image, 20 annotators
were instructed to pick which of two model genera-
5.3 Personality Captions tions was closer to the ground-truth caption (they
Inspired by language use on social media, Shuster were also presented with the image itself). We com-
et al. (2019) collected image captions by prompt- pare metrics in terms of their accuracy in matching
ing annotators with a “personality" (e.g., dramatic, human judgment between the two candidates.
sympathetic, sad, etc.) and asking them to “write Reference-based metrics dominate: METEOR and
a comment in the context of [a] given personality BLEU-4 achieve the highest accuracies of 93 and 91
trait... about an image that someone else would find respectively, whereas CLIP-S achieves only slightly
engaging." To evaluate their models, the authors above random at 65. Qualitatively, CLIP-S succeeds
collected pairwise human judgments, where evalu- when there are visually-verifiable content, e.g.,
ators were instructed to “to pick which comment is matching black-and-white photos to older dates
the most engaging". We assess CLIP-S in two capac- (e.g., picking 1933 vs. 1977, in one case), and
ities: (1) does it prefer literal descriptions, or the matching particularly iconic celebrities (e.g., it con-
less-literal, more engaging, personality captions?; fidently identifies Muhammad Ali boxing).17 But,
and (2) if it is given two personality captions, can it its most common failure case are captions that may
predict which humans judge to be more engaging? 16
Preliminary prompt-engineering experiments (e.g., “when
For (1): Over a set of 2.4K “traditional" vs. per- I look at this photo, I feel [PERSONALITY] and think [CAP-
sonality captions pairwise ratings, humans rate the TION]") could not overcome this.
17
Luo et al. (2021)’s recent experiments quantitatively
personality captions to be more engaging 65% of demonstrate that CLIP is capable of reasoning about real-
the time, whereas CLIP-S prefers the traditional 80% world entities within news images.
simply be unverifiable given only the image con- MSCOCO) and UMIC (Lee et al., 2021) (based
tent. For example: CLIP-S selects “The dining room on UNITER (Chen et al., 2020)). UMIC, in par-
at Elle Decor" for an image of a room, but annota- ticular, produces similar correlations with human
tors preferred a caption that mentioned “the Junior judgment on the literal image description tasks (§4)
League of New York;" the ground truth caption re- compared to CLIP-S, but with the complementary
veals why the image was pictured in the first place: approach of fine-tuning on synthetic negative cap-
“A Manhattan home on a May 7 tour by the Junior tions. Future work would be well-suited to explore
League of New York." if the textual data augmentations proposed by Lee
Overall, we do not advocate for reference-free et al. (2021) (1) result in a metric that complements
evaluation in this case, especially because our re- or overlaps with the non-finetuned CLIP-S (§4.6);
sults suggest that (at least for this particular set of and (2) could be extended beyond cases of literal
annotations) reference-based n-gram overlap met- description (§5).
rics achieve high correlation with human judgment.
Acknowledgements
6 Conclusion
This research is supported in part by DARPA MCS
For literal image description tasks, CLIPScore program through NIWC Pacific (N66001-19-2-
achieves high correlation with human judgments 4031), DARPA SemaFor program, and the Allen
of caption quality without references when used in Institute for AI. We additionally thank Ximing
an off-the-shelf fashion. Additional experiments Lu, Swabha Swayamdipta, Youngjae Yu, and the
in divergent domains suggest that CLIP can also anonymous EMNLP reviewers for the helpful com-
reason about non-photographic clip-art, and serves ments, thoughts, and discussions. Finally, we thank
as a reasonable option for reference-free evaluation Jin-Hwa Kim, who in March 2022, helped discover
in the alt-text case. Promising future work includes a now fixed discrepancy for the Pascal-50S results,
exploring 1) CLIP-S as a reinforcement learning re- see Appendix A.
ward for literal caption generators; and 2) whether
a small amount of labelled human rating data could
help CLIP-S adapt to domains where it struggles,
References
e.g., engagingness prediction. We hope our work Somak Aditya, Yezhou Yang, Chitta Baral, Cor-
can contribute to the ongoing discussion about the nelia Fermuller, and Yiannis Aloimonos. 2015.
From images to sentences through scene description
role of pretrained models in generation evaluation. graphs using commonsense reasoning and knowl-
Reference-free evaluation runs some risks. edge. arXiv preprint arXiv:1511.03292.
Much like BERTScore, model-based metrics like
Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec
CLIP-S reflect the biases of the pre-training data. Radford, Jong Wook Kim, and Miles Brundage.
While we believe that using CLIP-S as an offline 2021. Evaluating CLIP: Towards characterization
evaluation metric for literal caption quality accords of broader capabilities and downstream implications.
with the recommendations of CLIP’s model card18 arXiv preprint arXiv:2108.02818.
(Mitchell et al., 2019), Agarwal et al. (2021)’s Peter Anderson, Basura Fernando, Mark Johnson, and
study demonstrates that CLIP can make dispro- Stephen Gould. 2016. Spice: Semantic proposi-
portionate incorrect classifications of people, e.g., tional image caption evaluation. In ECCV. Springer.
“male images were misclassified into classes re- Mikel Artetxe and Holger Schwenk. 2019. Mas-
lated to crime.” Exploring potential social biases of sively multilingual sentence embeddings for zero-
shot cross-lingual transfer and beyond. TACL,
candidate generations (as in, e.g., Hendricks et al.
7:597–610.
(2018)) remains paramount, particularly if a system
is to be deployed. Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
an automatic metric for mt evaluation with improved
Contemporaneous work While this work was correlation with human judgments. In ACL work-
under submission, two alternate reference-free eval- shop on Evaluation Measures for MT and Summa-
rization.
uation metrics for image caption generation were
introduced: FAIEr (Wang et al., 2021) (based on Alexander C. Berg, Tamara L. Berg, Hal Daumé III,
a pretrained object detector, and fine-tuned on Jesse Dodge, Amit Goyal, Xufeng Han, Alyssa Men-
sch, Margaret Mitchell, Aneesh Sood, Karl Stratos,
18 and Kota Yamaguchi. 2012. Understanding and pre-
https://github.com/openai/CLIP/blob/
main/model-card.md dicting importance in images. In CVPR.
Ali Furkan Biten, Lluis Gomez, Marçal Rusinol, and Ming Jiang, Qiuyuan Huang, Lei Zhang, Xin Wang,
Dimosthenis Karatzas. 2019. Good news, everyone! Pengchuan Zhang, Zhe Gan, Jana Diesner, and Jian-
context driven entity-aware captioning for news im- feng Gao. 2019. TIGEr: text-to-image grounding
ages. In CVPR. for image caption evaluation. In EMNLP.

John Blatz, Erin Fitzgerald, George Foster, Simona Karthikeyan K, Zihan Wang, Stephen Mayhew, and
Gandrabur, Cyril Goutte, Alex Kulesza, Alberto San- Dan Roth. 2020. Cross-lingual ability of multilin-
chis, and Nicola Ueffing. 2004. Confidence estima- gual BERT: An empirical study. In ICLR.
tion for machine translation. In COLING.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla,
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Pelkins Ajanoh, and Mohamed Coulibali. 2020. NU-
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and BIA: NeUral based interchangeability assessor for
Jingjing Liu. 2020. Uniter: Universal image-text text generation. In 1st Workshop on Evaluating NLG
representation learning. In ECCV. Evaluation.

Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,
and Serge Belongie. 2018. Learning to evaluate im- and Erkut Erdem. 2017. Re-evaluating automatic
age captioning. In CVPR. metrics for image captioning. In EACL.

Bo Dai and Dahua Lin. 2017. Contrastive learning for Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt,
image captioning. In NeurIPS. Trung Bui, and Kyomin Jung. 2021. UMIC: an
unreferenced metric for image captioning via con-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and trastive learning. In ACL.
Kristina Toutanova. 2019. BERT: pre-training of
deep bidirectional transformers for language under- Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt,
standing. In NAACL. Doo Soon Kim, Trung Bui, and Kyomin Jung. 2020.
Vilbertscore: Evaluating image caption using vision-
Alexey Dosovitskiy, Lucas Beyer, Alexander and-language bert. In First Workshop on Evaluation
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, and Comparison of NLP Systems.
Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu,
Uszkoreit, and Neil Houlsby. 2021. An image and Xiaodong He. 2018. Stacked cross attention for
is worth 16x16 words: Transformers for image image-text matching. In ECCV.
recognition at scale. In ICLR. Chin-Yew Lin. 2004. Rouge: A package for auto-
Desmond Elliott and Frank Keller. 2014. Comparing matic evaluation of summaries. Text Summarization
automatic evaluation measures for image descrip- Branches Out.
tion. In ACL. Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
Cole Gleason, Patrick Carrington, Cameron Cassidy,
and C Lawrence Zitnick. 2014. Microsoft COCO:
Meredith Ringel Morris, Kris M Kitani, and Jef-
Common objects in context. In ECCV. Springer.
frey P Bigham. 2019. “it’s almost like they’re trying
to hide it": How user-provided image descriptions Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen,
have failed to make twitter accessible. In WWW. and Xiaogang Wang. 2018. Show, tell and discrim-
inate: Image captioning by self-retrieval with par-
Cole Gleason, Amy Pavel, Emma McCamey, Christina tially labeled data. In ECCV.
Low, Patrick Carrington, Kris M Kitani, and Jef-
frey P Bigham. 2020. Twitter a11y: A browser ex- Chi-kiu Lo. 2019. Yisi-a unified semantic mt quality
tension to make twitter images accessible. In CHI. evaluation and estimation metric for languages with
different levels of available resources. In Fourth
Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Conference on Machine Translation.
Trevor Darrell, and Anna Rohrbach. 2018. Women
also snowboard: Overcoming bias in captioning Annie Louis and Ani Nenkova. 2013. Automatically
models. In Proceedings of the European Conference assessing machine summary content without a gold
on Computer Vision (ECCV), pages 771–787. standard. Computational Linguistics, 39(2):267–
300.
Micah Hodosh, Peter Young, and Julia Hockenmaier.
2013. Framing image description as a ranking task: Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.
Data, models and evaluation metrics. JAIR, 47:853– 2019. ViLBERT: Pretraining task-agnostic visi-
899. olinguistic representations for vision-and-language
tasks. In NeurIPS.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
Zhen Li, and Tom Duerig. 2021. Scaling up visual Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task
and vision-language representation learning with vision and language representation learning. In
noisy text supervision. In ICML. CVPR.
Grace Luo, Trevor Darrell, and Anna Rohrbach. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns,
2021. NewsCLIPpings: automatic generation of Trevor Darrell, and Kate Saenko. 2018. Object hal-
out-of-context multimodal media. arXiv preprint lucination in image captioning. In EMNLP.
arXiv:2104.05893.
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach,
Ruotian Luo, Brian Price, Scott Cohen, and Gregory Niket Tandon, Christopher Pal, Hugo Larochelle,
Shakhnarovich. 2018. Discriminability objective for Aaron Courville, and Bernt Schiele. 2017. Movie
training descriptive captions. In CVPR. description. IJCV.

Haley MacLeod, Cynthia L Bennett, Meredith Ringel Rico Sennrich, Barry Haddow, and Alexandra Birch.
Morris, and Edward Cutrell. 2017. Understanding 2016. Neural machine translation of rare words with
blind people’s experiences with computer-generated subword units. In ACL.
captions of social media images. In CHI.
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Au-
Pranava Madhyastha, Josiah Wang, and Lucia Specia. rélie Herbelot, Moin Nabi, Enver Sangineto, and
2019. VIFIDEL: Evaluating the visual fidelity of Raffaella Bernardi. 2017. FOIL it! find one mis-
image descriptions. In ACL. match between image and language caption. In
ACL.
Yashar Mehdad, Matteo Negri, and Marcello Federico.
2012. Match without a referee: evaluating mt ad- Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine
equacy without reference translations. In Seventh Bordes, and Jason Weston. 2019. Engaging image
Workshop on Statistical Machine Translation. captioning via personality. In CVPR.
Shikib Mehri and Maxine Eskenazi. 2020. USR: An Kihyuk Sohn. 2016. Improved deep metric learning
unsupervised and reference free evaluation metric with multi-class n-pair loss objective. In NeurIPS.
for dialog generation. In ACL.
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma-
Margaret Mitchell, Simone Wu, Andrew Zaldivar, chine translation evaluation versus quality estima-
Parker Barnes, Lucy Vasserman, Ben Hutchinson, tion. Machine translation, 24(1):39–50.
Elena Spitzer, Inioluwa Deborah Raji, and Timnit
Gebru. 2019. Model cards for model reporting. In Lucia Specia and Kashif Shah. 2018. Machine transla-
FAccT. tion quality estimation: Applications and future per-
spectives. In Translation Quality Assessment, pages
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 201–235. Springer.
2018. Representation learning with contrastive pre-
dictive coding. arXiv preprint arXiv:1807.03748. Abigale Stangl, Meredith Ringel Morris, and Danna
Gurari. 2020. “person, shoes, tree. is the person
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- naked?" what people with vision impairments want
Jing Zhu. 2002. Bleu: a method for automatic eval- in image descriptions. In CHI.
uation of machine translation. In ACL.
Simeng Sun and Ani Nenkova. 2019. The feasibility
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, of embedding based automatic evaluation for single
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, document summarization. In EMNLP.
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui
esnay. 2011. Scikit-learn: Machine learning in Yan. 2018. Ruber: An unsupervised method for au-
Python. JMLR, 12. tomatic evaluation of open-domain dialog systems.
In AAAI.
Maxime Peyrard and Iryna Gurevych. 2018. Objec-
tive function learning to match human judgements Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
for optimization-based summarization. In NAACL. Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. you need. In NeurIPS.
How multilingual is multilingual BERT? In ACL.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Parikh. 2015. Cider: Consensus-based image de-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish scription evaluation. In CVPR.
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn- Oriol Vinyals, Alexander Toshev, Samy Bengio, and
ing transferable visual models from natural language Dumitru Erhan. 2016. Show and tell: Lessons
supervision. learned from the 2015 mscoco image captioning
challenge. TPAMI, 39(4):652–663.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language Sijin Wang, Ziwei Yao, Ruiping Wang, Zhongqin Wu,
models are unsupervised multitask learners. OpenAI and Xilin Chen. 2021. FAIEr: Fidelity and adequacy
blog, 1(8):9. ensured image caption evaluation. In CVPR.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas:
The surprising cross-lingual effectiveness of BERT.
In EMNLP.
Elizaveta Yankovskaya, Andre Tättar, and Mark Fishel.
2019. Quality estimation and translation metrics
via pre-trained word and sentence embeddings. In
Fourth Conference on Machine Translation.
Yanzhi Yi, Hangyu Deng, and Jinglu Hu. 2020. Im-
proving image captioning evaluation by considering
inter references variance. In ACL.
Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in-
ference over event descriptions. TACL, 2:67–78.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2020. BERTScore:
Evaluating text generation with BERT. In ICLR.
Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao,
Robert West, and Steffen Eger. 2020. On the lim-
itations of cross-lingual encoders as exposed by
reference-free machine translation evaluation. In
ACL.
C Lawrence Zitnick and Devi Parikh. 2013. Bring-
ing semantics into focus using visual abstraction. In
CVPR.
A Evaluation and Replication Details Original τb no GT τb w/ GT τc no GT τc w/ GT
BLEU-1 26 29 45 31 49
Anderson et al. (2016) introduced a set of corpora, BLEU-4 18 31 46 31 50
ROUGE-L 28 30 48 32 49
metrics, and experimental settings for comparing METEOR 35 36 49 39 50
image caption generation evaluation metrics. Per- CIDEr 36 35 48 38 52
haps unwittingly, their introduced protocols have SPICE 39 39 51 40 53

become the accepted standard for evaluation of new


Table 6: Attempts at replicating Anderson et al.
caption generation metrics. However, seemingly in- (2016)’s results on the composite corpus.
nocuous preprocessing+reporting choices can sig-
nificantly impact correlations with human judg-
ment on these corpora. In what follows, we detail which can have an impact, e.g., for CIDEr in our
our replication efforts. Our goal was to make the setting, switching from τb to τc results in an in-
experimental comparisons involving CLIPScore crease from 35 to 38 rank correlation. But per-
reported in the main paper as fair as possible. We haps the most impactful decision for this corpus
hope it can be useful for researchers reporting met- relates to the references: each image originally
rics on this setup going forward. has (roughly) five references. But when gathering
human judgments, one of the candidate captions
Flickr8K details that was rated by humans was sampled from the
We contacted the authors of some prior work, and references. For Flickr8k, Anderson et al. (2016)
did our best to re-create their evaluation settings. “exclude 158 correct image-caption pairs where the
We uncovered two types of discrepancies when re- candidate caption appears in the reference set;" this
porting on this corpus. The first discrepancy is that curation choice has become standard for Flickr8k.
prior work has mixed evaluating rank correlations But for Composite, it’s not clear if they repeated
with kendall-C and kendall-B. These metrics han- this curation choice, or not. And because of this
dle ties differently, and ties are frequent because ambiguity, it’s not obvious which standard each
human Likert judgements are discretized. The sec- prior work followed, either. For fair comparison,
ond discrepancy is the method of aggregation of in an effort to reconstruct Anderson et al. (2016),
human ratings. Three human ratings were gathered we tried both ways: removing the ground truth
for 5664 (image, candidate) pairs. The majority candidate reference, and not.
of prior works flatten all human judgments to a Our efforts to replicate the exact values of Ander-
single list, and report rank correlation over 5664 son et al. (2016) are in Table 6. We suspect the dis-
* 3 = 16992 instances (method A). However, an- crepancy in BLEU-4 likely results from a smoothing
other (possibly more defensible) evaluation choice issue related to the application of BLEU-4 to indi-
is to average human ratings for each pair, and re- vidual captions vs. the whole corpus (as mentioned
port rank correlation instead over 5664 instances in Kane et al. (2020)). Based on these replication
(method B). The choice of aggregation method has efforts, it’s likely that the original evaluations for
a significant impact on correlations. For example, this corpus were computed using τc with GT refer-
when we used aggregation method A and τc for ences removed. We agree that the fairest analysis
SPICE, we can exactly replicate the correlation, 44.9, on this corpus should not include a reference that is
originally reported in (Anderson et al., 2016). But, also a candidate. And while we didn’t go through
if we use τc and instead use aggregation method all prior works and recompute their metrics with
B, the correlation increases to 52.9: this inflation this change, we did compute ViLBERTScore-F in this
occurs with other metrics, too. setting, because it was, before CLIPScore, the
For our results, we do our best to report all re- state-of-the-art for this corpus. If it’s helpful for
sults for the most common setting: using τc corre- future reporting: in the setting where all references
lation, and using aggregation method A. Thus, the (including the GT reference) are used, RefCLIP-S
results we report may differ slightly than the results gets τc = 60.0.
reported in prior work.
MSCOCO system-level details
Composite details The MSCOCO 2015 image captioning challenge is
For this corpus too, prior work has mixed evalu- a standard corpus for evaluation the system-level
ating with kendall-C and kendall-B correlations, correlation between new evaluation metrics and hu-
man judgments on the MSCOCO test set. To our HC HI HM MM Mean
knowledge, this evaluation was first conducted by length 65.4 52.4 63.0 42.3 55.8
Anderson et al. (2016) using a random sample of BLEU-4 52.5 90.4 84.9 55.3 70.8
1K test set submissions from 15 teams. But because SPICE 56.9 96.3 87.1 66.4 76.7
the test set predictions are not public, more recent METEOR 59.0 97.7 93.9 62.0 78.2
ROUGE-L 55.0 95.3 93.1 58.7 75.5
work (e.g., Cui et al. (2018); Zhang et al. (2020))
CIDEr 53.7 98.1 90.8 63.7 76.6
has evaluated using dev set predictions from sys- BERT-S (RoBERTa-F) 54.4 96.1 94.3 56.4 75.3
tems, and assuming dev set results correlate with
CLIP-S (no refs) 60.3 99.4 97.9 77.3 83.7
test set results (12 teams submitted dev predictions). RefCLIP-S 57.9 99.5 96.1 80.8 83.6
However, there are some potential problems with
this setup: Table 7: Pascal50S-11-judgment accuracy results (5
1. There’s reason to believe that some teams give references, non-standard 11 human judgment version).
HC = two human correct captions; HI = both captions
dev set predictions with different models vs. test
are human written, but one is wrong; HM = both cap-
set predictions. For example, the dev set predic- tions are for the image, but one is written by a human,
tions are identical between the two submissions: one by an algorithm; MM = both captions are for the
m-RNN and m-RNN (Baidu/ UCLA), but image, and both are written by an algorithm. We av-
the test set predictions differ (and achieve sig- erage our results over 5 random samples (but CLIP-S
nificantly different scores). doesn’t change because it doesn’t use references).
2. Correlations are reported over 12 (or possibly
only 11, given the duplicate predictions) sys-
tems. But spearman/pearson correlation over than the usual setup. In particular, the Pascal-
only 12 observations is unfortunately simple to 50S corpus contains two types of human judg-
(accidentally) “game" due to the low statistical ments: 11 human judgments per pair (located in
power of the comparison (see Card et al. (2020) a file named pair_pascal.mat); and 48 hu-
for an overview of statistical power in NLP). man judgments per pair (located in a file named
Consider a (nonsense) evaluation metric that consensus_pascal.mat). The 48 judgments
assigns a random uniform [0, 1) “score" to sys- are intended to be used, and the results in the main
tems without examining outputs, and consider paper have been updated accordingly. For repro-
applying this metric, e.g., N = 10 times to the ducability sake, in case future work utilizes the
12 systems and taking the best performing run 11 judgments, we have included those results in
as the final metric (simulating either a single Table 7.
researcher developing a new evaluation metric
and/or the community’s collective trials). We B Rescaling CLIPScore
ran a simulation of this process 1000 times: the
For readability purposes, as in Zhang et al. (2020),
average spearman/pearson correlation between
we sought to re-scale the raw cosine similarities
human judgments and our bogus metric were
computed by CLIP ViT-B/32. While such a
r/ρ = .91, due to repeated evaluation and low
monotonic rescaling operation doesn’t affect rank-
sample size.
ing results, for reporting purposes, it can be eas-
Thus, while the intent of this evaluation is under- ier to compare raw values if they are on a scale
standable, and it may be possible to garner some more closely-aligned with other evaluation metrics
insight if relatively few evaluations are conducted, (e.g., from roughly zero to roughly one). Figure 4
this specific setup as a fine-grained comparison shows the raw candidate-reference and candidate-
between new evaluation metrics for caption gener- image cosine similarities for four corpora. (Many
ation has likely outlived its utility. “reference"-candidate similarities for the Twitter
corpus are 1.0 because users frequently use the text
Pascal-50S Setup Erratum of their tweet as the AltText.) Across all of these
In March 2022, Jin-Hwa Kim reported some small cases, we never observed a negative negative co-
discrepancies in a replication effort for the Pascal- sine similarity. But, to be safe, we take a maximum
50S corpus. Upon further investigation, it was between the cosine similarity and zero because the
discovered that the original version of this work harmonic mean used to compute RefCLIPScore
was using a different set of human judgments would be undefined for negative values. Multi-
cos(c, r) cos(c, r) BIA: NeUral based interchangeability assessor for
cos(c, v) cos(c, v) text generation. In 1st Workshop on Evaluating NLG
Evaluation.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2020. BERTScore:
Evaluating text generation with BERT. In ICLR.
0 .5 1 0 .5 1
Cosine Sim Cosine Sim

(a) Flickr8K (b) Composite


cos(c, r) cos(c, r)
cos(c, v) cos(c, v)

0 .5 1 0 .5 1
Cosine Sim Cosine Sim

(c) Pascal50S (d) Twitter AltText

Figure 4: Distributions of raw cosine similarities be-


tween candidate and references and candidate and
visual content from CLIP ViT-B/32.

plying by 2.5 has the effect of “stretching" the


CLIPScore distribution to more uniformly span
between zero and one, though, CLIPScore can
be greater than 1. Furthermore, when computing
RefCLIPScore, we maintain this weighting, be-
cause it has the effect of mapping the visual-textual
cosine similarity distribution to more closely match
the reference-candidate distribution: this provides
a roughly equal importance weighting between the
image-candidate and reference-candidate similarity
factors.
We note that the exact parameters of our rescal-
ing method only apply to CLIP ViT-B/32. If fu-
ture, bigger models are released, e.g., the presently
unreleased ViT-L/14 CLIP variant, they could
exhibit a different cosine similarity distribution.

References
Peter Anderson, Basura Fernando, Mark Johnson, and
Stephen Gould. 2016. Spice: Semantic proposi-
tional image caption evaluation. In ECCV. Springer.
Dallas Card, Peter Henderson, Urvashi Khandelwal,
Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020.
With little power comes great responsibility. In
EMNLP.
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang,
and Serge Belongie. 2018. Learning to evaluate im-
age captioning. In CVPR.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla,
Pelkins Ajanoh, and Mohamed Coulibali. 2020. NU-

You might also like