Clip Score
Jack Hessel† Ari Holtzman‡ Maxwell Forbes‡ Ronan Le Bras† Yejin Choi†‡
Allen Institute for AI
Paul G. Allen School of Computer Science & Engineering, University of Washington
{jackh,ronanlb} {ahai,mbforbes,yejin}
length 50.2 50.2 Ref
66.5 82.6 fC
BLEU-4 40 e
METEOR 78.8 85.4 CL
40 VI
ROUGE-L 71.7 79.3 RefCLIPScore RefCLIPScore
CIDEr 82.5 90.6 CLIPScore CLIPScore
30 30
SPICE 75.5 86.1 #1 #3 #5 #1 #3 #5
Importance Rank Importance Rank
BERT-S (RoBERTa-F) 88.6 92.1 (a) Composite (b) Flickr8k-Expert
CLIP-S (no refs) 87.2 87.2
Figure 2: R2 for the forward-selection regression
RefCLIP-S 91.0 92.6 of metrics on human Likert ratings for two corpora.
Foward-selection tends to identify both CLIP-S and
Table 5: Accuracy of evaluation metrics in the pairwise
FOIL hallucination detection setting. All reference- tary metrics include ViLBERTScore-F and SPICE.
based metrics are given access to either one or four ref-
baseline.11 Then, for each image, three authors
of this work independently selected which caption
erence outperforms all metrics except for BERT-S described the image content more accurately. Rela-
(RoBERTa-F). And, RefCLIP-S works best in all cases. tive to a 50% random baseline (and a 72% length
Overall, we corroborate Rohrbach et al. (2018)’s baseline of selecting the shorter caption) CLIP-S
finding that “object hallucination can not be always correctly recovers majority human preference in
predicted based on the traditional sentence metrics" 86% of cases. Human agreement for this corpus is
using a corpus derived from Shekhar et al. (2017), 93%.12
particularly in the case where there are few ref- While this setup cannot definitively refute the
erences available. However, CLIP-S and RefCLIP-S notion that CLIP works well because it has memo-
offer a performance improvement in the pairwise rized images, we hope the results here contribute
setting. to the evolving discussion about the nature of gen-
eralization for web-scale pretrained models.
4.5 Sensitivity of CLIP-S to memorization
4.6 Which metrics should I report?
One concern with model-based scoring methods Most caption generation works report multiple met-
is memorization, i.e., if a model’s weights are pre- rics, each of which (presumably) correlates with
trained using a large corpus, there’s a risk that data human judgment to different degrees. But it’s not
used at evaluation time have already been seen at always clear if individual metrics capture distinct
pretraining time. While Radford et al. (2021) con- or redundant dimensions of human judgment. For
duct a train-test overlap analysis and find that CLIP example, while CLIP-S and ViLBERTScore-F both pro-
is unlikely to succeed because of memorization, duce high correlations, are they redundant or com-
we nonetheless conduct an experiment with images plementary?
CLIP has never seen before. We seek a (minimal) set of metrics that explains
The authors of this work created a set of 250 the most variance in human judgment. To find
images that have never been posted to the Inter- this set, we undertake a forward selection on a set
net by aggregating personal photographs. The set of ten candidate metrics comprising six widely-
contains a variety of Flickr-like situations, e.g., na- reported metrics,13 and four newer metrics, BERT-S
ture scenes, animals, city streets, objects, etc. For (RoBERTa-F), TIGEr, ViLBERTScore-F, and CLIP-S (we
each image, we collect two automatically gener- also include experiments starting with RefCLIP-S
ated captions: one from a commercial API, Mi- instead of CLIP-S, too). Starting from an empty set,
crosoft Azure Cognitive Services (v 3.1)10 and one we perform an iterative greedy selection by picking
from Luo et al. (2018)’s pretrained model, which is 11
We use the ResNet101 pretrained version, which achieves
trained to maximize CIDEr score with a self-critical 1.05 CIDEr and 0.19 SPICE on the COCO validation set.
Raters preferred the Microsoft captions to the ResNet101
10
services/cognitive-services/ BLEU-1, BLEU-4, METEOR, CIDEr, ROUGE-L, SPICE
AltText Personality
the most informative additional metric to add.14 To The logo for OPEC,
the Organization of
[Miserable:] I'm
ready to get to
the Petroleum shore now. I hate
estimate variance, we repeat the forward-selection Exporting Countries, the waves.
is shown against a Nothing to do on
process 10 times with bootstrap re-sampled ver- background of flags.
shore either.
A Evaluation and Replication Details Original τb no GT τb w/ GT τc no GT τc w/ GT
BLEU-1 26 29 45 31 49
Anderson et al. (2016) introduced a set of corpora, BLEU-4 18 31 46 31 50
ROUGE-L 28 30 48 32 49
metrics, and experimental settings for comparing METEOR 35 36 49 39 50
image caption generation evaluation metrics. Per- CIDEr 36 35 48 38 52
haps unwittingly, their introduced protocols have SPICE 39 39 51 40 53
0 .5 1 0 .5 1
Cosine Sim Cosine Sim
Peter Anderson, Basura Fernando, Mark Johnson, and
Stephen Gould. 2016. Spice: Semantic proposi-
tional image caption evaluation. In ECCV. Springer.
Dallas Card, Peter Henderson, Urvashi Khandelwal,
Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020.
With little power comes great responsibility. In
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang,
and Serge Belongie. 2018. Learning to evaluate im-
age captioning. In CVPR.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla,
Pelkins Ajanoh, and Mohamed Coulibali. 2020. NU-