Clip Score
Clip Score
Clip Score
Jack Hessel† Ari Holtzman‡ Maxwell Forbes‡ Ronan Le Bras† Yejin Choi†‡
†
Allen Institute for AI
‡
Paul G. Allen School of Computer Science & Engineering, University of Washington
{jackh,ronanlb}@allenai.org {ahai,mbforbes,yejin}@cs.washington.edu
r
VI
LB
T
ER CID
ER TI RO
Er 4
E 1 B-
CE IG B- E R IG
re DE T
45
c I T IC
e T TS SP SP co CI
or ERBER S
Sc LB IP
length 50.2 50.2 Ref
CL
IP VI
50
or
e
SP
CL
IC
E
Sc
LIP
66.5 82.6 fC
R^2
R^2
BLEU-4 40 e
Re
or
Sc
IP
METEOR 78.8 85.4 CL
40 VI
LB
ER
T
35
ROUGE-L 71.7 79.3 RefCLIPScore RefCLIPScore
CIDEr 82.5 90.6 CLIPScore CLIPScore
30 30
SPICE 75.5 86.1 #1 #3 #5 #1 #3 #5
Importance Rank Importance Rank
BERT-S (RoBERTa-F) 88.6 92.1 (a) Composite (b) Flickr8k-Expert
CLIP-S (no refs) 87.2 87.2
Figure 2: R2 for the forward-selection regression
RefCLIP-S 91.0 92.6 of metrics on human Likert ratings for two corpora.
Foward-selection tends to identify both CLIP-S and
Table 5: Accuracy of evaluation metrics in the pairwise RefCLIP-S early-on: other informative and complemen-
FOIL hallucination detection setting. All reference- tary metrics include ViLBERTScore-F and SPICE.
based metrics are given access to either one or four ref-
erences.
baseline.11 Then, for each image, three authors
of this work independently selected which caption
erence outperforms all metrics except for BERT-S described the image content more accurately. Rela-
(RoBERTa-F). And, RefCLIP-S works best in all cases. tive to a 50% random baseline (and a 72% length
Overall, we corroborate Rohrbach et al. (2018)’s baseline of selecting the shorter caption) CLIP-S
finding that “object hallucination can not be always correctly recovers majority human preference in
predicted based on the traditional sentence metrics" 86% of cases. Human agreement for this corpus is
using a corpus derived from Shekhar et al. (2017), 93%.12
particularly in the case where there are few ref- While this setup cannot definitively refute the
erences available. However, CLIP-S and RefCLIP-S notion that CLIP works well because it has memo-
offer a performance improvement in the pairwise rized images, we hope the results here contribute
setting. to the evolving discussion about the nature of gen-
eralization for web-scale pretrained models.
4.5 Sensitivity of CLIP-S to memorization
4.6 Which metrics should I report?
One concern with model-based scoring methods Most caption generation works report multiple met-
is memorization, i.e., if a model’s weights are pre- rics, each of which (presumably) correlates with
trained using a large corpus, there’s a risk that data human judgment to different degrees. But it’s not
used at evaluation time have already been seen at always clear if individual metrics capture distinct
pretraining time. While Radford et al. (2021) con- or redundant dimensions of human judgment. For
duct a train-test overlap analysis and find that CLIP example, while CLIP-S and ViLBERTScore-F both pro-
is unlikely to succeed because of memorization, duce high correlations, are they redundant or com-
we nonetheless conduct an experiment with images plementary?
CLIP has never seen before. We seek a (minimal) set of metrics that explains
The authors of this work created a set of 250 the most variance in human judgment. To find
images that have never been posted to the Inter- this set, we undertake a forward selection on a set
net by aggregating personal photographs. The set of ten candidate metrics comprising six widely-
contains a variety of Flickr-like situations, e.g., na- reported metrics,13 and four newer metrics, BERT-S
ture scenes, animals, city streets, objects, etc. For (RoBERTa-F), TIGEr, ViLBERTScore-F, and CLIP-S (we
each image, we collect two automatically gener- also include experiments starting with RefCLIP-S
ated captions: one from a commercial API, Mi- instead of CLIP-S, too). Starting from an empty set,
crosoft Azure Cognitive Services (v 3.1)10 and one we perform an iterative greedy selection by picking
from Luo et al. (2018)’s pretrained model, which is 11
We use the ResNet101 pretrained version, which achieves
trained to maximize CIDEr score with a self-critical 1.05 CIDEr and 0.19 SPICE on the COCO validation set.
12
Raters preferred the Microsoft captions to the ResNet101
10 model 81% of the time.
https://azure.microsoft.com/en-us/
13
services/cognitive-services/ BLEU-1, BLEU-4, METEOR, CIDEr, ROUGE-L, SPICE
AltText Personality
the most informative additional metric to add.14 To The logo for OPEC,
the Organization of
Captions
[Miserable:] I'm
ready to get to
the Petroleum shore now. I hate
estimate variance, we repeat the forward-selection Exporting Countries, the waves.
is shown against a Nothing to do on
process 10 times with bootstrap re-sampled ver- background of flags.
shore either.
John Blatz, Erin Fitzgerald, George Foster, Simona Karthikeyan K, Zihan Wang, Stephen Mayhew, and
Gandrabur, Cyril Goutte, Alex Kulesza, Alberto San- Dan Roth. 2020. Cross-lingual ability of multilin-
chis, and Nicola Ueffing. 2004. Confidence estima- gual BERT: An empirical study. In ICLR.
tion for machine translation. In COLING.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla,
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Pelkins Ajanoh, and Mohamed Coulibali. 2020. NU-
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and BIA: NeUral based interchangeability assessor for
Jingjing Liu. 2020. Uniter: Universal image-text text generation. In 1st Workshop on Evaluating NLG
representation learning. In ECCV. Evaluation.
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang, Mert Kilickaya, Aykut Erdem, Nazli Ikizler-Cinbis,
and Serge Belongie. 2018. Learning to evaluate im- and Erkut Erdem. 2017. Re-evaluating automatic
age captioning. In CVPR. metrics for image captioning. In EACL.
Bo Dai and Dahua Lin. 2017. Contrastive learning for Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt,
image captioning. In NeurIPS. Trung Bui, and Kyomin Jung. 2021. UMIC: an
unreferenced metric for image captioning via con-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and trastive learning. In ACL.
Kristina Toutanova. 2019. BERT: pre-training of
deep bidirectional transformers for language under- Hwanhee Lee, Seunghyun Yoon, Franck Dernoncourt,
standing. In NAACL. Doo Soon Kim, Trung Bui, and Kyomin Jung. 2020.
Vilbertscore: Evaluating image caption using vision-
Alexey Dosovitskiy, Lucas Beyer, Alexander and-language bert. In First Workshop on Evaluation
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, and Comparison of NLP Systems.
Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu,
Uszkoreit, and Neil Houlsby. 2021. An image and Xiaodong He. 2018. Stacked cross attention for
is worth 16x16 words: Transformers for image image-text matching. In ECCV.
recognition at scale. In ICLR. Chin-Yew Lin. 2004. Rouge: A package for auto-
Desmond Elliott and Frank Keller. 2014. Comparing matic evaluation of summaries. Text Summarization
automatic evaluation measures for image descrip- Branches Out.
tion. In ACL. Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
Cole Gleason, Patrick Carrington, Cameron Cassidy,
and C Lawrence Zitnick. 2014. Microsoft COCO:
Meredith Ringel Morris, Kris M Kitani, and Jef-
Common objects in context. In ECCV. Springer.
frey P Bigham. 2019. “it’s almost like they’re trying
to hide it": How user-provided image descriptions Xihui Liu, Hongsheng Li, Jing Shao, Dapeng Chen,
have failed to make twitter accessible. In WWW. and Xiaogang Wang. 2018. Show, tell and discrim-
inate: Image captioning by self-retrieval with par-
Cole Gleason, Amy Pavel, Emma McCamey, Christina tially labeled data. In ECCV.
Low, Patrick Carrington, Kris M Kitani, and Jef-
frey P Bigham. 2020. Twitter a11y: A browser ex- Chi-kiu Lo. 2019. Yisi-a unified semantic mt quality
tension to make twitter images accessible. In CHI. evaluation and estimation metric for languages with
different levels of available resources. In Fourth
Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Conference on Machine Translation.
Trevor Darrell, and Anna Rohrbach. 2018. Women
also snowboard: Overcoming bias in captioning Annie Louis and Ani Nenkova. 2013. Automatically
models. In Proceedings of the European Conference assessing machine summary content without a gold
on Computer Vision (ECCV), pages 771–787. standard. Computational Linguistics, 39(2):267–
300.
Micah Hodosh, Peter Young, and Julia Hockenmaier.
2013. Framing image description as a ranking task: Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee.
Data, models and evaluation metrics. JAIR, 47:853– 2019. ViLBERT: Pretraining task-agnostic visi-
899. olinguistic representations for vision-and-language
tasks. In NeurIPS.
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
Parekh, Hieu Pham, Quoc V Le, Yunhsuan Sung, Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi
Zhen Li, and Tom Duerig. 2021. Scaling up visual Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task
and vision-language representation learning with vision and language representation learning. In
noisy text supervision. In ICML. CVPR.
Grace Luo, Trevor Darrell, and Anna Rohrbach. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns,
2021. NewsCLIPpings: automatic generation of Trevor Darrell, and Kate Saenko. 2018. Object hal-
out-of-context multimodal media. arXiv preprint lucination in image captioning. In EMNLP.
arXiv:2104.05893.
Anna Rohrbach, Atousa Torabi, Marcus Rohrbach,
Ruotian Luo, Brian Price, Scott Cohen, and Gregory Niket Tandon, Christopher Pal, Hugo Larochelle,
Shakhnarovich. 2018. Discriminability objective for Aaron Courville, and Bernt Schiele. 2017. Movie
training descriptive captions. In CVPR. description. IJCV.
Haley MacLeod, Cynthia L Bennett, Meredith Ringel Rico Sennrich, Barry Haddow, and Alexandra Birch.
Morris, and Edward Cutrell. 2017. Understanding 2016. Neural machine translation of rare words with
blind people’s experiences with computer-generated subword units. In ACL.
captions of social media images. In CHI.
Ravi Shekhar, Sandro Pezzelle, Yauhen Klimovich, Au-
Pranava Madhyastha, Josiah Wang, and Lucia Specia. rélie Herbelot, Moin Nabi, Enver Sangineto, and
2019. VIFIDEL: Evaluating the visual fidelity of Raffaella Bernardi. 2017. FOIL it! find one mis-
image descriptions. In ACL. match between image and language caption. In
ACL.
Yashar Mehdad, Matteo Negri, and Marcello Federico.
2012. Match without a referee: evaluating mt ad- Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine
equacy without reference translations. In Seventh Bordes, and Jason Weston. 2019. Engaging image
Workshop on Statistical Machine Translation. captioning via personality. In CVPR.
Shikib Mehri and Maxine Eskenazi. 2020. USR: An Kihyuk Sohn. 2016. Improved deep metric learning
unsupervised and reference free evaluation metric with multi-class n-pair loss objective. In NeurIPS.
for dialog generation. In ACL.
Lucia Specia, Dhwaj Raj, and Marco Turchi. 2010. Ma-
Margaret Mitchell, Simone Wu, Andrew Zaldivar, chine translation evaluation versus quality estima-
Parker Barnes, Lucy Vasserman, Ben Hutchinson, tion. Machine translation, 24(1):39–50.
Elena Spitzer, Inioluwa Deborah Raji, and Timnit
Gebru. 2019. Model cards for model reporting. In Lucia Specia and Kashif Shah. 2018. Machine transla-
FAccT. tion quality estimation: Applications and future per-
spectives. In Translation Quality Assessment, pages
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 201–235. Springer.
2018. Representation learning with contrastive pre-
dictive coding. arXiv preprint arXiv:1807.03748. Abigale Stangl, Meredith Ringel Morris, and Danna
Gurari. 2020. “person, shoes, tree. is the person
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- naked?" what people with vision impairments want
Jing Zhu. 2002. Bleu: a method for automatic eval- in image descriptions. In CHI.
uation of machine translation. In ACL.
Simeng Sun and Ani Nenkova. 2019. The feasibility
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, of embedding based automatic evaluation for single
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, document summarization. In EMNLP.
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Chongyang Tao, Lili Mou, Dongyan Zhao, and Rui
esnay. 2011. Scikit-learn: Machine learning in Yan. 2018. Ruber: An unsupervised method for au-
Python. JMLR, 12. tomatic evaluation of open-domain dialog systems.
In AAAI.
Maxime Peyrard and Iryna Gurevych. 2018. Objec-
tive function learning to match human judgements Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
for optimization-based summarization. In NAACL. Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. you need. In NeurIPS.
How multilingual is multilingual BERT? In ACL.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Parikh. 2015. Cider: Consensus-based image de-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish scription evaluation. In CVPR.
Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn- Oriol Vinyals, Alexander Toshev, Samy Bengio, and
ing transferable visual models from natural language Dumitru Erhan. 2016. Show and tell: Lessons
supervision. learned from the 2015 mscoco image captioning
challenge. TPAMI, 39(4):652–663.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language Sijin Wang, Ziwei Yao, Ruiping Wang, Zhongqin Wu,
models are unsupervised multitask learners. OpenAI and Xilin Chen. 2021. FAIEr: Fidelity and adequacy
blog, 1(8):9. ensured image caption evaluation. In CVPR.
Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas:
The surprising cross-lingual effectiveness of BERT.
In EMNLP.
Elizaveta Yankovskaya, Andre Tättar, and Mark Fishel.
2019. Quality estimation and translation metrics
via pre-trained word and sentence embeddings. In
Fourth Conference on Machine Translation.
Yanzhi Yi, Hangyu Deng, and Jinglu Hu. 2020. Im-
proving image captioning evaluation by considering
inter references variance. In ACL.
Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in-
ference over event descriptions. TACL, 2:67–78.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2020. BERTScore:
Evaluating text generation with BERT. In ICLR.
Wei Zhao, Goran Glavaš, Maxime Peyrard, Yang Gao,
Robert West, and Steffen Eger. 2020. On the lim-
itations of cross-lingual encoders as exposed by
reference-free machine translation evaluation. In
ACL.
C Lawrence Zitnick and Devi Parikh. 2013. Bring-
ing semantics into focus using visual abstraction. In
CVPR.
A Evaluation and Replication Details Original τb no GT τb w/ GT τc no GT τc w/ GT
BLEU-1 26 29 45 31 49
Anderson et al. (2016) introduced a set of corpora, BLEU-4 18 31 46 31 50
ROUGE-L 28 30 48 32 49
metrics, and experimental settings for comparing METEOR 35 36 49 39 50
image caption generation evaluation metrics. Per- CIDEr 36 35 48 38 52
haps unwittingly, their introduced protocols have SPICE 39 39 51 40 53
0 .5 1 0 .5 1
Cosine Sim Cosine Sim
References
Peter Anderson, Basura Fernando, Mark Johnson, and
Stephen Gould. 2016. Spice: Semantic proposi-
tional image caption evaluation. In ECCV. Springer.
Dallas Card, Peter Henderson, Urvashi Khandelwal,
Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020.
With little power comes great responsibility. In
EMNLP.
Yin Cui, Guandao Yang, Andreas Veit, Xun Huang,
and Serge Belongie. 2018. Learning to evaluate im-
age captioning. In CVPR.
Hassan Kane, Muhammed Yusuf Kocyigit, Ali Abdalla,
Pelkins Ajanoh, and Mohamed Coulibali. 2020. NU-