Movie Description

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Int J Comput Vis (2017) 123:94–120

DOI 10.1007/s11263-016-0987-1

Movie Description
Anna Rohrbach1 · Atousa Torabi3 · Marcus Rohrbach2 · Niket Tandon1 ·
Christopher Pal4 · Hugo Larochelle5,6 · Aaron Courville7 · Bernt Schiele1

Received: 10 May 2016 / Accepted: 23 December 2016 / Published online: 25 January 2017
© The Author(s) 2017. This article is published with open access at Springerlink.com

Abstract Audio description (AD) provides linguistic teams who participated in the challenges organized in the
descriptions of movies and allows visually impaired people context of two workshops at ICCV 2015 and ECCV 2016.
to follow a movie along with their peers. Such descriptions
are by design mainly visual and thus naturally form an inter- Keywords Movie description · Video description · Video
esting data source for computer vision and computational captioning · Video understanding · Movie description
linguistics. In this work we propose a novel dataset which dataset · Movie description challenge · Long short-term
contains transcribed ADs, which are temporally aligned to memory network · Audio description · LSMDC
full length movies. In addition we also collected and aligned
movie scripts used in prior work and compare the two sources
of descriptions. We introduce the Large Scale Movie Descrip- 1 Introduction
tion Challenge (LSMDC) which contains a parallel corpus
of 128,118 sentences aligned to video clips from 200 movies Audio descriptions (ADs) make movies accessible to mil-
(around 150 h of video in total). The goal of the challenge lions of blind or visually impaired people.1 AD—sometimes
is to automatically generate descriptions for the movie clips. also referred to as descriptive video service (DVS)—provides
First we characterize the dataset by benchmarking differ- an audio narrative of the “most important aspects of the
ent approaches for generating video descriptions. Comparing visual information” (Salway 2007), namely actions, gestures,
ADs to scripts, we find that ADs are more visual and describe scenes, and character appearance as can be seen in Figs. 1
precisely what is shown rather than what should happen and 2. AD is prepared by trained describers and read by pro-
according to the scripts created prior to movie production. fessional narrators. While more and more movies are audio
Furthermore, we present and compare the results of several transcribed, it may take up to 60 person-hours to describe a 2-
h movie (Lakritz and Salway 2006), resulting in the fact that
today only a small subset of movies and TV programs are
Communicated by Margaret Mitchell, John Platt and Kate Saenko. available for the blind. Consequently, automating this pro-
cess has the potential to greatly increase accessibility to this
B Anna Rohrbach
[email protected] media content.
In addition to the benefits for the blind, generating descrip-
1 Max Planck Institute for Informatics, Saarland Informatics tions for video is an interesting task in itself, requiring the
Campus, Saarbrücken, Germany
combination of core techniques from computer vision and
2 ICSI and EECS, UC Berkeley, Berkeley, CA, USA computational linguistics. To understand the visual input one
3 Disney Research, Pittsburgh, PA, USA has to reliably recognize scenes, human activities, and par-
4 École Polytechnique de Montréal, Montreal, Canada ticipating objects. To generate a good description one has to
5 Université de Sherbrooke, Sherbrooke, Canada
1 In this work we refer for simplicity to “the blind” to account for all
6 Twitter, Cambridge, ON, USA
blind and visually impaired people which benefit from AD, knowing of
7 Université de Montréal, Montreal, Canada the variety of visually impaired and that AD is not accessible to all.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 95

has been particularly potent (Krizhevsky et al. 2012). To be


able to learn how to generate descriptions of visual content,
parallel datasets of visual content paired with descriptions are
indispensable (Rohrbach et al. 2013). While recently several
large datasets have been released which provide images with
AD: Abby gets in Mike leans over and Abby clasps her
the basket. sees how high they hands around his descriptions (Young et al. 2014; Lin et al. 2014; Ordonez et al.
are. face and kisses him 2011), video description datasets focus on short video clips
passionately.
Script: After a Mike looks down to For the first time in with single sentence descriptions and have a limited number
moment a frazzled see – they are now her life, she stops
Abby pops up in his fifteen feet above the thinking and grabs
of video clips (Xu et al. 2016; Chen and Dolan 2011) or are
place. ground. Mike and kisses the not publicly available (Over et al. 2012). TACoS Multi-Level
hell out of him.
(Rohrbach et al. 2014) and YouCook (Das et al. 2013) are
Fig. 1 Audio description (AD) and movie script samples from the exceptions as they provide multiple sentence descriptions and
movie “Ugly Truth” longer videos. While these corpora pose challenges in terms
of fine-grained recognition, they are restricted to the cooking
scenario. In contrast, movies are open domain and realistic,
decide what part of the visual information to verbalize, i.e. even though, as any other video source (e.g. YouTube or
recognize what is salient. surveillance videos), they have their specific characteristics.
Large datasets of objects (Deng et al. 2009) and scenes ADs and scripts associated with movies provide rich multiple
(Xiao et al. 2010; Zhou et al. 2014) have had an important sentence descriptions. They even go beyond this by telling a
impact in computer vision and have significantly improved story which means they facilitate the study of how to extract
our ability to recognize objects and scenes. The combination plots, the understanding of long term semantic dependencies
of large datasets and convolutional neural networks (CNNs) and human interactions from both visual and textual data.

AD: Buckbeak rears and Hagrid lifts Malfoy up. As Hagrid carries Malfoy
attacks Malfoy. away, the hippogriff gen-
tly nudges Harry.
Script: In a flash, Buck- Malfoy freezes. Looks down at the blood Buckbeak whips around,
beak’s steely talons slash blossoming on his robes. raises its talons and -
down. seeing Harry - lowers
them.

AD: Another room, the She smokes a cigarette Putting the cigarette out, She pats her face and She pats her face and
wife and mother sits at a with a latex-gloved hand. she uncovers her hair, re- hands with a wipe, then hands with a wipe, then
window with a towel over moves the glove and pops sprays herself with per- sprays herself with per-
her hair. gum in her mouth. fume. fume.
Script: Debbie opens a She holds her cigarette She puts out the cigarette She puts some weird oil She sprays cologne and
window and sneaks a with a yellow dish wash- and goes through an elab- in her hair and uses a walks through it.
cigarette. ing glove. orate routine of hiding wet nap on her neck and
the smell of smoke. clothes and brushes her
teeth.

AD: They rush out onto A man is trapped under a Valjean is crouched down Javert watches as Valjean Javert’s eyes narrow.
the street. cart. beside him. places his shoulder under
the shaft.
Script: Valjean and A heavily laden cart has Valjean, Javert and He throws himself under Javert stands back and
Javert hurry out across toppled onto the cart Javert’s assistant all the cart at this higher looks on.
the factory yard and driver. hurry to help, but they end, and braces himself
down the muddy track can’t get a proper pur- to lift it from beneath.
beyond to discover - chase in the spongy
ground.

Fig. 2 Audio description (AD) and movie script samples from the movies “Harry Potter and the Prisoner of Azkaban”, “This is 40”, and “Les
Miserables”. Typical mistakes contained in scripts marked in red italic

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


96 Int J Comput Vis (2017) 123:94–120

Fig. 3 Some of the diverse verbs/actions present in our Large Scale Movie Description Challenge (LSMDC)

Figures 1 and 2 show examples of ADs and compare them As a first study on our dataset we benchmark several
to movie scripts. Scripts have been used for various tasks approaches for movie description. We first examine near-
(Cour et al. 2008; Duchenne et al. 2009; Laptev et al. 2008; est neighbor retrieval using diverse visual features which do
Liang et al. 2011; Marszalek et al. 2009), but so far not for not require any additional labels, but retrieve sentences from
video description. The main reason for this is that automatic the training data. Second, we adapt the translation approach
alignment frequently fails due to the discrepancy between of Rohrbach et al. (2013) by automatically extracting an
the movie and the script. As scripts are produced prior to the intermediate semantic representation from the sentences
shooting of the movie they are frequently not as precise as using semantic parsing. Third, based on the success of
the AD (Fig. 2 shows some typical mistakes marked in red long short-term memory networks (LSTMs) (Hochreiter
italic). A common case is that part of the sentence is correct, and Schmidhuber 1997) for the image captioning problem
while another part contains incorrect/irrelevant information. (Donahue et al. 2015; Karpathy and Fei-Fei 2015; Kiros
As can be seen in the examples, AD narrations describe key et al. 2015; Vinyals et al. 2015) we propose our approach
visual elements of the video such as changes in the scene, Visual-Labels. It first builds robust visual classifiers which
people’s appearance, gestures, actions, and their interaction distinguish verbs, objects, and places extracted from weak
with each other and the scene’s objects in concise and pre- sentence annotations. Then the visual classifiers form the
cise language. Figure 3 shows the variability of AD data input to an LSTM for generating movie descriptions.
w.r.t. to verbs (actions) and corresponding scenes from the The main contribution of this work is the Large Scale
movies. Movie Description Challenge (LSMDC)2 which provides
In this work we present a dataset which provides tran- transcribed and aligned AD and script data sentences. The
scribed ADs, aligned to full length movies. AD narrations LSMDC was first presented at the Workshop “Describing
are carefully positioned within movies to fit in the natural and Understanding Video & The Large Scale Movie Descrip-
pauses in the dialogue and are mixed with the original movie tion Challenge (LSMDC)”, collocated with ICCV 2015. The
soundtrack by professional post-production. To obtain ADs second edition, LSMDC 2016, was presented at the “Joint
we retrieve audio streams from DVDs/Blu-ray disks, seg- Workshop on Storytelling with Images and Videos and Large
ment out the sections of the AD audio and transcribe them Scale Movie Description and Understanding Challenge”,
via a crowd-sourced transcription service. The ADs provide collocated with ECCV 2016. Both challenges include the
an initial temporal alignment, which however does not always same public and blind test sets with an evaluation server3
cover the full activity in the video. We discuss a way to for automatic evaluation. LSMDC is based on the MPII
fully automate both audio-segmentation and temporal align- Movie Description dataset (MPII-MD) and the Montreal
ment, but also manually align each sentence to the movie Video Annotation Dataset (M-VAD) which were initially
for all the data. Therefore, in contrast to Salway (2007) and collected independently but are presented jointly in this
Salway et al. (2007), our dataset provides alignment to the work. We detail the data collection and dataset properties
actions in the video, rather than just to the audio track of the in Sect. 3, which includes our approach to automatically
description. collect and align AD data. In Sect. 4 we present several
In addition we also mine existing movie scripts, pre-align benchmark approaches for movie description, including our
them automatically, similar to Cour et al. (2008) and Laptev
et al. (2008), and then manually align the sentences to the 2 https://sites.google.com/site/describingmovies/.
movie. 3 https://competitions.codalab.org/competitions/6121.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 97

Visual-Labels approach which learns robust visual classi- et al. 2014) or in single domains like cooking (Rohrbach et al.
fiers and generates description using an LSTM. In Sect. 5 2014, 2013; Donahue et al. 2015). Donahue et al. (2015) first
we present an evaluation of the benchmark approaches on proposed to describe videos using an LSTM, relying on pre-
the M-VAD and MPII-MD datasets, analyzing the influence computed CRF scores from Rohrbach et al. (2014). Later
of the different design choices. Using automatic and human Venugopalan et al. (2015c) extended this work to extract
evaluation, we also show that our Visual-Labels approach CNN features from frames which are max-pooled over time.
outperforms prior work on both datasets. In Sect. 5.5 we Pan et al. (2016b) propose a framework that consists of a 2-
perform an analysis of prior work and our approach to under- /3-D CNN and LSTM trained jointly with a visual-semantic
stand the challenges of the movie description task. In Sect. 6 embedding to ensure better coherence between video and
we present and discuss the results of the LSMDC 2015 and text. Xu et al. (2015b) jointly address the language generation
LSMDC 2016. and video/language retrieval tasks by learning a joint embed-
This work is partially based on the original publica- ding for a deep video model and a compositional semantic
tions from Rohrbach et al. (2015c, b) and the technical language model. Li et al. (2015) study the problem of summa-
report from Torabi et al. (2015). Torabi et al. (2015) col- rizing a long video to a single concise description by using
lected M-VAD, Rohrbach et al. (2015c) collected the MPII- ranking based summarization of multiple generated candi-
MD dataset and presented the translation-based description date sentences.
approach. Rohrbach et al. (2015b) proposed the Visual-
Labels approach. Concurrent and Consequent Work To handle the challeng-
ing scenario of movie description, Yao et al. (2015) propose
a soft-attention based model which selects the most rele-
2 Related Work vant temporal segments in a video, incorporates 3-D CNN
and generates a sentence using an LSTM. Venugopalan
We discuss recent approaches to image and video description
et al. (2015b) propose S2VT, an encoder–decoder frame-
including existing work using movie scripts and ADs. We
work, where a single LSTM encodes the input video frame
also discuss works which build on our dataset. We compare
by frame and decodes it into a sentence. Pan et al. (2016a)
our proposed dataset to related video description datasets in
extend the video encoding idea by introducing a second
Table 3 (Sect. 3.5).
LSTM layer which receives input of the first layer, but
2.1 Image Description skips several frames, reducing its temporal depth. Venu-
gopalan et al. (2016) explore the benefit of pre-trained word
Prior work on image description includes Farhadi et al. embeddings and language models for generation on large
(2010), Kulkarni et al. (2011), Kuznetsova et al. (2012, external text corpora. Shetty and Laaksonen (2015) evalu-
2014), Li et al. (2011), Mitchell et al. (2012) and Socher et al. ate different visual features as input for an LSTM generation
(2014). Recently image description has gained increased frame-work. Specifically they use dense trajectory features
attention with work such as that of Chen and Zitnick (2015), (Wang et al. 2013) extracted for the clips and CNN features
Donahue et al. (2015), Fang et al. (2015), Karpathy and Fei- extracted at center frames of the clip. They find that train-
Fei (2015), Kiros et al. (2014, 2015), Mao et al. (2015), ing concept classifiers on MS COCO with the CNN features,
Vinyals et al. (2015) and Xu et al. (2015a). Much of the recent combined with dense trajectories provides the best input for
work has relied on Recurrent Neural Networks (RNNs) and the LSTM. Ballas et al. (2016) leverages multiple convo-
in particular on long short-term memory networks (LSTMs). lutional maps from different CNN layers to improve the
New datasets have been released, such as the Flickr30k visual representation for activity and video description. To
(Young et al. 2014) and MS COCO Captions (Chen et al. model multi-sentence description, Yu et al. (2016a) propose
2015), where Chen et al. (2015) also presents a standard- to use two stacked RNNs where the first one models words
ized protocol for image captioning evaluation. Other work within a sentence and the second one, sentences within a
has analyzed the performance of recent methods, e.g. Devlin paragraph. Yao et al. (2016) has conducted an interesting
et al. (2015) compare them with respect to the novelty of gen- study on performance upper bounds for both image and video
erated descriptions, while also exploring a nearest neighbor description tasks on available datasets, including the LSMDC
baseline that improves over recent methods. dataset.

2.2 Video Description 2.3 Movie Scripts and Audio Descriptions

In the past video description has been addressed in controlled Movie scripts have been used for automatic discovery and
settings (Barbu et al. 2012; Kojima et al. 2002), on a small annotation of scenes and human actions in videos (Duchenne
scale (Das et al. 2013; Guadarrama et al. 2013; Thomason et al. 2009; Laptev et al. 2008; Marszalek et al. 2009),

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


98 Int J Comput Vis (2017) 123:94–120

as well as a resource to construct activity knowledge base 3 Datasets for Movie Description
(Tandon et al. 2015; de Melo and Tandon 2016). We rely on
the approach presented by Laptev et al. (2008) to align movie In the following, we present how we collect our data for
scripts using subtitles. movie description and discuss its properties. The Large
Bojanowski et al. (2013) approach the problem of learn- Scale Movie Description Challenge (LSMDC) is based on
ing a joint model of actors and actions in movies using weak two datasets which were originally collected independently.
supervision provided by scripts. They rely on the semantic The MPII Movie Description Dataset (MPII-MD), initially
parser SEMAFOR (Das et al. 2012) trained on the FrameNet presented by Rohrbach et al. (2015c), was collected from
database (Baker et al. 1998), however, they limit the recog- Blu-ray movie data. It consists of AD and script data and
nition only to two frames. Bojanowski et al. (2014) aim to uses sentence-level manual alignment of transcribed audio
localize individual short actions in longer clips by exploiting to the actions in the video (Sect. 3.1). In Sect. 3.2 we dis-
the ordering constraints as weak supervision. Bojanowski cuss how to fully automate AD audio segmentation and
et al. (2013, 2014), Duchenne et al. (2009), Laptev et al. alignment for the Montreal Video Annotation Dataset (M-
(2008), Marszalek et al. (2009) proposed datasets focused VAD), initially presented by Torabi et al. (2015). M-VAD was
on extracting several activities from movies. Most of them collected with DVD data quality and only relies on AD. Sec-
are part of the “Hollywood2” dataset (Marszalek et al. 2009) tion 3.3 details the Large Scale Movie Description Challenge
which contains 69 movies and 3669 clips. Another line of (LSMDC) which is based on M-VAD and MPII-MD, but also
work (Cour et al. 2009; Everingham et al. 2006; Ramanathan contains additional movies, and was set up as a challenge. It
et al. 2014; Sivic et al. 2009; Tapaswi et al. 2012) proposed includes a submission server for evaluation on public and
datasets for character identification targeting TV shows. All blind test sets. In Sect. 3.4 we present the detailed statistics
the mentioned datasets rely on alignments to movie/TV of our datasets, also see Table 1. In Sect. 3.5 we compare our
scripts and none uses ADs. movie description data to other video description datasets.
ADs have also been used to understand which characters
interact with each other (Salway et al. 2007). Other prior work
3.1 The MPII Movie Description (MPII-MD) Dataset
has looked at supporting AD production using scripts as an
information source (Lakritz and Salway 2006) and automati-
In the following we describe our approach behind the col-
cally finding scene boundaries (Gagnon et al. 2010). Salway
lection of ADs (Sect. 3.1.1) and script data (Sect. 3.1.2).
(2007) analyses the linguistic properties on a non-public cor-
Then we discuss how to manually align them to the video
pus of ADs from 91 movies. Their corpus is based on the
(Sect. 3.1.3) and which visual features we extracted from the
original sources to create the ADs and contains different
video (Sect. 3.1.4).
kinds of artifacts not present in actual description, such as
dialogs and production notes. In contrast, our text corpus is
much cleaner as it consists only of the actual ADs. 3.1.1 Collection of ADs

We search for Blu-ray movies with ADs in the “Audio


Description” section of the British Amazon4 and select 55
2.4 Works Building on Our Dataset movies of diverse genres (e.g. drama, comedy, action). As
ADs are only available in audio format, we first retrieve the
Interestingly, other works, datasets, and challenges are audio stream from the Blu-ray HD disks. We use MakeMKV5
already building upon our data. Zhu et al. (2015b) learn to extract a Blu-ray in the .mkv file format, and then XMe-
a visual-semantic embedding from our clips and ADs to diaRecode6 to select and extract the audio streams from it.
relate movies to books. Bruni et al. (2016) also learn a joint Then we semi-automatically segment out the sections of the
embedding of videos and descriptions and use this represen- AD audio (which is mixed with the original audio stream)
tation to improve activity recognition on the Hollywood 2 with the approach described below. The audio segments are
dataset Marszalek et al. (2009). Tapaswi et al. (2016) use our then transcribed by a crowd-sourced transcription service7
AD transcripts for building their MovieQA dataset, which that also provides us the time-stamps for each spoken sen-
asks natural language questions about movies, requiring an tence.
understanding of visual and textual information, such as dia-
logue and AD, to answer the question. Zhu et al. (2015a)
4 www.amazon.co.uk.
present a fill-in-the-blank challenge for audio description of
5
the current, previous, and next sentence description for a https://www.makemkv.com/.
6
given clip, requiring to understand the temporal context of the https://www.xmedia-recode.de/.
clips. 7 CastingWords transcription service, http://castingwords.com/.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 99

Table 1 Movie description dataset statistics, see discussion in Sect. 3.4; for average/total length we report the “2-seconds-expanded” alignment,
used in this work, and an actual manual alignment in brackets
Unique movies Words Sentences Clips Average length (s) Total length (h)

MPII-MD (AD) 55 330,086 37,272 37,266 4.2 (4.1) 44.0 (42.5)


MPII-MD (movie script) 50 317,728 31,103 31,071 3.9 (3.6) 33.8 (31.1)
MPII-MD (total) 94 647,814 68,375 68,337 4.1 (3.9) 77.8 (73.6)
M-VAD (AD) 92 502,926 55,904 46,589 6.2 84.6
LSMDC 15 training 153 914.327 91,941 91,908 4.9 (4.8) 124.9 (121.4)
LSMDC 15 validation 12 63,789 6542 6542 5.3 (5.2) 9.6 (9.4)
LSMDC 15 and 16 public test 17 87,150 10,053 10,053 4.2 (4.1) 11.7 (11.3)
LSMDC 15 and 16 blind test 20 83,766 9578 9578 4.5 (4.4) 12.0 (11.8)
LSMDC 15 (total) 200 1,149,032 118,114 118,081 4.8 (4.7) 158.1 (153.9)
LSMDC 16 training 153 922,918 101,079 101,046 4.1 (3.9) 114.9 (109.7)
LSMDC 16 validation 12 63,321 7408 7408 4.1 (3.9) 8.4 (8.0)
LSMDC 15 and 16 public test 17 87,150 10,053 10,053 4.2 (4.1) 11.7 (11.3)
LSMDC 15 and 16 blind test 20 83,766 9578 9578 4.5 (4.4) 12.0 (11.8)
LSMDC 16 (Total) 200 1,157,155 128,118 128,085 4.1 (4.0) 147.0 (140.8)

Semi-automatic Segmentation of ADs We are given two (Cour et al. 2008; Laptev et al. 2008) to automatically align
audio streams: the original audio and the one mixed with scripts to movies. First we parse the scripts, extending the
the AD. We first estimate the temporal alignment between method of Laptev et al. (2008) to handle scripts which deviate
the two as there might be a few time frames difference. The from the default format. Second, we extract the subtitles from
precise alignment is important to compute the similarity of the Blu-ray disks with SubtitleEdit.9 It also allows for sub-
both streams. Both steps (alignment and similarity) are esti- title alignment and spellchecking. Then we use the dynamic
mated using the spectograms of the audio stream, which is programming method of Laptev et al. (2008) to align scripts
computed using a Fast Fourier Transform (FFT). If the dif- to subtitles and infer the time-stamps for the description sen-
ference between the two audio streams is larger than a given tences. We select the sentences with a reliable alignment
threshold we assume the mixed stream contains AD at that score (the ratio of matched words in the near-by monologues)
point in time. We smooth this decision over time using a min- of at least 0.5. The obtained sentences are then manually
imum segment length of 1 s. The threshold was picked on a aligned to video in-house.
few sample movies, but had to be adjusted for each movie due
to different mixing of the AD stream, different narrator voice 3.1.3 Manual Sentence-Video Alignment
level, and movie sound. While we found this semi-automatic
approach sufficient when using a further manual alignment, As the AD is added to the original audio stream between the
we describe a fully automatic procedure in Sect. 3.2. dialogs, there might be a small misalignment between the
time of speech and the corresponding visual content. There-
3.1.2 Collection of Script Data fore, we manually align each sentence from ADs and scripts
to the movie in-house. During the manual alignment we also
In addition to the ADs we mine script web resources8 and filter out: (a) sentences describing movie introduction/ending
select 39 movie scripts. As starting point we use the movie (production logo, cast, etc); (b) texts read from the screen;
scripts from “Hollywood2” (Marszalek et al. 2009) that have (c) irrelevant sentences describing something not present in
highest alignment scores to their movie. We are also inter- the video; (d) sentences related to audio/sounds/music. For
ested in comparing the two sources (movie scripts and ADs), the movie scripts, the reduction in number of words is about
so we are looking for the scripts labeled as “Final”, “Shoot- 19%, while for ADs it is under 4%. In the case of ADs, fil-
ing”, or “Production Draft” where ADs are also available. tering mainly happens due to initial/ending movie intervals
We found that the “overlap” is quite narrow, so we analyze and transcribed dialogs (when shown as text). For the scripts,
11 such movies in our dataset. This way we end up with it is mainly attributed to irrelevant sentences. Note that we
50 movie scripts in total. We follow existing approaches retain the sentences that are “alignable” but contain minor
8 http://www.weeklyscript.com, http://www.simplyscripts.com, http://

www.dailyscript.com, http://www.imsdb.com. 9 www.nikse.dk/SubtitleEdit/.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


100 Int J Comput Vis (2017) 123:94–120

mistakes. If the manually aligned video clip is shorter than when describing that part of the movie. If a scene changes
2 s, we symmetrically expand it (from beginning and end) to slowly, the narrator will instead describe the scene in one
be exactly 2 s long. In the following we refer to the obtained sentence, then pause for a moment, and later continue the
alignment as a “2-seconds-expanded” alignment. description. By detecting those short pauses, we are able to
align a movie with video descriptions automatically.
3.1.4 Visual Features In the following we describe how we select the movies
with AD for our dataset (Sect. 3.2.1) and detail our automatic
We extract video clips from the full movie based on the approach to AD segmentation (Sect. 3.2.2). In Sect. 3.2.3 we
aligned sentence intervals. We also uniformly extract 10 discuss how to align AD to the video and obtain high quality
frames from each video clip. As discussed earlier, ADs and AD transcripts.
scripts describe activities, objects and scenes (as well as emo-
tions which we do not explicitly handle with these features, 3.2.1 Collection of ADs
but they might still be captured, e.g. by the context or activi-
ties). In the following we briefly introduce the visual features To search for movies with AD we use the movie lists provided
computed on our data which are publicly available.10 in “An Initiative of the American Council of the Blind” 11
IDT We extract the improved dense trajectories compen- and “Media Access Group at WGBH”12 websites, and buy
sated for camera motion (Wang and Schmid 2013). For each them based on their availability and price. To extract video
feature (Trajectory, HOG, HOF, MBH) we create a code- and audio from the DVDs we use the DVDfab13 software.
book with 4,000 clusters and compute the corresponding
histograms. We apply L1 normalization to the obtained his-
3.2.2 AD Narrations Segmentation Using Vocal Isolation
tograms and use them as features.
LSDA We use the recent large scale object detection CNN
Despite the advantages offered by AD, creating a completely
(Hoffman et al. 2014) which distinguishes 7604 ImageNet
automated approach for extracting the relevant narration or
(Deng et al. 2009) classes. We run the detector on every
annotation from the audio track and refining the alignment
second extracted frame (due to computational constraints).
of the annotation with the video still poses some challenges.
Within each frame we max-pool the network responses for
In the following, we discuss our automatic solution for AD
all classes, then do mean-pooling over the frames within a
narrations segmentation. We use two audio tracks included
video clip and use the result as a feature.
in DVDs: (1) the standard movie audio signal and (2) the
PLACES and HYBRID Finally, we use the recent scene
standard movie audio mixed with AD narrations signal.
classification CNNs (Zhou et al. 2014) featuring 205 scene
Vocal isolation techniques boost vocals, including dia-
classes. We use both available networks, Places-CNN and
logues and AD narrations while suppressing background
Hybrid-CNN, where the first is trained on the Places dataset
movie sound in stereo signals. This technique is used widely
(Zhou et al. 2014) only, while the second is additionally
in karaoke machines for stereo signals to remove the vocal
trained on the 1.2 million images of ImageNet (ILSVRC
track by reversing the phase of one channel to cancel out
2012) (Russakovsky et al. 2015). We run the classifiers on
any signal perceived to come from the center while leaving
all the extracted frames of our dataset. We mean-pool over
the signals that are perceived as coming from the left or the
the frames of each video clip, using the result as a feature.
right. The main reason for using vocal isolation for AD seg-
mentation is based on the fact that AD narration is mixed
3.2 The Montreal Video Annotation Dataset (M-VAD)
in natural pauses in the dialogue. Hence, AD narration can
only be present when there is no dialogue. In vocal isolated
One of the main challenges in automating the construction
signals, whenever the narrator speaks, the movie signal is
of a video annotation dataset derived from AD audio is accu-
almost a flat line relative to the AD signal, allowing us to
rately segmenting the AD output, which is mixed with the
cleanly separate the narration by comparing the two signals.
original movie soundtrack. In Sect. 3.1.1 we have introduced
Figure 4 illustrates an example from the movie “Life of Pi”,
a way of semi-automatic AD segmentation. In this section we
where in the original movie soundtrack there are sounds of
describe a fully automatic method for AD narration isolation
ocean waves in the background.
and video alignment. AD narrations are typically carefully
Our approach has three main steps. First we isolate vocals,
placed within key locations of a movie and edited by a post-
including dialogues and AD narrations. Second, we separate
production supervisor for continuity. For example, when a
scene changes rapidly, the narrator will speak multiple sen-
11 http://www.acb.org/adp/movies.html.
tences without pauses. Such content should be kept together
12 http://main.wgbh.org/wgbh/pages/mag/dvsondvd.html.
10 mpii.de/movie-description. 13 http://www.dvdfab.cn/.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 101

Fig. 4 AD dataset collection. From the movie “Life of Pi”. Line 2 and segmentation while the narrator stops and then continues describing
3: Vocal isolation of movie and AD soundtrack. Second and third rows the movie. We automatically segment AD audio based on these natural
shows movie and AD audio signals after voice isolation. The two circles pauses. At first row, you can also see the transcription related to first
show the AD segments on the AD mono channel track. A pause (flat and second AD narration parts on top of second and third image shots
signal) between two AD narration parts shows the natural AD narration

the AD narrations from dialogues. Finally, we apply a simple automatically align them), applying LMS results in cleaned
thresholding method to extract AD segment audio tracks. AD narrations for the AD audio signal. Even in cases where
We isolate vocals using Adobe Audition’s center chan- the shapes of the standard movie audio signal and standard
nel extractor14 implementation to boost AD narrations movie audio mixed with AD signal are very different—due
and movie dialogues while suppressing movie background to the AD mixing process—our procedure is sufficient for
sounds on both AD and movie audio signals. We align the automatic segmentation of AD narration.
the movie and AD audio signals by taking an FFT of the Finally, we extract the AD audio tracks by detecting the
two audio signals, compute the cross-correlation, measure beginning and end of AD narration segments in the AD audio
similarity for different offsets and select the offset which signal (i.e. where the narrator starts and stops speaking) using
corresponds to peak cross-correlation. After alignment, we a simple thresholding method that we applied to all DVDs
apply Least Mean Square (LMS) noise cancellation and sub- without changing the threshold value. This is in contrast to
tract the AD mono squared signal from the movie mono the semi-automatic approach presented in Sect. 3.1.1, which
squared signal in order to suppress dialogue in the AD sig- requires individual adjustment of a threshold for each movie.
nal. For the majority of movies on the market (among the
104 movies that we purchased, 12 movies have been mixed
3.2.3 Movie/AD Alignment and Professional Transcription
to the center of the audio signal, therefore we were not able to
AD audio narration segments are time-stamped based on our
14 creative.adobe.com/products/audition. automatic AD narration segmentation. In order to compen-

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


102 Int J Comput Vis (2017) 123:94–120

Table 2 Vocabulary and POS


Dataset Vocab. size Nouns Verbs Adjectives- Adverbs
statistics (after word stemming)
for our movie description MPII-MD 18,871 10,558 2933 4239 1141
datasets, see discussion in
Sect. 3.4 M-VAD 17,609 9512 2571 3560 857
LSMDC 15 22,886 12,427 3461 5710 1288
LSMDC 16 22,500 12,181 3394 5633 1292

sate for the potential 1–2 s misalignment between the AD We set up the evaluation server3 for the challenge using the
narrator speaking and the corresponding scene in the movie, Codalab16 platform. The challenge data is available online2 .
we automatically add 2 s to the end of each video clip. Also We provide more information about the challenge setup and
we discard all the transcriptions related to movie introduc- results in Sect. 6.
tion/ending which are located at the beginning and the end In addition to the description task, LSMDC 2016 includes
of movies. three additional tracks, not discussed in this work. There is a
In order to obtain high quality text descriptions, the movie annotation track which asks to select the correct sen-
AD audio segments were transcribed with more than 98% tence out of five in a multiple-choice test, a retrieval track
transcription accuracy, using a professional transcription ser- which asks to retrieve the correct test clip for a given sen-
vice.15 These services use a combination of automatic speech tence, and a fill-in-the-blank track which requires to predict
recognition techniques and human transcription to produce a a missing word in a given description and the corresponding
high quality transcription. Our audio narration isolation tech- clip. The data and more details can be found on our web site2 ;
nique allows us to process the audio into small, well defined Torabi et al. (2016) provide more details about the annotation
time segments and reduce the overall transcription effort and and the retrieval tasks.
cost.
3.4 Movie Description Dataset Statistics
3.3 The Large Scale Movie Description Challenge
(LSMDC) Table 1 presents statistics for the number of words, sentences
and clips in our movie description corpora. We also report
To build our Large Scale Movie Description Challenge the average/total length of the annotated time intervals. We
(LSMDC), we combine the M-VAD and MPII-MD datasets. report both, the “2-seconds-expanded” clip alignment (see
We first identify the overlap between the two, so that the same Sect. 3.1.3) and the actual clip alignment in brackets. In
movie does not appear in the training and test set of the joined total MPII-MD contains 68,337 clips and 68,375 sentences
dataset. We also exclude script-based movie alignments from (rarely multiple sentences might refer to the same video
the validation and test sets of MPII-MD. The datasets are then clip), while M-VAD includes 46,589 clips and 55,904 sen-
joined by combining the corresponding training, validation tences.
and test sets, see Table 1 for detailed statistics. The com- Our combined LSMDC 2015 dataset contains over 118 K
bined test set is used as a public test set of the challenge. We sentence-clips pairs and 158 h of video. The training/valida-
additionally acquired 20 more movies where we only release tion/public-/blind-test sets contain 91,908, 6542, 10,053
the video clips, but not the aligned sentences. They form the and 9578 video clips respectively. This split balances
blind test set of the challenge and are only used for evaluation. movie genres within each set, which is motivated by the
We rely on the respective best aspects of M-VAD and MPII- fact that the vocabulary used to describe, say, an action
MD for the public and blind test sets: we provide Blu-ray movie could be very different from the vocabulary used
quality for them, use the automatic alignment/ transcription in a comedy movie. After manual alignment of the train-
described in Sect. 3.2 and clean them using a manual align- ing/validation sets, the new LSMDC 2016 contains 101,046
ment as in Sect. 3.1.3. For the second edition of our challenge, training clips, 7408 validation clips and 128 K clips in
LSMDC 2016, we also manually align the M-VAD validation total.
and training sets and release them with Blu-ray quality. The Table 2 illustrates the vocabulary size, number of nouns,
manual alignment results in many multi-sentences descrip- verbs, adjectives, and adverbs in each respective dataset. To
tions to be split. Also the more precise alignment reduces the compute the part of speech statistics for our corpora we tag
average clip length. and stem all words in the datasets with the Standford Part-Of-
Speech (POS) tagger and stemmer toolbox (Toutanova et al.

15 TranscribeMe professional transcription, http://transcribeme.com. 16 https://codalab.org/.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 103

Table 3 Comparison of video description datasets; for discussion see Sect. 3.5
Dataset Multisentence Domain Sentence source Videos Clips Sentences Length (h)

YouCook (Das et al. 2013) x Cooking Crowd 88 − 2668 2.3


TACoS (Regneri et al. 2013) x Cooking Crowd 127 7206 18, 227 10.1
TACoS Multi-Level (Rohrbach et al. 2014) x Cooking crowd 185 24,764 74, 828 15.8
MSVD (Chen and Dolan 2011) Open Crowd − 1970 70, 028 5.3
TGIF (Li et al. 2016) Open Crowd − 100,000 125, 781 ≈86.1
MSR-VTT (Xu et al. 2016) Open Crowd 7180 10,000 200, 000 41.2
VTW (Zeng et al. 2016) x Open Crowd/profess. 18,100 − 44, 613 213.2
M-VAD (ours) x Open Professional 92 46,589 55, 904 84.6
MPII-MD (ours) x Open Professional 94 68,337 68, 375 77.8
LSMDC 15 (ours) x Open Professional 200 118,081 118, 114 158.1
LSMDC 16 (ours) x Open Professional 200 128,085 128, 118 147.0

2003), then we compute the frequency of stemmed words in 4 Approaches for Movie Description
the corpora. It is important to notice that in our computa-
tion each word and its variations in corpora is counted once Given a training corpus of aligned videos and sentences
since we applied stemmer. Interesting observation on statis- we want to describe a new unseen test video. In this sec-
tics is that e.g. the number of adjectives is larger than the tion we discuss two approaches to the video description
number of verbs, which shows that the AD is describing task that we benchmark on our proposed datasets. Our first
the characteristics of visual elements in the movie in high approach in Sect. 4.1 is based on the statistical machine
detail. translation (SMT) approach of Rohrbach et al. (2013). Our
second approach (Sect. 4.2) learns to generate descriptions
using long short-term memory network (LSTM). For the first
step both approaches rely on visual classifiers learned on
3.5 Comparison to Other Video Description Datasets annotations (labels) extracted from natural language descrip-
tions using our semantic parser (Sect. 4.1.1). While the first
We compare our corpus to other existing parallel video approach does not differentiate which features to use for dif-
corpora in Table 3. We look at the following properties: avail- ferent labels, our second approach defines different semantic
ability of multi-sentence descriptions (long videos described groups of labels and uses most relevant visual features for
continuously with multiple sentences), data domain, source each group. For this reason we refer to this approach as
of descriptions and dataset size. The main limitations of prior Visual-Labels. Next, the first approach uses the classifier
datasets include the coverage of a single domain (Das et al. scores as input to a CRF to predict a semantic representa-
2013; Regneri et al. 2013; Rohrbach et al. 2014) and hav- tion (SR) (SUBJECT, VERB, OBJECT, LOCATION), and
ing a limited number of video clips (Chen and Dolan 2011). then translates it into a sentence with SMT. On the other hand,
Recently, a few video description datasets have been pro- our second approach directly provides the classifier scores as
posed, namely MSR-VTT (Xu et al. 2016), TGIF (Li et al. input to an LSTM which generates a sentence based on them.
2016) and VTW (Zeng et al. 2016). Similar to MSVD dataset Figure 5 shows an overview of the two discussed approaches.
(Chen and Dolan 2011), MSR-VTT is based on YouTube
clips. While it has a large number of sentence descriptions
(200K) it is still rather small in terms of the number of 4.1 Semantic Parsing + SMT
video clips (10 K). TGIF is a large dataset of 100 k image
sequences (GIFs) with associated descriptions. VTW is a As our first approach we adapt the two-step translation
dataset which focuses on longer YouTube videos (1.5 min approach of Rohrbach et al. (2013). As a first step it trains
on average) and aims to generate concise video titles from the visual classifiers based on manually annotated tuples e.g.
user provided descriptions as well as editor provided titles. cut, kni f e, tomato provided with the video. Then it trains
All these datasets are similar in that they contain web-videos, a CRF which aims to predict such tuple, or semantic rep-
while our proposed dataset focuses on movies. Similar to e.g. resentation (SR), from a video clip. At a second step, the
VTW, our dataset has a “multi-sentence” property, making it Statistical Machine Translation (SMT) (Koehn et al. 2007)
possible to study multi-sentence description or understand- is used to translate the obtained SR into a natural language
ing stories and plots. sentence, e.g. “The person cuts a tomato with a knife”, see

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


104 Int J Comput Vis (2017) 123:94–120

(a)

(b)

Fig. 5 Overview of our movie description approaches: a SMT-based approach, adapted from Rohrbach et al. (2013), b our proposed LSTM-based
approach

Fig. 5a. While we cannot rely on a manually annotated SR (corresponding WordNet sense) of a large set of verbs present
as in Rohrbach et al. (2013), we automatically mine the SR in movie descriptions.
from sentences using semantic parsing which we introduce We start by decomposing the typically long sentences
in this section. present in movie descriptions into smaller clauses using the
ClausIE tool (Del Corro and Gemulla 2013). For example,
4.1.1 Semantic Parsing “he shot and modified the video” is split into two clauses
“he shot the video” and “he modified the video”). We then
Learning from a parallel corpus of videos and natural lan- use the OpenNLP tool suite18 to chunk every clause into
guage sentences is challenging when no annotated interme- phrases. These chunks are disambiguated to their WordNet
diate representation is available. In this section we introduce senses17 by enabling a state-of-the-art WSD system called
our approach to exploit the sentences using semantic parsing. IMS (Zhong and Ng 2010), to additionally disambiguate
The proposed method automatically extracts intermediate phrases that are not present in WordNet and thus, out of reach
semantic representations (SRs) from the natural sentences. for IMS. We identify and disambiguate the head word of an
out of WordNet phrase, e.g. the moving bus to the proper
Approach We lift the words in a sentence to a semantic WordNet sense bus1n via IMS. In this way we make an exten-
space of roles and WordNet (Fellbaum 1998) senses by per- sion to IMS so it works for phrases and not just words. We link
forming SRL (Semantic Role Labeling) and WSD (Word verb phrases to the proper sense of its head word in WordNet
Sense Disambiguation). For an example, refer to Table 4 (e.g. begin to shoot to shoot4v ). The phrasal verbs such
where the desired outcome of SRL and WSD on the input as e.g. “pick up” or “turn off” are preserved as long as they
sentence “He shot a video in the moving bus” is “Agent: exist in WordNet.
man1n , Action: shoot4v , Patient: video2n , Location: bus1n ”. Having estimated WordNet senses for the words and
Here, e.g. shoot4v refers to the fourth verb sense of shoot phrases, we need to assign semantic role labels to them.
in WordNet.17 This is similar to the semantic representa- Typical SRL systems require large amounts of training
tion of Rohrbach et al. (2013), except that those semantic data, which we do not possess for the movie domain.
frames were constructed manually while we construct them Therefore, we propose leveraging VerbNet (Kipper et al.
automatically and our role fillers are additionally sense dis- 2006; Schuler et al. 2009), a manually curated high-quality
ambiguated. As verbs are known to have high ambiguity, linguistic resource for English verbs that supplements Word-
the disambiguation step will provide clearer representations Net verb senses with syntactic frames and semantic roles,
as a distant signal to assign role labels. Every VerbNet
17 The WordNet senses for shoot and video are: verb sense comes with a syntactic frame e.g. for shoot4v ,
• shoot1v : hit with missile … video1n : picture in TV the syntactic frame is NP V NP. VerbNet also provides
• shoot2v : kill by missile … video2n : a recording … a role restriction on the arguments of the roles e.g. for
• … …
• shoot4v : make a film … video4n : broadcasting …
where, shoot1v refers to the first verb (v) sense of shoot. 18 OpenNLP tool suite: http://opennlp.sourceforge.net/.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 105

Table 4 Semantic parse for “He


Phrase WordNet VerbNet Desired
began to shoot a video in the
Mapping Mapping Frame
moving bus”; for discussion, see
Sect. 4.1.1 the man man1n Agent.animate Agent: man1n
begin to shoot shoot4v shoot4v Action: shoot4v
a video video2n Patient.inanimate Patient: video2n
in in PP.in
the moving bus bus1n NP.Location. solid Location: moving bus1n

shoot3v (sense killing), the role restriction is Agent.animate above, Sect. 4.1.1. In addition to dense trajectories we use
V Patient.animate PP Instrument.solid. For another the features described in Sect. 3.1.4.
sense, shoot4v (sense film), the semantic restriction is
Agent.animate V Patient.inanimate. We ensure that the
selected WordNet verb sense adheres to both the syntactic 4.2 Visual Labels + LSTM
frame and the semantic role restriction provided by VerbNet.
For example, in Table 4, because video2n is a type of inani- Next we present our two-step LSTM-based approach. The
mate object (inferred through WordNet noun taxonomy), this first step performs visual recognition using the visual clas-
sense correctly adheres to the VerbNet role restriction. We sifiers which we train according to labels’ semantics and
can now simply apply the VerbNet suggested role Patient “visuality”. The second step generates textual descriptions
to video2n . using an LSTM network (see Fig. 5b). We explore various
design choices for building and training the LSTM.
Semantic Representation Although VerbNet is helpful as a
distant signal to disambiguate and perform semantic role
labeling, VerbNet contains over 20 roles and not all of them 4.2.1 Robust Visual Classifiers
are general or can be recognized reliably. Therefore, for sim-
plicity, we generalize and group them to get the SUBJECT, For training we rely on a parallel corpus of videos and weak
VERB, OBJECT, LOCATION roles. For example, the roles sentence annotations. As before (see Sect. 4.1) we parse the
patient, recepient, and, benefeciary are generalized to sentences to obtain a set of labels (single words or short
OBJECT. We explore two approaches to obtain the labels phrases, e.g. look up) to train visual classifiers. However,
based on the output of the semantic parser. First is to use the this time we aim to select the most visual labels which can
extracted text chunks directly as labels. Second is to use the be robustly recognized. In order to do that we take three steps.
corresponding senses as labels (and therefore group multiple
text labels). In the following we refer to these as text- and Avoiding Parser Failure Not all sentences can be parsed
sense-labels. Thus from each sentence we extract a seman- successfully, as e.g. some sentences are incomplete or gram-
tic representation in a form of (SUBJECT, VERB, OBJECT, matically incorrect. To avoid loosing the potential labels in
LOCATION). these sentences, we match our set of initial labels to the sen-
tences which the parser failed to process. Specifically, we
do a simple word matching, i.e. if the label is found in the
4.1.2 SMT sentence, we consider this sentence as a positive for the label.

For the sentence generation we build on the two-step trans- Semantic Groups Our labels correspond to different seman-
lation approach of Rohrbach et al. (2013). As the first step it tic groups. In this work we consider three most important
learns a mapping from the visual input to the semantic rep- groups: verbs, objects and places. We propose to treat each
resentation (SR), modeling pairwise dependencies in a CRF label group independently. First, we rely on a different rep-
using visual classifiers as unaries. The unaries are trained resentation for each semantic group, which is targeted to the
using an SVM on dense trajectories (Wang and Schmid specific group. Namely we use the activity recognition fea-
2013). In the second step it translates the SR to a sentence tures Improved Dense Trajectories (DT) for verbs, LSDA
using Statistical Machine Translation (SMT) (Koehn et al. scores for objects and PLACES-CNN scores for places. Sec-
2007). For this the approach uses a concatenated SR as input ond, we train one-vs-all SVM classifiers for each group
language, e.g. cut knife tomato, and natural sentence as out- separately. The intuition behind this is to avoid “wrong
put language, e.g. The person slices the tomato. We obtain negatives” (e.g. using object “bed” as negative for place
the SR automatically from the semantic parser, as described “bedroom”).

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


106 Int J Comput Vis (2017) 123:94–120

input-lang input-lang input-lang input-lang input-vis


LSTM LSTM LSTM LSTM LSTM
input-vis input-vis input-vis
lang-drop concat vis-drop
input-lang input-lang input-lang
LSTM LSTM LSTM LSTM LSTM concat-drop
input-vis input-vis input-vis

input-lang input-lang input-lang LSTM


LSTM LSTM LSTM LSTM LSTM
input-vis input-vis input-vis
lstm-drop
1 layer 2 layers unfactored 2 layers factored dropouts
(a) (b) (c) (d)

Fig. 6 a–c LSTM architectures, d variants of placing the dropout layer

Visual Labels Now, how do we select visual labels for our the language as input and the second layer gets the output of
semantic groups? In order to find the verbs among the labels the first as well as the visual input.
we rely on our semantic parser (Sect. 4.1.1). Next, we look
up the list of “places” used in Zhou et al. (2014) and search Dropout Placement To learn a more robust network which
for corresponding words among our labels. We look up the is less likely to overfit we rely on a dropout (Hinton et al.
object classes used in Hoffman et al. (2014) and search for 2012), i.e. a ratio r of randomly selected units is set to 0
these “objects”, as well as their base forms (e.g. “domes- during training (while all others are multiplied with 1/r ).
tic cat” and “cat”). We discard all the labels that do not We explore different ways to place dropout in the network,
belong to any of our three groups of interest as we assume i.e. either for language input (lang-drop) or visual (vis-drop)
that they are likely not visual and thus are difficult to rec- input only, for both inputs (concat-drop) or for the LSTM
ognize. Finally, we discard labels which the classifiers could output (lstm-drop), see Fig. 6d.
not learn reliably, as these are likely noisy or not visual. For
this we require the classifiers to have certain minimum area
under the ROC-curve (Receiver Operating Characteristic). 5 Evaluation on MPII-MD and M-VAD
We estimate a threshold for the ROC values on a validation
set. We empirically evaluate this as well as all other design In this section we evaluate and provide more insights about
choices of our approach in Sect. 5.4.2. our movie description datasets MPII-MD and M-VAD. We
4.2.2 LSTM for Sentence Generation compare ADs to movie scripts (Sect. 5.1), present a short
evaluation of our semantic parser (Sect. 5.2), present the
We rely on the basic LSTM architecture proposed in Donahue automatic and human evaluation metrics for description
et al. (2015) for video description. At each time step an LSTM (Sect. 5.3) and then benchmark the approaches to video
generates a word and receives the visual classifiers (input- description introduced in Sect. 4 as well as other related work.
vis) as well as as the previous generated word (input-lang) We conclude this section with an analysis of the different
as input (see Fig. 6a). We encode each word with a one-hot- approaches (Sect. 5.5).
vector according to its index in a dictionary and project it In Sect. 6 we will extend this discussion to the results of
in a lower dimensional embedding. The embedding is jointly the Large Scale Movie Description Challenge.
learned during training of the LSTM. We feed in the classifier
scores as input to the LSTM which is equivalent to the best 5.1 Comparison of AD Versus Script Data
variant proposed in Donahue et al. (2015). We analyze the
following aspects for this architecture: We compare the AD and script data using 11 movies from the
MPII-MD dataset where both are available (see Sect. 3.1.2).
Layer Structure We compare a 1-layer architecture with a For these movies we select the overlapping time intervals
2-layer architecture. In the 2-layer architecture, the output of with an intersection over union overlap of at least 75%, which
the first layer is used as input for the second layer (Fig. 6b) results in 279 sentence pairs, we remove 2 pairs which have
and was used by Donahue et al. (2015) for video description. idendical sentences. We ask humans via Amazon Mechanical
Additionally we also compare to a 2-layer factored architec- Turk (AMT) to compare the sentences with respect to their
ture of Donahue et al. (2015), where the first layer only gets correctness and relevance to the video, using both video inter-

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 107

Table 5 Human evaluation of movie scripts and ADs: which sentence is Table 6 Semantic parser accuracy on MPII-MD; discussion in Sect.
more correct/relevant with respect to the video (forced choice); majority 5.2
vote of 5 judges in %. In brackets: at least 4 out of 5 judges agree; see
Corpus Clause NLP Roles WSD
also Sect. 5.1
Correctness Relevance MPII-MD 0.89 0.62 0.86 0.7

Movie scripts 33.9 (11.2) 33.4 (16.8)


ADs 66.1 (35.7) 66.6 (44.9)
5.3.1 Automatic Metrics

For automatic evaluation we rely on the MS COCO Cap-


vals as a reference (one at a time). Each task was completed by
tion Evaluation API.19 The automatic evaluation measures
5 different human subjects, covering 2770 tasks done in total.
include BLEU-1,-2,-3,-4 (Papineni et al. 2002), METEOR
Table 5 presents the results of this evaluation. AD is ranked as
(Denkowski and Lavie 2014), ROUGE-L (Lin 2004), and
more correct and relevant in about 2/3 of the cases (i.e. there is
CIDEr (Vedantam et al. 2015). We also use the recently pro-
margin of about 33%). Looking at the more strict evaluation
posed evaluation measure SPICE (Anderson et al. 2016),
where at least 4 out of 5 judges agree (in brackets in Table 5)
which aims to compare the semantic content of two descrip-
there is still a significant margin of 24.5% between ADs and
tions, by matching the information contained in dependency
movie scripts for Correctness, and 28.1% for Relevance. One
parse trees for both descriptions. While we report all mea-
can assume that in the cases of lower agreement the descrip-
sures for the final evaluation in the LSMDC (Sect. 6),
tions are probably of similar quality. This evaluation supports
we focus our discussion on METEOR and CIDEr scores
our intuition that scrips contain mistakes and irrelevant con-
in the preliminary evaluations in this section. According
tent even after being cleaned up and manually aligned.
to Elliott and Keller (2013) and Vedantam et al. (2015),
METEOR/CIDEr supersede previously used measures in
5.2 Semantic Parser Evaluation terms of agreement with human judgments.

We empirically evaluate the various components of the


semantic parsing pipeline, namely, clause splitting (Clause), 5.3.2 Human Evaluation
POS tagging and chunking (NLP), semantic role labeling
(Roles), and, word sense disambiguation (WSD). We ran- For the human evaluation we rely on a ranking approach,
domly sample 101 sentences from the MPII-MD dataset over i.e. human judges are given multiple descriptions from dif-
which we perform semantic parsing and log the outputs at ferent systems, and are asked to rank them with respect to
various stages of the pipeline (similar to Table 4). We let the following criteria: correctness, relevance, and grammar,
three human judges evaluate the results for every token in motivated by prior work Rohrbach et al. (2013) and on the
the clause (similar to evaluating every row in Table 4) with other hand we asked human judges to rank sentences for
a correct/ incorrect label. From this data, we consider the “how helpful they would be for a blind person to under-
majority vote for every token in the sentence (i.e. at least 2 stand what is happening in the movie”. The AMT workers
out of 3 judges must agree). For a given clause, we assign a are given randomized sentences, and, in addition to some
score of 1 to a component if the component made no mistake general instruction, the following definitions:
for the entire clause. For example, “Roles” gets a score of 1
if, according to majority vote from the judges, we correctly Grammar “Rank grammatical correctness of sentences:
estimate all semantic roles in the clause. Table 6 reports the Judge the fluency and readability of the sentence (indepen-
average accuracy of the components over 130 clauses (gen- dently of the correctness with respect to the video).”
erated from 101 sentences).
It is evident that the poorest performing parts are the NLP Correctness “Rank correctness of sentences: For which sen-
and the WSD components. Some of the NLP mistakes arise tence is the content more correct with respect to the video
due to incorrect POS tagging. WSD is considered a hard (independent if it is complete, i.e. describes everything),
problem and when the dataset contains rare words, the per- independent of the grammatical correctness.”
formance is severely affected.
Relevance “Rank relevance of sentences: Which sentence
contains the more salient (i.e. relevant, important) events/
5.3 Evaluation Metrics for Description objects of the video?”

In this section we describe how we evaluate the generated


descriptions using automatic and human evaluation. 19 https://github.com/tylin/coco-caption.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


108 Int J Comput Vis (2017) 123:94–120

Table 7 Video description performance of different SMT versions on 5.4.2 Visual Labels + LSTM
MPII-MD; discussion in Sect. 5.4.1
METEOR We start with exploring different design choices of our
approach. We build on the labels discovered by the semantic
SMT with our sense-labels
parser. To learn classifiers we select the labels that appear
IDT 30 4.93
at least 30 times, resulting in 1263 labels. The parser addi-
IDT 100 5.12
tionally tells us whether the label is a verb. The LSTM
Combi 100 5.19 output/hidden unit as well as memory cell have each 500
SMT with our text-labels dimensions.
IDT 30 5.59
IDT 100 5.51 Robust Visual Classifiers We first analyze our proposal to
Combi 100 5.42 consider groups of labels to learn different classifiers and
also to use different visual representations for these groups
(see Sect. 4.2). In Table 8 we evaluate our generated sen-
Helpful for the Blind In the LSMDC evaluation we intro- tences using different input features to the LSTM on the
duce a new measure, which should capture how useful a validation set of MPII-MD. In our baseline, in the top part
description would be for blind people: “Rank the sentences of Table 8, we use the same visual descriptors for all labels.
according to how useful they would be for a blind person The PLACES feature is best with 7.10 METEOR. Combi-
which would like to understand/follow the movie without nation by stacking all features (IDT + LSDA + PLACES)
seeing it.” improves further to 7.24 METEOR. The second part of the
table demonstrates the effect of introducing different seman-
5.4 Movie Description Evaluation tic label groups. We first split the labels into “Verbs” and all
others. Given that some labels appear in both roles, the total
As the collected text data comes from the movie context, number of labels increases to 1328 (line 5). We compare two
it contains a lot of information specific to the plot, such as settings of training the classifiers: “Retrieved” (we retrieve
names of the characters. We pre-process each sentence in the the classifier scores from the classifiers trained in the previ-
corpus, transforming the names to “Someone” or “people” ous step), “Trained” (we train the SVMs specifically for each
(in case of plural). label type, e.g. “Verbs”). Next, we further divide the non-
We first analyze the performance of the proposed approa- ”Verb” labels into “Places” and “Others”(line 6), and finally
ches on the MPII-MD dataset, and then evaluate the best into “Places” and “Objects”(line 7). We discard the unused
version on the M-VAD dataset. For MPII-MD we split the labels and end up with 913 labels. Out of these labels, we
11 movies with associated scripts and ADs (in total 22 align- select the labels where the classifier obtains a ROC higher or
ments, see Sect. 3.1.2) into validation set (8) and test set (14). equal to 0.7 (threshold selected experimentally). After this we
The other 83 movies are used for training. On M-VAD we use obtain 263 labels and the best performance in the “Trained”
10 movies for testing, 10 for validation and 72 for training. setting (line 8). To support our intuition about the impor-
tance of the label discrimination (i.e. using different features
5.4.1 Semantic Parsing + SMT for different semantic groups of labels), we propose another
baseline (line 9). Here we use the same set of 263 labels but
Table 7 summarizes results of multiple variants of the SMT provide the same feature for all of them, namely the best per-
approach when using the SR from our semantic parser. forming combination IDT + LSDA + PLACES. As we see,
“Combi” refers to combining IDT, HYBRID, and PLACES this results in an inferior performance.
as unaries in the CRF. We did not add LSDA as we found We make several observations from Table 8 which lead to
that it reduces the performance of the CRF. After extract- robust visual classifiers from the weak sentence annotations.
ing the labels we select the ones which appear at least 30 or (a) It is beneficial to select features based on the label seman-
100 times as our visual attributes. Overall, we observe sim- tics. (b) Training one-vs-all SVMs for specific label groups
ilar performance in all cases, with slightly better results for consistently improves the performance as it avoids “wrong”
text-labels than sense-labels. This can be attributed to sense negatives. (c) Focusing on more “visual” labels helps: we
disambiguation errors of the semantic parser. In the follow- reduce the LSTM input dimensionality to 263 while improv-
ing we use the “IDT 30” model, which achieves the highest ing the performance.
score of 5.59, and denote it as “SMT-Best”.20

20 We also evaluated the “Semantic parsing+SMT” approach on a cor- (Rohrbach et al. 2014), and showed the comparable performance to
pus where annotated SRs are available, namely TACoS Multi-Level manually annotated SRs, see Rohrbach et al. (2015c).

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 109

Table 8 Comparison of
Approach Labels Classifiers (METEOR in %)
different choices of labels and
visual classifiers; all results Retrieved Trained
reported on the validation set of
MPII-MD; for discussion see Baseline: all labels treated the same way
Sect. 5.4.2 (1) IDT 1263 − 6.73
(2) LSDA 1263 − 7.07
(3) PLACES 1263 − 7.10
(4) IDT+LSDA+PLACES 1263 − 7.24
Visual labels
(5) Verbs(IDT), Others(LSDA) 1328 7.08 7.27
(6) Verbs(IDT), Places(PLACES), Others(LSDA) 1328 7.09 7.39
(7) Verbs(IDT), Places(PLACES), Objects(LSDA) 913 7.10 7.48
(8) + restriction to labels with R OC ≥ 0.7 263 7.41 7.54
Baseline: all labels treated the same way, labels from (8)
(9) IDT+LSDA+PLACES 263 7.16 7.20
Bold value indicates the best performing variant in the table

Table 9 LSTM architectures (fixed parameters: LSTM-drop, dropout Table 10 Dropout strategies (fixed parameters: 1-layer, dropout 0.5),
0.5), MPII-MD val set; labels, classifiers as Table 8, line (8); for dis- MPII-MD val set; labels, classifiers as Table 8, line (8); for discussion
cussion see Sect. 5.4.2 see Sect. 5.4.2
Architecture METEOR Dropout METEOR

1 layer 7.54 No dropout 7.19


2 layers unfact. 7.54 Lang-drop 7.13
2 layers fact. 7.41 Vis-drop 7.34
Bold value indicates the best performing variant in the table Concat-drop 7.29
LSTM-drop 7.54
Bold value indicates the best performing variant in the table
LSTM Architectures Now, as described in Sect. 4.2.2, we
look at different LSTM architectures and training configura- Table 11 Dropout ratios (fixed parameters: 1-layer, LSTM-drop),
tions. In the following we use the best performing “Visual MPII-MD val set; labels, classifiers as Table 8, line (8); for discussion
see Sect. 5.4.2
Labels” approach, Table 8, line (8).
Dropout ratio METEOR
We start with examining the architecture, where we
explore different configurations of LSTM and dropout layers. r = 0.1 7.22
Table 9 shows the performance of three different networks: r = 0.25 7.42
“1 layer”, “2 layers unfactored” and “2 layers factored” intro- r = 0.5 7.54
duced in Sect. 4.2.2. As we see, the “1 layer” and “2 layers r = 0.75 7.46
unfactored” perform equally well, while “2 layers factored” is
Bold value indicates the best performing variant in the table
inferior to them. In the following experiments we use the sim-
pler “1 layer” network. We then compare different dropout data. In our experiments we combine three in an ensemble,
placements as illustrated in Table 10. We obtain the best result averaging the resulting word predictions.
when applying dropout after the LSTM layer (“lstm-drop”), To summarize, the most important aspects that decrease
while having no dropout or applying it only to language leads over-fitting and lead to better sentence generation are: (a) a
to stronger over-fitting to the visual features. Putting dropout correct learning rate and step size, (b) dropout after the LSTM
after the LSTM (and prior to a final prediction layer) makes layer, (c) choosing the training iteration based on METEOR
the entire system more robust. As for the best dropout ratio, score as opposed to only looking at the LSTM accuracy/loss
we find that 0.5 works best with lstm-dropout (Table 11). which can be misleading, and (d) building ensembles of mul-
In most of the experiments we trained our networks for tiple networks with different random initializations.21
25,000 iterations. After looking at the METEOR scores for
intermediate iterations we found that at iteration 15,000 we
achieve best performance overall. Additionally we train mul- 21More details can be found in our corresponding arXiv version
tiple LSTMs with different random orderings of the training (Rohrbach et al. 2015a).

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


110 Int J Comput Vis (2017) 123:94–120

Table 12 Test Set of MPII-MD:


Approach METEOR CIDEr Human evaluation: rank
Comparison of our proposed
methods to baselines and prior in % in % Correct. Grammar Relev.
work: S2VT (Venugopalan et al.
2015a), Temporal Attention NN baselines
(Yao et al. 2015); human eval IDT 4.87 2.77 − − −
ranked 1–3, lower is better; for LSDA 4.45 2.84 − − −
discussion see Sect. 5.4.3
PLACES 4.28 2.73 − − −
HYBRID 4.34 3.29 − − −
SMT-Best (ours) 5.59 8.14 2.11 2.39 2.08
S2VT 6.27 9.00 2.02 1.67 2.06
Visual-Labels (ours) 7.03 9.98 1.87 1.94 1.86
NN METEOR upperbound 19.43 − − − −
Bold values indicate the best performing variant per measure/column

Table 13 Test set of M-VAD: Comparison of our proposed methods to dataset. Human evaluation mainly agrees with the automatic
prior work: S2VT (Venugopalan et al. 2015a), Temporal Attention (Yao measure. Visual-Labels outperforms both other methods in
et al. 2015); human eval ranked 1–3, lower is better; for discussion see
Sect. 5.4.3
terms of Correctness and Relevance, however it loses to
S2VT in terms of Grammar. This is due to the fact that S2VT
Approach METEOR CIDEr
in % in %
produces overall shorter (7.4 vs. 8.7 words per sentence) and
simpler sentences, while our system generates longer sen-
Temporal Attention 4.33 5.55 tences and therefore has higher chances to make mistakes.
S2VT 5.62 7.22 We also propose a retrieval upperbound. For every test sen-
Visual-Labels (ours) 6.36 7.48 tence we retrieve the closest training sentence according to
Bold values indicate the best performing variant per measure/column the METEOR score. The rather low METEOR score of 19.43
reflects the difficulty of the dataset. We show some qualitative
results in Fig. 7.
5.4.3 Comparison to Related Work
Results on M-VAD Table 13 shows the results on the test set
Experimental Setup In this section we perform the evalua- of M-VAD dataset. Our Visual-Labels method outperforms
tion on the test set of the MPII-MD dataset (6578 clips) and S2VT (Venugopalan et al. 2015a) and Temporal Attention
M-VAD dataset (4951 clips). We use METEOR and CIDEr (Yao et al. 2015) in METEOR and CIDEr score. As we see,
for automatic evaluation and we perform a human evaluation the results agree with Table 12, but are consistently lower,
on a random subset of 1300 video clips, see Sect. 5.3 for suggesting that M-VAD is more challenging than MPII-MD.
details. For M-VAD experiments we train our method on M- We attribute this to a more precise manual alignment of the
VAD and use the same LSTM architecture and parameters MPII-MD dataset.
as for MPII-MD, but select the number of iterations on the
M-VAD validation set. 5.5 Movie Description Analysis

Results on MPII-MD Table 12 summarizes the results on Despite the recent advances in the video description task, the
the test set of MPII-MD. Here we additionally include the performance on the movie description datasets (MPII-MD
results from a nearest neighbor baseline, i.e. we retrieve and M-VAD) remains rather low. In this section we want to
the closest sentence from the training corpus using L1- look closer at three methods, SMT-Best, S2VT and Visual-
normalized visual features and the intersection distance. Labels, in order to understand where these methods succeed
Our SMT-Best approach clearly improves over the near- and where they fail. In the following we evaluate all three
est neighbor baselines. With our Visual-Labels approach methods on the MPII-MD test set.
we significantly improve the performance, specifically by
1.44 METEOR points and 1.84 CIDEr points. Moreover,
we improve over the recent approach of (Venugopalan et al. 5.5.1 Difficulty Versus Performance
2015a), which also uses an LSTM to generate video descrip-
tions. Exploring different strategies to label selection and As the first study we suggest to sort the test reference sen-
classifier training, as well as various LSTM configurations tences by difficulty, where difficulty is defined in multiple
allows to obtain better result than prior work on the MPII-MD ways21 .

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 111

Approach Sentence
SMT-Best (ours) Someone is a man, someone is a man.
S2VT Someone looks at him, someone turns to someone.
Visual-Labels (ours) Someone is standing in the crowd,
a little man with a little smile.
Reference Someone, back in elf guise, is trying to calm the kids.

SMT-Best (ours) The car is a water of the water.


S2VT On the door, opens the door opens.
Visual-Labels (ours) The fellowship are in the courtyard.
Reference They cross the quadrangle below and run along the cloister.

SMT-Best (ours) Someone is down the door,


someone is a back of the door, and someone is a door.
S2VT Someone shakes his head and looks at someone.
Visual-Labels (ours) Someone takes a drink and pours it into the water.
Reference Someone grabs a vodka bottle standing open on the counter
and liberally pours some on the hand.

Fig. 7 Qualitative comparison of our proposed methods to prior work: S2VT (Venugopalan et al. 2015a). Examples from the test set of MPII-MD.
Visual-Labels identifies activities, objects, and places better than the other two methods. See Sect. 5.4.3

(a) (b) Table 14 Entropy and top 3 frequent verbs of each WordNet topic; for
16 16
SMT−Best SMT−Best discussion see Sect. 5.5.2
14 S2VT
14 S2VT
12 Visual−Labels 12 Visual−Labels Topic Entropy Top-1 Top-2 Top-3
METEOR

METEOR

10 10
8 8 Motion 7.05 Turn Walk Shake
6 6 Contact 7.10 Open Sit Stand
4 4
2 2
Perception 4.83 Look Stare See
0
1000 2000 3000 4000 5000 6000
0
1000 2000 3000 4000 5000 6000
Stative 4.84 Be Follow Stop
<−short Sentences long−> <−easy Sentences difficult−>
Change 6.92 Reveal Start Emerge
Fig. 8 Y-axis METEOR score per sentence. X-axis MPII-MD test Communication 6.73 Look up Nod Face
sentences 1–6578 sorted by a length (increasing); b word frequency Body 5.04 Smile Wear Dress
(decreasing). Shown values are smoothed with a mean filter of size Social 6.11 Watch Join Do
500. For discussion see Sect. 5.5.1
Cognition 5.21 Look at See Read
Possession 5.29 Give Take Have
Sentence Length and Word Frequency Some of the intu- None 5.04 Throw Hold Fly
itive sentence difficulty measures are its length and average Creation 5.69 Hit Make Do
frequency of its words. When sorting the data by diffi- Competition 5.19 Drive Walk over Point
culty (increasing sentence length or decreasing average word Consumption 4.52 Use Drink Eat
frequency), we find that all three methods have the same Emotion 6.19 Draw Startle Feel
tendency to obtain lower METEOR score as the difficulty Weather 3.93 Shine Blaze Light up
increases. Fig. 8a) shows the performance of compared meth-
ods w.r.t. the sentence length. For the word frequency the
correlation is even stronger, see Fig. 8b. Visual-Labels con-
sistently outperforms the other two methods, most notable as tic parser, thus it might be noisy. We showcase the 3 most
the difficulty increases. frequent verbs for each Topic in Table 14. We select sentences
with a single verb, group them according to the verb Topic
5.5.2 Semantic Analysis and compute an average METEOR score for each Topic,
see Fig. 9. We find that Visual-Labels is best for all Topics
WordNet Verb Topics Next we analyze the test reference sen- except “communication”, where SMT-Best wins. The most
tences w.r.t. verb semantics. We rely on WordNet Topics frequent verbs there are “look up” and “nod”, which are
(high level entries in the WordNet ontology), e.g. “motion”, also frequent in the dataset and in the sentences produced
“perception”, defined for most synsets in WordNet (Fellbaum by SMT-Best. The best performing Topic, “cognition”, is
1998). Sense information comes from our automatic seman- highly biased to “look at” verb. The most frequent Topics,

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


112 Int J Comput Vis (2017) 123:94–120

14
SMT−Best
12 S2VT
Visual−Labels
10
METEOR

0
motion (960)

contact (562)

perception (492)

stative (346)

change (235)

communication (197)

body (192)

social (139)

cognition (120)

possession (101)

none (80)

competition (54)

creation (47)

consumption (33)

emotion (29)

weather (7)
Fig. 9 Average METEOR score for WordNet verb Topics. Selected sentences with single verb, number of sentences in brackets. For discussion
see Sect. 5.5.2

“motion” and “contact”, which are also visual (e.g. “turn”, the first phase of the challenge the participants could evaluate
“walk”, “sit”), are nevertheless quite challenging, which we the outputs of their system on the public test set. In the second
attribute to their high diversity (see their entropy w.r.t. dif- phase of the challenge the participants were provided with the
ferent verbs and their frequencies in Table 14). Topics with videos from the blind test set (without textual descriptions).
more abstract verbs (e.g. “be”, “have”, “start”) get lower These were used for the final evaluation. To measure per-
scores. formance of the competing approaches we performed both
automatic and human evaluation. The submission format was
Top 100 Best and Worst Sentences We look at 100 test ref- similar to the MS COCO Challenge (Chen et al. 2015) and
erence sentences, where Visual-Labels obtains highest and we also used the identical automatic evaluation protocol. The
lowest METEOR scores. Out of 100 best sentences 44 con- challenge winner was determined based on the human eval-
tain the verb “look” (including phrases such as “look at”). uation. In the following we review the participants and their
The other frequent verbs are “walk”, “turn”, “smile”, “nod”, results for both LSMDC 15 and LSMDC 16. As they share
“shake”, i.e. mainly visual verbs. Overall the sentences are the same public and blind test sets, as described in Sect. 3.3,
simple. Among the worst 100 sentences we observe more we can also compare the submissions to both challenges with
diversity: 12 contain no verb, 10 mention unusual words (spe- each other.
cific to the movie), 24 have no subject, 29 have a non-human
subject. This leads to a lower performance, in particular, as 6.1 LSMDC Participants
most training sentences contain “Someone” as subject and
generated sentences are biased towards it. We received 4 submissions to LSMDC 15, including our
Visual-Labels approach. The other submissions are S2VT
Summary (a) The test reference sentences that mention verbs (Venugopalan et al. 2015b), Temporal Attention (Yao et al.
like “look” get higher scores due to their high frequency 2015) and Frame-Video-Concept Fusion (Shetty and Laak-
in the dataset. (b) The sentences with more “visual” verbs sonen 2015). For LSMDC 16 we received 6 new submissions.
tend to get higher scores. (c) The sentences without verbs As the blind test set is not changed between LSMDC 2015
(e.g. describing a scene), without subjects or with non-human to LSMDC 2016, we look at all the submitted results jointly.
subjects get lower scores, which can be explained by dataset In the following we summarize the submissions based on
biases. the (sometimes very limited) information provided by the
authors.

6.1.1 LSMDC 15 Submissions


6 The Large Scale Movie Description Challenge
S2VT (Venugopalan et al. 2015b) Venugopalan et al. (2015b)
The Large Scale Movie Description Challenge (LSMDC) propose S2VT, an encoder–decoder framework, where a sin-
was held twice, first in conjunction with ICCV 2015 gle LSTM encodes the input video, frame by frame, and
(LSMDC 15) and then at ECCV 2016 (LSMDC 16). For the decodes it into a sentence. We note that the results to LSMDC
automatic evaluation we set up an evaluation server3 . During

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 113

were obtained with a different set of hyper-parameters then IIT Kanpur This submission uses an encoder–decoder frame-
the results discussed in the previous section. Specifically, work with 2 LSTMs, one LSTM used to encode the frame
S2VT was optimized w.r.t. METEOR on the validation set, sequence of the video and another to decode it into a sentence.
which resulted in significantly longer but also nosier sen-
tences. VD-ivt (BUPT CIST AI lab) According to the authors, their
VD-ivt model consists of three parallel channels: a basic
Frame-Video-Concept Fusion (Shetty and Laaksonen 2015) video description channel, a sentence to sentence channel for
Shetty and Laaksonen (2015) evaluate diverse visual features language learning, and a channel to fuse visual and textual
as input for an LSTM generation frame-work. Specifically information.
they use dense trajectory features (Wang et al. 2013) extracted
for the entire clip and VGG (Simonyan and Zisserman 2015) 6.2 LSMDC Quantitative Results
and GoogleNet (Szegedy et al. 2015) CNN features extracted
at the center frame of each clip. They find that training 80 We first discuss the submissions w.r.t. to automatic measures
concept classifiers on MS COCO with the CNN features, and then discuss the human evaluations, which determined
combined with dense trajectories provides the best input for the winner for the challenges.
the LSTM.
6.2.1 Automatic Evaluation
Temporal Attention (Yao et al. 2015) Yao et al. (2015) pro-
pose a soft-attention model based on Xu et al. (2015a) which We first look at the results of the automatic evaluation on
selects the most relevant temporal segments in a video, the blind test set of LSMDC in Table 15. In the first edition
incorporates 3-D CNN and generates a sentence using an of the challenge, LSMDC 15, our Visual-Labels approach
LSTM. obtains highest scores in all evaluation measures except
BLEU-1,-2, where S2VT wins. One reason for lower scores
for Frame-Video-Concept Fusion and Temporal Attention
appears to be the generated sentence length, which is much
6.1.2 LSMDC 16 Submissions smaller compared to the reference sentences, as we dis-
cuss below (see also Table 16). When extended to LSMDC
Tel Aviv University This submission retrieves a nearest 16 submissions, we observe that most approaches per-
neighbor from the training set, learning a unified space form below S2VT/Visual-Labels, except for VD-ivt, which
using Canonical Correlation Analysis (CCA) over textual achieves METEOR 8.0. Surprisingly, but confirmed with the
and visual features. For the textual representation it relies on authors, VD-ivt predicts only a single sentence “Someone
the Word2Vec representation using a Fisher Vector encod- is in the front of the room.”, which seems to be optimized
ing with a Hybrid Gaussian-Laplacian Mixture Model (Klein w.r.t. the METEOR score, while e.g. CIDEr score shows
et al. 2015) and for the visual representation it uses RNN that this sentence is not good for most video clips. While
Fisher Vector (Lev et al. 2015), encoding video frames with most approaches are generating novel descriptions, Tel Aviv
the 19-layer VGG. University is the only retrieval-based approached among
the submissions. It takes a second place w.r.t. the CIDEr
Aalto University (Shetty and Laaksonen 2016) Shetty and score, while not achieving particularly high scores in other
Laaksonen (2016) rely on an ensemble of four models which measures.
were trained on the MSR-VTT dataset (Xu et al. 2016) with- We closer analyze the outputs of the compared approaches
out additional training on the LSMDC dataset. The four in Table 16, providing detailed statistics over the gener-
models were trained with different combinations of key- ated descriptions. Among the LSMDC 15 submissions, with
frame based GoogleLeNet features and segment based dense respect to the sentence length, Visual-Labels and S2VT
trajectory and C3D features. A separately trained evaluator demonstrate similar properties to the reference descrip-
network was used to predict the result of the ensemble. tions, while the approaches Frame-Video-Concept Fusion
and Temporal Attention generate much shorter sentences
Seoul NU This work relies on temporal and attribute atten- (5.16 and 3.63 words on average vs. 8.74 of the references).
tion. In terms of vocabulary size all approaches fall far below the
reference descriptions. This large gap indicates a problem in
SNUVL (Yu et al. 2016b) Yu et al. (2016b) first learn a set that all the compared approaches focus on a rather small set
of semantic attribute classifiers. To generate a description for of visual and language concepts, ignoring a long tail in the
a video clip, they rely on Temporal Attention and attention distribution. The number of unique sentences confirms the
over semantic attributes. previous finding, showing slightly higher numbers for Visual-

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


114 Int J Comput Vis (2017) 123:94–120

Table 15 Automatic evaluation on the blind test set of the LSMDC, in %; for discussion see Sect. 6.2
Approach BLEU METEOR ROUGE CIDEr SPICE
1 2 3 4

Submissions to LSMDC 15
Visual-Labels (ours) 16.1 5.2 2.1 0.9 7.1 16.4 11.2 13.2
S2VT (Venugopalan et al. 2015b) 17.4 5.3 1.8 0.7 7.0 16.1 9.1 11.4
Frame-Video-Concept Fusion (Shetty and Laaksonen 2015) 11.0 3.4 1.3 0.6 6.1 15.6 9.0 13.4
Temporal Attention (Yao et al. 2015) 5.6 1.5 0.6 0.3 5.2 13.4 6.2 14.3
Submissions to LSMDC 16
Tel Aviv University 14.5 4.1 1.4 0.6 5.8 13.4 10.1 7.7
Aalto University (Shetty and Laaksonen 2016) 6.9 1.6 0.5 0.2 3.4 7.0 3.5 2.6
Seoul NU 9.2 2.9 1.0 0.4 4.0 9.6 7.6 4.8
SNUVL (Yu et al. 2016b) 15.6 4.4 1.4 0.4 7.1 14.7 7.0 11.5
IIT Kanpur 11.8 3.6 1.3 0.5 7.4 14.2 4.7 7.2
VD-ivt (BUPT CIST AI lab) 15.9 4.3 1.0 0.3 8.0 15.0 4.8 10.6
Bold values indicate the best performing approach per measure/column for LSMDC 2015, and LSMDC 2016, if it improved over LSMDC 2015

Table 16 Description statistics for different methods and reference sentences on the blind test set of the LSMDC; for discussion see Sect. 6.2
Approach Avg. sentence length Vocabulary size % Unique sentences %Novel sentences

Submissions to LSMDC 15
Visual-Labels (ours) 7.47 525 45.11 66.76
S2VT (Venugopalan et al. 2015b) 8.77 663 30.17 72.10
Frame-Video-Concept Fusion (Shetty and Laaksonen 2015) 5.16 401 9.09 30.81
Temporal Attention (Yao et al. 2015) 3.63 117 1.39 6.48
Submissions to LSMDC 16
Tel Aviv University 9.34 5530 58.35 0.00
Aalto University (Shetty and Laaksonen 2016) 6.83 651 24.39 94.09
Seoul NU 6.16 459 24.26 52.78
SNUVL (Yu et al. 2016b) 8.53 756 41.54 76.03
IIT Kanpur 16.2 1172 39.37 100.00
VD-ivt (BUPT CIST AI lab) 8.00 7 0.01 100.00
Reference 8.75 6820 97.19 92.63

Labels and S2VT, while the other two tend to frequently and VD-ivt, which, as mentioned above, generates the same
generate the same description for different clips. Finally, the sentence for all video clips.
percentage of novel sentences (not present among the training
descriptions) highlights another aspect, namely the amount 6.2.2 Human Evaluation
of novel vs. retrieved descriptions. As we see, all the methods
“retrieve” some amount of descriptions from training data, We performed separate human evaluations for LSMDC 15
while the approach Temporal Attention produces only 7.36% and LSMDC 16.
novel sentences. Looking at the LSMDC 16 submissions,
we, not surprisingly, see that Tel Aviv University retrieval LSMDC 15 The results of the human evaluation are shown
approach achieves highest diversity among all approaches. in Table 17. The human evaluation was performed over 1,200
Most other submissions have similar statistics to LSMDC randomly selected clips from the blind test set of LSMDC.
15 submissions. Interestingly, Shetty and Laaksonen (2016) We follow the evaluation protocol defined in Sect. 5.3.2.
generate many novel sentences, as they are not trained on As known from literature (Chen et al. 2015; Elliott and
LSMDC, but on the MSR-VTT dataset. Two outliers are IIT Keller 2013; Vedantam et al. 2015), automatic evaluation
Kanpur, which generates very long and noisy descriptions, measures do not always agree with the human evaluation.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 115

Table 17 Human evaluation on the blind test set of the LSMDC; human eval ranked 1–5, lower is better; for discussion see Sect. 6.2
Approach Correctness Grammar Relevance Helpful for blind

Visual-Labels (ours) 3.32 3.37 3.32 3.26


S2VT (Venugopalan et al. 2015a) 3.55 3.09 3.53 3.42
Frame-Video-Concept Fusion (Shetty and Laaksonen 2015) 3.10 2.70 3.29 3.29
Temporal Attention (Yao et al. 2015) 3.14 2.71 3.31 3.36
Reference 1.88 3.13 1.56 1.57
Bold values indicate the best performing approach per measure/column

Table 18 LSMDC 16; human


Approach Better or equal than reference
evaluation; ratio of sentences
which are judged better or equal Submissions to LSMDC 15
compared to the reference
description, with at least two out Visual-Labels (ours) 18.8
of three judges agreeing (in %); S2VT (Venugopalan et al. 2015b) 15.6
for discussion see Sect. 6.2 Frame-Video-Concept Fusion (Shetty and Laaksonen 2015) 15.2
Temporal Attention (Yao et al. 2015) 16.8
Submissions to LSMDC 16
Tel Aviv University 22.4
Aalto University (Shetty and Laaksonen 2016) 16.4
Seoul NU 14.4
SNUVL (Yu et al. 2016b) 8.8
IIT Kanpur 7.2
VD-ivt (BUPT CIST AI lab) 1.6
Bold value indicates the best performing approach in the table

Here we see that human judges prefer the descriptions from 3 humans. For an approach to get a point at least 2 out of
Frame-Video-Concept Fusion approach in terms of correct- 3 humans should agree that a generated sentence is better
ness, grammar and relevance. In our alternative evaluation, or equal to a reference. The results of the human evaluation
in terms of being helpful for the blind, Visual-Labels wins. on 250 randomly selected sentence pairs are presented in
Possible explanation for it is that in this evaluation criteria Table 18. Tel Aviv University is ranked best by the human
human judges penalized less the errors in the descriptions but judges and thus it wins the LSMDC 16 challenge. Visual-
rather looked at their overall informativeness. In general, the Labels gets the second place, next are Temporal Attention
gap between different approaches is not large. Based on the and Aalto University. The VD-ivt submission with identi-
human evaluation the winner of the LSMDC 15 challenge is cal descriptions is ranked worst. Additionally we measure
Frame-Video-Concept Fusion approach of Shetty and Laak- the correlation between the automatic and human evaluation
sonen (2015). in Fig. 10. We compare BLEU@4, METEOR, CIDEr and
SPICE and find that CIDEr score provides the highest and
LSMDC 16 For the LSMDC 16 the evaluation protocol is reasonable (0.61) correlation with human judgments. SPICE
different from the one above. As we have to compare more shows no correlation, METEOR demonstrates negative cor-
approaches the ranking becomes unfeasible. Additionally we relation. We attribute this to the fact that the approaches
would like to capture the human agreement in this evaluation. generate very different types of descriptions (long/short, sim-
This leads us to the following evaluation protocol which is ple/retrieved from the training data, etc.) as discussed above
inspired by the human evaluation metric “M1” in the MS and that we only have a single reference to compute these
COCO Challenge (Chen et al. 2015). The humans are pro- metrics. While we believe that these metrics can still provide
vided with randomized pairs (reference, generated sentence) reasonable scores for similar models, comparing very diverse
from each system and asked to decide in terms of being help- methods and results, requires human evaluation. However,
ful for the blind person (a) if sentence 1 is better (b) both also for human evaluation, further studies are needed in the
are similar (c) sentence 2 is better. Each pair is judged by future, to determine what are the best evaluation protocols.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


116 Int J Comput Vis (2017) 123:94–120

Fig. 10 LSDMC 16: We plot the correlation between human evaluation score (x axis) and 4 automatic measures (y axis)

Approach Sentence
Visual-Labels (ours) Someone lies on the bed.
S2VT Someone lies asleep on his b ed.
Frame-Video-Concept Fusion Someone lies on the bed.
Temporal Attention Someone lies in bed.
Reference Someone lies on her side facing her new friend.

Visual-Labels (ours) Someone sits down.


S2VT Someone sits on the couch and lo oks at the tv.
Frame-Video-Concept Fusion Someone sits at the table.
Temporal Attention Someone looks at someone.
Reference Someone takes a seat and someone moves to the stove.

Visual-Labels (ours) Someone walks to the front of the house.


S2VT Someone looks at the house.
Frame-Video-Concept Fusion Someone walks up to the house.
Temporal Attention Someone looks at someone.
Reference Someone sets down his young daughter then moves to a small wo o den table.

Visual-Labels (ours) Someone turns to someone.


S2VT Someone looks at someone.
Frame-Video-Concept Fusion Someone turns to someone.
Temporal Attention Someone stands alone.
Reference Someone dashes for the staircase.

Visual-Labels (ours) Someone takes a deep breath and takes a deep breath.
S2VT Someone looks at someone and lo oks at him.
Frame-Video-Concept Fusion Someone looks up at the ceiling.
Temporal Attention Someone stares at someone.
Reference Someone digs out her phone again, eyes the display, and answers the call.

Fig. 11 Qualitative comparison of our approach Visual-Labels, S2VT (Venugopalan et al. 2015b), Frame-Video-Concept Fusion (Shetty and
Laaksonen 2015) and Temporal Attention (Yao et al. 2015) on the blind test set of the LSMDC. Discussion see Sect. 6.3

6.3 LSMDC Qualitative Results Figure 12 compares all LSMDC 15 approaches with the
LSMDC 16 winner, Tel Aviv University, on a sequence of
Figure 11 shows qualitative results from the competing 5 consecutive clips. We can make the following observa-
approaches submitted to LSMDC 15. The first two examples tions from these examples. Although, Tel Aviv University is
are success cases, where most of the approaches are able to a retrieval-based approach, it does very well in many cases,
describe the video correctly. The third example is an inter- providing an added benefit of fluent and grammatically cor-
esting case where visually relevant descriptions, provided rect descriptions. One side-effect of retrieval is that when it
by most approaches, do not match the reference description, fails, it produces a completely irrelevant description, e.g. the
which focuses on an action happening in the background second example. Tel Aviv University and Visual-Labels are
of the scene (“Someone sets down his young daughter then able to capture important details, such as sipping a drink,
moves to a small wooden table.”). The last two rows contain which the other methods fail to recognize. Descriptions gen-
partial and complete failures. In one all approaches fail to erated by Visual-Labels and S2VT tend to be longer and
recognize the person running away, only capturing the “turn- noisier than the ones by Frame-Video-Concept Fusion and
ing” action which indeed happened before running. In the Temporal Attention, while Temporal Attention tends to pro-
other one, all approaches fail to recognize that the woman duce generally applicable sentences, e.g. “Someone looks at
interacts with the small object (phone). someone”.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 117

Approach Sentence
Visual-Labels (ours) Someone takes a seat on the table and takes a seat on his desk.
S2VT Someone lo oks at someone and smiles.
Frame-Video-Concept Fusion Someone looks at someone.
Temporal Attention Someone gets up.
Tel Aviv University Farther along, the mustached stranger sits on a bench.
Reference Later, someone sits with someone and someone.

Visual-Labels (ours) Someone gets out of the car and walks off.
S2VT Someone walks up to the front of the house.
Frame-Video-Concept Fusion Someone walks up to the front door.
Temporal Attention Someone gets out of the car.
Tel Aviv University He sees a seated man on the TV gesturing.
Reference Now someone steps out of the carriage with his new employers.

Visual-Labels (ours) Someone walks up to the street, and someone is walking to the other side of.
S2VT Someone walks over to the table and lo oks at the other side of the house.
Frame-Video-Concept Fusion Someone walks away.
Temporal Attention Someone gets out of the car.
Tel Aviv University Later smiling, the two walk hand in hand down a busy sidewalk noticing
every hat-wearing man they pass.
Reference The trio starts across a bustling courtyard.

Visual-Labels (ours) Someone sips his drink.


S2VT Someone sits at the table and lo oks at someone.
Frame-Video-Concept Fusion Someone sits up.
Temporal Attention Someone looks at someone.
Tel Aviv University Someone sits at a table sipping a drink.
Reference As the men drink red wine, someone and someone watch someone take a sip.

Visual-Labels (ours) Someone takes a bite.


S2VT Someone sits at the table.
Frame-Video-Concept Fusion Someone looks at someone.
Temporal Attention Someone looks at someone.
Tel Aviv University Later at the dinner table.
Reference Someone tops off someone’s glass.

Fig. 12 Qualitative comparison of our approach Visual-Labels, S2VT (Venugopalan et al. 2015b), Frame-Video-Concept Fusion (Shetty and
Laaksonen 2015), Temporal Attention (Yao et al. 2015), and Tel Aviv University on 5 consecutive clips from the blind test set of the LSMDC.
Discussion see Sect. 6.3

7 Conclusion Visual-Labels generate longer and more diverse descriptions


than the other submissions but are also more susceptible
In this work we present the Large Scale Movie Description to content or grammatical errors. This consequently leads
Challenge (LSMDC), a novel dataset of movies with aligned to worse human rankings with respect to correctness and
descriptions sourced from movie scripts and ADs (audio grammar. In contrast, Frame-Video-Concept Fusion wins the
descriptions for the blind, also referred to as DVS). Alto- challenge by predicting medium length sentences with inter-
gether the dataset is based on 200 movies and has 128,118 mediate diversity, which gets rated best in human evaluation
sentences with aligned clips. We compare AD with previ- for correctness, grammar, and relevance. When ranking sen-
ously used script data and find that AD tends to be more tences with respect to the criteria “helpful for the blind”,
correct and relevant to the movie than script sentences. our Visual-Labels is well received by human judges, likely
Our approach, Visual-Labels, to automatic movie descrip- because it includes important aspects provided by the strong
tion trains visual classifiers and uses their scores as input to an visual labels. Overall all approaches have problems with the
LSTM. To handle the weak sentence annotations we rely on challenging long-tail distributions of our data. Additional
three ingredients. (1) We distinguish three semantic groups training data cannot fully ameliorate this problem because
of labels (verbs, objects, and places). (2) We train them sep- a new movie might always contain novel parts. We expect
arately, removing the noisy negatives. (3) We select only the new techniques, including relying on different modalities,
most reliable classifiers. For sentence generation we show see e.g. Hendricks et al. (2016), to overcome this challenge.
the benefits of exploring different LSTM architectures and The second edition of our challenge (LSMDC 16) was
learning configurations. held at ECCV 2016. This time we introduced a new human
To evaluate different approaches for movie description, evaluation protocol to allow comparison of a large num-
we organized a challenge at ICCV 2015 (LSMDC 15) where ber of approaches. We found that the best approach in the
we evaluated submissions using automatic and human eval- new evaluation with the “helpful for the blind” criteria is a
uation criteria. We found that the approaches S2VT and our retrieval-based approach from Tel Aviv University. Likely,

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


118 Int J Comput Vis (2017) 123:94–120

human judges prefer the rich while also grammatically cor- Das, D., Martins, A. F. T., & Smith, N. A. (2012). An exact dual decom-
rect descriptions provided by this method. In the future work position algorithm for shallow semantic parsing with constraints.
In Proceedings of the annual meeting of the association for com-
the movie description approaches should aim to achieve rich putational linguistics (ACL).
yet correct and fluent descriptions. Our evaluation server will Das, P., Xu, C., Doell, R., & Corso, J. (2013). Thousand frames in just
continue to be available for automatic evaluation. a few words: Lingual description of videos through latent topics
Our dataset has already been used beyond description, and sparse object stitching. In Conference on computer vision and
pattern recognition (CVPR).
e.g. for learning video-sentence embeddings or for movie de Melo, G., & Tandon, N. (2016). Seeing is believing: The quest
question answering. Beyond our current challenge on single for multimodal knowledge. SIGWEB Newsletter, (Spring). doi:10.
sentences, the dataset opens new possibilities to understand 1145/2903513.2903517.
stories and plots across multiple sentences in an open domain Del Corro, L., & Gemulla, R. (2013). Clausie: Clause-based open infor-
mation extraction. In Proceedings of the international world wide
scenario on a large scale. web conference (WWW).
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009).
Acknowledgements Open access funding provided by Max Planck
Imagenet: A large-scale hierarchical image database. In Confer-
Society. Marcus Rohrbach was supported by a fellowship within the
ence on computer vision and pattern recognition (CVPR).
FITweltweit-Program of the German Academic Exchange Service
Denkowski, M., & Lavie, A. (2014). Meteor universal: Language spe-
(DAAD).
cific translation evaluation for any target language. In Proceedings
Open Access This article is distributed under the terms of the Creative of the ninth workshop on statistical machine translation.
Commons Attribution 4.0 International License (http://creativecomm Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X. et al.
ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, (2015). Language models for image captioning: The quirks and
and reproduction in any medium, provided you give appropriate credit what works. In Proceedings of the annual meeting of the associa-
to the original author(s) and the source, provide a link to the Creative tion for computational linguistics (ACL).
Commons license, and indicate if changes were made. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venu-
gopalan, S., Saenko, K. et al. (2015). Long-term recurrent con-
volutional networks for visual recognition and description. In
References Conference on computer vision and pattern recognition (CVPR).
Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009).
Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Automatic annotation of human actions in video. In International
Semantic propositional image caption evaluation. In European conference on computer vision (ICCV).
conference on computer vision (ECCV). Elliott, D., & Keller, F. (2013). Image description using visual depen-
Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The berkeley framenet dency representations. In Proceedings of the conference on
project. In Proceedings of the annual meeting of the association empirical methods in natural language processing (EMNLP), pp.
for computational linguistics (ACL). 1292–1302.
Ballas, N., Yao, L., Pal, C., & Courville, A. (2016). Delving deeper Everingham, M., Sivic, J., & Zisserman, A. (2006). ”hello! my name
into convolutional networks for learning video representations. In is... buffy”—Automatic naming of characters in tv video. In Pro-
International conference on learning representations (ICLR). ceedings of the british machine vision conference (BMVC).
Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, Fang, H., Gupta, S., Iandola, F. N., Srivastava, R., Deng, L., Dollár,
S. et al. (2012). Video in sentences out. In Proceedings of the P. et al. (2015). From captions to visual concepts and back. In
conference on Uncertainty in artificial intelligence (UAI). Conference on computer vision and pattern recognition (CVPR).
Bojanowski, P., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C.,
J. (2013). Finding actors and actions in movies. In International Hockenmaier, J. et al. (2010). Every picture tells a story: Generat-
conference on computer vision (ICCV). ing sentences from images. In European conference on computer
Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., vision (ECCV).
& Sivic, J. (2014). Weakly supervised action labeling in videos Fellbaum, C. (1998). WordNet: An electronic lexical database. Cam-
under ordering constraints. In European conference on computer bridge: The MIT Press.
vision (ECCV). Gagnon, L., Chapdelaine, C., Byrns, D., Foucher, S., Heritier, M., &
Bruni, M., Uricchio, T., Seidenari, L., & Del Bimbo, A. (2016). Do Gupta, V. (2010). A computer-vision-assisted system for videode-
textual descriptions help action recognition? In Proceedings of the scription scripting. In Proceedings of the IEEE conference on
ACM on multimedia conference (MM), pp. 645–649. computer vision and pattern recognition workshops (CVPR work-
Chen, D. & Dolan, W. (2011). Collecting highly parallel data for para- shops).
phrase evaluation. In Proceedings of the annual meeting of the Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan,
association for computational linguistics (ACL). S., Mooney, R., Darrell, T. et al. (2013). Youtube2text: Recogniz-
Chen, X, & Zitnick, C. L. (2015). Mind’s eye: A recurrent visual repre- ing and describing arbitrary activities using semantic hierarchies
sentation for image caption generation. In Conference on computer and zero-shoot recognition. In International conference on com-
vision and pattern recognition (CVPR). puter vision (ICCV).
Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollár, P., & Hendricks, L. A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko,
Zitnick, C. L. (2015). Microsoft coco captions: Data collection and K., & Darrell, T. (2016). Deep compositional captioning: Describ-
evaluation server. arXiv:1504.00325. ing novel object categories without paired training data. In
Cour, T., Jordan, C., Miltsakaki, E., & Taskar, B. (2008). Movie/script: Conference on computer vision and pattern recognition (CVPR).
Alignment and parsing of video and text transcription. In European Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhut-
conference on computer vision (ECCV). dinov, R. R. (2012). Improving neural networks by preventing
Cour, T., Sapp, B., Jordan, C., & Taskar, B. (2009). Learning from co-adaptation of feature detectors. arXiv:1207.0580.
ambiguously labeled images. In Conference on computer vision Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
and pattern recognition (CVPR). Neural Computation, 9(8), 1735–1780.

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Int J Comput Vis (2017) 123:94–120 119

Hoffman, J., Guadarrama, S., Tzeng, E., Donahue, J., Girshick, R., Dar- Liang, C., Xu, C., Cheng, J., & Lu, H. (2011). Tvparser: An automatic
rell, T., & Saenko, K. (2014). LSDA: Large scale detection through tv video parsing method. In Conference on computer vision and
adaptation. In Conference on neural information processing sys- pattern recognition (CVPR).
tems (NIPS). Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of sum-
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for maries. In Text summarization branches out: Proceedings of the
generating image descriptions. In Conference on computer vision ACL-04 workshop, pp. 74–81.
and pattern recognition (CVPR). Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.
Kipper, K., Korhonen, A., Ryant, N., & Palmer, M. (2006). Extending et al. (2014). Microsoft coco: Common objects in context. In Euro-
verbnet with novel verb classes. In Proceedings of the international pean Conference on Computer Vision (ECCV).
conference on language resources and evaluation (LREC). Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., & Yuille, A. (2015). Deep
Kiros, R., Salakhutdinov, R., & Zemel, R. (2014). Multimodal neural captioning with multimodal recurrent neural networks (m-rnn). In
language models. In International conference on machine learning International conference on learning representations (ICLR).
(ICML). Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In
Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2015). Unifying visual- Conference on computer vision and pattern recognition (CVPR).
semantic embeddings with multimodal neural language models. Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X.
Transactions of the Association for Computational Linguistics et al. (2012). Midge: Generating image descriptions from computer
(TACL), 14, 595–603. vision detections. In Proceedings of the conference of the European
Klein, B., Lev, G., Sadeh, G., & Wolf, L. (2015). Associating neural chapter of the association for computational linguistics (EACL).
word embeddings with deep image representations using fisher Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing
vectors. In Conference on computer vision and pattern recognition images using 1 million captioned photographs. In Conference on
(CVPR). neural information processing systems (NIPS).
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw,
Bertoldi, N. et al. (2007). Moses: Open source toolkit for statistical B., Smeaton, A. F., & Quéenot, G. (2012). Trecvid 2012—An
machine translation. In Proceedings of the annual meeting of the overview of the goals, tasks, data, evaluation mechanisms and met-
association for computational linguistics (ACL). rics. In Proceedings of TRECVID 2012. NIST, USA.
Kojima, A., Tamura, T., & Fukunaga, K. (2002). Natural language Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016a). Hierarchical
description of human activities from video images based on con- recurrent neural encoder for video representation with applica-
cept hierarchy of actions. International Journal of Computer Vision tion to captioning. In Conference on computer vision and pattern
(IJCV), 50(2), 171–184. recognition (CVPR).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classi- Pan, Y., Mei, T., Yao, T., Li, H., & Rui, Y. (2016b). Jointly model-
fication with deep convolutional neural networks. In Conference ing embedding and translation to bridge video and language. In
on neural information processing systems (NIPS). Conference on computer vision and pattern recognition (CVPR).
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU:
Berg, T. L. (2011). Baby talk: Understanding and generating sim- A method for automatic evaluation of machine translation. In
ple image descriptions. In Conference on computer vision and Proceedings of the annual meeting of the association for com-
pattern recognition (CVPR). putational linguistics (ACL).
Kuznetsova, P., Ordonez, V., Berg, A. C., Berg, T. L., & Choi, Y. (2012). Ramanathan, V., Joulin, A., Liang, P., & Fei-Fei, L. (2014). Linking
Collective generation of natural image descriptions. In Proceed- people in videos with “their” names using coreference resolution.
ings of the annual meeting of the association for computational In European conference on computer vision (ECCV).
linguistics (ACL). Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal,
Kuznetsova, P., Ordonez, V., Berg, T. L., Hill, UNC Chapel, & Choi, Y. M. (2013). Grounding action descriptions in videos. Transactions
(2014). Treetalk: Composition and compression of trees for image of the Association for Computational Linguistics (TACL), 1, 25–
descriptions. In Proceedings of the Transactions of the association 36.
for computational linguistics (TACL). Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., &
Lakritz, J. & Salway, A. (2006). The semi-automatic generation of Schiele, B. (2014). Coherent multi-sentence video description with
audio description from screenplays. Technical report, Department variable level of detail. In Proceedings of the German conference
of Computing Technical Report, University of Surrey. on pattern recognition (GCPR).
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning Rohrbach, A., Rohrbach, M., & Schiele, B. (2015a). The long-short
realistic human actions from movies. In Conference on computer story of movie description. arXiv:1506.01698.
vision and pattern recognition (CVPR). Rohrbach, A., Rohrbach, M., & Schiele, B. (2015b). The long-short
Lev, G., Sadeh, G., Klein, B., & Wolf, L. (2015). RNN fisher vectors for story of movie description. In Proceedings of the German Confer-
action recognition and image annotation. In European conference ence on Pattern Recognition (GCPR).
on computer vision (ECCV). Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015c). A
Li, G., Ma, S., & Han, Y. (2015). Summarization-based video caption dataset for movie description. In Conference on Computer Vision
via deep neural networks. In Proceedings of the 23rd annual ACM and Pattern Recognition (CVPR).
conference on multimedia conference. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.
Li, S., Kulkarni, G., Berg, T. L., Berg, A. C., & Choi, Y. (2011). Com- (2013). Translating video content to natural language descriptions.
posing simple image descriptions using web-scale N-grams. In In International conference on computer vision (ICCV).
Proceedings of the fifteenth conference on computational natural Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,
language learning (CoNLL). Association for Computational Lin- et al. (2015). ImageNet large scale visual recognition challenge.
guistics. International Journal of Computer Vision, 115(3), 211–252.
Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Salway, A. (2007). A corpus-based analysis of audio description. In
Luo, J. (2016). TGIF: A new dataset and benchmark on animated Media for all: Subtitling for the deaf, audio description and sign
GIF description. In Conference on computer vision and pattern language (pp. 151–174).
recognition (CVPR).

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


120 Int J Comput Vis (2017) 123:94–120

Salway, A., Lehane, B., & O’Connor, N. E. (2007). Associating charac- of the North American chapter of the association for computational
ters with events in films. In Proceedings of the ACM international linguistics (NAACL).
conference on image and video retrieval (CIVR). Venugopalan, S., Hendricks, L. A., Mooney, R., & Saenko, K. (2016).
Schuler, K. K., Korhonen, A., & Brown, S. W. (2009). Verbnet overview, Improving LSTM-based video description with linguistic knowl-
extensions, mappings and applications. In Proceedings of the con- edge mined from text. arXiv:1604.01729.
ference of the North American chapter of the association for Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and
computational linguistics (NAACL). tell: A neural image caption generator. In Conference on computer
Shetty, R., & Laaksonen, J. (2015). Video captioning with recurrent net- vision and pattern recognition (CVPR).
works based on frame-and video-level features and visual content Wang, H. & Schmid, C. (2013). Action recognition with improved tra-
classification. arXiv:1512.02949. jectories. In International conference on computer vision (ICCV).
Shetty, R., & Laaksonen, J. (2016). Frame-and segment-level features Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense tra-
and candidate pool evaluation for video caption generation. In jectories and motion boundary descriptors for action recognition.
Proceedings of the ACM on multimedia conference (MM), pp. International Journal of Computer Vision (IJCV), 103(1), 60–79.
1073–1076. Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010).
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional Sun database: Large-scale scene recognition from abbey to zoo. In
networks for large-scale image recognition. In International con- Conference on computer vision and pattern recognition (CVPR).
ference on learning representations (ICLR). Xu, J., Mei, T., Yao, T., & Rui, Y. (2016). Msr-vtt: A large video descrip-
Sivic, J., Everingham, M., & Zisserman, A. (2009). “who are you?”- tion dataset for bridging video and language. In Conference on
learning person specific classifiers from video. In Conference on computer vision and pattern recognition (CVPR).
computer vision and pattern recognition (CVPR). Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R.,
Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., & Ng, A. Y. (2014). & Bengio, Y. (2015a). Show, attend and tell: Neural image caption
Grounded compositional semantics for finding and describing generation with visual attention. In International conference on
images with sentences. Transactions of the Association for Com- machine learning (ICML).
putational Linguistics (TACL), 2, 207–218. Xu, R., Xiong, C., Chen, W., & Corso, J. J. (2015b). Jointly modeling
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., deep video and compositional text to bridge vision and language
Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper in a unified framework. In Conference on artificial intelligence
with convolutions. In Conference on computer vision and pattern (AAAI).
recognition (CVPR). Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., &
Tandon, N., de Melo, G., De, A., & Weikum, G. (2015). Knowlywood: Courville, A. (2015). Describing videos by exploiting temporal
Mining activity knowledge from hollywood narratives. In Proceed- structure. In International conference on computer vision (ICCV).
ings on CIKM. Yao, L., Ballas, N., Cho, K., Smith, J. R., & Bengio, Y. (2016). Empiri-
Tapaswi, M., Baeuml, M., & Stiefelhagen, R. (2012). ”knock! knock! cal performance upper bounds for image and video captioning. In
who is it?” probabilistic person identification in tv-series. In Con- International conference on learning representations (ICLR).
ference on computer vision and pattern recognition (CVPR). Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., image descriptions to visual denotations: New similarity metrics
& Fidler, S. (2016). Movieqa: Understanding stories in movies for semantic inference over event descriptions. Transactions of the
through question-answering. In Conference on computer vision Association for Computational Linguistics (TACL), 2, 67–78.
and pattern recognition (CVPR). Yu, H., Wang, J., Huang, Z., Yang, Y., & Xu, W. (2016a). Video para-
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., & Mooney, graph captioning using hierarchical recurrent neural networks. In
R. J. (2014). Integrating language and vision to generate natural Conference on computer vision and pattern recognition (CVPR).
language descriptions of videos in the wild. In Proceedings of the Yu, Y., Ko, H., Choi, J., & Kim, G. (2016b). Video captioning
international conference on computational linguistics (COLING). and retrieval models with semantic attention. arXiv preprint
Torabi, A., Pal, C., Larochelle, H., & Courville, A. (2015). Using arXiv:1610.02947.
descriptive video services to create a large data source for video Zeng, K.-H., Chen, T.-H., Niebles, J. C., & Sun, M. (2016). Title
annotation research. arXiv:1503.01070v1. generation for user generated videos. In European conference on
Torabi, A., Tandon, N., & Sigal, L. (2016). Learning language- computer vision.
visual embedding for movie understanding with natural-language. Zhong, Z., & Ng, H. T. (2010). It makes sense: A wide-coverage word
arXiv:1609.08124. sense disambiguation system for free text. In Proceedings of the
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature- ACL 2010 system demonstrations.
rich part-of-speech tagging with a cyclic dependency network. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014).
In NAACL ’03: Proceedings of the 2003 conference of the North Learning deep features for scene recognition using places database.
American chapter of the association for computational linguistics In Conference on neural information processing systems (NIPS).
on human language technology. Association for Computational Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. G. (2015a). Uncov-
Linguistics. ering temporal context for video question and answering.
Vedantam, R., Zitnick, C. L., & Parikh, D. (2015). Cider: Consensus- arXiv:1511.04670.
based image description evaluation. In Conference on computer Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba,
vision and pattern recognition (CVPR). A., & Fidler, S. (2015b). Aligning books and movies: Towards
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, story-like visual explanations by watching movies and reading
T., & Saenko, K. (2015a) Sequence to sequence—video to text. books. In International conference on computer vision (ICCV).
arXiv:1505.00487v2.
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T.,
& Saenko, K. (2015b) Sequence to sequence—video to text. In
International Conference on Computer Vision (ICCV).
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., &
Saenko, K. (2015c). Translating videos to natural language using
deep recurrent neural networks. In Proceedings of the conference

123

Content courtesy of Springer Nature, terms of use apply. Rights reserved.


Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for small-
scale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:

1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at

[email protected]

You might also like