1 s2.0 S0925231220308110 Main

Neurocomputing 405 (2020) 247–258
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Reducing human efforts in video segmentation annotation with

reinforcement learning
} rincz a,⇑
Viktor Varga a, András Lo
a
Faculty of Informatics, Eötvös Loránd University, Pázmány Péter sétány 1/A, 1117 Budapest, Hungary
a r t i c l e i n f o a b s t r a c t
Article history: Manual annotation of video segmentation datasets requires an immense amount of human effort, thus,
Received 26 September 2019 reduction of human annotation costs is an active topic of research. While many papers deal with the
Revised 18 December 2019 propagation of masks through frames of a video, only a few results attempt to optimize annotation task
Accepted 22 February 2020
selection. In this paper we present a deep learning based solution to the latter problem and train it using
Available online 8 May 2020
Reinforcement Learning. Our approach utilizes a modified version of the Dueling Deep Q-Network shar-
Communicated by Junwei Han
ing weight parameters across the temporal axis of the video. This technique enables the trained agent to
select annotation tasks from the whole video. We evaluate our annotation task selection method by
Keywords:
Video segmentation
means of a hierarchical supervoxel segmentation based mask propagation algorithm.
Interactive annotation Ó 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://
Reinforcement learning creativecommons.org/licenses/by/4.0/).
1. Introduction In this paper, we attempt to learn annotation task selection

policies for human annotators to reach optimal segmentation qual-
Over the last few years, many large scale computer vision data- ity from a given budget of human effort in dense video segmenta-
sets [1,2] were released catalyzing the evolution of research areas, tion labeling. To this end, we present a reinforcement learning
like image classification and object localization in images. How- based solution. We evaluate our method on the DAVIS and Seg-
ever, video datasets with dense, pixelwise segmentation annota- Track datasets and compare it to other approaches in a quantitative
tion are still rare and most of them are small scale or lacking manner. Our contribution is a modification of the Deep-Q-Network
variations. The recently published Youtube-VOS [3] dataset is the architecture which learns superior annotation task selection policy
first large-scale video segmentation annotation dataset, but the compared to the state-of-the-art. The software code and the
majority of clips included in it do not exceed a few seconds in trained model will be released1.
length. The Densely Annotated VIdeo Segmentation (DAVIS) [4,5]
dataset is a very popular database containing close to a hundred
short clips. Many previous papers evaluate different models on 2. Related work
the SegTrack dataset [6]. SegTrack consists of six very short
sequences. 2.1. Video object segmentation
A key factor in the production of annotated video datasets is the
budget for human annotation costs. Although numerous video Approaches in video segmentation can be categorized into
propagation techniques automatically generate annotation for either (semi-) supervised or unsupervised methods. Supervised
the whole sequence from a few annotated images only [7,8], in and semi-supervised methods are usually trained on part of the
case of dense, pixelwise annotation, automatically propagated target dataset or pretrained on closely related datasets in terms
masks degrade quickly, thus human supervision is essential. Also, of content and image features. A majority of recent results utilize
the cost of human annotation for drawing bounding boxes or deep learning as well [7,9–11]. However, depending on the imple-
points in an image are orders of magnitude lower than the costs mentation, deep learning based image segmentation models may
of providing exact pixelwise labeling. Consequently, reducing the easily become semantically biased and might exhibit significant
need for human effort in video segmentation annotation is highly performance loss when evaluated on datasets of different nature.
relevant. Unsupervised methods include the partitioning of images or videos
to superpixels, or supervoxels respectively [12].
⇑ Corresponding author.
}rincz).
E-mail address: [email protected] (A. Lo 1
https://github.com/vvarga90/rlseg
https://doi.org/10.1016/j.neucom.2020.02.127
0925-2312/Ó 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
248 }rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo
Stutz et al. [13] present a comprehensive analysis and a set of 3. Method

benchmarks for the superpixel literature. Chang et al. [14] solve
temporal propagation of superpixels by modeling flow vectors of In our work, we utilize deep learning to reveal potential errors
regions with bilateral Gaussian processes. Grundmann et al. [15] in segmentation and label predictions based on image features,
utilize a minimum spanning tree algorithm to build a tree of voxels optical flow estimation and changes in occlusion. We formulate
to obtain a hierarchical segmentation. Although these classical the learning in the Markov Decision Process (MDP) framework, in
superpixel/supervoxel methods are surpassed by deep learning order to find cost effective strategies in annotation task selection
based solutions in terms of performance, still, they are less suscep- for the human annotator. We employ a hierarchical supervoxel
tible to overfit to high-level semantic content of the training data. based segmentation label prediction algorithm to predict a com-
We note that hierarchical superpixel/supervoxel algorithms are plete annotation of the video from partial user input and use it
ideal to form the base of interactive segmentation annotation, with our task selection agent to achieve a significant reduction of
since they provide image/video partitioning at an arbitrary level human effort in the dense annotation of videos.
of granularity. Fig. 1 depicts our novel scheme that reduces human effort dur-
ing interactive video segmentation annotation. Below, we outline
2.2. Interactive segmentation our proposed procedure. We carry out the three steps repeatedly.
1. Annotation prediction: Initially, ‘background’ labels are
Numerous papers discuss the potential of hybrid annotation assigned to all pixels. After correcting annotation is added by the
systems where human and machine work together to speed up user, the label propagation algorithm re-estimates its label predic-
the annotation process. A well known example is the GrabCut algo- tions for the whole video. Details of this step are elaborated in
rithm [16] which performs image segmentation by means of Section 3.1.
energy minimization and iterative corrections of the user. Several 2. Agent selects the next annotation task: State representation
works aim to exploit temporal context in image sequence annota- of the MDP is made up of features extracted from the video and the
tion. Nagaraja et al. [17] compute temporal point trajectories from updated annotation prediction. Given the state, the RL agent
optical flow and assign partial labeling to trajectories intersecting selects the next annotation task to be given to the human annota-
user inputs in the form of scribbles. They estimate dense segmen- tor: the index of the next frame to annotate (f t ) and a hierarchical
tation labels using energy minimization. level (ht ) corresponding to the underlying hierarchical supervoxel
Some results deal with the selection of the most rewarding segmentation.
annotation task. Fathi et al. [18] compute a superpixel-graph based 3. Annotation correction by user: The user is presented the
representation of videos and estimates a complete labeling from level ht superpixel segmentation of frame f t . They are asked to
partial annotations by finding the harmonic solution. They define make corrections of the predicted superpixel labels in the given
an uncertainty measure to select the next annotation task for the frame: the user may change the label of each superpixel by clicking
user, iteratively. Vijayanarasimhan and Grauman [19] analyze sev- on it. The user continues to make corrections in the frame until
eral algorithms based on optical flow and Markov Random Fields to they consider the superpixel labeling correct. The annotation pro-
propagate partial annotations, estimate propagation errors tocol is detailed in Section 3.2. Examples of label correction are
between frames and select the frame with the most expected shown in Fig. 2.
improvement to be the next annotation task. Luo et al. [20] gener- In the process described above, the annotation prediction algo-
ate segmentation proposals and estimate the label change caused rithm is responsible for the estimation of a complete dense anno-
by correctly annotating them. They search for an optimal query tation of the whole video by propagating the partial annotations
strategy to annotate proposals using energy minimization. and corrections provided by the user. The RL agent is trained to
select optimal annotation tasks for the human at any time, by esti-
2.3. Reinforcement learning mating the scale of potential discrepancies between the predicted
annotation and image-derived features.
Reinforcement learning (RL) has been gaining strength in recent
years due to the successful joining of deep networks and RL by
3.1. Hierarchical supervoxel based segmentation propagation
Mnih et al. [21]. Beyond the popular RL benchmark Atari game
environment, numerous computer vision problems have been
We formulate superpixel and supervoxel hierarchies. To start
tackled by RL. These include, but are not limited to, video content
with, it is important to clarify that we refer to two independent con-
summarization [22], object localization [23], tracking [8,24] and
cepts of time in the following sections. We differentiate between
video object segmentation [25].
the temporal axis of a video (corresponding symbols include f ; F,
Several papers deal with the optimization of human-computer
etc.) and the iterations of the RL framework (with corresponding
collaboration via the interactive annotation of images or videos
symbols t; T, etc.). A video can be represented as X ¼ ðP1 ; . . . ; PF Þ
using RL. Russakovsky et al. [26] selects tasks in a crowdsourced
– a 3-dimensional grid, where Pf stands for the representation of
image annotation setting. Song et al. [27] improve the GrabCut
frame f, a single image – a 2-dimensional grid of pixels.
algorithm by generating some of the seed points with a convolu-
We define the hierarchical supervoxel segmentation as
tional net based Deep-Q-Network (DQN) agent. Acuna et al. [28] n o
evaluate their image segmentation network in a human-in-the- V ¼ V 0 ; V 1 ; . . . ; V H , where V h is the level h hierarchical segmen-
loop setup, too. Rudovic et al. [29] integrate reinforcement learning tation of the video. The supervoxel segmentation of a video in a
into an active learning framework. Their agent decides whether single hierarchical level can be then formulated as V h ¼
labeling of a data point is requested from the user or not. Similarly, n o
our approach utilizes reinforcement learning to select annotation v h1 ; v h2 ; . . . ; v hNh , a partition of the set of voxels in the video. We
tasks, but our agent is not restricted to proceed through each frame also define superpixel hierarchies in the following way:
in a sequential order. On the contrary, the modified Deep-Q- U f ;h ¼ V h \ Pf denotes the level h superpixel segmentation of
Network presented in this paper, estimates state-action values frame with index f. Similarly to supervoxels, we define a superpixel
for each frame at once. Therefore, the selection of each annotation segmentation as a partition of the set of pixels in frame
n o
task is carried out with a single inference phase of our network. For
f : U f ;h ¼ u1f ;h ; uf2;h ; . . . ; ufM;hf ;h .
details, see Section 3.4.
}rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo 249
Fig. 1. Overview of our method. 1. A complete labeling of the video is predicted from the partial labeling provided by the user in past iterations. 2. RL agent selects the next
annotation task for the user. 3. User executes the assigned task by correcting mislabeled segments of the label predictions.
Fig. 2. Label propagation exploiting hierarchical supervoxel structure. The two rows show label propagation to different frames of the video after user input was given at a
high (top row) and at a lower hierarchical level (bottom row) in frame 20 of the soapbox DAVIS sequence. The first column contains the actual click points of the simulated
user (zoomed in). Note, that the number of click points form the cost in our procedure. Human work may be spared by allowing annotation with scribbles, as shown in the
second column. The remaining two columns to the right display the effect of propagating the user annotation to frames 22 and 80. It is clearly visible that user input in higher
hierarchical levels comes with significantly lower cost and is propagated further in time. However, lower hierarchical level inputs increase granularity of the propagation
locally. Best viewed in color.
The hierarchical nature of the superpixel and supervoxel seg- tively merges adjacent spatio-temporal regions using a sorted glo-
0
mentations is formulated in the following way: let h; h be bal edge weight list. The edge weights are a function of color and
0 optical flow histogram distances. The result of GBH is a minimum
indices of hierarchical levels, with h P h. Then,
F F
9k1 ; . . . ; km : v hj ¼ v hki , where denotes the disjoint union spanning tree constructed from the region adjacency graph. Opti-
0
i¼1;...;m cal flow is estimated with Flownet 2 [31].

of the family of its sets. Similarly for superpixels: Since GBH was designed to handle pixel-level adjacency graphs,
0 F an obvious choice would be to obtain the base level segmentation
9k1 ; . . . ; kn : ufj ;h ¼ ufk;h
i
for any frame f. We will refer to pixel
i¼1;...;n of the video with this algorithm as well. However, we found that
level segmentation as segmentation level 0. The distinction the low-level segmentations computed by CCS are much better
between pixels and voxels is important, since our algorithm is in terms of boundary accuracy at similar supervoxel counts. Never-
based on supervoxels, whereas user actions concern superpixels theless, the exact implementation of the hierarchical supervoxel
of a frame chosen by the RL agent. However, to be able to use method utilized is indifferent to the main thesis of our paper:
superpixels and supervoxels in the same framework, we handle the configuration presented here may be replaced by other hierar-
superpixels as 3D regions in X with an extent of 1 frames along chical supervoxel segmentation techniques.
the temporal axis. Therefore, a supervoxel or superpixel is an CCS is parametrized by the approximate number of superpixels
ancestor of another supervoxel or superpixel if the former one in a single frame. We set this parameter to 2048, for which super-
contains the latter one as a subset. voxel counts in the resulting video segmentation are between 3000
In our approach, we use Contour-Constrained Superpixels (CCS) and 20,000 depending on the length and the amount of changes in
by Lee et al. [30] to acquire a base level supervoxel segmentation of the video. In the following sections of the paper, we will refer to
the video. We build a hierarchical segmentation of the video by the voxel level segmentation of the video (where each voxel forms
applying the method proposed in Efficient Hierarchical Graph- an individual segment) as hierarchical level 0. We define hierarchi-
Based Video Segmentation (GBH) by Grundmann et al. [15] on the cal level 1 as the base level segmentation produced by CCS. Higher
output of the CCS segmentation method. The GBH algorithm itera- hierarchical segmentation levels are created from level 1 segmen-
V. Varga, A. Lo
tation by the iterative merging procedure with GBH. We refer to label of the most frequent pixel label found in the given superpixel
hierarchical level h P 2 as the segmentation given by the iterative (i.e. plurality voting). In case this does not hold for a superpixel, the
merging algorithm stopped at segment count jV h j ¼ bjV h1 j=cc. The simulated human annotator corrects the label. We utilize the
number of supervoxels at hierarchical level i is denoted by jV i j. We ground truth labels in order to simulate user input. However,
found c ¼ 3:7 to be ideal for the DAVIS and SegTrack datasets. Note real-life human annotators cannot be expected to perform label
that the ideal value of this constant is highly dependent on the selection based on plurality voting of pixel labels with absolutely
temporal extent of supervoxels present in the base level zero error. For this reason, we also carry out experiments simulat-
segmentation. ing the noisy selection of superpixel labels in Section 4.4 and show
Next, we formulate the labeling of a video. We follow the label- that our approach is robust to minor human errors.
ing conventions of the DAVIS dataset: our label set contains a sin- In several works dealing with human segmentation annota-
gle background label LBG and multiple foreground labels. Let tions, other ways of human input are investigated, such as drawing
scribbles to mark coherent regions in one action [17,32]. While
L ¼ LBG ; L1 ; . . . ; LjLj1 be the set of labels. Our task is to give a
results show that in some cases segmentation annotation using
complete labeling of the video, by assigning a label to each pixel:
scribbles requires less effort, we chose to stick with simple mouse
l : V 0 ! L. clicks instead, which may serve quantitative comparisons in the
Annotations already given by the user in an episode can be for-
n o future.
malized as a set of triplets: K ¼ f t ; ht ; lf t ;ht : U f t ;ht ! L . As
described in the previous section, the user is assigned the annota- 3.3. Markov decision process
tion task ðf t ; ht Þ by the RL agent, for which the user provides the
labeling of all superpixels in level ht of frame f t : the lf t ;ht mapping. In our work, an annotation episode is modelled with a Markov
The annotation propagation algorithm can be given in the follow- Decision Process (MDP). The MDP is defined by states st 2 S,
ing way: actions at 2 A, the state transition function P : S A S ! R
Let ufj ;0 be a pixel in frame f. If no supervoxel ancestor of the and the reward function R : S A ! R. In each step, the agent
given pixel exist that contains user-labeled superpixels as a subset, receives the current state, which stores downsampled pixelwise
video features, the current state of annotation and the time index
then l ufj ;0 ¼ LBG . Otherwise, let v hk be the lowest level ancestor of
in the episode, t. The agent then takes an action, selecting the sub-
ufj ;0 which contains a user-labeled superpixel as a subset. In this sequent annotation task to be queried from the user. The action is
case, we choose the temporally nearest of such labelled superpix- represented by ðf t ; ht Þ where f t is the frame index of the video and
0 0
0 0 ht is the hierarchical level in which the user carries out the labeling
els. Let this superpixel be ufi ;h . Therefore l ufj ;0 ¼ l ufi ;h , in other
task. In other words, the algorithm may ask for annotations at
words we assign the label of the selected superpixel to the pixel specific levels in the hierarchy. During the training process, user
ufj ;0 being discussed. The effects of propagating user input given labeling is computed from the ground truth segmentation labels
in different hierarchical levels are shown in Fig. 2. in accordance with the human annotation protocol, detailed in
the previous subsection. Finally, the complete labeling of the video
3.2. Human annotation protocol is estimated from the partial annotations already present. The pre-
dicted annotation is compared to the ground truth labeling using
The human annotator is asked to make corrections to the pre- the mean Jaccard-index (a.k.a. intersection-over-union or IoU).
dicted annotation at a given frame f t , in a given hierarchical level
ht . The annotator may carry out the label corrections by modifying 3.4. Model architecture
the assigned labels to each superpixel at level ht if they are
believed to be mislabeled. We decided to simulate the role of the To predict state-action values for each frame and each hierar-
human annotator during the training and evaluation of our chical level, we use a somewhat modified version of the Dueling
approach. The protocol of our simulated human annotator can be Double Deep Q-Network (DDDQN) [33]. Since we intend to learn
specified in the following way: each superpixel should carry the to approximate the same Q-values independently from the actual
Fig. 3. Architecture of the modified Dueling Deep Q-Network. F designates the length of the temporal dimension of the video, i.e. the number of frames: the action-advantage
prediction branch of the network shares its weights along this axis. C; C 0 ; C V ; C A denote the number of image channels. H stands for the number of actions available in each
frame. The index of time step (t) in the MDP episode is fed into both branches of the network.
V. Varga, A. Lo 251
temporal position in the video, we utilize layers with translation while H stands for the number of actions per frame, i.e. for the
invariance along the temporal axis in the action-advantage estima- number of hierarchical levels. H is chosen to be 4 in our experi-
tion branch of the network. While the state value prediction ments. The structure of our action-advantage prediction branch
branch does not share weights along the temporal axis, we only bears a resemblance to the Deep Recurrent Q-Network described
use this branch during the training of the network, as having the in [34], but in our case all weights that take part in the estimation
action-advantage estimation Aðs; aÞ is enough to select the optimal of Aðs; aÞ are shared along the frames of the video. see Fig. 4.
action. This procedure enables us to use the weights of the trained The state value estimation branch of the network is similar to
model to initialize differently shaped models to work with video the original DDDQN, featuring two 3D max-pool and one 3D con-
clips of arbitrary length. The architecture of our modified Deep volutional layer. The following two layers are ordinary fully con-
Q-network can be seen in Fig. 3. The input of our model is the state nected layers with ReLU activations.
representation, which contains multiple channels of 3D pixel Additionally, both branches of the network receive the index of
(voxel) data. Consequently, at the front part of our model, multiple the time step in the episode as input.
3D convolutional and max-pooling layers are placed alternately. The predictions of the two branches: Aðs; aÞ and V ðsÞ are merged
The pooling layers of the shared, frontal part of the network only following [33]:
reduce the dimensionality along the spatial axes and leave the !
temporal axis untouched. The pooling layers in the action- 1 X
Q ðs; aÞ ¼ V ðsÞ þ Aðs; aÞ Aðs; a0 Þ ð1Þ
advantage prediction branch are of similar nature. The output of jAj a0 2A
the last pooling layer in this branch is reshaped to two axes, keep-
ing the temporal dimension unchanged. The following layer is a
bidirectional LSTM layer having recurrent connections along the 3.5. State representation
temporal axis of the video. This layer provides long-term informa-
tion propagation in time. The outputs of the bidirectional recurrent The state consists of image features and information about the
layer are concatenated for each frame independently. The final progression of the annotation process. The state comprises the
layer is fully connected within individual frames of the video and index of the current time step in the MDP episode and features
learns the same transformations with tied weights along the tem- derived from the video in the shape of F X Y C, where F
poral axis. ReLU activations are placed after each convolutional denotes the number of frames in the video, X; Y denote the size
layer. The output of the action-advantage prediction branch is of the downscaled frame, while C designates the number of image
shaped F H, where F denote the number of frames in the video, channels. For subsequent episodes, F may vary, since all weights of
Fig. 4. Qualitative results on the DAVIS breakdance and soapbox sequences showing error in label prediction and correcting user input after 1, 2, 3, 4, 5 and 10 iterations in
frames selected for user annotation. Colored dots represent user clicks. Error rates for both sequences were above the dataset average. Note, that the number of click points
form the cost in our procedure. Human work may be spared by allowing annotation with scribbles, as shown in Fig. 2. Progression of labeling in consecutive frames of the
sequences are shown in Figs. 5 and 6. Best viewed in color.
V. Varga, A. Lo
our neural network are shared along this dimension. X and Y are between rewarding the improvement of segmentation accuracy
chosen to be 32 and 56 respectively. During our experiments C (improvement in Rexp ) and penalizing annotation cost (Costt ).
was 9. The details of the individual Ic image channels are the (
following: 1; if frame index f t has already been selected by the agent
R¼
I0 ; I1 ; I2 ; I3 store the downscaled optical flow images: dx; dy com- werr Rexp ðMt ; GÞ Rexp ðM t1 ; GÞ wac Cost
j ;
t
otherwise;
ponents for forward and backward flow, respectively. We obtained ð4Þ
the optical flow predictions from Flownet v2. I4 ; I5 contain the bin-
ary images of the estimated changes in occlusion computed from where
the optical flow predictions. A pixel is considered occluded/disoc- expxq 1 Jmean ðM; GÞ
cluded when the forward/backward optical flow have no vectors Rexp ðM; GÞ ¼ ; x¼ : ð5Þ
expq 1 Jmean G0 ; G
pointing to that pixel location or near it. I6 contains the boundaries
between similarly labeled regions. A pixel is considered a boundary In the definition of the reward function, Jmean stands for the
pixel, if it shares a side with at least one differently labeled pixel. mean Jaccard-index (also known as the mean intersection-over-
The image channels specified above, may give the neural network union for pairs of masks for each label present). M t ; G denote the
information about the scale of potential deterioration during anno- predicted mask for the whole video in time step t of the MDP epi-
tation propagation, even when annotation input is not yet present sode and the ground truth mask respectively. Costt stands for the
in the temporal vicinity of a certain frame. annotation cost of the selected annotation task in time step t.
In case the video lacks fast movements and quick changes in The reward for a given improvement in segmentation accuracy
occlusion, the annotation propagation may go on successfully for increases exponentially as accuracy gets close to the upper limit.
several seconds. Although the temporal recurrent connections in With this, we model the exponential increase in human efforts,
the action-advantage prediction branch of the network enable as the labeling quality improves during the annotation process.
long-term temporal information propagation, still, the introduc- The exponential dependence can be modulated by parameter q.
tion of features that encode the distance of user annotation In our studies, typical q values are between 3 and 10. Depending
improved our results. Hence, we specify channels I7 and I8 as on the border accuracy and the average temporal extent of the
follows. supervoxel segmentation, the maximum achievable Jmean and
I7 encodes the hierarchical distance dhier of a pixel from user the annotation costs may greatly vary between videos. The expo-

annotations. Let dhier ufj ;0 be the level of the lowest level ancestor nential nature of the reward function further amplifies the vari-
ance in expected rewards which impacts the stability of the
supervoxel of ufj ;0 which contains user labelled superpixels as a training. To fight off the instability, we introduce G0 and j which
subset. may differ from video to video as detailed below. G0 stands for
Finally, I8 encodes the level 1 temporal distance dtemp from user the predicted mask when all frames of the video are annotated at

annotations. Let dtemp ufj ;0 be the temporal distance of the tempo- the base hierarchical level (level 1). Thus, Jmean G0 ; G determines
the minimum achievable annotation error for a specific video.
rally closest labelled level 1 superpixel from the ufj ;0 pixel which
The role of x is to normalize Rexp to 1 when the minimum achiev-
share a level 1 supervoxel ancestor.
able error is reached for a specific video. Similarly, j designates the
number of supervoxels in the base level segmentation (level 1). As
3.6. Training the agent with deep reinforcement learning
mentioned before, the number of supervoxels in CCS generated
segmentations vary significantly depending on the magnitude of
To train an agent, we utilize the modified DDDQN described in
movement in the videos and j compensates for this.
Section 3.4. Our network learns to estimate the discounted sum of
the remaining rewards in the episode, i.e. the Q-values, for each
possible action. The action space can be factorized into two inde- 4. Experiments and evaluations
pendent dimensions: index of frame and the hierarchical level of
the annotation task. The training of the network is executed in a In this section, first, we present details about the training of our
similar fashion to ordinary Deep Q-Networks with the Double network. Later, we train and evaluate our method on the DAVIS
b targetNN ðs; aÞ and Q
b onlineNN ðs; aÞ denote datasets and compare it to the semi-supervised state-of-the-art.
DQN technique [35]. Let Q
Additionally, we evaluate our trained model on the unseen Seg-
the Q values estimated by the target network and the network
Track dataset and compare its accuracy to the state-of-the-art in
being trained. The Q-learning target and the DQN loss function
interactive video segmentation as well. Furthermore, we compare
can be expressed with s; a; s0 as:
the policy learned by the reinforcement learning agent to baseline
b targetNN s0 ; arg min a0 Q
Q ðs; aÞ ¼ r ðs; aÞ þ c Q b onlineNN ðs0 ; a0 Þ ð2Þ policies to validate our RL based approach. Finally, we simulate
potential mistakes made by human annotators and show that
2 our approach is robust to minor errors in following the annotation
b ðs; aÞ
J ðhÞ ¼ E Q ðs; aÞ Q ð3Þ protocol appropriately. We use the mean Jaccard-index to measure
our prediction for a given annotation budget.
where h denotes the parameters of the networks.
The gradient backpropagation is analogous to that of regular 4.1. Details of the training and the prediction process
Deep Q-Networks: gradients are propagated backwards only from
the output corresponding to selected action a. The network esti- We train our network on the training set of the DAVIS 2017
mates a state-action value for each action, from which an e- dataset. We split the training set into two parts, containing 40
greedy policy selects the action to be taken. To avoid training our and 20 videos respectively. We use the second part as a validation
network with correlated batches, we use an experience replay buf- set while training on the first part of the dataset. The subset of 20
fer to store past experiences of the agent: sampling is prioritized by videos of the DAVIS training set we use for validation should not be
the magnitude of the TD-error [36]. confused with the DAVIS validation set, which we use only for final
The reward function we use was inspired by the exponential evaluation purposes. We initialize a network with an input video
reward function suggested by Song et al. [27]. We aim to balance length F ¼ 40 and use random slices of videos for each episode
V. Varga, A. Lo 253
as the network input. The training process starts with an experi- Table 1
ence collection phase of 20,000 steps using a random policy with- Comparison of our method with state-of-the-art semi-supervised methods on the
DAVIS validation sets. Binary and multiclass segmentation results were reached on
out training the network. Following this, the network is trained the DAVIS 2016 and 2017 validation sets respectively. Mean Jaccard-index was used
while generating experiences. The policy is parametrized with an as metric. Note, that all cited methods receive the first frame ground truth annotation
e value gradually decreasing from 0:5 to a minimal value of 0:2. as input, while our method continuously improve the prediction using approximate
The discout rate c is set to 0:9. Episode length is fixed to 20 anno- annotations from user input. The budget for our method is given in number of clicks
per total frame count of all videos.
tation task selection iterations. The network is trained using an
RMSprop [37] optimizer with a learning rate set to 5 104 . The Method Binary Multiclass
segmentation segmentation
weights of the target network are updated with a soft update rate
(mean J) (mean J)
of 104 .
PReMVOS [38] 0.849 0.739
The number of hierarchical levels used (H) was set to 4. The fol- OSVOS-S [39] 0.856 0.647
lowing reward parameters were found to be optimal: OnAVOS [40] 0.861 0.616
werr ¼ 10; wac ¼ 70; q ¼ 8. Further analysis of these parameter val- CINM [41] 0.834 0.672
ues can be found in Section 4.3. OSVOS [7] 0.798 0.566
While the training process is carried out using random slices of Ours, 1 clicks per fr. budget 0.661 0.594
videos with fixed length, the prediction phase does not have such Ours, 2 clicks per fr. budget 0.710 0.685
Ours, 5 clicks per fr. budget 0.793 0.781
restriction on the input. During inference, the branch of the net-
Ours, 10 clicks per fr. budget 0.852 0.837
work responsible for the state-value estimation can be removed.
The remaining part of the network shares all its weights along
the temporal axis of the video, therefore the temporal extent of
the input is only limited by hardware capacities.
We measured the execution time of our method using a GTX to speed up the annotation process. Although, labeling multiple
1080 graphics card and a Xeon E5-2620 v4 CPU. While the prepro- adjacent regions by a single mouse gesture is indeed time-
cessing steps to generate the hierarchical segmentation and to efficient, the lack of a formal definition of a scribble makes compar-
compute the optical flow may take several seconds per frame, isons hard. A few approaches try to quantify annotation costs by
these computations can be done in advance. During the actual measuring the number of frames annotated in a video. However,
interactive annotation procedure, the human annotator have to in practice, the accurate pixelwise labeling of a frame could take
wait after completing the labeling of a frame, before being assigned several hours, while for the majority of applications, reaching
the consecutive task. The waiting time in this case is on the order 100% IoU is needless. Therefore, annotation costs of labeling a
of a few hundred milliseconds. The graphics card could load video frame may vary several magnitudes depending on the required
clips of more than 3000 frames at once. Longer videos may be sim- level of accuracy.
ply cut to slices of appropriate length. To facilitate the comparison of our approach with future results,
we chose an annotation technique popular in image segmentation
since the introduction of the GrabCut algorithm [16]: we provide
4.2. Interactive segmentation results hints (or seeds) to the segmentation algorithm by means of sug-
gested mouse clicks in images. The number of mouse clicks needed
4.2.1. Comparison with the state-of-the-art to reach certain accuracy levels is an unambigous measure of
First, we compare our method quantitatively to the state-of- human effort in annotation tasks.
the-art on the semi-supervised benchmarks of the DAVIS dataset We note that none of the papers available on interactive seg-
in Table 1. We cannot offer a straightforward comparison between mentation feature the metric described above, we compare our
our method and semi-supervised methods, due to the underlying results to the state-of-the-art measuring annotation costs by
methodological differences: the semi-supervised protocol of the counting the number of annotated frames. We evaluated our
DAVIS benchmarks reveal the ground truth labeling of the first method on the SegTrack dataset in order to study database depen-
frame to the evaluated methods, while our method iteratively dence when training on one set and testing on both. Results can be
receives approximate corrections of the predicted labels, without seen in Table 2. Videos in the SegTrack dataset have relatively low
ever obtaining the exact ground truth labeling of a full frame. Still, resolution and some sequences, e.g. cheetah, girl and monkeydog
some comparisons seem worth as they give insights about the use- contain a significant amount of compression related artifacts,
fulness of our method. which may give rise to erroneous object boundary prediction for
In Table 1, we show our results in two benchmarks. The values some algorithms and may spoil performance.
in the binary segmentation task correspond to the mean Jaccard- Qualitative results of interactive annotation episodes for binary
index of the predicted binary masks over all frames of the DAVIS and multilabel annotation tasks are shown in Fig. 5 and Fig. 6
2016 validation set. The multiclass segmentation task refers to respectively. Ground truth masks are shown in the leftmost col-
the evaluation of the models on the DAVIS 2017 validation set umns. Columns to the right show the progression of the improve-
using the mean Jaccard-index over all frames of the dataset and ment of predicted labels throughout multiple frames of the video.
with all the different labels present in each frame. Our results In all illustrations presented in Figs. 5 and 6 prediction error rate
improve as the budget for user annotation actions increases. was above the dataset average.
Clearly, state-of-the-art, deep convolutional network based semi-
supervised methods learn quickly to segment a single moving
object in front of a more or less static background. Nonetheless, 4.2.2. Evaluation of learned policy
their performance plummets as multiple moving objects appear In the previous section, we compared our method to other
with individual labels. On the contrary, our method presents an approaches in terms of the quality of the obtained segmentation
approximately similar accuracy both for binary and multilabel seg- annotation. However, utilizing a random task selection policy to
mentation tasks. We attribute this to the main difference in our assign annotation tasks to the human participant also leads to ade-
training scheme compared to other approaches. In our case, we quate annotation quality after a few iterations. In view of this, we
use a trained model to estimate the label prediction error, while compare the policy learnt by the RL agent to baseline policies on
the label prediction itself is computed by a black box algorithm. the DAVIS 2017 validation set.
V. Varga, A. Lo
Table 2
Results on the unseen SegTrack dataset after a specific number of frames were annotated by the user. Note, that in contrast to [20], our method is optimized on the cost of
annotating each frame as well. Annotation costs are given in number of clicks per total frame count of all videos.
SegTrack sequence [20] (5 fr.) Ours (5 fr.) Ours (10 fr.) Ours (20 fr.)
birdfall 0.593 0.711 0.729 0.747
cheetah 0.502 0.348 0.448 0.711
girl 0.739 0.463 0.537 0.671
monkeydog 0.454 0.631 0.729 0.774
parachute 0.821 0.897 0.898 0.901
penguin – 0.941 0.946 0.947
Mean IoU w/o penguin 0.604 0.651 0.711 0.782
Mean IoU – 0.701 0.751 0.810
Annot. cost per frame – 2.028 4.282 8.545
Fig. 5. Qualitative results on videos with above average error from the DAVIS 2016 dataset. Rows represent frames number 0, 40 and 80 of the breakdance sequence and
frames number 0, 15, 30 of the motocrossjump sequence in this order. First column contains the ground truth masks in the specific frames, while following columns contain
the predicted masks after 1, 2, 5 and 10 iterations of user annotation inputs. Best viewed in color.
The results of the comparison are shown in Fig. 7. Annotation ht , we evaluated all possible monotonically decreasing policies
costs are given in number of clicks per total number of frames in up to an episode length of 20 steps on a subset of the DAVIS
the dataset to reach a specific level of predicted label quality. 2017 training set combined with the frame selection policy
Random policy selects ðf t ; ht Þ annotation tasks randomly, but specified above. We evaluated those performing the best, again,
avoids redundant queries: in case a certain frame was already on the full DAVIS 2017 training set to acquire the final policy.
annotated in hierarchical level h, queries targeting this frame with The hierarchical level selection policy giving the best results
0
hierarchical level h P h are ignored. In Fig. 7, the average results was combined with the frame selection strategy mentioned
for 10 such random policies are shown. before produced Best constant policy. Results are depicted in
Best constant policy selects frame f t to be the most distant Fig. 7.
from all previously queried frames, the beginning and the MRF baseline policy is generated by the Markov Random Field
end of the video. We experimented with other frame selection (MRF) and is driven by an active learning approach. Our baseline
strategies. We found those to be inferior, including sequential measure is derived from the paper of Luo et al. [20]. We made
equidistant sampling and a keyframe based frame selection the cited method compatible with our hierarchical supervoxel
policy where keyframes were chosen according to the Keyframe framework by modifying their approach at a few points. We con-
policy in [19]. To obtain a strategy to select hierarchical level tinue to use the notation defined in Section 3.
V. Varga, A. Lo 255
Fig. 6. Qualitative results on videos with above average error from the DAVIS 2017 dataset. Rows represent frames number 0, 15 and 30 of the shooting sequence and frames
number 0, 40, 80 of the soapbox sequence in this order. First column contains the ground truth masks in the specific frames, while following columns contain the predicted
masks after 1, 2, 5 and 10 iterations of user annotation inputs. Best viewed in color.
Fig. 7. Comparison of the learned policy with baseline policies on the DAVIS 2017 validation set. The figure shows annotation costs required to reach certain Jmean levels with
the specific task selection policy. Annotation costs are given in number of clicks per total number of frames in all videos of the dataset. Lower values are better.
0
To predict the expected effect of the user labeling of a certain receiving label l when annotated by the user. ELC ð f ; hÞ is the
frame at a certain hierarchical level, we utilize the Expected Label expected label change by annotating frame f at hierarchical level
Change defined in Eq. 3. in Ref. [20] as follows. h. We compute ELC ð f ; hÞ over k superpixels sampled randomly
X from the segmentation. In our baseline we also include the esti-
0 0
ELC ð f ; hÞ ¼ P l ufj ;h ¼ l C l ufj ;h l ð6Þ mation of the cost of annotating a frame at a specific hierarchical
f ;h
n j;o
j21...jU level. This cost is given by the sum of the estimated probabilities
f ;h
l0 2Ln l uj that the user will change the labels of the sampled superpixels.
We normalize the cost with the sum of the areas of the sampled

0 superpixels. Eventually, we maximize the expected label change
Here, C l ufj ;h l denotes the total pixelwise label change in
0
divided by the estimated cost of the task to get the most efficient
the video in case superpixel ufj ;h was assigned label l . labeling task. Algorithm 1 summarizes the task selection

0
P l ufj ;h ¼ l denotes the estimated probability of the superpixel procedure.
V. Varga, A. Lo
First, we experimented with modifications to the reinforcement

Algorithm 1:MRF baseline policy: Annotation task selection learning setup. Setting either of the x or j parameters of the
reward function to 1 had a significant effect on the results: the high
variances of the total expected sum of rewards from video to video
destabilized the training process and the model failed to converge.
Removing the explicit penalty of 1 for repetitive actions had a
major effect on the length of the training: although the training
converged, but during evaluations, the learned model often fell
into a cycle of selecting repetitive actions after 5 to 10 steps of
the evaluation episode.
Next, we tried to use different configurations of image channels
in our MDP state representation. As expected, removing any of the
image channels encoding the state of the current annotation, i.e.
removing I7 or I8 , prevents the learned policy to surpass the perfor-
mance of the MRF baseline policy detailed in Section 4.2.2. Remov-
ing time index t from the state representation decreased
performance on a similar scale. The removal of optical flow and
occlusion data from the state representation increased the annota-
tion costs needed to reach a mean Jaccard-index of 0:8 by 17% and
5% respectively, while removing both resulted in a more signifi-
We construct a graph from the superpixel segmentation: each cant increase of 25% in the annotation costs.
node represents a level 1 superpixel. Superpixels of consecutive Besides these ablation studies, we conducted experiments to
frames belonging to the same level 1 supervoxel are connected measure the effect of the addition of color information to the state
by edges. Neighboring superpixels in the same frame are also con- representation. Downscaled color information was added to the
nected. In the definition of the energy function, we follow [20] (see state representation in RGB and CIELAB color formats, but the
Eqs. 5, 7 and 8 in [20]): unary potentials are derived from predicted results did not improve. Replacing optical flow or information
probabilities of Gaussian Mixture Model (GMM) based classifiers. about the occlusion with color data caused minor deterioration.
Individual GMMs are trained to estimate distribution of CIELAB A state representation with a higher, 112 64 spatial resolution
color space histograms in each user-labelled frame. GMMs for was also tested. While the results were approximately similar to
unlabeled frames are computed by interpolating between GMM the smaller representation of 56 32 spatial resolution, the size
models of the nearest annotated frames (see Eq. 5. in [20]). We also of the experience replay buffer was now limited by our hardware
use the smoothing term as given by Eq. 8 in Ref. [20]. We minimize capacities and made experiments with the larger model impracti-
the energy function with the help of graph cuts. The baseline algo- cal. An increased granularity of the base level supervoxel segmen-
rithm was implemented with the help of the gco library [42–44]. tation is expected to demonstrate the advantage of the high
Fig. 8 shows the estimated action-advantage values by our resolution representation.
model at different time steps during an annotation episode on We found that a higher number of hierarchical levels did not
the unseen DAVIS soapbox sequence. The figure demonstrates that improve the results. For H > 5, the agent tended to completely
the network learnt to extract meaningful information from the omit the selection of specific hierarchical levels. Modifying the
structure of the video. At t ¼ 0 the estimated maximum advantage ratio of reward parameters werr ; wac will lead to different priorities
corresponds to the frame where the soapbox cart passes in front of in the learned policy. A higher werr pushes the agent towards
the camera in the video. At this point the apparent size of the cart selecting lower level, more laborous tasks in the beginning.
is the largest and a high level, approximate labeling can be propa-
gated in both directions across the entire sequence effectively. In
the rightmost plot, it can be seen that seven of the first nine 4.4. Simulating human experiments
selected tasks were located in the second part of the video, where
the intense movement of the camera spoiled the quality of the pre- As mentioned earlier in the paper, we trained and evaluated the
dicted border quickly. network with simulated human input. Details of the human anno-
tation protocol can be found in Section 3.2. Although, this protocol
4.3. Ablation studies is quite straightforward, we expect the human annotator may
make mistakes. To test the robustness of our approach to human
We performed an ablation study to analyze the relevance of error, we conducted experiments with noisy input as well. Results
some of the components of the system. are presented in Table 3. We evaluated our model in the following
Fig. 8. Action-advantage values (A-values) predicted by our model on the unseen soapbox sequence of the DAVIS validation set after t iterations of simulated user inputs. First
ten actions selected with greedy policy were as follows: ð67; 4Þ; ð97; 4Þ; ð8; 4Þ; ð88; 2Þ; ð60; 2Þ; ð83; 1Þ; ð53; 1Þ; ð46; 1Þ; ð76; 1Þ; ð33; 1Þ; format: ðf t ; ht Þ.
V. Varga, A. Lo 257
Table 3
Effects of noisy simulated human input on predicted label quality. The table shows the mean Jaccard-index achieved from a 10 clicks per frame annotation budget when specific
levels of noise were added to the simulated user input. Binary and multiclass segmentation results were reached on the DAVIS 2016 and 2017 validation sets respectively.
Dataset r¼0 r ¼ 0:05 r ¼ 0:1 r ¼ 0:2 r ¼ 0:3

DAVIS 2016 (binary) 0.852 0.845 0.831 0.808 0.760
DAVIS 2017 (multilabel) 0.837 0.823 0.804 0.762 0.715
way: let pi be the relative frequency of label Li among the pixel Declaration of Competing Interest
labels of a superpixel ufj ;h . Then, we select the label assigned to
The authors declare that they have no known competing finan-
ufj ;h in this way: l ufj ;h ¼ Lk , where k ¼ argmax ðpi þ ni Þ;
i¼1::jLj ^ pi >0 cial interests or personal relationships that could have appeared
ni Nð0; rÞ. Setting r to 0 eliminates the error. We found that to influence the work reported in this paper.
even at high noise levels (r 0:3), the method still gives accept-
able results. As an example, in the case of r ¼ 0:3, the error of
the annotation model means that the less frequent label is selected Acknowledgments
with 12% probability when two pixel labels are present with rela-
tive frequencies of 1=4 and 3=4. Human annotators are not The authors would like to thank Ádám Fodor for his cooperation
expected to have such high error rates. in the setup of the supervoxel algorithms utilized in the paper and
Márton Véges for his helpful comments. V.V. is supported by the
European Union and co-financed by the European Social Fund
4.5. Limits of our implementation (EFOP-3.6.3-16-2017-00002). Project No. ED-18-1-2019-0030
(Application-specific highly reliable IT solutions) of the National
In this paper we modeled clicking as the annotation method Research, Development and Innovation Fund of Hungary, financed
and left aside other options, such as drawing polygons or scribbles under the Thematic Excellence Programme funding scheme sup-
and alike. They would certainly affect annotation time. In addition, ported the work of A.L.
the method to be chosen may depend on the annotation task that
can be subject of usability studies. Our work emphasizes the
advantages of the RL framework and not the specific choice of References
the annotation procedure itself.
In our implementation, the finest segmentation granularity was [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale
hierarchical image database, in: CVPR09, 2009..
limited by the superpixel size of the CCS algorithm. In our setting –
[2] M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual
detailed in Section 3.1 – a maximum of 91:4% mean Jaccard-index object classes (VOC) challenge, Int. J. Comput. Vision 88 (2) (2010) 303–338,
can be reached on the DAVIS 2017 validation set. However, several https://doi.org/10.1007/s11263-009-0275-4.
hierarchical superpixel/supervoxel methods are able to go down to [3] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, T. Huang, YouTube-VOS: A large-
scale video object segmentation benchmark, arXiv preprint arXiv:1809.03327..
pixel/voxel levels (e.g. GBH) and full precision may be reached by [4] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, A. Sorkine-
choosing a different method. That would also require a higher res- Hornung, A benchmark dataset and evaluation methodology for video object
olution state representation. segmentation, Comput. Vis. Pattern Recogn. (2016).
[5] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, L. Van Gool,
The 2017 DAVIS Challenge on video object segmentation, arXiv:1704.00675..
[6] D. Tsai, M. Flagg, A. Nakazawa, J.M. Rehg, Motion coherent tracking using
5. Conclusion and future work multi-label mrf optimization, Int. J. Comput. Vis. 100 (2) (2012) 190–202.
[7] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L. Van Gool,
One-shot video object segmentation, in: Proceedings of the IEEE Conference on
In this paper, we have introduced a novel approach to reduce Computer Vision and Pattern Recognition, 2017, pp. 221–230.
human effort in interactive video segmentation annotation with [8] S. Yun, J. Choi, Y. Yoo, K. Yun, J. Young Choi, Action-decision networks for visual
tracking with deep reinforcement learning, in: Proceedings of the IEEE
reinforcement learning. We presented a modified version of the
Conference on Computer Vision and Pattern Recognition, 2017, pp. 2711–
Dueling DDQN architecture, that shares its parameters across the 2720.
temporal axis of its video input. This way, our network leverages [9] W.-D. Jang, C.-S. Kim, Online video object segmentation via convolutional
the power of deep learning to select optimal annotation tasks dur- trident network, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 5849–5858.
ing the labeling of longer video clips in a single inference phase. [10] L. Yang, J. Han, D. Zhang, N. Liu, D. Zhang, Segmentation in weakly labeled
Potential future directions include the improvement of the label videos via a semantic ranking and optical warping network, IEEE Trans. Image
prediction step. More sophisticated algorithms and trained models Process. 27 (8) (2018) 4025–4037.
[11] D. Zhang, J. Han, L. Yang, D. Xu, Spftn: a joint learning framework for localizing
seem promising to us. In addition, deep reinforcement learning has and segmenting objects in weakly labeled videos, IEEE Trans. Pattern Anal.
been extended to multi-objective optimization (e.g. [45]) offering Mach. Intelligence..
further options for improving performance. The choice of clicking, [12] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic superpixels,
Tech. rep. (2010)..
drawing polygons, or scribbles depending on the situation may [13] D. Stutz, A. Hermans, B. Leibe, Superpixels: an evaluation of the state-of-the-
also improve quality, not mentioning usability. art, Comput. Vis. Image Underst. 166 (2018) 1–27.
[14] J. Chang, D. Wei, J.W. Fisher, A video representation using temporal
superpixels, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2013, pp. 2051–2058.
CRediT authorship contribution statement [15] M. Grundmann, V. Kwatra, M. Han, I. Essa, Efficient hierarchical graph-based
video segmentation, in ieee computer society conference on computer vision
Viktor Varga: Conceptualization, Methodology, Software, Writ- and pattern recognition, IEEE 2010 (2010) 2141–2148.
[16] C. Rother, V. Kolmogorov, A. Blake, GrabCut: interactive foreground extraction
}rincz: Supervision,
ing - original draft, Visualization. András Lo using iterated graph cuts, in: ACM transactions on graphics (TOG), vol. 23,
Writing - review & editing. ACM, 2004, pp. 309–314..
V. Varga, A. Lo
[17] N. Shankar Nagaraja, F.R. Schmidt, T. Brox, Video segmentation with just a few [38] J. Luiten, P. Voigtlaender, B. Leibe, Premvos: proposal-generation, refinement
strokes, in: Proceedings of the IEEE International Conference on Computer and merging for the DAVIS challenge on video object segmentation 2018, in:
Vision, 2015, pp. 3235–3243. The 2018 DAVIS Challenge on Video Object Segmentation-CVPR Workshops,
[18] A. Fathi, M.F. Balcan, X. Ren, J.M. Rehg, Combining self training and active 2018..
learning for video segmentation, in: BMVC, Georgia Institute of Technology, [39] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L. Van
2011.. Gool, Video object segmentation without temporal information, IEEE
[19] S. Vijayanarasimhan, K. Grauman, Active frame selection for label propagation transactions on pattern analysis and machine intelligence..
in videos, in: European Conference on Computer Vision, Springer, 2012, pp. [40] P. Voigtlaender, B. Leibe, Online adaptation of convolutional neural networks
496–509. for the 2017 DAVIS challenge on video object segmentation, in: The 2017
[20] B. Luo, H. Li, F. Meng, Q. Wu, C. Huang, Video object segmentation via global DAVIS Challenge on Video Object Segmentation-CVPR Workshops, vol. 5,
consistency aware query strategy, IEEE Trans. Multimedia 19 (7) (2017) 1482– 2017..
1493. [41] L. Bao, B. Wu, W. Liu, CNN in MRF: video object segmentation via inference in a
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. CNN-based higher-order spatio-temporal mrf, in: Proceedings of the IEEE
Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control Conference on Computer Vision and Pattern Recognition, 2018, pp. 5977–
through deep reinforcement learning, Nature 518 (7540) (2015) 529. 5986.
[22] K. Zhou, Y. Qiao, T. Xiang, Deep reinforcement learning for unsupervised video [42] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via
summarization with diversity-representativeness reward, in: Thirty-Second graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 23 (11) (2001) 1222–1239.
AAAI Conference on Artificial Intelligence, 2018. [43] Y. Boykov, V. Kolmogorov, An experimental comparison of min-cut/max-flow
[23] J.C. Caicedo, S. Lazebnik, Active object localization with deep reinforcement algorithms for energy minimization in vision, IEEE Trans. Pattern Anal. Mach.
learning, in: Proceedings of the IEEE International Conference on Computer Intell. 9 (2004) 1124–1137.
Vision, 2015, pp. 2488–2496. [44] V. Kolmogorov, R. Zabih, What energy functions can be minimizedvia graph
[24] Y. Xiang, A. Alahi, S. Savarese, Learning to track: online multi-object tracking cuts?, IEEE Trans. Pattern Anal. Mach. Intell. 2 (2004) 147–159.
by decision making, in: Proceedings of the IEEE International Conference on [45] H. Mossalam, Y.M. Assael, D.M. Roijers, S. Whiteson, Multi-objective deep
Computer Vision, 2015, pp. 4705–4713. reinforcement learning, arXiv preprint arXiv:1610.02707..
[25] J. Han, L. Yang, D. Zhang, X. Chang, X. Liang, Reinforcement cutting-agent
learning for video object segmentation, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 9080–9089.
[26] O. Russakovsky, L.-J. Li, L. Fei-Fei, Best of both worlds: human-machine András Lörincz graduated from the Eötvös Loránd
collaboration for object annotation, in: Proceedings of the IEEE Conference on University. His background is in physics. He habilitated
Computer Vision and Pattern Recognition, 2015, pp. 2121–2131. in laser physics and in computer science, too. His
[27] G. Song, H. Myeong, K. Mu Lee, Seednet: automatic seed generation with deep research interest lies at the crossroad of artificial gen-
reinforcement learning for robust interactive segmentation, in: Proceedings of eral intelligence, cognition, and neuroscience, with
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. principal areas including human-computer interaction
1760–1768. and applications. He has different research projects
[28] D. Acuna, H. Ling, A. Kar, S. Fidler, Efficient interactive annotation of supported by the government and by companies. He
segmentation datasets with polygon-rnn++, in: Proceedings of the IEEE supervises more than 10 Ph.D. and M.Sc. students and
Conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868. has published over 250 peer reviewed publications.
[29] O. Rudovic, M. Zhang, B. Schuller, R.W. Picard, Multi-modal active learning
from human data: a deep reinforcement learning approach, arXiv preprint
arXiv:1906.03098..
[30] S.-H. Lee, W.-D. Jang, C.-S. Kim, Contour-constrained superpixels for image and
video processing, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 2443–2451.
[31] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, Flownet 2.0: Viktor Varga received his B.Sc. and M.Sc. degrees in
evolution of optical flow estimation with deep networks, in: Proceedings of Computer Science from Eötvös Loránd University of
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. Budapest in 2014 and 2017. He is currently a Ph.D.
2462–2470. student working in the Neural Information Processing
[32] A. Benard, M. Gygli, Interactive video object segmentation in the wild, arXiv Group at the university. His research interests include
preprint arXiv:1801.00269.. reinforcement learning and computer vision.
[33] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, N. De Freitas, Dueling
network architectures for deep reinforcement learning, in: International
Conference on Machine Learning, 2016.
[34] M. Hausknecht, P. Stone, Deep recurrent q-learning for partially observable
mdps, in: 2015 AAAI Fall Symposium Series, 2015..
[35] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-
learning, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[36] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, arXiv
preprint arXiv:1511.05952..
[37] G. Hinton, N. Srivastava, K. Swersky, Neural networks for machine learning
lecture 6a overview of mini-batch gradient descent, Cited on 14..

1 s2.0 S0925231220308110 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0925231220308110 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0925231220308110 Main

Uploaded by

Copyright:

Available Formats

Neurocomputing 405 (2020) 247–258

Contents lists available at ScienceDirect

Reducing human efforts in video segmentation annotation with

1. Introduction In this paper, we attempt to learn annotation task selection

Stutz et al. [13] present a comprehensive analysis and a set of 3. Method

i¼1;...;m cal flow is estimated with Flownet 2 [31].

First, we experimented with modifications to the reinforcement

Dataset r¼0 r ¼ 0:05 r ¼ 0:1 r ¼ 0:2 r ¼ 0:3

You might also like