1 s2.0 S0925231220308110 Main
1 s2.0 S0925231220308110 Main
1 s2.0 S0925231220308110 Main
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o a b s t r a c t
Article history: Manual annotation of video segmentation datasets requires an immense amount of human effort, thus,
Received 26 September 2019 reduction of human annotation costs is an active topic of research. While many papers deal with the
Revised 18 December 2019 propagation of masks through frames of a video, only a few results attempt to optimize annotation task
Accepted 22 February 2020
selection. In this paper we present a deep learning based solution to the latter problem and train it using
Available online 8 May 2020
Reinforcement Learning. Our approach utilizes a modified version of the Dueling Deep Q-Network shar-
Communicated by Junwei Han
ing weight parameters across the temporal axis of the video. This technique enables the trained agent to
select annotation tasks from the whole video. We evaluate our annotation task selection method by
Keywords:
Video segmentation
means of a hierarchical supervoxel segmentation based mask propagation algorithm.
Interactive annotation Ó 2020 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://
Reinforcement learning creativecommons.org/licenses/by/4.0/).
https://doi.org/10.1016/j.neucom.2020.02.127
0925-2312/Ó 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
248 }rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo
Fig. 1. Overview of our method. 1. A complete labeling of the video is predicted from the partial labeling provided by the user in past iterations. 2. RL agent selects the next
annotation task for the user. 3. User executes the assigned task by correcting mislabeled segments of the label predictions.
Fig. 2. Label propagation exploiting hierarchical supervoxel structure. The two rows show label propagation to different frames of the video after user input was given at a
high (top row) and at a lower hierarchical level (bottom row) in frame 20 of the soapbox DAVIS sequence. The first column contains the actual click points of the simulated
user (zoomed in). Note, that the number of click points form the cost in our procedure. Human work may be spared by allowing annotation with scribbles, as shown in the
second column. The remaining two columns to the right display the effect of propagating the user annotation to frames 22 and 80. It is clearly visible that user input in higher
hierarchical levels comes with significantly lower cost and is propagated further in time. However, lower hierarchical level inputs increase granularity of the propagation
locally. Best viewed in color.
The hierarchical nature of the superpixel and supervoxel seg- tively merges adjacent spatio-temporal regions using a sorted glo-
0
mentations is formulated in the following way: let h; h be bal edge weight list. The edge weights are a function of color and
0 optical flow histogram distances. The result of GBH is a minimum
indices of hierarchical levels, with h P h. Then,
F F
9k1 ; . . . ; km : v hj ¼ v hki , where denotes the disjoint union spanning tree constructed from the region adjacency graph. Opti-
0
tation by the iterative merging procedure with GBH. We refer to label of the most frequent pixel label found in the given superpixel
hierarchical level h P 2 as the segmentation given by the iterative (i.e. plurality voting). In case this does not hold for a superpixel, the
merging algorithm stopped at segment count jV h j ¼ bjV h1 j=cc. The simulated human annotator corrects the label. We utilize the
number of supervoxels at hierarchical level i is denoted by jV i j. We ground truth labels in order to simulate user input. However,
found c ¼ 3:7 to be ideal for the DAVIS and SegTrack datasets. Note real-life human annotators cannot be expected to perform label
that the ideal value of this constant is highly dependent on the selection based on plurality voting of pixel labels with absolutely
temporal extent of supervoxels present in the base level zero error. For this reason, we also carry out experiments simulat-
segmentation. ing the noisy selection of superpixel labels in Section 4.4 and show
Next, we formulate the labeling of a video. We follow the label- that our approach is robust to minor human errors.
ing conventions of the DAVIS dataset: our label set contains a sin- In several works dealing with human segmentation annota-
gle background label LBG and multiple foreground labels. Let tions, other ways of human input are investigated, such as drawing
scribbles to mark coherent regions in one action [17,32]. While
L ¼ LBG ; L1 ; . . . ; LjLj1 be the set of labels. Our task is to give a
results show that in some cases segmentation annotation using
complete labeling of the video, by assigning a label to each pixel:
scribbles requires less effort, we chose to stick with simple mouse
l : V 0 ! L. clicks instead, which may serve quantitative comparisons in the
Annotations already given by the user in an episode can be for-
n o future.
malized as a set of triplets: K ¼ f t ; ht ; lf t ;ht : U f t ;ht ! L . As
described in the previous section, the user is assigned the annota- 3.3. Markov decision process
tion task ðf t ; ht Þ by the RL agent, for which the user provides the
labeling of all superpixels in level ht of frame f t : the lf t ;ht mapping. In our work, an annotation episode is modelled with a Markov
The annotation propagation algorithm can be given in the follow- Decision Process (MDP). The MDP is defined by states st 2 S,
ing way: actions at 2 A, the state transition function P : S A S ! R
Let ufj ;0 be a pixel in frame f. If no supervoxel ancestor of the and the reward function R : S A ! R. In each step, the agent
given pixel exist that contains user-labeled superpixels as a subset, receives the current state, which stores downsampled pixelwise
video features, the current state of annotation and the time index
then l ufj ;0 ¼ LBG . Otherwise, let v hk be the lowest level ancestor of
in the episode, t. The agent then takes an action, selecting the sub-
ufj ;0 which contains a user-labeled superpixel as a subset. In this sequent annotation task to be queried from the user. The action is
case, we choose the temporally nearest of such labelled superpix- represented by ðf t ; ht Þ where f t is the frame index of the video and
0 0
0 0 ht is the hierarchical level in which the user carries out the labeling
els. Let this superpixel be ufi ;h . Therefore l ufj ;0 ¼ l ufi ;h , in other
task. In other words, the algorithm may ask for annotations at
words we assign the label of the selected superpixel to the pixel specific levels in the hierarchy. During the training process, user
ufj ;0 being discussed. The effects of propagating user input given labeling is computed from the ground truth segmentation labels
in different hierarchical levels are shown in Fig. 2. in accordance with the human annotation protocol, detailed in
the previous subsection. Finally, the complete labeling of the video
3.2. Human annotation protocol is estimated from the partial annotations already present. The pre-
dicted annotation is compared to the ground truth labeling using
The human annotator is asked to make corrections to the pre- the mean Jaccard-index (a.k.a. intersection-over-union or IoU).
dicted annotation at a given frame f t , in a given hierarchical level
ht . The annotator may carry out the label corrections by modifying 3.4. Model architecture
the assigned labels to each superpixel at level ht if they are
believed to be mislabeled. We decided to simulate the role of the To predict state-action values for each frame and each hierar-
human annotator during the training and evaluation of our chical level, we use a somewhat modified version of the Dueling
approach. The protocol of our simulated human annotator can be Double Deep Q-Network (DDDQN) [33]. Since we intend to learn
specified in the following way: each superpixel should carry the to approximate the same Q-values independently from the actual
Fig. 3. Architecture of the modified Dueling Deep Q-Network. F designates the length of the temporal dimension of the video, i.e. the number of frames: the action-advantage
prediction branch of the network shares its weights along this axis. C; C 0 ; C V ; C A denote the number of image channels. H stands for the number of actions available in each
frame. The index of time step (t) in the MDP episode is fed into both branches of the network.
}rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo 251
temporal position in the video, we utilize layers with translation while H stands for the number of actions per frame, i.e. for the
invariance along the temporal axis in the action-advantage estima- number of hierarchical levels. H is chosen to be 4 in our experi-
tion branch of the network. While the state value prediction ments. The structure of our action-advantage prediction branch
branch does not share weights along the temporal axis, we only bears a resemblance to the Deep Recurrent Q-Network described
use this branch during the training of the network, as having the in [34], but in our case all weights that take part in the estimation
action-advantage estimation Aðs; aÞ is enough to select the optimal of Aðs; aÞ are shared along the frames of the video. see Fig. 4.
action. This procedure enables us to use the weights of the trained The state value estimation branch of the network is similar to
model to initialize differently shaped models to work with video the original DDDQN, featuring two 3D max-pool and one 3D con-
clips of arbitrary length. The architecture of our modified Deep volutional layer. The following two layers are ordinary fully con-
Q-network can be seen in Fig. 3. The input of our model is the state nected layers with ReLU activations.
representation, which contains multiple channels of 3D pixel Additionally, both branches of the network receive the index of
(voxel) data. Consequently, at the front part of our model, multiple the time step in the episode as input.
3D convolutional and max-pooling layers are placed alternately. The predictions of the two branches: Aðs; aÞ and V ðsÞ are merged
The pooling layers of the shared, frontal part of the network only following [33]:
reduce the dimensionality along the spatial axes and leave the !
temporal axis untouched. The pooling layers in the action- 1 X
Q ðs; aÞ ¼ V ðsÞ þ Aðs; aÞ Aðs; a0 Þ ð1Þ
advantage prediction branch are of similar nature. The output of jAj a0 2A
the last pooling layer in this branch is reshaped to two axes, keep-
ing the temporal dimension unchanged. The following layer is a
bidirectional LSTM layer having recurrent connections along the 3.5. State representation
temporal axis of the video. This layer provides long-term informa-
tion propagation in time. The outputs of the bidirectional recurrent The state consists of image features and information about the
layer are concatenated for each frame independently. The final progression of the annotation process. The state comprises the
layer is fully connected within individual frames of the video and index of the current time step in the MDP episode and features
learns the same transformations with tied weights along the tem- derived from the video in the shape of F X Y C, where F
poral axis. ReLU activations are placed after each convolutional denotes the number of frames in the video, X; Y denote the size
layer. The output of the action-advantage prediction branch is of the downscaled frame, while C designates the number of image
shaped F H, where F denote the number of frames in the video, channels. For subsequent episodes, F may vary, since all weights of
Fig. 4. Qualitative results on the DAVIS breakdance and soapbox sequences showing error in label prediction and correcting user input after 1, 2, 3, 4, 5 and 10 iterations in
frames selected for user annotation. Colored dots represent user clicks. Error rates for both sequences were above the dataset average. Note, that the number of click points
form the cost in our procedure. Human work may be spared by allowing annotation with scribbles, as shown in Fig. 2. Progression of labeling in consecutive frames of the
sequences are shown in Figs. 5 and 6. Best viewed in color.
252 }rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo
our neural network are shared along this dimension. X and Y are between rewarding the improvement of segmentation accuracy
chosen to be 32 and 56 respectively. During our experiments C (improvement in Rexp ) and penalizing annotation cost (Costt ).
was 9. The details of the individual Ic image channels are the (
following: 1; if frame index f t has already been selected by the agent
R¼
I0 ; I1 ; I2 ; I3 store the downscaled optical flow images: dx; dy com- werr Rexp ðMt ; GÞ Rexp ðM t1 ; GÞ wac Cost
j ;
t
otherwise;
ponents for forward and backward flow, respectively. We obtained ð4Þ
the optical flow predictions from Flownet v2. I4 ; I5 contain the bin-
ary images of the estimated changes in occlusion computed from where
the optical flow predictions. A pixel is considered occluded/disoc- expxq 1 Jmean ðM; GÞ
cluded when the forward/backward optical flow have no vectors Rexp ðM; GÞ ¼ ; x¼ : ð5Þ
expq 1 Jmean G0 ; G
pointing to that pixel location or near it. I6 contains the boundaries
between similarly labeled regions. A pixel is considered a boundary In the definition of the reward function, Jmean stands for the
pixel, if it shares a side with at least one differently labeled pixel. mean Jaccard-index (also known as the mean intersection-over-
The image channels specified above, may give the neural network union for pairs of masks for each label present). M t ; G denote the
information about the scale of potential deterioration during anno- predicted mask for the whole video in time step t of the MDP epi-
tation propagation, even when annotation input is not yet present sode and the ground truth mask respectively. Costt stands for the
in the temporal vicinity of a certain frame. annotation cost of the selected annotation task in time step t.
In case the video lacks fast movements and quick changes in The reward for a given improvement in segmentation accuracy
occlusion, the annotation propagation may go on successfully for increases exponentially as accuracy gets close to the upper limit.
several seconds. Although the temporal recurrent connections in With this, we model the exponential increase in human efforts,
the action-advantage prediction branch of the network enable as the labeling quality improves during the annotation process.
long-term temporal information propagation, still, the introduc- The exponential dependence can be modulated by parameter q.
tion of features that encode the distance of user annotation In our studies, typical q values are between 3 and 10. Depending
improved our results. Hence, we specify channels I7 and I8 as on the border accuracy and the average temporal extent of the
follows. supervoxel segmentation, the maximum achievable Jmean and
I7 encodes the hierarchical distance dhier of a pixel from user the annotation costs may greatly vary between videos. The expo-
annotations. Let dhier ufj ;0 be the level of the lowest level ancestor nential nature of the reward function further amplifies the vari-
ance in expected rewards which impacts the stability of the
supervoxel of ufj ;0 which contains user labelled superpixels as a training. To fight off the instability, we introduce G0 and j which
subset. may differ from video to video as detailed below. G0 stands for
Finally, I8 encodes the level 1 temporal distance dtemp from user the predicted mask when all frames of the video are annotated at
annotations. Let dtemp ufj ;0 be the temporal distance of the tempo- the base hierarchical level (level 1). Thus, Jmean G0 ; G determines
the minimum achievable annotation error for a specific video.
rally closest labelled level 1 superpixel from the ufj ;0 pixel which
The role of x is to normalize Rexp to 1 when the minimum achiev-
share a level 1 supervoxel ancestor.
able error is reached for a specific video. Similarly, j designates the
number of supervoxels in the base level segmentation (level 1). As
3.6. Training the agent with deep reinforcement learning
mentioned before, the number of supervoxels in CCS generated
segmentations vary significantly depending on the magnitude of
To train an agent, we utilize the modified DDDQN described in
movement in the videos and j compensates for this.
Section 3.4. Our network learns to estimate the discounted sum of
the remaining rewards in the episode, i.e. the Q-values, for each
possible action. The action space can be factorized into two inde- 4. Experiments and evaluations
pendent dimensions: index of frame and the hierarchical level of
the annotation task. The training of the network is executed in a In this section, first, we present details about the training of our
similar fashion to ordinary Deep Q-Networks with the Double network. Later, we train and evaluate our method on the DAVIS
b targetNN ðs; aÞ and Q
b onlineNN ðs; aÞ denote datasets and compare it to the semi-supervised state-of-the-art.
DQN technique [35]. Let Q
Additionally, we evaluate our trained model on the unseen Seg-
the Q values estimated by the target network and the network
Track dataset and compare its accuracy to the state-of-the-art in
being trained. The Q-learning target and the DQN loss function
interactive video segmentation as well. Furthermore, we compare
can be expressed with s; a; s0 as:
the policy learned by the reinforcement learning agent to baseline
b targetNN s0 ; arg min a0 Q
Q ðs; aÞ ¼ r ðs; aÞ þ c Q b onlineNN ðs0 ; a0 Þ ð2Þ policies to validate our RL based approach. Finally, we simulate
potential mistakes made by human annotators and show that
2 our approach is robust to minor errors in following the annotation
b ðs; aÞ
J ðhÞ ¼ E Q ðs; aÞ Q ð3Þ protocol appropriately. We use the mean Jaccard-index to measure
our prediction for a given annotation budget.
where h denotes the parameters of the networks.
The gradient backpropagation is analogous to that of regular 4.1. Details of the training and the prediction process
Deep Q-Networks: gradients are propagated backwards only from
the output corresponding to selected action a. The network esti- We train our network on the training set of the DAVIS 2017
mates a state-action value for each action, from which an e- dataset. We split the training set into two parts, containing 40
greedy policy selects the action to be taken. To avoid training our and 20 videos respectively. We use the second part as a validation
network with correlated batches, we use an experience replay buf- set while training on the first part of the dataset. The subset of 20
fer to store past experiences of the agent: sampling is prioritized by videos of the DAVIS training set we use for validation should not be
the magnitude of the TD-error [36]. confused with the DAVIS validation set, which we use only for final
The reward function we use was inspired by the exponential evaluation purposes. We initialize a network with an input video
reward function suggested by Song et al. [27]. We aim to balance length F ¼ 40 and use random slices of videos for each episode
}rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo 253
as the network input. The training process starts with an experi- Table 1
ence collection phase of 20,000 steps using a random policy with- Comparison of our method with state-of-the-art semi-supervised methods on the
DAVIS validation sets. Binary and multiclass segmentation results were reached on
out training the network. Following this, the network is trained the DAVIS 2016 and 2017 validation sets respectively. Mean Jaccard-index was used
while generating experiences. The policy is parametrized with an as metric. Note, that all cited methods receive the first frame ground truth annotation
e value gradually decreasing from 0:5 to a minimal value of 0:2. as input, while our method continuously improve the prediction using approximate
The discout rate c is set to 0:9. Episode length is fixed to 20 anno- annotations from user input. The budget for our method is given in number of clicks
per total frame count of all videos.
tation task selection iterations. The network is trained using an
RMSprop [37] optimizer with a learning rate set to 5 104 . The Method Binary Multiclass
segmentation segmentation
weights of the target network are updated with a soft update rate
(mean J) (mean J)
of 104 .
PReMVOS [38] 0.849 0.739
The number of hierarchical levels used (H) was set to 4. The fol- OSVOS-S [39] 0.856 0.647
lowing reward parameters were found to be optimal: OnAVOS [40] 0.861 0.616
werr ¼ 10; wac ¼ 70; q ¼ 8. Further analysis of these parameter val- CINM [41] 0.834 0.672
ues can be found in Section 4.3. OSVOS [7] 0.798 0.566
While the training process is carried out using random slices of Ours, 1 clicks per fr. budget 0.661 0.594
videos with fixed length, the prediction phase does not have such Ours, 2 clicks per fr. budget 0.710 0.685
Ours, 5 clicks per fr. budget 0.793 0.781
restriction on the input. During inference, the branch of the net-
Ours, 10 clicks per fr. budget 0.852 0.837
work responsible for the state-value estimation can be removed.
The remaining part of the network shares all its weights along
the temporal axis of the video, therefore the temporal extent of
the input is only limited by hardware capacities.
We measured the execution time of our method using a GTX to speed up the annotation process. Although, labeling multiple
1080 graphics card and a Xeon E5-2620 v4 CPU. While the prepro- adjacent regions by a single mouse gesture is indeed time-
cessing steps to generate the hierarchical segmentation and to efficient, the lack of a formal definition of a scribble makes compar-
compute the optical flow may take several seconds per frame, isons hard. A few approaches try to quantify annotation costs by
these computations can be done in advance. During the actual measuring the number of frames annotated in a video. However,
interactive annotation procedure, the human annotator have to in practice, the accurate pixelwise labeling of a frame could take
wait after completing the labeling of a frame, before being assigned several hours, while for the majority of applications, reaching
the consecutive task. The waiting time in this case is on the order 100% IoU is needless. Therefore, annotation costs of labeling a
of a few hundred milliseconds. The graphics card could load video frame may vary several magnitudes depending on the required
clips of more than 3000 frames at once. Longer videos may be sim- level of accuracy.
ply cut to slices of appropriate length. To facilitate the comparison of our approach with future results,
we chose an annotation technique popular in image segmentation
since the introduction of the GrabCut algorithm [16]: we provide
4.2. Interactive segmentation results hints (or seeds) to the segmentation algorithm by means of sug-
gested mouse clicks in images. The number of mouse clicks needed
4.2.1. Comparison with the state-of-the-art to reach certain accuracy levels is an unambigous measure of
First, we compare our method quantitatively to the state-of- human effort in annotation tasks.
the-art on the semi-supervised benchmarks of the DAVIS dataset We note that none of the papers available on interactive seg-
in Table 1. We cannot offer a straightforward comparison between mentation feature the metric described above, we compare our
our method and semi-supervised methods, due to the underlying results to the state-of-the-art measuring annotation costs by
methodological differences: the semi-supervised protocol of the counting the number of annotated frames. We evaluated our
DAVIS benchmarks reveal the ground truth labeling of the first method on the SegTrack dataset in order to study database depen-
frame to the evaluated methods, while our method iteratively dence when training on one set and testing on both. Results can be
receives approximate corrections of the predicted labels, without seen in Table 2. Videos in the SegTrack dataset have relatively low
ever obtaining the exact ground truth labeling of a full frame. Still, resolution and some sequences, e.g. cheetah, girl and monkeydog
some comparisons seem worth as they give insights about the use- contain a significant amount of compression related artifacts,
fulness of our method. which may give rise to erroneous object boundary prediction for
In Table 1, we show our results in two benchmarks. The values some algorithms and may spoil performance.
in the binary segmentation task correspond to the mean Jaccard- Qualitative results of interactive annotation episodes for binary
index of the predicted binary masks over all frames of the DAVIS and multilabel annotation tasks are shown in Fig. 5 and Fig. 6
2016 validation set. The multiclass segmentation task refers to respectively. Ground truth masks are shown in the leftmost col-
the evaluation of the models on the DAVIS 2017 validation set umns. Columns to the right show the progression of the improve-
using the mean Jaccard-index over all frames of the dataset and ment of predicted labels throughout multiple frames of the video.
with all the different labels present in each frame. Our results In all illustrations presented in Figs. 5 and 6 prediction error rate
improve as the budget for user annotation actions increases. was above the dataset average.
Clearly, state-of-the-art, deep convolutional network based semi-
supervised methods learn quickly to segment a single moving
object in front of a more or less static background. Nonetheless, 4.2.2. Evaluation of learned policy
their performance plummets as multiple moving objects appear In the previous section, we compared our method to other
with individual labels. On the contrary, our method presents an approaches in terms of the quality of the obtained segmentation
approximately similar accuracy both for binary and multilabel seg- annotation. However, utilizing a random task selection policy to
mentation tasks. We attribute this to the main difference in our assign annotation tasks to the human participant also leads to ade-
training scheme compared to other approaches. In our case, we quate annotation quality after a few iterations. In view of this, we
use a trained model to estimate the label prediction error, while compare the policy learnt by the RL agent to baseline policies on
the label prediction itself is computed by a black box algorithm. the DAVIS 2017 validation set.
254 }rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo
Table 2
Results on the unseen SegTrack dataset after a specific number of frames were annotated by the user. Note, that in contrast to [20], our method is optimized on the cost of
annotating each frame as well. Annotation costs are given in number of clicks per total frame count of all videos.
SegTrack sequence [20] (5 fr.) Ours (5 fr.) Ours (10 fr.) Ours (20 fr.)
birdfall 0.593 0.711 0.729 0.747
cheetah 0.502 0.348 0.448 0.711
girl 0.739 0.463 0.537 0.671
monkeydog 0.454 0.631 0.729 0.774
parachute 0.821 0.897 0.898 0.901
penguin – 0.941 0.946 0.947
Mean IoU w/o penguin 0.604 0.651 0.711 0.782
Mean IoU – 0.701 0.751 0.810
Annot. cost per frame – 2.028 4.282 8.545
Fig. 5. Qualitative results on videos with above average error from the DAVIS 2016 dataset. Rows represent frames number 0, 40 and 80 of the breakdance sequence and
frames number 0, 15, 30 of the motocrossjump sequence in this order. First column contains the ground truth masks in the specific frames, while following columns contain
the predicted masks after 1, 2, 5 and 10 iterations of user annotation inputs. Best viewed in color.
The results of the comparison are shown in Fig. 7. Annotation ht , we evaluated all possible monotonically decreasing policies
costs are given in number of clicks per total number of frames in up to an episode length of 20 steps on a subset of the DAVIS
the dataset to reach a specific level of predicted label quality. 2017 training set combined with the frame selection policy
Random policy selects ðf t ; ht Þ annotation tasks randomly, but specified above. We evaluated those performing the best, again,
avoids redundant queries: in case a certain frame was already on the full DAVIS 2017 training set to acquire the final policy.
annotated in hierarchical level h, queries targeting this frame with The hierarchical level selection policy giving the best results
0
hierarchical level h P h are ignored. In Fig. 7, the average results was combined with the frame selection strategy mentioned
for 10 such random policies are shown. before produced Best constant policy. Results are depicted in
Best constant policy selects frame f t to be the most distant Fig. 7.
from all previously queried frames, the beginning and the MRF baseline policy is generated by the Markov Random Field
end of the video. We experimented with other frame selection (MRF) and is driven by an active learning approach. Our baseline
strategies. We found those to be inferior, including sequential measure is derived from the paper of Luo et al. [20]. We made
equidistant sampling and a keyframe based frame selection the cited method compatible with our hierarchical supervoxel
policy where keyframes were chosen according to the Keyframe framework by modifying their approach at a few points. We con-
policy in [19]. To obtain a strategy to select hierarchical level tinue to use the notation defined in Section 3.
}rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo 255
Fig. 6. Qualitative results on videos with above average error from the DAVIS 2017 dataset. Rows represent frames number 0, 15 and 30 of the shooting sequence and frames
number 0, 40, 80 of the soapbox sequence in this order. First column contains the ground truth masks in the specific frames, while following columns contain the predicted
masks after 1, 2, 5 and 10 iterations of user annotation inputs. Best viewed in color.
Fig. 7. Comparison of the learned policy with baseline policies on the DAVIS 2017 validation set. The figure shows annotation costs required to reach certain Jmean levels with
the specific task selection policy. Annotation costs are given in number of clicks per total number of frames in all videos of the dataset. Lower values are better.
0
To predict the expected effect of the user labeling of a certain receiving label l when annotated by the user. ELC ð f ; hÞ is the
frame at a certain hierarchical level, we utilize the Expected Label expected label change by annotating frame f at hierarchical level
Change defined in Eq. 3. in Ref. [20] as follows. h. We compute ELC ð f ; hÞ over k superpixels sampled randomly
X from the segmentation. In our baseline we also include the esti-
0 0
ELC ð f ; hÞ ¼ P l ufj ;h ¼ l C l ufj ;h l ð6Þ mation of the cost of annotating a frame at a specific hierarchical
f ;h
n j;o
j21...jU level. This cost is given by the sum of the estimated probabilities
f ;h
l0 2Ln l uj that the user will change the labels of the sampled superpixels.
We normalize the cost with the sum of the areas of the sampled
0 superpixels. Eventually, we maximize the expected label change
Here, C l ufj ;h l denotes the total pixelwise label change in
0
divided by the estimated cost of the task to get the most efficient
the video in case superpixel ufj ;h was assigned label l . labeling task. Algorithm 1 summarizes the task selection
0
P l ufj ;h ¼ l denotes the estimated probability of the superpixel procedure.
256 }rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo
Fig. 8. Action-advantage values (A-values) predicted by our model on the unseen soapbox sequence of the DAVIS validation set after t iterations of simulated user inputs. First
ten actions selected with greedy policy were as follows: ð67; 4Þ; ð97; 4Þ; ð8; 4Þ; ð88; 2Þ; ð60; 2Þ; ð83; 1Þ; ð53; 1Þ; ð46; 1Þ; ð76; 1Þ; ð33; 1Þ; format: ðf t ; ht Þ.
}rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo 257
Table 3
Effects of noisy simulated human input on predicted label quality. The table shows the mean Jaccard-index achieved from a 10 clicks per frame annotation budget when specific
levels of noise were added to the simulated user input. Binary and multiclass segmentation results were reached on the DAVIS 2016 and 2017 validation sets respectively.
way: let pi be the relative frequency of label Li among the pixel Declaration of Competing Interest
labels of a superpixel ufj ;h . Then, we select the label assigned to
The authors declare that they have no known competing finan-
ufj ;h in this way: l ufj ;h ¼ Lk , where k ¼ argmax ðpi þ ni Þ;
i¼1::jLj ^ pi >0 cial interests or personal relationships that could have appeared
ni Nð0; rÞ. Setting r to 0 eliminates the error. We found that to influence the work reported in this paper.
even at high noise levels (r 0:3), the method still gives accept-
able results. As an example, in the case of r ¼ 0:3, the error of
the annotation model means that the less frequent label is selected Acknowledgments
with 12% probability when two pixel labels are present with rela-
tive frequencies of 1=4 and 3=4. Human annotators are not The authors would like to thank Ádám Fodor for his cooperation
expected to have such high error rates. in the setup of the supervoxel algorithms utilized in the paper and
Márton Véges for his helpful comments. V.V. is supported by the
European Union and co-financed by the European Social Fund
4.5. Limits of our implementation (EFOP-3.6.3-16-2017-00002). Project No. ED-18-1-2019-0030
(Application-specific highly reliable IT solutions) of the National
In this paper we modeled clicking as the annotation method Research, Development and Innovation Fund of Hungary, financed
and left aside other options, such as drawing polygons or scribbles under the Thematic Excellence Programme funding scheme sup-
and alike. They would certainly affect annotation time. In addition, ported the work of A.L.
the method to be chosen may depend on the annotation task that
can be subject of usability studies. Our work emphasizes the
advantages of the RL framework and not the specific choice of References
the annotation procedure itself.
In our implementation, the finest segmentation granularity was [1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale
hierarchical image database, in: CVPR09, 2009..
limited by the superpixel size of the CCS algorithm. In our setting –
[2] M. Everingham, L. Gool, C.K. Williams, J. Winn, A. Zisserman, The pascal visual
detailed in Section 3.1 – a maximum of 91:4% mean Jaccard-index object classes (VOC) challenge, Int. J. Comput. Vision 88 (2) (2010) 303–338,
can be reached on the DAVIS 2017 validation set. However, several https://doi.org/10.1007/s11263-009-0275-4.
hierarchical superpixel/supervoxel methods are able to go down to [3] N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, T. Huang, YouTube-VOS: A large-
scale video object segmentation benchmark, arXiv preprint arXiv:1809.03327..
pixel/voxel levels (e.g. GBH) and full precision may be reached by [4] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, A. Sorkine-
choosing a different method. That would also require a higher res- Hornung, A benchmark dataset and evaluation methodology for video object
olution state representation. segmentation, Comput. Vis. Pattern Recogn. (2016).
[5] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, L. Van Gool,
The 2017 DAVIS Challenge on video object segmentation, arXiv:1704.00675..
[6] D. Tsai, M. Flagg, A. Nakazawa, J.M. Rehg, Motion coherent tracking using
5. Conclusion and future work multi-label mrf optimization, Int. J. Comput. Vis. 100 (2) (2012) 190–202.
[7] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L. Van Gool,
One-shot video object segmentation, in: Proceedings of the IEEE Conference on
In this paper, we have introduced a novel approach to reduce Computer Vision and Pattern Recognition, 2017, pp. 221–230.
human effort in interactive video segmentation annotation with [8] S. Yun, J. Choi, Y. Yoo, K. Yun, J. Young Choi, Action-decision networks for visual
tracking with deep reinforcement learning, in: Proceedings of the IEEE
reinforcement learning. We presented a modified version of the
Conference on Computer Vision and Pattern Recognition, 2017, pp. 2711–
Dueling DDQN architecture, that shares its parameters across the 2720.
temporal axis of its video input. This way, our network leverages [9] W.-D. Jang, C.-S. Kim, Online video object segmentation via convolutional
the power of deep learning to select optimal annotation tasks dur- trident network, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 5849–5858.
ing the labeling of longer video clips in a single inference phase. [10] L. Yang, J. Han, D. Zhang, N. Liu, D. Zhang, Segmentation in weakly labeled
Potential future directions include the improvement of the label videos via a semantic ranking and optical warping network, IEEE Trans. Image
prediction step. More sophisticated algorithms and trained models Process. 27 (8) (2018) 4025–4037.
[11] D. Zhang, J. Han, L. Yang, D. Xu, Spftn: a joint learning framework for localizing
seem promising to us. In addition, deep reinforcement learning has and segmenting objects in weakly labeled videos, IEEE Trans. Pattern Anal.
been extended to multi-objective optimization (e.g. [45]) offering Mach. Intelligence..
further options for improving performance. The choice of clicking, [12] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, S. Süsstrunk, Slic superpixels,
Tech. rep. (2010)..
drawing polygons, or scribbles depending on the situation may [13] D. Stutz, A. Hermans, B. Leibe, Superpixels: an evaluation of the state-of-the-
also improve quality, not mentioning usability. art, Comput. Vis. Image Underst. 166 (2018) 1–27.
[14] J. Chang, D. Wei, J.W. Fisher, A video representation using temporal
superpixels, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2013, pp. 2051–2058.
CRediT authorship contribution statement [15] M. Grundmann, V. Kwatra, M. Han, I. Essa, Efficient hierarchical graph-based
video segmentation, in ieee computer society conference on computer vision
Viktor Varga: Conceptualization, Methodology, Software, Writ- and pattern recognition, IEEE 2010 (2010) 2141–2148.
[16] C. Rother, V. Kolmogorov, A. Blake, GrabCut: interactive foreground extraction
}rincz: Supervision,
ing - original draft, Visualization. András Lo using iterated graph cuts, in: ACM transactions on graphics (TOG), vol. 23,
Writing - review & editing. ACM, 2004, pp. 309–314..
258 }rincz / Neurocomputing 405 (2020) 247–258
V. Varga, A. Lo
[17] N. Shankar Nagaraja, F.R. Schmidt, T. Brox, Video segmentation with just a few [38] J. Luiten, P. Voigtlaender, B. Leibe, Premvos: proposal-generation, refinement
strokes, in: Proceedings of the IEEE International Conference on Computer and merging for the DAVIS challenge on video object segmentation 2018, in:
Vision, 2015, pp. 3235–3243. The 2018 DAVIS Challenge on Video Object Segmentation-CVPR Workshops,
[18] A. Fathi, M.F. Balcan, X. Ren, J.M. Rehg, Combining self training and active 2018..
learning for video segmentation, in: BMVC, Georgia Institute of Technology, [39] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, L. Van
2011.. Gool, Video object segmentation without temporal information, IEEE
[19] S. Vijayanarasimhan, K. Grauman, Active frame selection for label propagation transactions on pattern analysis and machine intelligence..
in videos, in: European Conference on Computer Vision, Springer, 2012, pp. [40] P. Voigtlaender, B. Leibe, Online adaptation of convolutional neural networks
496–509. for the 2017 DAVIS challenge on video object segmentation, in: The 2017
[20] B. Luo, H. Li, F. Meng, Q. Wu, C. Huang, Video object segmentation via global DAVIS Challenge on Video Object Segmentation-CVPR Workshops, vol. 5,
consistency aware query strategy, IEEE Trans. Multimedia 19 (7) (2017) 1482– 2017..
1493. [41] L. Bao, B. Wu, W. Liu, CNN in MRF: video object segmentation via inference in a
[21] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. CNN-based higher-order spatio-temporal mrf, in: Proceedings of the IEEE
Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control Conference on Computer Vision and Pattern Recognition, 2018, pp. 5977–
through deep reinforcement learning, Nature 518 (7540) (2015) 529. 5986.
[22] K. Zhou, Y. Qiao, T. Xiang, Deep reinforcement learning for unsupervised video [42] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via
summarization with diversity-representativeness reward, in: Thirty-Second graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 23 (11) (2001) 1222–1239.
AAAI Conference on Artificial Intelligence, 2018. [43] Y. Boykov, V. Kolmogorov, An experimental comparison of min-cut/max-flow
[23] J.C. Caicedo, S. Lazebnik, Active object localization with deep reinforcement algorithms for energy minimization in vision, IEEE Trans. Pattern Anal. Mach.
learning, in: Proceedings of the IEEE International Conference on Computer Intell. 9 (2004) 1124–1137.
Vision, 2015, pp. 2488–2496. [44] V. Kolmogorov, R. Zabih, What energy functions can be minimizedvia graph
[24] Y. Xiang, A. Alahi, S. Savarese, Learning to track: online multi-object tracking cuts?, IEEE Trans. Pattern Anal. Mach. Intell. 2 (2004) 147–159.
by decision making, in: Proceedings of the IEEE International Conference on [45] H. Mossalam, Y.M. Assael, D.M. Roijers, S. Whiteson, Multi-objective deep
Computer Vision, 2015, pp. 4705–4713. reinforcement learning, arXiv preprint arXiv:1610.02707..
[25] J. Han, L. Yang, D. Zhang, X. Chang, X. Liang, Reinforcement cutting-agent
learning for video object segmentation, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 9080–9089.
[26] O. Russakovsky, L.-J. Li, L. Fei-Fei, Best of both worlds: human-machine András Lörincz graduated from the Eötvös Loránd
collaboration for object annotation, in: Proceedings of the IEEE Conference on University. His background is in physics. He habilitated
Computer Vision and Pattern Recognition, 2015, pp. 2121–2131. in laser physics and in computer science, too. His
[27] G. Song, H. Myeong, K. Mu Lee, Seednet: automatic seed generation with deep research interest lies at the crossroad of artificial gen-
reinforcement learning for robust interactive segmentation, in: Proceedings of eral intelligence, cognition, and neuroscience, with
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. principal areas including human-computer interaction
1760–1768. and applications. He has different research projects
[28] D. Acuna, H. Ling, A. Kar, S. Fidler, Efficient interactive annotation of supported by the government and by companies. He
segmentation datasets with polygon-rnn++, in: Proceedings of the IEEE supervises more than 10 Ph.D. and M.Sc. students and
Conference on Computer Vision and Pattern Recognition, 2018, pp. 859–868. has published over 250 peer reviewed publications.
[29] O. Rudovic, M. Zhang, B. Schuller, R.W. Picard, Multi-modal active learning
from human data: a deep reinforcement learning approach, arXiv preprint
arXiv:1906.03098..
[30] S.-H. Lee, W.-D. Jang, C.-S. Kim, Contour-constrained superpixels for image and
video processing, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 2443–2451.
[31] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, T. Brox, Flownet 2.0: Viktor Varga received his B.Sc. and M.Sc. degrees in
evolution of optical flow estimation with deep networks, in: Proceedings of Computer Science from Eötvös Loránd University of
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. Budapest in 2014 and 2017. He is currently a Ph.D.
2462–2470. student working in the Neural Information Processing
[32] A. Benard, M. Gygli, Interactive video object segmentation in the wild, arXiv Group at the university. His research interests include
preprint arXiv:1801.00269.. reinforcement learning and computer vision.
[33] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, N. De Freitas, Dueling
network architectures for deep reinforcement learning, in: International
Conference on Machine Learning, 2016.
[34] M. Hausknecht, P. Stone, Deep recurrent q-learning for partially observable
mdps, in: 2015 AAAI Fall Symposium Series, 2015..
[35] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double q-
learning, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[36] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, arXiv
preprint arXiv:1511.05952..
[37] G. Hinton, N. Srivastava, K. Swersky, Neural networks for machine learning
lecture 6a overview of mini-batch gradient descent, Cited on 14..