Superframes, A Temporal Video Segmentation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Superframes, A Temporal Video Segmentation

Hajar Sadeghi Sokeh∗ , Vasileios Argyriou∗ , Dorothy Monekosso† , Paolo Remagnino∗


∗ The
Robot Vision Team, Faculty of Science, Engineering, and Computing
Kingston University, UK, London
{h.sadeghisokeh, vasileios.argyriou, p.remagnino}@kingston.ac.uk
† The Robot Vision Team, Faculty of Science, Engineering, and Computing, Leeds Beckett University, UK, Leeds

{d.n.monekosso}@leedsbeckett.ac.uk

Abstract—The goal of video segmentation is to turn video data Each segment is defined as a continuous sequence of video
into a set of concrete motion clusters that can be easily interpreted frames which have no significant inter-frame difference in
as building blocks of the video. There are some works on similar terms of their motion contents. Motion is the main criteria
topics like detecting scene cuts in a video, but there is few
specific research on clustering video data into the desired number we use in this work, therefore we assume all the videos are
of compact segments. It would be more intuitive, and more taken from a single fixed camera.
efficient, to work with perceptually meaningful entity obtained There is little literature on specifically temporal segmenting
arXiv:1804.06642v2 [cs.CV] 6 Mar 2019

from a low-level grouping process which we call it ‘superframe’. of video, and some works on related areas. In this section,
This paper presents a new simple and efficient technique to we briefly discuss the most relevant techniques to this work:
detect superframes of similar content patterns in videos. We
calculate the similarity of content-motion to obtain the strength temporal superpixexls, scene cut, and video segmentation.
of change between consecutive frames. With the help of existing The main idea of using superpixels as primitives in image
optical flow technique using deep models, the proposed method processing was introduced by Ren and Malik in [10]. Using
is able to perform more accurate motion estimation efficiently. superpixels instead of raw pixel data is even beneficial for
We also propose two criteria for measuring and comparing video applications. Although until recently, superpixel algo-
the performance of different algorithms on various databases.
Experimental results on the videos from benchmark databases rithms were mainly on the still images, researchers started to
have demonstrated the effectiveness of the proposed method. apply them to the video sequences. There are some recent
works on using the temporal connection between consecutive
I. I NTRODUCTION frames. Reso et al. [11] proposed a new method for gener-
In computer vision, many existing algorithms on video ating superpixels in a video with temporal consistency. Their
analysis use fixed number of frames for processing. For approach performs an energy-minimizing clustering using a
example optical flow or motion estimation techniques [1] and hybrid clustering strategy for a multi-dimensional feature
human activity recognition [2], [3]. However, it would be space. This space is separated into a global color subspace and
more intuitive, and more efficient, to work with perceptually multiple local spatial subspaces. A sliding window consisting
meaningful entity obtained from a low-level grouping process multiple consecutive frames is used which is suitable for
which we call it ‘superframe’. processing arbitrarily long video sequences.
Similar to superpixels [4] which are key building blocks of A generative probabilistic model is proposed by Chang et
many algorithms and significantly reduce the number of image al. in [12] for temporally consistent superpixels in video se-
primitives compared to pixels, superframes also do the same in quences which uses past and current frames and scales linearly
time domain. They can be used in many different applications with video length. They have presented a low-level video
such as video segmentation [5], video summarization [6], and representation which is very related to a volumetric voxel [13],
video saliency detection [7]. They are also useful in the design but still different in such a way that temporal superpixels are
of a video database management system [8] that manages a mainly designed for video data, whereas supervoxels are for
collection of video data and provides content-based access to 3D volumetric data.
users [9]. Video data modeling, insertion, storage organization Lee et al. in [14] developed a temporal superpixel algorithm
and management, and video data retrieval are among the basic based on proximity-weighted patch matching. They estimated
problems that are addressed in a video database management superpixel motion vectors by combining the patch matching
system which can be solved more efficiently by using super- distances of neighboring superpixels and the target superpixel.
frames. By temporal clustering of the video, it is easier to Then, they initialized the superpixel label of each pixel in a
identify the significant segment of the video to achieve better frame, by mapping the superpixel labels in the previous frames
representation, indexing, storage, and retrieval of the video using the motion vectors. Next, they refined the initial super-
data. pixels by updating the superpixel labels of boundary pixels
The main goal of this work is an automatic temporal clus- iteratively based on a cost function. Finally, they performed
tering of a video by analyzing the visual content of the video some postprocessing including superpixel splitting, merging,
and partitioning it into a set of units called superframes. This and relabeling.
process can also be referred to as video data segmentation. In video indexing, archiving and video communication such
Fig. 1. An example of superframes in a video from MAD database

as rate control, scene change detection plays an important infeasible on frame level. Our algorithm detects major changes
role. It can be very challenging when scene changes are very in video streams, such as when an action begins and ends.
small and sometimes other changes like brightness variation The only input to our method is a video and a number
may cause false change detection. Many various methods which shows the desired number of clusters in that video. The
have been proposed for scene change detection. Yi and Ling output will be the frame numbers which shows the boundaries
in [15], proposed a simple technique to detect sudden and between the uniform clusters. Visual features like colors and
unexpected scene change based on only pixel values without textures per frame are of low importance compare to motion
any motion estimation. They first screen out many non-scene features. This motivates us to employ motion information to
change frames and then normalize the rest of the frames using detect big changes over the video frames. We use FlowNet-
a histogram equalization process. 2 [1], a very accurate optical flow estimation with deep
A fully convolutional neural network has been used for shot networks, to extract the motion between every subsequent
boundary detection task in [16] by Gygli. He considered this frame. We then use both the average and the histogram of flow
work as a binary classification problem to correctly predict over video frames and compare our results over a baseline.
if a frame is part of the same shot as the previous frame or
not. He also created a new dataset of synthetic data with one II. T HE PROPOSED SUPERFRAME TECHNIQUE
million frames to train this network.
Video segmentation aims to group perceptually and visually Our superframe segmentation algorithm detects the bound-
similar video frames into spatio-temporal regions, a method ary between temporal clusters in video frames. The superframe
applicable to many higher-level tasks in computer vision such algorithm takes the number of desired clusters, K, as input,
as activity recognition, object tracking, content-based retrieval, and generates superframes based on the motion similarity and
and visual enhancement. In [5], Grundmann et al. presented proximity in the video frames. In other words, the histogram
a technique for spatio-temporal segmentation of long video of magnitude is used with the direction of motion per frame
sequences. Their work is a generalization of Felzenszwalb and the frame position in the video as features to cluster video
and Huttenlocher’s [17] graph-based image segmentation tech- frames. Our superframe technique is motivated by SLIC [4],
nique. They use a hierarchical graph-based algorithm to make a superpixel algorithm for images, which we generalize it for
an initial over-segmentation of the video volume into relatively video data (see Figure 1).
small space-time regions. They use optical flow as a region
A. The proposed algorithm
descriptor for graph nodes.
Kotsia et al. in [18] proposed using the 3D gradient Our model segments the video using motion cues. We
correlation function operating at the frequency domain for assume the videos are taken with a fixed camera. Therefore,
action spotting in a video sequence. They used the 3D Fourier one can represent the type of motion of the foreground object
transform which is invariant to spatiotemporal changes and by computing features from the optical flow. We first apply
frame recording. In this work, the estimation of motion relies FlowNet-2 [1] to get the flow per video frame, and then
on the detection of the maximum of the cross-correlation we make a histogram of magnitude (HOM) and direction
function between two blocks of video frames [19]. (HOD) of flow per frame to initialize K cluster centers
Similar to all these tasks, in this work we propose a C1 , . . . , CK . Therefore each cluster center has 20 feature
simple and efficient motion-based method to segment a video values as Ck = [HOM1..11 , HOD1..8 , f r]T with k = [1, K].
over time into compact episodes. This grouping leads to an HOM1..11 indicates ‘Histogram of Magnitude’ and HOD1..8
increased computational efficiency for subsequent processing indicates ‘Histogram of Direction’ for 8 different directions,
steps and allows for more complex algorithms computationally and f r stands for the frame index in the video.
For a video with N frames, in the initialization step, Algorithm 1 Video superframe clustering algorithm
there are K equally-sized superframes with approximately 1: Inputs:
N/K frames. Since initially, the length of each superframe Number of clusters: K
is S = N/K, like [4], we safely assume that the search area 2: Initialize:
to find the best place for a cluster center is a 2S × 2S area m= 0.1 ∗ K . compactness
around each cluster center over the location of video frames. Cluster centers Ck = [x1 . . . xf ]T at regular
Following the cluster center initialization, a distance mea- step S
sure is considered to specify each frame belongs to which 3: Perturb cluster centers in a neighborhood, to the lowest
cluster. We use Ds as a distance measure defined as follows: gradient position
Distance between cluster k and frame i is calculated by: 4: repeat
qX 5: St ← St−1
dc = (Xk − Xi )2 , X = (x1 . . . xf ) 6: for each cluster center Ck do
7: Assign the best matching frames from a 2S neigh-
ds = f rk − f ri (1)
r bourhood around the cluster center according the distance
dc ds measure (Eq. 1).
Ds = ( )2 + ( )2
m S 8: end for
where x is a feature value and X is a feature vector of 19 9: Compute new cluster centers and error E {L1 distance
values per video frame including 11 values for the histogram between previous centers and recomputer centers}
of the magnitude of flow and 8 values for the histogram of 10: until E ≤ threshold
the direction of flow. We consider frame location separately 11: Postprocessing to remove very short clusters
as f r. S is the interval and m is a measure of compactness of
a superframe which regarding the experiment we choose it as
10% of the input number of clusters in this work i.e. 0.1 ∗ K where T P (S, G) is the number of boundary frames in G for
to make the result comparison easier. which there is a boundary frame in S in range r and F N (S, G)
After the initialization of the K cluster centers, we then is the number of boundary pixels in G for which there is no
move each of them to the lowest gradient position in a boundary frame in S in range r. In simple words, Boundary
neighborhood of 3 frames. The neighbourhood of 3 frames Recall, Rec, is the fraction of boundary frames in G which
is chosen arbitrarily but reasonable. This is done to avoid are correctly detected in S. In this work, the range r, as a
choosing noisy frames. The gradient for frame i is computed tolerance parameter, is set to 0.008 times the video length in
as Eq. 2: frames based on the experiments.
qX Under-segmentation error is another error metric which
G(i) = (X(i + 1) − X(i − 1))2 (2) measures the leakage from superframes with respect to the
We associate each frame in the video with the nearest ground truth segmentation. The lower under-segmentation er-
cluster center in the search area of 2S × 2S. When all frames ror, the better match between superframes and the ground truth
are associated with the nearest cluster center, a new cluster segments. We define under-segmentation error, U E(S, G), as
center is computed as the average of flow values of all the follows:
frames belonging to that cluster. We repeat this process until 1 X
L  X 
convergence when the error is less than a threshold. U E(S, G) = min{|sj ∩ gi |, |sj − gi |}
N i=1
At the end of this process, we may have few clusters which j|sj ∩gi >β
their length is very short. So, we do a postprocessing step (4)
to merge these very small clusters to the closer left or right Where N is the number of frames in the video, L is the
cluster. The whole algorithm is summarized in Algorithm 1 number of ground truth superframes, |.| indicates the length
of a segment in frames and Sj − Gi = {x ∈ Sj |x ∈ / Gi }. By
B. Evaluation criteria for the performance of the algorithm doing some experiments, we found the best number of β as
Superpixel algorithms are usually assessed using two error an overlap threshold, is 0.25 of each superframe sj .
metrics for evaluation of segmentation quality: boundary re-
call and under-segmentation error. Boundary recall measures III. E XPERIMENTS
how good a superframe segmentation adhere to the ground- This proposed algorithm requires two initialization: K (the
truth boundaries. Therefore, higher boundary recall describes number of desired clusters) and ‘compactness’. In our work,
better adherence to video segment boundaries. Suppose S = we initialize compactness to 0.1 ∗ K to make the evaluation
{S1 , . . . , SH } be a superframe segmentation with H number of results easier.
of clusters and G = {G1 , . . . , GL } be a ground-truth seg- Experiments are carried out using the ‘MAD’ 1 database and
mentation with L number of clusters. Then, boundary recall the ‘UMN’ databases. The MAD 2 database [20] is recorded
Rec(S, G) is defined as follows: using a Microsoft Kinect sensor in an indoor environment with
T P (S, G) 1 Multimodal Action Database
Rec(S, G) = (3) 2 Videos available from www.humansensing.cs.cmu.edu/mad/
T P (S, G) + F N (S, G)
a total of 40 video sequences of 20 subjects. Each subject of the normalized cross-spectrum between two space-time
performs 35 sequential actions twice and each video is about volumes in the video sequence. We call this phase-correlation
4000–7000 frames. We also labelled superframes manually for technique as PC-clustering in this work. We sampled every
each video. That shows in which frames there is a big change 2nd frame in each video and chose a sub-section of each frame.
of motion for the frame, which illustrates the video segments We use 240 × 240 pixels in the middle of each frame to carry
with similar motions, superframes. out the phase correlation. Therefore, each space-time volume
UMN 3 is an unusual crowd activity dataset [21], a staged is considered to be 240 × 240 × 30, in which 30 is the number
dataset that depicts sparsely populated areas. Normal crowd of frames and shows the temporal length of each volume.
activity is observed until a specified point in time where
behavior rapidly evolves into an escape scenario where each
individual runs out of camera view to simulate panic. The
dataset comprises 11 separate video samples that start by
depicting normal behavior before changing to abnormal. The
panic scenario is filmed in three different locations, one
indoors and two outdoors. All footage is recorded at a frame
rate of 30 frames per second at a resolution of 640×480 using
a static camera.
Each frame is represented by 20 features, 11 values for
HOM features, 8 values for HOD features, and one value for
the frame location in the video. HOF features provide us with
a normalized histogram at each frame of the video.
A dense optical flow [1] is used in this work to detect motion
for each video frame. We have employed FlowNet-2 which
has an improved quality and speed compared to other optical Fig. 3. The relation between the threshold and the number of clusters in one
flow algorithms. With the MAD and UMN databases it took of the videos of MAD database in PC-clustering technique [18].
only about 50ms and 40ms respectively per frame to extract
Given two space-time volumes va and vb , we calculate
the flow using FlowNet-2. Figure 2 illustrates the results of
the normalized cross-correlation using Fourier transform and
FlowNet-2 on both databases.
taking the complex conjugate of the second result (shown by
∗) and then the location of the peak using equation 6.

Fa ◦ F∗b
Fa = F(va ), Fb = F(vb ), R= (5)
| Fa ◦ F∗b |
where ◦ is the element-wise product.

r = F −1 {R}, corr = argmax {r} (6)


v
(a) MAD database (b) UMN database where corr is the correlation between two space-time volumes
va and vb . We calculate this correlation every 2nd frame of
each video and then we conclude that there is a big change in
the video when the cross-correlation is less than a threshold
(see Figure 3). Figure. 4 illustrates how boundary recall and
under-segmentation error changes over the number of clusters.
To test the performance of our method on video clustering
to superframes, we consider two groups of features: first the
averaged value of optical flow over u and v, i.e. the x and y
(c) Flow/Frame #326 (d) Flow/Frame #257 components of the optical flow and second the histogram of
Fig. 2. Examples of optical flow results using FlowNet-2 on a frame of a magnitude and direction of flow. Figure. 5 shows the boundary
video from MAD and UMN databases.
recall with respect to the input number of desired superframes
for one of the videos (sub01 seq01) in the MAD dataset. For
To compare our results against, we have used ‘phase cor-
this video, the number of superframes in the ground truth is 71
relation’ between two group of frames to determine relative
which for K > 70, HOF features has outstanding improvement
translative movement between them as proposed in [18].
on boundary recall over the averaged features.
This method relies on estimating the maximum of the phase
As discussed before, the output number of clusters is usually
correlation, which is defined as the inverse Fourier transform
less than the input number of desired clusters K. Since in the
3 Videos available from mha.cs.umn.edu/proje vents.shtml postprocessing step some of the clusters may get merged with
TABLE I
Q UANTITATIVE EVALUATION FOR VIDEO CLUSTERING

MAD database UMN database


UE Rec UE Rec
PC-clustering [18] 0.54 0.64 0.27 0.25
Averaged flow 0.52 0.77 0.19 0.63
Histogram of flow (proposed) 0.31 0.99 0.17 0.71

videos in the MAD database. According to this figure, for K


more than about 122, the boundary recall is %100 and the
under-segmentation error is less than %50.
A quantitative evaluation of all videos in each dataset is
done and the results are illustrated in Table I. These results
are calculated by averaging over all the videos in each dataset
Fig. 4. Boundary recall and under-segmentation error based on the number of
when the number of clusters is about 115. The experimental
clusters for one of the videos of MAD database for the proposed algorithm. results show that our model works quite well on both datasets.
Regarding this table, the histogram of flow works as a better
feature than just averaging the flow or using phase correlation
between volumes in the video. There is a %22 improvement
of boundary recall for the MAD dataset and %8 for UMN
dataset. The boundary recall is quite low for the baseline on
UMN dataset, although the segmentation error is still very
low. An interesting point in this table is that, there a higher
boundary recall for all methods on the MAD dataset, however
lower under-segmentation error on UMN datasets.
IV. C ONCLUSION
In this paper, we proposed using Histogram of Optical
Flow (HOF) features to cluster video frames. These features
are independent of the scale of the moving objects. Using
these atomic video segments to speed up later-stage visual
processing, has been recently used in some works. The num-
Fig. 5. Comparison of results when using averaged flow over components of ber of desired clusters, K, is the input to our algorithm.
flow and histogram of flow. The former is shown in pink-triangular and the
latter is shown in green-circles. This parameter is very important and may cause over/under-
segmenting the video.
Our superframe method divides a video into superframes
other clusters as their length is too short. Figure. 6 illustrates by minimizing a cost function which is distance-based and
the relation between the output number of clusters, K and the makes each superframe belong to a single motion pattern and
boundary recall for a video from the MAD database with 71 not overlap with others. We used FlowNet-2 to extract motion
ground-truth clusters. The ground-truth number of clusters is information per video frame to describe video sequences and
shown using a red horizontal line in the figure. It is shown quantitatively evaluated out method over two databases, MAD
that when the output number of clusters for this video is 71 or and UMU.
more, K is more than 115 and the boundary recall is bigger One interesting trajectory for the future work is to estimate
than %91. the saliency score for each of these superframes. This helps us
The ground-truth and the result boundaries for a video of to rank different episodes of a video by their saliency scores
MAD database are shown in Figure 7. The difference between and detect the abnormalities in videos as the most salient
them is also shown in this figure which is almost zero except segment.
for clusters between 50 to 62 between frames 1600 and 2100. This work was supported in part by the European Unions
As stated before a combination of under-segmentation error Horizon 2020 Programme for Research and Innovation Ac-
and boundary recall is a good evaluation metric for the tions within IoT (2016): Large Scale Pilots: Wearables for
algorithm. Under-segmentation error accurately evaluates the smart ecosystem (MONICA).
quality of superframe segmentation by penalizing superframes
R EFERENCES
overlapping with other superframes. Higher boundary recall
also indicates less true boundaries are missed. Figure 8 shows [1] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
“Flownet 2.0: Evolution of optical flow estimation with deep networks,”
the experimental results for both boundary recall and under- in IEEE Conference on Computer Vision and Pattern Recognition
segmentation error which are the average values over all 40 (CVPR), 2017.
Fig. 6. Boundary recall and the number of output clusters for a video from MAD database in blue-circle and orange-square, respectively. The number of
superframes in the ground truth for this video is 71 which is shown by a red line.

Recognition (CVPR), 2015, pp. 961–970.


[4] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,
“Slic superpixels compared to state-of-the-art superpixel methods,” IEEE
Transactions on Pattern Analysis and Machine Intelligence (PAMI),
vol. 34, no. 11, pp. 2274–2282, 2012.
[5] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchical
graph-based video segmentation,” in IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
[6] W.-S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video
summarization by visual co-occurrence,” in IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2015, pp. 3584–3592.
[7] Q. Tu, A. Men, Z. Jiang, F. Ye, and J. Xu, “Video saliency detection
incorporating temporal information in compressed domain,” Image Com-
munication, vol. 38, no. C, pp. 32–44, 2015.
[8] W. G. Aref, A. C. Catlin, J. Fan, A. K. Elmagarmid, M. A. Hammad,
I. F. Ilyas, M. S. Marzouk, and X. Zhu, “A video database manage-
ment system for advancing video database research,” in International
Workshop on Management Information Systems, 2002, pp. 8–17.
[9] H. A, “Designing video data management systems,” Ph.D. dissertation,
Fig. 7. Comparison of results when using averaged magnitude/direction and The University of Michigan, Ann Arbor, MI, USA, 1995.
HOF. The number of ground-truth clusters for this video is 71 clusters and [10] X. Ren and J. Malik, “Learning a classification model for segmentation,”
the boundary recall is 0.91. The averaged flow results are shown in pink- in IEEE International Conference on Computer Vision (ICCV), 2003, pp.
triangular and the HOF is shown in green-circles. 10–17.
[11] M. Reso, J. Jachalsky, B. Rosenhahn, and J. Ostermann, “Temporally
consistent superpixels,” IEEE International Conference on Computer
Vision (ICCV), pp. 385–392, 2013.
[12] J. Chang, D. Wei, and J. W. Fisher III, “A video representation using
temporal superpixels,” in IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), 2013, pp. 2051–2058.
[13] C. Xu and J. J. Corso, “Evaluation of super-voxel methods for early
video processing,” IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 1202–1209, 2012.
[14] S.-H. Lee, W.-D. Jang, and C.-S. Kim, “Temporal superpixels based on
proximity-weighted patch matching,” in IEEE International Conference
on Computer Vision (ICCV), 2017, pp. 3610–3618.
[15] N. L. Xiaoquan Yi, “Fast pixel-based video scene change detection,” in
IEEE International Symposium on Circuits and Systems (ISCAS), 2005.
[16] M. Gygli, “Ridiculously fast shot boundary detection with fully convo-
lutional neural networks,” Computing Research Repository (CoRR), vol.
abs/1705.08214, 2017.
[17] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based image
segmentation,” International Journal of Computer Vision (IJCV), vol. 59,
no. 2, pp. 167–181, 2004.
Fig. 8. Quantitative evaluation of our algorithm by averaging over 40 videos [18] I. Kotsia and V. Argyriou, “Action spotting exploiting the frequency
in MAD dataset. domain,” in Workshop on CVPR, 2011, pp. 43–48.
[19] V. Argyriou and T. Vlachos, “Quad-tree motion estimation in the
frequency domain using gradient correlation,” IEEE Transactions on
Multimedia, vol. 9, no. 6, pp. 1147–1154, 2007.
[2] V. Bloom, D. Makris, and V. Argyriou, “Clustered spatio-temporal [20] D. Huang, S. Yao, Y. Wang, and F. De La Torre, “Sequential max-margin
manifolds for online action recognition,” in International Conference event detectors,” in European Conference on Computer Vision (ECCV),
on Pattern Recognition, 2014, pp. 3963–3968. 2014, pp. 410–424.
[3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, [21] O. a. S. M. Mehran, R., “Abnormal crowd behaviour detection using
“Activitynet: A large-scale video benchmark for human activity un- social force model,” in IEEE Conference on Computer Vision and
derstanding,” in IEEE Conference on Computer Vision and Pattern Pattern Recognition (CVPR), 2009, pp. 935–942.

You might also like