Superframes, A Temporal Video Segmentation
Superframes, A Temporal Video Segmentation
Superframes, A Temporal Video Segmentation
{d.n.monekosso}@leedsbeckett.ac.uk
Abstract—The goal of video segmentation is to turn video data Each segment is defined as a continuous sequence of video
into a set of concrete motion clusters that can be easily interpreted frames which have no significant inter-frame difference in
as building blocks of the video. There are some works on similar terms of their motion contents. Motion is the main criteria
topics like detecting scene cuts in a video, but there is few
specific research on clustering video data into the desired number we use in this work, therefore we assume all the videos are
of compact segments. It would be more intuitive, and more taken from a single fixed camera.
efficient, to work with perceptually meaningful entity obtained There is little literature on specifically temporal segmenting
arXiv:1804.06642v2 [cs.CV] 6 Mar 2019
from a low-level grouping process which we call it ‘superframe’. of video, and some works on related areas. In this section,
This paper presents a new simple and efficient technique to we briefly discuss the most relevant techniques to this work:
detect superframes of similar content patterns in videos. We
calculate the similarity of content-motion to obtain the strength temporal superpixexls, scene cut, and video segmentation.
of change between consecutive frames. With the help of existing The main idea of using superpixels as primitives in image
optical flow technique using deep models, the proposed method processing was introduced by Ren and Malik in [10]. Using
is able to perform more accurate motion estimation efficiently. superpixels instead of raw pixel data is even beneficial for
We also propose two criteria for measuring and comparing video applications. Although until recently, superpixel algo-
the performance of different algorithms on various databases.
Experimental results on the videos from benchmark databases rithms were mainly on the still images, researchers started to
have demonstrated the effectiveness of the proposed method. apply them to the video sequences. There are some recent
works on using the temporal connection between consecutive
I. I NTRODUCTION frames. Reso et al. [11] proposed a new method for gener-
In computer vision, many existing algorithms on video ating superpixels in a video with temporal consistency. Their
analysis use fixed number of frames for processing. For approach performs an energy-minimizing clustering using a
example optical flow or motion estimation techniques [1] and hybrid clustering strategy for a multi-dimensional feature
human activity recognition [2], [3]. However, it would be space. This space is separated into a global color subspace and
more intuitive, and more efficient, to work with perceptually multiple local spatial subspaces. A sliding window consisting
meaningful entity obtained from a low-level grouping process multiple consecutive frames is used which is suitable for
which we call it ‘superframe’. processing arbitrarily long video sequences.
Similar to superpixels [4] which are key building blocks of A generative probabilistic model is proposed by Chang et
many algorithms and significantly reduce the number of image al. in [12] for temporally consistent superpixels in video se-
primitives compared to pixels, superframes also do the same in quences which uses past and current frames and scales linearly
time domain. They can be used in many different applications with video length. They have presented a low-level video
such as video segmentation [5], video summarization [6], and representation which is very related to a volumetric voxel [13],
video saliency detection [7]. They are also useful in the design but still different in such a way that temporal superpixels are
of a video database management system [8] that manages a mainly designed for video data, whereas supervoxels are for
collection of video data and provides content-based access to 3D volumetric data.
users [9]. Video data modeling, insertion, storage organization Lee et al. in [14] developed a temporal superpixel algorithm
and management, and video data retrieval are among the basic based on proximity-weighted patch matching. They estimated
problems that are addressed in a video database management superpixel motion vectors by combining the patch matching
system which can be solved more efficiently by using super- distances of neighboring superpixels and the target superpixel.
frames. By temporal clustering of the video, it is easier to Then, they initialized the superpixel label of each pixel in a
identify the significant segment of the video to achieve better frame, by mapping the superpixel labels in the previous frames
representation, indexing, storage, and retrieval of the video using the motion vectors. Next, they refined the initial super-
data. pixels by updating the superpixel labels of boundary pixels
The main goal of this work is an automatic temporal clus- iteratively based on a cost function. Finally, they performed
tering of a video by analyzing the visual content of the video some postprocessing including superpixel splitting, merging,
and partitioning it into a set of units called superframes. This and relabeling.
process can also be referred to as video data segmentation. In video indexing, archiving and video communication such
Fig. 1. An example of superframes in a video from MAD database
as rate control, scene change detection plays an important infeasible on frame level. Our algorithm detects major changes
role. It can be very challenging when scene changes are very in video streams, such as when an action begins and ends.
small and sometimes other changes like brightness variation The only input to our method is a video and a number
may cause false change detection. Many various methods which shows the desired number of clusters in that video. The
have been proposed for scene change detection. Yi and Ling output will be the frame numbers which shows the boundaries
in [15], proposed a simple technique to detect sudden and between the uniform clusters. Visual features like colors and
unexpected scene change based on only pixel values without textures per frame are of low importance compare to motion
any motion estimation. They first screen out many non-scene features. This motivates us to employ motion information to
change frames and then normalize the rest of the frames using detect big changes over the video frames. We use FlowNet-
a histogram equalization process. 2 [1], a very accurate optical flow estimation with deep
A fully convolutional neural network has been used for shot networks, to extract the motion between every subsequent
boundary detection task in [16] by Gygli. He considered this frame. We then use both the average and the histogram of flow
work as a binary classification problem to correctly predict over video frames and compare our results over a baseline.
if a frame is part of the same shot as the previous frame or
not. He also created a new dataset of synthetic data with one II. T HE PROPOSED SUPERFRAME TECHNIQUE
million frames to train this network.
Video segmentation aims to group perceptually and visually Our superframe segmentation algorithm detects the bound-
similar video frames into spatio-temporal regions, a method ary between temporal clusters in video frames. The superframe
applicable to many higher-level tasks in computer vision such algorithm takes the number of desired clusters, K, as input,
as activity recognition, object tracking, content-based retrieval, and generates superframes based on the motion similarity and
and visual enhancement. In [5], Grundmann et al. presented proximity in the video frames. In other words, the histogram
a technique for spatio-temporal segmentation of long video of magnitude is used with the direction of motion per frame
sequences. Their work is a generalization of Felzenszwalb and the frame position in the video as features to cluster video
and Huttenlocher’s [17] graph-based image segmentation tech- frames. Our superframe technique is motivated by SLIC [4],
nique. They use a hierarchical graph-based algorithm to make a superpixel algorithm for images, which we generalize it for
an initial over-segmentation of the video volume into relatively video data (see Figure 1).
small space-time regions. They use optical flow as a region
A. The proposed algorithm
descriptor for graph nodes.
Kotsia et al. in [18] proposed using the 3D gradient Our model segments the video using motion cues. We
correlation function operating at the frequency domain for assume the videos are taken with a fixed camera. Therefore,
action spotting in a video sequence. They used the 3D Fourier one can represent the type of motion of the foreground object
transform which is invariant to spatiotemporal changes and by computing features from the optical flow. We first apply
frame recording. In this work, the estimation of motion relies FlowNet-2 [1] to get the flow per video frame, and then
on the detection of the maximum of the cross-correlation we make a histogram of magnitude (HOM) and direction
function between two blocks of video frames [19]. (HOD) of flow per frame to initialize K cluster centers
Similar to all these tasks, in this work we propose a C1 , . . . , CK . Therefore each cluster center has 20 feature
simple and efficient motion-based method to segment a video values as Ck = [HOM1..11 , HOD1..8 , f r]T with k = [1, K].
over time into compact episodes. This grouping leads to an HOM1..11 indicates ‘Histogram of Magnitude’ and HOD1..8
increased computational efficiency for subsequent processing indicates ‘Histogram of Direction’ for 8 different directions,
steps and allows for more complex algorithms computationally and f r stands for the frame index in the video.
For a video with N frames, in the initialization step, Algorithm 1 Video superframe clustering algorithm
there are K equally-sized superframes with approximately 1: Inputs:
N/K frames. Since initially, the length of each superframe Number of clusters: K
is S = N/K, like [4], we safely assume that the search area 2: Initialize:
to find the best place for a cluster center is a 2S × 2S area m= 0.1 ∗ K . compactness
around each cluster center over the location of video frames. Cluster centers Ck = [x1 . . . xf ]T at regular
Following the cluster center initialization, a distance mea- step S
sure is considered to specify each frame belongs to which 3: Perturb cluster centers in a neighborhood, to the lowest
cluster. We use Ds as a distance measure defined as follows: gradient position
Distance between cluster k and frame i is calculated by: 4: repeat
qX 5: St ← St−1
dc = (Xk − Xi )2 , X = (x1 . . . xf ) 6: for each cluster center Ck do
7: Assign the best matching frames from a 2S neigh-
ds = f rk − f ri (1)
r bourhood around the cluster center according the distance
dc ds measure (Eq. 1).
Ds = ( )2 + ( )2
m S 8: end for
where x is a feature value and X is a feature vector of 19 9: Compute new cluster centers and error E {L1 distance
values per video frame including 11 values for the histogram between previous centers and recomputer centers}
of the magnitude of flow and 8 values for the histogram of 10: until E ≤ threshold
the direction of flow. We consider frame location separately 11: Postprocessing to remove very short clusters
as f r. S is the interval and m is a measure of compactness of
a superframe which regarding the experiment we choose it as
10% of the input number of clusters in this work i.e. 0.1 ∗ K where T P (S, G) is the number of boundary frames in G for
to make the result comparison easier. which there is a boundary frame in S in range r and F N (S, G)
After the initialization of the K cluster centers, we then is the number of boundary pixels in G for which there is no
move each of them to the lowest gradient position in a boundary frame in S in range r. In simple words, Boundary
neighborhood of 3 frames. The neighbourhood of 3 frames Recall, Rec, is the fraction of boundary frames in G which
is chosen arbitrarily but reasonable. This is done to avoid are correctly detected in S. In this work, the range r, as a
choosing noisy frames. The gradient for frame i is computed tolerance parameter, is set to 0.008 times the video length in
as Eq. 2: frames based on the experiments.
qX Under-segmentation error is another error metric which
G(i) = (X(i + 1) − X(i − 1))2 (2) measures the leakage from superframes with respect to the
We associate each frame in the video with the nearest ground truth segmentation. The lower under-segmentation er-
cluster center in the search area of 2S × 2S. When all frames ror, the better match between superframes and the ground truth
are associated with the nearest cluster center, a new cluster segments. We define under-segmentation error, U E(S, G), as
center is computed as the average of flow values of all the follows:
frames belonging to that cluster. We repeat this process until 1 X
L X
convergence when the error is less than a threshold. U E(S, G) = min{|sj ∩ gi |, |sj − gi |}
N i=1
At the end of this process, we may have few clusters which j|sj ∩gi >β
their length is very short. So, we do a postprocessing step (4)
to merge these very small clusters to the closer left or right Where N is the number of frames in the video, L is the
cluster. The whole algorithm is summarized in Algorithm 1 number of ground truth superframes, |.| indicates the length
of a segment in frames and Sj − Gi = {x ∈ Sj |x ∈ / Gi }. By
B. Evaluation criteria for the performance of the algorithm doing some experiments, we found the best number of β as
Superpixel algorithms are usually assessed using two error an overlap threshold, is 0.25 of each superframe sj .
metrics for evaluation of segmentation quality: boundary re-
call and under-segmentation error. Boundary recall measures III. E XPERIMENTS
how good a superframe segmentation adhere to the ground- This proposed algorithm requires two initialization: K (the
truth boundaries. Therefore, higher boundary recall describes number of desired clusters) and ‘compactness’. In our work,
better adherence to video segment boundaries. Suppose S = we initialize compactness to 0.1 ∗ K to make the evaluation
{S1 , . . . , SH } be a superframe segmentation with H number of results easier.
of clusters and G = {G1 , . . . , GL } be a ground-truth seg- Experiments are carried out using the ‘MAD’ 1 database and
mentation with L number of clusters. Then, boundary recall the ‘UMN’ databases. The MAD 2 database [20] is recorded
Rec(S, G) is defined as follows: using a Microsoft Kinect sensor in an indoor environment with
T P (S, G) 1 Multimodal Action Database
Rec(S, G) = (3) 2 Videos available from www.humansensing.cs.cmu.edu/mad/
T P (S, G) + F N (S, G)
a total of 40 video sequences of 20 subjects. Each subject of the normalized cross-spectrum between two space-time
performs 35 sequential actions twice and each video is about volumes in the video sequence. We call this phase-correlation
4000–7000 frames. We also labelled superframes manually for technique as PC-clustering in this work. We sampled every
each video. That shows in which frames there is a big change 2nd frame in each video and chose a sub-section of each frame.
of motion for the frame, which illustrates the video segments We use 240 × 240 pixels in the middle of each frame to carry
with similar motions, superframes. out the phase correlation. Therefore, each space-time volume
UMN 3 is an unusual crowd activity dataset [21], a staged is considered to be 240 × 240 × 30, in which 30 is the number
dataset that depicts sparsely populated areas. Normal crowd of frames and shows the temporal length of each volume.
activity is observed until a specified point in time where
behavior rapidly evolves into an escape scenario where each
individual runs out of camera view to simulate panic. The
dataset comprises 11 separate video samples that start by
depicting normal behavior before changing to abnormal. The
panic scenario is filmed in three different locations, one
indoors and two outdoors. All footage is recorded at a frame
rate of 30 frames per second at a resolution of 640×480 using
a static camera.
Each frame is represented by 20 features, 11 values for
HOM features, 8 values for HOD features, and one value for
the frame location in the video. HOF features provide us with
a normalized histogram at each frame of the video.
A dense optical flow [1] is used in this work to detect motion
for each video frame. We have employed FlowNet-2 which
has an improved quality and speed compared to other optical Fig. 3. The relation between the threshold and the number of clusters in one
flow algorithms. With the MAD and UMN databases it took of the videos of MAD database in PC-clustering technique [18].
only about 50ms and 40ms respectively per frame to extract
Given two space-time volumes va and vb , we calculate
the flow using FlowNet-2. Figure 2 illustrates the results of
the normalized cross-correlation using Fourier transform and
FlowNet-2 on both databases.
taking the complex conjugate of the second result (shown by
∗) and then the location of the peak using equation 6.
Fa ◦ F∗b
Fa = F(va ), Fb = F(vb ), R= (5)
| Fa ◦ F∗b |
where ◦ is the element-wise product.