A method for cut detection based on visual rhythm
S ILVIO JAMIL F ERZOLI G UIMAR ÃESyx , M ICHEL C OUPRIEx ,
N EUCIMAR J ER ÔNIMO L EITEz , A RNALDO DE A LBUQUERQUE A RA ÚJOy
y
NPDI/DCC - Universidade Federal de Minas Gerais, Caixa Postal 702, 30161-970, Belo Horizonte, MG, Brasil
fsjamil,
[email protected]
x A2 SI/ESIEE - Cité Descartes, BP 99, 93162, Noisy le Grand, France
fguimaras,
[email protected]
z Instituto de Computação - UNICAMP Caixa Postal 6176, 13083-970, Campinas, SP, Brasil
[email protected]
Abstract. The visual rhythm is a simplification of the video content represented by a 2D image. In this work,
the video segmentation problem is transformed into a problem of pattern detection, where each video effect is
transformed into a different pattern on the visual rhythm. To detect sharp video transitions (cuts) we use topological
and morphological tools instead of using a dissimilarity measure. Thus, we propose a method to detect sharp video
transitions between two consecutive shots. We present a comparative analysis of our method with respect to some
other methods. We also propose a variant of this method to detect the position of flashes in a video.
1
Introduction
The video segmentation problem can be considered as a
problem of dissimilarity between images (or frames). Usually, the common approach to cope with this problem is
based on the use of a dissimilarity measure which allows to
identify the boundary between consecutive shots. The simplest transitions between two consecutive shots are sharp
and gradual transitions [1]. A sharp transition (cut) is simply a concatenation of two consecutive shots. When there
is a gradual transition between two shots, new frames are
created from these shots [1]. In literature, we can find different types of dissimilarity measures used for video segmentation, such as, pixel-wise comparison, histogram-wise
comparison, etc. If two frames belong to the same shot,
then their dissimilarity measure should be small, and if two
frames belong to different shots, this measure should be
high, but in the presence of different effects, like zoom, pan,
tilt, flash, this measure can be affected. So, the choice of a
good measure is essential for the quality of the segmentation results.
Another approach to the video segmentation problem
is to transform the video into a 2D image, and to apply methods of image processing to extract the different patterns
related to each transition. This approach can be found in [2,
3], where the transformed image is called visual rhythm [2]
or spatio-temporal slice [3]. Informally, the visual rhythm
is a simplification of the video content represented by a 2D
image. This simplification can be obtained from a systematic sampling of points of the video, such as, extraction of
the diagonal points of each frame. In Fig. 1, we illustrate an
example of point sampling from a video. So, the video segmentation problem is transformed into an image segmenta-
tion problem. In this work, we propose a method for cut
detection based on analysis of visual rhythm. We also propose a variant of this method for flash detection. According
to the comparative analysis involving our method and some
other methods, we can verify that the proposed method for
cut detection presents the best results.
Figure 1: Visual rhythm.
This paper is organized as follows. In Sec. 2, we describe the basic method used in this work, the visual rhythm.
In Sec. 3, we describe some related works. In Sec. 4, we
propose a method for cut detection based on the analysis of
visual rhythm. In Sec. 5, we show a variant of our method
for flash detection. In Sec. 6, we have done a comparative analysis involving our method and some other methods
using quality measures. Finally, some conclusions and a
summary of future works are given in Sec. 7.
2 Visual rhythm
Let D Z 2 , D = f0; :::; M 1g f0; :::; N 1g, where
M and N are the width and the height of each frame, respectively.
Definition 2.1 (Frame) A frame f t is a function from D
to Z where for each spatial position (x; y ) in D, ft (x; y )
represents the grayscale value of the pixel (x; y ).
2 +
Definition 2.2 (Video) A video V, in domain D t, can
be seen as a sequence of frames ft and can be described by
V
= ( t )t2[0;T
f
(1)
1℄
where T is the number of frames contained in the video.
When we work directly on the video, we have to cope
with two main problems: the processing time and the choice
of a dissimilarity measure. Looking for reducing the processing time and using tools for 2D image segmentation
instead of a dissimilarity measure (as we will see in Sec.
4), we transform the video into a two-dimensional image,
called visual rhythm [2, 3].
Definition 2.3 (Visual rhythm (Spatio-temporal slice))
ft t2[0;T 1℄ be an arbitrary video, in domain
Let V
D t. The visual rhythm #, in domain D t, is a simplification of the video where each frame ft is transformed
into a vertical line on the visual rhythm that is defined by
2 +
=()
1 +
#(t; z ) = ft (rx z + a; ry z + b)
(2)
where z 2 f ; :::; M#
g and t 2 f ; :::; N# g, M#
and N# are the height and the width of the visual rhythm,
respectively, rx and ry are ratios of pixel sampling, a and b
are shifts on each frame. Thus, according to these parameters, different pixel samplings could be considered, for exand a b
and M N then we
ample, if rx ry
obtain all pixels of the principal diagonal. If rx
and
ry
and a M and b
and M N then we obtain
all pixels of the secondary diagonal. If rx
and ry
and a
and b N= then we obtain all pixels of a central horizontal line. If rx
and ry
and a
M=
and b
and then we obtain all pixels of a central vertical
line.
0
1
0
1
= =1
= =0
=
= 1
=1
=
=0
=
=1
=0
=0
= 2
=0
=1
= 2
=0
a same video but with different pixel samplings, in these
cases we use the principal diagonal (Fig. 2-top) and central
vertical axis (Fig. 2-bottom) sampling. We can observe that
there are “vertical lines” in Fig. 2-bottom that do not correspond to sharp video transition but all sharp video transitions correspond to “vertical lines” on the visual rhythm.
3 Related works
In literature, we find different approaches for cut detection,
amongst them, those that are applied directly to the video
and those that are applied to a simplification of the video,
called visual rhythm. In [4], we can find some methods for
cut detection.
3.1 Methods applied to the video
These methods represent the most common approach for
cut detection and are associated with dissimilarity measures.
In general, dissimilarity measures (calculated between each
pair of consecutive frames) are compared to a threshold to
detect a transition, but the choice of a good threshold represents a problem, because the result of the video segmentation is highly dependent on the threshold value.
In [5], it is proposed a methodology for cut detection considering the mean of the pixel difference between
two consecutive frames as the dissimilarity measure. Afterwards, a morphological filter is applied to the one-dimensional signal (signal computed by the dissimilarity measure).
of the maxiAnd finally, a thresholding with value
mum value of this signal is applied. Another approach is
to consider the histogram intersection [4] as dissimilarity
measure. In theory, histograms of frames into the same shot
would be similar, that is, their dissimilarity measure would
be small.
20%
3.2 Visual rhythm-based
Figure 2: Visual rhythm obtained from a real video using
different pixel samplings: principal diagonal (top) and central vertical line (bottom). The temporal positions of sharp
video transitions are indicated in the middle image.
The choice of the pixel sampling is a problem because
different samplings produce different visual rhythms with
different patterns. [2] presents some pixel sampling with
their correspondent visual rhythm, and it is said that the best
results are found when the sampling based on a diagonal
is used because it contains horizontal and vertical features.
In Fig. 2, we illustrate two visual rhythms obtained from
On the visual rhythm # obtained from the principal diagonal sampling, the cuts correspond to horizontal intensity
discontinuities that are vertically aligned. These discontinuities may be easily observed on Fig. 2. In [2] is defined
a statistical approach based on visual rhythm for cut detection. This approach considers the local mean and variance
of the horizontal gradient. An adaptive thresholding is applied to detect a sharp video transition. In [3], we can find
another method from visual rhythm based on concepts of
Markov model for image segmentation.
4 A method for cut detection
Usually, the shot detection is the first step to automatically
segment a video and it is associated with the detection of
sharp and gradual transitions between two different shots
[1]. In this work we consider only the sharp transition that
is simply a concatenation of two consecutive shots.
With the aim of realizing a video segmentation without
defining a dissimilarity measure, we can use a simplifica-
tion of the video content, the visual rhythm, where the video
segmentation problem, in domain 2D + t, is transformed
into a problem of pattern detection, in domain 1D + t. So,
we can apply methods of 2D image processing to identify
different patterns on the visual rhythm because each video
effect corresponds a pattern in this image, for example, each
sharp video transition is transformed into “vertical lines”
on the visual rhythm. Unfortunately, this correspondence is
not one-to-one relation, i. e., a sharp video transition corresponds to a vertical line, but a vertical line is not necessarily a sharp video transition. This problem can be resolved
by considering visual rhythms obtained from different pixel
samplings. Afterwards, a simple intersection operation between these results may be used to correctly identify the
sharp video transitions.
Fortunately, in general, we can use only a visual rhythm obtained from principal diagonal sampling because this
problem rarely occurs in practice. Furthermore, this visual
rhythm represents the best simplification of the video content, according to [2]. To follow, we will define a method
for cut detection based on visual rhythm.
4.1 Steps of our method
Let V be an arbitrary video as defined in Sec. 2. To facilitate
the description of our method, we will describe each step
separately.
example, the maximum of a peak may consist of several
neighboring pixels. In such cases, a simple maximum detection would result in multiple responses for a single transition. This is why we introduce the thinning step, with the
aim of reducing every peak to a one-pixel-thin maximum
and thus to simplify peak detection.
Let us consider a point x in a 1D image (or signal) g .
We say that a point x is destructible for g , if one neighbor of
x has a value greater or equal to g (x) and the other neighbor
has a value strictly smaller than g (x). The thinning procedure consists in repeating the following steps until stability: i) select a destructible point x; ii) lower the value of x
downto the value of its lowest neighbor.
Selection of destructible points must be done in increasing order of value, so that each point is modified at
most once. Points having the same value are scheduled with
a fifo policy which guarantees that, in case of large flat maxima, the thinned signal is “well centered” with respect to
the original one. This procedure is in fact a particular case,
in 1D domain, of a topological operator introduced in [8].
Topological operators have as aim to simplify the image
maintaining the topology. [8] presents operators for image
segmentation based upon topology which generalizes to 2D
grayscale images the notions of binary digital topology [9].
Step 0. Visual rhythm creation In this work, we use a
principal diagonal pixel sampling, as described in Sec. 2,
to create the visual rhythm # from the video V.
Step 1. Visual rhythm filtering In this step, we eliminate the noise of the visual rhythm using mathematical
morphology filters. The filtered image is denoted by #F .
We apply an opening (closing) by reconstruction to eliminate the small light (black) components. The readers are
encouraged to read [6, 7] for more details about mathematical morphology. We choose this filtering method because
it preserves the sharp contours of the image.
Step 2. Horizontal gradient calculation The aim of this
step is to detect the horizontal boundary between two consecutive regions. This boundary (sharp contour) when vertically aligned can represent sharp video transition. So, we
calculate the norm of the horizontal gradient rh of the filtered image by
jrh #F (t; z)j = j#F (t; z)
#F (t
j
1; z )
(3)
Other derivative operators could be considered here,
we will discuss this point in Sec. 7.
Step 3. Thinning operation Intuitively, a horizontal transition between two consecutive regions corresponds to a
“peak” in the horizontal gradient of each line. In the case
of a cut, the maximum of this peak is generally reduced to
only one pixel, but in case of a gradual video transition for
(a) Original
(b) Result
Figure 3: Example of thinning from 1D image. Dotted line,
in (b), represents the original image.
This operator is applied to all horizontal lines of the
gradient of the filtered visual rhythm, producing a new image IT . In Fig. 3, we illustrate the thinning of a 1D image.
Step 4. Detection of the maxima points After the thinning operation, we have a new image IT with the horizontal
peaks being represented by a point, called maximum point.
A point x in 1D image g is maximum if its two neighbors
have values strictly smaller than g (x). So, we must find all
maxima points on the image IT to identify the center points
of the transitions. This operation produces a new binary image M that is defined by
M(t; z) =
(
1;
0;
I
if T (t; z )
otherwise
> max(IT (t
1; z );
IT (t + 1; z))
(4)
M
Step 5. Maxima point filtering If we observe on the image
, the location of the sharp video transition is rep. Unfortunately, irrelevant
resented by vertical lines on
components (noise) are also present in this image
, and
considering that only the relevant vertical components are
desired, we can use a morphological filter to eliminate the
noise. This filter is an opening by reconstruction with a ver, defined empirically.
tical structuring element of size
The filtered maxima image is denoted by F .
M
M
=7
M N( )
M
(a) Visual rhythm
Step 6. Calculation of the number of maxima points.
From the filtered maxima image F , we create a 1D image
where each point t has a value
t which represents
the number of maxima points of the vertical line t on F .
Thus, this 1D image is given by
N
N
N
(t) =
XM
M#
z=0
5
(c)
Maxima
points of the
thinned horizontal gradient
M
1
F (t; z )
(5)
Step 7. Detection of the sharp transition Finally, we
can detect the sharp video transition from the one-dimensioif we compare the values of each point to a
nal image
threshold, i. e., when the value of the point
t is greater
or equal to a threshold T , then a sharp video transition is
detected.
In Fig. 4, we illustrate the results of some steps of
our algorithm when we apply it to a visual rhythm obtained
from a real video.
N
(b)
Horizontal
gradient of the
filtered
visual
rhythm
N( )
Flash detection
The flash presence is very common in digital videos mainly
in television journal videos. When a camera flash occurs,
an increase of the luminosity in a few frames is produced,
as illustrated in Fig. 5, and when we calculate a dissimilarity measure, like pixel-wise measure, we can see that in
the frames affected by a flash, the dissimilarity measure is
very high. In fact, a flash is confused with a sharp video
transition. In the literature, we can find some methods for
flash detection, like shot-reverse-shot [4]. In these cases, it
is necessary to define a dissimilarity measure. In this work,
we propose two methods for flash detection from the visual
rhythm without defining a dissimilarity measure. The first
is a variant of the proposed method for cut detection and
the second considers a filtering of the component tree calculated from statistical measures computed by each frame
(or frame sub-sampling).
5.1 Filtering by top-hat
On the visual rhythm, we can observe that the video flashes
are transformed into thin light vertical lines, as showed in
Fig. 6a. So, we can easily extract these lines from a white
top-hat by reconstruction. The white top-hat by reconstruction is a mathematical morphology operator and represents
the difference between the original image g and the opening
(d) Filtering
(e) Number of
maxima points
in each vertical
line
(f)
Detected
sharp transition
superimposed on
the visual rhythm
Figure 4: Sharp video detection. The threshold is equal to
of the maximum value.
50%
by reconstruction of g [6, 7]. Informally, this operator detects light regions according to the shape and the size specifications of the structuring element. The method for flash
detection can be described as follows.
1. Calculate the visual rhythm from the principal diagonal pixel sampling;
2. Apply the white top-hat by reconstruction with square
structuring element of size
. This size is associated with the potential duration of a flash;
3. Apply a 1D thinning in each horizontal line;
4. Find the maxima points;
5. Apply an opening by reconstruction with vertical struc, defined empirically;
turing element of size
6. Calculate the number of maxima points in each vertical line;
7. Apply a detection by thresholding.
We can observe that this method is very similar to the
proposed method for cut detection. The difference here is
the substitution of the morphological filter and horizontal
gradient by the white top-hat by reconstruction. As the
method for cut detection, this methodology detects the center of the regions of interest, in this case, regions with peak
luminosity. Thus, we can have false detection in regions
of high luminosity changing that do not represent a flash.
Usually, this method produces good results when the flash
appears in the middle of the shot.
=5
=7
Experiment
Cut
Flash
Videos
32
10
Cuts
778
-
Dissolves
46
-
Flashes
14
23
Frames
29933
8392
Table 1: Chosen video features for the experiments
#
Figure 5: Some frames of a sequence with the flash presence.
sent a cut and by Miss the number of the cuts that are not
Cut
Corre t. From
detected defined by Miss
these numbers we can define two basic quality measures.
Usually, the frames affected by a flash are visually similar
to their neighbors but with a higher luminosity. The analysis of flash presence can be realized by computation of
some statistical measures like mean and median, where the
frames affected by a flash present higher mean and median
values with respect to their neighbors. From computation of
these statistical measures for all frames of the video, we can
create a 1D image for facilitating the flash detection. From
this 1D image, we need to find the “peaks” with “height”
greater than a value H , and with a “basis area” less or equal
to a value A that corresponds to the duration of the flash. In
this work, we consider that the maximum flash duration is
. The parameter H influences the sensi5 frames, so A
tivity of the method and has a role similar to the threshold
in Sec. 5.1. The notion of peak, height and basis area can
precisely defined thanks to a data structure called max-tree
[10] or component tree [11] (refer to these papers for more
details on definitions and implementation).
6 Experimental results
=5
In this section, we show the experimental results for cut detection and flash detection. Nowadays, our video database
contains 150 videos, but we use only 32 videos for cut detection experiments and 10 videos for flash detection experiment. The choice of the sequences was associated with
the presence of the different characteristics, such as, cut,
dissolve, wipe, flashes, zoom-in, zoom-out, pan, tilt, object
motion, camera motion, computer effects. In Table 1, we
show some features of the chosen videos. To compare the
different methods, we define quality measures in the next
section.
6.1 Quality measures
#
We denote by Cut the number of sharp (cut) transition,
by Corre t the number of cuts correctly detected, by
F alse the number of detected frames that do not repre-
#
#
=#
#
Definition 6.1 (Recall and error rates) The recall and error rates represent the percentages of a correct and false
detection, respectively, and are given by
t
= #Corre
#Cut
= ##False
Cut
Figure 6: Flash video detection. Visual rhythm (left), white
top by reconstruction (middle) and flash detected (right).
5.2 Component tree filtering
#
(recall)
(6)
(error)
(7)
Let be the threshold used for cut detection in the
range ; . If we consider that for each threshold we
obtain different values for and , we can represent these
relations as functions and , respectively. A new
measure can be created to relate ranges in which and
are adequate, according to the percentages of miss and the
percentage of false detection that are permitted.
[0 1℄
()
()
()
()
Definition 6.2 (Robustness) Let and be the
functions that relate the threshold to recall and error rates,
respectively. Let m and p be the percentage of miss and
false detection that are permitted. The robustness is a
measure related to the interval where the recall and error
rates have the values smaller than
m and p, respectively. This measure is in the range ; and is given by
(1 )
[0 1℄
1 (p)
(m; p) = 1 (1 m)
(8)
1 and 1 are the inverses of the functions
where
and , respectively. In Fig. 7, we illustrate the robustness measure obtained from functions and .
Next, we define two other measures, Em and Rf , that
are associated with the absence of miss and false detection,
respectively.
()
()
()
()
Definition 6.3 (“Missless” error) The missless error Em
is associated with the percentage of false detection when we
have results without miss (a small percentage of miss Pm
can be permitted, like
). The missless error is given by
3%
Em (Pm ) =
(maxf =
1 (q)j1 q Pm g)
(9)
Definition 6.4 (“Falseless” recall) The falseless recall Rf
is associated with the percentage of correct detection when
we have results without false detection (a small number of
false detection Pf can be permitted, like
). The falseless
recall is given by
1%
Rf (Pf ) =
(minf =
1 (p)jp Pf g)
(10)
the dissimilarity value is greater than a threshold, then a cut
is detected. With the aim of improving the results, we realize a subdivision in each frame, according to [12]. So, each
frame contains 9 subframes, and the dissimilarity measure
is applied to all correspondent subframes in consecutive
frames, being realized the mean between these measures.
The results of this experiment when compared to previous
experiment produce worse results for robustness, falseless
recall and gamma measures, but better results for missless
error.
Figure 7: Robustness () measure.
When we use methods for cut detection, we expect that
the recall is highest with a smallest error rate. To find a
compromise between these two requirements, we must define a “reward function” combining ( ) and ( ). Since
high values of and low values of have to be rewarded,
the function ( ) (1
( )) is a natural choice.
Definition 6.5 (Gamma measure) The gamma measure
represents the maximal value of the reward function defined
above for all possible values of :
maxf ( ) (1
( ))j 2 [0; 1℄g
(11)
The quality of the results is associated with the values
of the measures above defined. The highest values of robustness, falseless recall and gamma measure represent the
best results of a method. The lowest values of missless error
represent the best results of a method.
In the next sections, we describe the experiments for
cut detection and flash detection.
=
6.2 Experiments for cut detection
In these experiments, we implemented three methods described in literature: a variant of pixel-wise comparison,
histogram intersection and a statistical technique based on
visual rhythm. We chose these methods due to their simplicity and to present good results according to [5], [12]
and [2], respectively. We also implemented the proposed
method with some variants.
In the next sections we describe all experiments and in
Sec. 6.2.1 we present a global analysis of their results.
Experiment 1 This experiment uses the difference between pixel (defined in Sec. 3) as the dissimilarity measure. A 1D signal is created from the dissimilarity values
calculated on the video. According to [5], we apply a mathematical morphology operator, called inf top-hat operator,
on this signal, and finally, we use a threshold to detect the
cuts, i. e., if the result of the inf top-hat operator is greater
than a threshold, then a cut is detected.
Experiment 2 This experiment uses the histogram intersection (defined in Sec. 3) as the dissimilarity measure. If
Experiment 3 This experiment uses visual rhythm for cut
video detection based on statistical method as described in
[2]. Here, the parameters are different from those used in
other methods, in particular the threshold. While in this
method the threshold is locally adaptive and related to a
parameter that vary from 1 to 10, in the other methods the
threshold is fixed and global.
This method presents the best values of falseless recall
in these experiments, but other quality measures of the proposed methods are better. In particular, this method has a
very bad missless error rate.
Experiment 4 In this experiment, we compute a 1D image associated with the mean of the difference between pixels in consecutive frames. We apply the following algorithm on this image: i) apply a white top-hat by reconstruction with a flat structuring element of size 3; ii) apply a
thinning; and iii) apply a thresholding. Step i) eliminates
noise on the 1D signal, and step ii) reduces the number
of false detection according to the quality measures. This
method can be visualized as an hybrid between the method
described in the experiment 1 and the proposed method described in Sec. 4.
The quality measures of this method has best results
when compared to the previous experiments, with exception of the falseless recall rate of the experiment 3.
Experiment 5 In this variant of our method introduced in
Sec. 4, instead of applying the summation of the number
of maxima points in each vertical line, we use the filtered
maxima image as a mask to verify the grayscale value associated with each maxima point. Afterwards, we find the
mean of these grayscale values in each vertical line. Then,
a thresholding is applied to these results, and if the mean
is greater than a threshold, then a cut is detected. We verify that the falseless recall presents the second best result of
these experiments, but the other measures are worse when
compared to the next experiment.
Experiment 6 This experiment is related to the method
defined in Sec. 4. In general, the robustness, the missless
error and the gamma have the best results when compared
to the others experiments, and the falseless recall present
the third best value of all experiments.
6.2.1 Analysis of the results
In Fig. 8, we show graphically the experimental results for
each experiment previously described. These graphics relate the threshold (except for experiment 3) to recall and
error rates. From the functions illustrated in these graphics, it is possible to find the robustness, missless error rate,
falseless recall rate and gamma measure, that are outlined
in Table 2.
Experiment 1
Experiment 2
Experiment 3
Experiment 4
Experiment 5
Experiment 6
Em
Rf
0.01
0.00
0.00
0.06
0.01
0.11
0.80
0.51
1.20
0.49
0.48
0.37
0.10
0.00
0.51
0.21
0.44
0.35
0.77
0.68
0.72
0.80
0.78
0.80
Table 2: Quality measures (0:10; 0:30), Em (0:03),
Rf (0:01) and .
From these experiments, we can verify that the proposed method generally produces the best results, mainly
according to the robustness and the missless error rate. The
result of the robustness means that the proposed method is
a not very sensitive to small variations around an “optimal
value”. Another good point of our method is related to the
missless error rate because generally, we want results without miss and with a smallest percentage of false detections,
so that we can eliminate them posteriorly. Indeed, a postprocessing is essential to increasing the quality of results
because many false detections are due to the presence of
effects like flash, pan, zoom.
Also, we can observe that the processing time for experiments with visual rhythm is significantly lower than for
the experiments applied directly to the video.
6.3 Experiments for flash detection
In these experiments, we apply the methods described in
Sec. 5.1 and in Sec. 5.2. In Fig. 9, we illustrate some experimental results. Considering two statistical measures,
mean and median, we compute a component tree for each
measure. The quality measures for the intersection of the
filtering of the component tree and for top-hat filtering are
outlined in Table 3.
Top-hat
Component tree
Em
Rf
0.05
0.11
0.61
0.67
0.26
0.43
0.56
0.69
Table 3: Quality measures (0:40; 0:30), Em (0:05),
Rf (0:01) and .
7 Conclusions
Figure 8: Experimental results.
In this work, we transform the video segmentation problem
into a 2D image segmentation problem, and we propose
that the additional computation effort is rewarded by a better segmentation quality.
Acknowledgements
The authors are grateful to FAPEMIG, CAPES/COFECUB,
CNPq and the SIAM DCC/PRONEX Project for the financial support of this work.
References
[1] A. Hampapur, R. Jain, and T. E. Weymoth. Production model based digital video. Multimedia Tool and
Aplications, 1:9–46, 1995.
[2] M. G. Chung et al. Automatic video segmentation
based on spatio-temporal features. Korea Telecom
Journal, 4(1):4–14, 1999.
Figure 9: Experimental results for flash detection.
a method for cut detection from a video content simplification, called visual rhythm. Its main originality consists
is the thinning step that decreases the number of false detections, with respect to the number of correct detections.
This method is sensitive to filtering step due to the size of
the structuring element that can eliminate some small regions, i. e., if the shot size (number of frames of the shot)
is smaller than the size of the structuring element, then a
miss occurs. To realize a comparative analysis between different methods for cut detection, we defined four quality
measures: robustness, missless error, falseless recall and
gamma. According to these quality measures, we verified
that the proposed method have the best values of robustness,
missless error and gamma measure, when compared experimentally to the other methods. Except for two methods, it
has also the best falseless recall.
Another problem that we studied is related to the flash
presence. In fact, due to the dissimilarity values, the flash
can be confused with a sharp video transition, and with the
aim of eliminating the choice of a dissimilarity measure,
we proposed two methods for flash detection. A method is a
variant of our cut detection method that uses a white top-hat
by reconstruction and the another is related to a statistical
measure filtering.
From this work, we observed that the visual rhythm
presents an adequate simplification of the video content,
which can be basis for future developments: i) identify some
video effects, like pan, zoom, camera motion, from the detection of their correspondent patterns; ii) modify the proposed method to detect gradual video transitions, using the
Canny’s [13] filter to compute the horizontal gradient.
We can also remark that considering the video sequence
as three-dimensional images, we could apply a variant of
our method directly on the video data. We have to verify
[3] C. W. Ngo, T. C. Pong, and R. T. Chin. Detection of
gradual transitions through temporal slice analysis. In
IEEE CVPR, pages 36–41, 1999.
[4] A. Del Bimbo. Visual Information Retrieval. Morgan
Kaufmann, 1999.
[5] C.-H. Demarty.
Segmentation et Structuration
d’un Document Vidéo pour la Caractérisation et
l’Indexation de son Contenu Sémantique. PhD thesis, École Nationale Supérieure des Mines de Paris,
Janvier 2000.
[6] J. Serra. Image Analysis and Mathematical Morphology: Theoretical Advances. Academic Press, 1988.
[7] P. Soille. Morphological Image Analysis. SpringerVerlag, 1999.
[8] G. Bertrand, J.-C. Everat, and M. Couprie. Image
segmentation through operators based upon topology.
Journal of Electronic Imaging, 6:395–405, 1997.
[9] T. Y. Kong and A. Rosenfeld. Digital topology: Introduction and survey. CVGIP, 48:357–393, 1989.
[10] P. Salembier et al. Antiextensive connected operators
for image and sequence processing. IEEE Trans. on
Image Processing, 7(4):555–570, 1998.
[11] E. J. Breen and R. Jones. Attribute openings, thinnings and granulometries. Computer Vision and Image Undestanding, 64(3):377–389, 1996.
[12] A. Del Bimbo et al. Retrieval of commercials based on
dynamics of color flows. Journal of Visual Languages
and Computing, 11:273–285, 2000.
[13] J. Canny. A computational approach to edge detection. Pattern Analysis and Machine Intelligence,
8(6):679–698, 1986.