Tracking Groups of Pedestrians in Video Sequences∗
Jorge S. Marques
Pedro M. Jorge
Arnaldo J. Abrantes
J. M. Lemos
IST / ISR
Lisbon, Portugal
ISEL / IST
Lisbon, Portugal
ISEL
Lisbon, Portugal
INESC-ID / IST
Lisbon, Portugal
Abstract
This paper describes an algorithm for tracking groups of
objects in video sequences. The main difficulties addressed
in this work concern total occlusions of the objects to be
tracked as well as group merging and splitting. A two layer
solution is proposed to overcome these difficulties. The first
layer produces a set of spatio temporal strokes based on
low level operations which manage to track the active regions most of the time. The second layer performs a consistent labeling of the detected segments using a statistical
model based on Bayesian networks. The Bayesian network
is recursively computed during the tracking operation and
allows the update of the tracker results everytime new information is available. Experimental tests are included to
show the performance of the algorithm in ambiguous situations.
Figure 1: Group formation
jects. Therefore the tracking algorithm must be able to cope
with group formation and splitting. Splitting is a very difficult problem since the tracker must distinguish which object belongs to each subgroup. A second difficulty concerns
temporarily occlusions produced either by static or by interacting objects. In this case, it is not possible to detect
the objects trajectories but only a set of non connected segments. The estimation of the objects trajectories from the
observed segments is a difficult problem since several hypothesis have to be considered.
Tracking with occlusions has been thoroughly studied by
several authors (e.g. see [11, 4]). On the other hand, tracking groups of objects is much less studied, being still an
open issue (e.g., see [2, 6]). Most tracking systems solve
both difficulties (occlusions and groups) using hard decisions based on instantaneous rules. This approach works
well in simple cases but the performance of these systems
is poor in large problems with complex interactions. For example, when an object is occluded for a long time or when
a group is split a large ambiguity is created which can not
be instantaneously removed.
A different approach was recently proposed in [1] by formulating trajectory estimation as a labelling problem, modelled by Bayesian networks. Bayesian networks are able to
incorporate the appearance information extracted from the
video stream and to propagate the uncertainty associated
with the labeling decisions. This allows to deal with am-
1 Introduction
Video surveillance aims to track and classify human activities, providing the human operator with augmented perception capability [9]. This task is usually divided into several
processing layers, e.g., region detection, object tracking and
activity classification.
Efficient algorithms have been developed to detect active
regions in video sequences e.g., using statistical models of
the background image [11, 10]. Nonlinear operations are
then used to discard small regions and cluster neighboring
pixels into connected active regions. Higher level operations use the output of the low level layer to track moving
objects present in the scene and to classify their behavior.
This is done by associating regions with similar properties
detected in consecutive frames (region tracking). Region
sequences are then classified using statistical pattern recognition techniques e.g., Hidden Markov Models [8].
Most region trackers perform well in simple cases. However, in practice, the objects to be tracked interact forming groups: a single track may correspond to several ob∗ this work was partially supported by FCT and POCTI in the scope of
project LTT 37844.
1
y1
s5
s1
s3
x1
x2
y2
s4
s6
s2
x3
y3
x4
y4
t
y5
x6
x5
y5
Figure 2: Detected strokes
r56
biguous situations in which multiple interpretations have to
be considered.
This paper extends the ideas proposed in [1] by addressing the problem of tracking groups of objects with Bayesian
networks, i.e., each active region detected in the image may
contain several objects. The main question which has to be
considered is: which objects are associated with each segment detected in the video stream? The system must be able
to provide the most probable labeling sequences as well as
their confidence degrees.
A second contribution concerns the way splitting trajectories are modeled. Instead of using competitive links, this
paper proposes the use of restriction nodes. This allows
a simplification of the probabilistic models which become
symmetric and reduces the amount of memory and computation effort associated with the inference task.
Figure 3: Bayesian network
A set of probabilistic labels x = (x1 , ..., xN ) is estimated from the object trajectories and visual features y =
(y1 , ..., yN ), detected in the video stream e.g. using the
dominant colors to characterize each detected segment. The
most probable labels are obtained using the MAP method
x̂ = arg max p(y, x)
x
(1)
Bayesian networks are used in this paper to represent the
joint probability distribution p(x, y). It is assumed that each
variable xi is a hidden node of a Bayesian network and each
observation yi is a visible node connected to xi . Object
interaction (trajectory geometry) is encoded in the network
topology. Two hidden nodes xi , xj are connected if the j-th
segment starts after the end of the i-th segment. Additional
restrictions are used to reduce the number of connections as
discussed in Section 4.
A third type of nodes (restriction nodes rij ) are included
every time a hidden node has more than one hidden child.
These nodes are used to model the competitive interaction
among the hidden children (the same label can not be assigned to more than one child). Fig. 3 shows the Bayesian
network associated with the example of Fig. 2 showing the
three types of nodes.
Three issues have to be considered in order to specify a
Bayesian network for a tracking problem:
2 System Overview
Given a video sequence, low level processing provides a set
of segments each of them associated with the evolution of
an active region in the image [10]. We wish to link segments
in order to estimate the trajectories of each object present in
the scene. If a given segment corresponds to a group it will
belong to several trajectories.
Every time new segments are detected (e.g., due to occlusions, new objects or group splitting), we need to know
which objects belong to the new segments (Fig. 1). This
ambiguity is solved by building a model for each object previously observed by the system. The likelihood of the new
segment is computed considering all the admissible models.
The Bayesian network is used to model object interaction as
well their visual appearance.
To obtain the object trajectories a label xk must be assign
to each segment sk . Each label identifies all the objects in
the segment i.e., if the segment corresponds to a single object, the label is the object identifier. If the segment corresponds to a group, the label is a set of identifiers of all the
objects inside the group.
Let s1 , ..., sN be the sequence of detected segments.
• computation of the network architecture: nodes and
links;
• choice of the admissible labels Li associated to each
hidden node;
• the conditional distribution of each variable given its
parents.
The last two items depend on the type of application.
Different solutions must be adopted if one wants to track
isolated objects or groups of objects. Group tracking leads
2
to more complex networks since each segment represents
multiple objects.
These topics are addressed in the next sections. Section
3 describes low level processing. Section 4 describes the
network architecture. Section 5 describes the Bayesian network for tracking multiple isolated objects. Section 6 describes the Bayesian networks for group tracking. Section
7 presents experimental results and section 8 presents the
conclusions.
x1
x1
x1
x2
x1
x2
x2
x3
b
c
x3
r23
a
x1
x3
3 Low level processing
x1
x2
x4
d
x2
x3
x4
r34
r34
e
f
x1
x3
x2
x4
r34
x5
r45
g
Figure 4: Basic structures (grey circles represent restriction
nodes).
The algorithm described in this paper was used for long
term tracking of groups of pedestrians in the presence of
occlusions. The video sequence is first pre-processed to detect the active regions in every new frame. A background
subtraction method is used to perform this task followed by
morphological operations to remove small regions [10].
Then region linking is performed to associate corresponding regions in consecutive frames. A simple method
is used in this step: two regions are associated if each of
them selects the other as the best candidate for matching
[12]. The output of this step is a set of strokes in the spatialtemporal domain describing the evolution of the regions
centroids during the observation interval.
Every time there is a conflict between two neighboring
regions in the image domain the low level matcher is not
able to perform a reliable association of the regions and
the corresponding strokes end. A similar effect is observed
when a region is occluded by the background. Both cases
lead to discontinuities and the creation of new strokes.
The role of the Bayesian network is to perform a consistent labeling of the strokes detected in the image i.e., to
associate strokes using high level information when the simple heuristic methods fail. Every time a stroke begins a new
node is created and the inference procedure is applied to determine the most probable label configuration as well as the
associated uncertainty.
start after the end of the first and the average speed during
during the occlusion gap is smaller than the maximum velocity specified by the user).
Furthermore, we assume that the number of parents as
well as the number of hidden children of each node is limited to 2. Therefore, seven basic structures must be considered (see Fig. 4). These structures show the restriction
nodes rij but the visible nodes yi are omitted for the sake
of simplicity.
When the number of parents or children is higher than
two, the network is pruned using link elimination techniques. Simple criteria are used to perform this task. We
prefer the connections which correspond to small spatial
gaps.
5 Tracking Isolated Objects
A stroke si is either the continuation of a previous stroke or
it is a new object. The set of admissible labels Li is then
the union of the admissible labels Lj of all previous strokes
which can be assigned to si plus a new label corresponding to the appearance of a new object in the field of view.
Therefore,
4 Network Architecture
Li =
j∈Ii
The network architecture is specified by a graph, i.e., a set
of nodes and corresponding links. Three types of nodes are
used in this paper: the hidden nodes xi representing the
label of the i-th segment, the observation nodes yi which
represent the features extracted from the i-th segment and
binary restriction nodes rij which are used to avoid labeling
conflicts. The restriction node rij is created only if xi and
xj share a common parent.
A link is created from a hidden node xi to xj if xj can
inherit the label of xi . Physical constrains are used to determine if two nodes are linked (e.g., the second segment must
Lj ∪ {lnew }
(2)
where Ii denotes the set of indices of parents of xi . See Table 1 which shows the labels associated to the hidden nodes
of the Bayesian network of Fig. 3.
The Bayesian network becomes defined once we know
the graph (see Section 4) and the conditional distributions
p(xi |pi ) for all the nodes, where pi are the parents of xi .
Seven cases have to be considered (see Fig.4). The distribution p(xi |pi ) for each of these cases are defined following
a few rules. It is assumed that the probability of assigning
a new label to xi is a constant Pnew defined by the user.
3
k
1
2
3
4
5
6
Lk
1
2
123
1234
12345
12346
k
1
2
3
4
5
6
Table 1: Admissible labels (isolated objects)
Lk
1
2
1 2 (1,2) 3
1 2 (1,2) 3 4
1 2 (1,2) 3 4 5
1 2 (1,2) 3 4 6
Table 2: Admissible labels (groups of objects)
6 Group Model
Therefore,
p(xi = lnew |xj = k) = Pnew
(3)
This section addresses group modeling. Three cases have to
be considered: group occlusions, merging and splitting.
Fig. 2 shows a simple example in which two persons
meet, walk together for a while and separate. This example
shows three basic mechanisms: group merging, occlusion
and group splitting. These mechanisms allow us to model
more complex situations in which a large number of objects
interact forming groups. After detecting the segments using
image processing operations each segment is characterized
by a group label xi . A group label is a sequence of labels
of the objects present in the group. A Bayesian network is
then built using the seven basic structures of Fig. 4.
Let us now consider the computation of the admissible
labels. The set of admissible labels Lk of the k-th node is
recursively computed from the sets of admissible labels of
its parents Li , Lj , starting from the root nodes. This operation depends on the type of connections as follows:
occlusion:
(7)
Lk = Li ∪ lnew
All the other cases are treated on the basis of a uniform
probability assignment. For example in the case of Fig. 4c,
xi inherits the label of each parent with equal probability
p(xi |xp , xq ) = (1 − Pnew )/2,
(4)
for xi = xp or xi = xq .
Every time two nodes xi , xj have a common parent, a
binary node rij is included to avoid conflicts i.e., to avoid
assigning common labels to both nodes. The conditional
probability table of the restriction node is defined by
p(rij = 1/xi ∩ xj = ∅) = 1
p(rij = 1/xi ∩ xj = ∅) = 0.
(5)
It is assumed that rij = 0 if there is a labeling conflict i.e.,
if the children nodes xi , xj have a common label; rij = 1
otherwise. To avoid conflicts we assume that rij is observed
and equal to 1.
Inference methods are used to compute the most probable configuration (label assignment) as well as the probability of the admissible labels associated with each node. This
task is performed using the Bayes Net Matlab toolbox [7].
Each stroke detected in the image is characterized by a
vector of measurements yj . In this paper yj is a set of dominant colors. The dominant colors are computed applying
the LBG algorithm to the pixels of the active region being
tracked in each segment. A probabilistic model of the active
colors is used to provide soft evidence about each node [3].
Each label is also characterized by a set of dominant colors. This information is computed as follows. The first time
a new label is created and associated to a segment, a set of
dominant colors is assigned to the label.
The probability of label xj ∈ Lj given the observation
yj is defined by
N
P n (1 − P )N −n
(6)
P (xj /yj ) =
n
merging:
Lk = Li ∪ Lj ∪ Lmerge ∪ lnew
(8)
Lmerge = {a ∪ b : a ⊂ Li , b ⊂ Lj , a ∩ b = ∅}
(9)
splitting:
Lk = Lj = P(Li ) ∪ lnew
(10)
where P(Li ) is the partition of the set Li , excluding the
empty set. In all these examples, lnew stands for a new
label, corresponding to a new track.
Table 2 shows the set of admissible labels for the example of Fig. 3. Labels 1,2 correspond to the objects detected
in the first frame and labels 3-6 correspond to new objects
which may have appeared.
Conditional probability distributions must be defined for
all the network nodes, assuming that the parents labels are
known. Simple expressions for these distributions are used
based on four parameters chosen by the user:
where n is the number of matched colors, N is the total
number of colors (N=5 in this paper) and P is the matching
probability for one color.
• Poccl - occlusion probability
4
400
350
Column
4
3
300
1
250
1
1 2
200
1 2
150
2
2
2
2
2 3
(a)
(b)
(c)
(d)
(e)
(f)
2 4
3
100
3
2
4
50
0
36
1
40
44
48
52
56
60
Time (s)
64
68
72
76
Figure 5: Stroke detection and the most probable labels
• Pmerge - merging probability
• Psplit - splitting probability
• Pnew - probability of a new track
These parameters are free except in the case of the occlusion (Fig. 4b). In this case, the conditional probability
of xk given xi in given by
1 − Pnew
xk = xi
(11)
P (xk /xi ) =
Pnew
xk = lnew
Figure 6: Tracking results at time 42, 45, 54, 56, 68, 70s
1
The spliting and merging conditional distributions are
defined in appendix.
2
3
1
2
4
3
7 Results
4
5
7
5
9
The proposed algorithm was used for long term tracking of
multiple objects in video sequences. The tests were performed using PETS 2001 database as well as video sequences of an university campus. In both cases the sequences were digitalized at 25 fps and contain less than 20
active regions (pedestrians and vehicles) to be tracked.
These tests allowed to evaluate the performance of the
proposed algorithms in the presence of occlusions and small
groups of 2 to 4 objects. Similar results were obtained in
both cases. The Bayesian network managed to correctly
solve most ambiguous situations.
Fig. 5 shows the segments detected in one of the PETS
sequences. Each object is easily tracked if it appears isolated in the image but the trajectories are broken everytime
there is an occlusion or a group change (merging or splitting). Fig. 6 illustrates some of these difficulties. Figs. 6a,b
correspond to a merge. Figs. 6b,c is a group split. Fig. 6d-e
is a merge and split. These events can also be observed in
Fig. 5.
The Bayesian network is recursively updated from the
output of the low level module. Everytime a new segment is
9
8
0
7
1
8
0
2
6
3
1
4
5
6
7
Figure 7: Bayesian network at two time instants
created the network is updated with a new hidden node. Fig.
7 shows the evolution of the Bayesian network at two time
instants. The most probable labels obtained by the MAP
method are displayed in Fig. 5. A consistent interpretation
of the data was achieved by the tracker.
Fig. 8 shows tracking results obtained at the university
campus. The figure displays the evolution of the active regions (column of the mass center) and the labels obtained
using the Bayesian network. The Bayesian network associated with this example was automatically built from the
video stream as before and it is shown in Fig. 9. Fig. 10
5
00
column
50
7
3
37
7 9
00
9
79 7
14
50
00
3
8 14 15
4
2
50
15
14 15
14 15
14 15
6
5
8
4
11
8
00
3
4
4
50
18
18 19
19
12
13 13 10 10
10 12 1610 16
10 10 101612 16
12 12
10
1
17
0
0
20
40
60
80
100
time (s)
Figure 8: Stroke detection and the most probable labels
1
2
Figure 10: Tracking results in the time interval shown in
Fig. 8)
3
5
4
1
6
2
5
7
task. This allows to formulate the labeling problem as an
inference task which integrates all the available information
extracted from the video stream allowing to update the interpretation of the detected tracks every time new information is available. This is a useful feature to solve ambiguous
situations such as group splitting and occlusions in which
long term memory is needed.
7
3
6
4
8
0
3
0
8
9
0
4
5
1
2
9
9
4
3
7
2
8
1
6
6
7
9
Appendix
8
1
2
6
7
5
0
3
Fig. 4 depicts the set of net configurations which has to
be considered to define all the probability conditional distributions in the Bayesian network. The first structure (Fig.
4a) corresponds to the case of an isolated track. This corresponds to a new object in the scene. A new label is assigned
to this object.
The second case (occlusion), shown in Fig. 4b, was already considered in equation (11).
The remaining cases, corresponds to situations involving merging and splitting, being the conditional distributions defined as follows.
1. Two merging parents (Fig. 4c):
Pmerge xk = xi ∪ xj
xk = xi
Poccli
P (xk /xi , xj ) =
x
P
k = xj
occl
j
Pnew
xk = lnew
4
5
Figure 9: Bayesian network
shows four frames which illustrate the performance of the
system in the tracking of multiple objects with occlusions
and groups.
8 Conclusion
This paper presents a system for long term tracking of multiple objects in the presence of occlusions and group merging and splitting. The system tries to follow all moving
objects present in the scene by performing a low level detection of spatio-temporal segments followed by a labeling
procedure which attempts to assign consistent labels to all
the segments associated to the same object. The interaction among the objects is modeled using a Bayesian network which is automatically built during the surveillance
2. A single parent with a split (Fig. 4d):
Psplit /(2Ni − 2) xk ⊂ P(xi )\xi
xk = xi
Poccli
P (xk /xi ) =
Pnew
xk = lnew
6
where Ni is the cardinality of xi (number of objects
in segment si ). The expression for N = 1 has some
minor modifications.
[9] C. Regazzoni, P. Varshney, Multi-Sensor Surveillance
Systems, IEEE Int. Conf. Image Processing, 497-500,
2002.
3. Two parents, but only one of them may have a split
[10] C. Stauffer, W. Grimson, Learning Patterns of Activity
(Fig. 4e):
Using Real-Time Tracking, IEEE Trans. Pattern Anal
ysis and Machine Intelligence, 22, 8, 747-757, 2000.
Psplit /(2Nj − 2) xk ⊂ P(xj )\xj
[11] C. Wren, A. Azabayejani, T. Darrel, A. Pentland,
xk ⊂ Mij
Pmerge /(2Nj − 1)
Pfinder: Real Time Tracking of the Human Body,
xk = xi
Poccli
P (xk /xi , xj ) =
IEEE Transactions on Patter Analysis and Machine
x
=
x
P
k
j
occl
j
Intelligence, 19, 780-785, 1997.
Pnew
xk = lnew
[12] S. Ullman, R. Basri, Recognotion by Linear Combination of Models, IEEE Transactions on Patter Analysis
and Machine Intelligence, 13, 992-1006, 1991.
where Mij = {a ∪ b : a = xi , b ⊂ P(xj ), a ∩ b = ∅}.
4. Two parents, both with splits (Figs. 4f-g):
P
/(2Ni − 2) xk ⊂ P(xi )\xi
split Nj
P
/(2 − 2) xk ⊂ P(xj )\xj
splitPmerge
xk ⊂ Mij
(2Ni −1)(2Nj −1)
P (xk /xi , xj ) =
xk = xi
Poccli
x
P
k = xj
occlj
Pnew
xk = lnew
Mij = {a ∪ b : a ⊂ P(xi ), b ⊂ P(xj ), a ∩ b = ∅}.
References
[1] A. Abrantes, J. Marques, J. Lemos, Long Term Tracking Using Bayesian Networks, IEEE Int. Conf. Image
Processing, 609-612, 2002.
[2] F. Bremond, M. Thonnat, Tracking Multiple Nonrigid
Objects in Video Sequences, IEEE Trans. Circuits,
Systems and Video Technology, 8, 585-591, 1998.
[3] F. Jensen, Bayesian Networks and Decision Graphs,
Springer, 2001.
[4] I. Haritaoglu, D. Harwood, L. Davis, W4: Real-Time
Surveillance of People and Their Activities, IEEE
Trans. PAMI, 22, 809-830, 2000.
[5] M. Jordan, Learning in Graphical Models, MIT Press,
1998.
[6] S. McKenna S. Jabri, Z. Duric, A. Rosenfeld, H.
Wechsler, Tracking Groups of People, Journal of
Comp. Vision Image Understanding, 80, 42-56, 2000.
[7] K. Murphy, The Bayes Net Toolbox for Matlab, Computing Science and Statistics, 33, 2001.
[8] N. Oliver, B. Rosario, A. Pentland, A Bayesian Computer Vision System for Modeling Human Interactions, IEEE Trans. Pattern Analysis and Machine Intelligence, 22, 8, 831-843,2000.
7