Real-Time People Tracking in A Camera Network: Wasit Limprasert, Andrew Wallace, and Greg Michaelson
Real-Time People Tracking in A Camera Network: Wasit Limprasert, Andrew Wallace, and Greg Michaelson
Real-Time People Tracking in A Camera Network: Wasit Limprasert, Andrew Wallace, and Greg Michaelson
2, JUNE 2013
263
I. INTRODUCTION
264
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013
processing was viewed as a pipeline of processes from segmentation using background subtraction, color, and motion through
to person detection and classification as people or other objects.
However this was based only on silhouettes and is unlikely to be
robust. Sinha [19] implemented a parallel feature detector and
tracker on a GPU achieving speedups of approximately 10 compared to an optimized CPU implementation. This is comparable
to the performance improvement we expect in our work, but we
address a rather different problem of detection and tracking of
extended bodies.
In this paper, we describe a Bayesian approach to track multiple human subjects within a camera network that can have
both overlapping and nonoverlapping fields of view. The contributions are as follows. First, we introduce a new form of appearance model, based on a colored, textured 3-D ellipsoid that
is progressively learned as the subject moves through the network. Second, we address camera handoff, that is the tracking
of mutiple subjects between camera views when there is dead
ground between camera views. Third, we encode this with a
GPU co-processor for real time, tracking of up to 10 subjects at
up to 10 frames
.
II. DETECTION OF A SUBJECT
Detection. illustrated in Fig. 1 is the initiation of a subject
track performed independently in each camera view. An initial
state vector is established for subsequent tracking. Referring to
Fig. 2, a uniform grid
is generated on the ground plane. At
each gridpoint, an ellipsoid of vertical diameter 1.7 m and horizontal diameter 0.5 m (an average for human subjects) is projected onto the ground plane, then rectangular bounding boxes
are generated. This accelerates detection as a computation
of foreground pixels.
At each frame and for each camera, background segmentation using a mixture of Gaussian (MoG) model [20] extracts
active pixels that differ from the background. A kernel integral
between
and is computed from the integral image [21].
265
Fig. 4. Single frame from a single camera for the EM331 sequence, showing
a superimposition of the current state of the textured, colored ellipsoid for the
viewed subject.
266
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013
Fig. 5. Evaluation with the PETS09 dataset. The first and second columns are
result sequences from camera1 and camera3. From top to bottom are frames
700, 720, and 740. The numbers over each ellipse show the ID and height in
metres.
TABLE I
ERROR AND THE NUMBER OF PARTICLES
(13)
We repeated each experiment 10 times to estimate the statistical average. Table I shows error rates as a ratio of the number
of the errors to the total number of subjects. The estimated
number of trajectories in the datasets is
and the true number
of the trajectories is
. Our framework always terminates
the trackers before the tracker is distracted by other subjects.
This causes
and a low switch-id rate. Arguably,
this causes an artificial lowering of the MOTA and leads to unfair comparison. However, the effect is not significant, less than
0.6%. Comparing our evaluation with the official PETS09 results, they allowed either a path to be broken on exit and reentry
or the path at reentry to have the same id. We have followed the
former protocol.
We compare these results with those reported for the same
dataset [27] in Table II. Unofficially, the MOTA of our tracking
algorithm is higher than the best method. First, the 3-D ellipsoid model allows observations from multiple cameras to
be integrated effectively. The advantage of a calibrated over a
267
TABLE II
MOTA COMPARISON WITH RESULTS IN [27]
(15)
(16)
Of course, there are well-established problems in histogram
comparison caused by variations in color caused by either illumination changes, or by different color balances between cameras. In our approach the color quantization is coarse, 1:16 of the
color range, which allows for these variations at the expense of
loss of subtle color discrimination.
C. The Combined Cost Function
This is computed from the product of the spatial and texture
cost functions in (17). There are two parameters, the blending
ratio that determines the relative importance of the spatial
compared to the texture term, and a threshold that controls
the breakpoint of the ramp function. When the evidence to associate and is low the ramp function returns zero
(17)
Once we compute the combined cost of association of every
subject with every trajectory, we store these in a matrix. We
use the Hungarian method [32] to find the optimum matching
configuration. The numbers of subjects and trajectories are not
necessarily equal.
D. Evaluation in Nonoverlapping Camera Networks
We use the F-measure that combines the recall and precision
rates defined in (20). This is computed from the number of correct assignments estimated by our algorithm
and the
total number of assignments of ground-truth
.
We use both the public (Fig. 5) and our private datasets (Fig. 6)
with camera calibration
(18)
(14)
(19)
(20)
Combining spatial and texture data association, there are
three tuning parameters, the standard deviation of the motion model , the threshold , and blending factor . The
effectiveness of camera handoff is critically dependent on the
uniqueness of the attire (which affects the discriminative ability
of texture comparison) and the regularity of movement (which
affects spatial association). We used eight different subjects
for the PETS09 data, and five for the DEC11 data, shown in
Fig. 6(d).
268
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013
TABLE III
CPU COMPUTATION TIME IN MILLISECOND
Fig. 6. Example frames in the DEC11 dataset. The upper images are overlapping camera views with quite different scales whereas the lower view is disjoint
with no overlap. Five subjects with their labels are also shown. For the corresponding PETS09 dataset we used eight different subjects.
and
. However, for DEC11, the best
result was 71.4% with
and
. Using a
low value of
reduces the effect of the spatial association,
and the best results are found effectively at low values. This
emphasizes the stronger visual cues that come from the textured
and colored data as motion in our data set is much more irregular
than the PETS09 data.
V. GPU IMPLEMENTATION
Fig. 7. F-measure for spatial-texture data association, PETS09 (upper). F-measure for spatial-texture data association, DEC11 (lower).
269
TABLE V
GPU ACCELERATION
A. Parallel Implementation
Fig. 8 shows schematically the activation of multiple threads
on a GeForce GTS250 with 128 cores of clock frequency
1.6 GHz. The loading of image data is still performed by the
host because direct memory transfer to the GPU is not possible.
The clock frequency of the host CPU is 3.1 GHz or about twice
the GPU clock frequency.
The Detection function operates on a fixed grid for each
camera, and the projection of ellipsoids and subsequent comparison with input data is independent. This suggests ready
data parallelism which has been implemented for foreground
background segmentation and computation of the integral
image. The Likelihood function has been made parallel by
using the natural independent structure of likelihood evaluation
on each particle. Reduction is employed to select the maximum
grid response in Detection and the summation in Resampling.
We have also used Skip ahead to split and accelerate random
number generation using multiple parallel chains to generate
properly distributed samples in Transition. This employs
Sobols sequence [34] with the CURAND library [35].
The performance of any shared memory algorithm depends
critically on the memory hierarchy. Table IV shows the size and
270
IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013
between the CPU host and GPU, and so is not a simple linear
relationship.
VI. CONCLUSION
We have described a new approach to tracking human subjects in video sequences. First, we have introduced a parametric
ellipsoid model in both detection and visual tracking. For detection, this is projected at static grid positions to find intersections between potential subject positions and foreground image
data, as determined by mixture of Gaussian segmentation. For
tracking, the ellipsoid is parameterized by position, velocity and
height as part of the state vectors of a particle filter. As the subject moves, a 3-D appearance description using texture and color
is learned progressively. This allows us to integrate observations
from multiple cameras into the likelihood function.
We have shown that the texture and color signature can be
used for effective tracking of subjects, with multiple object
tracking accuracies on PETS benchmarks of greater than 90%.
Further, we have combined this signature with spatial data
association to achieve F-measures (combining recall and precision) rates of between 60% and 85% when handing off between
cameras with nonoverlapping views, depending on the nature
of the data sets. This measure is further improved by use of the
bijection property of the Hungarian method for assignment of
identities to subjects.
Detection and tracking are both computationally expensive.
Therefore we have implemented a many core version of our algorithms on a NVIDIA GT250 GPU to achieve real time performance tracking of up to 10 subjects at 10 franes per second.
The speedup achieved with the GPU implementation shows the
most improvement in the case of detection, peaking at 5.8 times.
Overall, the speedup is 3.5 times for one benchmark dataset,
but this does not allow for the lower clock speed of the GPU
cores, so the algorithm performs better than this. Nevertheless,
the GPU implementation does allow us to achieve the necessary
real time performance.
REFERENCES
[1] D. Comaniciu and P. Meer, Mean shift: A robust approach toward
feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell., vol.
24, no. 5, pp. 603619, May 2002.
[2] A. Jepson, D. Fleet, and T. El-Maraghi, Robust online appearance
models for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 25, no. 10, pp. 12961311, Oct. 2003.
[3] L. Ma, J. Liu, J. Wang, J. Cheng, and H. Lu, A improved silhouette
tracking approach integrating particle filter with graph cuts, in IEEE
Int. Conf. Acoust. Speech Signal Process., Mar. 2010, pp. 11421145.
[4] G. Li, W. Qu, and Q. Huang, A multiple targets appearance tracker
based on object interaction models, IEEE Trans. Circuits Syst. Video
Technol., vol. 22, no. 3, pp. 450464, Mar. 2012.
[5] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multicamera people
tracking with a probabilistic occupancy map, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 30, no. 2, pp. 267282, Feb. 2008.
[6] Z. Husz, A. Wallace, and P. Green, Tracking with a hierarchical partitioned particle filter and movement modelling, IEEE Trans. Syst.,
Man, Cybern., Part B: Cybern., vol. 41, no. 6, pp. 15711584, Dec.
2011.
[7] J. Brown and D. Capson, A framework for 3d model-based visual
tracking using a GPU-accelerated particle filter, IEEE Trans. Vis.
Comput. Graphics, vol. 18, no. 1, pp. 6880, Jan. 2012.
[8] F. Yan, A. Kostin, W. Christmas, and J. Kittler, A novel data association algorithm for object tracking in clutter with application to tennis
video analysis, in Proc IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Washington, DC, 2006, vol. 1, pp. 634641.
[9] Z. Kalal, K. Mikolajczyk, and J. Matas, Face-TLD: Trackinglearning-detection applied to faces, in Proc. 17th IEEE Int. Conf.
Image Process., Sep. 2010, pp. 37893792.
[10] G. Cielniak and T. Duckett, People recognition by mobile robots, J.
Intell. Fuzzy Syst., vol. 15, pp. 2127, 2004.
[11] H.-B. Kim and K.-B. Sim, A particular object tracking in an environment of multiple moving objects, in Proc. Int. Conf. Control Autom.
Syst., Oct. 2010, pp. 10531056.
[12] Y. Tang, Y. Li, T. Bai, X. Zhou, and Z. Li, Human tracking in thermal
catadioptric omnidirectional vision, in Proc. IEEE Int. Conf. Inf.
Autom., Jun. 2011, pp. 97102.
[13] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005,
vol. 1, pp. 886893.
[14] H. Ma, C. Zeng, and C. X. Ling, A reliable people counting system
via multiple cameras, ACM Trans. Intell. Syst. Technol. vol. 3, no. 2,
pp. 31:131:22, Feb. 2012.
[15] J. Yao and J. M. Odobez, Multi-Person Bayesian Tracking With Multiple Cameras. New York: Elsevier, 2009, ch. 15, p. 363.
[16] K. Pauwels, M. Tomasi, J. Diaz, E. Ros, and M. V. Hulle, A comparison of FPGA and GPU for real-time phase-based optical flow, stereo,
and local image features, IEEE Trans. Computers, vol. 61, no. 7, pp.
9991011, Jul. 2012.
[17] R. Z. M. Wojcikowski and B. Pankiewicz, FPGA-based real-time implementation of detection algorithm for automatic traffic surveillance
sensor network, J. Signal Process. Syst., vol. 68, no. 1, pp. 118, Jan.
2012.
[18] G. B. J. Sanchez and J. Simo, Video sensor architecture for surveillance applications, Sensors, vol. 12, pp. 15091528, 2012.
[19] M. P. S. N. Sinha, J.-M. Frahm, and Y. Genc, Feature tracking and
matching in video using programmable graphics hardware, Mach. Vis.
Appl., vol. 22, pp. 1207217, 2011.
[20] C. Stauffer and W. E. L. Grimson, Adaptive background mixture
models for real time tracking, in Proc. IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., 1999, vol. 2, p. 252.
[21] P. Viola and M. Jones, Rapid object detection using a boosted cascade
of simple features, in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., 2001, vol. 1, pp. 511518.
[22] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter
Particle Filter for Tracking Applications. London, U.K.: Artech
House, 2004.
[23] P. J. Schneider and D. H. Eberly, Geometric Tools for Computer
Graphics. New York: Elsevier, 2003.
[24] J. M. Ferryman, PETS: Performance evaluation of tracking
and surveillance Univ. Reading, 2007 [Online]. Available:
http://www.cvg.rdg.ac.uk/slides/pets.html
[25] J. Ferryman and A. Shahrokni, Pets2009: Dataset and challenge, in
Proc. 12th IEEE Int. Performance Eval. Track. Surveill. Workshop,
2009, pp. 16.
[26] K. Bernardin and R. Stiefelhagen, Evaluating multiple object tracking
performance: The clear MOT metrics, J. Image Video Process., vol.
2008, pp. 110, 2008.
[27] A. Ellis, A. Shahrokni, and J. Ferryman, PETS2009 and winter-PETS
2009 results: A combined evaluation, in Proc. 2009 12th IEEE Int.
Workshop Performance Eval. Track. Surveill. (PETS-Winter), Dec.
2009, pp. 18.
[28] J. Berclaz, A. Shahrokni, F. Fleuret, J. Ferryman, and P. Fua, Evaluation of probabilistic occupancy map people detection for surveillance systems, in Proc. IEEE Int. Workshop Performance Eval. Track.
Surveill., 2009.
[29] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L.
Van Gool, Robust tracking-by-detection using a detector confidence
particle filter, in Proc. IEEE 12th Int. Comput. Vis. Conf., 2009, pp.
15151522.
[30] D. Arsic, A. Lyutskanov, G. Rigoll, and B. Kwolek, Multi camera
person tracking applying a graph-cuts based foreground segmentation
in a homography framework, in Proc. 12th IEEE Int. Performance
Eval. Track. Surveill. (PETS-Winter) Workshop, 2009, pp. 18.
[31] J. Berclaz, F. Fleuret, and P. Fua, Multiple object tracking using flow
linear programming, in Proc. 12th IEEE Int. Workshop Performance
Eval. Trackl Surveill. (PETS-Winter), Dec. 2009, pp. 18.
[32] J. Munkres, Algorithm for the assignment and transportation problems, J. Soc. Ind. Appl. Math., vol. 5, no. 1, pp. 3238, 1957.
[33] CUDA C programming guide, ver. 3.2, NVIDIA, 2010.
[34] I. Sobol, The distribution of points in a cube and the approximate
evaluation of integrals, USSR Comput. Maths. Math. Phys., vol. 7,
pp. 86112, 1967.
[35] Curand Guide, NVIDIA, 2012.
271
Greg Michaelson received the B.A. degree in computer science from the University of Essex, Essex,
U.K., in 1973, the M.Sc. degree in computational
science from the University of St. Andrews, St.
Andrews, U.K. in 1982, and the Ph.D. degree from
Heriot-Watt University, Edinburgh, U.K. in 1993.
He is Professor of Computer Science at HeriotWatt University. His research expertise is in programming language design, implementation and analysis
for multi-processor systems. He has enjoyed EPSRC,
DTC, and EU support for his research and has pub-
Andrew Wallace received the B.Sc. and Ph.D. degrees from the University of Edinburgh, Edinburgh,
U.K., in 1972 and 1975, respectively.
He is a Professor of Signal and Image Processing
at Heriot-Watt University. His research interests include vision, image and signal processing, and parallel, many core architectures. He has published extensively; receiving a number of best paper and other
awards, and has secured funding from EPSRC, the
EU and other industrial and government sponsors.
Dr. Wallace is a chartered engineer and a Fellow
of the Institute of Engineering Technology.
lished widely.
Dr. Michaelson is a Chartered Engineer, a Fellow of the British Computer
Society, and an elected board member of the European Association for Programming Languages and Systems.