Real-Time People Tracking in A Camera Network: Wasit Limprasert, Andrew Wallace, and Greg Michaelson

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO.

2, JUNE 2013

263

Real-Time People Tracking in a Camera Network


Wasit Limprasert, Andrew Wallace, and Greg Michaelson

AbstractWe present an approach to track several subjects


from video sequences acquired by multiple cameras in real time.
We address the key concerns of real time performance and continuity of tracking in overlapping and nonoverlapping fields of
view. Each human subject is represented by a parametric ellipsoid
having a state vector that encodes its position, velocity and height.
We also encode visibility and persistence to tackle problems of
distraction and short-period occlusion. To improve likelihood
computation from different viewpoints, including the relocation
of subjects after network blind spots, the colored and textured
surface of each ellipsoid is learned progressively as the subject
moves through the scene. This is combined with the information
about subject position and velocity to perform camera handoff.
For real time performance, the boundary of the ellipsoid can be
projected several hundred times per frame for comparison with
the observation image. Further, our implementation employs a
particle filter, developed for parallel implementation on a graphics
processing unit. We have evaluated our algorithm on standard
data sets using metrics for multiple object tracking accuracy
(MOTA) and speed of processing, and can show significant improvements in comparison with published work.
Index TermsCamera network, graphics processing unit (GPU)
implementation, particle filter, visual tracking.

I. INTRODUCTION

IDEO analytics includes the observation and tracking of


human (and other) subjects as they move within the field
of view of camera networks. For example, CCTV surveillance
is used to record and counteract criminal acts in town centres,
public buildings and transport termini, for traffic monitoring to
apply congestion charging, for road planning, and to observe
shopping patterns in a supermarket. In the context of CCTV, we
aim to deploy economically distributed sensing and computation with cooperative communication, for example to track an
identified individual from camera to camera in an urban setting.
Real time processing is a priority if responsive action is required
before rather than after an event.
A. Detection and Tracking
In general, there are two necessary activities, detection and
tracking. It is necessary to detect a new presence, caused for example by entry into the field of view or reemergence from occlusion or blind areas. Detection must search the entire observable
Manuscript received January 29, 2013; accepted March 18, 2013. Date of
publication April 24, 2013; date of current version June 07, 2013. This work
was supported by the Royal Thai Government and the European Community
(Framework 7, Collaborative under Project 260101, Locobot). This paper was
recommended by Guest Editor H. Aghajan.
W. Limprasert and G. Michaelson are with the School of Mathematical and
Computer Sciences, Heriot-Watt University, EH14 4AS Edinburgh, U.K.
A. Wallace is with the School of Engineering and Physical Sciences, HeriotWatt University, EH14 4AS Edinburgh, U.K.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/JETCAS.2013.2256820

space, and thus can have considerable complexity. Tracking is


an estimation process, which includes prediction and evaluation. Prediction includes a motion model that limits the search
space for evaluation of subject presence in succeeding video
frames. Evaluation requires a comparison between the prediction of position, velocity and appearance, and the observed data.
State estimators can be classified into three categories; maximum likelihood (ML), maximum a posteriori (MaP), Bayesian
estimation, and data association. The ML method estimates the
state from the current observations without prior knowledge
e.g., [1]. The state vector is estimated by finding the optimal
state which maximizes a likelihood function which includes the
appearance descriptor. The MaP method, e.g., [2], [3], is similar to ML but the estimated state is computed by optimization,
and prior knowledge of the state is used to calculate the posterior
state variables. Bayesian estimation preserves both the prior and
posterior probability distributions, and is better able to predict
future probability distributions and sampling strategies. This has
been used extensively in visual tracking, e.g., [4][6], notably
through particle filtering for which GPU implementations are
already available, e.g., [7].
B. Multiple Subjects and Camera Handoff
For multiple subject tracking, there is the additional problem
of data association, both within and between camera views,
linking detected subjects with associated state vectors to trajectories. Detections in space and time can be linked to previously
known subjects using trajectories by the shortest-path, e.g.,
[8], or by appearance, e.g., [9], [10]. There are many forms
of appearance descriptor, based for example on silhouettes,
edges, and boundaries, e.g., [11], [12], pattern descriptors,
e.g., [13], [14], or color features and histograms, e.g., [1]. 3-D
descriptors allow the system to better combine likelihoods from
multiple-cameras, e.g., [7], [15], but require more memory and
computation.
C. Hardware Acceleration Using GPUs
The development of GPU technology lags behind that of
FPGAs in their restricted capability for deployment in safety
critical areas, and they have a relatively high power consumption. However, they do have the advantages of a greatly reduced
development time and higher levels of abstraction, offering
many more cores than a general multicore architecture [16].
Wojcikowski et al. [17] have implemented an FPGA detection algorithm for traffic surveillance as part of a sensor network. Vehicle motion is much more regular, and the consequent
solution is implemented as repeated detection using static data
structures with a high degree of pipelined data parallelism. Occlusion is not considered. Sanchez et al. [18] used a hybrid architecture of DSP and FPGA devices for video surveillance. The

2156-3357/$31.00 2013 IEEE

264

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013

Fig. 2. Detection grid and ellipsoid projection.

Fig. 1. Schematic of Detection module.

processing was viewed as a pipeline of processes from segmentation using background subtraction, color, and motion through
to person detection and classification as people or other objects.
However this was based only on silhouettes and is unlikely to be
robust. Sinha [19] implemented a parallel feature detector and
tracker on a GPU achieving speedups of approximately 10 compared to an optimized CPU implementation. This is comparable
to the performance improvement we expect in our work, but we
address a rather different problem of detection and tracking of
extended bodies.
In this paper, we describe a Bayesian approach to track multiple human subjects within a camera network that can have
both overlapping and nonoverlapping fields of view. The contributions are as follows. First, we introduce a new form of appearance model, based on a colored, textured 3-D ellipsoid that
is progressively learned as the subject moves through the network. Second, we address camera handoff, that is the tracking
of mutiple subjects between camera views when there is dead
ground between camera views. Third, we encode this with a
GPU co-processor for real time, tracking of up to 10 subjects at
up to 10 frames
.
II. DETECTION OF A SUBJECT
Detection. illustrated in Fig. 1 is the initiation of a subject
track performed independently in each camera view. An initial
state vector is established for subsequent tracking. Referring to
Fig. 2, a uniform grid
is generated on the ground plane. At
each gridpoint, an ellipsoid of vertical diameter 1.7 m and horizontal diameter 0.5 m (an average for human subjects) is projected onto the ground plane, then rectangular bounding boxes
are generated. This accelerates detection as a computation
of foreground pixels.
At each frame and for each camera, background segmentation using a mixture of Gaussian (MoG) model [20] extracts
active pixels that differ from the background. A kernel integral
between
and is computed from the integral image [21].

The detection response,


, at each grid point is the ratio between the foreground and total area. A maximum response that
is also above a heuristic threshold of 0.65 is a new subject.
III. TRACKING IN AN OVERLAPPING CAMERA NETWORK
When a subject tracker is activated, new state variables
encode position, velocity, visibility, and persistence,
, respectively. The
prior density of position is assumed to be a uniform distribution
within a 1 m radius on the ground plane. The prior density of
velocity is a uniform distribution with magnitude between 0
and 2 ms in all directions. Visibility is a boolean variable
that expresses whether the subject is observed by any camera.
Persistence is the continuity of tracking measured in time; this
determines whether to terminate or continue tracking a subject.
On activation, visibility is set to 1 and the persistence level is
maximum.
Tracking (Fig. 3) employs a sequential importance resampling (SIR) particle filter (PF) [22]. This expresses the prior
probability distribution by a set of particles
where
and are the particle and tracker indexes, respectively. The
collection of all particles is the probability density of the state
variables for all subjects.
The SIR-PF has transition, likelihood and resampling functions. The transition function generates prior state particles from
the previous iteration as shown in (1)(4), where
is a Wiener
process;
are standard deviations of the process perturbation; is drawn from a Bernoulli process. This can be
or
with probability 0.25 and 0.75, respectively
(1)
(2)
(3)
(4)
The likelihood is based on the similarity between a synthetic
appearance generated from a particle state using the ellipsoid
model and the observation image. Resampling draws samples
based on importance sampling and is implemented by inversion
sampling [22].

LIMPRASERT et al.: REAL-TIME PEOPLE TRACKING IN A CAMERA NETWORK

265

Fig. 3. Schematic of tracking module.

A. Computing the Likelihood


We aim to maintain identity of tracked subjects within and
between views, even when these do not overlap. Therefore,
we require a component of likelihood that can be learned and
maintained. The likelihood and detection functions are the
most time-complex components of our system; therefore we
use a methodology that is suitable for hardware acceleration.
Likelihood is based on the correspondence between projected
ellipses [23] and foreground pixels (silhouette) determined by
the MoG model and a comparison of the mapped surface color
or texture (hereafter texture) with the image pixels. An additional term suppresses distraction when two or more subjects
are in close proximity, perhaps occluded.
This use of 3-D projection of a parameterized model, combined with the adaptive learning of the surface texture and color
are key contributions of the work. The parameterized model allows faster computation of likelihood than is possible by vertex
projection, for example. Combined with GPU implementation
it gives us real time capability. Adaptive learning on the ellipsoid surface means that we can compute likelihood from any
viewpoint, i.e., that not all views are considered the same as
is common with ellipse models. For camera handoff, a subject
appearing or reappearing in any camera view can be compared
against any known model. This is added to the trajectory data to
improve correspondence in disjoint networks.
1) Silhouette Likelihood: The ellipsoid is a closed surface
in 3-D space. Any surface coordinate is mapped to a cylindrical coordinate system, which allows a 2-D image of an ellipse
boundary in each camera image defined by the cylinder azimuth
and elevation and camera calibration. The rays from all active
foreground pixels are traced and the intersection with the ellipsoid surface is computed quickly. The number of intersections
as a fraction of the ellipse area forms the silhouette likelihood.
2) Texture Likelihood: The texture distribution is an accumulated average over time acquired by mapping from the image to
the ellipsoid surface coordinates. The facing angle of the subject is assumed in the direction of velocity to define the azimuth.
Hence, every foreground pixel gives an observation of the surface used to update the texture model. Similar to adaptive background learning [20], this is represented by a multiple-Gaussian

Fig. 4. Single frame from a single camera for the EM331 sequence, showing
a superimposition of the current state of the textured, colored ellipsoid for the
viewed subject.

distribution of each color channel with a mean


and a variance
, where is the color index. As an example, Fig. 4
shows one frame of a video sequence in which the current ellipsoid model for the viewed subject is also superimposed.
A
test checks for agreement between the color values of
an observed pixel and the corresponding color value of the current texture model represented by the mean
and variance
. This has three degrees-of-freedom and a p-value of 50%
(5)
The texture matching ratio is the summation of the number
of matched pixels. To compute , every foreground pixel at
is transformed to
by using the ellipsoid projection and the
particle state
(6)
At detection, the texture model is initialized with high color
variances so that all observed pixels match the texture model.
As the model is learned, the variance is reduced and mismatched
observations contribute to the likelihood.
3) Distraction Suppression: Distraction is caused by occlusion in one or more camera views when projected subject ellipses
intersect in the image plane. An ownership probability defines
the number of ellipsoids that meet the ray generated from .
defines the contribution of a foreground pixel to the subject likelihood by equally sharing the observation with subjects of interest.
is computed from
The silhouette filling ratio
(7)
We also consider the existence of foreground pixels outside
the ellipse as these nonaligned pixels show an error of a predicted particle state. A nonaligned particle state is penalized by
the negative filling ratio , computed from
(8)

266

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013

4) Combined Likelihood for a Single Camera: The combined


likelihood is a linear combination of the silhouette and texture likelihoods, incorporating distraction suppression, defined
in (9), where
and are weighting
factors
(9)
5) Combined Likelihood for Multiple Cameras: The likeliare evaluated for reliahoods from individual cameras
bility by (11) and (12) before forming a combined likelihood,
, by (10)
(10)
(11)
(12)

B. Evaluation: Overlapping Network


We have measured the accuracy, speed of computation and
execution profile. The sequential implementation is coded in
C++ and tested on a single core of a CPU Intel Core i7 950,
clock frequency 3.1 GHz. We have used several public datsets
including PETS03 and PETS06 [24], PETS09 [25] and two of
our own datasets, EM330 and DEC11. As an example, Fig. 5
shows three frames from two views in the PETS09 data set. To
set the several weighting factors in (10), we used the PETS09
data as a training set, choosing the optimal set to obtain the
highest MOTA, but then used these parameters
without change across all the datasets in
Table I. The other necessary factors in (1)(8) were estimated
from the same trial data [e.g., values in (1) and (2)] or the
known calibration data [e.g., ellipse parameters in (6)].
During tracking, we generate a history of the position of
each subject for comparison with ground-truth, created manually by mapping foot locations from pixel to ground plane
coordinates. We use the standard multiple objects tracking
accuracy (MOTA) [26] computed by summation of the number
of false-alarms
, misdetections
, switch-ids
divided
by the total number of subjects
in frame
%

Fig. 5. Evaluation with the PETS09 dataset. The first and second columns are
result sequences from camera1 and camera3. From top to bottom are frames
700, 720, and 740. The numbers over each ellipse show the ID and height in
metres.

TABLE I
ERROR AND THE NUMBER OF PARTICLES

(13)

We repeated each experiment 10 times to estimate the statistical average. Table I shows error rates as a ratio of the number
of the errors to the total number of subjects. The estimated
number of trajectories in the datasets is
and the true number
of the trajectories is
. Our framework always terminates
the trackers before the tracker is distracted by other subjects.
This causes
and a low switch-id rate. Arguably,
this causes an artificial lowering of the MOTA and leads to unfair comparison. However, the effect is not significant, less than
0.6%. Comparing our evaluation with the official PETS09 results, they allowed either a path to be broken on exit and reentry

or the path at reentry to have the same id. We have followed the
former protocol.
We compare these results with those reported for the same
dataset [27] in Table II. Unofficially, the MOTA of our tracking
algorithm is higher than the best method. First, the 3-D ellipsoid model allows observations from multiple cameras to
be integrated effectively. The advantage of a calibrated over a

LIMPRASERT et al.: REAL-TIME PEOPLE TRACKING IN A CAMERA NETWORK

267

be complete along one azimuthal row of the ellispoid, so the


similarity is approximated by the importance sampling method
in

TABLE II
MOTA COMPARISON WITH RESULTS IN [27]

(15)

noncalibrated system is shown in Table II (comparison between


[28] and [29]). Second, our method uses detection to activate
the PF and then the 3-D ellipsoid with an evolving texture signature to estimate the state variables. Thus, the state variable
and texture signature are combined into a subject representation for likelihood calculation, unlike [29] for example, who
considered state variables and appearance separately. Other
authors have also linked detected subjects to trajectories based
on 2-D features, which is difficult to extend from a single to
multiple cameras. Here, using the ellipsoidal texture learning
method, the signatures are defined consistently across all cameras, resulting in the higher MOTA values.
IV. CAMERA HANDOFF: TRACKING IN A DISJOINT NETWORK
Whereas most authors consider continued viewing, we now
address the camera handoff problem, tracking and identifying a
subject when he/she has disappeared from sight to reappear in
at least one camera view. When that subject reappears, this is
registered as a detection, and so the subject must be identified
either as a new or a previously viewed subject. We use spatial
and texture information to possibly link the broken trajectory
and assign an existing or new ID.
A. The Spatial Cost Function
This is defined by a 2-D normal distribution in position in
which the variances are proportional to
following the stochastic motion model of (1) and (2). The elapsed time
is
the period between disappearance and reappearance. The spatial cost function
to associate the subject with a
trajectory
is defined by (14), where
. There is a single user defined parameter,

(16)
Of course, there are well-established problems in histogram
comparison caused by variations in color caused by either illumination changes, or by different color balances between cameras. In our approach the color quantization is coarse, 1:16 of the
color range, which allows for these variations at the expense of
loss of subtle color discrimination.
C. The Combined Cost Function
This is computed from the product of the spatial and texture
cost functions in (17). There are two parameters, the blending
ratio that determines the relative importance of the spatial
compared to the texture term, and a threshold that controls
the breakpoint of the ramp function. When the evidence to associate and is low the ramp function returns zero
(17)
Once we compute the combined cost of association of every
subject with every trajectory, we store these in a matrix. We
use the Hungarian method [32] to find the optimum matching
configuration. The numbers of subjects and trajectories are not
necessarily equal.
D. Evaluation in Nonoverlapping Camera Networks
We use the F-measure that combines the recall and precision
rates defined in (20). This is computed from the number of correct assignments estimated by our algorithm
and the
total number of assignments of ground-truth
.
We use both the public (Fig. 5) and our private datasets (Fig. 6)
with camera calibration
(18)

(14)

B. The Texture Cost Function


This is computed from the similarity between a detected
and each previously known subject. Since the facing angle
estimation is not maintained outside the network field of view,
we consider only the vertical information, where the texture
pixels of all
at the same height
are combined into a
32-row histogram. The histogram signature
represents
the probability distribution of the texture in the color and
vertical space and the texture similarity between the subject
and the trajectory is the inner product as defined by (15).
Since there is only a single view of a new subject, this cannot

(19)
(20)
Combining spatial and texture data association, there are
three tuning parameters, the standard deviation of the motion model , the threshold , and blending factor . The
effectiveness of camera handoff is critically dependent on the
uniqueness of the attire (which affects the discriminative ability
of texture comparison) and the regularity of movement (which
affects spatial association). We used eight different subjects
for the PETS09 data, and five for the DEC11 data, shown in
Fig. 6(d).

268

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013

TABLE III
CPU COMPUTATION TIME IN MILLISECOND

Fig. 6. Example frames in the DEC11 dataset. The upper images are overlapping camera views with quite different scales whereas the lower view is disjoint
with no overlap. Five subjects with their labels are also shown. For the corresponding PETS09 dataset we used eight different subjects.

and
. However, for DEC11, the best
result was 71.4% with
and
. Using a
low value of
reduces the effect of the spatial association,
and the best results are found effectively at low values. This
emphasizes the stronger visual cues that come from the textured
and colored data as motion in our data set is much more irregular
than the PETS09 data.
V. GPU IMPLEMENTATION

Fig. 7. F-measure for spatial-texture data association, PETS09 (upper). F-measure for spatial-texture data association, DEC11 (lower).

Using the cost function in (17) we obtain the results shown in


Fig. 7. For PETS09, the F-measure was highest at 84.6% with

We have employed a graphics processing unit (GPU)


and the CUDA development platform [33] to accelerate the
multi-camera tracking process for the overlapping case. First,
we needed to identify the computationally demanding portions
of the sequential code. We used several datasets and varied the
number of particles as shown in Table III. The Detection and
Likelihood functions, the latter one of the three key processes
for the particle filter, are by far the most demanding. As expected, the likelihood computation is an increasing function of
the number of particles.

LIMPRASERT et al.: REAL-TIME PEOPLE TRACKING IN A CAMERA NETWORK

269

TABLE V
GPU ACCELERATION

Fig. 8. GPU implementation; this shows multi-threading of the Detection and


Tracking algorithms.
TABLE IV
GPU MEMORY TYPES AND CHARACTERISTICS

access time for GPU memory. We use constant memory to


store static data, e.g., the rectangular boundary of the detection
process, the parameters of the camera calibration models and
the particle filter transition parameters. The shared memory
is fast memory located near the core used with Reduction to
quickly combine the data without reference to global memory.
Shared memory is used similarly when computing the integral image. Texture memory is fast access, read-only and is
used by threaded processes to store the raw image data, the foreground and integral images, and slightly reduces the time to load
this data.
B. Comparative Execution on the GPU and Host CPU

A. Parallel Implementation
Fig. 8 shows schematically the activation of multiple threads
on a GeForce GTS250 with 128 cores of clock frequency
1.6 GHz. The loading of image data is still performed by the
host because direct memory transfer to the GPU is not possible.
The clock frequency of the host CPU is 3.1 GHz or about twice
the GPU clock frequency.
The Detection function operates on a fixed grid for each
camera, and the projection of ellipsoids and subsequent comparison with input data is independent. This suggests ready
data parallelism which has been implemented for foreground
background segmentation and computation of the integral
image. The Likelihood function has been made parallel by
using the natural independent structure of likelihood evaluation
on each particle. Reduction is employed to select the maximum
grid response in Detection and the summation in Resampling.
We have also used Skip ahead to split and accelerate random
number generation using multiple parallel chains to generate
properly distributed samples in Transition. This employs
Sobols sequence [34] with the CURAND library [35].
The performance of any shared memory algorithm depends
critically on the memory hierarchy. Table IV shows the size and

Computational times of all the functions are collected and


shown in Table V, considering the PETS09 dataset and 512
particles. The overall speed-up ratio of 3.5 can be considered
as equivalent to 7 times algorithmic speed-up when we take
into account the different clock frequencies. As expected, the
most computationally demanding processes, Detection, Likelihood and to a much lesser extent Transition, give better speedup
and contribute most to the overall reduction in computation time
from 82.3 to 23.2 ms. The CUDA profiler identifies the system
bottlenecks, examining the computation times and memory use
of all kernels. Examination of the detailed results shows that
the current implementation has problems with missed cache access on the critical Detection and Likelihood functions which
require access to significant amounts of image and particle data.
One way to address this latter problem would be to partition
the likelihood calculation into smaller functions to reduce the
size of the memory used and hence cache misses. There is a
further problem of instruction serialization caused by memory
conflicts and control branching. When different cores execute
different control paths, the master controller has to send instruction branches sequentially rather than concurrently. Nevertheless, the final processing time for the GPU implementation is
just below 25 ms, which corresponds to 40 frames per second.
Taking additional factors including the time to load live video
from the host into account, we perform live video with two cameras with 510 subjects at a final frame rate of 10 s . In general there is no simple answer to how the frame rate varies as
a function of the number of cameras within the network and
the number of subjects. This depends on load balancing and the
use of cache memory within the GPU and on communication

270

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, VOL. 3, NO. 2, JUNE 2013

between the CPU host and GPU, and so is not a simple linear
relationship.
VI. CONCLUSION
We have described a new approach to tracking human subjects in video sequences. First, we have introduced a parametric
ellipsoid model in both detection and visual tracking. For detection, this is projected at static grid positions to find intersections between potential subject positions and foreground image
data, as determined by mixture of Gaussian segmentation. For
tracking, the ellipsoid is parameterized by position, velocity and
height as part of the state vectors of a particle filter. As the subject moves, a 3-D appearance description using texture and color
is learned progressively. This allows us to integrate observations
from multiple cameras into the likelihood function.
We have shown that the texture and color signature can be
used for effective tracking of subjects, with multiple object
tracking accuracies on PETS benchmarks of greater than 90%.
Further, we have combined this signature with spatial data
association to achieve F-measures (combining recall and precision) rates of between 60% and 85% when handing off between
cameras with nonoverlapping views, depending on the nature
of the data sets. This measure is further improved by use of the
bijection property of the Hungarian method for assignment of
identities to subjects.
Detection and tracking are both computationally expensive.
Therefore we have implemented a many core version of our algorithms on a NVIDIA GT250 GPU to achieve real time performance tracking of up to 10 subjects at 10 franes per second.
The speedup achieved with the GPU implementation shows the
most improvement in the case of detection, peaking at 5.8 times.
Overall, the speedup is 3.5 times for one benchmark dataset,
but this does not allow for the lower clock speed of the GPU
cores, so the algorithm performs better than this. Nevertheless,
the GPU implementation does allow us to achieve the necessary
real time performance.
REFERENCES
[1] D. Comaniciu and P. Meer, Mean shift: A robust approach toward
feature space analysis, IEEE Trans. Pattern Anal. Mach. Intell., vol.
24, no. 5, pp. 603619, May 2002.
[2] A. Jepson, D. Fleet, and T. El-Maraghi, Robust online appearance
models for visual tracking, IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 25, no. 10, pp. 12961311, Oct. 2003.
[3] L. Ma, J. Liu, J. Wang, J. Cheng, and H. Lu, A improved silhouette
tracking approach integrating particle filter with graph cuts, in IEEE
Int. Conf. Acoust. Speech Signal Process., Mar. 2010, pp. 11421145.
[4] G. Li, W. Qu, and Q. Huang, A multiple targets appearance tracker
based on object interaction models, IEEE Trans. Circuits Syst. Video
Technol., vol. 22, no. 3, pp. 450464, Mar. 2012.
[5] F. Fleuret, J. Berclaz, R. Lengagne, and P. Fua, Multicamera people
tracking with a probabilistic occupancy map, IEEE Trans. Pattern
Anal. Mach. Intell., vol. 30, no. 2, pp. 267282, Feb. 2008.
[6] Z. Husz, A. Wallace, and P. Green, Tracking with a hierarchical partitioned particle filter and movement modelling, IEEE Trans. Syst.,
Man, Cybern., Part B: Cybern., vol. 41, no. 6, pp. 15711584, Dec.
2011.
[7] J. Brown and D. Capson, A framework for 3d model-based visual
tracking using a GPU-accelerated particle filter, IEEE Trans. Vis.
Comput. Graphics, vol. 18, no. 1, pp. 6880, Jan. 2012.
[8] F. Yan, A. Kostin, W. Christmas, and J. Kittler, A novel data association algorithm for object tracking in clutter with application to tennis
video analysis, in Proc IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Washington, DC, 2006, vol. 1, pp. 634641.

[9] Z. Kalal, K. Mikolajczyk, and J. Matas, Face-TLD: Trackinglearning-detection applied to faces, in Proc. 17th IEEE Int. Conf.
Image Process., Sep. 2010, pp. 37893792.
[10] G. Cielniak and T. Duckett, People recognition by mobile robots, J.
Intell. Fuzzy Syst., vol. 15, pp. 2127, 2004.
[11] H.-B. Kim and K.-B. Sim, A particular object tracking in an environment of multiple moving objects, in Proc. Int. Conf. Control Autom.
Syst., Oct. 2010, pp. 10531056.
[12] Y. Tang, Y. Li, T. Bai, X. Zhou, and Z. Li, Human tracking in thermal
catadioptric omnidirectional vision, in Proc. IEEE Int. Conf. Inf.
Autom., Jun. 2011, pp. 97102.
[13] N. Dalal and B. Triggs, Histograms of oriented gradients for human
detection, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2005,
vol. 1, pp. 886893.
[14] H. Ma, C. Zeng, and C. X. Ling, A reliable people counting system
via multiple cameras, ACM Trans. Intell. Syst. Technol. vol. 3, no. 2,
pp. 31:131:22, Feb. 2012.
[15] J. Yao and J. M. Odobez, Multi-Person Bayesian Tracking With Multiple Cameras. New York: Elsevier, 2009, ch. 15, p. 363.
[16] K. Pauwels, M. Tomasi, J. Diaz, E. Ros, and M. V. Hulle, A comparison of FPGA and GPU for real-time phase-based optical flow, stereo,
and local image features, IEEE Trans. Computers, vol. 61, no. 7, pp.
9991011, Jul. 2012.
[17] R. Z. M. Wojcikowski and B. Pankiewicz, FPGA-based real-time implementation of detection algorithm for automatic traffic surveillance
sensor network, J. Signal Process. Syst., vol. 68, no. 1, pp. 118, Jan.
2012.
[18] G. B. J. Sanchez and J. Simo, Video sensor architecture for surveillance applications, Sensors, vol. 12, pp. 15091528, 2012.
[19] M. P. S. N. Sinha, J.-M. Frahm, and Y. Genc, Feature tracking and
matching in video using programmable graphics hardware, Mach. Vis.
Appl., vol. 22, pp. 1207217, 2011.
[20] C. Stauffer and W. E. L. Grimson, Adaptive background mixture
models for real time tracking, in Proc. IEEE Comput. Soc. Conf.
Comput. Vis. Pattern Recognit., 1999, vol. 2, p. 252.
[21] P. Viola and M. Jones, Rapid object detection using a boosted cascade
of simple features, in Proc. IEEE Comput. Soc. Conf. Comput. Vis.
Pattern Recognit., 2001, vol. 1, pp. 511518.
[22] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter
Particle Filter for Tracking Applications. London, U.K.: Artech
House, 2004.
[23] P. J. Schneider and D. H. Eberly, Geometric Tools for Computer
Graphics. New York: Elsevier, 2003.
[24] J. M. Ferryman, PETS: Performance evaluation of tracking
and surveillance Univ. Reading, 2007 [Online]. Available:
http://www.cvg.rdg.ac.uk/slides/pets.html
[25] J. Ferryman and A. Shahrokni, Pets2009: Dataset and challenge, in
Proc. 12th IEEE Int. Performance Eval. Track. Surveill. Workshop,
2009, pp. 16.
[26] K. Bernardin and R. Stiefelhagen, Evaluating multiple object tracking
performance: The clear MOT metrics, J. Image Video Process., vol.
2008, pp. 110, 2008.
[27] A. Ellis, A. Shahrokni, and J. Ferryman, PETS2009 and winter-PETS
2009 results: A combined evaluation, in Proc. 2009 12th IEEE Int.
Workshop Performance Eval. Track. Surveill. (PETS-Winter), Dec.
2009, pp. 18.
[28] J. Berclaz, A. Shahrokni, F. Fleuret, J. Ferryman, and P. Fua, Evaluation of probabilistic occupancy map people detection for surveillance systems, in Proc. IEEE Int. Workshop Performance Eval. Track.
Surveill., 2009.
[29] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L.
Van Gool, Robust tracking-by-detection using a detector confidence
particle filter, in Proc. IEEE 12th Int. Comput. Vis. Conf., 2009, pp.
15151522.
[30] D. Arsic, A. Lyutskanov, G. Rigoll, and B. Kwolek, Multi camera
person tracking applying a graph-cuts based foreground segmentation
in a homography framework, in Proc. 12th IEEE Int. Performance
Eval. Track. Surveill. (PETS-Winter) Workshop, 2009, pp. 18.
[31] J. Berclaz, F. Fleuret, and P. Fua, Multiple object tracking using flow
linear programming, in Proc. 12th IEEE Int. Workshop Performance
Eval. Trackl Surveill. (PETS-Winter), Dec. 2009, pp. 18.
[32] J. Munkres, Algorithm for the assignment and transportation problems, J. Soc. Ind. Appl. Math., vol. 5, no. 1, pp. 3238, 1957.
[33] CUDA C programming guide, ver. 3.2, NVIDIA, 2010.
[34] I. Sobol, The distribution of points in a cube and the approximate
evaluation of integrals, USSR Comput. Maths. Math. Phys., vol. 7,
pp. 86112, 1967.
[35] Curand Guide, NVIDIA, 2012.

LIMPRASERT et al.: REAL-TIME PEOPLE TRACKING IN A CAMERA NETWORK

271

Wasit Limprasert received the B.Sc. degree in


physics from Mahidol University, Bangkok, Thailand, the M.Sc. degree in micro-electronics from
the Asian Institute of Technology, Pathum Thani,
Thailand, and the Ph.D. degree in computer science
from Heriot-Watt University, Edinburgh, U.K.
He is currently a researcher at NECTEC Thailand.
His research interests are object tracking, visualization, and system modeling.

Greg Michaelson received the B.A. degree in computer science from the University of Essex, Essex,
U.K., in 1973, the M.Sc. degree in computational
science from the University of St. Andrews, St.
Andrews, U.K. in 1982, and the Ph.D. degree from
Heriot-Watt University, Edinburgh, U.K. in 1993.
He is Professor of Computer Science at HeriotWatt University. His research expertise is in programming language design, implementation and analysis
for multi-processor systems. He has enjoyed EPSRC,
DTC, and EU support for his research and has pub-

Andrew Wallace received the B.Sc. and Ph.D. degrees from the University of Edinburgh, Edinburgh,
U.K., in 1972 and 1975, respectively.
He is a Professor of Signal and Image Processing
at Heriot-Watt University. His research interests include vision, image and signal processing, and parallel, many core architectures. He has published extensively; receiving a number of best paper and other
awards, and has secured funding from EPSRC, the
EU and other industrial and government sponsors.
Dr. Wallace is a chartered engineer and a Fellow
of the Institute of Engineering Technology.

lished widely.
Dr. Michaelson is a Chartered Engineer, a Fellow of the British Computer
Society, and an elected board member of the European Association for Programming Languages and Systems.

You might also like