Robust Multi-Resolution Pedestrian Detection
Robust Multi-Resolution Pedestrian Detection
Robust Multi-Resolution Pedestrian Detection
Abstract
The serious performance decline with decreasing resolution is the major bottleneck for current pedestrian detection
techniques [14, 23]. In this paper, we take pedestrian detection in different resolutions as different but related problems, and propose a Multi-Task model to jointly consider
their commonness and differences. The model contains resolution aware transformations to map pedestrians in different resolutions to a common space, where a shared detector
is constructed to distinguish pedestrians from background.
For model learning, we present a coordinate descent procedure to learn the resolution aware transformations and deformable part model (DPM) based detector iteratively. In
traffic scenes, there are many false positives located around
vehicles, therefore, we further build a context model to suppress them according to the pedestrian-vehicle relationship.
The context model can be learned automatically even when
the vehicle annotations are not available. Our method reduces the mean miss rate to 60% for pedestrians taller than
30 pixels on the Caltech Pedestrian Benchmark, which noticeably outperforms previous state-of-the-art (71%).
Figure 1. Examples of multiple resolution pedestrian detection result of our method in the Caltech Pedestrian Benchmark [14].
1. Introduction
Pedestrian detection has been a hot research topic in
computer vision for decades, for its importance in real applications, such as driving assistance and video surveillance. In recent years, especially due to the popularity of
gradient features, pedestrian detection field has achieved
impressive progresses in both effectiveness [6, 31, 43, 41,
19, 33] and efficiency [25, 11, 18, 4, 10]. The leading detectors can achieve satisfactory performance on high resolution benchmarks (e.g. INRIA [6]), however, they encounter
difficulties for the low resolution pedestrians (e.g. 30-80
pixels tall, Fig. 1) [14, 23]. Unfortunately, the low resolution pedestrians are often very important in real applications. For example, the driver assistance systems need detect
Stan
3031
3033
tures of samples from different resolutions, thus the structural commonness is preserved. Particularly, we extend the
popular deformable part model (DPM) [19] to multi-task
DPM (MT-DPM), which aims to find an optimal combination of DPM detector and resolution aware transformations.
We prove that when the resolution aware transformations
are fixed, the multi-task problems can be transformed to be
a Latent-SVM optimization problem, and when the DPM
detector in the mapped space is fixed, the problem equals
to a standard SVM problem. We divide the complex nonconvex problem into the two sub-problems, and optimize
them alternatively.
In addition, we propose a new context model to improve
the detection performance in traffic scenes. There is a phenomenon that quite a large number of detections (33.19%
for MT-DPM in our experiments) are around vehicles. The
vehicle localization is much easier than pedestrian, which
motivates us to employ pedestrian-vehicle relationship as
an additional cue to judge whether the detection is a false or
true positive. We build an energy model to jointly encode
the pedestrian-vehicle and geometry contexts, and infer the
labels of detections by maximizing the energy function on
the whole image. Since the vehicle annotations are often
not available in pedestrian benchmark, we further present a
method to learn the context model from ground truth pedestrian annotations and noisy vehicle detections.
We conduct experiments on the challenging Caltech
Pedestrian Benchmark [14], and achieve significantly improvement over previous state-of-the-art methods on all the
9 sub-experiments advised in [14]. For the pedestrians taller
than 30 pixels, our MT-DPM reduces 8% and our context model further reduces 3% mean miss rate over previous
state-of-the-art performance.
The rest of the paper is organized as follows: Section 2
reviews the related work. The multi-task DPM detector and
pedestrian-vehicle context model are discussed in Section
3 and Section 4, respectively. Section 5 shows the experiments and finally in Section 6 we conclude the paper.
2. Related work
There is a long history of research on pedestrian detection. Most of the modern detectors are based on statistical
learning and sliding-window scan, popularized by [32] and
[40]. Large improvements came from the robust features,
such as [6, 12, 25, 3]. There are some papers fused HOG
with other features [43, 7, 45, 41] to improve the performance. Some papers focused on special problems in pedestrian detection, including occlusion handling [46, 43, 38, 2],
speed [25, 11, 18, 4, 10], and detector transfer in new scenes
[42, 27]. We refer the detailed surveys on pedestrian detection to [21, 14].
Resolution related problems have attracted attention in
recent evaluations. [16] found that the pedestrian detection
3034
3032
LR Detector
Detector
High Resolution
Pedestrians
HR Detector
Multi-Task
Detector
Resolution Aware
Transformation
HR
Transformation
Low Resolution
Pedestrians
LR
Transformation
Backgrounds
aSingle Resolution
Detector
bMulti-Resolution
Detector
cMulti-task
Detector
Resolution invariant
feature matrix
(1)
3035
3033
background. The first two terms are used for regularize the
detector parameters, and the last term is the hinge loss in
DPM detection. The Ln is the optimized part configuration
that maximizes the detection score of In . In the learning
phase, the part locations are taken as latent variables, and
the problem can be optimized by the Latent-SVM [19].
For multi-task learning, the relationship between different tasks should be considered. In analogy to the original
DPM, MT-DPM is formulated as:
arg
min
1 T
ws ws
2
(4)
1
fa (In , Ln ). Eq. 4 can be reformuA 2 PL a (In , Ln ) as
lated as:
arg min
g
W
a ,ws
+C
1 f 2
1
kWa kF + wsT ws
2
2
(6)
fa T
fa (In , Ln )) + wsT s (Ln ))],
max[0, 1 yn (T r(W
NH +NL
where IH and IL denote the high and low resolution training sets, including both pedestrian and background. Since
spatial term ws is directly applied to the data from different resolutions, it can be regularized independently. fIH and
fIL are used to consider the detection loss and regularize the
parameters PH , PL and Wa . fIH and fIL are of the same
form, here we take fIH as an example. It can be written as:
1 T
kPH Wa k2F
(5)
2
X
+C
max[0, 1 yn (T r(WaT PH a (IHn , Ln )) + wsT s (Ln ))],
fIH (Wa , ws , PH ) =
3.2.2
Optimize PH and PL
When the Wa and ws are fixed, PH and PL are independent, thus the optimization problem can be divided
into two subproblems: arg minPH fIH (Wa , ws , PH ) and
arg minPL fIL (Wa , ws , PL ). Since they are of the same
form, here we only give the details for optimizing PH .
Given the Wa and ws , we first infer the part location of
every training samples Ln by finding a part configurations
1
to maximize Eq. 2. Denoting Wa WaT as A, A 2 PH as Pf
H,
1
2
T
f
and A Wa a (IHn , Ln ) as a (IHn , Ln ), the problem
of Eq. 4 equals to:
NH
T
where the regularization term PH
Wa is a nf nc dimensional matrix, and of the same dimension with the original
feature matrix. Since PH and Wa are applied to original
appearance feature integrally in calculating the appearance
T
score T r((PH
Wa )T a (I, L), we take them as an ensemble and regularize them together. The second term is the
detection loss for resolution aware detection, corresponding
to the detection model in Eq. 2. The parameters Wa and ws
are shared between fIH and fIL . Note that more partitions
of resolutions can be handle naturally in Eq. 4.
In Eq. 4, we need to find an optimal combination of Wa ,
ws , PH , and PL . However, Eq. 4 is not convex when all
of them are free. Fortunately, we show that given the two
transformations, the problem can be transformed into a standard DPM problem, and given the DPM detector, it can
be transformed into a standard SVM problem. We conduct a coordinate descent procedure to optimize the two subproblems iteratively.
3.2.1
Optimize Wa and ws
1
2
arg min kPf
H kF
g
PH 2
+C
(7)
f
max[0, 1 yn (T r(Pf
H a (IHn , Ln )) + ws s (Ln ))].
NH
Training Details
3036
3034
n
X
i=1
(wpT g(pi )
m
X
(8)
j=1
Original Detection
Context Result
Figure 4. Examples of original detection, and the detection optimized by the context model.
tpi ,tvj
n
X
m
X
j=1
i=1
where tpi and tvj are the binary value, 0 means the false
positive and 1 means the true positive. Eq. 9 is a integer
programming problem, but becomes trivial when the label
of V is fixed, since it equals to maximizing every pedestrians independently. In typical traffic scenes, the number
of vehicles is limited. For example, in Caltech Pedestrian
Benchmark, there are no more than 8 vehicles in an image,
so that the problem can be solved by no more than 28 trivial
sub-problems, which can be very efficient in real applications.
For the linear property, Eq. 9 is equal to:
m
n
n
X
X
X
tvj g(pi , vj )]T ,
tpi g(pi ),
tpi
arg max [wp , wv ][
tpi ,tvj
i=1
i=1
j=1
(10)
Eq. 10 provides a natural way for max-margin learning. We
use wc to denote [wp , wv ]. Given the ground truth hypotheses of vehicles and pedestrians, a standard structural SVM
[39] can be used here to discriminatively learn wc by solving the following problem:
K
X
1
minwc ,k kwc k22 +
k
2
(11)
where Pk0 and Vk0 are arbitrary pedestrian and vehicle hypotheses in the kth image, and Pk and Vk are the ground
truth. L(Pk , Pk0 ) is the Hamming loss of pedestrian detection hypothesis Pk0 and ground truth Pk . The difficulty in
pedestrian based applications is that only pedestrian ground
3037
3035
truth Pk is available in public pedestrian databases, and vehicle annotation Vk is unknown. To address the problem,
we use the noisy vehicle detection result as the initial estimation of Vk , and jointly learn context model and infer
whether the vehicle detection is true or false positive, by
optimizing the following problem:
MissRate
0.69
0.683913
0.68
0.67
0.658472
0.66
0.656717
0.65
0.641546
0.640177
0.64
0.630547
0.63
0.62
0.61
0.6
dim=8
1
minwc ,k kwc k22 +
2
K
X
dim=10
dim=12
dim=14
dim=16
dim=18
where Vbk is a subset of Vk , which reflects the current inference of the vehicle detections by maximizing the overall
context score. Eq. 12 can be solved by optimizing the model parameters wc and the label of vehicles Vbk iteratively. In
the learning phase, the initial Pk0 is the pedestrian detection
result of MT-DPM.
5. Experiments
Experiments are conducted on the Caltech Pedestrian
Benchmark [14]2 . Following the experimental protocol, the
set00-set05 are used for training and set06-set10 are used
for test. We use the ROC or the mean miss rate3 to compare methods as advised in [14]. For more details of the
benchmark, please refer to [14]. There are various subexperiments on the benchmark to compare detectors in different conditions. Due to the space limitation, we only report the most relevant and leave results of other subexperiments in the supplemental material. We emphasize
that our method outperforms all the 17 methods evaluated
in [14] on the 9 sub-experiments significantly.
In the following experiments, we examine the influence
of the subspace dimension in MT-DPM, then compare it
with other strategies for low resolution detection. The contribution of context model is also validated at different FPPI. Finally we compare the performance with other stateof-the-art detectors.
Datasets/CaltechPedestrians/
Original Detection
0.7551
Context Model
0.7
0.6305
0.65
0.6087
0.6
0.55
0.4926
0.5
0.4603
0.45
0.4
FPPI=0.01
FPPI=0.1
FPPI=1
3038
3036
.80
.80
.80
.64
99% VJ
96% Shapelet
93% PoseInv
90% LatSvmV1
87% HikSvm
86% FtrMine
86% HOG
84% HogLbp
82% MultiFtr
82% LatSvmV2
78% Pls
78% MultiFtr+CSS
76% FeatSynth
75% FPDW
74% ChnFtrs
74% MultiFtr+Motion
71% MultiResC
63% MTDPM
60% MTDPM+Context
.40
miss rate
.30
.20
.10
.05
10
10
.64
99% VJ
97% Shapelet
93% LatSvmV1
93% PoseInv
93% HogLbp
89% HikSvm
87% HOG
87% FtrMine
86% LatSvmV2
84% MultiFtr
82% MultiFtr+CSS
82% Pls
80% MultiFtr+Motion
78% FPDW
78% FeatSynth
77% ChnFtrs
73% MultiResC
67% MTDPM
64% MTDPM+Context
.50
.40
.30
miss rate
.50
.20
.10
.05
10
10
10
10
10
95% VJ
91% Shapelet
86% PoseInv
80% LatSvmV1
74% FtrMine
73% HikSvm
68% HOG
68% MultiFtr
68% HogLbp
63% LatSvmV2
62% Pls
61% MultiFtr+CSS
60% FeatSynth
57% FPDW
56% ChnFtrs
51% MultiFtr+Motion
48% MultiResC
41% MTDPM
38% MTDPM+Context
.50
.40
.30
miss rate
.64
.20
.10
.05
10
10
10
10
10
10
10
10
Figure 8. Quantitative result of MT-DPM, MT-DPM+ Context and other methods on the Caltech Pedestrian Benchmark.
(Fig. 8(c), taller than 50 pixels)4 . Our MT-DPM significantly outperforms previous state-of-the-art, at least a 6%
margin mean miss rate on all the three experiments. The
proposed Context model further improves the performance
with about 3%. Because the ROC of [9] is not available, its
performance is not shown here. But as reported in [9], it
got 48% mean miss rate on the reasonable condition, while
our method reduces it to 41%. The most related method is
MultiResC [33], where multi-resolution model is also used.
Our method outperforms it with a 11% margin for multiresolution detection, which can prove the advantage of the
proposed method.
6. Conclusion
In this paper, we propose a Multi-Task DPM detector
to jointly encode the commonness and differences between
pedestrians from different resolutions, and achieve robust
performance for multi-resolution pedestrian detection. The
pedestrian-vehicle relationship is modeled to infer the true
or false positives in traffic scenes, and we show how to
learn it automatically from the data. Experiments on challenging Caltech Pedestrian Benchmark show the significant
improvement over state-of-the-art performance. Our future
work is to explore the spatial-temporal information and extend the proposed models to general object detection task.
4 Results
3039
3037
Figure 9. Qualitative results of the proposed method on Caltech Pedestrian Benchmark (the threshold corresponds to 0.1 FPPI).
Acknowledgement
We thank the anonymous reviewers for their valuable
feedbacks. This work was supported by the Chinese
National Natural Science Foundation Project #61070146,
#61105023, #61103156, #61105037, #61203267, National
IoT R&D Project #2150510, National Science and Technology Support Program Project #2013BAK02B01, Chinese
Academy of Sciences Project No. KGZD-EW-102-2, European Union FP7 Project #257289 (TABULA RASA), and
AuthenMetric R&D Funds.
References
[1] A. Bar-Hillel, D. Levi, E. Krupka, and C. Goldberg. Part-based feature synthesis for human detection. ECCV, 2010. 7
[2] O. Barinova, V. Lempitsky, and P. Kholi. On detection of multiple object instances using hough transforms. PAMI, 2012. 2
[3] C. Beleznai and H. Bischof. Fast human detection in crowded scenes by contour
integration and local shape estimation. In CVPR. IEEE, 2009. 2
[4] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection
at 100 frames per second. In CVPR. IEEE, 2012. 1, 2
[5] S. Biswas, K. W. Bowyer, and P. J. Flynn. Multidimensional scaling for matching low-resolution face images. PAMI, 2012. 2
[6] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.
In CVPR. IEEE, 2005. 1, 2, 7
[7] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms
of flow and appearance. ECCV, 2006. 2
[8] C. Desai, D. Ramanan, and C. Fowlkes. Discriminative models for multi-class
object layout. IJCV, 2011. 2
[9] Y. Ding and J. Xiao. Contextual boost for pedestrian detection. In CVPR. IEEE,
2012. 2, 7
[10] P. Dollar, R. Appel, and W. Kienzle. Crosstalk cascades for frame-rate pedestrian detection. In ECCV. Springer, 2012. 1, 2
[11] P. Dollar, S. Belongie, and P. Perona. The fastest pedestrian detector in the
west. BMVC 2010, 2010. 1, 2, 7
[12] P. Dollar, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In
BMVC, 2009. 2, 7
[13] P. Dollar, Z. Tu, H. Tao, and S. Belongie. Feature mining for image classification. In CVPR. IEEE, 2007. 7
[14] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. TPAMI, 2012. 1, 2, 3, 6, 7
[15] C. Dubout and F. Fleuret. Exact acceleration of linear object detectors. ECCV,
2012. 7
[16] M. Enzweiler and D. Gavrila. Monocular pedestrian detection: Survey and
experiments. TPAMI, 2009. 2
[17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman.
The pascal voc 2012 results. 5
[18] P. Felzenszwalb, R. Girshick, and D. McAllester. Cascade object detection with
deformable part models. In CVPR. IEEE, 2010. 1, 2
[19] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection
with discriminatively trained part-based models. TPAMI, 2010. 1, 2, 3, 4, 7
[20] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the
kitti vision benchmark suite. In CVPR. IEEE, 2012. 5
[21] D. Geronimo, A. Lopez, A. Sappa, and T. Graf. Survey of pedestrian detection
for advanced driver assistance systems. PAMI, 2010. 2
[22] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained
deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latentrelease5/. 4
[23] D. Hoiem, Y. Chodpathumwan, and Q. Dai. Diagnosing error in object detectors. ECCV, 2012. 1, 2
[24] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. IJCV, 2008.
2
[25] C. Huang and R. Nevatia. High performance object detection by collaborative
learning of joint ranking of granules features. In CVPR. IEEE, 2010. 1, 2
[26] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba. Undoing the
damage of dataset bias. In ECCV. Springer, 2012. 2
[27] B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain
adaptation using asymmetric kernel transforms. In CVPR. IEEE, 2011. 2
[28] Z. Lei and S. Z. Li. Coupled spectral regression for matching heterogeneous
faces. In CVPR. IEEE, 2009. 2
[29] C. Li, D. Parikh, and T. Chen. Automatic discovery of groups of objects for
scene understanding. In CVPR. IEEE, 2012. 2
[30] Z. Lin and L. Davis. A pose-invariant descriptor for human detection and segmentation. ECCV, 2008. 7
[31] S. Maji, A. Berg, and J. Malik. Classification using intersection kernel support
vector machines is efficient. In CVPR. IEEE, 2008. 1, 7
[32] C. Papageorgiou and T. Poggio. A trainable system for object detection. IJCV,
2000. 2
[33] D. Park, D. Ramanan, and C. Fowlkes. Multiresolution models for object detection. ECCV, 2010. 1, 2, 5, 7
[34] H. Pirsiavash and D. Ramanan. Steerable part models. In CVPR. IEEE, 2012.
2
[35] H. Pirsiavash, D. Ramanan, and C. Fowlkes. Bilinear classifiers for visual
recognition. In NIPS, 2009. 2
[36] M. Sadeghi and A. Farhadi. Recognition using visual phrases. In CVPR, 2011.
2
[37] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis. Human detection using
partial least squares analysis. In ICCV. IEEE, 2009. 7
[38] S. Tang, M. Andriluka, and B. Schiele. Detection and tracking of occluded
people. In BMVC, 2012. 2
[39] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. JMLR, 2006. 5
[40] P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion
and appearance. IJCV, 2005. 1, 2, 7
[41] S. Walk, N. Majer, K. Schindler, and B. Schiele. New features and insights for
pedestrian detection. In CVPR. IEEE, 2010. 1, 2
[42] M. Wang, W. Li, and X. Wang. Transferring a generic pedestrian detector
towards specific scenes. In CVPR. IEEE, 2012. 2
[43] X. Wang, T. Han, and S. Yan. An hog-lbp human detector with partial occlusion
handling. In ICCV. IEEE, 2009. 1, 2, 7
[44] C. Wojek and B. Schiele. A performance evaluation of single and multi-feature
people detection. DAGM, 2008. 7
[45] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedestrian detection. In
CVPR. IEEE, 2009. 2
[46] J. Yan, Z. Lei, D. Yi, and S. Z. Li. Multi-pedestrian detection in crowded
scenes: A global view. In CVPR. IEEE, 2012. 2
3040
3038