Scale Match For Tiny Person Detection
Scale Match For Tiny Person Detection
Scale Match For Tiny Person Detection
1257
To detect the tiny persons, we propose a simple yet ef- [7], MS COCO [16], has far exceeded that of traditional
fective approach, named Scale Match. The intuition of our machine learning algorithms.Region convolutional neural
approach is to align the object scales of the dataset for pre- network (R-CNN) [10] has become the popular detection
training and the one for detector training. The nature behind architecture. OverFeat adopted a Conv-Net as a sliding
Scale Match is that it can better investigate and utilize the window detector on an image pyramid. R-CNN adopted
information in tiny scale, and make the convolutional neural a region proposal-based method based on selective search
networks (CNNs) more sophisticated for tiny object repre- and then used a Conv-Net to classify the scale normalized
sentation. The main contributions of our work include: proposals. Spatial pyramid pooling (SPP) [11] adopted R-
1. We introduce TinyPerson, under the background of CNN on feature maps extracted on a single image scale,
maritime quick rescue, and raise a grand challenge about which demonstrated that such region-based detectors could
tiny object detection in the wild. To our best knowledge, be applied much more efficiently. Fast R-CNN [9] and
this is the first benchmark for person detection in a long Faster R-CNN [21] made a unified object detector in a mul-
distance and with massive backgrounds. The train/val. an- titask manner. Dai et al. [1] proposed R-FCN, which uses
notations will be made publicly and an online benchmark position-sensitive RoI pooling to get a faster and better de-
will be setup for algorithm evaluation. tector.
2. We comprehensively analyze the challenges about While the region-based methods are complex and time-
tiny persons and propose the Scale Match approach, with consuming, single-stage detectors, such as YOLO [20] and
the purpose of aligning the feature distribution between the SSD [17], are proposed to accelerate the processing speed
dataset for network pre-training and the dataset for detector but with a performance drop, especially in tiny objects.
learning. Tiny object detection: Along with the rapid development
3. The proposed Scale Match approach improves the de- of CNNs, researchers search frameworks for tiny object de-
tection performance over the state-of-the-art detector (FPN) tection specifically. Lin et al. [14] proposed feature pyra-
with a significant margin ( 5%). mid networks that use the top-down architecture with lat-
eral connections as an elegant multi-scale feature warping
2. Related Work method. Zhang et al. [28] proposed a scale-equitable face
detection framework to handle different scales of faces well.
Dataset for person detection: Pedestrian detection has al-
Then J Li et al. [13] proposed DSFD for face detection,
ways been a hot issue in computer vision. Larger capacity,
which is SOTA open-source face detector. Hu et al. [12]
richer scenes and better annotated pedestrian datasets,such
showed that the context is crucial and defines the templates
as INRIA [2], ETH [6], TudBrussels [24], Daimler [5],
that make use of massively large receptive fields. Zhao et
Caltech-USA [4], KITTI [8] and CityPersons [27] represent
al. [30] proposed a pyramid scene-parsing network that em-
the pursuit of more robust algorithms and better datasets.
ploys the context reasonable. Shrivastava et al. [22] pro-
The data in some datasets were collected in city scenes and
posed an online hard example mining method that can im-
sampled from annotated frames of video sequences. De-
prove the performance of small objects significantly.
spite the pedestrians in those datasets are in a relatively high
resolution and the size of the pedestrians is large, this situa-
tion is not suitable for tiny object detection. 3. Tiny Person Benchmark
TinyPerson represents the person in a quite low reso-
In this paper, the size of object is defined as the square
lution, mainly less than 20 pixles, in maritime and beach
root of the object’s bounding box area. We use Gij =
scenes. Such diversity enables models trained on TinyPer-
(xij , yij , wij , hij ) to describe the j-th object’s bounding
son to well generalize to more scenes, e.g., Long-distance
box of i-th image Ii in dataset, where (xij , yij ) denotes the
human target detection and then rescue.
coordinate of the left-top point, and wij , hij are the width
Several small target datasets including WiderFace [25]
and height of the bounding box. Wi , Hi denote the width
and TinyNet [19] have been reported. TinyNet involves re-
and height of Ii , respectively. Then the absolute size and
mote sensing target detection in a long distance. However,
relative size of a object are calculated as:
the dataset is not publicly available. WiderFace mainly fo-
cused on face detection, as shown in Figure 1. The faces p
AS(Gij ) = wij ∗ hij . (1)
hold a similar distribution of absolute size with the TinyPer-
son, but have a higher resolution and larger relative sizes, as r
shown in Figure 1. wij ∗ hij
RS(Gij ) = . (2)
CNN-based person detection: In recent years, with the de- W i ∗ Hi
velopment of Convolutional neural networks (CNNs), the
performance of classification, detection and segmentation For the size of objects we mentioned in the following,
on some classical datasets, such as ImageNet [3], Pascal we use the objects’ absolute size by default.
1258
dataset absolute size relative size aspect ratio
TinyPerson 18.0±17.4 0.012±0.010 0.676±0.416
COCO 99.5±107.5 0.190±0.203 1.214±1.339
Wider face 32.8±52.7 0.036±0.052 0.801±0.168
CityPersons 79.8±67.5 0.055±0.046 0.410±0.008
1259
tiny tiny
dataset M R50 AP50
tiny Citypersons 75.44 19.08
3*3 tiny Citypersons 45.49 35.39
TinyPerson 85.71 47.29
3*3 TinyPerson 83.21 52.47
Table 3. The performance of the tiny CityPersons, TinyPerson and
their 3*3 up-sampled datasets (Due to out of memory caused by
the 4*4 upsampling strategy for TinyPerson, here we just use the
3*3 up-sampling strategy as an alternative).
Figure 3. IOU (insertion of union) and IOD (insertion of detec-
tion). IOD is for ignored regions for evaluation. The outline (in
violet) box represents a labeled ignored region and the dash boxes The objects’ relative size of TinyPerson is smaller than that
are unlabeled and ignored persons. The red box is a detection’s of CityPersons as shown in bottom-right of the Figure 1.
result box that has high IOU with one of ignored person. To better quantify the effect of the tiny relative size,
we obtain two new datasets 3*3 tiny CityPersons and 3*3
TinyPerson by directly 3*3 up-sampling tiny CityPersons
subsets, while images from same video can not split to same and TinyPerson, respectively. Then FPN detectors are
subset. trained for 3*3 tiny CityPersons and 3*3 TinyPerson.
The performance results are shown in table 3. For tiny
Focusing on the person detection task, we treat “sea
CityPersons, simply up-sampling improved MRtiny 50 and
person” and “earth person” as one same class (person). And tiny
for detection task, we only use these images which have AP50 by 29.95 and 16.31 points respectively, which are
less than 200 valid persons. What’s more, the TinyPerson closer to the original CityPersons’s performance. However,
can be used for more tasks as motioned before based on the for TinyPerson, the same up-sampling strategy obtains
different configuration of the TinyPerson manually. limited performance improvement. The tiny relative size
results in more false positives and serious imbalance of
positive/negative, due to massive and complex backgrounds
3.2. Dataset Challenges are introduced in a real scenario. The tiny relative size also
greatly challenges the detection task.
Tiny absolute size: For a tiny object dataset, extreme small
size is one of the key characteristics and one of the main
challenges. To quantify the effect of absolute size reduc- 4. Tiny Person Detection
tion on performance, we down-sample CityPersons by 4*4
to construct tiny CityPersons, where mean of objects’ ab- It is known that the more data used for training, the better
solute size is same as that of TinyPerson. Then we train a performance will be. However, the cost of collecting data
detector for CityPersons and tiny Citypersons, respectively, for a specified task is very high. A commonly approah is
the performance is shown in Table 4. The performance training a model on the extra datasets as pre-trained model,
drops significantly while the object’s size becomes tiny. In and then fine-tune it on a task-specified dataset. Due to the
tiny
Table 4, the M R50 of tiny CityPersons is 40% lower than huge data volume of these datasets, the pre-trained model
that of CityPersons. Tiny objects’ size really brings a great sometimes boost the performance to some extent. However,
challenge in detection, which is also the main concern in the performance improvement is limited, when the domain
this paper. of these extra datasets differs greatly from that of the task-
The FPN pre-trained with MS COCO can learn more specified dataset. How can we use extra public datasets with
about the objects with the representative size in MS COCO, lots of data to help training model for specified tasks, e.g.,
however, it is not sophisticated with the object in tiny size. tiny person detection?
The big difference of the size distribution brings in a sig- The publicly available datasets are quite different from
nificant reduction in performance. In addition, as for tiny TinyPerson in object type and scale distribution, as shown
object, it will become blurry, resulting in the poor semantic in Figure 1. Inspired by the Human Cognitive Process that
information of the object. The performance of deep neural human will be sophisticated with some scale-related tasks
network is further greatly affected. when they learn more about the objects with the similar
Tiny relative size: Although tiny CityPersons holds the scale, we propose an easy but efficient scale transformation
similar absolute size with TinyPerson. Due to the whole approach for tiny person detection by keeping the scale con-
image reduction, the relative size keeps no change when sistency between the TinyPerson and the extra dataset.
down-sampling. Different from tiny CityPersons, the im- For dataset X, we define the probability density function
ages in TinyPerson are captured far away in the real scene. of objects’ size s in X as Psize (s; X). Then we define a
1260
𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝐸) 𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝐷𝑡𝑒𝑠𝑡 )
𝑃𝑠𝑖𝑧𝑒 𝑠; 𝑇(𝐸)
Scale Match
Figure 4. The framework of Scale Match for detection. With the distributions of E and Dtrain dataset, the proposed Scale Match T (·) is
adopted to adjust the Psize (s; E) to Psize (s; Dtrain ). Various training policy can be used here, such as joint training or pre-training.
scale transform T , which is used to transform the probabil- Algorithm 1 Scale Match (SM) for Detection
ity distribution of objects’ size in extra dataset E to that in linenosize= INPUT: Dtrain (train set of D)
the targeted dataset D (TinyPerson), given in Eq.(3): INPUT: K (integer, number of bin in histogram which use to
estimate Psize (s; Dtrain ))
INPUT: E (extra labeled dataset)
Psize (s; T (E)) ≈ Psize (s; D). (3) OUTPUT: Ê (note as T (E) before.)
NOTE: H is the histogram for estimating Psize (s; Dtrain ); R is
the size’s range of each histogram bin; Ii is i-th image in dataset
In this paper, without losing generality, MS COCO is E; Gi represents all ground-truth boxes set in Ii ; ScaleImage
used as extra dataset, and Scale Match is used for the scale is a function to resize image and gorund-truth boxes with a given
transformation T . scale.
1261
number of objects in Dtrain , Gij(Dtrain )is j-th object in Algorithm 2 Rectified Histogram
i-th image of dataset Dtrain , and H[k] is probability of k-th linenosize= INPUT: Dtrain (train dataset of D)
bin given in Eq (4): INPUT: K (integer, K > 2)
OUTPUT: H (probability of each bin in the histogram for esti-
mating Psize (s; Dtrain ))
|{Gij (Dtrain )|R[k]− ≤ AS(Gij (Dtrain )) < R[k]+ }|
H[k] = . OUTPUT: R (size’s range of each bin in histogram)
N NOTE: N (the number of objects in dataset D); Gij (Dtrain ) is
(4)
j-th object in i-th image of dataset Dtrain .
However, the long tail of dataset distribution (shown in
1: Array R[K], H[K]
Figure 4) makes histogram fitting inefficient, which means
2: // collect all boxes’ size in Dtrain as Sall
that many bins’ probability is close to 0. Therefore, a more
3: Sall ← (..., AS(Gij (Dtrain )), ...)
efficient rectified histogram (as show in Algorithm 2) is pro- 4: // ascending sort
posed. And SR (sparse rate), calculating how many bins’ 5: Ssort ← sorted(Sall )
probability are close to 0 in all bins, is defined as the mea- 6:
sure of H’s fitting effectiveness: 7: // rectify part to solve long tail
1
8: p← K
|{k| H[k] ≤ 1/(α ∗ K) and k = 1, 2..., K| 9: N ← |Ssort |
SR = . (5) 10: // first tail small boxes’ size are merge to first bin
K
11: tail ← ceil(N ∗ p)
12: R[1]− ← min(Ssort )
where K is defined as the bin number of the H and is set 13: R[1]+ ← Ssort [tail + 1]
to 100, α is set to 10 for SR, and 1/(α ∗ K) is used as 14: H[1] ← tail N
a threshold. With rectified histogram, SR is down to 0.33 15: // last tail big boxes’ size are merge to last bin
from 0.67 for TinyPerson. The rectified histogram H pays 16: R[K]− ← Ssort [N − tail]
less attention on long tail part which has less contribution 17: R[K]+ ← max(Ssort )
to distribution. 18: H[K] ← tail N
19:
Image-level scaling: For all objects in extra dataset E, we
20: Smiddle ← Ssort [tail + 1 : N − tail]
need sample a ŝ respect to Psize (s; Dtrain ) and resize the
21: // calculate histogram with uniform size step and have K −
object to ŝ. In this paper, instead of resizing the object, we 2 bins for Smiddle to get H[2], H[3], ..., H[K − 1] and
resize the image which hold the object to make the object’s R[2], R[3], ..., R[K − 1].
size reach ŝ. Due to only resizing these objects will destroy 22: d ← max(SmiddleK−2 )−min(Smiddle )
the image structure. However there are maybe more than 23: for k in 2, 3, ..., K − 1 do
one object with different size in one image. We thus sample 24: R[k]− ← min(Smiddle ) + (k − 2) ∗ d
one ŝ per image and guarantees the mean size of objects in 25: R[k]+ ← min(Smiddle ) + (k − 1) ∗ d
this image to ŝ. |{Gij (Dtrain )|R[k]− ≤AS(Gij (Dtrain ))<R[k]+ }|
26: H[k] = N
Sample ŝ: We firstly sample a bin’s index respect to prob- 27: end for
ability of H, and secondly sample ŝ respect to a uniform
probability distribution with min and max size equal to
R[k]− and R[k]+ . The first step ensures that the distribu- makes the distribution of ŝ same as Psize (ŝ, Dtrain ). For
tion of ŝ is close to that of Psize (s; Dtrain) . For the second any s0 ∈ [min(s), max(s)], it is calculated as:
step, a uniform sampling algorithm is used.
s0 f (s0 )
4.2. Monotone Scale Match (MSM) for Detection
Z Z
Psize (s; E)ds = Psize (ŝ; Dtrain )dŝ. (6)
min(s) f (min(s))
Scale Match can transform the distribution of size to task-
specified dataset, as shown in Figure 5. Nevertheless, Scale
Match may make the original size out of order: a very small where min(s) and max(s) represent the minimum and
object could sample a very big size and vice versa. The maximum size of objects in E, respectively.
Monotone Scale Match, which can keep the monotonicity
of size, is further proposed for scale transformation. 5. Experiments
It is known that the histogram Equalization and Match-
5.1. Experiments Setting
ing algorithms for image enhancement keep the monotonic
changes of pixel values. We follow this idea monotoni- Ignore region: In TinyPerson, we must handle ignore re-
cally change the size, as shown in Figure 6. Mapping ob- gions in training set. Since the ignore region is always a
ject’s size s in dataset E to ŝ with a monotone function f , group of persons (not a single person) or something else
1262
Training detail: The codes are based on facebook
maskrcnn-benchmark. We choose ResNet50 as backbone.
If no specified, Faster RCNN-FPN are chose as detector.
Training 12 epochs, and base learning rate is set to 0.01,
decay 0.1 after 6 epochs and 10 epochs. We train and eval-
uate on two 2080Ti GPUs. Anchor size is set to (8.31, 12.5,
18.55, 30.23, 60.41), aspect ratio is set to (0.5, 1.3, 2) by
clustering. Since some images are with dense objects in
TinyPerson, DETECTIONS PER IMG (the max number of
detectors output result boxes per image) is set to 200.
Data Augmentation: Only flip horizontal is adopted to
augment the data for training. Different from other FPN
based detectors, which resize all images to the same size,
Figure 5. Psize (s; X) of COCO, TinyPerson and COCO after we use the origin image/sub-image size without any zoom-
Scale Match to TinyPerson, for better view, we limit the max ob- ing.
ject’s size to 200 instead of 500.
5.2. Baseline for TinyPerson Detection
For TinyPerson, the RetinaNet[15], FCOS[23], Faster
RCNN-FPN, which are the representatives of one stage an-
chor base detector, anchor free detector and two stage an-
chor base detector respectively, are selected for experimen-
tal comparisons. To guarantee the convergence, we use half
learning rate of Faster RCNN-FPN for RetinaNet and quar-
ter for FCOS. For adaptive FreeAnchor[29], we use same
learning rate and backbone setting of Adaptive RetinaNet,
and others are keep same as FreeAnchor’s default setting.
In Figure 1, WIDER Face holds a similar absolute scale
distribution to TinyPerson. Therefore, the state-of-the-art of
DSFD detector [13], which is specified for tiny face detec-
tion, has been included for comparison on TinyPerson.
Figure 6. The flowchart of Monotone Scale Match, mapping the
Poor localization: As shown in Table 5 and Table 6,
object’s size s in E to ŝ in Ê with a monotone function. the performance drops significantly while IOU threshold
changes from 0.25 to 0.75. It’s hard to have high location
precision in TinyPerson due to the tiny objects’ absolute and
which can neither be treated as foreground (positive sam- relative size.
ple) nor background (negative sample). There are two ways Spatial information: Due to the size of the tiny object,
for processing the ignore regions while training: 1) Replace spatial information maybe more important than deeper net-
the ignore region with mean value of the images in train- work model. Therefore, we use P2, P3, P4, P5, P6 of
ing set; 2) Do not back-propagate the gradient which comes FPN instead of P3, P4, P5, P6, P7 for RetinaNet, which is
from ignore region. In this paper, we just simply adopt the similar to Faster RCNN-FPN. We named the adjusted ver-
first way for ignore regions. sion as Adaptive RetinaNet. It achieves better performance
tiny
Image cutting: Most of images in TinyPerson are with (10.43% improvement of AP50 ) than the RetinaNet.
large size, results in the GPU out of memory. Therefore, Best detector: With MS COCO, RetinaNet and FreeAn-
we cut the origin images into some sub-images with over- chor achieves better performance than Faster RCNN-FPN.
lapping during training and test. Then the NMS strategy is One stage detector can also go beyond two stage detector
used to merge all results of the sub-images in one same im- if sample imbalance is well solved [15]. The anchor-free
age for evaluation. Although the image cutting can make based detector FCOS achieves the better performance com-
better use of GPU resources, there are two flaws:1) For pared with RetinaNet and Faster RCNN-FPN. However,
FPN, pure background images (no object in this image) will when objects’ size become tiny such as objects in TinyPer-
not be used for training. Due to image cutting, many sub- son, the performance of all detectors drop a lot. And the
images will become the pure background images, which RetinaNet and FCOS performs worse, as shown in Table 5
are not well utilized; 2) In some conditions, NMS can not and Table 6. For tiny objects, two stage detector shows ad-
merge the results in overlapping regions well. vantages over one stage detector. Li et al. [13] proposed
1263
tiny1 tiny2 tiny3 tiny small tiny tiny
detector M R50 M R50 M R50 M R50 M R50 M R25 M R75
FCOS [23] 99.10 96.39 91.31 96.12 84.14 89.56 99.56
RetinaNet [15] 95.05 88.34 86.04 92.40 81.75 81.56 99.11
DSFD [13] 96.41 88.02 86.84 93.47 78.72 78.02 99.48
Adaptive RetinaNet 89.48 82.29 82.40 89.19 74.29 77.83 98.63
Adaptive FreeAnchor [29] 90.26 82.01 81.74 88.97 73.67 77.62 98.7
Faster RCNN-FPN [14] 88.40 81.99 80.17 87.78 71.31 77.35 98.40
Table 5. Comparisons of M Rs on TinyPerson.
tiny tiny
DSFD for face detection, which is one of the SOTA face pretrain dataset M R50 AP50
detectors with code available. But it obtained poor perfor- ImageNet 87.78 43.55
mance on TinyPerson, due to the great difference between COCO 86.58 43.38
relative scale and aspect ratio, which also further demon- COCO100 87.67 43.03
strates the great chanllange of the proposed TinyPerson. SM COCO 86.30 46.77
With performance comparison, Faster RCNN-FPN is cho- MSM COCO 85.71 47.29
Table 7. Comparisons on TinyPerson. COCO100 holds the sim-
sen as the baseline of experiment and the benchmark.
ilar mean of the boxes’ size with TinyPerson, SM COCO uses
Scale Match on COCO for pre-training, while MSM COCO uses
Monotonous Scale Match on COCO for pre-training.
5.3. Analysis of Scale Match
tiny tiny
TinyPerson. In general, for detection, pretrain on MS pretrain dataset M R50 AP50
COCO often gets better performance than pretrain on Im- ImageNet 75.44 19.08
ageNet, although the ImageNet holds more data. How- COCO 74.15 20.74
ever, detector pre-trained on MS COCO improves very lim- COCO100 74.92 20.57
ited in TinyPerson, since the object size of MS COCO is SM COCO 73.87 21.18
quite different from that of TinyPerson. Then, we obtain MSM COCO 72.41 21.56
a new dataset, COCO100, by setting the shorter edge of Table 8. Comparisons on Tiny Citypersons. COCO100 holds the
similar mean of the boxes’ size with Tiny Citypersons.
each image to 100 and keeping the height-width ratio un-
changed. The mean of objects’ size in COCO100 almost
equals to that of TinyPerson. However, the detector pre- 6. Conclusion
trained on COCO100 performs even worse, shown in Table
In this paper, a new dataset (TinyPerson) is introduced for
7. The transformation of the mean of objects’ size to that in
detecting tiny objects, particularly, tiny persons less than 20
TinyPerson is inefficient. Then we construct SM COCO by
pixels in large-scale images. The extremely small objects
transforming the whole distribution of MS COCO to that
raise a grand challenge for existing person detectors.
of TinyPerson based on Scale Match. With detector pre-
We build the baseline for tiny person detection and exper-
trained on SM COCO, we obtain 3.22% improvement of
tiny imentally find that the scale mismatch could deteriorate the
AP50 , Table 7. Finally we construct MSM COCO using
feature representation and the detectors. We thereby pro-
Monotone Scale Match for transformation of MS COCO.
posed an easy but efficient approach, Scale Match, for tiny
With MSM COCO as the pre-trained dataset, the perfor-
tiny person detection. Our approach is inspired by the Human
mance further improves to 47.29% of AP50 , Table 7.
Cognition Process, while Scale Match can better utilize the
Tiny Citypersons. To further validate the effectiveness of existing annotated data and make the detector more sophis-
the proposed Scale Match on other datasets, we conducted ticated. Scale Match is designed as a plug-and-play univer-
experiments on Tiny Citypersons and obtained similar per- sal block for object scale processing, which provides a fresh
formance gain, Table 8. insight for general object detection tasks.
1264
References international conference on computer vision, pages 2980–
2988, 2017.
[1] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
via region-based fully convolutional networks. In Advances manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
in neural information processing systems, pages 379–387, mon objects in context. In European conference on computer
2016. vision, pages 740–755. Springer, 2014.
[2] N. Dalal and B. Triggs. Histograms of oriented gradients for [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
human detection. 2005. Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- In European conference on computer vision, pages 21–37.
Fei. Imagenet: A large-scale hierarchical image database. Springer, 2016.
In 2009 IEEE conference on computer vision and pattern [18] J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedes-
recognition, pages 248–255. Ieee, 2009. trian detection? In Proceedings of the IEEE Conference
[4] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedes- on Computer Vision and Pattern Recognition, pages 3127–
trian detection: An evaluation of the state of the art. IEEE 3136, 2017.
transactions on pattern analysis and machine intelligence, [19] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng. R2 -cnn: Fast tiny
34(4):743–761, 2011. object detection in large-scale remote sensing images. IEEE
[5] M. Enzweiler and D. M. Gavrila. Monocular pedestrian de- Transactions on Geoscience and Remote Sensing, 2019.
tection: Survey and experiments. IEEE transactions on pat- [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
tern analysis and machine intelligence, 31(12):2179–2195, only look once: Unified, real-time object detection. In Pro-
2008. ceedings of the IEEE conference on computer vision and pat-
[6] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mo- tern recognition, pages 779–788, 2016.
bile vision system for robust multi-person tracking. In 2008 [21] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
IEEE Conference on Computer Vision and Pattern Recogni- real-time object detection with region proposal networks. In
tion, pages 1–8. IEEE, 2008. Advances in neural information processing systems, pages
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and 91–99, 2015.
A. Zisserman. The pascal visual object classes (voc) chal- [22] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
lenge. International journal of computer vision, 88(2):303– based object detectors with online hard example mining. In
338, 2010. Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 761–769, 2016.
[8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
[23] Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully
tonomous driving? the kitti vision benchmark suite. In 2012
convolutional one-stage object detection. arXiv preprint
IEEE Conference on Computer Vision and Pattern Recogni-
arXiv:1904.01355, 2019.
tion, pages 3354–3361. IEEE, 2012.
[24] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedes-
[9] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- trian detection. In 2009 IEEE Conference on Computer Vi-
national conference on computer vision, pages 1440–1448, sion and Pattern Recognition, pages 794–801. IEEE, 2009.
2015.
[25] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- face detection benchmark. In Proceedings of the IEEE con-
ture hierarchies for accurate object detection and semantic ference on computer vision and pattern recognition, pages
segmentation. In Proceedings of the IEEE conference on 5525–5533, 2016.
computer vision and pattern recognition, pages 580–587, [26] S. Zhang, R. Benenson, M. Omran, J. Hosang, and
2014. B. Schiele. Towards reaching human performance in pedes-
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling trian detection. IEEE transactions on pattern analysis and
in deep convolutional networks for visual recognition. IEEE machine intelligence, 40(4):973–986, 2017.
transactions on pattern analysis and machine intelligence, [27] S. Zhang, R. Benenson, and B. Schiele. Citypersons: A di-
37(9):1904–1916, 2015. verse dataset for pedestrian detection. In Proceedings of the
[12] P. Hu and D. Ramanan. Finding tiny faces. In Proceedings of IEEE Conference on Computer Vision and Pattern Recogni-
the IEEE conference on computer vision and pattern recog- tion, pages 3213–3221, 2017.
nition, pages 951–959, 2017. [28] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li.
[13] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, S3fd: Single shot scale-invariant face detector. In Proceed-
J. Li, and F. Huang. Dsfd: dual shot face detector. In Pro- ings of the IEEE International Conference on Computer Vi-
ceedings of the IEEE Conference on Computer Vision and sion, pages 192–201, 2017.
Pattern Recognition, pages 5060–5069, 2019. [29] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye. Freeanchor:
[14] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and Learning to match anchors for visual object detection. arXiv
S. Belongie. Feature pyramid networks for object detection. preprint arXiv:1909.02466, 2019.
In Proceedings of the IEEE conference on computer vision [30] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
and pattern recognition, pages 2117–2125, 2017. parsing network. In Proceedings of the IEEE conference on
[15] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal computer vision and pattern recognition, pages 2881–2890,
loss for dense object detection. In Proceedings of the IEEE 2017.
1265