Scale Match For Tiny Person Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Scale Match for Tiny Person Detection

Xuehui Yu Yuqi Gong Nan Jiang Qixiang Ye Zhenjun Han∗


University of Chinese Academy of Sciences, Beijing, China
{yuxuehui17,gongyuqi18,jiangnan18}@mails.ucas.ac.cn {qxye,hanzhj}@ucas.ac.cn

Abstract CityPersons COCO

Visual object detection has achieved unprecedented ad-


vance with the rise of deep convolutional neural networks.
However, detecting tiny objects (for example tiny per-
sons less than 20 pixels) in large-scale images remains
not well investigated. The extremely small objects raise
a grand challenge about feature representation while the
massive and complex backgrounds aggregate the risk of
TinyPerson WIDER Face
false alarms. In this paper, we introduce a new benchmark,
referred to as TinyPerson, opening up a promising direction
for tiny object detection in a long distance and with mas-
sive backgrounds. We experimentally find that the scale mis-
match between the dataset for network pre-training and the
dataset for detector learning could deteriorate the feature
representation and the detectors. Accordingly, we propose
a simple yet effective Scale Match approach to align the
object scales between the two datasets for favorable tiny-
object representation. Experiments show the significant
performance gain of our proposed approach over state-of-
the-art detectors, and the challenging aspects of TinyPerson
related to real-world scenarios. The TinyPerson benchmark
and the code for our approach will be publicly available1 .

1. Introduction Figure 1. Comparison of TinyPerson with other datasets. Top: Im-


age examples from TinyPerson show the great challenges (Please
Person/pedestrian detection is an important topic in the zoom in for details). Bottom: Statistics about absolute size and
computer vision community [5] [4] [8] [27] [18] [26], with relative size of objects.
wide applications including surveillance, driving assistance,
mobile robotics, and maritime quick rescue. With the rise
of deep convolutional neural networks, pedestrian detection
has achieved an unprecedented progress. Nevertheless, de- marine search and rescue on a helicopter platform.
tecting tiny persons remains far from well explored. Different from objects in proper scales, the tiny objects
The reason about the delay of the tiny-person detection are much more challenging due to the extreme small object
research is lack of significant benchmarks. The scenarios of size and low signal noise ratio, as shown in Figure 1. Af-
existing person/pedestrian benchmarks [2][6][24][5][4][8] ter the video encoding/decoding procedure, the image blur
e.g., CityPersons [27], are mainly in a near or middle dis- causes the tiny objects mixed with the backgrounds, which
tance. They are not applicable to the scenarios where per- makes it require great human efforts when preparing the
sons are in a large area and in a very long distance, e.g., benchmark. The low signal noise ratio can seriously de-
∗ corresponding author teriorate the feature representation and thereby challenges
1 https://github.com/ucas-vg/TinyBenchmark the state-of-the-art object detectors.

1257
To detect the tiny persons, we propose a simple yet ef- [7], MS COCO [16], has far exceeded that of traditional
fective approach, named Scale Match. The intuition of our machine learning algorithms.Region convolutional neural
approach is to align the object scales of the dataset for pre- network (R-CNN) [10] has become the popular detection
training and the one for detector training. The nature behind architecture. OverFeat adopted a Conv-Net as a sliding
Scale Match is that it can better investigate and utilize the window detector on an image pyramid. R-CNN adopted
information in tiny scale, and make the convolutional neural a region proposal-based method based on selective search
networks (CNNs) more sophisticated for tiny object repre- and then used a Conv-Net to classify the scale normalized
sentation. The main contributions of our work include: proposals. Spatial pyramid pooling (SPP) [11] adopted R-
1. We introduce TinyPerson, under the background of CNN on feature maps extracted on a single image scale,
maritime quick rescue, and raise a grand challenge about which demonstrated that such region-based detectors could
tiny object detection in the wild. To our best knowledge, be applied much more efficiently. Fast R-CNN [9] and
this is the first benchmark for person detection in a long Faster R-CNN [21] made a unified object detector in a mul-
distance and with massive backgrounds. The train/val. an- titask manner. Dai et al. [1] proposed R-FCN, which uses
notations will be made publicly and an online benchmark position-sensitive RoI pooling to get a faster and better de-
will be setup for algorithm evaluation. tector.
2. We comprehensively analyze the challenges about While the region-based methods are complex and time-
tiny persons and propose the Scale Match approach, with consuming, single-stage detectors, such as YOLO [20] and
the purpose of aligning the feature distribution between the SSD [17], are proposed to accelerate the processing speed
dataset for network pre-training and the dataset for detector but with a performance drop, especially in tiny objects.
learning. Tiny object detection: Along with the rapid development
3. The proposed Scale Match approach improves the de- of CNNs, researchers search frameworks for tiny object de-
tection performance over the state-of-the-art detector (FPN) tection specifically. Lin et al. [14] proposed feature pyra-
with a significant margin ( 5%). mid networks that use the top-down architecture with lat-
eral connections as an elegant multi-scale feature warping
2. Related Work method. Zhang et al. [28] proposed a scale-equitable face
detection framework to handle different scales of faces well.
Dataset for person detection: Pedestrian detection has al-
Then J Li et al. [13] proposed DSFD for face detection,
ways been a hot issue in computer vision. Larger capacity,
which is SOTA open-source face detector. Hu et al. [12]
richer scenes and better annotated pedestrian datasets,such
showed that the context is crucial and defines the templates
as INRIA [2], ETH [6], TudBrussels [24], Daimler [5],
that make use of massively large receptive fields. Zhao et
Caltech-USA [4], KITTI [8] and CityPersons [27] represent
al. [30] proposed a pyramid scene-parsing network that em-
the pursuit of more robust algorithms and better datasets.
ploys the context reasonable. Shrivastava et al. [22] pro-
The data in some datasets were collected in city scenes and
posed an online hard example mining method that can im-
sampled from annotated frames of video sequences. De-
prove the performance of small objects significantly.
spite the pedestrians in those datasets are in a relatively high
resolution and the size of the pedestrians is large, this situa-
tion is not suitable for tiny object detection. 3. Tiny Person Benchmark
TinyPerson represents the person in a quite low reso-
In this paper, the size of object is defined as the square
lution, mainly less than 20 pixles, in maritime and beach
root of the object’s bounding box area. We use Gij =
scenes. Such diversity enables models trained on TinyPer-
(xij , yij , wij , hij ) to describe the j-th object’s bounding
son to well generalize to more scenes, e.g., Long-distance
box of i-th image Ii in dataset, where (xij , yij ) denotes the
human target detection and then rescue.
coordinate of the left-top point, and wij , hij are the width
Several small target datasets including WiderFace [25]
and height of the bounding box. Wi , Hi denote the width
and TinyNet [19] have been reported. TinyNet involves re-
and height of Ii , respectively. Then the absolute size and
mote sensing target detection in a long distance. However,
relative size of a object are calculated as:
the dataset is not publicly available. WiderFace mainly fo-
cused on face detection, as shown in Figure 1. The faces p
AS(Gij ) = wij ∗ hij . (1)
hold a similar distribution of absolute size with the TinyPer-
son, but have a higher resolution and larger relative sizes, as r
shown in Figure 1. wij ∗ hij
RS(Gij ) = . (2)
CNN-based person detection: In recent years, with the de- W i ∗ Hi
velopment of Convolutional neural networks (CNNs), the
performance of classification, detection and segmentation For the size of objects we mentioned in the following,
on some classical datasets, such as ImageNet [3], Pascal we use the objects’ absolute size by default.

1258
dataset absolute size relative size aspect ratio
TinyPerson 18.0±17.4 0.012±0.010 0.676±0.416
COCO 99.5±107.5 0.190±0.203 1.214±1.339
Wider face 32.8±52.7 0.036±0.052 0.801±0.168
CityPersons 79.8±67.5 0.055±0.046 0.410±0.008

Table 1. Mean and standard deviation of absolute size, relative size


and aspect ratio of the datasets: TinyPerson, MS COCO, Wider
Face and CityPersons.

3.1. Benchmark description


Dataset Collection: The images in TinyPerson are col-
lected from Internet. Firstly, videos with a high resolution
are collected from different websites. Secondly, we sample
images from video every 50 frames. Then we delete images
with a certain repetition (homogeneity). We annotate 72651
objects with bounding boxes by hand. Figure 2. The annotation examples. “sea person”, “earth person”,
Dataset Properties: 1) The persons in TinyPerson are quite “uncertain sea person”, “uncertain earth person”, ignore region are
tiny compared with other representative datasets, shown in represented with red, green, blue, yellow, purple rectangle, respec-
Figure 1 and Table 1, which is the main characteristics of tively. The regions are zoomed in and shown on right.
TinyPerson; 2) The aspect ratio of persons in TinyPerson
has a large variance, given in Talbe 1. Since the various
poses and viewpoints of persons in TinyPerson, it brings TinyPerson Train set valid set sum
more complex diversity of the persons, and leads to the #image 794 816 1610
detection more difficult. In addition, TinyPerson can also #annotations 42197 30454 72651
make a effective supplement to the existing datasets in the #normal 18433 13787 32120
diversity of poses and views aspect; 3) In TinyPerson, we #ignore 3369 1989 5358
mainly focus on person around seaside, which can be used #uncertain 3486 2832 6318
for quick maritime rescue and defense around sea; 4) #dense 16909 11946 28855
There are many images with dense objects (more than 200 #sea 26331 15925 42256
persons per image) in TinyPerson. Therefore, the TinyPer- #earth 15867 14530 30397
son also can be used for other tasks, e.g. person counting. #ignore 3369 1989 5358
Table 2. Statistic information in details for TinyPerson. The
Annotation rules: In TinyPerson, we classify persons as
TinyPerson can be divided into “normal”, “ignore”, “uncertain”,
“sea person” (persons in the sea) or “earth person” (persons “dense” based on the attributes and “sea”, “earth”, “ignore” by the
on the land). We define four rules to determine which the classes, which is described as annotation rules in section 3.1.
label a person belongs to: 1) Persons on boat are treated
as “sea person”; 2) Persons lying in the water are treated
as “sea person”; 3) Persons with more than half body in tiny2[8, 12], tiny3[12, 20]. And the IOU threshold is set to
water are treated as “sea person”; 4) others are treated as 0.5 for performance evaluation. Due to many applications
“earth person”. In TinyPerson, there are three conditions of tiny person detection concerning more about finding per-
where persons are labeled as “ignore”: 1) Crowds, which sons than locating precisely (e.g., shipwreck search and res-
we can recognize as persons. But the crowds are hard to cue), the IOU threshold 0.25 is also used for evaluation.
separate one by one when labeled with standard rectangles; For Caltech or CityPersons, IOU criteria is adopted for
2) Ambiguous regions, which are hard to clearly distinguish performance evaluation. The size of most of Ignore region
whether there is one or more persons, and 3) Reflections in in Caltech and CityPersons are same as that of a pedestrian.
Water. In TinyPerson, some objects are hard to be recog- However in TinyPerson, most of ignore regions are much
nized as human beings, we directly labeled them as “uncer- larger than that of a person. Therefore, we change IOU cri-
tain”. Some annotation examples are given in Figure 2. teria to IOD for ignore regions (IOD criteria only applies
Evaluation: We use both AP (average precision) and MR to ignore region, for other classes still use IOU criteria),as
(miss rate) for performance evaluation. For more detailed shown in Figure 3. In this paper, we also treat uncertain
experimental comparisons, the size range is divided into 3 same as ignore while training and testing.
intervals: tiny[2, 20], small[20, 32] and all[2, inf]. And for Training&Test Set: The training and test sets are con-
tiny[2, 20], it is partitioned into 3 sub-intervals: tiny1[2, 8], structed by randomly splitting the images equally into two

1259
tiny tiny
dataset M R50 AP50
tiny Citypersons 75.44 19.08
3*3 tiny Citypersons 45.49 35.39
TinyPerson 85.71 47.29
3*3 TinyPerson 83.21 52.47
Table 3. The performance of the tiny CityPersons, TinyPerson and
their 3*3 up-sampled datasets (Due to out of memory caused by
the 4*4 upsampling strategy for TinyPerson, here we just use the
3*3 up-sampling strategy as an alternative).
Figure 3. IOU (insertion of union) and IOD (insertion of detec-
tion). IOD is for ignored regions for evaluation. The outline (in
violet) box represents a labeled ignored region and the dash boxes The objects’ relative size of TinyPerson is smaller than that
are unlabeled and ignored persons. The red box is a detection’s of CityPersons as shown in bottom-right of the Figure 1.
result box that has high IOU with one of ignored person. To better quantify the effect of the tiny relative size,
we obtain two new datasets 3*3 tiny CityPersons and 3*3
TinyPerson by directly 3*3 up-sampling tiny CityPersons
subsets, while images from same video can not split to same and TinyPerson, respectively. Then FPN detectors are
subset. trained for 3*3 tiny CityPersons and 3*3 TinyPerson.
The performance results are shown in table 3. For tiny
Focusing on the person detection task, we treat “sea
CityPersons, simply up-sampling improved MRtiny 50 and
person” and “earth person” as one same class (person). And tiny
for detection task, we only use these images which have AP50 by 29.95 and 16.31 points respectively, which are
less than 200 valid persons. What’s more, the TinyPerson closer to the original CityPersons’s performance. However,
can be used for more tasks as motioned before based on the for TinyPerson, the same up-sampling strategy obtains
different configuration of the TinyPerson manually. limited performance improvement. The tiny relative size
results in more false positives and serious imbalance of
positive/negative, due to massive and complex backgrounds
3.2. Dataset Challenges are introduced in a real scenario. The tiny relative size also
greatly challenges the detection task.
Tiny absolute size: For a tiny object dataset, extreme small
size is one of the key characteristics and one of the main
challenges. To quantify the effect of absolute size reduc- 4. Tiny Person Detection
tion on performance, we down-sample CityPersons by 4*4
to construct tiny CityPersons, where mean of objects’ ab- It is known that the more data used for training, the better
solute size is same as that of TinyPerson. Then we train a performance will be. However, the cost of collecting data
detector for CityPersons and tiny Citypersons, respectively, for a specified task is very high. A commonly approah is
the performance is shown in Table 4. The performance training a model on the extra datasets as pre-trained model,
drops significantly while the object’s size becomes tiny. In and then fine-tune it on a task-specified dataset. Due to the
tiny
Table 4, the M R50 of tiny CityPersons is 40% lower than huge data volume of these datasets, the pre-trained model
that of CityPersons. Tiny objects’ size really brings a great sometimes boost the performance to some extent. However,
challenge in detection, which is also the main concern in the performance improvement is limited, when the domain
this paper. of these extra datasets differs greatly from that of the task-
The FPN pre-trained with MS COCO can learn more specified dataset. How can we use extra public datasets with
about the objects with the representative size in MS COCO, lots of data to help training model for specified tasks, e.g.,
however, it is not sophisticated with the object in tiny size. tiny person detection?
The big difference of the size distribution brings in a sig- The publicly available datasets are quite different from
nificant reduction in performance. In addition, as for tiny TinyPerson in object type and scale distribution, as shown
object, it will become blurry, resulting in the poor semantic in Figure 1. Inspired by the Human Cognitive Process that
information of the object. The performance of deep neural human will be sophisticated with some scale-related tasks
network is further greatly affected. when they learn more about the objects with the similar
Tiny relative size: Although tiny CityPersons holds the scale, we propose an easy but efficient scale transformation
similar absolute size with TinyPerson. Due to the whole approach for tiny person detection by keeping the scale con-
image reduction, the relative size keeps no change when sistency between the TinyPerson and the extra dataset.
down-sampling. Different from tiny CityPersons, the im- For dataset X, we define the probability density function
ages in TinyPerson are captured far away in the real scene. of objects’ size s in X as Psize (s; X). Then we define a

1260
𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝐸) 𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝐷𝑡𝑒𝑠𝑡 )

𝑃𝑠𝑖𝑧𝑒 𝑠; 𝑇(𝐸)
Scale Match

𝑇: 𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝑇(𝐸)) ≈ 𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝐷)

𝑃𝑠𝑖𝑧𝑒 (𝑠; 𝐷𝑡𝑟𝑎𝑖𝑛 ) Evaluation


Model
Train
Policy

Figure 4. The framework of Scale Match for detection. With the distributions of E and Dtrain dataset, the proposed Scale Match T (·) is
adopted to adjust the Psize (s; E) to Psize (s; Dtrain ). Various training policy can be used here, such as joint training or pre-training.

tiny1 tiny2 tiny3 tiny small


dataset M R50 M R50 M R50 M R50 M R50
CityPersons 56.40 24.29 8.93 35.65 7.43
tiny CityPersons 94.04 72.56 49.37 75.44 23.70
tiny tiny1 tiny2
Table 4. The performance of CityPersons and tiny CityPersons. To guarantee the objectivity and fairness, M R50 , M R50 , M R50 ,
tiny3 small
M R50 , M R50 are calculated with size in range [2, 20], [2, 8], [8, 12], [12, 20], [20, 32] for tiny CityPersons and in range [8, 80],
[8, 32] [32, 48], [48, 80], [80, 128] for CityPersons, respectively.

scale transform T , which is used to transform the probabil- Algorithm 1 Scale Match (SM) for Detection
ity distribution of objects’ size in extra dataset E to that in linenosize= INPUT: Dtrain (train set of D)
the targeted dataset D (TinyPerson), given in Eq.(3): INPUT: K (integer, number of bin in histogram which use to
estimate Psize (s; Dtrain ))
INPUT: E (extra labeled dataset)
Psize (s; T (E)) ≈ Psize (s; D). (3) OUTPUT: Ê (note as T (E) before.)
NOTE: H is the histogram for estimating Psize (s; Dtrain ); R is
the size’s range of each histogram bin; Ii is i-th image in dataset
In this paper, without losing generality, MS COCO is E; Gi represents all ground-truth boxes set in Ii ; ScaleImage
used as extra dataset, and Scale Match is used for the scale is a function to resize image and gorund-truth boxes with a given
transformation T . scale.

4.1. Scale Match 1: Ê ← ∅


2: // to obtain approximate Psize (s; Dtrain ).
Gij = (xij , yij , wij , hij ) represents j-th object in image 3: (H , Sizes) ← Rectif iedHistogram(Dtrain , K)
Ii of dataset E. The Scale Match approach can be simply 4: for (Ii , Gi ) in E do
described as three steps: 5: // calculate mean size of box in Gi
6: s ← M ean((Gi ))
1. Sample ŝ ∼ Psize (s; D); 7: // sample a bin index of H by the probability value
ŝ 8: sample k ∼ H
2. Calculate scale ratio c = AS(Gij ) ; 9: sample ŝ ∼ U nif orm(R[k]− , R[k]+ )
10: c ← ŝ/s
3. Resize object with scale ratio c , then Gˆij ← (xij ∗ 11: Iˆi , Ĝi ← ScaleImage(Ii , Gi , c)
c, yij ∗ c, wij ∗ c, hij ∗ c); 12: Ê ← Ê ∪ (Iˆi , Ĝi )
13: end for
where Gˆij is the result after Scale Match. Scale Match will
be applied to all objects in E to get T (E), when there are a
large number of targets in E, Psize (s; T (E)) will be close is close to actual distribution. Therefore, the training set
to Psize (s; D). Details of Scale Match algorithm are shown Psize (s; Dtrain ) is used to approximate Psize (s; D).
in Algorithm 1. Rectified Histogram: The discrete histogram (H, R)
Estimate Psize (s; D): In Scale Match, we first estimate is used to approximate Psize (s; Dtrain ) for calculation,
Psize (s; D), following a basic assumption in machine learn- R[k]− and R[k]+ are size boundary of k-th bin in his-
ing: the distribution of randomly sampled training dataset togram, K is the number of bins in histogram, N is the

1261
number of objects in Dtrain , Gij(Dtrain )is j-th object in Algorithm 2 Rectified Histogram
i-th image of dataset Dtrain , and H[k] is probability of k-th linenosize= INPUT: Dtrain (train dataset of D)
bin given in Eq (4): INPUT: K (integer, K > 2)
OUTPUT: H (probability of each bin in the histogram for esti-
mating Psize (s; Dtrain ))
|{Gij (Dtrain )|R[k]− ≤ AS(Gij (Dtrain )) < R[k]+ }|
H[k] = . OUTPUT: R (size’s range of each bin in histogram)
N NOTE: N (the number of objects in dataset D); Gij (Dtrain ) is
(4)
j-th object in i-th image of dataset Dtrain .
However, the long tail of dataset distribution (shown in
1: Array R[K], H[K]
Figure 4) makes histogram fitting inefficient, which means
2: // collect all boxes’ size in Dtrain as Sall
that many bins’ probability is close to 0. Therefore, a more
3: Sall ← (..., AS(Gij (Dtrain )), ...)
efficient rectified histogram (as show in Algorithm 2) is pro- 4: // ascending sort
posed. And SR (sparse rate), calculating how many bins’ 5: Ssort ← sorted(Sall )
probability are close to 0 in all bins, is defined as the mea- 6:
sure of H’s fitting effectiveness: 7: // rectify part to solve long tail
1
8: p← K
|{k| H[k] ≤ 1/(α ∗ K) and k = 1, 2..., K| 9: N ← |Ssort |
SR = . (5) 10: // first tail small boxes’ size are merge to first bin
K
11: tail ← ceil(N ∗ p)
12: R[1]− ← min(Ssort )
where K is defined as the bin number of the H and is set 13: R[1]+ ← Ssort [tail + 1]
to 100, α is set to 10 for SR, and 1/(α ∗ K) is used as 14: H[1] ← tail N
a threshold. With rectified histogram, SR is down to 0.33 15: // last tail big boxes’ size are merge to last bin
from 0.67 for TinyPerson. The rectified histogram H pays 16: R[K]− ← Ssort [N − tail]
less attention on long tail part which has less contribution 17: R[K]+ ← max(Ssort )
to distribution. 18: H[K] ← tail N
19:
Image-level scaling: For all objects in extra dataset E, we
20: Smiddle ← Ssort [tail + 1 : N − tail]
need sample a ŝ respect to Psize (s; Dtrain ) and resize the
21: // calculate histogram with uniform size step and have K −
object to ŝ. In this paper, instead of resizing the object, we 2 bins for Smiddle to get H[2], H[3], ..., H[K − 1] and
resize the image which hold the object to make the object’s R[2], R[3], ..., R[K − 1].
size reach ŝ. Due to only resizing these objects will destroy 22: d ← max(SmiddleK−2 )−min(Smiddle )

the image structure. However there are maybe more than 23: for k in 2, 3, ..., K − 1 do
one object with different size in one image. We thus sample 24: R[k]− ← min(Smiddle ) + (k − 2) ∗ d
one ŝ per image and guarantees the mean size of objects in 25: R[k]+ ← min(Smiddle ) + (k − 1) ∗ d
this image to ŝ. |{Gij (Dtrain )|R[k]− ≤AS(Gij (Dtrain ))<R[k]+ }|
26: H[k] = N
Sample ŝ: We firstly sample a bin’s index respect to prob- 27: end for
ability of H, and secondly sample ŝ respect to a uniform
probability distribution with min and max size equal to
R[k]− and R[k]+ . The first step ensures that the distribu- makes the distribution of ŝ same as Psize (ŝ, Dtrain ). For
tion of ŝ is close to that of Psize (s; Dtrain) . For the second any s0 ∈ [min(s), max(s)], it is calculated as:
step, a uniform sampling algorithm is used.
s0 f (s0 )
4.2. Monotone Scale Match (MSM) for Detection
Z Z
Psize (s; E)ds = Psize (ŝ; Dtrain )dŝ. (6)
min(s) f (min(s))
Scale Match can transform the distribution of size to task-
specified dataset, as shown in Figure 5. Nevertheless, Scale
Match may make the original size out of order: a very small where min(s) and max(s) represent the minimum and
object could sample a very big size and vice versa. The maximum size of objects in E, respectively.
Monotone Scale Match, which can keep the monotonicity
of size, is further proposed for scale transformation. 5. Experiments
It is known that the histogram Equalization and Match-
5.1. Experiments Setting
ing algorithms for image enhancement keep the monotonic
changes of pixel values. We follow this idea monotoni- Ignore region: In TinyPerson, we must handle ignore re-
cally change the size, as shown in Figure 6. Mapping ob- gions in training set. Since the ignore region is always a
ject’s size s in dataset E to ŝ with a monotone function f , group of persons (not a single person) or something else

1262
Training detail: The codes are based on facebook
maskrcnn-benchmark. We choose ResNet50 as backbone.
If no specified, Faster RCNN-FPN are chose as detector.
Training 12 epochs, and base learning rate is set to 0.01,
decay 0.1 after 6 epochs and 10 epochs. We train and eval-
uate on two 2080Ti GPUs. Anchor size is set to (8.31, 12.5,
18.55, 30.23, 60.41), aspect ratio is set to (0.5, 1.3, 2) by
clustering. Since some images are with dense objects in
TinyPerson, DETECTIONS PER IMG (the max number of
detectors output result boxes per image) is set to 200.
Data Augmentation: Only flip horizontal is adopted to
augment the data for training. Different from other FPN
based detectors, which resize all images to the same size,
Figure 5. Psize (s; X) of COCO, TinyPerson and COCO after we use the origin image/sub-image size without any zoom-
Scale Match to TinyPerson, for better view, we limit the max ob- ing.
ject’s size to 200 instead of 500.
5.2. Baseline for TinyPerson Detection
For TinyPerson, the RetinaNet[15], FCOS[23], Faster
RCNN-FPN, which are the representatives of one stage an-
chor base detector, anchor free detector and two stage an-
chor base detector respectively, are selected for experimen-
tal comparisons. To guarantee the convergence, we use half
learning rate of Faster RCNN-FPN for RetinaNet and quar-
ter for FCOS. For adaptive FreeAnchor[29], we use same
learning rate and backbone setting of Adaptive RetinaNet,
and others are keep same as FreeAnchor’s default setting.
In Figure 1, WIDER Face holds a similar absolute scale
distribution to TinyPerson. Therefore, the state-of-the-art of
DSFD detector [13], which is specified for tiny face detec-
tion, has been included for comparison on TinyPerson.
Figure 6. The flowchart of Monotone Scale Match, mapping the
Poor localization: As shown in Table 5 and Table 6,
object’s size s in E to ŝ in Ê with a monotone function. the performance drops significantly while IOU threshold
changes from 0.25 to 0.75. It’s hard to have high location
precision in TinyPerson due to the tiny objects’ absolute and
which can neither be treated as foreground (positive sam- relative size.
ple) nor background (negative sample). There are two ways Spatial information: Due to the size of the tiny object,
for processing the ignore regions while training: 1) Replace spatial information maybe more important than deeper net-
the ignore region with mean value of the images in train- work model. Therefore, we use P2, P3, P4, P5, P6 of
ing set; 2) Do not back-propagate the gradient which comes FPN instead of P3, P4, P5, P6, P7 for RetinaNet, which is
from ignore region. In this paper, we just simply adopt the similar to Faster RCNN-FPN. We named the adjusted ver-
first way for ignore regions. sion as Adaptive RetinaNet. It achieves better performance
tiny
Image cutting: Most of images in TinyPerson are with (10.43% improvement of AP50 ) than the RetinaNet.
large size, results in the GPU out of memory. Therefore, Best detector: With MS COCO, RetinaNet and FreeAn-
we cut the origin images into some sub-images with over- chor achieves better performance than Faster RCNN-FPN.
lapping during training and test. Then the NMS strategy is One stage detector can also go beyond two stage detector
used to merge all results of the sub-images in one same im- if sample imbalance is well solved [15]. The anchor-free
age for evaluation. Although the image cutting can make based detector FCOS achieves the better performance com-
better use of GPU resources, there are two flaws:1) For pared with RetinaNet and Faster RCNN-FPN. However,
FPN, pure background images (no object in this image) will when objects’ size become tiny such as objects in TinyPer-
not be used for training. Due to image cutting, many sub- son, the performance of all detectors drop a lot. And the
images will become the pure background images, which RetinaNet and FCOS performs worse, as shown in Table 5
are not well utilized; 2) In some conditions, NMS can not and Table 6. For tiny objects, two stage detector shows ad-
merge the results in overlapping regions well. vantages over one stage detector. Li et al. [13] proposed

1263
tiny1 tiny2 tiny3 tiny small tiny tiny
detector M R50 M R50 M R50 M R50 M R50 M R25 M R75
FCOS [23] 99.10 96.39 91.31 96.12 84.14 89.56 99.56
RetinaNet [15] 95.05 88.34 86.04 92.40 81.75 81.56 99.11
DSFD [13] 96.41 88.02 86.84 93.47 78.72 78.02 99.48
Adaptive RetinaNet 89.48 82.29 82.40 89.19 74.29 77.83 98.63
Adaptive FreeAnchor [29] 90.26 82.01 81.74 88.97 73.67 77.62 98.7
Faster RCNN-FPN [14] 88.40 81.99 80.17 87.78 71.31 77.35 98.40
Table 5. Comparisons of M Rs on TinyPerson.

tiny1 tiny2 tiny3 tiny small tiny tiny


detector AP50 AP50 AP50 AP50 AP50 AP25 AP75
FCOS [23] 3.39 12.39 29.25 16.9 35.75 40.49 1.45
RetinaNet [15] 11.47 36.36 43.32 30.82 43.38 57.33 2.64
DSFD [13] 14.00 35.12 46.30 31.15 51.64 59.58 1.99
Adaptive RetinaNet 25.49 47.89 51.28 41.25 53.64 63.94 4.22
Adaptive FreeAnchor [29] 24.92 48.01 51.23 41.36 53.36 63.73 4.0
Faster RCNN-FPN [14] 29.21 48.26 53.48 43.55 56.69 64.07 5.35
Table 6. Comparisons of AP s on TinyPerson.

tiny tiny
DSFD for face detection, which is one of the SOTA face pretrain dataset M R50 AP50
detectors with code available. But it obtained poor perfor- ImageNet 87.78 43.55
mance on TinyPerson, due to the great difference between COCO 86.58 43.38
relative scale and aspect ratio, which also further demon- COCO100 87.67 43.03
strates the great chanllange of the proposed TinyPerson. SM COCO 86.30 46.77
With performance comparison, Faster RCNN-FPN is cho- MSM COCO 85.71 47.29
Table 7. Comparisons on TinyPerson. COCO100 holds the sim-
sen as the baseline of experiment and the benchmark.
ilar mean of the boxes’ size with TinyPerson, SM COCO uses
Scale Match on COCO for pre-training, while MSM COCO uses
Monotonous Scale Match on COCO for pre-training.
5.3. Analysis of Scale Match
tiny tiny
TinyPerson. In general, for detection, pretrain on MS pretrain dataset M R50 AP50
COCO often gets better performance than pretrain on Im- ImageNet 75.44 19.08
ageNet, although the ImageNet holds more data. How- COCO 74.15 20.74
ever, detector pre-trained on MS COCO improves very lim- COCO100 74.92 20.57
ited in TinyPerson, since the object size of MS COCO is SM COCO 73.87 21.18
quite different from that of TinyPerson. Then, we obtain MSM COCO 72.41 21.56
a new dataset, COCO100, by setting the shorter edge of Table 8. Comparisons on Tiny Citypersons. COCO100 holds the
similar mean of the boxes’ size with Tiny Citypersons.
each image to 100 and keeping the height-width ratio un-
changed. The mean of objects’ size in COCO100 almost
equals to that of TinyPerson. However, the detector pre- 6. Conclusion
trained on COCO100 performs even worse, shown in Table
In this paper, a new dataset (TinyPerson) is introduced for
7. The transformation of the mean of objects’ size to that in
detecting tiny objects, particularly, tiny persons less than 20
TinyPerson is inefficient. Then we construct SM COCO by
pixels in large-scale images. The extremely small objects
transforming the whole distribution of MS COCO to that
raise a grand challenge for existing person detectors.
of TinyPerson based on Scale Match. With detector pre-
We build the baseline for tiny person detection and exper-
trained on SM COCO, we obtain 3.22% improvement of
tiny imentally find that the scale mismatch could deteriorate the
AP50 , Table 7. Finally we construct MSM COCO using
feature representation and the detectors. We thereby pro-
Monotone Scale Match for transformation of MS COCO.
posed an easy but efficient approach, Scale Match, for tiny
With MSM COCO as the pre-trained dataset, the perfor-
tiny person detection. Our approach is inspired by the Human
mance further improves to 47.29% of AP50 , Table 7.
Cognition Process, while Scale Match can better utilize the
Tiny Citypersons. To further validate the effectiveness of existing annotated data and make the detector more sophis-
the proposed Scale Match on other datasets, we conducted ticated. Scale Match is designed as a plug-and-play univer-
experiments on Tiny Citypersons and obtained similar per- sal block for object scale processing, which provides a fresh
formance gain, Table 8. insight for general object detection tasks.

1264
References international conference on computer vision, pages 2980–
2988, 2017.
[1] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
via region-based fully convolutional networks. In Advances manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
in neural information processing systems, pages 379–387, mon objects in context. In European conference on computer
2016. vision, pages 740–755. Springer, 2014.
[2] N. Dalal and B. Triggs. Histograms of oriented gradients for [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
human detection. 2005. Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
[3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- In European conference on computer vision, pages 21–37.
Fei. Imagenet: A large-scale hierarchical image database. Springer, 2016.
In 2009 IEEE conference on computer vision and pattern [18] J. Mao, T. Xiao, Y. Jiang, and Z. Cao. What can help pedes-
recognition, pages 248–255. Ieee, 2009. trian detection? In Proceedings of the IEEE Conference
[4] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedes- on Computer Vision and Pattern Recognition, pages 3127–
trian detection: An evaluation of the state of the art. IEEE 3136, 2017.
transactions on pattern analysis and machine intelligence, [19] J. Pang, C. Li, J. Shi, Z. Xu, and H. Feng. R2 -cnn: Fast tiny
34(4):743–761, 2011. object detection in large-scale remote sensing images. IEEE
[5] M. Enzweiler and D. M. Gavrila. Monocular pedestrian de- Transactions on Geoscience and Remote Sensing, 2019.
tection: Survey and experiments. IEEE transactions on pat- [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You
tern analysis and machine intelligence, 31(12):2179–2195, only look once: Unified, real-time object detection. In Pro-
2008. ceedings of the IEEE conference on computer vision and pat-
[6] A. Ess, B. Leibe, K. Schindler, and L. Van Gool. A mo- tern recognition, pages 779–788, 2016.
bile vision system for robust multi-person tracking. In 2008 [21] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
IEEE Conference on Computer Vision and Pattern Recogni- real-time object detection with region proposal networks. In
tion, pages 1–8. IEEE, 2008. Advances in neural information processing systems, pages
[7] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and 91–99, 2015.
A. Zisserman. The pascal visual object classes (voc) chal- [22] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
lenge. International journal of computer vision, 88(2):303– based object detectors with online hard example mining. In
338, 2010. Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 761–769, 2016.
[8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
[23] Z. Tian, C. Shen, H. Chen, and T. He. Fcos: Fully
tonomous driving? the kitti vision benchmark suite. In 2012
convolutional one-stage object detection. arXiv preprint
IEEE Conference on Computer Vision and Pattern Recogni-
arXiv:1904.01355, 2019.
tion, pages 3354–3361. IEEE, 2012.
[24] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedes-
[9] R. Girshick. Fast r-cnn. In Proceedings of the IEEE inter- trian detection. In 2009 IEEE Conference on Computer Vi-
national conference on computer vision, pages 1440–1448, sion and Pattern Recognition, pages 794–801. IEEE, 2009.
2015.
[25] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- face detection benchmark. In Proceedings of the IEEE con-
ture hierarchies for accurate object detection and semantic ference on computer vision and pattern recognition, pages
segmentation. In Proceedings of the IEEE conference on 5525–5533, 2016.
computer vision and pattern recognition, pages 580–587, [26] S. Zhang, R. Benenson, M. Omran, J. Hosang, and
2014. B. Schiele. Towards reaching human performance in pedes-
[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling trian detection. IEEE transactions on pattern analysis and
in deep convolutional networks for visual recognition. IEEE machine intelligence, 40(4):973–986, 2017.
transactions on pattern analysis and machine intelligence, [27] S. Zhang, R. Benenson, and B. Schiele. Citypersons: A di-
37(9):1904–1916, 2015. verse dataset for pedestrian detection. In Proceedings of the
[12] P. Hu and D. Ramanan. Finding tiny faces. In Proceedings of IEEE Conference on Computer Vision and Pattern Recogni-
the IEEE conference on computer vision and pattern recog- tion, pages 3213–3221, 2017.
nition, pages 951–959, 2017. [28] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li.
[13] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang, S3fd: Single shot scale-invariant face detector. In Proceed-
J. Li, and F. Huang. Dsfd: dual shot face detector. In Pro- ings of the IEEE International Conference on Computer Vi-
ceedings of the IEEE Conference on Computer Vision and sion, pages 192–201, 2017.
Pattern Recognition, pages 5060–5069, 2019. [29] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye. Freeanchor:
[14] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and Learning to match anchors for visual object detection. arXiv
S. Belongie. Feature pyramid networks for object detection. preprint arXiv:1909.02466, 2019.
In Proceedings of the IEEE conference on computer vision [30] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene
and pattern recognition, pages 2117–2125, 2017. parsing network. In Proceedings of the IEEE conference on
[15] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal computer vision and pattern recognition, pages 2881–2890,
loss for dense object detection. In Proceedings of the IEEE 2017.

1265

You might also like