An Analysis of Scale Invariance in Object Detectio
An Analysis of Scale Invariance in Object Detectio
An Analysis of Scale Invariance in Object Detectio
net/publication/321241658
CITATIONS READS
129 653
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Bharat Singh on 09 December 2017.
Abstract
1
[6, 7, 13] or, inference is performed on multiple scales of in the image-pyramid. To minimize the domain shift for
an image pyramid and predictions are combined using non- the backbone CNN, we only back-propagate gradients for
maximum suppression [6, 7, 2, 31]. RoIs/anchors that have a resolution close to that of the pre-
While these architectural innovations have significantly training dataset. Since we train on each scale in the pyra-
helped to improve object detection, many important issues mid with the above constraint, SNIP effectively utilizes all
related to training remain unaddressed: the object instances available during training. The proposed
approach is generic and can be plugged into the training
• Is it critical to upsample images for obtaining good pipeline of different problems like instance-segmentation,
performance for object detection? Even though the pose-estimation, spatio-temporal action detection - wher-
typical size of images in detection datasets is 480x640, ever the “objects” of interest manifest large scale variations.
why is it a common practice to up-sample them to Contrary to the popular belief that deep neural networks
800x1200? Can we pre-train CNNs with smaller can learn to cope with large variations in scale given enough
strides on low resolution images from ImageNet and training data, we show that SNIP offers significant im-
then fine-tune them on detection datasets for detecting provements (3.5%) over traditional object detection training
small object instances? paradigms. Our ensemble of Image Pyramid Networks with
a Deformable-RFCN backbone obtains an mAP of 69.7%
• When fine-tuning an object detector from a pre-trained at 50% overlap, which is an improvement of 7.4% over the
image classification model, should the resolution of the state-of-the-art on the COCO dataset.
training object instances be restricted to a tight range
(from 64x64 to 256x256) after appropriately re-scaling 2. Related Work
the input images, or should all object resolutions (from
16x16 to 800x1000, in the case of COCO) participate Scale space theory [34, 24] advocates learning represen-
in training after up-sampling input images? tations that are invariant to scale and the theory has been
applied to many problems in the history of computer vision
We design controlled experiments on ImageNet and [4, 28, 26, 19, 12, 5, 21]. For problems like object detection,
COCO to seek answers to these questions. In Section 3, pose-estimation, instance segmentation etc., learning scale
we study the effect of scale variation by examining the per- invariant representations is critical for recognizing and lo-
formance of existing networks for ImageNet classification calizing objects. To detect objects at multiple scales, many
when images of different scales are provided as input. We solutions have been proposed.
also make minor modifications to the CNN architecture for The deeper layers of modern CNNs have large strides
classifying images of different scales. These experiments (32 pixels) that lead to a very coarse representation of the
reveal the importance of up-sampling for small object de- input image, which makes small object detection very chal-
tection. To analyze the effect of scale variation on object lenging. To address this problem, modern object detectors
detection, we train and compare the performance of scale- [30, 6, 5] employ dilated/atrous convolutions to increase the
specific and scale invariant detector designs in Section 5. resolution of the feature map. Dilated/deformable convolu-
For scale-specific detectors, variation in scale is handled tions also preserve the weights and receptive fields of the
by training separate detectors - one for each scale range. pre-trained network and do not suffer from degraded per-
Moreover, training the detector on similar scale object in- formance on large objects. Up-sampling the image by a
stances as the pre-trained classification network helps to re- factor of 1.5 to 2 times during training and up to 4 times
duce the domain shift for the detector backbone. But, scale- during inference is also a common practice to increase the
specific designs also reduce the number of training sam- final feature map resolution [7, 6, 13]. Since feature maps of
ples per scale, which degrades performance. On the other layers closer to the input are of higher resolution and often
hand, training a single object detector with all training sam- contain complementary information (wrt. conv5), these fea-
ples makes the learning task significantly harder because the tures are either combined with shallower layers (like conv4,
network needs to learn filters for detecting object instances conv3) [21, 29, 1, 29] or independent predictions are made
over a wide range of scales. at layers of different resolutions [35, 25, 3]. Methods like
Based on these observations, in Section 6 we present SDP [35], SSH [27] or MS-CNN [3], which make indepen-
a novel training paradigm, which we refer to as Scale dent predictions at different layers, also ensure that smaller
Normalization for Image Pyramids (SNIP), that benefits objects are trained on higher resolution layers (like conv3)
from reducing scale-variation during training but without while larger objects are trained on lower resolution layers
paying the penalty of reduced training samples. Scale- (like conv5). This approach offers better resolution at the
invariance is achieved using an image-pyramid (instead of cost of high-level semantic features which can hurt perfor-
a scale-invariant detector), which contains normalized in- mance.
put representations of object instances in one of the scales Methods like FPN, Mask-RCNN, RetinaNet [21, 11, 22],
Figure 2. The same layer convolutional features at different scales Figure 3. Both CNN-B and CNN-B-FT are provided an upsampled
of the image are different and map to different semantic regions in low resolution image as input. CNN-S is provided a low resolu-
the image at different scales. tion image as input. CNN-B is trained on high resolution images.
CNN-S is trained on low resolution images. CNN-B-FT is pre-
trained on high resolution images and fine-tuned on upsampled
low-resolution images.
which use a pyramidal representation and combine features 3. Image Classification at Multiple Scales
of shallow layers with deeper layers at least have access to
higher level semantic information. However, if the size of In this section we study the effect of domain shift, which
an object was 25x25 pixels then even an up-sampling factor is introduced when different resolutions of images are pro-
of 2 during training will scale the object to only 50x50 pix- vided as input during training and testing. We perform
els. Note that typically the network is pre-trained on images this analysis because state-of-the-art detectors are typically
of resolution 224x224. Therefore, the high level seman- trained at a resolution of 800x1200 pixels 1 , but inference is
tic features (at conv5) generated even by feature pyramid performed at a higher resolution of 1400x2000 for detecting
networks will not be useful for classifying small objects (a small objects [7, 6, 2].
similar argument can be made for large objects in high reso- Firstly, we obtain images at different resolutions, 48x48,
lution images). Hence, combining them with features from 64x64, 80x80, 96x96 and 128x128, by down-sampling the
shallow layers would not be good for detecting small ob- original ImageNet database. These are then up-sampled
jects, see Fig. 2. Although feature pyramids efficiently ex- to 224x224 and provided as input to a CNN architecture
ploit features from all the layers in the network, they are not trained on 224x224 size images, referred to as CNN-B (see
an attractive alternative to an image pyramid for detecting Fig. 3). Fig. 4 (a) shows the top-1 accuracy of CNN-B
very small/large objects. with a ResNet-101 backbone. We observe that as the dif-
ference in resolution between training and testing images
increases, so does the drop in performance. Hence, test-
Recently, a pyramidal approach was proposed for de-
ing on resolutions on which the network was not trained is
tecting faces [15] where the gradients of all objects were
clearly sub-optimal, at least for image classification.
back-propagated after max-pooling the responses from each
Based on this observation, a simple solution for improv-
scale. Different filters were used in the classification layers
ing the performance of detectors on smaller objects is to
for faces at different scales. This approach has limitations
pre-train classification networks with a different stride on
for object detection because training data per class in ob-
ImageNet. After-all, network architectures which obtain
ject detection is limited and the variations in appearance,
best performance on CIFAR10 [17] (which contains small
pose etc. are much larger compared to face detection. We
objects) are different from ImageNet. The first convolution
on the other hand, selectively back-propagate gradients for
layer in ImageNet classification networks has a stride of
each scale and use the same filters irrespective of the scale
2 followed by a max pooling layer of stride 2, which can
of the object, thereby making better use of training data.
potentially wipe out most of the image signal present in a
We observe that adding scale specific filters in R-FCN for
small object. Therefore, we train ResNet-101 with a stride
each class hurts performance for object detection. In [31],
of 1 and 3x3 convolutions in the first layer for 48x48 im-
an image pyramid was generated and maxout [10] was used
ages (CNN-S, see Fig. 3), a typical architecture used for
to select features from a pair of scales closer to the reso-
CIFAR. Similarly, for 96x96 size images, we use a kernel of
lution of the pre-trained dataset during inference: however,
size 5x5 and stride of 2. Standard data augmentation tech-
standard multi-scale training (described in Section 5) was
used. 1 original image resolution is typically 480x640
Figure 4. All figures report accuracy on the validation set of the ImageNet classification dataset. We upsample images of resolution 48,64,80
etc. and plot the Top-1 accuracy of the pre-trained ResNet-101 classifier in figure (a). Figure (b,c) show results for different CNNs when
the original image resolution is 48,96 pixels respectively.
niques such as random cropping, color augmentation, dis- ing scale invariant representations for objects of different
abling color augmentation after 70 epochs are used to train scales. At each convolutional feature map, a lightweight
these networks. As seen in Fig. 4, these networks (CNN- network predicts offsets on the 2D grid, which are spatial
S) perform significantly better than CNN-B. Therefore, it locations at which spatial sub-filters of the convolution ker-
is tempting to pre-train classification networks with differ- nel are applied. The second change is in Position Sensitive
ent architectures for low resolution images and use them for RoI Pooling. Instead of pooling from a fixed set of bins on
object detection for low resolution objects. the convolutional feature map (for an RoI), a network pre-
Yet another simple solution for small object detection dicts offsets for each position sensitive filter (depending on
would be to fine-tune CNN-B on up-sampled low resolu- the feature map) on which PSRoI-Pooling is performed.
tion images to yield CNN-B-FT ( Fig. 3). The performance For our experiments, proposals are extracted at a sin-
of CNN-B-FT on up-sampled low-resolution images is bet- gle resolution (after upsampling) of 800x1200 using a pub-
ter than CNN-S, Fig. 4. This result empirically demon- licly available Deformable-RFCN detector. It has a ResNet-
strates that the filters learned on high-resolution images can 101 backbone and is trained at a resolution of 800x1200.
be useful for recognizing low-resolution images as well. 5 anchor scales are used in RPN for generating proposals
Therefore, instead of reducing the stride by 2, it is better [2]. For classifying these proposals, we use Deformable-
to up-sample images 2 times and then fine-tune the network RFCN with a ResNet-50 backbone without the Deformable
pre-trained on high-resolution images. Position Sensitive RoIPooling. We use Position Sensitive
While training object detectors, we can either use differ- RoIPooling with bilinear interpolation as it reduces the
ent network architectures for classifying objects of different number of filters by a factor of 3 in the last layer. NMS
resolutions or use the a single architecture for all resolu- with a threshold of 0.3 is used. Not performing end-to-end
tions. Since pre-training on ImageNet (or other larger clas- training along with RPN, using ResNet-50 and eliminating
sification datasets) is beneficial and filters learned on larger deformable PSRoI filters reduces training time by a factor
object instances help to classify smaller object instances, of 3 and also saves GPU memory.
upsampling images and using the network pre-trained on
high resolution images should be better than a specialized 5. Data Variation or Correct Scale?
network for classifying small objects. Fortunately, existing
The study in section 3 confirms that differences in reso-
object detectors up-sample images for detecting smaller ob-
lutions between the training and testing phase leads to a sig-
jects instead of using a different architecture. Our analysis
nificant drop in performance. Unfortunately, this difference
supports this practice and compares it with other alterna-
in resolution is part of the current object detection pipeline
tives to emphasize the difference.
- due to GPU memory constraints, training is performed
4. Background on a lower resolution (800x1200) than testing (1400x2000)
(note that original resolution is typically 640x480). This
In the next section, we discuss a few baselines for de- section analyses the effect of image resolution, the scale of
tecting small objects. We briefly describe the Deformable- object instances and variation in data on the performance of
RFCN [7] detector which will be used in the following an object detector. We train detectors under different set-
analysis. D-RFCN obtains the best single model results on tings and evaluate them on 1400x2000 images for detecting
COCO and is publicly available, so we use this detector. small objects (less than 32x32 pixels in the COCO dataset)
Deformable-RFCN is based on the R-FCN detector [6]. only to tease apart the factors that affect the performance.
It adds deformable convolutions in the conv5 layers to adap- The results are reported in Table 1. We start by training
tively change the receptive field of the network for creat- detectors that use all the object instances on two differ-
Figure 5. Different approaches for providing input for training the classifier of a proposal based detector.
an overlap greater than 0.3 with an invalid ground truth box boundary, for speeding up the sampling process we snap the
are excluded during training. During inference, we generate chips to image boundaries. We found that, on average, 1.7
proposals using RPN for each resolution and classify them chips of size 1000x1000 are generated for images of size
independently at each resolution as shown in Fig 6. Simi- 1400x2000. This sampling step is not needed when the im-
lar to training, we do not select detections (not proposals) age resolution is 800x1200 or 480x640 or when an image
which fall outside a specified range at each resolution. Af- does not contain small objects. Random cropping is not the
ter classification and bounding-box regression, we use soft- reason why we observe an improvement in performance for
NMS [2] to combine detections from multiple resolutions our detector. To verify this, we trained ResNet-50 (as it re-
to obtain the final detection boxes, refer to Fig. 6. quires less memory) using un-cropped high-resolution im-
The resolution of the RoIs after pooling matches the pre- ages (1400x2000) and did not observe any change in mAP.
trained network, so it is easier for the network to learn dur-
ing fine-tuning. For methods like R-FCN which divide RoIs 7. Datasets and Evaluation
into sub-parts and use position sensitive filters, this becomes
even more important. For example, if the size of an RoI is We evaluate our method on the COCO dataset. COCO
48 pixels (3 pixels in the conv5 feature map) and there are contains 123,000 images for training and evaluation is per-
7 filters along each axis, the positional correspondence be- formed on 20,288 images in test-dev. Since recall for pro-
tween features and filters would be lost. posals is not provided by the evaluation server on COCO,
we train on 118,000 images and report recall on the re-
6.2. Sampling Sub-Images maining 5,000 images (commonly referred to as minival
Training on high resolution images with deep networks set). Unless specifically mentioned, the area of small ob-
like ResNet-101 or DPN-92 [38] requires more GPU mem- jects is less than 32x32, medium objects range from 32x32
ory. Therefore, we crop images so that they fit in GPU to 96x96 and large objects are greater than 96x96.
memory. Our aim is to generate the minimum number of
7.1. Training Details
chips (sub-images) of size 1000x1000 which cover all the
small objects in the image. This helps in accelerating train- We train Deformable-RFCN [7] as our detector with 3
ing as no computation is needed where there are no small resolutions, (480, 800), (800, 1200) and (1400,2000), where
objects. For this, we generate 50 randomly positioned chips the first value is for the shorter side of the image and the
of size 1000x1000 per image. The chip which covers the second one is the limit on the maximum size of a side.
maximum number of objects is selected and added to our Training is performed for 7 epochs for the classifier while
set of training images. Until all objects in the image are RPN is trained for 6 epochs. Although it is possible to com-
covered, we repeat the sampling and selection process on bine RPN and RCN using alternating training which leads
the remaining objects. Since chips are randomly gener- to slight improvement in accuracy [21], we train separate
ated and proposal boxes often have a side on the image models for RPN and RCN and evaluate their performance
Method AP APS APM APL Method AR AR50 AR75 0-25 25-50 50-100
Single scale 34.5 16.3 37.2 47.6 Baseline 57.6 88.7 67.9 67.5 90.1 95.6
MS Test 35.9 19.5 37.3 48.5 + Improved 61.3 89.2 69.8 68.1 91.0 96.7
MS Train/Test 35.6 19.5 37.5 47.3 + SNIP 64.0 92.1 74.7 74.4 95.1 98.0
SNIP 37.8 21.4 40.4 50.1 DPN-92 65.7 92.8 76.3 76.7 95.7 98.2
Table 2. MS denotes multi-scale. Single scale is (800,1200) Table 3. For individual ranges (like 0-25 etc.) recall at 50% overlap
is reported because minor localization errors can be fixed in the
second stage. First three rows use ResNet-50 as the backbone.
Recall is for 900 proposals, as top 300 are taken from each scale.
assigns an anchor as positive only if overlap with a ground the maximum overlap with ground truth bounding box as positive.
Method Backbone AP AP50 AP75 APS APM APL
IPN, No SNIP DPN-98 (3 scales, DPN-92 proposals ) 41.2 63.5 45.9 25.7 43.9 52.8
IPN, No SNIP in RPN DPN-98 (3 scales, DPN-92 proposals) 44.2 65.6 49.7 27.4 47.8 55.8
IPN, With SNIP DPN-98 (3 scales, DPN-92 Proposals) 44.7 66.6 50.2 28.5 47.8 55.9
D-RFCN [7, 2] ResNet-101 38.4 60.1 41.6 18.5 41.6 52.5
FCIS [36] Ensemble (seg) 39.7 61.6 42.6 22.3 43.2 52.9
Mask-RCNN [11] ResNext-101 (seg) 39.8 62.3 43.4 22.1 43.2 51.2
D-RFCN [7, 2] ResNet-101 (6 scales) 40.9 62.8 45.0 23.3 43.6 53.3
G-RMI [16] Ensemble 41.6 62.3 45.6 24.0 43.9 55.2
ResNet-101 (3 scales, ResNet-101 proposals ) 43.4 65.5 48.4 27.2 46.5 54.9
DPN-92 (3 scales, DPN-92 Proposals) 43.8 66.1 49.0 27.3 46.9 55.5
IPN (D-RFCN Detector) DPN-98 (3 scales, DPN-92 Proposals) 44.7 66.6 50.2 28.5 47.8 55.9
DPN-98 (3 scales, DPN-92 Proposals, flip) 45.7 67.3 51.1 29.3 48.8 57.1
Ensemble (DPN-92 Proposals) 48.3 69.7 53.7 31.4 51.6 60.7
Table 4. Comparison of IPN with state-of-the-art methods. (seg) denotes that segmentation masks were also used for training.
For RPN, a baseline with the ResNet-50 network was small objects. We observe that using better backbone ar-
trained on the conv4 feature map. Top 300 proposals are se- chitectures further improves the performance of the detec-
lected from each scale and all these 900 proposals are used tor. When SNIP is not used for both the proposals and the
for computing recall. Average recall (averaged over multi- classifier (MST is used at the same scales), mAP drops by
ple overlap thresholds, just like mAP) is better for our im- 3.5% for the DPN-98 classifier, as shown in the first three
proved RPN, as seen in Table 3. This is because for large rows. Other than the 3 networks mentioned in Table 4, we
objects (> 100 pixels), average recall improves by 10% (not also trained a DPN-92 and ResNet-101 network which was
shown in table) for the improved baseline. Although the trained jointly. Classification scores were averaged while
improved version improves average recall, it does not have bounding-box regression was only performed on the DPN-
much effect at 50% overlap. Recall at 50% is most impor- 92 network. This network obtained an mAP of 45.2% after
tant for object proposals because bounding box regression flipping. For the ensemble, DPN-92 proposals are used for
can correct minor localization errors, but if an object is not all the networks (including ResNet-101). Since proposals
covered at all by proposals, it will clearly lead to a miss. are shared across all networks, we average the scores and
Recall for objects greater than 100 pixels at 50% overlap is box-predictions for each RoI. During flipping we average
already close to 100%, so improving average recall for large the detection scores and bounding box predictions. Finally,
objects is not that valuable for a detector. Note that SNIP Soft-NMS is used to obtain the final detections. Iterative
improves recall at 50% overlap by 2.9% compared to our bounding-box regression is not used. All pre-trained mod-
improved baseline. For objects smaller than 25 pixels, the els are trained on ImageNet-1000 and COCO segmentation
improvement in recall is 6.3%. Using a stronger classifica- masks are not used. Still, our overall mAP is 6.7% better. At
tion network like DPN-92 also improves recall. In last two a 50% overlap and for small objects, it is 7.4% better. For
rows of Table 4, we perform an ablation study with our best results shown with a single model, we improve the state-of-
model, which uses a DPN-98 classifier and DPN-92 pro- the-art by 4.9%. On 100 images, it takes 90 seconds for IPN
posals. If we train our improved RPN without SNIP, mAP to perform detection on a Titan X GPU using a ResNet-101
drops by 1.1% on small objects and 0.5% overall. Note that backbone. Speed can be improved with end-to-end training.
AP of large objects is not affected as we still use the classi-
fication model trained with SNIP. 8. Conclusion
Finally, we compare IPN with state-of-the-art detectors We presented an analysis of different techniques for rec-
in Table 4. For these experiments, we use the deformable ognizing and detecting objects under extreme scale varia-
position sensitive filters and Soft-NMS. Compared to the tion, which exposed shortcomings of the current object de-
single scale deformable R-FCN baseline shown in the first tection training pipeline. Based on the analysis, a training
line of Table 4, IPN improves overall results by 5% and for scheme (SNIP) was proposed to tackle the wide scale spec-
small objects by 8.7%! This shows the importance of an trum of object instances which participate in training and
image pyramid for object detection. Compared to the best to reduce the domain-shift for the pre-trained classification
single model method (which uses 6 instead of 3 scales in network. Compared to a single-scale detector, SNIP ob-
IPN and is also trained end-to-end) based on ResNet-101, tains a 5% improvement in mAP, which highlights the im-
IPN improves performance by 2.5% overall and 3.9% for portance of scale and image-pyramids in object detection.
References [17] A. Krizhevsky and G. Hinton. Learning multiple layers of
features from tiny images. 2009. 3
[1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
Inside-outside net: Detecting objects in context with skip classification with deep convolutional neural networks. In
pooling and recurrent neural networks. In Proceedings of the Advances in neural information processing systems, pages
IEEE Conference on Computer Vision and Pattern Recogni- 1097–1105, 2012. 1
tion, pages 2874–2883, 2016. 1, 2
[19] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
[2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms features: Spatial pyramid matching for recognizing natural
– improving object detection with one line of code. Pro- scene categories. In Computer vision and pattern recogni-
ceedings of the IEEE International Conference on Computer tion, 2006 IEEE computer society conference on, volume 2,
Vision, 2017. 2, 3, 4, 6, 8 pages 2169–2178. IEEE, 2006. 2
[3] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified [20] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan. Scale-
multi-scale deep convolutional neural network for fast ob- aware fast r-cnn for pedestrian detection. arXiv preprint
ject detection. In European Conference on Computer Vision, arXiv:1510.08160, 2015. 1
pages 354–370. Springer, 2016. 1, 2 [21] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
[4] J. Canny. A computational approach to edge detection. IEEE S. Belongie. Feature pyramid networks for object detection.
Transactions on pattern analysis and machine intelligence, arXiv preprint arXiv:1612.03144, 2016. 1, 2, 6, 7
(6):679–698, 1986. 2 [22] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár.
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and Focal loss for dense object detection. arXiv preprint
A. L. Yuille. Deeplab: Semantic image segmentation with arXiv:1708.02002, 2017. 2
deep convolutional nets, atrous convolution, and fully con- [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
nected crfs. arXiv preprint arXiv:1606.00915, 2016. 2 manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection mon objects in context. In European conference on computer
via region-based fully convolutional networks. In Advances vision, pages 740–755. Springer, 2014. 1
in neural information processing systems, pages 379–387, [24] T. Lindeberg. Scale-space theory in computer vision, 1993.
2016. 1, 2, 3, 4 2
[7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
Y. Wei. Deformable convolutional networks. arXiv preprint Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
arXiv:1703.06211, 2017. 1, 2, 3, 4, 6, 7, 8 In European conference on computer vision, pages 21–37.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Springer, 2016. 1, 2
Fei. Imagenet: A large-scale hierarchical image database. [26] D. G. Lowe. Distinctive image features from scale-
In Computer Vision and Pattern Recognition, 2009. CVPR invariant keypoints. International journal of computer vi-
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1 sion, 60(2):91–110, 2004. 2
[9] S. Gidaris and N. Komodakis. Object detection via a multi- [27] M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. SSH:
region and semantic segmentation-aware cnn model. In The Single stage headless face detector. In Proceedings of the
IEEE International Conference on Computer Vision (ICCV), International Conference on Computer Vision (ICCV), 2017.
December 2015. 1 2
[10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, [28] P. Perona and J. Malik. Scale-space and edge detection using
and Y. Bengio. Maxout networks. arXiv preprint anisotropic diffusion. IEEE Transactions on pattern analysis
arXiv:1302.4389, 2013. 3 and machine intelligence, 12(7):629–639, 1990. 2
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [29] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn-
arXiv preprint arXiv:1703.06870, 2017. 1, 2, 7, 8 ing to refine object segments. In European Conference on
[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling Computer Vision, pages 75–91. Springer, 2016. 2
in deep convolutional networks for visual recognition. In [30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
European Conference on Computer Vision, pages 346–361. real-time object detection with region proposal networks. In
Springer, 2014. 2 Advances in neural information processing systems, pages
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- 91–99, 2015. 1, 2
ing for image recognition. In Proceedings of the IEEE con- [31] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object
ference on computer vision and pattern recognition, pages detection networks on convolutional feature maps. IEEE
770–778, 2016. 1, 2 transactions on pattern analysis and machine intelligence,
[14] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- 39(7):1476–1481, 2017. 2, 3
works. arXiv preprint arXiv:1709.01507, 2017. 1 [32] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
[15] P. Hu and D. Ramanan. Finding tiny faces. arXiv preprint based object detectors with online hard example mining. In
arXiv:1612.04402, 2016. 3 Proceedings of the IEEE Conference on Computer Vision
[16] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, and Pattern Recognition, pages 761–769, 2016. 7
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. [33] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-
Speed/accuracy trade-offs for modern convolutional object yond skip connections: Top-down modulation for object de-
detectors. arXiv preprint arXiv:1611.10012, 2016. 1, 8 tection. arXiv preprint arXiv:1612.06851, 2016. 1
[34] A. Witkin. Scale-space filtering: A new approach to multi-
scale description. In Acoustics, Speech, and Signal Pro-
cessing, IEEE International Conference on ICASSP’84., vol-
ume 9, pages 150–153. IEEE, 1984. 2
[35] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast
and accurate cnn object detector with scale dependent pool-
ing and cascaded rejection classifiers. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 2129–2137, 2016. 1, 2
[36] J. D. X. J. Yi Li, Haozhi Qi and Y. Wei. 8
[37] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
1
[38] H. X. X. J. S. Y. J. F. Yunpeng Chen, Jianan Li. Dual path
networks. arXiv preprint arXiv:1707.01629, 2017. 6
[39] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross,
S. Chintala, and P. Dollár. A multipath network for object
detection. arXiv preprint arXiv:1604.02135, 2016. 1
[40] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu,
Y. Zhou, B. Yang, Z. Wang, et al. Crafting gbd-net for ob-
ject detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2017. 1