An Analysis of Scale Invariance in Object Detectio

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/321241658

An Analysis of Scale Invariance in Object Detection - SNIP

Article · November 2017

CITATIONS READS
129 653

2 authors:

Bharat Singh Larry S. Davis


University of Maryland, College Park University of Maryland, College Park
33 PUBLICATIONS   1,196 CITATIONS    567 PUBLICATIONS   37,068 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Multi-Camera Systems View project

high performance visual computing in multi-perspective environments View project

All content following this page was uploaded by Bharat Singh on 09 December 2017.

The user has requested enhancement of the downloaded file.


An Analysis of Scale Invariance in Object Detection – SNIP

Bharat Singh Larry S. Davis


University of Maryland, College Park
{bharat,lsd}@cs.umd.edu
arXiv:1711.08189v1 [cs.CV] 22 Nov 2017

Abstract

An analysis of different techniques for recognizing and


detecting objects under extreme scale variation is pre-
sented. Scale specific and scale invariant design of de-
tectors are compared by training them with different con-
figurations of input data. To examine if upsampling im-
ages is necessary for detecting small objects, we evaluate
the performance of different network architectures for clas-
sifying small objects on ImageNet. Based on this analy-
sis, we propose a deep end-to-end trainable Image Pyra- Figure 1. Fraction of RoIs in the dataset vs scale of RoIs relative
mid Network for object detection which operates on the to the image.
same image scales during training and inference. Since
small and large objects are difficult to recognize at smaller
and larger scales respectively, we present a novel train- relative to the image in ImageNet (classification) vs COCO
ing scheme called Scale Normalization for Image Pyramids (detection) are 0.554 and 0.106 respectively. Therefore,
(SNIP) which selectively back-propagates the gradients of most object instances in COCO are smaller than 1% of im-
object instances of different sizes as a function of the image age area! To make matters worse, the scale of the small-
scale. On the COCO dataset, our single model performance est and largest 10% of object instances in COCO is 0.024
is 45.7% and an ensemble of 3 networks obtains an mAP of and 0.472 respectively (resulting in scale variations of al-
48.3%. We use ImageNet-1000 pre-trained models and only most 20 times!); see Fig. 1. This variation in scale which
train with bounding box supervision. Our submission won a detector needs to handle is enormous and represents an
the Best Student Entry in the COCO 2017 challenge. Code extreme challenge to the scale invariance properties of con-
will be made available at http://bit.ly/2yXVg4c. volutional neural networks. Moreover, differences in the
scale of object instances between classification and detec-
tion datasets also results in a large domain-shift while fine-
1. Introduction tuning from a pre-trained classification network. In this pa-
per, we first provide evidence of these problems and then
Deep learning has fundamentally changed how comput- propose a training scheme called Scale Normalization for
ers perform image classification and object detection. In Image Pyramids which leads to a state-of-the-art object de-
less than five years, since AlexNet [18] was proposed, the tector on COCO.
top-5 error on ImageNet classification [8] has dropped from To alleviate the problems arising from scale variation
15% to 2% [14]. This is super-human level performance for and small object instances, multiple solutions have been
image classification with 1000 classes. On the other hand, proposed. For example, features from the layers near
the mAP of the best performing detector [16] (which is only to the input, referred to as shallow(er) layers, are com-
trained to detect 80 classes) on COCO [23] is only 62% bined with deeper layers for detecting small object in-
– even at 50% overlap. Why is object detection so much stances [21, 33, 1, 11, 25], dilated/deformable convolution
harder than image classification? is used to increase receptive fields for detecting large objects
Large scale variation across object instances, and espe- [30, 6, 37, 7], independent predictions at layers of different
cially, the challenge of detecting very small objects stands resolutions are used to capture object instances of differ-
out as one of the factors behind the difference in perfor- ent scales [35, 3, 20], context is employed for disambigua-
mance. Interestingly, the median scales of object instances tion [39, 40, 9], training is performed over a range of scales

1
[6, 7, 13] or, inference is performed on multiple scales of in the image-pyramid. To minimize the domain shift for
an image pyramid and predictions are combined using non- the backbone CNN, we only back-propagate gradients for
maximum suppression [6, 7, 2, 31]. RoIs/anchors that have a resolution close to that of the pre-
While these architectural innovations have significantly training dataset. Since we train on each scale in the pyra-
helped to improve object detection, many important issues mid with the above constraint, SNIP effectively utilizes all
related to training remain unaddressed: the object instances available during training. The proposed
approach is generic and can be plugged into the training
• Is it critical to upsample images for obtaining good pipeline of different problems like instance-segmentation,
performance for object detection? Even though the pose-estimation, spatio-temporal action detection - wher-
typical size of images in detection datasets is 480x640, ever the “objects” of interest manifest large scale variations.
why is it a common practice to up-sample them to Contrary to the popular belief that deep neural networks
800x1200? Can we pre-train CNNs with smaller can learn to cope with large variations in scale given enough
strides on low resolution images from ImageNet and training data, we show that SNIP offers significant im-
then fine-tune them on detection datasets for detecting provements (3.5%) over traditional object detection training
small object instances? paradigms. Our ensemble of Image Pyramid Networks with
a Deformable-RFCN backbone obtains an mAP of 69.7%
• When fine-tuning an object detector from a pre-trained at 50% overlap, which is an improvement of 7.4% over the
image classification model, should the resolution of the state-of-the-art on the COCO dataset.
training object instances be restricted to a tight range
(from 64x64 to 256x256) after appropriately re-scaling 2. Related Work
the input images, or should all object resolutions (from
16x16 to 800x1000, in the case of COCO) participate Scale space theory [34, 24] advocates learning represen-
in training after up-sampling input images? tations that are invariant to scale and the theory has been
applied to many problems in the history of computer vision
We design controlled experiments on ImageNet and [4, 28, 26, 19, 12, 5, 21]. For problems like object detection,
COCO to seek answers to these questions. In Section 3, pose-estimation, instance segmentation etc., learning scale
we study the effect of scale variation by examining the per- invariant representations is critical for recognizing and lo-
formance of existing networks for ImageNet classification calizing objects. To detect objects at multiple scales, many
when images of different scales are provided as input. We solutions have been proposed.
also make minor modifications to the CNN architecture for The deeper layers of modern CNNs have large strides
classifying images of different scales. These experiments (32 pixels) that lead to a very coarse representation of the
reveal the importance of up-sampling for small object de- input image, which makes small object detection very chal-
tection. To analyze the effect of scale variation on object lenging. To address this problem, modern object detectors
detection, we train and compare the performance of scale- [30, 6, 5] employ dilated/atrous convolutions to increase the
specific and scale invariant detector designs in Section 5. resolution of the feature map. Dilated/deformable convolu-
For scale-specific detectors, variation in scale is handled tions also preserve the weights and receptive fields of the
by training separate detectors - one for each scale range. pre-trained network and do not suffer from degraded per-
Moreover, training the detector on similar scale object in- formance on large objects. Up-sampling the image by a
stances as the pre-trained classification network helps to re- factor of 1.5 to 2 times during training and up to 4 times
duce the domain shift for the detector backbone. But, scale- during inference is also a common practice to increase the
specific designs also reduce the number of training sam- final feature map resolution [7, 6, 13]. Since feature maps of
ples per scale, which degrades performance. On the other layers closer to the input are of higher resolution and often
hand, training a single object detector with all training sam- contain complementary information (wrt. conv5), these fea-
ples makes the learning task significantly harder because the tures are either combined with shallower layers (like conv4,
network needs to learn filters for detecting object instances conv3) [21, 29, 1, 29] or independent predictions are made
over a wide range of scales. at layers of different resolutions [35, 25, 3]. Methods like
Based on these observations, in Section 6 we present SDP [35], SSH [27] or MS-CNN [3], which make indepen-
a novel training paradigm, which we refer to as Scale dent predictions at different layers, also ensure that smaller
Normalization for Image Pyramids (SNIP), that benefits objects are trained on higher resolution layers (like conv3)
from reducing scale-variation during training but without while larger objects are trained on lower resolution layers
paying the penalty of reduced training samples. Scale- (like conv5). This approach offers better resolution at the
invariance is achieved using an image-pyramid (instead of cost of high-level semantic features which can hurt perfor-
a scale-invariant detector), which contains normalized in- mance.
put representations of object instances in one of the scales Methods like FPN, Mask-RCNN, RetinaNet [21, 11, 22],
Figure 2. The same layer convolutional features at different scales Figure 3. Both CNN-B and CNN-B-FT are provided an upsampled
of the image are different and map to different semantic regions in low resolution image as input. CNN-S is provided a low resolu-
the image at different scales. tion image as input. CNN-B is trained on high resolution images.
CNN-S is trained on low resolution images. CNN-B-FT is pre-
trained on high resolution images and fine-tuned on upsampled
low-resolution images.

which use a pyramidal representation and combine features 3. Image Classification at Multiple Scales
of shallow layers with deeper layers at least have access to
higher level semantic information. However, if the size of In this section we study the effect of domain shift, which
an object was 25x25 pixels then even an up-sampling factor is introduced when different resolutions of images are pro-
of 2 during training will scale the object to only 50x50 pix- vided as input during training and testing. We perform
els. Note that typically the network is pre-trained on images this analysis because state-of-the-art detectors are typically
of resolution 224x224. Therefore, the high level seman- trained at a resolution of 800x1200 pixels 1 , but inference is
tic features (at conv5) generated even by feature pyramid performed at a higher resolution of 1400x2000 for detecting
networks will not be useful for classifying small objects (a small objects [7, 6, 2].
similar argument can be made for large objects in high reso- Firstly, we obtain images at different resolutions, 48x48,
lution images). Hence, combining them with features from 64x64, 80x80, 96x96 and 128x128, by down-sampling the
shallow layers would not be good for detecting small ob- original ImageNet database. These are then up-sampled
jects, see Fig. 2. Although feature pyramids efficiently ex- to 224x224 and provided as input to a CNN architecture
ploit features from all the layers in the network, they are not trained on 224x224 size images, referred to as CNN-B (see
an attractive alternative to an image pyramid for detecting Fig. 3). Fig. 4 (a) shows the top-1 accuracy of CNN-B
very small/large objects. with a ResNet-101 backbone. We observe that as the dif-
ference in resolution between training and testing images
increases, so does the drop in performance. Hence, test-
Recently, a pyramidal approach was proposed for de-
ing on resolutions on which the network was not trained is
tecting faces [15] where the gradients of all objects were
clearly sub-optimal, at least for image classification.
back-propagated after max-pooling the responses from each
Based on this observation, a simple solution for improv-
scale. Different filters were used in the classification layers
ing the performance of detectors on smaller objects is to
for faces at different scales. This approach has limitations
pre-train classification networks with a different stride on
for object detection because training data per class in ob-
ImageNet. After-all, network architectures which obtain
ject detection is limited and the variations in appearance,
best performance on CIFAR10 [17] (which contains small
pose etc. are much larger compared to face detection. We
objects) are different from ImageNet. The first convolution
on the other hand, selectively back-propagate gradients for
layer in ImageNet classification networks has a stride of
each scale and use the same filters irrespective of the scale
2 followed by a max pooling layer of stride 2, which can
of the object, thereby making better use of training data.
potentially wipe out most of the image signal present in a
We observe that adding scale specific filters in R-FCN for
small object. Therefore, we train ResNet-101 with a stride
each class hurts performance for object detection. In [31],
of 1 and 3x3 convolutions in the first layer for 48x48 im-
an image pyramid was generated and maxout [10] was used
ages (CNN-S, see Fig. 3), a typical architecture used for
to select features from a pair of scales closer to the reso-
CIFAR. Similarly, for 96x96 size images, we use a kernel of
lution of the pre-trained dataset during inference: however,
size 5x5 and stride of 2. Standard data augmentation tech-
standard multi-scale training (described in Section 5) was
used. 1 original image resolution is typically 480x640
Figure 4. All figures report accuracy on the validation set of the ImageNet classification dataset. We upsample images of resolution 48,64,80
etc. and plot the Top-1 accuracy of the pre-trained ResNet-101 classifier in figure (a). Figure (b,c) show results for different CNNs when
the original image resolution is 48,96 pixels respectively.

niques such as random cropping, color augmentation, dis- ing scale invariant representations for objects of different
abling color augmentation after 70 epochs are used to train scales. At each convolutional feature map, a lightweight
these networks. As seen in Fig. 4, these networks (CNN- network predicts offsets on the 2D grid, which are spatial
S) perform significantly better than CNN-B. Therefore, it locations at which spatial sub-filters of the convolution ker-
is tempting to pre-train classification networks with differ- nel are applied. The second change is in Position Sensitive
ent architectures for low resolution images and use them for RoI Pooling. Instead of pooling from a fixed set of bins on
object detection for low resolution objects. the convolutional feature map (for an RoI), a network pre-
Yet another simple solution for small object detection dicts offsets for each position sensitive filter (depending on
would be to fine-tune CNN-B on up-sampled low resolu- the feature map) on which PSRoI-Pooling is performed.
tion images to yield CNN-B-FT ( Fig. 3). The performance For our experiments, proposals are extracted at a sin-
of CNN-B-FT on up-sampled low-resolution images is bet- gle resolution (after upsampling) of 800x1200 using a pub-
ter than CNN-S, Fig. 4. This result empirically demon- licly available Deformable-RFCN detector. It has a ResNet-
strates that the filters learned on high-resolution images can 101 backbone and is trained at a resolution of 800x1200.
be useful for recognizing low-resolution images as well. 5 anchor scales are used in RPN for generating proposals
Therefore, instead of reducing the stride by 2, it is better [2]. For classifying these proposals, we use Deformable-
to up-sample images 2 times and then fine-tune the network RFCN with a ResNet-50 backbone without the Deformable
pre-trained on high-resolution images. Position Sensitive RoIPooling. We use Position Sensitive
While training object detectors, we can either use differ- RoIPooling with bilinear interpolation as it reduces the
ent network architectures for classifying objects of different number of filters by a factor of 3 in the last layer. NMS
resolutions or use the a single architecture for all resolu- with a threshold of 0.3 is used. Not performing end-to-end
tions. Since pre-training on ImageNet (or other larger clas- training along with RPN, using ResNet-50 and eliminating
sification datasets) is beneficial and filters learned on larger deformable PSRoI filters reduces training time by a factor
object instances help to classify smaller object instances, of 3 and also saves GPU memory.
upsampling images and using the network pre-trained on
high resolution images should be better than a specialized 5. Data Variation or Correct Scale?
network for classifying small objects. Fortunately, existing
The study in section 3 confirms that differences in reso-
object detectors up-sample images for detecting smaller ob-
lutions between the training and testing phase leads to a sig-
jects instead of using a different architecture. Our analysis
nificant drop in performance. Unfortunately, this difference
supports this practice and compares it with other alterna-
in resolution is part of the current object detection pipeline
tives to emphasize the difference.
- due to GPU memory constraints, training is performed
4. Background on a lower resolution (800x1200) than testing (1400x2000)
(note that original resolution is typically 640x480). This
In the next section, we discuss a few baselines for de- section analyses the effect of image resolution, the scale of
tecting small objects. We briefly describe the Deformable- object instances and variation in data on the performance of
RFCN [7] detector which will be used in the following an object detector. We train detectors under different set-
analysis. D-RFCN obtains the best single model results on tings and evaluate them on 1400x2000 images for detecting
COCO and is publicly available, so we use this detector. small objects (less than 32x32 pixels in the COCO dataset)
Deformable-RFCN is based on the R-FCN detector [6]. only to tease apart the factors that affect the performance.
It adds deformable convolutions in the conv5 layers to adap- The results are reported in Table 1. We start by training
tively change the receptive field of the network for creat- detectors that use all the object instances on two differ-
Figure 5. Different approaches for providing input for training the classifier of a proposal based detector.

1400<80px 800all 1400all MST SNIP 6. Object Detection on an Image Pyramid


16.4 19.6 19.9 19.5 21.4
Our goal is to combine the best of both approaches i.e.
Table 1. mAP on Small Objects under different training protocols. train with maximal variations in appearance and pose while
MST denotes multi-scale training as shown in Fig. 5.3. Small restricting scale to a reasonable range. We achieve this by a
objects are those which are smaller than 32x32 pixels in COCO.
novel construct that we refer to as Scale Normalization for
Image Pyramids (SNIP). We also discuss details of training
ent resolutions, 800x1200 and 1400x2000, referred to as object detectors on an image pyramid within the memory
800all and 1400all , respectively. As expected, 1400all out- limits of current GPUs.
performed 800all , because the former is trained and tested 6.1. Scale Normalization for Image Pyramids
on the same resolution i.e. 1400x2000. However, the im-
provement is only marginal. Why? To answer this question SNIP is a modified version of MST where only the ob-
we consider what happens to the medium-to-large object ject instances that have a resolution close to the pre-training
instances while training at such a large resolution. They be- dataset, which is typically 224x224, are used for training
come too big to be correctly classified! Therefore, training the detector. In multi-scale training (MST), each image is
at higher resolutions scales up small objects for better clas- observed at different resolutions therefore, at a high resolu-
sification, but blows up the medium-to-large objects which tion (like 1400x2000) large objects are hard to classify and
degrades performance. Therefore, we trained another de- at a low resolution (like 480x800) small objects are hard to
tector (1400<80px ) at a resolution of 1400x2000 while ig- classify. Fortunately, each object instance appears at sev-
noring all the medium-to-large objects (> 80 pixels, in the eral different scales and some of those appearances fall in
original image) to eliminate the deleterious-effects of ex- the desired scale range. In order to eliminate extreme scale
tremely large objects. Unfortunately, it performed signifi- objects, either too large or too small, training is only per-
cantly worse than even 800all . What happened? We lost formed on objects that fall in the desired scale range and
a significant source of variation in appearance and pose by the remainder are simply ignored during back-propagation.
ignoring medium-to-large objects (about 30% of the total Effectively, SNIP uses all the object instances during train-
object instances) that hurt performance more than it helped ing, which helps capture all the variations in appearance and
by eliminating extreme scale objects. Lastly, we evaluated pose, while reducing the domain-shift in the scale-space for
the common practice of obtaining scale-invariant detectors the pre-trained network. The result of evaluating the detec-
by using randomly sampled images at multiple resolutions tor trained using SNIP is reported in Table 1 - it outperforms
during training, referred to as MST 2 . It ensures training in- all the other approaches. This experiment demonstrates the
stances are observed at many different resolutions, but it’s effectiveness of SNIP for detecting small objects. Below we
performance also degraded because of extremely small and discuss the implementation of SNIP in detail.
large objects. It performed similar to 800all . We conclude For training the classifier, all ground truth boxes are used
that it is important to train a detector with appropriately to assign labels to proposals. We do not select proposals
scaled objects while capturing as much variation across the and ground truth boxes which are outside a specified size
object instances as possible. In the next section we describe range at a particular resolution during training. At a partic-
our proposed solution that achieves exactly this and show ular resolution i, if the area of an RoI ar(r) falls within a
that it outperforms current training pipelines. range [sci , eci ], it is marked as valid, else it is invalid. Sim-
ilarly, RPN training also uses all ground truth boxes to as-
2 MST also uses a resolution of 480x800 sign labels to anchors. Finally, those anchors which have
Figure 6. SNIP training and inference for IPN is shown. Invalid RoIs which fall outside the specified range at each scale are shown in
purple. These are discarded during training and inference. Each batch during training consists of images sampled from a particular scale.
Invalid GT boxes are used to invalidate anchors in RPN. Detections from each scale are rescaled and combined using NMS.

an overlap greater than 0.3 with an invalid ground truth box boundary, for speeding up the sampling process we snap the
are excluded during training. During inference, we generate chips to image boundaries. We found that, on average, 1.7
proposals using RPN for each resolution and classify them chips of size 1000x1000 are generated for images of size
independently at each resolution as shown in Fig 6. Simi- 1400x2000. This sampling step is not needed when the im-
lar to training, we do not select detections (not proposals) age resolution is 800x1200 or 480x640 or when an image
which fall outside a specified range at each resolution. Af- does not contain small objects. Random cropping is not the
ter classification and bounding-box regression, we use soft- reason why we observe an improvement in performance for
NMS [2] to combine detections from multiple resolutions our detector. To verify this, we trained ResNet-50 (as it re-
to obtain the final detection boxes, refer to Fig. 6. quires less memory) using un-cropped high-resolution im-
The resolution of the RoIs after pooling matches the pre- ages (1400x2000) and did not observe any change in mAP.
trained network, so it is easier for the network to learn dur-
ing fine-tuning. For methods like R-FCN which divide RoIs 7. Datasets and Evaluation
into sub-parts and use position sensitive filters, this becomes
even more important. For example, if the size of an RoI is We evaluate our method on the COCO dataset. COCO
48 pixels (3 pixels in the conv5 feature map) and there are contains 123,000 images for training and evaluation is per-
7 filters along each axis, the positional correspondence be- formed on 20,288 images in test-dev. Since recall for pro-
tween features and filters would be lost. posals is not provided by the evaluation server on COCO,
we train on 118,000 images and report recall on the re-
6.2. Sampling Sub-Images maining 5,000 images (commonly referred to as minival
Training on high resolution images with deep networks set). Unless specifically mentioned, the area of small ob-
like ResNet-101 or DPN-92 [38] requires more GPU mem- jects is less than 32x32, medium objects range from 32x32
ory. Therefore, we crop images so that they fit in GPU to 96x96 and large objects are greater than 96x96.
memory. Our aim is to generate the minimum number of
7.1. Training Details
chips (sub-images) of size 1000x1000 which cover all the
small objects in the image. This helps in accelerating train- We train Deformable-RFCN [7] as our detector with 3
ing as no computation is needed where there are no small resolutions, (480, 800), (800, 1200) and (1400,2000), where
objects. For this, we generate 50 randomly positioned chips the first value is for the shorter side of the image and the
of size 1000x1000 per image. The chip which covers the second one is the limit on the maximum size of a side.
maximum number of objects is selected and added to our Training is performed for 7 epochs for the classifier while
set of training images. Until all objects in the image are RPN is trained for 6 epochs. Although it is possible to com-
covered, we repeat the sampling and selection process on bine RPN and RCN using alternating training which leads
the remaining objects. Since chips are randomly gener- to slight improvement in accuracy [21], we train separate
ated and proposal boxes often have a side on the image models for RPN and RCN and evaluate their performance
Method AP APS APM APL Method AR AR50 AR75 0-25 25-50 50-100
Single scale 34.5 16.3 37.2 47.6 Baseline 57.6 88.7 67.9 67.5 90.1 95.6
MS Test 35.9 19.5 37.3 48.5 + Improved 61.3 89.2 69.8 68.1 91.0 96.7
MS Train/Test 35.6 19.5 37.5 47.3 + SNIP 64.0 92.1 74.7 74.4 95.1 98.0
SNIP 37.8 21.4 40.4 50.1 DPN-92 65.7 92.8 76.3 76.7 95.7 98.2
Table 2. MS denotes multi-scale. Single scale is (800,1200) Table 3. For individual ranges (like 0-25 etc.) recall at 50% overlap
is reported because minor localization errors can be fixed in the
second stage. First three rows use ResNet-50 as the backbone.
Recall is for 900 proposals, as top 300 are taken from each scale.

independently. This is because it is faster to experiment


with different classification architectures after proposals are
extracted. We use a warmup learning rate of 0.00005 for truth bounding box is greater than 0.7 3 . We found that
1000 iterations after which it is increased to 0.0005. Step when using RPN at conv4 with 15 anchors (5 scales - 32,
down is performed at 4.33 epochs for RPN and 5.33 epochs 64, 128, 256, 512, stride 16, 3 aspect ratios), only 30% of
otherwise. For training RCN, we use online hard example the ground truth boxes match this criterion when the im-
mining [32] as performed in [7]. Our implementation is in age resolution is 800x1200 in COCO. Even if this thresh-
MxNet and training is performed on 8 Nvidia P6000 GPUs. old is changed to 0.5, only 58% of the ground truth boxes
Batch size is 1 per GPU and we use synchronous SGD. For have an anchor which matches this criterion. Therefore, for
efficient utilization of multiple GPUs in parallel, images of more than 40% of the ground truth boxes, an anchor which
only one resolution are included in a mini-batch. So, an im- has an overlap less than 0.5 is assigned as a positive (or
age may be forward propagated multiple times per epoch. ignored). Methods which use a feature pyramid like FPN,
Note that if there are no ground truth boxes within the valid Mask-RCNN also employ RPN at a finer resolutions like
range at a particular resolution in an image, that image- conv3, so this problem is alleviated to some extent. How-
resolution pair is ignored during training. For our baselines ever, the higher level features at conv4/conv5 may not cap-
which did not involve SNIP, we also evaluated their perfor- ture the desired semantic representation, unless the image is
mance after 8 or 9 epochs but observed that results after 7 sampled at multiple resolutions.
epochs were best. For the classifier (RCN), on images of Since we sample the image at multiple resolutions and
resolution (1400,2000), the valid range in the original im- back-propagate gradients at the relevant resolution only, this
age (without up/down sampling) is [0, 80], at a resolution problem is alleviated to some extent. We also concatenate
of (800,1200) it is [40, 160] and at a resolution of (480,800) the output of conv4 and conv5 to capture diverse features
it is [120, ∞]. Notice that we have an overlap of 40 pixels and use 7 anchor scales. A more careful combination of fea-
over adjacent ranges. This is because it is not clear which tures with predictions at multiple layers like [21, 11] should
resolution is correct at the boundary. These ranges were provide a further boost in performance (at a significant com-
design decisions made during training, based on the consid- putational burden for the deformable R-FCN detector).
eration that after re-scaling, the resolution of the valid RoIs 7.3. Experiments
does not significantly differ from the resolution on which
the backbone CNN was trained. Since in RPN, even a one First, we evaluate the performance of SNIP on classifica-
pixel feature map can generate a proposal (unlike PSRoI fil- tion (RCN) under the same settings as described in Section
ters, which should ideally map to a 7x7 feature map), we use 4. In Table 2, performance of the single scale model, multi-
a validity range of [0,160] at (800,1200) for valid ground scale testing, and multi-scale training followed by multi-
truths for RPN. For inference, the validity range for each scale testing is shown. We use the best possible validity
resolution in RCN is obtained using the minival set. Train- ranges at each resolution for each of these methods when
ing RPN is fast as it does not have Position Sensitive Filters, multi-scale testing is performed. Multi-scale testing im-
so we enable SNIP after the first epoch. SNIP doubles the proves performance by 1.4%. Performance of the detec-
training time per epoch, so we enable it after 3 epochs for tor deteriorates for large objects when we add multi-scale
training RCN. training. This is because at extreme resolutions the recep-
tive field of the network is not sufficient to classify them.
SNIP improves performance by 1.9% compared to standard
7.2. Improving RPN multi-scale testing. Note that we only use single scale pro-
posals common across all three scales during classification
In detectors like Faster-RCNN/R-FCN, Deformable R- for this experiment.
FCN, RPN is used for generating region proposals. RPN 3 If there does not exist a matching anchor, RPN assigns the anchor with

assigns an anchor as positive only if overlap with a ground the maximum overlap with ground truth bounding box as positive.
Method Backbone AP AP50 AP75 APS APM APL
IPN, No SNIP DPN-98 (3 scales, DPN-92 proposals ) 41.2 63.5 45.9 25.7 43.9 52.8
IPN, No SNIP in RPN DPN-98 (3 scales, DPN-92 proposals) 44.2 65.6 49.7 27.4 47.8 55.8
IPN, With SNIP DPN-98 (3 scales, DPN-92 Proposals) 44.7 66.6 50.2 28.5 47.8 55.9
D-RFCN [7, 2] ResNet-101 38.4 60.1 41.6 18.5 41.6 52.5
FCIS [36] Ensemble (seg) 39.7 61.6 42.6 22.3 43.2 52.9
Mask-RCNN [11] ResNext-101 (seg) 39.8 62.3 43.4 22.1 43.2 51.2
D-RFCN [7, 2] ResNet-101 (6 scales) 40.9 62.8 45.0 23.3 43.6 53.3
G-RMI [16] Ensemble 41.6 62.3 45.6 24.0 43.9 55.2
ResNet-101 (3 scales, ResNet-101 proposals ) 43.4 65.5 48.4 27.2 46.5 54.9
DPN-92 (3 scales, DPN-92 Proposals) 43.8 66.1 49.0 27.3 46.9 55.5
IPN (D-RFCN Detector) DPN-98 (3 scales, DPN-92 Proposals) 44.7 66.6 50.2 28.5 47.8 55.9
DPN-98 (3 scales, DPN-92 Proposals, flip) 45.7 67.3 51.1 29.3 48.8 57.1
Ensemble (DPN-92 Proposals) 48.3 69.7 53.7 31.4 51.6 60.7
Table 4. Comparison of IPN with state-of-the-art methods. (seg) denotes that segmentation masks were also used for training.

For RPN, a baseline with the ResNet-50 network was small objects. We observe that using better backbone ar-
trained on the conv4 feature map. Top 300 proposals are se- chitectures further improves the performance of the detec-
lected from each scale and all these 900 proposals are used tor. When SNIP is not used for both the proposals and the
for computing recall. Average recall (averaged over multi- classifier (MST is used at the same scales), mAP drops by
ple overlap thresholds, just like mAP) is better for our im- 3.5% for the DPN-98 classifier, as shown in the first three
proved RPN, as seen in Table 3. This is because for large rows. Other than the 3 networks mentioned in Table 4, we
objects (> 100 pixels), average recall improves by 10% (not also trained a DPN-92 and ResNet-101 network which was
shown in table) for the improved baseline. Although the trained jointly. Classification scores were averaged while
improved version improves average recall, it does not have bounding-box regression was only performed on the DPN-
much effect at 50% overlap. Recall at 50% is most impor- 92 network. This network obtained an mAP of 45.2% after
tant for object proposals because bounding box regression flipping. For the ensemble, DPN-92 proposals are used for
can correct minor localization errors, but if an object is not all the networks (including ResNet-101). Since proposals
covered at all by proposals, it will clearly lead to a miss. are shared across all networks, we average the scores and
Recall for objects greater than 100 pixels at 50% overlap is box-predictions for each RoI. During flipping we average
already close to 100%, so improving average recall for large the detection scores and bounding box predictions. Finally,
objects is not that valuable for a detector. Note that SNIP Soft-NMS is used to obtain the final detections. Iterative
improves recall at 50% overlap by 2.9% compared to our bounding-box regression is not used. All pre-trained mod-
improved baseline. For objects smaller than 25 pixels, the els are trained on ImageNet-1000 and COCO segmentation
improvement in recall is 6.3%. Using a stronger classifica- masks are not used. Still, our overall mAP is 6.7% better. At
tion network like DPN-92 also improves recall. In last two a 50% overlap and for small objects, it is 7.4% better. For
rows of Table 4, we perform an ablation study with our best results shown with a single model, we improve the state-of-
model, which uses a DPN-98 classifier and DPN-92 pro- the-art by 4.9%. On 100 images, it takes 90 seconds for IPN
posals. If we train our improved RPN without SNIP, mAP to perform detection on a Titan X GPU using a ResNet-101
drops by 1.1% on small objects and 0.5% overall. Note that backbone. Speed can be improved with end-to-end training.
AP of large objects is not affected as we still use the classi-
fication model trained with SNIP. 8. Conclusion
Finally, we compare IPN with state-of-the-art detectors We presented an analysis of different techniques for rec-
in Table 4. For these experiments, we use the deformable ognizing and detecting objects under extreme scale varia-
position sensitive filters and Soft-NMS. Compared to the tion, which exposed shortcomings of the current object de-
single scale deformable R-FCN baseline shown in the first tection training pipeline. Based on the analysis, a training
line of Table 4, IPN improves overall results by 5% and for scheme (SNIP) was proposed to tackle the wide scale spec-
small objects by 8.7%! This shows the importance of an trum of object instances which participate in training and
image pyramid for object detection. Compared to the best to reduce the domain-shift for the pre-trained classification
single model method (which uses 6 instead of 3 scales in network. Compared to a single-scale detector, SNIP ob-
IPN and is also trained end-to-end) based on ResNet-101, tains a 5% improvement in mAP, which highlights the im-
IPN improves performance by 2.5% overall and 3.9% for portance of scale and image-pyramids in object detection.
References [17] A. Krizhevsky and G. Hinton. Learning multiple layers of
features from tiny images. 2009. 3
[1] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick. [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
Inside-outside net: Detecting objects in context with skip classification with deep convolutional neural networks. In
pooling and recurrent neural networks. In Proceedings of the Advances in neural information processing systems, pages
IEEE Conference on Computer Vision and Pattern Recogni- 1097–1105, 2012. 1
tion, pages 2874–2883, 2016. 1, 2
[19] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of
[2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms features: Spatial pyramid matching for recognizing natural
– improving object detection with one line of code. Pro- scene categories. In Computer vision and pattern recogni-
ceedings of the IEEE International Conference on Computer tion, 2006 IEEE computer society conference on, volume 2,
Vision, 2017. 2, 3, 4, 6, 8 pages 2169–2178. IEEE, 2006. 2
[3] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified [20] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan. Scale-
multi-scale deep convolutional neural network for fast ob- aware fast r-cnn for pedestrian detection. arXiv preprint
ject detection. In European Conference on Computer Vision, arXiv:1510.08160, 2015. 1
pages 354–370. Springer, 2016. 1, 2 [21] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
[4] J. Canny. A computational approach to edge detection. IEEE S. Belongie. Feature pyramid networks for object detection.
Transactions on pattern analysis and machine intelligence, arXiv preprint arXiv:1612.03144, 2016. 1, 2, 6, 7
(6):679–698, 1986. 2 [22] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár.
[5] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and Focal loss for dense object detection. arXiv preprint
A. L. Yuille. Deeplab: Semantic image segmentation with arXiv:1708.02002, 2017. 2
deep convolutional nets, atrous convolution, and fully con- [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
nected crfs. arXiv preprint arXiv:1606.00915, 2016. 2 manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
[6] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection mon objects in context. In European conference on computer
via region-based fully convolutional networks. In Advances vision, pages 740–755. Springer, 2014. 1
in neural information processing systems, pages 379–387, [24] T. Lindeberg. Scale-space theory in computer vision, 1993.
2016. 1, 2, 3, 4 2
[7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and [25] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-
Y. Wei. Deformable convolutional networks. arXiv preprint Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector.
arXiv:1703.06211, 2017. 1, 2, 3, 4, 6, 7, 8 In European conference on computer vision, pages 21–37.
[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- Springer, 2016. 1, 2
Fei. Imagenet: A large-scale hierarchical image database. [26] D. G. Lowe. Distinctive image features from scale-
In Computer Vision and Pattern Recognition, 2009. CVPR invariant keypoints. International journal of computer vi-
2009. IEEE Conference on, pages 248–255. IEEE, 2009. 1 sion, 60(2):91–110, 2004. 2
[9] S. Gidaris and N. Komodakis. Object detection via a multi- [27] M. Najibi, P. Samangouei, R. Chellappa, and L. Davis. SSH:
region and semantic segmentation-aware cnn model. In The Single stage headless face detector. In Proceedings of the
IEEE International Conference on Computer Vision (ICCV), International Conference on Computer Vision (ICCV), 2017.
December 2015. 1 2
[10] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, [28] P. Perona and J. Malik. Scale-space and edge detection using
and Y. Bengio. Maxout networks. arXiv preprint anisotropic diffusion. IEEE Transactions on pattern analysis
arXiv:1302.4389, 2013. 3 and machine intelligence, 12(7):629–639, 1990. 2
[11] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [29] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learn-
arXiv preprint arXiv:1703.06870, 2017. 1, 2, 7, 8 ing to refine object segments. In European Conference on
[12] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling Computer Vision, pages 75–91. Springer, 2016. 2
in deep convolutional networks for visual recognition. In [30] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
European Conference on Computer Vision, pages 346–361. real-time object detection with region proposal networks. In
Springer, 2014. 2 Advances in neural information processing systems, pages
[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- 91–99, 2015. 1, 2
ing for image recognition. In Proceedings of the IEEE con- [31] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun. Object
ference on computer vision and pattern recognition, pages detection networks on convolutional feature maps. IEEE
770–778, 2016. 1, 2 transactions on pattern analysis and machine intelligence,
[14] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation net- 39(7):1476–1481, 2017. 2, 3
works. arXiv preprint arXiv:1709.01507, 2017. 1 [32] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
[15] P. Hu and D. Ramanan. Finding tiny faces. arXiv preprint based object detectors with online hard example mining. In
arXiv:1612.04402, 2016. 3 Proceedings of the IEEE Conference on Computer Vision
[16] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, and Pattern Recognition, pages 761–769, 2016. 7
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. [33] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta. Be-
Speed/accuracy trade-offs for modern convolutional object yond skip connections: Top-down modulation for object de-
detectors. arXiv preprint arXiv:1611.10012, 2016. 1, 8 tection. arXiv preprint arXiv:1612.06851, 2016. 1
[34] A. Witkin. Scale-space filtering: A new approach to multi-
scale description. In Acoustics, Speech, and Signal Pro-
cessing, IEEE International Conference on ICASSP’84., vol-
ume 9, pages 150–153. IEEE, 1984. 2
[35] F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast
and accurate cnn object detector with scale dependent pool-
ing and cascaded rejection classifiers. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 2129–2137, 2016. 1, 2
[36] J. D. X. J. Yi Li, Haozhi Qi and Y. Wei. 8
[37] F. Yu and V. Koltun. Multi-scale context aggregation by di-
lated convolutions. arXiv preprint arXiv:1511.07122, 2015.
1
[38] H. X. X. J. S. Y. J. F. Yunpeng Chen, Jianan Li. Dual path
networks. arXiv preprint arXiv:1707.01629, 2017. 6
[39] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross,
S. Chintala, and P. Dollár. A multipath network for object
detection. arXiv preprint arXiv:1604.02135, 2016. 1
[40] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu,
Y. Zhou, B. Yang, Z. Wang, et al. Crafting gbd-net for ob-
ject detection. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 2017. 1

View publication stats

You might also like