Focal Loss For Dense Object Detection
Focal Loss For Dense Object Detection
Focal Loss For Dense Object Detection
2, FEBRUARY 2020
Abstract—The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier
is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling
of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In
this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered
during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross
entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse
set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the
effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the
focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing
state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Index Terms—Computer vision, object detection, machine learning, convolutional neural networks
1 INTRODUCTION
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
LIN ET AL.: FOCAL LOSS FOR DENSE OBJECT DETECTION 319
Fig. 1. We propose a novel loss we term the Focal Loss that adds a fac- Fig. 2. Speed (ms) versus accuracy (AP) on COCO test-dev. Enabled
tor ð1 pt Þg to the standard cross entropy criterion. Setting g > 0 by the focal loss, our simple one-stage RetinaNet detector outperforms
reduces the relative loss for well-classified examples (pt > :5), putting all previous one-stage and two-stage detectors, including the best
more focus on hard, misclassified examples. As our experiments will reported Faster R-CNN [3] system from [4]. We show variants of Retina-
demonstrate, the proposed focal loss enables training highly accurate Net with ResNet-50-FPN (blue circles) and ResNet-101-FPN (orange
dense object detectors in the presence of vast numbers of easy back- diamonds) at five scales (400-800 pixels). Ignoring the low-accuracy
ground examples. regime (AP < 25), RetinaNet forms an upper envelope of all current
detectors, and an improved variant (not shown) achieves 40.8 AP.
Details are given in Section 5.
pyramid and use of anchor boxes. It draws on a variety of
recent ideas from [3], [4], [9], [21]. RetinaNet is efficient and One-Stage Detectors. OverFeat [32] was one of the first
accurate; our best model, based on a ResNet-101-FPN back- modern one-stage object detector based on deep networks.
bone, achieves a COCO test-dev AP of 39.1 while running More recently SSD [9], [10] and YOLO [7], [8] have renewed
at 5 fps, surpassing the previously best published single-model interest in one-stage methods. These detectors have been
results from both one and two-stage detectors, see Fig. 2. tuned for speed but their accuracy trails that of two-stage
methods. SSD has a 10-20 percent lower AP, while YOLO
focuses on an even more extreme speed/accuracy trade-off.
2 RELATED WORK See Fig. 2. Recent work showed that two-stage detectors can
Classic Object Detectors. The sliding-window paradigm, in be made fast simply by reducing input image resolution
which a classifier is applied on a dense image grid, has a long and the number of proposals, but one-stage methods trailed
and rich history. One of the earliest successes is the classic in accuracy even with a larger compute budget [33]. In con-
work of LeCun et al. who applied convolutional neural net- trast, the aim of this work is to understand if one-stage
works to handwritten digit recognition [22], [23]. Viola and detectors can match or surpass the accuracy of two-stage
Jones [19] used boosted object detectors for face detection, detectors while running at similar or faster speeds.
leading to widespread adoption of such models. The intro- The design of our RetinaNet detector shares many simi-
duction of HOG [24] and integral channel features [25] gave larities with previous dense detectors, in particular the con-
rise to effective methods for pedestrian detection. DPMs [20] cept of ‘anchors’ introduced by RPN [3] and use of features
helped extend dense detectors to more general object catego- pyramids as in SSD [9] and FPN [4]. We emphasize that our
ries and had top results on PASCAL [26] for many years. simple detector achieves top results not based on innova-
While the sliding-window approach was the leading detec- tions in network design but due to our novel loss.
tion paradigm in classic computer vision, with the resurgence Class Imbalance. Both classic one-stage object detection
of deep learning [27], two-stage detectors, described next, methods, like boosted detectors [19], [25] and DPMs [20],
quickly came to dominate object detection. and more recent methods, like SSD [9], face a large class
Two-Stage Detectors. The dominant paradigm in modern imbalance during training. These detectors evaluate 104 -105
object detection is based on a two-stage approach. As pio- candidate locations per image but only a few locations con-
neered in the Selective Search work [12], the first stage gen- tain objects. This imbalance causes two problems: (1) train-
erates a sparse set of candidate proposals that should ing is inefficient as most locations are easy negatives that
contain all objects while filtering out the majority of nega- contribute no useful learning signal; (2) en masse, the easy
tive locations [28], and the second stage classifies the pro- negatives can overwhelm training and lead to degenerate
posals into foreground classes/background. R-CNN [1] models. A common solution is to perform some form of
upgraded the second-stage classifier to a convolutional net- hard negative mining [9], [16], [17], [19], [20] that samples
work yielding large gains in accuracy and ushering in the hard examples during training or more complex sampling/
modern era of object detection. R-CNN was improved over reweighing schemes [34]. In contrast, we show that our pro-
the years, both in terms of speed [2], [29] and by using posed focal loss naturally handles the class imbalance faced
learned object proposals [3], [14], [21]. Region Proposal Net- by a one-stage detector and allows us to efficiently train on
works (RPN) integrated proposal generation with the sec- all examples without sampling and without easy negatives
ond-stage classifier into a single convolution network, overwhelming the loss and computed gradients.
forming the Faster R-CNN framework [3]. Numerous exten- Robust Estimation. There has been much interest in
sions to this framework have been proposed, e.g., [4], [5], designing robust loss functions (e.g., Huber loss [35]) that
[16], [30], [31]. reduce the contribution of outliers by down-weighting the
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
320 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 2, FEBRUARY 2020
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
LIN ET AL.: FOCAL LOSS FOR DENSE OBJECT DETECTION 321
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
322 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 2, FEBRUARY 2020
Fig. 5. The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [4] backbone on top of a feedforward ResNet architec-
ture [31] (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classify-
ing anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which
enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-
stage detectors like Faster R-CNN with FPN [4] while running at faster speeds.
choices are not crucial, we emphasize the use of the FPN level for the purpose of regressing the offset from each
backbone is; preliminary experiments using features from anchor box to a nearby ground-truth object, if one exists.
only the final ResNet layer yielded low AP. The design of the box regression subnet is identical to the
Anchors. We use translation-invariant anchor boxes simi- classification subnet except that it terminates in 4A linear
lar to those in the RPN variant in [4]. The anchors have areas outputs per spatial location, see Fig. 5d. For each of the A
of 322 to 5122 on pyramid levels P3 to P7 , respectively. As in anchors per spatial location, these 4 outputs predict the rela-
[4], at each pyramid level we use anchors at three aspect tive offset between the anchor and the ground-truth box (we
ratios f1:2, 1:1, 2:1g. For denser scale coverage than in [4], at use the standard box parameterization from R-CNN [1]).
each level we add anchors of sizes {20 , 21=3 , 22=3 } of the origi- We note that unlike most recent work, we use a class-agnos-
nal set of 3 aspect ratio anchors. This improve AP in our set- tic bounding box regressor which uses fewer parameters
ting. In total there are A ¼ 9 anchors per level and across and we found to be equally effective. The object classifica-
levels they cover the scale range 32-813 pixels with respect tion subnet and the box regression subnet, though sharing a
to the network’s input image. common structure, use separate parameters.
Each anchor is assigned a length K one-hot vector of clas-
sification targets, where K is the number of object classes, 4.1 Inference and Training
and a 4-vector of box regression targets. We use the assign- Inference. RetinaNet forms a single FCN comprised of a
ment rule from RPN [3] but modified for multi-class detec- ResNet-FPN backbone, a classification subnet, and a box
tion and with adjusted thresholds. Specifically, anchors are regression subnet, see Fig. 5. As such, inference involves sim-
assigned to ground-truth object boxes using an intersection- ply forwarding an image through the network. To improve
over-union (IoU) threshold of 0.5; and to background if their speed, we only decode box predictions from at most 1k top-
IoU is in [0, 0.4). As each anchor is assigned to at most one scoring predictions per FPN level, after thresholding detec-
object box, we set the corresponding entry in its length K tor confidence at 0.05. The top predictions from all levels are
label vector to 1 and all other entries to 0. If an anchor is merged and greedy non-maximum suppression with a
unassigned, which may happen with overlap in [0.4, 0.5), it threshold of 0.5 is applied to get the final detections.
is ignored during training. Box regression targets are com- Focal Loss. We use the focal loss introduced in this work
puted as the offset between each anchor and its assigned as the loss on the output of the classification subnet. As we
object box, or omitted if there is no assignment. will show in Section 5, we find that g ¼ 2 works well in
Classification Subnet. The classification subnet predicts the practice and the RetinaNet is relatively robust to g 2 ½0:5; 5.
probability of object presence at each spatial position for We emphasize that when training RetinaNet, the focal loss
each of the A anchors and K object classes. This subnet is a is applied to all 100k anchors in each sampled image. This
small FCN attached to each FPN level; parameters of this stands in contrast to common practice of using heuristic
subnet are shared across all pyramid levels. Its design is sampling (RPN) or hard example mining (OHEM, SSD) to
simple. Taking an input feature map with C channels from select a small set of anchors (e.g., 256) for each minibatch.
a given pyramid level, the subnet applies four 3 3 conv The total focal loss of an image is computed as the sum of
layers, each with C filters and each followed by ReLU acti- the focal loss over all 100k anchors, normalized by the num-
vations, followed by a 3 3 conv layer with KA filters. ber of anchors assigned to a ground-truth box. We perform the
Finally sigmoid activations are attached to output the KA normalization by the number of assigned anchors, not total
binary predictions per spatial location, see Fig. 5c. We use anchors, since the vast majority of anchors are easy nega-
C ¼ 256 and A ¼ 9 in most experiments. tives and receive negligible loss values under the focal loss.
In contrast to RPN [3], our object classification subnet is Finally we note that a, the weight assigned to the rare class,
deeper, uses only 3 3 convs, and does not share parameters also has a stable range, but it interacts with g making it nec-
with the box regression subnet (described next). We found essary to select the two together (see Table 1a and 1b). In
these higher-level design decisions to be more important general a should be decreased slightly as g is increased (for
than specific values of hyperparameters. g ¼ 2, a ¼ 0:25 works best).
Box Regression Subnet. In parallel with the object classifi- Initialization. We experiment with ResNet-50-FPN and
cation subnet, we attach another small FCN to each pyramid ResNet-101-FPN backbones [4]. The base ResNet-50 and
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
LIN ET AL.: FOCAL LOSS FOR DENSE OBJECT DETECTION 323
TABLE 1
Ablation Experiments for RetinaNet and Focal Loss (FL)
All models are trained on trainval35k and tested on minival unless noted. If not specified, default values are: g ¼ 2; anchors for 3 scales and 3 aspect
ratios; ResNet-50-FPN backbone; and a 600 pixel train and test image scale. (a) RetinaNet with a-balanced CE achieves at most 31.1 AP. (b) In contrast, using
FL with the same exact network gives a 2.9 AP gain and is fairly robust to exact g/a settings. (c) Using 2-3 scale and 3 aspect ratio anchors yields good results
after which point performance saturates. (d) FL outperforms the best variants of online hard example mining (OHEM) [9], [16] by over 3 points AP. (e) Accu-
racy/Speed trade-off of RetinaNet on test-dev for various network depths and image scales (see also Fig. 2).
ResNet-101 models are pre-trained on ImageNet1k; we use train and a random 35k subset of images from the 40k
the models released by [31]. New layers added for FPN are image val split). We report lesion and sensitivity studies
initialized as in [4]. All new conv layers except the final one by evaluating on the minival split (the remaining 5k
in the RetinaNet subnets are initialized with bias b ¼ 0 and images from val). For our main results, we report COCO
a Gaussian weight fill with s ¼ 0:01. For the final conv layer AP on the test-dev split, which has no public labels and
of the classification subnet, we set the bias initialization to requires use of the evaluation server.
b ¼ log ðð1 pÞ=pÞ, where p specifies that at the start of
training every anchor should be labeled as foreground with 5.1 Training Dense Detection
confidence of p. We use p ¼ :01 in all experiments, We run numerous experiments to analyze the behavior of
although results are robust to the exact value. As explained the loss function for dense detection along with various
in Section 3.3, this initialization prevents the large number optimization strategies. For all experiments we use depth 50
of background anchors from generating a large, destabiliz- or 101 ResNets [31] with a Feature Pyramid Network [4]
ing loss value in the first iteration of training. constructed on top. For all ablation studies we use an image
Optimization. RetinaNet is trained with stochastic gradi- scale of 600 pixels for training and testing.
ent descent (SGD). We use synchronized SGD over 8 Nvidia Network Initialization. Our first attempt to train RetinaNet
M40 GPUs with a total of 16 images per minibatch (2 images uses standard cross entropy loss without any modifications
per GPU). Unless otherwise specified, all models are trained to the initialization or learning strategy. This fails quickly,
for 90k iterations with an initial learning rate of 0.01, which with the network diverging during training. However, sim-
is then divided by 10 at 60k and again at 80k iterations. We ply initializing the last layer of our model such that the prior
use horizontal image flipping as the only form of data aug- probability of detecting an object is p ¼ :01 (see Section 4.1)
mentation unless otherwise noted. Weight decay of 0.0001 enables effective learning. Training RetinaNet with ResNet-
and momentum of 0.9 are used. The training loss is the sum 50 and this initialization already yields a respectable AP of
the focal loss and the standard smooth L1 loss used for box 30.2 on COCO. Results are insensitive to the exact value of p
regression [2]. Training time ranges between 10 and 35 so we use p ¼ :01 for all experiments.
hours for the models in Table 1e. Balanced Cross Entropy. Our next attempt to improve
learning involved using the a-balanced CE loss described in
Section 3.1. Results for various a are shown in Table 1a. Set-
5 EXPERIMENTS ting a ¼ :75 gives a gain of 0.9 points AP.
We present experimental results on the bounding box detec- Focal Loss. Results using our proposed focal loss are
tion track of the challenging COCO benchmark [6]. For shown in Table 1b. The focal loss introduces one new hyper-
training, we follow common practice [4], [37] and use the parameter, the focusing parameter g, that controls the
COCO trainval35k split (union of 80k images from strength of the modulating term. When g ¼ 0, our loss is
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
324 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 2, FEBRUARY 2020
Fig. 6. Cumulative distribution functions of the normalized loss for positive and negative samples for different values of g for a converged model. The
effect of changing g on the distribution of the loss for positive examples is minor. For negatives, however, increasing g heavily concentrates the loss
on hard examples, focusing nearly all attention away from easy negatives.
equivalent to the CE loss. As g increases, the shape of the fact, with g ¼ 2 (our default setting), the vast majority of the
loss changes so that “easy” examples with low loss get fur- loss comes from a small fraction of samples. As can be seen,
ther discounted, see Fig. 1. FL shows large gains over CE as FL can effectively discount the effect of easy negatives,
g is increased. With g ¼ 2, FL yields a 2.9 AP improvement focusing all attention on the hard negative examples.
over the a-balanced CE loss. Online Hard Example Mining (OHEM). Shrivastava et al.
For the experiments in Table 1b, for a fair comparison we [16] proposed to improve training of two-stage detectors by
find the best a for each g. We observe that lower a’s are constructing minibatches using high-loss examples. Specifi-
selected for higher g’s (as easy negatives are down- cally, in OHEM each example is scored by its loss, non-
weighted, less emphasis needs to be placed on the posi- maximum suppression (nms) is then applied, and a mini-
tives). Overall, however, the benefit of changing g is much batch is constructed with the highest-loss examples. The
larger, and indeed the best a’s ranged in just [.25, 75] (we nms threshold and batch size are tunable parameters. Like
tested a 2 ½:01; :999). We use g ¼ 2:0 with a ¼ :25 for all the focal loss, OHEM puts more emphasis on misclassified
experiments but a ¼ :5 works nearly as well (.4 AP lower). examples, but unlike FL, OHEM completely discards easy
Analysis of the Focal Loss. To understand the focal loss bet- examples. We also implement a variant of OHEM used in
ter, we analyze the empirical distribution of the loss of a con- SSD [9]: after applying nms to all examples, the minibatch is
verged model. For this, we take our default ResNet-101 600- constructed to enforce a 1:3 ratio between positives and neg-
pixel model trained with g ¼ 2 (which has 36.0 AP). We atives to help ensure each minibatch has enough positives.
apply this model to a large number of random images and We test both OHEM variants in our setting of one-stage
sample the predicted probability for 107 negative windows detection which has large class imbalance. Results for the
and 105 positive windows. Next, separately for positives original OHEM strategy and the ‘OHEM 1:3’ strategy for
and negatives, we compute FL for these samples, and nor- selected batch sizes and nms thresholds are shown in
malize the loss such that it sums to one. Given the normal- Table 1d. These results use ResNet-101, our baseline trained
ized loss, we can sort the loss from lowest to highest and with FL achieves 36.0 AP for this setting. In contrast, the
plot its cumulative distribution function (CDF) for both pos- best setting for OHEM (no 1:3 ratio, batch size 128, nms of
itive and negative samples and for different settings for g .5) achieves 32.8 AP. This is a gap of 3.2 AP, showing FL is
(even though model was trained with g ¼ 2). more effective than OHEM for training dense detectors. We
Cumulative distribution functions for positive and nega- note that we tried other parameter setting and variants for
tive samples are shown in Fig. 6. If we observe the positive OHEM but did not achieve better results.
samples, we see that the CDF looks fairly similar for different Hinge Loss. Finally, in early experiments, we attempted to
values of g. For example, approximately 20 percent of the train with the hinge loss [35] on pt , which sets loss to 0
hardest positive samples account for roughly half of the posi- above a certain value of pt . However, this was unstable and
tive loss, as g increases more of the loss gets concentrated in we did not manage to obtain meaningful results. Results
the top 20 percent of examples, but the effect is minor. exploring alternate loss functions are discussed next.
The effect of g on negative samples is dramatically differ-
ent. For g ¼ 0, the positive and negative CDFs are quite sim- 5.2 Variants of Focal Loss
ilar. However, as g increases, substantially more weight We trained RetinaNet-50-600 using identical settings as
becomes concentrated on the hard negative examples. In before but we swap out FL for FL with the selected param-
eters (see Section 3.5). These models achieve nearly the
TABLE 2 same AP as those trained with FL, see Table 2. In other
Results of FL and FL versus CE for Select Settings words, FL is a reasonable alternative for the FL that works
well in practice.
loss g b AP AP50 AP75
We found that various g and b settings gave good results.
CE – – 31.1 49.4 33.0 In Fig. 7 we show results for RetinaNet-50-600 with FL for
FL 2.0 – 34.0 52.5 36.5
a wide set of parameters. The loss plots are color coded
FL 2.0 1.0 33.8 52.7 36.3
FL 4.0 0.0 33.9 51.8 36.4 such that effective settings (models converged and with AP
over 33.5) are shown in blue. We used a ¼ :25 in all
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
LIN ET AL.: FOCAL LOSS FOR DENSE OBJECT DETECTION 325
TABLE 3
Object Detection Single-Model Results (Bounding Box AP), versus State-of-the-Art on COCO test-dev
We show results for our RetinaNet-101-800 model, trained with scale jitter and for 1.5 longer than the same model from Table 1e. Our model achieves top
results, outperforming both one-stage and two-stage models. For a detailed breakdown of speed versus accuracy see Table 1e and Fig. 2.
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
326 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 2, FEBRUARY 2020
further improves results another 1.7 AP, surpassing 40 AP [22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand-
on COCO. written zip code recognition,” Neural Comput., vol. 1, pp. 541–551,
1989.
6 CONCLUSION [23] R. Vaillant, C. Monrocq, and Y. LeCun, “Original approach for the
localisation of objects in images,” Proc. IEEE Vis. Image Signal Pro-
In this work, we identify class imbalance as the primary cess., vol. 141, pp. 245–250, 1994.
obstacle preventing one-stage object detectors from surpass- [24] N. Dalal and B. Triggs, “Histograms of oriented gradients for
ing top-performing, two-stage methods. To address this, we human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
nit., 2005, pp. 886–893.
propose the focal loss which applies a modulating term to [25] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel
the cross entropy loss in order to focus learning on hard features,” in Proc. British Mach. Vis. Conf., 2009.
negative examples. Our approach is simple and highly [26] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis-
serman, “The PASCAL Visual Object Classes (VOC) Challenge,”
effective. We demonstrate its efficacy by designing a fully Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010.
convolutional one-stage detector and report extensive [27] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica-
experimental analysis showing that it achieves state-of-the- tion with deep convolutional neural networks,” in Proc. Neural Inf.
art accuracy and speed. Source code is available at https:// Process. Syst., 2012, pp. 1097–1105.
[28] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for
github.com/facebookresearch/Detectron [39]. effective detection proposals?” IEEE Trans. Pattern Analy. Mach.
Intell., vol. 38, no. 4, pp. 814–830, Apr. 2016.
REFERENCES [29] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” in Proc. Eur.
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature Conf. Comput. Vis., 2014, pp. 346–361.
hierarchies for accurate object detection and semantic [30] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., skip connections: Top-down modulation for object detection,”
2014, pp. 580–587. arXiv:1612.06851, 2016.
[2] R. Girshick, “Fast R-CNN,” in Proc. Int. Conf. Comput. Vis., 2015. [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
real-time object detection with region proposal networks,” in nit., 2016.
Proc. Neural Inf. Process. Syst., 2015. [32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and
[4] T.-Y. Lin, P. Doll ar, R. Girshick, K. He, B. Hariharan, and Y. LeCun, “Overfeat: Integrated recognition, localization and
S. Belongie, “Feature pyramid networks for object detection,” in detection using convolutional networks,” in Proc. Int. Conf. Learn.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. Representations, 2014.
[5] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in [33] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
Proc. IEEE Int. Conf. Comput. Vis., 2017. I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy,
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, “Speed/accuracy trade-offs for modern convolutional object
P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. [34] S. R. Bulo, G. Neuhold, and P. Kontschieder, “Loss max-pooling
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look for semantic image segmentation,” in Proc. IEEE Conf. Comput.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Com- Vis. Pattern Recognit., 2017.
put. Vis. Pattern Recognit., 2016. [35] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
[8] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” Learning. Berlin, Germany: Springer, 2008.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. [36] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and works for semantic segmentation,” in Proc. IEEE Conf. Comput.
A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. Vis. Pattern Recognit., 2015.
Conf. Comput. Vis., 2016. [37] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-outside net:
[10] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Detecting objects in context with skip pooling and recurrent neu-
Deconvolutional single shot detector,” arXiv: 1701.06659, 2016. ral networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[11] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via 2016.
region-based fully convolutional networks,” in Proc. Neural Inf. [38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-
Process. Syst., 2016. v4, inception-resnet and the impact of residual connections on
[12] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, learning,” in Proc. AAAI Conf. Artif. Intell., 2017.
“Selective search for object recognition,” Int. J. Comput. Vis., [39] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollar, and K. He,
vol. 104, pp. 154–171, 2013. “Detectron,” 2008. [Online]. Available: https://github.com/
[13] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals facebookresearch/detectron
from edges,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 391–405. [40] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated resid-
[14] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment ual transformations for deep neural networks,” in Proc. IEEE Conf.
object candidates,” in Proc. Neural Inf. Process. Syst., 2015. Comput. Vis. Pattern Recognit., 2017.
[15] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar, “Learning to
refine object segments,” in Proc. Eur. Conf. Comput. Vis., 2016. Tsung-Yi Lin received the master’s degree from
[16] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based the University of California, San Diego, in 2013
object detectors with online hard example mining,” in Proc. IEEE and the PhD degree in electrical and computer
Conf. Comput. Vis. Pattern Recognit., 2016. engineering from Cornell, in 2017. He was the
[17] K.-K. Sung and T. Poggio, “Learning and example selection for recipient of Marr Prize Best Student Paper Award
object and pattern detection,” in MIT A.I. Memo No. 1521, 1994. at ICCV 2017. He joined Google Brain as a
[18] H. Rowley, S. Baluja, and T. Kanade, “Human face detection in research scientist, in 2017. His research interests
visual scenes,” Carnegie Mellon Univ., Pittsburgh, PA, USA, include computer vision and machine learning. In
Tech. Rep. CMU-CS-95–158R, 1995. particular, he is interested in learning representa-
[19] P. Viola and M. Jones, “Rapid object detection using a boosted cas- tions for object detection and segmentation.
cade of simple features,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2001, pp. I-511–I-518.
[20] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade
object detection with deformable part models,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2010, pp. 2241–2248.
[21] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
object detection using deep neural networks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2014.
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
LIN ET AL.: FOCAL LOSS FOR DENSE OBJECT DETECTION 327
Priya Goyal received the master’s degree in Kaiming He joined Facebook AI Research
mathematics and computing from the Indian Insti- (FAIR), in 2016 as a research scientist. Formerly
tute of Technology, Kanpur, in 2015. She is cur- he was with Microsoft Research Asia (MSRA),
rently a research engineer with Facebook AI which he joined in 2011 after receiving the PhD
Research (FAIR) working on computer vision, degree. His research interests include computer
particularly object detection and object segmen- vision and deep learning. He has received the
tation, high performance computing with applica- Best Paper Award in CVPR 2009, CVPR 2016,
tions to computer vision and compilers for Deep and ICCV 2017.
Learning for accelerating neural networks. She
co-authored a paper which was recipient of Marr
Prize Best Student Paper Award at ICCV 2017.
Ross Girshick received the PhD degree in com- Piotr Dollar received the PhD degree from
puter science from the University of Chicago UCSD under the guidance of Serge Belongie, in
under Pedro Felzenszwalb, in 2012. He is a 2007 and has continued doing research in vision
research scientist with Facebook AI Research and learning since. He is a research scientist with
(FAIR), working on computer vision and machine Facebook AI Research (FAIR) with a focus on
learning. Prior, he was a researcher with Micro- computer vision and machine learning. Prior, he
soft Research and a postdoc with the University spent three years with Microsoft Research
of California, Berkeley, where he was advised by (MSR). He helped cofound Anchovi Labs
Jitendra Malik and Trevor Darrell. He received (acquired by Dropbox in 2012) and before that
the 2017 PAMI Young Researcher Award and was a postdoc with the Computation Vision Lab
the Marr Prize at ICCV 2017 for “Mask R-CNN”. at Caltech until 2011.
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.