Focal Loss For Dense Object Detection

318 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO.
2, FEBRUARY 2020
Focal Loss for Dense Object Detection

r
Tsung-Yi Lin , Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dolla
Abstract—The highest accuracy object detectors to date are based on a two-stage approach popularized by R-CNN, where a classifier
is applied to a sparse set of candidate object locations. In contrast, one-stage detectors that are applied over a regular, dense sampling
of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. In
this paper, we investigate why this is the case. We discover that the extreme foreground-background class imbalance encountered
during training of dense detectors is the central cause. We propose to address this class imbalance by reshaping the standard cross
entropy loss such that it down-weights the loss assigned to well-classified examples. Our novel Focal Loss focuses training on a sparse
set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training. To evaluate the
effectiveness of our loss, we design and train a simple dense detector we call RetinaNet. Our results show that when trained with the
focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing
state-of-the-art two-stage detectors. Code is at: https://github.com/facebookresearch/Detectron.
Index Terms—Computer vision, object detection, machine learning, convolutional neural networks
1 INTRODUCTION
C URRENT state-of-the-art object detectors are based on a

two-stage, proposal-driven mechanism. As popular-
ized in the R-CNN framework [1], the first stage generates a
[14], [15], RPN [3]) rapidly narrows down the number of can-
didate object locations to a small number (e.g., 1-2k), filtering
out most background samples. In the second classification
sparse set of candidate object locations and the second stage stage, sampling heuristics, such as a fixed foreground-to-
classifies each candidate location as one of the foreground background ratio (1:3), or online hard example mining
classes or as background using a convolutional neural net- (OHEM) [16], are performed to maintain a manageable bal-
work. Through a sequence of advances [2], [3], [4], [5], this ance between foreground and background.
two-stage framework consistently achieves top accuracy on In contrast, a one-stage detector must process a much
the challenging COCO benchmark [6]. larger set of candidate object locations regularly sampled
Despite the success of two-stage detectors, a natural across an image. In practice this often amounts to enumerat-
question to ask is: could a simple one-stage detector achieve ing 100k locations that densely cover spatial positions,
similar accuracy? One stage detectors are applied over a scales, and aspect ratios. While similar sampling heuristics
regular, dense sampling of object locations, scales, and may also be applied, they are inefficient as the training proce-
aspect ratios. Recent work on one-stage detectors, such as dure is still dominated by easily classified background exam-
YOLO [7], [8] and SSD [9], [10], demonstrates promising ples. This inefficiency is a classic problem in object detection
results, yielding faster detectors with accuracy within 10-40 that is typically addressed via techniques such as bootstrap-
percent relative to state-of-the-art two-stage methods. ping [17], [18] or hard example mining [16], [19], [20].
This paper pushes the envelop further: we present a one- In this paper, we propose a new loss function that acts as
stage object detector that, for the first time, matches the a more effective alternative to previous approaches for deal-
state-of-the-art COCO AP of more complex two-stage detec- ing with class imbalance. The loss function is a dynamically
tors, such as the Feature Pyramid Network (FPN) [4] or scaled cross entropy loss, where the scaling factor decays to
Mask R-CNN [5] variants of Faster R-CNN [3]. To achieve zero as confidence in the correct class increases, see Fig. 1.
this result, we identify class imbalance during training as Intuitively, this scaling factor can automatically down-
the main obstacle impeding one-stage detector from achiev- weight the contribution of easy examples during training
ing state-of-the-art accuracy and propose a new loss func- and rapidly focus the model on hard examples. Experi-
tion that eliminates this barrier. ments show that our proposed Focal Loss enables us to train
Class imbalance is addressed in R-CNN-like detectors by a high-accuracy, one-stage detector that significantly out-
a two-stage cascade and sampling heuristics. The proposal
performs the alternatives of training with the sampling heu-
stage (e.g., Selective Search [12], EdgeBoxes [13], DeepMask
ristics or hard example mining, the previous state-of-the-art
techniques for training one-stage detectors. Finally, we note
The authors are with Facebook AI Research. E-mail: [email protected], that the exact form of the focal loss is not crucial, and we
{prigoyal, rbg, kaiminghe}@fb.com, [email protected]. show other instantiations can achieve similar results.
Manuscript received 2 Feb. 2018; revised 19 June 2018; accepted 19 June 2018. To demonstrate the effectiveness of the proposed focal loss,
Date of publication 23 July 2018; date of current version 6 Jan. 2020. we design a simple one-stage object detector called RetinaNet,
(Corresponding author: Tsung-Yi Lin).
Recommended for acceptance by R. cucchiara, N. Seve, S. Soatto, and Y. Matsushita. named for its dense sampling of object locations in an input
Digital Object Identifier no. 10.1109/TPAMI.2018.2858826 image. Its design features an efficient in-network feature
0162-8828 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: Nanjing Institute Of Technology. Downloaded on November 17,2022 at 11:54:55 UTC from IEEE Xplore. Restrictions apply.
LIN ET AL.: FOCAL LOSS FOR DENSE OBJECT DETECTION 319
Fig. 1. We propose a novel loss we term the Focal Loss that adds a fac- Fig. 2. Speed (ms) versus accuracy (AP) on COCO test-dev. Enabled
tor ð1 pt Þg to the standard cross entropy criterion. Setting g > 0 by the focal loss, our simple one-stage RetinaNet detector outperforms
reduces the relative loss for well-classified examples (pt > :5), putting all previous one-stage and two-stage detectors, including the best
more focus on hard, misclassified examples. As our experiments will reported Faster R-CNN [3] system from [4]. We show variants of Retina-
demonstrate, the proposed focal loss enables training highly accurate Net with ResNet-50-FPN (blue circles) and ResNet-101-FPN (orange
dense object detectors in the presence of vast numbers of easy back- diamonds) at five scales (400-800 pixels). Ignoring the low-accuracy
ground examples. regime (AP < 25), RetinaNet forms an upper envelope of all current
detectors, and an improved variant (not shown) achieves 40.8 AP.
Details are given in Section 5.
pyramid and use of anchor boxes. It draws on a variety of
recent ideas from [3], [4], [9], [21]. RetinaNet is efficient and One-Stage Detectors. OverFeat [32] was one of the first
accurate; our best model, based on a ResNet-101-FPN back- modern one-stage object detector based on deep networks.
bone, achieves a COCO test-dev AP of 39.1 while running More recently SSD [9], [10] and YOLO [7], [8] have renewed
at 5 fps, surpassing the previously best published single-model interest in one-stage methods. These detectors have been
results from both one and two-stage detectors, see Fig. 2. tuned for speed but their accuracy trails that of two-stage
methods. SSD has a 10-20 percent lower AP, while YOLO
focuses on an even more extreme speed/accuracy trade-off.
2 RELATED WORK See Fig. 2. Recent work showed that two-stage detectors can
Classic Object Detectors. The sliding-window paradigm, in be made fast simply by reducing input image resolution
which a classifier is applied on a dense image grid, has a long and the number of proposals, but one-stage methods trailed
and rich history. One of the earliest successes is the classic in accuracy even with a larger compute budget [33]. In con-
work of LeCun et al. who applied convolutional neural net- trast, the aim of this work is to understand if one-stage
works to handwritten digit recognition [22], [23]. Viola and detectors can match or surpass the accuracy of two-stage
Jones [19] used boosted object detectors for face detection, detectors while running at similar or faster speeds.
leading to widespread adoption of such models. The intro- The design of our RetinaNet detector shares many simi-
duction of HOG [24] and integral channel features [25] gave larities with previous dense detectors, in particular the con-
rise to effective methods for pedestrian detection. DPMs [20] cept of ‘anchors’ introduced by RPN [3] and use of features
helped extend dense detectors to more general object catego- pyramids as in SSD [9] and FPN [4]. We emphasize that our
ries and had top results on PASCAL [26] for many years. simple detector achieves top results not based on innova-
While the sliding-window approach was the leading detections in network design but due to our novel loss.
tion paradigm in classic computer vision, with the resurgence Class Imbalance. Both classic one-stage object detection
of deep learning [27], two-stage detectors, described next, methods, like boosted detectors [19], [25] and DPMs [20],
quickly came to dominate object detection. and more recent methods, like SSD [9], face a large class
Two-Stage Detectors. The dominant paradigm in modern imbalance during training. These detectors evaluate 104 -105
object detection is based on a two-stage approach. As pio- candidate locations per image but only a few locations con-
neered in the Selective Search work [12], the first stage gen- tain objects. This imbalance causes two problems: (1) train-
erates a sparse set of candidate proposals that should ing is inefficient as most locations are easy negatives that
contain all objects while filtering out the majority of nega- contribute no useful learning signal; (2) en masse, the easy
tive locations [28], and the second stage classifies the pro- negatives can overwhelm training and lead to degenerate
posals into foreground classes/background. R-CNN [1] models. A common solution is to perform some form of
upgraded the second-stage classifier to a convolutional net- hard negative mining [9], [16], [17], [19], [20] that samples
work yielding large gains in accuracy and ushering in the hard examples during training or more complex sampling/
modern era of object detection. R-CNN was improved over reweighing schemes [34]. In contrast, we show that our pro-
the years, both in terms of speed [2], [29] and by using posed focal loss naturally handles the class imbalance faced
learned object proposals [3], [14], [21]. Region Proposal Net- by a one-stage detector and allows us to efficiently train on
works (RPN) integrated proposal generation with the sec- all examples without sampling and without easy negatives
ond-stage classifier into a single convolution network, overwhelming the loss and computed gradients.
forming the Faster R-CNN framework [3]. Numerous exten- Robust Estimation. There has been much interest in
sions to this framework have been proposed, e.g., [4], [5], designing robust loss functions (e.g., Huber loss [35]) that
[16], [30], [31]. reduce the contribution of outliers by down-weighting the
320 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 2, FEBRUARY 2020
loss of examples with large errors (hard examples). In con-

FLðpt Þ ¼ ð1 pt Þg log ðpt Þ: (4)
trast, rather than addressing outliers, our focal loss is
designed to address class imbalance by down-weighting The focal loss is visualized for several values of g 2 ½0; 5
inliers (easy examples) such that their contribution to the in Fig. 1. We note two properties of the focal loss. (1) When
total loss is small even if their number is large. In other an example is misclassified and pt is small, the modulating
words, the focal loss performs the opposite role of a robust factor is near 1 and the loss is unaffected. As pt ! 1, the fac-
loss: it focuses training on a sparse set of hard examples. tor goes to 0 and the loss for well-classified examples is
down-weighted. (2) The focusing parameter g smoothly
3 FOCAL LOSS adjusts the rate at which easy examples are down-weighted.
The Focal Loss is designed to address the one-stage object When g ¼ 0, FL is equivalent to CE, and as g is increased
detection scenario in which there is an extreme imbalance the effect of the modulating factor is likewise increased (we
between foreground and background classes during train- found g ¼ 2 to work best in our experiments).
ing (e.g., 1:1000). We introduce the focal loss starting from Intuitively, the modulating factor reduces the loss contri-
the cross entropy (CE) loss for binary classification1: bution from easy examples and extends the range in which
an example receives low loss. For instance, with g ¼ 2, an
example classified with pt ¼ 0:9 would have 100 lower
log ðpÞ if y ¼ 1
CEðp; yÞ ¼ (1) loss compared with CE and with pt 0:968 it would have
log ð1 pÞ otherwise.
1000 lower loss. This in turn increases the importance of
In the above y 2 f1g specifies the ground-truth class and correcting misclassified examples (whose loss is scaled
p 2 ½0; 1 is the model’s estimated probability for the class down by at most 4 for pt :5 and g ¼ 2).
with label y ¼ 1. For notational convenience, we define pt : In practice we use an a-balanced variant of the focal loss:
FLðpt Þ ¼ at ð1 pt Þg log ðpt Þ: (5)
p if y ¼ 1
pt ¼ (2)
1 p otherwise,
We adopt this form in our experiments as it yields slightly
and rewrite CEðp; yÞ ¼ CEðpt Þ ¼ log ðpt Þ. improved accuracy over the non-a-balanced form. Finally,
The CE loss can be seen as the blue (top) curve in Fig. 1. we note that the implementation of the loss layer combines
One notable property of this loss, which can be easily seen the sigmoid operation for computing p with the loss compu-
in its plot, is that even examples that are easily classified tation, resulting in greater numerical stability.
(pt :5) incur a loss with non-trivial magnitude. When While in our main experimental results we use the focal
summed over a large number of easy examples, these small loss definition above, its precise form is not crucial. In
loss values can overwhelm the rare class. Section 3.5 we consider other instantiations of the focal loss
and demonstrate that these can be equally effective.
3.1 Balanced Cross Entropy
3.3 Class Imbalance and Model Initialization
A common method for addressing class imbalance is to intro-
duce a weighting factor a 2 ½0; 1 for class 1 and 1 a for Binary classification models are by default initialized to
class 1. In practice a may be set by inverse class frequency have equal probability of outputting either y ¼ 1 or 1.
or treated as a hyperparameter to set by cross validation. For Under such an initialization, in the presence of class imbal-
notational convenience, we define at analogously to how we ance, the loss due to the frequent class can dominate total
defined pt . We write the a-balanced CE loss as: loss and cause instability in early training. To counter this,
we introduce the concept of a ‘prior’ for the value of p esti-
CEðpt Þ ¼ at log ðpt Þ: (3) mated by the model for the rare class (foreground) at the
start of training. We denote the prior by p and set it so that
This loss is a simple extension to CE that we consider as an the model’s estimated p for examples of the rare class is low,
experimental baseline for our proposed focal loss. e.g., 0.01. We note that this is a change in model initializa-
tion (see Section 4.1) and not of the loss function. We found
3.2 Focal Loss Definition this to improve training stability for both the cross entropy
As our experiments will show, the large class imbalance and focal loss in the case of heavy class imbalance.
encountered during training of dense detectors overwhelms
the cross entropy loss. Easily classified negatives comprise 3.4 Class Imbalance and Two-Stage Detectors
the majority of the loss and dominate the gradient. While a Two-stage detectors are often trained with the cross entropy
balances the importance of positive/negative examples, it loss without use of a-balancing or our proposed loss.
does not differentiate between easy/hard examples. Instead, Instead, they address class imbalance through two mecha-
we propose to reshape the loss function to down-weight nisms: (1) a two-stage cascade and (2) biased minibatch
easy examples and thus focus training on hard negatives. sampling. The first cascade stage is an object proposal
More formally, we propose to add a modulating factor mechanism [3], [12], [14] that reduces the nearly infinite set
ð1 pt Þg to the cross entropy loss, with tunable focusing of possible object locations down to one or two thousand.
parameter g 0. We define the focal loss as: Importantly, the selected proposals are not random, but are
likely to correspond to true object locations, which removes
1. Extending the focal loss to the multi-class case is straightforward the vast majority of easy negatives. When training the sec-
and works well; for simplicity we focus on the binary loss in this work. ond stage, biased sampling is typically used to construct
Fig. 4. Derivates of the loss functions from Fig. 3 w.r.t. x.

Fig. 3. Focal loss variants compared to the cross entropy as a function of
xt ¼ yx. Both the original FL and alternate variant FL reduce the rela- Plots for selected settings are shown in Fig. 4. For all loss
tive loss for well-classified examples (xt > 0). functions, the derivative tends to 1 or 0 for high-confidence
predictions. However, unlike CE, for effective settings of
minibatches that contain, for instance, a 1:3 ratio of positive both FL and FL , the derivative is small as soon as xt > 0.
to negative examples. This ratio is like an implicit a-balanc-
ing factor that is implemented via sampling. Our proposed
4 RETINANET DETECTOR
focal loss is designed to address these mechanisms in a one-
stage detection system directly via the loss function. RetinaNet is a single, unified network composed of a back-
bone network and two task-specific subnetworks. The back-
3.5 Variants of Focal Loss bone is responsible for computing a convolutional feature
The exact form of the focal loss is not crucial. We now show map over an entire input image and is an off-the-self convo-
an alternate instantiation of the focal loss that has similar lutional network. The first subnet performs convolutional
properties and yields comparable results. The following object classification on the backbone’s output; the second
also gives more insights into properties of the focal loss. subnet performs convolutional bounding box regression.
We begin by considering both cross entropy and the focal The two subnetworks feature a simple design that we pro-
loss (FL) in a slightly different form than earlier. Specifi- pose specifically for one-stage, dense detection, see Fig. 5.
cally, we define a quantity xt as follows: While there are many possible choices for the details of
these components, most design parameters are not particu-
xt ¼ yx; (6) larly sensitive to exact values as shown in the experiments.
We describe each component of RetinaNet next.
where y 2 f1g specifies the ground-truth class as before. Feature Pyramid Network Backbone. We adopt the Feature
We can then write pt ¼ sðxt Þ (this is compatible with the Pyramid Network from [4] as the backbone network for Ret-
definition of pt in Equation (2)). An example is correctly inaNet. In brief, FPN augments a standard convolutional
classified when xt > 0, in which case pt > :5. network with a top-down pathway and lateral connections
We can now define an alternate form of the focal loss in so the network efficiently constructs a rich, multi-scale fea-
terms of xt . We define pt and Focal Loss* (FL ) as follows: ture pyramid from a single resolution input image, see
Figs. 5a and 5b. Each level of the pyramid can be used for
pt ¼ sðgxt þ bÞ; (7) detecting objects at a different scale. FPN improves multi-
scale predictions from fully convolutional networks (FCN)
FL ¼ log ðpt Þ=g: (8)
[36], as shown by its gains for RPN [3] and DeepMask-style
FL has two parameters, g and b, that control the steepness proposals [14], as well at two-stage detectors such as Fast R-
and shift of the loss curve. We plot FL for two selected set- CNN [2] or Mask R-CNN [5].
tings of g and b in Fig. 3 alongside CE and FL. As can be Following [4], we build FPN on top of the ResNet archi-
seen, like FL, FL with the selected parameters diminishes tecture [31]. We construct a pyramid with levels P3 through
the loss assigned to well-classified examples. P7 , where l indicates pyramid level (Pl has resolution 2l
lower than the input). As in [4] all pyramid levels have
3.6 Focal Loss Derivatives C ¼ 256 channels. Details of the pyramid generally follow
For reference, derivates for CE, FL, and FL w.r.t. x are: [4] with a few modest differences.2 While many design
dCE 2. RetinaNet uses feature pyramid levels P3 to P7 , where P3 to P5 are

¼ yðpt 1Þ (9) computed from the output of the corresponding ResNet residual stage
dx (C3 through C5 ) using top-down and lateral connections just as in [4],
P6 is obtained via a 3 3 stride-2 conv on C5 , and P7 is computed by
dFL
¼ yð1 pt Þg ðgpt log ðpt Þ þ pt 1Þ (10) applying ReLU followed by a 3 3 stride-2 conv on P6 . This differs
dx slightly from [4]: (1) we don’t use the high-resolution pyramid level P2
for computational reasons, (2) P6 is computed by strided convolution
dFL instead of downsampling, and (3) we include P7 to improve large object
¼ yðpt 1Þ: (11) detection. These minor modifications improve speed while maintaining
dx
accuracy.
Fig. 5. The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) [4] backbone on top of a feedforward ResNet architec-
ture [31] (a) to generate a rich, multi-scale convolutional feature pyramid (b). To this backbone RetinaNet attaches two subnetworks, one for classify-
ing anchor boxes (c) and one for regressing from anchor boxes to ground-truth object boxes (d). The network design is intentionally simple, which
enables this work to focus on a novel focal loss function that eliminates the accuracy gap between our one-stage detector and state-of-the-art two-
stage detectors like Faster R-CNN with FPN [4] while running at faster speeds.
choices are not crucial, we emphasize the use of the FPN level for the purpose of regressing the offset from each
backbone is; preliminary experiments using features from anchor box to a nearby ground-truth object, if one exists.
only the final ResNet layer yielded low AP. The design of the box regression subnet is identical to the
Anchors. We use translation-invariant anchor boxes simi- classification subnet except that it terminates in 4A linear
lar to those in the RPN variant in [4]. The anchors have areas outputs per spatial location, see Fig. 5d. For each of the A
of 322 to 5122 on pyramid levels P3 to P7 , respectively. As in anchors per spatial location, these 4 outputs predict the rela-
[4], at each pyramid level we use anchors at three aspect tive offset between the anchor and the ground-truth box (we
ratios f1:2, 1:1, 2:1g. For denser scale coverage than in [4], at use the standard box parameterization from R-CNN [1]).
each level we add anchors of sizes {20 , 21=3 , 22=3 } of the origi- We note that unlike most recent work, we use a class-agnos-
nal set of 3 aspect ratio anchors. This improve AP in our set- tic bounding box regressor which uses fewer parameters
ting. In total there are A ¼ 9 anchors per level and across and we found to be equally effective. The object classifica-
levels they cover the scale range 32-813 pixels with respect tion subnet and the box regression subnet, though sharing a
to the network’s input image. common structure, use separate parameters.
Each anchor is assigned a length K one-hot vector of clas-
sification targets, where K is the number of object classes, 4.1 Inference and Training
and a 4-vector of box regression targets. We use the assign- Inference. RetinaNet forms a single FCN comprised of a
ment rule from RPN [3] but modified for multi-class detec- ResNet-FPN backbone, a classification subnet, and a box
tion and with adjusted thresholds. Specifically, anchors are regression subnet, see Fig. 5. As such, inference involves sim-
assigned to ground-truth object boxes using an intersection- ply forwarding an image through the network. To improve
over-union (IoU) threshold of 0.5; and to background if their speed, we only decode box predictions from at most 1k top-
IoU is in [0, 0.4). As each anchor is assigned to at most one scoring predictions per FPN level, after thresholding detec-
object box, we set the corresponding entry in its length K tor confidence at 0.05. The top predictions from all levels are
label vector to 1 and all other entries to 0. If an anchor is merged and greedy non-maximum suppression with a
unassigned, which may happen with overlap in [0.4, 0.5), it threshold of 0.5 is applied to get the final detections.
is ignored during training. Box regression targets are com- Focal Loss. We use the focal loss introduced in this work
puted as the offset between each anchor and its assigned as the loss on the output of the classification subnet. As we
object box, or omitted if there is no assignment. will show in Section 5, we find that g ¼ 2 works well in
Classification Subnet. The classification subnet predicts the practice and the RetinaNet is relatively robust to g 2 ½0:5; 5.
probability of object presence at each spatial position for We emphasize that when training RetinaNet, the focal loss
each of the A anchors and K object classes. This subnet is a is applied to all 100k anchors in each sampled image. This
small FCN attached to each FPN level; parameters of this stands in contrast to common practice of using heuristic
subnet are shared across all pyramid levels. Its design is sampling (RPN) or hard example mining (OHEM, SSD) to
simple. Taking an input feature map with C channels from select a small set of anchors (e.g., 256) for each minibatch.
a given pyramid level, the subnet applies four 3 3 conv The total focal loss of an image is computed as the sum of
layers, each with C filters and each followed by ReLU acti- the focal loss over all 100k anchors, normalized by the num-
vations, followed by a 3 3 conv layer with KA filters. ber of anchors assigned to a ground-truth box. We perform the
Finally sigmoid activations are attached to output the KA normalization by the number of assigned anchors, not total
binary predictions per spatial location, see Fig. 5c. We use anchors, since the vast majority of anchors are easy nega-
C ¼ 256 and A ¼ 9 in most experiments. tives and receive negligible loss values under the focal loss.
In contrast to RPN [3], our object classification subnet is Finally we note that a, the weight assigned to the rare class,
deeper, uses only 3 3 convs, and does not share parameters also has a stable range, but it interacts with g making it nec-
with the box regression subnet (described next). We found essary to select the two together (see Table 1a and 1b). In
these higher-level design decisions to be more important general a should be decreased slightly as g is increased (for
than specific values of hyperparameters. g ¼ 2, a ¼ 0:25 works best).
Box Regression Subnet. In parallel with the object classifi- Initialization. We experiment with ResNet-50-FPN and
cation subnet, we attach another small FCN to each pyramid ResNet-101-FPN backbones [4]. The base ResNet-50 and
TABLE 1
Ablation Experiments for RetinaNet and Focal Loss (FL)
All models are trained on trainval35k and tested on minival unless noted. If not specified, default values are: g ¼ 2; anchors for 3 scales and 3 aspect
ratios; ResNet-50-FPN backbone; and a 600 pixel train and test image scale. (a) RetinaNet with a-balanced CE achieves at most 31.1 AP. (b) In contrast, using
FL with the same exact network gives a 2.9 AP gain and is fairly robust to exact g/a settings. (c) Using 2-3 scale and 3 aspect ratio anchors yields good results
after which point performance saturates. (d) FL outperforms the best variants of online hard example mining (OHEM) [9], [16] by over 3 points AP. (e) Accu-
racy/Speed trade-off of RetinaNet on test-dev for various network depths and image scales (see also Fig. 2).
ResNet-101 models are pre-trained on ImageNet1k; we use train and a random 35k subset of images from the 40k
the models released by [31]. New layers added for FPN are image val split). We report lesion and sensitivity studies
initialized as in [4]. All new conv layers except the final one by evaluating on the minival split (the remaining 5k
in the RetinaNet subnets are initialized with bias b ¼ 0 and images from val). For our main results, we report COCO
a Gaussian weight fill with s ¼ 0:01. For the final conv layer AP on the test-dev split, which has no public labels and
of the classification subnet, we set the bias initialization to requires use of the evaluation server.
b ¼ log ðð1 pÞ=pÞ, where p specifies that at the start of
training every anchor should be labeled as foreground with 5.1 Training Dense Detection
confidence of p. We use p ¼ :01 in all experiments, We run numerous experiments to analyze the behavior of
although results are robust to the exact value. As explained the loss function for dense detection along with various
in Section 3.3, this initialization prevents the large number optimization strategies. For all experiments we use depth 50
of background anchors from generating a large, destabiliz- or 101 ResNets [31] with a Feature Pyramid Network [4]
ing loss value in the first iteration of training. constructed on top. For all ablation studies we use an image
Optimization. RetinaNet is trained with stochastic gradi- scale of 600 pixels for training and testing.
ent descent (SGD). We use synchronized SGD over 8 Nvidia Network Initialization. Our first attempt to train RetinaNet
M40 GPUs with a total of 16 images per minibatch (2 images uses standard cross entropy loss without any modifications
per GPU). Unless otherwise specified, all models are trained to the initialization or learning strategy. This fails quickly,
for 90k iterations with an initial learning rate of 0.01, which with the network diverging during training. However, sim-
is then divided by 10 at 60k and again at 80k iterations. We ply initializing the last layer of our model such that the prior
use horizontal image flipping as the only form of data aug- probability of detecting an object is p ¼ :01 (see Section 4.1)
mentation unless otherwise noted. Weight decay of 0.0001 enables effective learning. Training RetinaNet with ResNet-
and momentum of 0.9 are used. The training loss is the sum 50 and this initialization already yields a respectable AP of
the focal loss and the standard smooth L1 loss used for box 30.2 on COCO. Results are insensitive to the exact value of p
regression [2]. Training time ranges between 10 and 35 so we use p ¼ :01 for all experiments.
hours for the models in Table 1e. Balanced Cross Entropy. Our next attempt to improve
learning involved using the a-balanced CE loss described in
Section 3.1. Results for various a are shown in Table 1a. Set-
5 EXPERIMENTS ting a ¼ :75 gives a gain of 0.9 points AP.
We present experimental results on the bounding box detec- Focal Loss. Results using our proposed focal loss are
tion track of the challenging COCO benchmark [6]. For shown in Table 1b. The focal loss introduces one new hyper-
training, we follow common practice [4], [37] and use the parameter, the focusing parameter g, that controls the
COCO trainval35k split (union of 80k images from strength of the modulating term. When g ¼ 0, our loss is
Fig. 6. Cumulative distribution functions of the normalized loss for positive and negative samples for different values of g for a converged model. The
effect of changing g on the distribution of the loss for positive examples is minor. For negatives, however, increasing g heavily concentrates the loss
on hard examples, focusing nearly all attention away from easy negatives.
equivalent to the CE loss. As g increases, the shape of the fact, with g ¼ 2 (our default setting), the vast majority of the
loss changes so that “easy” examples with low loss get fur- loss comes from a small fraction of samples. As can be seen,
ther discounted, see Fig. 1. FL shows large gains over CE as FL can effectively discount the effect of easy negatives,
g is increased. With g ¼ 2, FL yields a 2.9 AP improvement focusing all attention on the hard negative examples.
over the a-balanced CE loss. Online Hard Example Mining (OHEM). Shrivastava et al.
For the experiments in Table 1b, for a fair comparison we [16] proposed to improve training of two-stage detectors by
find the best a for each g. We observe that lower a’s are constructing minibatches using high-loss examples. Specifi-
selected for higher g’s (as easy negatives are down- cally, in OHEM each example is scored by its loss, non-
weighted, less emphasis needs to be placed on the posi- maximum suppression (nms) is then applied, and a mini-
tives). Overall, however, the benefit of changing g is much batch is constructed with the highest-loss examples. The
larger, and indeed the best a’s ranged in just [.25, 75] (we nms threshold and batch size are tunable parameters. Like
tested a 2 ½:01; :999). We use g ¼ 2:0 with a ¼ :25 for all the focal loss, OHEM puts more emphasis on misclassified
experiments but a ¼ :5 works nearly as well (.4 AP lower). examples, but unlike FL, OHEM completely discards easy
Analysis of the Focal Loss. To understand the focal loss bet- examples. We also implement a variant of OHEM used in
ter, we analyze the empirical distribution of the loss of a con- SSD [9]: after applying nms to all examples, the minibatch is
verged model. For this, we take our default ResNet-101 600- constructed to enforce a 1:3 ratio between positives and neg-
pixel model trained with g ¼ 2 (which has 36.0 AP). We atives to help ensure each minibatch has enough positives.
apply this model to a large number of random images and We test both OHEM variants in our setting of one-stage
sample the predicted probability for 107 negative windows detection which has large class imbalance. Results for the
and 105 positive windows. Next, separately for positives original OHEM strategy and the ‘OHEM 1:3’ strategy for
and negatives, we compute FL for these samples, and nor- selected batch sizes and nms thresholds are shown in
malize the loss such that it sums to one. Given the normal- Table 1d. These results use ResNet-101, our baseline trained
ized loss, we can sort the loss from lowest to highest and with FL achieves 36.0 AP for this setting. In contrast, the
plot its cumulative distribution function (CDF) for both pos- best setting for OHEM (no 1:3 ratio, batch size 128, nms of
itive and negative samples and for different settings for g .5) achieves 32.8 AP. This is a gap of 3.2 AP, showing FL is
(even though model was trained with g ¼ 2). more effective than OHEM for training dense detectors. We
Cumulative distribution functions for positive and nega- note that we tried other parameter setting and variants for
tive samples are shown in Fig. 6. If we observe the positive OHEM but did not achieve better results.
samples, we see that the CDF looks fairly similar for different Hinge Loss. Finally, in early experiments, we attempted to
values of g. For example, approximately 20 percent of the train with the hinge loss [35] on pt , which sets loss to 0
hardest positive samples account for roughly half of the posi- above a certain value of pt . However, this was unstable and
tive loss, as g increases more of the loss gets concentrated in we did not manage to obtain meaningful results. Results
the top 20 percent of examples, but the effect is minor. exploring alternate loss functions are discussed next.
The effect of g on negative samples is dramatically differ-
ent. For g ¼ 0, the positive and negative CDFs are quite sim- 5.2 Variants of Focal Loss
ilar. However, as g increases, substantially more weight We trained RetinaNet-50-600 using identical settings as
becomes concentrated on the hard negative examples. In before but we swap out FL for FL with the selected param-
eters (see Section 3.5). These models achieve nearly the
TABLE 2 same AP as those trained with FL, see Table 2. In other
Results of FL and FL versus CE for Select Settings words, FL is a reasonable alternative for the FL that works
well in practice.
loss g b AP AP50 AP75
We found that various g and b settings gave good results.
CE – – 31.1 49.4 33.0 In Fig. 7 we show results for RetinaNet-50-600 with FL for
FL 2.0 – 34.0 52.5 36.5
a wide set of parameters. The loss plots are color coded
FL 2.0 1.0 33.8 52.7 36.3
FL 4.0 0.0 33.9 51.8 36.4 such that effective settings (models converged and with AP
over 33.5) are shown in blue. We used a ¼ :25 in all
Finally, we note that increasing beyond 6-9 anchors did

not shown further gains. Thus while two-stage systems can
classify arbitrary boxes in an image, the saturation of perfor-
mance w.r.t. density implies the higher potential density of
two-stage systems may not offer an advantage.
Speed versus Accuracy: Larger backbone networks yield
higher accuracy, but also slower inference speeds. Likewise
for input image scale (defined by the shorter image side).
We show the impact of these two factors in Table 1e. In
Fig. 2 we plot the speed/accuracy trade-off curve for Retina-
Net and compare it to recent methods using public numbers
on COCO test-dev. The plot reveals that RetinaNet,
enabled by our focal loss, forms an upper envelope over all
Fig. 7. Effectiveness of FL with various settings g and b. The plots are existing methods, discounting the low-accuracy regime.
color coded such that effective settings are shown in blue. RetinaNet with ResNet-101-FPN and a 600 pixel image scale
(which we denote by RetinaNet-101-600 for simplicity)
experiments for simplicity. As can be seen, losses that matches the accuracy of the recently published ResNet-101-
reduce weights of well-classified examples (xt > 0) are FPN Faster R-CNN [4], while running in 122 ms per image
effective. compared to 172 ms (both measured on an Nvidia M40
More generally, we expect any loss function with similar GPU). Using larger scales allows RetinaNet to surpass the
properties as FL or FL to be equally effective. accuracy of all two-stage approaches, while still being
faster. For faster runtimes, there is only one operating point
5.3 Model Architecture Design (500 pixel input) at which using ResNet-50-FPN improves
Anchor Density. One of the most important design factors in over ResNet-101-FPN. Addressing the high frame rate
a one-stage detection system is how densely it covers the regime will likely require special network design, as in [8],
space of possible image boxes. Two-stage detectors can clas- and is beyond the scope of this work. We note that after
sify boxes at any position, scale, and aspect ratio using a publication, faster and more accurate results can now be
region pooling operation [2]. In contrast, as one-stage detec- obtained by a variant of Faster R-CNN from [39].
tors use a fixed sampling grid, a popular approach for
achieving high coverage of boxes in these approaches is to 5.4 Comparison to State of the Art
use multiple ‘anchors’ [3] at each spatial position to cover We evaluate RetinaNet on the challenging COCO dataset
boxes of various scales and aspect ratios. and compare test-dev results to recent state-of-the-art
We sweep over the number of scale and aspect ratio methods including both one-stage and two-stage models.
anchors used at each spatial position and each pyramid Results are presented in Table 3 for our RetinaNet-101-800
level in FPN. We consider cases from a single square anchor model trained using scale jitter and for 1.5 longer than the
at each location to 12 anchors per location spanning 4 sub- models in Table 1e (giving a 1.3 AP gain). Compared to
octave scales (2k=4 , for k 3) and 3 aspect ratios [0.5, 1, 2]. existing one-stage methods, our approach achieves a
Results using ResNet-50 are shown in Table 1c. A surpris- healthy 5.9 point AP gap (39.1 versus 33.2) with the closest
ingly good AP (30.3) is achieved using just one square competitor, DSSD [10], while also being faster, see Fig. 2.
anchor. However, the AP can be improved by nearly 4 Compared to recent two-stage methods, RetinaNet achieves
points (to 34.0) when using 3 scales and 3 aspect ratios per a 2.3 point gap above the top-performing Faster R-CNN
location. We used this setting for all other experiments in model based on Inception-ResNet-v2-TDM [30]. Plugging
this work. in ResNeXt-32x8d-101-FPN [40] as the RetinaNet backbone
TABLE 3
Object Detection Single-Model Results (Bounding Box AP), versus State-of-the-Art on COCO test-dev
backbone AP AP50 AP75 APS APM APL

Two-stage methods
Faster R-CNN+++ [31] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN w FPN [4] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Faster R-CNN by G-RMI [33] Inception-ResNet-v2 [38] 34.7 55.5 36.7 13.5 38.1 52.0
Faster R-CNN w TDM [30] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1
One-stage methods
YOLOv2 [8] DarkNet-19 [8] 21.6 44.0 19.2 5.0 22.4 35.5
SSD513 [9], [10] ResNet-101-SSD 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [10] ResNet-101-DSSD 33.2 53.3 35.2 13.0 35.4 51.1
RetinaNet (ours) ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2
RetinaNet (ours) ResNeXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2
We show results for our RetinaNet-101-800 model, trained with scale jitter and for 1.5 longer than the same model from Table 1e. Our model achieves top
results, outperforming both one-stage and two-stage models. For a detailed breakdown of speed versus accuracy see Table 1e and Fig. 2.
further improves results another 1.7 AP, surpassing 40 AP [22] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand-
on COCO. written zip code recognition,” Neural Comput., vol. 1, pp. 541–551,
1989.
6 CONCLUSION [23] R. Vaillant, C. Monrocq, and Y. LeCun, “Original approach for the
localisation of objects in images,” Proc. IEEE Vis. Image Signal Pro-
In this work, we identify class imbalance as the primary cess., vol. 141, pp. 245–250, 1994.
obstacle preventing one-stage object detectors from surpass- [24] N. Dalal and B. Triggs, “Histograms of oriented gradients for
ing top-performing, two-stage methods. To address this, we human detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
nit., 2005, pp. 886–893.
propose the focal loss which applies a modulating term to [25] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel
the cross entropy loss in order to focus learning on hard features,” in Proc. British Mach. Vis. Conf., 2009.
negative examples. Our approach is simple and highly [26] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zis-
serman, “The PASCAL Visual Object Classes (VOC) Challenge,”
effective. We demonstrate its efficacy by designing a fully Int. J. Comput. Vis., vol. 88, pp. 303–338, 2010.
convolutional one-stage detector and report extensive [27] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classifica-
experimental analysis showing that it achieves state-of-the- tion with deep convolutional neural networks,” in Proc. Neural Inf.
art accuracy and speed. Source code is available at https:// Process. Syst., 2012, pp. 1097–1105.
[28] J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for
github.com/facebookresearch/Detectron [39]. effective detection proposals?” IEEE Trans. Pattern Analy. Mach.
Intell., vol. 38, no. 4, pp. 814–830, Apr. 2016.
REFERENCES [29] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in
deep convolutional networks for visual recognition,” in Proc. Eur.
[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature Conf. Comput. Vis., 2014, pp. 346–361.
hierarchies for accurate object detection and semantic [30] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta, “Beyond
segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., skip connections: Top-down modulation for object detection,”
2014, pp. 580–587. arXiv:1612.06851, 2016.
[2] R. Girshick, “Fast R-CNN,” in Proc. Int. Conf. Comput. Vis., 2015. [31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[3] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
real-time object detection with region proposal networks,” in nit., 2016.
Proc. Neural Inf. Process. Syst., 2015. [32] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and
[4] T.-Y. Lin, P. Doll ar, R. Girshick, K. He, B. Hariharan, and Y. LeCun, “Overfeat: Integrated recognition, localization and
S. Belongie, “Feature pyramid networks for object detection,” in detection using convolutional networks,” in Proc. Int. Conf. Learn.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. Representations, 2014.
[5] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in [33] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi,
Proc. IEEE Int. Conf. Comput. Vis., 2017. I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy,
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, “Speed/accuracy trade-offs for modern convolutional object
P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects detectors,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017.
in context,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 740–755. [34] S. R. Bulo, G. Neuhold, and P. Kontschieder, “Loss max-pooling
[7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look for semantic image segmentation,” in Proc. IEEE Conf. Comput.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Com- Vis. Pattern Recognit., 2017.
put. Vis. Pattern Recognit., 2016. [35] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
[8] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” Learning. Berlin, Germany: Springer, 2008.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017. [36] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
[9] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and works for semantic segmentation,” in Proc. IEEE Conf. Comput.
A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. Vis. Pattern Recognit., 2015.
Conf. Comput. Vis., 2016. [37] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-outside net:
[10] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “DSSD: Detecting objects in context with skip pooling and recurrent neu-
Deconvolutional single shot detector,” arXiv: 1701.06659, 2016. ral networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[11] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via 2016.
region-based fully convolutional networks,” in Proc. Neural Inf. [38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-
Process. Syst., 2016. v4, inception-resnet and the impact of residual connections on
[12] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders, learning,” in Proc. AAAI Conf. Artif. Intell., 2017.
“Selective search for object recognition,” Int. J. Comput. Vis., [39] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollar, and K. He,
vol. 104, pp. 154–171, 2013. “Detectron,” 2008. [Online]. Available: https://github.com/
[13] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object proposals facebookresearch/detectron
from edges,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 391–405. [40] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated resid-
[14] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to segment ual transformations for deep neural networks,” in Proc. IEEE Conf.
object candidates,” in Proc. Neural Inf. Process. Syst., 2015. Comput. Vis. Pattern Recognit., 2017.
[15] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar, “Learning to
refine object segments,” in Proc. Eur. Conf. Comput. Vis., 2016. Tsung-Yi Lin received the master’s degree from
[16] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based the University of California, San Diego, in 2013
object detectors with online hard example mining,” in Proc. IEEE and the PhD degree in electrical and computer
Conf. Comput. Vis. Pattern Recognit., 2016. engineering from Cornell, in 2017. He was the
[17] K.-K. Sung and T. Poggio, “Learning and example selection for recipient of Marr Prize Best Student Paper Award
object and pattern detection,” in MIT A.I. Memo No. 1521, 1994. at ICCV 2017. He joined Google Brain as a
[18] H. Rowley, S. Baluja, and T. Kanade, “Human face detection in research scientist, in 2017. His research interests
visual scenes,” Carnegie Mellon Univ., Pittsburgh, PA, USA, include computer vision and machine learning. In
Tech. Rep. CMU-CS-95–158R, 1995. particular, he is interested in learning representa-
[19] P. Viola and M. Jones, “Rapid object detection using a boosted cas- tions for object detection and segmentation.
cade of simple features,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., 2001, pp. I-511–I-518.
[20] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade
object detection with deformable part models,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2010, pp. 2241–2248.
[21] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable
object detection using deep neural networks,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2014.
Priya Goyal received the master’s degree in Kaiming He joined Facebook AI Research
mathematics and computing from the Indian Insti- (FAIR), in 2016 as a research scientist. Formerly
tute of Technology, Kanpur, in 2015. She is cur- he was with Microsoft Research Asia (MSRA),
rently a research engineer with Facebook AI which he joined in 2011 after receiving the PhD
Research (FAIR) working on computer vision, degree. His research interests include computer
particularly object detection and object segmen- vision and deep learning. He has received the
tation, high performance computing with applica- Best Paper Award in CVPR 2009, CVPR 2016,
tions to computer vision and compilers for Deep and ICCV 2017.
Learning for accelerating neural networks. She
co-authored a paper which was recipient of Marr
Prize Best Student Paper Award at ICCV 2017.
Ross Girshick received the PhD degree in com- Piotr Dollar received the PhD degree from
puter science from the University of Chicago UCSD under the guidance of Serge Belongie, in
under Pedro Felzenszwalb, in 2012. He is a 2007 and has continued doing research in vision
research scientist with Facebook AI Research and learning since. He is a research scientist with
(FAIR), working on computer vision and machine Facebook AI Research (FAIR) with a focus on
learning. Prior, he was a researcher with Micro- computer vision and machine learning. Prior, he
soft Research and a postdoc with the University spent three years with Microsoft Research
of California, Berkeley, where he was advised by (MSR). He helped cofound Anchovi Labs
Jitendra Malik and Trevor Darrell. He received (acquired by Dropbox in 2012) and before that
the 2017 PAMI Young Researcher Award and was a postdoc with the Computation Vision Lab
the Marr Prize at ICCV 2017 for “Mask R-CNN”. at Caltech until 2011.
" For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/csdl.

Focal Loss For Dense Object Detection

Uploaded by

Copyright:

Available Formats

Focal Loss For Dense Object Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Focal Loss For Dense Object Detection

Uploaded by

Copyright:

Available Formats

318 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO.

Focal Loss for Dense Object Detection

C URRENT state-of-the-art object detectors are based on a

loss of examples with large errors (hard examples). In con-

Fig. 4. Derivates of the loss functions from Fig. 3 w.r.t. x.

dCE 2. RetinaNet uses feature pyramid levels P3 to P7 , where P3 to P5 are

Finally, we note that increasing beyond 6-9 anchors did

backbone AP AP50 AP75 APS APM APL

" For more information on this or any other computing topic,

You might also like