Havi Batch 10

242 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO.
1, JANUARY 2023
Scale-Balanced Real-Time Object Detection

With Varying Input-Image Resolution
Longbin Yan , Student Member, IEEE, Yunxiao Qin , Member, IEEE, and Jie Chen , Senior Member, IEEE
Abstract— Current object-detection methods for small-scale is non-trivial for most detectors. It is generally challenging
objects are often marred by poor performance. Using relatively to detect small objects. The average precision (AP) of most
high-resolution input images can be considered a remedy for popular detectors [1], [2], [3], [4], [7], [8] for small objects
this issue, but it usually leads to performance degeneration for
large-scale objects. We define this problem as the imbalance of is only a half that for large-scale objects, or even worse,
detection performance for multi-scale objects when the resolution on the commonly used evaluation dataset MS COCO [16].
of input images varies. In addition, the use of high-resolution The performance of detecting small objects deteriorates due
images results in significant computational resource consumption to both data properties and method-design reasons: (i) Small
and inference-speed impairment. In this paper, we propose a objects contain few pixels and thus provide insufficient seman-
friendly varying-resolution object-detection method for multi-
scale objects. We analyze in detail the reasons leading to the tic information to be accurately detected, and (ii) existing
performance degradation in the detection of large-scale objects detectors generally utilize a deep backbone network to extract
with increasing input-image resolution, and propose a novel the semantic features of objects. The receptive field gradually
lightweight bidirectional feature-flow module to enhance the becomes larger at later layers and the feature information of
performance of multi-scale object detection in high-resolution small objects is consequently overwhelmed by large ones and
images, especially for large-scale objects. The proposed approach
can also ease the problems of computational resource consump- backgrounds.
tion and inference-speed impairment caused by high-resolution Therefore, several types of methods have been developed
images. Additionally, a decoupled detection head is designed to to enhance performance in small-object detection by either
further improve performance by separating classification and augmenting the data or improving the network design. These
regression sub-tasks, and an adaptive feature-fusion module is methods can be divided into three categories as follows.
designed to better fuse different feature levels. The proposed
• Image pyramid-based methods [17], [18], which simulta-
scheme alleviates the negative effects of using high-resolution
input images and achieves an excellent balance between inference neously feed multi-scale images to detectors;
speed and precision. Experiments on the MS COCO dataset show • Feature pyramid-based methods [2], [3], [19], [20], which
that the scheme achieves 44.6 AP at 42.6 FPS and 47 AP at fuse high-level semantic features and low-level features
26.7 FPS, showing significant advantages over the methods to to enhance performance;
which it is compared.
• Data augmentation-based methods [21], [22], which aug-
Index Terms— Deep convolution neural network (CNN), object ment the data specifically for small objects.
detection, multi-scale features fusion. While significantly improving the performance of small-
I. I NTRODUCTION object detection, the above-mentioned methods are restricted
in different aspects. The image pyramid-based methods are
D EEP convolutional neural networks (CNNs) have
achieved significant performance breakthroughs in image
object detection [1], [2], [3], [4], [5], [6], [7], [8], [9], [10],
computationally costly as they conduct detection separately
on different-scale inputs. The feature pyramid-based methods
can improve the detection precision of small objects, but
[11], [12], [13], [14], [15]. In practice, detection methods must
the improvement is limited. In addition, these methods must
be robust to a large range of object scales, which, however,
assign objects with different scales to multi-level features,
Manuscript received 3 May 2022; revised 14 July 2022; accepted 27 July which requires the application of more expert experience. The
2022. Date of publication 11 August 2022; date of current version 6 January data-augmentation-based methods usually require significant
2023. The work of Jie Chen was supported in part by NSFC under Grant
62192713 and Grant 62171380, in part by the Key Industrial Innovation manual intervention to tune the data-augmentation policy.
Chain Project in Industrial Domain of Shaanxi under Grant 2022ZDLGY01- In this work, we take the novel approach of converting the
02, and in part by the Technology Industrialization Plan of Xi’an under Grant task of improving the precision of small objects to another
XA2020-RGZNTJ-0076. This article was recommended by Associate Editor
F. M. Zhu. (Corresponding author: Jie Chen.) task, thereby avoiding the above-mentioned dilemmas to a
Longbin Yan and Jie Chen are with the School of Marine Science and certain extent. Specifically, we directly use high-resolution
Technology, Northwestern Polytechnical University, Xi’an 710072, China input images to enhance performance in small-object detection
(e-mail: [email protected]; [email protected]).
Yunxiao Qin is with the Neuroscience and Intelligent Media Insti- while focusing on minimizing the consequent adverse effects.
tute, Communication University of China, Beijing 100024, China (e-mail: We first evaluate the effectiveness of increasing the resolution
[email protected]). of input images, which is the most intuitive way of enhancing
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSVT.2022.3198329. small-object detection performance. Considering our motiva-
Digital Object Identifier 10.1109/TCSVT.2022.3198329 tion for achieving a superior balance between inference speed,
1051-8215 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Vel Tech High Tech Dr.Rangarajan Dr.Sakunthala Engineering College. Downloaded on December 19,2024 at 09:53:33 UTC from IEEE Xplore. Restrictions apply.
YAN et al.: SCALE-BALANCED REAL-TIME OBJECT DETECTION WITH VARYING INPUT-IMAGE RESOLUTION 243
TABLE I • An adaptive feature-fusion module (AFFM) is designed

E VALUATION R ESULTS ON THE MS COCO D ATASET OF TTF NET [8] W ITH to mitigate the performance degradation caused by direct
D IFFERENT R ESOLUTION I NPUTS . AP R EPRESENTS AVERAGE P RECI -
SION , AND AP S , AP M , AP L A RE THE P RECISION OF S MALL (A REA
summing or concatenating features from different levels.
< 322 P IXELS ), M EDIUM (322 ≤ A REA ≤ 962 P IXELS ) AND To be specific, two- and three-way AFFMs, aiming at the
L ARGE O BJECTS (A REA > 962 P IXELS ). M AND FLOP S feature fusion with both/either up- and/or down-sampling,
R EPRESENT M EMORY C ONSUMPTION AND F LOATING
P OINT O PERATIONS
are designed separately.
Relying on the above proposed modules, the proposed
method significantly improves object-detection performance
of high-resolution images, and it maintains a good balance
of inference speed and precision on the MS COCO dataset,
on which 44.6 average precision (AP) at 42.6 frames per
second (FPS) and 47 AP at 26.7 FPS are achieved.
II. R ELATED W ORK

detection precision, and training time compared with other A. General Object Detectors
existing state-of-the-art object-detection methods, we choose Common object detection methods can be roughly divided
TTFNet [8] as our baseline. As illustrated in Table I, with into two categories, namely, the two-stage and one-stage
increasing input resolutions, small-object detection perfor- frameworks. The two-stage methods originated from the sem-
mance is continuously improved, whereas the precision of inal Faster RCNN [1]. Later works in this category include
detecting large objects remains almost unchanged or even Mask RCNN [5], Cascade RCNN [26], etc. These methods
decreased. Another non-negligible problem manifests, namely, first determine coarse bounding boxes through region proposal
that the memory consumption and computational cost of the networks, followed by refining these boxes in the second stage.
model increase almost quadratically with the size of the input The two-stage methods are featured with higher precision,
image, which severely restricts the deployment of the model but they have a lower inference speed. One-stage frameworks
on certain edge devices, such as cellphones. mainly include SSD [2] and YOLO [3], which utilize an end-
Therefore, according to the above discussion, we convert the to-end strategy to avoid intermediate steps, and thus the for-
task of improving the small-scale-object detection precision to ward propagation can be carried out more efficiently. In more
that of determining how to improve the performance of large- detail, the one-stage methods include an anchor-free sub-
scale-object detection using relatively high-resolution input branch. These methods do not suffer from the problem of man-
images. Meanwhile, we also find it is necessary to determine ually setting anchor hyperparameters. This property endows
how to achieve efficient detection by alleviating the problem the network with a stronger generalization ability and allows it
of excessive resource consumption caused by high-resolution to deal with objects with extreme shapes that cannot be easily
input images. This can be summarized as the scale-balance covered by anchors. Early methods include Densebox [27],
problem of small, medium, and large objects caused by varying YOLOv1 [28], and later FCOS [7], FSAF [29], etc. Another
input-image resolution. type of anchor-free methods is inspired by the fact that human
To this end, we propose a novel framework that can simul- pose estimation, which predicts the objects by the corner or
taneously improve the detection performance of multi-scale center point of the bounding box. This type of method includes
objects. In addition, the proposed method is able to reduce the CornerNet [30], CenterNet [4], and TTFNet [8], etc. Among
training and inference computational costs, and thus achieves these detectors, CenterNet has a good balance between infer-
a superior trade-off between inference speed and precision. ence speed and precision. However, the model of CenterNet
We summarize our main contributions as follows: has slow convergence as it calculates the regression loss of the
• We meticulously analyze the factors that lead to the size head only at the center point. To remedy this problem,
limited performance of the baseline method for the detec- TTFNet calculates the loss for all points near the center point
tion of large objects and find that deep features play a weighted by the Gaussian function, which thus provides more
crucial role in the center-based object-detection methods. supervision signals and greatly accelerates the convergence
Then, we design a lightweight bidirectional feature flow of the model learning process. To summarize, TTFNet [8]
module (LBFFM) to fuse features of different levels. The can be used as a superior baseline method due to its fast
LBFFM notably boosts the performance of multi-scale- convergence and excellent real-time performance over most
object detection while making the model as lightweight classic real-time detectors such as YOLO [3], SSD [2], Center-
as possible. net [4] and FCOS [7]. However, lacking well-designed feature
• Most detectors employ two heads for features extracted fusion structure, TTFNet is not sufficiently flexible in dealing
from the backbone network for separate classification and with multi-scale objects, especially, large-scale objects in high-
location regression tasks, hence, the features fed to the resolution images. In addition, the coupled features fed to the
two heads are the same. However, the required features of output head impair the detection accuracy due to the conflict
these two sub-tasks are often inconsistent [23], [24], [25]. between the classification and bounding box regression tasks.
Therefore, a decoupled detection head (DDH) is designed To remedy these issues, we propose a lightweight bidirectional
to provide separate features for both sub-tasks. feature flow module (LBFFM) that considers the redundancy
244 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO. 1, JANUARY 2023
Fig. 1. Overall structure of our proposed detector. The parts with dashed outlines from left to right represent the backbone, neck, and detection head of the
object detector. The neck contains an LBFFM which consists of three cascaded BFFM, and the head part contains the DDH. AFFM2 and AFFM3 denote
2-way and 3-way AFFMs, respectively. The detailed structures of CBA, Refinement module, UP, and Down blocks are also illustrated.
among multi-scale features and effectively fuses multi-scale region proposal network, and then obtains independent regions
features with a lightweight structure. The LBFFM significantly for classification and regression tasks, respectively. The two
boosts the detection performance of multi-scale objects with- sub-tasks are better decoupled in this way.
out excessive computational cost. Besides, we propose two III. M ETHODS
useful designs that further improve the performance of the
Here, we introduce the overall structure of the proposed
detector: First, an adaptive feature-fusion module (AFFM)
framework and then present the details of each module.
is proposed to adaptively fuse features from different levels
based on the attention mechanism, which outperforms feature A. Overall Architecture
fusion methods such as vanilla summarizing or concatenating The main goal of our method is to improve the detection
operations. Second, a decoupled detection head (DDH) module performance of multi-scale objects in high-resolution images.
is proposed to decouple the classification and bounding box Specifically, the high detection performance on small-scale
regression tasks. objects is achieved by using high-resolution input images,
while the performance on large-scale objects is preserved
B. Feature Fusion Methods or improved in contrast to regular detectors that encounter
As a seminal feature fusion method, FPN [19] is widely difficulties with the straightforward use of large images.
used in most object detectors. It transmits high-level seman- Consequently, the method achieves superior overall detection
tic information to low-level features through an extra top- performance on multi-scale objects. Fig. 1 demonstrates the
down path, thus enabling low-level features to have enough overall structure of the proposed method. Similar to typical
high-level semantic information. PANet [31] proposes a object-detector architectures, the proposed structure consists
two-way feature fusion on the basis of FPN, which improves of three parts, namely, a backbone network for extracting
FPN by adding an extra bottom-up path to shorten the distance features from input images, a neck for re-processing preceding
from low-level to high-level features. BiFPN [6] is further opti- features, and detection heads for predicting bounding boxes
mized on the basis of PANet, by removing some unnecessary and classes of objects. Specifically, the left-hand part of the
connections, and repeating the pyramid as a module several figure represents the backbone network. The middle part rep-
times to obtain better feature fusion capabilities. resents the proposed LBFFM, which is a lightweight feature
fusion module serving to improve the detection performance
of large scale objects in high-resolution images. Besides, the
C. Task Decoupling AFFM embedded in LBFFM is proposed to adaptively fuse
IoU-net [23] is the earliest work that studies the inconsis- features at different scales based on an attention mechanism,
tency of features required for classification and bounding box which enables LBFFM to enhance useful and suppress harmful
regression tasks. It uses an additional head to predict IoU and information. The right-hand part shows the decoupled class
merges the acquired scores and classification scores to obtain prediction and bounding box regression heads named DDH.
the final results. Double-Head RCNN [24] uses two indepen- It independently extracts the features from the main infor-
dent heads for classification and regression, but the input of the mation pipeline for conducting class prediction and bound-
two heads is still fed with the features that belong to the same ing box regression tasks in a decoupled manner. The useful
proposal regions. TSD [25] uses two sub-modules to perform designs of AFFM and DDH can further enhance the detection
second regressions on the proposal regions generated by the performance.
Fig. 2. Left picture depicts the attention heat map of the ResNet-18
classification network, and the right picture depicts the output activation map
of the TTFNet-ResNet-18 [8] classification branch.
B. LBFFM
Fig. 3. Up-flow structure. The green dotted line in the figure represents the
In this section, we first elaborate on the structure of LBFFM. information pipelines from the high-level semantic features of the backbone
to the detection head.
As shown in the middle part of Fig. 1, LBFFM is a light-
weight structure consisting of cascaded BFFMs for finely
fusing features from different levels. The design of this module
aims at improving the detection performance of multi-scale i) thickening the lateral connection layer can significantly
objects, especially, large objects, in high-resolution images, enhance the receptive fields of the model, thereby improving
and achieves lightweight architecture by virtue of the scale the detection performance of large-scale objects; and ii) the
shifting down and slimming designs. Then, we detail the adjacent levels’ features can be fully fused, and thus refined
design of AFFM. features are acquired. The effectiveness of refinement for fused
1) BFFM: In this subsection, we first analyze in detail features is evaluated in Sec. IV-B1a.
the cause of the weak detection performance of the baseline b) Up-flow for large-scale object detection: Generally,
method [8] for large-scale objects with high-resolution images, the features in the deep layers of the backbone network have
and propose specific designs namely, Refinement and Up-flow larger receptive fields. However, most common detectors have
structures, to alleviate this problem. Then, the complete BFFM the problem that the path from the tail end of the backbone to
including Down-flow and Up-flow structures is presented to the detection head is “narrow”, which restricts the transmission
boost the overall detection performance for multi-scale objects. of the high-level semantic features to the detection head, thus
Typical image recognition tasks do not typically require leading to a notable degradation in detection performance,
consideration of the global information of objects [32]. In con- especially for large-scale objects. To enable the detector to
trast, in a center-based detector, the activation area of the clas- more accurately locate the center point of a large-scale object,
sification branch is primarily concentrated at the centre of the the deep semantic features with a large receptive field must
objects, which is where the global information of the objects smoothly transfer to the detection head. Therefore, as shown
must be examined. Figure 2 demonstrates the contrastive atten- in Fig. 3, we designed an Up-flow architecture with multi-
tion heat maps of the general image recognition task and the ple transmission pipelines to broaden the connection between
activation heat map of the classification branch of the center- high-level semantic information and the detection head, which
based detector. For an image containing a large-scale object can effectively enhance the information flow from the back-
(an elephant in this case) that nearly fills the entire image bone to the detection head and enable the detector to see the
area, it is evident that the former only focuses on the elephant’s panorama of the objects as much as possible. We verify the
head, whereas the activation area of the latter is primarily con- effectiveness of the Up-flow design in Sec. IV-B1b.
centrated in the elephant’s body centre. The former captures c) Complete BFFM structure: In the above discus-
specific features to generate prediction results, whereas the sion, the Refinement and Up-flow structures are proposed to
latter requires a complete image of the objects to accurately improve the performance of large-scale object detection in
locate their centres. In other words, the large receptive field is high-resolution images. To further promote the performance,
essential for the center-point-based object detector, particularly we in addition propose a Down-flow design to introduce spa-
for large-scale objects. As shown in Tabler I, we believe tial detail information from low-level features into high-level
that insufficient receptive fields are the primary reason why semantic features to form a complete BFFM structure. The
the baseline method has limited performance for detecting Down-flow structure further enhances the large object locating
large scale objects when using high-resolution input images. and small objects detection. The complete BFFM structure
To address this issue, we discuss and propose enhancement with Up- and Down-flow designs can effectively integrate
designs based on the following three aspects: multi-scale information and thus enables the model to acquire
a) Refinement for fused features: At present, the feature features with diverse receptive fields to enhance the representa-
pyramid network widely used in detectors primarily focuses tion capacity for multi-scale objects. These performance boosts
on the feature fusion of adjacent levels, but less study is are confirmed in Sec. IV-B1c and the experimental results are
conducted on the fusion in a more refined manner. We pro- presented in Table III.
pose to thicken the lateral connection layers of the feature- 2) Lightweigt Architecture: The FPN-based object detection
fusion module. Two advantages are gained in this approach: methods usually have a detection head after each level of
Fig. 4. Visualization of pyramid features. P1-P4 represent features with a total stride of 4, 8, 16 and 32. As shown in the red ellipse areas connected by
arrows, we observe that the features between different levels have a certain similarity, especially in shallow adjacent levels, indicating redundant information
between features at different levels of the traditional feature pyramid.
features. For example, in the FCOS detector [7], a five-level structure allows the detection head to achieve high-quality
feature pyramid with total strides of 8, 16, 32, 64, and 128 is fusion features, while reducing the memory occupation and
used. Each level of the feature pyramid corresponds to a calculation cost as much as possible. We propose the following
detection head, which is responsible for detecting objects of two designs to achieve this goal.
corresponding scales. The final result is obtained by fusing a) Scale shifting down design: Table I shows that, as the
the detection bounding boxes of the multi-scale detection resolution of the input image increases, the detection perfor-
heads through non-maximum suppression (NMS) operation. mance of the model is consistently improved, but the required
Although this design improves performance, it is not optimal memory and calculation cost are approximately squared rela-
and efficient enough. tive to the resolution. We observe that most existing backbone
First, as shown in Fig. 4, we observe that there is a large networks follow the design methodology of setting the channel
amount of redundant information between features at different number and size of the feature map of the next stage as
levels of the backbone network, especially at adjacent levels. 2 and 0.5 times that of the previous stage, which makes the
This reveals that treating the multi-level features extracted amount of information contained in each stage feature roughly
from the backbone network uniformly in the design of the unchanged. Assuming that the feature map size of the stagei is
neck is not efficient enough. Besides, the design of using C × H × W , then for stagei+1 , the size is 2C × H /2 × W/2.
multiple heads not only degrades the real-time performance Because the memory consumption is directly related to the
of the detector, but also requires a significant amount of scale of the feature map, we have:
expert experience to set the parameters of the multi-scale
label assignment. Finally, the object detector often achieves Mst agei = C H W m, (1)
better performance when inputting relatively high-resolution
where Mst agei represents the memory consumption of stagei ,
images, especially for small objects, but the thorny problem
m is memory consumption of one pixel. Then for stagei+1 ,
of increasing memory consumption and floating-point opera-
we have
tions (FLOPs) in this case is not negligible. As a rough esti-
mate, the increases in memory consumption and the number Mst agei+1 = C H W m/2. (2)
of FLOPs are squarely related to the size of input images,
which also seriously impairs the real-time performance of the We can approximately conclude that the memory consump-
detector. tion of the later stage are about half of the previous stage.
To remedy the problems discussed above, we propose a Therefore, to reduce the memory consumption when feeding
lightweight bidirectional feature flow module, which adopts high-resolution input images to the detector, we propose to
the design philosophy of detecting all of the scales objects build the detection head on the feature map with a larger total
based on single-level features. Besides, the extra proposed stride.
Fig. 6. Slimming design. We remove the adjacent embedded refinement

modules (detail structure can be seen in Fig. 1) in the lateral connection,
and make the module have an alternating arrangement structure. The black
dashed box represents the refinement modules that we removed, the black
dotted arrow line in the figure illustrates that slimming design can enhance
the receptive field gap of different levels features by virtue of difference of
the feature flowing path length.
Fig. 5. Scale shifting down design. We use high-resolution images, but

use feature maps with larger stride to keep the number of detection boxes number of refinement modules and thus the receptive fields
unchanged.
of each level’s feature increase synchronously, which reduces
the receptive field gap between different levels of features.
At first glance, building detection heads on the feature Here, we illustrate this point with the structure enclosed by the
map with a larger total stride will reduce the number of green dashed box in the lower-left part of Figure 6. Denote the
prediction boxes, which thus leads to a decrease in detection receptive fields of 16× and 32× features by RF16× and RF32×
performance. Fortunately, it is possible to make up for this respectively, and the receptive field of a refinement block by
shortcoming with higher resolution input images. As shown RFblock. For structures without slimming design, the receptive
in Fig. 5, we use input images with larger resolution and fields of the two-way features to be fused by AFFM can be
build the detection head on the feature map with a larger total expressed as
stride, while the number of prediction boxes is the same as RF fuse-16×-without = RF16× + RFblock /2 , (3)
before. This not only improves the detection performance of
RF fuse-32×-without = RF32× + RFblock /2 . (4)
small objects by increasing the resolution of input images, but
also greatly alleviates the consequent problem of a dramatic In contrast, with slimming design, the corresponding receptive
increase in memory consumption and computational cost. fields are given by
b) Slimming design: Here, we elaborate on the design of
RFfuse-16×-with = RF 16× , (5)
slimming, which makes it possible to achieve the objective
of minimizing the consumption of computing resources while RFfuse-32×-with = RF 32× + RFblock /2 , (6)
preserving detection accuracy. As shown in Fig. 4, there are where denotes floor function. The receptive field gaps
numerous redundant components in the multi-level features between two-way features to be fused (i.e., feature 16× and
extracted from the backbone, resulting in an inefficient method feature 32× here) without/with slimming design are given by
for treating each level feature uniformly. To further improve
performance, as depicted in Fig. 1 and Fig. 6, we propose GAPwithout = RFfuse-32×-without − RFfuse-16×-without
a slimming design to alternately remove refinement modules = RF32× − RF16× , (7)
embedded in the lateral connections after careful considera- GAPwith = RFfuse-32×-with − RFfuse-16×-with
tion. With this design, the middle slimming path (i.e., the 16×
= RF32× − RF16× + RFblock /2 . (8)
lateral path in Fig. 1) can absorb requisite features from the
routes on both sides, compensating for the information loss We can observe from (7) and (8) that the two-way features to
caused by the slimming design. On this basis, we can achieve be fused with slimming design have a larger receptive field gap
comparable performance while minimizing the consumption than those without slimming design. A larger receptive field
of computing resources. gap between features to be fused implies a more significant
It is worth noting that that the proposed slimming design can difference in perceiving ability, thus enabling the fused feature
provide an added benefit. Facing objects of varying sizes in to have more diverse receptive fields and improve the detecting
real-world scenes, if the receptive fields of two-level adjacent performance of multi-scale objects.
features to be fused are similar, the fusion may be ineffi- In summary, the model with the lightweight architecture
cient due to the lack of input feature diversity. Therefore, obviously reduces the training memory resource consumption,
a sufficiently large receptive field gap between different lev- calculation cost, number of parameters, and inference time
els of features is advantageous for enhancing the detection with a slight performance degradation compared with that
performance of multi-scale objects. Without the slimming without the lightweight architecture. This will be confirmed
design, the features of each level are suffixed with the same in the experimental results listed in Table V.
Fig. 7. Subfigure (a) illustrates the 3-way feature fusion module. A convolu-
tional layer and a sigmoid function are used to compute the attention weights,
which are used to fuse the features based on the element-wise operation
(⊕ and ⊗ are element-wise addition and multiplication). The 3-way fusion
is used in the layers that contain both Up and Down sampling operation Fig. 8. This figure illustrates our proposed DDH module. For classification
(AFFM3 module in Fig. 1, Input1 , Input3 are features from Up and Down and regression branches, the model can adaptively extract features that benefit
sampling module, and Input2 is the lateral connection feature). (b) illustrates them via using independent deformable convolutions. The small red square
the 2-way feature fusion module. It is used for layers that contain only Up in the figure represents the convolution kernel, and the arrows on it represent
or Down sampling operation (AFFM2 module in Fig. 1, Input1 is the feature the direction offsets.
from the Up or Down sampling module, and Input2 is the lateral connection
feature).
detection heads are therefore decoupled. The path indicated by

3) AFFM: Most existing object detectors fuse low- and the red arrow in Fig. 8 is the primary information-flow pipeline
high-level features by direct addition or concatenation. How- through which both heads independently extract features for
ever, directly fusing the features in this manner will result in themselves. Specifically, three groups of deformable convolu-
mutual repulsion when predicting objects of different scales, tion [36] and upsampling layers are used to progressively fuse
resulting in a decline in performance [33]. To address this the features from top to bottom, with the final features then
issue, we propose an adaptive feature fusion module that fuses being fed to the corresponding detection head. The structure
features of different levels adaptively via attention modules. is depicted in detail in Fig. 1. The Spli t is a module for
Current soft-attention includes spatial and channel attention. separating features based on two deformable convolutional
The former [34] obtains a mask with the same spatial size as layers on either side of the main feature flow. Deformable
the feature map and then applies element-wise multiplication convolution has a strong capacity for capturing features at non-
as spatial domain weights. Using global average pooling and fixed locations; consequently, it can effectively extract features
fully connected layers on the features, the latter [35] obtains a that are advantageous for particular tasks.
vector with the same length as the number of feature channels,
and then applies channel-wise multiplication to weight the D. Loss Function
features. In the proposed framework, we employ both spa-
tial and channel attention in a single module to improve the Our framework uses the same loss function as TTFNet [8],
detector’s information capturing ability. As illustrated in Fig. 7, which is briefly described below.
three-way and two-way adaptive feature-fusion modules are 1) Classification Loss: Let the output size of the classifica-
separately designed. Notably, the proposed design directly tion head be C × H × W , where C represents the number of
employs a convolutional layer and sigmoid functions to obtain categories, and H and W represent the height and width of
element-level weights with the same dimension of the feature, the feature map, respectively. A 2D Gaussian kernel is used
which simultaneously (1) improves the performance reduction to establish the heat map optimization target of c-th channel
caused by the inconsistency of receptive fields and semantic at pixel (x, y) by:
information of different levels of features in the spatial domain (x − x )2 (y − y0 )2
0
and (2) fuses the features obtained by different convolution Hx yc = exp − − , (9)
kernels for different objects or parts in the channel. 2σx2 2σ y2
where σx = wbox α
6 , σy =
h box α
6 . Given an object bounding box,
C. Decoupled Detection Head x 0 , y0 represent the coordinates of the center point, h box and
The output features from the neck are directly connected wbox represent the height and width, and α is a regulatory
to the classification and regression heads in the majority of factor. The classification loss modified based on the focal
object detection methods. Nevertheless, the features required loss [37] can be expressed as:
for these two tasks are inconsistent [25]. The regression task
1 (1 − Ĥi j c )α f log( Ĥi j c ) if Hi j c = 1
relies primarily on the features of object edge areas, whereas L loc = α
the classification task likely relies more on the features of M x yc (1 − Hi j c )β f Ĥi j cf log(1 − Ĥi j c ) else,
object interiors. In the proposed framework, classification and (10)
where Ĥi j c represents the prediction of network, and M rep- of training epochs is 12, and the resolution of input images is
resents the number of ground truth boxes. Parameters α f and 768 × 768.
β f are set to 2 and 4 [8].
2) Bounding Box Regression Loss: The bounding box
regression loss based on GIoU [38] can be expressed as B. Effectiveness of LBFFM
follows: In this section, we first evaluate the performance enhance-
1 ment associated with the cascaded BFFM structure and then
L reg = GIoU B̂i j , Bm × Wi j , (11) demonstrate the benefits of Lightweight architecture for low-
Nreg
(i, j )∈ Am ering training and testing expenses.
where B̂i j and Bm represent prediction box at (i, j ) pixel and 1) Validity of BFFM: In this subsection, we incrementally
its target box, Nreg represents the number of pixels needed to validate the effectiveness of each key component of the BFFM.
computer regression loss, and Am represents the region of the Specifically, we first verify the performance improvement,
m-th box. Wi j is the weight based on Gaussian distribution, especially for large-scale objects, resulting from the refine-
given by: ment module and Up-flow structures, and then assess the
⎧ improvement resulting from the complete BFFM. In addition,
⎪
⎪ Gm (i, j ) we investigate the impact of BFFM structures with sampling
⎨ log (am ) × (i, j ) ∈ Am
Gm (x, y) rates of 4× and 8× on detection performance.
Wi j = (12)
⎪
⎪
(x,y)∈Am
a) Experiments of refinement module: To refine the fused
⎩0 (i, j ) ∈
/ A , m features, we embed cascading refinement modules within lat-
where G m (i, j ) is the value of heat map calculated by (9), and eral connections. The specific structure of the refinement mod-
am is the area of Am . ule is depicted in Fig. 1. In Table II, the experimental results
3) Total Loss: Combing L cls and L reg , the total loss is with increasing numbers of cascaded refinement modules are
finally given by: displayed. It can be observed that the embedded refinement
modules boost the detector’s AP in comparison to the baseline
L t ot al = L cls + λL reg , (13) method. Note that the AP can be significantly increased from
where λ is set to 5 [8]. 37.8 to 43.1 (+5.3 AP), particularly for large-scale objects.
Nonetheless, with the number of embedded refinement mod-
IV. E XPERIMENTS ules increasing, the performance gain has reached a bottleneck.
This demonstrates that a reasonable number of refinement
Experiments are mainly carried out on the MS COCO
modules is sufficient.
dataset [16], which contains 80 categories with 118k train-
b) Experiments of up-flow for large-scale object detec-
ing images, 5k validation images, and 20k test images. All
tion: We evaluate the significance of Up-flow in enhancing
of our evaluations are conducted on the validation set as
deep features with larger receptive fields and higher-level
in the vast majority of other similar works. Performance
semantic information, which is directly related to the enhance-
indices AP (0.5:0.05:0.95), AP0.5 , AP0.75 , AP S , AP M , and
ment of large scale object detection performance. As depicted
AP L are reported. To further verify the effectiveness of pro-
in Fig. 3, we uniformly embed three cascaded refinement
posed method, we also conduct experiments on PASCAL VOC
modules in each level of lateral connection during the exper-
dataset [39].
iments. We construct the information flows from top-layer to
bottom-layer at intervals in order to bridge a multi-branch
A. Training Settings information path, thereby enhancing the information transfer
The experiments are based on the object detection tool- of deep features from the backbone to the detection head.
box MMdetection [40]. Unless otherwise specified, all The experimental results are presented in Table III labelled
hyper-parameters are consistent with those in TTFNet [8]. “Up-flow group”. Observations indicate that with the Up-
We train the model on eight-2080Ti GPU servers, and the flow, the detector can achieve an additional 1.5 AP gain, and
inference speed is measured on Tesla V100. To evaluate the for medium and large objects there are still 1.5 and 1.4 AP
effectiveness of the proposed method, we conduct experi- gains, respectively, over the previous performance boost (i.e.,
ments on multiple backbone networks, including ResNet [32], compared to the “No-flow group”). This demonstrates that the
Res2Net [41], DarkNet [3], and Swin transformer [42]. The Up-flow design can overcome the bottleneck caused by the
mini-batch sizes of a single GPU are uniformly set to 16 and stacking of cascaded refinement modules alone and further
the learning rates are to 0.016 (corresponding to 8 GPUs). enhance performance.
Using the method described in [43], we simultaneously scale c) Effectiveness of complete BFFM: The complete BFFM
the mini-batch size and learning rate for high-resolution inputs. system consists of both the Up-flow and Down-flow modules.
We use SGD optimizer to train the model with 120 epochs. In Fig. 1, the default Up-flow block is a 2× Up-sampling
The momentum and weight decay are set to 0.9 and 0.004, and module that first compresses the number of channels through
the learning rate is multiplied by a coefficient of 0.1 at the 90th a 3 × 3 convolutional layer and then performs an interpolation
and 110th epochs, respectively. To facilitate comparison with operation. The Down-flow block consisted of a 2× down-
the baseline [8], the experiments with the exception of those in sampling module implemented by a 3 × 3 convolutional layer
Table VIII are based on the ResNet-18 backbone, the number with a stride of 2. In this experiment, the neck in Fig. 1
TABLE II
P ERFORMANCE C OMPARISONS W ITH D IFFERENT N UMBERS
OF C ASCADED R EFINEMENT M ODULES IN
L ATERAL C ONNECTIONS
Fig. 9. Three kinds of BFFM using different sample ratios for feature fusion.
TABLE III (a) Default BFFM with 2× sample ratio. Its feature connections are used for
P ERFORMANCE I MPROVEMENTS W ITH UP AND D OWN F EATURE F LOWS only adjacent level features. (b) BFFM with 4× sample ratio. Its feature
connections are additionally added between cross-layers. (c) BFFM with
8× sample ratio.
TABLE IV
P ERFORMANCE A NALYSIS OF U SING BFFM S W ITH D IFFERENT
S AMPLING S TRUCTURES . 2×, 4× AND 8× R EPRESENT S AMPLING Fig. 10. Subfigures (a), (b) and (c) illustrate three different patterns of
R ATIOS OF 2, 4 AND 8, R ESPECTIVELY the slimming design, namely, up-to-down, down-to-up and random slimming
styles. Solid and open circles represent locations corresponding to retained
and removed refinement modules, respectively.
TABLE V
C OMPARATIVE E XPERIMENTS W ITH AND W ITHOUT S LIMMING A RCHI -
TECTURE . T HE R ESOLUTION OF THE I NPUT I MAGE I S 768 × 768, M INI -
BATCH S IZES A RE 8. a, b, AND c R EPRESENT T HREE D IFFERENT
S LIMMING D ESIGN PATTERNS I LLUSTRATED IN F IG . 10
contained three cascading BFFMs. The experimental results
are reported in Table III. We find that the BFFM improves
detection performance by 2.3 AP compared to the baseline
without flow module. The AP gains for small, medium, and
large objects are 2, 2.2, and 3.1, respectively.
The preceding experiments convincingly confirm the per-
formance enhancement of the BFFM designs. As shown in
Table III, a final improvement of 5 AP over the baseline
method is achieved, and a significant improvement of 8.4 AP
for large-scale objects. TABLE VI
d) Do 4× and 8× BFFMs benefit the model?: In order T HE A BLATION E XPERIMENTS OF LBFFM, AFFM AND DDH
to evaluate the efficacy of the Up- and Down-flow modules
beyond 2×, we add 4× and 8× connections to the BFFM.
In Figs. 9 (b) and (c), respectively, the flow structures utilising
4× and 8× connections are depicted. The 4× and 8× BFFM
structures fuse not only the features of adjacent levels, but also
those of cross-levels. The number of refinement modules is
uniformly set to three, and the experimental results of the 2×,
4×, and 8× BFFMs are presented in Table IV. Unfortunately,
these experiments revealed that 4× and 8× flows are incapable module from the lateral connection at intervals based on the
of enhancing detection performance further. analysis presented in Sec. III-B2b. We analyse the impact of
2) Effectiveness of Lightweight Architecture: We next eval- the lightweight architecture on computing resource demands
uate the benefit introduced by the proposed lightweight archi- from the four perspectives of training memory consumption,
tecture. We utilise the scale shifting down design based on FLOPs, parameters (Param), and inference time (Inf Time).
the analysis presented in Sec. III-B2a to redesign the orig- The experimental results are presented in Table V. Compared
inal 4× down-sampling head into an 8× down-sampling to the experiment without lightweight architecture, it can be
head, while simultaneously removing the corresponding lateral observed that the slimming design patterns reduce memory
connections of the 4× branch. In addition, we use the slim- usage, FLOPs, parameters, and inference time by 53%, 38%,
ming design depicted in Fig. 10 to remove the refinement 20%, and 44%, respectively, while precision decreases by only
Fig. 11. Visual comparison of our proposed method and the TTFNet baseline on the COCO dataset with inputs of varying resolutions. All experiments are
performed utilising the ResNet-18 backbone. Images can be zoomed in to see details. GT stands for ground truth. The red ovals represent the difference in
small-scale object detection results between the proposed method and baseline. In addition, it can be observed that our method significantly outperforms the
baseline for large-scale object detection with high-resolution input images.
0.8∼1.1 AP. This inarguably allows us to achieve an optimal as shown in in Fig. 1, we employ AFFM in the inner
balance between precision and inference speed. regions of LBFFM and DDH, where the features are fused.
Table VI demonstrates that the AFFM results in an additional
C. AFFM and DDH Experiments improvement of 0.7 AP over the LBFFM-based results. Sep-
In addition, we conduct experiments to confirm that arately for the classification and bounding box regression
the AFFM and DDH improved performance. Specifically, branches, we add a structure with three layers of deformable
TABLE VII baseline in terms of small object detection, i.e., our method
P ERFORMANCE COMPARISON OF OUR METHOD AND BASELINE [8] WITH has the higher recall (e.g., “stop sign” in 2nd row, 3rd and 5th
DIFFERENT RESOLUTIONS OF INPUT IMAGE . THE B ACKBONE I S
R ES N ET-18, AND THE T RAINING E POCHS A RE 12
columns images), fewer false positives (e.g., “objects that do
not belong to person class” in 8th row, 5th column image),
and more precise boxes regression (e.g., “chair” in 7th row,
5th column image).
It is especially worth noting that, for large objects, the pro-
posed method obtained approximately 10 AP L and 7.4 AR L
gains. In addition, compared with the baseline, the pro-
posed method exhibited a better AP/AR growth trend for
medium- and large-scale objects when the input-image res-
olution increases. These results solidly proved that the pro-
posed method has better balance for objects of different scales
when the resolution of the input image varies. In Figure 11,
we can clearly observe that our method has obvious advantages
in large-scale object detection. In comparison, the baseline
method is susceptible to missing detections (e.g., “umbrella”
in 3rd row, 4th columns images), low confidence scores (e.g.,
“bus” in 2nd row, 2nd and 4th columns images), and multiple
detection boxes (e.g., “bus” in 1st row, 2nd column image) on
convolution for the DDH. It is discovered that the DDH large-scale objects.
improves detection performance by 0.4 AP compared to the
baseline with LBFFM and AFFM. Using all of the proposed
modules, the experimental performance was improved by E. Comparison With State-of-the-Art Methods
5.1 AP over the baseline. Finally, we report the performance comparison between
the proposed method and existing state-of-the-art object-
D. Scale Balancing With Varying Resolution detection frameworks in Table VIII. It is clearly observed
We also evaluate the performance of the proposed method that our method has significant advantages in terms of accu-
with varying input-image resolutions. Based on the LBFFM, racy, inference speed and FLOPs compared to most other
AFFM, and DDH, we train the model with 384 × 384, competitors. Specifically, compared with superior multi-stage
512 × 512, 768 × 768, and 1024 × 1024 resolutions using methods, such as Cascade RCNN (with ResNeXt-101) and
ResNet-18 to evaluate the performance in detecting multi- HTC (with ResNet-101), our model based on ResNet-50 or
scale objects. Experimental results are shown in Table VII, Darknet-53 are about 30∼35 FPS faster in inference speed
from which it can be observed that, when the input resolution and 70% less in FLOPs with comparable AP. For compar-
increases from 384 × 384 to 1024 × 1024, the performance of isons with the state-of-the-art TOOD, our methods is still
the proposed methods improves significantly and uniformly for about 17 FPS faster in speed and 65% less in FLOPs with
small, medium, and large objects. Specifically, for small object comparable AP.
detection, the proposed LBFFM smoothly transfers bottom On the other hand, compared with several superior real-
detail features into the detection head, which is beneficial time detectors, the proposed method also surpasses most algo-
for small object detection, especially for precise localization. rithms in both precision and inference speed. Compared with
Besides, AFFM can further improve the detection perfor- TTFNet, our method has advantages in both AP and FPS,
mance of small objects by strengthening useful features while it also has a higher upper limit with the same backbone.
suppressing harmful information based on the soft-attention Compared with EfficientDet, the proposed method can achieve
mechanism. We can observe from Tables VI that LBFFM an inference-speed improvement of approximately 15∼60 FPS
improves the AP S of small objects from 13.1 to 14.1, and with comparable AP, while the number of training epochs
AFFM can further provide 0.7 AP S gain. Then, we conduct for the proposed method is only about a quarter of those
experiments to evaluate the effectiveness of increasing the for EfficientDet (120 v.s. 500). Note that EfficientDet has an
input image resolution on the detection performance of small- edge over our method in terms of FLOPs due to its extensive
scale objects. With the high-resolution input, the semantic use of depth-wise convolutions. However, the excessive use of
features of small objects are significantly enhanced, thus sub- depth-wise convolutions seriously reduces the computational
sequently improving the recognition ability of small objects. density of the model, making it difficult to effectively exploit
As reported in Table VII, we can observe that with the input the parallel computing capability of the chip (e.g., GPU),
resolutions increasing from 3842 to 10242, the AP S and AR S thereby impairing the inference speed of the detectors [75],
of small-scale object detection can be boosted from 8.1 to [76]. In pursuit of faster inference speed, our model exten-
17.5 and 16.9 to 32.3, respectively. In Figure 11, we visualize sively employs standard 3 × 3 convolutions with higher com-
some detection results of our method and the baseline. From putational density. As a result, it has a significant benefit in
the area circled by the red ellipse in the images and the results inference speed, although it has a disadvantage in FLOPs com-
in Table VII, it is evident that our method is superior to the pared to detectors that rely heavily on depth-wise convolutions.
TABLE VIII
C OMPARISON OF THE P ROPOSED M ETHOD W ITH O THER S TATE - OF - THE -A RT M ETHODS . T HE R ESULTS OF O UR M ETHODS A RE M EASURED IN S INGLE -
M ODEL S INGE -S CALE S ETTING . T HE FPS A RE C ALCULATED W ITH BATCH S IZE 1 ON T ESLA V100 GPU, AND THE VALUES W ITH ∗ R EPRESENT
O UR R E -T EST R ESULTS BASED ON [44] W ITH THE S AME T ESTING E NVIRONMENT. T HE R ESULTS OF FLOP S A RE C ALCULATED BASED
ON THE MM DETECTION T OOLBOX [40]. T HE M ETHODS W ITH + D ENOTE THE M ODEL W ITH DDH AND AFFM. O PTIMAL AND
S UBOPTIMAL R ESULTS A RE S HOWN IN B OLD . W E A NNOTATE THE Y EAR ON S OME R ECENT W ORKS (2020∼2022)
Compared with ASFF, our method has a higher AP and faster training epochs is only 40% of ASFF’s (120 v.s. 300). In addi-
inference speed with the same backbone, while the number of tion, to further verify the efficiency, we comprehensively com-
TABLE IX the experimental results are shown in Table X. We can observe

C OMPARISONS OF O UR M ETHOD W ITH THE BASELINE OF TTFN ET [8] IN that the proposed method can also achieve superior results on
T ERMS OF T RAINING M EMORY C ONSUMPTION , PARAMETERS , FLOP S
AND FPS. T HE I NPUT I MAGE S IZES OF O UR M ETHOD AND B ASE -
the PASCAL VOC dataset.
LINE A RE S ET TO 768 AND 512, R ESPECTIVELY
V. C ONCLUSION
Using high-resolution images is an effective method
for improving small-object detection performance. However,
direct use of high-resolution images with the majority of
existing methods typically leads to an imbalance problem
in detection performance for multi-scale objects. To address
TABLE X
this dilemma, we design a novel framework and acquire
E XPERIMENTS ON PASCAL VOC D ATASET
a real-time detector with better balance for input images
with varying resolutions. Utilizing the lightweight bidirec-
tional feature flow module, the decoupled detection head,
and the adaptive feature-fusion module, the proposed method
achieves superior detection performance and outperforms the
vast majority of existing detectors in terms of both infer-
ence speed and precision. In order to achieve a good bal-
ance between inference speed and detection precision, our
method employs a single-level detection head, which simplifies
the structure at the expense of a potential loss of detection
precision. Note that some existing detectors utilise detection
heads with multiple detection levels. Allocating multi-scale
objects at different levels improves detection accuracy while
introducing new challenges, such as hyperparameter settings
pare the training memory consumption, FLOPs, parameters, for multi-level detection heads, which slows inference speed
and FPS of the proposed method and the baseline of TTFNet. due to complex post-processing. Future research will attempt
The detailed results are shown in Table IX. We can observe to implement multi-level detection heads to improve detection
that our method outperforms the baseline in terms of training accuracy while maintaining inference speed.
memory consumption, parameters and FPS with similar or
higher AP (+0.5 AP). Note that our method with a lightweight R EFERENCES
backbone such as ResNet-18 can achieve even higher AP [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
than baseline with a larger backbone such as Darknet-53, time object detection with region proposal networks,” in Proc. Adv.
Neural Inform. Process. Syst., 2015, pp. 91–99.
indicating that our method possesses better scalability and [2] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf.
higher performance potential. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.
Furthermore, we perform extensive experiments to compare [3] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”
2018, arXiv:1804.02767.
the performance of our method with several different supe- [4] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” 2019,
rior detectors inclduing YOLOv3, FCOS, Cascade RCNN, arXiv:1904.07850.
DETR, deformable DETR, and ViDT based on the light- [5] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc.
ICCV, Oct. 2017, pp. 2961–2969.
weight backbones. The backbones include MobileNet-v2 [77] [6] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: Scalable and efficient
and vision transformer based networks including Swin-nano object detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
[42], [66], Deit-tiny [78]. It can be observed from Table VIII nit. (CVPR), Jun. 2020, pp. 10781–10790.
[7] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional
that our ResNet-18 based model can outperform most light- one-stage object detection,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
weight backbone based competitors in terms of accuracy or (ICCV), Oct. 2019, pp. 9627–9636.
inference speed, at the cost of not being significantly disad- [8] Z. Liu, T. Zheng, G. Xu, Z. Yang, H. Liu, and D. Cai, “Training-time-
friendly network for real-time object detection,” in Proc. AAAI Conf.
vantaged in FLOPs. Artif. Intell., 2020, pp. 11685–11692.
[9] J. Deng, Y. Pan, T. Yao, W. Zhou, H. Li, and T. Mei, “Single shot video
object detector,” IEEE Trans. Multimedia, vol. 23, pp. 846–858, 2021.
F. Experiments on PASCAL VOC [10] J. Li et al., “Multistage object detection with group recursive learning,”
To further evaluate the effectiveness of the proposed IEEE Trans. Multimedia, vol. 20, no. 7, pp. 1645–1655, Jul. 2018.
[11] X. Chen, H. Li, Q. Wu, K. N. Ngan, and L. Xu, “High-quality
method, we conduct the experiments on the PASCAL VOC R-CNN object detection using multi-path detection calibration network,”
dataset [39]. We adopt the commonly used protocol of IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 2, pp. 715–727,
dataset named “07+12” [1], i.e., the training set includes the Feb. 2021.
[12] S. J. Lee, S. Lee, S. I. Cho, and S. Kang, “Object detection-based video
trainval of VOC2007 and the trainval of VOC2012, retargeting with spatial–temporal consistency,” IEEE Trans. Circuits
with a total of 16551 images. The test set includes test of Syst. Video Technol., vol. 30, no. 12, pp. 4434–4439, Dec. 2020.
VOC2007 with a total of 4952 images. The number of object [13] J. U. Kim, J. Kwon, H. G. Kim, and Y. M. Ro, “BBC Net:
Bounding-box critic network for occlusion-robust object detection,”
categories is 20. The mean average precision with intersection- IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 4, pp. 1037–1050,
over-union of 0.5 is used as the evaluation metric [39], and Apr. 2020.
[14] X. Liang, J. Zhang, L. Zhuo, Y. Li, and Q. Tian, “Small object [41] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and
detection in unmanned aerial vehicle images using feature fusion and P. Torr, “Res2Net: A new multi-scale backbone architecture,” IEEE
scaling-based single shot detector with spatial context analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, pp. 652–662,
Trans. Circuits Syst. Video Technol., vol. 30, no. 6, pp. 1758–1770, Feb. 2021.
Jun. 2020. [42] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
[15] X. Chen, J. Yu, S. Kong, Z. Wu, and L. Wen, “Joint anchor-feature shifted Windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
refinement for real-time accurate object detection in images and videos,” Oct. 2021, pp. 10012–10022.
IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 2, pp. 594–607, [43] P. Goyal et al., “Accurate, large minibatch SGD: Training ImageNet in
Feb. 2021. 1 hour,” 2017, arXiv:1706.02677.
[16] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in [44] Zylo117. Accessed: Apr. 2020. [Online]. Available: https://
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, github.com/zylo117/Yet-Another-EfficientDet-Pytorch
pp. 740–755. [45] J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, and D. Lin, “CARAFE:
[17] B. Singh and L. S. Davis, “An analysis of scale invariance in object Content-aware ReAssembly of FEatures,” in Proc. IEEE/CVF Int. Conf.
detection–SNIP,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- Comput. Vis. (ICCV), Oct. 2019, pp. 3007–3016.
nit., Jun. 2018, pp. 3578–3587. [46] T. Chen, S. Saxena, L. Li, D. J. Fleet, and G. Hinton, “Pix2seq: A
[18] B. Singh, M. Najibi, and L. S. Davis, “Sniper: Efficient multi-scale train- language modeling framework for object detection,” in Proc. Int. Conf.
ing,” in Proc. Adv. Neural Inform. Process. Syst., 2018, pp. 9310–9320. Learn. Represent., 2022, pp. 1–17.
[19] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, [47] K. Chen et al., “Hybrid task cascade for instance segmentation,” in Proc.
“Feature pyramid networks for object detection,” in Proc. IEEE Conf. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125. pp. 4974–4983.
[20] C. Deng, M. Wang, L. Liu, Y. Liu, and Y. Jiang, “Extended feature [48] X. Chen, H. Li, Q. Wu, F. Meng, and H. Qiu, “Bal-R2 CNN: High
pyramid network for small object detection,” IEEE Trans. Multimedia, quality recurrent object detection with balance optimization,” IEEE
vol. 24, pp. 1968–1979, 2022. Trans. Multimedia, vol. 24, pp. 1558–1569, 2021.
[21] M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho, “Aug- [49] T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi, “FoveaBox: Beyound
mentation for small object detection,” 2019, arXiv:1902.07296. anchor-based object detection,” IEEE Trans. Image Process., vol. 29,
[22] Y. Chen et al., “Stitcher: Feedback-driven data provider for object pp. 7389–7398, 2020.
detection,” 2020, arXiv:2004.12432. [50] T. Vu, K. Haeyong, and C. D. Yoo, “SCNet: Training inference sample
[23] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, “Acquisition of consistency for instance segmentation,” in Proc. AAAI Conf. Artif. Intell.,
localization confidence for accurate object detection,” in Proc. Eur. Conf. 2021, pp. 2701–2709.
Comput. Vis., Sep. 2018, pp. 784–799. [51] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu, “GCNet: Non-local networks
[24] Y. Wu et al., “Rethinking classification and localization for object meet squeeze-excitation networks and beyond,” in Proc. IEEE Int. Conf.
detection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Comput. Vis. Worksh., Oct. 2019, pp. 1971–1980.
(CVPR), Jun. 2020, pp. 10186–10195.
[52] S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille, “Micro-batch training
[25] G. Song, Y. Liu, and X. Wang, “Revisiting the sibling head in object with batch-channel normalization and weight standardization,” 2019,
detector,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. arXiv:1903.10520.
(CVPR), Jun. 2020, pp. 11563–11572.
[53] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap between
[26] Z. Cai and N. Vasconcelos, “Cascade R-CNN: Delving into high quality
anchor-based and anchor-free detection via adaptive training sample
object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
selection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Jun. 2018, pp. 6154–6162.
(CVPR), Jun. 2020, pp. 9759–9768.
[27] L. Huang, Y. Yang, Y. Deng, and Y. Yu, “DenseBox: Unifying landmark
[54] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye, “FreeAnchor: Learning to
localization with end to end object detection,” 2015, arXiv:1509.04874.
match anchors for visual object detection,” in Proc. Adv. Neural Inform.
[28] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look
Process. Syst., 2019, pp. 147–155.
once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788. [55] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “RepPoints: Point
set representation for object detection,” in Proc. IEEE/CVF Int. Conf.
[29] C. Zhu, Y. He, and M. Savvides, “Feature selective anchor-free module
Comput. Vis. (ICCV), Oct. 2019, pp. 9657–9666.
for single-shot object detection,” in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2019, pp. 840–849. [56] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, “Region proposal
[30] H. Law and J. Deng, “CornerNet: Detecting objects as paired keypoints,” by guided anchoring,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
in Proc. Eur. Conf. Comput. Vis., Sep. 2018, pp. 734–750. Recognit. (CVPR), Jun. 2019, pp. 2965–2974.
[31] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for [57] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra
instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern R-CNN: Towards balanced learning for object detection,” in Proc.
Recognit., Jun. 2018, pp. 8759–8768. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for pp. 821–830.
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. [58] Y. Luo et al., “CE-FPN: Enhancing channel information for object
(CVPR), Jun. 2016, pp. 770–778. detection,” Multimedia Tools Appl., vol. 4, pp. 1–20, Apr. 2022.
[33] S. Liu, D. Huang, and Y. Wang, “Learning spatial fusion for single-shot [59] S. Wu, J. Yang, X. Wang, and X. Li, “IoU-balanced loss functions
object detection,” 2019, arXiv:1911.09516. for single-stage object detection,” Pattern Recognit. Lett., vol. 156,
[34] S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, “CBAM: Convolutional pp. 96–103, Apr. 2022.
block attention module,” in Proc. Eur. Conf. Comput. Vis. (ECCV), [60] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang, “Mask scoring
Sep. 2018, pp. 3–19. R-CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[35] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in IEEE (CVPR), Jun. 2019, pp. 6409–6418.
Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7132–7141. [61] B. Li, Y. Liu, and X. Wang, “Gradient harmonized single-stage detector,”
[36] J. Dai et al., “Deformable convolutional networks,” in Proc. IEEE Int. in Proc. AAAI Conf. Artif. Intell., 2019, pp. 8577–8584.
Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 764–773. [62] X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable ConvNets V2: More
[37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for deformable, better results,” in Proc. IEEE Conf. Comput. Vis. Pattern
dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Recognit., Jun. 2019, pp. 9308–9316.
Oct. 2017, pp. 2980–2988. [63] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “TOOD:
[38] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, Task-aligned one-stage object detection,” in Proc. IEEE/CVF Int. Conf.
“Generalized intersection over union: A metric and a loss for bounding Comput. Vis. (ICCV), Oct. 2021, pp. 3490–3499.
box regression,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog- [64] Z. Dong, G. Li, Y. Liao, F. Wang, P. Ren, and C. Qian, “CentripetalNet:
nit. (CVPR), Jun. 2019, pp. 658–666. Pursuing high-quality keypoint pairs for object detection,” in Proc.
[39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
W. Zisserman, “The PASCAL visual object classes (VOC) challenge,” pp. 10519–10528.
Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010. [65] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR:
[40] K. Chen et al., “MMDetection: Open MMLab detection toolbox and Deformable transformers for end-to-end object detection,” in Proc. Int.
benchmark,” 2019, arXiv:1906.07155. Conf. Learn. Represent., 2020, pp. 1–16.
[66] H. Song et al., “VIDT: An efficient and effective fully transformer-based Longbin Yan (Student Member, IEEE) received the
object detector,” in Proc. Int. Conf. Learn. Represent., 2022, pp. 1–18. B.S. and M.S. degrees from the School of Marine
[67] Y. Fang et al., “You only look at one sequence: Rethinking transformer Science and Technology, Northwestern Polytechni-
in vision through object detection,” in Proc. Adv. Neural Inform. Process. cal University, Xi’an, China, in 2013 and 2016,
Syst., vol. 34, 2021, pp. 26183–26197. respectively, where he is currently pursuing the
[68] Z. Tan et al., “GiraffeDet: A heavy-neck paradigm for object detection,” Ph.D. degree with the Laboratory of Centre of Intel-
in Proc. Int. Conf. Learn. Represent., 2022, pp. 1–17. ligent Acoustics and Immersive Communications
[69] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and (CIAIC).
S. Zagoruyko, “End-to-end object detection with transformers,” in His research interests include hyperspectral image
Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020, analysis, image super-resolution, image recognition,
pp. 213–229. and object detection.
[70] X. Li et al., “Generalized focal loss: Learning qualified and distributed
bounding boxes for dense object detection,” in Proc. Adv. Neural Inform. Yunxiao Qin (Member, IEEE) received the Ph.D.
Process. Syst., 2020, pp. 21002–21012. degree in control science and engineering from the
[71] X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai, “An empir- School of Automation, Northwestern Polytechnical
ical study of spatial attention mechanisms in deep networks,” University, Xi’an, China, in 2021.
in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, He is currently working as a Lecturer with the
pp. 6688–6697. Neuroscience and Intelligent Media Institute, Com-
[72] Z. Chen, C. Yang, Q. Li, F. Zhao, Z.-J. Zha, and F. Wu, “Disentangle munication University of China. His current research
your dense object detector,” in Proc. ACM Int. Conf. Multimedia, 2021, interests include meta-learning, adversarial attack,
pp. 4939–4948. continual learning, and brain-inspired artificial
[73] X. Dai et al., “Dynamic head: Unifying object detection heads with intelligence.
attentions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2021, pp. 7373–7382.
[74] H. Zhang, Y. Wang, F. Dayoub, and N. Sunderhauf, “Varifocal- Jie Chen (Senior Member, IEEE) received the B.S.
Net: An IoU-aware dense object detector,” in Proc. IEEE/CVF degree from Xi’an Jiaotong University, Xi’an, China,
Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, in 2006, the Dipl.-Ing. degree in information and
pp. 8514–8523. telecommunication engineering from the University
[75] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical of Technology of Troyes (UTT), Troyes, France,
guidelines for efficient CNN architecture design,” in Proc. Eur. Conf. in 2009, the M.S. degree in information and telecom-
Comput. Vis., Sep. 2018, pp. 116–131. munication engineering from Xi’an Jiaotong Uni-
[76] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun, versity in 2009, and the Ph.D. degree in systems
“RepVGG: Making VGG-style ConvNets great again,” in Proc. optimization and security from UTT in 2013.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, From 2013 to 2014, he was with the Lagrange
pp. 13733–13742. Laboratory, University of Nice Sophia Antipolis,
[77] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, Nice, France. From 2014 to 2015, he was with the Department of Electrical
“MobileNetV2: Inverted residuals and linear bottlenecks,” in Proc. Engineering and Computer Science, University of Michigan, Ann Arbor, MI,
IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, USA. He is currently a Professor with the Center of Intelligent Acoustics and
pp. 4510–4520. Immersive Communications, Northwestern Polytechnical University, Xi’an.
[78] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and His current research interests include adaptive signal processing, distributed
H. Jégou, “Training data-efficient image transformers & distilla- optimization, hyperspectral image analysis, and acoustic signal processing.
tion through attention,” in Proc. Int. Conf. Mach. Learn., 2021, Dr. Chen was the Technical Co-Chair of IWAENC’16 held in Xi’an. He also
pp. 10347–10357. serves as the Co-Chair for the IEEE Signal Processing Society Summer School
[79] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object detection via region- 2019 and the IEEE International Workshop on Machine Learning for Signal
based fully convolutional networks,” in Proc. Adv. Neural Inform. Processing 2022. He was a Distinguished Lecturer for Asia–Pacific Signal
Process. Syst., vol. 29, 2016, pp. 379–387. and Information Processing Association from 2018 to 2019.

Havi Batch 10

Uploaded by

Copyright:

Available Formats

Havi Batch 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Havi Batch 10

Uploaded by

Copyright:

Available Formats

242 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 33, NO.

Scale-Balanced Real-Time Object Detection

TABLE I • An adaptive feature-fusion module (AFFM) is designed

II. R ELATED W ORK

Fig. 6. Slimming design. We remove the adjacent embedded refinement

Fig. 5. Scale shifting down design. We use high-resolution images, but

detection heads are therefore decoupled. The path indicated by

TABLE IX the experimental results are shown in Table X. We can observe

You might also like