Applied Sciences: Lightweight Attention Pyramid Network For Object Detection and Instance Segmentation
Applied Sciences: Lightweight Attention Pyramid Network For Object Detection and Instance Segmentation
Applied Sciences: Lightweight Attention Pyramid Network For Object Detection and Instance Segmentation
sciences
Article
Lightweight Attention Pyramid Network for Object
Detection and Instance Segmentation
Jiwei Zhang 1 , Yanyu Yan 2 , Zelei Cheng 3 and Wendong Wang 1,2, *
1 School of Software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China;
[email protected]
2 State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and
Telecommunications, Beijing 100876, China; [email protected]
3 The Center for Education and Research in Information Assurance and Security (CERIAS), Purdue University,
West Lafayette, IN 47906, USA; [email protected]
* Correspondence: [email protected]
Received: 7 December 2019; Accepted: 13 January 2020; Published: 28 January 2020
1. Introduction
Along with the popularization of the artificial intelligence systems [1,2], IOT [3] and the
accumulation of image data [4], automatic object detection is increasingly being widely used in video
surveillance and robot vision. Compared with manual inspections, computer vision technology can
effectively improve the inspection efficiency and avoid the influence of subjectivity on accuracy. Object
detection is a fundamental computer-vision task [5], and the existence of multiple scales and ratios is the
most challenging problem in object detection. More and more attention is being paid to this problem,
and various detection methods have emerged. In the image pyramid methods [6,7], as shown in
Figure 1a, images are generally resized to multiple scales and then resized to the same ratio for training
and inference. Because of the large number of images, the methods are computationally expensive.
By using pyramids of reference boxes (which are called anchors), the multiple scales and ratios
within the same feature map size are addressed and do not cause extra computational complexity for
the multiple feature map sizes [8]. As shown in Figure 1b, a trade-off between the anchors is made.
The anchor described above is adopted by methods such as Faster R-CNN [8], strengthened RPN [9],
and region-based fully convolutional networks [10]. The anchors with different scales and ratios are
created in order to predict candidate bounding boxes for multi-scale objects. However, the topmost
feature map has a limitation on its fixed receptive field size—this method struggles to detect too large
or too small objects.
Recent object detection methods, such as single shot multibox detector (SSD) [11] and multi-scale
deep convolutional neural network (MS-CNN) [12] are used for solving the object detection of multiple
scales and ratios. As shown in Figure 1c, the feature pyramid from bottom to top is adopted;
a well-directed classifier and bounding box regressor [13] with multiple scales/sizes are trained
to handle the object detection of different scales and ratios. However, owing to the multiple types
of bounding box regressors, we should fit each with specific training data. This method would
necessitate dividing the original training data into multiple types, and each filter would then be fed
much less training data, so the division cannot guarantee adequate training data for each filter. Besides,
each filter requires different proposal sizes, resulting in implementational complexity and the burden
of network training.
As shown in Figure 1d, using sideways predictions with deep supervision is another decent
method to use feature maps from different stages. Such a method generates the final prediction by
fusing all of the sideways predictions [14,15]. A top-down structure, such as that in Figure 1e, has
recently become popular. This structure has been proven to work well in feature pyramid networks
(FPN) [16], deconvolutional single shot detector (DSSD) [17], and Mask R-CNN [18]. Fusing feature
maps layer by layer is not efficient or effective enough when there are many layers to be combined
together. FPN developed a top-down pathway with lateral connections to build high-level semantic
feature maps at all scales. This architecture shows significant improvements as a generic feature
extractor in several applications such as object detection and instance object segmentation. However,
the FPN module only takes neighboring feature maps into consideration. Adding the feature maps
from top to bottom, level by level, weakens the influence from the top feature map to the bottom
feature map.
However, these methods seldom both use the relationship of each feature channel and the fusion
of low-level features and high-level features [19]. In this paper, we propose an upgraded method,
Appl. Sci. 2020, 10, 883 3 of 16
the Attention Pyramid Network (APN), which can utilize both multiscale feature fusion and the
channel-wise attention mechanism to generate new feature pyramids. Instead of only fusing the
neighboring feature maps, we designed a new sufficient module that takes all levels of the feature
maps into consideration by using adaptive transformation, which shortens the information propagation
path of multiple feature maps to generate a new feature pyramid. Then, our network focuses on
feature attention [20], which reweights different channels of the feature maps. Different from the
traditional adaptive transformation and feature attention block, the proposed APN can utilize both
multiscale feature fusion and channel-wise attention mechanisms to generate new feature pyramids.
Moreover, the proposed APN is a lightweight network. The modification is independent of the
backbone convolutional architecture, and, in this paper, we present our result based on ResNets [21].
We evaluated our method on the 2018 CVPR WAD dataset and MS COCO dataset [22]. Without
any bells and whistles, we achieve a better result than the baseline methods. Moreover, our module
achieves the above improvement in a very short detection time, and our code and models will be made
publicly available. The contributions of this paper are summarized as follows:
• The adaptive transformation module can increase the information interaction between multiscale
features and improve the performance of feature pyramids.
• The feature attention block adopt a channel-wise attention mechanism to generate new feature
pyramids, and make full use of the relationship between channels. In addition, it can strengthen
the important channel information and weaken the unimportant channel information.
• A novel APN is proposed, and it is lightweight. The performance of the proposed APN was
evaluated via extensive simulations using the MS COCO and the WAD datasets; the results show
the effectiveness of our approach.
2. Related Work
Feature fusion is often used for image classification, visual tracking, object detection, etc. A single
classifier usually cannot handle some complex data classification; the technology of feature fusion
is used to fuse different visual features to solve the problem of image classification [23]. In the
problem of visual tracking, lighting, occlusion, and other factors often affect the recognition effect.
Some association models based on sparse representation [24,25] have been proposed for feature
fusion and have a certain robustness in visual tracking. In recent years, some algorithms based on
deep learning have adopted the idea of feature fusion. Song et al. proposed a deep feature fusion
network for the classification of hyperspectral images [26]. Using residual learning to optimize the
convolutional layer, and the model to fuse the features of different layers, the method achieved better
results than other classifiers. To solve the complex situation such as the small size of the object in the
aerial image, a feature fusion deep network was proposed [27], and the spatial relationship between
objects was increased by the fusion of network layer features, thus accurate detection results were
obtained. Facial expressions are very important information in human behavior research. To recognize
three-dimensional faces, Tian et al. proposed a deep feature fusion convolutional neural networks
(CNN) [28], which combines different two-dimensional face information to fine-tune the network
model, and achieves effective detection results. In addition, Starzacher et al. combined artificial
neural networks and support vector machines to achieve feature fusion and applied it to embedded
devices [29]. It can be used for vehicle classification and tracking in traffic monitoring, and has achieved
good execution time and classification rate on embedded platforms.
Attention was first introduced in the NLP area. Generally, attention is intended to choose
significant feature representations at some specific locations. Luong et al. [30] proposed global and
local attention approaches for neural machine translation. The global approach pays attention to all
source words while the local one only focuses on the subset of source words at a time. Vaswani et al. [31]
proposed a self-attention method for machine translation; the encoder and decoder can be connected
via an attention mechanism. However, it then gains great popularity in the computer vision area.
Wang et al. [32] proposed residual attention network where attention modules are stacked to generate
Appl. Sci. 2020, 10, 883 4 of 16
attention-aware features. Moreover, attention residual learning is proposed to train very deep networks.
Fan et al. [33] developed a few-shot object detection model with an attention module on RPN and a
detector. Thus, the irrelevant background boxes can be filtered out. Wang et al. [34] proposed a novel
domain-attention mechanism for a universal object detection model. A domain-attention module is
leveraged and enables adapters to specialize on some individual domains. Zhang et al. [35] proposed
a backward attention filter to help the region proposal network generate reasonable regions of interest
(ROIs). The informative features are emphasized and, by utilizing the semantic features from the deep
layers, it can suppress the distractive features. Li et al. [36] sequentially integrated different kinds
of soft attention into CNNs for single-stage object detection. Huang et al. [37] designed Inverted
Attention to improve object detection. It inverts attention to feature maps and assigns more attention
to complementary object parts.
Figure 2. The architecture of our proposed Attention Pyramid Network (APN). A building block
illustrates the feature attention fusion block, which takes the P5 generation process as an example.
First, we select an arbitrary size single-scale image as an input. Then, we generate proportionally
sized feature maps using backbone of the convolutional architectures. We use ResNets [21] as the
backbone to compute a feature hierarchy that consists of feature maps at several scales with a scaling
stride step of 2 pixels. In the feedforward computation of ConvNet, many layers producing output
maps of the same size are available, in which these layers are in the same network stage. As shown
in Figure 2, we use the feature activations output using each stage of the last residual block, denoted
Appl. Sci. 2020, 10, 883 5 of 16
as {C2 , C3 , C4 , C5 } for the outputs of conv2, conv3, conv4, and conv5 with the corresponding strides
of {4, 8, 16, 32} pixels with respect to the input images, respectively. However, we do not integrate
conv1 into the pyramid because of its large memory consumption. As shown in Figure 2, we define
the relative concept of the bottom feature maps and the top feature maps.
interaction, we allow multiple channels to be emphasized rather than a one-hot activation. To fulfill the
learning of a non-mutually-exclusive relationship, we apply a nonlinear activation sigmoid. We make
the output tensor and input as vector s the same shape. This process is denoted as Fex .
Figure 3. Illustration of our feature attention block. Note that we display our feature map without the
dimension of batch size for easier understanding.
In this paper, we use the 1 × 1 convolutional layer to replace the fully-connected layers—instead of
using the complex SENet [39] structure—to implement the channel-wise interaction. Our implemented
structure has a lighter module that proved effective in the experiment presented in Section 4.
After obtaining the vector s, we apply a rescaling to the concatenated feature maps, implemented by
adopting channel-wise multiplication between the concatenated feature maps and the vector. In this
way, we obtain the channel-wise attention feature maps.
1 λ
L ( p k , bk ) =
Ngtbox ∑( Lclas ( pk , p∗k ) + Nanchor ∑ p∗k Lreg (bk , bk∗ ) (1)
k k
where k represents the index of an anchor in the mini-batch. Ngtbox denotes the number of ground
truth boxes and Nanchor denotes the number of anchors. pk is the predicted possibility that anchor k is
the target object. p∗k is a binary value asserting whether the anchor is positive. bk is a vector that is
used to represent the predicted bounding box while bk∗ denotes the ground-truth bounding box.
In the multi-task loss, we have two parts of the loss function. The first one is the classification loss.
Here, we use a cross-entropy function, which is defined as:
The second part is the regression loss. We need to force the distance between the predicted
bounding box and the ground truth to be as close as possible. Here, we measure the distance by
smooth L1-norm, which is
(
∗ |bk − bk∗ |, i f |bk − bk∗ | < 1
Lreg (bk , bk ) = (smooth L1) (3)
(bk − bk∗ )2 , otherwize
The term p∗k Lreg (bk , bk∗ ) means only when the anchors are positive should the regression loss
function be activated, i.e., p∗k = 1. The outputs of the classification and regression layers include pk
and tk , respectively. Then, they need to be normalized with Nclas and Nreg , and balanced by weight λ.
For regression, the parameterizations of the four coordinates (top left corner, width, and height)
are defined as: x−xa ∗ x∗ −xa
bx = wa
bx = ∗wa
b = y−y a
b∗ = y −y a
y ha y ha
∗ = log ( w∗ )
(4)
bw = log( ww ) bw wa
a
∗
bh = log( hha )
∗
bh = log( hha )
Here, x, y, w, h denote the four coordinates of the predict box; x a , y a , wa , h a denote those of the
anchor box; and x ∗ , y∗ , w∗ , h∗ denote those of the ground truth box.
area( B p ∩ BGT )
IoU = , (5)
area( B p ∪ BGT )
where B p is the prediction region from the detection/segmentation network and BGT is the
ground-truth body region. We used the mean IoU calculated from all test images.
The AP is defined as follows:
1 all tp
N ∑ (tp + f p)
AP = , (6)
where tp means the true positive, f p means the false positive, and N means the number of the
detection/segmentation results. AP in Equation (6) represents the averaged value of all categories.
Traditionally, this is called “mean average precision” (mAP). In the MS COCO dataset, it makes
no distinction between AP and mAP. Specifically, the MS COCO dataset uses 10 IoU thresholds
of 0.50:0.05:0.95. This is a break from tradition, where AP is computed at a single IoU of 0.5
(which corresponds to our metric AP.50 ).
We took two images in one image batch for training and used two NVidia TitanX GPUs (one image
per GPU). The shorter and longer edges of the images were 800 and 1333, if not otherwise noted.
For object detection, we trained our model with a learning rate that starts from 0.01, which decreased by
a factor of 0.1 after 960 k and 1280 k iterations and finally terminated at 1440 k iterations. This training
schedule resulted in 12.17 epochs over the 118,287 images in the MS COCO 2017 training dataset.
The rest of the hyperparameters remained the same as the Mask R-CNN.
We evaluated the performance of the proposed APN on the MS COCO dataset and compared
the test-dev results with the recent state-of-art models with different feature map fusion methods.
The results of our proposed ResNet-50 based APN are presented in Table 1. As shown in Table 1,
our model, trained and tested on single-scale images, outperforms the baselines by a large margin over
all of the compared evaluation metrics. Especially, it represents an improvement of 1.8 points for the
AP when compared to the original module.
Table 1. Comparisons of single model results on the MS COCO object detection benchmark.
In Table 1, when the FPN is applied to the Mask R-CNN, the best effect is achieved in the related
baseline methods. The ResNet-100 is used by the network as the backbone network, and ResNet-100
can usually achieve better recognition than ResNet-50 [21]. The backbone network of Our APN is
ResNet-50, but it still achieved a better effect than the FPN applied to the Mask R-CNN, which uses
ResNet-100 as the backbone network. Therefore, our APN network is more effective than FPN in
object recognition. In addition, we plotted the performance comparison curves for the APN and
baseline methods. In Figure 4, the abscissa is the threshold of IoU, and the ordinate is the AP value
of the APN and baseline methods. We use different color polylines to distinguish different methods.
Obviously, we can see that the APN proposed by us has higher AP values than any other methods in
any threshold range.
Figure 4. Mean Average Precision of the proposed APN-based model and FPN-based models in
different threshold.
Appl. Sci. 2020, 10, 883 10 of 16
Figure 5 shows the comparison of the methods based on our model and the FPN. According to
the results, we can see that our improvement model detects more qualitative objects, which means our
model can detect those objects that highly overlap, but that belong to different individuals. The design
of our models takes advantage of the contextual information from the whole feature pyramid to fulfill
the object detection.
Mask R-CNN APN Mask R-CNN FPN YOLOv3 FPN RetinaNet FPN
Figure 5. Selected examples from the MS COCO 2017 test dataset using our method and other
methods using the FPN model.
Then, we performed the instance segmentation experiments on the COCO dataset. We report
the performance of the proposed APN on the COCO dataset for comparison. As shown in Table 2,
the APN applied to Mask R-CNN trained and tested on single-scale images already outperforms the
FPN applied to Mask R-CNN, and both use ResNet-50 as their backbone network. All instantiations of
our proposed model outperform the base variants of the latter model by nearly 2.0 points on average.
Table 2. The value of instance segmentation results for the COCO dataset. All entries are
single-model results.
AP M M
[email protected] M
[email protected] APSM AP M AP LM
M
Mask R-CNN + FPN 46.5 72.4 51.0 29.2 50.6 59.7
Mask R-CNN + APN 48.3 74.7 53.1 31.2 52.4 61.4
+1.8 +2.3 +2.1 +2.0 +2.4 1.7
Appl. Sci. 2020, 10, 883 11 of 16
Moreover, Figure 6 shows the partial examples of the segmentation results using the APN.
In Figure 6, we can see that different classes of objects can be accurately segmented.
Figure 6. Selected examples of the test results on the COCO test-dev images, using the proposed
APN and running at 5 fps.
In Table 3, we compare two different designs of our fusion block. Our simple channel-wise
addition fusion design results in a severe decrease in bounding box loss AP (2.4 points). Compared
with the simple channel-wise addition fusion strategy, the convolutional fusion strategy takes all pixels
from the feature maps into the calculation. This suggests that the more the spatial information is
considered in the fusion block, the better are the results the fusion block gains. The result was tested on
the MS COCO val-2017 set, and the backbone network was ResNet-50. Thus, adding a convolutional
layer to fuse the reweighted feature map obtains greater gains over simple channel-wise fusion.
Table 3. The test results of the APN when the channel-wise addition fusion and convolutional fusion
are used.
We further tested the efficiency of the APN for the task of object detection on the COCO set
test-dev. The comparison in Table 4 reveals that, with much fewer parameters, the APN compared
with the FPN when the backbone is ResNet-101 can still achieve better performance. Based on the
same backbone of ResNet-50, the proposed APN can achieve much better detection results. As shown
in Table 4, the APN based on the ResNet-50 is 2.8 points better than the FPN based on the ResNet-101
with nearly 20 million fewer parameters and the smaller backbone of ResNet-50. As the performance
of ResNeXt-101 is better than ResNet-50, and with a larger backbone model, the performance of the
APN can achieve better results. To test the efficiency of our design, we carried out more ablation
studies of the individual module of the APN. Figure 2 shows that the feature attention fusion block
we proposed is composed of the adaptive transformation module and feature attention block and
fusion block. Next, we tested the detection performance of the small module in the feature attention
fusion block when it is assembled separately and multiplied. In Table 4, we can see that the third row
is the case where the APN only contains the adaptive transformation module, and the fourth row the
simultaneous existence of the adaptive transformation module and the feature attention block. The
fusion block used in these two cases is channel-wise addition fusion. The fusion block used by the
APN in the last row of the table is convolutional fusion, which means that this APN used a complete
feature attention fusion block structure.
Appl. Sci. 2020, 10, 883 12 of 16
As shown in Table 4, when the APN only contains the adaptive transformation module, the AP
value of APN is 0.1 higher than that of the FPN using Resnet-50 as the backbone network. Thus, the
adaptive transformation of feature maps from all stages marginally improves performance. However,
from the channel-wise addition fusion strategy and the conv fusion on the last two rows of Table 4,
we can see that the AP value of the APN on the last row of the table using the conv fusion strategy is
0.7 more than the APN of the fourth row of the table using the channel-wise addition fusion. Therefore,
the conv fusion strategy gains greater benefits from fusing feature maps than the channel-wise addition
fusion strategy. In addition, from the third and fourth rows in Table 4, we can see that, when the
APN uses the feature attention block, the AP value is 2.3 more than when the APN only contains the
adaptive transformation module. Rather than simply giving all the channels of the feature map the
same weight as the FPN, our feature attention block learns to reweight different channels from all
network stages. Such a reweight-strategy helps the network learn, through the training dataset, to take
advantage of feature maps from different network stages to construct a better fusion feature map.
The APN utilizes both the multiscale feature fusion and the channel-wise attention mechanism to
generate new feature pyramids. It can be proven that the FPN is a special case of the APN. We look
forward to seeing more application to replace our proposed APN with the FPN and yield better
performance. In addition, we also carried out an experiment to analyze the inference time of the
proposed module. The proposed APN ran in 4 ms on average during the inference time, which was
ignorable compared with the inference time of the backbone ConvNet. The experiment was performed
with the ResNet-50 model on a single P40 where the input size of the image was 512 × 512.
the separated classes also improves significantly, which indicates the benefit of using a fusion module
with contextual information. Therefore, the APN proposed by us is better than the FPN in target
recognition of different categories or the whole.
Figure 7. Images in each row are the object detection and instance segmentation results.
Table 5. Object detection results using Mask R-CNN [18] evaluated on the 2018 WAD test-dev set:
(a) object detection results on separate class of 2018 WAD test-dev; and (b) object detection results of
2018 WAD test-dev.
(a)
AP [email protected] [email protected] APS AP M AP L
FPNcar 35.7 59.8 36.8 11.7 48.1 77.5
APNcar 39.8 61.3 42.4 15.7 53.5 79.2
FPNmoto 23.5 46.8 20.3 7.1 26.2 47.1
APNmoto 28.1 51.6 27.9 10.0 31.3 52.0
FPNbike 12.6 30.5 8.2 1.6 14.5 31.1
APNbike 15.7 34.0 11.1 3.5 17.5 36.2
FPN person 16.4 36.2 12.1 4.9 31.6 58.2
APN person 18.3 38.3 14.7 6.3 35.3 57.0
FPNtruck 44.7 69.3 50.6 17.0 50.1 65.4
APNtruck 48.8 73.4 58.1 21.9 54.9 66.0
FPNbus 41.0 66.7 44.3 10.7 47.4 67.2
APNbus 46.0 71.7 51.3 17.3 56.4 68.8
FPNtricycle 34.7 61.6 36.4 29.2 36.2 47.7
APNtricycle 37.3 65.3 40.1 31.9 37.8 50.9
(b)
AP [email protected] [email protected] APS AP M AP L
FPNall 30.7 49.7 29.2 12.8 34.4 55.7
APNall 32.4 54.7 33.8 15.4 39.4 59.2
+1.7 +5.0 +4.6 +2.6 +5.0 +3.5
Similar to the object detection results on the Mask R-CNN, we also report the instance
segmentation performance of our proposed module on the WAD test-dev for comparison. As shown in
Table 6, compared with the baseline, our APN consistently improves the performance across different
evaluation metrics.
The result of the segmentation on the evaluation metrics shows a significant improvement
of 1.3 points for AP compared with the original FPN. In Figure 7, we can see the results of
instance segmentation.
Appl. Sci. 2020, 10, 883 14 of 16
Table 6. Instance segmentation proposals evaluated on the first 1 k 2018 WAD test-dev images,
denoted as AP M . All models were trained on the train set.
AP M M
[email protected] M
[email protected] APSM AP M AP LM
M
FPNall 38.2 46.9 28.1 11.7 32.8 53.7
APNall 39.5 51.7 30.0 12.7 36.3 55.1
+1.3 +4.8 +1.9 +1.0 +3.5 +1.4
5. Conclusions
In this paper, we propose the APN, which utilizes both the multiscale feature fusion and the
multiscale-aware channel-wise attention mechanism to generate new feature pyramids for the task of
object detection. We reshape the features from all feature levels and shorten the distance between the
lower and topmost feature levels to enable reliable information propagation. The APN is a lightweight
network and the efficient feature attention block can enhance important feature channels. In addition,
the APN adopts the feature fusion strategy of conv and channel-wise addition, which will help to
improve the recognition performance of the whole network.The running time of APN is 4 ms, which
can be ignored compared with backbone network. Experiments on the MS COCO and the 2018 WAD
datasets proved that the APN significantly improves the performance compared with the contrast
methods, and without any bells and whistles. In the future, we will consider the utilization of video
and RGB-D data for the task of object detection and extend our method to support more general
detection models.
Author Contributions: Conceptualization, J.Z. and W.W.; Formal analysis, J.Z. and Z.C.; Investigation, J.Z. and
Z.C.; Methodology, J.Z.; Project administration, W.W.; Validation, Y.Y.; Writing—original draft, J.Z. and Y.Y.; and
Writing—review and editing, J.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Yan, X.; Cui, B.; Xu, Y.; Shi, P.; Wang, Z. A Method of Information Protection for Collaborative Deep Learning
under GAN Model Attack. IEEE/ACM Trans. Comput. Biol. Bioinform. 2019. [CrossRef] [PubMed]
2. Sugimori, H.; Kawakami, M. Automatic Detection of a Standard Line for Brain Magnetic Resonance Imaging
Using Deep Learning. Appl. Sci. 2019, 9, 3849. [CrossRef]
3. Xu, Y.; Ren, J.; Wang, G.; Zhang, C.; Yang, J.; Zhang, Y. A Blockchain-based Nonrepudiation Network
Computing Service Scheme for Industrial IoT. IEEE Trans. Industr. Inform. 2019, 15, 3632–3641. [CrossRef]
4. Xu, Y.; Ren, J.; Zhang, Y.; Zhang, C.; Shen, B.; Zhang, Y. Blockchain Empowered Arbitrable Data Auditing
Scheme for Network Storage as a Service. IEEE Trans. Serv. Comput. 2019. [CrossRef]
5. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale
dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983.
6. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; pp. 886–893.
7. Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
[CrossRef]
8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada,
7–12 December 2015; ACM: New York, NY, USA, 2015; pp. 91–99.
9. Wang, G.; Guo, J.; Chen, Y.; Li, Y.; Xu, Q. A PSO and BFO-Based Learning Strategy Applied to Faster R-CNN
for Object Detection in Autonomous Driving. IEEE Access 2019, 7, 18840–18859. [CrossRef]
Appl. Sci. 2020, 10, 883 15 of 16
10. Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks.
In Proceedings of the Advances in Neural Information Processing Systems, (NIPS 2016), Barcelona, Spain,
5–10 December 2016; ACM: New York, NY, USA, 2016; pp. 379–387.
11. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox
detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands,
8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37.
12. Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A unified multi-scale deep convolutional neural network
for fast object detection. In Proceedings of the European Conference on Computer Vision, Amsterdam,
The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 354–370.
13. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision,
Araucano Park, Chile, 11–18 December 2015; pp. 1440–1448.
14. Xie, S.; Tu, Z. Holistically-nested edge detection. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1395–1403.
15. Wang, X.; Ma, H.; Chen, X.; You, S. Edge preserving and multi-scale contextual neural network for salient
object detection. IEEE Trans. Image Process. 2018, 27, 121–134. [CrossRef] [PubMed]
16. Lin, T.Y.; Dollár, P.; Girshick, R.; Kaiming, H.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for
Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125.
17. Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. DSSD: Deconvolutional single shot detector. arXiv 2017,
arXiv:1701.06659.
18. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International
Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969.
19. Lee, G.; Tai, Y.W.; Kim, J. Deep saliency with encoded low level distance map and high level features.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
27–30 June 2016; pp. 660–668.
20. Fu, J.; Zheng, H.; Mei, T. Look closer to see better: Recurrent attention convolutional neural network for
fine-grained image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446.
21. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
22. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco:
Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich,
Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755.
23. Mangai, U.G.; Samanta, S.; Das, S.; Chowdhury, P.R. A survey of decision fusion and feature fusion strategies
for pattern classification. IETE Tech. Rev. 2010, 27, 293–307. [CrossRef]
24. Lan, X.; Ma, A.J.; Yuen, P.C. Multi-cue visual tracking using robust feature-level fusion based on joint
sparse representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, USA, 23–28 June 2014; pp. 1194–1201.
25. Lan, X.; Ma, A.J.; Yuen, P.C.; Chellappa, R. Joint sparse representation and robust feature-level fusion for
multi-cue visual tracking. IEEE Trans. Image Process. 2015, 24, 5826–5841. [CrossRef] [PubMed]
26. Song, W.; Li, S.; Fang, L.; Lu, T. Hyperspectral image classification with deep feature fusion network. IEEE
Trans. Geosci. Remote Sens. 2018, 56, 3173–3184. [CrossRef]
27. Long, H.; Chung, Y.; Liu, Z.; Bu, S. Object Detection in Aerial Images Using Feature Fusion Deep Networks.
IEEE Access 2019, 7, 30980–30990. [CrossRef]
28. Tian, K.; Zeng, L.; McGrath, S.; Yin, Q.; Wang, W. 3D Facial Expression Recognition Using Deep Feature
Fusion CNN. In Proceedings of the 2019 30th Irish Signals and Systems Conference (ISSC), Maynooth,
Ireland, 17–18 June 2019; pp. 1–6.
29. Starzacher, A.; Rinner, B. Embedded realtime feature fusion based on ANN, SVM and NBC. In Proceedings of
the 2009 12th International Conference on Information Fusion, Seattle, WA, USA, 6–9 July 2009; pp. 482–489.
30. Luong, T.; Pham, H.; Manning, C.D. Effective Approaches to Attention-based Neural Machine Translation.
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon,
Portugal, 17–21 September 2015; pp. 1412–1421.
Appl. Sci. 2020, 10, 883 16 of 16
31. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I.
Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems,
Long Beach, CA, USA, 4–9 December 2017; ACM: New York, NY, USA, 2017; pp. 5998–6008.
32. Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for
image classification. arXiv 2017, arXiv:1704.06904.
33. Fan, Q.; Zhuo, W.; Tai, Y.W. Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector.
arXiv 2019, arXiv:1908.01998.
34. Wang, X.; Cai, Z.; Gao, D.; Vasconcelos, N. Towards universal object detection by domain attention.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
16–20 June 2019; pp. 7289–7298.
35. Zhang, C.; Kim, J. Object Detection With Location-Aware Deformable Convolution and Backward Attention
Filtering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach,
CA, USA, 16–20 June 2019; pp. 9452–9461.
36. Li, Y.L.; Wang, S. HAR-Net: Joint Learning of Hybrid Attention for Single-stage Object Detection. arXiv
2019, arXiv:1904.11141.
37. Huang, Z.; Ke, W.; Huang, D. Improving Object Detection with Inverted Attention. arXiv 2019, arXiv:1903.12255.
38. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. Squeezenet: Alexnet-level
accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv 2016, arXiv:1602.07360.
39. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1492–1500.
40. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.;
Isard, M.; et al. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th
USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA,
2–4 November 2016; pp. 265–283.
41. Girshick, R. Fast R-CNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1440–1448.
42. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
43. Redmon, J.; Farhadi, A. YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271.
44. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings
of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Venice, Italy, 22–29 October 2018;
pp. 2980–2988.
45. Huang, J.; Rathod, V.; Sun, C.; Zhu, M.; Korattikara, A.; Fathi, A.; Fischer, I.; Wojna, Z.; Song, Y.;
Guadarrama, S.; et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 7310–7311.
46. Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual
connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco,
CA, USA, 4–10 February 2017; ACM: New York, NY, USA, 2017; pp. 500–515.
47. Shrivastava, A.; Sukthankar, R.; Malik, J.; Gupta, A. Beyond skip connections: Top-down modulation for
object detection. arXiv 2016, arXiv:1612.06851.
48. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 1492–1500.
c 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).