Object Detection With DL

Digital Signal Processing 132 (2023) 103812
Contents lists available at ScienceDirect
Digital Signal Processing

journal homepage: www.elsevier.com/locate/dsp
A comprehensive review of object detection with deep learning

Ravpreet Kaur ∗ , Sarbjeet Singh
Computer Science and Engineering, UIET, Panjab University, Chandigarh, 160014, India
a r t i c l e i n f o a b s t r a c t
Article history: In the realm of computer vision, Deep Convolutional Neural Networks (DCNNs) have demonstrated
Available online 8 November 2022 excellent performance. Video Processing, Object Detection, Image Segmentation, Image Classification,
Speech Recognition and Natural Language Processing are some of the application areas of CNN. Object
Keywords:
detection is the most crucial and challenging task of computer vision. It has numerous applications in
Computer vision
Deep convolutional neural network
the field of security, military, transportation and medical sciences. In this review, object detection and its
Object detection different aspects have been covered in detail. With the gradual increase in the evolution of deep learning
Deep learning algorithms for detecting objects, a significant improvement in the performance of object detection models
Conventional methods has been observed. However, this does not imply that the conventional object detection methods, which
had been evolving for decades prior to the emergence of deep learning, had become outdated. There are
some cases where conventional methods with global features are superior choice. This review paper
starts with a quick overview of object detection followed by object detection frameworks, backbone
convolutional neural network, and an overview of common datasets along with the evaluation metrics.
Object detection problems and applications are also studied in detail. Some future research challenges
in designing deep neural networks are discussed. Lastly, the performance of object detection models on
PASCAL VOC and MS COCO datasets is compared and conclusions are drawn.
© 2022 Elsevier Inc. All rights reserved.
1. Introduction detection and semantic segmentation [5]. The progress of object

detection is usually separated into two historical phases. The phase
With the evolution of Deep Convolutional Neural Network (DC- before 2014 was of traditional methods and after 2014, deep learn-
NNs) and rise in computational power of GPUs, deep learning mod- ing based methods take place [4]. This paper will focus on deep
els are being extensively used today in the domain of computer learning based methods. It makes use of CNN for best results as it
vision [9]. The primary objective of object detection is to detect vi- plays a significant role in the implementation of algorithms of ob-
sual objects of certain classes like tv/monitor, books, cats, humans, ject detection. The architectures of both the phases differ with re-
etc. and locate them using bounding boxes, and then classify them spect to accuracy, speed, and hardware resources. Comparing CNN
in the categories of that particular object [1–4]. to traditional techniques, CNN has better architecture and is sub-
Generic object detection is also known by several other terms, stantially much more expressive [6,7].
for instance, generic object category detection, object category de- Before discussing deep learning based object detection algo-
tection, category level object detection, and object class detection.
rithms, it is important to understand the working of traditional
It also focuses on recognizing instances of some preset categories
techniques and to know why the deep learning based methods are
[2,3].
much superior. This will help the researcher’s to better compre-
The problem of object detection is described as the task of de-
hend the modern object detection methods.
tecting and classifying a varied number of objects in an image. It
aims to detect where the object is located in an image, creates a
a. Conventional methods
bounding box around that object and then identifies to which cat-
There are three phases [6,8] in traditional object detection
egory it belongs to.
methods. These phases are described with their respective
Deep learning is currently being applied in diverse areas of
drawbacks as below:
computer vision, like image classification, image retrieval, object
i. Selection of region – As the objects have different magnitude
and aspect ratio, they may occur at distinct regions of an
image. So at the first stage, it is essential to identify the re-
* Corresponding author.
gion of an object. As a result, an entire image is inspected
E-mail addresses: [email protected] (R. Kaur), [email protected]
(S. Singh). using multi-scale sliding window approach to detect ob-
https://doi.org/10.1016/j.dsp.2022.103812
1051-2004/© 2022 Elsevier Inc. All rights reserved.
R. Kaur and S. Singh Digital Signal Processing 132 (2023) 103812
Fig. 1. Classification of generic object detection models (a) Two-stage detectors from period 2014 to 2017 [20,21,25–29]. (b) One-stage detectors from period 2013 to 2020
[22,23,30–37].
jects. However this approach has high computational cost The rest of this paper is organized as follows.
and also causes a large number of non-essential choices.
ii. Extraction of features – After locating an object, process of • Section 2 provides extensive details about object detection
feature extraction is carried out to provide robust repre- frameworks; two-stage detectors and one-stage detectors,
sentation. The methods such as HOG [9], Haar-like [10], along with its characteristics in tabular form.
SIFT [11] are used to extract features for object recognition • Backbone architectures are described in Section 3 and their
to provide a meaningful representation. However because performance is compared and analyzed.
of contrasting backgrounds, lighting environment and per- • Section 4 discusses the popular datasets and criteria for as-
spective variances, it is extremely hard to manually build sessing the performance of object detection algorithms.
a comprehensive feature descriptor that correctly identifies • Section 5 and Section 6 elaborates various object detection
all kinds of object. problems and its applications.
iii. Classification – At this stage, a classifier such as Adaboost • Section 7 covers the future research areas.
[12] is used to identify the target objects and to build the • Comparative results are presented in Section 8.
models more organized and meaningful for visual percep- • Finally, Section 9 draws the conclusion.
tion.
It is clear from the above points that in traditional methods, 2. Object detection frameworks
handcrafted features are not always adequate to correctly rep-
resent the objects. Along with this, sliding window approach A considerable advancement has been made in the domain of
used for generating bounding boxes is computationally expen- generic object detection with the evolution of deep learning net-
sive and ineffective, in. The traditional techniques include HOG works [24].
[9], SIFT [11], Haar [10], VJ detector [13,14] and other algo- Object detection is a fusion of object location and object classi-
rithms such as [15,16]. In HOG [9], it takes a long time to rec- fication task. Because deep CNNs have high feature representation
ognize an object since it employs a sliding window approach
power, hence they are used in object detection architectures. The
to extract features [17]. SIFT [11] algorithm is extremely slow,
classification of object detection models is depicted in Fig. 1. There
has high computational cost and also not good at illumina-
are two types of detectors: two-stage and one-stage detectors [1].
tion changes [18]. In VJ detector [13], training duration is very
large and is limited to binary classification only [19]. There-
2.1. Two-stage object detectors: region based
fore, deep learning techniques are being utilized to overcome
the problems of traditional methods.
b. Deep learning based methods The two-stage object detection framework divides the task of
The advent of deep learning has potential to address few lim- object localization and object classification. In simpler terms, firstly
itations of conventional techniques. Lately, the deep learning the region proposals are generated where the object is localized
methods have become prominent for learning feature repre- and then that region is classified according to its particular cate-
sentation from data automatically. These approaches have sig- gory. This is the reason why it is called two-stage. Fig. 1(a) shows
nificantly improved object detection. The deep learning based various two-stage object detectors. These architectures are also
approaches are Faster RCNN [20,21], SSD [22], YOLO [23] and called Region-based frameworks [2]. The main advantage of two-
many more (Refer to Section 2). stage object detectors is that they have high detection accuracy
and disadvantage is that they have slow detection speed. The fea-
The major strengths of the paper are as follows: tures and characteristics of these detectors are explained below:
1. The study examines the state-of-the-art object detection mod- 2.1.1. RCNN
els providing an in-depth analysis of major object detectors Region-based convolutional neural network (RCNN) proposed
along with their characteristics. by [25] was an advanced research in using deep learning meth-
2. The work provides detailed explanation of backbone architec- ods for detection of objects [38]. Its architecture is shown in Fig. 2.
tures. Furthermore, benchmark datasets and evaluation criteria The process of RCNN is explained below in four stages [2,6,39]:
are discussed and challenges are explored. 1st stage – Region proposals are extracted using the selective
3. A comprehensive performance comparison of different object search method. The selective search identifies these regions based
detectors is provided on two popular datasets namely PASCAL on varying scales, enclosures, textures, and color patterns. It ex-
VOC dataset and COCO dataset. tracts around 2000 regions from each image [39].
2
Fig. 2. Architecture of RCNN [25].
2nd stage – All these region proposals are rescaled into the same
image size to match the CNN input size since the fully connected
layer requires fixed-length input vectors. The features from each
candidate region are extracted using CNN.
3rd stage – After the extraction of features, SVM classifier is
used to detect whether the object is present within each region.
4th stage – Finally, for each identified object in an image, tighter
bounding boxes are generated around it using a linear regression
model.
Although RCNN has shown great improvement in object de-
tection, it still has some limitations like slow object detection,
multi-stage pipeline training, and rigidness of the selective search
method.
2.1.2. SPP-Net
As RCNN generates 2000 region proposals per image, CNN fea-
ture extraction from these regions was the main barrier. The con-
Fig. 3. Spatial Pyramid Pooling [26].
straint of fixed input size is only because of fully connected layers
[40]. So to overcome this difficulty, [26] brings in a new technique
called the Spatial Pyramid Pooling Network layer (SPP-Net). The 2nd – Region of Interest (RoI) is generated using selective search
SPP layer is added on top of the final convolutional layer to pro- method.
duce fixed length features for fully connected layers, irrespective of 3rd – RoI pooling layer is applied on the extracted RoI to gen-
the size of RoI (Region of interest), and without rescaling it, which erate feature vector of fixed length. It assures that all the regions
can lead to information loss [4,40]. have same magnitude.
By using the SPPNet layer, a great improvement in the speed 4th – Extracted features are then sent to fully connected layer
of RCNN was seen without any loss in detection quality. This is for categorization and localization using softmax layer and linear
because the convolutional layers need to be run for one time only regression layer at the same time.
on the complete test image to create features of fixed-length for Fast RCNN consumes less computational time and has improved
region proposals of random size. the detection accuracy. However, it was based on traditional region
The network structure of SPP-Net is demonstrated in Fig. 3. proposal method, which uses selective search method that makes
Here the output of the SPP layer is 256×M-d vectors. 256 is the it time consuming.
number of Convolutional filters and M is the no. of bins. The
fully connected layer receives the fixed-length dimensional vector 2.1.4. Faster RCNN
[2,26]. Despite Fast RCNN has shown considerable advancement in
speed and accuracy, it uses the selective search method to gen-
2.1.3. Fast RCNN erate 2000 region proposals which was a very slow process. Ren,
Although SPPNet outperforms RCNN in terms of efficiency and S. et al. [20,21] worked on this issue and developed a new detector
accuracy, it still has some problems like it roughly follows the named Faster RCNN as the first end-to-end deep learning detector
same procedure of RCNN, which includes fine-tuning of the net- [41]. It also improves the detection speed of Fast RCNN by replac-
work, extraction of features, and bounding box regression [6]. Gir- ing the traditional region proposals algorithms such as selective
shick, R. has shown further improvement in RCNN and SPPNet, and search [42], multiscale combinatorial grouping [43] or edge boxes
put forward a new detector named Fast RCNN [27]. It allows end to [44] with a CNN called Region Proposal Network (RPN). The proce-
end training of the detector that learn softmax classifier and class dure for Faster RCNN is as follows:
specific bounding box regression concurrently, with a multi task a) CNN takes an image as an input and provides the feature
loss, rather than separately training them as in RCNN and SPPNet. maps of an image as an output.
In Fast RCNN, rather than executing CNN 2000 times per image, b) RPN is applied to the generated feature maps returning the
it is run only once and get all the regions of interest. Then RoI object proposals (RoI) as well as their objectness score.
pooling layer was added between the final convolutional layer and c) Once the RoIs are extracted, RoI pooling layer is applied to it
initial fully connected layer so that a feature of fixed length vec- to bring all the proposals to a fixed dimension.
tor gets extracted for all region proposal [2,4,39]. Working of Fast d) The derived feature vectors are supplied into a succession
RCNN detector is as follows: of fully connected layers with a layer of softmax and regression
1st – Fast RCNN takes a complete input image and pass it to at the top, to classify and output the bounding boxes for objects
CNN to produce feature map. [39].
3
Fig. 4. Region Proposal Network [20]. Fig. 5. Feature Pyramid Network [28]. (For interpretation of the colors in the fig-
ure(s), the reader is referred to the web version of this article.)
Working of RPN – Region Proposal Network is a fully convolu-

tional network that is attached to the final convolutional layer of and several other computer vision tasks like instance segmentation
the backbone network [1]. It receives the feature map and uses [6,40].
the sliding window over these feature maps to output multiple Although DCNNs has tremendous representational capability,
object proposals. At each window, the network produces k anchor yet it is necessary to solve multi-scale challenges via pyramid rep-
boxes (also called reference boxes) of different sizes and aspect ra- resentation [28].
tios. Instead of position of the anchor, only the features obtained
from anchor boxes are class-specific. Each object proposal consists 2.1.6. Mask RCNN
of 4 coordinates and a score that determines whether the object He, K. et al. [29] designed an object detector named Mask
is present or not. Each anchor is mapped to a low-dimensional
RCNN, an augmentation to Faster RCNN to solve the instance seg-
vector and passed to two fully connected layers, one is object
mentation issue in which object detection and semantic segmenta-
category classification layer and the other is box regression layer
tion job are carried out. These two tasks are self-reliant processes
[2,39,40,45]. The above working process of RPN is shown in Fig. 4.
[6]. The goal of Mask RCNN is to perform pixel-level segmenta-
tion. Mask RCNN inspects each pixel and estimates whether or not
2.1.5. Feature pyramid network it is a part of the object. Mask R-CNN follows the architecture of
Lin, T. Y. et al. presented a Feature pyramid network (FPN) [28]; Faster R-CNN; both use the same RPN, but the difference is that
an intrinsic multi-scale, pyramidal hierarchy of DCNN to build fea- Mask RCNN has three outputs for each object proposal i.e. class la-
ture pyramids at low cost. It takes an image of any size as input bel, bounding box offset, and object detection mask [7,40]. In Mask
and outputs feature maps of the same size at multiple levels. This RCNN, the RoIAlign layer is used to associate the extracted fea-
method shows considerable enhancement in many applications. tures with the object’s input position. The purpose of the RoIAlign
FPN isn’t an object detector. It is a feature extractor that is used in layer is to fix the misalignment issues in RoI pooling layer. It elim-
conjunction with object detectors. The architecture of FPN incorpo- inates the need of measuring the RoI threshold and instead uses
rates semantically strong low-resolution features with semantically bilinear interpolation to evaluate the real feature values at each
weak high-resolution features using top-down pathway and lateral sampling point. Mask RCNN achieved state-of-the-art performance
connections [40,46]. Using the sequence of CNN architecture, FPN on instance segmentation [47,48].
builds a bottom-up path and top-down path with lateral connec- As discussed above, Region-proposal based frameworks con-
tions. sist of various phases which are connected to each other and are
In the bottom-up pathway (in red), an image is passed as an trained separately. These are region proposal generation, extraction
input to CNN and it uses a pooling layer to bring the feature maps of features using CNN, classification and bounding box regression
to the same size. For each stage of FPN (i.e. for each resolution [6]. Even though these methods able to achieve high accuracy, yet
level), one pyramid level is defined [40,47]. there are some issues related to real-time speed. This problem can
In the top-down pathway (shown in blue color), features of be overcome by a unified stage detector by removing the region
higher resolution are used by up-sampling the feature maps back proposal phase and implementing feature extraction, proposal re-
into the same size as in the bottom-up part. Then using lateral gression, and prediction in a single CNN [38].
connections, these features are augmented with features from the The characteristics of two-stage object detection models are
bottom-up pathway. Each lateral connection combines the same summarized in Table 1. It concisely gives the details for each ob-
sized feature maps from both bottom-up and the top-down path- ject detector in terms of size of the input image taken, backbone
way [28]. CNN, the method used for Region Proposal, optimization method,
Fig. 5 depicts the fundamental structure of FPN. (a) Image pyra- and the loss function. In addition to this, the strengths and short-
mids are used to build feature pyramids resulting in a slow pro- comings are also discussed corresponding to each object detection
cess. (b) Single-scale features are employed for fast detection. (c) model. Optimization algorithm (learning method) like SGD [49]
Pyramidal feature hierarchy is reused similar to image pyramid, minimizes the error by determining the value of the weight pa-
such as SSD. (d) FPN is designed with more accuracy and is faster rameter. They have a substantial influence on the model’s accuracy
than previous methods [28]. and training speed. The loss or cost function such as Hinge loss
The process of FPN yields an extensive solution for generating [50], L1 and L2 loss [51], log loss [52] is a measure of the differ-
multi-scale feature maps with huge semantic content. FPN is not ence between the expected and the predicted output. Readers are
dependent on the architecture of CNN and can be enforced to non- advised to refer the respective object detector paper for additional
identical phases of object detection such as RPN, Fast RCNN [27], information.
4
Table 1
Characteristics of Two-Stage Object Detection Models.
Object Year Image Input Backbone Region Learning Method / Loss / Cost Strengths Shortcomings
Detector size DCNN Proposal (Optimization function
Method method)
RCNN [25] 2014 Fixed AlexNet Selective SGD,BP Hinge loss, 1. First Neural Network 1. Training is costly as huge
Search Bounding box based on region proposal amount of space and time
regressor loss for higher detection quality. is required.
2. Increase in the performance 2. Multi-stage pipeline
is seen over traditional training is used.
state-of-the-art methods 3. CNN is frequently applied
to 2000 image regions, so
the feature extraction is the
main time constraint in testing.
4. Extracting 2000 image regions
is a difficult task as features are
extracted for every image region.
SPP-Net [26] 2014 Arbitrary ZFNet Selective SGD Hinge loss, 1. Extracts the features 1. Architecture is identical
Search Bounding box of entire image at once. to RCNN, so it has same
regressor loss 2. Outputs the fixed length drawbacks as RCNN.
features regardless of 2. No end-to-end training.
image size.
3. Faster than RCNN.
Fast RCNN 2015 Arbitrary Alexnet Selective SGD Classification loss, 1. First end-to-end 1. Sluggish for real time
[27] VGGM VGG16 Search Bounding box detector training. applications because of
regression loss 2. Faster and accurate than selective search.
RCNN and SPP-Net. 2. Region proposal
3. Single-stage training computation is a bottleneck.
network. 3. No end-to-end training.
4. RoI pooling layer used.
Faster RCNN 2015 Arbitrary ZFNet Region SGD Classification loss 1. Introduces RPN that 1. Training is complex;
[20,21] VGG16 Proposal (class log loss), generates cost free inefficient for real time
Network Bounding box region proposals. applications.
regression loss 2. Established translation- 2. Lack of performance
invariant and multi-scale for small and multi-scale
anchors. objects.
3. An integrated network 3. Speed is slow.
comprising RPN and
Fast RCNN is designed
with common Convolutional
layers.
4. Provides end-to-end
training.
Feature 2017 Arbitrary ResNet50 Region Synchronized Classification loss 1. Multi-level feature fusion 1. It is still required to
Pyramid ResNet101 Proposal SGD (Class log loss), FPN is designed. use pyramid representation
Network Network Bounding box 2. Accurate solution to multi- to tackle multi-scale
[28] regression loss scale object detection. challenges.
3. Follows top-down structure 2. Speed is yet the bottleneck
with lateral connections. for detection purpose;
cannot fulfill real time needs.
Mask 2017 Arbitrary ResNet101 Region SGD Classification loss, 1. RoIAlign pooling layer is 1. Detection speed is low
RCNN ResNext101 Proposal Bounding box used rather than RoI pooling to satisfy real time
[29] Network regression loss, layer; thus increase in the requirements.
Mask loss (average detection accuracy.
binary cross entropy loss) 2. Simple and flexible
architecture for object instance
segmentation.
3. Pixel to pixel alignment
is carried out.
2.2. One-stage object detectors: regression/classification based detection by using multi-scale sliding window approach. It is one
of the most powerful object detection frameworks, applied to Ima-
One-stage object detection frameworks locate and categorize
geNet Large Scale Visual Recognition Challenge 2013 (ILSVRC), and
simultaneously using DCNNs without partitioning them into two
ranks first in detection and localization [31]. It is the first fully
portions. These are also called region proposal free frameworks.
convolutional deep network based one-stage detector that detects
Several one-stage detectors are shown in Fig. 1(b).
the object using a single forward pass via fully convolutional lay-
In this, only one pass is needed through a neural network. It
ers. OverFeat acts as a base model for later emerged algorithms
has feed-forward neural network and predicts all the bounding
namely YOLO and its versions, SSD etc. The primary difference is
boxes at one time [7]. They map image pixels directly to bound-
that the training of classifiers and regressors is done in succession
ing box coordinates and class probabilities [1,6]. One-stage object
in OverFeat [2].
detectors are described as below.
2.2.1. DetectorNet 2.2.3. YOLO
Szegedy, C. et al. [30] has implemented the DetectorNet frame- You Only Look Once (YOLO), is a single-stage object detector
work as a regression problem. It is capable of learning features designed by Redmon, J. et al. [23] where object detection is car-
for classification and acquiring some geometric information. It uses ried out as a regression problem. It predicts the coordinates of the
AlexNet as a backbone network and the softmax layer is replaced bounding boxes for the objects and determines the likelihood of
with the regression layer. To predict the foreground pixels, Detec- the category to which it associates. Due to the use of only single
torNet splits the input image into a coarse grid. It has a very slow network, an end-to-end optimization can be achieved [53]. It pre-
training process as the network is to be trained for each object dicts the detections directly using a limited selection of candidate
type and mask type. Also, the DetectorNet cannot handle multi- regions. Unlike region based approaches, which employs features
ple objects of similar class. When it is used in conjunction with a from a specific region, YOLO uses features broadly from the whole
multi-scale coarse-to-fine method, DNN-based object mask regres- image [2].
sion produces excellent results [2,30,45]. In YOLO object detection, an image is divided into an S × S
grid; each grid comprises of five tuples (x, y, w, h and confidence
2.2.2. OverFeat score). The confidence score of an individual object is based on
Sermanet, P. et al. [31] has presented a unified structure for the probability. This score is given to every class and whichever
using Convolutional Networks for localization, classification and class has a high probability, that class is given precedence. The
5
parameters width (w) and height (h) of the bounding box are pre- giving effective results along with generating the reliability of the
dicted with respect to the size of an object. From the overlapping model. It operates at an inference speed of 140 fps. YOLOv5 uses
bounding boxes, the box having highest IOU is selected and the PyTorch which makes the deployment of the model faster, easier
remaining boxes are removed [45]. and accurate [60]. Although the YOLOv4 and YOLOv5 frameworks
are similar, thus comparing the difference between them is hard,
2.2.4. SSD but later on, YOLOv5 has gained higher performance than YOLOv4
SSD, a fast single-shot multi-box detector for several classes under certain situations. There are five types of YOLOv5 model -
was implemented by Liu, W., et al. [22]. It builds a unified detector nano, small, medium, large, and extralarge. The type of model is
framework which is faster as YOLO, and accurate as Faster-RCNN. chosen according to the dataset. Further, the lightweight model of
The design of SSD combines the idea of regression from YOLO’s YOLOv5 model is released with version 6.0; with an improved in-
model and anchors procedure from Faster R-CNN’s algorithm. By ference speed of 1666 fps [35,60].
using YOLO’s regression, SSD reduces the computing complexity The characteristics of one-stage object detection models are de-
of neural networks to assure real-time performance. With the an- scribed in Table 2. It provides concise details for each object de-
chor’s procedure, SSD is capable of extracting features of various tector. It gives information for the same parameters as mentioned
sizes and aspect ratios to ensure detection accuracy [54]. SSD uses in Table 1 except Region Proposal method.
VGG-16 as a backbone detector. Finally, it can be concluded that YOLOv5 model acts as a good
The process of SSD is based on a feed-forward CNN that gener- object detector to detect small objects. It is the fastest model com-
ates bounding boxes of fixed size and objectness scores for the pared to other object detectors. For the detection of objects which
existence of object class instances in those boxes, then applies are large in size, any object detector can be used. If results are re-
NMS (Non-maximum suppression) to give rise to the final detec- quired in real-time, then any one-stage object detector can be used
tions [22]. It also uses the concept of RPN to attain fast detection but if accuracy is main concern, then Faster RCNN (a two-stage ob-
speed while maintaining high detection quality [2]. With some ject detector) is a good choice.
auxiliary data augmentation and hard negative mining approaches,
SSD accomplished state-of-the-art performance on various bench- 3. Backbone networks
mark datasets [47].
The DCNNs serve as backbone network for the object detec-
2.2.5. YOLOv2 tion models. To ameliorate the feature representation behavior,
YOLOv2, an enhanced version of YOLOv1 [23], is given by Red- the structure of the network gets more complex which means
mon, J. et al. [32]. In this version, different ideas such as Batch the network layer gets deeper and its parameters are increased.
Normalization, Convolutional with Anchor Boxes, High-Resolution A backbone CNN is used to extract features in DCNN-based object
Classifier, Fine-Grained Features, and Multi-scale training are ap- detection systems [1,38].
plied to improve YOLO’s performance. It uses a Darknet-19 as a The backbone network acts as a primary feature extractor for
backbone classification containing 19 convolutional layers and 5 object detection method, taking images as input and generating
max-pooling layers which require fewer processes to analyze an feature maps as output for each input image. According to the
image while achieving best accuracy [24]. need of accuracy and efficiency; the densely connected backbones,
such as ResNet [61], ResNext [62] etc. can be used. Complex back-
2.2.6. YOLOv3 bones are required when there is a need for high precision and to
YOLOv3 [33] is a gradual form of YOLOv2 [32], that uses logis- build accurate applications [24].
tic regression to estimate an objectness score for each bounding Before the paradigm of deep learning, constructing feature de-
box. There are multiple classes contained in the bounding box and scriptors requires extensive effort and expertise. In contrast, CNN
to predict those classes, multi-label classification is used. It also incorporates the capability of learning the features using CNNs ab-
uses binary cross-entropy loss, data augmentation techniques, and stract hierarchical layers. In this section, some common backbone
batch normalization. YOLOv3 uses a robust feature extractor called CNN architectures are discussed [45].
Darknet-53 [24,33,47].
3.1. AlexNet
2.2.7. YOLOv4
YOLOv4 [36] is a state-of-the-art object detector that is more AlexNet [63] is an important CNN architecture consisting of five
accurate and faster than all the previous versions of YOLO [23,32, convolutional layers and three fully connected layers. After giv-
33]. It includes a method called “Bag of freebies” which increases ing an input of fixed size (224 × 224) to an image, the network
the training time without influencing the inference time. This convolves over and over again and pools the activations, then the
method exploits data augmentation techniques, Self-Adversarial result is transmitted to fully connected layers. The network was
training, Cross mini-Batch Normalization (CmBN), CIoU-loss [55], trained on ImageNet and combines several methods of regulariza-
DropBlock regularization [56], Cosine annealing scheduler [57] to tion, such as data augmentation, dropout etc. In order to accelerate
improve training. YOLOv4 also incorporates those methods which the data processing and increase the convergence speed, the ReLu
solely impact the inference time known as “Bag of specials”; it in- activation function and GPU were used for the first time. It ul-
cludes Mish activation, Multi-input weighted residual connections timately paid off and turned out to be the first CNN to win the
(MiWRC), SPP-block [26], PAN path-aggregation block [58], Cross- ILSVRC2012 competition with great accuracy and a huge drop in
stage partial connections (CSP) [59] and Spatial Attention Module error rate [45,63]. The triumph of AlexNet architecture is based on
block. YOLOv4 can be trained on a single GPU and uses genetic the following mechanics [1]:
algorithm to select hyper-parameters [36].
• Rectified Linear Unit (ReLU) activation function is used instead
2.2.8. YOLOv5 of sigmoid and tanh.
Soon after the release of YOLOv4, the Ultralytics company • Multi-GPU’s processing is used to speed up the network train-
launched YOLOv5 repository with considerable enhancements over ing.
previous YOLO models [60]. As YOLOv5 was not published as a • To enlarge the dataset, some techniques are used to augment
peer-reviewed research so it creates many debates about its legit- the data such as random clipping, transformation with color illu-
imacy [34]; but still it is being used in various applications and is mination etc.
6
Table 2
Characteristics of One-Stage Object Detection Models.
Object Year Image Input Backbone Learning Method / Loss / Cost Strengths Shortcomings
Detector size DCNN (Optimization function
method)
DetectorNet [30] 2013 Arbitrary AlexNet Stochastic Least Square Error 1. Multi-scale inference method 1. Training is expensive.
gradient (L2 loss) which produces object detection 2. It cannot deal with multiple objects
using of high resolutions. of same class type.
ADAGRAD 2. Represents strong geometric
information.
3. Simple model because it has
higher detection rate over large
number of objects and can be
conveniently applied to ample
variety of classes.
OverFeat [31] 2013 Arbitrary AlexNet SGD Least Square Error 1. Multi-scale, sliding window 1. Single bounding box regressor for
(L2 loss) approach used for classification, class.
localization and detection. 2. Unable to deal with multiple instances
2. Winner of ILSVRC2013 of same class.
competition for localization task. 3. Multi-stage pipeline sequentially
3. Faster due to sharing of trained.
Convolutional features.
YOLO v1 [23] 2016 Fixed GoogLeNet SGD Sum squared Error 1. First unified end-to-end 1. Difficult to localize low resolution
(Classification loss, framework. objects.
localization loss, 2. Completely removes the 2. Less flexible.
confidence loss) concept of region proposal. 3. Cannot predict more than one box for
3. Real-time object detection. particular region without anchor boxes.
SSD [22] 2016 Fixed VGG-16 SGD Confidence loss 1. Multi-scale feature maps 1. Performs poorly when detecting
(categorical cross- enhance the object detection small objects.
entropy loss) + at spatial levels. 2. Small objects can only be identified in
Localization loss 2. Faster than YOLO and higher resolution layers however these
(regression loss) on par with Faster RCNN. layers incorporate low-level features such
as edges that are not much effective for
classification.
YOLO v2 [32] 2017 Fixed Darknet-19 SGD Sum Squared Error 1. Faster and stronger than 1. Difficult in detecting small objects.
YOLO v1. 2. Complex training.
2. Batch Normalization
3. Use of High Resolution classifier
aims to increase accuracy.
4. The k-means clustering algorithm
is used to yield anchor boxes.
5. Multi-scale training.
YOLO v3 [33] 2018 Fixed Darknet-53 SGD Binary cross entropy 1. To boost the multi-scale detection 1. YOLOv3 may not be ideal for using
accuracy, it makes use of multi-level niche models where large datasets can
feature fusion. be hard to obtain.
2. Detections done at different 2. Not suitable to detect small objects.
feature maps of different sizes to
detect features at different scales.
YOLO v4 [36] 2020 Fixed CSPDarknet-53 SGD Binary cross entropy 1. Introduces Mosaic data -
augmentation.
2. Bag of Freebies (BoF) and
Bag of Specials (BoS) are used
for backbone and detection purpose.
3. Hyper-parameters are selected
using genetic algorithms.
YOLO v5 [34,35] 2020 Fixed Focus structure SGD Binary cross entropy 1. Faster than YOLOv4. -
CSP Network with Logits Loss 2. Detect objects in real-time
function with great accuracy.
• The dropout regularization method is used during training to layer, it employs a kernel of size 3×3 with a stride of 1. Small ker-
remove part of neurons. It brings down the chances of overfit- nel and stride acts as a more favorable to extract the details of the
ting. object’s location in the image. It has a benefit of expanding the
network’s depth by incorporating additional convolutional layers.
3.2. ZFNet Minimizing the parameters leads to improved feature representa-
tion ability of the network [1,5].
After the success of AlexNet, researchers wanted to know the
mechanism behind the visualization of the convolutional layers, to
3.4. GoogLeNet or inception v1
see how CNN learns the features and how to examine the dif-
ferences in image feature maps at each layer. So, a method was
designed by Zeiler, M. D. et al. [64] to visualize the feature maps The main aim of GoogleNet [67] a.k.a. Inception v1 architec-
using deconvolutional layers, unpooling layers and ReLU non lin- ture was to achieve high accuracy by decreasing the computational
earities. As in AlexNet, the filter size of the first layer is 11×11 cost. Adding 1×1 convolutional layers to the network, there is an
with a stride of 4, but in ZFNet, it is reduced to 7×7, and the increase in its depth. This filter size was first used in the technique
stride is set to 2 instead of 4. The reason behind doing this was named Network-in-Network [68], and mainly used as dimensional-
that the filters of the first layer contain variations in frequency in- ity reduction to remove computational bottlenecks and increasing
formation; it can be high, low and have very small percentage of the width and height of the network [67]. GoogleNet is a 22-layer
mid frequencies. This method performs better than AlexNet and deep architecture and is the winner of the ILSVRC 2014 competi-
proved that the depth of the network influences the deep learning tion. Based on this idea, the author developed an inception module
models performance [1,64,65]. [67] with dimensionality reductions. By using the inception mod-
ules, the number of GoogLeNet parameters is decreased, in contrast
3.3. VGGNet to [63,64,66]. The Inception module comprises of 1x1, 3x3, and
5x5 filter size convolution layers and max-pooling layers assem-
VGG [66] further enlarges the depth of AlexNet to 16-19 layers bled parallelly with one another. Inception v2 series was the first
which refines the feature representation of the network. VGG16 network to propose batch normalization [69] resulting in speedy
and VGG19 are two popular VGG network architectures. In each training [2,45,47,70].
7
Table 3
Summary of DCNN architectures.
DCNN Year Depth (No. No. of Dataset used Test Error Accuracy Category Highlights
Architecture of Layers) parameters (Top 5) (Top-5)
AlexNet [63] 2012 8 60M ImageNet 15.3% 84.7% Spatial 1. First deep CNN architecture.
exploitation 2. ReLu activation function used
instead of Sigmoid and tanh.
3. Multi-GPU’s parallel computing
technology is used.
4. Shift from hand feature engineering
to deep conv neural network.
ZFNet [64] 2014 8 60M ImageNet 14.8% 85.2% Spatial 1. Introduced a visualization technique
exploitation that gives insights of intermediate layers.
2. Analogous to AlexNet architecture
with a small difference in filter size,
no. of filters and stride for convolution.
VGGNet [66] 2014 16 138M ImageNet 6.8% 93.2% Spatial 1. Increasing depth of the network
exploitation using very small 3*3 convolution filters.
GoogleNet [67] 2015 22 6M ImageNet 6.67% 93.3% Spatial 1. Increased the depth and width without
exploitation raising the computational requirements.
2. Uses the Inception Module consisting
of conv layers with different filter sizes.
3. It makes use of global average pooling.
4. First bottleneck architecture.
ResNet50 [61] 2016 50 25.6M ImageNet 3.57% 96.43% Depth + 1. Using the identity mapping, deeper
Multi networks can be learned to a great
extent.
2. Skip connections are used.
3. Increases the accuracy by preserving
the gradient in deeper layer.
ResNet101 [61] 2016 101 44.5M ImageNet - - Depth + 1. Performance is identical to VGG
Multi with lesser number of parameters.
2. Uses the bottleneck and global
average pooling introduced in
GoogleNet.
DenseNet [71] 2017 201 20M - - - Multi- 1. Framework uses the dense blocks.
path 2. Every layer is linked to the next
layer in a feed forward manner.
3. Reduces the problem of vanishing gradient.
3.5. ResNet of DCNN architectures. The top-5 error rate is the percentage of
test images where the correct label is not one of the five labels
With the rise in the network’s depth there can be a situation considered most likely by the model. Top 5 accuracy indicates the
where accuracy drops after reaching a saturation point. This is dataset’s classification accuracy. CNN’s can be divided into differ-
known as degradation problem and to solve this, a residual learn- ent categories such as Spatial exploitation, depth based or multi-
ing (ResNet) module is proposed by [61]. It has less computational path based [70]. Spatial exploitation based CNNs adjust the spatial
complexity than earlier designed architectures like AlexNet [63] filters such that they can perform well on both coarse-grained
and VGGNet [66]. Generally, ResNet backbone networks with 50 information (extracted by large size filters) and fine-grained in-
and 101 number of layers are used [1,70]. formation (extracted by small size filters). In depth based CNNs,
In ResNet50, skip connections were used to preserve the gra- deeper networks perform better as compared to shallow ones as
dient in the deeper layer and a rise in accuracy was seen. In they manage the networks learning capability and can regulate the
ResNet101, the module performs identically to the VGG network complex tasks effectively. Multi-path based CNNs bridges one layer
with less number of parameters, following global average pooling to the other without using few intermediary layers, so that the in-
and bottleneck as in GoogLeNet [45]. formation flows over all layers. It also attempts to work out on the
problem of gradient descent. Readers can follow the survey [70]
3.6. DenseNet for more details.
Huang, G. et al. [71] presented DenseNet architecture composed 4. Datasets for object detection and performance assessment
of dense blocks that links each layer to every other layer in a feed-
forward manner, giving rise to benefits like reuse of features, ef- 4.1. Datasets
fectiveness of parameters and implicit deep supervision. DenseNet
reduces the problem of vanishing gradient [2,45]. Datasets play very important part in research. Due to the out-
Table 3 shows performance comparisons of the various back- standing accomplishment of the image datasets, they can be used
bone architectures discussed above, It gives brief description about in image classification, object detection and segmentation tasks
no. of layers and no. of parameters used, benchmark dataset, test [1,65]. There are many object detection datasets in the domain of
error (top 5), accuracy (top 5) and the category to which the corre- research such as LISA [72], CIFAR-10 [73], PASCAL VOC [74], CIFAR-
sponding architecture belongs. Highlights show the main features 100 [73], MS COCO [75], ImageNet [76], Tiny Images [77], SUN
8
Fig. 6. Sample of annotated images taken from commonly used datasets.
[78], Open Imagesv5 [79] etc. Fig. 6 shows some sample images Images v5 is a standard dataset comprising 1.9 million images with
of commonly used datasets. A brief description of these datasets is 16 million annotated bounding boxes for 600 object categories. The
as follows: images in this dataset are heterogeneous in nature and contain
complicated scenes with various objects (on average, 8.3 object
4.1.1. PASCAL VOC categories are there per image) [24,60].
The PASCAL VOC [74] datasets are extensively used for object The most famous object detection datasets are given in Table 4.
detection tasks. Having good quality images and corresponding la- It gives details about the year in which the dataset was launched,
bels for each image, the evaluation of algorithms becomes easy. It the no. of classes in each dataset, number of images and no. of
was launched in 2005 with four classes and with the time it in- objects (bounding boxes) used in training and validation set. The
creases to 20 classes in 2007. These 20 classes were divided into objects/images give the number of bounding boxes per image. Ref-
four primary sections- vehicles, people, household objects and an- erence link is also provided for each dataset.
imals. PASCAL VOC 2007 and 2012 are the two most used versions
of PASCAL dataset. It also contains some imbalanced classes in 4.2. Evaluation metrics
2007 like instances of the class person are more than the class
sheep [2,5,24,74].
There are several parameters that can be used to measure the
effectiveness of object detectors. These are Accuracy, Precision, IOU,
4.1.2. MS-COCO
Recall, PR curve, Average Precision etc. [1,2,24,45,81–83]. Average
The Microsoft Common Objects in Context (MS COCO) [75]
dataset has 91 common object categories found in everyday life for Precision (AP) is the most often used metric obtained using recall
detecting and segmenting the objects. Out of 91 categories, 20 cat- and precision.
egories are from PASCAL VOC dataset. The dataset has more than The goal of object detectors is to predict the object location by
2,500,000 labeled instances and 328,000 categories per image in placing the bounding box over the object of a given class in an
total. MS COCO dataset contains diverse viewpoints and is rich in image/video with a high confidence score. Overall detection can be
contextual information. It is a more challenging dataset than PAS- considered as a collection of three elements: object class, bounding
CAL VOC; containing a large number of small objects with huge box (BB) around that object and the confidence score [81]. The
scale variation [6,24,60]. metrics terminology used in assessing the performances of object
detection algorithms is explained below:
4.1.3. ImageNet
ImageNet [76] is an extensive and diverse image dataset for as- 4.2.1. IoU (Intersection over Union)
sessing the performance of algorithms. Complex datasets can drive IoU is the ratio of the overlap area between the predicted BB
improvement in practical applications and computer vision tasks. ( B B predict ) and the ground truth BB ( B B ground ) to the area of their
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [80] is union. It uses the concept of the Jaccard index which calculates the
derived from ImageNet [2]. The ILSVRC object detection challenge similarity between the above two sets. Fig. 7 shows the concept of
assesses an algorithm’s ability to categorize and locate all target IoU.
objects in an image. It has become the benchmark dataset con- Values of IoU ranges between 0 and 1. More it is closer to 1,
taining 1000 object classes with millions of images in it [1,24]. more accurate is the detection. If area of the predicted BB and the
ground truth BB overlap each other perfectly then the value of IoU
4.1.4. OpenImages is 1, else if they do not intersect each other then IoU is 0. If in case,
Open Images dataset [79] is one of the greatest publicly avail- the IoU value of both BB is larger than the predefined threshold
able datasets containing 9.2 million images annotated with object (mostly used 0.5); it means the object is recognized properly [1,2,
bounding boxes, image-level labels and segmentation masks. Open 4,45,81,84]. IoU is calculated as follows:
9
Table 4
Statistics for well known object detection datasets.
Dataset Launched Dataset’s No. of No. of images No. of objects No. of objects/ Link
in Challenge classes Train Val Train Val image
PASCAL 2005 VOC 2007 20 2501 2510 6301 6307 2.5 http://host.robots.ox.ac.uk/pascal/
VOC VOC/voc2007/index.html
[74]
VOC 2012 20 5717 5823 13,609 13,841 2.4 http://host.robots.ox.ac.uk/pascal/
VOC/voc2012/index.html
ImageNet [76] 2009 ILSVRC 2013 200 3,95,909 20,121 3,45,854 55,502 1.0 https://image-net.org/challenges/
LSVRC/2013/index.php
ILSVRC 2017 200 4,56,567 20,121 4,78,807 55,502 1.1 https://image-net.org/challenges/
LSVRC/2017/index.php#det
MS COCO [75] 2014 COCO 2017 80 1,18,287 5000 8,60,001 36,781 7.3 https://cocodataset.org/
Open Images [79] 2016 OpenImages 2018 600 17,43,042 41,620 14,610,229 3,03,980 8.3 https://g.co/dataset/openimages/
T P C mn + T N C mn
Accurac y C mn = (2)
T P C mn + F P C mn + T N C mn + F N C mn
4.2.3. Precision
Precision means how much positive identifications were actu-
ally correct. In other words, it is computed as a ratio between the
no. of accurately identified positive samples to the total count of
positive samples [81,83]. It is given by:
T P C mn
P recisionC mn = (3)
T P C mn + F P C mn
4.2.4. Recall
Recall is the measure of how many actual positives were iden-
tified correctly. It is evaluated as the proportion of the no. of
correctly identified positive samples to the total number of actual
positive samples. Recall is also known as sensitivity [45,82,83]. It
Fig. 7. Demonstration of IoU. is given by:
T P C mn
Recall C mn = (4)
IoU = J ( B B predict , B B ground ) T P C mn + F N C mn
Area of intersection of predicted and ground truth boxes
= 4.2.5. Average Precision (AP)
Area of union of predicted and ground truth boxes
To compute the accuracy of detections, the most general metric
(1)
used is Average Precision (AP). It is calculated independently for
For each objection task, precision and recall is evaluated using IoU each object category C m [4,45,81]. It is evaluated as:
value, for the given threshold (t). If IoU ≥ t, then predictions are
1
i
correctly identified and if IoU < t, then predictions are identified
incorrectly. To compute the values of precision and recall, every BB A P Cm = P recisionC mn (5)
j
must be classified as [45,83]: n =1
• TP (True Positive) - Model has predicted positive and in actual 4.2.6. Mean Average Precision (mAP)
it’s true. The mAP is calculated by taking the average over all object cat-
• TN (True Negative) - Model has predicted negative and in actual egories and thereby evaluates the performance of object detectors
it’s true. [45,81,82]. The formula is defined as below:
• FP (False Positive) - Model has predicted positive and in actual
1
j
it’s false.
• FN (False Negative) - Model has predicted negative and in actual mAP = A P Cm (6)
j
it’s false. m =1
Variables used for the calculation of metrics are: 4.2.7. F1-score

C mn = C m is the category in the nth instance or image. The F1 metric estimates the balance between recall and preci-
i = no. of instances in a class. sion. It is the harmonic mean of two fractions and frequently used
j = no. of categories. for imbalanced class distribution. If both precision and recall are
high it leads to higher value of F1-score [45,82,85]. Low F1 value
4.2.2. Accuracy shows significant imbalance between the two. F1score is calculated
It defines the performance of the model across all classes. It is as follows:
computed as the ratio of total no. of samples classified correctly to P recision ∗ Recall
the total sample count. The formula is defined as below [45,83]: F 1 − score = 2 ∗ (7)
P recision + Recall
10
4.2.8. PR curve (Precision Recall curve) semantic characteristics are represented differently, various feature
The precision-recall curve depicts the tradeoff between recall maps can be utilized to detect objects of varying sizes and reso-
and precision for distinct thresholds. A wide area under the curve lutions at different layers. Representative methods include Multi-
indicates high recall and precision. The Precision-Recall plot gives scale deep CNN [93], Deeply supervised object detection (DSOD)
more details as compared to ROC (Receiver Operating Characteris- [94] and SSD [22]. To increase the reliability of multi-scale object
tics) curve plot in case of evaluation of binary classifiers on uneven detection, multi-layer feature fusion and multi-layer detection can
distribution of datasets. As recall value starts increasing and corre- be merged. This includes Feature Pyramid Network (FPN) [28], De-
spondingly if precision is maintaining a higher value, it indicates convolutional single-shot detector (DSSD) [95], Scale transferrable
that detector has good performance. However if the value of recall detection network (STDN) [41], Reverse connection with objectness
starts declining and at the same time high precision is attained, prior networks (RON) [96], Top down modulation (TDM) [97] as a
then the detector has to keep the precision at a certain level to few representative frameworks.
keep recall at high level [1,86,87].
Since PR curves provide the positive prediction cases, thus it is
5.3. Intraclass variation
used in many research analysis.
4.2.9. AUC-ROC curve Intraclass variation refers to the variation that occurs between
AUC stands for Area under the ROC Curve, that measures the different images of the same class. They vary in shape, size, color,
performance of classification problems at various thresholds. The material, texture etc. Object instances appear to be flexible and
ROC (Receiver Operating Characteristics) curve is a probability can be easily transformed in terms of scaling and rotation. These
curve related to Precision-Recall curve. The distinction is that the are called intrinsic factors. Some noticeable effects are also expe-
ROC employs TPR (True Positive Rate) and FPR (False Positive Rate). rienced by external factors. It includes improper lighting, weather
The area under AUC curve indicates high precision and high recall. conditions, illuminations, low-quality camera etc. This difference
More closer is the ROC curve to the upper left co-ordinate (0,1), could be caused by a variety of factors such as occlusion, light-
better the performance is. AUC value is the magnitude of the area ing, position, perspective, and so on [2,45,60]. This problem can be
beneath the ROC curve whose value ranges from 0.5 to 1; greater overcome by verifying that the training data has good amount of
the AUC value, more accurate is the performance of the detector variety including all the factors mentioned above [98].
[1,45,88].
5.4. Efficiency and scalability
5. Problems of object detection and its solutions
As the number of object classes increases, there is rise in com-
Even though object detection has achieved remarkable perfor-
putational complexity, hence there is a demand for high compu-
mance in computer vision, still it is a complicated task and has
tation resources with a huge number of locations inside a single
some challenges. Some of these fundamental challenges that net-
image. The scalability of the detector ensures that it can recognize
works encounter in real-world applications and solutions to over-
unseen objects. It is impractical to annotate the images manually
come them are discussed as below.
with the increasing number of images and categories, so weakly
supervised techniques are used [2].
5.1. Small object detection
Detecting small size objects is among the most difficult prob- 5.5. Generalization issues
lems in object detection. Object detection algorithms such as Faster
RCNN [20,21], and YOLO [23] are inadequate at detecting small size Generalization problems in object detection emerge when the
objects. In deep convolutional neural network, there is a lack of model goes either underfitting or overfitting. Underfitting can be
adequate knowledge in independent feature layers as they occupy identified in the preliminary stages of the training phase, and this
only a small pixel size in the actual image. It is hard to detect problem can be fixed by increasing the number of training epochs
low-resolution small-size objects since they carry finite contextual or complexity of the model. For overfitting, we can use significant
details [1,47]. To overcome this issue, more data can be generated methods such as an increase in the training data, early stopping,
by augmentation or model’s input resolution can be increased etc. regularization method (L1, L2), or dropout layers [45].
[89].
5.6. Class imbalance
5.2. Multi-scale object detection
Multi-scale object detection is a challenging task in the area of The irregular data distribution between classes is referred to
object detection. Each layer of deep CNN generates feature maps as Class imbalance. In simple terms, it can be said that when
and the information generated by these feature maps is indepen- the class contains disproportionate number of instances i.e. having
dent of the other. Discriminative details for multi-scale objects can more specimens in one dataset than the other. From the view-
appear in either layer of the backbone network and for small-scale point of object detection, a class imbalance can be of two types -
objects, it emerges in the preliminary layer and dissipated in the foreground-background imbalance and foreground-foreground im-
later layers. In the object detection algorithms (one-stage and two- balance. The former occurs during the training process and is in-
stage), predictions are carried out from the topmost layer, which dependent of the number of categories in the dataset. The latter
creates hindrances in the way of detecting multi-scale objects, usu- refers to the imbalance at the batch level within the number of
ally small objects. To overcome this difficulty; multi-layer detection samples, concerning positive classes. Generally, one-stage object
and feature fusion is proposed with the association of information detectors have low accuracy than the two-stage object detectors
fusion and DCNNs hierarchical structures [1,45]. and one of the reasons behind this is class imbalance [99]. To solve
Multiple layers are combined for detection purpose, and for this issue, upsampling and downsampling of the class can be done
this, backbone networks like Inside-Outside Network (ION) [90], or synthetic data can be generated using Synthetic Minority Over-
HyperNet [91], Hypercolumns [92] are used. Because each layer’s sampling Technique (SMOTE) etc. [100,101].
11
6. Applications of object detection 6.4. Event detection
In real-time, object detection has an extensive scope. It is being Due to the ubiquitous use of social media, continuous growth
utilized in various areas of image processing applications such as can be seen in multimedia content and one can find out about
monitoring systems, robotics, vehicle detection, autonomous driv- real-world incidents due to its online availability. Many methods
ing etc. Important applications [1,4,102] of object detection are such as multimodal graphs [117], multi-domain [118] and social
explained as follows: interaction graph modeling [119] are used for event detections.
The objective of a multimodal graph is to identify and detect the
6.1. Self-driving cars event from a collection of 100 million photos or videos and briefly
summarize it for the use of consumers. In [119], online social in-
Self-driving cars are the distinctive application of object detec- teraction features are integrated with the use of social affinity of
tion tasks. A self-driving car can carefully travel on the road if two photos which helps in the detection of events. Social affinity
it can detect other objects by its side such as persons, cars and uses the interaction graph to figure out the similarity between two
road signs to determine what next activity to be performed, like pictures of the graph. Multi-domain event detection, collects data
whether to apply a brake or accelerate or to take a turn; and for from multiple domains like social media, news media etc. consist-
this purpose, the car can be trained to perform detection of object ing of heterogeneous data, to detect real-world incidents.
[102].
6.5. Medical detection
6.2. Remote sensing target detection
The task of medical object detection is to identify medical-
With the speedy increase in the development of remote sensing based objects within an image. CNN-based algorithms play a key
automation, it is being used in many application areas like military role in medical image classification. It can help doctors to analyze
field, urban planning, traffic navigation, disaster rescue etc. In the the exact area of the wound, thus enhancing the accuracy of med-
last few years, the remote sensing target detection of ships, air- ical diagnosis [1].
craft, roads etc. has become a current research trend. In DCNNs, In [120], the combination of CNN along with Long short term
object detection frameworks such as Faster RCNN [20,21] and SSD memory (LSTM) and Recurrent Neural Network (RNN) is used to
[22], are gaining popularity in the remote sensing field. detect end-systolic and end-diastolic frames in the MRI image. To
However there are some challenges in this field such as the dif- classify the problem of skin lesions, multi-stream CNN is designed
ficulty in detecting remote sensing targets correctly and quickly by [121], by extracting the information from images of different
because remote sensing images have immense volume of data. Re- resolutions. A challenge was also organized by [122], for melanoma
mote sensing has quite a huge and intricate background which detection. Li, L. et al. [123] proposed an attention mechanism for
leads to much wrong detection. The difference between the re- glaucoma detection. For automated detection of synapses and au-
mote sensing images captured by different sensors presents a high tomated neuron reconstruction, [124] introduced cellular morphol-
degree of variation. Sometimes, small object detection is also a dif- ogy neural networks (CMNs) [1,24].
ficult task; making the detection process slow. So to rectify this,
the resolution of the feature map is increased. Attention mech- 6.6. Face detection and face recognition
anisms and feature fusion procedures have also been utilized to
enhance small target identification. [4,24]. The objective of face detection is to detect and localize face re-
Datasets used for remote sensing target detection are LEVIR gions in an image. Every face has a unique structure and attributes.
(LEarning, VIsion and Remote sensing laboratory) [103], DOTA The most popular detector in the early times was the Viola-Jones
(Dataset for Object deTection in Aerial images) [104], xView [105], detector [13,14]. It has shown wonderful performance in the field
VeDAI (Vehicle Detection in Aerial Imagery) [106], TAS (Things and of object detection by detecting the human faces for the first time
Stuff) [107] etc. along with attaining real-time efficiency [13,14].
Face detection generally has various problems like occlusion,
6.3. Pedestrian detection illumination, multi-scale detection as some faces may be tiny or
some may be large, may have illumination or resolution variations
Pedestrian detection is a critical application of object detection etc. Also, human faces can have heterogeneous expressions, poses,
that is commonly used in video surveillance, autonomous driv- or skin colors. So to solve all these problems, various methods are
ing etc. Traditional methods of pedestrian detection include hand- designed such as face calibration to improve the multi-pose de-
crafted features such as Histogram of Oriented Gradients (HOG) tection [125,126]. Methods namely attention mechanism [127] and
[9], Integral Channel Features (ICF) [108] etc., they have build a detection based on parts [128] are used to improve occluded face
powerful base for object detection. But with timely progress, DC- detection problems. Furthermore, multi-scale feature fusion and
NNs have taken place and become more appropriate for pedestrian multi-resolution detection are used to enhance multi-scale face de-
detection. tection [4,24].
Difficulties in pedestrian detection such as detection of dense Several datasets are used for face detection such as WiderFace
and occluded pedestrian, small pedestrian detection, and hard neg- [129], FDDB (Face Detection Data set and Benchmark) [130], AFLW
ative detection impose great challenges in real applications. There (Annotated Facial Landmarks in the Wild) [131], UFDD (Uncon-
are several methods through which these difficulties can be ame- strained Face Detection Dataset) [132] and many more.
liorated. The techniques such as semantic segmentation [109] and
integration of boosted decision trees [110] help in improving the 6.7. Text detection
problem of hard negatives detection. For small pedestrian detec-
tion, feature fusion [110] is used. On the other hand, to improve Text detection aims to detect whether an image or video con-
the occlusion problem, an ensemble of part detectors [111,112] tains a text and if it is there then to recognize and localize it. Text
and attention mechanism [113] are used [4,6,24]. detection has gained much significance in latest years, as it helps
To detect pedestrians, various datasets come into use like Cal- visually impaired persons to read street signs. It is also utilized in
tech [114], INRIA [9], KITTI [115], CityPersons [116] etc. classification, video analysis etc. [4,24].
12
Text detection faces many problems as it can be of different 7.2. RGB-D detection
fonts and languages, perspective distortion or discrete orientations
can be there, blurred characters can be seen in street images, and Due to the popularity of research in autonomous driving, depth
irregular lighting. The problem of blurred text detection can be information has been added to the image to understand it in a
solved by using word-level recognition and sentence-level recog-
better way. LIDAR point cloud localizes the position of objects ac-
nition [133]. To rectify the problem of font size, training is done
curately in 3D area, by using depth information. To correctly place
with the help of synthetic samples [134].
the ground truth 3D bounding boxes around the objects, the 3D
Some datasets like COCOText [135], synthetic dataset Syn90k
proposal network [149] can be referred [6,24].
[134], and ICDAR [136] are used for text detection.
6.8. Traffic sign/light detection 7.3. Video object detection
In the past few years, although the automated detection of traf- Detecting objects in real-time video such as surveillance videos,
fic lights and traffic signs has drawn lot of attention of users, still autonomous driving is of great importance at present. It faces
it is a challenging task to recognize it, as it faces many difficulties some difficulties such as image quality is not good which leads
in the detection process. to poor accuracy. To associate objects across different frames in
Bad weather is the main cause of false detection as it affects the order to understand the object’s actions, several video detectors
quality of an image. Real-time detection and illumination changes are designed concerning temporal factors. These video detectors
are also a challenging task. Techniques such as adversarial training include deep feature flow [150], flow-guided feature aggregation
[137], and attention mechanism [138] have been used to refine the (FGFA) [151], spatial-temporal memory networks (STMN), a novel
detection process in difficult traffic scenes. The CNN-based Faster tubelet network [152] for spatiotemporal proposals, and to inte-
RCNN, Single Shot detector (SSD) are used in traffic sign and light grate temporal information from it, LSTM is used etc.
detection [4,139–141].
Some popular traffic light and sign datasets are LISA [72], 7.4. Automatic Neural Architecture Search (NAS)
TT100K (Tsinghua-Tencent 100K) [139], GTSDB (German Traffic
Sign Detection Benchmark) [142] etc.
The utilization of deep learning models is becoming popular
day by day. It can be considered to use the backbone architecture
7. Future research challenges
like AutoML (Automated Machine Learning) which is being used in
object detection for some specific purpose. NAS is a part of this
Despite rapid evolution of object detection, there are still many
backbone, in addition to transfer learning and feature engineering.
areas where research needs to be done. In this section, various
So to reduce the involvement from humans at the time of outlin-
research directions are discussed.
ing the model by using NAS; AutoML could be the future research
direction [2,4,153,154].
7.1. Weakly supervised detection
7.5. Scale adaption

The state-of-the-art object detectors use supervised learning
frameworks which rely on huge amount of annotated data. It is an
inefficient and time-consuming process to manually draw bound- Generally objects vary in different scales as it can be seen
ing boxes in large numbers. Weakly supervised learning depends in face and pedestrian detection. To train multi-scale detectors,
on a few annotated images in training data to learn detection mod- Feature Pyramid Networks (FPN) [28] and Generative Adversar-
els. Some methods are used for weakly supervised object detection ial Networks (GAN) [155] produce feature pyramids with deep
[143,144] such as Feedback CNN [145], multiple instance learn- understanding. For scale-invariant detectors, robust backbone ar-
ing (MIL) with non-convex loss function [146], min-entropy latent chitectures, RON [96], Online hard example mining (OHEM) [97]
model (MELM) [147] or Semantic image segmentation [148]. methods can be used [6].
Table 5
Performance comparison on PASCAL VOC 2007 and 2012 test dataset.
Type Method Model Used No. of proposals FPS PASCAL VOC 2007 test set PASCAL VOC 2012 test set
generated Training data [email protected] Training data [email protected]
2-stage RCNN [25] AlexNet 2000 0.03 07 58.5 12 53.3
2-stage SPP-Net [26] ZFNet 2000 0.44 07 59.2 - -
2-stage Fast RCNN [27] VGG16 2000 0.5 07 66.9 12 65.7
2000 0.5 07+12 70.0 07++12 68.4
2-stage Faster RCNN VGG16 300 5 07 69.9 12 67.0
[20,21]
300 5 07+12 73.2 07++12 70.4
300 5 COCO+07+12 78.8 07++12+COCO 75.9
1-stage YOLO [23] - 98 45 07+12 63.4 07++12 57.9
1-stage SSD300 [22][34] VGG16 8732 46 07 68.0 07++12 72.4
8732 46 07+12 74.3 07++12+COCO 77.5
8732 46 07+12+COCO 79.6 - -
1-stage SSD512 [22] VGG16 24564 19 07 71.6 07++12 74.9
24564 19 07+12 76.8 07++12+COCO 80.0
24564 19 07+12+COCO 81.6 - -
1-stage YOLOv2 [32] Darknet19 - 40 07+12 78.6 07++12 73.4
1-stage YOLOv5x 692 CSPDarknet - 140 07+12 91.0 - -
[34,35,37]
13
Table 6
Description of Training data given in Table 5.
Training data Description

07 : VOC 2007 trainval data.
07+12 : Union of VOC 2007 trainval and VOC 2012 trainval
07+12+COCO : Firstly trained on COCO trainval35k, then finetune on 07+12.
07++12 : Union of VOC 2007 trainval + test and VOC 2012 trainval.
07++12+COCO : Firstly trained on COCO trainval35k, then finetune on 07++12.
Table 7
Performance comparison on COCO 2015 and 2017 test dev dataset.
Type Method Model Used No. of proposals FPS Training data MS COCO test dev 2015
generated [email protected] mAP@[.5,.95]
2-stage Fast RCNN [27] VGG16 2000 0.03 train 35.9 19.7
2-stage Faster VGG16 300 5 trainval 42.7 21.9
RCNN
[20,21]
2-stage Mask RCNN ResNeXt-101- - - trainval35k 62.3 39.8
[29] FPN
1-stage SSD300 [22] VGG16 8732 46 trainval35k 41.2 23.2
1-stage SSD512 [22] VGG16 24564 19 trainval35k 46.5 26.8
1-stage YOLOv2 [32] Darkent19 - 40 trainval35k 44.0 21.6
MS COCO test dev 2017
[email protected] mAP@[.5,.95]
1-stage YOLOv3 320 Darknet53 - 45 trainval 51.5 28.2
[33]
1-stage YOLOv4 512 CSPDarknet53 - 31 trainval 64.9 43.0
[36]
1-stage YOLOv5x 640 CSPDarknet - 140 trainval 68.9 50.7
[34,35,37]
7.6. Optimization accuracy of 50.7. The values of mAP for YOLOv5x are taken from its
official github repository [37] as no formal paper is available for it.
The structure of DCNNs can be optimized using various meta-
heuristic optimization algorithms. These algorithms can be used 9. Conclusion
to improve convolutional neural network in diverse research tasks
and applications such as fine-tuning DCNNs hyperparameter, train- Deep learning based CNNs have accomplished great develop-
ing the DCNN etc. So the applicability of meta-heuristic techniques ment in recent years. Object detection progressed quickly follow-
can be explored. Optimization techniques such as [156–160] can ing the introduction of deep learning. This review paper provides
be used. Readers can also refer to [161] for more details. a thorough analysis of state-of-the-art object detection models
(one-stage and two-stage), backbone architectures, and evaluates
8. Comparative results and discussion the performance of models using standard datasets and metrics.
Challenges of object detection are also discussed along with ap-
In this section, comparison of various object detector algorithms plications and future research directions to provide an in-depth
is shown on two popular datasets; PASCAL VOC dataset [74] and coverage of object detection. It is clear from the results that even
MS COCO dataset [75]. This comparison is done on the basis of after achieving remarkable performance in detection of objects,
the results shown in their respective object detector paper. Models still there is a considerable scope for improvement.
are compared using mean average precision (mAP). The selection
of backbone network to extract features has a great impact on the Declaration of competing interest
performance of models.
Table 5 compares the performance comparison of object detec- The authors declare that they have no known competing finan-
tors on the test datasets of PASCAL VOC 2007 and 2012. It gives cial interests or personal relationships that could have appeared to
brief details about the backbone model used, number of region influence the work reported in this paper.
proposals and frames per second (fps); all these effects the per-
formance of object detectors. PASCAL VOC calculates the [email protected] Data availability
where 0.5 is the threshold (t). As discussed in section 4.2.1, if IoU
≥ 0.5, it denotes that predictions are correctly identified. It can No data was used for the research described in the article.
be seen from table that YOLOv5x performs better than others on
VOC 2007 test set with an accuracy of 91%. For VOC 2012 test set, References
SSD512 achieves higher performance having accuracy of 80%.
Description of training data in the above Table 5 is given in [1] Y. Xiao, Z. Tian, J. Yu, Y. Zhang, S. Liu, S. Du, X. Lan, A review of ob-
Table 6. ject detection based on deep learning, Multimed. Tools Appl. 79 (33) (2020)
In Table 7, the performance comparison is evaluated on the 23729–23791.
COCO 2015 and 2017 test dev dataset. The metric mAP@[0.5,0.95] [2] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, M. Pietikäinen, Deep
learning for generic object detection: a survey, Int. J. Comput. Vis. 128 (2)
is used by the COCO dataset with a threshold ranging from 0.5 to (2020) 261–318.
0.95 having step size of 0.05. Here in Table 7, again YOLOv5x out- [3] X. Zhang, Y.-H. Yang, Z. Han, H. Wang, C. Gao, Object class detection: a survey,
performs all other models on COCO 2017 test dev dataset with an ACM Comput. Surv. 46 (1) (2013) 1–53.
14
[4] Z. Zou, Z. Shi, Y. Guo, J. Ye, Object detection in 20 years: a survey, arXiv [35] D. Thuan, Evolution of yolo algorithm and yolov5: the state-of-the-art object
preprint, arXiv:1905.05055, 2019. detection algorithm, 2021.
[5] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual [36] A. Bochkovskiy, C.-Y. Wang, H.-Y.M. Liao, Yolov4: optimal speed and accuracy
understanding: a review, Neurocomputing 187 (2016) 27–48. of object detection, arXiv preprint, arXiv:2004.10934, 2020.
[6] Z.-Q. Zhao, P. Zheng, S.-t. Xu, X. Wu, Object detection with deep learning: a [37] Yolov5, https://github.com/ultralytics/yolov5. (Accessed 6 March 2022).
review, IEEE Trans. Neural Netw. Learn. Syst. 30 (11) (2019) 3212–3232. [38] A. Boukerche, Z. Hou, Object detection using deep learning methods in traffic
[7] A.K. Shetty, I. Saha, R.M. Sanghvi, S.A. Save, Y.J. Patel, A review: object de- scenarios, ACM Comput. Surv. 54 (2) (2021) 1–35.
tection models, in: 2021 6th International Conference for Convergence in [39] PulkitS, Introduction to object detection algorithms, https://www.
Technology (I2CT), IEEE, 2021, pp. 1–8. analyticsvidhya.com/blog/2018/10/a-step-by-step-introduction-to-the-basic-
[8] S. Mohan, 6 different types of object detection algorithms in nutshell, https:// object-detection-algorithms-part-1/, Oct. 2018. (Accessed 6 March 2022).
machinelearningknowledge.ai/different-types-of-object-detection-algorithms/, [40] S. Park, A guide to two-stage object detection: R-CNN, FPN, mask R-CNN,
Jun. 2020. (Accessed 11 February 2022). https://medium.com/codex/a-guide-to-two-stage-object-detection-r-cnn-fpn-
[9] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, mask-r-cnn-and-more-54c2e168438c, Jul. 2021. (Accessed 15 March 2022).
in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern [41] P. Zhou, B. Ni, C. Geng, J. Hu, Y. Xu, Scale-transferrable object detection, in:
Recognition (CVPR’05), vol. 1, IEEE, 2005, pp. 886–893. Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
[10] R. Lienhart, J. Maydt, An extended set of Haar-like features for rapid ob- tion, 2018, pp. 528–537.
ject detection, in: Proceedings. International Conference on Image Processing, [42] J.R. Uijlings, K.E. Van De Sande, T. Gevers, A.W. Smeulders, Selective search for
vol. 1, IEEE, 2002. object recognition, Int. J. Comput. Vis. 104 (2) (2013) 154–171.
[11] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. [43] P. Arbeláez, J. Pont-Tuset, J.T. Barron, F. Marques, J. Malik, Multiscale combi-
Comput. Vis. 60 (2) (2004) 91–110. natorial grouping, in: Proceedings of the IEEE Conference on Computer Vision
[12] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and Pattern Recognition, 2014, pp. 328–335.
and an application to boosting, J. Comput. Syst. Sci. 55 (1) (1997) 119–139. [44] C.L. Zitnick, P. Dollár, Edge boxes: locating object proposals from edges, in:
[13] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple European Conference on Computer Vision, Springer, 2014, pp. 391–405.
features, in: Proceedings of the 2001 IEEE Computer Society Conference on [45] E. Arulprakash, M. Aruldoss, A study on generic object detection with empha-
Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, IEEE, 2001. sis on future research directions, J. King Saud Univ., Comput. Inf. Sci. (2021).
[14] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2) [46] J. Hui, Understanding feature pyramid networks for object detection (FPN),
(2004) 137–154. https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-
[15] H. Bay, T. Tuytelaars, L.V. Gool, Surf: speeded up robust features, in: European for-object-detection-fpn-45b227b9106c, Mar. 2018. (Accessed 21 February
Conference on Computer Vision, Springer, 2006, pp. 404–417. 2022).
[16] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multi- [47] Y. Liu, P. Sun, N. Wergeles, Y. Shang, A survey and performance evaluation
scale, deformable part model, in: 2008 IEEE Conference on Computer Vision of deep learning methods for small object detection, Expert Syst. Appl. 172
and Pattern Recognition, IEEE, 2008, pp. 1–8. (2021) 114602.
[17] W.Y. Kyaw, Histogram of oriented gradients, https://waiyankyawmc.medium. [48] F. Sultana, A. Sufian, P. Dutta, A review of object detection models based
com/histogram-of-oriented-gradients-90567ea6490a, May 2021. (Accessed 9 on convolutional neural network, in: Intelligent Computing: Image Process-
April 2022). ing Based Applications, 2020, pp. 1–16.
[18] D.S. Aljutaili, R.A. Almutlaq, S.A. Alharbi, D.M. Ibrahim, A speeded up robust [49] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D.
scale-invariant feature transform currency recognition algorithm, Int. J. Com- Jackel, Backpropagation applied to handwritten zip code recognition, Neural
put. Inf. Eng. 12 (6) (2018) 365–370. Comput. 1 (4) (1989) 541–551.
[19] AaronWard, Facial detection — understanding viola Jones’ algorithm, https:// [50] C. Gentile, M.K. Warmuth, Linear hinge loss and average margin, Adv. Neural
medium.com/@aaronward6210/facial-detection-understanding-viola-jones- Inf. Process. Syst. 11 (1998).
algorithm-116d1a9db218, Jan. 2020. (Accessed 29 January 2022). [51] K. Janocha, W.M. Czarnecki, On loss functions for deep neural networks in
[20] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: towards real-time object detec- classification, arXiv preprint, arXiv:1702.05659, 2017.
tion with region proposal networks, Adv. Neural Inf. Process. Syst. 28 (2015). [52] P.-T. De Boer, D.P. Kroese, S. Mannor, R.Y. Rubinstein, A tutorial on the cross-
[21] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: towards real-time object de- entropy method, Ann. Oper. Res. 134 (1) (2005) 19–67.
tection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell. [53] J. Shetty, P.S. Jogi, Study on different region-based object detection models
39 (6) (2017) 1137–1149. applied to live video stream and images using deep learning, in: International
[22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, Ssd: Conference on ISMAC in Computational Vision and Bio-Engineering, Springer,
single shot multibox detector, in: European Conference on Computer Vision, 2018, pp. 51–60.
Springer, 2016, pp. 21–37. [54] C. Tang, Y. Feng, X. Yang, C. Zheng, Y. Zhou, The object detection based on
[23] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: unified, real- deep learning, in: 2017 4th International Conference on Information Science
time object detection, in: Proceedings of the IEEE Conference on Computer and Control Engineering (ICISCE), IEEE, 2017, pp. 723–728.
Vision and Pattern Recognition, 2016, pp. 779–788. [55] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, D. Ren, Distance-iou loss: faster and
[24] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, R. Qu, A survey of deep learning- better learning for bounding box regression, in: Proceedings of the AAAI Con-
based object detection, IEEE Access 7 (2019) 128837–128868. ference on Artificial Intelligence, vol. 34, 2020, pp. 12993–13000.
[25] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accu- [56] G. Ghiasi, T.-Y. Lin, Q.V. Le Dropblock, A regularization method for convolu-
rate object detection and semantic segmentation, in: Proceedings of the IEEE tional networks, Adv. Neural Inf. Process. Syst. 31 (2018).
Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. [57] I. Loshchilov, F. Hutter, Sgdr: stochastic gradient descent with warm restarts,
[26] K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional arXiv preprint, arXiv:1608.03983, 2016.
networks for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell. 37 (9) [58] S. Liu, L. Qi, H. Qin, J. Shi, J. Jia, Path aggregation network for instance seg-
(2015) 1904–1916. mentation, in: Proceedings of the IEEE Conference on Computer Vision and
[27] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE International Conference Pattern Recognition, 2018, pp. 8759–8768.
on Computer Vision, 2015, pp. 1440–1448. [59] C.-Y. Wang, H.-Y.M. Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, I.-H. Yeh, Cspnet: a
[28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, Feature pyra- new backbone that can enhance learning capability of cnn, in: Proceedings of
mid networks for object detection, in: Proceedings of the IEEE Conference on the IEEE/CVF Conference on Computer Vision and Pattern Recognition Work-
Computer Vision and Pattern Recognition, 2017, pp. 2117–2125. shops, 2020, pp. 390–391.
[29] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the [60] S.S.A. Zaidi, M.S. Ansari, A. Aslam, N. Kanwal, M. Asghar, B. Lee, A survey of
IEEE International Conference on Computer Vision, 2017, pp. 2961–2969. modern deep learning based object detection models, Digit. Signal Process.
[30] C. Szegedy, A. Toshev, D. Erhan, Deep neural networks for object detection, (2022) 103514.
Adv. Neural Inf. Process. Syst. 26 (2013). [61] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
[31] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: in- tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
tegrated recognition, localization and detection using convolutional networks, Recognition, 2016, pp. 770–778.
arXiv preprint, arXiv:1312.6229, 2013. [62] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual transformations
[32] J. Redmon, A. Farhadi, Yolo9000: better, faster, stronger, in: Proceedings of for deep neural networks, in: Proceedings of the IEEE Conference on Com-
the IEEE Conference on Computer Vision and Pattern Recognition, 2017, puter Vision and Pattern Recognition, 2017, pp. 1492–1500.
pp. 7263–7271. [63] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con-
[33] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv preprint, volutional neural networks, Adv. Neural Inf. Process. Syst. 25 (2012).
arXiv:1804.02767, 2018. [64] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks,
[34] J. Solawetz, YOLOv5 new version - improvements and evaluation, https://blog. in: European Conference on Computer Vision, Springer, 2014, pp. 818–833.
roboflow.com/yolov5-improvements-and-evaluation/, Jun. 2020. (Accessed 1 [65] A.R. Pathak, M. Pandey, S. Rautaray, Application of deep learning for object
April 2022). detection, Proc. Comput. Sci. 132 (2018) 1706–1717.
15
[66] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale [93] Z. Cai, Q. Fan, R.S. Feris, N. Vasconcelos, A unified multi-scale deep convolu-
image recognition, arXiv preprint, arXiv:1409.1556, 2014. tional neural network for fast object detection, in: European Conference on
[67] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van- Computer Vision, Springer, 2016, pp. 354–370.
houcke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the [94] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, X. Xue, Dsod: learning deeply super-
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9. vised object detectors from scratch, in: Proceedings of the IEEE International
[68] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint, arXiv:1312.4400, Conference on Computer Vision, 2017, pp. 1919–1927.
2013. [95] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, A.C. Berg, Dssd: deconvolutional single
[69] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training shot detector, arXiv preprint, arXiv:1701.06659, 2017.
by reducing internal covariate shift, in: International Conference on Machine [96] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, Y. Chen Ron, Reverse connection with ob-
Learning, PMLR, 2015, pp. 448–456. jectness prior networks for object detection, in: Proceedings of the IEEE Con-
[70] A. Khan, A. Sohail, U. Zahoora, A.S. Qureshi, A survey of the recent architec- ference on Computer Vision and Pattern Recognition, 2017, pp. 5936–5944.
tures of deep convolutional neural networks, Artif. Intell. Rev. 53 (8) (2020) [97] A. Shrivastava, R. Sukthankar, J. Malik, A. Gupta, Beyond skip connections:
5455–5516. top-down modulation for object detection, arXiv preprint, arXiv:1612.06851,
[71] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected 2016.
convolutional networks, in: Proceedings of the IEEE Conference on Computer [98] B. Dipert, Overcome these 6 problems with object detection, https://
Vision and Pattern Recognition, 2017, pp. 4700–4708. www.edge-ai-vision.com/2022/02/overcome-these-6-problems-with-object-
[72] A. Mogelmose, M.M. Trivedi, T.B. Moeslund, Vision-based traffic sign detection detection/, Feb. 2022. (Accessed 24 April 2022).
and analysis for intelligent driver assistance systems: perspectives and survey, [99] K. Oksuz, B.C. Cam, S. Kalkan, E. Akbas, Imbalance problems in object de-
IEEE Trans. Intell. Transp. Syst. 13 (4) (2012) 1484–1497. tection: a review, IEEE Trans. Pattern Anal. Mach. Intell. 43 (10) (2020)
[73] A. Krizhevsky, G. Hinton, et al., Learning multiple layers of features from tiny 3388–3415.
images, 2009. [100] S. Mazumder, 5 techniques to handle imbalanced data for a classification
[74] M. Everingham, L. Van Gool, C.K. Williams, J. Winn, A. Zisserman, The pas- problem, https://www.analyticsvidhya.com/blog/2021/06/5-techniques-to-
cal visual object classes (voc) challenge, Int. J. Comput. Vis. 88 (2) (2010) handle-imbalanced-data-for-a-classification-problem/, Jun. 2021. (Accessed 25
303–338. April 2022).
[75] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. [101] S. Kumar, 5 techniques to work with imbalanced data in machine learn-
Zitnick, Microsoft coco: common objects in context, in: European Conference ing, https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-
on Computer Vision, Springer, 2014, pp. 740–755. data-in-machine-learning-80836d45d30c, Sep. 2021. (Accessed 25 April 2022).
[102] A. Vahab, M.S. Naik, P.G. Raikar, S. Prasad, Applications of object detection
[76] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: a large-scale
system, Int. J. Res. Eng. Technol. 6 (4) (2019) 4186–4192.
hierarchical image database, in: 2009 IEEE Conference on Computer Vision
[103] Z. Zou, Z. Shi, Random access memories: a new paradigm for target detection
and Pattern Recognition, IEEE, 2009, pp. 248–255.
in high resolution aerial remote sensing images, IEEE Trans. Image Process.
[77] A. Torralba, R. Fergus, W.T. Freeman, 80 million tiny images: a large data
27 (3) (2017) 1100–1111.
set for nonparametric object and scene recognition, IEEE Trans. Pattern Anal.
[104] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, L.
Mach. Intell. 30 (11) (2008) 1958–1970.
Zhang, Dota: a large-scale dataset for object detection in aerial images, in:
[78] J. Xiao, K.A. Ehinger, J. Hays, A. Torralba, A. Oliva, Sun database: exploring a
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
large collection of scene categories, Int. J. Comput. Vis. 119 (1) (2016) 3–22.
tion, 2018, pp. 3974–3983.
[79] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali,
[105] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Y. Bulatov, B.
S. Popov, M. Malloci, A. Kolesnikov, et al., The open images dataset v4, Int. J.
McCord, xview: objects in context in overhead imagery, arXiv preprint, arXiv:
Comput. Vis. 128 (7) (2020) 1956–1981.
1802.07856, 2018.
[80] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A.
[106] S. Razakarivony, F. Jurie, Vehicle detection in aerial imagery: a small target
Karpathy, A. Khosla, M. Bernstein, et al., Imagenet large scale visual recog-
detection benchmark, J. Vis. Commun. Image Represent. 34 (2016) 187–203.
nition challenge, Int. J. Comput. Vis. 115 (3) (2015) 211–252.
[107] G. Heitz, D. Koller, Learning spatial context: using stuff to find things, in: Eu-
[81] R. Padilla, W.L. Passos, T.L. Dias, S.L. Netto, E.A. da Silva, A comparative anal-
ropean Conference on Computer Vision, Springer, 2008, pp. 30–43.
ysis of object detection metrics with a companion open-source toolkit, Elec-
[108] P. Dollár, Z. Tu, P. Perona, S. Belongie, Integral channel features, 2009.
tronics 10 (3) (2021) 279.
[109] Y. Tian, P. Luo, X. Wang, X. Tang, Pedestrian detection aided by deep learning
[82] A. Gad, Evaluating object detection models using mean average precision,
semantic tasks, in: Proceedings of the IEEE Conference on Computer Vision
https://www.kdnuggets.com/2021/03/evaluating-object-detection-models-
and Pattern Recognition, 2015, pp. 5079–5087.
using-mean-average-precision.html. (Accessed 7 August 2022).
[110] L. Zhang, L. Lin, X. Liang, K. He, Is faster r-cnn doing well for pedestrian
[83] A. Gad, Evaluating deep learning models: the confusion matrix, accuracy, detection?, in: European Conference on Computer Vision, Springer, 2016,
precision, and recall, https://www.kdnuggets.com/2021/02/evaluating-deep- pp. 443–457.
learning-models-confusion-matrix-accuracy-precision-recall.html. (Accessed 2 [111] Y. Tian, P. Luo, X. Wang, X. Tang, Deep learning strong parts for pedestrian
August 2022). detection, in: Proceedings of the IEEE International Conference on Computer
[84] R. Padilla, S.L. Netto, E.A. Da Silva, A survey on performance metrics for Vision, 2015, pp. 1904–1912.
object-detection algorithms, in: 2020 International Conference on Systems, [112] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, X. Wang, Jointly learning deep features,
Signals and Image Processing (IWSSIP), IEEE, 2020, pp. 237–242. deformable parts, occlusion and classification for pedestrian detection, IEEE
[85] J. Brownlee, How to calculate precision, recall, and f-measure for im- Trans. Pattern Anal. Mach. Intell. 40 (8) (2017) 1874–1887.
balanced classification, https://machinelearningmastery.com/precision-recall- [113] S. Zhang, J. Yang, B. Schiele, Occluded pedestrian detection through guided
and-f-measure-for-imbalanced-classification/, Jan. 2020. (Accessed 1 May attention in cnns, in: Proceedings of the IEEE Conference on Computer Vision
2022). and Pattern Recognition, 2018, pp. 6995–7003.
[86] Precision-Recall, https://scikit-learn.org/stable/auto_examples/model_ [114] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: an evaluation
selection/plot_precision_recall.html. (Accessed 17 April 2022). of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2011)
[87] J. Brownlee, How to use ROC curves and precision-recall curves for 743–761.
classification in python, https://machinelearningmastery.com/roc-curves-and- [115] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The kitti
precision-recall-curves-for-classification-in-python/, Aug. 2018. (Accessed 17 vision benchmark suite, in: 2012 IEEE Conference on Computer Vision and
April 2022). Pattern Recognition, IEEE, 2012, pp. 3354–3361.
[88] S. Narkhede, Understanding AUC - ROC curve, https://towardsdatascience. [116] S. Zhang, R. Benenson, B. Schiele, Citypersons: a diverse dataset for pedestrian
com/understanding-auc-roc-curve-68b2303cc9c5, Jun. 2018. (Accessed 12 detection, in: Proceedings of the IEEE Conference on Computer Vision and
April 2022). Pattern Recognition, 2017, pp. 3213–3221.
[89] J. Solawetz, Small object detection guide, https://blog.roboflow.com/detect- [117] M. Schinas, S. Papadopoulos, G. Petkos, Y. Kompatsiaris, P.A. Mitkas, Mul-
small-objects/, Aug. 2020. (Accessed 7 August 2022). timodal graph-based event detection and summarization in social media
[90] S. Bell, C.L. Zitnick, K. Bala, R. Girshick, Inside-outside net: detecting objects streams, in: Proceedings of the 23rd ACM International Conference on Mul-
in context with skip pooling and recurrent neural networks, in: Proceedings timedia, 2015, pp. 189–192.
of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, [118] Z. Yang, Q. Li, W. Liu, J. Lv, Shared multi-view data representation for multi-
pp. 2874–2883. domain event detection, IEEE Trans. Pattern Anal. Mach. Intell. 42 (5) (2019)
[91] T. Kong, A. Yao, Y. Chen, F. Sun, Hypernet: towards accurate region proposal 1243–1256.
generation and joint object detection, in: Proceedings of the IEEE Conference [119] Y. Wang, H. Sundaram, L. Xie, Social event detection with interaction graph
on Computer Vision and Pattern Recognition, 2016, pp. 845–853. modeling, in: Proceedings of the 20th ACM International Conference on Mul-
[92] B. Hariharan, P. Arbelaez, R. Girshick, J. Malik, Object instance segmentation timedia, 2012, pp. 865–868.
and fine-grained localization using hypercolumns, IEEE Trans. Pattern Anal. [120] B. Kong, Y. Zhan, M. Shin, T. Denny, S. Zhang, Recognizing end-diastole and
Mach. Intell. 39 (4) (2016) 627–639. end-systole frames via deep temporal regression network, in: International
16
Conference on Medical Image Computing and Computer-Assisted Intervention, [145] C. Cao, Y. Huang, Y. Yang, L. Wang, Z. Wang, T. Tan, Feedback convolutional
Springer, 2016, pp. 264–272. neural network for visual localization and segmentation, IEEE Trans. Pattern
[121] J. Kawahara, G. Hamarneh, Multi-resolution-tract cnn with hybrid pretrained Anal. Mach. Intell. 41 (7) (2018) 1627–1640.
and skin-lesion trained layers, in: International Workshop on Machine Learn- [146] F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, Q. Ye, C-mil: continuation multiple in-
ing in Medical Imaging, Springer, 2016, pp. 164–171. stance learning for weakly supervised object detection, in: Proceedings of
[122] N.C. Codella, D. Gutman, M.E. Celebi, B. Helba, M.A. Marchetti, S.W. Dusza, the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019,
A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, et al., Skin lesion analysis to- pp. 2199–2208.
ward melanoma detection: a challenge at the 2017 international symposium [147] F. Wan, P. Wei, J. Jiao, Z. Han, Q. Ye, Min-entropy latent model for weakly
on biomedical imaging (isbi), hosted by the international skin imaging col- supervised object detection, in: Proceedings of the IEEE Conference on Com-
laboration (isic), in: 2018 IEEE 15th International Symposium on Biomedical puter Vision and Pattern Recognition, 2018, pp. 1297–1306.
Imaging (ISBI 2018), IEEE, 2018, pp. 168–172. [148] H. Noh, S. Hong, B. Han, Learning deconvolution network for semantic seg-
[123] L. Li, M. Xu, X. Wang, L. Jiang, H. Liu, Attention based glaucoma detection: a mentation, in: Proceedings of the IEEE International Conference on Computer
large-scale database and cnn model, in: Proceedings of the IEEE/CVF Confer- Vision, 2015, pp. 1520–1528.
ence on Computer Vision and Pattern Recognition, 2019, pp. 10571–10580. [149] X. Chen, K. Kundu, Y. Zhu, A.G. Berneshawi, H. Ma, S. Fidler, R. Urtasun, 3d
[124] P.J. Schubert, S. Dorkenwald, M. Januszewski, V. Jain, J. Kornfeld, Learning cel- object proposals for accurate object class detection, Adv. Neural Inf. Process.
lular morphology with neural networks, Nat. Commun. 10 (1) (2019) 1–12. Syst. 28 (2015).
[125] X. Shi, S. Shan, M. Kan, S. Wu, X. Chen, Real-time rotation-invariant [150] X. Zhu, Y. Xiong, J. Dai, L. Yuan, Y. Wei, Deep feature flow for video recogni-
face detection with progressive calibration networks, in: Proceedings of tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
the IEEE Conference on Computer Vision and Pattern Recognition, 2018, Recognition, 2017, pp. 2349–2358.
pp. 2295–2303. [151] X. Zhu, Y. Wang, J. Dai, L. Yuan, Y. Wei, Flow-guided feature aggregation for
[126] D. Chen, G. Hua, F. Wen, J. Sun, Supervised transformer network for efficient video object detection, in: Proceedings of the IEEE International Conference
face detection, in: European Conference on Computer Vision, Springer, 2016, on Computer Vision, 2017, pp. 408–417.
pp. 122–138. [152] K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, X. Wang, Object detection in
[127] J. Wang, Y. Yuan, G. Yu, Face attention network: an effective face detector for videos with tubelet proposal networks, in: Proceedings of the IEEE Conference
the occluded faces, arXiv preprint, arXiv:1711.07246, 2017. on Computer Vision and Pattern Recognition, 2017, pp. 727–735.
[128] S. Yang, P. Luo, C.C. Loy, X. Tang, Faceness-net: face detection through deep [153] M. Heller, What is neural architecture search? AutoML for deep learn-
facial part responses, IEEE Trans. Pattern Anal. Mach. Intell. 40 (8) (2017) ing, https://www.infoworld.com/article/3648408/what-is-neural-architecture-
1845–1859. search.html, Jan. 2022. (Accessed 26 February 2022).
[129] S. Yang, P. Luo, C.-C. Loy, X. Tang, Wider face: a face detection benchmark, in: [154] Everything you need to know about AutoML and neural architecture search,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni- https://www.kdnuggets.com/2018/09/everything-need-know-about-automl-
tion, 2016, pp. 5525–5533. neural-architecture-search.html. (Accessed 26 February 2022).
[130] V. Jain, E. Learned-Miller, Fddb: a benchmark for face detection in uncon- [155] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
strained settings, Tech. Rep., UMass Amherst technical report, 2010. Courville, Y. Bengio, Generative adversarial nets, Adv. Neural Inf. Process. Syst.
[131] M. Koestinger, P. Wohlhart, P.M. Roth, H. Bischof, Annotated facial landmarks 27 (2014).
in the wild: a large-scale, real-world database for facial landmark localization, [156] S. Mahajan, A.K. Pandit, Hybrid method to supervise feature selection us-
in: 2011 IEEE International Conference on Computer Vision Workshops (ICCV ing signal processing and complex algebra techniques, Multimed. Tools Appl.
Workshops), IEEE, 2011, pp. 2144–2151. (2021) 1–22.
[132] H. Nada, V.A. Sindagi, H. Zhang, V.M. Patel, Pushing the limits of uncon- [157] S. Mahajan, L. Abualigah, A.K. Pandit, M. Altalhi, Hybrid aquila optimizer with
strained face detection: a challenge dataset and baseline results, in: 2018 IEEE arithmetic optimization algorithm for global optimization tasks, Soft Comput.
9th International Conference on Biometrics Theory, Applications and Systems 26 (10) (2022) 4863–4881.
(BTAS), IEEE, 2018, pp. 1–10. [158] S. Mahajan, L. Abualigah, A.K. Pandit, A. Nasar, M. Rustom, H.A. Alkhaza-
[133] Z. Wojna, A.N. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, J. Ibarz, Attention- leh, M. Altalhi, Fusion of modern meta-heuristic optimization methods using
based extraction of structured information from street view imagery, in: 2017 arithmetic optimization algorithm for global optimization tasks, Soft Comput.
14th IAPR International Conference on Document Analysis and Recognition (2022) 1–15.
(ICDAR), vol. 1, IEEE, 2017, pp. 844–850. [159] S. Mahajan, L. Abualigah, A.K. Pandit, Hybrid arithmetic optimization algo-
[134] M. Jaderberg, K. Simonyan, A. Vedaldi, A. Zisserman, Synthetic data and ar- rithm with hunger games search for global optimization, Multimed. Tools
tificial neural networks for natural scene text recognition, arXiv preprint, Appl. (2022) 1–24.
arXiv:1406.2227, 2014. [160] S. Mahajan, A.K. Pandit, Image segmentation and optimization techniques: a
[135] A. Veit, T. Matera, L. Neumann, J. Matas, S. Belongie, Coco-text: dataset short overview, Medicon Eng. Themes 2 (2) (2022) 47–49.
and benchmark for text detection and recognition in natural images, arXiv [161] M. Abd Elaziz, A. Dahou, L. Abualigah, L. Yu, M. Alshinwan, A.M. Kha-
preprint, arXiv:1601.07140, 2016. sawneh, S. Lu, Advanced metaheuristic optimization techniques in applica-
[136] S. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, R. Young, Icdar 2003 ro- tions of deep neural networks: a review, Neural Comput. Appl. 33 (21) (2021)
bust reading competitions, in: Seventh International Conference on Document 14079–14099.
Analysis and Recognition, 2003, Proceedings, 2003, pp. 682–687.
[137] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, Perceptual generative adversarial
networks for small object detection, in: Proceedings of the IEEE Conference Ravpreet Kaur received her B.Tech degree from
on Computer Vision and Pattern Recognition, 2017, pp. 1222–1230. Chandigarh University, India and M.Tech degree from
[138] Y. Lu, J. Lu, S. Zhang, P. Hall, Traffic signal detection and classification in street CGC Landran, India in Computer Science and Engi-
views using an attention model, Comput. Vis. Media 4 (3) (2018) 253–266.
neering. Currently she is pursuing Ph.D from UIET,
[139] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, S. Hu, Traffic-sign detection and
Panjab University, Chandigarh. Her areas of interests
classification in the wild, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2016, pp. 2110–2118.
include Deep Learning and Machine Learning.
[140] K. Behrendt, L. Novak, R. Botros, A deep learning approach to traffic lights:
detection, tracking, and classification, in: 2017 IEEE International Conference
on Robotics and Automation (ICRA), IEEE, 2017, pp. 1370–1377.
Sarbjeet Singh is a Professor at University Insti-
[141] D. Li, D. Zhao, Y. Chen, Q. Zhang, Deepsign: deep learning based traffic
sign recognition, in: 2018 International Joint Conference on Neural Networks tute of Engineering and Technology, Panjab Univer-
(IJCNN), IEEE, 2018, pp. 1–6. sity, India. He received his B.Tech degree in Computer
[142] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, C. Igel, Detection of traffic Science and Engineering from Punjab Technical Uni-
signs in real-world images: the German traffic sign detection benchmark, in: versity, Jalandhar, India, in 2001 and Ph.D. degree in
The 2013 International Joint Conference on Neural Networks (IJCNN), IEEE, Computer Science and Engineering from Thapar Uni-
2013, pp. 1–8. versity, Patiala, India, in 2009. His research areas in-
[143] H. Bilen, A. Vedaldi, Weakly supervised deep detection networks, in: Pro- clude Machine Learning, Deep Learning, Object De-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
tection, Activity Recognition, Cloud Computing, Social
2016, pp. 2846–2854.
Network Analysis and Sentiment Analysis.
[144] A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, L. Van Gool, Weakly supervised
cascaded convolutional networks, in: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2017, pp. 914–922.
17

Object Detection With DL

Uploaded by

Copyright:

Available Formats

Object Detection With DL

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Detection With DL

Uploaded by

Copyright:

Available Formats

Digital Signal Processing 132 (2023) 103812

Contents lists available at ScienceDirect

Digital Signal Processing

A comprehensive review of object detection with deep learning

1. Introduction detection and semantic segmentation [5]. The progress of object

Fig. 2. Architecture of RCNN [25].

Working of RPN – Region Proposal Network is a fully convolu-

Fig. 6. Sample of annotated images taken from commonly used datasets.

Variables used for the calculation of metrics are: 4.2.7. F1-score

6. Applications of object detection 6.4. Event detection

6.8. Traﬃc sign/light detection 7.3. Video object detection

7.5. Scale adaption

Training data Description

You might also like