Ymer 230109

YMER || ISSN : 0044-0477 http://ymerdigital.
com
Object Detection in Remote Sensing

Images Using Yolov4
Amir Mokhtar Hannane*, Faiza Oudjedi Damerdji, Mohammed Amine Baïch,

Mohamed Anis Benallal
Department of Computer Science, Faculty of Mathematics and Computer Science,
University of Sciences and Technology of Oran Mohamed Boudiaf (USTO-MB), Oran,
Algeria.
[email protected]
Abstract
The article focuses on assessing object detection performance in satellite images using
the YOLOv4 network. As satellite image quantity and quality increase, intelligent
observation methods become crucial. Deep learning, particularly Convolutional Neural
Networks (CNNs), has excelled in computer vision, prompting exploration in remote
sensing imagery. The study evaluates YOLOv4 effectiveness in detecting objects,
conducting tests on DIOR and HRRSD datasets. YOLOv4 outperforms other CNN models
in detection accuracy, showcasing its potential for efficient satellite image object
detection. The evaluation involves data preprocessing, model training, and comprehensive
analysis of results from both datasets. YOLOv4's strengths lie in diverse scenario handling
and rapid learning, as identified through literature analysis. The study demonstrates
YOLOv4's applicability and superiority in satellite image object detection, offering more
accurate and efficient methods for remote sensing applications. The insights gained guide
future studies and applications in remote sensing and computer vision, contributing to
improved observation techniques in satellite imagery.
Keywords: Object Detection, Deep learning, Convolution Neural Network (CNN),

YOLOv4, Remote Sensing Images
1. Introduction
The human visual system possesses remarkable speed and accuracy, enabling us to
effortlessly perform complex visual tasks by unconsciously recognizing objects, their
spatial relationships, and interactions. However, machines, despite recent advancements
in hardware and machine learning, still require extensive time and training examples to
achieve comparable object identification capabilities. The field of computer vision has
witnessed significant progress, making it more accessible and intuitive than ever before.
This paper addresses the critical task of object detection in satellite images. To tackle
this challenge, fast and accurate convolutional neural networks (CNNs) have been
developed [1].
VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:77

YMER || ISSN : 0044-0477 http://ymerdigital.com
These CNNs effectively combine localization and classification tasks by drawing

bounding boxes around objects in images and assigning corresponding class labels. Some
notable architectures in this domain include AlexNet [2], ZFNet [3], VGGNet [4],
GoogLeNet [5], Inception series [6, 7, 8], ResNet [9], DenseNet [10], and SENet [11].
Object detectors can be categorized into two types: region proposal-based methods and
regression-based methods. The former, exemplified by R-CNN [12], Fast-RCNN [13], and
Faster-RCNN [14], follow a two-step process. It generates candidate region proposals
potentially containing objects and then classifies these proposals into specific object
classes. On the other hand, regression-based methods, which are the focus of this article,
simplify detection by treating it as a regression problem, making them more efficient.
In this study, our primary objective is to explore regression-based methods for object
detection in satellite images. We aim to demonstrate that these methods offer simplicity
and improved efficiency compared to region proposal-based approaches. We focus on the
You Only Look Once (YOLO) method, which adopts a single CNN backbone to directly
predict bounding boxes and class probabilities for all objects within an image in real-time
[15]. To further enhance both the speed and accuracy of YOLO, we investigate the Single
Shot MultiBox Detector (SSD), known for its ability to detect and locate small objects
effectively using a default box mechanism and multi-scale feature maps [16]. Additionally,
we explore the RetinaNet detector, which introduces a pyramidal network of features with
the novel focal loss to significantly increase accuracy [17].
To conduct a comprehensive evaluation, we com-pare the performance of the YOLOv4

model with several other deep learning-based object detection methods. Specifically [19],
we employ two publicly available datasets, DIOR [20] and HRRSD [21], to assess
YOLOv4's capabilities for object detection in remote sensing images. By analyzing and
comparing the results, we aim to contribute valuable insights to the field of object detection
in satellite imagery. Our research holds the potential to benefit diverse applications, such
as environmental monitoring, urban planning, and dis-aster response, where accurate and
efficient object detection plays a crucial role.
Figure 1: YOLOv4 object detection model

2. Material and Methods

2.1. YOLOv4
YOLOv4 is considered one of the most influential object detectors and was developed by
Bochkovskiy et al. in 2020 [19] based on the previous work of Redmon et al. [18]. The
YOLOv4 architecture has adopted a CSPDarkNet53 backbone with Mish activation [22]
for feature extraction, a spatial pyramid pooling (SPP) neck and PANet with Leaky-ReLU
activation [23] for feature aggregation, and a YOLOv3 head for objects detection and class
prediction. YOLOv4 was able to achieve accuracies comparable to other state-of-the-art
detectors with double the inference speed.
The general scheme of the YOLOv4 object detection algorithm is illustrated in Figure
1 which can generally be divided into 4 main modules, more precisely: Input, Backbone,
Neck and Head (Dense prediction). In the following we detail each module and the
improvements made to each:
Figure 2: YOLOv4 features.

Input: represents the input image. This image pre-processing phase involves zooming the
image to the input size of the network and performing operations such as normalization.
During the online training phase, YOLOv4 uses the Mosaic data enhancement technique
to improve model training speed and network accuracy, and CmBN (Cross-Map Batch
Normalization) and Self-Adversarial Training to improve network generalization
performance.
Backbone: is a high-performance classification network used to extract feature maps in
the image. YOLOv4 uses CSPDarknet53 as a backbone, Mish activation is used to replace
the original ReLU activation function and Dropblock is added to further improve the
generalizability of the model.
Neck: It is usually located between the Backbone and Head and can be used to further
improve feature diversity and robustness. YOLOv4 uses the SPP module to merge feature
maps of different sizes, it also uses a top down FPN (Feature Pyramid Network) and a bot-
tom-up FPN to improve the feature extraction capability of the network.
Head: ends the process of outputting the detection result, which contains the class of
the object as well as its location.

The developers of YOLOv4 have dedicated their efforts to enhancing the model's
training accuracy and post-processing of data. It is important to note that the effectiveness
of several advanced techniques in target detection performance has been verified. These
techniques are referred to as 'bag-of-freebies' and 'bag-of-specials' (Figure 2):
1. Bag-of-Freebies (BoF): These are specific improvements to the learning process that
have no impact on inference speed and increase accuracy.
BoF for the backbone network:
Data ugmentation: CutMix, Mosaic.
Regularisation: DropBlock, Label smoothing.
BoF for the detector: Mosaic, Self-Adversarial Training, CIoU-loss, CmBN, Cosine
annealing scheduler, Random training shapes, Optimal hyper parameters.
2. Bag-Of-Specials (BOS): The improvement of the network slightly impacts the
inference time with good performance feedback.
BoS for the backbone network: Mish activation, Cross Stage Partial Network (CSP),
Multi-input weighted residual connections (MiWRC).
BoS for the detector: Mish activation, Modified Spatial pyramid pooling layer (SPP),
Modified Spatial Attention Module (SAM) [24], Modified Path aggregation network
(PAN), DIoU-NMS.
Figure 3: DIOR dataset samples

2.2. Case Study

2.2.1. DIOR Dataset
The dataset comprises images with a resolution of 800 × 800 pixels and varying spatial
resolutions from 0.5 m to 30 m. Similar to other datasets, these images were collected from
Google Earth by experts specialized in earth observation interpretation (Figure 3). It
encompasses 20 distinct object classes, thoughtfully curated by analyzing object categories
from existing earth observation datasets. Initially, ten commonly used object categories
were identified, and then, an additional ten object categories were selected through a
meticulous search on Google Scholar and the Web of Science using keywords such as
'object detection', 'object recognition', 'Earth observation images', and 'remote sensing
images'.
The dataset includes the following classes: Air-plane, Airport, Baseball Diamond,
Basketball Court, Bridge, Chimney, Dam, Motorway Service Area, High-way Toll Station,
Harbor, Golf Course, Ground Track Field, Overpass, Ship, Stadium, Storage Tank, Tennis
Court, Train Station, Vehicle, and Wind Turbine.
Figure 4: HRRSD dataset samples

2.2.2. HRRSD Dataset

The HRRSD remote sensing image dataset [21, 25] is produced by the Optical Image
Analysis and Learning Center of the "Xi'an" Institute of Optics and Precision Mechanics
of the Chinese Academy of Sciences to study the object detection of high-resolution
satellite images (Figure 4). The dataset comprises a total of 21,761 image samples obtained
from both Google Earth and Baidu Map, with spatial resolutions ranging from 0.15 m to
1.2 m. It has 55740 object instances and contains 13 classes in particular: Airplane, Bridge,
intersection, Ship, Vehicle, Harbor, Ground Track Field, Storage Tank, Basketball Court,
Parking Lot, Tennis Court, Baseball Diamond et T Junction.
Figure 5: Experimental results on DIOR datasets
3. Results and Discussion

This section provides a comparison between al-ready existing results of some deep
learning-based object detection methods with the YOLOv4 method applied on DIOR and
HRRSD. We randomly divide the two datasets into two subsets 50% for training and 50%
for testing in order to make fair comparisons with other experiments that have made the
same division [26]. A detection is deemed accurate if more than half of its boundary area
aligns with the ground truth, while any detection with less than 50% overlap is regarded
as a false positive.
Average Precision (AP) and mean Average Precision (mAP) were applied as measures
to evaluate the performance of the detection.

3.1. Experimental Results on DIOR Dataset

The existing results of twelve (12) object detection methods [20] applied to the DIOR
dataset were selected for comparison with our experimental results using the YOLOv4
model. These methods include R-CNN, RICNN, RICAOD, RIFD-CNN, Faster R-CNN,
and SSD using the VGG16 backbone. The YOLOv3 model utilizes the Darknet-53
backbone. Additionally, the Faster R-CNN with FPN, Mask R-CNN with FPN, RetinaNet,
and PANet use two back-bone networks (ResNet-50 and ResNet-101). The CornerNet
network employs the Hourglass-104 backbone. For this paper, YOLOv4 was tested on the
DIOR dataset using the CSPDarknet-53 backbone.
The YOLOv4 convolutional neural network (CNN) model stands out as the top
performer among various models with different backbones for object detection on the
DIOR dataset (Figure 5). Boasting an impressive Mean Average Precision value of 81%,
YOLOv4 demonstrates exceptional accuracy and efficiency. In comparison, the nearest
mAP values achieved by other popular models, such as PANet and RetinaNet using the
ResNet-101 backbone, are 66.1%. This significant difference of 14.9% highlights the
substantial advantage of YOLOv4 over its closest competitors.
The substantial disparity in mAP values under-scores the superiority of YOLOv4 in
accurately detecting and localizing objects within images. With its re-markable
performance, YOLOv4 solidifies its position as the leading choice for object detection
tasks, setting new standards in the field of computer vision.
Figure 6: Experimental results on HRRSD datasets

3.2. Experimental Results on HRRSD Dataset

Existing results from twelve (12) object detection methods on the HRRSD dataset were
compared with the results obtained from YOLOv4. The methods include: BOW [27],
SSCBoW [28], FDDL [29], COPD [30], Transformed CNN, RICNN, YOLO, Fast R-
CNN, Faster R-CNN, YOLOv2 with the Darknet-19 backbone, Fast R-CNN + GACL-Net
[25], and Faster R-CNN + GACL-Net with the ResNet-50 backbone. In this study,
YOLOv4 has been tested on the HRRSD dataset using the CSPDarknet-53 backbone.
The experimental results depicted in Figure 6 are sorted based on their appearance,
providing a visual representation of the progression and improvements in Mean Average
Precision values. This ordering allows us to observe the advancements made by each
model over time. Notably, YOLOv4 stands out with the highest mAP value compared to
all other techniques included in the figure. This finding highlights the superior
performance of YOLOv4 in terms of accuracy and precision, further solidifying its
position as a leading model in the field. The incremental increase in mAP values between
models reinforces the continuous efforts and advancements in object detection and
recognition algorithms.
4. Conclusion
The main objective of our research article was to evaluate the performance of the
YOLOv4 convolutional neural network (CNN) model in object detection using remotely
sensed data. To accomplish this, we com-pared the results obtained by YOLOv4 with those
of 12 other CNN object detection models. In our study, we utilized two datasets of remote
sensing images, namely DIOR and HRRSD. The experimental findings clearly indicate
that YOLOv4 outperformed all the other techniques examined. These results serve as
evidence that the new features integrated into YOLOv4 significantly enhance its
performance compared to previous iterations, such as YOLOv3.
Moving forward, there are several promising directions for future research:
Validation on Additional Datasets: It would be valuable to further validate the YOLOv4
model by testing it on additional datasets from diverse sources. By evaluating its
performance on various datasets, we can assess the model's generalizability and robustness
across different remote sensing scenarios.
Performance in Challenging Contexts: Investigate how YOLOv4 performs in
challenging contexts, such as adverse weather conditions, occlusions, or rare object
classes. Understanding the model's behaviour in these scenarios can provide insights into
its limitations and potential areas for improvement.
Transfer Learning: Explore the applicability of transfer learning techniques to fine-tune
the YOLOv4 model for specific remote sensing tasks or domains. This could lead to
improved performance with reduced training data requirements.
Efficiency and Resource Optimization: As mentioned, YOLOv4 requires significant
computational resources for training and inference. Investigate methods to optimize the
model's architecture or develop lightweight versions for deployment on resource-
constrained platforms.

Overall, YOLOv4 offers enhanced accuracy, im-proved feature representation, and better
handling of objects at different scales compared to several CNN object detection models.
However, addressing these future research directions can lead to a more comprehensive
understanding of the model's capabilities and potential areas for improvement in remote
sensing ap-plications.
References
[1] V. Yaloveha, A. Podorozhniak, H. Kuchuk, and N. Garashchuk, “Performance comparison

of CNNs on high-resolution multispectral dataset applied to land cover classification
problem”, Radìoelektronnì Ì Komp’ûternì Sistemi. no. 2, (2023), pp. 107–118.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep
convolutional neural networks” Communications of the ACM. vol. 60, no. 6, (2017), pp.
84–90.
[3] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks”, in
Lecture Notes in Computer Science, (2014), pp. 818–833.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Large-Scale image
recognition”, Computer Vision and Pattern Recognition, (2014).
[5] A. Abbasi et al., “Detecting prostate cancer using deep learning convolution neural
network with transfer learning approach”, Cognitive Neurodynamics, vol. 14, no. 4,
(2020), pp. 523–533.
[6] S. Ioffe, “Batch renormalization: towards reducing minibatch dependence in Batch-
Normalized models”, arXiv (Cornell University), vol. 30, (2017), pp. 1945–1953.
[7] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-V4, Inception-ResNet and
the impact of residual connections on learning”, Proceedings of the ... AAAI Conference
on Artificial Intelligence, vol. 31, no. 1, (2017).
[8] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception
Architecture for Computer Vision,” Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, vol. 2818–2826, (2016).
[9] M. Shafiq and Z. Gu, “Deep Residual Learning for Image Recognition: a survey”, Applied
Sciences, vol. 12, no. 18, p. 8972, (2022).
[10] H. Cai, T. Chen, R. Niu and A. Plaza, "Landslide Detection Using Densely Connected
Convolutional Networks and Environmental Conditions", IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing, vol. 14, pp. 5235-5247, (2021).
[11] D. Al-Alimi, Y. Shao, R. Feng, M. a. A. Al‐qaness, M. E. A. Elaziz, and S. H. Kim, “Multi-
Scale geospatial object detection based on Shallow-Deep feature extraction”, Remote
Sensing, vol. 11, no. 21, (2019), p. 2525.
[12] H. Kuchuk, A. Podorozhniak, N. Liubchenko, and D. Onischenko, “System of license plate
recognition considering large camera shooting angles”, Radìoelektronnì Ì Komp’ûternì
Sistemi, no. 4, (2021).

[13] M. Li, Z. Zhang, L. Lei, X. Wang, and X. Guo, “Agricultural Greenhouses Detection in
High-Resolution Satellite Images Based on Convolutional Neural Networks: Comparison
of Faster R-CNN, YOLO v3 and SSD”, Sensors, vol. 20, no. 17, (2020), p. 4938.
[14] A. A. J. Pazhani and C. Vasanthanayaki, “Object detection in satellite images by faster R-
CNN incorporated with enhanced ROI pooling (FrRNet-ERoI) framework”, Earth Science
Informatics, vol. 15, no. 1, (2022), pp. 553–561.
[15] Z. Liu, Y. Gao, Q. Du, M. Chen, and W. Lv, “YOLO-Extract: Improved YOLOV5 for
aircraft object detection in remote sensing images”, IEEE Access, vol. 11, (2023), pp.
1742–1751.
[16] A. Kumar, Z. Zhang, and H. Lyu, “Object detection in real time based on improved single
shot multi-box detector algorithm”, Eurasip Journal on Wireless Communications and
Networking, vol. 2020, no. 1, (2020).
[17] M. Zhu et al., “Arbitrary-Oriented ship detection based on RetinaNet for remote sensing
images”, IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, vol. 14, pp. 6694–6706, (2021).
[18] R. Luo et al., “Glassboxing Deep Learning to Enhance Aircraft Detection from SAR
Imagery”, Remote Sensing, vol. 13, no. 18, (2021), p. 3650.
[19] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “YOLOV4: Optimal speed and accuracy
of object detection”, arXiv (Cornell University), (2020).
[20] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing
images: A survey and a new benchmark,” Isprs Journal of Photogrammetry and Remote
Sensing, vol. 159, (2020), pp. 296–307.
[21] Y. Zhang, Y. Yuan, Y. Feng, and X. Lu, “Hierarchical and robust convolutional neural
network for Very High-Resolution Remote Sensing object detection”, IEEE Transactions
on Geoscience and Remote Sensing, vol. 57, no. 8, (2019), pp. 5535–5548.
[22] A. Mondal and V. K. Shrivastava, “A novel Parametric Flatten-p Mish activation function
based deep CNN model for brain tumor classification”, Computers in Biology and
Medicine, vol. 150, (2022), p. 106183.
[23] J.-N. Lee, J.-W. Chae, and H.-C. Cho, “Improvement of colon polyp detection
performance by modifying the multi-scale network structure and data augmentation”,
Journal of Electrical Engineering & Technology, vol. 17, no. 5, (2022), pp. 3057–3065.
[24] S. Ari and S. Ari, “MU-NET: Modified U-Net architecture for automatic Ocean Eddy
Detection”, IEEE Geoscience and Remote Sensing Letters, vol. 19, (2022), pp. 1–5.
[25] X. Lu, Y. Zhang, Y. Yuan, and Y. Feng, “Gated and Axis-Concentrated localization
network for remote sensing object detection”, IEEE Transactions on Geoscience and
Remote Sensing, vol. 58, no. 1, (2020), pp. 179–192.
[26] G. Cheng and J. Han, “A survey on object detection in optical remote sensing images”,
Isprs Journal of Photogrammetry and Remote Sensing, vol. 117, (2016), pp. 11–28.

[27] S. Xu, T. Fang, D. Li, and S. Wang, “Object classification of aerial images with Bag-of-
Visual words”, IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 2, (2010).
[28] H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection in High-Resolution
remote sensing images using spatial sparse coding Bag-of-Words model,” IEEE
Geoscience and Remote Sensing Letters, vol. 9, no. 1, (2012), pp. 109–113.
[29] J. Han et al., “Efficient, simultaneous detection of multi-class geospatial targets based on
visual saliency modeling and discriminative learning of sparse coding”, Isprs Journal of
Photogrammetry and Remote Sensing, vol. 89, (2014), pp. 37–48.
[30] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class geospatial object detection and
geographic image classification based on collection of part detectors,” Isprs Journal of
Photogrammetry and Remote Sensing, vol. 98, (2014), pp. 119–132.

Ymer 230109

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Ymer 230109

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ymer 230109

Uploaded by

Copyright:

Available Formats

YMER || ISSN : 0044-0477 http://ymerdigital.

Object Detection in Remote Sensing

Amir Mokhtar Hannane*, Faiza Oudjedi Damerdji, Mohammed Amine Baïch,

Keywords: Object Detection, Deep learning, Convolution Neural Network (CNN),

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:77

These CNNs effectively combine localization and classification tasks by drawing

To conduct a comprehensive evaluation, we com-pare the performance of the YOLOv4

Figure 1: YOLOv4 object detection model

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:78

2. Material and Methods

Figure 2: YOLOv4 features.

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:79

Figure 3: DIOR dataset samples

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:80

2.2. Case Study

Figure 4: HRRSD dataset samples

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:81

2.2.2. HRRSD Dataset

Figure 5: Experimental results on DIOR datasets

3. Results and Discussion

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:82

3.1. Experimental Results on DIOR Dataset

Figure 6: Experimental results on HRRSD datasets

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:83

3.2. Experimental Results on HRRSD Dataset

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:84

[1] V. Yaloveha, A. Podorozhniak, H. Kuchuk, and N. Garashchuk, “Performance comparison

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:85

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:86

VOLUME 23 : ISSUE 01 (Jan) - 2024 Page No:87

You might also like