Remotesensing 12 03035

remote sensing
Article
Detection of a Moving UAV Based on Deep
Learning-Based Distance Estimation
Ying-Chih Lai * and Zong-Ying Huang
Department of Aeronautics and Aeronautics, National Cheng Kung University, Tainan 701, Taiwan;
[email protected]
* Correspondence: [email protected]; Tel.: +886-6-275-7575 (ext. 63648)

Received: 27 July 2020; Accepted: 14 September 2020; Published: 17 September 2020
Abstract: Distance information of an obstacle is important for obstacle avoidance in many applications,
and could be used to determine the potential risk of object collision. In this study, the detection of a
moving fixed-wing unmanned aerial vehicle (UAV) with deep learning-based distance estimation
to conduct a feasibility study of sense and avoid (SAA) and mid-air collision avoidance of UAVs
is proposed by using a monocular camera to detect and track an incoming UAV. A quadrotor is
regarded as an owned UAV, and it is able to estimate the distance of an incoming fixed-wing intruder.
The adopted object detection method is based on the you only look once (YOLO) object detector.
Deep neural network (DNN) and convolutional neural network (CNN) methods are applied to exam
their performance in the distance estimation of moving objects. The feature extraction of fixed-wing
UAVs is based on the VGG-16 model, and then its result is applied to the distance network to estimate
the object distance. The proposed model is trained by using synthetic images from animation software
and validated by using both synthetic and real flight videos. The results show that the proposed
active vision-based scheme is able to detect and track a moving UAV with high detection accuracy
and low distance errors.
Keywords: unmanned aerial vehicle (UAV); you only look once (YOLO); deep neural network
(DNN); convolutional neural network (CNN); object detection; sense and avoid (SAA); mid-air
collision avoidance
1. Introduction
With the advance of technology, unmanned aerial vehicles (UAVs) have become popular in the
past two decades due to their wide and various applications. The advantages of UAVs include low
cost, offering a less stressful environment, and long endurance. Most important of all, UAVs are
unmanned, so they can reduce the need of manpower, and thus reduce the number of casualties caused
by accidents. They also have many different applications including aerial photography, entertainment,
3D mapping [1], object detection for different usages [2–4], military use, and agriculture applications,
such as pesticide spraying and vegetation monitoring [5]. With the increasing amounts of UAVs,
there are more and more UAVs flying in the same airspace. If there is no air traffic control and
management of UAVs, it may cause accidents and mid-air collisions to happen, which is one the most
significant risks that UAVs are facing [6]. Thus, UAV sense and avoid (SAA) has become a critical issue.
A comprehensive review of the substantial breadth of SAA architectures, technologies, and algorithms
is presented in the tutorial [7], which concludes with a summary of the regulatory and technical
issues that continue to challenge the progress on SAA. Without a human pilot onboard, unmanned
aircraft systems (UASs) have to solely rely on SAA systems when in dense UAS operations in urban
environments, or they are merged into the National Airspace System (NAS) [8]. There are many factors
needed to be considered for UAS traffic management (UTM), such as cost, payload of UAV, accuracy
Remote Sens. 2020, 12, 3035; doi:10.3390/rs12183035 www.mdpi.com/journal/remotesensing

Remote Sens. 2020, 12, 3035 2 of 28
of the senor, etc. Therefore, the determination of suitable sensors in UAV SAA of UTM for objective
sensing is essential.
According to how the information is transmitted, current sensor technologies for SAA can be
classified as cooperative and non-cooperative methods [8]. For cooperative sensors, the communication
devices need to be equipped to communicate with the aircrafts in the same airspace, such as the traffic
alert and collision avoidance system (TCAS) and the automatic dependent surveillance-broadcast
(ADS-B), which have been widely used in commercial airlines. In contrast to cooperative sensors,
there is no need for non-cooperative sensors to equip the same communication devices to exchange
data with the other aircrafts for sharing the same airspace. Moreover, non-cooperative sensors are able
to detect not only air objects but also ground targets, such as light detection and ranging (LIDAR),
radar, and optical sensors (cameras). One drawback of small-scale UAVs is the limitation of their
payload capability. Therefore, the camera becomes an ideal sensor for object and target detection.
The camera has many advantages, such as its light weight, low cost, the fact that it is easy to equip,
and it is also widely used in different applications.
Computer vision is one of the popular studies for onboard systems of UAVs, which make the vehicles
able to “see” the targets or objects. With the rapid development of computer vision, vision-based navigation
is now the promising technology for detecting potential threats [6]. For object sense/detection, there are
many approaches have be proposed, such as multi-stage detection pipeline [9–11], machine learning [12–15],
and deep learning [16]. Deep learning is widely used in machine vision for object detection, localization,
and classification. In contrast to traditional object detection methods, detectors using deep learning are
able to learn semantic, high-level, and deeper features to address the problems existing in traditional
architectures [17]. Detectors based on deep learning can be divided into two categories, one stage and two
stage. Two-stage detectors require a region proposal network (RPN) to generate regions of interests (ROI),
such as the faster region convolution neural network (R-CNN) or the mask R-CNN [18,19]. On the other
hand, the one-stage detector considers object detection as a single regression problem by taking an image as
input to learn class probabilities and bounding box coordinates, such as the single shot multi-box detector
(SSD) or you only look once (YOLO) [20,21]. Two-stages detectors have higher accuracy when compared to
one-stage detectors, but their computational cost is higher than one-stage detectors.
Vison-based object detection methods have been studied for many decades and applied in many
applications. In recent years, there are many studies focused on UAV detection with vision-based
methods and deep learning [22–26]. These studies focus on the detection of quadrotor or multirotor
UAVs, commonly known as drones, but it is difficult to obtain the detector for small fixed-wing UAVs,
which have higher flight speed than multirotors and will increase the challenge of the vision-based
detectors. Moreover, most of these studies emphasized the development of object detectors, and there
is no vision-based distance estimation for the feasibility study of SAA and mid-air collision avoidance
of UAVs using a monocular camera to detect the incoming small fixed-wing UAV. Some vison-based
detection approaches for mid-air collision avoidance have been proposed for light fixed-wing aircrafts.
For example, an image processing of multi-stage pipeline based on the hidden Markov model (HMM)
has been utilized to detect the aircrafts with slow motion on the image plane [10]. The key stages of
multi-stage pipeline are stabilized image input, image preprocessing, temporal filtering and detection
logic. The advantage of this approach is that it can detect a Cessna 182 aircraft in long distance.
However, when the movement of the aircraft on the image plane is too fast, this algorithm will fail.
In [6], the proposed long-range vision-based SAA utilized the same multi-stage pipeline. Moreover,
instead of using only morphological image processing in image processing stage, deep learning-based
pixel-wised image segmentation is also applied to increase the detection range of a Cessna 182 whilst
maintaining low false alarms. It classifies every pixel in image into two classes, aircraft and non-aircraft.
Regarding to UAVs, Li et al. proposed a new method to detect and track UAVs from a monocular
camera mounted on the owned aircraft [3]. The main idea of this approach is to adopt background
subtraction. The background motion is calculated via optical flow to obtain the background subtracted
Remote Sens. 2020, 12, 3035 3 of 28
images and to find the moving targets. This approach is able to detect moving objects without the
limitations of moving speed or visual size.
For the obstacle avoidance, the distance information of the target object usually plays an important
role. However, it is difficult to estimate distance with only a monocular camera. Some approaches
exploit the known information, such as camera focal length and height of the object, to calculate
distance via the pinhole model, and usually assume that the height or width of objects are known [27,28].
The distance estimation of the objects on the ground based on deep learning has been proposed in
many studies, but the deep learning-based object detection of UAVs for mid-air collision avoidance is
rare according to paper survey results. There are some studies focused on the monocular vision-based
SAA of UAVs [29,30]. In the study [29], an approach to deal with monocular image-based SAA
assuming constant aircraft velocities and straight flight paths was proposed and simulated in
software-in-the-loop simulation test runs. A nonlinear model predictive control scheme for a UAV
SAA scenario, which assumes that the intruder’s position is already confirmed as a real threat and
the host UAV is on the predefined trajectory at the beginning of the SAA process, was proposed and
verified through simulations [30]. However, in these two studies, there is no object detection method
and real image data acquiring from a monocular camera. For the deep learning-based object detection,
most of the studies utilize the images acquired from UAVs or a satellite to detect and track the objects
on the ground, such as an automatic vehicle, airplane, and vessel [31–33]. For ground vehicles, Li et al.
proposed a monocular distance estimation system for neuro-robotics by using CNN to concatenate
horizontal and vertical motion of images estimated via optical flow as inputs to the trained CNN model
and the distance information from the ultrasonic sensors [34]. The distance estimation is successfully
estimated using only a camera, but the distance estimation results become worse when the velocity
of robotics increases. In [35], a deep neural network (DNN) named DisNet is proposed to detect the
distance of a ground vehicle to objects, and it applied the bounding box of the objects detected by
YOLO and image information, such as width and height, as inputs to train DisNet. The results show
that DisNet is able to estimate the distance between objects and camera without either explicit camera
parameters or a prior knowledge about the scene. However, the accuracy of the estimated distance
may be directly affected due to the width and height of the bounding box.
With the rapid development in technology, UAVs have become an off-the-shelf consumer product.
However, if there is no traffic control or UTM system to manage UAVs when they fly in the same
airspace, it may cause mid-air collision, property loss, or causalities. Therefore, SAA and mid-air
collision avoidance for UAVs have become an important issue. The goal of this study is to develop the
detection of a moving UAV based on deep learning distance estimation to conduct the feasibility study
of SAA and mid-air collision avoidance of UAVs. The adopted sensor for the detection of the moving
object is a monocular camera, and DNN and CNN were applied to estimate the distance between the
intruder and the owned UAV.
The rest of study is organized as follows: In Section 2, the overview of this study is presented,
including the architecture of the proposed detection scheme and the methods to accomplish object
detection. The methods of the proposed distance estimation using deep learning are presented in
Section 3, and the introduction to model architecture and a proposed procedure to synthesize the
dataset for training the model are also presented. Section 4 presents the performance evaluation of
the proposed methods by using synthetic videos and real flight experiments. Results and discussions
of model evaluation and experiments are shown in Section 5. Finally, the conclusion of this study is
addressed in Section 6.
2. Detection of a Moving UAV

To develop the key technologies of mid-air collision avoidance for UAVs, a vision-based object
detection method is developed using deep learning-based distance estimation processing. The developed
approach is able to detect a fixed-wing intruder and estimate the distance between the ownership
and intruder. However, it is important to detect the target object in both short and long distances,
Remote Sens. 2020, 12, 3035 4 of 28
especially for aircrafts moving in relative high speed. In this study, since the camera is a passive
Remote Sens. 2020, 12,sensor,
non-cooperative x FOR PEER REVIEW camera was selected to be the only sensor to detect the target 4object
a monocular of 27
in the airspace. A multi-stage object detection scheme is proposed to obtain the distance estimation of
estimation
the movingof the moving
targets targets
on the image on the
plane image
in long andplane
short in long and
distances. short
The distances.
background The background
subtraction method,
subtraction method, based on the approach in [3], is applied to detect the long-range
based on the approach in [3], is applied to detect the long-range target and the moving target and
object thea
with
moving
moving object with aon
background moving background
the image on the
plane. When theimage
targetplane.
objectWhen the targetthe
is approaching object
ownedis approaching
UAV, a deep
the
learning-based model is trained to estimate the distance. Then, according to the distance according
owned UAV, a deep learning-based model is trained to estimate the distance. Then, estimation to
of
the distance estimation of the detected object on the image plane and its dynamic motion,
the detected object on the image plane and its dynamic motion, a risk assessment of mid-air collision a risk
assessment of mid-air
could be conducted collisionmid-air
to prevent could collision
be conducted to preventFigure
from occurring. mid-air collision
1 shows the from occurring.
flow chart of the
Figure 1 shows the flow chart of the research process of the proposed multi-stage
research process of the proposed multi-stage target detection and distance estimation using target detection
a deep
and distance estimation
learning-based approach. using a deep learning-based approach.
Object Detection
Long Distance Short Distance
Deep Learning
Background
(YOLO
Subtraction
Detector)
Distance Estimation
Method1 Method2
CNN DNN
Regression Regression
Risk Assessment
Figure1.
Figure Flowchart
1.Flow chartof
ofresearch
researchprocess.
process.
2.1. Object Detection

2.1. Object Detection
There are many approaches to achieve object detection, and machine learning (e.g., deep learning) is
There are many approaches to achieve object detection, and machine learning (e.g., deep
one of the popular methods for robotics and autonomous driving applications. For example, a histogram
learning) is one of the popular methods for robotics and autonomous driving applications. For
of an oriented gradient (HOG) descriptor is able to detect the features of an object, and the support vector
example, a histogram of an oriented gradient (HOG) descriptor is able to detect the features of an
machine (SVM) is utilized to classify the object. In the past decade, deep learning has attracted a lot of
object, and the support vector machine (SVM) is utilized to classify the object. In the past decade,
attention over the world, and many deep learning-based detectors, such as YOLOv3, Faster-RCNN,
deep learning has attracted a lot of attention over the world, and many deep learning-based detectors,
and RetinaNet, were proposed [18,31,36]. The deep learning-based detector is able to detect and classify
such as YOLOv3, Faster-RCNN, and RetinaNet, were proposed [18,31,36]. The deep learning-based
objects with excellent efficiency and accuracy. In order to improve the detection range, a multi-stage
detector is able to detect and classify objects with excellent efficiency and accuracy. In order to
object detection scheme is proposed to detect the target in long or short distances. The methods of
improve the detection range, a multi-stage object detection scheme is proposed to detect the target in
object detection will be presented in this section. In this study, the main goal is to detect the intruder
long or short distances. The methods of object detection will be presented in this section. In this study,
UAV and estimate the distance between the intruder and the owned UAVs. Background subtraction
the main goal is to detect the intruder UAV and estimate the distance between the intruder and the
is utilized to detect moving and small objects, but this method is not able to estimate the distance of
owned UAVs. Background subtraction is utilized to detect moving and small objects, but this method
an unknown object. Therefore, the deep learning-based detector is used in this study to address the
is not able to estimate the distance of an unknown object. Therefore, the deep learning-based detector
problem, which is able to detect and to classify objects at the same time. The detector used in this study
is used in this study to address the problem, which is able to detect and to classify objects at the same
is YOLOv3, and the advantages of this algorithm are as follows:
time. The detector used in this study is YOLOv3, and the advantages of this algorithm are as follows:
• The required
The requiredcomputing
computing power
power is very low compared
is very low comparedwith the other
with thedeep-learning-based detectors.
other deep-learning-based
• Its accuracy
detectors. is acceptable for most applications that require real-time onboard computing.
• It isaccuracy
Its able to detect a relatively
is acceptable smallapplications
for most objects, andthat
the require
long-range targets
real-time occupycomputing.
onboard few pixels on the
 Itimage plane.
is able to detect a relatively small objects, and the long-range targets occupy few pixels on the
image plane.
YOLO is a one-stage detector, and it treats the task of detection as a single regression problem.
It is YOLO is a one-stage
an end-to-end single detector, and itneural
convolutional treats network
the task of detection
that detects as a single
objects regression
based problem.
on bounding box
It is an end-to-end single convolutional neural network that detects objects based on bounding box
prediction and class probabilities [37]. The YOLO detector is well-known for its computational speed,
and it is a good choice for the real-time applications. YOLOv3 is the third version of YOLO, which
has a deeper network for feature extraction, a different network architecture, and a new loss function
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 27
[36]. The new architecture of YOLOv3 boasts residual skip connections and upsampling. The most
significant feature of v3 is that it makes detections at three different scales. The upsampled layers
Remote Sens. 2020, 12, 3035 5 of 28
concatenated with the previous layers help preserve the fine grained features which help in detecting
small objects. More details of different YOLO detectors are introduced in the literature [36,37].
prediction
Since theandYOLOv3
class probabilities
detector is[37]. The YOLO
a high-speed detectoritisiswell-known
detector, a good choice forwhen
its computational speed,
real-time detection
with acceptable accuracy is required for the onboard computing system of small UAVs. Becausehas
and it is a good choice for the real-time applications. YOLOv3 is the third version of YOLO, which thea
deeper network
purpose for feature
of this study is to extraction, a differentstudy
conduct a feasibility network architecture,
of active and a new
vision-based SAAlossfor function
small UAVs[36].
The new
using architecture
a deep of YOLOv3
learning-based boasts residual
approach, YOLOv3 skip connections
is selected to beand
theupsampling.
detector for The most significant
detecting the fixed-
feature
wing of v3 is that
intruder. it makes
In order detections
to perform theat three different
distance estimationscales.withThe upsampled
YOLOv3, layers concatenated
the intruder distance is
with the previous
estimated at short layers
range,help
where preserve
the objectthe appearance
fine grainedon features which
the image helpisinlarger
plane detecting
than small
a few objects.
pixels.
More details
Moreover, theofYOLOv3
differentdetector
YOLO detectors
was run are on aintroduced in the literature
personal computer [36,37].
to detect the object and to estimate
Since thebetween
the distance YOLOv3 detector
the intruder is and
a high-speed
the owned detector,
UAV by it is a good
using postchoice when real-time
processing with the detection
synthetic
with acceptable
images acquiredaccuracy is requiredsoftware
from animation for the onboard
and real computing
flight tests.system
Theofcomputing
small UAVs. Because
power of the
purpose of this study is to conduct a feasibility study of active vision-based
developed vision-based SAA is still regarded as a limitation to improve on for the future of real-timeSAA for small UAVs using
a deep learning-based
onboard implementation. approach, YOLOv3 is selected to be the detector for detecting the fixed-wing
intruder. In order to perform the distance estimation with YOLOv3, the intruder distance is estimated
2.2. Object
at short Collection
range, where the object appearance on the image plane is larger than a few pixels. Moreover,
the YOLOv3 detector was run on a personal computer to detect the object and to estimate the distance
In this study, a low-cost fixed-wing UAV, named Sky Surfer X8, with a wingspan of 1400 mm,
betweenlength
overall the intruder
of 915 and
mm,the owned
and flyingUAV by using
weight post
of 1 kg wasprocessing
adopted with to bethe
thesynthetic
intruder.images
The realacquired
flight
from animation software and real flight tests. The computing power
tests were conducted by using a Pixhawk autopilot to perform waypoint tracking in auto mode. Inof the developed vision-based
SAAtraining
the is still regarded
process, as thea proposed
limitation model
to improve on for the
was trained byfuture
using of real-time
synthetic onboard
images implementation.
of Sky Surfer from
animation software. With the synthetic images, the YOLOv3 detector was pre-trained with the
2.2. Object Collection
Microsoft COCO dataset [38] to train the feature extractor with the custom images of UAVs in this
study.InTo
this study,
train the acustom
low-cost fixed-wing
YOLOv3 UAV,it named
detector, Sky Surfer
is necessary X8, images
to collect with a wingspan
with targetoffixed-wing
1400 mm,
overall length of 915 mm, and flying weight of 1 kg was adopted to be the intruder.
UAV. The software, named Blender, which is a free and open-source 3D creation suite, was utilized The real flight
tests were conducted by using a Pixhawk autopilot to perform waypoint tracking
to synthesize the custom images. It supports the entirety of the 3D pipeline, such as modeling, in auto mode.
In the training
animation, motionprocess, the proposed
graphics, modelFigure
and rendering. was trained
2 showsbyoneusing synthetic
of the synthesisimages
images of to
Sky Surfer
train the
from animation software. With the synthetic images, the YOLOv3 detector was
custom YOLOv3 detector, and the UAVs in each image are synthesized with a real image to be the pre-trained with the
Microsoft COCO dataset [38] to train the feature extractor with the custom images of UAVs in this
background.
study.ToTo trainthe
train themodel
custom YOLOv3
with detector,
the dataset, it isitnecessary
is necessary to collect
to label images in
the images with
thetarget fixed-wing
training dataset
UAV. The software, named Blender, which is a free and open-source 3D creation
with bounding box and class, respectively. The outputs of YOLOv3 are the bounding box informationsuite, was utilized to
synthesize theand
(coordinates) custom images.
classes. It supports
In this the entirety
study, there of the
is only one 3D pipeline,
class, which is such as modeling,
the fixed-wing UAV.animation,
Figure
motion graphics, and rendering. Figure 2 shows one of the synthesis images to
3 shows the labeling process, and the adopted tool used to label the images is LabelImg, which is alsotrain the custom
YOLOv3
an detector,
open-source and the UAVs in each image are synthesized with a real image to be the background.
software.
Synthetic image
Figure 2. Synthetic image made
made by
by Blender.
Blender.
To train the model with the dataset, it is necessary to label the images in the training dataset with
bounding box and class, respectively. The outputs of YOLOv3 are the bounding box information
(coordinates) and classes. In this study, there is only one class, which is the fixed-wing UAV. Figure 3
shows the labeling process, and the adopted tool used to label the images is LabelImg, which is also an
open-source software.
Remote Sens. 2020, 12, 3035 6 of 28

Figure 3. Labeling image of the training dataset.

Figure 3.
Figure 3. Labeling
Labeling image
image of
of the
the training
training dataset.
dataset.
2.3. Detection Results
2.3.
2.3. Detection Results
The detection results of the custom YOLOv3 detector are shown in Figures 4 and 5. Figure 4
presents The detection
detection results
the detection result of
of the
onecustom
the frame YOLOv3
custom YOLOv3
from detector
detector are
a synthetic are shown
within
shown
video Figures
in100
Figures 44 and
frames, and
and 5. Figure
5. the 4
accuracy
presents
presents the
the detection
detection result
resultof
of one
one frame
frame from
from a synthetic
a video
synthetic videowith
with100100frames, and
frames,
of 100 frames is also 100% with no false-positive detection. In Figure 5, 4 detection results of 4 images andthe accuracy
the accuracyof
were100 frames
ofobtained
100 is
frames also
from 100%
is also
236 100%with
with
frames no false-positive
ofno
3 false-positive detection.
detection.
real flight videos, andInInFigure
Figure
the 5,5,44detection
accuracy detection results
and recall rateof
results 44 images
ofof images
the custom
were
were obtained
obtained from
from 236
236 frames
frames of
of 33 real
real flight
flight videos,
videos, and
and the
the accuracy
accuracy and
and recall
recall rate
rate of
of the
the custom
custom
YOLOv3 detector were 96.3% and 96.7%, respectively, with a few false-positives and false-negatives.
YOLOv3
YOLOv3 detector
detector were 96.3% and 96.7%, respectively, with a few false-positives and false-negatives.
The detections errors occurred when the aircraft’s color was similar to the background color in cloudy
The detections errors
The detections errorsoccurred
occurredwhenwhen thethe aircraft’s
aircraft’s color
color waswas similar
similar to thetobackground
the backgroundcolor incolor in
cloudy
weather.
cloudy weather.
weather.
Figure 4. Detection results of the custom you only look once (YOLO)v3 detector on synthetic images.
Remote Sens. 2020, 12, 3035 7 of 28
Figure 5. Detection results of the custom YOLOv3 detector from real images.
Figure 5. Detection results of the custom YOLOv3 detector from real images.
3. Distance Estimation
3. Distance Estimation
Since the detected objects on the 2D image plane could not provide the distance of the intruder,
Since the detected objects on the 2D image plane could not provide the distance of the intruder,
the depth of the target object is required to obtain its movement in 3D space. In this study, the distance
the depth of the target object is required to obtain its movement in 3D space. In this study, the distance
between the ownership and intruder is estimated by deep learning-based methods to achieve SAA
between the ownership and intruder is estimated by deep learning-based methods to achieve SAA of
of UAVs. To obtain more accurate distance estimation results, two different deep learning methods
UAVs. To obtain more accurate distance estimation results, two different deep learning methods are
are used to compare their performance of distance estimation in this study. One is CNN and the other
used to compare their performance of distance estimation in this study. One is CNN and the other is
is DNN with the DisNet regression model. From the comparison results, the better one will be applied
DNN with the DisNet regression model. From the comparison results, the better one will be applied to
to the videos of real flight tests in this study.
the videos of real flight tests in this Figure
study. 2. Synthetic image made by Blender.
3.1.Distance
3.1. DistanceEstimation
EstimationUsing
UsingCNN
CNN
CNNisisaapowerful
CNN powerfulalgorithm
algorithminindeep
deeplearning,
learning,and
andititisisable
ableto
toextract
extractthe
thedifferent
differentfeatures
featuresof
of
objects during the training process. In this study, the distance estimation is considered as
objects during the training process. In this study, the distance estimation is considered as a simple a simple
CNNregression
CNN regressionproblem,
problem,andandthe
theimages
imageswith
withthe
thetarget
targetobject
objectwere
werecropped
croppedas asthe
theinputs
inputsofofthe
the
CNN distance regression model. As shown in Figure 6, the CNN distance regression
CNN distance regression model. As shown in Figure 6, the CNN distance regression model could be model could be
separatedinto
separated intotwo
twoparts,
parts,the
thefeature
featureextraction
extractionnetwork
networkandandthe thedistance
distancenetwork.
network.
Input
Detected Target
Cropped Image
Figure 3. Labeling image of the training dataset.
input
Feature
CNN Distance
Extraction
Regression
Model Network
output
Distance
Distance Network
Figure 6. The architecture of the convolutional neural network (CNN) distance estimation system.
Figure 6. The architecture of the convolutional neural network (CNN) distance estimation system.
3.1.1. Model Architecture

Feature Extraction Network
RemoteAsSens. 2020, 12,
shown in 3035
Figure 7, the feature extraction network is based on VGG-16 [39], which contains 8 of 28
five convolution layers followed with a max-pooling layer, respectively. The feature extraction
network
3.1.1. Modelis initialized
Architecturewith the pre-trained weights which were pre-trained with ImageNet. Then, the
layer before the third pooling layer was frozen to fine-tune the remaining layers. In model evaluation,
the results
Feature show that
Extraction the model with no frozen layers in a feature extraction network has larger
Network
training loss (around 0.7 to 1.3) comparing to that with frozen layers in a feature extraction network
As shown in Figure 7, the feature extraction network is based on VGG-16 [39], which contains five
(around 0.2 to 0.5). Therefore, the feature extraction network with frozen layers was chosen in this
convolution layers followed with a max-pooling layer, respectively. The feature extraction network is
study.
initialized with the pre-trained weights which were pre-trained with ImageNet. Then, the layer before
The reasons of freezing some layers are as follows:
the third pooling layer was frozen to fine-tune the remaining layers. In model evaluation, the results
1.
show It that
could reduce
the model some
withparameters of the model.
no frozen layers in a feature extraction network has larger training loss
2.
(around 0.7 to 1.3) comparing to that with frozenImageNet
The weights (filters) are pre-trained with layers in a and an extraction
feature image database to (around
network improve0.2the
to
0.5). performance
Therefore, the offeature
the filters in feature
extraction extraction.
network with frozen layers was chosen in this study.
Figure 7. Feature extraction network architecture of the CNN distance model.

Figure 7. Feature extraction network architecture of the CNN distance model.
The reasons of freezing some layers are as follows:

Distance Network
1. The distance
It could reducenetwork
some is a simple DNN
parameters of thefor regression, and its architecture is shown in Figure 8.
model.
The
2. output of feature extraction network is flatted
The weights (filters) are pre-trained with ImageNet to obtain aand
4608an
× 1image
as the database
input, andtois improve
then passedthe
through four fullyofconnected
performance the filters(FC) layers.
in feature Each FC layer is followed by batch normalization and
extraction.
activation, and the output layer is the estimated distance of the target. The activation function used
Distance
in Network
the distance network is rectified linear units (ReLU) [40], and batch normalization is applied to
improve the training
The distance network speed iswith better DNN
a simple convergence.
for regression, and its architecture is shown in Figure 8.
The output of feature extraction network is flattedthe
To decide how many FC layers, excluding outputa 4608
to obtain layer,× have to input,
1 as the be usedandinisthe distance
then passed
network,
through fourand to discuss
fully whether
connected the layers.
(FC) distanceEach
network with different
FC layer is followedamounts of FC
by batch layers affects and
normalization the
performance, two different architectures, three FC layers and four FC layers, were
activation, and the output layer is the estimated distance of the target. The activation function usedcompared in this
study.
in the The evaluation
distance network results of different
is rectified linearmodels with different
units (ReLU) [40], andamounts of FC layers are
batch normalization shown in
is applied to
Figure
improve 9. the
Remote Sens.GT represents
training
2020, ground
speed
12, x FOR PEERwithtruth. Models 5 to 8 and Models 20 to 21 are the results with three
REVIEWbetter convergence. FC
9 of 27
layers. Models 13 to 15 are the results with four FC layers. The training and validation losses of all
models are able to converge at around 0.2 to 0.5, and the results show that there is no significant
difference between models with three FC layers and four FC layers. However, the models with three
FC layers are slightly more accurate than the others with four FC layers, and the parameters of the
models with three FC layers are much smaller than the models with four FC layers, which can
decrease the training time.
Figure
Figure 8. The architecture
8. The architecture of
of the
the distance
distance regression
regression network.
network.
Remote Sens. 2020, 12, 3035 9 of 28
To decide how many FC layers, excluding the output layer, have to be used in the distance
network, and to discuss whether the distance network with different amounts of FC layers affects the
performance, two different architectures, three FC layers and four FC layers, were compared in this
study. The evaluation results of different models with different amounts of FC layers are shown in
Figure 9. GT represents ground truth. Models 5 to 8 and Models 20 to 21 are the results with three
FC layers. Models 13 to 15 are the results with four FC layers. The training and validation losses of
all models are able to converge at around 0.2 to 0.5, and the results show that there is no significant
difference between models with three FC layers and four FC layers. However, the models with three
FC layers are slightly more accurate than the others with four FC layers, and the parameters of the
models with three FC layers are much smaller than the models with four FC layers, which can decrease
the training time. Figure 8. The architecture of the distance regression network.
Figure9.
Figure Evaluationresults
9. Evaluation resultsof
ofthe
themodels
modelswith
withdifferent
different amounts
amounts of
of fully
fully connected
connected (FC) layers.
3.1.2. Data
3.1.2. DataCollection
Collection
Because there
Because there isis no
no existing
existing dataset
dataset with
with the
the CNN
CNN distance
distance regression
regression model,
model, itit is
is necessary
necessary to to
build a dataset to train the model, which is able to estimate the distance between
build a dataset to train the model, which is able to estimate the distance between the ownership and the ownership and
intruder UAVs
intruder UAVs using
using thethe deep
deep learning-based
learning-based approach.
approach. In In order
order toto obtain
obtain aa dataset
dataset withwith aa lot
lot of
of
various cropped
various croppedimages
imagesthatthatcontain
containa aUAV
UAVwith withvarious
various distances
distances and
and orientations,
orientations, a procedure
a procedure to
to synthesize this dataset is proposed in this study. In contrast to the
synthesize this dataset is proposed in this study. In contrast to the approach in [35], whichapproach in [35], which is is
a
a ground-based distance estimation for railway obstacle avoidance, this study
ground-based distance estimation for railway obstacle avoidance, this study presents a air-to-air presents a air-to-air
obstacle avoidance
obstacle avoidance scheme,
scheme, in in which
whichitit isis more
more difficult
difficult to
to collect
collect the
the real
real scene
scene image
image forfor training,
training,
because the ground truth of the estimated distance needs to be determined
because the ground truth of the estimated distance needs to be determined rigorously. rigorously.
Synthetic Images
Synthetic Images
To address the previously mentioned problem mentioned, Blender software was utilized to
createTothe
address
desiredthesynthetic
previously mentioned
images. For theproblem
trainingmentioned,
dataset, aBlender
small-scalesoftware
UAV,was Skyutilized
Surfer to
X8,create
was
the desired synthetic images. For the training dataset, a small-scale UAV, Sky Surfer
imported to Blender as the intruder, and then it was randomly rotated to obtain different orientations, X8, was imported
to Blender
and as thewas
the camera intruder,
adjustedandtothen it was
acquire randomly
various rotated
distances. Into obtain
this study,different
scenes orientations,
of a UAV toward and the
to
camera was adjusted to acquire various distances. In this study, scenes of a
camera were considered, and the scenarios of head-on and crossing were conducted. The rotation UAV toward to camera
were considered,
range of the UAVand wasthe scenarios
also limited toof prevent
head-on unusual
and crossing were
attitude conducted.
and overtaking The rotation
case. range of the
The information
UAV was the
regarding alsodataset
limitedbuilt
to prevent unusual
to training attitude
the CNN and overtaking
distance regression modelcase. The information
is list in Table 1. regarding
Figure 10
the dataset built to training the CNN distance regression model is list in Table 1.
shows the interface of Blender, which is able to change the location of the intruder by setting the Figure 10 shows the
interface of Blender, which is able to change the location of the intruder by
parameters in the red box and changing the attitude parameters in the yellow box. Figure 11 showssetting the parameters
in the
one of red
the box and changing
synthetic the attitude
image produced parameters
by Blender, and in the yellow
Figure 12 shows box.some
Figure 11 shows
cropped one of
images of the
the
synthetic image produced
developed training dataset. by Blender, and Figure 12 shows some cropped images of the developed
training dataset.
Image shape before being cropped 3840 × 2160 × 3
Cropped image shape 100 × 100 × 3
Attitude Rotation Range
Remote Sens.
Remote Sens. 2020,
2020, 12,
12, 3035
x FOR PEER REVIEW
Roll angle range −15𝑜 ~ 15𝑜 10 of
10 of 28
27
𝑜
Pitch
Table 1.angle range
Information −15
regarding the training ~ 15𝑜
dataset.
𝑜
Table 1. Information regarding the training dataset. 𝑜
Yaw angle range
Information −75Size~ 75
ImageInformation
shape before being cropped 3840 × Size
2160 × 3
Image shapeCropped image
before being shape
cropped 100 ×× 2160
3840 100 ××33
Cropped image shape
Attitude 100 × 100Range
Rotation ×3
𝑜 𝑜
Roll angle range
Attitude −15 ~ 15
Rotation Range
𝑜 ◦ 𝑜 ◦
Pitch angle range
Roll angle range −15 ~ 15
−15 ∼ 15
𝑜 ◦ 𝑜 ◦
Yaw range
Pitch angle angle range −75−15 ~ 75∼ 15
◦ ◦
Yaw angle range −75 ∼ 75
Figure 10. Using Blender to collect images to train the model.

Figure 10.
Figure 10. Using Blender to collect images to train the
the model.
model.
Figure 11. Synthetic image rendered by Blender.
FigureFigure 11. Syntheticimage

11. Synthetic image rendered
rendered by Blender.
by Blender.
Remote Sens. 2020, 12, 3035 11 of 28
12. Examples
Figure 12.
Figure Examplesofofthethe
cropped images
cropped withwith
images different distances
different and orientations
distances for model
and orientations fortraining.
model
training.
Image Augmentation
Image InAugmentation
order to create more data for model training, the image augmentation process, which randomly
changes the images
In order beforemore
to create inputting
datathem
for into
modelthe training,
model according to theaugmentation
the image given parameters, was applied
process, which
randomly changes the images before inputting them into the model according to the model
during model training. Moreover, the image augmentation process can also prevent the trained given
from overfitting.
parameters, The augmentation
was applied during model process usedMoreover,
training. in this study includes
the image rotations and
augmentation translations
process can alsoof
the target
prevent object,
the trainedwhich
modelare from
performed by theThe
overfitting. image processing of
augmentation width used
process shifting and study
in this heightincludes
shifting.
The parameters
rotations are list in Table
and translations of the2.target
For the translation
object, which are process, the factor
performed of 0.35
by the imagemeans shiftingofatwidth
processing most
70 pixelsand
shifting of the target
height object with
shifting. size of 200 ×
The aparameters are200
listpixels, which
in Table changes
2. For based on the
the translation size ofthe
process, thefactor
input
images.
of For the
0.35 means rotation
shifting at process,
most 70 the maximum
pixels rotating
of the target angle
object is 3adegrees.
with In the
size of 200 training
× 200 pixels,process,
which
the image
changes augmentation
based on the sizeprocess randomly
of the input selects
images. For athesetrotation
of parameter combinations
process, the maximum of translation and
rotating angle
rotation
is for each
3 degrees. epoch.
In the training process, the image augmentation process randomly selects a set of
parameter combinations of translation and rotation for each epoch.
Table 2. Image augmentation parameters.
Table 2. Image augmentation parameters.
Augmentation Parameter
Augmentation
Width shift range Parameter
0.35
Widthshift
Height shift range
range 0.35 0.35
Rotation range
Height shift range 0.35 3
Fill mode nearest
Rotation range 3
Fill mode nearest
3.1.3. Training Result
3.1.3.In
Training
order toResult
train the proposed model, the dataset was collected by using the proposed procedure
to synthesize
In order tothetrain
training data as previously
the proposed model, thementioned.
dataset wasThe amountbyofusing
collected the produced images—which
the proposed procedure
are cropped in RGB with a distance range of from 30 m to 95 m—in the dataset
to synthesize the training data as previously mentioned. The amount of the produced images—which for training is about
10,000. First of all, the images were normalized to increase the training speed and
are cropped in RGB with a distance range of from 30 m to 95 m—in the dataset for training is about model robustness,
and then
10,000. split
First of to
all,an
the80% dataset
images werefornormalized
training andto 20% for validation.
increase the trainingMean
speedsquare errorrobustness,
and model (MSE) was
chosen to be the loss function, as shown in Equation (1), where y is the ground
and then split to an 80% dataset for training and 20% for validation. Mean square error (MSE) truth and ŷ iswas
the
prediction from the proposed model. Adaptive moment estimation (Adam) with a
chosen to be the loss function, as shown in Equation (1), where y is the ground truth and ŷ is thelearning rate decay
as shown in
prediction Equation
from (2) was chosen
the proposed model. to be the optimizer;
Adaptive the model(Adam)
moment estimation trainingwith
resulta is illustrating
learning rate
decay as shown in Equation (2) was chosen to be the optimizer; the model training result is
illustrating
Remote inxFigure
Sens. 2020, 12, 13.REVIEW
FOR PEER It took
about 38 min to train the model with an NVIDIA GeForce 12
GTXof 271660
Remote Sens. 2020, 12, 3035 12 of 28
Graphics Processing Unit (GPU) card.
illustrating in Figure 13. It took about 38 min to train the
n model with an NVIDIA GeForce GTX 1660
1
Graphics Processing Unit (GPU) card. MSE
in Figure 13. It took about 38 min to train the model= ∑(y − ŷi )2 GeForce GTX 1660 Graphics (1)
n withi an NVIDIA
Processing Unit (GPU) card. n i=1
1 n
X i − ŷi )220.001
MSE = 1 ∑(y (1)
MSE = n i=1 Rate
Learning yi − =
ŷi Epoch (1) (2)
n
i=1
0.001
Learning Rate = 0.001 (2)
Learning Rate = Epoch (2)
Epoch
Figure 13. Training results of 150 epochs.
Figure 13. Training results of 150 epochs.

3.2. Distance Estimation usingFigure DNN13. Training results of 150 epochs.
3.2. Distance In the Estimation
study [35],Using
an DNN
approach with a simple DNN regression to estimate distance is proposed,
3.2. Distance
and it Estimation
considers onlyusing
the DNN
input with
imagea size, theDNN
bounding box oftothe detected object,isand the size of
In the study [35], an approach simple regression estimate distance proposed,
and the object.
itInconsiders
the Inonly
study thisthe
[35], study,
aninput a simple
approach
imagewithDNN
size,a theisbounding
alsoDNN
simple usedboxto estimate
regression tothe distance
estimate
of the detected ofand
air-to-air
distance
object, sizeUAVs,
of the but
is proposed,
the
and the
object. inputs
it considers
In are
this study, different
only athe
simple from
input theis those
image
DNN size, in bounding
the
also used [35]. Figurebox
to estimate 14 ofshows
the theofdistance
the detected
distance object,regression
air-to-air and
UAVs,the butmodel;
size of it
the
the consists
object.
inputs of a CNN
In this study,
are different attitude
from thea simplemodel
DNN
those in and a DNN
[35].isFigure
also used network with
to estimate
14 shows a DisNet regression
the distance
the distance model.
of air-to-air
regression model; itUAVs,
consistsbut
of
the
a CNNinputs are different
attitude model and from the those
a DNN network in [35].
with Figure
a DisNet14regression
shows themodel.distance regression model; it
consists of a CNN attitude model and a DNN network with a DisNet regression model.
Input
Detected Target
Input
Cropped image
Detected Target
Cropped image
CNN
input Attitude
Model
DisNet
Regression
CNN
input
Model Attitude
Bounding Box
ModelRectification
DisNet output
Regression
ModelDistance Bounding BoxDNN
Rectification
output Distance
Network
Distance
DNN
Distance
Figure 14. The
Figure architecture
14. The of the
architecture ofdeep neural
the deep network
neural (DNN)
Network
network (DNN)distance estimation
distance system.
estimation system.
Figure 14. The architecture of the deep neural network (DNN) distance estimation system.
Remote
Remote Sens. 2020, 12,
Sens. 2020, 12, 3035
x FOR PEER REVIEW 13 of 28
27
3.2.1. Attitude Estimation via CNN

3.2.1. Attitude Estimation via CNN
The adopted DNN is modified from the study [35], and some parameters have been modified to
obtainThetheadopted
attitudeDNN
of theistarget.
modified
Thefrom the study
first three input[35], and some
parameters inparameters
[35] are thehave been modified
information about theto
obtain the attitude of the target. The first three input parameters in [35] are
detected bounding box, but the remaining three parameters are the average height, width, and the information about the
detected
breadth of bounding box,which
the object, but thedoremaining
not meetthreethe parameters
requirementare ofthe
thisaverage
study height, width,
to detect and breadth
the attitude and
of the object, which do not meet the requirement of this study to detect
distance of the intruder UAV. The distance of the intruder UAV is assumed to be the function the attitude and distance
of its
of the intruder
attitude and theUAV.
sizesThe distance
of the of the
detected intruder
bounding UAV
box. is assumed
Hence, the lastto be the
three function were
parameters of its changed
attitude
and
to thethe sizes
roll, of the
pitch, anddetected
yaw angles bounding box. Hence,
of the intruder. Sincethe
thelast three of
attitude parameters
the intruderwereUAV changed to the
is unknown,
roll,
it is necessary to estimate its attitude. Therefore, CNN regression is also applied to estimate itthe
pitch, and yaw angles of the intruder. Since the attitude of the intruder UAV is unknown, is
necessary to estimate its attitude. Therefore, CNN regression is also applied to estimate
attitude of the intruder, and the architecture is identical to the CNN distance model, in which the the attitude of
the intruder,
outputs and the architecture
are changed Euler anglesisofidentical to theThe
the intruder. CNN distance
training model,
process and indata
which the outputs
collection are
are also
changed Euler angles of the
similar to the CNN distance model. intruder. The training process and data collection are also similar to the
CNN distance model.
3.2.2. Bounding Box Rectification
3.2.2. Bounding Box Rectification
The accuracy of the detected bounding box is significant because it may directly affect the
The accuracy of the detected bounding box is significant because it may directly affect the accuracy
accuracy of the estimated distance. To make sure the estimated accuracy, Sobel edge detection is
of the estimated distance. To make sure the estimated accuracy, Sobel edge detection is applied to
applied to rectify the bounding box, and there is a similar approach that utilizes bounding box
rectify the bounding box, and there is a similar approach that utilizes bounding box rectification to
rectification to center the bounding box onto the detected objects [41]. Figure 15 shows the process of
center the bounding box onto the detected objects [41]. Figure 15 shows the process of bounding box
bounding box rectification. It is obvious that the bounding box (red one) acquired from YOLOv3 is
rectification. It is obvious that the bounding box (red one) acquired from YOLOv3 is not accurate,
not accurate, but it is able to obtain the right bounding box (blue one) when the edge detection is
but it is able to obtain the right bounding box (blue one) when the edge detection is applied and has
applied and has passed the threshold. However, this method does not perform well when the
passed the threshold. However, this method does not perform well when the background is too noisy.
background is too noisy.
Figure
Figure 15. Process to
15. Process to rectify
rectify the
the bounding
bounding box
box given
given by
by the
the YOLOv3
YOLOv3 detector.
detector.
3.2.3. DNN Architecture
3.2.3. DNN Architecture
Figure 16 shows the architecture of the DNN distance model, which consists of three hidden layers
with Figure 16 shows
100 hidden units, the architecture
respectively. Theof the vector
input DNN distance
is shown model, which
in Equation (3),consists
and theof three value
output hidden
is
layers with 100 hidden units, respectively. The input vector is shown in Equation (3), and the
the estimated distance of the object. The distance network is trained with the same loss function and output
value is the estimated distance of the object. The distance network is trained with the same loss
optimizer in Section 3.1.3.
function and optimizer in Section 3.1.3. "
1 1 1
#
v= ∅θϕ (3)
1 Bh1Bw B1d
v=[ ∅ θ φ] (3)
where Bh Bw Bd
where
Bh : height of the object bounding box in pixels/image height in pixels;
h : : height
Bw width of the object bounding box in pixels/image
pixels/image width
heightin
inpixels;
pixels;
Bw : width of the object bounding box in pixels/image width in pixels;
Remote Sens. 2020, 12,Sens.
Remote x FOR
2020, PEER
12, 3035REVIEW 14 of 28 14 of 27
Bd : diagonalBdof: diagonal
the object
of thebounding boxbox
object bounding in pixels/image diagonal
in pixels/image diagonal in pixels;
in pixels;
∅: estimated∅ roll angle;roll angle;
: estimated
θ : estimated
θ: estimated pitch angle; pitch angle;
ϕ : estimated yaw angle.
φ: estimated yaw angle.
FigureFigure
16. Architecture
16. Architectureofofthe DNN
the DNN distance
distance network.
network.
3.2.4. Data Collection and Labeling

3.2.4. Data Collection and Labeling
In order to train the attitude model, it is necessary to build a similar dataset, as shown in
In orderSection 3.1.2.the
to train Theattitude
dataset is also built using
model, it is Blender software,
necessary and each
to build image is named
a similar according
dataset, to the in Section
as shown
parameters (roll, pitch, yaw angles) as the ground truth as shown in Figure 10 (red box). For the DNN
3.1.2. The dataset is also built using Blender software, and each image is named according to the
model, the LabelImg software, shown in Figure 3, is utilized to obtain the information of bounding box
parameters (roll, pitch,
(first three yaw angles)
parameters of the DNNas model),
the groundand the truth
name ofas
theshown in Figure
image provides 10 (red
the attitude box). For the DNN
information
model, the LabelImg
of the intruder software,
(last three shown
parameters inofFigure
the DNN 3,model).
is utilized
In this to obtain
way, the information
it is possible to train the DNN of bounding
distance model.
box (first three parameters of the DNN model), and the name of the image provides the attitude
information3.3.
of Comparison
the intruder
of the (last three
Developed CNNparameters of the
and DNN Distance DNN model). In this way, it is possible to
Regressions
train the DNN distance model.
In this study, two deep learning-based methods, CNN and DNN distance regression models,
were applied to estimate the distance of the intruder. In this section, these two deep learning methods
wereof
3.3. Comparison conducted to compare
the Developed CNNtheirand
performance of distance
DNN Distance estimation. From the comparison results,
Regressions
the better one will be applied to the videos of real flight tests in this study. Figure 17 shows the
In this comparison
study, two of CNN
deep (green dots) and DNN (blue
learning-based dots) regression
methods, CNN models.
and DNN Figuredistance
17a,b are the distance
regression models,
range of the intruder flying from 60 to 30 m, and Figure 17c is the distance range from 50 to 35 m.
were applied to estimate the distance of the intruder. In this section, these two deep learning methods
The results show that CNN regression is better and more reliable than DNN for most frames, especially
were conducted to compare
for Figure 17b. This is their performance
perhaps ofreasons:
for the following distance estimation. From the comparison results,
the better one will be applied to the videos of real flight tests in this study. Figure 17 shows the
1. The accuracy of DNN regression is affected by the accuracy of the bounding box size.
comparison2.of CNN (green
The estimated dots)contains
attitude and DNN (blue dots) regression models. Figure 17a,b are the
large errors.
distance range
3. of the
The intruder
bounding flying from
box rectification does60nottowork
30 m,
welland Figure
when 17c is the
the background distance
is cloudy range from 50 to
and complex.
35 m. The results show that CNN regression is better and more reliable than DNN for most frames,
especially for Figure 17b. This is perhaps for the following reasons:
1. The accuracy of DNN regression is affected by the accuracy of the bounding box size.
2. The estimated attitude contains large errors.
3. The bounding box rectification does not work well when the background is cloudy and complex.
Therefore, CNN regression is selected to be the distance estimation method. The reasons why
CNN regression is selected are as follows:
Remote Sens. 2020, 12, 3035 15 of 28
(a)
(b)
(c)
Figure
Figure 17. Comparison
17. Comparison of CNN
of CNN andand DNNdistance
DNN distance regression
regression models:
models:(a)(a,b)
and the
(b) the distance
distance rangeof the
range
of the intruder flying from 60 to 30 m; (c) the distance ranges from 50 to 35 m.
intruder flying from 60 to 30 m; (c) the distance ranges from 50 to 35 m.
4. Model Evaluation and Real Flight Experiments

Therefore, CNN regression is selected to be the distance estimation method. The reasons why
After CNN
CNN regression distance
is selected areregression is chosen to be the method to estimate the distance of the
as follows:
intruder, it is necessary to evaluate whether its performance could meet the requirement of this study.
1. It
Twouses only
types one model
of videos, to estimate
synthetic the distance,
and real flight butused
videos, were the to
DNN
verityregression
the distancemodel requires
estimation for an
additional
SAA of UAVs. attitude estimation
In general, model
there are threeto estimate
different the attitude
scenarios of SAAoffor
theaircrafts,
intruder.head-on, crossing,
2. and overtaking.
From In this study,
the comparison only
results, it ishead-on and crossing
more accurate thancases
DNN aredistance
considered for the evaluation of
regression.
3. synthetic
It is moreand real when
robust flight videos. The detailsisabout
the background how
not so to acquire these videos are presented in the
clear.
following sections.
4. Model Evaluation and Real Flight Experiments
4.1. Model Evaluation in Synthetic Videos
AfterThe
CNN distance
synthetic regression
videos is chosen
were acquired bytousing
be the method
Blender to estimate
software the distance
as mentioned in theofprevious
the intruder,
it is necessary to evaluate whether its performance could meet the requirement of this study. Two types
sections. The small-scale UAV, Sky Surfer X8, was simulated as an intruder and flew toward or across
of videos, synthetic and real flight videos, were used to verity the distance estimation for SAA of UAVs.
In general, there are three different scenarios of SAA for aircrafts, head-on, crossing, and overtaking.
Remote Sens. 2020, 12, 3035 16 of 28
In this study, only head-on and crossing cases are considered for the evaluation of synthetic and real
flight videos. The details about how to acquire these videos are presented in the following sections.
4.1. Model Evaluation in Synthetic Videos

The synthetic videos were acquired by using Blender software as mentioned in the previous
sections. The small-scale UAV, Sky Surfer X8, was simulated as an intruder and flew toward or
across to the ownership UAV with a camera onboard. In the synthetic videos, only two cases were
conducted, head-on and crossing. The flight speed of the synthetic UAV was assumed to be constant,
and thetoground truth with
the ownership UAVrespect
with atocamera
each video frame
onboard. cansynthetic
In the be determined based
videos, only twooncases
this assumption.
were
conducted,
Six synthetic videos head-on
with two and crossing.
weatherThe flight speed
conditions, of the
clear andsynthetic
cloudy,UAV
werewas assumedfor
recorded to be constant,
model evaluation,
as givenand the ground
in Table 3. The truth with respect
intruder to each
in each video video
hasframe can be
different determined
attitudes andbased on thisFigure
distance. assumption.
18 illustrates
Six synthetic videos with two weather conditions, clear and cloudy, were recorded for model
the synthetic videos used for model evaluation. The red boxes indicate the crossing cases, and the
evaluation, as given in Table 3. The intruder in each video has different attitudes and distance. Figure
yellow boxes indicate
18 illustrates the head-on
the synthetic cases.
videos The
used for arrows
model show the
evaluation. Theflight paths
red boxes of the
indicate theintruder UAV on the
crossing cases,
image plane.
and the yellow boxes indicate the head-on cases. The arrows show the flight paths of the intruder
UAV on the image plane.
Table 3. Information regarding the synthetic videos.
Table 3. Information regarding the synthetic videos.
Video Type Case Weather Condition
Video Type Case Weather Condition
Head-on
Head-on Clear/Cloudy
Clear/Cloudy
Synthetic
Synthetic Crossing
Crossing Clear/Cloudy
Clear/Cloudy
Figure
Figure 18. 18. Synthetic
Synthetic videos
videos for for model
model evaluation. The
evaluation. Thered
redbox indicates
box the crossing
indicates case, and
the crossing theand the
case,
yellow
yellow box box indicates
indicates the head-on
the head-on case.
case.
The results of model evaluation with synthetic videos are given in Table 4 and Figure 19. As
The results of model evaluation with synthetic videos are given in Table 4 and Figure 19. As shown
shown in Table 4, the synthetic videos are grouped into two sets according to their distance. Set I
in Tablepresents
4, the synthetic
the shortervideos arewith
distance grouped
a clearinto two setsand
background, according to their
Set II shows distance.
the longer Set Iwith
distance presents
a the
shorter distance with a clear
cloudy background. Thebackground,
root mean squareanderror
Set II(RMSE)
showsof the longer
each videodistance with atocloudy
was calculated compare background.
the
The rootperformance
mean square error
of the (RMSE)
results. of each
RMSE_K videothe
indicates was calculated
RMSE with thetoKalman
compare theinperformance
filter the distance of the
results. estimation, and the Kalman
RMSE_K indicates the RMSEfilter in onethe
with dimension,
Kalmanwhich
filter is
inadopted to be aestimation,
the distance low-pass filter
andinthe
thisKalman
study, is applied to smooth the output of the CNN distance regression model.
filter in one dimension, which is adopted to be a low-pass filter in this study, is applied to smooth the
Figure 19 shows the estimated distance by the CNN distance regression model, where green line
output of the CNN distance regression model.
indicates the raw estimation from model, the blue line indicates the estimation with the Kalman filter,
and the red line indicates the ground truth of the distance in each video frame. The ground true is
determined by the positions of the intruder and the related frame with a timestamp. From Table 4
and Figure 19, it is obvious that the CNN distance regression model successfully estimated the
distance in each frame; the RMSEs are small for different weather condition and cases, that is, using
Remote Sens. 2020, 12, 3035 17 of 28
Table 4. Estimated distance vs. ground truth.
Set Figure Weather Case Distance Range RMSE RMSE_K

(a) crossing 60 m~30 m 1.580 m 0.651 m
I (b) Clear head-on 60 m~33 m 1.301 m 0.726 m
(c)
Remote Sens. 2020, 12, x FOR PEER REVIEW crossing 50 m~35 m 1.641 m 1.224 m27
17 of
(d) crossing 80 m~50 m 2.243 m 1.753 m
CNN II regression
(e)to estimate distance works
Cloudy considerably
head-on well. The
80 m~50 m encountered
1.379 mproblems,1.435
suchmas
jittering of the (f)
estimated distance, can head-on
be improved by 80applying
m~40 m the Kalman
1.652 filter
m to smooth
1.338 m the
estimation, as show in Figure 19 (blue line).
(a)
(b)
(c)
(d)
Figure 19. Cont.

Remote Sens. 2020, 12, 3035 18 of 28
(e)
(f)
Figure Figure 19. Distance

19. Distance estimation
estimation from
from sixsix differentsynthetic
different synthetic videos.
videos.(a), (c) and
(a,c,d) (d)crossing
are are crossing cases,
cases, and (b,e,f)
and (b), (e) and (f) are head-on cases.
are head-on cases.
Table 4. Estimated distance vs. ground truth.
Figure 19 shows the estimated distance by the CNN distance regression model, where green line
Set Figure from
indicates the raw estimation Weather
model, theCaseblueDistance Range the
line indicates RMSE RMSE_K
estimation with the Kalman filter,
(a) crossing 60 m ~ 30 m 1.580 m 0.651 m
and the red line indicates the ground truth of the distance in each video frame. The ground true is
I (b) Clear head-on 60 m ~ 33 m 1.301 m 0.726 m
determined by the positions of the intruder and the related frame with a timestamp. From Table 4 and
(c) crossing 50 m ~ 35 m 1.641 m 1.224 m
Figure 19, it is obvious that
(d) the CNN distance crossingregression model
80 m ~ 50 m successfully
2.243 m 1.753 estimated
m the distance
in each frame; theIIRMSEs (e) are Cloudy
small forhead-on
different weather
80 m ~ 50 condition
m 1.379 and
m cases,
1.435 mthat is, using CNN
regression to estimate distance
(f) works considerably
head-on well.
80 m ~ The
40 m encountered
1.652 m problems,
1.338 m such as jittering
of the estimated distance, can be improved by applying the Kalman filter to smooth the estimation,
as show4.2.
inModel
FigureEvaluation
19 (blue in line).
Real Flight Videos
For the real flight experiments, a drone was hovering in the sky as the ownership, and a small-
4.2. Model
scaleEvaluation in Real
UAV (Sky Surfer in Flight Videos
this study), which is as an intruder, flew along the designed waypoints. The
reason why the ownership was hovering instead of moving is that it was able to identify the distance
Forbetween
the realthe
flight experiments, a drone was hovering in the sky as the ownership, and a small-scale
ownership and intruder for model evaluation. In the real flight experiments, it is hard to
UAV (Sky Surfer in this
obtain the ground truthstudy), which
of each videois frame.
as an intruder,
Therefore, flew along the designed
the measurement waypoints.
of the distance between The
the reason
why theowned
ownership
UAV and the intruder were determined by their positions obtained from global positioningbetween
was hovering instead of moving is that it was able to identify the distance
the ownership(GPS)
system and the related
and intruder frame
for model with a timestamp.
evaluation. In the realEvery video
flight frame with it
experiments, a is
timestamp is
hard to obtain the
considered as an input of CNN distance model, and the error of estimated
ground truth of each video frame. Therefore, the measurement of the distance between the owned distance is calculated
according to the video frame rate and the log file of GPS from Sky Surfer. A 4K-capable consumer-
UAV and the intruder were determined by their positions obtained from global positioning system
grade drone, Parrot Anafi, was selected to be the ownership UAV. The videos were recorded by
(GPS) and the
Parrot related
Anafi framebywith
at 4k (3840 2160)aresolution
timestamp. with aEvery videorate,
30 fps frame frame with
and the a timestamp
ownership is considered
was hovering
as an input of CNN distance model, and the error of estimated distance is calculated
at a fixed position in the sky to record the videos with the incoming intruder. The specifications according to
the video frame the
regarding rateimaging
and thesystem
log file
of of
theGPS from Sky
ownership are Surfer.
shown in A Table
4K-capable
5. The consumer-grade
lens distortion is drone,
considered
Parrot Anafi, wastoselected
be negligible
to besince the videos recorded
the ownership UAV. The by the ownership
videos were have been corrected
recorded by Parrot byAnafi
its at 4k
built-in software of the consumer drone, Parrot Anafi, with a low-dispersion aspherical
(3840 by 2160) resolution with a 30 fps frame rate, and the ownership was hovering at a fixed position lens (ASPH).
The GPS receiver equipped on the intruder, Sky Surfer X8, has a 5 Hz sampling rate. The real flight
in the sky to record the videos with the incoming intruder. The specifications regarding the imaging
system of the ownership are shown in Table 5. The lens distortion is considered to be negligible since
the videos recorded by the ownership have been corrected by its built-in software of the consumer
drone, Parrot Anafi, with a low-dispersion aspherical lens (ASPH). The GPS receiver equipped on the
intruder, Sky Surfer X8, has a 5 Hz sampling rate. The real flight experiments and the performance of
the CNN regression model in the real flight test are given in the following sections.
experiments and the performance of the CNN regression model in the real flight test are given in the
experiments and the performance of the CNN regression model in the real flight test are given in the
following sections.
following
Remote sections.
Sens. 2020, 12, 3035 19 of 28
Table 5. Specifications regarding the imaging system of the ownership.

Table 5. Specifications regarding the imaging system of the ownership.
Table 5. Specifications
Sensor regarding the imaging
1/2.4′′ Complementary system of the ownership. (CMOS)
Metal-Oxide-Semiconductor
Sensor 1/2.4′′ Complementary Metal-Oxide-Semiconductor (CMOS)
Sensor ASPH (Low-dispersion
1/2.4” Complementary aspherical lens) (CMOS)
Metal-Oxide-Semiconductor
ASPH (Low-dispersion aspherical lens)
Aperture: f/2.4
Lens ASPH (Low-dispersion
Aperture: f/2.4 aspherical lens)
Lens Focal Length: 26–78 mm
Aperture: f/2.4 (video)
Lens Focal Length: 26–78 mm (video)
Depth
Focal of field:
Length: 1.5 mm
26–78 m –(video)
∞
Depth
Depth ofof
field:
field:1.5
1.5m ––∞
m24 ∞
4K Cinema 4096 × 2160 fps
4K 4K
Cinema 4096 × 2160
×
24 fps
Video resolution 4K Ultra High Definition 3840 × 2160 fps
Cinema 4096 2160 24 24/25/30 fps
Video resolution 4K Ultra High Definition 3840 × 2160
216024/25/30 fps
Video resolution 4K Video
Ultra High Definition
horizontal 3840
field of ×view: 24/25/30
69◦𝑜𝑜 fps
Video horizontal
Video field
horizontal view: 69
fieldofofview: 69
4.2.1. Experiment 1
4.2.1. Experiment
4.2.1. Experiment 11
Experiment 1 is a head-on scenario with misty weather, and the fly trajectory is shown in Figure
Experiment 11isisa ahead-on
Experiment scenario with misty weather, and and
the fly
thetrajectory is shown in Figure
20. The yellow arrow is head-on
the flightscenario with
direction, andmisty weather,
the black arrow is the fly trajectory
heading of theisownership.
shown in
20. The yellow
Figure 20. arrow
The yellow is the flight direction,
arrow is the flight and
direction, the
and black arrow
the black is the
arrow isand heading
thethe
heading of the
of theownership.
ownership.
Figure 21 shows the measurements of GPS data for model evaluation, distance range is from
Figure 21
21 shows
shows the
the measurements
measurements of of GPS
GPS data
data forfor model
model evaluation,
evaluation, and
and the
the distance
distance range
range is
is from
62 m to 22 m. Figure 22 shows the results of the CNN regression model, and Table 6 showsfrom
Figure the
62 m
62 m to
to 22
22 m.
m. Figure
Figure 2222 shows
shows the results
resultsofof
the RMSE the
ofthe
the CNN
CNN regression
regression model, andand Table
Table 66 shows
shows thethe
information of Experiment 1 and the estimated distance. model,
MEAS (Measurement) denotes
information
information of Experiment
of Experiment 1 and
1 and the RMSE
theEST
RMSE of the estimated
of the estimated distance. MEAS
distance. distance.(Measurement)
MEAS (Measurement) denotesdenotes
the measurements of GPS data and (Estimation) is the estimated
the measurements
the measurements of of GPS
GPS data
data and
and EST
EST (Estimation)
(Estimation) is is the
the estimated
estimated distance.
distance.
Figure 20. Flight path of the intruder in Experiment 1.

Figure 21. Measurements of global positioning system (GPS) data to evaluate the model in Experiment 1.
Experiment 1.
Figure
Remote 21. 12,
Sens. 2020, Measurements
3035 of global positioning system (GPS) data to evaluate the model in20 of 28
Experiment 1.
Figure 22. Result of model evaluation in Experiment 1.
Table 6. Experiment 1 (estimation vs. GPS measurement).

Figure 22.
Figure Resultof
22. Result ofmodel
modelevaluation
evaluationin
inExperiment
Experiment1.
1.
Case Head-On Weather Condition
Misty
Point MEAS(m) EST(m) 1 (estimation
Table 6. Experiment Point MEAS(m) EST(m)
vs. GPS measurement).
Case Head-On Weather Condition Misty
1 Case 62.68 Head-On 58.67 Weather 7 Condition 41.2 Misty 46.24
Point MEAS (m) 58.35 EST (m)
2Point MEAS(m)
59.15 Point
8 MEAS37.23
(m) EST
42.4(m)
EST(m) Point MEAS(m) EST(m)
31 1 55.52 62.68
62.68 58.67 58.67
55.41 779 41.2
33.53 46.24
41.2 46.24
39.35
2 59.15 58.35 8 37.23 42.4
43 2 51.78
59.15
55.52
53.34
58.35 55.41 810
9
29.81
37.23
33.53
42.432.55
39.35
54 3 55.52
48 51.78 55.41 53.34
48.39 911
10 33.53
25.74 39.35
29.81 24.52
32.55
65 4 51.7848
44 53.34 48.39
47.24 1011
12 29.81
22.05 32.55
25.74 24.52
18.68
6 5 48 44 48.39 47.24 1112 22.05
25.74 24.5218.68
RMSE 3.369 m
6 44RMSE 47.24 12 22.05m
3.369 18.68
RMSE 3.369 m
4.2.2. Experiment 2
4.2.2. Experiment 2
4.2.2. Experiment
Experiment 2 is2a crossing scenario with misty weather, and the intruder flew from the right side
Experiment 2 is a crossing scenario with misty weather, and the intruder flew from the right side
of ownership to
Experimentthe 2lefta side,
is left as shown in
crossing Figure 23. There and
are thefifteen measurements ofright
GPSside
data for
of ownership to the side, asscenario
shown in with misty23.
Figure weather,
There are fifteenintruder flew from
measurements ofthe
GPS data for
model
of evaluation,
ownership
model to as
evaluation,
shown
theasleft
shown
inin
side, Figure
as shown
Figure
24,inand
24,
the
Figure
and
distance
the 23.
range
Thererange
distance are is from
fifteen
is from
61 m to 25 m.
measurements
61 m to 25 m.ofThe
The
GPS evaluation
data for
evaluation
results are
model shown in
evaluation, Figure
as shown 25 and
in Table
Figure
results are shown in Figure 25 and Table 7. 24, 7.
and the distance range is from 61 m to 25 m. The evaluation
results are shown in Figure 25 and Table 7.

Remote Sens. 2020, 12, 3035 21 of 28
Remote
Remote Sens.
Sens. 2020,
2020, 12,
12, xx FOR
FOR PEER
PEER REVIEW
REVIEW 21
21 of
of 27
27
Measurements of GPS
Figure 24. Measurements
Figure GPS data for
for model evaluation
evaluation in Experiment
Experiment 2.
Figure 24.
24. Measurements of
of GPS data
data for model
model evaluation in
in Experiment 2.
2.
Figure
Figure 25.
Figure 25. Result
25. Result of
Result of model
of model evaluation
model evaluation in
in Experiment
Experiment 2.
2.
Experiment 2 (estimation vs.

Table 7. Experiment
Table vs. GPS measurement).
measurement).
Table 7.
7. Experiment 22 (estimation
(estimation vs. GPS
GPS measurement).
Case
Case Crossing
Crossing Weather Condition
Weather Misty
Case Crossing Weather Condition
Condition Misty
Misty
Point
Point
Point MEAS(m) MEAS (m) EST(m)
MEAS(m) EST(m)EST (m) Point Point
Point MEAS
MEAS(m)
MEAS(m) (m) EST(m) EST (m)
EST(m)
1 11 61.53
61.53
61.53 63.68
63.68 63.68 999 40.36
40.36
40.36 43.03
43.0343.03
2 22 58.85
58.85
58.85 62.3
62.3 62.3 10
1010 37.82
37.82
37.82 40.9540.95
40.95
3 3 56.16
56.16 58.66 58.66 1111 35.3
35.3 38.9338.93
4
3 56.16
53.51
58.66 58.28 1112
35.3
32.8
38.9336.26
5 4
4 53.51
53.51
50.83
58.28
58.28 54.99 12
1213
32.8
32.8
30.35
36.26
36.2632.86
6 55 50.83
50.83
48.19 54.99
54.99 52.64 13
1314 30.35
30.35
27.98 32.86
32.86 30
7 66 48.19
45.55
48.19 52.64
52.64 49.15 14
1415 27.98
25.71
27.98 30
30 26.73
8 7 42.94
45.55 49.15 48.33 15 25.71 26.73
7 45.55 49.15 15 25.71 26.73
88 42.94
RMSE
42.94 48.33
48.33 3.445 m
RMSE
RMSE 3.445
3.445 m
m
4.2.3. Experiment 3
4.2.3.
4.2.3. Experiment
Experiment 33
Experiment 3 is a head-on scenario with misty weather, and the intruder flew directly toward the
Experiment
Experiment
ownership, 33 is
as shown is aainhead-on
head-on scenario
There with
scenario
Figure 26. with misty weather,
weather, and
mistymeasurements
are fifteen and the
the intruder
intruder
of GPS flew
flew
data for directly
directly
model toward
toward
evaluation,
the
the ownership,
ownership,
as shown as shown
as 27,
in Figure shown in
in Figure
and the Figure
distance 26.
26. There
There
range are
are fifteen
is from 70 m tomeasurements
fifteen measurements of
of GPS
25 m. The evaluation data
data for
GPSresults for model
model
are shown
evaluation,
evaluation, as
in Figure 28as shown
shown
and in Figure 27, and the distance range is from 70 m to 25 m. The evaluation
Tablein8.Figure 27, and the distance range is from 70 m to 25 m. The evaluation results results
are
are shown
shown inin Figure
Figure 28 28 and
and Table
Table 8.8.
Remote Sens. 2020, 12, 3035 22 of 28

Figure
Figure 26.
26. Flight
Flightpath
pathof
of the
the intruder
intruder in
in Experiment
Experiment 3.
3.
Figure
Figure 27.
27. Measurements
Measurements of
of GPS
GPS data
data for
for model
model evaluation
evaluation in
in Experiment
Experiment 3.
3.
Figure 27. Measurements of GPS data for model evaluation in Experiment 3.
Figure
Figure 28.
28. Result
Resultof
of model
model evaluation
evaluation in
in Experiment
Experiment 3.
3.

Case FigureHead-On Weather
28. Result of model Condition
evaluation in Experiment 3.Cloudy
Table 8. Experiment
Point MEAS(m) EST(m) 3 (estimation
Pointvs. GPS measurement).
MEAS(m) EST(m)
1 70.91 Experiment
70.66 3 (estimation9 vs. GPS measurement).
45.19 44.86
Case Table 8. Head-On Weather Condition Cloudy
2 67.67 70.14 10 41.99 41.41
Point 3 MEAS(m)
Case EST(m)
64.45 Head-On61.47 Point
11 ConditionMEAS(m)
Weather 38.8 EST(m)
Cloudy
38.53
1 4 MEAS(m)
Point 70.91
61.25 70.66
55.73
EST(m) 129
Point 45.19 34.26
35.58
MEAS(m) 44.86
EST(m)
12 5 67.67
58.04
70.91 70.14
58.04
70.66 10
139 41.99 28.75
32.36
45.19 41.41
44.86
23 64.45
67.67 61.47
70.14 11
10 38.8
41.99 38.53
41.41
34 61.25
64.45 55.73
61.47 12
11 35.58
38.8 34.26
38.53
Remote Sens. 2020, 12, 3035 23 of 28
Case Head-On Weather Condition Cloudy

Point MEAS (m) EST (m) Point MEAS (m) EST (m)
1 70.91 70.66 9 45.19 44.86
2 67.67 70.14 10 41.99 41.41
3 64.45 61.47 11 38.8 38.53
Remote Sens. 2020,
4 12, x FOR PEER REVIEW
61.25 55.73 12 35.58 34.26 23 of 27
5 58.04 58.04 13 32.36 28.75
6 6 54.83
54.83 54.85 54.85 14
14 29.14
29.14 26.0126.01
7 7 6 54.8351.6 54.85 51.4 14
15 29.14
25.88 26.01
51.6 51.4 15 25.88 20.6520.65
8 7 51.648.4 51.4 48.81 15 25.88 20.65
8 48.4 48.81
8 48.4
RMSE 48.81 2.559 m
RMSE 2.559 m
RMSE 2.559 m
Experiment 4
4.2.4. Experiment
4.2.4. Experiment 4
Experiment 4 is head-on case with a clear background, and the intruder flew directly toward the
Experiment 4 is head-on case with a clear background, and the intruder flew directly toward the
ownership, as asshown
shown in in
Figure 29. There
Figure are thirteen
29. There measurements
are thirteen of GPS data
measurements for model
of GPS data evaluation,
for model
ownership, as shown in Figure 29. There are thirteen measurements of GPS data for model
as shown in as
evaluation, Figure
shown30,in
and the distance
Figure 30, and range is fromrange
the distance 71 m to
is 21 m. 71
from Themresults show
to 21 m. Thethat Point
results 3 with
show thata
evaluation, as shown in Figure 30, and the distance range is from 71 m to 21 m. The results show that
star mark
Point is calculated
3 with a star markbyisinterpolation
calculated bybecause there isbecause
interpolation data loss of GPS
there in the
is data log
loss offile.
GPSThe evaluation
in the log file.
Point 3 with a star mark is calculated by interpolation because there is data loss of GPS in the log file.
results
The are shown
evaluation in Figure
results 31 andinTable
are shown Figure9. 31 and Table 9.
The evaluation results are shown in Figure 31 and Table 9.
Figure 29. Flight

Flight path of the intruder in Experiment 4.
Figure 30. Measurements

Measurements of
of GPS
GPS data
data for
for model
model evaluation
evaluation in Experiment 4.
Figure
Figure 30.
30. Measurements of GPS data for model evaluation in
in Experiment
Experiment 4.
4.
Remote Sens. 2020, 12, 3035 24 of 28
Case Head-On Weather Condition Cloudy

Point MEAS (m) EST (m) Point MEAS (m) EST (m)
1 71.20 71.15 8 42.42 43.91
2 67.1 69.9 9 38.32 39.47
3 62.99 64.78 10 34.18 35.29
4 58.88 59.97 11 30.07 31.19
5 54.76 55.6 12 25.93 26.96
6 50.64 51.53 13 21.82 23.92
7 46.54 47.64
RMSE 1.423 m
5. Result and Discussion
5.1. Synthetic Videos

The evaluation results of CNN regression model in synthetic videos are given in Table 4. The results
show that the proposed model successfully estimates the distance only from the synthesized images of
the intruder. The RMSEs of the estimation results are influenced by the weather conditions and the
flight trajectories. The RMSEs of estimation results are smaller in Set 1 with a clear background than
Set 2 with a cloudy (noisy) background. Moreover, the crossing cases have larger RMSEs compared to
the head-on cases in both sets. The attitude of the intruder in each synthetic video is totally different
regardless of what case it is. In contrast to crossing cases, the intruder in head-on cases stays almost in
the center of images. As shown in Figure 19, the errors of the estimated distances are smaller when the
intruder flies toward the center of images, and the distance estimation is more accurate for all synthetic
videos when the intruder is close to the ownership. However, there are still some factors which affect
the accuracy of the proposed model:
1. The intruder is located at the center of the images in the training dataset. However, the intruder
in crossing cases is always far away from the image center, but the intruder in head-on cases is
close to the center of the images.
2. Most of the cropped images for the model training are in clear weather, but the synthetic videos
have a cloudy (noisy) background which may affect the accuracy.
Remote Sens. 2020, 12, 3035 25 of 28
5.2. Real Flight Tests

The evaluation the results of the CNN regression model in real flight experiments are given in
Section 4.2. There are three head-on cases and one crossing case in the experiments. For the head-on
cases, the RMSE of the estimation results in Experiment 4 is the smallest, and the reason is that
Experiment 1 and 3 are conducted in misty weather, and the color of the background is close to that of
the intruder. Experiment 4 has the best results in the head-on scenario with a clear background, allowing
the model can easily extract features and estimate the accurate distance. From these experiments, it is
obvious that the deep learning-based distance estimation model is able to estimate the distance from
real scene images successfully, which means that the proposed approach is able to estimate the object
distance using only a monocular camera. In the real flight experiments, the true color of the intruder is
different from that used to train the model. The intruder in experiments is brighter than that in the
training data images, which means that the feature network in the CNN distance regression model is
able to extract the desired features (Sky Surfer) successfully.
The results show that the developed distance estimation is more accurate in the head-on cases
than in the crossing cases for both synthetic and real flight videos. In the real flight experiments,
the RMSEs of the estimation in the crossing cases are larger than those in the head-on cases, and the
RMSEs of estimations are larger than those in the synthetic videos. The reason is that the scale of the
intruder in the training dataset images is different from that in the real flight experiments, and the
model is sensitive to the change in the scale. Moreover, there is a problem regarding the estimation
results in long range. The pixels occupied by the intruder in the cropped image have no significant
change when the intruder is far away from the ownership, which may cause the model to misestimate
in the distance estimation and subsequently affect its accuracy.
6. Conclusions
In this work, the vision-based distance estimation using the deep learning-based approach to
estimate the distance between the ownership and intruder UAVs was proposed for the feasibility study
of SAA and mid-air collision avoidance of small UAVs with a consumer-grade monocular camera. First,
the target object on the image plane was detected, classified, and located by YOLOv3, which is a popular
deep learning-based object detector. Then, the distance between the ownership and intruder UAVs was
estimated using deep learning approach which only takes images as input. To verify the performance of
the CNN distance regression model, two types of videos were acquiring in this study, synthetic and real
flight videos. The model evaluation results show that the performance of the proposed method is viable
for the SAA of a small UAV with only the onboard camera. The proposed model was evaluated with the
videos acquired from the real flight tests, and the results show that the RMSE in the head-on scenario
with clear weather condition is only 1.423 m, which is satisfactory for mid-air collision avoidance of
small UAVs. The major achievements are summarized as follows:
1. A custom YOLOv3 detector has been trained to detect a fixed-wing aircraft with high accuracy.
2. A vision-based distance estimation approach with monocular camera is proposed to verify the
feasibility of mid-air collision avoidance of small UAVs.
3. A CNN distance regression model has been trained and evaluated by using air-to-air videos
acquired from real flight tests.
4. A procedure to synthesize the dataset for training and testing of the deep learning-based approach
is proposed in this study.
5. The real flight experiments were conducted to evaluate the performance of the proposed approach
for the application of SAA and mid-air collision avoidance of small UAVs in the near future.
However, there are still some limitations of the proposed method in this study. One limitation
is that the model is very sensitive to the scale of the intruder. Therefore, the size of the intruder
should be similar to that used to train the model. The other one is that the model is unable to
estimate the objet in long distance since the pixels occupied by the intruder in the cropped image
Remote Sens. 2020, 12, 3035 26 of 28
have no significant change and are not able to detect the distance of the intruder. Moreover, the real
flight experiments conducted in this study are limited to above-the-horizon scenarios. In the future,
below-the-horizon scenarios should be considered to prevent the mid-air collision of the intruder from
a lower altitude, and the long-distance estimation is also required to improve the distance estimation
model for high-speed UAVs.
Author Contributions: Conceptualization, Y.-C.L. and Z.-Y.H.; methodology, Y.-C.L.; software, Z.-Y.H.; validation,
Y.-C.L.; formal analysis, Y.-C.L.; investigation, Y.-C.L. and Z.-Y.H.; resources, Y.-C.L. and Z.-Y.H.; data curation,
Z.-Y.H.; writing—original draft preparation, Y.-C.L. and Z.-Y.H.; writing—review and editing, Y.-C.L.; visualization,
Z.-Y.H.; supervision, Y.-C.L. All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by Ministry of Science and Technology of Taiwan (MOST) under contract
MOST 108-2221-E-006-071-MY3 and, in part, the Ministry of Education, Taiwan, Headquarters of University
Advancement to the National Cheng Kung University (NCKU).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Nex, F.; Remondino, F. UAV for 3D mapping applications: A review. Appl. Geomat. 2014, 6, 1–15. [CrossRef]
2. Xu, S.; Savvaris, A.; He, S.; Shin, H.-S.; Tsourdos, A. Real-time implementation of YOLO + JPDA for small
scale UAV multiple object tracking. In Proceedings of the 2018 International Conference on Unmanned
Aircraft Systems (ICUAS), Dallas, TX, USA, 12–15 June 2018; pp. 1336–1341.
3. Li, J.; Ye, D.H.; Chung, T.; Kolsch, M.; Wachs, J.; Bouman, C. Multi-target detection and tracking from a single
camera in Unmanned Aerial Vehicles (UAVs). In Proceedings of the 2016 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4992–4997.
4. Ammour, N.; Alhichri, H.; Bazi, Y.; Benjdira, B.; Alajlan, N.; Zuair, M. Deep learning approach for car
detection in UAV imagery. Remote Sens. 2017, 9, 312. [CrossRef]
5. Uto, K.; Seki, H.; Saito, G.; Kosugi, Y. Characterization of rice paddies by a UAV-mounted miniature
hyperspectral sensor system. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 851–860. [CrossRef]
6. James, J.; Ford, J.J.; Molloy, T.L. Learning to Detect Aircraft for Long-Range Vision-Based Sense-and-Avoid
Systems. IEEE Robot. Autom. Lett. 2018, 3, 4383–4390. [CrossRef]
7. Fasano, G.; Accado, D.; Moccia, A.; Moroney, D. Sense and avoid for unmanned aircraft systems. IEEE Aerosp.
Electron. Syst. Mag. 2016, 31, 82–110. [CrossRef]
8. Yu, X.; Zhang, Y. Sense and avoid technologies with applications to unmanned aircraft systems: Review and
prospects. Prog. Aerosp. Sci. 2015, 74, 152–166. [CrossRef]
9. Carnie, R.; Walker, R.; Corke, P. Image processing algorithms for UAV “sense and avoid”. In Proceedings
of the 2006 IEEE International Conference on Robotics and Automation (ICRA 2006), Orlando, FL, USA,
15–19 May 2006; pp. 2848–2853.
10. Lai, J.; Ford, J.J.; Mejias, L.; O’Shea, P. Characterization of Sky-region Morphological-temporal Airborne
Collision Detection. J. Field Robot. 2013, 30, 171–193. [CrossRef]
11. Nussberger, A.; Grabner, H.; Van Gool, L. Aerial object tracking from an airborne platform. In Proceedings of
the 2014 International Conference on Unmanned Aircraft Systems (ICUAS), Orlando, FL, UAS, 27–30 May 2014;
pp. 1284–1293.
12. Zhu, Q.; Yeh, M.-C.; Cheng, K.-T.; Avidan, S. Fast human detection using a cascade of histograms of oriented
gradients. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1491–1498.
13. Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE
International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 1150–1157.
14. Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the
2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai,
HI, USA, 8–14 December 2001; p. I-I.
15. Liu, C.; Chang, F.; Liu, C. Cascaded split-level colour Haar-like features for object detection. Electron. Lett.
2015, 51, 2106–2107. [CrossRef]
Remote Sens. 2020, 12, 3035 27 of 28
16. Ye, D.H.; Li, J.; Chen, Q.; Wachs, J.; Bouman, C. Deep Learning for Moving Object Detection and Tracking from a
Single Camera in Unmanned Aerial Vehicles (UAVs). Electron. Imaging 2018, 2018, 4661–4666. [CrossRef]
17. Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural
Netw. Learn. Syst. 2019, 30, 3212–3232. [CrossRef]
18. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada,
7–12 December 2015; pp. 91–99.
19. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969.
20. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox
detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands,
11–14 October 2016; pp. 21–37.
21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
26 June–1 July 2016; pp. 779–788.
22. Saqib, M.; Khan, S.D.; Sharma, N.; Blumenstein, M. A study on detecting drones using deep convolutional
neural networks. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and
Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–5.
23. Schumann, A.; Sommer, L.; Klatte, J.; Schuchert, T.; Beyerer, J. Deep cross-domain flying object classification
for robust UAV detection. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video
and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6.
24. Opromolla, R.; Fasano, G.; Accardo, D. A vision-based approach to UAV detection and tracking in cooperative
applications. Sensors 2018, 18, 3391. [CrossRef] [PubMed]
25. Jin, R.; Jiang, J.; Qi, Y.; Lin, D.; Song, T. Drone detection and pose estimation using relational graph networks.
Sensors 2019, 19, 1479. [CrossRef] [PubMed]
26. Wu, M.; Xie, W.; Shi, X.; Shao, P.; Shi, Z. Real-time drone detection using deep learning approach.
In Proceedings of the International Conference on Machine Learning and Intelligent Communications,
Hangzhou, China, 6–8 July 2018; pp. 22–32.
27. Rezaei, M.; Terauchi, M.; Klette, R. Robust vehicle detection and distance estimation under challenging
lighting conditions. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2723–2743. [CrossRef]
28. Monajjemi, M.; Mohaimenianpour, S.; Vaughan, R. UAV, come to me: End-to-end, multi-scale situated
HRI with an uninstrumented human and a distant UAV. In Proceedings of the 2016 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4410–4417.
29. Bauer, P.; Hiba, A.; Bokor, J.; Zarandy, A. Three dimensional intruder closest point of approach estimation based-on
monocular image parameters in aircraft sense and avoid. J. Intell. Robot. Syst. 2019, 93, 261–276. [CrossRef]
30. Zhang, Y.; Wang, W.; Huang, P.; Jiang, Z. Monocular Vision-based Sense and Avoid of UAV Using Nonlinear
Model Predictive Control. Robotica 2019, 37, 1582–1594. [CrossRef]
31. Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. Automatic ship detection based on RetinaNet using
multi-resolution Gaofen-3 imagery. Remote Sens. 2019, 11, 531. [CrossRef]
32. Luo, X.; Tian, X.; Zhang, H.; Hou, W.; Leng, G.; Xu, W.; Jia, H.; He, X.; Wang, M.; Zhang, J. Fast Automatic Vehicle
Detection in UAV Images Using Convolutional Neural Networks. Remote Sens. 2020, 12, 1994. [CrossRef]
33. Ophoff, T.; Puttemans, S.; Kalogirou, V.; Robin, J.-P.; Goedemé, T. Vehicle and Vessel Detection on Satellite
Imagery: A Comparative Study on Single-Shot Detectors. Remote Sens. 2020, 12, 1217. [CrossRef]
34. Ponce, H.; Brieva, J.; Moya-Albor, E. Distance estimation using a bio-inspired optical flow strategy applied to
neuro-robotics. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio,
Brazil, 8–13 July 2018; pp. 1–7.
35. Haseeb, M.A.; Guan, J.; Ristić-Durrant, D.; Gräser, A. DisNet: A novel method for distance estimation from
monocular camera. In Proceedings of the 10th Planning, Perception and Navigation for Intelligent Vehicles
(PPNIV18), Madrid, Spain, 1 October 2018.
36. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Remote Sens. 2020, 12, 3035 28 of 28
37. Jensen, M.B.; Nasrollahi, K.; Moeslund, T.B. Evaluating state-of-the-art object detector on challenging traffic
light data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Honolulu, HI, USA, 21–26 July 2017; pp. 9–15.
38. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft
coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich,
Switzerland, 5–12 September 2014; pp. 740–755.
39. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
40. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375.
41. Opromolla, R.; Inchingolo, G.; Fasano, G. Airborne visual detection and tracking of cooperative UAVs
exploiting deep learning. Sensors 2019, 19, 4332. [CrossRef] [PubMed]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Remotesensing 12 03035

Uploaded by

Copyright:

Available Formats

Remotesensing 12 03035

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Remotesensing 12 03035

Uploaded by

Copyright:

Available Formats

remote sensing

Remote Sens. 2020, 12, 3035; doi:10.3390/rs12183035 www.mdpi.com/journal/remotesensing

2. Detection of a Moving UAV

2.1. Object Detection

Remote Sens. 2020, 12, 3035 6 of 28

Figure 3. Labeling image of the training dataset.

3.1.1. Model Architecture

Figure 7. Feature extraction network architecture of the CNN distance model.

The reasons of freezing some layers are as follows:

Figure 10. Using Blender to collect images to train the model.

Figure 11. Synthetic image rendered by Blender.

FigureFigure 11. Syntheticimage

Figure 13. Training results of 150 epochs.

Figure 13. Training results of 150 epochs.

3.2.1. Attitude Estimation via CNN

3.2.4. Data Collection and Labeling

4. Model Evaluation and Real Flight Experiments

4.1. Model Evaluation in Synthetic Videos

Table 4. Estimated distance vs. ground truth.

Set Figure Weather Case Distance Range RMSE RMSE_K

Figure 19. Cont.

Remote Sens. 2020, 12, x FOR PEER REVIEW 18 of 27

Figure Figure 19. Distance

Table 5. Specifications regarding the imaging system of the ownership.

Figure 20. Flight path of the intruder in Experiment 1.

Remote Sens. 2020, 12, x FOR PEER REVIEW 20 of 27

Figure 22. Result of model evaluation in Experiment 1.

Table 6. Experiment 1 (estimation vs. GPS measurement).

Figure 23. Flight path of the intruder in Experiment 2.

Experiment 2 (estimation vs.

Figure 26. Flight path of the intruder in Experiment 3.

Table 8. Experiment 3 (estimation vs. GPS measurement).

Table 8. Experiment 3 (estimation vs. GPS measurement).

Case Head-On Weather Condition Cloudy

Figure 29. Flight

Figure 30. Measurements

Figure 31. Result of model evaluation in Experiment 4.

Table 9. Experiment 4 (estimation vs. GPS measurement).

Case Head-On Weather Condition Cloudy

5. Result and Discussion

5.1. Synthetic Videos

5.2. Real Flight Tests

You might also like