Remotesensing 12 03035
Remotesensing 12 03035
Remotesensing 12 03035
Article
Detection of a Moving UAV Based on Deep
Learning-Based Distance Estimation
Ying-Chih Lai * and Zong-Ying Huang
Department of Aeronautics and Aeronautics, National Cheng Kung University, Tainan 701, Taiwan;
[email protected]
* Correspondence: [email protected]; Tel.: +886-6-275-7575 (ext. 63648)
Received: 27 July 2020; Accepted: 14 September 2020; Published: 17 September 2020
Abstract: Distance information of an obstacle is important for obstacle avoidance in many applications,
and could be used to determine the potential risk of object collision. In this study, the detection of a
moving fixed-wing unmanned aerial vehicle (UAV) with deep learning-based distance estimation
to conduct a feasibility study of sense and avoid (SAA) and mid-air collision avoidance of UAVs
is proposed by using a monocular camera to detect and track an incoming UAV. A quadrotor is
regarded as an owned UAV, and it is able to estimate the distance of an incoming fixed-wing intruder.
The adopted object detection method is based on the you only look once (YOLO) object detector.
Deep neural network (DNN) and convolutional neural network (CNN) methods are applied to exam
their performance in the distance estimation of moving objects. The feature extraction of fixed-wing
UAVs is based on the VGG-16 model, and then its result is applied to the distance network to estimate
the object distance. The proposed model is trained by using synthetic images from animation software
and validated by using both synthetic and real flight videos. The results show that the proposed
active vision-based scheme is able to detect and track a moving UAV with high detection accuracy
and low distance errors.
Keywords: unmanned aerial vehicle (UAV); you only look once (YOLO); deep neural network
(DNN); convolutional neural network (CNN); object detection; sense and avoid (SAA); mid-air
collision avoidance
1. Introduction
With the advance of technology, unmanned aerial vehicles (UAVs) have become popular in the
past two decades due to their wide and various applications. The advantages of UAVs include low
cost, offering a less stressful environment, and long endurance. Most important of all, UAVs are
unmanned, so they can reduce the need of manpower, and thus reduce the number of casualties caused
by accidents. They also have many different applications including aerial photography, entertainment,
3D mapping [1], object detection for different usages [2–4], military use, and agriculture applications,
such as pesticide spraying and vegetation monitoring [5]. With the increasing amounts of UAVs,
there are more and more UAVs flying in the same airspace. If there is no air traffic control and
management of UAVs, it may cause accidents and mid-air collisions to happen, which is one the most
significant risks that UAVs are facing [6]. Thus, UAV sense and avoid (SAA) has become a critical issue.
A comprehensive review of the substantial breadth of SAA architectures, technologies, and algorithms
is presented in the tutorial [7], which concludes with a summary of the regulatory and technical
issues that continue to challenge the progress on SAA. Without a human pilot onboard, unmanned
aircraft systems (UASs) have to solely rely on SAA systems when in dense UAS operations in urban
environments, or they are merged into the National Airspace System (NAS) [8]. There are many factors
needed to be considered for UAS traffic management (UTM), such as cost, payload of UAV, accuracy
of the senor, etc. Therefore, the determination of suitable sensors in UAV SAA of UTM for objective
sensing is essential.
According to how the information is transmitted, current sensor technologies for SAA can be
classified as cooperative and non-cooperative methods [8]. For cooperative sensors, the communication
devices need to be equipped to communicate with the aircrafts in the same airspace, such as the traffic
alert and collision avoidance system (TCAS) and the automatic dependent surveillance-broadcast
(ADS-B), which have been widely used in commercial airlines. In contrast to cooperative sensors,
there is no need for non-cooperative sensors to equip the same communication devices to exchange
data with the other aircrafts for sharing the same airspace. Moreover, non-cooperative sensors are able
to detect not only air objects but also ground targets, such as light detection and ranging (LIDAR),
radar, and optical sensors (cameras). One drawback of small-scale UAVs is the limitation of their
payload capability. Therefore, the camera becomes an ideal sensor for object and target detection.
The camera has many advantages, such as its light weight, low cost, the fact that it is easy to equip,
and it is also widely used in different applications.
Computer vision is one of the popular studies for onboard systems of UAVs, which make the vehicles
able to “see” the targets or objects. With the rapid development of computer vision, vision-based navigation
is now the promising technology for detecting potential threats [6]. For object sense/detection, there are
many approaches have be proposed, such as multi-stage detection pipeline [9–11], machine learning [12–15],
and deep learning [16]. Deep learning is widely used in machine vision for object detection, localization,
and classification. In contrast to traditional object detection methods, detectors using deep learning are
able to learn semantic, high-level, and deeper features to address the problems existing in traditional
architectures [17]. Detectors based on deep learning can be divided into two categories, one stage and two
stage. Two-stage detectors require a region proposal network (RPN) to generate regions of interests (ROI),
such as the faster region convolution neural network (R-CNN) or the mask R-CNN [18,19]. On the other
hand, the one-stage detector considers object detection as a single regression problem by taking an image as
input to learn class probabilities and bounding box coordinates, such as the single shot multi-box detector
(SSD) or you only look once (YOLO) [20,21]. Two-stages detectors have higher accuracy when compared to
one-stage detectors, but their computational cost is higher than one-stage detectors.
Vison-based object detection methods have been studied for many decades and applied in many
applications. In recent years, there are many studies focused on UAV detection with vision-based
methods and deep learning [22–26]. These studies focus on the detection of quadrotor or multirotor
UAVs, commonly known as drones, but it is difficult to obtain the detector for small fixed-wing UAVs,
which have higher flight speed than multirotors and will increase the challenge of the vision-based
detectors. Moreover, most of these studies emphasized the development of object detectors, and there
is no vision-based distance estimation for the feasibility study of SAA and mid-air collision avoidance
of UAVs using a monocular camera to detect the incoming small fixed-wing UAV. Some vison-based
detection approaches for mid-air collision avoidance have been proposed for light fixed-wing aircrafts.
For example, an image processing of multi-stage pipeline based on the hidden Markov model (HMM)
has been utilized to detect the aircrafts with slow motion on the image plane [10]. The key stages of
multi-stage pipeline are stabilized image input, image preprocessing, temporal filtering and detection
logic. The advantage of this approach is that it can detect a Cessna 182 aircraft in long distance.
However, when the movement of the aircraft on the image plane is too fast, this algorithm will fail.
In [6], the proposed long-range vision-based SAA utilized the same multi-stage pipeline. Moreover,
instead of using only morphological image processing in image processing stage, deep learning-based
pixel-wised image segmentation is also applied to increase the detection range of a Cessna 182 whilst
maintaining low false alarms. It classifies every pixel in image into two classes, aircraft and non-aircraft.
Regarding to UAVs, Li et al. proposed a new method to detect and track UAVs from a monocular
camera mounted on the owned aircraft [3]. The main idea of this approach is to adopt background
subtraction. The background motion is calculated via optical flow to obtain the background subtracted
Remote Sens. 2020, 12, 3035 3 of 28
images and to find the moving targets. This approach is able to detect moving objects without the
limitations of moving speed or visual size.
For the obstacle avoidance, the distance information of the target object usually plays an important
role. However, it is difficult to estimate distance with only a monocular camera. Some approaches
exploit the known information, such as camera focal length and height of the object, to calculate
distance via the pinhole model, and usually assume that the height or width of objects are known [27,28].
The distance estimation of the objects on the ground based on deep learning has been proposed in
many studies, but the deep learning-based object detection of UAVs for mid-air collision avoidance is
rare according to paper survey results. There are some studies focused on the monocular vision-based
SAA of UAVs [29,30]. In the study [29], an approach to deal with monocular image-based SAA
assuming constant aircraft velocities and straight flight paths was proposed and simulated in
software-in-the-loop simulation test runs. A nonlinear model predictive control scheme for a UAV
SAA scenario, which assumes that the intruder’s position is already confirmed as a real threat and
the host UAV is on the predefined trajectory at the beginning of the SAA process, was proposed and
verified through simulations [30]. However, in these two studies, there is no object detection method
and real image data acquiring from a monocular camera. For the deep learning-based object detection,
most of the studies utilize the images acquired from UAVs or a satellite to detect and track the objects
on the ground, such as an automatic vehicle, airplane, and vessel [31–33]. For ground vehicles, Li et al.
proposed a monocular distance estimation system for neuro-robotics by using CNN to concatenate
horizontal and vertical motion of images estimated via optical flow as inputs to the trained CNN model
and the distance information from the ultrasonic sensors [34]. The distance estimation is successfully
estimated using only a camera, but the distance estimation results become worse when the velocity
of robotics increases. In [35], a deep neural network (DNN) named DisNet is proposed to detect the
distance of a ground vehicle to objects, and it applied the bounding box of the objects detected by
YOLO and image information, such as width and height, as inputs to train DisNet. The results show
that DisNet is able to estimate the distance between objects and camera without either explicit camera
parameters or a prior knowledge about the scene. However, the accuracy of the estimated distance
may be directly affected due to the width and height of the bounding box.
With the rapid development in technology, UAVs have become an off-the-shelf consumer product.
However, if there is no traffic control or UTM system to manage UAVs when they fly in the same
airspace, it may cause mid-air collision, property loss, or causalities. Therefore, SAA and mid-air
collision avoidance for UAVs have become an important issue. The goal of this study is to develop the
detection of a moving UAV based on deep learning distance estimation to conduct the feasibility study
of SAA and mid-air collision avoidance of UAVs. The adopted sensor for the detection of the moving
object is a monocular camera, and DNN and CNN were applied to estimate the distance between the
intruder and the owned UAV.
The rest of study is organized as follows: In Section 2, the overview of this study is presented,
including the architecture of the proposed detection scheme and the methods to accomplish object
detection. The methods of the proposed distance estimation using deep learning are presented in
Section 3, and the introduction to model architecture and a proposed procedure to synthesize the
dataset for training the model are also presented. Section 4 presents the performance evaluation of
the proposed methods by using synthetic videos and real flight experiments. Results and discussions
of model evaluation and experiments are shown in Section 5. Finally, the conclusion of this study is
addressed in Section 6.
especially for aircrafts moving in relative high speed. In this study, since the camera is a passive
Remote Sens. 2020, 12,sensor,
non-cooperative x FOR PEER REVIEW camera was selected to be the only sensor to detect the target 4object
a monocular of 27
in the airspace. A multi-stage object detection scheme is proposed to obtain the distance estimation of
estimation
the movingof the moving
targets targets
on the image on the
plane image
in long andplane
short in long and
distances. short
The distances.
background The background
subtraction method,
subtraction method, based on the approach in [3], is applied to detect the long-range
based on the approach in [3], is applied to detect the long-range target and the moving target and
object thea
with
moving
moving object with aon
background moving background
the image on the
plane. When theimage
targetplane.
objectWhen the targetthe
is approaching object
ownedis approaching
UAV, a deep
the
learning-based model is trained to estimate the distance. Then, according to the distance according
owned UAV, a deep learning-based model is trained to estimate the distance. Then, estimation to
of
the distance estimation of the detected object on the image plane and its dynamic motion,
the detected object on the image plane and its dynamic motion, a risk assessment of mid-air collision a risk
assessment of mid-air
could be conducted collisionmid-air
to prevent could collision
be conducted to preventFigure
from occurring. mid-air collision
1 shows the from occurring.
flow chart of the
Figure 1 shows the flow chart of the research process of the proposed multi-stage
research process of the proposed multi-stage target detection and distance estimation using target detection
a deep
and distance estimation
learning-based approach. using a deep learning-based approach.
Object Detection
Long Distance Short Distance
Deep Learning
Background
(YOLO
Subtraction
Detector)
Distance Estimation
Method1 Method2
CNN DNN
Regression Regression
Risk Assessment
Figure1.
Figure Flowchart
1.Flow chartof
ofresearch
researchprocess.
process.
[36]. The new architecture of YOLOv3 boasts residual skip connections and upsampling. The most
significant feature of v3 is that it makes detections at three different scales. The upsampled layers
Remote Sens. 2020, 12, 3035 5 of 28
concatenated with the previous layers help preserve the fine grained features which help in detecting
small objects. More details of different YOLO detectors are introduced in the literature [36,37].
prediction
Since theandYOLOv3
class probabilities
detector is[37]. The YOLO
a high-speed detectoritisiswell-known
detector, a good choice forwhen
its computational speed,
real-time detection
with acceptable accuracy is required for the onboard computing system of small UAVs. Becausehas
and it is a good choice for the real-time applications. YOLOv3 is the third version of YOLO, which thea
deeper network
purpose for feature
of this study is to extraction, a differentstudy
conduct a feasibility network architecture,
of active and a new
vision-based SAAlossfor function
small UAVs[36].
The new
using architecture
a deep of YOLOv3
learning-based boasts residual
approach, YOLOv3 skip connections
is selected to beand
theupsampling.
detector for The most significant
detecting the fixed-
feature
wing of v3 is that
intruder. it makes
In order detections
to perform theat three different
distance estimationscales.withThe upsampled
YOLOv3, layers concatenated
the intruder distance is
with the previous
estimated at short layers
range,help
where preserve
the objectthe appearance
fine grainedon features which
the image helpisinlarger
plane detecting
than small
a few objects.
pixels.
More details
Moreover, theofYOLOv3
differentdetector
YOLO detectors
was run are on aintroduced in the literature
personal computer [36,37].
to detect the object and to estimate
Since thebetween
the distance YOLOv3 detector
the intruder is and
a high-speed
the owned detector,
UAV by it is a good
using postchoice when real-time
processing with the detection
synthetic
with acceptable
images acquiredaccuracy is requiredsoftware
from animation for the onboard
and real computing
flight tests.system
Theofcomputing
small UAVs. Because
power of the
purpose of this study is to conduct a feasibility study of active vision-based
developed vision-based SAA is still regarded as a limitation to improve on for the future of real-timeSAA for small UAVs using
a deep learning-based
onboard implementation. approach, YOLOv3 is selected to be the detector for detecting the fixed-wing
intruder. In order to perform the distance estimation with YOLOv3, the intruder distance is estimated
2.2. Object
at short Collection
range, where the object appearance on the image plane is larger than a few pixels. Moreover,
the YOLOv3 detector was run on a personal computer to detect the object and to estimate the distance
In this study, a low-cost fixed-wing UAV, named Sky Surfer X8, with a wingspan of 1400 mm,
betweenlength
overall the intruder
of 915 and
mm,the owned
and flyingUAV by using
weight post
of 1 kg wasprocessing
adopted with to bethe
thesynthetic
intruder.images
The realacquired
flight
from animation software and real flight tests. The computing power
tests were conducted by using a Pixhawk autopilot to perform waypoint tracking in auto mode. Inof the developed vision-based
SAAtraining
the is still regarded
process, as thea proposed
limitation model
to improve on for the
was trained byfuture
using of real-time
synthetic onboard
images implementation.
of Sky Surfer from
animation software. With the synthetic images, the YOLOv3 detector was pre-trained with the
2.2. Object Collection
Microsoft COCO dataset [38] to train the feature extractor with the custom images of UAVs in this
study.InTo
this study,
train the acustom
low-cost fixed-wing
YOLOv3 UAV,it named
detector, Sky Surfer
is necessary X8, images
to collect with a wingspan
with targetoffixed-wing
1400 mm,
overall length of 915 mm, and flying weight of 1 kg was adopted to be the intruder.
UAV. The software, named Blender, which is a free and open-source 3D creation suite, was utilized The real flight
tests were conducted by using a Pixhawk autopilot to perform waypoint tracking
to synthesize the custom images. It supports the entirety of the 3D pipeline, such as modeling, in auto mode.
In the training
animation, motionprocess, the proposed
graphics, modelFigure
and rendering. was trained
2 showsbyoneusing synthetic
of the synthesisimages
images of to
Sky Surfer
train the
from animation software. With the synthetic images, the YOLOv3 detector was
custom YOLOv3 detector, and the UAVs in each image are synthesized with a real image to be the pre-trained with the
Microsoft COCO dataset [38] to train the feature extractor with the custom images of UAVs in this
background.
study.ToTo trainthe
train themodel
custom YOLOv3
with detector,
the dataset, it isitnecessary
is necessary to collect
to label images in
the images with
thetarget fixed-wing
training dataset
UAV. The software, named Blender, which is a free and open-source 3D creation
with bounding box and class, respectively. The outputs of YOLOv3 are the bounding box informationsuite, was utilized to
synthesize theand
(coordinates) custom images.
classes. It supports
In this the entirety
study, there of the
is only one 3D pipeline,
class, which is such as modeling,
the fixed-wing UAV.animation,
Figure
motion graphics, and rendering. Figure 2 shows one of the synthesis images to
3 shows the labeling process, and the adopted tool used to label the images is LabelImg, which is alsotrain the custom
YOLOv3
an detector,
open-source and the UAVs in each image are synthesized with a real image to be the background.
software.
Synthetic image
Figure 2. Synthetic image made
made by
by Blender.
Blender.
To train the model with the dataset, it is necessary to label the images in the training dataset with
bounding box and class, respectively. The outputs of YOLOv3 are the bounding box information
(coordinates) and classes. In this study, there is only one class, which is the fixed-wing UAV. Figure 3
shows the labeling process, and the adopted tool used to label the images is LabelImg, which is also an
open-source software.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 27
Figure 4. Detection results of the custom you only look once (YOLO)v3 detector on synthetic images.
Figure 4. Detection results of the custom you only look once (YOLO)v3 detector on synthetic images.
Figure 4. Detection results of the custom you only look once (YOLO)v3 detector on synthetic images.
Remote Sens. 2020, 12, 3035 7 of 28
Remote Sens. 2020, 12, x FOR PEER REVIEW 7 of 27
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 28
Figure 5. Detection results of the custom YOLOv3 detector from real images.
Figure 5. Detection results of the custom YOLOv3 detector from real images.
3. Distance Estimation
3. Distance Estimation
Since the detected objects on the 2D image plane could not provide the distance of the intruder,
Since the detected objects on the 2D image plane could not provide the distance of the intruder,
the depth of the target object is required to obtain its movement in 3D space. In this study, the distance
the depth of the target object is required to obtain its movement in 3D space. In this study, the distance
between the ownership and intruder is estimated by deep learning-based methods to achieve SAA
between the ownership and intruder is estimated by deep learning-based methods to achieve SAA of
of UAVs. To obtain more accurate distance estimation results, two different deep learning methods
UAVs. To obtain more accurate distance estimation results, two different deep learning methods are
are used to compare their performance of distance estimation in this study. One is CNN and the other
used to compare their performance of distance estimation in this study. One is CNN and the other is
is DNN with the DisNet regression model. From the comparison results, the better one will be applied
DNN with the DisNet regression model. From the comparison results, the better one will be applied to
to the videos of real flight tests in this study.
the videos of real flight tests in this Figure
study. 2. Synthetic image made by Blender.
3.1.Distance
3.1. DistanceEstimation
EstimationUsing
UsingCNN
CNN
CNNisisaapowerful
CNN powerfulalgorithm
algorithminindeep
deeplearning,
learning,and
andititisisable
ableto
toextract
extractthe
thedifferent
differentfeatures
featuresof
of
objects during the training process. In this study, the distance estimation is considered as
objects during the training process. In this study, the distance estimation is considered as a simple a simple
CNNregression
CNN regressionproblem,
problem,andandthe
theimages
imageswith
withthe
thetarget
targetobject
objectwere
werecropped
croppedas asthe
theinputs
inputsofofthe
the
CNN distance regression model. As shown in Figure 6, the CNN distance regression
CNN distance regression model. As shown in Figure 6, the CNN distance regression model could be model could be
separatedinto
separated intotwo
twoparts,
parts,the
thefeature
featureextraction
extractionnetwork
networkandandthe thedistance
distancenetwork.
network.
Input
Detected Target
Cropped Image
Figure 3. Labeling image of the training dataset.
input
Feature
CNN Distance
Extraction
Regression
Model Network
output
Distance
Distance Network
Figure 6. The architecture of the convolutional neural network (CNN) distance estimation system.
Figure 6. The architecture of the convolutional neural network (CNN) distance estimation system.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 27
five convolution layers followed with a max-pooling layer, respectively. The feature extraction
network
3.1.1. Modelis initialized
Architecturewith the pre-trained weights which were pre-trained with ImageNet. Then, the
layer before the third pooling layer was frozen to fine-tune the remaining layers. In model evaluation,
the results
Feature show that
Extraction the model with no frozen layers in a feature extraction network has larger
Network
training loss (around 0.7 to 1.3) comparing to that with frozen layers in a feature extraction network
As shown in Figure 7, the feature extraction network is based on VGG-16 [39], which contains five
(around 0.2 to 0.5). Therefore, the feature extraction network with frozen layers was chosen in this
convolution layers followed with a max-pooling layer, respectively. The feature extraction network is
study.
initialized with the pre-trained weights which were pre-trained with ImageNet. Then, the layer before
The reasons of freezing some layers are as follows:
the third pooling layer was frozen to fine-tune the remaining layers. In model evaluation, the results
1.
show It that
could reduce
the model some
withparameters of the model.
no frozen layers in a feature extraction network has larger training loss
2.
(around 0.7 to 1.3) comparing to that with frozenImageNet
The weights (filters) are pre-trained with layers in a and an extraction
feature image database to (around
network improve0.2the
to
0.5). performance
Therefore, the offeature
the filters in feature
extraction extraction.
network with frozen layers was chosen in this study.
Figure
Figure 8. The architecture
8. The architecture of
of the
the distance
distance regression
regression network.
network.
Remote Sens. 2020, 12, 3035 9 of 28
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 27
To decide how many FC layers, excluding the output layer, have to be used in the distance
network, and to discuss whether the distance network with different amounts of FC layers affects the
performance, two different architectures, three FC layers and four FC layers, were compared in this
study. The evaluation results of different models with different amounts of FC layers are shown in
Figure 9. GT represents ground truth. Models 5 to 8 and Models 20 to 21 are the results with three
FC layers. Models 13 to 15 are the results with four FC layers. The training and validation losses of
all models are able to converge at around 0.2 to 0.5, and the results show that there is no significant
difference between models with three FC layers and four FC layers. However, the models with three
FC layers are slightly more accurate than the others with four FC layers, and the parameters of the
models with three FC layers are much smaller than the models with four FC layers, which can decrease
the training time. Figure 8. The architecture of the distance regression network.
Figure9.
Figure Evaluationresults
9. Evaluation resultsof
ofthe
themodels
modelswith
withdifferent
different amounts
amounts of
of fully
fully connected
connected (FC) layers.
3.1.2. Data
3.1.2. DataCollection
Collection
Because there
Because there isis no
no existing
existing dataset
dataset with
with the
the CNN
CNN distance
distance regression
regression model,
model, itit is
is necessary
necessary to to
build a dataset to train the model, which is able to estimate the distance between
build a dataset to train the model, which is able to estimate the distance between the ownership and the ownership and
intruder UAVs
intruder UAVs using
using thethe deep
deep learning-based
learning-based approach.
approach. In In order
order toto obtain
obtain aa dataset
dataset withwith aa lot
lot of
of
various cropped
various croppedimages
imagesthatthatcontain
containa aUAV
UAVwith withvarious
various distances
distances and
and orientations,
orientations, a procedure
a procedure to
to synthesize this dataset is proposed in this study. In contrast to the
synthesize this dataset is proposed in this study. In contrast to the approach in [35], whichapproach in [35], which is is
a
a ground-based distance estimation for railway obstacle avoidance, this study
ground-based distance estimation for railway obstacle avoidance, this study presents a air-to-air presents a air-to-air
obstacle avoidance
obstacle avoidance scheme,
scheme, in in which
whichitit isis more
more difficult
difficult to
to collect
collect the
the real
real scene
scene image
image forfor training,
training,
because the ground truth of the estimated distance needs to be determined
because the ground truth of the estimated distance needs to be determined rigorously. rigorously.
Synthetic Images
Synthetic Images
To address the previously mentioned problem mentioned, Blender software was utilized to
createTothe
address
desiredthesynthetic
previously mentioned
images. For theproblem
trainingmentioned,
dataset, aBlender
small-scalesoftware
UAV,was Skyutilized
Surfer to
X8,create
was
the desired synthetic images. For the training dataset, a small-scale UAV, Sky Surfer
imported to Blender as the intruder, and then it was randomly rotated to obtain different orientations, X8, was imported
to Blender
and as thewas
the camera intruder,
adjustedandtothen it was
acquire randomly
various rotated
distances. Into obtain
this study,different
scenes orientations,
of a UAV toward and the
to
camera was adjusted to acquire various distances. In this study, scenes of a
camera were considered, and the scenarios of head-on and crossing were conducted. The rotation UAV toward to camera
were considered,
range of the UAVand wasthe scenarios
also limited toof prevent
head-on unusual
and crossing were
attitude conducted.
and overtaking The rotation
case. range of the
The information
UAV was the
regarding alsodataset
limitedbuilt
to prevent unusual
to training attitude
the CNN and overtaking
distance regression modelcase. The information
is list in Table 1. regarding
Figure 10
the dataset built to training the CNN distance regression model is list in Table 1.
shows the interface of Blender, which is able to change the location of the intruder by setting the Figure 10 shows the
interface of Blender, which is able to change the location of the intruder by
parameters in the red box and changing the attitude parameters in the yellow box. Figure 11 showssetting the parameters
in the
one of red
the box and changing
synthetic the attitude
image produced parameters
by Blender, and in the yellow
Figure 12 shows box.some
Figure 11 shows
cropped one of
images of the
the
synthetic image produced
developed training dataset. by Blender, and Figure 12 shows some cropped images of the developed
training dataset.
Image shape before being cropped 3840 × 2160 × 3
Cropped image shape 100 × 100 × 3
Attitude Rotation Range
Remote Sens.
Remote Sens. 2020,
2020, 12,
12, 3035
x FOR PEER REVIEW
Roll angle range −15𝑜 ~ 15𝑜 10 of
10 of 28
27
𝑜
Pitch
Table 1.angle range
Information −15
regarding the training ~ 15𝑜
dataset.
𝑜
Table 1. Information regarding the training dataset. 𝑜
Yaw angle range
Information −75Size~ 75
ImageInformation
shape before being cropped 3840 × Size
2160 × 3
Image shapeCropped image
before being shape
cropped 100 ×× 2160
3840 100 ××33
Cropped image shape
Attitude 100 × 100Range
Rotation ×3
𝑜 𝑜
Roll angle range
Attitude −15 ~ 15
Rotation Range
𝑜 ◦ 𝑜 ◦
Pitch angle range
Roll angle range −15 ~ 15
−15 ∼ 15
𝑜 ◦ 𝑜 ◦
Yaw range
Pitch angle angle range −75−15 ~ 75∼ 15
◦ ◦
Yaw angle range −75 ∼ 75
12. Examples
Figure 12.
Figure Examplesofofthethe
cropped images
cropped withwith
images different distances
different and orientations
distances for model
and orientations fortraining.
model
training.
Image Augmentation
Image InAugmentation
order to create more data for model training, the image augmentation process, which randomly
changes the images
In order beforemore
to create inputting
datathem
for into
modelthe training,
model according to theaugmentation
the image given parameters, was applied
process, which
randomly changes the images before inputting them into the model according to the model
during model training. Moreover, the image augmentation process can also prevent the trained given
from overfitting.
parameters, The augmentation
was applied during model process usedMoreover,
training. in this study includes
the image rotations and
augmentation translations
process can alsoof
the target
prevent object,
the trainedwhich
modelare from
performed by theThe
overfitting. image processing of
augmentation width used
process shifting and study
in this heightincludes
shifting.
The parameters
rotations are list in Table
and translations of the2.target
For the translation
object, which are process, the factor
performed of 0.35
by the imagemeans shiftingofatwidth
processing most
70 pixelsand
shifting of the target
height object with
shifting. size of 200 ×
The aparameters are200
listpixels, which
in Table changes
2. For based on the
the translation size ofthe
process, thefactor
input
images.
of For the
0.35 means rotation
shifting at process,
most 70 the maximum
pixels rotating
of the target angle
object is 3adegrees.
with In the
size of 200 training
× 200 pixels,process,
which
the image
changes augmentation
based on the sizeprocess randomly
of the input selects
images. For athesetrotation
of parameter combinations
process, the maximum of translation and
rotating angle
rotation
is for each
3 degrees. epoch.
In the training process, the image augmentation process randomly selects a set of
parameter combinations of translation and rotation for each epoch.
Table 2. Image augmentation parameters.
Table 2. Image augmentation parameters.
Augmentation Parameter
Augmentation
Width shift range Parameter
0.35
Widthshift
Height shift range
range 0.35 0.35
Rotation range
Height shift range 0.35 3
Fill mode nearest
Rotation range 3
Fill mode nearest
3.1.3. Training Result
3.1.3.In
Training
order toResult
train the proposed model, the dataset was collected by using the proposed procedure
to synthesize
In order tothetrain
training data as previously
the proposed model, thementioned.
dataset wasThe amountbyofusing
collected the produced images—which
the proposed procedure
are cropped in RGB with a distance range of from 30 m to 95 m—in the dataset
to synthesize the training data as previously mentioned. The amount of the produced images—which for training is about
10,000. First of all, the images were normalized to increase the training speed and
are cropped in RGB with a distance range of from 30 m to 95 m—in the dataset for training is about model robustness,
and then
10,000. split
First of to
all,an
the80% dataset
images werefornormalized
training andto 20% for validation.
increase the trainingMean
speedsquare errorrobustness,
and model (MSE) was
chosen to be the loss function, as shown in Equation (1), where y is the ground
and then split to an 80% dataset for training and 20% for validation. Mean square error (MSE) truth and ŷ iswas
the
prediction from the proposed model. Adaptive moment estimation (Adam) with a
chosen to be the loss function, as shown in Equation (1), where y is the ground truth and ŷ is thelearning rate decay
as shown in
prediction Equation
from (2) was chosen
the proposed model. to be the optimizer;
Adaptive the model(Adam)
moment estimation trainingwith
resulta is illustrating
learning rate
decay as shown in Equation (2) was chosen to be the optimizer; the model training result is
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 27
illustrating
Remote inxFigure
Sens. 2020, 12, 13.REVIEW
FOR PEER It took
about 38 min to train the model with an NVIDIA GeForce 12
GTXof 271660
Remote Sens. 2020, 12, 3035 12 of 28
Graphics Processing Unit (GPU) card.
illustrating in Figure 13. It took about 38 min to train the
n model with an NVIDIA GeForce GTX 1660
1
Graphics Processing Unit (GPU) card. MSE
in Figure 13. It took about 38 min to train the model= ∑(y − ŷi )2 GeForce GTX 1660 Graphics (1)
n withi an NVIDIA
Processing Unit (GPU) card. n i=1
1 n
X i − ŷi )220.001
MSE = 1 ∑(y (1)
MSE = n i=1 Rate
Learning yi − =
ŷi Epoch (1) (2)
n
i=1
0.001
Learning Rate = 0.001 (2)
Learning Rate = Epoch (2)
Epoch
Cropped image
CNN
input Attitude
Model
DisNet
Regression
CNN
input
Model Attitude
Bounding Box
ModelRectification
DisNet output
Regression
ModelDistance Bounding BoxDNN
Rectification
output Distance
Network
Distance
DNN
Distance
Figure 14. The
Figure architecture
14. The of the
architecture ofdeep neural
the deep network
neural (DNN)
Network
network (DNN)distance estimation
distance system.
estimation system.
Figure 14. The architecture of the deep neural network (DNN) distance estimation system.
Remote
Remote Sens. 2020, 12,
Sens. 2020, 12, 3035
x FOR PEER REVIEW 13 of 28
27
Figure
Figure 15. Process to
15. Process to rectify
rectify the
the bounding
bounding box
box given
given by
by the
the YOLOv3
YOLOv3 detector.
detector.
3.2.3. DNN Architecture
3.2.3. DNN Architecture
Figure 16 shows the architecture of the DNN distance model, which consists of three hidden layers
with Figure 16 shows
100 hidden units, the architecture
respectively. Theof the vector
input DNN distance
is shown model, which
in Equation (3),consists
and theof three value
output hidden
is
layers with 100 hidden units, respectively. The input vector is shown in Equation (3), and the
the estimated distance of the object. The distance network is trained with the same loss function and output
value is the estimated distance of the object. The distance network is trained with the same loss
optimizer in Section 3.1.3.
function and optimizer in Section 3.1.3. "
1 1 1
#
v= ∅θϕ (3)
1 Bh1Bw B1d
v=[ ∅ θ φ] (3)
where Bh Bw Bd
where
Bh : height of the object bounding box in pixels/image height in pixels;
h : : height
Bw width of the object bounding box in pixels/image
pixels/image width
heightin
inpixels;
pixels;
Bw : width of the object bounding box in pixels/image width in pixels;
Remote Sens. 2020, 12,Sens.
Remote x FOR
2020, PEER
12, 3035REVIEW 14 of 28 14 of 27
Bd : diagonalBdof: diagonal
the object
of thebounding boxbox
object bounding in pixels/image diagonal
in pixels/image diagonal in pixels;
in pixels;
∅: estimated∅ roll angle;roll angle;
: estimated
θ : estimated
θ: estimated pitch angle; pitch angle;
ϕ : estimated yaw angle.
φ: estimated yaw angle.
FigureFigure
16. Architecture
16. Architectureofofthe DNN
the DNN distance
distance network.
network.
(a)
(b)
(c)
Figure
Figure 17. Comparison
17. Comparison of CNN
of CNN andand DNNdistance
DNN distance regression
regression models:
models:(a)(a,b)
and the
(b) the distance
distance rangeof the
range
of the intruder flying from 60 to 30 m; (c) the distance ranges from 50 to 35 m.
intruder flying from 60 to 30 m; (c) the distance ranges from 50 to 35 m.
In this study, only head-on and crossing cases are considered for the evaluation of synthetic and real
flight videos. The details about how to acquire these videos are presented in the following sections.
Figure
Figure 18. 18. Synthetic
Synthetic videos
videos for for model
model evaluation. The
evaluation. Thered
redbox indicates
box the crossing
indicates case, and
the crossing theand the
case,
yellow
yellow box box indicates
indicates the head-on
the head-on case.
case.
The results of model evaluation with synthetic videos are given in Table 4 and Figure 19. As
The results of model evaluation with synthetic videos are given in Table 4 and Figure 19. As shown
shown in Table 4, the synthetic videos are grouped into two sets according to their distance. Set I
in Tablepresents
4, the synthetic
the shortervideos arewith
distance grouped
a clearinto two setsand
background, according to their
Set II shows distance.
the longer Set Iwith
distance presents
a the
shorter distance with a clear
cloudy background. Thebackground,
root mean squareanderror
Set II(RMSE)
showsof the longer
each videodistance with atocloudy
was calculated compare background.
the
The rootperformance
mean square error
of the (RMSE)
results. of each
RMSE_K videothe
indicates was calculated
RMSE with thetoKalman
compare theinperformance
filter the distance of the
results. estimation, and the Kalman
RMSE_K indicates the RMSEfilter in onethe
with dimension,
Kalmanwhich
filter is
inadopted to be aestimation,
the distance low-pass filter
andinthe
thisKalman
study, is applied to smooth the output of the CNN distance regression model.
filter in one dimension, which is adopted to be a low-pass filter in this study, is applied to smooth the
Figure 19 shows the estimated distance by the CNN distance regression model, where green line
output of the CNN distance regression model.
indicates the raw estimation from model, the blue line indicates the estimation with the Kalman filter,
and the red line indicates the ground truth of the distance in each video frame. The ground true is
determined by the positions of the intruder and the related frame with a timestamp. From Table 4
and Figure 19, it is obvious that the CNN distance regression model successfully estimated the
distance in each frame; the RMSEs are small for different weather condition and cases, that is, using
Remote Sens. 2020, 12, 3035 17 of 28
(a)
(b)
(c)
(d)
(e)
(f)
experiments and the performance of the CNN regression model in the real flight test are given in the
experiments and the performance of the CNN regression model in the real flight test are given in the
following sections.
following
Remote sections.
Sens. 2020, 12, 3035 19 of 28
4.2.1. Experiment 1
4.2.1. Experiment
4.2.1. Experiment 11
Experiment 1 is a head-on scenario with misty weather, and the fly trajectory is shown in Figure
Experiment 11isisa ahead-on
Experiment scenario with misty weather, and and
the fly
thetrajectory is shown in Figure
20. The yellow arrow is head-on
the flightscenario with
direction, andmisty weather,
the black arrow is the fly trajectory
heading of theisownership.
shown in
20. The yellow
Figure 20. arrow
The yellow is the flight direction,
arrow is the flight and
direction, the
and black arrow
the black is the
arrow isand heading
thethe
heading of the
of theownership.
ownership.
Figure 21 shows the measurements of GPS data for model evaluation, distance range is from
Figure 21
21 shows
shows the
the measurements
measurements of of GPS
GPS data
data forfor model
model evaluation,
evaluation, and
and the
the distance
distance range
range is
is from
62 m to 22 m. Figure 22 shows the results of the CNN regression model, and Table 6 showsfrom
Figure the
62 m
62 m to
to 22
22 m.
m. Figure
Figure 2222 shows
shows the results
resultsofof
the RMSE the
ofthe
the CNN
CNN regression
regression model, andand Table
Table 66 shows
shows thethe
information of Experiment 1 and the estimated distance. model,
MEAS (Measurement) denotes
information
information of Experiment
of Experiment 1 and
1 and the RMSE
theEST
RMSE of the estimated
of the estimated distance. MEAS
distance. distance.(Measurement)
MEAS (Measurement) denotesdenotes
the measurements of GPS data and (Estimation) is the estimated
the measurements
the measurements of of GPS
GPS data
data and
and EST
EST (Estimation)
(Estimation) is is the
the estimated
estimated distance.
distance.
Figure 21. Measurements of global positioning system (GPS) data to evaluate the model in Experiment 1.
Experiment 1.
Figure
Remote 21. 12,
Sens. 2020, Measurements
3035 of global positioning system (GPS) data to evaluate the model in20 of 28
Experiment 1.
Measurements of GPS
Figure 24. Measurements
Figure GPS data for
for model evaluation
evaluation in Experiment
Experiment 2.
Figure 24.
24. Measurements of
of GPS data
data for model
model evaluation in
in Experiment 2.
2.
Figure
Figure 25.
Figure 25. Result
25. Result of
Result of model
of model evaluation
model evaluation in
in Experiment
Experiment 2.
2.
Figure
Figure 27.
27. Measurements
Measurements of
of GPS
GPS data
data for
for model
model evaluation
evaluation in
in Experiment
Experiment 3.
3.
Figure 27. Measurements of GPS data for model evaluation in Experiment 3.
Figure 27. Measurements of GPS data for model evaluation in Experiment 3.
Figure
Figure 28.
28. Result
Resultof
of model
model evaluation
evaluation in
in Experiment
Experiment 3.
3.
1. The intruder is located at the center of the images in the training dataset. However, the intruder
in crossing cases is always far away from the image center, but the intruder in head-on cases is
close to the center of the images.
2. Most of the cropped images for the model training are in clear weather, but the synthetic videos
have a cloudy (noisy) background which may affect the accuracy.
Remote Sens. 2020, 12, 3035 25 of 28
6. Conclusions
In this work, the vision-based distance estimation using the deep learning-based approach to
estimate the distance between the ownership and intruder UAVs was proposed for the feasibility study
of SAA and mid-air collision avoidance of small UAVs with a consumer-grade monocular camera. First,
the target object on the image plane was detected, classified, and located by YOLOv3, which is a popular
deep learning-based object detector. Then, the distance between the ownership and intruder UAVs was
estimated using deep learning approach which only takes images as input. To verify the performance of
the CNN distance regression model, two types of videos were acquiring in this study, synthetic and real
flight videos. The model evaluation results show that the performance of the proposed method is viable
for the SAA of a small UAV with only the onboard camera. The proposed model was evaluated with the
videos acquired from the real flight tests, and the results show that the RMSE in the head-on scenario
with clear weather condition is only 1.423 m, which is satisfactory for mid-air collision avoidance of
small UAVs. The major achievements are summarized as follows:
1. A custom YOLOv3 detector has been trained to detect a fixed-wing aircraft with high accuracy.
2. A vision-based distance estimation approach with monocular camera is proposed to verify the
feasibility of mid-air collision avoidance of small UAVs.
3. A CNN distance regression model has been trained and evaluated by using air-to-air videos
acquired from real flight tests.
4. A procedure to synthesize the dataset for training and testing of the deep learning-based approach
is proposed in this study.
5. The real flight experiments were conducted to evaluate the performance of the proposed approach
for the application of SAA and mid-air collision avoidance of small UAVs in the near future.
However, there are still some limitations of the proposed method in this study. One limitation
is that the model is very sensitive to the scale of the intruder. Therefore, the size of the intruder
should be similar to that used to train the model. The other one is that the model is unable to
estimate the objet in long distance since the pixels occupied by the intruder in the cropped image
Remote Sens. 2020, 12, 3035 26 of 28
have no significant change and are not able to detect the distance of the intruder. Moreover, the real
flight experiments conducted in this study are limited to above-the-horizon scenarios. In the future,
below-the-horizon scenarios should be considered to prevent the mid-air collision of the intruder from
a lower altitude, and the long-distance estimation is also required to improve the distance estimation
model for high-speed UAVs.
Author Contributions: Conceptualization, Y.-C.L. and Z.-Y.H.; methodology, Y.-C.L.; software, Z.-Y.H.; validation,
Y.-C.L.; formal analysis, Y.-C.L.; investigation, Y.-C.L. and Z.-Y.H.; resources, Y.-C.L. and Z.-Y.H.; data curation,
Z.-Y.H.; writing—original draft preparation, Y.-C.L. and Z.-Y.H.; writing—review and editing, Y.-C.L.; visualization,
Z.-Y.H.; supervision, Y.-C.L. All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by Ministry of Science and Technology of Taiwan (MOST) under contract
MOST 108-2221-E-006-071-MY3 and, in part, the Ministry of Education, Taiwan, Headquarters of University
Advancement to the National Cheng Kung University (NCKU).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Nex, F.; Remondino, F. UAV for 3D mapping applications: A review. Appl. Geomat. 2014, 6, 1–15. [CrossRef]
2. Xu, S.; Savvaris, A.; He, S.; Shin, H.-S.; Tsourdos, A. Real-time implementation of YOLO + JPDA for small
scale UAV multiple object tracking. In Proceedings of the 2018 International Conference on Unmanned
Aircraft Systems (ICUAS), Dallas, TX, USA, 12–15 June 2018; pp. 1336–1341.
3. Li, J.; Ye, D.H.; Chung, T.; Kolsch, M.; Wachs, J.; Bouman, C. Multi-target detection and tracking from a single
camera in Unmanned Aerial Vehicles (UAVs). In Proceedings of the 2016 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4992–4997.
4. Ammour, N.; Alhichri, H.; Bazi, Y.; Benjdira, B.; Alajlan, N.; Zuair, M. Deep learning approach for car
detection in UAV imagery. Remote Sens. 2017, 9, 312. [CrossRef]
5. Uto, K.; Seki, H.; Saito, G.; Kosugi, Y. Characterization of rice paddies by a UAV-mounted miniature
hyperspectral sensor system. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 6, 851–860. [CrossRef]
6. James, J.; Ford, J.J.; Molloy, T.L. Learning to Detect Aircraft for Long-Range Vision-Based Sense-and-Avoid
Systems. IEEE Robot. Autom. Lett. 2018, 3, 4383–4390. [CrossRef]
7. Fasano, G.; Accado, D.; Moccia, A.; Moroney, D. Sense and avoid for unmanned aircraft systems. IEEE Aerosp.
Electron. Syst. Mag. 2016, 31, 82–110. [CrossRef]
8. Yu, X.; Zhang, Y. Sense and avoid technologies with applications to unmanned aircraft systems: Review and
prospects. Prog. Aerosp. Sci. 2015, 74, 152–166. [CrossRef]
9. Carnie, R.; Walker, R.; Corke, P. Image processing algorithms for UAV “sense and avoid”. In Proceedings
of the 2006 IEEE International Conference on Robotics and Automation (ICRA 2006), Orlando, FL, USA,
15–19 May 2006; pp. 2848–2853.
10. Lai, J.; Ford, J.J.; Mejias, L.; O’Shea, P. Characterization of Sky-region Morphological-temporal Airborne
Collision Detection. J. Field Robot. 2013, 30, 171–193. [CrossRef]
11. Nussberger, A.; Grabner, H.; Van Gool, L. Aerial object tracking from an airborne platform. In Proceedings of
the 2014 International Conference on Unmanned Aircraft Systems (ICUAS), Orlando, FL, UAS, 27–30 May 2014;
pp. 1284–1293.
12. Zhu, Q.; Yeh, M.-C.; Cheng, K.-T.; Avidan, S. Fast human detection using a cascade of histograms of oriented
gradients. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 1491–1498.
13. Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE
International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 1150–1157.
14. Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the
2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai,
HI, USA, 8–14 December 2001; p. I-I.
15. Liu, C.; Chang, F.; Liu, C. Cascaded split-level colour Haar-like features for object detection. Electron. Lett.
2015, 51, 2106–2107. [CrossRef]
Remote Sens. 2020, 12, 3035 27 of 28
16. Ye, D.H.; Li, J.; Chen, Q.; Wachs, J.; Bouman, C. Deep Learning for Moving Object Detection and Tracking from a
Single Camera in Unmanned Aerial Vehicles (UAVs). Electron. Imaging 2018, 2018, 4661–4666. [CrossRef]
17. Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural
Netw. Learn. Syst. 2019, 30, 3212–3232. [CrossRef]
18. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal
networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada,
7–12 December 2015; pp. 91–99.
19. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the 2017 IEEE International
Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969.
20. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox
detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands,
11–14 October 2016; pp. 21–37.
21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA,
26 June–1 July 2016; pp. 779–788.
22. Saqib, M.; Khan, S.D.; Sharma, N.; Blumenstein, M. A study on detecting drones using deep convolutional
neural networks. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and
Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–5.
23. Schumann, A.; Sommer, L.; Klatte, J.; Schuchert, T.; Beyerer, J. Deep cross-domain flying object classification
for robust UAV detection. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video
and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6.
24. Opromolla, R.; Fasano, G.; Accardo, D. A vision-based approach to UAV detection and tracking in cooperative
applications. Sensors 2018, 18, 3391. [CrossRef] [PubMed]
25. Jin, R.; Jiang, J.; Qi, Y.; Lin, D.; Song, T. Drone detection and pose estimation using relational graph networks.
Sensors 2019, 19, 1479. [CrossRef] [PubMed]
26. Wu, M.; Xie, W.; Shi, X.; Shao, P.; Shi, Z. Real-time drone detection using deep learning approach.
In Proceedings of the International Conference on Machine Learning and Intelligent Communications,
Hangzhou, China, 6–8 July 2018; pp. 22–32.
27. Rezaei, M.; Terauchi, M.; Klette, R. Robust vehicle detection and distance estimation under challenging
lighting conditions. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2723–2743. [CrossRef]
28. Monajjemi, M.; Mohaimenianpour, S.; Vaughan, R. UAV, come to me: End-to-end, multi-scale situated
HRI with an uninstrumented human and a distant UAV. In Proceedings of the 2016 IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4410–4417.
29. Bauer, P.; Hiba, A.; Bokor, J.; Zarandy, A. Three dimensional intruder closest point of approach estimation based-on
monocular image parameters in aircraft sense and avoid. J. Intell. Robot. Syst. 2019, 93, 261–276. [CrossRef]
30. Zhang, Y.; Wang, W.; Huang, P.; Jiang, Z. Monocular Vision-based Sense and Avoid of UAV Using Nonlinear
Model Predictive Control. Robotica 2019, 37, 1582–1594. [CrossRef]
31. Wang, Y.; Wang, C.; Zhang, H.; Dong, Y.; Wei, S. Automatic ship detection based on RetinaNet using
multi-resolution Gaofen-3 imagery. Remote Sens. 2019, 11, 531. [CrossRef]
32. Luo, X.; Tian, X.; Zhang, H.; Hou, W.; Leng, G.; Xu, W.; Jia, H.; He, X.; Wang, M.; Zhang, J. Fast Automatic Vehicle
Detection in UAV Images Using Convolutional Neural Networks. Remote Sens. 2020, 12, 1994. [CrossRef]
33. Ophoff, T.; Puttemans, S.; Kalogirou, V.; Robin, J.-P.; Goedemé, T. Vehicle and Vessel Detection on Satellite
Imagery: A Comparative Study on Single-Shot Detectors. Remote Sens. 2020, 12, 1217. [CrossRef]
34. Ponce, H.; Brieva, J.; Moya-Albor, E. Distance estimation using a bio-inspired optical flow strategy applied to
neuro-robotics. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio,
Brazil, 8–13 July 2018; pp. 1–7.
35. Haseeb, M.A.; Guan, J.; Ristić-Durrant, D.; Gräser, A. DisNet: A novel method for distance estimation from
monocular camera. In Proceedings of the 10th Planning, Perception and Navigation for Intelligent Vehicles
(PPNIV18), Madrid, Spain, 1 October 2018.
36. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
Remote Sens. 2020, 12, 3035 28 of 28
37. Jensen, M.B.; Nasrollahi, K.; Moeslund, T.B. Evaluating state-of-the-art object detector on challenging traffic
light data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops,
Honolulu, HI, USA, 21–26 July 2017; pp. 9–15.
38. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft
coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich,
Switzerland, 5–12 September 2014; pp. 740–755.
39. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
2014, arXiv:1409.1556.
40. Agarap, A.F. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375.
41. Opromolla, R.; Inchingolo, G.; Fasano, G. Airborne visual detection and tracking of cooperative UAVs
exploiting deep learning. Sensors 2019, 19, 4332. [CrossRef] [PubMed]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).