Vision-Based Autonomous Drone Control Using Supervised Learning in Simulation
Vision-Based Autonomous Drone Control Using Supervised Learning in Simulation
Vision-Based Autonomous Drone Control Using Supervised Learning in Simulation
D EPARTMENT OF C OMPUTING
Author: Supervisor:
Max Christl Professor Wayne Luk
Submitted in partial fulfillment of the requirements for the MSc degree in MSc
Computing Science of Imperial College London
4. September 2020
Abstract
I would like to express my sincere gratitude to Professor Wayne Luk and Dr. Ce Guo
for their thought-provoking input and continuous support throughout the project.
i
Contents
1 Introduction 1
3 Simulation 7
3.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Drone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3.2 Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4.1 Simple Implementation . . . . . . . . . . . . . . . . . . . . . 15
3.4.2 Velocity Implementation . . . . . . . . . . . . . . . . . . . . . 16
3.5 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
ii
Table of Contents CONTENTS
5 Evaluation 44
5.1 Test Flights in Simulation . . . . . . . . . . . . . . . . . . . . . . . . 44
5.1.1 Successful Landing Analysis . . . . . . . . . . . . . . . . . . . 45
5.1.2 Flight Path Analysis . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Qualitative Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 Conclusion 53
Appendices 58
C Architecture Comparison 62
iii
List of Figures
4.1 Histogram of 1 000 data sample labels produced by the naive data
collection procedure in the non-obstructed simulation environment . 21
4.2 Histogram of 1 000 data sample labels produced by the sophisticated
data collection procedure in the non-obstructed simulation environment 23
4.3 CNN architecture for optimal camera set-up analysis . . . . . . . . . 27
4.4 Confusion matrix front camera model (67.6% accuracy) . . . . . . . 28
4.5 Confusion matrix bottom camera model (60.5% accuracy) . . . . . . 29
4.6 Confusion matrix diagonal camera model (83.3% accuracy) . . . . . 29
4.7 Confusion matrix half front and half bottom camera model (80.8%
accuracy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.8 Confusion matrix fisheye front camera model (87.9% accuracy) . . . 31
4.9 Optimised neural network architecture . . . . . . . . . . . . . . . . . 32
4.10 Performance evaluation with varying image resolutions . . . . . . . . 37
iv
Table of Contents LIST OF FIGURES
v
List of Tables
vi
Chapter 1
Introduction
The autonomous control of unmanned areal vehicles (UAVs) has gained increasing
importance due to the rise of consumer, military and smart delivery drones.
Navigation for large UAVs which have access to the Global Positioning system (GPS)
and a number of high-end sensors is already well understood [2, 3, 4, 5, 6]. These
navigation systems are, however, not applicable for MAVs which only have limited
power and computing resources, aren’t equipped with high-end sensors and are fre-
quently required to operate in GPS-denied indoor environments. A solution to the
autonomous control of MAVs are vision-based systems that only rely on data cap-
tured by a camera and basic onboard sensors.
An increasing number of publications have applied machine learning techniques for
vision-based MAV navigation. These can generally be differentiated into Supervised
Learning approaches that are trained based on labelled images captured in the real
world [7, 8] and Reinforcement Learning approaches which train a machine learning
model in simulation and subsequently transfer the learnings to the real world using
domain-randomisation techniques [9, 10, 11]. The former approach is limited by
the fact that the manual collection and labelling of images in the real world is a time
consuming task and the resulting datasets used for training are thus relatively small.
The latter approach shows very promising results both in simulation and in the real
world. A potential downside is, however, the slow conversion rate of Reinforcement
Learning algorithms that requires the models to be trained for several days.
We are suggesting an approach that combines the strengths of Supervised Learning
(i.e. a quick convergence rate) with the advantages of training a model in sim-
ulation (i.e. automated data collection). The task which we attempt to solve is to
autonomously navigate a drone towards a visually marked landing platform. Our ap-
proach works as follows: We capture images (and sensor information) in simulation
and label them according to the optimal flight command that should be executed
next. The optimal flight command is computed using a path planning algorithm
that determines the optimal flight path towards the landing platform. The data col-
lection procedure is fully automated and thus allows for the creation of very large
1
Chapter 1. Introduction
datasets. These datasets can subsequently be used to train a CNN through Super-
vised Learning which significantly reduces the training time. These learnings can
then be transferred into the real world under the assumption that domain randomi-
sation techniques are able to generalize the model such that it is able to perform
in unseen environments. We have refrained from applying domain randomisation
techniques and only tested our approach in simulation due to restrictions imposed
by the COVID-19 pandemic. Previous research has shown that sim-to-real transfer is
possible using domain randomisation [9, 10, 11, 12, 13].
This report makes two principal contributions. First, we provide a simulation pro-
gram which can be used for simulating the control commands of the DJI Tello drone
in an indoor environment. This simulation program is novel as it is specifically tar-
geted at the control commands of the DJI Tello drone, provides the possibility of
capturing images from the perspective of the drone’s camera, can be used for gen-
erating labelled datasets and can simulate flights that are controlled by a machine
learning model. Second, we propose a vision-based approach for the autonomous
navigation of MAVs using input form a low resolution monocular camera and ba-
sic onboard sensors. A novelty of our approach is the fact that we collect labelled
data samples in simulation using our own data collection procedure (Section 4.1).
These data samples can subsequently be used for training a CNN through Super-
vised Learning. Our approach thus manages to overcome the limitations of manual
data collection in the real world and the slow convergence rate of simulation-based
Reinforcement Learning in regards to the autonomous navigation of MAVs.
The report is structured in the following way: After giving an overview of previous
research in this area, we introduce the simulation program in Chapter 3 and describe
our vision-based control approach in Chapter 4. Chapter 5 gives an overview of the
test results that have been achieved by the model in our simulation environment as
well as a qualitative comparison with previous publications. The final chapter of this
report sets out the conclusion.
2
Chapter 2
This chapter gives an overview of background literature and related research that
is relevant to this project. The first section introduces literature referring to au-
tonomous control of UAVs. And the second part focuses on literature referring to
path planning techniques.
3
Chapter 2. Background and Related Literature 2.1. AUTONOMOUS UAV CONTROL
system is capable of mapping obstacles in its surrounding via feature points and
straight architectural lines [16]. Grzonka et al. [17] introduced a multi-stage navi-
gation solution combining SLAM with path planning, height estimation and control
algorithms. A limitation of visual SLAM is the fact that it is computationally ex-
pensive and thus requires the UAV to have access to sufficient power sources and
high-performance processors [17, 18].
In recent years, an ever increasing number of research papers have emerged which
use machine learning systems for vision-based UAV navigation. Ross et al. [19]
developed a system based on the DAgger algorithm [20], an iterative training pro-
cedure, that can navigate a UAV through trees in a forest. Sadeghi and Levine [9]
proposed a learning method called CAD2 RL for collision avoidance in indoor en-
vironments. Using this method, they trained a CNN exclusively in a domain ran-
domised simulation environment using Reinforcement Learning and subsequently
transferred the learnings into the real world [9]. Loquercio et al. [10] further ex-
tended this idea and developed a modular system for autonomous drone racing that
combines model-based control with CNN-based perception awareness. They trained
their system in both static and dynamic simulation environments using Imitation
Learning and transferred the learnings into the real world through domain randomi-
sation techniques [10]. Polvara et al. [11] proposed a modular design for navigating
a UAV towards a marker that is also trained in a domain randomised simulation
environment. Their design combines two Deep-Q Networks (DQNs) with the first
one being used to align the UAV with the marker and the second one being used for
vertical descent [11]. The task which Polvara et al. [11] attempt to solve is similar to
ours, but instead of a Reinforcement Learning approach, we are using a Supervised
Learning approach.
There also exists research in which Supervised Learning has been applied for vision-
based autonomous UAV navigation. Kim and Chen [7] proposed a system made
up of a CNN that has been trained using a labelled dataset of real world images.
The images were manually collected in seven different locations and labelled with
the corresponding flight command executed by the pilot [7]. A similar strategy for
solving the task of flying through a straight corridor without colliding with the walls
has been suggested by Padhy et al. [8]. The authors also manually collected the
images in the real world by taking multiple images at different camera angles and
used them to train a CNN which outputs the desired control command [8]. Our
approach differs regarding the fact that we are collecting the labelled data samples
in a fully automated way in simulation. This enables us to overcome the tedious task
of manually collecting and labelling images, allowing us to build significantly larger
datasets. Furthermore, our environment includes obstacles other than walls that
need to be avoided by the UAV in order to reach a visually marked landing platform.
4
Chapter 2. Background and Related Literature 2.2. PATH PLANNING
Path planning is the task of finding a path from a given starting point to a given end
point which is optimal in respect to a given set of performance indicators [15]. It is
very applicable to UAV navigation for example when determining how to efficiently
approach a landing area while at the same time avoiding collision with obstacles.
Lu et al. [15] divide literature relating to path planning into two domains, global
and local path planning, depending on the amount and information utilised during
the planning process. Local path planning algorithms have only partial information
about the environment and can react to sudden changes, making them especially
suitable for real-time decision making in dynamic environments [15]. Global path
planning algorithms have in contrast knowledge about the entire environment such
as the position of the starting point, the position of the target point and the position
of all obstacles [15]. For this project, only global path planning is relevant since we
utilise it for labelling data samples during our data collection procedure and have a
priori knowledge about the simulation environment.
In regards to global path planning, there exist heuristic algorithms which try to find
the optimal path through iterative improvement, and intelligent algorithms which
use more experimental methods for finding the optimal path [15]. We will limit our
discussion to heuristic algorithms since these are more relevant to this project. A
common approach is to model the environment using a graph consisting of nodes
and edges. Based on this graph, several techniques can be applied for finding the op-
timal path. The most simple heuristic search algorithm is breadth-first search (BFS)
[21]. This algorithm traverses the graph from a given starting node in a breadth-
first fashion, thus visiting all nodes that are one edge away before going to nodes
that are two edges away [21]. This procedure is continued until the target node is
reached, returning the shortest path between these two nodes [21]. The Dijkstra
algorithm follows a similar strategy, but it also takes positive edge weights into con-
sideration for finding the shortest path [22]. The A* algorithm further optimised the
search for the shortest path by not only considering the edge weights between the
neighbouring nodes but also the distance towards the target node for the decision of
which paths to traverse first [23, 24]. The aforementioned algorithms are the basis
for many more advanced heuristic algorithms [15].
Szczerba et al. [25] expanded the idea of the A* algorithm by proposing a sparse
A* search algorithm (SAS) which reduces the computational complexity by account-
ing for more constraints such as minimum route leg length, maximum turning an-
gle, route distance and a fixed approach vector to the goal position. Koenig and
Likhachev [26] suggested a grid-based path planning algorithm called D* Lite which
incorporates ideas from the A* and D* algorithms, and subdivides 2D space into
grids with edges that are either traversable or untraversable. Carsten et al. [27]
expanded this idea into 3D space by proposing the 3D Field D* algorithm. This al-
gorithm uses interpolation to find an optimal path which doesn’t necessarily need to
follow the edges of the grid [27]. Yan et al. [28] also approximate 3D space using
a grid-based approach which subdivides the environment into voxels. After creat-
5
Chapter 2. Background and Related Literature 2.3. ETHICAL CONSIDERATIONS
ing a graph representing the connectivity between free voxels using a probabilistic
roadmap method, Yan et al. [28] apply the A* algorithm to find an optimal path. Our
path planning approach which will be described in Section 4.1.2 is similar to the two
previously mentioned algorithms and other grid-based algorithms regarding the fact
that we also subdivide space into a 3D grid. But instead of differentiating between
traversable and untraversable cells which are subsequently connected in some way
to find the optimal path, we use the cells of the grid to approximate the poses that
can be reached through the execution of a limited set of flight commands.
This is a purely software based project which doesn’t involve human participants or
animals. We did not collect or use personal data in any form.
The majority of our experiments were performed in a simulated environment. For
the small number of experiments that were conducted on a real drone, we ensured
that all safety and environmental protection measures were met by restricting the
area, adding protectors to the rotors of the drone, removing fragile objects and other
potential hazards as well as implementing a kill switch that forces the drone to land
immediately.
Since our project is concerned with the autonomous control of MAVs, there is a
potential for military use. The nature of our system is passive and non-aggressive
with the explicit objective to not cause harm to humans, nature or any other forms
of life and we ensured to comply with the ethically aligned design stated in the
comprehensive AI guidelines [29].
This project uses open-source software and libraries that are licensed under the BSD-
2 clause [30]. Use of this software is admitted for academic purposes and we made
sure to credit the developers appropriately.
6
Chapter 3
Simulation
This chapter gives an overview of the drone simulator. The simulator has been specif-
ically developed for this project and is capable of simulating the control commands
and subsequent flight and image capturing behaviour of the DJI Tello drone. In the
following sections, after laying out the purpose of the simulator and the challenges
it needs to overcome, the implementation details of the simulation program are de-
scribed in regards to the architecture, the drone model, the simulation of movement
and the simulation environment.
3.1 Challenges
The reason for building a drone simulator is to automate and speed up the data
collection procedure. Collecting data in the real world is a tedious task which has
several limitation. These limitations include battery constraints, damages to the
physical drone, irregularities introduced by humans in the loop and high cost for
location rental as well as human labour and technology expenses. All these factors
make it infeasible to collect training data in the real world. In contrast to this, collect-
ing data samples in a simulated environment is cheap, quick and precise. It is cheap
since the cost factors are limited to computational costs, it is quick since everything
is automated and it is precise because the exact position of every simulated object
is given in the simulation environment. In addition, the model can be generalized
such that it is capable of performing in unseen environments and uncertain condi-
tions through domain randomisation. This helps to overcome the assumptions and
simplifications about the real world which are usually introduced by simulators. For
this project, we have however restrained from introducing domain randomisation
as we weren’t able to transfer the learnings into the real world due to restrictions
imposed by COVID-19. Without the possibility of validating a domain randomisation
approach, we agreed to focus on a non-domain randomised simulation environment
instead.
7
Chapter 3. Simulation 3.1. CHALLENGES
Because of all the above mentioned advantages, data collection was performed in
simulation for this project. Commercial and open source simulators already exist,
such as Microsoft SimAir [31] or DJI Flight Simulator [32]. These are, however,
unsuitable for this project as they come with a large overhead and aren’t specifically
targeted at the control commands of the DJI Tello drone. It was therefore neces-
sary to develop a project specific drone simulator which overcomes the following
challenges:
1. The simulator needs to be specifically targeted at the DJI Tello drone. In other
words, the simulator must include a drone with correct dimensions, the ability
of capturing image/sensor information and the ability to execute flight com-
mands which simulate the movement of the DJI Tello flight controller.
2. The simulator needs to include a data collection procedure which can automat-
ically create a labelled dataset that can be used for training a machine learning
model via Supervised Learning.
3. The simulator should be kept as simple as possible to speed up the data collec-
tion procedure.
4. The simulator needs to include an indoor environment (i.e. floor, wall, ceiling,
obstacles) with collision and visual properties.
Solutions
8
Chapter 3. Simulation 3.1. CHALLENGES
Polvara et al. [11] also used a simulation program for training their model. It is
not mentioned in their paper whether they developed the simulation program them-
selves or whether they used an existing simulator. In contrast to us, their simulation
program includes domain randomisation features which allowed them to transfer
the learnings into the real world [11]. Our simulation program does, however, fea-
ture the possibility of generating a labelled dataset which Polvara et al. [11] did not
require as they were using a Reinforcement Learning approach.
Kim and Chen [7] and Padhy et al. [8] did not use a simulation program, but rather
manually collected labelled images in the real world.
9
Chapter 3. Simulation 3.2. ARCHITECTURE
3.2 Architecture
The drone simulation program which has been developed for this project is based
on PyBullet [33]. PyBullet is a Python module which wraps the underlying Bullet
physics engine through the Bullet C-API [33]. PyBullet [33] has been chosen for this
project because it integrates well with TensorFlow [34], is well documented, has an
active community, is open-source and its physics engine has a proven track record.
We have integrated the functions provided by the PyBullet packages into an object
oriented program written in Python. The program is made up of several classes
which are depicted in the class diagram in Figure 3.1.
drone-ai
Simulation
0..*
Camera Object
0..*
The Simulation class manages the execution of the simulation. PyBullet provides
both, a GUI mode which creates a graphical user interface (GUI) with 3D OpenGL
and allows users to interact with the program through buttons and sliders, as well as
a DIRECT mode which sends commands directly to the physics engine without ren-
dering a GUI [33]. DIRECT mode is quicker by two orders of magnitude and therefore
more suitable for training a machine learning model. GUI mode is more useful for
debugging the program or to visually retrace the flight path of the drone. Users can
switch between these two modes by setting the render variable of the Simulation
class. In order to create a simulation environment, a user can do so by calling the
add_room function which will add walls and ceilings to the environment. Further-
more, using the add_drone, add_landing_plattform and add_cuboids functions, a
user can place a drone, a landing platform and obstacles at specified positions in
space. The Simulation class also implements a function for generating a dataset,
a function for calculating the shortest path and a function for executing flights that
are controlled by a machine learning model. These functions will be discussed in
10
Chapter 3. Simulation 3.3. DRONE
3.3 Drone
The DJI Tello drone has been chosen for this project because it is purpose-built for
educational use and is shipped with a software development kit (SDK) which enables
a user to connect to the drone via a wifi UDP port, send control commands to the
drone and receive state and sensor information as well as a video stream from the
drone [35]. The flight control system of the DJI Tello drone can translate control
commands into rotor movements and thus enables the drone to execute the desired
movement. To autonomously navigate the drone, an autonomous system can send
control commands to the flight controller of the drone, and thus avoids the need to
directly interact with the rotors of the drone. This simplifies the task as the machine
learning model doesn’t need to learn how to fly. The construction of the autonomous
system and details of the machine learning model will be covered in Chapter 4.
The simulation program contains a model of the DJI Tello drone which we have build
in the Unified Robot Description Format (URDF). Figure 3.2 shows a rendering of the
11
Chapter 3. Simulation 3.3. DRONE
URDF model and can be compared to an image of the physical drone in Figure 3.3.
The URDF model has been designed to replicate the dimensions of the drone as
closely as possible whilst keeping it simple enough for computational efficiency. The
hierarchical diagram depicted in Figure 3.4 illustrates the interconnections of the
URDF model’s links and joints. It includes a specific link to which the camera can
be attached. This link is placed at the front of the drone, just like the camera on
the real DJI Tello, and is highlighted in red in Figure 3.2. In the following, we will
discuss how images and sensor information are captured in simulation.
body
Figure 3.4: Hierarchical diagram of links (red) and joints (blue) of DJI Tello URDF
model
3.3.1 Camera
12
Chapter 3. Simulation 3.3. DRONE
Tsai [36, p. 329] describes it as “the transformation from 3D object world coordinate
system to the camera 3D coordinate system centered at the optical center”. In order
to compute the extrinsic camera matrix, which is also called view matrix, we used
the function depicted in Listing A.1 in Appendix A.
The intrinsic camera matrix, on the other hand, is concerned with the properties
of the camera. It is used for “the transformation from 3D object coordinate in the
camera coordinate system to the computer image coordinate" [36, p. 329]. In other
words, it projects a three dimensional visual field onto a two dimensional image
which is why the intrinsic camera matrix is also referred to as projection matrix.
PyBullet provides the function computeProjectionMatrixFOV for computing the in-
trinsic camera matrix [33]. This function takes several parameters as inputs: Firstly,
it takes in the aspect ratio of the image which we have set to 4:3 as this corresponds
to the image dimensions of DJI Tello’s camera. Secondly, it takes in a near and a far
plane distance value. Anything that is closer than the near plane distance or farther
away than the far plane distance, is not displayed in the image. We have set these
values to 0.05 meters and 5 meters, respectively. Thirdly, it takes in the vertical field
of view (VFOV) of the camera. DJI, however, only advertises the diagonal field of
view (DFOV) of its cameras and we thus needed to translate it into VFOV. Using the
fundamentals of trigonometry we constructed the following equation to translate
DFOV into VFOV:
h
VFOV = 2 ∗ atan( ∗ tan(DFOV ∗ 0.5)) (3.1)
d
where:
It is sufficient to calculate the intrinsic camera matrix only once at the start of the
simulation, since the properties of the camera don’t change. The extrinsic matrix, on
the other hand, needs to be calculated every time we capture an image, because the
drone might have changed its position and orientation.
Images can be captured at any time during the simulation by simply calling the
get_image function of the Drone class. This function calls the respective matrix
calculation functions of its Camera object and returns the image as a list of pixel
intensities.
3.3.2 Sensors
The DJI Tello drone is a non-GPS drone and features infra-red sensors and an altime-
ter in addition to its camera [35]. Its SDK provides several read commands which
13
Chapter 3. Simulation 3.4. MOVEMENT
can retrieve information captured by its sensors [35]. These include the drone’s cur-
rent speed, battery percentage, flight time, height relative to the take off position,
surrounding air temperature, attitude, angular acceleration and distance from take
off [35].
Out of these sensor readings, only the height relative to the take off position, the
distance from take off and the flight time are relevant for our simulation program.
We have implemented these commands in the Drone class. The get_height func-
tions returns the vertical distance between the drone’s current position and the take
off position in centimetres. The get_tof function returns the euclidean distance be-
tween the drone’s current position and the take off position in centimetres. Instead
of measuring the flight time, we are counting the number previously executed flight
commands when executing a flight in simulation.
3.4 Movement
The physical DJI Tello drone can be controlled by sending SDK commands via a UDP
connection to the flight controller of the drone [35]. For controlling the movement
of the drone, DJI provides two different types of commands: control commands and
set commands [35]. Control commands direct the drone to fly to a given position
relative to its current position. For example, executing the command forward(100),
results in the drone flying one meter in the x-direction of the drone’s coordinate
system. In contrast, set commands attempt to set new sub-parameter values such as
the drone’s current speed [35]. Using set commands, a user can remotely control
the drone via four channels. The difference between these two control concepts
is the fact that control commands result in a non-interruptible movement towards a
specified target position, whereas the set commands enable the remote navigation of
the drone, resulting in a movement that can constantly be interrupted. We decided
on using control commands for the simulation program since these can be replicated
more accurately.
The control commands have been implemented in two different variations. The first
variation is implemented such that only the position of the drone is updated without
simulating the velocity of the drone. This makes command executions quicker and
is thus more optimal for collecting data samples in simulation. The second variation
simulates the linear and angular velocity of the drone. This variation can be used
for visualising the movement and replicates how the real DJI Tello drone would fly
towards the target position. In the following, both variations will be discussed in
more detail.
14
Chapter 3. Simulation 3.4. MOVEMENT
Since control commands are non-interruptible and direct the drone to fly from its
current position to a relative target position, the most simple implementation is to
only update the drone’s position without simulating its movement. In other words,
if the drone is advised to fly 20cm forward, we calculate the position to which the
drone should fly and set the position of the drone accordingly. This causes the drone
to disappear from its current position and reappear at the target position. The DJI
Tello SDK provides both, linear movement commands (takeoff, land, up, down, for-
ward, backward, left, right) as well as angular movement commands (cw, ccw) [35].
Executing the cw command causes the drone to rotate clock-wise and executing the
ccw command causes the drone to rotate counter-clockwise by a given degree. In
order to simulate these two commands, it is necessary to update the drone’s orien-
tation instead of its position.
The above mentioned logic has been implemented in the simulation program in the
following way: Each control command has its own function in the Drone class. When
calling a linear control command, the user needs to provide an integer value corre-
sponding to the anticipated distance in cm that the drone should fly in the desired
direction. In a next step, it necessary to check whether the flight path towards the
target position is blocked by an obstacle. This is done by the path_clear function
of the Drone class which uses PyBullet’s rayTestBatch function in order to test the
flight path for collision properties of other objects in the simulation environment.
If the flight path is found to be obstructed, the control command function returns
False, implying that the drone would have crashed when executing the control com-
mand. If the flight path is clear, the function set_position_relative of the Drone
class is called. This function translates the target position from the drone’s coordi-
nate system into the world coordinate system and updates the position of the drone
accordingly.
The execution of angular control commands works in a similar fashion. The func-
tions take in an integer corresponding to the degree by which the drone should
rotate in the desired direction. There is no need for collision checking, since the
drone doesn’t change its position when executing a rotation command. We can thus
directly call the set_orientation_relative function which translates the target ori-
entation into the world coordinate system and updates the simulation.
The simple control command implementation is used for this project since it is less
computationally expensive than simulating the actual movement and thus more suit-
able for the analysis conducted in the following chapters. We nevertheless also im-
plemented a second variation of the flight commands which simulates the linear and
angular velocity of the drone. This implementation can be used for future projects
that want to use the simulator for other purposes.
15
Chapter 3. Simulation 3.4. MOVEMENT
When creating a drone object, a user has the option to specify whether the velocity
of the drone should be simulated. In this case, the control commands execute a
different set of private functions which are capable of simulating the linear and
angular velocity of the drone. For clarification, these functions do not simulate the
physical forces that are acting on the drone but rather the flight path and speed of
the drone. It is assumed that the linear and angular velocity of the drone at receipt of
a control command are both equal to zero, since control commands on the real DJI
Tello drone can only be executed when the drone is hovering in the air. Let’s again
discuss the implementation of the linear and angular control commands separately
beginning with the linear control commands.
If one of the linear control commands is called, the private function fly is executed
which takes the target position in terms of the drone’s coordinate system as input.
This function transforms the target position into the world coordinate system and
calculates the euclidean distance from the target position as well as the distance
required to accelerate the drone to it’s maximum velocity. With these parameters,
a callback (called partial in Python) to the simulate_linear_velocity function is
created and set as a variable of the drone object. Once this has been done, everything
is prepared to simulate the linear velocity of the drone that results from the linear
control command. If the simulation proceeds to the next time step, the callback
function is executed which updates the linear velocity of the drone. When calculating
the velocity of a movement, it is assumed that the drone accelerates with constant
acceleration until the maximum velocity of the drone is reached. Deceleration at
the end of the movement is similar. If the flight path is to short in order to reach
the maximum velocity, the move profile follows a triangular shape as depicted in
Figure 3.5. If the flight path is long enough in order to reach the maximum velocity,
the velocity stays constant until the deceleration begins, resulting in a trapezoidal
move profile (see Figure 3.6).
vmax vmax
Velocity
Velocity
0 0
Time Time
Figure 3.5: Triangular move profile Figure 3.6: Trapezoidal move profile
A very similar approach has been chosen for the simulation of angular control com-
mands. When the cw or ccw commands are called, the rotation function is ex-
16
Chapter 3. Simulation 3.5. ENVIRONMENT
ecuted which takes the target orientation in terms of the drone’s coordinate sys-
tem as input. This function transforms the target orientation into the world coor-
dinate system and calculates the total displacement in radian as well as the dis-
placement required to accelerate the drone. With this information, a callback to the
simulate_angular_velocity function is created and set as an object variable. When
stepping through the simulation, the callback function is called which updates the
angular velocity of the drone. The move profile of the angular velocity also follows
a triangular or trapezoidal shape.
3.5 Environment
In addition to the drone and the landing platform, obstacles can be placed in the
simulation environment. The obstacles are modelled as cuboids and one can specify
their position, orientation, width, length, height and color. Obstructed simulation
environments can be generated manually, by adding the obstacles one by one, or
automatically, in which case the dimensions, positions and orientations are selected
using a pseudo-random uniform distribution. A sample simulation environment is
depicted in Figure 3.8.
17
Chapter 3. Simulation 3.6. SUMMARY
3.6 Summary
This chapter gave an overview of the simulation program that has specifically been
developed for this project. It can be used for simulating the control commands and
subsequent flight behaviour of the DJI Tello drone. The drone is represented by
a URDF model in simulation which replicates the dimensions of the real DJI Tello
drone. Furthermore, the drone’s camera set-up as well as a selection of sensors has
been added for capturing images and reading sensor information during flight. In
order to simulate the execution of flight commands, we have introduced a simple
but computationally less expensive implementation as well as an implementation
in which linear and angular velocities are simulated. The simulation environment
consists of a room to which a drone, a landing platform and obstacles can be added.
The program is used for the analysis conducted in the following chapters.
18
Chapter 4
This chapter gives an overview of the machine learning model that has been devel-
oped for this project. The task which this model attempts to solve is to autonomously
navigate a drone towards a visually marked landing platform.
We used a Supervised Learning approach in order to solve this task. Supervised
Learning requires human supervision which comes in the form of a labelled dataset
that is used to train the machine learning model [37]. The labels of the dataset
constitute the desired outcome that should be predicted by the model. There are
several machine learning systems which can be applied for Supervised Learning such
as Decision Trees, Support Vector Machines and Neural Networks. For this project
a CNN is used as this is the state of the art machine learning system for image
classification [37].
This chapter is divided into three main parts. The first part describes the data collec-
tion procedure which is used to generate labelled datasets. The second part identifies
the optimal camera set-up for solving the task. And the third part discusses the ar-
chitectural design of the CNN as well as the optimisation techniques that have been
applied in order to find this design. The parts are logically structured and build up
on top of each other. Each part solves individual challenges, but it is the combination
of all three parts which can be regarded as a coherent approach of solving the task
at hand.
4.1 Dataset
A dataset that is suitable for this project needs to fulfil very specific requirements.
Each data sample needs to include multiple features and one label. The features
correspond the state of the drone which is specified by the visual input of the drone’s
camera and the sensor information captured by its sensors. The label corresponds
to the desired flight command which the drone should execute from this state in
order to fly towards the landing platform. As a labelled dataset which fulfils these
19
Chapter 4. Machine Learning Model 4.1. DATASET
When creating a new algorithm a good practice is to begin with a very simple al-
gorithm and expand its complexity depending on the challenges you encounter. A
naive data collection procedure is depicted in Algorithm 1 which works as follows:
At first, the algorithm places the drone and the landing platform at random positions
in the simulation environment. The position of the drone can be any position in the
three dimensional simulation space and the landing platform is placed somewhere
on the ground. The algorithm then captures an image using the drone’s camera and
requests sensor information from the drone’s sensors. In a next step, the desired
flight command, which should be executed by the drone in order to fly towards the
landing platform, is computed. The desired flight command can be derived from the
optimal flight path which can be calculated using a breadth first search (BFS) algo-
rithm since both, the position of the drone and the position of the landing platform,
are known in simulation. The workings of the BFS algorithm will be discussed later
in Section 4.1.2. Finally, the image and sensor information, which make up the fea-
tures of the data sample, are stored together with the desired flight command, which
20
Chapter 4. Machine Learning Model 4.1. DATASET
corresponds to the label of the data sample. Repeating this procedure generates a
dataset.
1,000
800
#Samples
600
490 463
400
200
7 3 37
0
takeoff land forward cw ccw
Label
Figure 4.1: Histogram of 1 000 data sample labels produced by the naive data collection
procedure in the non-obstructed simulation environment
A downside of this procedure is, however, that it produces a very unbalanced dataset.
As depicted in Figure 4.1, the desired flight command is almost always to rotate
clockwise (cw) or counterclockwise (ccw). This makes sense, because we allow the
drone to only execute one of the following five commands: takeoff, land, fly forward
20cm, rotate cw 10° and rotate ccw 10°. Horizontal movement is therefore limited to
only flying forward because this is where the camera of the drone is facing. In order
to fly left, the drone first needs to rotate ccw multiple times and then fly forward.
Since the naive data collection procedure places the drone at a random position and
with a random orientation, it is most likely not facing towards the landing platform.
The desired flight command is thus in most cases to rotate towards the landing
platform which results in an unbalanced dataset.
In order to overcome this unbalanced data problem, a more sophisticated data col-
lection procedure has been developed for which the pseudo-code is shown in Al-
gorithm 2. This algorithm begins with placing both, the drone and the landing
platform, at random positions on the ground of the simulation environment using
a uniform distribution. Notice that the drone is now placed on the ground which
is different to the naive algorithm that places it somewhere in space. It then calcu-
lates the optimal flight path using a BFS algorithm. This path includes a series of
commands which the drone needs to execute in order to fly from its current position
to the landing platform. Beginning with the drone’s starting position, the algorithm
generates a data sample by capturing and storing the image and sensor information
as well as the command that should be executed next. It then advises the drone to
execute this command in simulation which consequently repositions the drone. Us-
ing the new state of the drone we can again capture camera and sensor information
and hereby generate a new data sample. These steps are repeated until the drone has
reached the landing platform. Once the drone has landed, both the drone and the
landing platform are repositioned and a new shortest path is calculated. Repeating
21
Chapter 4. Machine Learning Model 4.1. DATASET
As shown in Figure 4.2, the dataset is now significantly more balanced compared to
the one produced by the naive data collection procedure. It includes sample labels
of all five flight commands and its label distribution reflects how often the com-
mands are executed during a typical flight in the simulated environment. Since the
dataset, on which Figure 4.2 is based on, includes 54 flights we can deduct that the
average flight path includes one takeoff command, nine rotational commands (cw
or ccw), seven forward commands and one land command. Using the sophisticated
data collection procedure it is thus possible to generate a dataset which is balanced
according to the number of times a drone usually executes the commands during
flight. As the dataset can be of arbitrary size it is also possible to adapt the label
distribution such that every label is represented equally in the dataset. We did, how-
ever, not apply this technique and instead decided on using class weights for the loss
function of the neural network. More on this in Section 4.3.1.
22
Chapter 4. Machine Learning Model 4.1. DATASET
1,000
800
#Samples
600
396
400
246 251
200
54 53
0
takeoff land forward cw ccw
Label
Figure 4.2: Histogram of 1 000 data sample labels produced by the sophisticated data
collection procedure in the non-obstructed simulation environment
The above mentioned data collection algorithms require information about the opti-
mal flight path. The optimal flight path is the one that requires the drone to execute
the least number of flight commands in order to fly from its current position to the
landing platform. Since both, the position of the drone and the position of the land-
ing platform, are known in simulation, it is possible to compute the optimal flight
path.
In order to find the optimal flight path it was necessary to develop an algorithm
that is specifically targeted at our use case. The general idea behind this algorithm
is to construct a graph that consists of nodes, representing the positions that the
drone can reach, and links, representing the flight commands that the drone can
execute. Using a BFS algorithm we can then compute the shortest path between
any two nodes of the graph. Since the shortest path represents the links that need
to be traversed in order to get from the start node to the target node, we can thus
derive the series of commands that the drone needs to execute in order to fly from
the position represented by the start node to the position represented by the target
node. This path is also the optimal path as it includes the least number of flight
commands.
Let’s translate this general idea into our specific use case. We can begin with the
construction of the graph by adding the drone’s current pose as the central node. The
pose of an object is the combination of its position and orientation. It is necessary to
use the drone’s pose and not simply its position because the drone can rotate along
its z-axis. From this starting pose, the drone can reach a new pose by executing one
of its five flight commands. We can therefore add five new nodes that represent the
five poses that can be reached from the drone’s current pose and connect them using
links which have the label of the respective flight command. From each of the new
poses, the drone can again execute five new flight commands which adds new poses
23
Chapter 4. Machine Learning Model 4.1. DATASET
to the graph. Repeating this procedure, a graph is constructed which includes all the
poses that the drone can reach from its current pose. Since we are searching for the
shortest path between the drone and the landing platform, we can stop adding new
poses to the graph once a pose has been found which is within a certain threshold
of the position of the landing platform. Note that for the breaking condition only
the position of the drone and not its orientation is being considered, as it doesn’t
matter in which direction the drone is facing when it lands on the platform. When
we add the nodes to the graph in a BFS fashion we can ensure that we have found
the shortest path, once a pose is within the landing area.
One problem with this approach is that the graph is growing way too quickly result-
ing in too many poses. The upper bound time complexity of a BFS algorithm is O(bd ),
with branching factor b and search tree depth d [38]. Since a drone can execute five
different commands from every pose, the branching factor, meaning the number of
outgoing links of every node, is equal to five. The depth on the other hand depends
on the distance between the drone’s current pose and the landing platform. If, for
example, the landing platform is at minimum 20 flight commands away from the
drone, it would require up to 520 iterations to find the shortest path using this algo-
rithm. This is too computationally expensive and it was thus necessary to optimise
the BFS algorithm.
Algorithm 3 shows the pseudo-code of the optimised BFS algorithm. Three main
optimisation features have been implemented in order to speed up the algorithm.
• We make use of efficient data structures. A FIFO queue is used to store the
nodes that need to be visited next. Both queue operations, enqueue and dequeue
have a time complexity of O(1) [21]. A dictionary is used to store the nodes
that have already been visited. Since the dictionary is implemented using a
hash table, we can check whether a node is already included in the dictionary
(Line 20) with time complexity O(1) [21]. A list is used to keep track of the
path, to which a new flight command can be appended with time complexity
O(1) [21].
• The most drastic optimisation feature is the partitioning of space. The general
idea behind this feature is that it isn’t necessary to follow every branch of the
BFS tree as most of these branches will lead to similar poses. For example if
we find a new pose at depth ten of the BFS tree and we have already found a
very similar pose at a lower level of the BFS tree, it is almost certain that the
24
Chapter 4. Machine Learning Model 4.1. DATASET
new pose won’t help us find the shortest path and can thus be ignored. Our al-
gorithm implements this logic by subdividing the simulation environment into
cubes of equal sizes. Each position in the simulation environment is therefore
part of one cube. And since we also need the orientation in order to identify
the pose of the drone, every cube contains a limited number of yaw angles.
Note that only the yaw angle of the drone’s orientation is considered as it can
only rotate along its z-axis. Every drone pose that is discovered during the
search can thus be allocated to one square and one (rounded) yaw angle. The
combination of square and yaw angle needs to be stored in the visited dic-
tionary and if an entry of this combination is already included, we know that
a similar pose has already been discovered. We also know that the previously
discovered pose is fewer links away from the starting node and is therefore a
better candidate for the shortest path. We can thus ignore a pose if its square
25
Chapter 4. Machine Learning Model 4.1. DATASET
and yaw have already been visited. This significantly reduces the time com-
plexity of the algorithm with an upper bound of O(c ∗ y) where c is the number
of cubes and y is the number of yaw angles. Reducing the length of the cube
edges (which increases the number of cubes that fill up the environment) re-
sults in a worse upper bound time complexity but also increases the chances of
finding the true shortest path. Reducing the step size between the yaw angles
(which increases the number of yaw angles) has the same effect. On the other
hand, if the cubes are too large or there are too few yaw angles, the shortest
path might not be found. Since the forward command makes the drone fly
20cm in horizontal direction and the cw/ccw rotation commands rotate the
drone by 10°, the cube edges have been set to √202 cm and the yaw angle step
size has been set to 10°.
The upper bound time complexity of the optimised BFS algorithm, when applied in
the simulation environment described in Section 3.5 is therefore:
w ∗ d ∗ h 360°
O( ∗ ) (4.1)
(c)3 y
where:
Inserting the appropriate numeric values into Equation 4.1 reveals that in the worst
case, 135 252 iterations need to be performed to calculate the optimal flight path in
our simulation environment. This is over 25 orders of magnitude lower than if we
were to use a non-optimised BFS algorithm.
26
Chapter 4. Machine Learning Model 4.2. OPTIMAL CAMERA SET-UP
The DJI Tello drone is shipped with a forward facing camera which is slightly tilted
downward at a 10° angle and with a DFOV of 82.6°. We hypothesized that with
this camera set-up, (i) the drone would struggle to vertically land on the landing
platform (ii) as the landing platform goes out of sight once the drone gets too close.
Our analysis showed that the original camera set-up of the DJI Tello drone is sub-
optimal for our task at hand. We found a fisheye set-up with a DFOV of 150° to be
optimal.
In the following, we will in a first step discuss how we tested our hypothesis and in
a second step layout alternative camera set-ups.
In order to test our hypothesis, we generated a dataset with the original DJI Tello
camera set-up, used this dataset to train a CNN and evaluated its performance.
The dataset includes 10 000 data samples which have been collected in a non-
obstructed simulation environment using the previously described data collection
procedure. Each data sample is made up of a 280 x 210 image that has been taken
from a forward facing camera (tilted 10° downward with 82.6° DFOV) and a label
which corresponds to the desired flight command that should be executed. There
is no need for capturing sensor information as we are only focusing on the camera
set-up. The dataset has been partitioned into a training and a test set using a 70/30
27
Chapter 4. Machine Learning Model 4.2. OPTIMAL CAMERA SET-UP
split. There is no need for a validation set as we don’t require hyper parameter
tuning in order to test our hypothesis.
As illustrated in Figure 4.3, the CNN architecture is made up of an input layer tak-
ing in a RGB image, two convolutional layers, two max-pooling layers, one fully
connected layer and an output layer with five neurons which correspond to the five
commands that the drone can execute. It uses ReLU activation functions, categorical
cross entropy loss and Adam optimisation.
The model (hereafter referred to as “front camera model") has been trained over 20
epochs with a batch size of 32 and achieves a prediction accuracy of 67.6% on the
test set. The confusion matrix plotted in Figure 4.4 reveals that the performance
of the front camera model significantly varies across the desired flight commands.
On the one hand, it manages to correctly predict the takeoff, forward and rotation
commands in most cases. On the other hand, it misclassifies 63.6% of data samples
that are labelled as “land". This supports the first part of our hypothesis, because it
clearly shows that the front camera model struggles with landing the drone.
800
takeoff 175 0 0 0 0
land 0 63 43 26 41 600
True label
cw 1 60 99 448 82
200
ccw 0 66 126 69 497
0
takeoff land forward cw ccw
Predicted label
In order to prove the second part of our hypothesis, we developed a second model
(hereafter referred to as “bottom camera model"). This model uses the same CNN
architecture and is trained for the same number of epochs, but the images in the
dataset have been taken using a downward facing camera (tilted 90° downward
with 82.6° DFOV). In fact, this dataset was created using the same flight paths as the
previous dataset which means that the images were taken from the same positions
and the label distribution is equivalent in both datasets. The only difference between
these two datasets is the camera angle at which the images were taken. The bottom
camera model achieves a 60.2% prediction accuracy on the test set which is worse
than the front camera model. But as we can see in the confusion matrix depicted in
Figure 4.5, the bottom camera model correctly predicts 83.8% of the land commands
in the test set. This performance variation makes sense, because with a downward
facing camera, the drone can only see the landing platform when it is hovering above
it. The bottom camera model therefore knows when to land, but doesn’t know how
to fly to the landing platform. This supports the second part of our hypothesis, since
we have shown that the bad landing performance of the front camera model is due
28
Chapter 4. Machine Learning Model 4.2. OPTIMAL CAMERA SET-UP
to the fact that the landing platform is out of sight once the drone gets too close to
it.
takeoff 175 0 0 0 0
1000
cw 0 0 552 131 7
400
200
ccw 0 2 569 0 187
0
takeoff land forward cw ccw
Predicted label
Since neither the front camera model, nor the bottom camera model, are capable of
predicting the comprehensive set of flight commands, it was necessary to come up
with a new camera set-up. Three different set-ups have been tested and are in the
following compared against each other.
1,000
takeoff 175 0 0 0 0
800
land 0 151 9 7 6
True label
600
forward 0 2 1034 69 99
400
cw 0 4 50 541 95
200
ccw 0 9 88 62 599
0
takeoff land forward cw ccw
Predicted label
Due to the fact that the two previously discussed models had very opposing strengths
and only differed in regards to the camera angle, one might think that a camera angle
in between 10° and 90° combines the strengths of both models. We have therefore
created a dataset using a camera angle of 45° (and 82.6° DFOV) and trained a new
model (hereafter referred to as “diagonal camera model"). Evaluating its perfor-
mance on the test set revealed a prediction accuracy of 83.3% which is significantly
better than the performance of the two previous models. The confusion matrix in
Figure 4.6 illustrates that the performance improvement results from the fact that
the diagonal model is capable of both landing and flying towards the landing plat-
form.
29
Chapter 4. Machine Learning Model 4.2. OPTIMAL CAMERA SET-UP
Another attempt in combining the strengths of the front and bottom camera model
was to join the two images into one. Such an image could on the real drone be
created by attaching a mirror in front of the camera. This mirror splits the view of
the camera in half such that the lower half shows the forward view and the upper
half shows the downward view. We have again generated a dataset using this camera
set-up, trained a new model (hereafter referred to as “half front and half bottom
model") which is based on the same CNN architecture as the previous models, and
evaluated its performance on the test set. It achieved a prediction accuracy of 80.8%
which is slightly worse than the performance of the diagonal model. Figure 4.7
illustrates that the model is also capable of predicting all flight commands with high
accuracy.
1,000
takeoff 173 0 0 0 2
800
land 0 157 9 3 4
True label
600
forward 0 4 1019 67 114
400
cw 2 1 90 482 115
200
ccw 0 6 95 64 593
0
takeoff land forward cw ccw
Predicted label
Figure 4.7: Confusion matrix half front and half bottom camera model (80.8% accuracy)
The last camera set-up that was tested uses a forward facing camera with a fisheye
lens. Fisheye lenses are intended to create panoramic images with an ultra-wide
DFOV. We have generated a dataset using a forward facing camera (tilted 10° down-
ward) with a DFOV of 150°. The model that has been trained on this dataset (here-
after referred to as “fisheye front camera model") achieves a prediction accuracy of
87.9% which is higher than the accuracy of all other models. The confusion matrix
in Figure 4.8 illustrates that the model is capable of predicting all five flight com-
mands with high accuracy, just like the diagonal model and the half front and half
bottom model, but outperforms both in regards to the prediction of the forward and
rotational commands.
After having evaluated several camera set-ups, it was necessary to make an educated
decision on which set-up to choose. For this decision, the front camera model and
the bottom camera model were not considered as we have already seen that these
can’t predict the comprehensive set of flight commands. As depicted in Table 4.1 the
fisheye front camera model achieved the highest accuracy on the test set. But since
accuracy only captures the overall performance, we needed to analyse the different
types of misclassification in order to get a better picture of the model’s performance.
In regards to our task at hand, not all misclassifications are equally severe. For ex-
ample, under the assumption that a flight is terminated after landing, misclassifying
a cw command as a land command will most likely result in an unsuccessful flight
because the drone landed outside the landing platform. In contrast, misclassifying
30
Chapter 4. Machine Learning Model 4.2. OPTIMAL CAMERA SET-UP
1,000
takeoff 175 0 0 0 0
800
land 0 157 10 4 2
True label
600
forward 0 4 1066 45 89
400
cw 0 3 53 586 48
200
ccw 0 5 64 36 653
0
takeoff land forward cw ccw
Predicted label
Figure 4.8: Confusion matrix fisheye front camera model (87.9% accuracy)
a cw command as ccw might increase the flight time by a few seconds, but can still
lead to a successful landing.
Let’s have a look at the different types of misclassification, beginning with the take-
off command. Misclassifying the takeoff command is severe, because in that case
the drone wouldn’t be able to start the flight. This type of misclassification is, how-
ever, very rare across all models. In regards to the land command, both Recall and
Precision are important metrics to consider. If a flight command is falsely classified
as a land command, the drone will most likely land outside the landing platform.
In contrast, if the ground truth label is land, but the model predicts another flight
command, this might cause the drone to fly away from the platform. Recall and Pre-
cision metrics for the land command are all at a very similar level across the three
adjusted camera set-ups. Regarding the forward command, Recall is more important
than Precision. This is due to the fact that a falsely predicted forward command can
result in the drone crashing into an obstacle which is the worst possible outcome.
When comparing the Recall of the forward command amongst the three adjusted
camera set-ups, we find that the fisheye camera model has a forward command Re-
call of 89.35% which is roughly two percentage points higher than the metric of
the diagonal camera model and more than five percentage points higher than the
metric of the half front and half bottom model. Lastly, misclassifying a rotational
commands is less severe as this will most likely only prolong the flight. Neverthe-
less, the fisheye camera set-up also outperforms the other models in regards to the
rotational commands. Due to its simplicity and all the above mentioned factors, the
fisheye camera set-up has been chosen for the further course of this project.
31
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
Now that we have determined a suitable data collection procedure and camera set-
up, we can construct our neural network. The task which this neural network at-
tempts to solve is to predict the optimal flight command which the drone should
execute next. We are therefore faced with a multi-class classification problem be-
cause the neural network needs to predict one out of five potential flight commands.
The inputs to the neural network are threefold: an image taken with the drone’s
camera, sensor information captured by the drone’s sensors and the previously exe-
cuted flight commands. Based on these input and output requirements we developed
an optimal neural network architecture, which is depicted in Figure 4.9.
1 X 2Px Rx
macro F1-score = (4.2)
n x Px + Rx
where:
n = number of classes
P = precision
R = recall
We chose the macro F1-score because our dataset is slightly imbalanced and because
we regard each of the five flight commands as equally important. If we used accuracy
32
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
instead, the overrepresented classes in the dataset would have a higher impact on
the performance measure than the underrepresented ones. Since the macro F1-score
gives each class the same weight it can better capture the true performance of the
model and is thus a better metric than accuracy.
Furthermore, latency has been chosen as a second optimisation metric, because the
model needs to make predictions in real time. These predictions thus need to be
quick to make the drone control more efficient. One can reduce the prediction la-
tency of a model by using compact architectures with a fewer number of parameters,
low resolution input images and low-precision number representations.
Optimising a CNN architecture is a complex task as there is a large number of param-
eters which can be tuned. In order to find an optimal neural network architecture,
we used the following strategy: At first, we laid out the foundations, including the
dataset as well as the structure of the input and output layers. We then compared
basic CNN architectures against each other and chose the best performing one. In a
next step, we tested whether we can further reduce the complexity of the network
by using lower resolution input images and low-precision number representations
without sacrificing performance. Thereafter, we introduces techniques to avoid over-
fitting. And in a final optimisation step, we performed hyper parameter tuning in
order to further increase the macro F1-score. The following subsections describe the
optimisation procedure in greater detail.
4.3.1 Foundations
Before we can compare different architectures, we first need to clarify the dataset,
the structure of the input layers and the structure of the output layer.
Dataset
The dataset, that we used for optimising the model, includes 100 000 data sam-
ples which have been collected using the previously described data collection pro-
cedure over a total of 5 125 flights. For each of these flights, the drone, the landing
platform and a number of obstacles have been randomly placed in the simulation
environment. The number of obstacles (between 0 and 10) and the edge lengths
of the obstacles (between 0cm and 200cm) have been randomly chosen using a
pseudo-random uniform distribution. Each flight was thus conducted in a different
simulation environment. The dataset has been partitioned into a train, validation
and test set using a 70/15/15 split, while samples captured during the same flight
were required to be in the same partition to avoid similarities across the partitions.
The label distributions for each of these datasets can be found in Appendix B.
33
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
Input Layers
As mentioned earlier, the model should take in an image, sensor information and the
previously executed flight commands. Let’s discuss each of these inputs so we get a
better understanding of them.
The image input layer takes in a gray scale image. We are using gray scale images
because we want to keep the model as simple as possible and because we aren’t
using any colors in the simulation environment. Pixel values are normalized in a
preprocessing step to a range between 0 and 1 in order to make them more suitable
for the model, because high input values can disrupt the learning process. Since
we have already determined that the fish-eye set-up is optimal for our purpose, the
images in the dataset were taken using a forward facing camera with a DFOV of
150° tilted 10° downward. We started our optimisation procedure using an image
dimension of 280 x 210 pixels.
The sensor input layer takes in three floating point values: the height relative to the
takeoff position in meters, the euclidean distance relative to the takeoff position in
meters and the number of flight commands executed since takeoff.
The third input layer takes in the previously executed flight commands. The intuition
behind this layer is to add a sort of memory to the neural network such that it can
consider historic data when making its prediction. Flight commands are categorical
values which can’t be interpreted directly by the neural network. Simply translating
the flight commands to integer values, like we are doing for the labels of the dataset,
also won’t work because it doesn’t make sense to rank the flight commands. Instead,
we use one-hot encoding to translate a flight command into five boolean values.
For example, if the previous flight command was the “forward" command, then the
boolean value for “forward" is set to 1 and all other boolean values which resemble
the remaining four flight commands are set to 0. We could apply the same logic not
only to the flight command that was executed last, but also the second-last, third-last
and so forth. For each additional flight command, we need to add five additional
boolean values to the layer. This layout ensures that the neural network can make
sense of the previously executed flight commands. We started our optimisation pro-
cedure using the two previously executed flight commands, which translate into an
input layer of ten boolean values.
Output Layer
Since we are dealing with a multi-class classification problem, the final layer is made
up five neurons (one for each class) and is using the Softmax activation function.
Softmax transforms real values into probabilities in a range [0,1] which sum up
to 1. The model thus outputs a probability for each class and using the argmax
function we can determine the index of the class with the highest probability. This
class corresponds to the flight command which the model has predicted.
34
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
The default loss function for a multi-class classification problem is categorical cross-
entropy loss. This loss function evaluates how well the probability distribution pro-
duced by the model matches the one-hot encoded label of the data sample. Since our
dataset is balanced according to how often each command is on average executed
during flight, it makes sense to add class weights to the loss functions. These class
weights need to be higher for under represented classes (i.e. “takeoff" and “land")
such that the loss function penalises the model more if one of these classes is mis-
classified. Table 4.2 shows the weight for each class that is used for the cross-entropy
loss function. The weights have been calculated by dividing the share of the majority
class by the share of each class.
The state of the art neural network architectures for image processing are CNNs.
CNNs typically begin with convolutional and pooling layers which extract informa-
tion from the image by interpreting pixel values in relation to their neighbouring
pixels. The deeper you go in these layers, the more high level pixel relationships can
be detected. Convolutional and pooling layers are then followed by fully connected
layers which classify the image. Since it doesn’t make sense to convolve sensor in-
formation or the previously executed flight commands, we can feed these two input
layers directly into the first fully connected layer.
Now that we have clarified the overall structure of the model, we need to find a
combination of convolutional, pooling and fully connected layers which work best
for our task. We created eight different architectures which have been inspired by
LeNet, AlexNet and VGG-16 [40, 41, 42]. These architectures only differ in the
number and order of convolutional, pooling and fully connected layers. All other
parameters, such as learning rate, epochs, batch size, number of filters per convolu-
tion, stride, padding, kernel size, number of neurons per fully connected layer and
activation functions, are the same across all eight architectures. The values which
have been used for these parameters can be found in Appendix C.
For each architecture, we trained a model on the train set for ten epochs and eval-
uated its performance on the validation set after every epoch. Table 4.3 shows the
number of trainable parameters for each model as well as the highest accuracy and
the highest macro F1-score that have been achieved on the validation set. The num-
35
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
bers in brackets represent the epochs with the highest measurements. There are four
main takeaways from this table: Firstly, we can see that all architectures achieve
high macro F1-scores which vary only within a range of 0.6 percentage points. Nev-
ertheless, Architectures 6 and 8 have the highest macro F1-scores. Secondly, we
can see that the number of parameters significantly varies across the architectures,
with Architectures 5 and 6 being the smallest. Thirdly, we can see that models with
a higher number of parameters are very prone to overfitting since they reach their
highest macro F1-scores on the validation set after only a few number of epochs.
And fourthly, we can see that models with two fully connected layers are superior to
models with one fully connected layer.
Since we are optimising the model on macro F1-score and latency, we are interested
in a model with a high macro F1-score and a low number of parameters (since
this reduces the latency). Considering the main takeaways from Table 4.3, we can
conclude that Architecture 6 is the optimal choice because it not only has the highest
macro F1-score but also the second lowest number of parameters.
The next step in optimising the neural network concerns the input layers. Two of the
three input layers provide room for optimisation. Regarding the image input layer,
we can change the resolution of the images. And concerning the previously executed
flight command input layer, we can change how many flight commands are provided
to the network.
36
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
Let’s begin with optimising the image input layer. We have trained seven differ-
ent models that are all based on Architecture 6, but vary in the resolution of input
images. All models have been trained for ten epochs on the same train set (with
adjusted image resolutions) and validated after each epoch on the validation set.
Figure 4.10 provides the highest macro F1-score across the ten epochs as well as the
number of parameters for each of the seven models. The red line illustrates that
smaller image resolutions reduce the complexity of the model which is again bene-
ficial for the prediction latency. Reducing the resolution of the image, however, also
means that the model has fewer inputs on which it can base its prediction, resulting
in a lower macro F1-score. This relationship is illustrated by the blue line in Fig-
ure 4.10, but we can see a spike regarding the macro F1-score for the 160 x 120
image resolution. This macro F1-score is exactly the same as the macro F1-score of
the model with an image resolution of 280 x 210, but considering a reduction of
64% in the number of parameters, the model with an image resolution of 160 x 120
is the superior choice.
94% 300k
93.18% 93.18%
93% 92.83% 92.86% 250k
92.3%
# Parameters
200k
Macro F1
92% 91.65%
150k
91% 90.64%
100k
90%
50k
89% 0k
90
0
60
15
18
21
24
12
x
x
x
x
0
80
12
0
0
20
24
28
32
16
Now that we have determined the optimal image resolution, let’s have a look at
the input layer which captures the previously executed flight commands. We have
again trained six models for ten epochs and evaluated their performance on the
validation set after each period. All models are based on Architecture 6 and use
an image resolution of 160 x 120, but vary in the number of previously executed
flight commands that are taken as input. Remember that these flight commands are
one-hot encoded, so an input layer that takes in four flight commands consists of
4 ∗ 5 = 20 parameters. Figure 4.11 shows the highest macro F1-score of each model
across the ten epochs. We can clearly see that the models which receive information
about the previously executed flight commands perform significantly better than
the model that only uses image and sensor data as input. More specifically, having
knowledge about at least one previously executed flight command, increases the
macro F1-score by over 2.5 percentage points. We can also see that having too
37
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
many inputs decreases the performance which might be due to overfitting. Since the
macro F1-score is peaking at two flight commands, we have chosen this number as
the optimum for the input layer.
94%
93.18%
92.99% 92.97% 92.95%
93% 92.8%
Macro F1
92%
91%
90%
89.28%
89%
0 1 2 3 4 5
Figure 4.11: Performance evaluation with varying number of previously executed flight
commands as input
One last option for tuning the input parameters was to test whether feeding the
sensor and flight command input layers into its own fully connected layer (before
joining it with the convolution output) increases the performance. Using this ar-
chitecture we achieved a macro F1-score of 92.97% which is worse than without
an additional fully connected layer. We thus decided not to add this layer to the
architecture.
A further way of reducing the latency of the network is by using lower precision num-
ber representations for its weights. This correlation is due to the fact that smaller
number representations consume less memory and arithmetic operations can be ex-
ecuted more efficiently.
We have thus far been using 32-bit floats, but TensorFlow provides the option to
change this to 16-bit, 64-bit or a mixture of 16-bit and 32-bit floats [34]. Using
our current architecture, we have again trained four different models with varying
floating point number precisions and evaluated their performance on the validation
set. Figure 4.12 illustrates that the highest macro F1-score is achieved when using
a mix between 16-bit and 32-bit precision numbers. The model with a precision of
16-bit performs the worst which can be attributed to numeric stability issues. Since
the mixed precision model has both, a higher macro F1-score and a lower memory
usage, compared to the 32-bit and 64-bit models, the mixed precision model has
been chosen as the optimal option.
38
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
94%
93.35%
93.18%
93% 92.85% 92.9%
Macro F1
92%
91%
90%
89%
16-bit 16/32-bit 32-bit 64-bit
Floating Point Precision
4.3.5 Overfitting
The final step in optimising the model was to perform hyper parameter tuning. We
have split this up into two parts, beginning with tuning the convolutional and fully
39
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
But even with the limited number of hyper parameter values, there are still 1 048 576
possible hyper parameter combinations. We thus used random search as a second
technique to overcome this problem. We trained and evaluated 530 hyper parameter
combinations which were randomly selected from the subsets. The top five parame-
ter combinations, with the highest macro F1-score on the validation set, are listed in
Table 4.5. Here we can see that we managed to further improve the macro F1-score
by 0.2 percentage points compared to our earlier model. We can also make out a
trend amongst the top five performing models. The models have in common that
their kernel sizes are small for the first convolutional layer and get larger for the
convolutional layers that follow. This makes sense, because the deeper a convolu-
tional layer is located in the network architecture, the more high level features can
be detected. And in order to detect these high level features we need to consider a
larger neighbourhood around the pixels.
Table 4.5: Top five best performing hyper parameter combinations found using random
search
Out of the parameter combinations listed in Table 4.5, we have selected the pa-
rameter combination with ID 2 as the optimal choice. This parameter combination
40
Chapter 4. Machine Learning Model 4.3. NEURAL NETWORK ARCHITECTURE
achieves the second highest macro F1-score and has the smallest number of param-
eters.
In a last step we compared four different optimizers (Adam, Stochastic Gradient De-
scend (SGD), Adagrad, RMSprop) and three different activation functions (sigmoid,
ReLU, Leaky ReLU) against each other. We trained twelve different models that are
based on our previously determined optimal architecture and allocated each with
a different combination of optimizer and activation functions. Since the optimizers
are highly dependent on their learning rates, we used the optimizer specific default
learning rates that are provided by TensorFlow for our comparison. The results in
Table 4.6 show that in regards to the activation functions, Leaky ReLU and ReLU
both perform much better than Sigmoid. Regarding the optimizers, Adam and RM-
Sprop both outperform SGD and Adagrad. Based on this analysis we have decided
to keep the combination of Adam optimizer and ReLU activation functions which we
have used throughout this chapter. We have also tested the Adam optimizer with a
variation of learning rates but found the default learning rate of 0.001 to be optimal.
Table 4.6: Matrix of macro F1-scores using different optimizer and activation function
combinations
Now that we have optimised the architectural design of the CNN, we can evaluate
its performance on the test set. The test set has been set aside throughout the whole
optimisation procedure and thus allows us to test how well the model performs on
unseen data. We trained the final model on the train set using a batch size of 32, val-
idated its performance after every epoch on the validation set and saved the weights.
The training process was terminated once the macro F1-score on the validation set
didn’t improve for more than ten epochs and the weights which achieved the highest
macro F1-score were selected for the final model.
The evaluation on the test set achieved a macro F1-score of 93.60% which is even
higher than the macro F1-score on the validation set. The confusion matrix in Fig-
ure 4.13 in combination with the class-based metrics in Table 4.7 reveal some further
insights. We can see that the model is capable of predicting all five flight commands,
but the performance does vary across them. Let’s take a closer look at the individual
flight commands. The model correctly predicts the takeoff command in all but two
41
Chapter 4. Machine Learning Model 4.4. SUMMARY
5,000
takeoff 766 0 1 0 1
4,000
land 0 721 23 7 4
True label
3,000
forward 0 37 5166 270 329
2,000
cw 0 12 196 3603 104
1,000
ccw 0 3 205 41 3511
0
takeoff land forward cw ccw
Predicted label
instances. The corresponding F1-score of 99.87% ratifies that the model performs
very well in regards to the takeoff command. The land command has the second
highest F1-score across the five classes. This suggests that the drone will likely be
able to descend once it is hovering above the landing platform. The confusion ma-
trix further illustrates that the model sometimes confuses the land with the forward
command. This is understandable, since it is difficult to determine if the drone is
already close enough to the landing platform in order to execute the land command.
The forward command, on the other hand, is most often confused with the rotation
commands. This might be caused by the fact that the landing platform can be hidden
by an obstacle. In those cases, the drone can’t simply rotate in order to find the plat-
form, but needs to fly around the obstacle which requires the execution of forward
commands. This explains the fact that the forward, cw and ccw flight commands
have lower F1-scores than the other two flight commands.
4.4 Summary
42
Chapter 4. Machine Learning Model 4.4. SUMMARY
43
Chapter 5
Evaluation
Let’s recapitulate what we have done so far. In a first step, we created a simulation
environment for the DJI Tello drone. In a second step, we developed a machine
learning model which can be used for predicting flight commands based on visual
and sensor information captured by the drone. The only thing left to do, is to put the
model into action and test whether it manages to successfully navigate the drone to
the landing platform.
In this chapter, we will elaborate on the test results and qualitatively compare our
approach to previous research. We have decided to solely evaluate the model in the
simulation environment, and not in the real world, due to restrictions impost by the
COVID-19 pandemic.
In order to evaluate the model’s performance, we executed a total of 1 000 test flights
in the simulation environment. The flights were conducted in the following way: In
a first step, we randomly placed the drone, the landing platform and a number of
obstacles in the simulation environment. The number of obstacles (between 0 and
10) and the edge lengths of the obstacles (between 0cm and 200cm) have been
randomly chosen using a pseudo-random uniform distribution. In a second step, we
ensured that a potential flight path between the drone’s starting position and the
position of the landing platform exists. If this was not the case, which can happen if
the drone is locked in by the obstacles, we generated a new simulation environment.
In a third step, we let the machine learning model take over the control of the
drone. This is an iterative process in which the drone continuously captures visual
and sensor information, predicts a flight command based on this information and
executes the predicted flight command. The flight terminates once the drone has
landed somewhere, has crashed or as a default after 100 flight command executions.
For each test flight, we captured several statistics such as whether drone has landed
44
Chapter 5. Evaluation 5.1. TEST FLIGHTS IN SIMULATION
on the platform, the distance between the drones final position and the landing
platform, the number of executed flight commands and the number of obstacles
in the simulation environment. A test flight is classified as “Landed on Platform"
if the drone is within 30cm of the center of the landing platform as illustrated in
Figure 5.1.
In the following, we will analyse the outcome of these test flights. We begin our
analysis by looking at whether the model manages to successfully navigate the drone
to the landing platform. And in a second step, we will investigate whether the flight
paths that the drone has taken are efficient, by comparing them to the optimal flight
paths.
For our first set of test flights, we trained our model on a dataset of 100 000 data
samples (using a 70% train and 30% validation split) and kept the model with the
highest macro F1-score measured on the validation set. Using this model, the drone
managed to land on the landing platform in 852 out of 1 000 test flights. Out of
the remaining 148 test flights which didn’t make it onto the landing platform, the
drone crashed in 141 cases, landed outside the landing platform in five cases and
didn’t land at all in two cases. These outcomes prove that the model is capable
of successfully navigating the drone from a random starting position towards the
landing platform in the majority of simulation environments.
In order to investigate whether a larger training set improves the performance, we
trained two further models on a dataset of 500 000 and 1 000 000 data samples,
respectively, again using a 70% train and 30% validation split. We deployed the two
models in the same 1 000 simulation environments and under the same conditions
(starting pose of drone, termination period, etc.) as the ones used for conducting
the previous test flights. The outcomes of these test flights are also depicted in
Figure 5.2 and can be compared against each other. We can see that the models
that were trained on the larger datasets managed to navigate the drone towards the
landing platform more often. This improvement is, however, not consistent since
the share of successful flights is higher for the 500 000 model than for the 1 000 000
model. Nevertheless, a consistent improvement can be found in regards to crashes.
45
Chapter 5. Evaluation 5.1. TEST FLIGHTS IN SIMULATION
Dind’t land
60%
40%
20% 14.1%
6.1% 7.2%
0.5% 0.2% 1% 0% 1.6% 3.4%
0%
100k 500k 1000k
# Data Samples used for Training the Model
Figure 5.2: Test flight outcomes depending on size of dataset used for training the
model (70% train and 30% validation split)
We can see that a larger training dataset significantly reduces the risk of crashing
the drone. Depending on the use case, having a lower crash rate might even be
considered to be of higher importance than making it to the landing platform more
often. The model which has been trained on the dataset of 1 000 000 data samples
is used for the proceeding analysis.
In a next step, we investigated the factors that impact whether a test flight is suc-
cessful or not. For this, we have conducted the following multi linear regression:
LandedOnPlatformi = β0 + β1 NrCuboidsi
+ β2 TotalCuboidVolumei (5.1)
+ β3 DistanceStarti + i
where:
β0 = 1.3329
β1 = -0.0315
β2 = 0.0023
β3 = -0.2101
46
Chapter 5. Evaluation 5.1. TEST FLIGHTS IN SIMULATION
reveals that every additional meter which the drone is placed away from the landing
platform at the start, reduces the chances of making it to the landing platform by
21.01 percentage points. The TotalCuboidVolume variable, on the other hand, is not
statistically significant and we can thus conclude that the size of the obstacles didn’t
seem to impact whether drone made it to the landing platform.
100%
95%
Landed on Platform
90%
85%
80%
75%
70%
0 1 2 3 4 5 6 7 8 9 10
# Obstacles in Simulation Environment
Figure 5.3: Relative share of flights that landed on the platform depending on number
of obstacles in the simulation environment
Let’s have a closer look at these two factors, beginning with the number of cuboids
in the simulation environment. The bar chart depicted in Figure 5.3 illustrates the
share of test flights in which the model successfully managed to navigate the drone
to the landing platform, depending on the number of obstacles in the simulation en-
vironment. There are two main takeaways from this figure: Firstly, we can see that
in a simulation environment without any obstacles, the drone made it to the landing
platform in every test flight. This corresponds to a success rate of 100% and we
can therefore conclude that the model achieves an optimal outcome in simulation
environments without obstacles. Secondly, we can see that the chances of making
it to the landing platform decreases with the number of obstacles, which confirms
our results from the multi linear regression model. Figure 5.3 illustrates that the
model is 25% less likely to navigate the drone to the landing platform in a highly
obstructed simulation environment compared to a non-obstructed simulation envi-
ronment. Examples of non-obstructed and obstructed simulation environments are
given in Figure 5.4 and Figure 5.5, respectively.
The second factor, according to our multi linear regression model, which impacted
the outcome of the test flights, was the distance between the starting position of the
drone and the position of the landing platform. We have plotted the distance at the
start against the distance at the end for each test flight in Figure 5.6. The respective
outcomes of the flights are color coded and reveal some interesting insights. First
of all, we can see that whenever the drone is closer than 1.5 meters away from the
landing platform at the start, it makes it to the landing platform in all but one of
the test flights. Once the start distance is greater than 1.5 meters, the number of
47
Chapter 5. Evaluation 5.1. TEST FLIGHTS IN SIMULATION
4 Landed on Platform
Distance at End in Meters
Crashed
3 Landed Outside
Dind’t land
0
0 1 2 3 4
Distance at Start in Meters
Figure 5.6: Scatter plot comparing distance between position of the drone and position
of the landing platform at the start and end of the test flights
unsuccessful flights increases. Let’s have a look at each of the four different outcome
types. It is obvious that the final distance of flights in which the drone landed on the
platform should be marginal. In fact, a flight is identified as “Landed on Platfrom"
when it lands within 30cm of the center of the landing platform. Looking at the
flights which landed outside the landing platform, we can see that in most cases the
drone landed far away from the platform. There are only three instances in which
the drone landed outside the platform but within 1.5 meters. This finding, paired
with the fact that the model achieved a 100% success rate in the non-obstructed
environments, suggests that whenever the landing platform is within the drone’s
visual field, it will successfully land on it. But whenever the drone needs to search
for the landing platform, it can happen that it gets lost and lands somewhere else.
Regarding the test flights which didn’t land at all, we can find that for most of these
test flights, the distance at the end is equivalent to the distance at the start. We
have observed that the drone sometimes takes off and begins rotating without ever
48
Chapter 5. Evaluation 5.1. TEST FLIGHTS IN SIMULATION
leaving its position. Lastly, regarding the test flights which crashed, we can see that
all but one of these flights crashed on their way towards the landing platform since
the distance at the end is smaller than the distance at the start.
In order to evaluate the performance of the model, it is important to not only look
at whether the drone made it to the platform, but also at the flight path which it has
taken. We are interested in a model which manages to find the most efficient flight
path and thus gets to the landing platform with as few flight command executions
as possible.
100
Flight Path Length
80
60
40
20
0
0 20 40 60 80 100
BFS Path Length
Figure 5.7: Scatter plot comparing BFS path length with flight path length for test flights
which landed on the platform
For a first analysis, we only considered the test flights in which the drone landed
on the platform. Figure 5.7 includes a scatter plot in which the flight path length
is plotted against the path length that has been determined by our BFS algorithm
described in Section 4.1.2. We can see that the majority of flight paths is close to the
BFS flight paths. In fact, 71% of the flight paths have a length which is less than 20%
longer than the corresponding BFS path. We can even see that the flight paths taken
by the model are in some cases marginally shorter than the BFS paths. This can
happen because we are using some simplifications when calculating the BFS path.
Figure 5.7 also illustrates that the variance of the flight path length increases with
the length of the corresponding BFS path. Longer BFS paths suggest that the drone
might not be able to see the landing from the beginning and first needs to manoeuvre
around some obstacles in order to find it. In these cases, the drone sometimes takes
detours which explains the increasing variance.
Using the scatter plot depicted in Figure 5.8, we can conduct the same analysis for
all test flights in which the drone didn’t land on the landing platform. Let’s again
49
Chapter 5. Evaluation 5.2. QUALITATIVE COMPARISON
go through each of the three different outcomes separately, beginning with the test
flights in which the drone crashed. We can see that the crashes didn’t occur right at
the start, but rather after at least twelve flight command executions. In four of these
flights, the drone even crashed after having executed over 50 flight commands al-
ready. Regarding the flights in which the drone landed outside the landing platform,
we can see that in the majority of these test flights, the drone landed after having
executed more than 60 flight commands. Since the model takes in the number of
previously executed flight commands as an input, it makes sense that it becomes
more likely for the model to predict the landing command after having executed
more flight commands even if the landing platform isn’t yet in sight. Nevertheless,
there were also test flights in which the drone didn’t land after 100 flight command
executions. In regards to the flights which didn’t land at all, Figure 5.8 doesn’t reveal
any insights except for the fact that this outcome is evenly distributed across the BFS
path lengths.
100 Crashed
Landed Outside
Flight Path Length
80 Dind’t land
60
40
20
0
0 20 40 60 80 100
BFS Path Length
Figure 5.8: Scatter plot comparing BFS path length with flight path length for test flights
which did not landed on the platform
After having evaluated the performance of our model, we will now qualitatively
compare our approach with the Reinforcement Learning approach of Polvara et al.
[11] and the Supervised Learning approaches of Kim and Chen [7] and Padhy et al.
[8].
All three papers [7, 8, 11] attempt to solve a task which is similar to ours (i.e. vision-
based autonomous navigation and landing of a drone) but differs in regards to some
aspects (e.g. Kim and Chen [7] and Padhy et al. [8] don’t use a landing platform and
also don’t include obstacles in the environment; Polvara et al. [11] use a downward
facing camera and fly the drone at higher altitudes).
50
Chapter 5. Evaluation 5.2. QUALITATIVE COMPARISON
In regards to how these papers [7, 8, 11] approach to solve this task, we can find
some similarities but also some key differences. What our approach and all three pa-
pers [7, 8, 11] have in common, is the fact that we all use a CNN to map images that
are captured by the drone to high-level control commands. Let’s discuss the other
similarities and differences separately for the Reinforcement Learning approach of
Polvara et al. [11] and the the Supervised Learning approaches of Kim and Chen [7]
and Padhy et al. [8]. The key points of our comparison are summarized in Table 5.1.
What the approach of Polvara et al. [11] has in common with our approach, is the
fact that we both use a simulation program for training the machine learning model.
Our approaches do however differ in regards to how we use the simulation program.
We use the simulation program to collect labelled data samples and subsequently
train our model via Supervised Learning. In contrast, Polvara et al. [11] apply an
enhance reward method (partitioned buffer replay) for training their machine learn-
ing model via Reinforcement Learning in the simulator. A potential downside of
their approach is the low convergence rate. Polvara et al. [11] were in fact required
to train their model for 12.8 days in simulation. Using our approach in contrast, it
takes 15.6 hours to generate a dataset of 1 000 000 data samples using an Intel Core
i7-7700K CPU with 16 GB random access memory (RAM) and 1.2 hours to train
the model using a NVIDIA GeForce GTX 1080 GPU with 3268 MB RAM. The signifi-
cantly quicker convergence rate constitutes an advantage of our approach compared
to the approach of Polvara et al. [11]. An aspect in which the approach of Polvara
et al. [11] is more advanced than ours, concerns sim-to-real transfer. Polvara et al.
[11] trained their model in domain randomised simulation environments with vary-
ing visual and physical properties. This allowed them to successfully transfer their
learnings into the real world [11]. We have not implemented domain randomisation
features yet, but our approach can be extended with domain randomisation features
to allow for sim-to-real transfer.
Kim and Chen [7] and Padhy et al. [8] use a Supervised Learning technique for
training their CNN. Their convergence rate is therefore similar to ours. The main
difference compared to our approach concerns the data collection procedure that is
used for generating a labelled dataset. Both, Kim and Chen [7] and Padhy et al. [8]
51
Chapter 5. Evaluation 5.3. SUMMARY
manually collect images in the real world. This is a time consuming task as they
need to fly the real drone through various locations and label them with the corre-
sponding flight command that is executed by the pilot [7, 8]. In order to increase the
size of their datasets, Kim and Chen [7] and Padhy et al. [8] further apply augmen-
tation techniques such as zooming, flipping and adding Gaussian white noise. We,
in contrast, have developed a fully automated procedure to collect data samples. In
our approach, images and sensor information are collected in a simulation program
and the labels, which correspond to the optimal flight commands, are computed
using a path planning algorithm. Under the assumption that sim-to-real transfer is
possible, which has been shown by several other publications [12, 13, 9, 10, 11],
our approach has two advantages: Firstly, since the data collection procedure is fully
automated, we can create large datasets with very little effort and thus generalize
the model more by training it on a greater variety of environments. Secondly, our
labelling is more accurate since we don’t have humans in the loop and our path plan-
ning algorithm ensures that the label corresponds to the optimal flight command.
5.3 Summary
This chapter gave an overview of the test results that were achieved in the simulation
environment, and provided a qualitative comparison to previous publications.
The test results revealed that the model managed to successfully navigate the drone
towards the landing platform in 87.8% of the test flights. We have shown that the
risk of crashing the drone can be reduced when training the model on larger datasets,
even though we couldn’t find a consistent improvement in regards to successful land-
ings. Since each test flight was performed in a different simulation environment, it
was also possible to observe the impact of environmental changes on performance of
the model. The results illustrated that the success rate negatively correlates with the
number obstacles in the environment as well as the distance between the starting
position of the drone and the position of the landing platform. The model achieved
a success rate of 100% in non-obstructed simulation environments.
In regards to the flight path, we have found that the flight path taken by the model is
for the majority of test flights very close to the optimal flight path. It can, however,
sometimes happen that the drone takes a detour especially when the simulation
environment is highly obstructed.
The qualitative comparison of our model with previous research has shown that our
model has the advantage of having shorter training times when being compared to a
similar Reinforcement Learning approach and overcomes the limitations of manual
data collection (under the assumption of successfully applying domain randomisa-
tion for sim-to-real transfer) when being compared to similar Supervised Learning
approaches.
52
Chapter 6
Conclusion
In this work, we have presented a vision-based control approach which maps low
resolution image and sensor input to high-level control commands. The task which
this model attempts to solve is the autonomous navigation of a MAV towards a visu-
ally marked landing platform in an obstructed indoor environment. Our approach is
practical since (i) our data collection procedure can be used to automatically gener-
ate large and accurately labelled datasets in simulation and (ii) because our model
has a quick convergence rate. We expect our approach to also be applicable to other
navigation tasks such as the last meter autonomous control of a delivery robot or
the navigation of underwater vehicles. Despite the simplicity of our approach, the
test flights in the simulation environment showed very promising results. There are
several possibilities for future work:
• Make the approach suitable for dynamic environments (e.g. extend the sim-
ulation program to also include dynamic features; adapt the path planning
algorithm to account for changes in the environment).
53
Bibliography
[1] Da-Jiang Innovations Science and Technology Co., Ltd. Dji tello prod-
uct image. https://product1.djicdn.com/uploads/photos/33900/large_
851441d0-f0a6-4fbc-a94a-a8fddcac149f.jpg, 2020. [Last Accessed:
01. September 2020]. pages iv, 11
[2] J. Zhang and H. Yuan. Analysis of unmanned aerial vehicle navigation and
height control system based on gps. Journal of Systems Engineering and Elec-
tronics, 21(4):643–649, 2010. pages 1, 3
[3] Am Cho, Jihoon Kim, Sanghyo Lee, Sujin Choi, Boram Lee, Bosung Kim, Noha
Park, Dongkeon Kim, and Changdon Kee. Fully automatic taxiing, takeoff and
landing of a uav using a single-antenna gps receiver only. In 2007 International
Conference on Control, Automation and Systems, pages 821–825, 2007. pages
1, 3
[6] Steven Langel, Samer Khanafseh, Fang-Cheng Chan, and Boris Pervan. Cycle
ambiguity reacquisition in uav applications using a novel gps/ins integration
algorithm. 01 2009. pages 1, 3
[7] Dong Ki Kim and Tsuhan Chen. Deep neural network for real-time autonomous
indoor navigation. CoRR, abs/1511.04668, 2015. URL http://arxiv.org/
abs/1511.04668. pages 1, 4, 9, 50, 51, 52
[8] Ram Prasad Padhy, Sachin Verma, Shahzad Ahmad, Suman Kumar Choud-
hury, and Pankaj Kumar Sa. Deep neural network for autonomous uav nav-
igation in indoor corridor environments. Procedia Computer Science, 133:643 –
650, 2018. ISSN 1877-0509. URL http://www.sciencedirect.com/science/
article/pii/S1877050918310524. International Conference on Robotics and
Smart Manufacturing (RoSMa2018). pages 1, 4, 9, 50, 51, 52
54
BIBLIOGRAPHY BIBLIOGRAPHY
[9] Fereshteh Sadeghi and Sergey Levine. Cad2rl: Real single-image flight without
a single real image. 07 2017. doi: 10.15607/RSS.2017.XIII.034. pages 1, 2,
4, 52
[10] Antonio Loquercio, Elia Kaufmann, René Ranftl, Alexey Dosovitskiy, Vladlen
Koltun, and Davide Scaramuzza. Deep drone racing: From simulation to
reality with domain randomization. CoRR, abs/1905.09727, 2019. URL
http://arxiv.org/abs/1905.09727. pages 1, 2, 4, 52
[11] Riccardo Polvara, Massimiliano Patacchiola, Marc Hanheide, and Gerhard Neu-
mann. Sim-to-real quadrotor landing via sequential deep q-networks and do-
main randomization. Robotics, 9:8, 02 2020. doi: 10.3390/robotics9010008.
pages 1, 2, 4, 9, 17, 50, 51, 52
[13] Stephen James, Andrew J. Davison, and Edward Johns. Transferring end-
to-end visuomotor control from simulation to real world for a multi-stage
task. CoRR, abs/1707.02267, 2017. URL http://arxiv.org/abs/1707.02267.
pages 2, 52, 53
[14] Yang Gui, Pengyu Guo, Hongliang Zhang, Zhihui Lei, Xiang Zhou, Jing Du, and
Qifeng Yu. Airborne vision-based navigation method for uav accuracy landing
using infrared lamps. Journal of Intelligent & Robotic Systems, 72(2):197–218,
2013. doi: 10.1007/s10846-013-9819-5. pages 3
[15] Yuncheng Lu, Zhucun Xue, Gui-Song Xia, and Liangpei Zhang. A survey on
vision-based uav navigation. Geo-spatial Information Science, 21(1):21–32,
2018. doi: 10.1080/10095020.2017.1420509. pages 3, 5
[16] K. Celik, S. Chung, M. Clausman, and A. K. Somani. Monocular vision slam for
indoor aerial vehicles. In 2009 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pages 1566–1573, 2009. pages 3, 4
[18] Josep Aulinas, Yvan Petillot, Joaquim Salvi, and Xavier Llado. The slam
problem: a survey. volume 184, pages 363–371, 01 2008. doi: 10.3233/
978-1-58603-925-7-363. pages 4
55
BIBLIOGRAPHY BIBLIOGRAPHY
[20] Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. No-regret reduc-
tions for imitation learning and structured prediction. CoRR, abs/1011.0686,
2010. URL http://arxiv.org/abs/1011.0686. pages 4
[22] Ivin Amri Musliman, Alias Abdul Rahman, and Volker Coors. Implementing 3
d network analysis in 3 d-gis. 2008. pages 5
[23] P. E. Hart, N. J. Nilsson, and B. Raphael. A formal basis for the heuristic
determination of minimum cost paths. IEEE Transactions on Systems Science
and Cybernetics, 4(2):100–107, 1968. pages 5
[24] Luca De Filippis, Giorgio Guglieri, and Fulvia Quagliotti. Path planning strate-
gies for uavs in 3d environments. Journal of Intelligent and Robotic Systems,
65:247–264, 01 2012. doi: 10.1007/s10846-011-9568-2. pages 5
[26] S. Koenig and M. Likhachev. Improved fast replanning for robot navigation in
unknown terrain. In Proceedings 2002 IEEE International Conference on Robotics
and Automation (Cat. No.02CH37292), volume 1, pages 968–975 vol.1, 2002.
pages 5
[27] J. Carsten, D. Ferguson, and A. Stentz. 3d field d: Improved path planning and
replanning in three dimensions. In 2006 IEEE/RSJ International Conference on
Intelligent Robots and Systems, pages 3381–3386, 2006. pages 5
[28] Fei Yan, Yi-Sha Liu, and Ji-Zhong Xiao. Path planning in complex 3d envi-
ronments using a probabilistic roadmap method. International journal of au-
tomation and computing., 10(6):525–533, 2013. ISSN 1476-8186. pages 5,
6
[29] IEEE. Aligned design: A vision for prioritizing human well-being with au-
tonomous and intelligent systems, version 2. 2017. pages 6
[31] Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. Airsim:
High-fidelity visual and physical simulation for autonomous vehicles. CoRR,
abs/1705.05065, 2017. URL http://arxiv.org/abs/1705.05065. pages 8
[32] Da-Jiang Innovations Science and Technology Co., Ltd. Dji flight simu-
lator - user manual (version 1.0). https://dl.djicdn.com/downloads/
simulator/20181031/DJI_Flight_Simulator_User_Manual_v1.0_EN.pdf,
2018. [Last Accessed: 01. September 2020]. pages 8
56
BIBLIOGRAPHY BIBLIOGRAPHY
[33] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simula-
tion for games, robotics and machine learning. http://pybullet.org, 2016–
2019. [Last Accessed: 01. September 2020]. pages 10, 12, 13
[34] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev-
enberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal
Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous
systems, 2015. URL https://www.tensorflow.org/. Software available from
tensorflow.org. pages 10, 38
[35] Shenzhen RYZE Tech Co., Ltd. and Da-Jiang Innovations Science and Tech-
nology Co., Ltd. Tello sdk (version 2.0). https://dl-cdn.ryzerobotics.
com/downloads/Tello/Tello\%20SDK\%202.0\%20User\%20Guide.pdf, 2018.
[Last Accessed: 22. June 2020]. pages 11, 13, 14, 15
[36] R. Tsai. A versatile camera calibration technique for high-accuracy 3d machine
vision metrology using off-the-shelf tv cameras and lenses. IEEE Journal on
Robotics and Automation, 3(4):323–344, 1987. pages 12, 13
[37] AurÃl’lien GÃl’ron. Hands-on machine learning with scikit-learn, keras, and
tensorflow : concepts, tools, and techniques to build intelligent systems, 2019.
ISSN 9781492032649. pages 19
[38] Stuart Russell and Peter Norvig. Artificial intelligence: a modern approach.
Prentice Hall series in artificial intelligence. Pearson Education, 2016. ISBN
9781292153964. pages 24
[39] J. Opitz and S. Burst. Macro f1 and macro f1. ArXiv, abs/1911.03347, 2019.
pages 32
[40] Yann Lecun, LÃl’on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. In Proceedings of the IEEE, pages
2278–2324, 1998. pages 35
[41] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification
with deep convolutional neural networks. Neural Information Processing Sys-
tems, 25, 01 2012. doi: 10.1145/3065386. pages 35
[42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv 1409.1556, 09 2014. pages 35
[43] Andrew Ng. Feature selection, l 1 vs. l 2 regularization, and rotational in-
variance. Proceedings of the Twenty-First International Conference on Machine
Learning, 09 2004. doi: 10.1145/1015330.1015435. pages 39
57
Appendices
58
Appendix A
Parameters
−−−−−−−−−−
no p a r a m t e r s
Returns
−−−−−−−
v i e w _ m a t r i x : l i s t o f 16 f l o a t s
The 4x4 v i e w m a t r i x c o r r e s p o n d i n g t o t h e p o s i t i o n and o r i e n t a t i o n o f
t h e camera
"""
# g e t c u r r e n t p o s i t i o n and o r i e n t a t i o n o f l i n k
camera_position , camera_orientation , _ , _ , _ , _ = p . getLinkState (
s e l f . body_id , s e l f . l i n k _ i d , computeForwardKinematics=True )
# c a l c u l a t e r o t a t i o n a l matrix
r o t a t i o n a l _ m a t r i x = p . getMatrixFromQuaternion (
camera_orientation )
r o t a t i o n a l _ m a t r i x = np . a r r a y (
r o t a t i o n a l _ m a t r i x ) . r e s h a p e ( 3 , 3)
# t r a n s f o r m camera v e c t o r i n t o world c o o r d i n a t e s y s t e m
c a m e r a _ v e c t o r = r o t a t i o n a l _ m a t r i x . dot (
self . initial_camera_vector )
# t r a n s f o r m up v e c t o r i n t o world c o o r d i n a t e s y s t e m
u p _ v e c t o r = r o t a t i o n a l _ m a t r i x . dot (
self . initial_up_vector )
# compute v i e w m a t r i x
v i e w _ m a t r i x = p . computeViewMatrix (
c a m e r a _ p o s i t i o n , c a m e r a _ p o s i t i o n + camera_vector , u p _ v e c t o r )
return v i e w _ m a t r i x
59
Appendix B
·104
3 27,220
#Samples
2 18,129 17,509
1
3,608 3,524
0
takeoff land forward cw ccw
Label
5,884
6,000
#Samples
3,862 3,772
4,000
2,000
749 733
0
takeoff land forward cw ccw
Label
60
Chapter B. Datasets used for Optimisation
6,000 5,802
#Samples
3,915 3,760
4,000
2,000
768 755
0
takeoff land forward cw ccw
Label
61
Appendix C
Architecture Comparison
Parameter Value
# Epochs 10
Batch Size 32
# Conv. Filters 32
Conv. Kernel Size 3x3
Conv. Padding valid
Conv. Stride 1
Conv. Activation Function ReLU
# Neurons 32
Fully Con. Layer Activation Function ReLU
62
Appendix D
63