1 s2.0 S0029801822014986 Main

Ocean Engineering 262 (2022) 112182
Contents lists available at ScienceDirect
Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng
Autonomous underwater vehicle formation control and obstacle avoidance

using multi-agent generative adversarial imitation learning
Zheng Fang, Dong Jiang, Jie Huang, Chunxi Cheng, Qixin Sha, Bo He, Guangliang Li ∗
College of Electronic Engineering, Ocean University of China, Qingdao, China
ARTICLE INFO ABSTRACT
Keywords: Autonomous underwater vehicle (AUV) is widely used in complex underwater missions such as bottom survey
Multi-agent reinforcement learning and data collection. Multiple AUVs can cooperatively complete tasks that single AUV cannot accomplish.
Formation control Recently, multi-agent reinforcement learning (MARL) has been introduced to improve multi-AUV control
Autonomous underwater vehicle
in uncertain marine environments. However, it is very difficult and even unpractical to design effective
Imitation learning
and efficient reward functions for various tasks. In this paper, we implemented multi-agent generative
Obstacle avoidance
adversarial imitation learning (MAGAIL) from expert demonstrated trajectories for formation control and
obstacle avoidance of multi-AUV. In addition, decentralized training with decentralized execution framework
was adopted to alleviate the communication problem in underwater environments. Moreover, to facilitate the
discriminator to accurately judge the quality of AUV’s trajectory in the two tasks and increase the convergence
speed, we improved upon MAGAIL by dividing the state–action pairs of expert trajectory for each AUV into
two groups and updating discriminator by randomly selecting equal number of state–action pairs from both
groups. Our experimental results on a simulated AUV system modeling Sailfish 210 of our lab in the Gazebo
simulation environment show that MAGAIL allows control policies of multi-AUV to obtain a better performance
than traditional multi-agent deep reinforcement learning from fine-tuned reward function — IPPO. Moreover,
control policies trained via MAGAIL in simple tasks can generalize better to complex tasks than those trained
via IPPO.
1. Introduction method, a leader AUV usually tracks predefined trajectory and the
followers track transformed states of the leader AUV according to
Autonomous underwater vehicle (AUV) can perform a variety of predefined schemes (Desai et al., 2001). The leader–follower method
underwater tasks with its own control and navigation system and is is easy to understand and implement, since the coordinated team
widely used in complex underwater missions such as bottom survey members only need to maneuver according to the leader (Zhang et al.,
and data collection (Cheng et al., 2021). However, the detection range
2021a), and has been widely used for multi-AUV control because of
and energy storage of single AUV are limited. With the increasing
its simplicity and extensibility (Gao and Guo, 2018). However, most
complexity of underwater missions, it is necessary to use multiple AUVs
methods for leader–follower formation control are difficult to work in
to cooperatively complete tasks that single AUV cannot perform. Multi-
AUV can perform tasks such as underwater detection, target search, uncertain environments.
object recognition etc., in a collaborative manner, which not only Multi-agent reinforcement learning (MARL) has been introduced
improves the efficiency of task execution but also reduces the time and to improve multi-AUV control in uncertain marine environments (Qie
energy cost. et al., 2019; Zhang et al., 2021b). With MARL, AUVs can interact
For many tasks, multi-AUV needs to move in formation and be with the environment and learn optimal control policies to complete
able to avoid obstacles while performing the task (Yang et al., 2021). different underwater tasks. For example, Xu et al. (2021) used deep
In other words, formation control and obstacle avoidance are core reinforcement learning to deal with the cooperative decision-making
techniques for multi-AUV to complete complex tasks. Currently, most problem of multi-AUV under limited perception and communication in
methods for formation control can be divided into four groups: vir-
attack-defense confrontation missions. They found that deep reinforce-
tual structure method, behavior-based method, artificial potential field
ment learning can enable multi-AUV to complete the attack-defense
and leader–follower method (Li et al., 2014). In the leader–follower
∗ Corresponding author.
E-mail address: [email protected] (G. Li).
https://doi.org/10.1016/j.oceaneng.2022.112182
Received 28 January 2022; Received in revised form 16 July 2022; Accepted 31 July 2022
Available online 27 August 2022
0029-8018/© 2022 Elsevier Ltd. All rights reserved.
Z. Fang et al. Ocean Engineering 262 (2022) 112182
confrontation missions excellently and generate interesting collabo- • How to deal with the communication constraints in underwater
rative behaviors that were unseen previously. Wang et al. (2018) environments for multi-AUV formation control?
designed a reinforcement learning-based online learning algorithm to • As the formation control and obstacle avoidance sub-tasks were
learn optimal trajectories in a constrained continuous space for multi- demonstrated in one expert trajectory and AUVs do not encounter
AUV to estimate a water parameter field of interest in the under-ice and detect obstacles all the time, there are much more state–
environment. However, it is very challenging to apply MARL to multi- action pairs without detected obstacles than those with detected
AUV control as the difficulty of designing an effective and efficient obstacles. How to train an unbiased discriminator to provide
reward function for each agent increases with the number of agents in effective and distinctive rewards for the two sub-tasks learning
the group and the relationship between agents will change from task in one trajectory?
to task (Yang and Gu, 2004).
As AUVs often work in complex and high-dimensional environments The main contributions of our work are as below:
(Fang et al., 2019), it is easier to provide expert demonstrations on
• Developed and implemented the MAGAIL method for multi-AUV
how to perform tasks than to design complex reward functions. Imita-
about formation control and obstacle avoidance, which can learn
tion learning from expert demonstrations was proposed and has been
successfully applied to single robot control (Hussein et al., 2017). The control policies from expert trajectories instead of pre-defined
simplest imitation learning method is behavior cloning (BC), which reward functions;
learns to map states of an agent to optimal actions via supervised learn- • To solve the limited communication problem in underwater envi-
ing (Ross and Bagnell, 2010). However, BC requires a large number ronments, we adopted a decentralized training with decentralized
of reliable data to learn and cannot generalize to unseen states and execution (DTDE) framework, in which each AUV only needs to
different tasks. On the other hand, inverse reinforcement learning (IRL) behave based on its own local observations;
allows an agent to first extract a cost function from expert demonstra- • Improved upon original MAGAIL by distinguishing data in the ex-
tions and then learn a control policy from the extracted cost function pert trajectories for the formation control and obstacle avoidance
via reinforcement learning (Ng et al., 2000), which enables the agent to tasks, so that the discriminator can accurately judge the quality
generalize to unseen states more effectively (Ho et al., 2016). However, of AUV’s trajectory and both AUVs can learn one policy for the
many of the proposed IRL methods need a model to solve a sequence of two tasks at a faster speed.
planning or reinforcement learning problems in an inner loop (Ho and
Ermon, 2016), which is extremely expensive to run and limits the use We tested our method learning from both optimal and sub-optimal
of IRL for robot control in large and complex tasks. Ho et al. tried to demonstrations for formation control and obstacle avoidance with
solve this problem by proposing a general model-free imitation learning multi-AUV simulators modeling our Sailfish 210 on the Gazebo simu-
method — generative adversarial imitation learning (GAIL) (Ho and lation underwater environment extended from Unmanned Underwater
Ermon, 2016). GAIL allows robots to directly learn policies from expert Vehicle (UUV) Simulator. AUVs trained via traditional deep reinforce-
demonstrations in large and complex environments. ment learning from fine-tuned reward functions – IPPO – were used
On the other hand, compared with other multi-agent formations, for comparison. In addition, to evaluate the generalization of our
a unique characteristic of the multi-AUV formation is the poor com- method, we tested the trained policy of the previous task in two
munication conditions. Environmental factors such as water depth, new and complex tasks with continuous obstacles and dense closure
water quality and light field can have negative impacts on communi- obstacles respectively. Moreover, effect of time delay in communication
cation devices, thus reducing the stability of the communication and on MAGAIL’s performance was also studied in the original formation
information quality (Xin et al., 2021). As radion or spread-spectrum control and obstacle avoidance task. The experimental results show that
communications and global positioning propagate only short distances multi-AUV trained via MAGAIL can achieve a better performance than
in underwater environments, acoustic communication is the most popu- those trained via traditional deep reinforcement learning from fine-
lar method for multi-AUV controls. However, acoustic communication tuned reward functions (IPPO). Moreover, control policies trained via
also suffers from many shortcomings, such as small bandwidth, high MAGAIL in simple tasks can generalize better to complex tasks than
latency and so on (Paull et al., 2013). In order to solve the problem those trained via IPPO.
of multi-robot formation control caused by communication range and
bandwidth limitation, Ren and Sorensen (2008) proposed a distributed 2. Background
formation control architecture that accommodates an arbitrary of group
leaders and arbitrary flow among vehicles. The architecture requires
In this section, we provide background on multi-agent reinforce-
only local neighbor-to-neighbor information exchange. By increasing
ment learning, decentralized partially observable Markov decision pro-
the number of group leaders within the formation, it greatly reduces
cess (POMDP) and dynamic model of AUV.
the failure of the entire formation caused by wrong decisions made
by a single leader or follower. However, in their experiment, they
2.1. Multi-agent reinforcement learning
failed to take into account the issue of time delay, the effect of robot
dynamics and the consequences of data loss. To solve the problem
caused by the underwater time-varying delay, Yan et al. (2017) used In single-agent reinforcement learning for AUV control (Sutton and
the consensus algorithm to solve the coordinate control problem of Barto, 2018; Arulkumaran et al., 2017; Kober et al., 2013), an AUV
multiple unmanned underwater vehicles (UUVs) formation in a leader- learns a policy 𝜋 via interacting with the underwater environment.
following manner. Suryendu and Subudhi (2020) further proposed to Specifically, at time step 𝑡, AUV selects an action 𝑎𝑡 ∼ 𝜋(𝑎|𝑠𝑡 ) with the
use a gradient-descent based method to estimate the actual communi- policy based on the detected current state 𝑠𝑡 . Then AUV will transition
cation delay. However, acoustic communication still suffers from low to next state and receive a reward 𝑟𝑡 from the environment. The goal of
data rate, variable sound speed due to fluctuating water situation etc, AUV is to learn a policy 𝜋 that maximizes the discounted accumulated
which greatly limits applying multi-AUV formation control in complex return as follow:
[ ]
tasks. ∑ |
𝑉𝜋 (𝑠) = E 𝛾 ℎ 𝑟𝑡 |𝑎𝑡 ∼ 𝜋(𝑎|𝑠𝑡 ) , (1)
In this paper, we implemented multi-agent generative adversarial |
ℎ≥0
imitation learning (MAGAIL) (2018) for formation control and ob-
stacle avoidance of multi-AUV. MAGAIL allows multi-AUV to learn where 𝛾 is the discount factor determining the value of future rewards,
control policies from demonstrated trajectories instead of predefined 𝑉𝜋 (𝑠) is the value of state 𝑠 following policy 𝜋 thereafter.
reward functions. However, there are two challenges for implementing Multi-agent reinforcement learning for multi-AUV control involves
MAGAIL in formation control and obstacle avoidance of multi-AUV: multiple AUVs interacting with the underwater environment (Busoniu
2
et al., 2008; Qie et al., 2019). In MARL, each AUV 𝑖 has its own policy
𝜋𝑖 and it can select an action 𝑎𝑖,𝑡 ∼ 𝜋𝑖 (𝑎𝑖 |𝑠𝑡 ) based on the observed
current environmental state 𝑠𝑡 at time step 𝑡. After AUV transitions to
another state in the underwater environment, it will receive a reward
𝑟𝑖,𝑡 . Similar to single-agent reinforcement learning for AUV control,
the goal for each AUV is to learn a policy maximizing its discounted
accumulated return as follows:
[ ]
∑ |
𝑉𝜋1 (𝑠) = E 𝛾 ℎ 𝑟1,𝑡 |𝑎1,𝑡 ∼ 𝜋1 (𝑎1 |𝑠𝑡 )
|
ℎ≥0
[ ]
∑ |
𝑉𝜋2 (𝑠) = E 𝛾 ℎ 𝑟2,𝑡 |𝑎2,𝑡 ∼ 𝜋1 (𝑎2 |𝑠𝑡 )
| (2)
ℎ≥0
......
[ ]
∑ |
𝑉𝜋𝑖 (𝑠) = E 𝛾 ℎ 𝑟𝑖,𝑡 |𝑎𝑖,𝑡 ∼ 𝜋1 (𝑎𝑖 |𝑠𝑡 ) ,
|
ℎ≥0
Fig. 1. Decentralized partially observable Markov decision processes for multi-AUV
where 𝜋𝑖 is the policy for AUV 𝑖. However, different from single-AUV control.
reinforcement learning, in MARL, the policy 𝜋𝑖 of AUV 𝑖 is influenced by
other AUVs’ policies. The most common concept to solve this problem
is Nash Equilibrium (NE), which can be defined as: number of AUVs that inhabit a particular underwater environment,
𝑉𝜋 ∗ (𝑠) ≥ 𝑉𝜋1 (𝑠) 𝑤ℎ𝑒𝑛 𝑉𝜋 ∗ (𝑠), 𝑉𝜋 ∗ (𝑠), … , 𝑉𝜋 ∗ (𝑠) which is considered at discrete time steps (Oliehoek et al., 2008).
1 2 3 𝑖
A decentralized partially observable Markov decision process is
𝑉𝜋 ∗ (𝑠) ≥ 𝑉𝜋2 (𝑠) 𝑤ℎ𝑒𝑛 𝑉𝜋 ∗ (𝑠), 𝑉𝜋 ∗ (𝑠), … , 𝑉𝜋 ∗ (𝑠) represented as a tuple ⟨𝑛, 𝑆, 𝐴, 𝑇 , 𝑂, 𝑓 , ℎ, 𝑏0 , R, 𝜋, 𝛾⟩. 𝑛 is the number of
2 1 3 𝑖
(3)
...... AUVs. The environment is represented by a finite set of states 𝑆. 𝐴 =
𝑉𝜋 ∗ (𝑠) ≥ 𝑉𝜋𝑖 (𝑠) 𝑤ℎ𝑒𝑛 𝑉𝜋 ∗ (𝑠), 𝑉𝜋 ∗ (𝑠), … , 𝑉𝜋 ∗ (𝑠), ×𝑖 𝐴𝑖 is the set of joint actions, in which 𝐴𝑖 is the set of actions available
𝑖 1 2 𝑖−1 to AUV 𝑖. 𝑂 = ×𝑖 𝑂𝑖 is the set of joint observations of all AUVs, where 𝑂𝑖
where 𝑉𝜋 ∗ (𝑠) represents the value of state 𝑠 following the optimal policy is a set of observations available to AUV 𝑖. For example, in formation
𝑖
𝜋𝑖∗ for AUV 𝑖. In NE, AUV 𝑖 will not try to change its policy 𝜋𝑖 if other control and obstacle avoidance of multi-AUV, at current time step 𝑡,
AUVs do not change their policies, because its discounted accumulated each AUV will observe its own state with limited perceptions including
return cannot continue to increase. That is to say, if the equilibrium the distances between AUVs and from obstacles, and the whole system
state is reached, and all AUVs’ state values converge. will have a joint observation 𝑜 = ⟨𝑜𝑖 , 𝑜2 , … , 𝑜𝑛 ⟩. One joint action 𝑎𝑡 =
In MARL, the relationship between AUVs can be cooperative, com- ⟨𝑎1,𝑡 , 𝑎2,𝑡 , … , 𝑎𝑛,𝑡 ⟩ will be taken based on the joint observation, and each
petitive or a mixed setting, which is divided based on the relationship AUV only knows its own action. 𝑇 is a transition function defined as
between reward functions of AUVs (Zhang et al., 2021b). In a fully 𝑇 ∶ 𝑆 × 𝐴 × 𝑆 → 𝑃 (𝑆) ∈ [0, 1], so the probability of a particular
cooperative setting, all AUVs share a same reward function, i.e., R1 =
next state can be gotten as 𝑃 (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) based on the current state
R2 = ⋯ = R𝑖 , because they perform the same task. They can even
and joint action. 𝑓 is the observation function, a mapping from AUVs’
share a common policy 𝜋 and value function 𝑉 in this setting. In a
joint actions and successor states to probability distributions over joint
fully competitive setting, there is a zero-sum relationship between the
observation as 𝑓 ∶ 𝐴 × 𝑆 → 𝑃 (𝑂) ∈ [0, 1], so the probability of joint
reward functions of AUVs. If an AUV 𝑎 and an AUV 𝑏 is competitive,
observation can be taken as 𝑃 (𝑜𝑡 |𝑎𝑡−1 , 𝑠𝑡 ). Similar to how the transition
their corresponding reward functions can be set as R𝑎 = −R𝑏 . In other
model 𝑇 describes the stochastic influence of actions on the underwater
words, AUVs maximize their cumulative rewards by preventing each
environment, the observation model 𝑓 describes how the state of the
other from completing tasks. In a mixed setting, each AUV 𝑖 is assigned
its own task, i.e., each AUV has its own reward function R𝑖 . At the same underwater environment is perceived by AUVs. ℎ is the horizon of the
time, each AUV 𝑖 can be cooperative or competitive to other agents. For decision problem. 𝑏0 ∈ 𝑃 (𝑆) is the initial state distribution at time step
example, in an attack-defense confrontation task, all AUVs are divided 𝑡 = 0. For a fully cooperative setting, all AUVs share a joint reward
into two groups (Xu et al., 2021). AUVs in one group need to defend function R defined as R ∶ 𝑆 × 𝐴 × 𝑆 → R, and each AUV will get
its own territory and occupy the territory defended by another group a reward 𝑟𝑡 = R(𝑠𝑡 , 𝑎𝑡 ) based on detected state and joint action taken
of AUVs. In this case, AUVs within the same group can be considered at time step 𝑡. The reward function can be different for each AUV in
cooperative and share a reward function, while AUVs across groups competitive settings. A tuple of policies 𝜋 = ⟨𝜋1 , 𝜋2 , … , 𝜋𝑛 ⟩ called a
exhibit a competitive relationship. In the leader–follower formation joint policy will be learned to maximize the expected cumulative return
∑∞ 𝑡
control and obstacle avoidance of multi-AUV, the leader AUVs and 𝑡=0 𝛾 𝑟𝑡 , where 𝛾 ∈ (0, 1] is the discount factor. In the task of formation
follower AUVs can be divided into competitive and different groups, control and obstacle avoidance for multi-AUV, each AUV will learn a
since they perform different tasks. policy that can form formation and avoid obstacles at the same time.
2.2. Dec-POMDPs
2.3. Model of AUV
For AUVs in environments in which they have access to reliable

state signals, methods based on Markov decision process (MDP) have The AUV simulator used in our experiment is modeled based on the
been shown to be able to successfully learn optimal policies on how physical AUV Sailfish 210 developed by our lab. The length of AUV is
to perform tasks. However, in many real tasks, an AUV usually suffers 2.3 m and the diameter is 0.21 m. Its weight is 72 kg. The maximum
from limited sensing capability that precludes it from recovering a speed of AUV is 5 knots. It has two steering rudders and a thruster in
Markovian state signal with its perceptions. In this situation, par- its tail, as shown in Fig. 2. Sailfish 210 can be fitted with sensors such
tially observable Markov decision process (POMDP) allows AUVs to as sonar, camera, DVL etc, to perform a variety of underwater missions,
make principled decision making under conditions of uncertain sensing such as path tracking, obstacle avoidance.
(Spaan, 2012). As an extension to POMDP, decentralized partially Two coordinate systems are introduced to AUV model: inertial
observable Markov decision process (Dec-POMDP) (Fig. 1) models a coordinate system (𝜉, 𝜂, 𝜁 ) and motion coordinate system (𝑥, 𝑦, 𝑧), as
3
where 𝐷 is the diameter of thruster, 𝑤 represents the vertical velocity

of AUV, 𝐽𝑝 is the advance ratio which represents the ratio of thruster
advance speed to blade tip linear speed.
In addition, the rudder blades are also modeled for motion control
of AUV in the simulation platform. The dynamic model of the rudder
blade (Zhang et al., 2017) can be defined as:
⎧𝛿𝑣,ℎ , −𝛿𝑣,ℎ,𝑚𝑎𝑥 < 𝛿𝑣,ℎ < 𝛿𝑣,ℎ,𝑚𝑎𝑥
⎪
𝛿𝑣,ℎ,𝑟𝑒𝑎𝑙 = ⎨𝛿𝑣,ℎ,𝑚𝑎𝑥 , 𝛿𝑣,ℎ,𝑚𝑎𝑥 ≤ 𝛿𝑣,ℎ , (8)
⎪
⎩ − 𝛿𝑣,ℎ,𝑚𝑎𝑥 , 𝛿𝑣,ℎ ≤ −𝛿𝑣,ℎ,𝑚𝑎𝑥
where 𝛿𝑣,ℎ,𝑟𝑒𝑎𝑙 represents the actual angle value of each rudder; 𝛿𝑣,ℎ
represents the input radian value for controlling the rudder; 𝛿𝑣,ℎ,𝑚𝑎𝑥
represents the maximum angle value of each rudder, which is 𝜋4 in our
Fig. 2. The thruster and rudders of AUV simulator. setting.
3. Approach
In this section, we present the implementation of MAGAIL for

formation control and obstacle avoidance of multi-AUV.
3.1. Decentralized training with decentralized execution
There are three frameworks for agents training with multi-agent

reinforcement learning: centralized training with centralized execution
(CTCE), centralized training with decentralized execution (DTCE) and
decentralized training with decentralized execution (DTDE) (Zhang
et al., 2021b). For centralized training, there will be a central con-
Fig. 3. The inertial coordinate system and motion coordinate system of AUV model.
troller that collect global observations of all agents and make unified
decisions. The advantage of centralization is the good training effect,
and the disadvantage is that agents needs to communicate with the
central controller when training or making decisions, which might
shown in Fig. 3. In the inertial coordinate system, the position 𝑃 of
cause time delay and thus affect the training speed (Sharma et al.,
AUV can be expressed as follows:
2021; Chen, 2019). Moreover, there are communication constraints
𝑃 = [𝑥, 𝑦, 𝑧, 𝜑, 𝜃, 𝜓]𝑇 , (4) and disturbances in some application domains, which might make it
impossible for the central controller to get global observations in real
where 𝑥, 𝑦, 𝑧 represent the barycentric coordinates of AUV and 𝜙, 𝜃, 𝜓 time.
represent the attitude angles of AUV (da Silva et al., 2007). The speed Different from the centralized setting, in decentralized setting, there
𝑉 of AUV can be expressed as: is no need for a central controller to collect all agents’ information.
Instead, each agent will make decisions based on its local observation,
𝑉 = [𝑢, 𝑣, 𝑤, 𝑝, 𝑞, 𝑟]𝑇 , (5)
without explicit information exchange with each other (Zhang et al.,
where 𝑢 is the longitudinal velocity, 𝑣 is the lateral velocity, 𝑤 is the 2021b). This can greatly reduce the communication burden between
vertical velocity, 𝑝 is the rake velocity, 𝑝 is the roll angle velocity, 𝑞 is agents and meet real-time requirements of tasks, especially in the envi-
the pitch angle velocity and 𝑟 is the yaw angle velocity (da Silva et al., ronment where communication is constrained. For example, in order
2007). to solve the problem that traditional centralized sensor scheduling
The dynamic model of AUV can be established (Fossen, 2011) as frameworks cannot meet the real-time requirements of vehicle tracking
due to communication limitations, Liang et al. (2020) proposed a
𝑀 𝑣̇ + 𝐶(𝑣)𝑣 + 𝐷(𝑣)𝑣 − 𝑔(𝜂) = 𝜏 + 𝑔0 , (6) multi-agent distributed sensor scheduling framework and developed a
fully decentralized, namely DTDE, multi-agent reinforcement learning
where 𝑀 represents the inertia coefficient matrix of the system; 𝐶(𝑣) is
algorithm to design their framework. Their simulation results show the
the Coriolis force coefficient matrix; 𝐷(𝑣) is the viscous hydrodynamic
convergence and the superiority of the proposed algorithm.
coefficient matrix; 𝑔𝜂 represents the restoring force/torque vector; 𝜏
In underwater marine environments, it is very difficult for multi-
represents the control input vector; 𝑔0 is the static load vector; 𝑣̇ is the
AUV to communicate with each other because of limited communica-
accelerated speed of the lateral velocity.
tion bandwidth and disturbances. This means the multi-AUV system
In this paper, we conducted experiments in the Gazebo simulation cannot obtain the complete environmental states of all AUVs. More-
environment extended from Unmanned Underwater Vehicle (UUV) over, the formation control and avoid obstacle task require AUVs to
Simulator (Manhães et al., 2016) by simulating our Sailfish 210. The make decisions in real time. Therefore, a centralized setting is not ap-
above parameters and dynamics characteristics of AUV are already propriate for our task. In this case, in our implementation, we adopted
integrated into the UUV Simulator. The thruster of AUV is also modeled the DTDE framework as in Zhang et al. (2021b), in which each AUV
in the simulation platform. According to characteristics of the thruster, 𝑖 only needs to obtain its own local observation 𝑜𝑖 instead of all state
the relationship between the speed of thruster 𝑣𝑝 and AUV speed can information during training. With DTDE, all AUVs do not need to share
be expressed as (Huang et al., 2021): information between each other but detect other AUVs with sensors,
(1 − 𝑤)𝑢 and the policy training of each AUV can be done locally, avoiding the
𝑣𝑝 = , (7) communication problem.
𝐷𝐽𝑝
4
Algorithm 1 Multi-agent generative adversarial imitation learning for

formation control and obstacle avoidance of multi-AUV
Input:
The local observation 𝑜𝑖 of each AUV
The action 𝑎𝑖 of each AUV
The expert trajectory 𝜏𝑖𝐸 ∼ 𝜋𝑖𝐸 of each AUV
Output:
Learned policy network 𝜋𝜃𝑖 of each AUV
Initialize parameters 𝜃𝑖 , 𝜙𝑖 , 𝜔𝑖 for 𝜋𝜃𝑖 , 𝑉𝜙𝑖 and 𝐷𝜔𝑖
for 𝑒 = 0, 1, 2, ... do:
𝜏𝑖 = []
for 𝑡 = 0, 1, 2, ..., 𝑇 do:
Each AUV detects its local observation 𝑜𝑖,𝑡
Select and execute an action 𝑎𝑖,𝑡 with policy 𝜋𝜃𝑖 taking the local
observation 𝑜𝑖,𝑡 as input
Store state–action pair (𝑜𝑖,𝑡 , 𝑎𝑖,𝑡 ) in policy trajectory 𝜏𝑖
Fig. 4. Illustration of implementation on multi-agent generative adversarial imitation Next local observation 𝑜𝑖,𝑡+1 is updated as the current local
learning for formation control and obstacle avoidance of multi-AUV. observation
end for
Update the discriminator parameter 𝜔𝑖
3.2. MAGAIL for formation control and obstacle avoidance of multi-AUV Get reward 𝑟𝐷𝑖 from the discriminator
Use IPPO to update the policy parameter 𝜃𝑖 and the value function
In our implementation of multi-agent generative adversarial imita- parameter 𝜙𝑖
tion learning (MAGAIL) for formation control and obstacle avoidance end for
of multi-AUV, as shown in Fig. 4, each AUV will learn a discriminator,
an Actor network (i.e., control policy) and a Critic network (i.e., value
function). The Actor network and Critic network are trained with
The rewards 𝑟𝐷𝑖 for the whole trajectory of AUV 𝑖 will be used to update
the Independent Proximal Policy Optimization (IPPO) (de Witt et al.,
the Actor network – policy 𝜋𝜃𝑖 , and the Critic network – value function
2020), which is an extension of Proximal Policy Optimization (PPO)
𝑉𝜙𝑖 , with the IPPO algorithm using the loss:
(Schulman et al., 2017) to multi-agent system. The discriminator is used
to provide reward signals for updating the Actor network and Critic (𝜃𝑖 , 𝜙𝑖 ) = (𝜃𝑖 ) + 𝑐1 (𝜙𝑖 ) + 𝑐2 (𝜋𝜃𝑖 ), (13)
network. PPO restricts the update of policy by ratio clipping to improve
the stability of each update. In multi-AUV system, each AUV uses where (𝜃𝑖 ) is the loss for updating the policy network 𝜋𝜃𝑖 , (𝜙𝑖 )
PPO to learn a decentralized policy 𝜋𝜃𝑖 with individual policy clipping. is the loss to update the value function 𝑉𝜙𝑖 , (𝜋𝜃𝑖 ) is the entropy
Meanwhile, a decentralized value function 𝑉𝜙𝑖 will be updated with the regularization, 𝑐1 and 𝑐2 are coefficients. For details about the IPPO
local observable state 𝑜𝑖 using generalized advantage estimation (GAE). algorithm, please refer to de Witt et al. (2020).
Specifically, as shown in Algorithm 1, first, each AUV will get In addition, in the task, both AUVs need to perform two sub-
demonstrated expert trajectories as input. During each episode of train- tasks: formation control and obstacle avoidance, at the same time. In
ing, at time step 𝑡, AUV 𝑖 will obtain a local observation 𝑜𝑖,𝑡 , and select the expert trajectory, AUVs do not encounter and detect obstacles all
an action 𝑎𝑖,𝑡 with its policy 𝜋𝜃𝑖 : the time (the detecting distance of sonar sensor is set to be 25 m).
Therefore, there are much more state–action pairs without detected
𝑎𝑖,𝑡 ∼ 𝜋𝜃𝑖 (𝑎𝑖 |𝑜𝑖,𝑡 ). (9) obstacles than those with detected obstacles. If we randomly select
state–action pairs (𝑜, 𝑎) of expert trajectory to train the discriminator
After executing the selected action 𝑎𝑖,𝑡 , AUV 𝑖 will move to a next state
as the original MAGAIL algorithm, the discriminator may be biased
at time step 𝑡 + 1. This cycle continues until time step 𝑇 , i.e., the end
during training and cannot provide effective and distinctive rewards
of one episode. The state–action pair (𝑜𝑖,𝑡 , 𝑎𝑖,𝑡 ) at time step 𝑡 of AUV
for AUVs to learn how to perform the two sub-tasks. The learning
𝑖 will be stored. All state–action pairs during each episode’s interac-
speed of AUV with the original MAGAIL is very slow and its policy
tions with the environment will compose an AUV’s trajectory 𝜏𝑖 =
is difficult to converge. Therefore, different from the original MAGAIL
[(𝑜𝑖,1 , −𝑎𝑖,1 ), (𝑜𝑖,2 , 𝑎𝑖,2 ), …]. AUV’s trajectory 𝜏𝑖 ∼ 𝜋𝑖 and the demonstrated
algorithm, we divided the state–action pairs (𝑜, 𝑎) of expert trajectory
expert trajectory 𝜏𝑖𝐸 ∼ 𝜋𝜃𝑖𝐸 generated by expert policy of AUV 𝑖 will
for each AUV into two groups in our implementation: state–action
be used to update the discriminator 𝐷𝜔𝑖 with ADAM (Zhang, 2018)
pairs with detected obstacles and without detected obstacles. When
optimizer using the loss as follow:
training the discriminator, we selected equal number of state–action
E𝜏𝑖 [𝑙𝑜𝑔(𝐷𝜔𝑖 (𝑜, 𝑎))] + E𝜏𝑖𝐸 [𝑙𝑜𝑔(1 − 𝐷𝜔𝑖 (𝑜, 𝑎))], (10) pairs from both groups to update it. The learning speed of our method
was experimentally shown to be much faster than the original MAGAIL.
The loss function is analogous to that of GANs (Creswell et al., 2018),
which draws an analogy between imitation learning and GANs. 3.3. Observation and action representation
The discriminator 𝐷𝜔𝑖 of AUV 𝑖 can be used to provide a reward
𝑟𝐷𝑖 ,𝑡 for each state–action pair (𝑜𝑖,𝑡 , 𝑎𝑖,𝑡 ) at time step 𝑡: For simplification, we consider a system with a leader AUV and a
follower AUV tracking a desired path and avoiding obstacles at the
𝑟𝐷𝑖 ,𝑡 = −𝑙𝑜𝑔(1 − 𝐷𝜔𝑖 (𝑜𝑖,𝑡 , 𝑎𝑖,𝑡 )). (11)
same time. This can also be extended to more AUVs and different tasks.
Then, we can get the rewards 𝑟𝐷𝑖 of AUV 𝑖 for the whole trajectory 𝜏𝑖 The requirements to perform the task are as follow:
as
(1) The leader AUV should follow the desired route when there is
𝑟𝐷𝑖 = 𝐷𝜔𝑖 (𝜏𝑖 )
no obstacle, and return to the route after avoiding obstacles;
= 𝐷𝜔𝑖 [(𝑜𝑖,0 , 𝑎𝑖,0 ), (𝑜𝑖,1 , 𝑎𝑖,1 ), … , (𝑜𝑖,𝑇 , 𝑎𝑖,𝑇 )] (12) (2) The follower AUV should avoid obstacles while following the
= [𝑟𝐷𝑖 ,0 , 𝑟𝐷2 ,1 , … , 𝑟𝐷𝑖 ,𝑇 ]. leader AUV and continuously monitor the state of leader AUV;
5
Fig. 6. A screenshot of the Gazebo simulation platform with simulators of a leader

AUV and a follower AUV in the simulated underwater environment.
Vehicle (UUV) Simulator (Manhães et al., 2016). Gazebo is a visual

Fig. 5. Illustration of the task for our multi-AUV system and the representations of
robot simulation platform supported by ROS. It can simulate the real
observation space for each AUV. (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
environment and the physical properties of robot, build the dynam-
ics model of the robot and simulate the sensor model, providing a
simulation environment that is quite close to the real world. It also
provides a visual interface for participants to observe the robot in
(3) All AUVs can only explore and perform the task in an area with
the environment. This is more convenient for researchers to test and
limited ranges, or the task is considered as failure.
observe the effect of algorithms before deploying them on the real robot
(4) The heading deviation of AUVs should not be too large, other-
in the environment.
wise the task is considered as failure.
UUV Simulator extends Gazebo’s capabilities with underwater
As shown in Fig. 5, the task area is bounded by the orange line robots, sensors, and environmental models implemented through new
and the two axes. In the task area, the red line is the route that the plug-ins that simulate the effects of underwater hydrostatics and hy-
leader AUV is going to follow and the solid red circles are obstacles to drodynamics, thrusters, sensors, and external disturbances (Manhães
avoid. Each AUV has 𝑀 rangefinder sensors with a total visual span et al., 2016). It can simulate multiple underwater robots and perform
of 𝑆 radians. We divided the visual span of each AUV into 𝑞 parts. intervention tasks using the robot operator.
𝑀 Fig. 6 shows a leader AUV and a follower AUV by modeling our
Therefore, each part has 𝑚 = sensors and AUV will take the shortest
𝑞 Sailfish 210 in the simulated underwater environment on UUV Simula-
distance detected from all parts as the distance from obstacle. 𝑑𝐿 and tor. Our AUV simulator relies on a rear thruster for propulsion. In the
𝑑𝐹 are the shortest distance from obstacles for the leader AUV and the tail, there are four rudders: two vertical ones and two horizontal ones.
follower AUV respectively. The local observation of the leader AUV is The vertical rudders can be used to control the heading and steering of
represented as 𝑜1 = [𝑔𝐿 , 𝑎𝐿 , 𝑑𝐿1 , 𝑑𝐿2 , … , 𝑑𝐿𝑞 ], where 𝑑𝐿𝑖 represents the AUV, while the horizontal rudders can be used to control the rising and
distance from the obstacle detected by sensors in part 𝑖 of the divided diving and the pitch angle of AUV. The speed of AUV can be controlled
visual span for the leader AUV, 𝑖 = 1, 2, … , 𝑞. 𝑔𝐿 represents the distance by setting the thruster speed. In addition, AUV can receive the data
from the route. 𝑐 is the expected heading point of the leader AUV, for perception by invoking the appropriate sensors like sonar, DVL and
which is set to be 15 m ahead of the AUV’s current position along the so on. The underwater environment can be configured to simulate the
target route. 𝑎𝐿 is the heading deviation of the leader AUV, which is real environment, e.g., by placing obstacles, setting the speed of ocean
the angle between the line from leader AUV’s current position to the current, etc. In our experiments, the wind speed coefficient is set to be
expected heading point 𝑐 and the leader AUV’s moving direction. 12, the maximum flow speed coefficient is set to be 5 and the minimum
For the follower AUV, the local observation is represented as 𝑜2 = is set to be 0.
[𝑔𝐹 , 𝑎𝐹 , 𝑑𝐹 1 , 𝑑𝐹 2 , … , 𝑑𝐹 𝑞 ], where 𝑑𝐹 𝑖 represents the distance from the
obstacle detected by sensors in part 𝑖 of the divided visual span for 4.2. Experimental setup
the follower AUV, 𝑖 = 1, 2, … , 𝑞. 𝑔𝐹 represents the distance from the
follower AUV to the leader AUV. The heading deviation 𝑎𝐹 is the angle In our experiments, we set 𝛺 = {(𝑥, 𝑦)|0 ≤ 𝑥 ≤ 100, −30 ≤ 𝑦 ≤ 30} as
between the line from the follower AUV to the leader AUV and the the task area, as shown in Fig. 7. The positions of the leader AUV and
follower AUV’s moving direction. follower AUV are initialized at (1, 0) and (15, 0) respectively. Cylindrical
The action spaces for both AUVs are discrete. All actions for each obstacles are placed at (66, −2.5), (80, 20), (118, −20), (132, 2.5) with a
AUV can be divided into three groups: turning left, turning right and radius of 1.5 m. The horizontal line 𝑦 = 0 is set to be the target route
going straight ahead. The number of actions available in each group for the leader AUV to follow.
can be defined differently for different tasks. The task for the whole system will suspend and a new episode starts
in the following conditions:
4. Experiments and results
(1) The leader AUV is more than 30 m away from the target route,
4.1. Experimental platform i.e., |𝑔𝐿 | ≥ 30;
(2) The heading deviation of the leader AUV is too large, i.e., |𝑎𝐿 | ≥
We evaluate our method for formation control and obstacle avoid- 𝜋;
ance of multi-AUV on a simulated AUV system of our lab in the (3) The follower AUV is too far or too close to the leader AUV,
Gazebo simulation environment extended from Unmanned Underwater i.e., 𝑔𝐹 < 5 m or 𝑔𝐹 > 25 m;
6
of AUV. 𝐹 𝑖𝑛1 and 𝐹 𝑖𝑛3 represent the angles of the upper and lower
rudders, which can be set to control the horizontal direction of AUV.
𝐹 𝑖𝑛2 and 𝐹 𝑖𝑛4 represent the angles of the left and right rudders, which
can control AUV to float and dive in the underwater environment. In
our experiments, 𝐹 𝑖𝑛2 and 𝐹 𝑖𝑛4 are set to 0 as the tasks are performed
in a 2D space.
In the experiments, we set the discount factors 𝛾 = 0.99, 𝜆 = 1.0
and the clipping factor 𝜖 = 0.05 in the IPPO algorithm. For each AUV,
the discriminator 𝐷𝜔𝑖 is updated 3 times within one episode, while
the policy 𝜋𝜃𝑖 and the value function 𝑉𝜙𝑖 are updated 9 times. This
is tested to be the best way to improve the policy and discriminator
simultaneously. Each time, 256 local observable state–action pairs were
randomly selected from both AUV’s trajectory 𝜏𝑖 and expert trajectory
𝜏𝑖𝐸 respectively to update the discriminator of AUV 𝑖, and 128 local
Fig. 7. Illustration of task configuration in the underwater environment for formation observable state–action pairs from AUV’s trajectory 𝜏𝑖 were randomly
control and obstacle avoidance.
selected to update the policy of AUV 𝑖.
Table 1
Actions available to the leader AUV (The units of 𝐹 𝑖𝑛 and 𝑇 ℎ𝑟𝑢𝑠𝑡𝑒𝑟 are respectively 4.3. Evaluation metrics
𝑟𝑎𝑑 and 𝑟∕𝑠).
Action Fin1 Fin2 Fin3 Fin4 Thruster As the rewards provided by the discriminator are subjective, to
Turn left 1 −0.25 0 0.25 0 −300 evaluate the learning performance of both AUVs via MAGAIL, we
Turn left 2 −0.364 0 0.364 0 −300 defined reward functions for both AUVs as evaluation metrics.
Go straight 0 0 0 0 −300
Turn right 1 0.25 0 −0.25 0 −300
Turn right 2 0.364 0 −0.364 0 −300 4.3.1. Leader AUV
In the task, the leader AUV needs to complete path tracking and
Table 2 obstacle avoidance at the same time. Therefore, its reward function 𝑟𝐿
Actions available to the follower AUV (The units of 𝐹 𝑖𝑛 and 𝑇 ℎ𝑟𝑢𝑠𝑡𝑒𝑟 are respectively is composed of two parts, and can be defined as:
𝑟𝑎𝑑 and 𝑟∕𝑠).
Action Fin1 Fin2 Fin3 Fin4 Thruster 𝑟𝐿 = 𝑟𝑐𝐿 + 𝑟𝑎𝐿 , (14)
Turn left 1 −0.25 0 0.25 0 −250
where 𝑟𝑐𝐿
represents rewards for path tracking and 𝑟𝑎𝐿
represents re-
Turn left 2 −0.25 0 0.25 0 −350
Turn left 3 −0.364 0 0.364 0 −250 wards for obstacle avoidance. 𝑟𝑐𝐿 will provide rewards based on AUV’s
Turn left 4 −0.364 0 0.364 0 −350 heading deviation 𝑎𝐿 and distance from the target route 𝑔𝐿 , as:
Go straight 1 0 0 0 0 −250
Go straight 2 0 0 0 0 −350 𝑟𝑐𝐿 = −|𝑎𝐿 |∕3.14 + |1 − 𝑔𝐿 ∕30|, (15)
Turn right 1 0.25 0 −0.25 0 −250
Turn right 2 0.25 0 −0.25 0 −350 𝑟𝑎𝐿is defined based on the distance from obstacles, as:
Turn right 3 0.364 0 −0.364 0 −250 {
Turn right 4 0.364 0 −0.364 0 −350 𝑎
− 2, 5 < 𝑑𝐿 ≤ 8
𝑟𝐿 = . (16)
− 4, 2 < 𝑑𝐿 ≤ 5
In addition, if the leader AUV met conditions to suspend the current
(4) The heading deviation of the follower AUV is too large, i.e., |𝑎𝐹 | episode in the task, as described in Section 4.2, it will receive additional
𝜋
≥ ; rewards, as:
3
(5) The shortest distances from AUVs to obstacles are smaller than ⎧ − 200, 30 ≤ |𝑔𝐿 |
2 m, i.e., 𝑑𝐿 ≤ 2 or 𝑑𝐹 ≤ 2. ⎪
𝑟𝐿 = ⎨ − 200, 𝜋 ≤ |𝑎𝐿 | . (17)
⎪
In addition, each AUV has 𝑀 = 600 rangefinder sensors and a visual ⎩ − 200, 𝑑𝐿 ≤ 2
2𝜋
span 𝑆 = . The detection range is 25 m. To ensure that all AUVs can
3
accurately detect obstacles and avoid them smoothly, we divided the 4.3.2. Follower AUV
𝑀 = 600 rangefinder sensors of each AUV into 6 parts. The follower AUV needs to track the leader AUV at a certain
As the leader AUV only needs to track the target path and avoid distance and avoid obstacles at the same time. Therefore, its reward
obstacles, we assign 5 actions available to it: two actions for turning function is also composed of two parts and defined as:
left, one action for going straight and two actions for turning right, as
𝑟𝐹 = 𝑟𝑐𝐹 + 𝑟𝑎𝐹 , (18)
shown in Table 1. This allows the leader AUV to follow the path and
avoid obstacles smoothly. The follower AUV needs to track the leader where 𝑟𝑐𝐹 represents rewards for tracking the leader AUV, 𝑟𝑎𝐹 represents
AUV and keep a distance of 15 m from the leader AUV, so it needs rewards for obstacle avoidance. 𝑟𝑐𝐹 will provide rewards based on the
more actions to adjust the distance from the leader AUV while avoiding follower AUV’s heading deviation 𝑎𝐹 and distance from the leader AUV
obstacles at the same time. Therefore, we assign 10 actions available 𝑔𝐹 , as:
to the follower AUV: four actions for turning left, two actions for going
straight and four actions for turning right, which allows the follower 𝑟𝑐𝐹 = −|𝑎𝐹 | ∗ 3∕3.14 + |1 − |𝑔𝐹 − 15|∕10|. (19)
AUV to speed up or slow down and avoid being too close or too far
Same to the leader AUV, 𝑟𝑎𝐹 is defined based on its distance from
away from the leader AUV, as shown in Table 2.
obstacles, as:
All actions available to two AUVs are controlled by setting the {
propeller speed and rudder angles. As shown in Tables 1 and 2, 𝑇 ℎ𝑟𝑢𝑠𝑡𝑒𝑟 − 2, 5 < 𝑑𝐹 ≤ 8
𝑟𝑎𝐹 = , (20)
represents the propeller speed which can be set to control the speed − 4, 2 < 𝑑𝐹 ≤ 5
7
Fig. 8. Cumulative rewards received by the leader AUV and follower AUV via MAGAIL with optimal demonstrations (MAGAIL Optimal), sub-optimal demonstrations (MAGAIL
Suboptimal) and IPPO, in the original formation control and obstacle avoidance task (averaged over data collected in three trials). Note that the shaded area is the 0.95 confidence
interval and the bold line is the mean performance. Two red lines show the performance of expert optimal and sub-optimal demonstrations. (For interpretation of the references
to color in this figure legend, the reader is referred to the web version of this article.)
In addition, similar to the leader AUV, the follower AUV will receive 4.4.1. Learning curves
additional rewards if it meets conditions to suspend the current episode We first evaluate and compare the learning curves of the leader
in the task as described in Section 4.2, as: and follower AUV trained via MAGAIL with optimal and sub-optimal
|𝑔𝑓 | < 5 𝑜𝑟 25 < |𝑔𝐹 | demonstrations in the original formation control and obstacle avoid-
⎧ − 200,
⎪ 𝜋 ance task, measured by cumulative rewards received per episode ac-
𝑟𝐹 = ⎨ − 200, ≤ |𝑎𝐹 | . (21)
3 cording to the defined reward functions for the leader AUV and follower
⎪
⎩ − 200, 𝑑𝐹 ≤ 2 AUV in Section 4.3 respectively. As shown in Fig. 8, for both leader
AUV and follower AUV, the MAGAIL Optimal agent learned much faster
Note that the above defined reward functions for both leader AUV and better than the IPPO agent. Even with sub-optimal demonstrated
and follower AUV are never used for training the control policy with trajectories, the MAGAIL Suboptimal agent allows the leader AUV to
the MAGAIL method, but only used for testing the learned policies from learn faster and better than the IPPO agent, while the follower AUV
demonstrated trajectories. In addition, we used the traditional deep can also obtain a comparable performance to that of the IPPO agent
reinforcement learning method for multi-agent system – IPPO (de Witt at a similar speed. However, the performance of the MAGAIL Optimal
et al., 2020) – as baseline for comparison, which learned from only the
agent and the MAGAIL Suboptimal agent is limited by the performance
above defined reward functions for each AUV.
of expert demonstrations respectively. Moreover, our results in Fig. 8
In our experiments, we trained the leader and follower AUV via
show that the better the provided expert demonstrations are, the higher
our implementation of MAGAIL learning from optimal expert demon-
performance each AUV trained via MAGAIL can obtain.
strations and sub-optimal demonstrations respectively. That is to say,
optimal or sub-optimal demonstrated trajectories will be taken as input
to the controller of leader AUV and follower AUV separately. Then a 4.4.2. Performance in the original task
control policy will be learned for each AUV via MAGAIL by interacting We tested the final trained policies via MAGAIL and IPPO for the
with the underwater environment. In general, the optimal and sub- leader AUV and follower AUV in the original task. As the performance
optimal demonstrations can be obtained by recording human operators of the MAGAIL Suboptimal agent is similar to that of the IPPO agent,
controlling a vehicle to complete tasks on a simulation platform or we only tested and compared the MAGAIL optimal agent to the IPPO
even in the real marine environment. For simplicity, in this paper, the agent. Figs. 9 and 10 show the local observations including the distance
optimal and sub-optimal demonstrations are generated by training both from the target route (or leader AUV), heading deviation and distance
AUVs with IPPO and evaluated with the above defined reward functions from obstacles, and trajectories of the leader and follower AUV in the
for the leader AUV and follower AUV respectively. Specifically, the testing process.
optimal and sub-optimal expert trajectories for leader AUV are gen- As can be seen from Figs. 9 and 10, both leader AUV and follower
erated by running a policy trained via IPPO which can receive about AUV trained via MAGAIL and IPPO can complete the original formation
200 rewards and 180 rewards respectively within one episode, while control and obstacle avoidance task. The leader AUV can generally
the optimal and sub-optimal expert trajectories for the follower AUV track the target path and only move away from the path when avoiding
are generated by running a policy trained via IPPO which can receive obstacles, as shown in Fig. 9(a). The follower AUV can generally follow
about 200 rewards and 150 rewards respectively within one episode. the leader AUV and keep a distance at around 15 m and the distance
fluctuated around it when avoiding obstacles, as shown in Fig. 9(d).
4.4. Results and analysis
When the follower AUV detected the obstacle, it prioritized obstacle
In this section, we present and analyze the experimental results avoidance and continued tracking the leader AUV after successfully
by comparing the performance of learned control policies trained via avoiding the obstacle, which can be seen from the dramatically in-
MAGAIL with optimal expert demonstrations and sub-optimal expert creased or decreased heading deviation close to obstacles and small
demonstrations to traditional reinforcement learning algorithm for fluctuation around 0 at other places in Fig. 9(e). Moreover, the trajec-
multi-agent system from predefined reward functions — IPPO, in the tory, distance from target route (or leader AUV), heading deviation, and
original task configuration described in Section 4.2. In addition, to distance from obstacle of both leader AUV and follower AUV trained via
evaluate the generalization of our method, we tested the trained control MAGAIL are quite similar to the expert demonstrations. The MAGAIL
policies of the original task in two different and complex tasks with agent even allows both the leader AUV and follower AUV to avoid
continuous obstacles and dense closure obstacles respectively. obstacles at a closer distance than expert demonstrations, as shown in
8
Fig. 9. The distance from the target route (or leader AUV), heading deviation and distance from obstacles of the leader AUV (a,b,c) and follower AUV (d,e,f) during the process
of testing the final trained policies via MAGAIL and IPPO respectively in the original task.
better performance than the ones trained via traditional deep reinforce-
ment learning from fine-tuned reward function (i.e.,IPPO), and both
AUVs trained via MAGAIL can finish the formation control and obstacle
avoidance task with a shorter trajectory than those trained via IPPO.
4.4.3. Generalization to different and complex tasks

To evaluate the generalization of our implemented MAGAIL, we
saved final control policies of both leader AUV and follower AUV
trained with MAGAIL and IPPO in the previous original formation
control and obstacle avoidance task, and tested them in two new and
complex tasks: one with continuous obstacles and the other with dense
closure obstacles on the target route, respectively. In the first new task,
the obstacles are in a line on the target route to follow and in the second
new task, obstacles are made a circle on the route, with only 2 relatively
Fig. 10. Tested trajectories of the leader and follower AUV with final control policies large openings left for AUVs to enter and exit. Other settings for the two
trained via MAGAIL learning from demonstrations and IPPO in the original formation tasks are the same as the original one.
control and obstacle avoidance task. Figs. 11, 12 and 13 show the local observations including the
distance from the target route (or leader AUV), heading deviation and
distance from obstacles, and trajectories of the leader and follower
Fig. 9(c) and (f). In contrast, the leader and follower AUV trained via AUV in the two new tasks respectively. Results in both tasks show
IPPO kept a farther distance from obstacles than the MAGAIL agent, that the control policies of leader AUV and follower AUV trained via
which means the trajectory of the IPPO agent is not the shortest one both MAGAIL and IPPO can generalize well to different and complex
though a bit safer than that of the MAGAIL agent, as shown in Fig. 10. tasks. However, the MAGAIL agent allows both leader and follower
In addition, the starting and completing of obstacle avoidance by both AUVs to obtain shorter trajectories than the IPPO agent. Moreover,
the leader AUV and follower AUV trained via MAGAIL were earlier trajectories generated by both AUVs trained via MAGAIL are closer to
than those of both AUVs trained via IPPO. However, the follower AUV obstacles than those by the IPPO agent. In addition, both AUVs trained
trained via IPPO can follow the leader AUV better than that trained via via MAGAIL completed the obstacle avoidance and returned to tracking
MAGAIL, as shown by the smaller fluctuations of the heading deviation the target route earlier than those trained via IPPO. Furthermore, with
of the follower AUV trained via IPPO compared to that of the MAGAIL MAGAIL, the leader AUV kept at a closer distance from the target route
agent in Fig. 9(e). This is because the expert demonstrations changed even when avoiding obstacles and the follower AUV can keep a 15
in a similar way, and the follower AUV trained via MAGAIL is trying to meters’ distance from the leader AUV better than the ones trained via
imitate the behavior of expert demonstrations. If better demonstrations IPPO.
could be provided, the follower AUV trained via MAGAIL could track In addition, we tested the success rates of completing the three
the leader AUV in a better way. tasks with the final control policies of both leader AUV and follower
In summary, these results suggest that both leader and follower AUV trained in the previous original formation control and obstacle
AUVs trained via MAGAIL with expert demonstrations can achieve a avoidance task via MAGAIL and IPPO. We tested policies of both AUVs
9
Fig. 11. The distance from the target route (or leader AUV), heading deviation and distance from obstacles of the leader AUV (a,b,c) and follower AUV (d,e,f) during evaluating
the generalization of the final trained policies via MAGAIL and IPPO in the original task to a new and complex task with continuous obstacles.
Fig. 12. Tested trajectories of the leader and follower AUV with final control policies of the previous formation control and obstacle avoidance task trained via MAGAIL and IPPO
in two new and complex tasks with continuous obstacles and dense closure obstacles respectively.
in each task for 100 episodes and the number of times that both AUVs Table 3
Success rate of completing different tasks.
successfully finished the task was used to calculate the success rate. As
Task configuration Algorithm Success rate
shown in Table 3, control policies of multi-AUV trained via MAGAIL
can complete the original task and new task with continuous obstacles MAGAIL 100%
Original
IPPO 100%
with 100% success rate, and the complex new task with dense closure
MAGAIL 100%
obstacles with 96% success rate. In contrast, although the success rates Continuous obstacles
IPPO 100%
of completing the original task and new task with continuous obstacles
MAGAIL 96%
by AUVs trained via IPPO can also reach 100%, they can successfully Dense and closure obstacles
IPPO 43%
finish the new and complex task with dense closure obstacles for 43%
of total testings.
In summary, these results indicate that multi-AUV trained via MA-
GAIL with demonstrated trajectories can generalize better to differ- range of sensors (25 m) on each AUV. We assumed the follower AUV
ent and complex tasks than those trained with the traditional deep can obtain the position of the leader AUV with the detecting sensors
reinforcement learning from fine-tuned reward function (i.e., IPPO). immediately, which was used to calculate 𝑔𝐹 — the distance from
the follower AUV to the leader AUV and the heading deviation 𝑎𝐹 .
4.4.4. Effect of time delay in communication The leader AUV does not need to receive the position information
In the above three tasks, they require 15 meter’s distance between of the follower AUV, since it does not need to track the follower
the leader AUV and follower AUV, which is within the detection AUV and DTDE framework was adopted. However, in complex tasks,
10
Fig. 13. The distance from the target route (or leader AUV), heading deviation and distance from obstacles of the leader AUV (a,b,c) and follower AUV (d,e,f) during evaluating
the generalization of the final trained policies via MAGAIL and IPPO in the original task to a new and complex task with dense closure obstacles.
performance of the follower AUV, which can be expressed as follows:

√
( 𝑔𝐹 − 15 )2 ( 𝑎𝐹 )2
𝑒𝑡 = + , (22)
10 𝜋∕3
where 𝑔𝐹 is the distance from the follower AUV to the leader AUV,
𝑎𝐹 is the angle between the line from the follower AUV to the leader
AUV and the follower AUV’s moving direction. According to the task
requirements in Section 4.2, the closer 𝑔𝐹 is to 15 m and 𝑎𝐹 is to 0
radian, the smaller the tracking error 𝑒𝑡 is. Fig. 14 shows the tracking
error of the follower AUV with 0.2 s, 0.6 s and 1.0 s time delay
in communication, in comparison with 0s delay. 0s time delay in
communication is the same to previously obtaining the position of
the leader AUV immediately with detecting sensors. Results in Fig. 14
show that as the time delay of communication increased, the tracking
error of the follower AUV became larger. However, in all cases of
time delay, the follower AUV can still complete the formation control
and obstacle avoidance task. In future studies, since the only differ-
Fig. 14. Tracking error of the follower AUV with final control policy trained via ence between MAGAIL and MARL is that MAGAIL learns from expert
MAGAIL learning from demonstrations in the original task with different time delays. demonstrations while MARL learns from pre-defined reward functions,
methods to solve time delay of communication (Zhang et al., 2018;
Kim et al., 2021) and limited bandwidth (Kim et al., 2019; Zhang and
AUVs need to communicate with each other and there is usually time Lesser, 2013) developed for multi-agent reinforcement learning (MARL)
delay in communication, so that one AUV cannot receive required can be integrated with MAGAIL, which can extend our method to
new information in time from the other one. To study the effect of facilitate multi-AUV to accomplish complex and large-scale tasks with
time delay in communication on MAGAIL’s performance, we further communication constraints.
allow the follower AUV to receive the position of the leader AUV via
communication instead of detecting sensors. Moreover, we added 0.2s, 5. Conclusion
0.6s and 1.0s time delay in communicating the position of the leader
AUV to the follower AUV respectively. In these cases of time delay, the In this paper, we adopted decentralized training with decentral-
follower AUV needs to make decisions based on the received position ized execution framework and implemented multi-agent generative
of the leader AUV during last communications. adversarial imitation learning (MAGAIL) from expert demonstrated
For simplicity, we evaluated the tracking performance of the fol- trajectories for formation control and obstacle avoidance of multi-AUV.
lower AUV in the first original formation control and obstacle avoid- We tested our method learning from both optimal and sub-optimal
ance task with the above three time delays in communication. The demonstrations in a task with scattered obstacles on the Gazebo plat-
tracking error 𝑒𝑡 (Zhang et al., 2020) was used to reflect the tracking form with AUV simulators of our lab and extension to simulated
11
underwater environments, and compared to traditional deep reinforce- Gao, Z., Guo, G., 2018. Fixed-time leader-follower formation control of autonomous
ment learning from fine-tuned reward functions — IPPO. In addition, to underwater vehicles with event-triggered intermittent communications. IEEE Access
6, 27902–27911.
evaluate the generalization of our method, we saved the trained control
Ho, J., Ermon, S., 2016. Generative adversarial imitation learning. Adv. Neural Inf.
policies in the original task, and tested in two new and complex tasks Process. Syst. 29, 4565–4573.
with continuous obstacles and dense closure obstacles respectively. Ho, J., Gupta, J., Ermon, S., 2016. Model-free imitation learning with policy opti-
Our experimental results show that multi-AUV trained via MAGAIL mization. In: Proceedings of International Conference on Machine Learning (ICML).
can achieve a better performance than those trained via traditional PMLR, pp. 2760–2769.
Huang, H., Sheng, C., Wu, J., Wu, G., Zhou, C., Wang, H., 2021. Hydrodynamic analysis
deep reinforcement learning from fine-tuned reward functions (IPPO).
and motion simulation of fin and propeller driven manta ray robot. Appl. Ocean
Moreover, control policies trained via MAGAIL in simple tasks can Res. 108, 102528.
generalize better to complex tasks than those trained via IPPO. Hussein, A., Gaber, M.M., Elyan, E., Jayne, C., 2017. Imitation learning: A survey of
Future work will focus on investigating methods learning from learning methods. ACM Comput. Surv. 50 (2), 1–35.
imperfect demonstrations and integrating our method with sim2real Juan, R., Huang, J., Gomez, R., Nakamura, K., Sha, Q., He, B., Li, G., 2021. Shaping
progressive net of reinforcement learning for policy transfer with human evaluative
methods for multi-agent reinforcement learning to transfer and ver-
feedback. In: Proceedings of IEEE International Conference on Intelligent Robots
ify the proposed method on physical AUV in real marine environ- and Systems (IROS), pp. 1281–1288.
ments (Schwab et al., 2020; Zhao et al., 2020; Juan et al., 2021). Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., Yi, Y., 2019. Learning
Moreover, we would like to take the disturbances and communication to schedule communication in multi-agent reinforcement learning. In: Proceedings
constraints into account to facilitate multi-AUV to perform large-scale of International Conference on Representation Learning (ICLR).
Kim, W., Park, J., Sung, Y., 2021. Communication in multi-agent reinforcement
and complex tasks. learning: Intention sharing. In: Proceedings of International Conference on Learning
Representations (ICLR).
CRediT authorship contribution statement Kober, J., Bagnell, J.A., Peters, J., 2013. Reinforcement learning in robotics: A survey.
Int. J. Robot. Res. 32 (11), 1238–1274.
Li, X., Zhu, D., Qian, Y., 2014. A survey on formation control algorithms for multi-AUV
Zheng Fang: Methodology, Investigation, Data curation, Formal
system. Unmanned Syst. 2 (04), 351–359.
analysis, Writing – original draft. Dong Jiang: Methodology. Jie Liang, T., Lin, Y., Shi, L., Li, J., Zhang, Y., Qian, Y., 2020. Distributed vehicle tracking
Huang: Conceptualization. Chunxi Cheng: Validation. Qixin Sha: in wireless sensor network: A fully decentralized multiagent reinforcement learning
Software. Bo He: Writing – review & editing. Guangliang Li: approach. IEEE Sensors Lett. 5 (1), 1–4.
Writing – review & editing, Supervision, Funding acquisition. Manhães, M.M.M., Scherer, S.A., Voss, M., Douat, L.R., Rauschenbach, T., 2016. UUV
simulator: A gazebo-based package for underwater intervention and multi-robot
simulation. In: OCEANS 2016 MTS/IEEE Monterey. IEEE, pp. 1–8.
Declaration of competing interest Ng, A.Y., Russell, S.J., et al., 2000. Algorithms for inverse reinforcement learning. In;
Proceedings of International Conference on Machine Learning (ICML), Vol. 1, pp.
The authors declare that they have no known competing financial 2.
interests or personal relationships that could have appeared to Oliehoek, F.A., Spaan, M.T., Vlassis, N., 2008. Optimal and approximate Q-value
functions for decentralized POMDPs. J. Artificial Intelligence Res. 32, 289–353.
influence the work reported in this paper. Paull, L., Saeedi, S., Seto, M., Li, H., 2013. AUV navigation and localization: A review.
IEEE J. Ocean. Eng. 39 (1), 131–149.
Data availability Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L., 2019. Joint optimization of multi-UAV
target assignment and path planning based on multi-agent reinforcement learning.
IEEE Access 7, 146264–146272.
Data will be made available on request.
Ren, W., Sorensen, N., 2008. Distributed coordination architecture for multi-robot
formation control. Robot. Auton. Syst. 56 (4), 324–333.
Acknowledgments Ross, S., Bagnell, D., 2010. Efficient reductions for imitation learning. In: Proceedings
of the Thirteenth International Conference on Artificial Intelligence and Statistics.
JMLR Workshop and Conference Proceedings, pp. 661–668.
This work was supported by Natural Science Foundation of China
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy
(under grant No. 51809246). optimization algorithms. arXiv preprint arXiv:1707.06347.
Schwab, D., Zhu, Y., Veloso, M., 2020. Tensor action spaces for multi-agent robot
References transfer learning. In: 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE, pp. 5380–5386.
Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A., 2017. Deep Sharma, P.K., Fernandez, R., Zaroukian, E., Dorothy, M., Basak, A., Asher, D.E.,
reinforcement learning: A brief survey. IEEE Signal Process. Mag. 34 (6), 26–38. 2021. Survey of recent multi-agent reinforcement learning algorithms utilizing
Busoniu, L., Babuska, R., De Schutter, B., 2008. A comprehensive survey of multiagent centralized training. In: Artificial Intelligence and Machine Learning for Multi-
reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 38 (2), Domain Operations Applications III, Vol. 11746. International Society for Optics
156–172. and Photonics, p. 117462K.
Chen, G., 2019. A new framework for multi-agent reinforcement learning–centralized Song, J., Ren, H., Sadigh, D., Ermon, S., 2018. Multi-agent generative adversarial
training and exploration with decentralized execution via policy distillation. arXiv imitation learning. arXiv preprint arXiv:1807.09936.
preprint arXiv:1910.09152. Spaan, M.T., 2012. Partially observable Markov decision processes. In: Reinforcement
Cheng, C., Sha, Q., He, B., Li, G., 2021. Path planning and obstacle avoidance for AUV: Learning. Springer, pp. 387–414.
A review. Ocean Eng. 235, 109355. Suryendu, C., Subudhi, B., 2020. Formation control of multiple autonomous underwater
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A., vehicles under communication delays. IEEE Trans. Circuits Syst. II: Express Briefs
2018. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35 67 (12), 3182–3186.
(1), 53–65. Sutton, R., Barto, A., 2018. Reinforcement Learning: An Introduction. MIT Press.
da Silva, J.E., Terra, B., Martins, R., de Sousa, J.B., 2007. Modeling and simulation Wang, C., Wei, L., Wang, Z., Song, M., Mahmoudian, N., 2018. Reinforcement learning-
of the lauv autonomous underwater vehicle. In: 13th IEEE IFAC International based multi-AUV adaptive trajectory planning for under-ice field estimation. Sensors
Conference on Methods and Models in Automation and Robotics. 1, Szczecin, 18 (11), 3859.
Poland Szczecin, Poland, 9867115. Xin, B., Zhang, J., Chen, J., Wang, Q., Qu, Y., 2021. Overview of research on
de Witt, C.S., Gupta, T., Makoviichuk, D., Makoviychuk, V., Torr, P.H., Sun, M., White- transformation of multi-AUV formations. Complex Syst. Model. Simul. 1 (1), 1–14.
son, S., 2020. Is independent learning all you need in the StarCraft multi-agent Xu, J., Huang, F., Wu, D., Cui, Y., Yan, Z., Zhang, K., 2021. Deep reinforcement learning
challenge? arXiv preprint arXiv:2011.09533. based multi-AUVs cooperative decision-making for attack–defense confrontation
Desai, J.P., Ostrowski, J.P., Kumar, V., 2001. Modeling and control of formations of missions. Ocean Eng. 239, 109794.
nonholonomic mobile robots. IEEE Trans. Robot. Autom. 17 (6), 905–908. Yan, Z.-p., Liu, Y.-b., Yu, C.-b., Zhou, J.-j., 2017. Leader-following coordination of
Fang, B., Jia, S., Guo, D., Xu, M., Wen, S., Sun, F., 2019. Survey of imitation learning multiple UUVs formation under two independent topologies and time-varying
for robotic manipulation. Int. J. Intell. Robot. Appl. 3 (4), 362–369. delays. J. Central South Univ. 24 (2), 382–393.
Fossen, T.I., 2011. Handbook of Marine Craft Hydrodynamics and Motion Control. John Yang, E., Gu, D., 2004. Multiagent Reinforcement Learning for Multi-Robot Systems: A
Wiley & Sons. Survey. Technical Report, tech. rep.
12
Yang, Y., Xiao, Y., Li, T., 2021. A survey of autonomous underwater vehicle formation: Zhang, K., Yang, Z., Başar, T., 2021b. Multi-agent reinforcement learning: A selective
Performance, formation control, and communication capability. IEEE Commun. overview of theories and algorithms. In: Handbook of Reinforcement Learning and
Surv. Tutor. 23 (2), 815–841. Control. Springer, pp. 321–384.
Zhang, Z., 2018. Improved adam optimizer for deep neural networks. In: Proceedings Zhang, K., Yang, Z., Liu, H., Zhang, T., Basar, T., 2018. Fully decentralized multi-agent
of IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). IEEE, reinforcement learning with networked agents. In: Proceedings of International
pp. 1–2. Conference on Machine Learning (ICML). PMLR, pp. 5872–5881.
Zhang, C., Lesser, V., 2013. Coordinating multi-agent reinforcement learning with Zhang, G., Yu, W., Li, J., Zhang, X., 2021a. A novel event-triggered robust neural
limited communication. In: Proceedings of the International Conference on formation control for USVs with the optimized leader–follower structure. Ocean
Autonomous Agents and Multi-Agent Systems, pp. 1101–1108. Eng. 235, 109390.
Zhang, Y., Li, Y., Sun, Y., Zeng, J., Wan, L., 2017. Design and simulation of X-rudder Zhao, W., Queralta, J.P., Westerlund, T., 2020. Sim-to-real transfer in deep reinforce-
auv’s motion control. Ocean Eng. 137, 204–214. ment learning for robotics: a survey. In: IEEE Symposium Series on Computational
Zhang, Q., Lin, J., Sha, Q., He, B., Li, G., 2020. Deep interactive reinforcement learning Intelligence (SSCI). IEEE, pp. 737–744.
for path following of autonomous underwater vehicle. IEEE Access 8, 24258–24268.
13

1 s2.0 S0029801822014986 Main

Uploaded by

Copyright:

Available Formats

1 s2.0 S0029801822014986 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0029801822014986 Main

Uploaded by

Copyright:

Available Formats

Ocean Engineering 262 (2022) 112182

Contents lists available at ScienceDirect

Autonomous underwater vehicle formation control and obstacle avoidance

ARTICLE INFO ABSTRACT

For AUVs in environments in which they have access to reliable

where 𝐷 is the diameter of thruster, 𝑤 represents the vertical velocity

In this section, we present the implementation of MAGAIL for

3.1. Decentralized training with decentralized execution

There are three frameworks for agents training with multi-agent

Algorithm 1 Multi-agent generative adversarial imitation learning for

Fig. 6. A screenshot of the Gazebo simulation platform with simulators of a leader

Vehicle (UUV) Simulator (Manhães et al., 2016). Gazebo is a visual

4.4.3. Generalization to different and complex tasks

performance of the follower AUV, which can be expressed as follows:

You might also like