1 s2.0 S0029801822014986 Main
1 s2.0 S0029801822014986 Main
1 s2.0 S0029801822014986 Main
Ocean Engineering
journal homepage: www.elsevier.com/locate/oceaneng
Keywords: Autonomous underwater vehicle (AUV) is widely used in complex underwater missions such as bottom survey
Multi-agent reinforcement learning and data collection. Multiple AUVs can cooperatively complete tasks that single AUV cannot accomplish.
Formation control Recently, multi-agent reinforcement learning (MARL) has been introduced to improve multi-AUV control
Autonomous underwater vehicle
in uncertain marine environments. However, it is very difficult and even unpractical to design effective
Imitation learning
and efficient reward functions for various tasks. In this paper, we implemented multi-agent generative
Obstacle avoidance
adversarial imitation learning (MAGAIL) from expert demonstrated trajectories for formation control and
obstacle avoidance of multi-AUV. In addition, decentralized training with decentralized execution framework
was adopted to alleviate the communication problem in underwater environments. Moreover, to facilitate the
discriminator to accurately judge the quality of AUV’s trajectory in the two tasks and increase the convergence
speed, we improved upon MAGAIL by dividing the state–action pairs of expert trajectory for each AUV into
two groups and updating discriminator by randomly selecting equal number of state–action pairs from both
groups. Our experimental results on a simulated AUV system modeling Sailfish 210 of our lab in the Gazebo
simulation environment show that MAGAIL allows control policies of multi-AUV to obtain a better performance
than traditional multi-agent deep reinforcement learning from fine-tuned reward function — IPPO. Moreover,
control policies trained via MAGAIL in simple tasks can generalize better to complex tasks than those trained
via IPPO.
1. Introduction method, a leader AUV usually tracks predefined trajectory and the
followers track transformed states of the leader AUV according to
Autonomous underwater vehicle (AUV) can perform a variety of predefined schemes (Desai et al., 2001). The leader–follower method
underwater tasks with its own control and navigation system and is is easy to understand and implement, since the coordinated team
widely used in complex underwater missions such as bottom survey members only need to maneuver according to the leader (Zhang et al.,
and data collection (Cheng et al., 2021). However, the detection range
2021a), and has been widely used for multi-AUV control because of
and energy storage of single AUV are limited. With the increasing
its simplicity and extensibility (Gao and Guo, 2018). However, most
complexity of underwater missions, it is necessary to use multiple AUVs
methods for leader–follower formation control are difficult to work in
to cooperatively complete tasks that single AUV cannot perform. Multi-
AUV can perform tasks such as underwater detection, target search, uncertain environments.
object recognition etc., in a collaborative manner, which not only Multi-agent reinforcement learning (MARL) has been introduced
improves the efficiency of task execution but also reduces the time and to improve multi-AUV control in uncertain marine environments (Qie
energy cost. et al., 2019; Zhang et al., 2021b). With MARL, AUVs can interact
For many tasks, multi-AUV needs to move in formation and be with the environment and learn optimal control policies to complete
able to avoid obstacles while performing the task (Yang et al., 2021). different underwater tasks. For example, Xu et al. (2021) used deep
In other words, formation control and obstacle avoidance are core reinforcement learning to deal with the cooperative decision-making
techniques for multi-AUV to complete complex tasks. Currently, most problem of multi-AUV under limited perception and communication in
methods for formation control can be divided into four groups: vir-
attack-defense confrontation missions. They found that deep reinforce-
tual structure method, behavior-based method, artificial potential field
ment learning can enable multi-AUV to complete the attack-defense
and leader–follower method (Li et al., 2014). In the leader–follower
∗ Corresponding author.
E-mail address: [email protected] (G. Li).
https://doi.org/10.1016/j.oceaneng.2022.112182
Received 28 January 2022; Received in revised form 16 July 2022; Accepted 31 July 2022
Available online 27 August 2022
0029-8018/© 2022 Elsevier Ltd. All rights reserved.
Z. Fang et al. Ocean Engineering 262 (2022) 112182
confrontation missions excellently and generate interesting collabo- • How to deal with the communication constraints in underwater
rative behaviors that were unseen previously. Wang et al. (2018) environments for multi-AUV formation control?
designed a reinforcement learning-based online learning algorithm to • As the formation control and obstacle avoidance sub-tasks were
learn optimal trajectories in a constrained continuous space for multi- demonstrated in one expert trajectory and AUVs do not encounter
AUV to estimate a water parameter field of interest in the under-ice and detect obstacles all the time, there are much more state–
environment. However, it is very challenging to apply MARL to multi- action pairs without detected obstacles than those with detected
AUV control as the difficulty of designing an effective and efficient obstacles. How to train an unbiased discriminator to provide
reward function for each agent increases with the number of agents in effective and distinctive rewards for the two sub-tasks learning
the group and the relationship between agents will change from task in one trajectory?
to task (Yang and Gu, 2004).
As AUVs often work in complex and high-dimensional environments The main contributions of our work are as below:
(Fang et al., 2019), it is easier to provide expert demonstrations on
• Developed and implemented the MAGAIL method for multi-AUV
how to perform tasks than to design complex reward functions. Imita-
about formation control and obstacle avoidance, which can learn
tion learning from expert demonstrations was proposed and has been
successfully applied to single robot control (Hussein et al., 2017). The control policies from expert trajectories instead of pre-defined
simplest imitation learning method is behavior cloning (BC), which reward functions;
learns to map states of an agent to optimal actions via supervised learn- • To solve the limited communication problem in underwater envi-
ing (Ross and Bagnell, 2010). However, BC requires a large number ronments, we adopted a decentralized training with decentralized
of reliable data to learn and cannot generalize to unseen states and execution (DTDE) framework, in which each AUV only needs to
different tasks. On the other hand, inverse reinforcement learning (IRL) behave based on its own local observations;
allows an agent to first extract a cost function from expert demonstra- • Improved upon original MAGAIL by distinguishing data in the ex-
tions and then learn a control policy from the extracted cost function pert trajectories for the formation control and obstacle avoidance
via reinforcement learning (Ng et al., 2000), which enables the agent to tasks, so that the discriminator can accurately judge the quality
generalize to unseen states more effectively (Ho et al., 2016). However, of AUV’s trajectory and both AUVs can learn one policy for the
many of the proposed IRL methods need a model to solve a sequence of two tasks at a faster speed.
planning or reinforcement learning problems in an inner loop (Ho and
Ermon, 2016), which is extremely expensive to run and limits the use We tested our method learning from both optimal and sub-optimal
of IRL for robot control in large and complex tasks. Ho et al. tried to demonstrations for formation control and obstacle avoidance with
solve this problem by proposing a general model-free imitation learning multi-AUV simulators modeling our Sailfish 210 on the Gazebo simu-
method — generative adversarial imitation learning (GAIL) (Ho and lation underwater environment extended from Unmanned Underwater
Ermon, 2016). GAIL allows robots to directly learn policies from expert Vehicle (UUV) Simulator. AUVs trained via traditional deep reinforce-
demonstrations in large and complex environments. ment learning from fine-tuned reward functions – IPPO – were used
On the other hand, compared with other multi-agent formations, for comparison. In addition, to evaluate the generalization of our
a unique characteristic of the multi-AUV formation is the poor com- method, we tested the trained policy of the previous task in two
munication conditions. Environmental factors such as water depth, new and complex tasks with continuous obstacles and dense closure
water quality and light field can have negative impacts on communi- obstacles respectively. Moreover, effect of time delay in communication
cation devices, thus reducing the stability of the communication and on MAGAIL’s performance was also studied in the original formation
information quality (Xin et al., 2021). As radion or spread-spectrum control and obstacle avoidance task. The experimental results show that
communications and global positioning propagate only short distances multi-AUV trained via MAGAIL can achieve a better performance than
in underwater environments, acoustic communication is the most popu- those trained via traditional deep reinforcement learning from fine-
lar method for multi-AUV controls. However, acoustic communication tuned reward functions (IPPO). Moreover, control policies trained via
also suffers from many shortcomings, such as small bandwidth, high MAGAIL in simple tasks can generalize better to complex tasks than
latency and so on (Paull et al., 2013). In order to solve the problem those trained via IPPO.
of multi-robot formation control caused by communication range and
bandwidth limitation, Ren and Sorensen (2008) proposed a distributed 2. Background
formation control architecture that accommodates an arbitrary of group
leaders and arbitrary flow among vehicles. The architecture requires
In this section, we provide background on multi-agent reinforce-
only local neighbor-to-neighbor information exchange. By increasing
ment learning, decentralized partially observable Markov decision pro-
the number of group leaders within the formation, it greatly reduces
cess (POMDP) and dynamic model of AUV.
the failure of the entire formation caused by wrong decisions made
by a single leader or follower. However, in their experiment, they
2.1. Multi-agent reinforcement learning
failed to take into account the issue of time delay, the effect of robot
dynamics and the consequences of data loss. To solve the problem
caused by the underwater time-varying delay, Yan et al. (2017) used In single-agent reinforcement learning for AUV control (Sutton and
the consensus algorithm to solve the coordinate control problem of Barto, 2018; Arulkumaran et al., 2017; Kober et al., 2013), an AUV
multiple unmanned underwater vehicles (UUVs) formation in a leader- learns a policy 𝜋 via interacting with the underwater environment.
following manner. Suryendu and Subudhi (2020) further proposed to Specifically, at time step 𝑡, AUV selects an action 𝑎𝑡 ∼ 𝜋(𝑎|𝑠𝑡 ) with the
use a gradient-descent based method to estimate the actual communi- policy based on the detected current state 𝑠𝑡 . Then AUV will transition
cation delay. However, acoustic communication still suffers from low to next state and receive a reward 𝑟𝑡 from the environment. The goal of
data rate, variable sound speed due to fluctuating water situation etc, AUV is to learn a policy 𝜋 that maximizes the discounted accumulated
which greatly limits applying multi-AUV formation control in complex return as follow:
[ ]
tasks. ∑ |
𝑉𝜋 (𝑠) = E 𝛾 ℎ 𝑟𝑡 |𝑎𝑡 ∼ 𝜋(𝑎|𝑠𝑡 ) , (1)
In this paper, we implemented multi-agent generative adversarial |
ℎ≥0
imitation learning (MAGAIL) (2018) for formation control and ob-
stacle avoidance of multi-AUV. MAGAIL allows multi-AUV to learn where 𝛾 is the discount factor determining the value of future rewards,
control policies from demonstrated trajectories instead of predefined 𝑉𝜋 (𝑠) is the value of state 𝑠 following policy 𝜋 thereafter.
reward functions. However, there are two challenges for implementing Multi-agent reinforcement learning for multi-AUV control involves
MAGAIL in formation control and obstacle avoidance of multi-AUV: multiple AUVs interacting with the underwater environment (Busoniu
2
Z. Fang et al. Ocean Engineering 262 (2022) 112182
et al., 2008; Qie et al., 2019). In MARL, each AUV 𝑖 has its own policy
𝜋𝑖 and it can select an action 𝑎𝑖,𝑡 ∼ 𝜋𝑖 (𝑎𝑖 |𝑠𝑡 ) based on the observed
current environmental state 𝑠𝑡 at time step 𝑡. After AUV transitions to
another state in the underwater environment, it will receive a reward
𝑟𝑖,𝑡 . Similar to single-agent reinforcement learning for AUV control,
the goal for each AUV is to learn a policy maximizing its discounted
accumulated return as follows:
[ ]
∑ |
𝑉𝜋1 (𝑠) = E 𝛾 ℎ 𝑟1,𝑡 |𝑎1,𝑡 ∼ 𝜋1 (𝑎1 |𝑠𝑡 )
|
ℎ≥0
[ ]
∑ |
𝑉𝜋2 (𝑠) = E 𝛾 ℎ 𝑟2,𝑡 |𝑎2,𝑡 ∼ 𝜋1 (𝑎2 |𝑠𝑡 )
| (2)
ℎ≥0
......
[ ]
∑ |
𝑉𝜋𝑖 (𝑠) = E 𝛾 ℎ 𝑟𝑖,𝑡 |𝑎𝑖,𝑡 ∼ 𝜋1 (𝑎𝑖 |𝑠𝑡 ) ,
|
ℎ≥0
Fig. 1. Decentralized partially observable Markov decision processes for multi-AUV
where 𝜋𝑖 is the policy for AUV 𝑖. However, different from single-AUV control.
reinforcement learning, in MARL, the policy 𝜋𝑖 of AUV 𝑖 is influenced by
other AUVs’ policies. The most common concept to solve this problem
is Nash Equilibrium (NE), which can be defined as: number of AUVs that inhabit a particular underwater environment,
𝑉𝜋 ∗ (𝑠) ≥ 𝑉𝜋1 (𝑠) 𝑤ℎ𝑒𝑛 𝑉𝜋 ∗ (𝑠), 𝑉𝜋 ∗ (𝑠), … , 𝑉𝜋 ∗ (𝑠) which is considered at discrete time steps (Oliehoek et al., 2008).
1 2 3 𝑖
A decentralized partially observable Markov decision process is
𝑉𝜋 ∗ (𝑠) ≥ 𝑉𝜋2 (𝑠) 𝑤ℎ𝑒𝑛 𝑉𝜋 ∗ (𝑠), 𝑉𝜋 ∗ (𝑠), … , 𝑉𝜋 ∗ (𝑠) represented as a tuple ⟨𝑛, 𝑆, 𝐴, 𝑇 , 𝑂, 𝑓 , ℎ, 𝑏0 , R, 𝜋, 𝛾⟩. 𝑛 is the number of
2 1 3 𝑖
(3)
...... AUVs. The environment is represented by a finite set of states 𝑆. 𝐴 =
𝑉𝜋 ∗ (𝑠) ≥ 𝑉𝜋𝑖 (𝑠) 𝑤ℎ𝑒𝑛 𝑉𝜋 ∗ (𝑠), 𝑉𝜋 ∗ (𝑠), … , 𝑉𝜋 ∗ (𝑠), ×𝑖 𝐴𝑖 is the set of joint actions, in which 𝐴𝑖 is the set of actions available
𝑖 1 2 𝑖−1 to AUV 𝑖. 𝑂 = ×𝑖 𝑂𝑖 is the set of joint observations of all AUVs, where 𝑂𝑖
where 𝑉𝜋 ∗ (𝑠) represents the value of state 𝑠 following the optimal policy is a set of observations available to AUV 𝑖. For example, in formation
𝑖
𝜋𝑖∗ for AUV 𝑖. In NE, AUV 𝑖 will not try to change its policy 𝜋𝑖 if other control and obstacle avoidance of multi-AUV, at current time step 𝑡,
AUVs do not change their policies, because its discounted accumulated each AUV will observe its own state with limited perceptions including
return cannot continue to increase. That is to say, if the equilibrium the distances between AUVs and from obstacles, and the whole system
state is reached, and all AUVs’ state values converge. will have a joint observation 𝑜 = ⟨𝑜𝑖 , 𝑜2 , … , 𝑜𝑛 ⟩. One joint action 𝑎𝑡 =
In MARL, the relationship between AUVs can be cooperative, com- ⟨𝑎1,𝑡 , 𝑎2,𝑡 , … , 𝑎𝑛,𝑡 ⟩ will be taken based on the joint observation, and each
petitive or a mixed setting, which is divided based on the relationship AUV only knows its own action. 𝑇 is a transition function defined as
between reward functions of AUVs (Zhang et al., 2021b). In a fully 𝑇 ∶ 𝑆 × 𝐴 × 𝑆 → 𝑃 (𝑆) ∈ [0, 1], so the probability of a particular
cooperative setting, all AUVs share a same reward function, i.e., R1 =
next state can be gotten as 𝑃 (𝑠𝑡+1 |𝑠𝑡 , 𝑎𝑡 ) based on the current state
R2 = ⋯ = R𝑖 , because they perform the same task. They can even
and joint action. 𝑓 is the observation function, a mapping from AUVs’
share a common policy 𝜋 and value function 𝑉 in this setting. In a
joint actions and successor states to probability distributions over joint
fully competitive setting, there is a zero-sum relationship between the
observation as 𝑓 ∶ 𝐴 × 𝑆 → 𝑃 (𝑂) ∈ [0, 1], so the probability of joint
reward functions of AUVs. If an AUV 𝑎 and an AUV 𝑏 is competitive,
observation can be taken as 𝑃 (𝑜𝑡 |𝑎𝑡−1 , 𝑠𝑡 ). Similar to how the transition
their corresponding reward functions can be set as R𝑎 = −R𝑏 . In other
model 𝑇 describes the stochastic influence of actions on the underwater
words, AUVs maximize their cumulative rewards by preventing each
environment, the observation model 𝑓 describes how the state of the
other from completing tasks. In a mixed setting, each AUV 𝑖 is assigned
its own task, i.e., each AUV has its own reward function R𝑖 . At the same underwater environment is perceived by AUVs. ℎ is the horizon of the
time, each AUV 𝑖 can be cooperative or competitive to other agents. For decision problem. 𝑏0 ∈ 𝑃 (𝑆) is the initial state distribution at time step
example, in an attack-defense confrontation task, all AUVs are divided 𝑡 = 0. For a fully cooperative setting, all AUVs share a joint reward
into two groups (Xu et al., 2021). AUVs in one group need to defend function R defined as R ∶ 𝑆 × 𝐴 × 𝑆 → R, and each AUV will get
its own territory and occupy the territory defended by another group a reward 𝑟𝑡 = R(𝑠𝑡 , 𝑎𝑡 ) based on detected state and joint action taken
of AUVs. In this case, AUVs within the same group can be considered at time step 𝑡. The reward function can be different for each AUV in
cooperative and share a reward function, while AUVs across groups competitive settings. A tuple of policies 𝜋 = ⟨𝜋1 , 𝜋2 , … , 𝜋𝑛 ⟩ called a
exhibit a competitive relationship. In the leader–follower formation joint policy will be learned to maximize the expected cumulative return
∑∞ 𝑡
control and obstacle avoidance of multi-AUV, the leader AUVs and 𝑡=0 𝛾 𝑟𝑡 , where 𝛾 ∈ (0, 1] is the discount factor. In the task of formation
follower AUVs can be divided into competitive and different groups, control and obstacle avoidance for multi-AUV, each AUV will learn a
since they perform different tasks. policy that can form formation and avoid obstacles at the same time.
2.2. Dec-POMDPs
2.3. Model of AUV
3
Z. Fang et al. Ocean Engineering 262 (2022) 112182
3. Approach
4
Z. Fang et al. Ocean Engineering 262 (2022) 112182
5
Z. Fang et al. Ocean Engineering 262 (2022) 112182
6
Z. Fang et al. Ocean Engineering 262 (2022) 112182
of AUV. 𝐹 𝑖𝑛1 and 𝐹 𝑖𝑛3 represent the angles of the upper and lower
rudders, which can be set to control the horizontal direction of AUV.
𝐹 𝑖𝑛2 and 𝐹 𝑖𝑛4 represent the angles of the left and right rudders, which
can control AUV to float and dive in the underwater environment. In
our experiments, 𝐹 𝑖𝑛2 and 𝐹 𝑖𝑛4 are set to 0 as the tasks are performed
in a 2D space.
In the experiments, we set the discount factors 𝛾 = 0.99, 𝜆 = 1.0
and the clipping factor 𝜖 = 0.05 in the IPPO algorithm. For each AUV,
the discriminator 𝐷𝜔𝑖 is updated 3 times within one episode, while
the policy 𝜋𝜃𝑖 and the value function 𝑉𝜙𝑖 are updated 9 times. This
is tested to be the best way to improve the policy and discriminator
simultaneously. Each time, 256 local observable state–action pairs were
randomly selected from both AUV’s trajectory 𝜏𝑖 and expert trajectory
𝜏𝑖𝐸 respectively to update the discriminator of AUV 𝑖, and 128 local
Fig. 7. Illustration of task configuration in the underwater environment for formation observable state–action pairs from AUV’s trajectory 𝜏𝑖 were randomly
control and obstacle avoidance.
selected to update the policy of AUV 𝑖.
Table 1
Actions available to the leader AUV (The units of 𝐹 𝑖𝑛 and 𝑇 ℎ𝑟𝑢𝑠𝑡𝑒𝑟 are respectively 4.3. Evaluation metrics
𝑟𝑎𝑑 and 𝑟∕𝑠).
Action Fin1 Fin2 Fin3 Fin4 Thruster As the rewards provided by the discriminator are subjective, to
Turn left 1 −0.25 0 0.25 0 −300 evaluate the learning performance of both AUVs via MAGAIL, we
Turn left 2 −0.364 0 0.364 0 −300 defined reward functions for both AUVs as evaluation metrics.
Go straight 0 0 0 0 −300
Turn right 1 0.25 0 −0.25 0 −300
Turn right 2 0.364 0 −0.364 0 −300 4.3.1. Leader AUV
In the task, the leader AUV needs to complete path tracking and
Table 2 obstacle avoidance at the same time. Therefore, its reward function 𝑟𝐿
Actions available to the follower AUV (The units of 𝐹 𝑖𝑛 and 𝑇 ℎ𝑟𝑢𝑠𝑡𝑒𝑟 are respectively is composed of two parts, and can be defined as:
𝑟𝑎𝑑 and 𝑟∕𝑠).
Action Fin1 Fin2 Fin3 Fin4 Thruster 𝑟𝐿 = 𝑟𝑐𝐿 + 𝑟𝑎𝐿 , (14)
Turn left 1 −0.25 0 0.25 0 −250
where 𝑟𝑐𝐿
represents rewards for path tracking and 𝑟𝑎𝐿
represents re-
Turn left 2 −0.25 0 0.25 0 −350
Turn left 3 −0.364 0 0.364 0 −250 wards for obstacle avoidance. 𝑟𝑐𝐿 will provide rewards based on AUV’s
Turn left 4 −0.364 0 0.364 0 −350 heading deviation 𝑎𝐿 and distance from the target route 𝑔𝐿 , as:
Go straight 1 0 0 0 0 −250
Go straight 2 0 0 0 0 −350 𝑟𝑐𝐿 = −|𝑎𝐿 |∕3.14 + |1 − 𝑔𝐿 ∕30|, (15)
Turn right 1 0.25 0 −0.25 0 −250
Turn right 2 0.25 0 −0.25 0 −350 𝑟𝑎𝐿is defined based on the distance from obstacles, as:
Turn right 3 0.364 0 −0.364 0 −250 {
Turn right 4 0.364 0 −0.364 0 −350 𝑎
− 2, 5 < 𝑑𝐿 ≤ 8
𝑟𝐿 = . (16)
− 4, 2 < 𝑑𝐿 ≤ 5
In addition, if the leader AUV met conditions to suspend the current
(4) The heading deviation of the follower AUV is too large, i.e., |𝑎𝐹 | episode in the task, as described in Section 4.2, it will receive additional
𝜋
≥ ; rewards, as:
3
(5) The shortest distances from AUVs to obstacles are smaller than ⎧ − 200, 30 ≤ |𝑔𝐿 |
2 m, i.e., 𝑑𝐿 ≤ 2 or 𝑑𝐹 ≤ 2. ⎪
𝑟𝐿 = ⎨ − 200, 𝜋 ≤ |𝑎𝐿 | . (17)
⎪
In addition, each AUV has 𝑀 = 600 rangefinder sensors and a visual ⎩ − 200, 𝑑𝐿 ≤ 2
2𝜋
span 𝑆 = . The detection range is 25 m. To ensure that all AUVs can
3
accurately detect obstacles and avoid them smoothly, we divided the 4.3.2. Follower AUV
𝑀 = 600 rangefinder sensors of each AUV into 6 parts. The follower AUV needs to track the leader AUV at a certain
As the leader AUV only needs to track the target path and avoid distance and avoid obstacles at the same time. Therefore, its reward
obstacles, we assign 5 actions available to it: two actions for turning function is also composed of two parts and defined as:
left, one action for going straight and two actions for turning right, as
𝑟𝐹 = 𝑟𝑐𝐹 + 𝑟𝑎𝐹 , (18)
shown in Table 1. This allows the leader AUV to follow the path and
avoid obstacles smoothly. The follower AUV needs to track the leader where 𝑟𝑐𝐹 represents rewards for tracking the leader AUV, 𝑟𝑎𝐹 represents
AUV and keep a distance of 15 m from the leader AUV, so it needs rewards for obstacle avoidance. 𝑟𝑐𝐹 will provide rewards based on the
more actions to adjust the distance from the leader AUV while avoiding follower AUV’s heading deviation 𝑎𝐹 and distance from the leader AUV
obstacles at the same time. Therefore, we assign 10 actions available 𝑔𝐹 , as:
to the follower AUV: four actions for turning left, two actions for going
straight and four actions for turning right, which allows the follower 𝑟𝑐𝐹 = −|𝑎𝐹 | ∗ 3∕3.14 + |1 − |𝑔𝐹 − 15|∕10|. (19)
AUV to speed up or slow down and avoid being too close or too far
Same to the leader AUV, 𝑟𝑎𝐹 is defined based on its distance from
away from the leader AUV, as shown in Table 2.
obstacles, as:
All actions available to two AUVs are controlled by setting the {
propeller speed and rudder angles. As shown in Tables 1 and 2, 𝑇 ℎ𝑟𝑢𝑠𝑡𝑒𝑟 − 2, 5 < 𝑑𝐹 ≤ 8
𝑟𝑎𝐹 = , (20)
represents the propeller speed which can be set to control the speed − 4, 2 < 𝑑𝐹 ≤ 5
7
Z. Fang et al. Ocean Engineering 262 (2022) 112182
Fig. 8. Cumulative rewards received by the leader AUV and follower AUV via MAGAIL with optimal demonstrations (MAGAIL Optimal), sub-optimal demonstrations (MAGAIL
Suboptimal) and IPPO, in the original formation control and obstacle avoidance task (averaged over data collected in three trials). Note that the shaded area is the 0.95 confidence
interval and the bold line is the mean performance. Two red lines show the performance of expert optimal and sub-optimal demonstrations. (For interpretation of the references
to color in this figure legend, the reader is referred to the web version of this article.)
In addition, similar to the leader AUV, the follower AUV will receive 4.4.1. Learning curves
additional rewards if it meets conditions to suspend the current episode We first evaluate and compare the learning curves of the leader
in the task as described in Section 4.2, as: and follower AUV trained via MAGAIL with optimal and sub-optimal
|𝑔𝑓 | < 5 𝑜𝑟 25 < |𝑔𝐹 | demonstrations in the original formation control and obstacle avoid-
⎧ − 200,
⎪ 𝜋 ance task, measured by cumulative rewards received per episode ac-
𝑟𝐹 = ⎨ − 200, ≤ |𝑎𝐹 | . (21)
3 cording to the defined reward functions for the leader AUV and follower
⎪
⎩ − 200, 𝑑𝐹 ≤ 2 AUV in Section 4.3 respectively. As shown in Fig. 8, for both leader
AUV and follower AUV, the MAGAIL Optimal agent learned much faster
Note that the above defined reward functions for both leader AUV and better than the IPPO agent. Even with sub-optimal demonstrated
and follower AUV are never used for training the control policy with trajectories, the MAGAIL Suboptimal agent allows the leader AUV to
the MAGAIL method, but only used for testing the learned policies from learn faster and better than the IPPO agent, while the follower AUV
demonstrated trajectories. In addition, we used the traditional deep can also obtain a comparable performance to that of the IPPO agent
reinforcement learning method for multi-agent system – IPPO (de Witt at a similar speed. However, the performance of the MAGAIL Optimal
et al., 2020) – as baseline for comparison, which learned from only the
agent and the MAGAIL Suboptimal agent is limited by the performance
above defined reward functions for each AUV.
of expert demonstrations respectively. Moreover, our results in Fig. 8
In our experiments, we trained the leader and follower AUV via
show that the better the provided expert demonstrations are, the higher
our implementation of MAGAIL learning from optimal expert demon-
performance each AUV trained via MAGAIL can obtain.
strations and sub-optimal demonstrations respectively. That is to say,
optimal or sub-optimal demonstrated trajectories will be taken as input
to the controller of leader AUV and follower AUV separately. Then a 4.4.2. Performance in the original task
control policy will be learned for each AUV via MAGAIL by interacting We tested the final trained policies via MAGAIL and IPPO for the
with the underwater environment. In general, the optimal and sub- leader AUV and follower AUV in the original task. As the performance
optimal demonstrations can be obtained by recording human operators of the MAGAIL Suboptimal agent is similar to that of the IPPO agent,
controlling a vehicle to complete tasks on a simulation platform or we only tested and compared the MAGAIL optimal agent to the IPPO
even in the real marine environment. For simplicity, in this paper, the agent. Figs. 9 and 10 show the local observations including the distance
optimal and sub-optimal demonstrations are generated by training both from the target route (or leader AUV), heading deviation and distance
AUVs with IPPO and evaluated with the above defined reward functions from obstacles, and trajectories of the leader and follower AUV in the
for the leader AUV and follower AUV respectively. Specifically, the testing process.
optimal and sub-optimal expert trajectories for leader AUV are gen- As can be seen from Figs. 9 and 10, both leader AUV and follower
erated by running a policy trained via IPPO which can receive about AUV trained via MAGAIL and IPPO can complete the original formation
200 rewards and 180 rewards respectively within one episode, while control and obstacle avoidance task. The leader AUV can generally
the optimal and sub-optimal expert trajectories for the follower AUV track the target path and only move away from the path when avoiding
are generated by running a policy trained via IPPO which can receive obstacles, as shown in Fig. 9(a). The follower AUV can generally follow
about 200 rewards and 150 rewards respectively within one episode. the leader AUV and keep a distance at around 15 m and the distance
fluctuated around it when avoiding obstacles, as shown in Fig. 9(d).
4.4. Results and analysis
When the follower AUV detected the obstacle, it prioritized obstacle
In this section, we present and analyze the experimental results avoidance and continued tracking the leader AUV after successfully
by comparing the performance of learned control policies trained via avoiding the obstacle, which can be seen from the dramatically in-
MAGAIL with optimal expert demonstrations and sub-optimal expert creased or decreased heading deviation close to obstacles and small
demonstrations to traditional reinforcement learning algorithm for fluctuation around 0 at other places in Fig. 9(e). Moreover, the trajec-
multi-agent system from predefined reward functions — IPPO, in the tory, distance from target route (or leader AUV), heading deviation, and
original task configuration described in Section 4.2. In addition, to distance from obstacle of both leader AUV and follower AUV trained via
evaluate the generalization of our method, we tested the trained control MAGAIL are quite similar to the expert demonstrations. The MAGAIL
policies of the original task in two different and complex tasks with agent even allows both the leader AUV and follower AUV to avoid
continuous obstacles and dense closure obstacles respectively. obstacles at a closer distance than expert demonstrations, as shown in
8
Z. Fang et al. Ocean Engineering 262 (2022) 112182
Fig. 9. The distance from the target route (or leader AUV), heading deviation and distance from obstacles of the leader AUV (a,b,c) and follower AUV (d,e,f) during the process
of testing the final trained policies via MAGAIL and IPPO respectively in the original task.
better performance than the ones trained via traditional deep reinforce-
ment learning from fine-tuned reward function (i.e.,IPPO), and both
AUVs trained via MAGAIL can finish the formation control and obstacle
avoidance task with a shorter trajectory than those trained via IPPO.
9
Z. Fang et al. Ocean Engineering 262 (2022) 112182
Fig. 11. The distance from the target route (or leader AUV), heading deviation and distance from obstacles of the leader AUV (a,b,c) and follower AUV (d,e,f) during evaluating
the generalization of the final trained policies via MAGAIL and IPPO in the original task to a new and complex task with continuous obstacles.
Fig. 12. Tested trajectories of the leader and follower AUV with final control policies of the previous formation control and obstacle avoidance task trained via MAGAIL and IPPO
in two new and complex tasks with continuous obstacles and dense closure obstacles respectively.
in each task for 100 episodes and the number of times that both AUVs Table 3
Success rate of completing different tasks.
successfully finished the task was used to calculate the success rate. As
Task configuration Algorithm Success rate
shown in Table 3, control policies of multi-AUV trained via MAGAIL
can complete the original task and new task with continuous obstacles MAGAIL 100%
Original
IPPO 100%
with 100% success rate, and the complex new task with dense closure
MAGAIL 100%
obstacles with 96% success rate. In contrast, although the success rates Continuous obstacles
IPPO 100%
of completing the original task and new task with continuous obstacles
MAGAIL 96%
by AUVs trained via IPPO can also reach 100%, they can successfully Dense and closure obstacles
IPPO 43%
finish the new and complex task with dense closure obstacles for 43%
of total testings.
In summary, these results indicate that multi-AUV trained via MA-
GAIL with demonstrated trajectories can generalize better to differ- range of sensors (25 m) on each AUV. We assumed the follower AUV
ent and complex tasks than those trained with the traditional deep can obtain the position of the leader AUV with the detecting sensors
reinforcement learning from fine-tuned reward function (i.e., IPPO). immediately, which was used to calculate 𝑔𝐹 — the distance from
the follower AUV to the leader AUV and the heading deviation 𝑎𝐹 .
4.4.4. Effect of time delay in communication The leader AUV does not need to receive the position information
In the above three tasks, they require 15 meter’s distance between of the follower AUV, since it does not need to track the follower
the leader AUV and follower AUV, which is within the detection AUV and DTDE framework was adopted. However, in complex tasks,
10
Z. Fang et al. Ocean Engineering 262 (2022) 112182
Fig. 13. The distance from the target route (or leader AUV), heading deviation and distance from obstacles of the leader AUV (a,b,c) and follower AUV (d,e,f) during evaluating
the generalization of the final trained policies via MAGAIL and IPPO in the original task to a new and complex task with dense closure obstacles.
11
Z. Fang et al. Ocean Engineering 262 (2022) 112182
underwater environments, and compared to traditional deep reinforce- Gao, Z., Guo, G., 2018. Fixed-time leader-follower formation control of autonomous
ment learning from fine-tuned reward functions — IPPO. In addition, to underwater vehicles with event-triggered intermittent communications. IEEE Access
6, 27902–27911.
evaluate the generalization of our method, we saved the trained control
Ho, J., Ermon, S., 2016. Generative adversarial imitation learning. Adv. Neural Inf.
policies in the original task, and tested in two new and complex tasks Process. Syst. 29, 4565–4573.
with continuous obstacles and dense closure obstacles respectively. Ho, J., Gupta, J., Ermon, S., 2016. Model-free imitation learning with policy opti-
Our experimental results show that multi-AUV trained via MAGAIL mization. In: Proceedings of International Conference on Machine Learning (ICML).
can achieve a better performance than those trained via traditional PMLR, pp. 2760–2769.
Huang, H., Sheng, C., Wu, J., Wu, G., Zhou, C., Wang, H., 2021. Hydrodynamic analysis
deep reinforcement learning from fine-tuned reward functions (IPPO).
and motion simulation of fin and propeller driven manta ray robot. Appl. Ocean
Moreover, control policies trained via MAGAIL in simple tasks can Res. 108, 102528.
generalize better to complex tasks than those trained via IPPO. Hussein, A., Gaber, M.M., Elyan, E., Jayne, C., 2017. Imitation learning: A survey of
Future work will focus on investigating methods learning from learning methods. ACM Comput. Surv. 50 (2), 1–35.
imperfect demonstrations and integrating our method with sim2real Juan, R., Huang, J., Gomez, R., Nakamura, K., Sha, Q., He, B., Li, G., 2021. Shaping
progressive net of reinforcement learning for policy transfer with human evaluative
methods for multi-agent reinforcement learning to transfer and ver-
feedback. In: Proceedings of IEEE International Conference on Intelligent Robots
ify the proposed method on physical AUV in real marine environ- and Systems (IROS), pp. 1281–1288.
ments (Schwab et al., 2020; Zhao et al., 2020; Juan et al., 2021). Kim, D., Moon, S., Hostallero, D., Kang, W.J., Lee, T., Son, K., Yi, Y., 2019. Learning
Moreover, we would like to take the disturbances and communication to schedule communication in multi-agent reinforcement learning. In: Proceedings
constraints into account to facilitate multi-AUV to perform large-scale of International Conference on Representation Learning (ICLR).
Kim, W., Park, J., Sung, Y., 2021. Communication in multi-agent reinforcement
and complex tasks. learning: Intention sharing. In: Proceedings of International Conference on Learning
Representations (ICLR).
CRediT authorship contribution statement Kober, J., Bagnell, J.A., Peters, J., 2013. Reinforcement learning in robotics: A survey.
Int. J. Robot. Res. 32 (11), 1238–1274.
Li, X., Zhu, D., Qian, Y., 2014. A survey on formation control algorithms for multi-AUV
Zheng Fang: Methodology, Investigation, Data curation, Formal
system. Unmanned Syst. 2 (04), 351–359.
analysis, Writing – original draft. Dong Jiang: Methodology. Jie Liang, T., Lin, Y., Shi, L., Li, J., Zhang, Y., Qian, Y., 2020. Distributed vehicle tracking
Huang: Conceptualization. Chunxi Cheng: Validation. Qixin Sha: in wireless sensor network: A fully decentralized multiagent reinforcement learning
Software. Bo He: Writing – review & editing. Guangliang Li: approach. IEEE Sensors Lett. 5 (1), 1–4.
Writing – review & editing, Supervision, Funding acquisition. Manhães, M.M.M., Scherer, S.A., Voss, M., Douat, L.R., Rauschenbach, T., 2016. UUV
simulator: A gazebo-based package for underwater intervention and multi-robot
simulation. In: OCEANS 2016 MTS/IEEE Monterey. IEEE, pp. 1–8.
Declaration of competing interest Ng, A.Y., Russell, S.J., et al., 2000. Algorithms for inverse reinforcement learning. In;
Proceedings of International Conference on Machine Learning (ICML), Vol. 1, pp.
The authors declare that they have no known competing financial 2.
interests or personal relationships that could have appeared to Oliehoek, F.A., Spaan, M.T., Vlassis, N., 2008. Optimal and approximate Q-value
functions for decentralized POMDPs. J. Artificial Intelligence Res. 32, 289–353.
influence the work reported in this paper. Paull, L., Saeedi, S., Seto, M., Li, H., 2013. AUV navigation and localization: A review.
IEEE J. Ocean. Eng. 39 (1), 131–149.
Data availability Qie, H., Shi, D., Shen, T., Xu, X., Li, Y., Wang, L., 2019. Joint optimization of multi-UAV
target assignment and path planning based on multi-agent reinforcement learning.
IEEE Access 7, 146264–146272.
Data will be made available on request.
Ren, W., Sorensen, N., 2008. Distributed coordination architecture for multi-robot
formation control. Robot. Auton. Syst. 56 (4), 324–333.
Acknowledgments Ross, S., Bagnell, D., 2010. Efficient reductions for imitation learning. In: Proceedings
of the Thirteenth International Conference on Artificial Intelligence and Statistics.
JMLR Workshop and Conference Proceedings, pp. 661–668.
This work was supported by Natural Science Foundation of China
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy
(under grant No. 51809246). optimization algorithms. arXiv preprint arXiv:1707.06347.
Schwab, D., Zhu, Y., Veloso, M., 2020. Tensor action spaces for multi-agent robot
References transfer learning. In: 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS). IEEE, pp. 5380–5386.
Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A., 2017. Deep Sharma, P.K., Fernandez, R., Zaroukian, E., Dorothy, M., Basak, A., Asher, D.E.,
reinforcement learning: A brief survey. IEEE Signal Process. Mag. 34 (6), 26–38. 2021. Survey of recent multi-agent reinforcement learning algorithms utilizing
Busoniu, L., Babuska, R., De Schutter, B., 2008. A comprehensive survey of multiagent centralized training. In: Artificial Intelligence and Machine Learning for Multi-
reinforcement learning. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 38 (2), Domain Operations Applications III, Vol. 11746. International Society for Optics
156–172. and Photonics, p. 117462K.
Chen, G., 2019. A new framework for multi-agent reinforcement learning–centralized Song, J., Ren, H., Sadigh, D., Ermon, S., 2018. Multi-agent generative adversarial
training and exploration with decentralized execution via policy distillation. arXiv imitation learning. arXiv preprint arXiv:1807.09936.
preprint arXiv:1910.09152. Spaan, M.T., 2012. Partially observable Markov decision processes. In: Reinforcement
Cheng, C., Sha, Q., He, B., Li, G., 2021. Path planning and obstacle avoidance for AUV: Learning. Springer, pp. 387–414.
A review. Ocean Eng. 235, 109355. Suryendu, C., Subudhi, B., 2020. Formation control of multiple autonomous underwater
Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A., vehicles under communication delays. IEEE Trans. Circuits Syst. II: Express Briefs
2018. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 35 67 (12), 3182–3186.
(1), 53–65. Sutton, R., Barto, A., 2018. Reinforcement Learning: An Introduction. MIT Press.
da Silva, J.E., Terra, B., Martins, R., de Sousa, J.B., 2007. Modeling and simulation Wang, C., Wei, L., Wang, Z., Song, M., Mahmoudian, N., 2018. Reinforcement learning-
of the lauv autonomous underwater vehicle. In: 13th IEEE IFAC International based multi-AUV adaptive trajectory planning for under-ice field estimation. Sensors
Conference on Methods and Models in Automation and Robotics. 1, Szczecin, 18 (11), 3859.
Poland Szczecin, Poland, 9867115. Xin, B., Zhang, J., Chen, J., Wang, Q., Qu, Y., 2021. Overview of research on
de Witt, C.S., Gupta, T., Makoviichuk, D., Makoviychuk, V., Torr, P.H., Sun, M., White- transformation of multi-AUV formations. Complex Syst. Model. Simul. 1 (1), 1–14.
son, S., 2020. Is independent learning all you need in the StarCraft multi-agent Xu, J., Huang, F., Wu, D., Cui, Y., Yan, Z., Zhang, K., 2021. Deep reinforcement learning
challenge? arXiv preprint arXiv:2011.09533. based multi-AUVs cooperative decision-making for attack–defense confrontation
Desai, J.P., Ostrowski, J.P., Kumar, V., 2001. Modeling and control of formations of missions. Ocean Eng. 239, 109794.
nonholonomic mobile robots. IEEE Trans. Robot. Autom. 17 (6), 905–908. Yan, Z.-p., Liu, Y.-b., Yu, C.-b., Zhou, J.-j., 2017. Leader-following coordination of
Fang, B., Jia, S., Guo, D., Xu, M., Wen, S., Sun, F., 2019. Survey of imitation learning multiple UUVs formation under two independent topologies and time-varying
for robotic manipulation. Int. J. Intell. Robot. Appl. 3 (4), 362–369. delays. J. Central South Univ. 24 (2), 382–393.
Fossen, T.I., 2011. Handbook of Marine Craft Hydrodynamics and Motion Control. John Yang, E., Gu, D., 2004. Multiagent Reinforcement Learning for Multi-Robot Systems: A
Wiley & Sons. Survey. Technical Report, tech. rep.
12
Z. Fang et al. Ocean Engineering 262 (2022) 112182
Yang, Y., Xiao, Y., Li, T., 2021. A survey of autonomous underwater vehicle formation: Zhang, K., Yang, Z., Başar, T., 2021b. Multi-agent reinforcement learning: A selective
Performance, formation control, and communication capability. IEEE Commun. overview of theories and algorithms. In: Handbook of Reinforcement Learning and
Surv. Tutor. 23 (2), 815–841. Control. Springer, pp. 321–384.
Zhang, Z., 2018. Improved adam optimizer for deep neural networks. In: Proceedings Zhang, K., Yang, Z., Liu, H., Zhang, T., Basar, T., 2018. Fully decentralized multi-agent
of IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). IEEE, reinforcement learning with networked agents. In: Proceedings of International
pp. 1–2. Conference on Machine Learning (ICML). PMLR, pp. 5872–5881.
Zhang, C., Lesser, V., 2013. Coordinating multi-agent reinforcement learning with Zhang, G., Yu, W., Li, J., Zhang, X., 2021a. A novel event-triggered robust neural
limited communication. In: Proceedings of the International Conference on formation control for USVs with the optimized leader–follower structure. Ocean
Autonomous Agents and Multi-Agent Systems, pp. 1101–1108. Eng. 235, 109390.
Zhang, Y., Li, Y., Sun, Y., Zeng, J., Wan, L., 2017. Design and simulation of X-rudder Zhao, W., Queralta, J.P., Westerlund, T., 2020. Sim-to-real transfer in deep reinforce-
auv’s motion control. Ocean Eng. 137, 204–214. ment learning for robotics: a survey. In: IEEE Symposium Series on Computational
Zhang, Q., Lin, J., Sha, Q., He, B., Li, G., 2020. Deep interactive reinforcement learning Intelligence (SSCI). IEEE, pp. 737–744.
for path following of autonomous underwater vehicle. IEEE Access 8, 24258–24268.
13