Under review as a conference paper at ICLR 2020
S TRUCTURAL M ULTI - AGENT L EARNING
Anonymous authors
Paper under double-blind review
A BSTRACT
In this paper, we propose a multi-agent learning framework to model communication in complex multi-agent systems. Most existing multi-agent reinforcement
learning methods require agents to exchange information with the environment
or global manager to achieve effective and efficient interaction. We model the
multi-agent system with an online adaptive graph where all agents communicate
with each other through the edges. We update the graph network with a relation
system which takes the current graph network and the hidden variable of agents as
input. Messages and rewards are shared through the graph network. Finally, we
optimize the whole system via the policy gradient algorithm. Experimental results
of several multi-agent systems show the efficiency of the proposed method and its
strength compared to existing methods in cooperative scenarios.
1
I NTRODUCTION
Reinforcement learning(Shu & Tian, 2018) has been making significant progress in many applications, including playing video games(Mnih et al., 2013; Silver et al., 2016), text generation(Guo;
Li et al., 2016), and robotic controls(Tedrake et al., 2004; Levine et al., 2016). These signs of
improvement in single-agent learning have enabled agents to deal with tasks in high dimensional
spaces. Under this circumstance, agents can fulfill their goals by modeling the environment as a
whole(Doya et al., 2002). However, numerous applications are concerned with the interaction of
multiple agents(Matignon et al., 2012; Mordatch & Abbeel, 2018; Leibo et al., 2017). The relationships between agents are complicated as the cooperation(Peng et al., 2017; Olfati-Saber et al., 2007)
and competition(Zheng et al., 2018) with each other bring uncertainty and complexity. The ideas
of deep reinforcement learning have been explored for multi-agent systems(Lowe et al., 2017). Unfortunately, conventional reinforcement learning approaches such as Q-Learning or policy gradient
with independently learning agents are poorly suited to multi-agent environments(Matignon et al.,
2012).
Most existing multi-agent learning methods focus on collaboration among learning agents. In some
multi-agent systems, success has been made by introducing a centralized controller(Shu & Tian,
2018; Hong et al., 2018). When the agent number increases, researchers reduce computational complexity by approximating the average effect from the overall population or neighboring agents(Yang
et al., 2018). Others develop methods that enable agents to learn to coordinate on their own(Morcos
et al., 2018; Sukhbaatar et al., 2017; Perolat et al., 2017). These methods, however, fail to characterize systems of diverse agents as a whole and can hardly reveal the relationship between agents
without any prior information. Besides such challenges, a general dilemma lies in that none of the
existing algorithms deals with indefinite numbers of agents, which is a typical attribute of most
dynamic systems.
In this work, we propose a general multi-agent learning structure(SMAL) to address these challenges. As shown in Figure 1, we tackle multi-agent reinforcement learning under a setting where
each agent is directly interacting with a finite set of other agents. Through a chain of direct interactions, we interconnect any pair of agents globally. Unlike the agents that only have access to local
information, we update their policies by considering communications with each other. Each agent
has a latent variable that stands for its role in the whole system. The hidden variables of every pair
of agents model the online adaptive graph network that stores the relations.
The online graph explicitly demonstrates the connections between agents by the weights of edges
and informs agents with whom they should collaborate. Edges of positive weight indicate that the
1
Under review as a conference paper at ICLR 2020
(t)
m5 A5
A1
A5
A1
(t)
m4
(t)
o5
A2
A4
Environment
(t)
a5
A2
A4
A3
A4
A3
A5
(t)
m2
A2
GNN
Environment
Frame t
Environment
Frame t+1
Figure 1: The main framework of the proposed Structural Multi-Agent Learning. Given the states
of the multi-agent system at frame t, all agents communicate with each other through the edges
rather than upload messages to the environment or global manager to achieve effective and efficient
interaction. Then each agent makes a decision according to the current state and received messages.
Finally, we update the graph network with a relation system which takes the current graph network
and the hidden variables as inputs.
two agents should work together while a negative one suggests that they are competitors. As we
use a relation network to infer edges between agents, the graph is adaptive even if the agent number
fluctuates. Agents can only communicate with those adjacent on the graph, and their messages are
propagated to connected agents. For highly complex systems with numerous agents, we reduce the
complexity of the graph network by limiting global communication and control cost. Finally, We
empirically show the success of our approach compared to existing methods in various cooperative
scenarios, where agents can discover optimal coordination strategies.
The main contributions of this paper are:
(1)We propose a general framework for multi-agent learning that models the relationships between
agents with a graph network.
(2)The messages and rewards are shared through the graph to achieve efficient cooperation.
2
R ELATED W ORK
Multi-agent reinforcement learning. Self-agent reinforcement learning algorithms have achieved
significant success in several applications. However, it’s difficult to adapt these methods to multiagent systems(Matignon et al., 2012). As the environments are non-stationary, independent-learning
policy gradient methods perform poorly(Lowe et al., 2017). Under cooperative settings, a centralized critic is proved effective(Shu & Tian, 2018; Hong et al., 2018). Other methods encourage
cooperation via the sharing of policy parameters(Gupta et al., 2017). The graph in our work is a
variant of a centralized controller for the messages between agents. The difference lies in that the
policies of agents are learned separately.
Communication in multi-agent systems. Reinforcement learning provides a way to study how
communication emerges in multi-agent systems. Pioneering work by Jakob(Foerster et al., 2016) is
considered as a first attempt at learning communication and language with deep learning approaches.
Under cooperative settings, some communication architectures(Cao et al., 2018; Das et al., 2018;
Kluge, 1983; Singh et al., 2018; Sukhbaatar et al., 2016) allow agents to achieve targeted and profitable communication. Several approaches (Choi et al., 2018; Lazaridou et al., 2018; Bogin et al.,
2018) train agents to learn from raw pixel data and communicate through discrete symbols. Social
influences can also be interpreted as a kind of communication(Jaques et al., 2019). Other frameworks
2
Under review as a conference paper at ICLR 2020
use early and curiosity-driven developments to build a unified model of speech and tool(Oudeyer &
Smith, 2016; Foerster et al., 2016). Unlike prior approaches for targeted communications, we propose a structural communication system for agents to exchange messages with their targets through
a graph.
Relational forward models. Relational reasoning has received considerable attention in recent
years, and researchers attempt to predict the forward dynamics of multi-agent systems with a rich
relational structure(Zheng et al., 2016). Deep learning models that operate on the graph have been
developed to provide insights into the relational structures behind the data. Relational forward
model(Tacchetti et al., 2018) based on graph network(Battaglia et al., 2018) embeds RFM modules inside agents and updates the graph network with supervised learning. Their model takes the
state of the environment as input and outputs either an action prediction for each agent or a prediction of the cumulative reward each agent will receive. The graph in our framework can be defined
as a relational forward model and is updated by policy gradient flows.
3
BACKGROUND
Multi-agent Markov games. In this paper, we consider multi-agent partially observable Markov
games(Littman, 1994). Under multi-agent settings, a Markov game can be defined by xS, T , A, γy.
Given environment state st P S, each“ agent chooses
an action at P A. The actions of all agents
‰
are combined as a joint action at “ a0t , . . . aN
t , producting the next state st`1 according to the
transition function T pst`1 |at , st q. Each agent also receives a reward rpst , at q and the target is to
řT
maximize its accumulated expected return Ri “ t“0 γ t rit pst , at q, where γ is the discount factor
and T is the time horizon. As the environment is partially observable, every agent chooses its action
based on observation ot rather than st .
Policy gradient algorithms. Policy gradient methods(Sutton et al., 2000) directly update the parameters of policies by computing the gradients of value functions. Given the objective: Jpθq “
Eτ „pθ pτ q rrpτ qs, where r and τ refer to rewards and trajectories, the gradient of the policiy can be
written as:
«˜
¸˜
¸ff
T
T
ÿ
ÿ
∇θ Jpθq “ Eτ „pθ pτ q
(1)
∇θ log πθ pat |st q
r pst , at q
t“1
t“1
Here πθ pat |st q is the policy, st and at are states and actions. The advantage of equation 1 is that
we don’t need any prior knowledge about initial distribution or environment. However, the gradient
estimates are of high variance. Under multi-agent settings, an agents reward usually depends on the
actions of other agents, which leads to more fluctuations in the estimation process.
Graph neural network. Graph neural networks (GNN)(Defferrard et al., 2016) extended existing
neural networks for processing the data represented in graph domains. GNNs can be categorized
as spectral approaches and non-spectral approaches. Non-spectral methods define convolutions directly on the graph, operating on spatially close neighbors. Spectral approaches work with a spectral
representation of the graphs. The convolution operation is defined in the Fourier domain by computing the decomposition of the graph Laplacian(Zhou et al., 2018). The operation can be defined
as the multiplication of a signal x P RN (a scalar for each node) with a filter gθ “ diagpθq:
gθ ‹ x “ U gθ pΛqU T x,
(2)
where U is the matrix of eigenvectors of the normalized graph Laplacian L(Chung & Graham,
1997).
4
A PPROACH
In this section, we will detail the proposed structural multi-agent learning method. We first define
the structure of a multi-agent system with the graph network and the interactions between agents.
Then we provide the optimization process of the whole framework.
3
Under review as a conference paper at ICLR 2020
4.1
S TRUCTURAL MULTI - AGENT MODELING
Given a multi-agent system tA, Gu with N agents, where Ai is the i-th agent, and G “ tvpAi , Aj q P
r´1, 1su is the graph that represents the relations between agents(Note that A and a refers to agents
and actions respectively). Positive, negative and zero weight of the graph vij “ vpAi , Aj q means
Ai , Aj are cooperative, competitive and independent.
Each agent can communicate with the agents adjacent on the graph directly, and the message of
itself can propagate through the edges to connected agents. Thanks to the graph neural network, the
agent Ai can get the encoded messengers at pt ´ d ` 1q-th frame of neighbors whose order is d.
Specifically, when d “ 1, it means that agents can receive instant messages from adjacent neighbors,
while messages from farther agents are propagated through frames.
4.1.1
C OOPERATIVE
First, we talk about cooperative settings. In this case, the edges of the graph vij P r0, 1s. For a single
agent Ai , it takes its observation, its hidden variable and messages from other agents as inputs to
generate action according to policy π. We also exploit a model f to predict its hidden variable and
messages to be sent to other agents. For different agents, the dimensions of messages must be the
same to communicate with each other. We have:
at „ πpot , ht , m´t q
(3)
tht`1 , mt`1 u “ f pht , mt , ot , at q
(4)
m´t “
N
ÿ
vij mjt
(5)
j‰i
Here ot is the observation, ht is the hidden variable, mt is the message sent to other agents and at
is the action. In equation 5, mjt is the message given by agent j at time t, so m´t is treated as a
sum of reweighted messages from other agents. We omit the subscript i for clarity. For the graph
G “ tvpAi , Aj qu, we use a relation network φ to infer the edge between each pair of agents, which
is defined as follows:
Gt`1 „ vij “ φphit , hjt , ψpGt qq
(6)
Where ψ is a graph neural network which encodes the current graph Gt into vectors as the global
representation of current multi-agent system.
To adjust the whole graph network needs to traverse all possible active edges of the current multiagent system, which might cost lots of computation when the multi-agent system scale is relatively
large. Since the relationship between agents might not change frequently, we can update the graph
network interval T frame or when the number of agents changes to reduce the cost of graph network
adjustment.
4.1.2
C OMPETITIVE
Under competitive settings, the edges of the graph vij P r´1, 1s. We can change equation 5 as
follows:
N
ÿ
m´t “
maxpvij , 0qmjt
(7)
j‰i
In this way, the messages from partners are considered valuable, while messages from competitors are ignored. As competitive environments are totally different from cooperative scenarios, we
mainly focus on cooperative settings in our experiments and leave further discussions of competitive
settings to Appendix.
4.2
O PTIMIZATION
Different from traditional multi-agent learning methods, we consider the actions of the single agent
and the whole interaction network at the same time. The whole formulation of the proposed method
contains two parts:
4
Under review as a conference paper at ICLR 2020
Algorithm 1: SMAL
Input: Training environment E, parameters: λ, T , learning rate ρ, total iterative number Γ, and
convergence error ε.
Output: Parameters: θ, ψ, φ.
Initialize θ, ψ, φ;
for t “ 1, 2, ¨ ¨ ¨ , Γ do
for Every agent Ai do
Receive observations ot from E;
Generate message mt and send it to its neighbor;
Receive messages m´t from its neighbor;
Generate action at ;
Interactive with E according to at and obtain ot`1 ;
end
if MOD(t,T)= 0 then
for Ai , Aj P S do
Calculate vij according to (6);
end
Update current graph G;
end
Compute Jt using (8);
if t ě 1 and |Jt ´ Jt´1 | ă ε then
go to Return.
end
Update θ, ψ, φ according to (12);
end
Return: θ, ψ, φ.
J “ J1 ` λJ2
(8)
Where J1 is the individual loss and J2 is the structural loss. We use λ to balance those two part. The
individual loss J1 is defined by the rewards of a single agent. In a multi-agent system, the purpose
of each agent is to enlarge the reward of itself and its collaborators. So the policy gradient part need
to consider the reward of its neighbor. We define the individual loss J1 for every agent as following:
J1 “ ´E rπpat |ot , ht , m´t qpRpst , at q ` R´t qs
R´t “
1`
1
řN
j‰i
vij
N
ÿ
vij Rpsjt , ajt q
(9)
(10)
j‰i
Where Rpst , at q is the reward for a single agent, and R´t is the shared reward from adjacent agents.
The structural loss J2 “ E rLpGt`1 qs of our method aims to minimize the global communication
and control cost. In our method LpGt`1 q is defined as following:
ÿ
Iv‰0 pvij q
(11)
LpGt`1 q “
i,j
Where Ip¨q is the indicator function. With this loss, we can control the sparseness of the graph. In
large scale systems, we want to reduce unnecessary communications between agents.
Finally, we use the stochastic sub-gradient descent algorithm to update those neural networks with
learning rate ρ. Here θ refers to the parameters for policy π and predictor f of every agent.
θ “θ´ρ
BJ
BJ
BJ
,φ “ φ ´ ρ ,ψ “ ψ ´ ρ .
Bθ
Bφ
Bψ
Algorithm 1 summarizes the detailed procedure of our proposed SMAL method.
5
(12)
Under review as a conference paper at ICLR 2020
(a) Waterworld
(b) Multi-Walker
(c) Multi-Ant
Figure 2: Illustrations of the experimental environment and continuous tasks we consider. Agent
numbers of the three environments above are respectively 4, 4 and 10.
5
5.1
E XPERIMENTS
E NVIRONMENTS
We conducted experiments on three multi-agent domains: Waterworld, Multi-Walker, and MultiAnt(Gupta et al., 2017). All the three tasks are continuous control problems with partial observation
in which agents should cooperate to fulfill a certain goal. We provide details of each environment
below.
Waterworld. Waterworld is a type of pursuit problem in the continuous domain. The environment
is based on the single-agent Waterworld domain used by (Ho et al., 2016). In this task, agents need
to cooperate to capture moving food targets while avoiding poison targets. The agents move around
by applying a two-dimensional force. They will receive a positive reward for achieving a food target
and a negative reward for capturing a poisoned target. The observation and action spaces are all
continuous. Figure 2a shows an example of four agents moving for food.
Multi-Walker. Multi-Walker is a control locomotion task based on the BipedalWalker environment from OpenAI gym(Brockman et al., 2016). The environment contains multiple bipedal walkers that can actuate the joints in each of their legs. At the start of each simulation, a large package
is placed on top of the walkers. The walkers must learn how to move forward and to coordinate
with other agents to keep the bag balanced while navigating complex terrain. Each agent receives
a positive reward for carrying the package forward, a negative reward for falling and dropping the
package. Figure 2b shows the scene of four walkers.
Multi-Ant. The multi-ant domain is a 3D locomotion task based on the robot models in (Schulman
et al., 2015b). In this domain, each leg of the ant is a separate agent. Legs can sense their positions
and velocities as well as those of their neighbors. Each leg is controlled by applying torque to its
two joints. The legs should cooperate with neighbors to help the ant move forward as quickly as
possible. The robot in Figure 2c is an ant with ten legs.
5.2
R ESULTS AND ANALYSIS
Comparison with state-of-the-art methods. For comparison, we choose four baselines: Multi-agent Deep Deterministic Policy Gradient(MADDPG)(Lowe et al., 2017),
Centralized MADDPG(C-MADDPG), Individualized Controlled Continuous Communication
Model(IC3Net)(Singh et al., 2018) and Trust Region Policy Optimization(TRPO) (Schulman et al.,
2015a) to compare with our method: Structural Multi-agent learning(SMAL). Among them, MADDPG learns coordinated behavior via centralized critics and uncentralized executors, while Centralized MADDPG exploits joint observation and action to train a centralized critic and a centralized
actor. IC3Net is an efficient method that learns profitable communication by controlling continuous
communication with a gating mechanism. TRPO is an effective single-agent reinforcement learning
methods. We list the details for the implementation and comparison in the Appendix.
We plot the learning curves for various approaches in Figure 3. The curves show that our method
(SMAL) can reach high rewards in early episodes and the optimal policies perform better than the
6
Under review as a conference paper at ICLR 2020
Figure 3: Total agent rewards on Waterworld
Figure 4: Comparison of normalized agent score
baselines. We observed that the agents trained by our method tend to keep close with other agents
near them more frequently. In the Multi-Walker domain, the walkers learn to carry the package
forward without letting it fall by the messages from the other three walkers. In the Multi-ant case,
the legs learn to avoid collision with each other. The histogram in Figure 4 is the normalized agent
reward(divided by the return of SMAL) in all three domains, which shows that our method can be
generalized to these systems. Our method surpasses MADDPG in all three environments, indicating
that the graph network is a better supervisor than the centralized critic. Also, out method outperforms
IC3Net, which focuses on efficient communication and doesn’t include a reward sharing mechanism.
Compared to Centralized MADDPG, which trains the whole system jointly and optimizes the total
reward directly, our method achieves a near-optimal result.
Ablation studies. To verify the effectiveness of sharing messages and rewards in our method,
we create two baselines that only allow agents to share either messages or rewards. To clarify
how messages and shared rewards help with the collaboration between agents, we evaluate their
effectiveness in the three domains.
Method
Shared Messages Shared Rewards
ˆ
ˆ
X
ˆ
ˆ
X
X
X
Environments(Total Rewards)
Waterworld Multi-walker Multi-ant
472.33
30.64
67.29
660.10
52.82
106.21
625.59
48.76
90.80
812.45
70.68
126.11
Table 1: Ablation studies for messages and shared rewards
Table 1 shows how shared messages and rewards lead agents to cooperate with others. The model
with only messages is made by removing the second term of equation 10. Meanwhile, the model that
only shares rewards is by setting messages as zero. When agents aren’t allowed to share messages
and rewards, SMAL degenerates to a self-agent reinforcement learning method. The results demonstrate that the transmission of signals and rewards are both necessary for agents to achieve thorough
consociation. These results further show that if rewards of other agents are taken into consideration,
policy gradient methods can learn suboptimal policies for overall returns.
Trade-off between communication costs and Performance. To investigate the loss defined in
equation 8(J “ J1 ` λJ2 ), we alter the parameter λ to adjust the sparseness of the graph. When
λ “ 0, the graph learned by SMAL is fully connected regardless of the communication cost. When
we increase λ, we limit the communication between agents. All the experiments above are carried
out with λ “ 0 to achieve complete cooperation.
Figure 5 reveals the relationship between communication costs and agents’ performance. When
communication is limited, agents that should be collaborators are forced to be independent. Losing
potential opportunities for cooperation leads to less total rewards. How to achieve best cooperation
with limited communication is an interesting problem for future work.
Effect of the graph. To further analyze the effect of the graph in our SMAL method, we visualize
how graph changes in a certain episode. We use λ “ 1 to ensure some agents are independent.
7
Under review as a conference paper at ICLR 2020
Figure 5: Trade-off between communication costs and performance, the number of edges decreases
when λ increases. We choose ten values of λ for every environment.
Figure 6 demonstrates the transformation of the graph inferred during one episode on Waterworld with four agents. Wider edges in the plot refer to larger weights of messages between
the agents. If there’s no edge between two vertices, the edge inferred is zero. The above row
shows the real situation of agents in Waterworld. We examine the case in the middle of Figure 6.
As agent 1 and 4 are around agent 2, they are
more likely to cooperate to capture the food
targets. In contrary, agent 3 is far away from
other agents, so it keeps a distant connection.
The graph we infer is consistent with this situation as the edges between agent 1,2,4 are
wider, while the edges between agent 3 and
other agents are narrower. The experimental results point out that introducing a graph network
to manage the message flows and share rewards
leads to better cooperation.
Table 2: Ablation study of graph
Method
Environment
Graph
Mean
6
Total Reward(λ “ 0)
Waterworld Multi-walker Multi-ant
812.45
70.68
126.11
636.92
59.34
115.70
Figure 6: Graph inference
Quantitive improvements contributed to the graph
are shown in table 2. In the second experiment, all
the weights in the graph are set as one. Agents directly receive the messages from other agents and
take their mean vector as the input message. With
the graph to model the proportion of messages and
rewards, the agents choose their actions more intelligently.
C ONCLUSIONS AND F UTURE W ORK
In this paper, we have proposed a structural multi-agent learning framework. We tackle multiagent reinforcement learning under a setting where each agent is directly interacting with a finite
set of other agents. Through a chain of direct interactions, any pair of agents are interconnected
globally. The messages and rewards are shared through the global graph. Experimental results on
several multi-agent systems show the efficiency of the proposed method and its strength compared
to existing methods in cooperative scenarios. We also show that messages and rewards are both
necessary for collaboration.
For competitive settings, agents should learn to generate unique messages to inform their partners
or cheat their opponents. Also, when the agent number dynamically changes, the initialization or
removal of agents remains investigation. For many-agent systems, the stability and computational
complexity should be taken into consideration. We leave these investigations to future work. How
to apply the proposed method to real group analysis, like multi-object tracking, group action recognition, and autonomous driving, is an exciting future research direction.
8
Under review as a conference paper at ICLR 2020
R EFERENCES
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob
McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In NeurIPS, pp. 5048–5058, 2017.
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi,
Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al.
Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261,
2018.
Ben Bogin, Mor Geva, and Jonathan Berant. Emergence of communication in an interactive world
with consistent speakers. arXiv preprint arXiv:1809.00549, 2018.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and
Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Kris Cao, Angeliki Lazaridou, Marc Lanctot, Joel Z Leibo, Karl Tuyls, and Stephen Clark. Emergent
communication through negotiation. arXiv preprint arXiv:1804.03980, 2018.
Edward Choi, Angeliki Lazaridou, and Nando de Freitas. Compositional obverter communication
learning from raw visual input. arXiv preprint arXiv:1804.02341, 2018.
Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical
Soc., 1997.
Abhishek Das, Théophile Gervet, Joshua Romoff, Dhruv Batra, Devi Parikh, Michael Rabbat, and
Joelle Pineau. Tarmac: Targeted multi-agent communication. arXiv preprint arXiv:1810.11187,
2018.
Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on
graphs with fast localized spectral filtering. In NeurIPS, pp. 3844–3852, 2016.
Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple model-based
reinforcement learning. Neural computation, 14(6):1347–1369, 2002.
Jakob Foerster, Ioannis Alexandros Assael, Nando de Freitas, and Shimon Whiteson. Learning to
communicate with deep multi-agent reinforcement learning. In NeurIPS, pp. 2137–2145, 2016.
H Guo. Generating text with deep reinforcement learning, arxiv (2015).
Jayesh K Gupta, Maxim Egorov, and Mykel Kochenderfer. Cooperative multi-agent control using
deep reinforcement learning. In AAMAS, pp. 66–83. Springer, 2017.
Jonathan Ho, Jayesh Gupta, and Stefano Ermon. Model-free imitation learning with policy optimization. In ICML, pp. 2760–2769, 2016.
Zhang-Wei Hong, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, and Chun-Yi Lee. A deep
policy inference q-network for multi-agent systems. In AAMAS, pp. 1388–1396. International
Foundation for Autonomous Agents and Multiagent Systems, 2018.
Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro Ortega, Dj Strouse,
Joel Z Leibo, and Nando De Freitas. Social influence as intrinsic motivation for multi-agent deep
reinforcement learning. In International Conference on Machine Learning, pp. 3040–3049, 2019.
Werner E. Kluge. Cooperating reduction machines. IEEE Trans. Comput.;(United States), 11, 1983.
Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. Emergence of linguistic communication from referential games with symbolic and pixel input. arXiv preprint
arXiv:1804.03984, 2018.
Joel Z Leibo, Vinicius Zambaldi, Marc Lanctot, Janusz Marecki, and Thore Graepel. Multi-agent
reinforcement learning in sequential social dilemmas. In AAMAS, pp. 464–473. IFAAMAS, 2017.
9
Under review as a conference paper at ICLR 2020
Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. JMLR, 17(1):1334–1373, 2016.
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In
Machine learning proceedings 1994, pp. 157–163. Elsevier, 1994.
Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent
actor-critic for mixed cooperative-competitive environments. In NeurIPS, pp. 6379–6390, 2017.
Laetitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems. KER, 27(1):1–31,
2012.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint
arXiv:1312.5602, 2013.
Ari S Morcos, David GT Barrett, Neil C Rabinowitz, and Matthew Botvinick. On the importance of
single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
Igor Mordatch and Pieter Abbeel. Emergence of grounded compositional language in multi-agent
populations. In AAAI, 2018.
Reza Olfati-Saber, J Alex Fax, and Richard M Murray. Consensus and cooperation in networked
multi-agent systems. Proceedings of the IEEE, 95(1):215–233, 2007.
Pierre-Yves Oudeyer and Linda B Smith. How evolution may work through curiosity-driven developmental process. Topics in Cognitive Science, 8(2):492–502, 2016.
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, and Jun Wang.
Multiagent bidirectionally-coordinated nets for learning to play starcraft combat games. arXiv
preprint arXiv:1703.10069, 2, 2017.
Julien Perolat, Joel Z Leibo, Vinicius Zambaldi, Charles Beattie, Karl Tuyls, and Thore Graepel. A
multi-agent reinforcement learning model of common-pool resource appropriation. In NeurIPS,
pp. 3643–3652, 2017.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region
policy optimization. In ICML, pp. 1889–1897, 2015a.
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation.
arXiv preprint
arXiv:1506.02438, 2015b.
Tianmin Shu and Yuandong Tian. Mˆ 3rl: Mind-aware multi-agent management reinforcement
learning. arXiv preprint arXiv:1810.00147, 2018.
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale
in multiagent cooperative and competitive tasks. arXiv preprint arXiv:1812.09755, 2018.
Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropagation. In Advances in Neural Information Processing Systems, pp. 2244–2252, 2016.
Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Synnaeve, Arthur Szlam, and Rob
Fergus. Intrinsic motivation and automatic curricula via asymmetric self-play. arXiv preprint
arXiv:1703.05407, 2017.
10
Under review as a conference paper at ICLR 2020
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient
methods for reinforcement learning with function approximation. In NeurIPS, pp. 1057–1063,
2000.
Andrea Tacchetti, H Francis Song, Pedro AM Mediano, Vinicius Zambaldi, Neil C Rabinowitz,
Thore Graepel, Matthew Botvinick, and Peter W Battaglia. Relational forward models for multiagent learning. arXiv preprint arXiv:1809.11044, 2018.
Russ Tedrake, Teresa Weirui Zhang, and H Sebastian Seung. Stochastic policy gradient reinforcement learning on a simple 3d biped. In RSJ, volume 3, pp. 2849–2854. IEEE, 2004.
Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multiagent reinforcement learning. arXiv preprint arXiv:1802.05438, 2018.
Lianmin Zheng, Jiacheng Yang, Han Cai, Ming Zhou, Weinan Zhang, Jun Wang, and Yong Yu.
Magent: A many-agent reinforcement learning platform for artificial collective intelligence. In
AAAI, 2018.
Stephan Zheng, Yisong Yue, and Jennifer Hobbs. Generating long-term trajectories using deep
hierarchical networks. In NeurIPS, pp. 1543–1551, 2016.
Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural
networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
11
Under review as a conference paper at ICLR 2020
A
A.1
A PPENDIX
C OMPETITIVE SETTINGS
In this section, we provide an extension of the competitive situation we have talked about in section
4.1.2.
Under competitive settings, the edges of the graph vij P r´1, 1s. The target of every agent is to
maximize the total rewards of their collaborators and minimize the rewards of their competitors. We
can change equation 5 as follows:
m´t “
N
ÿ
maxpvij , 0qmjt
(13)
j‰i
The equation indicates that messages from partners are considered valuable, while messages from
competitors are ignored. (Singh et al., 2018) also demonstrates that agents will stop communication
under the competitive setting. Also, equation 10 should be altered:
R´t “
1`
1
řN
j‰i
N
ÿ
|vij | j‰i
vij Rpsjt , ajt q
(14)
When vij is negative, the equation is still a reweighted combination of rewards of other agents. The
revised policy gradient method aims to control the rewards of competitors and reinforce the rewards
of cooperators.
However, messages in competitive settings can be more flexible. In cooperative settings, agents are
sending the same messages to others. When competitors exist, agents should learn how to give out
information selectively. Agents can even determine how to lie to competitors. How to generate
signals in competitive scenarios is an interesting future direction.
A.2
I MPLEMENTATION D ETAILS
We implement our structural multi-agent reinforcement learning algorithm and test its efficiency on
the environments presented in Section 5.1. The agent numbers in three systems are all four, and the
max steps for each episode are 500. We choose ten random seeds for every environment.
Our policies and graph inference network are both parameterized by a two-layer ReLU MLP with
128 units per layer. In cooperative environments, the output of the graph inference network is
limited to [0,1] with a sigmoid function. In all of our experiments, we use the Adam optimizer with
a learning rate of 0.001 to optimize the policy network and a learning rate of 0.01 to optimize the
graph network.
The latent variables and messages are initialized randomly with a length of 10. The edges inferred
from the network serve as the weights to calculate a weighted average of messages for a single agent.
We exploit a replay buffer Andrychowicz et al. (2017) of size 500000 and the batch size is 250. As
the graph network doesn’t need to be changed frequently, we update it every 10000 steps.
For comparison, the hyperparameters for all four algorithms are the same, except for TRPO we
exploit a batch size of 100, which leads to better performance. For IC3Net, we trained with the
setting the authors exploited in the StarCraft task (Singh et al., 2018).
12