Large-Scale Deep Reinforcement Learning
Large-Scale Deep Reinforcement Learning
Large-Scale Deep Reinforcement Learning
Abstract
We present synchronous GPU clusters of independently-trained deep Q-networks
(DQNs), which are bootstrap aggregated (bagged) for action selection in games
on Atari 2600. Our work is an extension of code based on Torch in LuaJIT from
Google DeepMind on February 26, 2015 [1,2]. We heavily used Amazon Web Services g2.2xlarge GPUs for processing 100s of GBs across 100s of convolutional
neural networks (CNNs) with 100,000s of CUDA cores for 100s of hours; and we
used m3.2xlarge SSDs with high i/o for master orchestration of CNN slaves. To
scale our bagging architecture [3], we built on the high performance computing
framework CfnCluster-0.0.20 from AWS Labs [4].
We tested architectures of up to 18 GPUs for playing 2 games. We followed
Dean et al.s warmstarting of large-scale deep neural networks [5] to radically
constrain the global space of exploration by asynchronous learners. This meant
pre-training a parent network for 24 hours, and then using its parameters to
initialize weights for children networks, which were trained 36 hours further,
without any communication of weight gradients to each other.
Thus far, the warmstart technique for constraining exploration of asynchronous
learner has been too effective. The children DQNs learn enormously beyond their
parents - nearly 4x improvement in average score - but all of their learnings are
about identical: at any given point in time, ensembles of 1, 3, 6, 12, and 18 DQNs
always average out to the same actions, because together the children believe the
same things. This suggests that we should experiment with the following calibrations: (i) pre-train parents of the warmstart for less time, (ii) increase learning
rates of children, and (iii) increase probability of greedy- exploration by children. These will diverge asynchronous learners to support discovery of unique
features in training. Experiments are in process to validate this and our framework
for large-scale deep reinforcement learning.
Introduction
For an input state st at time t, the above term is maximized when the expected sum of estimated
future rewards rt+i is maximized, where rewards further into the future are exponentially discounted
1
in value by a factor (e.g. 0.9). One of the innovations of the DQN framework is that the actionvalue selector is approximated by weight parameters of a neural network with a linear rectifier
output: Q(s, a; ) Q (s, a). This opens up the problem to broad solutions in deep learning research, including the application of CNNs, which have been used to dramatically increase accuracy
of machine learning tasks in recent years; they are a natural fit as a state-space function approximator of raw image data, which must be pre-processed in reinforcement learning applications like
q-networks [7,8,9,10,11].
In the DQN architecture introduced by Mnih et al. [2], images of the screen displayed in an Atari
video game are collected from the Arcade Learning Environment (ALE) [12]; the network then
processes each image, and outputs an action that is sent back to the ALE game agent. For the
network to select an action, the raw image is first linearly downscaled from 210 X 160 pixels to
84 X 84 pixels. Then it is processed by a CNN with 3 layers, followed by 2 fully connected linear
rectifier layers, the last layer which outputs one of 18 possible actions that an Atari agent can do.
Our contribution is to scale up this framework to benefit from large-scale distributed computing,
based on research showing that voting among ensembles of independent machine learning models
- also known as bootstrap aggregation, or bagging - generally improves the performance and
stability of statistical classifiers like deep networks [5,13,14,15].
Engineering of DQN Ensembles on GPU Clusters
To build GPU clusters for bagging ensembles of DQNs, we use GPU machines (G2 instances)
from Amazon Web Services [16]. Each instance provides up to 1,536 CUDA cores and 4 GB of
video memory, as well as at least 8 CPUs with 60 GB of SSD memory. We implemented NVIDIA
drivers to set up the machine image (i.e. development environment) for GPU computing [17].
Ultimately, as we allocated DQNs on GPUs of the same parameters as by Mnih et al. [2], we
found that we could safely put 2 isolated DQNs on the same GPU, without (too often) encountering
memory access issues. To enable a master to quickly orchestrate connections across all DQNs, we
customized an instance (m3.2xlarge) with an SSD drive capable of input and output of data at up
to 30X the number of data allocated (24 GB): 720 i/o per second. This was important, in addition
to placing all machines in the same subnet, which helped to ensure they were as physically close
together as possible, and thus minimize latency. To coordinate all of this in a secure and robust
way, we turned to the open source framework being designed by AWS Labs specifically for high
performance cluster computing on their services, CfnCluster-0.0.20 [4]. We find this framework to
be a promising even if young infrastructure to build upon.
With computing infrastructure ready, we ultimately turned to code that Google DeepMind released
for academic evaluation on February 26, 2015 [18]. Our current architecture built around this
code can be found on GitHub [3]. As required by the license for using this code, our application and
disclosure of our work relating to it is only for academic evaluation. Meanwhile, we find no strong
open source framework for cluster computing of deep networks specifically, so we take a first step
in releasing a general architecture that may eventually benefit the deep learning R&D community.
Code for the DQN was written in the scientific computing framework called Torch in LuaJIT [1].
The use of Torch versus, e.g. Theano, is a choice that generally makes sense for a deep learning
researcher in pursuit of the fastest run times. It was shown for the same computations that Torch
is faster than Theano sometimes by double or triple the speed; and increasingly so, for increasingly larger training set sizes that must be processed [1]. This is mainly because Python has many
inefficiencies in memory management, including in Theano [19], which were designed to ease application development; meanwhile, Torch is based on LuaJIT which is a minimalist wrapper around
a C compiler, run in a just in time fashion for versatile and efficient extensibility across platforms.
However, we discovered that using LuaJIT comes at the cost of very limited documentation and
maturity of libraries. In our particular case, we ultimately needed to severely constrain the scalability
of our current architecture, because of a lack of mature/effective libraries in LuaJIT for implementing
true multi-threading of dynamically generated data. This trade-off forces our master controller to
sequentially query slaves for their action vote, instead doing this in a truly parallel way. This is a top
priority for us to address in ensuing work, because this constraint breaks the real-time gameplay of
the DQNs for networks too large. In our currently limited architecture, run-time grows linearly with
number of DQNs, instead of, e.g., logarithmically.
2
The high-level abstraction for how our ensemble architecture works is described in Figure 1. After
training several DQNs, we synchronize their beliefs, across a cluster of computers. The basic idea is
3
to have a single master agent playing an Atari 2600 game, which is responsible for actually inputting
the next action to ALE, and getting the next state (image) from ALE. After every state, the master
first does the preprocessing of reducing it down to 84 X 84 pixels, and then sends that image out
to all of its slaves. Each of the slaves is a unique DQN with a unique set of weight parameters.
When a slave receives an image, it forward propagates it through its CNN and fully-connected linear
rectifier units, to output its belief of what the best action is. Then it sends this action belief back to
the master, which collects all suggested actions from all of the DQNs. In our current architecture, the
master simply does a majority vote to choose the most popular action. The testing process repeats
for a fixed number of frames, and the score after each frame is recorded to see how the ensemble
performs across a large number of episodes.
We followed Dean et al.s warmstarting of large-scale deep neural networks [5] to radically constrain the global space of exploration by asynchronous learners. This meant pre-training a parent
network for 24 hours, and then using its parameters to initialize weights for children networks,
which were trained 36 hours further, without any communication of weight gradients to each other.
Breakout
PacMan
92%
96%
93%
99%
100%
98%
Breakout
PacMan
39.48
39.48
39.48
39.48
39.48
74.09
74.09
74.09
74.09
74.09
[11] Sermanet, Pierre, et al. Overfeat: Integrated recognition, localization and detection using convolutional
networks. arXiv preprint arXiv:1312.6229 (2013).
[12] Bellemare, M. et al. The arcade learning environment: An evaluation platform for general agents. J.
Artif. Intell. Res. 47, 253279 (2013).
[13] Breiman, Leo. Bagging predictors. Machine learning 24.2 (1996): 123-140.
[14] Schwenk, Holger, and Yoshua Bengio. Boosting neural networks. Neural Computation 12.8 (2000):
1869-1887.
[15] Ha, Kyoungnam, Sungzoon Cho, and Douglas MacLachlan. Response models based on bagging neural
networks. Journal of Interactive Marketing 19.1 (2005): 17-30.
[16] Amazon Web Services. Linux GPU Instances. Documentation on GPU Machines
[17] NVIDIA. Amazon Machine Image for GPU Computing. Development Environment for NVIDIA GPUs
on AWS
[18] Google DeepMind. Code for Human-Level Control through Deep Reinforcement Learning Deep QNetworks in LuaJIT
[19] DeepLearning.net. Theano 0.7 Tutorial - Python Memory Management Analysis of Memory Management by Python for Theano