Reinforcement Learning 2012
Reinforcement Learning 2012
Reinforcement Learning 2012
ADAPTATION, LEARNING,
Marco Wiering
AND
Martijn van Otterlo (Eds.)
Reinforcement Learning
State-of-the-Art
123
Marco Wiering and Martijn van Otterlo (Eds.)
Reinforcement Learning
Adaptation, Learning, and Optimization, Volume 12
Series Editor-in-Chief
Meng-Hiot Lim
Nanyang Technological University, Singapore
E-mail: [email protected]
Yew-Soon Ong
Nanyang Technological University, Singapore
E-mail: [email protected]
Vol. 8. Bijaya Ketan Panigrahi, Yuhui Shi, and Meng-Hiot Lim (Eds.)
Handbook of Swarm Intelligence, 2011
ISBN 978-3-642-17389-9
Reinforcement Learning
State-of-the-Art
123
Editors
Dr. Marco Wiering Dr. ir. Martijn van Otterlo
University of Groningen Radboud University Nijmegen
The Netherlands The Netherlands
Reinforcement learning has been a subject of study for over fifty years, but its mod-
ern form—highly influenced by the theory of Markov decision processes—emerged
in the 1980s and became fully established in textbook treatments in the latter half
of the 1990s. In Reinforcement Learning: State-of-the-Art, Martijn van Otterlo and
Marco Wiering, two respected and active researchers in the field, have commis-
sioned and collected a series of eighteen articles describing almost all the major
developments in reinforcement learning research since the start of the new millen-
nium. The articles are surveys rather than novel contributions. Each authoritatively
treats an important area of Reinforcement Learning, broadly conceived as including
its neural and behavioral aspects as well as the computational considerations that
have been the main focus. This book is a valuable resource for students wanting to
go beyond the older textbooks and for researchers wanting to easily catch up with
recent developments.
As someone who has worked in the field for a long time, two things stand out
for me regarding the authors of the articles. The first is their youth. Of the eighteen
articles, sixteen have as their first author someone who received their PhD within
the last seven years (or who is still a student). This is surely an excellent sign for
the vitality and renewal of the field. The second is that two-thirds of the authors hail
from Europe. This is only partly due to the editors being from there; it seems to
reflect a real shift eastward in the center of mass of reinforcement learning research,
from North America toward Europe. Vive le temps et les différences!
A decade ago the answer to our leading question would be quite easy to give; around
that time two dominant books existed that were fully up-to-date. One is the excel-
lent introduction1 to reinforcement learning by Rich Sutton and Andy Barto from
1998. This book is written from an artificial intelligence perspective, has a great ed-
ucational writing style and is widely used (around ten thousand citations at the time
of writing). The other book was written by Dimitri Bertsekas and John Tsitsiklis
in 1996 and was titled neuro-dynamic programming2. Written from the standpoint
of operations research, the book rigorously and in a mathematically precise way
describes dynamic programming and reinforcement learning with a particular em-
phasis on approximation architectures. Whereas Sutton and Barto always maximize
rewards, talk about value functions, rewards and are biased to the {V, Q, S, A, T, R}
part of the alphabet augmented with π , Bertsekas and Tsitsiklis talk about cost-
to-go-functions, always minimize costs, and settle on the {J, G, I,U} part of the
alphabet augmented with the greek symbol μ . Despite these superficial (notation)
differences, the distinct writing styles and backgrounds, and probably also the audi-
ence for which these books were written, both tried to give a thorough introduction
1 Sutton and Barto, (1998) Reinforcement Learning: An Introduction, MIT Press.
2 Bertsekas and Tsitsiklis (1996) Neuro-Dynamic Programming, Athena Scientific.
X Preface
to this exciting new research field and succeeded in doing that. At that time, the big
merge of insights in both operations research and artificial intelligence approaches
to behavior optimization was still ongoing and many fruitful cross-fertilization hap-
pened. Powerful ideas and algorithms such as Q-learning and T D-learning had been
introduced quite recently and so many things were still unknown.
For example, questions about convergence of combinations of algorithms and
function approximators arose. Many theoretical and experimental questions about
convergence of algorithms, numbers of required samples for guaranteed perfor-
mance, and applicability of reinforcement learning techniques in larger intelligent
architectures were largely unanswered. In fact, many new issues came up and in-
troduced an ever increasing pile of research questions waiting to be answered by
bright, young PhD students. And even though both Sutton & Barto and Bertsekas
& Tsitsiklis were excellent at introducing the field and eloquently describing the
underlying methodologies and issues of it, at some point the field grew so large that
new texts were required to capture all the latest developments. Hence this book, as
an attempt to fill the gap.
This book is the first book about reinforcement learning featuring only state-
of-the-art surveys on the main subareas. However, we can mention several other
interesting books that introduce or describe various reinforcement learning topics
too. These include a collection3 edited by Leslie Kaelbling in 1996 and a new edi-
tion of the famous Markov decision process handbook4 by Puterman. Several other
books5,6 deal with the related notion of approximate dynamic programming. Re-
cently additional books have appeared on Markov decision processes7 , reinforce-
ment learning8, function approximation9 and relational knowledge representation
for reinforcement learning10. These books just represent a sample of a larger num-
ber of books relevant for those interested in reinforcement learning of course.
In the past one and a half decade, the field of reinforcement learning has grown
tremendously. New insights from this recent period – having much to deal with
richer, and firmer, theory, increased applicability, scaling up, and connections to
(probabilistic) artificial intelligence, brain theory and general adaptive systems –
are not reflected in any recent book. Richard Sutton, one of the founders of modern
reinforcement learning described11 in 1999 three distinct areas in the development
of reinforcement learning; past, present and future.
The RL past encompasses the period until approximately 1985 in which the
idea of trial-and-error learning was developed. This period emphasized the use of
an active, exploring agent and developed the key insight of using a scalar reward
signal to specify the goal of the agent, termed the reward hypothesis. The methods
usually only learned policies and were generally incapable of dealing effectively
with delayed rewards.
The RL present was the period in which value functions were formalized. Value
functions are at the heart of reinforcement learning and virtually all methods focus
on approximations of value functions in order to compute (optimal) policies. The
value function hypothesis says that approximation of value functions is the dominant
purpose of intelligence.
At this moment, we are well underway in the reinforcement learning future. Sut-
ton made predictions about the direction of this period and wrote ”Just as rein-
forcement learning present took a step away from the ultimate goal of reward to
focus on value functions, so reinforcement learning future may take a further step
away to focus on the structures that enable value function estimation [...] In psy-
chology, the idea of a developing mind actively creating its representations of the
world is called constructivism. My prediction is that for the next tens of years rein-
forcement learning will be focused on constructivism.” Indeed, as we can see in this
book, many new developments in the field have to do with new structures that enable
value function approximation. In addition, many developments are about properties,
capabilities and guarantees about convergence and performance of these new struc-
tures. Bayesian frameworks, efficient linear approximations, relational knowledge
representation and decompositions of hierarchical and multi-agent nature all consti-
tute new structures employed in the reinforcement learning methodology nowadays.
Reinforcement learning is currently an established field usually situated in
machine learning. However, given its focus on behavior learning, it has many con-
nections to other fields such as psychology, operations research, mathematical op-
timization and beyond. Within artificial intelligence, there are large overlaps with
probabilistic and decision-theoretic planning as it shares many goals with the plan-
ning community (e.g. the international conference on automated planning systems,
ICAPS). In very recent editions of the international planning competition (IPC),
methods originating from the reinforcement learning literature have entered the
competitions and did very well, in both probabilistic planning problems, and a
recent ”learning for planning” track.
Reinforcement learning research is published virtually everywhere in the broad
field of artificial intelligence, simply because it is both a general methodology for
behavior optimization as well as a set of computational tools to do so. All major
artificial intelligence journals feature articles on reinforcement learning nowadays,
and have been doing so for a long time. Application domains range from robotics
and computer games to network routing and natural language dialogue systems and
reinforcement learning papers appear at fora dealing with these topics. A large por-
tion of papers appears every year (or two year) at the established top conferences
in artificial intelligence such as IJCAI, ECAI and AAAI, and many also at top con-
ferences with a particular focus on statistical machine learning such as UAI, ICML,
ECML and NIPS. In addition, conferences on artificial life (Alife), adaptive behav-
ior (SAB), robotics (ICRA, IROS, RSS) and neural networks and evolutionary com-
putation (e.g. IJCNN and ICANN) feature much reinforcement learning work. Last
but not least, in the past decade many specialized reinforcement learning workshops
and tutorials have appeared at all the major artificial intelligence conferences.
But even though the field has much to offer to many other fields, and reinforce-
ment learning papers appear everywhere, the current status of the field renders it
natural to introduce fora with a specific focus on reinforcement learning methods.
The European workshop on reinforcement learning (EWRL) has gradually become
one such forum, growing every two years considerably and most recently held in
Nancy (2008) and co-located with ECML (2011). Furthermore, the IEEE Sympo-
sium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)
has become yet another meeting point for researchers to present and discuss their
latest research findings. Together EWRL and ADPRL show that the field has pro-
gressed a lot and requires its own community and events.
Concerning practical aspects of reinforcement learning, and more importantly,
concerning benchmarking, evaluation and comparisons, much has happened. In ad-
dition to the planning competitions (e.g. such as the IPC), several editions of the re-
inforcement learning competitions12 have been held with great success. Contestants
competed in several classic domains (such as pole balancing) but also new and excit-
ing domains such as the computer games Tetris and Super Mario. Competitions can
promote code sharing and reuse, establish benchmarks for the field and be used to
evaluate and compare methods on challenging domains. Another initiative for pro-
moting more code and solution reuse is the RL-Glue framework13, which provides
an abstract reinforcement learning framework that can be used to share methods and
domains among researchers. RL-Glue can connect to most common programming
languages and thus provides a system- and language-independent software frame-
work for experimentation. The competitions and RL-Glue help to further mature the
field of reinforcement learning, and enable better scientific methods to test, compare
and reuse reinforcement learning methods.
12 http://www.rl-competition.org/
13 glue.rl-community.org/
Preface XIII
As said before, we have tried to let this book be an answer to the question ”what
book would you recommend to learn about current reinforcement learning?”. Every
person who could pose this question is contained in the potential audience for this
book. This includes PhD and master students, researchers in reinforcement learn-
ing itself, and researchers in any other field who want to know about reinforcement
learning. Having a book with 17 surveys on the major areas in current reinforcement
learning provides an excellent starting point for researchers to continue expanding
the field, applying reinforcement learning to new problems and to incorporate prin-
cipled behavior learning techniques in their own intelligent systems and robots.
When we started the book project, we first created a long list of possible topics
and grouped them, which resulted in a list of almost twenty large subfields of rein-
forcement learning in which many new results were published over the last decade.
These include established subfields such as evolutionary reinforcement learning,
but also newer topics such as relational knowledge representation approaches and
Bayesian frameworks for learning and planning. Hierarchical approaches, about
which a chapter is contained in this book, form the first subfield that basically
emerged14 right after the appearance of two mentioned books, and for that reason,
were not discussed at that time.
Our philosophy when coming up with this book was to let the pool of authors
reflect the youth and the active nature of the field. To that end, we selected and in-
vited mainly young researchers in the start of their careers. Many of them finished
their PhD studies in recent years, and that ensured that they were active and ex-
pert in their own sub-field of reinforcement learning, full of ideas and enthusiastic
about that sub-field. Moreover, it gave them an excellent opportunity to promote that
sub-field within the larger research area. In addition, we also invited several more
experienced researchers who are recognized for their advances in several subfields
of reinforcement learning. This all led to a good mix between different views on
the subject matter. The initial chapter submissions were of very high quality, as we
had hoped for. To complete the whole quality assurance procedure, we – the editors
– together with a group of leading experts as reviewers, provided at least three re-
views for each chapter. The results were that chapters were improved even further
and that the resulting book contains a huge number of references to work in each of
the subfields.
The resulting book contains 19 chapters, of which one contains introductory ma-
terial on reinforcement learning, dynamic programming, Markov decision processes
and foundational algorithms such as Q-learning and value iteration. The last chapter
reflects on the material in the book, discusses things that were left out, and points
out directions for further research. In addition, this chapter contains personal reflec-
tions and predictions about the field. The 17 chapters that form the core of the book
are each self-contained introductions and overviews of subfields of reinforcement
14 That is not to say that there were no hierarchical approaches, but the large portion of
current hierarchical techniques appeared after the mid-nineties.
XIV Preface
learning. In the next section we will give an overview of the structure of the book
and its chapters. In total, the book features 30 authors, from many different institutes
and different countries.
represent general knowledge about the world and are, because of that, good candi-
dates to be transferred to other, related tasks. More about the transfer of knowledge
in reinforcement learning is surveyed in the chapter Transfer in Reinforcement
Learning: a Framework and a Survey by Alessandro Lazaric. When confronted
with several related tasks, various things can, once learned, be reused in a subse-
quent task. For example, policies can be reused, but depending on whether the state
and/or action spaces of the two related tasks differ, other methods need to be ap-
plied. The chapter not only surveys existing approaches, but also tries to put them in
a more general framework. The remaining chapter in this part, Sample Complexity
Bounds of Exploration by Lihong Li surveys techniques and results concerning the
sample complexity of reinforcement learning. For all algorithms it is important to
know how many samples (examples of interactions with the world) are needed to
guarantee a minimal performance on a task. In the past decade many new results
have appeared that study this vital aspect in a rigorous and mathematical way and
this chapter provides an overview of them.
C ONSTRUCTIVE -R EPRESENTATIONAL D IRECTIONS
This part of the book contains several chapters in which either representations are
central, or their construction and use. As mentioned before, a major aspect of con-
structive techniques are the structures that enable value function approximation (or
policies for that matter). Several major new developments in reinforcement learning
are about finding new representational frameworks to learn behaviors in challenging
new settings.
In the chapter Reinforcement Learning in Continuous State and Action Spaces
by Hado van Hasselt many techniques are described for problem representations that
contain continuous variables. This has been a major component in reinforcement
learning for a long time, for example through the use of neural function approxi-
mators. However, several new developments in the field have tried to either more
rigorously capture the properties of algorithms dealing with continuous states and
actions or have applied such techniques in novel domains. Of particular interest are
new techniques for dealing with continuous actions, since this effectively renders the
amount of applicable actions infinite and requires sophisticated techniques for com-
puting optimal policies. The second chapter, Solving Relational and First-Order
Logical Markov Decision Processes: A Survey by Martijn van Otterlo describes
a new representational direction in reinforcement learning which started around a
decade ago. It covers all representations strictly more powerful than propositional
(or; attribute-value) representations of states and actions. These include modelings
as found in logic programming and first-order logic. Such representations can rep-
resent the world in terms of objects and relations and open up possibilities for re-
inforcement learning in a much broader set of domains than before. These enable
many new ways of generalization over value functions, policies and world models
and require methods from logical machine learning and knowledge representation to
do so. The next chapter, Hierarchical Approaches by Bernhard Hengst too surveys
a representational direction, although here representation refers to the structural de-
composition of a task, and with that implicitly of the underlying Markov decision
XVI Preface
processes. Many of the hierarchical approaches appeared at the end of the nineties,
and since then a large number of techniques has been introduced. These include new
decompositions of tasks, value functions and policies, and many techniques for au-
tomatically learning task decompositions from interaction with the world. The final
chapter in this part, Evolutionary Computation for Reinforcement Learning by
Shimon Whiteson surveys evolutionary search for good policy structures (and value
functions). Evolution has always been a good alternative for iterative, incremental
reinforcement learning approaches and both can be used to optimize complex be-
haviors. Evolution is particularly well suited for non-Markov problems and policy
structures for which gradients are unnatural or difficult to compute. In addition, the
chapter surveys evolutionary neural networks for behavior learning.
P ROBABILISTIC M ODELS OF S ELF AND OTHERS
Current artificial intelligence has become more and more statistical and probabilis-
tic. Advances in the field of probabilistic graphical models are used virtually ev-
erywhere, and results for these models – both theoretical as computational – are
effectively used in many sub-fields. This is no different for reinforcement learning.
There are several large sub-fields in which the use of probabilistic models, such as
Bayesian networks, is common practice and the employment of such a universal
set of representations and computational techniques enables fruitful connections to
other research employing similar models.
The first chapter, Bayesian Reinforcement Learning by Nikos Vlassis, Moham-
mad Ghavamzadeh, Shie Mannor and Pascal Poupart surveys Bayesian techniques
for reinforcement learning. Learning sequential decision making under uncertainty
can be cast in a Bayesian universe where interaction traces provide samples (evi-
dence), and Bayesian inference and learning can be used to find optimal decision
strategies in a rigorous probabilistic fashion. The next chapter, Partially Observ-
able Markov Decision Processes by Matthijs Spaan surveys representations and
techniques for partially observable problems which are very often cast in a prob-
abilistic framework such as a dynamic Bayesian network, and where probabilis-
tic inference is needed to infer underlying hidden (unobserved) states. The chapter
surveys both model-based as well as model-free methods. Whereas POMDPs are
usually modeled in terms of belief states that capture some form of history (or mem-
ory), a more recent class of methods that focuses on the future is surveyed in the
chapter Predictively Defined Representations of State by David Wingate. These
techniques maintain a belief state used for action selection in terms of probabilistic
predictions about future events. Several techniques are described in which these pre-
dictions are represented compactly and where these are updated based on experience
in the world. So far, most methods focus on the prediction (or; evaluation) problem,
and less on control. The fourth chapter, Game Theory and Multi-agent Reinforce-
ment Learning by Ann Nowé, Peter Vrancx and Yann-Michaël De Hauwere moves
to a more general set of problems in which multiple agents learn and interact. It
surveys game-theoretic and multi-agent approaches in reinforcement learning and
shows techniques used to optimize agents in the context of other (learning) agents.
The final chapter in this part, Decentralized POMDPs by Frans Oliehoek surveys
Preface XVII
ACKNOWLEDGEMENTS
Crafting a book such as this can not be done overnight. Many people have put a lot
of work in it to make it happen. First of all, we would like to give a big thanks to all
the authors who have put in all their expertise, time and creativity to write excellent
surveys of their sub-fields. Writing a survey usually takes some extra effort, since
it requires that you know much about a topic, but in addition that you can put all
relevant works in a more general framework. As editors, we are very happy with the
way the authors have accomplished this difficult, yet very useful, task.
A second group of people we would like to thank are the reviewers. They have
provided us with very thorough, and especially very constructive, reviews and these
have made the book even better. We thank these reviewers who agreed to put their
names in the book; thank you very much for all your help: Andrea Bonarini, Prasad
Tadepalli, Sarah Ostentoski, Rich Sutton, Daniel Kudenko, Jesse Hoey, Christopher
Amato, Damien Ernst, Remi Munos, Johannes Fuernkrantz, Juergen Schmidhuber,
Thomas Rückstiess, Joelle Pineau, Dimitri Bertsekas, John Asmuth, Lisa Torrey,
Yael Niv, Te Thamrongrattanarit, Michael Littman and Csaba Szepesvari.
Thanks also to Rich Sutton who was so kind to write the foreword to this book.
We both consider him as one of the main figures in reinforcement learning, and
in all respects we admire him for all the great contributions he has made to the
field. He was there in the beginning of modern reinforcement learning, but still he
continuously introduces novel, creative new ways to let agents learn. Thanks Rich!
Editing a book such as this is made much more convenient if you can fit it in your
daily scientific life. In that respect, Martijn would like to thank both the Katholieke
Universiteit Leuven (Belgium) as well as the Radboud University Nijmegen (The
Netherlands) for their support. Marco would like to thank the University of Gronin-
gen (The Netherlands) for the same kind of support.
Last but not least, we would like to thank you, the reader, to having picked this
book and having started to read it. We hope it will be useful to you, and hope that
the work you are about to embark on will be incorporated in a subsequent book on
reinforcement learning.
Part VI Closing
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
List of Contributors
Robert Babuška
Delft Center for Systems and Control, Delft University of Technology,
The Netherlands
e-mail: [email protected]
Lucian Buşoniu
Research Center for Automatic Control (CRAN), University of Lorraine, France
e-mail: [email protected]
Thomas Gabel
Albert-Ludwigs-Universtität, Faculty of Engineering, Germany,
e-mail: [email protected]
Mohammad Ghavamzadeh
Team SequeL, INRIA Lille-Nord Europe, France
e-mail: [email protected]
Hado van Hasselt
Centrum Wiskunde en Informatica (CWI, Center for Mathematics and Computer
Science), Amsterdam, The Netherlands
e-mail: [email protected]
Yann-Michaël De Hauwere
Vrije Universiteit Brussel, Belgium
e-mail: [email protected]
Bernhard Hengst
School of Computer Science and Engineering,
University of New South Wales, Sydney, Australia
e-mail: [email protected]
XXX List of Contributors
Todd Hester
Department of Computer Science, The University of Texas at Austin, USA
e-mail: [email protected]
Jens Kober
1) Intelligent Autonomous Systems Institute, Technische Universitaet Darmstadt,
Darmstadt, Germany; 2) Robot Learning Lab, Max-Planck Institute for Intelligent
Systems, Tübingen, Germany
e-mail: [email protected]
Sascha Lange
Albert-Ludwigs-Universtität Freiburg, Faculty of Engineering, Germany
e-mail: [email protected]
Alessandro Lazaric
Team SequeL, INRIA Lille-Nord Europe, France
e-mail: [email protected]
Lihong Li
Yahoo! Research, Santa Clara, USA
e-mail: [email protected]
Shie Mannor
Technion, Haifa, Israel
e-mail: [email protected]
Rémi Munos
Team SequeL, INRIA Lille-Nord Europe, France
e-mail: [email protected]
Frans Oliehoek
CSAIL, Massachusetts Institute of Technology
e-mail: [email protected]
Ann Nowé
Vrije Universiteit Brussel, Belgium
e-mail: [email protected]
Martijn van Otterlo
Radboud University Nijmegen, The Netherlands
e-mail: [email protected]
Jan Peters
1) Intelligent Autonomous Systems Institute, Technische Universitaet Darmstadt,
Darmstadt, Germany; 2) Robot Learning Lab, Max-Planck Institute for Intelligent
Systems, Tübingen, Germany
e-mail: [email protected]
Pascal Poupart
University of Waterloo, Canada
e-mail: [email protected]
List of Contributors XXXI
Martin Riedmiller
Albert-Ludwigs-Universtität Freiburg, Faculty of Engineering, Germany
e-mail: [email protected]
Bart De Schutter
Delft Center for Systems and Control,
Delft University of Technology, The Netherlands
e-mail: [email protected]
Ashvin Shah
Department of Psychology, University of Sheffield, Sheffield, UK
e-mail: [email protected]
Matthijs Spaan
Institute for Systems and Robotics, Instituto Superior Técnico, Lisbon, Portugal
e-mail: [email protected]
Peter Stone
Department of Computer Science, The University of Texas at Austin, USA
e-mail: [email protected]
István Szita
University of Alberta, Canada
e-mail: [email protected]
Nikos Vlassis
(1) Luxembourg Centre for Systems Biomedicine, University of Luxembourg,
and (2) OneTree Luxembourg
e-mail: [email protected],[email protected]
Peter Vrancx
Vrije Universiteit Brussel, Belgium
e-mail: [email protected]
Shimon Whiteson
Informatics Institute, University of Amsterdam, The Netherlands
e-mail: [email protected]
Marco Wiering
Department of Artificial Intelligence, University of Groningen, The Netherlands
e-mail: [email protected]
David Wingate
Massachusetts Institute of Technology, Cambridge, USA
e-mail: [email protected]
Acronyms
AC Actor-Critic
AO Action-Outcome
BAC Bayesian Actor-Critic
BEETLE Bayesian Exploration-Exploitation Tradeoff in Learning
BG Basal Ganglia
BQ Bayesian Quadrature
BQL Bayesian Q-learning
BPG Bayesian Policy Gradient
BRM Bellman Residual Minimization (generic; BRM-Q for Q-functions;
BRM-V for V-functions)
CMA-ES Covariance Matrix Adaptation Evolution Strategy
CPPN Compositional Pattern Producing Network
CoSyNE Cooperative Synapse Coevolution
CR Conditioned Response
CS Conditioned Stimulus
DA Dopamine
DBN Dynamic Bayesian Network
DEC-MDP Decentralized Markov Decision Process
DFQ Deep Fitted Q iteration
DP Dirichlet process
DP Dynamic Programming
DTR Decision-Theoretic Regression
EDA Estimation of Distribution Algorithm
ESP Enforced SubPopulations
FODTR First-Order (Logical) Decision-Theoretic Regression
FQI Fitted Q Iteration
GP Gaussian Process
GPI Generalized Policy Iteration
GPTD Gaussian Process Temporal Difference
HBM Hierarchical Bayesian model
HRL Hierarchical Reinforcement Learning
XXXIV Acronyms
1.1 Introduction
Markov Decision Processes (MDP) Puterman (1994) are an intuitive and fundamen-
tal formalism for decision-theoretic planning (DTP) Boutilier et al (1999); Boutilier
(1999), reinforcement learning (RL) Bertsekas and Tsitsiklis (1996); Sutton and
Barto (1998); Kaelbling et al (1996) and other learning problems in stochastic do-
mains. In this model, an environment is modelled as a set of states and actions can
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 3–42.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
4 M. van Otterlo and M. Wiering
be performed to control the system’s state. The goal is to control the system in
such a way that some performance criterium is maximized. Many problems such as
(stochastic) planning problems, learning robot control and game playing problems
have successfully been modelled in terms of an MDP. In fact MDPs have become
the de facto standard formalism for learning sequential decision making.
DTP Boutilier et al (1999), e.g. planning using decision-theoretic notions to rep-
resent uncertainty and plan quality, is an important extension of the AI planning
paradigm, adding the ability to deal with uncertainty in action effects and the abil-
ity to deal with less-defined goals. Furthermore it adds a significant dimension in
that it considers situations in which factors such as resource consumption and un-
certainty demand solutions of varying quality, for example in real-time decision
situations. There are many connections between AI planning, research done in the
field of operations research Winston (1991) and control theory Bertsekas (1995), as
most work in these fields on sequential decision making can be viewed as instances
of MDPs. The notion of a plan in AI planning, i.e. a series of actions from a start
state to a goal state, is extended to the notion of a policy, which is mapping from
all states to an (optimal) action, based on decision-theoretic measures of optimality
with respect to some goal to be optimized.
As an example, consider a typical planning domain, involving boxes to be moved
around and where the goal is to move some particular boxes to a designated area.
This type of problems can be solved using AI planning techniques. Consider now
a slightly more realistic extension in which some of the actions can fail, or have
uncertain side-effects that can depend on factors beyond the operator’s control, and
where the goal is specified by giving credit for how many boxes are put on the right
place. In this type of environment, the notion of a plan is less suitable, because a
sequence of actions can have many different outcomes, depending on the effects of
the operators used in the plan. Instead, the methods in this chapter are concerned
about policies that map states onto actions in such a way that the expected out-
come of the operators will have the intended effects. The expectation over actions
is based on a decision-theoretic expectation with respect to their probabilistic out-
comes and credits associated with the problem goals. The MDP framework allows
for online solutions that learn optimal policies gradually through simulated trials,
and additionally, it allows for approximated solutions with respect to resources such
as computation time. Finally, the model allows for numeric, decision-theoretic mea-
surement of the quality of policies and learning performance. For example, policies
can be ordered by how much credit they receive, or by how much computation is
needed for a particular performance.
This chapter will cover the broad spectrum of methods that have been devel-
oped in the literature to compute good or optimal policies for problems modelled
as an MDP. The term RL is associated with the more difficult setting in which no
(prior) knowledge about the MDP is presented. The task then of the algorithm is
to interact, or experiment with the environment (i.e. the MDP), in order to gain
knowledge about how to optimize its behavior, being guided by the evaluative feed-
back (rewards). The model-based setting, in which the full transition dynamics and
reward distributions are known, is usually characterized by the use of dynamic
1 Reinforcement Learning and Markov Decision Processes 5
Fig. 1.1 Example of interaction between an agent and its environment, from an RL
perspective
programming (DP) techniques. However, we will see that the underlying basis is
very similar, and that mixed forms occur.
boxes and one white one and you are standing next to a blue box. However, this
figure clearly shows the mechanism of sequential decision making.
There are several important aspects in learning sequential decision making which
we will describe in this section, after which we will describe formalizations in the
next sections.
There are several classes of algorithms that deal with the problem of sequential
decision making. In this book we deal specifically with the topic of learning, but
some other options exist.
The first solution is the programming solution. An intelligent system for sequen-
tial decision making can – in principle – be programmed to handle all situations. For
each possible state an appropriate or optimal action can be specified a priori. How-
ever, this puts a heavy burden on the designer or programmer of the system. All sit-
uations should be foreseen in the design phase and programmed into the agent. This
is a tedious and almost impossible task for most interesting problems, and it only
works for problems which can be modelled completely. In most realistic problems
this is not possible due to the sheer size of the problem, or the intrinsic uncertainty
in the system. A simple example is robot control in which factors such as lighting or
temperature can have a large, and unforeseen, influence on the behavior of camera
and motor systems. Furthermore, in situations where the problem changes, for ex-
ample due to new elements in the description of the problem or changing dynamics
of the system, a programmed solution will no longer work. Programmed solutions
are brittle in that they will only work for completely known, static problems with
fixed probability distributions.
A second solution uses search and planning for sequential decision making. The
successful chess program Deep Blue Schaeffer and Plaat (1997) was able to defeat
the human world champion Gary Kasparov by smart, brute force search algorithms
that used a model of the dynamics of chess, tuned to Kasparov’s style of playing.
When the dynamics of the system are known, one can search or plan from the cur-
rent state to a desirable goal state. However, when there is uncertainty about the
action outcomes standard search and planning algorithms do not apply. Admissible
heuristics can solve some problems concerning the reward-based nature of sequen-
tial decision making, but the probabilistic effects of actions pose a difficult problem.
Probabilistic planning algorithms exist, e.g. Kushmerick et al (1995), but their per-
formance is not as good as their deterministic counterparts. An additional problem
is that planning and search focus on specific start and goal states. In contrast, we are
looking for policies which are defined for all states, and are defined with respect to
rewards.
The third solution is learning, and this will be the main topic of this book.
Learning has several advantages in sequential decision making. First, it relieves the
designer of the system from the difficult task of deciding upon everything in the de-
sign phase. Second, it can cope with uncertainty, goals specified in terms of reward
1 Reinforcement Learning and Markov Decision Processes 7
measures, and with changing situations. Third, it is aimed at solving the problem
for every state, as opposed to a mere plan from one state to another. Additionally,
although a model of the environment can be used or learned, it is not necessary in or-
der to compute optimal policies, such as is exemplified by RL methods. Everything
can be learned from interaction with the environment.
One important aspect in the learning task we consider in this book is the distinction
between online and off-line learning. The difference between these two types is
influenced by factors such as whether one wants to control a real-world entity –
such as a robot playing robot soccer or a machine in a factory – or whether all
necessary information is available. Online learning performs learning directly on
the problem instance. Off-line learning uses a simulator of the environment as a
cheap way to get many training examples for safe and fast learning.
Learning the controller directly on the real task is often not possible. For ex-
ample, the learning algorithms in this chapter sometimes need millions of training
instances which can be too time-consuming to collect. Instead, a simulator is much
faster, and in addition it can be used to provide arbitrary training situations, includ-
ing situations that rarely happen in the real system. Furthermore, it provides a ”safe”
training situation in which the agent can explore and make mistakes. Obtaining neg-
ative feedback in the real task in order to learn to avoid these situations, might entail
destroying the machine that is controlled, which is unacceptable. Often one uses a
simulation to obtain a reasonable policy for a given problem, after which some parts
of the behavior are fine-tuned on the real task. For example, a simulation might pro-
vide the means for learning a reasonable robot controller, but some physical factors
concerning variance in motor and perception systems of the robot might make addi-
tional fine-tuning necessary. A simulation is just a model of the real problem, such
that small differences between the two are natural, and learning might make up for
that difference. Many problems in the literature however, are simulations of games
and optimization problems, such that the distinction disappears.
Credit Assignment
An important aspect of sequential decision making is the fact that deciding whether
an action is ”good” or ”bad” cannot be decided upon right away. The appropriate-
ness of actions is completely determined by the goal the agent is trying to pursue.
The real problem is that the effect of actions with respect to the goal can be much
delayed. For example, the opening moves in chess have a large influence on win-
ning the game. However, between the first opening moves and receiving a reward
for winning the game, a couple of tens of moves might have been played. Decid-
ing how to give credit to the first moves – which did not get the immediate reward
for winning – is a difficult problem called the temporal credit assignment problem.
8 M. van Otterlo and M. Wiering
Each move in a winning chess game contributes more or less to the success of the
last move, although some moves along this path can be less optimal or even bad. A
related problem is the structural credit assignment problem, in which the problem is
to distribute feedback over the structure representing the agent’s policy. For exam-
ple, the policy can be represented by a structure containing parameters (e.g. a neural
network). Deciding which parameters have to be updated forms the structural credit
assignment problem.
Compared to supervised learning, the amount of feedback the learning system gets
in RL, is much less. In supervised learning, for every learning sample the correct
output is given in a training set. The performance of the learning system can be mea-
sured relative to the number of correct answers, resulting in a predictive accuracy.
The difficulty lies in learning this mapping, and whether this mapping generalizes
to new, unclassified, examples. In unsupervised learning, the difficulty lies in con-
structing a useful partitioning of the data such that classes naturally arise. In rein-
forcement learning there is only some information available about performance, in
the form of one scalar signal. This feedback system is evaluative rather than being
instructive. Using this limited signal for feedback renders a need to put more effort
in using it to evaluate and improve behavior during learning.
A second aspect about feedback and performance is related to the stochastic na-
ture of the problem formulation. In supervised and unsupervised learning, the data
is usually considered static, i.e. a data set is given and performance can be measured
1 Reinforcement Learning and Markov Decision Processes 9
with respect to this data. The learning samples for the learner originate from a fixed
distribution, i.e. the data set. From an RL perspective, the data can be seen as a
moving target. The learning process is driven by the current policy, but this policy
will change over time. That means that the distribution over states and rewards will
change because of this. In machine learning the problem of a changing distribution
of learning samples is termed concept drift Maloof (2003) and it demands special
features to deal with it. In RL this problem is dealt with by exploration, a constant
interaction between evaluation and improvement of policies and additionally the use
of learning rate adaption schemes.
A third aspect of feedback is the question ”where do the numbers come from?”.
In many sequential decision tasks, suitable reward functions present themselves
quite naturally. For games in which there are winning, losing and draw situations,
the reward function is easy to specify. In some situations special care has to be taken
in giving rewards for states or actions, and also their relative size is important. When
the agent will encounter a large negative reward before it finally gets a small posi-
tive reward, this positive reward might get overshadowed. All problems posed will
have some optimal policy, but it depends on whether the reward function is in ac-
cordance with the right goals, whether the policy will tackle the right problem. In
some problems it can be useful to provide the agent with rewards for reaching inter-
mediate subgoals. This can be helpful in problems which require very long action
sequences.
Representations
One of the most important aspects in learning sequential decision making is repre-
sentation. Two central issues are what should be represented, and how things should
be represented. The first issue is dealt with in this chapter. Key components that can
or should be represented are models of the dynamics of the environment, reward dis-
tributions, value functions and policies. For some algorithms all components are ex-
plicitly stored in tables, for example in classic DP algorithms. Actor-critic methods
keep separate, explicit representations of both value functions and policies. How-
ever, in most RL algorithms just a value function is represented whereas policy
decisions are derived from this value function online. Methods that search in policy
space do not represent value functions explicitly, but instead an explicitly repre-
sented policy is used to compute values when necessary. Overall, the choice for not
representing certain elements can influence the choice for a type of algorithm, and
its efficiency.
The question of how various structures can be represented is dealt with exten-
sively in this book, starting from the next chapter. Structures such as policies, tran-
sition functions and value functions can be represented in more compact form by
using various structured knowledge representation formalisms and this enables
much more efficient solution mechanisms and scaling up to larger domains.
10 M. van Otterlo and M. Wiering
The elements of the RL problem as described in the introduction to this chapter can
be formalized using the Markov decision process (MDP) framework. In this section
we will formally describe components such as states and actions and policies, as
well as the goals of learning using different kinds of optimality criteria. MDPs
are extensively described in Puterman (1994) and Boutilier et al (1999). They can
be seen as stochastic extensions of finite automata and also as Markov processes
augmented with actions and rewards.
Although general MDPs may have infinite (even uncountable) state and action
spaces, we limit the discussion to finite-state and finite-action problems. In the next
chapter we will encounter continuous spaces and in later chapters we will encounter
situations arising in the first-order logic setting in which infinite spaces can quite
naturally occur.
MDPs consist of states, actions, transitions between states and a reward function
definition. We consider each of them in turn.
States
The set of environmental states S is defined as the finite set {s1 , . . . , sN } where the
size of the state space is N, i.e. |S| = N. A state is a unique characterization of all
that is important in a state of the problem that is modelled. For example, in chess
a complete configuration of board pieces of both black and white, is a state. In the
next chapter we will encounter the use of features that describe the state. In those
contexts, it becomes necessary to distinguish between legal and illegal states, for
some combinations of features might not result in an actually existing state in the
problem. In this chapter, we will confine ourselves to the discrete state set S in which
each state is represented by a distinct symbol, and all states s ∈ S are legal.
Actions
The set of actions A is defined as the finite set {a1 , . . . , aK } where the size of the
action space is K, i.e. |A| = K. Actions can be used to control the system state.
The set of actions that can be applied in some particular state s ∈ S, is denoted A(s),
where A(s) ⊆ A. In some systems, not all actions can be applied in every state, but in
general we will assume that A(s) = A for all s ∈ S. In more structured representations
(e.g. by means of features), the fact that some actions are not applicable in some
1 Reinforcement Learning and Markov Decision Processes 11
The idea of Markovian dynamics is that the current state s gives enough information
to make an optimal decision; it is not important which states and actions preceded s.
Another way of saying this, is that if you select an action a, the probability distribu-
tion over next states is the same as the last time you tried this action in the same state.
More general models can be characterized by being k-Markov, i.e. the last k states
are sufficient, such that Markov is actually 1-Markov. Though, each k-Markov prob-
lem can be transformed into an equivalent Markov problem. The Markov property
forms a boundary between the MDP and more general models such as POMDPs.
The reward function2 specifies rewards for being in a state, or doing some action
in a state. The state reward function is defined as R : S → R, and it specifies the
1 Although this is the same, the explicit distinction between an action not being applicable
in a state and a zero probability for transitions with that action, is lost in this way.
2 Although we talk about rewards here, with the usual connotation of something positive, the
reward function merely gives a scalar feedback signal. This can be interpreted as negative
(punishment) or positive (reward). The various origins of work in MDPs in the literature
creates an additional confusion with the reward function. In the operations resarch litera-
ture, one usually speaks of a cost function instead and the goal of learning and optimization
is to minimize this function.
12 M. van Otterlo and M. Wiering
reward obtained in states. However, two other definitions exist. One can define either
R : S × A → R or R : S × A × S → R. The first one gives rewards for performing
an action in a state, and the second gives rewards for particular transitions between
states. All definitions are interchangeable though the last one is convenient in model-
free algorithms (see Section 1.7), because there we usually need both the starting
state and the resulting state in backing up values. Throughout this book we will
mainly use R(s,a,s ), but deviate from this when more convenient.
The reward function is an important part of the MDP that specifies implicitly the
goal of learning. For example, in episodic tasks such as in the games Tic-Tac-Toe
and chess, one can assign all states in which the agent has won a positive reward
value, all states in which the agent loses a negative reward value and a zero reward
value in all states where the final outcome of the game is a draw. The goal of the
agent is to reach positive valued states, which means winning the game. Thus, the
reward function is used to give direction in which way the system, i.e. the MDP,
should be controlled. Often, the reward function assigns non-zero reward to non-
goal states as well, which can be interpreted as defining sub-goals for learning.
Putting all elements together results in the definition of a Markov decision process,
which will be the base model for the large majority of methods described in this
book.
Definition 1.3.1. A Markov decision process is a tuple S, A, T, R in which S is
a finite set of states, A a finite set of actions, T a transition function defined as
T : S × A × S → [0,1] and R a reward function defined as R : S × A × S → R.
The transition function T and the reward function R together define the model of
the MDP. Often MDPs are depicted as a state transition graph where the nodes
correspond to states and (directed) edges denote transitions. A typical domain that
is frequently used in the MDP literature is the maze Matthews (1922), in which the
reward function assigns a positive reward for reaching the exit state.
There are several distinct types of systems that can be modelled by this definition
of an MDP. In episodic tasks, there is the notion of episodes of some length, where
the goal is to take the agent from a starting state to a goal state. An initial state
distribution I : S → [0,1] gives for each state the probability of the system being
started in that state. Starting from a state s the system progresses through a sequence
of states, based on the actions performed. In episodic tasks, there is a specific subset
G ⊆ S, denoted goal state area containing states (usually with some distinct reward)
where the process ends. We can furthermore distinguish between finite, fixed horizon
tasks in which each episode consists of a fixed number of steps, indefinite horizon
tasks in which each episode can end but episodes can have arbitrary length, and
infinite horizon tasks where the system does not end at all. The last type of model is
usually called a continuing task.
1 Reinforcement Learning and Markov Decision Processes 13
Episodic tasks, i.e. in which there are so-called goal states, can be modelled using
the same model defined in Definition 1.3.1. This is usually modelled by means of
absorbing states or terminal states, e.g. states from which every action results in
a transition to that same state with probability 1 and reward 0. Formally, for an
absorbing state s, it holds that T (s,a,s) = 1 and R(s,a,s ) = 0 for all states s ∈ S and
actions a ∈ A. When entering an absorbing state, the process is reset and restarts in
a new starting state. Episodic tasks and absorbing states can in this way be elegantly
modelled in the same framework as continuing tasks.
1.3.2 Policies
Given an MDP S, A, T, R , a policy is a computable function that outputs for each
state s ∈ S an action a ∈ A (or a ∈ A(s)). Formally, a deterministic policy π is
a function defined as π : S → A. It is also possible to define a stochastic policy
as π : S × A → [0,1] such that for each state s ∈ S, it holds that π (s,a) ≥ 0 and
∑a∈A π (s,a) = 1. We will assume deterministic policies in this book unless stated
otherwise.
Application of a policy to an MDP is done in the following way. First, a start
state s0 from the initial state distribution I is generated. Then, the policy π sug-
gest the action a0 = π (s0 ) and this action is performed. Based on the transition
function T and reward function R, a transition is made to state s1 , with probabil-
ity T (s0 ,a0 ,s1 ) and a reward r0 = R(s0 ,a0 ,s1 ) is received. This process continues,
producing s0 , a0 , r0 , s1 , a1 , r1 , s2 , a2 , . . .. If the task is episodic, the process ends in
state sgoal and is restarted in a new state drawn from I. If the task is continuing, the
sequence of states can be extended indefinitely.
The policy is part of the agent and its aim is to control the environment modelled
as an MDP. A fixed policy induces a stationary transition distribution over the MDP
which can be transformed into a Markov system3 S , T where S = S and T (s,s ) =
T (s,a,s ) whenever π (s) = a.
In the previous sections, we have defined the environment (the MDP) and the agent
(i.e. the controlling element, or policy). Before we can talk about algorithms for
computing optimal policies, we have to define what that means. That is, we have
to define what the model of optimality is. There are two ways of looking at opti-
mality. First, there is the aspect of what is actually being optimized, i.e. what is the
goal of the agent? Second, there is the aspect of how optimal the way in which the
goal is being optimized, is. The first aspect is related to gathering reward and is
3 In other words, if π is fixed, the system behaves as a stochastic transition system with a
stationary distribution over states.
14 M. van Otterlo and M. Wiering
treated in this section. The second aspect is related to the efficiency and optimality
of algorithms, and this is briefly touched upon and dealt with more extensively in
Section 1.5 and further.
The goal of learning in an MDP is to gather rewards. If the agent was only con-
cerned about the immediate reward, a simple optimality criterion would be to opti-
mize E[rt ]. However, there are several ways of taking into account the future in how
to behave now. There are basically three models of optimality in the MDP, which
are sufficient to cover most of the approaches in the literature. They are strongly
related to the types of tasks that were defined in Section 1.3.1.
h ∞
1 h
E ∑ rt E ∑ γ t rt lim E
h→∞
∑ rt
h t=0
t=0 t=0
Fig. 1.2 Optimality: a) finite horizon, b) discounted, infinite horizon, c) average reward
The finite horizon model simply takes a finite horizon of length h and states that the
agent should optimize its expected reward over this horizon, i.e. the next h steps (see
Figure 1.2a)). One can think of this in two ways. The agent could in the first step
take the h-step optimal action, after this the (h − 1)-step optimal action, and so on.
Another way is that the agent will always take the h-step optimal action, which is
called receding-horizon control. The problem, however, with this model, is that the
(optimal) choice for the horizon length h is not always known.
In the infinite-horizon model, the long-run reward is taken into account, but the
rewards that are received in the future are discounted according to how far away
in time they will be received. A discount factor γ , with 0 ≤ γ < 1 is used for
this (see Figure 1.2b)). Note that in this discounted case, rewards obtained later
are discounted more than rewards obtained earlier. Additionally, the discount factor
ensures that – even with infinite horizon – the sum of the rewards obtained is finite.
In episodic tasks, i.e. in tasks where the horizon is finite, the discount factor is not
needed or can equivalently be set to 1. If γ = 0 the agent is said to be myopic, which
means that it is only concerned about immediate rewards. The discount factor can
be interpreted in several ways; as an interest rate, probability of living another step,
or the mathematical trick for bounding the infinite sum. The discounted, infinite-
horizon model is mathematically more convenient, but conceptually similar to the
finite horizon model. Most algorithms in this book use this model of optimality.
A third optimality model is the average-reward model, maximizing the long-run
average reward (see Figure 1.2c)). Sometimes this is called the gain optimal policy
and in the limit, as the discount factor approaches 1, it is equal to the infinite-horizon
discounted model. A difficult problem with this criterion that for long (or, infinite)
episodes we cannot distinguish between two policies in which one receives a lot of
reward in the initial phases and another one which does not. This initial difference in
reward is hidden in the long-run average. This problem can be solved in using a bias
1 Reinforcement Learning and Markov Decision Processes 15
optimal model in which the long-run average is still being optimized, but policies
are preferred if they additionally get initially extra reward. See Mahadevan (1996)
for a survey on average reward RL.
Choosing between these optimality criteria can be related to the learning prob-
lem. If the length of the episode is known, the finite-horizon model is best. However,
often this is not known, or the task is continuing, and then the infinite-horizon model
is more suitable. Koenig and Liu (2002) gives an extensive overview of different
modelings of MDPs and their relationship with optimality.
The second kind of optimality in this section is related to the more general aspect
of the optimality of the learning process itself. We will encounter various concepts
in the remainder of this book. We will briefly summarize three important notions
here.
Learning optimality can be explained in terms of what the end result of learning
might be. A first concern is whether the agent is able to obtain optimal performance
in principle. For some algorithms there are proofs stating this, but for some not. In
other words, is there a way to ensure that the learning process will reach a global
optimum, or merely a local optimum, or even an oscillation between performances?
A second kind of optimality is related to the speed of converging to a solution. We
can distinguish between two learning methods by looking at how many interactions
are needed, or how much computation is needed per interaction. And related to that,
what will the performance be after a certain period of time? In supervised learning
the optimality criterion is often defined in terms of predictive accuracy which is
different from optimality in the MDP setting. Also, it is important to look at how
much experimentation is necessary, or even allowed, for reaching optimal behavior.
For example, a learning robot or helicopter might not be allowed to make many
mistakes during learning. A last kind of optimality is related to how much reward is
not obtained by the learned policy, as compared to an optimal one. This is usually
called the regret of a learning system.
In the preceding sections we have defined MDPs and optimality criteria that can be
useful for learning optimal policies. In this section we define value functions, which
are a way to link the optimality criteria to policies. Most learning algorithms for
MDPs compute optimal policies by learning value functions. A value function rep-
resents an estimate how good it is for the agent to be in a certain state (or how good
it is to perform a certain action in that state). The notion of how good is expressed in
terms of an optimality criterion, i.e. in terms of the expected return. Value functions
are defined for particular policies.
The value of a state s under policy π , denoted V π (s) is the expected return when
starting in s and following π thereafter. We will use the infinite-horizon, discounted
16 M. van Otterlo and M. Wiering
One fundamental property of value functions is that they satisfy certain recursive
properties. For any policy π and any state s the expression in Equation 1.1 can
recursively be defined in terms of a so-called Bellman Equation Bellman (1957):
π
V (s) = Eπ rt + γ rt+1 + γ rt+2 + . . . |st = t
2
π
= Eπ rt + γ V (st+1 )|st = s
= ∑ T (s,π (s),s ) R(s,a,s ) + γ V (s )
π
(1.2)
s
It denotes that the expected value of state is defined in terms of the immediate reward
and values of possible next states weighted by their transition probabilities, and
additionally a discount factor. V π is the unique solution for this set of equations.
Note that multiple policies can have the same value function, but for a given policy
π , V π is unique.
The goal for any given MDP is to find a best policy, i.e. the policy that receives
the most reward. This means maximizing the value function of Equation 1.1 for all
∗
states s ∈ S. An optimal policy, denoted π ∗ , is such that V π (s) ≥ V π (s) for all s ∈ S
∗
and all policies π . It can be proven that the optimal solution V ∗ = V π satisfies the
following Equation:
V (s) = max ∑ T (s,a,s ) R(s,a,s ) + γ V (s )
∗ ∗
(1.3)
a∈A
s ∈S
This expression is called the Bellman optimality equation. It states that the value
of a state under an optimal policy must be equal to the expected return for the best
action in that state. To select an optimal action given the optimal state value function
V ∗ the following rule can be applied:
π ∗ (s) = argmax ∑ T (s,a,s ) R(s,a,s ) + γ V ∗ (s ) (1.4)
a
s ∈S
We call this policy the greedy policy, denoted πgreedy (V ) because it greedily se-
lects the best action using the value function V . An analogous optimal state-action
value is:
Q∗ (s,a) = ∑ T (s,a,s ) R(s,a,s ) + γ max Q∗ (s ,a )
s a
Q-functions are useful because they make the weighted summation over different
alternatives (such as in Equation 1.4) using the transition function unnecessary. No
forward-reasoning step is needed to compute an optimal action in a state. This is the
reason that in model-free approaches, i.e. in case T and R are unknown, Q-functions
are learned instead of V -functions. The relation between Q∗ and V ∗ is given by
Q (s,a) = ∑ T (s,a,s ) R(s,a,s ) + γ V (s )
∗ ∗
(1.5)
s ∈S
V (s) = max Q∗ (s,a)
∗
(1.6)
a
Now, analogously to Equation 1.4, optimal action selection can be simply put as:
That is, the best action is the action that has the highest expected utility based on
possible next states resulting from taking that action. One can, analogous to the
expression in Equation 1.4, define a greedy policy πgreedy (Q) based on Q. In contrast
to πgreedy (V ) there is no need to consult the model of the MDP; the Q-function
suffices.
Now that we have defined MDPs, policies, optimality criteria and value functions, it
is time to consider the question of how to compute optimal policies. Solving a given
MDP means computing an optimal policy π ∗ . Several dimensions exist along which
algorithms have been developed for this purpose. The most important distinction is
that between model-based and model-free algorithms.
Model-based algorithms exist under the general name of DP. The basic assump-
tion in these algorithms is that a model of the MDP is known beforehand, and can be
used to compute value functions and policies using the Bellman equation (see Equa-
tion 1.3). Most methods are aimed at computing state value functions which can, in
the presence of the model, be used for optimal action selection. In this chapter we
will focus on iterative procedures for computing value functions and policies.
Model-free algorithms, under the general name of RL, do not rely on the avail-
ability of a perfect model. Instead, they rely on interaction with the environment,
i.e. a simulation of the policy thereby generating samples of state transitions and
rewards. These samples are then used to estimate state-action value functions.
18 M. van Otterlo and M. Wiering
Because a model of the MDP is not known, the agent has to explore the MDP to ob-
tain information. This naturally induces a exploration-exploitation trade-off which
has to be balanced to obtain an optimal policy. In model-based reinforcement learn-
ing, the agent does not possess an a priori a model of the environment, but estimates
it while it is learning. After inducing a reasonable model of the environment, the
agent can then apply dynamic programming-like algorithms to compute a policy.
A very important underlying mechanism, the so-called generalized policy itera-
tion (GPI) principle, present in all methods is depicted in Figure 1.3. This principle
consists of two interaction processes. The policy evaluation step estimates the utility
of the current policy π , that is, it computes V π . There are several ways for comput-
ing this. In model-based algorithms, one can use the model to compute it directly
or iteratively approximate it. In model-free algorithms, one can simulate the pol-
icy and estimate its utility from the sampled execution traces. The main purpose of
this step is to gather information about the policy for computing the second step,
the policy improvement step. In this step, the values of the actions are evaluated
for every state, in order to find possible improvements, i.e. possible other actions
in particular states that are better than the action the current policy proposes. This
step computes an improved policy π from the current policy π using the informa-
tion in V π . Both the evaluation and the improvement steps can be implemented in
various ways, and interleaved in several distinct ways. The bottom line is that there
is a policy that drives value learning, i.e. it determines the value function, but in
turn there is a value function that can be used by the policy to select good actions.
Note that it is also possible to have an implicit representation of the policy, which
means that only the value function is stored, and a policy is computed on-the-fly
for each state based on the value function when needed. This is common practice in
model-free algorithms (see Section 1.7). And vice versa it is also possible to have
implicit representations of value functions in the context of an explicit policy rep-
resentation. Another interesting aspect is that in general a value function does not
have to be perfectly accurate. In many cases it suffices that sufficient distinction is
present between suboptimal and optimal actions, such that small errors in values do
not have to influence policy optimality. This is also important in approximation and
abstraction methods.
Planning as an RL Problem
evaluation
π
V →V V =
Vπ
π V
π→greedy(V)
improvement starting V*
V π π*
(V)
ed y
g re
π* V* π=
Fig. 1.3 a) The algorithms in Section 1.5 can be seen as instantiations of Generalized Policy
Iteration (GPI) Sutton and Barto (1998). The policy evaluation step estimates V π , the policy’s
performance. The policy improvement step improves the policy π based on the estimates in
V π . b) The gradual convergence of both the value function and the policy to optimal versions.
be deterministic, i.e. for all states s ∈ S and actions a ∈ A there exists only one state
s ∈ S such that T (s,a,s ) = 1. All states in G are assumed to be absorbing. The only
thing left is to specify the reward function. We can specify this in such a way that a
positive reinforcement is received once a goal state is reached, and zero otherwise:
1, if st ∈ G and st+1 ∈ G
R(st ,at ,st+1 ) =
0, otherwise
Now, depending on whether the transition function and reward function are known
to the agent, one can solve this planning task with either model-based or model-free
learning. The difference with classic planning is that the learned policy will apply
to all states.
The term DP refers to a class of algorithms that is able to compute optimal poli-
cies in the presence of a perfect model of the environment. The assumption that a
model is available will be hard to ensure for many applications. However, we will
see that from a theoretical viewpoint, as well as from an algorithmic viewpoint, DP
algorithms are very relevant because they define fundamental computational mech-
anisms which are also used when no model is available. The methods in this section
all assume a standard MDP S, A, T, R , where the state and action sets are finite and
discrete such that they can be stored in tables. Furthermore, transition, reward and
value functions are assumed to store values for all states and actions separately.
20 M. van Otterlo and M. Wiering
Two core DP methods are policy iteration Howard (1960) and value iteration Bell-
man (1957). In the first, the GPI mechanism is clearly separated into two steps,
whereas the second represents a tight integration of policy evaluation and improve-
ment. We will consider both these algorithms in turn.
Policy iteration (PI) Howard (1960) iterates between the two phases of GPI. The
policy evaluation phase computes the value function of the current policy and the
policy improvement phase computes an improved policy by a maximization over the
value function. This is repeated until converging to an optimal policy.
A first step is to find the value function V π of a fixed policy π . This is called the pre-
diction problem. It is a part of the complete problem, that of computing an optimal
policy. Remember from the previous sections that for all s ∈ S,
V π (s) = ∑ T (s,π (s),s ) R(s,π (s),s ) + γ V π (s ) (1.8)
s ∈S
If the dynamics of the system are known, i.e. a model of the MDP is given, then
these equations form a system of |S| equations in |S| unknowns (the values of V π for
each s ∈ S). This can be solved by linear programming (LP). However, an iterative
procedure is possible, and in fact common in DP and RL. The Bellman equation is
transformed into an update rule which updates the current value function Vkπ into
π
Vk+1 by ’looking one step further in the future’, thereby extending the planning
horizon with one step:
π
Vk+1 (s) = Eπ rt + γ Vkπ (st+1 )|st = s
= ∑ T (s,π (s),s ) R(s,π (s),s ) + γ Vkπ (s ) (1.9)
s
The value function V π of a fixed policy π satisfies the fixed point of this backup
operator as V π = Bπ V π . A useful special case of this backup operator is defined
with respect to a fixed action a:
Now LP for solving the prediction problem can be stated as follows. Computing V π
can be accomplished by solving the Bellman equations (see Equation 1.3) for all
states. The optimal value function V ∗ can be found by using a LP problem solver
that computes V ∗ = arg maxV ∑s V (s) subject to V (s) ≥ (BaV )(s) for all a and s.
Policy Improvement
Now that we know the value function V π of a policy π as the outcome of the policy
evaluation step, we can try to improve the policy. First we identify the value of all
actions by using:
π π
Q (s,a) = Eπ rt + γ V (st+1 )|st = s, at = a (1.11)
= ∑ T (s,a,s ) R(s,a,s ) + γ V π (s ) (1.12)
s
If now Qπ (s,a) is larger than V π (s) for some a ∈ A then we could do better by
choosing action a instead of the current π (s). In other words, we can improve the
current policy by selecting a different, better, action in a particular state. In fact, we
can evaluate all actions in all states and choose the best action in all states. That is,
we can compute the greedy policy π by selecting the best action in each state, based
on the current value function V π :
Computing an improved policy by greedily selecting the best action with respect to
the value function of the original policy is called policy improvement. If the policy
22 M. van Otterlo and M. Wiering
cannot be improved in this way, it means that the policy is already optimal and its
value function satisfies the Bellman equation for the optimal value function. In a
similar way one can also perform these steps for stochastic policies by blending the
action probabilities into the expectation operator.
Summarizing, policy iteration Howard (1960) starts with an arbitrary initialized
policy π0 . Then a sequence of iterations follows in which the current policy is eval-
uated after which it is improved. The first step, the policy evaluation step computes
V πk , making use of Equation 1.9 in an iterative way. The second step, the policy
improvement step, computes πk+1 from πk using V πk . For each state, using equation
1.4, the optimal action is determined. If for all states s, πk+1 (s) = πk (s), the pol-
icy is stable and the policy iteration algorithm can stop. Policy iteration generates a
sequence of alternating policies and value functions
π 0 → V π0 → π 1 → V π1 → π 2 → V π2 → π 3 → V π3 → . . . → π ∗
The complete algorithm can be found in Algorithm 1.
For finite MDPs, i.e. state and action spaces are finite, policy iteration converges
after a finite number of iterations. Each policy πk+1 is a strictly better policy than
πk unless πk = π ∗ , in which case the algorithm stops. And because for a finite MDP,
the number of different policies is finite, policy iteration converges in finite time.
In practice, it usually converges after a small number of iterations. Although policy
iteration computes the optimal policy for a given MDP in finite time, it is relatively
inefficient. In particular the first step, the policy evaluation step, is computation-
ally expensive. Value functions for all intermediate policies π0 , . . . , πk , . . . , π ∗ are
computed, which involves multiple sweeps through the complete state space per it-
eration. A bound on the number of iterations is not known Littman et al (1995) and
1 Reinforcement Learning and Markov Decision Processes 23
depends on the MDP transition structure, but it often converges after few iterations
in practice.
The policy iteration algorithm completely separates the evaluation and improve-
ment phases. In the evaluation step, the value function must be computed in the
limit. However, it is not necessary to wait for full convergence, but it is possible to
stop evaluating earlier and improve the policy based on the evaluation so far. The
extreme point of truncating the evaluation step is the value iteration Bellman (1957)
algorithm. It breaks off evaluation after just one iteration. In fact, it immediately
blends the policy improvement step into its iterations, thereby purely focusing on
estimating directly the value function. Necessary updates are computed on-the-fly.
In essence, it combines a truncated version of the policy evaluation step with the
policy improvement step, which is essentially Equation 1.3 turned into one update
rule:
Vt+1 (s) = max ∑ T (s,a,s ) R(s,a,s ) + γ Vt (s )
(1.14)
a
s
= max Qt+1 (s,a). (1.15)
a
Using Equations (1.14) and (1.15), the value iteration algorithm (see Figure 2) can
be stated as follows: starting with a value function V0 over all states, one iteratively
updates the value of each state according to (1.14) to get the next value functions Vt
(t = 1,2,3, . . .). It produces the following sequence of value functions:
V0 → V1 → V2 → V3 → V4 → V5 → V6 → V7 → . . .V ∗
Actually, in the way it is computed it also produces the intermediate Q-value func-
tions such that the sequence is
24 M. van Otterlo and M. Wiering
V0 → Q1 → V1 → Q2 → V2 → Q3 → V3 → Q4 → V4 → . . .V ∗
Value iteration is guaranteed to converge in the limit towards V ∗ , i.e. the Bellman
optimality Equation (1.3) holds for each state. A deterministic policy π for all states
s ∈ S can be computed using Equation 1.4. If we use the same general backup op-
erator mechanism used in the previous section, we can define value iteration in the
following way.
(B ϕ )(s) = max ∑ T (s,a,s ) R(s,a,s ) + γϕ (s)
∗
(1.16)
a
s ∈S
The policy iteration and value iteration algorithms can be seen as spanning a spec-
trum of DP approaches. This spectrum ranges from complete separation of eval-
uation and improvement steps to a complete integration of these steps. Clearly, in
between the extreme points is much room for variations on algorithms. Let us first
consider the computational complexity of the extreme points.
Complexity
grow extremely large when the problem size grows. The state spaces of games such
as backgammon and chess consist of too many states to perform just one full sweep.
In this section we will describe some efficient variations on DP approaches. Detailed
coverage of complexity results for the solution of MDPs can be found in Littman
et al (1995); Bertsekas and Tsitsiklis (1996); Boutilier et al (1999).
The efficiency of DP can be roughly improved along two lines. The first is a
tighter integration of the evaluation and improvement steps of the GPI process.
We will discuss this issue briefly in the next section. The second is that of using
(heuristic) search algorithms in combination with DP algorithms. For example, us-
ing search as an exploration mechanism can highlight important parts of the state
space such that value backups can be concentrated on these parts. This is the under-
lying mechanism used in the methods discussed briefly in Section 1.6.2.2
The full backups updates in DP algorithms can be done in several ways. We have
assumed in the description of the algorithms that in each step an old and a new value
function are kept in memory. Each update puts a new value in the new table, based
on the information of the old. This is called synchronous, or Jacobi-style updating
Sutton and Barto (1998). This is useful for explanation of algorithms and theoretical
proofs of convergence. However, there are two more common ways for updates. One
can keep a single table and do the updating directly in there. This is called in-place
updating Sutton and Barto (1998) or Gauss-Seidel Bertsekas and Tsitsiklis (1996)
and usually speeds up convergence, because during one sweep of updates, some
updates use already newly updated values of other states. Another type of updating
is called asynchronous updating which is an extension of the in-place updates, but
here updates can be performed in any order. An advantage is that the updates may
be distributed unevenly throughout the state(-action) space, with more updates being
given to more important parts of this space. For all these methods convergence can
be proved under the general condition that values are updated infinitely often but
with a finite frequency.
Modified policy iteration (MPI) Puterman and Shin (1978) strikes a middle ground
between value and policy iteration. MPI maintains the two separate steps of GPI,
but both steps are not necessarily computed in the limit. The key insight here is that
for policy improvement, one does not need an exactly evaluated policy in order to
improve it. For example, the policy estimation step can be approximative after which
a policy improvement step can follow. In general, both steps can be performed quite
independently by different means. For example, instead of iteratively applying the
Bellman update rule from Equation 1.15, one can perform the policy estimation step
by using a sampling procedure such as Monte Carlo estimation Sutton and Barto
26 M. van Otterlo and M. Wiering
(1998). These general forms with mixed forms of estimation and improvements
is captured by the generalized policy iteration mechanism depicted in Figure 1.3.
Policy iteration and value iteration are both the extreme cases of modified policy
iteration, whereas MPI is a general method for asynchronous updating.
In many realistic problems, only a fraction of the state space is relevant to the prob-
lem of reaching the goal state from some state s. This has inspired a number of
algorithms that focus computation on states that seem most relevant for finding an
optimal policy from a start state s. These algorithms usually display good anytime
behavior, i.e. they produce good or reasonable policies fast, after which they are
gradually improved. In addition, they can be seen as implementing various ways of
asynchronous DP.
The previous section has reviewed several methods for computing an optimal policy
for an MDP assuming that a (perfect) model is available. RL is primarily concerned
with how to obtain an optimal policy when such a model is not available. RL adds
to MDPs a focus on approximation and incomplete information, and the need for
sampling and exploration. In contrast with the algorithms discussed in the previous
section, model-free methods do not rely on the availability of priori known transi-
tion and reward models, i.e. a model of the MDP. The lack of a model generates a
need to sample the MDP to gather statistical knowledge about this unknown model.
Many model-free RL techniques exist that probe the environment by doing actions,
thereby estimating the same kind of state value and state-action value functions as
model-based techniques. This section will review model-free methods along with
several efficient extensions.
In model-free contexts one has still a choice between two options. The first one
is first to learn the transition and reward model from interaction with the environ-
ment. After that, when the model is (approximately or sufficiently) correct, all the
DP methods from the previous section apply. This type of learning is called indi-
rect or model-based RL. The second option, called direct RL, is to step right into
estimating values for actions, without even estimating the model of the MDP. Ad-
ditionally, mixed forms between these two exist too. For example, one can still do
model-free estimation of action values, but use an approximated model to speed up
value learning by using this model to perform more, and in addition, full backups
of values (see Section 1.7.3). Most model-free methods however, focus on direct
estimation of (action) values.
A second choice one has to make is what to do with the temporal credit assign-
ment problem. It is difficult to assess the utility of some action, if the real effects
of this particular action can only be perceived much later. One possibility is to wait
28 M. van Otterlo and M. Wiering
until the ”end” (e.g. of an episode) and punish or reward specific actions along the
path taken. However, this will take a lot of memory and often, with ongoing tasks,
it is not known beforehand whether, or when, there will be an ”end”. Instead, one
can use similar mechanisms as in value iteration to adjust the estimated value of
a state based on the immediate reward and the estimated (discounted) value of the
next state. This is generally called temporal difference learning which is a general
mechanism underlying the model-free methods in this section. The main difference
with the update rules for DP approaches (such as Equation 1.14) is that the transition
function T and reward function R cannot appear in the update rules now. The gen-
eral class of algorithms that interact with the environment and update their estimates
after each experience is called online RL.
A general template for online RL is depicted in Figure 3. It shows an interaction
loop in which the agent selects an action (by whatever means) based on its current
state, gets feedback in the form of the resulting state and an associated reward, after
which it updates its estimated values stored in Ṽ and Q̃ and possibly statistics con-
cerning T̃ and R̃ (in case of some form of indirect learning). The selection of the
action is based on the current state s and the value function (either Q or V ). To solve
the exploration-exploitation problem, usually a separate exploration mechanism en-
sures that sometimes the best action (according to current estimates of action values)
is taken (exploitation) but sometimes a different action is chosen (exploration). Var-
ious choices for exploration, ranging from random to sophisticated, exist and we
will see some examples below and in Section 1.7.3.
Exploration
One important aspect of model-free algorithms is that there is a need for explo-
ration. Because the model is unknown, the learner has to try out different actions to
see their results. A learning algorithm has to strike a balance between exploration
and exploitation, i.e. in order to gain a lot of reward the learner has to exploit its
current knowledge about good actions, although it sometimes must try out differ-
ent actions to explore the environment for finding possible better actions. The most
1 Reinforcement Learning and Markov Decision Processes 29
basic exploration strategy is the ε -greedy policy, i.e. the learner takes its current
best action with probability (1 − ε ) and a (randomly selected) other action with
probability ε . There are many more ways of doing exploration (see Wiering (1999);
Reynolds (2002); Ratitch (2005) for overviews) and in Section 1.7.3 we will see
some examples. One additional method that is often used in combination with the
algorithms in this section is the Boltzmann (or: softmax) exploration strategy. It
is only slightly more complicated than the ε -greedy strategy. The action selection
strategy is still random, but selection probabilities are weighted by their relative Q-
values. This makes it more likely for the agent to choose very good actions, whereas
two actions that have similar Q-values will have almost the same probability to get
selected. Its general form is
Q(s,a)
e T
P(a) = Q(s,ai )
(1.17)
∑i e T
in which P(a) is the probability of selecting action a and T is the temperature pa-
rameter. Higher values of T will move the selection strategy more towards a purely
random strategy and lower values will move to a fully greedy strategy. A com-
bination of both ε -greedy and Boltzmann exploration can be taken by taking the
best action with probability (1 − ε ) and otherwise an action computed according to
Equation 1.17 Wiering (1999).
Another simple method to stimulate exploration is optimistic Q-values initializa-
tion; one can initialize all Q-values to high values – e.g. an a priori defined upper-
bound – at the start of learning. Because Q-values will decrease during learning,
actions that have not been tried a number of times will have a large enough value to
get selected when using Boltzmann exploration for example. Another solution with
a similar effect is to keep counters on the number of times a particular state-action
pair has been selected.
estimate on arriving back home with 20 minutes less. However, once on your way
from the butcher to the wine seller, you see that there is quite some traffic along the
way and it takes you 30 minutes longer to get there. Finally you arrive 10 minutes
later than you predicted in the first place. The bottom line of this example is that
you can adjust your estimate about what time you will be back home every time you
have obtained new information about in-between steps. Each time you can adjust
your estimate on how long it will still take based on actually experienced times of
parts of your path. This is the main principle of TD learning: you do not have to
wait until the end of a trial to make updates along your path.
TD methods learn their value estimates based on estimates of other values, which
is called bootstrapping. They have an advantage over DP in that they do not require
a model of the MDP. Another advantage is that they are naturally implemented in
an online, incremental fashion such that they can be easily used in various circum-
stances. No full sweeps through the full state space are needed; only along experi-
enced paths values get updated, and updates are effected after each step.
TD(0)
where α ∈ [0,1] is the learning rate that determines by how much values get up-
dated. This backup is performed after experiencing the transition from state s to s
based on the action a, while receiving reward r. The difference with DP backups
such as used in Equation 1.14 is that the update is still done by using bootstrapping,
but it is based on an observed transition, i.e. it uses a sample backup instead of a
full backup. Only the value of one successor state is used, instead of a weighted av-
erage of all possible successor states. When using the value function V π for action
selection, a model is needed to compute an expected value over all action outcomes
(e.g. see Equation 1.4).
The learning rate α has to be decreased appropriately for learning to converge.
Sometimes the learning rate can be defined for states separately as in α (s), in which
case it can be dependent on how often the state is visited. The next two algorithms
learn Q-functions directly from samples, removing the need for a transition model
for action selection.
5 The learning parameter α should comply with some criteria on its value, and the way it
is changed. In the algorithms in this section, one often chooses a small, fixed learning
parameter, or it is decreased every iteration.
1 Reinforcement Learning and Markov Decision Processes 31
Q-learning
One of the most basic and popular methods to estimate Q-value functions in a
model-free fashion is the Q-learning algorithm Watkins (1989); Watkins and Dayan
(1992), see Algorithm 4.
The basic idea in Q-learning is to incrementally estimate Q-values for actions,
based on feedback (i.e. rewards) and the agent’s Q-value function. The update rule is
a variation on the theme of TD learning, using Q-values and a built-in max-operator
over the Q-values of the next state in order to update Qt into Qt+1 :
Qk+1 (st ,at ) = Qk (st ,at ) + α rt + γ max Qk (st+1 ,a) − Qk (st ,at ) (1.18)
a
The agent makes a step in the environment from state st to st+1 using action at while
receiving reward rt . The update takes place on the Q-value of action at in the state
st from which this action was executed.
Q-learning is exploration-insensitive. It means that it will converge to the optimal
policy regardless of the exploration policy being followed, under the assumption
that each state-action pair is visited an infinite number of times, and the learning
parameter α is decreased appropriately Watkins and Dayan (1992); Bertsekas and
Tsitsiklis (1996).
SARSA
where the action at+1 is the action that is executed by the current policy for state
st+1 . Note that the max-operator in Q-learning is replaced by the estimate of the
value of the next action according to the policy. This learning algorithm will still
converge in the limit to the optimal value function (and policy) under the condition
that all states and actions are tried infinitely often and the policy converges in the
limit to the greedy policy, i.e. such that exploration does not occur anymore Singh
et al (2000). SARSA is especially useful in non-stationary environments. In these
situations one will never reach an optimal policy. It is also useful if function approx-
imation is used, because off-policy methods can diverge when this is used.
Actor-Critic Learning
Another class of algorithms that precede Q-learning and SARSA are actor–critic
methods Witten (1977); Barto et al (1983); Konda and Tsitsiklis (2003), which learn
on-policy. This branch of TD methods keeps a separate policy independent of the
value function. The policy is called the actor and the value function the critic. The
critic – typically a state-value function – evaluates, or: criticizes, the actions exe-
cuted by the actor. After action selection, the critic evaluates the action using the
TD-error:
δt = rt + γ V (st+1 ) − V (st )
The purpose of this error is to strengthen or weaken the selection of this action in
this state. A preference for an action a in some state s can be represented as p(s,a)
such that this preference can be modified using:
where a parameter β determines the size of the update. There are other versions of
actor–critic methods, differing mainly in how preferences are changed, or experi-
ence is used (for example using eligibility traces, see next section). An advantage
of having a separate policy representation is that if there are many actions, or when
the action space is continuous, there is no need to consider all actions’ Q-values in
order to select one of them. A second advantage is that they can learn stochastic
policies naturally. Furthermore, a priori knowledge about policy constraints can be
used, e.g. see Främling (2005).
of algorithms to the average reward framework exist (see Mahadevan (1996) for an
overview).
Other algorithms that use more unbiased estimates are Monte Carlo (MC) tech-
niques. They keep frequency counts on state-action pairs and future reward-sums
(returns) and base their values on these estimates. MC methods only require samples
to estimate average sample returns. For example, in MC policy evaluation, for each
state s ∈ S all returns obtained from s are kept and the value of a state s ∈ S is just
their average. In other words, MC algorithms treat the long-term reward as a random
variable and take as its estimate the sampled mean. In contrast with one-step TD
methods, MC estimates values based on averaging sample returns observed during
interaction. Especially for episodic tasks this can be very useful, because samples
from complete returns can be obtained. One way of using MC is by using it for the
evaluation step in policy iteration. However, because the sampling is dependent on
the current policy π , only returns for actions suggested by π are evaluated. Thus,
exploration is of key importance here, just as in other model-free methods.
A distinction can be made between every-visit MC, which averages over all visits
of a state s ∈ S in all episodes, and first-visit MC, which averages over just the re-
turns obtained from the first visit to a state s ∈ S for all episodes. Both variants will
converge to V π for the current policy π over time. MC methods can also be applied
to the problem of estimating action values. One way of ensuring enough explo-
ration is to use exploring starts, i.e. each state-action pair has a non-zero probability
of being selected as the initial pair. MC methods can be used for both on-policy
and off-policy control, and the general pattern complies with the generalized policy
iteration procedure. The fact that MC methods do not bootstrap makes them less de-
pendent on the Markov assumption. TD methods too focus on sampled experience,
although they do use bootstrapping.
Learning a Model
We have described several methods for learning value functions. Indirect or model-
based RL can also be used to estimate a model of the underlying MDP. An aver-
age over sample transition probabilities experienced during interaction can be used
to gradually estimate transition probabilities. The same can be done for immedi-
ate rewards. Indirect RL algorithms make use of this to strike a balance between
model-based and model-free learning. They are essentially model-free, but learn a
transition and reward model in parallel with model-free RL, and use this model to
do more efficient value function learning (see also the next section). An example of
this is the DYNA model Sutton (1991a). Another method that often employs model
learning is prioritized sweeping Moore and Atkeson (1993). Learning a model can
34 M. van Otterlo and M. Wiering
also be very useful to learn in continuous spaces where the transition model is de-
fined over a discretized version of the underlying (infinite) state space Großmann
(2000).
The methods in the previous section have shown that both prediction and control can
be learned using samples from interaction with the environment, without having ac-
cess to a model of the MDP. One problem with these methods is that they often need
a large number of experiences to converge. In this section we describe a number of
extensions used to speed up learning. One direction for improvement lies in the ex-
ploration. One can – in principle – use model estimation until one knows everything
about the MDP but this simply takes too long. Using more information enables
1 Reinforcement Learning and Markov Decision Processes 35
Efficient Exploration
statistics. The state space is divided into known and unknown parts. On every step
a decision is made whether the known part contains sufficient opportunities for get-
ting rewards or whether the unknown part should be explored to obtain possibly
more reward. An important aspect of this algorithm is that it was the first general
near-optimal (tabular) RL algorithm with provable bounds on computation time.
The approach was extended in Brafman and Tennenholtz (2002) into the more gen-
eral algorithm R-MAX. It too provides a polynomial bound on computation time
for reaching near-optimal policies. As a last example, Ratitch (2005) presents an
approach for efficient, directed exploration based on more sophisticated character-
istics of the MDP such as an entropy measure over state transitions. An interesting
feature of this approach is that these characteristics can be computed before learn-
ing and be used in combination with other exploration methods, thereby improving
their behavior.
For a more detailed coverage of exploration strategies we refer the reader to
Ratitch (2005) and Wiering (1999).
Exploration methods can be used to speed up learning and focus attention to relevant
areas in the state space. The exploration methods mainly use statistics derived from
the problem before or during learning. However, sometimes more information is
available that can be used to guide the learner. For example, if a reasonable policy for
a domain is available, it can be used to generate more useful learning samples than
(random) exploration could do. In fact, humans are usually very bad in specifying
optimal policies, but considerably good at specifying reasonable ones6 .
The work in behavioral cloning Bain and Sammut (1995) takes an extreme point
on the guidance spectrum in that the goal is to replicate example behavior from ex-
pert traces, i.e. to clone this behavior. This type of guidance moves learning more
in the direction of supervised learning. Another way to help the agent is by shaping
Mataric (1994); Dorigo and Colombetti (1997); Ng et al (1999). Shaping pushes the
reward closer to the subgoals of behavior, and thus encourages the agent to incre-
mentally improve its behavior by searching the policy space more effectively. This
is also related to the general issue of giving rewards to appropriate subgoals, and
the gradual increase in difficulty of tasks. The agent can be trained on increasingly
more difficult problems, which can also be considered as a form of guidance.
Various other mechanisms can be used to provide guidance to RL algorithms,
such as decompositions Dixon et al (2000), heuristic rules for better exploration
Främling (2005) and various types of transfer in which knowledge learned in one
problem is transferred to other, related problems, e.g. see Konidaris (2006).
6 Quote taken from the invited talk by Leslie Kaelbling at the European Workshop on Rein-
forcement Learning EWRL, in Utrecht 2001.
1 Reinforcement Learning and Markov Decision Processes 37
Eligibility Traces
In MC methods, the updates are based on the entire sequence of observed rewards
until the end of an episode. In TD methods, the estimates are based on the samples
of immediate rewards and the next states. An intermediate approach is to use the
(n)
n-step-truncated-return Rt , obtained from a whole sequence of returns:
(n)
Rt = rt + γ rt+1 + γ 2 rt+2 + . . . + γ nVt (st+n )
With this, one can go to the approach of computing the updates of values based
on several n-step returns. The family of TD(λ ), with 0 ≤ λ ≤ 1, combines n-step
returns weighted proportionally to λ n−1 .
(∞)
The problem with this is that we would have to wait indefinitely to compute Rt .
This view is useful for theoretical analysis and understanding of n-step backups.
It is called the forward view of the TD(λ ) algorithm. However, the usual way to
implement this kind of updates is called the backward view of the TD(λ ) algorithm
and is done by using eligibility traces, which is an incremental implementation of
the same idea.
Eligibility traces are a way to perform some kind of n-step backups in an elegant
way. For each state s ∈ S, an eligibility et (s) is kept in memory. They are initialized
at 0 and incremented every time according to:
γλ et−1 (s) if s = st
et (s) =
γλ et−1 (s) + 1 if s = st
where λ is the trace decay parameter. The trace for each state is increased every
time that state is visited and decreases exponentially otherwise. Now δt is the tem-
poral difference error at stage t:
δt = rt + γ V (st+1 ) − V (st )
On every step, all states are updated in proportion to their eligibility traces as in:
The forward and backward view on eligibility traces can be proved equivalent Sut-
ton and Barto (1998). For λ = 1, TD(λ ) is essentially the same as MC, because
it considers the complete return, and for λ = 0, TD(λ ) uses just the immediate re-
turn as in all one-step RL algorithms. Eligibility traces are a general mechanism to
learn from n-step returns. They can be combined with all of the model-free meth-
ods we have described in the previous section. Watkins (1989) combined Q-learning
with eligibility traces in the Q(λ )-algorithm. Peng and Williams (1996) proposed
a similar algorithm, and Wiering and Schmidhuber (1998b) and Reynolds (2002)
both proposed efficient versions of Q(λ ). The problem with combining eligibility
traces with learning control is that special care has to be taken in case of exploratory
38 M. van Otterlo and M. Wiering
actions, which can break the intended meaning of the n-step return for the current
policy that is followed. In Watkins (1989)’s version, eligibility traces are reset ev-
ery time an exploratory action is taken. Peng and Williams (1996)’s version is, in
that respect more efficient, in that traces do not have to be set to zero every time.
SARSA(λ ) Sutton and Barto (1998) is more safe in this respect, because action
selection is on-policy. Another recent on-policy learning algorithm is the QV(λ )
algorithm by Wiering (2005). In QV(λ )-learning two value functions are learned;
TD(λ ) is used for learning a state value function V and one-step Q-learning is used
for learning a state-action value function, based on V .
Even though RL methods can function without a model of the MDP, such a model
can be useful to speed up learning, or bias exploration. A learned model can also be
useful to do more efficient value updating. A general guideline is when experience
is costly, it pays off to learn a model. In RL model-learning is usually targeted at the
specific learning task defined by the MDP, i.e. determined by the rewards and the
goal. In general, learning a model is most often useful because it gives knowledge
about the dynamics of the environment, such that it can be used for other tasks too
(see Drescher (1991) for extensive elaboration on this point).
The DYNA architecture Sutton (1990, 1991b,a); Sutton and Barto (1998) is a
simple way to use the model to amplify experiences. Algorithm 5 shows DYNA-
Q which combines Q-learning with planning. In a continuous loop, Q-learning is
interleaved with series of extra updates using a model that is constantly updated too.
DYNA needs less interactions with the environment, because it replays experience
to do more value updates.
A related method that makes more use of experience using a learned model is
prioritized sweeping (PS) Moore and Atkeson (1993). Instead of selecting states to
be updated randomly (as in DYNA), PS prioritizes updates based on their change in
values. Once a state is updated, the PS algorithm considers all states that can reach
that state, by looking at the transition model, and sees whether these states will have
1 Reinforcement Learning and Markov Decision Processes 39
to be updated as well. The order of the updates is determined by the size of the
value updates. The general mechanism can be summarized as follows. In each step
i) one remembers the old value of the current state, ii) one updates the state value
with a full backup using the learned model, iii) one sets the priority of the current
state to 0, iv) one computes the change δ in value as the result of the backup, v)
one uses this difference to modify predecessors of the current state (determined by
the model); all states leading to the current state get a priority update of δ × T ,
where T is the probability a successor state leads to the current state that is updated.
The number of value backups is a parameter to be set in the algorithm. Overall, PS
focuses the backups to where they are expected to most quickly reduce the error.
Another example of using planning in model-based RL is Wiering (2002).
1.8 Conclusions
This chapter has provided the necessary background of Markov decision processes,
dynamic programming and reinforcement learning. Core elements of many solution
algorithms which will be discussed in subsequent chapters are Bellman equations,
value updates, exploration and sampling.
References
Bain, M., Sammut, C.: A framework for behavioral cloning. In: Muggleton, S.H., Furakawa,
K., Michie, D. (eds.) Machine Intelligence, vol. 15, pp. 103–129. Oxford University Press
(1995)
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike elements that can solve difficult learn-
ing control problems. IEEE Transactions on Systems, Man, and Cybernetics 13, 835–846
(1983)
Barto, A.G., Bradtke, S.J., Singh, S.: Learning to act using real-time dynamic programming.
Artificial Intelligence 72(1), 81–138 (1995)
Bellman, R.E.: Dynamic Programming. Princeton University Press, Princeton (1957)
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1, 2. Athena Scientific,
Belmont (1995)
Bertsekas, D.P., Tsitsiklis, J.: Neuro-Dynamic Programming. Athena Scientific, Belmont
(1996)
Bonet, B., Geffner, H.: Faster heuristic search algorithms for planning with uncertainty and
full feedback. In: Proceedings of the International Joint Conference on Artificial Intelli-
gence (IJCAI), pp. 1233–1238 (2003a)
Bonet, B., Geffner, H.: Labeled RTDP: Improving the convergence of real-time dynamic
programming. In: Proceedings of the International Conference on Artificial Intelligence
Planning Systems (ICAPS), pp. 12–21 (2003b)
Boutilier, C.: Knowledge Representation for Stochastic Decision Processes. In: Veloso,
M.M., Wooldridge, M.J. (eds.) Artificial Intelligence Today. LNCS (LNAI), vol. 1600,
pp. 111–152. Springer, Heidelberg (1999)
40 M. van Otterlo and M. Wiering
Boutilier, C., Dean, T., Hanks, S.: Decision theoretic planning: Structural assumptions and
computational leverage. Journal of Artificial Intelligence Research 11, 1–94 (1999)
Brafman, R.I., Tennenholtz, M.: R-MAX - a general polynomial time algorithm for near-
optimal reinforcement learning. Journal of Machine Learning Research (JMLR) 3,
213–231 (2002)
Dean, T., Kaelbling, L.P., Kirman, J., Nicholson, A.: Planning under time constraints in
stochastic domains. Artificial Intelligence 76, 35–74 (1995)
Dixon, K.R., Malak, M.J., Khosla, P.K.: Incorporating prior knowledge and previously
learned information into reinforcement learning agents. Tech. rep., Institute for Complex
Engineered Systems, Carnegie Mellon University (2000)
Dorigo, M., Colombetti, M.: Robot Shaping: An Experiment in Behavior Engineering. The
MIT Press, Cambridge (1997)
Drescher, G.: Made-Up Minds: A Constructivist Approach to Artificial Intelligence. The MIT
Press, Cambridge (1991)
Ferguson, D., Stentz, A.: Focussed dynamic programming: Extensive comparative results.
Tech. Rep. CMU-RI-TR-04-13, Robotics Institute, Carnegie Mellon University, Pitts-
burgh, Pennsylvania (2004)
Främling, K.: Bi-memory model for guiding exploration by pre-existing knowledge. In:
Driessens, K., Fern, A., van Otterlo, M. (eds.) Proceedings of the ICML-2005 Workshop
on Rich Representations for Reinforcement Learning, pp. 21–26 (2005)
Großmann, A.: Adaptive state-space quantisation and multi-task reinforcement learning using
constructive neural networks. In: From Animals to Animats: Proceedings of The Interna-
tional Conference on Simulation of Adaptive Behavior (SAB), pp. 160–169 (2000)
Hansen, E.A., Zilberstein, S.: LAO*: A heuristic search algorithm that finds solutions with
loops. Artificial Intelligence 129, 35–62 (2001)
Howard, R.A.: Dynamic Programming and Markov Processes. The MIT Press, Cambridge
(1960)
Kaelbling, L.P.: Learning in Embedded Systems. The MIT Press, Cambridge (1993)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of
Artificial Intelligence Research 4, 237–285 (1996)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. In: Proceed-
ings of the International Conference on Machine Learning (ICML) (1998)
Koenig, S., Liu, Y.: The interaction of representations and planning objectives for decision-
theoretic planning. Journal of Experimental and Theoretical Artificial Intelligence 14(4),
303–326 (2002)
Konda, V., Tsitsiklis, J.: Actor-critic algorithms. SIAM Journal on Control and Optimiza-
tion 42(4), 1143–1166 (2003)
Konidaris, G.: A framework for transfer in reinforcement learning. In: ICML-2006 Workshop
on Structural Knowledge Transfer for Machine Learning (2006)
Kushmerick, N., Hanks, S., Weld, D.S.: An algorithm for probabilistic planning. Artificial
Intelligence 76(1-2), 239–286 (1995)
Littman, M.L., Dean, T., Kaelbling, L.P.: On the complexity of solving Markov decision
problems. In: Proceedings of the National Conference on Artificial Intelligence (AAAI),
pp. 394–402 (1995)
Mahadevan, S.: Average reward reinforcement learning: Foundations, algorithms, and empir-
ical results. Machine Learning 22, 159–195 (1996)
Maloof, M.A.: Incremental rule learning with partial instance memory for changing concepts.
In: Proceedings of the International Joint Conference on Neural Networks, pp. 2764–2769
(2003)
1 Reinforcement Learning and Markov Decision Processes 41
Mataric, M.J.: Reward functions for accelerated learning. In: Proceedings of the International
Conference on Machine Learning (ICML), pp. 181–189 (1994)
Matthews, W.H.: Mazes and Labyrinths: A General Account of their History and Develop-
ments. Longmans, Green and Co., London (1922); Mazes & Labyrinths: Their History &
Development. Dover Publications, New York (reprinted in 1970)
McMahan, H.B., Likhachev, M., Gordon, G.J.: Bounded real-time dynamic programming:
RTDP with monotone upper bounds and performance guarantees. In: Proceedings of the
International Conference on Machine Learning (ICML), pp. 569–576 (2005)
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data
and less time. Machine Learning 13(1), 103–130 (1993)
Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory
and application to reward shaping. In: Proceedings of the International Conference on
Machine Learning (ICML), pp. 278–287 (1999)
Peng, J., Williams, R.J.: Incremental multi-step Q-learning. Machine Learning 22, 283–290
(1996)
Puterman, M.L.: Markov Decision Processes—Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., New York (1994)
Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov
decision processes. Management Science 24, 1127–1137 (1978)
Ratitch, B.: On characteristics of Markov decision processes and reinforcement learning in
large domains. PhD thesis, The School of Computer Science, McGill University, Montreal
(2005)
Reynolds, S.I.: Reinforcement learning with exploration. PhD thesis, The School of
Computer Science, The University of Birmingham, UK (2002)
Rummery, G.A.: Problem solving with reinforcement learning. PhD thesis, Cambridge
University, Engineering Department, Cambridge, England (1995)
Rummery, G.A., Niranjan, M.: On-line Q-Learning using connectionist systems. Tech. Rep.
CUED/F-INFENG/TR 166, Cambridge University, Engineering Department (1994)
Russell, S.J., Norvig, P.: Artificial Intelligence: a Modern Approach, 2nd edn. Prentice Hall,
New Jersey (2003)
Schaeffer, J., Plaat, A.: Kasparov versus deep blue: The re-match. International Computer
Chess Association Journal 20(2), 95–101 (1997)
Schwartz, A.: A reinforcement learning method for maximizing undiscounted rewards. In:
Proceedings of the International Conference on Machine Learning (ICML), pp. 298–305
(1993)
Singh, S., Jaakkola, T., Littman, M., Szepesvari, C.: Convergence results for single-step
on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learn-
ing 3, 9–44 (1988)
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approx-
imating dynamic programming. In: Proceedings of the International Conference on Ma-
chine Learning (ICML), pp. 216–224 (1990)
Sutton, R.S.: DYNA, an integrated architecture for learning, planning and reacting. In:
Working Notes of the AAAI Spring Symposium on Integrated Intelligent Architectures,
pp. 151–155 (1991a)
Sutton, R.S.: Reinforcement learning architectures for animats. In: From Animals to Animats:
Proceedings of The International Conference on Simulation of Adaptive Behavior (SAB),
pp. 288–296 (1991b)
42 M. van Otterlo and M. Wiering
2.1 Introduction
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 45–73.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
46 S. Lange, T. Gabel, and M. Riedmiller
constituents: all observed transitions are stored and updates occur synchronously on
the whole batch of transitions (‘fitting’). In particular, this allows for the definition
of ‘growing batch’ methods, that are allowed to extend the set of sample experience
in order to incrementally improve their solution. From the interaction perspective,
the growing batch approach minimizes the difference between batch methods and
pure online learning methods.
The benefits that come with the batch idea—namely, stability and data-efficiency
of the learning process—account for the large interest in batch algorithms. Whereas
basic algorithms like Q-learning usually need many interactions until convergence
to good policies, thus often rendering a direct application to real applications impos-
sible, methods including ideas from batch reinforcement learning usually converge
in a fraction of the time. A number of successful examples of applying ideas orig-
inating from batch RL to learning in the interaction with real-world systems have
recently been published (see sections 2.6.2 and 2.6.5).
In this chapter, we will first define the batch reinforcement learning problem and
its variants, which form the problem space treated by batch RL methods. We will
then give a brief historical recap of the development of the central ideas that, in
retrospect, built the foundation of all modern batch RL algorithms. On the basis of
the problem definition and the introduced ideas, we will present the most impor-
tant algorithms in batch RL. We will discuss their theoretical properties as well as
some variations that have a high relevance for practical applications. This includes a
treatment of Neural Fitted Q Iteration (NFQ) and some of its applications, as it has
proven a powerful tool for learning on real systems. With the application of batch
methods to both visual learning of control policies and solving distributed schedul-
ing problems, we will briefly discuss on-going research.
Batch reinforcement learning was historically defined as the class of algorithms de-
veloped for solving a particular learning problem—namely, the batch reinforcement
learning problem.
trying an action a and adapting its policy according to the subsequent following state
s and reward r, the learner only receives a set F = {(st ,at ,rt+1 ,st+1 )|t = 1, . . . ,p}
of p transitions (s,a,r,s ) sampled from the environment1.
transitions policy
1. Exploration 2. Learning 3. Application
(s,a,r,s')
Fig. 2.1 The three distinct phases of the batch reinforcement learning process: 1: Collect-
ing transitions with an arbitrary sampling strategy. 2: Application of (batch) reinforcement
learning algorithms in order to learn the best possible policy from the set of transitions. 3:
Application of the learned policy. Exploration is not part of the batch learning task. During
the application phase, that isn’t part of the learning task either, policies stay fixed and are not
improved further.
In the most general case of this batch reinforcement learning problem the learner
cannot make any assumptions on the sampling procedure of the transitions. They
may be sampled by an arbitrary—even purely random—policy; they are not nec-
essarily sampled uniformly from the sate-action space S × A; they need not even
be sampled along connected trajectories. Using only this information, the learner
has to come up with a policy that will then be used by the agent to interact with
the environment. During this application phase the policy is fixed and not further
improved as new observations come in. Since the learner itself is not allowed to
interact with the environment, and the given set of transitions is usually finite, the
learner cannot be expected to always come up with an optimal policy. The objec-
tive has therefore been changed from learning an optimal policy—as in the general
reinforcement learning case—to deriving the best possible policy from the given
data.
The distinct separation of the whole procedure into three phases—exploring the
environment and collecting state transitions and rewards, learning a policy, and ap-
plication of the learned policy—their sequential nature, and the data passed at the
interfaces is further clarified in figure 2.1. Obviously, treatment of the exploration–
exploitation dilemma is not subject to algorithms solving such a pure batch learning
problem, as the exploration is not part of the learning task at all.
1 The methods presented in this chapter all assume a markovian state representation and
the transitions F to be sampled from a discrete-time Markov decision process (MDP, see
chapter 1). For a treatment of only partially observable decision processes see chapter 12.
48 S. Lange, T. Gabel, and M. Riedmiller
Although this batch reinforcement learning problem historically has been the start
of the development of batch reinforcement learning algorithms, modern batch RL
algorithms are seldom used in this ‘pure’ batch learning problem. In practice, ex-
ploration has an important impact on the quality of the policies that can be learned.
Obviously, the distribution of transitions in the provided batch must resemble the
‘true’ transition probabilities of the system in order to allow the derivation of good
policies. The easiest way to achieve this is to sample the training examples from the
system itself, by simply interacting with it. But when sampling from the real system,
another aspect becomes important: the covering of the state space by the transitions
used for learning. If ‘important’ regions—e.g. states close to the goal state—are not
covered by any samples, then it is obviously not possible to learn a good policy from
the data, since important information is missing. This is a real problem because in
practice, a completely ‘uninformed’ policy—e.g. a purely random policy—is often
not able to achieve an adequate covering of the state space—especially in the case
of attractive starting states and hard to reach desirable states. It is often necessary to
already have a rough idea of a good policy in order to be able to explore interesting
regions that are not in the direct vicinity of the starting states.
transitions policy
1. Exploration 2. Learning 3. Application
(s,a,r,s')
exploration policy
Fig. 2.2 The growing batch reinforcement learning process has the same three phases as the
‘pure’ batch learning process depicted in figure 2.1. But differing from the pure batch pro-
cess, the growing batch learning process alternates for several times between the exploration
and the learning phase, thus incrementally ‘growing’ the batch of stored transitions using
intermediate policies.
This is the main reason why a third variant of the reinforcement learning problem
became a popular practice, somehow positioned in between the pure online prob-
lem and the pure batch problem. Since the main idea of this third type of learning
problem is to alternate between phases of exploration, where a set of training ex-
amples is grown by interacting with the system, and phases of learning, where the
whole batch of observations is used (see fig. 2.2), we will refer to it as the ‘grow-
ing batch’ learning problem. In the literature, this growing batch approach can be
found in several different guises; the number of alternations between episodes of
2 Batch Reinforcement Learning 49
exploration and episodes of learning can be in the whole range of being as close to
the pure batch approach as using only two iterations (Riedmiller et al, 2008) to recal-
culating the policy after every few interactions—e.g. after finishing one episode in
a shortest-path problem (Kalyanakrishnan and Stone, 2007; Lange and Riedmiller,
2010a). In practice, the growing batch approach is the modeling of choice when
applying batch reinforcement learning algorithms to real systems. Since from the
interaction perspective the growing batch approach is very similar to the ‘pure’ on-
line approach—the agent improves its policy while interacting with the system—the
interaction perspective, with the distinction between ‘online’ and ‘offline’, isn’t that
useful anymore for identifying batch RL algorithms. When talking about ‘batch’ RL
algorithms now, it’s more important to look at the algorithms and search for typical
properties of the specific update rules.
Model-free online learning methods like Q-learning are appealing from a conceptual
point of view and have been very successful when applied to problems with small,
discrete state spaces. But when it comes to applying them to more realistic systems
with larger and, possibly, continuous state spaces, these algorithms come up against
limiting factors. In this respect, there can be identified three independent problems:
1. the ‘exploration overhead’, causing slow learning in practice
2. inefficiencies due to the stochastic approximation
3. stability issues when using function approximation
A common factor in modern batch RL algorithms is that these algorithms typically
address all three issues and come up with specific solutions to each of them. In the
following, we will discuss these problems in more detail and present the proposed
solutions (in historical order) that now form the defining ideas behind modern batch
reinforcement learning.
(see chapter 1). In pure online Q-learning the agent alternates between learning and
exploring with practically every single time step: in state s the agent selects and
executes an action a, and then, when observing the subsequent state s and reward
r, it immediately updates the value function (and thus the corresponding greedy
50 S. Lange, T. Gabel, and M. Riedmiller
In online RL it is common to use ‘asynchronous’ updates in the sense that the value
function after each observation is immediately updated locally for that particular
state, leaving all other states untouched. In the discrete case, this means updating
one single Q-value for a state-action pair (s,a) in Q-learning—thereby immediately
‘over-writing’ the value of the starting state of the transition. Subsequent updates
would use this updated value for their own update. This idea was also used with
function approximation, first performing a DP update like, for example,
2 Batch Reinforcement Learning 51
recalculating the approximated value q̄s,a of the present state action pair (s,a) by,
e.g., adding immediate reward and approximated value of the subsequent state, and
then immediately ‘storing’ the new value in the function approximator in the sense
of moving the value of the approximation slightly towards the value of the new
estimate q̄s,a for the state-action pair (s,a):
Please note that (2.2) and (2.3) are just re-arranged forms of equation (2.1), using
an arbitrary function approximation scheme for calculating an approximation f of
the ‘true’ Q-value function Q .
Baird (1995), Gordon (1996) and others have shown examples where particular
combinations of Q-learning and similar updates with function approximators behave
instably or even lead to sure divergence. Stable behavior can be proven only for par-
ticular instances of combinations of approximation schemes and update rules, or
under particular circumstances and assumptions on the system and reward structure
(Schoknecht and Merke, 2003). In practice it required extensive experience on be-
half of the engineer, in order to get a particular learning algorithm to work on a sys-
tem. The observed stability issues are related to the interdependency of errors made
in function approximation and deviations of the estimated value function from the
optimal value function. Whereas the DP update (2.2) tries to gradually decrease the
difference between Q(s,a) and the optimal Q-function Q∗ (s,a), storing the updated
value in the function approximator in step (2.3) might (re-)introduce an even larger
error. Moreover, this approximation error influences all subsequent DP updates and
may work against the contraction or even prevent it. The problem becomes even
worse when global function approximators—like, for example, multi-layer percep-
trons (Rumelhart et al, 1986; Werbos, 1974)—are used; improving a single Q-value
of a state-action pair might impair all other approximations throughout the entire
state space.
In this situation Gordon (1995a) came up with the compelling idea of slightly
modifying the update scheme in order to separate the dynamic programming step
from the function approximation step. The idea is to first apply a DP update to all
members of a set of so-called ‘supports’ (points distributed throughout the state
space) calculate new ‘target values’ for all supports (like in (2.2)), and then use
supervised learning to train (‘fit’) a function approximator on all these new target
values, thereby replacing the local updates of (2.3) (Gordon, 1995a). In this sense,
the estimated Q-function is updated ‘synchronously’, with updates occuring at all
supports at the same time. Although Gordon introduced this fitting idea within the
setting of model-based value iteration, it became the foundation, and perhaps even
the starting point, of all modern batch algorithms.
52 S. Lange, T. Gabel, and M. Riedmiller
In their work, Ormoneit and Sen proposed a ‘unified’ kernel-based algorithm that
could actually be seen as a general framework upon which several later algo-
rithms are based. Their ‘kernel-based approximate dynamic programming’ (KADP)
brought together the ideas of experience replay (storing and re-using experience),
fitting (separation of DP-operator and approximation), and kernel-based self-appro-
ximation (sample-based).
2 Batch Reinforcement Learning 53
V = HV
that is expressed in
V̂ = ĤV̂ .
It not only uses an approximation V̂ of the ‘true’ state-value function V —as Gor-
don’s fitted value iteration did—but also uses an approximate version Ĥ of the exact
DP-operator H itself.
The KADP algorithm for solving this equation works as follows: starting from
an arbitrary initial approximation V̂ 0 of the state-value function, each iteration i of
the KADP-algorithm consists of solving the equation
Ĥ = Hmax Ĥdap
has been split into an exact part Hmax maximizing over actions and an approximate
random operator Ĥdap approximating the ‘true’ (model-based) DP-step for individual
actions from the observed transitions. The first half of this equation is calculated
according to the sample-based DP update
a (σ ) := Ĥd pV̂ (σ ) =
Q̂i+1 a i
∑ k(s,σ ) r + γV̂ i (s ) . (2.4)
(s,a,r,s )∈Fa
(please note the similarity to equation (2)) along all transitions (s,a,r,s ) ∈ Fa ,
where Fa ⊂ F is the subset of F that contains only the transitions (s,a,r,s ) ∈ F
that used the particular action a. The weighting kernel is chosen in such a way that
more distant samples have a smaller influence on the resulting sum than closer (=
more similar) samples.
The second half of the equation applies the maximizing-operator Hmax to the
approximated Q-functions Q̂i+1a :
54 S. Lange, T. Gabel, and M. Riedmiller
Please note that this algorithm uses an individual approximation Q̂i+1 a : S → R for
each action a ∈ A in order to approximate the Q-function Qi+1 a : S × A → R. Fur-
thermore, a little bit counter-intuitively, in a practical implementation, this last
equation is actually evaluated and stored for all ending states s of all transitions
(s,a,r,s ) ∈ F —not the starting states s. This decision to store the ending states is
explained by noting that in the right-hand side of equation (4) we only query the
present estimate of the value function at the ending states of the transitions—never
its starting states (see Ormoneit and Sen (2002), section 4). This becomes clearer in
figure 2.3.
Q̂(σ, a)?
V̂ = Hmax Q̂a
st+1 st+2
st
Ĥdp
a
V̂ (σ) = k(s, σ)[r + γ V̂ (s )]
(s,a,r,s )∈Fa
Fig. 2.3 Visualization of the kernel-based approximation in KADP. For computing the Q-
value Q̂(σ , a) = Ĥdap (σ )V̂ (σ ) of (σ , a) at an arbitrary state σ ∈ S, KADP uses the starting
states s of nearby transitions (s,a,r,s ) only for calculating the weighting factors k(s,σ ), how-
ever, depends on the state values V̂ (s ) of ending states s (in the depicted example s = st and
s = st+1 ).
∑ k(s, σ ) = 1 ∀σ ∈ S
(s,a,r,s )∈Fa
k(s,σ ) ≥ 0 ∀σ ∈ S ∀(s,a,r,s ) ∈ Fa .
2 Batch Reinforcement Learning 55
The decision to store and iterate over state-values instead of state-action values re-
sults in the necessity of applying another DP-step in order to derive a greedy policy
from the result of the KADP algorithm:
= arg max
a∈A
∑ k(s,σ ) r + γ V̂ i (s ) .
(s,a,r,s )∈Fa
This might appear to be a small problem from an efficiency point of view, as you
have more complex computations as in Q-learning and need to remember all tran-
sitions in the application phase, but it is not a theoretical problem. Applying the
DP-operator to the fixed point of V̂ = ĤV̂ does not change anything but results in
the same unique fixed point.
Perhaps the most popular algorithm in batch RL is Damien Ernst’s ‘Fitted Q Itera-
tion’ (FQI, Ernst et al (2005a)). It can be seen as the ‘Q-Learning of batch RL’, as it
is actually a straight-forward transfer of the basic Q-learning update-rule to the batch
case. Given a fixed set F = {(st , at , rt+1 , st+1 )|t = 1, . . . ,p} of p transitions (s,a,r,s )
and an initial Q-value q̄0 (Ernst et al (2005a) used q̄0 = 0) the algorithm starts by
initializing an initial approximation Q̂0 of the Q-function Q0 with Q̂0 (s,a) = q̄0 for
all (s,a) ∈ S × A. It then iterates over the following two steps:
1. Start with an empty set Pi+1 of patterns (s,a; q̄i+1
s,a ). Then, for each transition
(s,a,r,s ) ∈ F calculate a new target Q-value q̄s,a according to
i+1
i
s,a = r + γ max Q̂ (s ,a )
q̄i+1 (2.6)
a ∈A
2. Use supervised learning to train a function approximator on the pattern set Pi+1 .
The resulting function Q̂i+1 then is an approximation of the Q-function Qi+1
after i + 1 steps of dynamic programming.
Originally, Ernst proposed randomized trees for approximating the value function.
After fixing their structure, these trees can also be represented as kernel-based aver-
agers, thereby reducing step 2 to
a (σ ) =
Q̂i+1 ∑ k(s, σ )q̄i+1
s,a , (2.7)
i+1
(s,a;q̄s,a )∈Pai+1
56 S. Lange, T. Gabel, and M. Riedmiller
with the weights k(·, σ ) determined by the structure of the tree. This variant of FQI
constructs an individual approximation Q̂i+1
a for each discrete action a ∈ A which,
together, form the approximation Q̂i+1 (s,a) = Q̂i+1
a (s) (Ernst et al, 2005a, section
3.4). Besides this variant of FQI, Ernst also proposed a variant with continuous
actions. We may refer the interested reader to Ernst et al (2005a) for a detailed
description of this.
From a theoretical stand-point, Fitted Q Iteration is nevertheless based on Or-
moneit and Sen’s theoretical framework. The similarity between Fitted Q Iteration
and KADP becomes obvious when rearranging equations (2.6) and (2.7):
Q̂i+1 (σ , a) = Q̂i+1
a (σ ) = ∑ k(s, σ )q̄i+1
s,a (2.8)
(s,a;q̄)∈Pai+1
= ∑ k(s, σ ) r + γ max Q̂ia (s ) .
a ∈A
(2.9)
(s,a,r,s )∈Fa
Equation (2.8) is the original averaging step from equation (2.7) in FQI for discrete
actions. By inserting FQI’s DP-step (2.6) immediately follows (2.9). This result
(2.9) is practically identical to the update used in KADP, as can be seen by inserting
(2.5) into (2.4):
a (σ ) =
Q̂i+1 ∑ k(s, σ ) r + γV̂ i (s )
(s,a,r,s )∈Fa
= ∑ k(s, σ ) r + γ max Q̂ia (s )
a ∈A
.
(s,a,r,s )∈Fa
Least-squares policy iteration (LSPI, Lagoudakis and Parr (2003)) is another early
example of a batch mode reinforcement learning algorithm. In contrast to the other
algorithms reviewed in this section, LSPI explicitly embeds the task of solving
control problems into the framework of policy iteration (Sutton and Barto, 1998),
thus alternating between policy evaluation and policy improvement steps. How-
ever, LSPI never stores a policy explicitly. Instead, it works solely on the basis of
a state-action value function Q from which a greedy policy is to be derived via
π (s) = argmaxa∈A Q(s,a). For the purpose of representing state-action value func-
tions, LSPI employs a parametric linear approximation architecture with a fixed set
of k pre-defined basis functions φi : S × A and a weight vector w = (w1 , . . . ,wk )T .
Therefore, any approximated state-action value function Q̂ within the scope of LSPI
takes the form
k
Q̂(s,a,; w) = ∑ φ j (s,a)w j = Φ wT .
j=1
Its policy evaluation step employs a least-squares temporal difference learning al-
gorithm for the state-action value function (LSQ, Lagoudakis and Parr (2001), later
called LSTDQ, Lagoudakis and Parr (2003)). This algorithm takes as input the cur-
rent policy πm —as pointed out above, represented by a set of weights that deter-
mine a value function Q̂ from which πm is to be derived greedily—as well as a finite
set F of transitions (s,a,r,s ). From these inputs, LSTDQ derives analytically the
state-action value function Q̂πm for the given policy under the state distribution de-
termined by the transition set. Clearly, the derived value function returned Q̂πm by
LSTDQ is, again, fully described by a weight vector wπm given the above-mentioned
linear architecture used.
Generally, the searched for value function is a fixed point of the Ĥπ operator
i.e. Ĥπm Qπ = Qπm . Thus, a good approximation Q̂πm should comply to Ĥπm Q̂πm ≈
Q̂πm = Φ (wπm )T . Practically speaking, LSTDQ aims at finding a vector wπ such
that the approximation of the result of applying the Ĥπm operator to Q̂πm is as near
as possible to the true result (in an L2 norm minimizing manner). For this, LSTDQ
employs an orthogonal projection and sets
With the compactly written version of equation (2.10), Ĥπm Qπm = R + γ P Ππm Qπm ,
equation (2.11) can be rearranged to
Importantly, in this equation LSTDQ approximates the model of the system on the
basis of the given sample set F , i.e., P is a stochastic matrix of size |S||A| × |S| that
contains transition probabilities, as observed within the transition set according to
P((s,a),s ) = ∑ 1/ ∑ 1 ≈ Pr(s,a,s )
(s,a,·,s )∈F (s,a,·,·)∈F
Since LSPI never stores policies explicitly, but rather implicitly by the set of ba-
sis functions and corresponding weights—and thus, in fact, by a state-action value
function—the policy improvement step of LSPI merely consists of overwriting the
old weight vector w with the current weight vector found by a call to LSTDQ.
Valuable insight can be obtained by comparing a single iteration of the Fitted
Q Iteration algorithm with a single iteration (one policy evaluation and improve-
ment step) of LSPI. The main difference between LSPI and value function-centered
FQI algorithms is that, in a single iteration, LSPI determines an approximation of
the state-action value function Qπm for the current policy and batch of experience.
LSPI can do this analytically—i.e., without iterating the Ĥπm operator—because of
the properties of its linear function approximation architecture. By contrast, FQI
algorithms rely on a set of target values for the supervised fitting of a function ap-
proximator that are based on a single dynamic programming update step—i.e., on a
single application of the Ĥ operator. Consequently, if we interpret Fitted Q Iteration
algorithms from a policy iteration perspective, then this class of algorithms imple-
ments a batch variant of optimistic policy iteration, whereas LSPI realizes standard
(non-optimistic) policy iteration.
Whereas the algorithms described here could be seen as the foundation of mod-
ern batch reinforcement learning, several other algorithms have been referred to as
‘batch’ or ‘semi-batch’ algorithms in the past. Furthermore, the borders between
‘online’, ‘offline’, ‘semi-batch’, and ‘batch’ can not be drawn distinctly; there are
at least two different perspectives to look at the problem. Figure 2.4 proposes an
ordering of online, semi-batch, growing batch, and batch reinforcement learning
algorithms. On one side of the tree we have pure online algorithms like classic
2 Batch Reinforcement Learning 59
Yes
batch & semi-batch
No
recurring
pure
episodes of data
online
collection?
online / Yes No
immediate
Experience Replay,
NFQ,...
Fig. 2.4 Classification of batch vs. non-batch algorithms. With the interaction perspective and
the data-usage perspective there are at least two different perspectives with which to define
the category borders.
Q-learning. On the opposite side of the tree we have pure batch algorithms that
work completely ‘offline’ on a fixed set of transitions. In-between these extremal
positions are a number of other algorithms that, depending on the perspective, could
be classified as either online or (semi-)batch algorithms. For example, the growing
batch approach could be classified as an online method—it interacts with the sys-
tem like an online method and incrementally improves its policy as new experience
becomes available—as well as, from a data usage perspective, being seen as a batch-
algorithm, since it stores all experience and uses ‘batch methods’ to learn from these
observations. Although FQI—like KADP and LSPI—has been proposed by Ernst
as a pure batch algorithm working on a fixed set of samples, it can easily be adapted
to the growing batch setting, as, for example, shown by Kalyanakrishnan and Stone
(2007). This holds true for every ‘pure’ batch approach. On the other hand, NFQ
(see section 2.6.1), which has been introduced in a growing-batch setting, can also
be adapted to the pure batch setting in a straight-forward manner. Another class is
formed by the ‘semi-batch’ algorithms that were introduced in the 90’s (Singh et al,
1995) primarily for formal reasons. These algorithms make an aggregate update for
several transitions—so it is not pure online learning with immediate updates. But
what they do not do, however, is store and reuse the experience after making this
update—so its not a full batch approach either.
60 S. Lange, T. Gabel, and M. Riedmiller
The compelling feature of the batch RL approach is that it grants stable behavior
for Q-learning-like update rules and a whole class of function approximators (av-
eragers) in a broad number of systems, independent of a particular modeling or
specific reward function. There are two aspects to discuss: a) stability, in the sense
of guaranteed convergence to a solution and b) quality, in the sense of the distance
of this solution to the true optimal value function.
Gordon (1995a,b) introduced the important notion of the ‘averager’ and proved
convergence of his model-based fitted value iteration for this class of function ap-
proximation schemes by first showing their non-expansive properties (in maximum
norm) and then relying on the classical contraction argument (Bertsekas and Tsit-
siklis, 1996) for MDPs with discounted rewards. For non-discounted problems he
identified a more restrictive class of compatible function approximators and proved
convergence for the ‘self-weighted’ averagers (Gordon, 1995b, section 4). Ormoneit
and Sen extended these proofs to the model-free case; their kernel-based approxi-
mators are equivalent to the ‘averagers’ introduced by Gordon (Ormoneit and Sen,
2002). Approximated values must be a weighted average of the samples, where
all weights are positive and add up to one (see section 2.4.1). These requirements
grant the non-expansive property in maximum norm. Ormoneit and Sen showed
that their random dynamic programming operator using kernel-based approximation
contracts the approximated function in maximum norm for any given set of samples
and, thus, converges to a unique fixed point in the space of possible approximated
functions. The proof has been carried out explicitly for MDPs with discounted re-
wards (Ormoneit and Sen, 2002) and average-cost problems (Ormoneit and Glynn,
2001, 2002).
Another important aspect is the quality of the solution found by the algorithms.
Gordon gave an absolute upper bound on the distance of the fixed point of his
fitted value iteration to the optimal value function (Gordon, 1995b). This bound
depends mainly on the expressiveness of the function approximator and its ‘com-
patibility’ with the optimal value function to approximate. Apart from the function
approximator, in model-free batch reinforcement learning the random sampling of
the transitions obviously is another aspect that influences the quality of the solution.
Therefore, for KADP, there is no absolute upper bound limiting the distance of the
approximate solution given a particular function approximator. Ormoneit and Sen
instead proved the stochastic consistency of their algorithm—actually, this could
be seen as an even stronger statement. Continuously increasing the size of samples
in the limit guarantees stochastic convergence to the optimal value function under
certain assumptions (Ormoneit and Sen, 2002). These assumptions (Ormoneit and
Sen, 2002, appendix A)—besides other constraints on the sampling of transitions—
include smoothness constraints on the reward function (needs to be a Lipschitz con-
tinuous function of s, a and s ) and the kernel. A particular kernel used throughout
their experiments (Ormoneit and Glynn, 2001) that fulfills these constraints is de-
rived from the ‘mother kernel’.
2 Batch Reinforcement Learning 61
||s − σ || ||si − σ ||
kFa ,b (s, σ ) = φ +
b ∑ φ+
b
(si ,ai ,ri ,si )∈Fa
with φ + being a univariate Gaussian function. The parameter b controls the ‘band-
width’ of the kernel—that is its region of influence, or, simply, its ‘resolution’. Re-
lying on such a kernel, the main idea of their consistency proof is to first define an
‘admissible’ reduction rate for the parameter b in dependence of the growing num-
ber of samples and then prove the stochastic convergence of the series of approx-
imations V̂ k under this reduction rate to the optimal value function. Reducing the
bandwidth parameter b can be interpreted as increasing the resolution of the approx-
imator. When reducing b, the expected deviation of the implicitly-estimated transi-
tion model from the true transition probabilities—in the limit—vanishes to zero as
more and more samples become available. It is important to note that increasing
the resolution of the approximator is only guaranteed to improve the approximation
for smooth reward functions and will not necessarily help in approximating step
functions, for example—thus, the Lipschitz constraint on the reward function.
Besides these results, which are limited to the usage of averagers within the batch
RL algorithms, there are promising new theoretical analysis of Antos, Munos, and
Szepesvari, that presently do not cover more general function approximators, but,
however, may lead to helpful results for non-averagers in the future (Antos et al,
2008).
The ability to approximate functions with high accuracy and to generalize well from
few training examples makes neural networks—in particular, multi-layer percep-
trons (Rumelhart et al, 1986; Werbos, 1974)—an attractive candidate to represent
value functions. However, in the classical online reinforcement learning setting, the
current update often has unforeseeable influence on the efforts taken so far. In con-
trast, batch RL changes the situation dramatically: by updating the value function
simultaneously at all transitions seen so far, the effect of destroying previous efforts
can be overcome. This was the driving idea behind the proposal of Neural Fitted Q
Iteration (NFQ, Riedmiller (2005)). As a second important consequence, the simul-
taneous update at all training instances makes the application of batch supervised
62 S. Lange, T. Gabel, and M. Riedmiller
learning algorithms possible. In particular, within the NFQ framework, the adaptive
supervised learning algorithm Rprop (Riedmiller and Braun, 1993) is used as the
core of the fitting step.
NFQ main() {
input: a set F of transition samples (s,a,r,s ) (same, as used throughout the text)
output: approximation Q̂N of the Q-value function
i=0
init MLP() → Q̂0 ;
DO {
generate pattern set P = {(inputt ;targett ),t = 1, . . . , #D} where:
inputt = s, a,
targett = r + γ maxa ∈A Q̂i (s ,a )
add artificial patterns(P)
normalize target values(P)
scale pattern values(P)
Rprop training(P) → Q̂i+1
i ← i+1
} W HILE (i < N)
Fig. 2.6 Brainstormers MidSize league robot. The difficulty of dribbling lies in the fact, that
by the rules at most one third of the ball might be covered by the robot. Not loosing the ball
while turning therefore requires a sophisticated control of the robot motion.
least one training pattern. This method has the advantage in that no additional
knowledge about states in the target regions need be known in advance.
• using a smooth immediate cost-function (Hafner and Riedmiller, 2011). Since
multi-layer perceptrons basically realize a smooth mapping from inputs to out-
puts, it is reasonable to also use a smooth immediate cost function. As an ex-
ample, consider the immediate cost function that gives constant positive costs
outside the target region and 0 costs inside the target region. This leads to a
minimum time control behavior, which is favorable in many applications. How-
ever, accordingly, the path costs have rather crispy jumps, which are kind of
difficult to represent by a neural network. Replacing this immediate cost func-
tion with a smoothed version, the main characteristic of the policy induced by
the crisp immediate cost function is widely preserved while the value function
approximation is much smoother. For more details, see Hafner and Riedmiller
(2011).
The following briefly describes the learning of a neural dribble controller for a
RoboCup MidSize League robot (for more details, see also Riedmiller et al (2009)).
The autonomous robot (figure 2.6) uses a camera as its main sensor and is fitted with
an omnidirectional drive. The control interval is 33 ms. Each motor command con-
sists of three values denoting vtarget
y (target forward speed relative to the coordinate
system of the robot), vtarget
x (target lateral speed) and vtarget
θ (target rotation speed).
Dribbling means being able to keep the ball in front of the robot while turning to
a given target. Because the rules of the MidSize league forbid simply grabbing the
ball and only allow one-third of the ball to be covered by a dribbling device, this
is quite a challenging task: the dribbling behavior must carefully control the robot,
such that the ball does not get away from the robot when it changes direction.
The learning problem is modelled as a stochastic shortest path problem with both
a terminal goal state and terminal failure states. Intermediate steps are punished by
constant costs of 0.01.2 NFQ is used as the core learning algorithm. The computa-
tion of the target value for the batch training set thus becomes:
⎧
⎨ 1.0 , if s ∈ S−
i+1
q̄s,a := 0.01 , if s ∈ S+ (2.12)
⎩
0.01 + mina inA Q̂ (s ,a ) , else
i
where S− denotes the states at which the ball is lost, and S+ denotes the states at
which the robot has the ball and heads towards the target. State information contains
speed of the robot in relative x and y direction, rotation speed, x and y ball position
relative to the robot and, finally, the heading direction relative to the given target
direction. A failure state s ∈ S− is encountered, if the ball’s relative x coordinate
is larger than 50 mm or less than -50 mm, or if the relative y coordinate exceeds
100 mm. A success state is reached whenever the absolute difference between the
heading angle and the target angle is less than 5 degrees.
The robot is controlled by a three-dimensional action vector denoting target
translational and rotational speeds. A total of 5 different action triples are used, U =
{(2.0, 0.0, 2.0), (2.5, 0.0, 1.5), (2.5, 1.5, 1.5), (3.0, 1.0, 1.0), (3.0, −1.0, 1.0)}, where
each triple denotes (vtarget
x , vtarget
y , vtarget
θ ).
Input to the Neural Fitted Q Iteration method is a set of transition tuples of the
form (state, action, cost, successor state) where the cost has either been ‘observed’
(external source) or is calculated ‘on-the-fly’ with the help of a known cost function
c : S × A × S → R (internal source). A common procedure used to sample these
transitions is to alternate between training the Q-function and then sampling new
transitions episode-wise by greedily exploiting the current Q-function. However,
on the real robot this means that between each data collection phase one has to wait
until the new Q-function has been trained. This can be aggravating, since putting the
ball back on the play-field requires human interaction. Therefore, a batch-sampling
method is used, which collects data over multiple trials without relearning.
2 Please note: costs = negative rewards. In this technical setting it’s more natural to minimize
costs, what, in principle, is equivalent to maximizing (negative) rewards.
2 Batch Reinforcement Learning 65
2000
1500
1000
500
−500
−500 0 500 1000 1500 2000
Fig. 2.7 Comparison of hand-coded (red) and neural dribbling behavior (blue) when requested
to make a U-turn. The data was collected on the real robot. When the robot gets the ball, it
typically has an initial speed of about 1.5 to 2 m/s in forward direction. The position of the
robots are displayed every 120 ms. The U-turn performed by the neural dribbling controller
is much sharper and faster.
While the previous two sections have pointed to the advantages of combining
the data-efficiency of batch-mode RL with neural network-based function approx-
imation schemes, this section elaborates on the benefits of batch methods for
cooperative multi-agent reinforcement learning. Assuming independently learning
66 S. Lange, T. Gabel, and M. Riedmiller
agents, it is obvious that those transitions experienced by one agent are strongly
affected by the decisions concurrently made by other agents. This dependency of
single transitions on external factors, i.e. on other agents’ policies, gives rise to an-
other argument for batch training: While a single transition tuple contains probably
too little information for performing a reliable update, a rather comprehensive batch
of experience may contain sufficient information to apply value function-based RL
in a multi-agent context.
The framework of decentralized Markov decision processes (DEC-(PO)MDP,
see chapter 15 or Bernstein et al (2002)) is frequently used to address environments
populated with independent agents that have access to local state information only
and thus do not know about the full, global state. The agents are independent of
one another both in terms of acting as well as learning. Finding optimal solutions to
these types of problems is, generally, intractable, which is why a meaningful goal
is to find approximate joint policies for the ensemble of agents using model-free
reinforcement learning. To this end, a local state-action value function Qk : Sk × Ak
is defined for each agent k that it successively computes, improves, and then uses to
choose its local actions.
In a straightforward approach, a batch RL algorithm (in the following, the focus
is put on the use of NFQ) might be run independently by each of the learning agents,
thus disregarding the possible existence of other agents and making no attempts to
enforce coordination across them. This approach can be interpreted as an ‘averaging
projection’ with Q-values of state-action pairs collected from both, cooperating and
non-cooperating agents. As a consequence, the agents’ local Qk functions underesti-
mate the optimal joint Q-function. The following briefly describes a batch RL-based
approach to sidestep that problem and points to a practical application where the re-
sulting multi-agent learning procedure has been employed successfully (for more
details, see also Gabel and Riedmiller (2008b)).
For a better estimation of the Qk values, the inter-agent coordination mechanism
introduced in Lauer and Riedmiller (2000) can be used and integrated within the
framework of Fitted Q Iteration. The basic idea here is that each agent always opti-
mistically assumes that all other agents behave optimally (though they often will not,
e.g. due to exploration). Updates to the value function and policy learned are only
performed when an agent is certain that a superior joint action has been executed.
The performance of that coordination scheme quickly degrades in the presence of
noise, which is why determinism in the DEC-MDP’s state transitions must be as-
sumed during the phase of collecting transitions. However, this assumption can be
dropped when applying the policies learned.
For the multi-agent case, step 1 of FQI (cf. equation (2.6)) is modified: Each
agent k collects its own transition set Fk with local transitions (sk ,ak ,rk ,sk ). It
then creates a reduced (so-called ‘optimistic’) training pattern set Ok such that
|Ok | ≤ |Pk |. Given a deterministic environment and the ability to reset the system to
a specific initial state during data collection, the probability that agent k enters some
sk more than once is greater than zero. Hence, if a certain action ak ∈ Ak has been
2 Batch Reinforcement Learning 67
When calculating the new target value for q̄φ (s),a for the new feature vector φ (s)
of state s, this update uses the expected reward from the subsequent state s as given
by the old approximation Q̂φ using the feature vector φ (s ) in the old feature space.
These target values are then used to calculate a new approximation Q̂φ for the new
feature space.
We have already implemented this idea in a new algorithm named ‘Deep Fitted Q
Iteration’ (DFQ) (Lange and Riedmiller, 2010a,b). DFQ uses a deep auto-encoder
neural network (Hinton and Salakhutdinov, 2006) with up to millions of weights
for unsupervised learning of low-dimensional feature spaces from high dimensional
visual inputs. Training of these neural networks is embedded in a growing batch re-
inforcement learning algorithm derived from Fitted Q Iteration, thus enabling learn-
ing of feasible feature spaces and useful control policies at the same time (see figure
2.8). By relying on kernel-based averagers for approximating the value function in
the automatically constructed feature spaces, DFQ inherits the stable learning be-
havior from the batch methods. Extending the theoretical results of Ormoneit and
Sen, the inner loop of DFQ could be shown to converge to a unique solution for any
given set of samples of any MDP with discounted rewards (Lange, 2010).
2 Batch Reinforcement Learning 69
The DFQ algorithm has been successfully applied to learning visual control poli-
cies in a grid-world benchmark problem—using synthesized (Lange and Riedmiller,
2010b) as well as screen-captured images (Lange and Riedmiller, 2010a)—and to
controlling a slot-car racer only on the basis of the raw image data captured by a
top-mounted camera (Lange, 2010).
feature space
improved by
gradient descent
Reinforcement
Learning
identity
bottle
neck
policy
high-dimensional low-dimensional
maps feature
vectors to
actions
action a
system
Fig. 2.8 Schematic drawing of Deep Fitted Q Iteration. Raw visual inputs from the system are
fed into a deep auto-encoder neural network that learns to extract a low-dimensional encoding
of the relevant information in its bottle-neck layer. The resulting feature vectors are then used
to learn policies with the help of batch updates.
2.7 Summary
This chapter has reviewed both the historical roots and early algorithms as well as
contemporary approaches and applications of batch-mode reinforcement learning.
Research activity in this field has grown substantially in recent years, primarily due
to the central merits of the batch approach, namely, its efficient use of collected
data as well as the stability of the learning process caused by the separation of the
dynamic programming and value function approximation steps. Besides this, var-
ious practical implementations and applications for real-world learning tasks have
contributed to the increased interest in batch RL approaches.
2 Batch Reinforcement Learning 71
References
Antos, A., Munos, R., Szepesvari, C.: Fitted Q-iteration in continuous action-space MDPs.
In: Advances in Neural Information Processing Systems, vol. 20, pp. 9–16 (2008)
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In:
Proc. of the Twelfth International Conference on Machine Learning, pp. 30–37 (1995)
Bernstein, D., Givan, D., Immerman, N., Zilberstein, S.: The Complexity of Decentralized
Control of Markov Decision Processes. Mathematics of Operations Research 27(4), 819–
840 (2002)
Bertsekas, D., Tsitsiklis, J.: Neuro-dynamic programming. Athena Scientific, Belmont (1996)
Bonarini, A., Caccia, C., Lazaric, A., Restelli, M.: Batch reinforcement learning for control-
ling a mobile wheeled pendulum robot. In: IFIP AI, pp. 151–160 (2008)
Brucker, P., Knust, S.: Complex Scheduling. Springer, Berlin (2005)
Deisenroth, M.P., Rasmussen, C.E., Peters, J.: Gaussian Process Dynamic Programming.
Neurocomputing 72(7-9), 1508–1524 (2009)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-Based Batch Mode Reinforcement Learning. Journal
of Machine Learning Research 6(1), 503–556 (2005a)
Ernst, D., Glavic, M., Geurts, P., Wehenkel, L.: Approximate Value Iteration in the Reinforce-
ment Learning Context. Application to Electrical Power System Control. International
Journal of Emerging Electric Power Systems 3(1) (2005b)
Ernst, D., Glavic, M., Capitanescu, F., Wehenkel, L.: Reinforcement learning versus model
predictive control: a comparison on a power system problem. IEEE Transactions on Sys-
tems, Man, and Cybernetics, Part B 39(2), 517–529 (2009)
Gabel, T., Riedmiller, M.: Adaptive Reactive Job-Shop Scheduling with Reinforcement
Learning Agents. International Journal of Information Technology and Intelligent Com-
puting 24(4) (2008a)
Gabel, T., Riedmiller, M.: Evaluation of Batch-Mode Reinforcement Learning Methods for
Solving DEC-MDPs with Changing Action Sets. In: Girgin, S., Loth, M., Munos, R.,
Preux, P., Ryabko, D. (eds.) EWRL 2008. LNCS (LNAI), vol. 5323, pp. 82–95. Springer,
Heidelberg (2008)
Gabel, T., Riedmiller, M.: Reinforcement Learning for DEC-MDPs with Changing Action
Sets and Partially Ordered Dependencies. In: Proceedings of the 7th International Joint
Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), IFAA-
MAS, Estoril, Portugal, pp. 1333–1336 (2008)
Gordon, G.J.: Stable Function Approximation in Dynamic Programming. In: Proc. of the
Twelfth International Conference on Machine Learning, pp. 261–268. Morgan Kaufmann,
Tahoe City (1995a)
Gordon, G.J.: Stable function approximation in dynamic programming. Tech. rep., CMU-CS-
95-103, CMU School of Computer Science, Pittsburgh, PA (1995b)
Gordon, G.J.: Chattering in SARSA (λ ). Tech. rep. (1996)
Guez, A., Vincent, R.D., Avoli, M., Pineau, J.: Adaptive treatment of epilepsy via batch-mode
reinforcement learning. In: AAAI, pp. 1671–1678 (2008)
Hafner, R., Riedmiller, M.: Reinforcement Learning in Feedback Control — challenges and
benchmarks from technical process control. Machine Learning (accepted for publication,
2011), doi:10.1007/s10994-011-5235-x
Hinton, G., Salakhutdinov, R.: Reducing the Dimensionality of Data with Neural Networks.
Science 313(5786), 504–507 (2006)
72 S. Lange, T. Gabel, and M. Riedmiller
Kalyanakrishnan, S., Stone, P.: Batch reinforcement learning in a complex domain. In: The
Sixth International Joint Conference on Autonomous Agents and Multiagent Systems, pp.
650–657. ACM, New York (2007)
Kietzmann, T., Riedmiller, M.: The Neuro Slot Car Racer: Reinforcement Learning in a Real
World Setting. In: Proceedings of the Int. Conference on Machine Learning Applications
(ICMLA 2009). Springer, Miami (2009)
Lagoudakis, M., Parr, R.: Model-Free Least-Squares Policy Iteration. In: Advances in Neural
Information Processing Systems, vol. 14, pp. 1547–1554 (2001)
Lagoudakis, M., Parr, R.: Least-Squares Policy Iteration. Journal of Machine Learning
Research 4, 1107–1149 (2003)
Lange, S.: Tiefes Reinforcement Lernen auf Basis visueller Wahrnehmungen. Dissertation,
Universität Osnabrück (2010)
Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforcement learning.
In: International Joint Conference on Neural Networks (IJCNN 2010), Barcelona, Spain
(2010a)
Lange, S., Riedmiller, M.: Deep learning of visual control policies. In: European Sympo-
sium on Artificial Neural Networks, Computational Intelligence and Machine Learning
(ESANN 2010), Brugge, Belgium (2010b)
Lauer, M., Riedmiller, M.: An Algorithm for Distributed Reinforcement Learning in Cooper-
ative Multi-Agent Systems. In: Proceedings of the Seventeenth International Conference
on Machine Learning (ICML 2000), pp. 535–542. Morgan Kaufmann, Stanford (2000)
Lin, L.: Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and
Teaching. Machine Learning 8(3), 293–321 (1992)
Ormoneit, D., Glynn, P.: Kernel-based reinforcement learning in average-cost problems: An
application to optimal portfolio choice. In: Advances in Neural Information Processing
Systems, vol. 13, pp. 1068–1074 (2001)
Ormoneit, D., Glynn, P.: Kernel-based reinforcement learning in average-cost problems.
IEEE Transactions on Automatic Control 47(10), 1624–1636 (2002)
Ormoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Machine Learning 49(2), 161–
178 (2002)
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural
Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M.,
Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidel-
berg (2005)
Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The
RPROP algorithm. In: Ruspini, H. (ed.) Proceedings of the IEEE International Conference
on Neural Networks (ICNN), San Francisco, pp. 586–591 (1993)
Riedmiller, M., Montemerlo, M., Dahlkamp, H.: Learning to Drive in 20 Minutes. In: Pro-
ceedings of the FBIT 2007 Conference. Springer, Jeju (2007)
Riedmiller, M., Hafner, R., Lange, S., Lauer, M.: Learning to dribble on a real robot by
success and failure. In: Proc. of the IEEE International Conference on Robotics and Au-
tomation, pp. 2207–2208 (2008)
Riedmiller, M., Gabel, T., Hafner, R., Lange, S.: Reinforcement Learning for Robot Soccer.
Autonomous Robots 27(1), 55–74 (2009)
Rumelhart, D., Hinton, G., Williams, R.: Learning representations by back-propagating
errors. Nature 323(6088), 533–536 (1986)
Schoknecht, R., Merke, A.: Convergent combinations of reinforcement learning with linear
function approximation. In: Advances in Neural Information Processing Systems, vol. 15,
pp. 1611–1618 (2003)
2 Batch Reinforcement Learning 73
Singh, S., Jaakkola, T., Jordan, M.: Reinforcement learning with soft state aggregation. In:
Advances in Neural Information Processing Systems, vol. 7, pp. 361–368 (1995)
Sutton, R., Barto, A.: Reinforcement Learning. An Introduction. MIT Press/A Bradford
Book, Cambridge, USA (1998)
Timmer, S., Riedmiller, M.: Fitted Q Iteration with CMACs. In: Proceedings of the IEEE
International Symposium on Approximate Dynamic Programming and Reinforcement
Learning (ADPRL 2007), Honolulu, USA (2007)
Tognetti, S., Savaresi, S., Spelta, C., Restelli, M.: Batch reinforcement learning for semi-
active suspension control, pp. 582–587 (2009)
Werbos, P.: Beyond regression: New tools for prediction and analysis in the behavioral
sciences. PhD thesis, Harvard University (1974)
Chapter 3
Least-Squares Methods for Policy Iteration
Lucian Buşoniu
Research Center for Automatic Control (CRAN), University of Lorraine, France
e-mail: [email protected]
Alessandro Lazaric · Mohammad Ghavamzadeh · Rémi Munos
Team SequeL, INRIA Lille-Nord Europe, France
e-mail: {alessandro.lazaric,mohammad.ghavamzadeh}@inria.fr,
{remi.munos}@inria.fr
Robert Babuška · Bart De Schutter
Delft Center for Systems and Control, Delft University of Technology, The Netherlands
e-mail: {r.babuska,b.deschutter}@tudelft.nl
∗ This work was performed in part while Lucian Buşoniu was with the Delft Center for
Systems and Control and in part with Team SequeL at INRIA.
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 75–109.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
76 L. Buşoniu et al.
3.1 Introduction
Policy iteration is a core procedure for solving reinforcement learning problems,
which evaluates policies by estimating their value functions, and then uses these
value functions to find new, improved policies. In Chapter 1, the classical policy
iteration was introduced, which employs tabular, exact representations of the value
functions and policies. However, most problems of practical interest have state and
action spaces with a very large or even infinite number of elements, which precludes
tabular representations and exact policy iteration. Instead, approximate policy it-
eration must be used. In particular, approximate policy evaluation – constructing
approximate value functions for the policies considered – is the central, most chal-
lenging component of approximate policy iteration. While representing the policy
can also be challenging, an explicit representation is often avoided, by computing
policy actions on-demand from the approximate value function.
Some of the most powerful state-of-the-art algorithms for approximate policy
evaluation represent the value function using a linear parameterization, and obtain
a linear system of equations in the parameters, by exploiting the linearity of the
Bellman equation satisfied by the value function. Then, in order to obtain parameters
approximating the value function, this system is solved in a least-squares sample-
based sense, either in one shot or iteratively.
Since highly efficient numerical methods are available to solve such systems,
least-squares methods for policy evaluation are computationally efficient. Addition-
ally taking advantage of the generally fast convergence of policy iteration methods,
an overall fast policy iteration algorithm is obtained. More importantly, least-squares
methods are sample-efficient, i.e., they approach their solution quickly as the num-
ber of samples they consider increases This is a crucial property in reinforcement
learning for real-life systems, as data obtained from such systems are very expen-
sive (in terms of time taken for designing and running data collection experiments,
of system wear-and-tear, and possibly even of economic costs).
In this chapter, we review least-squares methods for policy iteration: the class
of approximate policy iteration methods that employ least-squares techniques at the
policy evaluation step. The review is organized as follows. Section 3.2 provides a
quick recapitulation of classical policy iteration, and also serves to further clarify
the technical focus of the chapter. Section 3.3 forms the chapter’s core, thoroughly
introducing the family of least-squares methods for policy evaluation. Section 3.4
describes an application of a particular algorithm, called simply least-squares policy
iteration, to online learning. Section 3.5 illustrates the behavior of offline and online
least-squares policy iteration in an example. Section 3.6 reviews available theoreti-
cal guarantees about least-squares policy evaluation and the resulting, approximate
policy iteration. Finally, Section 3.7 outlines several important extensions and im-
provements to least-squares methods, and mentions other reviews of reinforcement
learning that provide different perspectives on least-squares methods.
3 Least-Squares Methods for Policy Iteration 77
In this section, we revisit the classical policy iteration algorithm and some relevant
theoretical results from Chapter 1, adapting their presentation for the purposes of
the present chapter.
Recall first some notation. A Markov decision process with states s ∈ S and ac-
tions a ∈ A is given, governed by the stochastic dynamics s ∼ T (s, a, ·) and with
the immediate performance described by the rewards r = R(s, a, s ), where T is the
transition function and R the reward function. The goal is to find an optimal pol-
icy π ∗ : S → A that maximizes the value function V π (s) or Qπ (s, a). For clarity,
in this chapter we refer to state value functions V as “V-functions”, thus achieving
consistency with the name “Q-functions” traditionally used for state-action value
functions Q. We use the name “value function” to refer collectively to V-functions
and Q-functions.
Policy iteration works by iteratively evaluating and improving policies. At the
policy evaluation step, the V-function or Q-function of the current policy is found.
Then, at the policy improvement step, a new, better policy is computed based on this
V-function or Q-function. The procedure continues afterward with the next iteration.
When the state-action space is finite and exact representations of the value function
and policy are used, policy improvement obtains a strictly better policy than the
previous one, unless the policy is already optimal. Since additionally the number of
possible policies is finite, policy iteration is guaranteed to find the optimal policy in
a finite number of iterations. Algorithm 6 shows the classical, offline policy iteration
for the case when Q-functions are used.
At the policy evaluation step, the Q-function Qπ of policy π can be found using
the fact that it satisfies the Bellman equation:
where the Bellman mapping (also called backup operator) BπQ is defined for any
Q-function as follows:
78 L. Buşoniu et al.
[BπQ (Q)](s, a) = Es ∼T (s,a,·) R(s, a, s ) + γ Q(s , π (s )) (3.2)
V π = BVπ (V π ) (3.3)
Note that both the Q-function and the V-function are bounded in absolute value by
R
Vmax = 1−γ∞ , where
R
∞ is the maximum absolute reward. A number of algorithms
are available for computing Qπ or V π , based, e.g., on directly solving the linear
system of equations (3.1) or (3.3) to obtain the Q-values (or V-values), on turning
the Bellman equation into an iterative assignment, or on temporal-difference, model-
free estimation – see Chapter 1 for details.
Once Qπ or V π is available, policy improvement can be performed. In this con-
text, an important difference between using Q-functions and V-functions arises.
When Q-functions are used, policy improvement involves only a maximization over
the action space:
πk+1 (s) ← arg max Qπk (s, a) (3.5)
a∈A
(see again Algorithm 6), whereas policy improvement with V-functions additionally
requires a model, in the form of T and R, to investigate the transitions generated by
each action:
πk+1 (s) ← arg max Es ∼T (s,a,·) R(s, a, s ) + γ V (s ) (3.6)
a∈A
action space is discrete and contains not too large a number of actions. In this case,
policy improvement can be performed by computing the value function for all the
discrete actions, and finding the maximum among these values using enumeration.1
In what follows, wherever possible, we will introduce the results and algorithms
in terms of Q-functions, motivated by the practical advantages they provide in the
context of policy improvement. However, most of these results and algorithms di-
rectly extend to the case of V-functions.
In this section we consider the problem of policy evaluation, and we introduce least-
squares methods to solve this problem. First, in Section 3.3.1, we discuss the high-
level principles behind these methods. Then, we progressively move towards the
methods’ practical implementation. In particular, in Section 3.3.2 we derive ideal-
ized, model-based versions of the algorithms for the case of linearly parameterized
approximation, and in Section 3.3.3 we outline their realistic, model-free implemen-
tations. To avoid detracting from the main line, we postpone the discussion of most
literature references until Section 3.3.4.
one-shot: least-squares
temporal difference, LSTD
projected policy evaluation
1 In general, when the action space is large or continuous, the maximization problems (3.5)
or (3.6) may be too involved to solve exactly, in which case only approximate policy im-
provements can be made.
80 L. Buşoniu et al.
where Π : Q → Q denotes projection from the space Q of all Q-functions onto the
space Q of representable Q-functions. Solving this equation is the same as mini-
and Π (Bπ (Q)):
mizing the distance between Q 2
Q
min Q − Π (BπQ (Q)) (3.8)
Q
Q∈
≈ BπQ (Q)
Q
Until now, the approximator of the value function, as well as the norms in the min-
imizations (3.8) and (3.9) have been left unspecified. The most common choices
are, respectively, approximators linear in the parameters and (squared, weighted)
Euclidean norms, as defined below. With these choices, the minimization problems
solved by projected and BRM policy evaluation can be written in terms of matrices
and vectors, and have closed-form solutions that eventually lead to efficient algo-
rithms. The name “least-squares” for this class of methods comes from minimizing
the squared Euclidean norm.
To formally introduce these choices and the solutions they lead to, the state and
action spaces will be assumed finite, S = {s1 , . . . , sN }, A = {a1 , . . . , aM }. Never-
theless, the final, practical algorithms obtained in Section 3.3.3 below can also be
applied to infinite and continuous state-action spaces.
A linearly parameterized representation of the Q-function has the following form:
d
a) =
Q(s, ∑ ϕl (s, a)θl = φ (s, a)θ (3.10)
l=1
where θ ∈ Rd is the parameter vector and φ (s, a) = [ϕ1 (s, a), . . . , ϕd (s, a)] is a
vector of basis functions (BFs), also known as features (Bertsekas and Tsitsiklis,
1996).3 Thus,
of representable Q-functions is the span of the BFs, Q =
the dspace
φ (·, ·)θ θ ∈ R .
Given a weight function ρ : S × A → [0, 1], the (squared) weighted Euclidean
norm of a Q-function is defined by:
2
Q
2 = ∑ ρ (si , a j ) Q(si , a j )
ρ
i=1,...,N
j=1,...,M
Note the norm itself,
Q
ρ , is the square root of this expression; we will mostly use
the squared variant in the sequel.
The corresponding weighted least-squares projection operator, used in projected
policy evaluation, is: 2
Π ρ (Q) = arg min Q − Q
ρ
Q
The weight function is interpreted as a probability distribution over the state-action
space, so it must sum up to 1. The distribution given by ρ will be used to generate
samples used by the model-free policy evaluation algorithms of Section 3.3.3 below.
Because the state-action space is discrete, the Bellman mapping can be written
as a sum:
N
[BπQ (Q)](si , a j ) = ∑
T (si , a j , si ) [R(si , a j , si ) + γ Q(si , π (si ))] (3.11)
i =1
N N
= ∑
T (si , a j , si )R(si , a j , si ) + γ ∑ T (si , a j , si )Q(si , π (si ))
i =1 i =1
(3.12)
for any i, j. The two-sum expression leads us to a matrix form of this mapping:
BπQ (Q
Q) = R + γ T π Q (3.13)
where B πQ : RNM → RNM . Denoting by [i, j] = i + ( j − 1)N the scalar index corre-
sponding to i and j,4 the vectors and matrices on the right hand side of (3.13) are
defined as follows:5
• Q ∈ RNM is a vector representation of the Q-function Q, with Q [i, j] = Q(si , a j ).
• R ∈ RNM is a vector representation of the expectation of the reward function
R, where the element R[i, j] is the expected reward after taking action a j in state
si , i.e., R [i, j] = ∑Ni =1 T (si , a j , si )R(si , a j , si ). So, for the first sum in (3.12), the
transition function T has been integrated into R (unlike for the second sum).
• T π ∈ RNM×NM is a matrix representation of the transition function combined
with the policy, with T [i, j],[i , j ] = T (si , a j , si ) if π (si ) = a j , and 0 otherwise. A
useful way to think about T π is as containing transition probabilities between
state-action pairs, rather than just states. In this interpretation, T π[i, j],[i , j ] is the
probability of moving from an arbitrary state-action pair (si , a j ) to a next state-
action pair (si , a j ) that follows the current policy. Thus, if a j = π (si ), then the
probability is zero, which is what the definition says. This interpretation also
indicates that stochastic policies can be represented with a simple modification:
in that case, T π[i, j],[i , j ] = T (si , a j , si ) · π (si , a j ), where π (s, a) is the probability
of taking a in s.
The next step is to rewrite the Bellman mapping in terms of the parameter vector, by
replacing the generic Q-vector in (3.13) by an approximate, parameterized Q-vector.
This is useful because in all the methods, the Bellman mapping is always applied to
approximate Q-functions. Using the following matrix representation of the BFs:
4 If the d elements of the BF vector were arranged into an N × M matrix, by first filling in
the first column with the first N elements, then the second column with the subsequent N
elements, etc., then the element at index [i, j] of the vector would be placed at row i and
column j of the matrix.
5 Boldface notation is used for vector or matrix representations of functions and mappings.
Ordinary vectors and matrices are displayed in normal font.
3 Least-Squares Methods for Policy Iteration 83
B πQ (Φ θ ) = R + γ T π Φ θ (3.15)
We are now ready to describe projected policy evaluation and BRM in the linear
case. Note that the matrices and vectors in (3.15) are too large to be used directly in
an implementation; however, we will see that starting from these large matrices and
vectors, both the projected policy evaluation and the BRM solutions can be written
in terms of smaller matrices and vectors, which can be stored and manipulated in
practical algorithms.
Under appropriate conditions on the BFs and the weights ρ (see Section 3.6.1 for a
discussion), the projected Bellman equation can be exactly solved in the linear case:
= Π ρ (BπQ (Q))
Q
π
Q Q − Π (BQ (Q)). In
so that a minimum of 0 is attained in the problem minQ∈
matrix form, the projected Bellman equation is:
Φ θ = Π ρ B πQ (Φ θ )
(3.16)
= Π ρ (R
R + γT π Φ θ )
Φ ρ T π Φ θ + Φ ρ R
Φ ρ Φ θ = γΦ
or in condensed form:
Aθ = γ Bθ + b (3.19)
ρTπΦ
with the notations A = Φ ρ Φ , B = Φ ρ T Φ , and b = Φ ρ R . The matrices A and
B are in Rd×d , while b is a vector in Rd . This is a crucial expression, highlighting
that the projected Bellman equation can be represented and solved using only small
matrices and vectors (of size d × d and d), instead of the large matrices and vectors
(of sizes up to NM × NM) that originally appeared in the formulas.
84 L. Buşoniu et al.
Next, two idealized algorithms for projected policy evaluation are introduced,
which assume knowledge of A, B, and b. The next section will show how this as-
sumption can be removed.
The idealized LSTD-Q belongs to the first (“one-shot”) subclass of methods for
projected policy evaluation in Figure 3.1. It simply solves the system (3.19) to ar-
rive at the parameter vector θ . This parameter vector provides an approximate Q-
function Qπ (s, a) = φ (s, a)θ of the considered policy π . Note that because θ ap-
pears on both sides of (3.19), this equation can be simplified to:
(A − γ B)θ = b
θτ +1 = θτ + α (θτ +1 − θτ )
(3.20)
where Aθτ +1 = γ Bθτ + b
starting from some initial value θ0 . In this update, α is a positive step size parameter.
Consider the problem solved by BRM-Q (3.9), specialized to the weighted Eu-
clidean norm used in this section:
2
min Q − BπQ (Q) (3.21)
Q
Q∈ ρ
Using the matrix expression (3.15) for Bπ , this minimization problem can be rewrit-
ten in terms of the parameter vector as follows:
min
Φ θ − R − γ T π Φ θ
2ρ
θ ∈Rd
= min
(Φ − γ T π Φ )θ − R
2ρ
θ ∈Rd
= min
Cθ − R
2ρ
θ ∈Rd
C ρ C θ = C ρ R (3.22)
The idealized BRM-Q algorithm consists of solving this equation to arrive at θ , and
thus to an approximate Q-function of the policy considered.
3 Least-Squares Methods for Policy Iteration 85
which is similar to the projected Bellman equation (3.16), except that the modified
π
BQ is used:
mapping B̄
π
T π ) [Φ θ − B πQ (Φ θ )]
BQ (Φ θ ) = B πQ (Φ θ ) + γ (T
B̄
To obtain the matrices and vectors appearing in their equations, the idealized algo-
rithms given above would require the transition and reward functions of the Markov
decision process, which are unknown in a reinforcement learning context. More-
over, the algorithms would need to iterate over all the state-action pairs, which is
impossible in the large state-action spaces they are intended for.
This means that, in practical reinforcement learning, sample-based versions of
these algorithms should be employed. Fortunately, thanks to the special struc-
ture of the matrices and vectors involved, they can be estimated from samples. In
this section, we derive practically implementable LSTD-Q, LSPE-Q, and BRM-Q
algorithms.
0 = 0, B0 = 0,
starting from zero initial values (A b0 = 0).
LSTD-Q processes the n samples using (3.23) and then solves the equation:
1 1 1
An θ = γ Bn θ + bn (3.24)
n n n
or, equivalently:
1 1
(An − γ Bn )θ = bn
n n
to find a parameter vector θ . Note that, because this equation is an approximation
of (3.19), the parameter vector θ is only an approximation to the solution of (3.19)
(however, for notation simplicity we denote it in the same way). The divisions by
n, while not mathematically necessary, increase the numerical stability of the algo-
rithm, by preventing the coefficients from growing too large as more samples are
processed. The composite matrix (A − γ B)
can also be updated as a single entity
(A − γ B), thereby eliminating the need of storing in memory two potentially large
matrices A and B.
Algorithm 7 summarizes this more memory-efficient variant of LSTD-Q. Note
that the update of (A− γ B) simply accumulates the updates of A and B in (3.23),
where the term from B is properly multiplied by −γ .
Algorithm 7. LSTD-Q
updates. The estimate A may not be invertible at the start of the learning process,
when only a few samples have been processed. A practical solution to this issue is
to a small multiple of the identity matrix.
to initialize A
A more flexible algorithm than (3.25) can be obtained by (i) processing more
than one sample in-between consecutive updates of the parameter vector, as well as
by (ii) performing more than one parameter update after processing each (batch of)
samples, while holding the coefficient estimates constant. The former modification
may increase the stability of the algorithm, particularly in the early stages of learn-
ing, while the latter may accelerate its convergence, particularly in later stages as
the estimates A, and
B, b become more precise.
Algorithm 8 shows this more flexible variant of LSPE-Q, which (i) processes
batches of n̄ samples in-between parameter update episodes (n̄ should preferably
be a divisor of n). At each such episode, (ii) the parameter vector is updated Nupd
times. Notice that unlike in (3.25), the parameter update index τ is different from
the sample index i, since the two indices are no longer advancing synchronously.
Algorithm 8. LSPE-Q
Because LSPE-Q must solve the system in (3.25) multiple times, it will require
more computational effort than LSTD-Q, which solves the similar system (3.24)
only once. On the other hand, the incremental nature of LSPE-Q can offer it advan-
tages over LSTD-Q. For instance, LSPE can benefit from a good initial value of the
parameter vector, and better flexibility can be achieved by controlling the step size.
Next, sample-based BRM is briefly discussed. The matrix C ρ C and the vector
C ρ R appearing in the idealized BRM equation (3.22) can be estimated from sam-
ples. The estimation procedure requires double transition samples, that is, for each
88 L. Buşoniu et al.
(C i
ρ C) = (C ρ C)
i−1 +
[φ (si , ai ) − γφ (si,1 , π (si,1 ))] · [φ (si , ai ) − γφ (si,2 , π (si,2 ))] (3.26)
(C
ρ R ) = (C
i
ρ R)
i−1 + φ (si , ai ) − γφ (si,2 , π (si,2 ))
A crucial issue that arises in all the algorithms above is exploration: ensuring that
the state-action space is sufficiently covered by the available samples. Consider first
the exploration of the action space. The algorithms are typically used to evaluate
deterministic policies. If samples were only collected according to the current pol-
icy π , i.e., if all samples were of the form (s, π (s)), no information about pairs (s, a)
with a = π (s) would be available. Therefore, the approximate Q-values of such pairs
would be poorly estimated and unreliable for policy improvement. To alleviate this
problem, exploration is necessary: sometimes, actions different from π (s) have to
be selected, e.g., in a random fashion. Looking now at the state space, exploration
plays another helpful role when samples are collected along trajectories of the sys-
tem. In the absence of exploration, areas of the state space that are not visited under
the current policy would not be represented in the sample set, and the value function
would therefore be poorly estimated in these areas, even though they may be im-
portant in solving the problem. Instead, exploration drives the system along larger
areas of the state space.
Computational Considerations
once, the time needed to process samples may become dominant if the number of
samples is very large.
The linear systems can be solved in several ways, e.g., by matrix inversion, by
Gaussian elimination, or by incrementally computing the inverse with the Sherman-
Morrison formula (see Golub and Van Loan, 1996, Chapters 2 and 3). Note also
that, when the BF vector φ is sparse, as in the often-encountered case of localized
BFs, this sparsity can be exploited to greatly improve the computational efficiency
of the matrix and vector updates in all the least-squares algorithms.
The high-level introduction of Section 3.3.1 followed the line of (Farahmand et al,
2009). After that, we followed at places the derivations in Chapter 3 of (Buşoniu
et al, 2010a).
LSTD was introduced in the context of V-functions (LSTD-V) by Bradtke and
Barto (1996), and theoretically studied by, e.g., Boyan (2002); Konda (2002); Nedić
and Bertsekas (2003); Lazaric et al (2010b); Yu (2010). LSTD-Q, the extension to
the Q-function case, was introduced by Lagoudakis et al (2002); Lagoudakis and
Parr (2003a), who also used it to develop the LSPI algorithm. LSTD-Q was then
used and extended in various ways, e.g., by Xu et al (2007); Li et al (2009); Kolter
and Ng (2009); Buşoniu et al (2010d,b); Thiery and Scherrer (2010).
LSPE-V was introduced by Bertsekas and Ioffe (1996) and theoretically studied
by Nedić and Bertsekas (2003); Bertsekas et al (2004); Yu and Bertsekas (2009).
Its extension to Q-functions, LSPE-Q, was employed by, e.g., Jung and Polani
(2007a,b); Buşoniu et al (2010d).
The idea of minimizing the Bellman residual was proposed as early as in
(Schweitzer and Seidmann, 1985). BRM-Q and variants were studied for instance
by Lagoudakis and Parr (2003a); Antos et al (2008); Farahmand et al (2009), while
Scherrer (2010) recently compared BRM approaches with projected approaches. It
should be noted that the variants of Antos et al (2008) and Farahmand et al (2009),
called “modified BRM” by Farahmand et al (2009), eliminate the need for double
sampling by introducing a change in the minimization problem (3.9).
iteration (Bertsekas and Tsitsiklis, 1996; Sutton, 1988; Tsitsiklis, 2002). In the ex-
treme, fully optimistic case, the policy is improved after every single transition.
Optimistic policy updates were combined with LSTD-Q – thereby obtaining opti-
mistic LSPI – in our works (Buşoniu et al, 2010d,b) and with LSPE-Q in (Jung
and Polani, 2007a,b). Li et al (2009) explored a non-optimistic, more computation-
ally involved approach to online policy iteration, in which LSPI is fully executed
between consecutive sample-collection episodes.
from being used too long. Note that, as in the offline case, improved policies do not
have to be explicitly computed in online LSPI, but can be computed on demand.
h(p,p’)
0.6 3
0.4
2
1
0.2 a
H(p)
p’
0
0
−1
−0.2 mg −2
−0.4 −3
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
p p
Fig. 3.2 Left: The car on the hill, with the “car” shown as a black bullet. Right: a near-optimal
policy (black means a = −4, white means a = +4, gray means both actions are equally good).
Denoting the horizontal position of the car by p, its dynamics are (in the variant
of Ernst et al, 2005):
(3.27) between consecutive time steps, using a discrete time step of 0.1 s. Thus, st
and pt are sampled versions of the continuous variables s and p. The state space is
S = [−1,1] × [−3,3] plus a terminal state that is reached whenever st+1 would fall
outside these bounds, and the discrete action space is A = {−4, 4}. The goal is to
drive past the top of the hill to the right with a speed within the allowed limits, while
reaching the terminal state in any other way is considered a failure. To express this
goal, the following reward is chosen:
⎧
⎪
⎨−1 if pt+1 < −1 or | ṗt+1 | > 3
R(st , at , st+1 ) = 1 if pt+1 > 1 and | ṗt+1 | ≤ 3 (3.28)
⎪
⎩
0 otherwise
with the discount factor γ = 0.95. Figure 3.2, right shows a near-optimal policy for
this reward function.
To apply the algorithms considered, the Q-function is approximated over the state
space using bilinear interpolation on an equidistant 13 × 13 grid. To represent the
Q-function over the discrete action space, separate parameters are stored for the two
discrete actions, so the approximator can be written as:
169
a j ) = ∑ ϕi (s)θi, j
Q(s,
i=1
where the state-dependent BFs ϕi (s) provide the interpolation coefficients, with at
most 4 BFs being non-zero for any s, and j = 1, 2. By replicating the state-dependent
BFs for both discrete actions, this approximator can be written in the standard form
(3.10), and can therefore be used in least-squares methods for policy iteration.
First, we apply LSPI with this approximator to the car-on-the-hill problem. A
set of 10 000 random, uniformly distributed state-action samples are independently
generated; these samples are reused to evaluate the policy at each policy iteration.
With these settings, LSPI typically converges in 7 to 9 iterations (from 20 indepen-
dent runs, where convergence is considered achieved when the difference between
consecutive parameter vectors drops below 0.001). Figure 3.3, top illustrates a sub-
sequence of policies found during a representative run. Subject to the resolution
limitations of the chosen representation, a reasonable approximation of the near-
optimal policy in Figure 3.2, right is found.
For comparison purposes, we also run an approximate value iteration algorithm
using the same approximator and the same convergence threshold. (The actual algo-
rithm, called fuzzy Q-iteration, is described in Buşoniu et al (2010c); here, we are
only interested in the fact that it is representative for the approximate value itera-
tion class.) The algorithm converges after 45 iterations. This slower convergence of
value iteration compared to policy iteration is often observed in practice. Figure 3.3,
middle shows a subsequence of policies found by value iteration. Like for the PI al-
gorithms we considered, the policies are implicitly represented by the approximate
Q-function, in particular, actions are chosen by maximizing the Q-function as in
(3.5). The final solution is different from the one found by LSPI, and the algorithms
3 Least-Squares Methods for Policy Iteration 93
2 2 2 2
0 0 0 0
−2 −2 −2 −2
−1 0 1 −1 0 1 −1 0 1 −1 0 1
iter=1 iter=2 iter=20 iter=45
2 2 2 2
0 0 0 0
−2 −2 −2 −2
−1 0 1 −1 0 1 −1 0 1 −1 0 1
t=1 s t=2 s t=500 s t=1000 s
2 2 2 2
0 0 0 0
−2 −2 −2 −2
−1 0 1 −1 0 1 −1 0 1 −1 0 1
Fig. 3.3 Representative subsequences of policies found by the algorithms considered. Top:
offline LSPI; middle: fuzzy Q-iteration; bottom: online LSPI. For axis and color meanings,
see Figure 3.2, right, additionally noting that here the negative (black) action is preferred
when both actions are equally good.
also converge differently: LSPI initially makes large steps in the policy space at each
iteration (so that, e.g., the structure of the policy is already visible after the second
iteration), whereas value iteration makes smaller, incremental steps.
Finally, online, optimistic LSPI is applied to the car on the hill. The experiment
is run for 1000 s of simulated time, so that in the end 10 000 samples have been
collected, like for the offline algorithm. This interval is split into separate learning
trials, initialized at random initial states and stopping when a terminal state has been
reached, or otherwise after 3 s. Policy improvements are performed once every 10
samples (i.e., every 1 s), and an ε -greedy exploration strategy is used (see Chap-
ter 1), with ε = 1 initially and decaying exponentially so that it reaches a value of
0.1 after 350 s. Figure 3.3, bottom shows a subsequence of policies found during
a representative run. Online LSPI makes smaller steps in the policy space than of-
fline LSPI because, in-between consecutive policy improvements, it processes fewer
samples, which come from a smaller region of the state space. In fact, at the end of
learning LSPI has processed each sample only once, whereas offline LSPI processes
all the samples once at every iteration.
Figure 3.4 shows the performance of policies found by online LSPI along the
online learning process, in comparison to the final policies found by offline LSPI.
The performance is measured by the average empirical return over an equidistant
grid of initial states, evaluated by simulation with a precision of 0.001; we call this
94 L. Buşoniu et al.
average return “score”. Despite theoretical uncertainty about its convergence and
near-optimality, online LSPI empirically reaches at least as good performance as the
offline algorithm (for encouraging results in several other problems, see our papers
(Buşoniu et al, 2010d,b) and Chapter 5 of (Buşoniu et al, 2010a)). For completeness,
we also report the score of the – deterministically obtained – value iteration solution:
0.219, slightly lower than that obtained by either version of LSPI.
0.4
0.2
Score
0
online LSPI, mean
95% confidence bounds
−0.2
offline LSPI, mean score
95% confidence bounds
−0.4
0 500 1000
t [s]
Fig. 3.4 Evolution of the policy score in online LSPI, compared with offline LSPI. Mean
values with 95% confidence intervals are reported, from 20 independent experiments.
The execution time was around 34 s for LSPI, around 28 s for online LSPI, and
0.3 s for value iteration. This illustrates the fact that the convergence rate advantage
of policy iteration does not necessarily translate into computational savings – since
each policy evaluation can have a complexity comparable with the entire value it-
eration. In our particular case, LSPI requires building the estimates of A, B and b
and solving a linear system of equations, whereas for the interpolating approxima-
tor employed, each approximate value iteration reduces to a very simple update of
the parameters. For other types of approximators, value iteration algorithms will be
more computationally intensive, but still tend to require less computation per iter-
ation than PI algorithms. Note also that the execution time of online LSPI is much
smaller than the 1000 s simulated experiment duration.
While theoretical results are mainly given for V-functions in the literature, they di-
rectly extend to Q-functions, by considering the Markov chain of state-action pairs
under the current policy, rather than the Markov chain of states as in the case of V-
functions. We will therefore use our Q-function-based derivations above to explain
and exemplify the guarantees.
Throughout, we require that the BFs are linearly independent, implying that the
BF matrix Φ has full column rank. Intuitively, this means there are no redundant
BFs.
In the context of projected policy evaluation, one important difference between the
LSTD and LSPE families of methods is the following. LSTD-Q will produce a
meaningful solution whenever the equation (3.19) has a solution, which can hap-
pen for many weight functions ρ . In contrast, to guarantee the convergence of the
basic LSPE-Q iteration, one generally must additionally require that the samples
follow ρ π , the stationary distribution over state-action pairs induced by the pol-
icy considered (stationary distribution of π , for short). Intuitively, this means that
the weight of each state-action pair (s, a) is equal to the steady-state probability of
this pair along an infinitely-long trajectory generated with the policy π . The projec-
π
tion mapping 6 Π ρ is nonexpansive with respect to the norm
·
ρ π weighted by
the stationary distribution ρ π , and because additionally the original Bellman map-
ping BπQ is a contraction with respect to this norm, the projected Bellman mapping
π
Π ρ (BπQ (·)) is also a contraction.7
Confirming that LSTD is not very dependent on using ρ π , Yu (2010) proved that
with a minor modification, the solution found by LSTD converges as n → ∞, even
when one policy is evaluated using samples from a different policy – the so-called
off-policy case. Even for LSPE, it may be possible to mitigate the destabilizing
effects of violating convergence assumptions, by controlling the step size α . Fur-
thermore, a modified LSPE-like update has recently been proposed that converges
without requiring that Π ρ (BπQ (·)) is a contraction, see Bertsekas (2011a,b). These
types of results are important in practice because they pave the way for reusing
samples to evaluate different policies, that is, at different iterations of the overall PI
algorithm. In contrast, if the stationary distribution must be followed, new samples
have to be generated at each iteration, using the current policy.
π
6 A projection mapping Π ρ applied to a function f w.r.t. a space F returns the closest
element in F to the function f , where the distance is defined according to the L2 norm
and the measure ρ π .
7 A mapping f (x) is a contraction with factor γ < 1 if for any x, x ,
f (x) − f (x )
≤
γ
x − x
. The mapping is a nonexpansion (a weaker property) if the inequality holds
for γ = 1.
96 L. Buşoniu et al.
Assuming now, for simplicity, that the stationary distribution ρ π is used, and thus
that the projected Bellman equation (3.19) has a unique solution (the projected Bell-
man operator is a contraction and it admits one unique fixed point), the following
informal, but intuitive line of reasoning is useful to understand the convergence of
the sample-based LSTD-Q and LSPE-Q. Asymptotically, as n → ∞, it is true that
1 1 1
n An → A, n Bn → B, and n bn → b, because the empirical distribution of the state-
action samples converges to ρ , while the empirical distribution of next-state samples
s for each pair (s, a) converges to T (s, a, s ). Therefore, the practical, sample-based
LSTD-Q converges to its idealized version (3.19), and LSTD-Q asymptotically finds
the solution of the projected Bellman equation. Similarly, the sample-based LSPE-
Q asymptotically becomes equivalent to its idealized version (3.20), which is just
an incremental variant of (3.19) and will therefore produce the same solution in the
end. In fact, it can additionally be shown that, as n grows, the solutions of LSTD-Q
and LSPE-Q converge to each other faster than they converge to their limit, see Yu
and Bertsekas (2009).
Let us investigate now the quality of the solution. Under the stationary distribu-
tion ρ π , we have (Bertsekas, 2011a; Tsitsiklis and Van Roy, 1997):
π π 1 π π
Q − Q ≤ Q − Π ρ (Qπ ) π (3.29)
ρπ 1 − γ2 ρ
where Qπ is the Q-function given by the parameter θ that solves the projected Bell-
man equation (3.19) for ρ = ρ π .Thus, we describe
the representation power of
π ρ π π
the approximator by the distance Q − Π (Q ) π between the true Q-function
ρ
π
Qπ and its projection Π ρ (Qπ ). As the approximator becomes more powerful, this
distance decreases. Then, projected policy evaluation leads to an approximate Q-
function Q π with an error proportional to this distance. The proportion is given by
the discount factor γ , and grows as γ approaches 1. Recently, efforts have been made
to refine this result in terms of properties of the dynamics and of the set Q of repre-
sentable Q-functions (Scherrer, 2010; Yu and Bertsekas, 2010).
The following relationship holds for any Q-function Q, see e.g. Scherrer (2010):
1
Q − Bπ (Q) π
Qπ − Q
ρ π ≤ (3.30)
1−γ Q ρ
and with ρ π the stationary distribution. Consider now the on-policy BRM solution
– the Q-function Q π given by the parameter that solves the BRM equation (3.22)
π
for ρ = ρ . Because
this Q-function
minimizes the right-hand side of the inequality
π π
(3.30), the error Q − Q π of the solution found is also small.
ρ
3 Least-Squares Methods for Policy Iteration 97
A general result about policy iteration can be given in terms of the infinity norm,
πk π
as follows. If the policy evaluation error Q − Q is upper-bounded by ε at
k
∞
every iteration k ≥ 0 (see again Algorithm 6), and if policy improvements are exact
(according to our assumptions in Section 3.2), then policy iteration eventually pro-
duces policies with a performance (i.e., the corresponding value function) that lies
within a bounded distance from the optimal performance (Bertsekas and Tsitsiklis,
1996; Lagoudakis and Parr, 2003a) (i.e., the optimal value function):
2γ
lim sup
Qπk − Q∗
∞ ≤ ·ε (3.31)
k→∞ (1 − γ )2
Here, Q∗ is the optimal Q-function and corresponds to the optimal performance, see
Chapter 1. Note that if approximate policy improvements are performed, a similar
bound holds, but the policy improvement error must also be included in the right
hand side.
An important remark is that the sequence of policies is generally not guaran-
teed to converge to a fixed policy. For example, the policy may end up oscillating
along a limit cycle. Nevertheless, all the policies along the cycle will have a high
performance, in the sense of (3.31).
Note that (3.31) uses infinity norms, whereas the bounds (3.29) and (3.30) for the
policy evaluation component use Euclidean norms. The two types of bounds cannot
be easily combined to yield an overall bound for approximate policy iteration. Policy
iteration bounds for Euclidean norms, which we will not detail here, were developed
by Munos (2003).
Consider now optimistic variants of policy iteration, such as online LSPI. The
performance guarantees above rely on small policy evaluation errors, whereas in the
optimistic case, the policy is improved before an accurate value function is avail-
able, which means the policy evaluation error can be very large. For this reason, the
behavior of optimistic policy iteration is theoretically poorly understood at the mo-
ment, although the algorithms often work well in practice. See Bertsekas (2011a)
and Section 6.4 of Bertsekas and Tsitsiklis (1996) for discussions of the difficulties
involved.
98 L. Buşoniu et al.
The results reported in the previous section analyze the asymptotic performance of
policy evaluation methods when the number of samples tends to infinity. Nonethe-
less, they do not provide any guarantee about how the algorithms behave when only
a finite number of samples is available. In this section we report recent finite-sample
bounds for LSTD and BRM and we discuss how they propagate through iterations
in the policy iteration scheme.
While in the previous sections we focused on algorithms for the approximation
of Q-functions, for sake of simplicity we report here the analysis for V-function ap-
proximation. The notation and the setting is exactly the same as in Section 3.3.2 and
we simply redefine it for V-functions. We use a linear approximation architecture
with parameters θ ∈ Rd and basis functions ϕi , i = 1, . . . ,d now defined as a map-
ping from the state space S to R. We denote by φ : S → Rd , φ (·) = [ϕ1 (·), . . . ,ϕd (·)]
the BF vector (feature vector), and by F the linear function
space spanned by the
BFs ϕi , that is F = fθ |θ ∈ Rd and fθ (·) = φ (·)θ . We define F˜ as the space
obtained by truncating the functions in F at Vmax (recall that Vmax gives the max-
imum return and upper-bounds any value function). The truncated function f˜ is
equal to f (s) in all the states where | f (s)| ≤ Vmax and it is equal to sgn( f (s))Vmax
otherwise. Furthermore, let L be an upper bound for all the BFs, i.e.,
ϕi
∞ ≤ L
for i = 1, . . . , d. In the following we report the finite-sample analysis of the policy
evaluation performance of LSTD-V and BRM-V, followed by the analysis of policy
iteration algorithms that use LSTD-V and BRM-V in the policy evaluation step. The
truncation to Vmax is used for LSTD-V, but not for BRM-V.
Pathwise LSTD
Let π be the current policy and V π its V-function. Let (st ,rt ) with t = 1, . . . ,n
be a sample path (trajectory) of size n generated by following the policy π and
Φ = [φ (s1 ); . . . ; φ (sn )] be the BF matrix defined at the encountered states, where
“;” denotes a vertical stacking of the vectors φ (st ) in the matrix. Pathwise LSTD-
V is a version of LSTD-V obtained by defining an empirical transition matrix T̂ T as
follows: T̂ T i j = 1 if j = i + 1, j = n, otherwise T̂
T i j = 0. When applied to a vector s =
[s1 , . . . ,sn ] , this transition matrix returns (T̂
T s)t = st+1 for 1 ≤ t < n and (T̂
T s)n = 0.
The rest of the algorithm exactly matches the standard LSTD and returns a vector θ
as the solution of the linear system Aθ = γ Bθ + b, where A = Φ Φ , B = γΦ Φ T̂
TΦ,
and b = Φ R . Similar to the arguments used in Section 3.6.1, it is easy to ver-
ify that the empirical transition matrix T̂ T results in an empirical Bellman operator
which is a contraction and thus that the previous system always admits at least one
solution. Although there exists a unique fixed point, there might be multiple solu-
tions θ . In the following, we use θ̂ to denote the solution with minimal norm, that is
3 Least-Squares Methods for Policy Iteration 99
Theorem 3.1. (Pathwise LSTD-V) Let ω > 0 be the smallest eigenvalue of the
Gram matrix G ∈ Rd×d , Gi j = φi (x)φ j (x)ρ π (dx). Let assume that the policy
π induces a stationary β -mixing process (Meyn and Tweedie, 1993) on the MDP
at hand with a stationary distribution ρ π 9 . Let (st ,rt ) with t = 1, . . . ,n be a path
generated by following policy π for n > nπ (ω ,δ ) steps, where nπ (ω ,δ ) is a suitable
number of steps depending on parameters ω and δ . Let θ̂ be the pathwise LSTD-V
solution and f˜θ̂ be the truncation of its corresponding function, then: 10
2 √ π d log1/ δ
|| f˜θ̂ − V π ||ρ π ≤ 2 2||V π − Π ρ V π ||ρ π + Õ L||θ ∗ ||
1 − γ2 n
1 d log 1/δ
+ Õ Vmax L (3.32)
1−γ ωn
π
with probability 1 − δ , where θ ∗ is such that fθ ∗ = Π ρ V π .
We now consider the BRM-V algorithm, which finds V-functions. Similar to BRM-
Q (see Section 3.3.2) n samples (si , si,1 , ri,1 , si,2 , ri,2 ) are available, where si are
drawn i.i.d. from an arbitrary distribution μ , (si,1 , ri,1 ) and (si,2 , ri,2 ) are two indepen-
dent samples drawn from T (si ,π (si ), ·), and ri,1 , ri,2 are the corresponding rewards.
The algorithm works similarly to BRM-Q but now the BFs are functions of the state
only. Before reporting the bound on the performance of BRM-V, we introduce:
ω (1−γ )2
with probability 1 − δ , where ξπ = C π ( μ )2
. Furthermore, the approximation error
of V π is:
π
C (μ ) 2 π π 1 4 2 d log 1/δ
||V π − f θ̂ ||μ
2 ≤ (1 + γ ||P ||μ ) inf ||V − f ||μ +Õ
2 2
L Rmax .
1−γ f ∈F ξπ2 n
Policy Iteration
Theorem 3.3. For any δ > 0, whenever n ≥ n(δ /K) with n(δ /K) is a suitable num-
ber of samples, with probability 1 − δ , the empirical Bellman residual minimizer fθk
exists for all iterations 1 ≤ k < K, thus the BRM-PI algorithm is well defined, and
the performance V πK of the policy πK returned by the algorithm is such that:
√
∗ πK 1 3/2 C d log(K/δ ) 1/4
||V − V ||∞ ≤ Õ C EBRM (F ) + +γ K/2
,
(1 − γ )2 ξ n
In this theorem, EBRM (F ) looks at the smallest approximation error for the V-
function of a policy π , and takes the worst case of this error across the set of policies
greedy in some approximate V-function from F . The second term is the estimation
error and it contains the same terms as in Theorem 3.2 and it decreases as Õ(d/n).
3 Least-Squares Methods for Policy Iteration 103
Finally, the last term γ K/2 simply accounts for the error due to the finite number of
iterations and it rapidly goes to zero as K increases.
Now we turn to pathwise-LSPI and report a performance bound for this algo-
rithm. At each iteration k of pathwise-LSPI, samples are collected by following a
single trajectory generated by the policy under evaluation, πk , and pathwise-LSTD
is used to compute an approximation of V πk . In order to plug in the pathwise-LSTD
π
bound (Theorem 3.1) in Eq. 3.34, one should first note that ||Vk − BVk Vk ||ρ πk ≤
(1 + γ )||Vk − V πk ||ρ πk . This way instead of the Bellman residual, we bound the per-
formance of the algorithm by using the approximation error Vk − V πk at each itera-
tion. It is also important to note that the pathwise-LSTD bound requires the samples
to be collected by following the policy under evaluation. This might introduce severe
problems in the sequence of iterations. In fact, if a policy concentrates too much on
a small portion of the state space, even if the policy evaluation is accurate on those
states, it might be arbitrary bad on the states which are not covered by the current
policy. As a result, the policy improvement is likely to generate a bad policy which
could, in turn, lead to an arbitrary bad policy iteration process. Thus, the following
assumption needs to be made here.
Assumption 2 (Lower-bounding distribution). Let μ be a distribution over the state
space. For any policy π that is greedy w.r.t. a function in the truncated space F˜ ,
μ (·) ≤ κρ π (·), where κ < ∞ is a constant and ρ π is the stationary distribution of
policy π .
It is also necessary to guarantee that with high probability a unique pathwise-LSTD
solution exists at each iteration of the pathwise-LSPI algorithm, thus, the following
assumption is needed.
Assumption 3 (Linear independent BFs). Let μ be the lower-bounding distribu-
tion from Assumption 2. We assume that the BFs φ (·) of the function space F are
linearly independent w.r.t. μ . In this case, the smallest eigenvalue ωμ of the Gram
matrix Gμ ∈ Rd×d w.r.t. μ is strictly positive.
Assumptions 2 and 3 (plus some minor assumptions on the characteristic parame-
ters of the β -mixing processes observed during the execution of the algorithm) are
sufficient to have a performance bound for pathwise-LSPI. However, in order to
make it easier to compare the bound with the one for BRM-PI, we also assume that
Assumption 1 holds for pathwise-LSPI.
Theorem 3.4. Let us assume that at each iteration k of the pathwise-LSPI algo-
rithm, a path of size n > n(ωμ ,δ ) is generated from the stationary β -mixing pro-
cess with stationary distribution ρk−1 = ρ πk−1 . Let V−1 ∈ F˜ be an arbitrary initial
V-function, V0 , . . . ,VK−1 (Ṽ0 , . . . , ṼK−1 ) be the sequence of V-functions (truncated
V-functions) generated by pathwise-LSPI after K iterations, and πK be the greedy
policy w.r.t. the truncated V-function ṼK−1 . Then under Assumptions 1–3 and some
104 L. Buşoniu et al.
where ELSTD (F ) = supπ ∈G (F˜ ) inf f ∈F || f −V π ||ρ π and G (F˜ ) is the set of all greedy
policies w.r.t. the functions in F˜ .
Note that the initial policy π0 is greedy in the V-function V−1 , rather than being
arbitrary; that is why we use the index −1 for the initial V-function.
Comparing the performance bounds of BRM-PI and pathwise-LSPI we first no-
tice that BRM-PI has a poorer estimation rate of O(n−1/4 ) instead of O(n−1/2 ).
We may also see that the approximation error term in BRM-PI, EBRM (F ), is less
complex than that for pathwise-LSPI, ELSTD (F ), as the norm in EBRM (F ) is only
w.r.t. the distribution μ while the one in ELSTD (F ) is w.r.t. the stationary distri-
bution of any policy in G (F˜ ). The assumptions used by the algorithms are also
different. In BRM-PI, it is assumed that a generative model is available, and thus,
the performance bounds may be obtained under any sampling distribution μ , specif-
ically the one for which Assumption 1 holds. On the other hand, at each iteration of
pathwise-LSPI, it is required to use a single trajectory generated by following the
policy under evaluation. This can provide performance bounds only under the sta-
tionary distribution of that policy, and accurately approximating the current policy
under the stationary distribution may not be enough in a policy iteration scheme, be-
cause the greedy policy w.r.t. that approximation may be arbitrarily poor. Therefore,
we may conclude that the performance of BRM-PI is better controlled than that of
pathwise-LSPI. This is reflected in the fact that the concentrability terms may be
controlled in the BRM-PI by only choosing a uniformly dominating distribution μ
(Assumption 1), such as a uniform distribution, while in pathwise-LSPI, we are re-
quired to make stronger assumptions on the stationary distributions of the policies
encountered at the iterations of the algorithm, such as being lower-bounded by a
uniform distribution (Assumption 2).
Many extensions and variations of the methods introduced in Section 3.3 have been
proposed, and we unfortunately do not have the space to describe them all. Instead,
in the present section, we will (non-exhaustively) touch upon some of the highlights
in this active field of research, providing pointers to the literature for the reader
interested in more details.
As previously mentioned in Footnote 2, variants of approximate policy eval-
uation that employ a multistep Bellman mapping can be used. This mapping is
parameterized by λ ∈ [0, 1), and is given, e.g., in the case of Q-functions by:
3 Least-Squares Methods for Policy Iteration 105
∞
BπQ,λ (Q) = (1 − λ ) ∑ λ t (BπQ )t+1 (Q)
t=0
where (BπQ )t denotes the t-times composition of BπQ with itself. In this chapter, we
only considered the single-step case, in which λ = 0, but in fact approximate policy
evaluation is often discussed in the general λ ∈ [0, 1) case, see e.g. Nedić and Bert-
sekas (2003); Yu (2010); Bertsekas and Ioffe (1996); Yu and Bertsekas (2009) and
also the discussion of the so-called TD(λ ) algorithm in Chapter 1. Note that in the
original LSPI, a nonzero λ would prevent sample reuse; instead, at every iteration,
new samples would have to be generated with the current policy.
Nonparametric approximators are not predefined, but are automatically con-
structed from the data, so to a large extent they free the user from the difficult task of
designing the BFs. A prominent class of nonparametric techniques is kernel-based
approximation, which was combined, e.g., with LSTD by Xu et al (2005, 2007);
Jung and Polani (2007b); Farahmand et al (2009), with LSPE by Jung and Polani
(2007a,b), and with BRM by Farahmand et al (2009). The related framework of
Gaussian processes has also been used in policy evaluation (Rasmussen and Kuss,
2004; Engel et al, 2005; Taylor and Parr, 2009). In their basic form, the compu-
tational demands of kernel-based methods and Gaussian processes grow with the
number of samples considered. Since this number can be large in practice, many
of the approaches mentioned above employ kernel sparsification techniques to limit
the number of samples that contribute to the solution (Xu et al, 2007; Engel et al,
2003, 2005; Jung and Polani, 2007a,b).
Closely related to sparsification is the technique of regularization, which controls
the complexity of the solution (Farahmand et al, 2009; Kolter and Ng, 2009). For
instance, to obtain a regularized variant of LSTD, a penalty term can be added to
the projected policy evaluation problem (3.8) to obtain:
!
min Q − Π (BπQ (Q)) + β ν (Q)
Q
Q∈
θτ +1 = θτ − αΓ [(A − γ B)θτ − b]
For appropriate choices of Γ and α , this algorithm converges under more general
conditions than the original LSPE.
In the context of BRM, Antos et al (2008) eliminated the requirement of double
sampling by modifying the minimization problem solved by BRM. They showed
that single-sample estimates of the coefficients in this modified problem are unbi-
ased, while the solution stays meaningful.
We also note active research into alternatives to least-squares methods for policy
evaluation and iteration, in particular, techniques based on gradient updates (Sutton
et al, 2009b,a; Maei et al, 2010) and on Monte Carlo simulations (Lagoudakis and
Parr, 2003b; Dimitrakakis and Lagoudakis, 2008; Lazaric et al, 2010a).
While an extensive tutorial such as this one, focusing on a unified view and
theoretical study of policy iteration with LSTD, LSPE, and BRM policy evalua-
tion, has not been available in the literature until now, these methods have been
treated in various recent books and surveys on reinforcement learning and dynamic
programming. For example, the interested reader should know that Chapter 3 of
(Buşoniu et al, 2010a) describes LSTD-Q and LSPE-Q as part of an introduction to
approximate reinforcement learning and dynamic programming, that (Munos, 2010)
touches upon LSTD and BRM, that Chapter 3 of (Szepesvári, 2010) outlines LSTD
and LSPE methods, and that (Bertsekas, 2011a, 2010) concern to a large extent these
two types of methods.
Chapter 7 of this book presents a more general view over the field of approximate
reinforcement learning, without focusing on least-squares methods.
References
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual
minimization based fitted policy iteration and a single sample path. Machine Learn-
ing 71(1), 89–129 (2008)
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In: Pro-
ceedings 12th International Conference on Machine Learning (ICML-1995), Tahoe City,
U.S, pp. 30–37 (1995)
Bertsekas, D.P.: A counterexample to temporal differences learning. Neural Computation 7,
270–279 (1995)
Bertsekas, D.P.: Approximate dynamic programming. In: Dynamic Programming and Opti-
mal Control, Ch. 6, vol. 2 (2010),
http://web.mit.edu/dimitrib/www/dpchapter.html
Bertsekas, D.P.: Approximate policy iteration: A survey and some new methods. Journal of
Control Theory and Applications 9(3), 310–335 (2011a)
Bertsekas, D.P.: Temporal difference methods for general projected equations. IEEE Trans-
actions on Automatic Control 56(9), 2128–2139 (2011b)
3 Least-Squares Methods for Policy Iteration 107
Bertsekas, D.P., Ioffe, S.: Temporal differences-based policy iteration and applications in
neuro-dynamic programming. Tech. Rep. LIDS-P-2349, Massachusetts Institute of Tech-
nology, Cambridge, US (1996),
http://web.mit.edu/dimitrib/www/Tempdif.pdf
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
Bertsekas, D.P., Borkar, V., Nedić, A.: Improved temporal difference methods with linear
function approximation. In: Si, J., Barto, A., Powell, W. (eds.) Learning and Approximate
Dynamic Programming. IEEE Press (2004)
Boyan, J.: Technical update: Least-squares temporal difference learning. Machine Learn-
ing 49, 233–246 (2002)
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning.
Machine Learning 22(1-3), 33–57 (1996)
Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic
Programming Using Function Approximators. In: Automation and Control Engineering.
Taylor & Francis, CRC Press (2010a)
Buşoniu, L., De Schutter, B., Babuška, R., Ernst, D.: Using prior knowledge to accelerate
online least-squares policy iteration. In: 2010 IEEE International Conference on Automa-
tion, Quality and Testing, Robotics (AQTR-2010), Cluj-Napoca, Romania (2010b)
Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Approximate dynamic programming
with a fuzzy parameterization. Automatica 46(5), 804–814 (2010c)
Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Online least-squares policy iteration
for reinforcement learning control. In: Proceedings 2010 American Control Conference
(ACC-2010), Baltimore, US, pp. 486–491 (2010d)
Dimitrakakis, C., Lagoudakis, M.: Rollout sampling approximate policy iteration. Machine
Learning 72(3), 157–171 (2008)
Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: The Gaussian process approach to
temporal difference learning. In: Proceedings 20th International Conference on Machine
Learning (ICML-2003), Washington, US, pp. 154–161 (2003)
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In:
Proceedings 22nd International Conference on Machine Learning (ICML-2005), Bonn,
Germany, pp. 201–208 (2005)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal
of Machine Learning Research 6, 503–556 (2005)
Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C.S., Mannor, S.: Regularized policy iter-
ation. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural
Information Processing Systems, vol. 21, pp. 441–448. MIT Press (2009)
Geramifard, A., Bowling, M.H., Sutton, R.S.: Incremental least-squares temporal difference
learning. In: Proceedings 21st National Conference on Artificial Intelligence and 18th
Innovative Applications of Artificial Intelligence Conference (AAAI-2006), Boston, US,
pp. 356–361 (2006)
Geramifard, A., Bowling, M., Zinkevich, M., Sutton, R.S.: iLSTD: Eligibility traces & con-
vergence analysis. In: Schölkopf, B., Platt, J., Hofmann, T. (eds.) Advances in Neural
Information Processing Systems, vol. 19, pp. 440–448. MIT Press (2007)
Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. Johns Hopkins (1996)
Jung, T., Polani, D.: Kernelizing LSPE(λ ). In: Proceedings 2007 IEEE Symposium on
Approximate Dynamic Programming and Reinforcement Learning (ADPRL-2007),
Honolulu, US, pp. 338–345 (2007a)
Jung, T., Polani, D.: Learning RoboCup-keepaway with kernels. In: Gaussian Processes in
Practice, JMLR Workshop and Conference Proceedings, vol. 1, pp. 33–57 (2007b)
108 L. Buşoniu et al.
Kolter, J.Z., Ng, A.: Regularization and feature selection in least-squares temporal difference
learning. In: Proceedings 26th International Conference on Machine Learning (ICML-
2009), Montreal, Canada, pp. 521–528 (2009)
Konda, V.: Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology, Cam-
bridge, US (2002)
Lagoudakis, M., Parr, R., Littman, M.: Least-squares Methods in Reinforcement Learning
for Control. In: Vlahavas, I.P., Spyropoulos, C.D. (eds.) SETN 2002. LNCS (LNAI),
vol. 2308, pp. 249–260. Springer, Heidelberg (2002)
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning
Research 4, 1107–1149 (2003a)
Lagoudakis, M.G., Parr, R.: Reinforcement learning as classification: Leveraging modern
classifiers. In: Proceedings 20th International Conference on Machine Learning (ICML-
2003), Washington, US, pp. 424–431 (2003b)
Lazaric, A., Ghavamzadeh, M., Munos, R.: Analysis of a classification-based policy iteration
algorithm. In: Proceedings 27th International Conference on Machine Learning (ICML-
2010), Haifa, Israel, pp. 607–614 (2010a)
Lazaric, A., Ghavamzadeh, M., Munos, R.: Finite-sample analysis of LSTD. In: Proceed-
ings 27th International Conference on Machine Learning (ICML-2010), Haifa, Israel,
pp. 615–622 (2010b)
Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration. In:
Proceedings 8th International Joint Conference on Autonomous Agents and Multiagent
Systems (AAMAS-2009), Budapest, Hungary, vol. 2, pp. 733–739 (2009)
Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control
with function approximation. In: Proceedings 27th International Conference on Machine
Learning (ICML-2010), Haifa, Israel, pp. 719–726 (2010)
Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite-sample analysis of Bellman
residual minimization, vol. 13, pp. 299–314 (2010)
Meyn, S., Tweedie, L.: Markov chains and stochastic stability. Springer, Heidelberg (1993)
Moore, A.W., Atkeson, C.R.: The parti-game algorithm for variable resolution reinforcement
learning in multidimensional state-spaces. Machine Learning 21(3), 199–233 (1995)
Munos, R.: Error bounds for approximate policy iteration. In: Proceedings 20th International
Conference (ICML-2003), Washington, US, pp. 560–567 (2003)
Munos, R.: Approximate dynamic programming. In: Markov Decision Processes in Artificial
Intelligence. Wiley (2010)
Munos, R., Szepesvári, C.S.: Finite time bounds for fitted value iteration. Journal of Machine
Learning Research 9, 815–857 (2008)
Nedić, A., Bertsekas, D.P.: Least-squares policy evaluation algorithms with linear func-
tion approximation. Discrete Event Dynamic Systems: Theory and Applications 13(1-2),
79–110 (2003)
Rasmussen, C.E., Kuss, M.: Gaussian processes in reinforcement learning. In: Thrun, S.,
Saul, L.K., Schölkopf, B. (eds.) Advances in Neural Information Processing Systems,
vol. 16. MIT Press (2004)
Scherrer, B.: Should one compute the Temporal Difference fix point or minimize the Bellman
Residual? the unified oblique projection view. In: Proceedings 27th International Confer-
ence on Machine Learning (ICML-2010), Haifa, Israel, pp. 959–966 (2010)
Schweitzer, P.J., Seidmann, A.: Generalized polynomial approximations in Markovian de-
cision processes. Journal of Mathematical Analysis and Applications 110(2), 568–582
(1985)
3 Least-Squares Methods for Policy Iteration 109
Sutton, R., Maei, H., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, C.S., Wiewiora, E.:
Fast gradient-descent methods for temporal-difference learning with linear function ap-
proximation. In: Proceedings 26th International Conference on Machine Learning (ICML-
2009), Montreal, Canada, pp. 993–1000 (2009a)
Sutton, R.S.: Learning to predict by the method of temporal differences. Machine Learning 3,
9–44 (1988)
Sutton, R.S., Szepesvári, C.S., Maei, H.R.: A convergent O(n) temporal-difference algorithm
for off-policy learning with linear function approximation. In: Koller, D., Schuurmans, D.,
Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems, vol. 21,
pp. 1609–1616. MIT Press (2009b)
Szepesvári, C.S.: Algorithms for Reinforcement Learning. Morgan & Claypool Publishers
(2010)
Taylor, G., Parr, R.: Kernelized value function approximation for reinforcement learning.
In: Proceedings 26th International Conference on Machine Learning (ICML-2009), Mon-
treal, Canada, pp. 1017–1024 (2009)
Thiery, C., Scherrer, B.: Least-squares λ policy iteration: Bias-variance trade-off in control
problems. In: Proceedings 27th International Conference on Machine Learning (ICML-
2010), Haifa, Israel, pp. 1071–1078 (2010)
Tsitsiklis, J.N.: On the convergence of optimistic policy iteration. Journal of Machine Learn-
ing Research 3, 59–72 (2002)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal difference learning with function
approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)
Xu, X., Xie, T., Hu, D., Lu, X.: Kernel least-squares temporal difference learning. Interna-
tional Journal of Information Technology 11(9), 54–63 (2005)
Xu, X., Hu, D., Lu, X.: Kernel-based least-squares policy iteration for reinforcement learning.
IEEE Transactions on Neural Networks 18(4), 973–992 (2007)
Yu, H.: Convergence of least squares temporal difference methods under general conditions.
In: Proceedings 27th International Conference on Machine Learning (ICML-2010), Haifa,
Israel, pp. 1207–1214 (2010)
Yu, H., Bertsekas, D.P.: Convergence results for some temporal difference methods based on
least squares. IEEE Transactions on Automatic Control 54(7), 1515–1531 (2009)
Yu, H., Bertsekas, D.P.: Error bounds for approximations from projected linear equations.
Mathematics of Operations Research 35(2), 306–329 (2010)
Chapter 4
Learning and Using Models
Todd Hester
Department of Computer Science, The University of Texas at Austin, 1616 Guadalupe,
Suite 2.408, Austin, TX 78701
e-mail: [email protected]
Peter Stone
Department of Computer Science, The University of Texas at Austin, 1616 Guadalupe,
Suite 2.408, Austin, TX 78701
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 111–141.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
112 T. Hester and P. Stone
4.1 Introduction
The reinforcement learning (RL) methods described in the book thus far have been
model-free methods, where the algorithm updates its value function directly from
experience in the domain. Model-based methods (or indirect methods), however,
perform their updates from a model of the domain, rather than from experience
in the domain itself. Instead, the model is learned from experience in the domain,
and then the value function is updated by planning over the learned model. This
sequence is shown in Figure 4.1. This planning can take the form of simply running
a model-free method on the model, or it can be a method such as value iteration or
Monte Carlo Tree Search.
The models learned by these methods can vary widely. Models can be learned
entirely from scratch, the structure of the model can be given so that only parameters
need to be learned, or a nearly complete model can be provided. If the algorithm can
learn an accurate model quickly enough, model-based reinforcement learning can be
more sample efficient (take fewer actions to learn) than model-free methods. Once
an accurate model is learned, an optimal policy can be planned without requiring
any additional experiences in the world. For example, when an agent first discovers
a goal state, the values of its policy can be updated at once through planning over its
new model that represents that goal. Conversely, a model-free method would have
to follow a trajectory to the goal many times for the values to propagate all the way
back to the start state. This higher sample efficiency typically comes at the cost of
more computation for learning the model and planning a policy and more space to
represent the model.
Another advantage of models is that they provide an opportunity for the agent to
perform targeted exploration. The agent can plan a policy using its model to drive
Fig. 4.1 Model-based RL agents use their experiences to first learn a model of the domain,
and then use this model to compute their value function
4 Learning and Using Models 113
the agent explore particular states; these states can be states it has not visited or is
uncertain about. A key to learning a model quickly is acquiring the right experiences
needed to learn the model (similar to active learning). Various methods for exploring
in this way exist, leading to fast learning of accurate models, and thus high sample
efficiency.
The main components of a model-based RL method are the model, which is de-
scribed in Section 4.2, and the planner which plans a policy on the model, described
in Section 4.3. Section 4.4 discusses how to combine models and planners into a
complete agent. The sample complexity of the algorithm (explained in Section 4.5)
is one of the main performance criteria that these methods are evaluated on. We
examine algorithms for factored domains in Section 4.6. Exploration is one of the
main focuses for improving sample complexity and is explained in Section 4.7. We
discuss extensions to continuous domains in Section 4.8, examine the performance
of some algorithms empirically in Section 4.9, and look at work on scaling up to
larger and more realistic problems in Section 4.10. Finally, we conclude the chapter
in Section 4.11.
Table 4.1 This table shows an example of the model learned after 50 experiences in the
domain shown in Figure 4.2
algorithms. Random forests are used to model the relative transition effects in (Hes-
ter and Stone, 2010). Jong and Stone (2007) make predictions for a given state-
action pair based on the average of the relative effects of its nearest neighbors.
One of the earliest model-based RL methods used locally weighted regression
(LWR) to learn models (Schaal and Atkeson, 1994; Atkeson et al, 1997). All ex-
periences of the agent were saved in memory, and when a given state-action was
queried, a locally weighted regression model was formed to provide a prediction
of the average next state for the queried state-action pair. Combined with a unique
exploration mechanism, this algorithm was able to learn to control a robot juggling.
4.3 Planning
Once the agent has learned an approximate model of the domain dynamics, the
model can be used to learn an improved policy. Typically, the agent would re-
plan a policy on the model each time it changes. Calculating a policy based on a
model is called planning. These methods can also be used for planning a policy on
a provided model, rather than planning while learning the model online. One option
for planning on the model is to use the dynamic programming methods described
in Chapter 1, such as Value Iteration or Policy Iteration. Another option is to use
Monte Carlo methods, and particularly Monte Carlo Tree Search (MCTS) methods,
described below. The main difference between the two classes of methods is that
the dynamic programming methods compute the value function for the entire state
space, while MCTS focuses its computation on the states that the agent is likely to
encounter soon.
Chapter 1 describes how to use Monte Carlo methods to compute a policy while
interacting with the environment. These methods can be used in a similar way on
experiences simulated using a learned model. The methods simulate a full trajectory
of experience until the end of an episode or to a maximum search depth. Each sim-
ulated trajectory is called a roll-out. They then update the values of states along that
trajectory towards the discounted sum of rewards received after visiting that state.
These methods require only a generative model of the environment rather than a full
distribution of next states. A variety of MCTS methods exist which vary in how they
choose actions at each state in their search.
Monte Carlo Tree Search methods build a tree of visited state-action pairs out
from the current state, shown in Figure 4.3. This tree enables the algorithm to re-
use information at previously visited states. In vanilla MCTS, the algorithm takes
greedy actions to a specified depth in the tree, and then takes random actions from
there until the episode terminates. This tree search focuses value updates on the
states that MCTS visits between the agent’s current state and the end of the roll-out,
116 T. Hester and P. Stone
Fig. 4.3 This figure shows a Monte Carlo Tree Search roll-out from the agent’s current state
out to a depth of 2. The MCTS methods simulate a trajectory forward from the agent’s current
state. At each state, they select some action, leading them to a next state one level deeper in
the tree. After rolling out to a terminal state or a maximum depth, the value of the actions
selected are updated towards the rewards received following it on that trajectory.
which can be more efficient than planning over the entire state space as dynamic
programming methods do. We will discuss a few variants of MCTS that vary in how
they select actions at each state to efficiently find a good policy.
Sparse sampling (Kearns et al, 1999) was a pre-cursor to the MCTS planning
methods discussed in this chapter. The authors determine the number, C, of samples
of next states required to accurately estimate the value of a given state-action, (s,a),
based on the maximum reward in the domain, Rmax , and the discount factor, γ . They
also determine the horizon h that must be searched to given the discount factor γ .
The estimate of the value of a state-action pair at depth t is based on C samples of
the next states at depth t + 1. The value of each of these next states is based on C
samples of each of the actions at that state, and so on up to horizon h. Instead of
4 Learning and Using Models 117
sampling trajectories one at a time like MCTS does, this method expands the tree
one level deeper each step until reaching the calculated horizon. The algorithm’s
running time is O((|A|C)h ). As a sampling-based planning method that is proven to
converge to accurate values, sparse sampling provides an important theoretical basis
for the MCTS methods that follow.
UCT (Kocsis and Szepesvári, 2006) is another MCTS variant that improves upon
the running time of Sparse Sampling by focusing its samples on the most promising
actions. UCT searches from the start state to the end, selecting actions based on upper
confidence bounds using the UCB1 algorithm (Auer et al, 2002). The algorithm
maintains a count, C(s,d), of visits to each state at a given depth in the search, d,
as well as a count, C(s,a,d), of the number of times action a was taken from that
state at that depth. These counts are used to calculate the upper confidence bound
to select the action. The action selected at each step is calculated with the following
equation (where C p is an appropriate constant based on the range of rewards in the
domain): "
log(C(s,d))
a = argmaxa Qd (s,a) + 2C p
C(s,a,d)
By selecting actions using the upper tail of the confidence interval, the algorithm
mainly samples good actions, while still exploring when other actions have a higher
upper confidence bound. Algorithm 10 shows pseudo-code for the UCT algorithm,
which is run from the agent’s current state, s, with a depth, d, of 0. The algorithm is
provided with a learning rate, α , and the range of one-step rewards in the domain,
rrange . Line 6 of the algorithm recursively calls the UCT method, to sample an action
at the next state one level deeper in the search tree. Modified versions of UCT have
had great success in the world of Go algorithms as a planner with the model of the
game already provided (Wang and Gelly, 2007). UCT is also used as the planner
inside several model-based reinforcement learning algorithms (Silver et al, 2008;
Hester and Stone, 2010).
Having separately introduced the ideas of model learning (Section 4.2) and plan-
ning (Section 4.3), we now discuss the challenges that arise when combining these
two together into a full model-based method, and how these challenges have been
addressed.
118 T. Hester and P. Stone
There are a number of ways to combine model learning and planning. Typically, as
the agent interacts with the environment, its model gets updated at every time step
with the latest transition, < s,a,r,s >. Each time the model is updated, the algorithm
re-plans on it with its planner (as shown in Figure 4.4). This approach is taken by
many algorithms (Brafman and Tennenholtz, 2001; Hester and Stone, 2009; Degris
et al, 2006). However, due to the computational complexity of learning the model
and planning on it, it is not always feasible.
Another approach is to do model updates and planning in batch mode, only pro-
cessing them after every k actions, an approach taken in (Deisenroth and Rasmussen,
2011). However, this approach means that the agent takes long pauses between
some actions while performing batch updates, which may not be acceptable in some
problems.
DYNA (Sutton, 1990, 1991) is a reactive RL architecture. In it, the agent starts
with either a real action in the world or a saved experience. Unlike value iteration,
where Bellman updates are performed on all the states by iterating over the state
space, here planning updates are performed on randomly selected state-action pairs.
The algorithm updates the action-values for the randomly selected state-action using
the Bellman equations, thus updating its policy. In this way, the real actions require
only a single action-value update, while many model-based updates can take place
in between the real actions. While the DYNA framework separates the model updates
from the real action loop, it still requires many model-based updates for the policy
to become optimal with respect to its model.
Prioritized Sweeping (Moore and Atkeson, 1993) (described in Chapter 1) im-
proves upon the DYNA idea by selecting which state-action pairs to update based on
priority, rather than selecting them randomly. It updates the state-action pairs in or-
der, based on the expected change in their value. Instead of iterating over the entire
state space, prioritized sweeping updates values propagating backwards across the
state space from where the model changed.
Fig. 4.4 Typically, model-based agent’s interleave model learning and planning sequentially,
first completing an update to the model, and then planning on the updated model to compute
a policy
4 Learning and Using Models 119
Fig. 4.5 The parallel architecture for real-time model-based RL proposed by Hester et al
(2011). There are three separate parallel threads for model learning, planning, and acting.
Separating model learning and planning from the action selection enables it to occur at the
desired rate regardless of the time taken for model learning or planning.
2 Q - LEARNING , which was developed before E 3 , was not proved to converge in polynomial
time until after the development of E 3 (Even-dar and Mansour, 2001).
3 Source code for R - MAX is available at: http://www.ros.org/wiki/rl_agent
122 T. Hester and P. Stone
the KWIK framework, when queried about a particular state-action pair, it must al-
ways either make an accurate prediction, or reply “I don’t know” and request a label
for that example. KWIK algorithms can be used as the model learning methods in an
RL setting, as the agent can be driven to explore the states where the model replies
“I don’t know” to improve its model quickly. The drawback of KWIK algorithms
is that they often require a large number of experiences to guarantee an accurate
prediction when not saying “I don’t know.”
R G
Y B
(a) Taxi Domain. (b) DBN Transition Structure.
Fig. 4.6 4.6a shows the Taxi domain. 4.6b shows the DBN transition model for this domain.
Here the x feature in state s is only dependent on the x and y features in state s and the y
feature is only dependent on the previous y. The passenger’s destination is only dependent
on its previous destination, and her current location is dependent on her location the step
before as well as the taxi’s x,y location the step before.
dependent on some subset of features from the previous state and action. The fea-
tures that a given state feature are dependent on are called its parents. The maximum
number of parents that any of the state features has is called the maximum in-degree
of the DBN. When using a DBN transition model, it is assumed that each feature
transitions independently of the others. These separate transition probabilities can
combined together into a prediction of the entire state transition with the following
equation:
P(s |s,a) = T (s,a,s ) = Πi=0
n
P(xi |s,a)
Learning the structure of this DBN transition model is known as the structure learn-
ing problem. Once the structure of the DBN is learned, the conditional probabilities
for each edge must be learned. Typically these probabilities are stored in a condi-
tional probability table, or CPT.
Figure 4.6 shows an example DBN for the Taxi domain (Dietterich, 1998). Here
the agent’s state is made up of four features: its x and y location, the passenger’s
location, and the passenger’s destination. The agent’s goal is to navigate the taxi to
the passenger, pick up the passenger, navigate to her destination, and drop off the
passenger. The y location of the taxi is only dependent on its previous y location,
and not its x location or the current location or destination of the passenger. Because
of the vertical walls in the domain, the x location of the taxi is dependent on both x
and y. If this structure is known, it makes the model learning problem much easier,
as the same model for the transition of the x and y variables can be used for any
possible value of the passenger’s location and destination.
Model-based RL methods for factored domains vary in the amount of informa-
tion and assumptions given to the agent. Most assume a DBN transition model,
124 T. Hester and P. Stone
and that the state features do transition independently. Some methods start with no
knowledge of the model and must first learn the structure of the DBN and then
learn the probabilities. Other methods are given the structure and must simply learn
the probabilities associated with each edge in the DBN. We discuss a few of these
variations below.
The DBN -E 3 algorithm (Kearns and Koller, 1999) extends the E 3 algorithm to
factored domains where the structure of the DBN model is known. With the structure
of the DBN already given, the algorithm must learn the probabilities associated
with each edge in the DBN. The algorithm is able to learn a near-optimal policy
in a number of actions polynomial in the number of parameters of the DBN-MDP,
which can be exponentially smaller than the number of total states.
Similar to the extension of E 3 to DBN -E 3 , R - MAX can be extended to FACTORED -
R - MAX for factored domains where the structure of the DBN transition model is
given (Guestrin et al, 2002). This method achieves the same sample complexity
bounds as the DBN -E 3 algorithm, while also maintaining the implementation and
simplicity advantages of R - MAX over E 3 .
Structure Learning Factored R - MAX (SLF - R - MAX) (Strehl et al, 2007) applies
an R - MAX type approach to factored domains where the structure of the DBN is
not known. It learns the structure of the DBN as well as the conditional probabil-
ities when given the maximum in-degree of the DBN. The algorithm enumerates
all possible combinations of input features as elements and then creates counters
to measure which elements are relevant. The algorithm makes predictions when a
relevant element is found for a queried state; if none is found, the state is considered
unknown. Similar to R - MAX, the algorithm gives a bonus of Rmax to unknown states
in value iteration to encourage the agent to explore them. The sample complexity of
the algorithm is highly dependent on the maximum in-degree, D, of the DBN. With
probability at least 1 − δ , the SLF - R - MAX algorithm’s policy is ε -optimal except for
at most k time steps, where:
3+2D
ADln( nA δ )ln( δ )ln( ε (1−γ ) )
1 1
n
k=O
ε 3 (1 − γ )6
Here, n is the number of factors in the domain, D is the maximum in-degree in the
DBN, and γ is the discount factor.
Diuk et al (2009) improve upon the sample complexity of SLF - R - MAX with the
k-Meteorologists R - MAX (MET- R - MAX) algorithm by introducing a more efficient
algorithm for determining which input features are relevant for its predictions. It
achieves this improved efficiency by using the mean squared error of the predictors
based on different DBNs. This improves the sample complexity for discovering the
structure of the DBN from O(n2D ) to O(nD ). The overall sample complexity bound
for this algorithm is the best known bound for factored domains.
Chakraborty and Stone (2011) present a similar approach that does not require
knowledge of the in-degree of the DBN called Learn Structure and Exploit with
R - MAX ( LSE - R - MAX ). It takes an alternative route to solving the structure learn-
ing problem in comparison to MET- R - MAX by assuming knowledge of a planning
4 Learning and Using Models 125
horizon that satisfies certain conditions, rather than knowledge of the in-degree.
With this assumption, it solves the structure learning problem in sample complexity
bounds which are competitive with MET- R - MAX, and it performs better empirically
in two test domains.
Decision trees present another approach to the structure learning problem. They
are able to naturally learn the structure of the problem by using information gain
to determine which features are useful to split on to make predictions. In addition,
they can generalize more than strict DBN models can. Even for state features that
are parents of a given feature, the decision tree can decide that in portions of the
state space, that feature is not relevant. For example, in taxi, the location of the
passenger is only dependent on the location of the taxi when the pickup action is
being performed. In all other cases, the location of the passenger can be predicted
while ignoring the taxi’s current location.
Decision trees are used to learn models in factored domains in the SPITI algo-
rithm (Degris et al, 2006). The algorithm learns a decision tree to predict each
state feature in the domain. It plans on this model using Structured Value Itera-
tion (Boutilier et al, 2000) and uses ε -greedy exploration. The generalization in its
model gives it better sample efficiency than many methods using tabular or DBN
models in practice. However, there are no guarantees that the decision tree will fully
learn the correct transition model, and therefore no theoretical bounds have been
proven for its sample efficiency.
RL - DT is another approach using decision trees in its model that attempts to im-
prove upon SPITI by modeling the relative transitions of states and using a different
exploration policy (Hester and Stone, 2009). By predicting the relative change in
each feature, rather than its absolute value, the tree models are able to make better
predictions about the transition dynamics for unseen states. In addition, the algo-
rithm uses a more directed exploration scheme, following R - MAX type exploration
of driving the agent to states with few visits until the agent finds a state with reward
near Rmax , at which point it switches to exploiting the policy computed using its
model. This algorithm has been shown to be effective on gridworld tasks such as
Taxi, as well as on humanoid robots learning to score penalty kicks (Hester et al,
2010).
Figure 4.7 shows an example decision tree predicting the relative change in the
x variable of the agent in the given gridworld domain. The decision tree is able to
split on both the actions and the state of the agent, allowing it to split the state space
up into regions where the transition dynamics are the same. Each leaf of the tree
can make probabilistic predictions based on the ratio of experienced outcomes in
that leaf. The grid is colored to match the leaves on the left side of the tree, making
predictions for when the agent takes the east action. The tree is built on-line while
the agent is acting in the MDP. At the start, the tree will be empty, and will slowly
be refined over time. The tree will make predictions about broad parts of the state
space at first, such as what the EAST or W EST actions do, and eventually refine
itself to have leaves for individual states where the transition dynamics differ from
the global dynamics.
126 T. Hester and P. Stone
(a) Two room gridworld domain. (b) Decision tree model predicting the change
in the x feature (Δ x) based on the current state
and action.
Fig. 4.7 This figure shows the decision tree model learned to predict the change in the x
feature (or Δ x). The two room gridworld is colored to match the corresponding leaves of the
left side of the tree where the agent has taken the east action. Each rectangle represents a split
in the tree and each rounded rectangle represents a leaf of the tree, showing the probabilities
of a given value for Δ x. For example, if the action is east and x = 14 we fall into the green
leaf on the left, where the probability of Δ x being 0 is 1.
4.7 Exploration
A key component of the E 3 and R - MAX algorithms is how and when the agent
decides to take an exploratory (sub-optimal) action rather than exploit what it knows
in its model. One of the advantages of model-based methods is that they allow the
agent to perform directed exploration, planning out multi-step exploration policies
rather than the simple ε -greedy or softmax exploration utilized by many model-
free methods. Both the E 3 and R - MAX algorithms do so by tracking the number
of visits to each state and driving the agent to explore all states with fewer than a
given number of visits. However, if the agent can measure uncertainty in its model,
it can drive exploration without depending on visit counts. The methods presented
in this section follow this approach. They mainly vary in two dimensions: 1) how
they measure uncertainty in their model and 2) exactly how they use the uncertainty
to drive exploration.
4 Learning and Using Models 127
Fig. 4.8 TEXPLORE’s Model Learning. This figure shows how the TEXPLORE algo-
rithm (Hester and Stone, 2010) learns a model of the domain. The agent calculates the dif-
ference between s and s as the transition effect. Then it splits up the state vector and learns
a random forest to predict the change in each state feature. Each random forest is made up
of stochastic decision trees, which are updated with each new experience with probability w.
The random forest’s predictions are made by averaging each tree’s predictions, and then the
predictions for each feature are combined into a complete model of the domain.
would be more efficient for the agent to use an approach similar to the Bayesian
methods such as BOSS described above.
The TEXPLORE algorithm (Hester and Stone, 2010) is a model-based method for
factored domains that attempts to accomplish this goal. It learns multiple possible
decision tree models of the domain, in the form of a random forest (Breiman, 2001).
Figure 4.8 shows the random forest model. Each random forest model is made up of
m possible decision tree models of the domain and trained on a subset of the agent’s
experiences. The agent plans on the average of these m models and can include an
exploration bonus based on the variance of the models’ predictions.
The TEXPLORE agent builds k hypotheses of the transition model of the domain,
similar to the multiple sampled models of BOSS or MBBE. TEXPLORE combines the
models differently, however, creating a merged model that predicts next state dis-
tributions that are the average of the distributions predicted by each of the models.
Planning over this average distribution allows the agent to explore promising tran-
sitions when their probability is high enough (when many of the models predict the
promising outcome, or one of them predicts it with high probability). This is similar
to the approach of BOSS, but takes into account the number of sampled models that
130 T. Hester and P. Stone
predict the optimistic outcome. In TEXPLORE, the prediction of a given model has
probability 1k , while the extra actions BOSS creates are always assumed to transition
as that particular model predicts. This difference becomes clear with an example.
If TEXPLORE’s models disagree and the average model predicts there is a small
chance of a particular negative outcome occurring, the agent will avoid it based on
the chance that it may occur. The BOSS agent, however, will simply select an action
from a different model and ignore the possibility of these negative rewards. On the
other hand, if TEXPLORE’s average model predicts a possibility of a high-valued
outcome occurring, it may be worth exploring if the value of the outcome is high
enough relative to its probability. TEXPLORE has been used in a gridworld with over
300,000 states (Hester and Stone, 2010) and run in real-time on an autonomous
vehicle (Hester et al, 2011)4.
Schmidhuber (1991) tries to drive the agent to where the model has been im-
proving the most, rather than trying to estimate where the model is poorest. The
author takes a traditional model-based RL method, and adds a confidence module,
which is trained to predict the absolute value of the error of the model. This module
could be used to create intrinsic rewards encouraging the agent to explore high-error
state-action pairs, but then the agent would be attracted to noisy states in addition
to poorly-modeled ones. Instead the author adds another module that is trained to
predict the changes in the confidence module outputs. Using this module, the agent
is driven to explore the parts of the state space that most improve the model’s pre-
diction error.
Baranes and Oudeyer (2009) present an algorithm based on a similar idea called
Robust Intelligent Adaptive Curiosity ( R - IAC), a method for providing intrinsic re-
ward to encourage a developing agent to explore. Their approach does not adopt the
RL framework, but is similar in many respects. In it, they split the state space into
regions and learn a model of the transition dynamics in each region. They maintain
an estimate of the prediction error for each region and use the gradient of this error
as the intrinsic reward for the agent, driving the agent to explore the areas where the
prediction errors are improving the most. Since this approach is not using the RL
framework, their algorithm selects actions only to maximize the immediate reward,
rather than the discounted sum of future rewards. Their method has no way of in-
corporating external rewards, but it could be used to provide intrinsic rewards to an
existing RL agent.
Most of the algorithms described to this point in the chapter assume that the agent
operates in a discrete state space. However, many real-world problems such as
robotic control involve continuously valued states and actions. These approaches
can be extended to continuous problems by quantizing the state space, but very
fine discretizations result in a very large number of states, and some information
is lost in the discretization. Model-free approaches can be extended fairly easily to
continuous domains through the use of function approximation. However, there are
multiple challenges that must be addressed to extend model-based methods to con-
tinuous domains: 1) learning continuous models, 2) planning in a continuous state
space, and 3) exploring a continuous state space. Continuous methods are described
further in Chapter 7.
Learning a model of a continuous domain requires predictions about a continuous
next-state and reward from a continuous state. Unlike the discrete case, one cannot
simply learn a tabular model for some discrete set of states. Some form of func-
tion approximation must be used, as many real-valued states may never be visited.
Common approaches are to use regression or instance-based techniques to learn a
continuous model.
A more difficult problem is to plan over a continuous state space. There are an
infinite number of states for which the agent needs to know an optimal action. Again,
this can be done with some form of function approximation on the policy, or the
statespace could be discretized for planning purposes (even if used as-is for learning
the model). In addition, many of the model-free approaches for continuous state
spaces discussed in Chapter 7, such as policy gradient methods (Sutton et al, 1999)
or Q - LEARNING with function approximation, could be used for planning a policy
on the model.
Fitted value iteration (FVI) (Gordon, 1995) adapts value iteration to continuous
state spaces. It iterates, updating the values of a finite set of states sampled from
the infinite state space and then fitting a function approximator to their values. If
the function approximator fits some contraction criteria, then fitted value iteration is
proven to converge.
One of the earliest model-based RL algorithms for continuous state-spaces is the
PARTI - GAME algorithm (Moore and Atkeson, 1995). It does not work in the typical
RL framework, having deterministic dynamics and a goal region rather than a re-
ward function. The algorithm discretizes the state space for learning and planning,
adaptively increasing its resolution in interesting parts of the state space. When the
planner is unable to find a policy to the goal, cells on the border of ones that suc-
ceed and fail are split further to increase the resolution of the discretization. This
approach allows the agent to have a more accurate model of dynamics when needed
and a more general model elsewhere.
Unlike the partitioning approach of the PARTI - GAME algorithm, Ormnoneit and
Sen (2002) use an instance-based model of the domain in their kernel-based RL algo-
rithm. The algorithm saves all the transitions it has experienced. When making a pre-
diction for a queried state-action, the model makes a prediction based on an average
of nearby transitions, weighted using the kernel function. This model is combined
with approximate dynamic programming to create a full model-based method.
Deisenroth and Rasmussen (2011) use Gaussian Process (GP) regression to learn
a model of the domain in their algorithm called Probabilistic Inference for Learning
Control (PILCO). The GP regression model generalizes to unseen states and pro-
vides confidence bounds for its predictions. The agent plans assuming the next state
132 T. Hester and P. Stone
distribution matches the confidence bounds, encouraging the agent to explore when
some states from the next state distribution are highly valued. The algorithm also
uses GP regression to represent its policy and it computes the policy with policy
iteration. It runs in batch mode, alternatively taking batches of actions in the world
and then re-computing its model and policy. The algorithm learns to control a phys-
ical cart-pole device with few samples, but pauses for 10 minutes of computation
after every 2.5 seconds of action.
Jong and Stone (2007) present an algorithm called FITTED - R - MAX that extends
R - MAX to continuous domains using an instance-based model. When a state-action
is queried, the algorithm uses the nearest instances to the queried state-action. They
use the relative effects of the nearest instances to predict the relative change in state
that will occur for the queried state-action. Their algorithm can provide a distri-
bution over next states which is then used for planning with fitted value iteration.
The agent is encouraged to explore parts of the state space that do not have enough
instances with an R - MAX type exploration bonus.
Least-Squares Policy Iteration ( LSPI) (Lagoudakis and Parr, 2003) is a popular
method for planning with function approximation (described further in Chapter 3). It
performs approximate policy iteration when using linear function approximation. It
calculates the policy parameters that minimize the least-squares difference from the
Bellman equation for a given set of experiences. These experiences could come from
a generative model or from saved experiences of the agent. However, LSPI is usually
used for batch learning with experiences gathered through random walks because of
the expensive computation required. Li et al (2009) extend LSPI to perform online
exploration by providing exploration bonuses similar to FITTED - R - MAX.
Nouri and Littman (2010) take a different approach with a focus on exploration
in continuous domains. They develop an algorithm called Dimension Reduction
in Exploration (DRE) that uses a method for dimensionality reduction in learning
the transition function that automatically discovers the relevant state features for
prediction. They predict each feature independently, and they use a ’knownness’
criterion from their model to drive exploration. They combine this model with fitted
value iteration to plan every few steps.
While RL typically focuses on discrete actions, there are many control prob-
lems that require a continuous control signal. Trying to find the best action in this
case can be a difficult problem. Binary Action Search (Pazis and Lagoudakis, 2009)
provides a possible solution to this problem by discretizing the action space and
breaking down the continuous action selection problem into a series of binary ac-
tion selections, each one deciding one bit of the value of the continuous action to be
taken. Alternatively, one can represent the policy with a function approximator (for
example, in an actor-critic method) and update the function approximator appropri-
ately to output the best continuous action (Sutton et al, 1999; van Hasselt and Wier-
ing, 2007). Weinstein et al (2010) develop a different approach called HOOT, using
MCTS -type search to partition and search the continuous action space. The search
progresses through the continuous action space, selecting smaller and smaller par-
titions of the continuous space until reaching a leaf in the search tree and selecting
an action.
4 Learning and Using Models 133
Having surveyed the model-based approaches along the dimensions of how they
combine planning and model learning, how they explore, and how they deal with
continuous spaces, we now present some brief representative experiments that illus-
trate their most important properties. Though model-based methods are most useful
in large domains with limited action opportunities (see Section 4.10), we can illus-
trate their properties on simple toy domains5.
We compared R - MAX (described in Section 4.5) with Q - LEARNING, a typical
model-free method, on the Taxi domain. R - MAX was run with the number of visits
required for a state to be considered known, M, set to 5. Q - LEARNING was run
with a learning rate of 0.3 and ε -greedy exploration with ε = 0.1. Both methods
were run with a discount factor of 0.99. The Taxi domain (Dietterich, 1998), shown
in Figure 4.9a, is a 5x5 gridworld with four landmarks that are labeled with one
of the following colors: red, green, blue or yellow. The agent’s state consists of
its location in the gridworld in x,y coordinates, the location of the passenger (at a
landmark or in the taxi), and the passenger’s destination (a landmark). The agent’s
goal is to navigate to the passenger’s location, pick the passenger up, navigate to the
passenger’s destination and drop the passenger off. The agent has six actions that
it can take. The first four (north, south, west, east) move the agent to the square in
that respective direction with probability 0.8 and in a perpendicular direction with
probability 0.1. If the resulting direction is blocked by a wall, the agent stays where
it is. The fifth action is the pickup action, which picks up the passenger if she is at the
taxi’s location. The sixth action is the putdown action, which attempts to drop off the
passenger. Each of the actions incurs a reward of −1, except for unsuccessful pickup
or putdown actions, which produce a reward of −10. The episode is terminated
by a successful putdown action, which provides a reward of +20. Each episode
starts with the passenger’s location and destination selected randomly from the four
landmarks and with the agent at a random location in the gridworld.
Figure 4.9b shows a comparison of the average reward accrued by Q - LEARNING
and R - MAX on the Taxi domain, averaged over 30 trials. R - MAX receives large neg-
ative rewards early as it explores all of its ’unknown’ states. This exploration, how-
ever, leads it to find the optimal policy faster than Q - LEARNING.
Next, we show the performance of a few model-based methods and Q - LEARNING
on the Cart-Pole Balancing task. Cart-Pole Balancing is a continuous task, shown
in Figure 4.10a, where the agent must keep the pole balanced while keeping the
cart on the track. The agent has two actions, which apply a force of 10 N to the
cart in either direction. Uniform noise between −5 and 5 N is added to this force.
The state is made up of four features: the pole’s position, the pole’s velocity, the
cart’s position, and the cart’s velocity. The agent receives a reward of +1 each time
step until the episode ends. The episode ends if the pole falls, the cart goes off the
track, or 1,000 time steps have passed. The task was simulated at 50 Hz. For the
Taxi
500
R G 0
-500
Average Reward
-1000
-1500
-2000
-2500
Q-Learning
R-Max
Y B -3000
0 50 100 150 200
Episode Number
Fig. 4.9 4.9a shows the Taxi domain. 4.9b shows the average reward per episode for Q-
LEARNING and R - MAX on the Taxi domain averaged over 30 trials.
discrete methods, we discretized each of the 4 dimensions into 10 values, for a total
of 10,000 states.
For model-free methods, we compared Q - LEARNING on the discretized domain
with Q - LEARNING using tile-coding for function approximation on the continuous
representation of the task. Both methods were run with a learning rate of 0.3 and
ε -greedy exploration with ε = 0.1. Q - LEARNING with tile coding was run with 10
conjunctive tilings each with dimension split into 4 tiles. We also compared four
model-based methods: two discrete methods and two continuous methods. The dis-
crete methods were R - MAX (Brafman and Tennenholtz, 2001), which uses a tabular
model, and TEXPLORE (Hester and Stone, 2010). TEXPLORE uses decision trees to
model the relative effects of transitions and acts greedily with respect to a model
that is the average of multiple possible models. Both of these methods were run
on the discretized version of the domain. Here again, R - MAX was run with M = 5.
TEXPLORE was run with b = 0, w = 0.55, and f = 0.2. We also evaluated FITTED -
R - MAX (Jong and Stone, 2007), which is an extension of R - MAX for continuous
domains, and CONTINUOUS TEXPLORE, an extension of TEXPLORE to continuous
domains that uses regression trees to model the continuous state instead of discrete
decision trees. CONTINUOUS TEXPLORE was run with the same parameters as TEX -
PLORE and FITTED - R - MAX was run with a model breadth of 0.05 and a resolution
factor of 4.
The average rewards for the algorithms averaged over 30 trials are shown in Fig-
ure 4.10b. Both versions of R - MAX take a long time exploring and do not accrue
much reward. Q - LEARNING with tile coding out-performs discrete Q - LEARNING
because it is able to generalize values across states to learn faster. Meanwhile,
the generalization and exploration of the TEXPLORE methods give them superior
performance, as they accrue significantly more reward per episode than the other
methods after episode 6. For each method, the continuous version of the algorithm
out-performs the discrete version.
4 Learning and Using Models 135
Cart-Pole
1000
900
800
700 Discrete Q-Learning
Average Reward
Q-Learning w/ tile-coding
600 Discrete R-Max
Fitted R-Max
500 Discrete TEXPLORE
Continuous TEXPLORE
400
300
200
100
0
0 20 40 60 80 100
Number of Episodes
Fig. 4.10 4.10a shows the Cart-Pole Balancing task. 4.10b shows the average reward per
episode for the algorithms on Cart-Pole Balancing averaged over 30 trials.
4.10 Scaling Up
sample complexity are still valid when using a sample-based planner called Forward-
Search Sparse Sampling (FSSS), which is a more conservative version of UCT. FSSS
maintains statistical upper and lower bounds for the value of each node in the tree
and only explores sub-trees that have a chance of being optimal.
In order to guarantee that they will find the optimal policy, any model-based
algorithm must visit at least every state-action in the domain. Even this number of
actions may be too large in some cases, whether it is because the domain is very
big, or because the actions are very expensive or dangerous. These bounds can be
improved in factored domains by assuming a DBN transition model. By assuming
knowledge of the DBN structure ahead of time, the DBN -E 3 and FACTORED - R - MAX
algorithms (Kearns and Koller, 1999; Guestrin et al, 2002) are able to learn a near-
optimal policy in a number of actions polynomial in the number of parameters of
the DBN-MDP, which can be exponentially smaller than the number of total states.
Another approach to this problem is to attempt to learn the structure of the DBN
model using decision trees, as in the TEXPLORE algorithm (Hester and Stone, 2010).
This approach gives up guarantees of optimality as the correct DBN structure may
not be learned, but it can learn high-rewarding (if not optimal) policies in many
fewer actions. The TEXPLORE algorithm models the MDP with random forests and
explores where its models are uncertain, but does not explore every state-action in
the domain.
Another approach to improving the sample efficiency of such algorithms is to
incorporate some human knowledge into the agent. One approach for doing so is
to provide trajectories of human generated experiences that the agent can use for
building its model. For example, in (Ng et al, 2003), the authors learn a dynamics
model of a remote control helicopter from data recorded from an expert user. Then
they use a policy search RL method to learn a policy to fly the helicopter using their
learned model.
One more issue with real-world decision making tasks is that they often involve
partially-observable state, where the agent can not uniquely identify its state from its
observations. The U - TREE algorithm (McCallum, 1996) is one model-based algo-
rithm that addresses this issue. It learns a model of the domain using a tree that can
incorporate previous states and actions into the splits in the tree, in addition to the
current state and action. The historical states and actions can be used to accurately
determine the agent’s true current state. The algorithm then uses value iteration on
this model to plan a policy.
Another way to scale up to larger and more complex domains is to use relational
models (described further in Chapter 8). Here, the world is represented as a set of
literals and the agent can learn models in the form of STRIPS-like planning oper-
ators (Fikes and Nilsson, 1971). Because these planning operators are relational,
the agent can easily generalize the effects of its actions over different objects in
the world, allowing it to scale up to worlds with many more objects. Pasula et al
(2004) present a method for learning probabilistic relational models in the form of
sets of rules. These rules describe how a particular action affects the state. They start
with a set of rules based on the experiences the agent has seen so far, and perform
a search over the rule set to optimize a score promoting simple and general rules.
4 Learning and Using Models 137
Table 4.2 This table presents a summary of the algorithms presented in this chapter
Algorithm Section Model Type Key Feature
DHP 4.2 Neural Network Uses NN model within actor-critic framework
LWR RL 4.2 Locally Weighted Regression One of the first methods to generalize model across states
DYNA 4.4 Tabular Bellman updates on actual actions and saved experiences
Prioritized Sweeping 4.4 Tabular Performs sweep of value updates backwards from changed state
DYNA -2 4.4 Tabular Use UCT for simulated updates to transient memory
Real-Time Architecture 4.4 Any Parallel architecture to allow real-time action
E3 4.5 Tabular Explicit decision to explore or exploit
R - MAX 4.5 Tabular Reward bonus given to ’unknown’ states
DBN -E 3 4.6 Learn probabilities for provided DBN model Learns in number of actions polynomial in # of DBN parameters
FACTORED - R - MAX 4.6 Learn probabilities for provided DBN model Learns in number of actions polynomial in # of DBN parameters
SLF - R - MAX 4.6 Learn models for all possible DBN structures Explore when any model is uncertain or models disagree
MET- R - MAX 4.6 Learn DBN structure and probabilities efficiently Explore when any model is uncertain or models disagree
LSE - R - MAX 4.6 Learn DBN efficiently without in-degree Explore when any model is uncertain or models disagree
SPITI 4.6 Decision Trees Model generalizes across states
RL - DT 4.6 Decision Trees with relative effects Model generalizes across states, R - MAX-like exploration
Optimal Probe 4.7 Maintain distribution over models Plan over augmented belief state space
BEETLE 4.7 Maintain distribution over parametrized models Plan over augmented belief state space
Bayesian DP 4.7 Maintain distribution over models Plan using model sampled from distribution
BOSS 4.7 Maintain distribution over models Plan using merged model created from sampled models
MBBE 4.7 Maintain distribution over models Use distribution over action-values to compute VPI
BEB 4.7 Maintain Dirichlet distribution over models Provide reward bonus to follow Bayesian policy
MBIE 4.7 Maintain distribution over transition probabilities Take max over transition probabilities as well as next state
TEXPLORE 4.7 Random Forest model for each feature Plan approximate optimal policy on average model
Intrinsic Curiosity 4.7 Supervised Learning method Provide intrinsic reward based on improvement in model
R - IAC 4.7 Supervised Learning method Select actions based on improvement in model
PILCO 4.8 Gaussian Process Regression Plan using uncertainty in next state predictions
PARTI - GAME 4.8 Split state-space based on transition dynamics Splits state space non-uniformly
Kernel-based RL 4.8 Instance-based model with kernel distance Use kernel to make predictions based on similar states
FITTED - R - MAX 4.8 Instance-based model Predict based on applying the relative effect of similar transitions
DRE 4.8 Dimensionality Reduction techniques Works in high-dimensional state spaces
U - TREE 4.10 Tree model that uses histories Makes predictions in partially observable domains
Relational 4.10 Relational rule sets Learns STRIPS-like relational operators
4.11 Conclusion
Model-based methods learn a model of the MDP on-line while interacting with the
environment, and then plan using their approximate model to calculate a policy. If
the algorithm can learn an accurate model quickly enough, model-based methods
can be more sample efficient than model-free methods. With an accurate learned
model, an optimal policy can be planned without requiring any additional experi-
ences in the world. In addition, these approaches can use their model to plan out
multi-step exploration policies, enabling them to perform more directed exploration
than model-free methods.
Table 4.2 shows a summary of the model-based algorithms described in this
chapter. R - MAX (Brafman and Tennenholtz, 2001) is one of the most commonly
used model-based methods because of its theoretical guarantees and ease of
use and implementation. However, MET- R - MAX (Diuk et al, 2009) and
LSE - R - MAX (Chakraborty and Stone, 2011) are the current state of the art in terms of
the bounds on sample complexity for factored domains. There are other approaches
that perform as well or better without such theoretical guarantees, such as Gaus-
sian Process RL (Deisenroth and Rasmussen, 2011; Rasmussen and Kuss, 2004) or
TEXPLORE (Hester and Stone, 2010).
138 T. Hester and P. Stone
Acknowledgements. This work has taken place in the Learning Agents Research Group
(LARG) at the Artificial Intelligence Laboratory, The University of Texas at Austin. LARG
research is supported in part by grants from the National Science Foundation (IIS-0917122),
ONR (N00014-09-1-0658), and the Federal Highway Administration (DTFH61-07-H-00030).
References
Asmuth, J., Li, L., Littman, M., Nouri, A., Wingate, D.: A Bayesian sampling approach to
exploration in reinforcement learning. In: Proceedings of the 25th Conference on Uncer-
tainty in Artificial Intelligence, UAI (2009)
Atkeson, C., Moore, A., Schaal, S.: Locally weighted learning for control. Artificial Intelli-
gence Review 11, 75–113 (1997)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem.
Machine Learning 47(2), 235–256 (2002)
Baranes, A., Oudeyer, P.Y.: R-IAC: Robust Intrinsically Motivated Exploration and Active
Learning. IEEE Transactions on Autonomous Mental Development 1(3), 155–169 (2009)
Boutilier, C., Dearden, R., Goldszmidt, M.: Stochastic dynamic programming with factored
representations. Artificial Intelligence 121, 49–107 (2000)
Brafman, R., Tennenholtz, M.: R-Max - a general polynomial time algorithm for near-optimal
reinforcement learning. In: Proceedings of the Seventeenth International Joint Conference
on Artificial Intelligence (IJCAI), pp. 953–958 (2001)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Chakraborty, D., Stone, P.: Structure learning in ergodic factored MDPs without knowledge
of the transition function’s in-degree. In: Proceedings of the Twenty-Eighth International
Conference on Machine Learning, ICML (2011)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings
of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI), pp. 150–159
(1999)
Degris, T., Sigaud, O., Wuillemin, P.H.: Learning the structure of factored Markov Decision
Processes in reinforcement learning problems. In: Proceedings of the Twenty-Third Inter-
national Conference on Machine Learning (ICML), pp. 257–264 (2006)
Deisenroth, M., Rasmussen, C.: PILCO: A model-based and data-efficient approach to pol-
icy search. In: Proceedings of the Twenty-Eighth International Conference on Machine
Learning, ICML (2011)
Dietterich, T.: The MAXQ method for hierarchical reinforcement learning. In: Proceedings of
the Fifteenth International Conference on Machine Learning (ICML), pp. 118–126 (1998)
Diuk, C., Cohen, A., Littman, M.: An object-oriented representation for efficient reinforce-
ment learning. In: Proceedings of the Twenty-Fifth International Conference on Machine
Learning (ICML), pp. 240–247 (2008)
4 Learning and Using Models 139
Diuk, C., Li, L., Leffler, B.: The adaptive-meteorologists problem and its application to
structure learning and feature selection in reinforcement learning. In: Proceedings of the
Twenty-Sixth International Conference on Machine Learning (ICML), p. 32 (2009)
Duff, M.: Design for an optimal probe. In: Proceedings of the Twentieth International
Conference on Machine Learning (ICML), pp. 131–138 (2003)
Even-dar, E., Mansour, Y.: Learning rates for q-learning. Journal of Machine Learning
Research, 1–25 (2001)
Fikes, R., Nilsson, N.: Strips: A new approach to the application of theorem proving to prob-
lem solving. Tech. Rep. 43r, AI Center, SRI International, 333 Ravenswood Ave, Menlo
Park, CA 94025, SRI Project 8259 (1971)
Gordon, G.: Stable function approximation in dynamic programming. In: Proceedings of the
Twelfth International Conference on Machine Learning, ICML (1995)
Guestrin, C., Patrascu, R., Schuurmans, D.: Algorithm-directed exploration for model-based
reinforcement learning in factored MDPs. In: Proceedings of the Nineteenth International
Conference on Machine Learning (ICML), pp. 235–242 (2002)
van Hasselt, H., Wiering, M.: Reinforcement learning in continuous action spaces. In: IEEE
International Symposium on Approximate Dynamic Programming and Reinforcement
Learning (ADPRL), pp. 272–279 (2007)
Hester, T., Stone, P.: Generalized model learning for reinforcement learning in factored do-
mains. In: Proceedings of the Eight International Joint Conference on Autonomous Agents
and Multiagent Systems, AAMAS (2009)
Hester, T., Stone, P.: Real time targeted exploration in large domains. In: Proceedings of the
Ninth International Conference on Development and Learning, ICDL (2010)
Hester, T., Quinlan, M., Stone, P.: Generalized model learning for reinforcement learning on a
humanoid robot. In: Proceedings of the 2010 IEEE International Conference on Robotics
and Automation, ICRA (2010)
Hester, T., Quinlan, M., Stone, P.: A real-time model-based reinforcement learning architec-
ture for robot control. ArXiv e-prints 11051749 (2011)
Jong, N., Stone, P.: Model-based function approximation for reinforcement learning. In: Pro-
ceedings of the Sixth International Joint Conference on Autonomous Agents and Multia-
gent Systems, AAMAS (2007)
Kakade, S.: On the sample complexity of reinforcement learning. PhD thesis, University
College London (2003)
Kearns, M., Koller, D.: Efficient reinforcement learning in factored MDPs. In: Proceedings of
the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI), pp. 740–
747 (1999)
Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. In: Proceed-
ings of the Fifteenth International Conference on Machine Learning (ICML), pp. 260–268
(1998)
Kearns, M., Mansour, Y., Ng, A.: A sparse sampling algorithm for near-optimal planning
in large Markov Decision Processes. In: Proceedings of the Sixteenth International Joint
Conference on Artificial Intelligence (IJCAI), pp. 1324–1331 (1999)
Kocsis, L., Szepesvári, C.: Bandit based monte-carlo planning. In: Fürnkranz, J., Scheffer, T.,
Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer,
Heidelberg (2006)
Kolter, J.Z., Ng, A.: Near-Bayesian exploration in polynomial time. In: Proceedings of
the Twenty-Sixth International Conference on Machine Learning (ICML), pp. 513–520
(2009)
140 T. Hester and P. Stone
Lagoudakis, M., Parr, R.: Least-squares policy iteration. Journal of Machine Learning Re-
search 4, 1107–1149 (2003)
Li, L., Littman, M., Walsh, T.: Knows what it knows: a framework for self-aware learning. In:
Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML),
pp. 568–575 (2008)
Li, L., Littman, M., Mansley, C.: Online exploration in least-squares policy iteration. In:
Proceedings of the Eight International Joint Conference on Autonomous Agents and Mul-
tiagent Systems (AAMAS), pp. 733–739 (2009)
McCallum, A.: Learning to use selective attention and short-term memory in sequential tasks.
In: From Animals to Animats 4: Proceedings of the Fourth International Conference on
Simulation of Adaptive Behavior (1996)
Moore, A., Atkeson, C.: Prioritized sweeping: Reinforcement learning with less data and less
real time. Machine Learning 13, 103–130 (1993)
Moore, A., Atkeson, C.: The parti-game algorithm for variable resolution reinforcement
learning in multidimensional state-spaces. Machine Learning 21, 199–233 (1995)
Ng, A., Kim, H.J., Jordan, M., Sastry, S.: Autonomous helicopter flight via reinforcement
learning. In: Advances in Neural Information Processing Systems (NIPS), vol. 16 (2003)
Nouri, A., Littman, M.: Dimension reduction and its application to model-based exploration
in continuous spaces. Mach. Learn. 81(1), 85–98 (2010)
Ormnoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Machine Learning 49(2), 161–
178 (2002)
Pasula, H., Zettlemoyer, L., Kaelbling, L.P.: Learning probabilistic relational planning rules.
In: Proceedings of the 14th International Conference on Automated Planning and Schedul-
ing, ICAPS (2004)
Pazis, J., Lagoudakis, M.: Binary action search for learning continuous-action control poli-
cies. In: Proceedings of the Twenty-Sixth International Conference on Machine Learning
(ICML), p. 100 (2009)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian re-
inforcement learning. In: Proceedings of the Twenty-Third International Conference on
Machine Learning (s), pp. 697–704 (2006)
Prokhorov, D., Wunsch, D.: Adaptive critic designs. IEEE Transactions on Neural Net-
works 8, 997–1007 (1997)
Rasmussen, C., Kuss, M.: Gaussian processes in reinforcement learning. In: Advances in
Neural Information Processing Systems (NIPS), vol. 16 (2004)
Schaal, S., Atkeson, C.: Robot juggling: implementation of memory-based learning. IEEE
Control Systems Magazine 14(1), 57–71 (1994)
Schmidhuber, J.: Curious model-building control systems. In: Proceedings of the Interna-
tional Joint Conference on Neural Networks, pp. 1458–1463. IEEE (1991)
Silver, D., Sutton, R., Müller, M.: Sample-based learning and search with permanent and tran-
sient memories. In: Proceedings of the Twenty-Fifth International Conference on Machine
Learning (ICML), pp. 968–975 (2008)
Strehl, A., Littman, M.: A theoretical analysis of model-based interval estimation. In: Pro-
ceedings of the Twenty-Second International Conference on Machine Learning (ICML),
pp. 856–863 (2005)
Strehl, A., Diuk, C., Littman, M.: Efficient structure learning in factored-state MDPs. In:
Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, pp. 645–
650 (2007)
Strens, M.: A Bayesian framework for reinforcement learning. In: Proceedings of the Seven-
teenth International Conference on Machine Learning (ICML), pp. 943–950 (2000)
4 Learning and Using Models 141
Sutton, R.: Integrated architectures for learning, planning, and reacting based on approximat-
ing dynamic programming. In: Proceedings of the Seventh International Conference on
Machine Learning (ICML), pp. 216–224 (1990)
Sutton, R.: Dyna, an integrated architecture for learning, planning, and reacting. SIGART
Bulletin 2(4), 160–163 (1991)
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement
learning with function approximation. In: Advances in Neural Information Processing
Systems (NIPS), vol. 12, pp. 1057–1063 (1999)
Venayagamoorthy, G., Harley, R., Wunsch, D.: Comparison of heuristic dynamic program-
ming and dual heuristic programming adaptive critics for neurocontrol of a turbogenerator.
IEEE Transactions on Neural Networks 13(3), 764–773 (2002)
Walsh, T., Goschin, S., Littman, M.: Integrating sample-based planning and model-based re-
inforcement learning. In: Proceedings of the Twenty-Fifth AAAI Conference on Artificial
Intelligence (2010)
Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line
reward optimization. In: Proceedings of the Twenty-Second International Conference on
Machine Learning (ICML), pp. 956–963 (2005)
Wang, Y., Gelly, S.: Modifications of UCT and sequence-like simulations for Monte-Carlo
Go. In: IEEE Symposium on Computational Intelligence and Games (2007)
Weinstein, A., Mansley, C., Littman, M.: Sample-based planning for continuous action
Markov Decision Processes. In: ICML 2010 Workshop on Reinforcement Learning and
Search in Very Large Spaces (2010)
Wiering, M., Schmidhuber, J.: Efficient model-based exploration. In: From Animals to An-
imats 5: Proceedings of the Fifth International Conference on Simulation of Adaptive
Behavior, pp. 223–228. MIT Press, Cambridge (1998)
Chapter 5
Transfer in Reinforcement Learning:
A Framework and a Survey
Alessandro Lazaric
5.1 Introduction
The idea of transferring knowledge across different but related tasks to improve the
performance of machine learning (ML) algorithms stems from psychology and cog-
nitive science research. A number of psychological studies (see e.g., Thorndike and
Woodworth, 1901; Perkins et al, 1992) show that humans are able to learn a task
better and faster by transferring the knowledge retained from solving similar tasks.
Transfer in machine learning has the objective to design transfer methods that an-
alyze the knowledge collected from a set of source tasks (e.g., samples, solutions)
and transfer it so as to bias the learning process on a target task towards a set of
good hypotheses. If the transfer method successfully identifies the similarities be-
tween source and target tasks, then the transferred knowledge is likely to improve
the learning performance on the target task. The idea of retaining and reusing knowl-
edge to improve the learning algorithms dates back to early stages of ML. In fact,
it is widely recognized that a good representation is the most critical aspect of any
learning algorithm, and the development of techniques that automatically change
the representation according to the task at hand is one of the main objectives of
Alessandro Lazaric
INRIA Lille-Nord Europe, 40 Avenue Halley, 59650 Villeneuve d’Ascq, France
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 143–173.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
144 A. Lazaric
large part of the research in ML. Most of the research in transfer learning (Fawcett
et al, 1994) identified the single-problem perspective usually adopted in ML as a
limit for the definition of effective methods for the inductive construction of good
representations. On the other hand, taking inspiration from studies in psychology
and neuroscience (Gentner et al, 2003; Gick and Holyoak, 1983), the transfer point
of view, where learning tasks are assumed to be related and knowledge is retained
and transferred, is considered as the most suitable perspective to design effective
techniques of inductive bias (Utgoff, 1986).
Transfer in reinforcement learning. Transfer algorithms have been successful in
improving the performance of learning algorithms in a number of supervised learn-
ing problems, such as recommender systems, medical decision making, text clas-
sification, and general game playing. In recent years, the research on transfer also
focused on the reinforcement learning (RL) paradigm and how RL algorithms could
benefit from knowledge transfer. In principle, traditional reinforcement learning
already provides mechanisms to learn solutions for any task without the need of
human supervision. Nonetheless, the number of samples needed to learn a nearly-
optimal solution is often prohibitive in real-world problems unless prior knowledge
from a domain expert is available. Furthermore, every time the task at hand changes
the learning process must be restarted from scratch even when similar problems have
been already solved. Transfer algorithms automatically build prior knowledge from
the knowledge collected in solving a set of similar source tasks (i.e., training tasks)
and use it to bias the learning process on any new task (i.e., testing task). The result
is a dramatic reduction in the number of samples and a significant improvement in
the accuracy of the learned solution.
Aim of the chapter. Unlike supervised learning, reinforcement learning problems
are characterized by a large number of elements such as the dynamics and the re-
ward function, and many different transfer settings can be defined depending on the
differences and similarities between the tasks. Although relatively recent, research
on transfer in reinforcement learning already counts a large number of works cover-
ing many different transfer problems. Nonetheless, it is often difficult to have a clear
picture of the current state-of-the-art in transfer in RL because of the very different
approaches and perspectives adopted in dealing with this complex and challenging
problem. The aim of this chapter is to formalize what the main transfer settings
are and to classify the algorithmic approaches according to the kind of knowledge
they transfer from source to target tasks. Taylor and Stone (2009) also provide a
thorough survey of transfer in reinforcement learning. While their survey provides a
very in-depth analysis of each transfer algorithm, the objective of this chapter is not
to review all the algorithms available in the literature but rather to identify the char-
acteristics shared by the different approaches of transfer in RL and classify them
into large families.
Structure of the chapter. The rest of the chapter is organized as follows. In Sec-
tion 5.2 we formalize the transfer problem and we identify three main dimensions
to categorize the transfer algorithms according to the setting, the transferred knowl-
edge, and the objective. Then we review the main approaches of transfer in RL in
5 Transfer in Reinforcement Learning 145
K H
Alearn
Ks1
Ktrans f er H
...
Atrans f er Alearn
Kt
Fig. 5.1 (top) In the standard learning process, the learning algorithm gets as input some form
of knowledge about the task (i.e., samples, structure of the solutions, parameters) and returns
a solution. (bottom) In the transfer setting, a transfer phase first takes as input the knowledge
retained from a set of source tasks and returns a new knowledge which is used as input for
the learning algorithm. The dashed line represents the possibility to define a continual process
where the experience obtained from solving a task is then reused in solving new tasks.
three different settings. In Section 5.3 we focus on the source-to-target setting where
transfer occurs from one single source task to one single target task. A more general
setting with a set of source tasks and one target task is studied in Section 5.4. Finally,
in Section 5.5 we discuss the general source-to-target setting when the state-action
spaces of source and target tasks are different. In Section 5.6 we conclude and we
discuss open questions.
In this section, we adapt the formalisms introduced by Baxter (2000) and Silver
(2000) for supervised learning to the RL paradigm and we introduce general defini-
tions and symbols used throughout the rest of the chapter.
As discussed in the introduction, transfer learning leverages on the knowledge
collected from a number of different tasks to improve the learning performance in
new tasks. We define a task M as an MDP (Sutton and Barto, 1998) characterized
146 A. Lazaric
Symbol Meaning
M MDP
SM State space
AM Action space
TM Transition model (dynamics)
RM Reward function
M Task space (set of tasks M)
Ω Probability distribution over M
E Environment (task space and distribution)
H Hypothesis space (e.g., value functions, policies)
h∈H Hypothesis (e.g., one value function, one policy)
K Knowledge space (e.g., samples and basis functions)
K ∈K Knowledge (e.g., specific realization of samples)
Ks Knowledge space from a source task
Kt Knowledge space from a target task
Ktransfer Knowledge space return by the transfer algorithm and used in learning
F Space of functions defined on a specific state-action space
φ State-action basis function
Alearn Learning algorithm
Atransfer Transfer algorithm
O Set of options
o∈O An option
by the tuple SM , AM , TM , RM where SM is the state space, AM is the action space,
TM is the transition function, and RM is the reward function. While the state-action
space SM × AM defines the domain of the task, the transition TM and reward function
RM define the objective of the task. The space of tasks involved in the transfer learn-
ing problem is denoted by M = {M}. Let Ω be a probability distribution over the
space of tasks M , then we denote by E = M , Ω the environment, which defines
the setting of the transfer problem. The tasks presented to the learner are drawn from
the task distribution (i.e., M ∼ Ω ). This general definition resembles the traditional
supervised learning setting where training samples are drawn from a given distribu-
tion. As a result, similar to classification and regression, transfer learning is based
on the idea that since tasks are drawn from the same distribution, an algorithm able
to achieve a good performance on average on a finite number of source tasks (or
training tasks), then it will also generalize well across the target tasks in M coming
from the same distribution Ω (or testing tasks).
A standard learning algorithm takes as input some form of knowledge of the task
at hand and returns a solution in a set of possible results. We use K to denote the
space of the knowledge used as input for the learning algorithm and H for the
space of hypotheses that can be returned. In particular, K refers to all the elements
used by the algorithm to compute the solution of a task, notably the instances (e.g.,
5 Transfer in Reinforcement Learning 147
samples), the representation of the problem (e.g., set of options, set of features), and
parameters (e.g., learning rate) used by the algorithm. Notice that K includes prior
knowledge provided by an expert, transfer knowledge obtained from a transfer al-
gorithm, and direct knowledge collected from the task. A general learning algorithm
is defined as the mapping
Alearn : K → H . (5.1)
Example 1. Let us consider fitted Q-iteration (Ernst et al, 2005) with a linear func-
tion approximator. Fitted Q-iteration first collects N samples (the instances) and
through an iterative process returns an action-value function which approximates
the optimal action-value function of the task. In this case, the hypothesis space
H is the linear space spanned by a set of d features {ϕi : S × A → R}di=1 de-
signed by a domain expert, that is H = {h(·,·) = ∑di=1 αi ϕi (·,·)}. Beside this prior
knowledge, the algorithm also receives as input a set of N samples s, a, s , r . As
a result, the knowledge used by fitted Q-iteration can be formalized by the space
K = (S × A × S × R)N , F d , where any specific instance K ∈ K is K = ({sn ,an ,
rn ,sn }Nn=1 , {ϕi }di=1 ), with ϕi ∈ F . Given as input K ∈ K the algorithm returns an
action-value function h ∈ H (i.e., AFQI (K) = h).
Given the previous definitions, we can now define the general shape of transfer
learning algorithms. In general, in single-task learning only the instances are directly
collected from the task at hand, while the representation of the problem and the
parameters are given as a prior by an expert. In transfer learning, the objective is
to reduce the need for instances from the target task and prior knowledge from a
domain expert by tuning and adapting the structure of the learning algorithm (i.e.,
the knowledge used as input) on the basis of the previous tasks observed so far.
Let E = M , Ω be the environment at hand and L be the number of tasks drawn
from M according to the distribution Ω used as source tasks, a transfer learning
algorithm is usually the result of a transfer of knowledge and a learning phase. Let
KsL be the knowledge collected from the L source tasks and Kt the knowledge
available (if any) from the target task. The transfer phase is defined as
where Ktransfer is the final knowledge transferred to the learning phase. In particular,
the learning algorithm is now defined as
transfer
Transfer from source task to
target task with fixed domain
source task target task
transfer
Transfer from source task to
target task with fixed domain
source tasks target task
transfer
Transfer from source task to
target task with different
state-action space
source task target task
Fig. 5.2 The three main transfer settings defined according to the number of source tasks and
their difference w.r.t. the target task
Although in the definition in Equation (5.2) Kt is present in both the transfer and
learning phase, in most of the transfer settings, no knowledge about the target is
available in the transfer phase. This formalization also shows that transfer algo-
rithms must be compatible with the specific learning algorithm employed in the
second phase, since Ktransfer is used as an additional source of knowledge for Alearn .
The performance of the transfer algorithm is usually compared to a learning algo-
rithm in Equation (5.1) which takes as input only Kt . As discussed in the next sec-
tion, the specific setting E , the knowledge spaces K , and the way the performance
is measured define the main categories of transfer problems and approaches.
5.2.2 Taxonomy
learning, we expect that, as the number of source tasks increases, the transfer
algorithm is able to improve the average performance on the target tasks drawn
from Ω when compared to a single-task learning algorithm which does not use
any transferred knowledge.
(III) Transfer across tasks with different domains. Finally, in this setting tasks have a
different domain, that is they might have different state-action variables, both in
terms of number and range. Most of the transfer approaches in this case consider
the source-target scenario and focus on how to define a mapping between the
source state-action variables and the target variables so as to obtain an effective
transfer of knowledge.
The definition of transferred knowledge and the specific transfer process are the
main aspects characterizing a transfer learning algorithm. In the definition of Sec-
tion 5.2.1 the space K contains the instances collected from the environment (e.g.,
sample trajectories), the representation of the solution and the parameters of the al-
gorithm itself. Once the space of knowledge considered by the algorithm is defined,
it is important to design how this knowledge is actually used to transfer information
from the source tasks to the target task. Silver (2000) and Pan and Yang (2010) pro-
pose a general classification of the knowledge retained and transferred across tasks
in supervised learning. Taylor and Stone (2009) introduces a very detailed classifi-
cation for transfer in RL. Here we prefer to have a broader classification identifying
macro-categories of approaches along the lines of Lazaric (2008). We classify the
possible knowledge transfer approaches into three categories: instance transfer, rep-
resentation transfer, parameter transfer.
(I) Instance transfer. Unlike dynamic programming algorithms, where the dynam-
ics and reward functions are known in advance, all the RL algorithms rely on a
set of samples collected from a direct interaction with the MDP to build a solu-
tion for the task at hand. This set of samples can be used to estimate the model
of the MDP in model-based approaches or to directly build an approximation
of the value function or policy in model-free approaches. The most simple ver-
sion of transfer algorithm collects samples coming from different source tasks
and reuses them in learning the target task. For instance, the transfer of tra-
jectory samples can be used to simplify the estimation of the model of new
tasks (Sunmola and Wyatt, 2006) or the estimation of the action value function
as in (Lazaric et al, 2008).
(II) Representation transfer. Each RL algorithm uses a specific representation of
the task and of the solution, such as state-aggregation, neural networks, or a
set of basis functions for the approximation of the optimal value function. Af-
ter learning on different tasks, transfer algorithms often perform an abstrac-
tion process which changes the representation of the task and of the solutions.
In this category, many possible approaches are possible varying from reward
5 Transfer in Reinforcement Learning 151
Performance
Performance
Experience Experience Experience
Fig. 5.3 The three main objectives of transfer learning (Langley, 2006). The red circles high-
light the improvement in the performance in the learning process expected by using transfer
solutions w.r.t. single-task approaches.
shaping (Konidaris and Barto, 2006) and MDP augmentation through op-
tions (Singh et al, 2004) to basis function extraction (Mahadevan and Maggioni,
2007).
(III) Parameter transfer. Most of the RL algorithms are characterized by a number
of parameters which define the initialization and the behavior of the algorithm
itself. For instance, in Q-learning (Watkins and Dayan, 1992) the Q-table is ini-
tialized with arbitrary values (e.g., the highest possible value for the action val-
ues Rmax /(1 − γ )) and it is updated using a gradient-descent rule with a learning
rate α . The initial values and the learning rate define the set of input param-
eters used by the algorithm. Some transfer approaches change and adapt the
algorithm parameters according to the source tasks. For instance, if the action
values in some state-action pairs are very similar across all the source tasks, the
Q-table for the target task could be initialized to more convenient values thus
speeding-up the learning process. In particular, the transfer of initial solutions
(i.e., policies or value functions) is commonly adopted to initialize the learning
algorithm in the transfer setting with only one source task.
previously solved tasks can be used to bias the learning algorithm towards a
limited set of solutions, so as to reduce its learning time. The complexity of
a learning algorithm is usually measured by the number of samples needed to
achieve a desired performance. In RL, this objective is pursued following two
different approaches. The first approach is to make the algorithm more effec-
tive in using the experience collected from the exploration of the environment.
For instance, Kalmar and Szepesvari (1999) and Hauskrecht (1998) show that
the use of options can improve the effectiveness of value iteration backups by
updating value function estimates with the total reward collected by an option,
and thus reducing the number of iterations to converge to a nearly optimal so-
lution. The second aspect is about the strategy used to collect the samples. In
online RL algorithms samples are collected from direct interaction with the en-
vironment through an exploration strategy. The experience collected by solving
a set of tasks can lead to the definition of better exploration strategies for new
related tasks. For instance, if all the tasks have goals in a limited region of the
state space, an exploration strategy that frequently visits that region will lead to
more informative samples.
In practice, at least three different methods can be used to measure the im-
provement in the learning speed: time to threshold, area ratio, and finite-sample
analysis. In all the problems where a target performance is considered (e.g.,
a small enough number of steps-to-go in a navigation problem), it is possible
to set a threshold and measure how much experience (e.g., samples, episodes,
iterations) is needed by the single-task and transfer algorithms to achieve that
threshold. If the transfer algorithm successfully takes advantage of the knowl-
edge collected from the previous tasks, we expect it to need much less expe-
rience to reach the target performance. The main drawback of this metric is
that the threshold might be arbitrary and that it does not take into account the
whole learning behavior of the algorithms. In fact, it could be the case that an
algorithm is faster in reaching a given threshold but it has a very poor initial
performance or does not achieve the asymptotic optimal performance. The area
ratio metric introduced by Taylor and Stone (2009) copes with this problem by
considering the whole area under the learning curves with and without transfer.
Formally, the area ratio is defined as
area with transfer − area without transfer
r= . (5.4)
area without transfer
Although this metric successfully takes into consideration the behavior of the
algorithms until a given number of samples, it is scale dependent. For instance,
when the reward-per-episode is used as a measure of performance, the scale of
the rewards impacts on the area ratio and changes in the rewards might lead to
different conclusions in the comparison of different algorithms. While the two
previous measures allow to empirically compare the learning performance with
and without transfer, it is also interesting to have a more rigorous comparison
by deriving sample-based bounds for the algorithms at hand. In such case, it is
possible to compute an upper bound on the error of the solution returned by the
5 Transfer in Reinforcement Learning 153
algorithm depending on the parameters of the task and the number of samples
is available. For instance, if the algorithm returns a function h ∈ H and Q∗ is
the optimal action value function, a finite sample bound is usually defined as
Table 5.2 The three dimensions of transfer learning in RL. Each transfer solution is specifi-
cally designed for a setting, it transfers some form of knowledge, and it pursues an objective.
The survey classifies the existing algorithms according to the first dimension, it then reviews
the approaches depending on the transferred knowledge and discusses which objectives they
achieve.
As a result, after observing a number of source tasks, the transfer algorithm may
build an effective prior on the solution of the tasks in M and initialize the learn-
ing algorithm to a suitable initial hypothesis with a better performance w.r.t. to a
random initialization. It is worth noting that this objective does not necessarily
correspond to an improvement in the learning speed. Let us consider a source
task whose optimal policy is significantly different from the optimal policy of
the target task but that, at the same time, it achieves only a slightly suboptimal
performance (e.g., two goal states with different final positive rewards in dif-
ferent regions of the state space). In this case, the improvement of the initial
performance can be obtained by initializing the learning algorithm to the opti-
mal policy of the source task, but this may lead to worsen the learning speed. In
fact, the initial policy does not provide samples of the actual optimal policy of
the task at hand, thus slowing down the learning algorithm. On the other hand,
it could be possible that the policy transferred from the source task is an effec-
tive exploration strategy for learning the optimal policy of the target task, but
that it also achieves very poor performance. This objective is usually pursued
by parameter-transfer algorithms in which the learning algorithm is initialized
with a suitable solution whose performance is better compared to a random (or
arbitrary) initialization.
Given the framework introduced in the previous sections, the survey is organized
along the dimensions in Table 5.2. In the following sections we first classify the
main transfer approaches in RL according to the specific setting they consider. In
each setting, we further divide the algorithms depending on the type of knowledge
they transfer from source to target, and, finally, we discuss which objectives are
achieved. As it can be noticed, the literature on transfer in RL is not equally dis-
tributed on the three settings. Most of the early literature on transfer in RL focused
on the source-to-target setting, while the most popular scenario of recent research is
the general problem of transfer from a set of source tasks. Finally, research on the
problem of mapping different state and action spaces mostly relied on hand-coded
transformations and much room for further investigation is available.
5 Transfer in Reinforcement Learning 155
S G
G S
Fig. 5.4 Example of the setting of transfer from source to target with a fixed state-action
space
In this section we consider the most simple setting in which transfer occurs from one
source task to a target task. We first formulate the general setting in the next sec-
tion and we then review the main approaches to this problem by categorizing them
according to the type of transferred knowledge. Most of the approaches reviewed
in the following change the representation of the problem or directly transfer the
source solution to the target task. Furthermore, unlike the other two settings con-
sidered in Section 5.4 and 5.5, not all the possible knowledge transfer models are
considered and at the best of our knowledge no instance-transfer method has been
proposed for this specific setting.
In this transfer setting we define two MDPs, a source task Ms = S, A, Ts , Rs and
a target task Mt = S, A, Tt , Rt , sharing the same state-action space S × A. The en-
vironment E is defined by the task space M = {Ms , Mt } and a task distribution Ω
which simply returns Ms as the first task and Mt as second.
Example 3. Let us consider the transfer problem depicted in Figure 5.4. The source
task is a navigation problem where the agent should move from the region marked
with S to the goal region G. The target task shares exactly the same state-action
space and the same dynamics as the source task but the initial state and the goal
(and the reward function) are different. The transfer algorithm first collect some
form of knowledge from the interaction with the source task and then generates
a transferrable knowledge that can be used as input to the learning algorithm on
the target task. In this example, the transfer algorithm can exploit the similarity in
the dynamics and identify regularities that could be useful in learning the target task.
156 A. Lazaric
As reviewed in the next section, one effective way to perform transfer in this case is
to discover policies (i.e., options) useful to navigate in an environment with such a
dynamics. For instance, the policy sketched in Figure 5.4 allows the agent to move
from any point in the left room to the door between the two rooms. Such a policy is
useful to solve any navigation task requiring the agent to move from a starting region
in the left room to a goal region in the right room. Another popular approach is to
discover features which are well-suited to approximate the optimal value functions
in the environment. In fact, the dynamics displays symmetries and discontinuities
which are likely to be preserved in the value functions. For instance, both the source
and target value functions are discontinuous close to the walls separating the two
rooms. As a result, once the source task is solved, the transfer algorithm should an-
alyze the dynamics and the value function and return a set of features which capture
this discontinuity and preserve the symmetries of the problem.
In some transfer problems no knowledge about the target task is available before
transfer actually takes place and Kt in Equation (5.2) is always empty. In this case, it
is important to abstract from the source task general characteristics that are likely to
apply to the target task as well. The transfer algorithm first collects some knowledge
from the source task and it then changes the representation either of the solution
space H or of the MDP so as to speed-up the learning in the target task.
Option discovery. One of the most popular approaches to the source-target trans-
fer problem is to change the representation of the MDP by adding options (Sutton
et al, 1999) to the set of available actions (see Chapter 9 for a review of hierarchi-
cal RL methods). In discrete MDPs, options do not affect the possibility to achieve
the optimal solution (since all the primitive actions are available, any possible pol-
icy can still be represented), but they are likely to improve the learning speed if
they reach regions of the state space which are useful to learn the target task. All
the option-transfer methods consider discrete MDPs, a tabular representation of the
action-value function, and source and target tasks which differ only in the reward
function (i.e., Ts = Tt ). The idea is to exploit the structure of the dynamics shared
by the two tasks and to ignore the details about the specific source reward function.
Most of these methods share a common structure. A set of samples si , ai , ri , si is
first collected from the source task and an estimated MDP M̂s is computed. On the
basis of the characteristics of the estimated dynamics a set of relevant subgoals is
identified and a set of d options is learned to reach each of them. According to the
model in Section 5.2.1, the source knowledge is Ks = (S × A × S × R)Ns , and for any
specific realization K ∈ Ks , the transfer algorithm returns Atransfer (K) = (O, H ),
where O = {oi }di=1 and H = {h : S × {A ∪ O} → R}. The learning algorithm can
now use the new augmented action space to learn the solution to the target task us-
ing option Q-learning (Sutton et al, 1999). Although all these transfer algorithms
share the same structure, the critical point is how to identify the subgoals and learn
5 Transfer in Reinforcement Learning 157
options from the estimated dynamics. McGovern and Barto (2001) define the con-
cept of bottleneck state as a state which is often traversed by the optimal policy
of the source task and that can be considered as critical to solve tasks in the same
MDP. Metrics defined for graph partitioning techniques are used in (Menache et al,
2002) and (Simsek et al, 2005) to identify states connecting different regions of the
state space. Hengst (2003) proposes a method to automatically develop a MAXQ
hierarchy on the basis of the concept of access states. Finally, Bonarini et al (2006)
a psychology-inspired notion of interest aimed at identifying states from which the
environment can be easily explored.
dynamics and reward function may be different in source and target task. A differ-
ent method is proposed to build the source graph and extract proto-value functions
which are well suited to approximate functions obtained from similar dynamics and
reward functions.
All the previous methods about representation transfer rely on the implicit assump-
tion that source and target tasks are similar enough so that options or features ex-
tracted from the source task are effective in learning the solution of the target task.
Nonetheless, it is clear that many different notions of similarity can be defined. For
instance, we expect the option-transfer methods to work well whenever the two op-
timal policies have some parts in common (e.g., they both need passing through
some specific states to achieve the goal), while proto-value functions are effective
when the value functions preserve the structure of the transition graph (e.g., sym-
metries). The only explicit attempt to measure the expected performance of transfer
from source to target as a function of a distance between the two MDPs is pursued
by Ferns et al (2004) and Phillips (2006). In particular, they analyze the case in
which a policy πs is transferred from source to target task. The learning process in
the target task is then initialized using πs and its performance is measured. If the
MDPs are similar enough, then we expect this policy-transfer method to achieve
a jumpstart improvement. According to the formalism introduced in Section 5.2.1,
in this case Ks is any knowledge collected from the source task used to learn πs ,
while the transferred knowledge Ktransfer only contains πs and no learning phase ac-
tually takes place. Phillips (2006) defines a state distance between Ms and Mt along
the lines of the metrics proposed in (Ferns et al, 2004). In particular, the distance
d : S → R is defined as
d(s) = max (|Rs (s,a) − Rt (s,a)| + γ T (d) (Ts (·|s,a),Tt (·|s,a))) , (5.6)
a∈A
where T (d) is the Kantorovich distance which measures the difference between
the two transition distributions Ts (·|s,a) and Tt (·|s,a) given the state distance d. The
recursive Equation (5.6) is proved to have a fixed point d ∗ which is used a state
distance. Phillips (2006) prove that when a policy πs is transferred from source to
target, its performance loss w.r.t. the optimal target policy πt∗ can be upper bounded
by d ∗ as
π∗ 2 1 + γ πs π∗
||Vtπs − Vt t || ≤ max d ∗ (s) + ||V − Vs s ||.
1 − γ s∈S 1−γ s
As it can be noticed, when the transferred policy is the optimal policy πs∗ of the
source task, then its performance loss is upper bounded by the largest value of d ∗
which takes into consideration the difference between the reward functions and tran-
sition models of the two tasks at hand.
5 Transfer in Reinforcement Learning 159
While in the previous section we considered the setting in which only one source
task is available, here we review the main transfer approaches to the general setting
when a set of source tasks is available. Transfer algorithms in this setting should
deal with two main issues: how to merge knowledge coming from different sources
and how to avoid the transfer from sources which differ too much from the target
task (negative transfer).
In this section we consider the more general setting in which the environment E
is defined by a set of tasks M and a distribution Ω . Similar to the setting in Sec-
tion 5.3.1, here all the tasks share the same state-action space, that is for any M ∈ M ,
SM = S and AM = A. Although not all the approaches reviewed in the next section
explicitly define a distribution Ω , they all rely on the implicit assumption that all
the tasks involved in the transfer problem share some characteristics in the dynam-
ics and reward function and that by observing a number of source tasks, the transfer
algorithm is able to generalize well across all the tasks in M .
Example 4. Let us consider a similar scenario to the real-time strategy (RTS) game
introduced in (Mehta et al, 2008). In RTS, there is a number of basic tasks such
as attacking the enemy, mining gold, building structures, which are useful to ac-
complish more complex tasks such as preparing an army and conquering an enemy
region. The more complex tasks can be often seen as a combination of the low level
tasks and the specific combination depends also on the phase of the game, the char-
acteristics of the map, and many other parameters. A simple way to formalize to
problem is to consider the case in which all the tasks in M share the same state-
action space and dynamics but have different rewards. In particular, each reward
function is the result of a linear combination of a set of d basis reward function, that
is, for each task M, the reward is defined as RM (·) = ∑di=1 wi ri (·) where w is a weight
vector and ri (·) is a basis reward function. Each basis reward function encodes a spe-
cific objective (e.g., defeat the enemy, collect gold), while the weights represent a
combination of them as in a multi-objective problem. It is reasonable to assume
that the specific task at hand is randomly generated by setting the weight vector w.
160 A. Lazaric
The main idea of instance-transfer algorithms is that the transfer of source samples
may improve the learning on the target task. Nonetheless, if samples are transferred
from sources which differ too much from the target task, then negative transfer
might occur. In this section we review the only instance-transfer approach for this
transfer setting proposed in (Lazaric et al, 2008) which selectively transfers samples
on the basis of the similarity between source and target tasks.
Let L be the number of source tasks, Lazaric et al (2008) propose an algorithm
proposed which first collects Ns samples for each source task Ks = (S × A× S × R)Ns
and Nt samples (with Nt Ns ) from the target task Kt = (S × A × S × R)Nt , and the
transfer algorithm takes as input Ks and Kt . Instead of returning as output a set con-
taining all the source samples, the method relies on a measure of similarity between
the source and the target tasks to select which source samples should be included
in Ktransfer . Let Ksl ∈ Ks and Kt ∈ Kt be the specific source and target samples
available to the transfer algorithm. The number of source samples is assumed to be
large enough to build an accurate kernel-based estimation of each source model M̂sl .
Given the estimated model, the similarity between the source task Msl and the target
task Mt is defined as
1 Nt
Λs l =
Nt ∑P sn ,an ,sn ,rn |M̂sl
n=1
where P sn ,an ,sn ,rn |M̂sl is the probability of the transition sn ,an ,sn ,rn ∈ Kt ac-
cording to the (estimated) model of Msl . The intuition behind this measure of sim-
ilarity is that it is more convenient to transfer samples collected from source tasks
which are likely to generate target samples. Finally, source samples are transferred
proportionally to their similarity Λsl to the target task. The method is further refined
using another measure of utility for each source sample so that from each source
task only the samples that are more likely to improve the learning performance in
the target task. In the experiments reported in (Lazaric et al, 2008) this method is
shown to successfully identify which sources are more relevant to transfer samples
from and to avoid negative transfer.
5 Transfer in Reinforcement Learning 161
Ψ
θl
Fig. 5.5 Example of a generative model of a hierarchical Bayesian model (HBM). The obser-
vations s, a, s , r are generated according to an MDP parameterized by a parameter vector θ ,
while each task is generated according to a distribution defined by a set of hyper-parameters
Ψ . Ψ0 is a vector of parameters defining the prior over the hyper-parameters.
where ψ ∈ Ψ is a hyper-parameter vector. The main assumption is that all the task
parameters θ are independently and identically distributed according to a specific
task distribution Ωψ ∗ .
The hierarchical Bayesian model. The structure of this problem is usually rep-
resented as a hierarchical Bayesian model (HBM) # as depicted in Figure
$ 5.5. The
transfer algorithms take as input samples Ksl = {sn ,an ,sn ,rn }Nn=1
s
from each of
the source tasks Mθl (l = 1, . . . ,L) which are drawn from the true distribution Ωψ ∗
iid
(i.e., θl ∼ Ωψ ∗ ) whose true hyper-parameters are unknown. Given a prior over ψ ,
the algorithm solves the inference problem
L
P ψ |{Ksl }Ll=1 ∝ ∏ P Ksl |ψ P(ψ ), (5.7)
l=1
where P Ksl |ψ ∝ P Ksl |θ P (ψ ). The ψ with highest probability is usually trans-
ferred and used to initialize the learning process on the target task. Notice that the
learning algorithm must be designed so as to take advantage of the knowledge about
the specific task distribution Ωψ returned by the transfer phase. Bayesian algorithms
for RL such as GPTD (Engel et al, 2005) are usually adopted (see Chapter 11).
The inference problem in Equation (5.7) leverages on the knowledge collected
on all the tasks at the same time. Thus, even if few samples per task are available
(i.e., Ns is small), the algorithm can still take advantage of a large number of tasks
(i.e., L is large) to solve the inference problem and learn an accurate estimate of
ψ ∗ . As L increases the hyper-parameter ψ gets closer and closer to the true hyper-
parameter ψ ∗ and ψ can be used to build a prior on the parameter θ for any new
target task drawn from the distribution Ωψ ∗ . Depending on the specific definition of
Θ and Ωψ and the way the inference problem is solved, many different algorithms
can be deduced from this general model.
164 A. Lazaric
Inference for transfer. Tanaka and Yamamura (2003) consider a simpler approach.
Although the MDPs are assumed to be drawn from a distribution Ω , the proposed
algorithm does not try to estimate the task distribution but only a statistics about the
action values is computed. The mean and variance of the action values over different
tasks are computed and then used to initialize the Q-table for new tasks. Sunmola
and Wyatt (2006) and Wilson et al (2007) consider the case where the MDP dy-
namics and reward function are parameterized by a parameter vector θ and they
are assumed to be drawn from a common distribution Ωψ . The inference problem
is solved by choosing appropriate conjugate priors over the hyper-parameter ψ . A
transfer problem on POMDPs is consider in (Li et al, 2009). In this case, no ex-
plicit parameterization of the tasks is provided. On the other hand, it is the space
of history-based policies H which is parameterized by a vector parameter θ ∈ Θ .
A Dirichlet process is then used as a non-parametric prior over the parameters of
the optimal policies for different tasks. Lazaric and Ghavamzadeh (2010) consider
the case of a parameterized space of value functions by considering the space of
linear functions spanned by a given set of features, H = {h(x,a) = ∑di=1 θi ϕi (x,a)}.
The vector θ is assumed to be drawn from a multivariate Guassian with parame-
ters ψ drawn from a normal-inverse-Wishart hyper-prior (i.e., θ ∼ N (μ ,Σ ) and
μ ,Σ ∼ N -I W (ψ )). The inference problem is solved using an EM-like algorithm
which takes advantage of the conjugate priors. This approach is further extended to
consider the case in which not all the tasks are drawn from the same distribution. In
order to cluster tasks into different classes, Lazaric and Ghavamzadeh (2010) place
a Dirichlet process on the top of the hierarchical Bayesian model and the number of
classes and assignment of tasks to classes is automatically learned by solving an in-
ference problem using a Gibbs sampling method. Finally, Mehta et al (2008) define
the reward function as a linear combination of reward features which are common
across tasks, while the weights are specific for each task. The weights are drawn
from a distribution Ω and the transfer algorithm compactly stores the optimal value
functions of the source tasks exploiting the structure of the reward function and uses
them to initialize the solution in the target task.
All the previous settings consider the case where all the tasks share the same domain
(i.e., they have the same state-action space). In the most general transfer setting the
tasks in M may also differ in terms of number or range of the state-action variables.
Although here we consider the general case in which each task M ∈ M is de-
fined as an MDP SM , AM , TM , RM and the environment E is obtained by defining a
5 Transfer in Reinforcement Learning 165
Example 5. Let us consider the mountain car domain and the source and target tasks
in Figure 5.6 introduced by Taylor et al (2008a). Although the problem is somehow
similar (i.e., an under-powered car has to move from the bottom of the valley to the
top of the hill), the two tasks are defined over a different state space and the action
space contains a different number of actions. In fact, in the 2D mountain car task the
state space is defined by the position and the velocity variables (x,ẋ) and the action
space contains the actions A = {Left, Neutral, Right}. On the other hand, the 3D task
has two additional state variables describing the position in y and its correspond-
ing speed ẏ and the action space becomes A = {Neutral, West, East, South, North}.
The transfer approaches described so far cannot be applied here because the knowl-
edge Ktransfer they transfer from the source task would not be compatible with
the target task. In this case, the transfer algorithm must define a suitable map-
ping between source and target state and action spaces, and then transfer solu-
tions learned in the 2D mountain car to initialize the learning process in the 3D
task.
166 A. Lazaric
As reviewed in the previous sections, many transfer approaches develop options that
can be effectively reused in the target task. In this case, the main problem is that op-
tions learned on the source task are defined as a mapping from Ss to As and they
cannot be used in a target task with different state-action variables. A number of
transfer algorithms deal with this problem by considering abstract options that can
be reused in different tasks. Ravindran and Barto (2003); Soni and Singh (2006)
use the homomorphism framework to map tasks to a common abstract level. For
instance, let us consider all the navigation problems in an empty squared room. In
this case, it is possible to define one common abstract MDP and obtain any specific
MDP by simply using operators such as translation, scaling, and rotation. In order
to deal with this scenario, Ravindran and Barto (2003) introduce the concept of rel-
ativized options. Unlike traditional options, relativized options are defined on the
abstract MDP, without an absolute frame of reference, and their policy is then trans-
formed according to the specific target task at hand. In particular, a set of possible
5 Transfer in Reinforcement Learning 167
transformations is provided and the transfer phase needs to identify the most suit-
able transformation of the relativized options depending on the current target task.
The problem is casted as a Bayesian parameter estimation problem and the transfor-
mation which makes the sequence of states observed by following the option more
likely is selected. Konidaris and Barto (2007) define options at a higher level of
abstraction and they can be used in the target task without any explicit mapping or
transformation. In fact, portable options are defined in a non-Markovian agent space
which depends on the characteristics of the agent and remains fixed across tasks.
This way, even when tasks are defined on different state-action spaces, portable op-
tions can be reused to speed-up learning in any target task at hand. Finally, Torrey
et al (2006) proposed an algorithm in which a set of skills is first identified using in-
ductive logic programming and then reused in the target task by using a hand-coded
mapping from source to target.
While in the setting considered in Section 5.3, the transfer of initial solutions (e.g.,
the optimal source policy) from the source to the target task is trivial, in this case
the crucial aspect in making transfer effective is to find a suitable mapping from the
source state-action space Ss × As to the target state-action space St × At .
Most of the algorithms reviewed in the following consider hand-coded mappings
and investigate how the transfer of different sources of knowledge (e.g., policies,
value functions) influence the performance on the target task. The transformation
through a hand-coded mapping and the transfer of the source value function to ini-
tialize the learning in the target task has been first introduced by Taylor and Stone
(2005) and Taylor et al (2005) and its impact has been study in a number of challeng-
ing problems such as the simulated keep-away problem (Stone et al, 2005). Baner-
jee and Stone (2007) also consider the transfer of value functions in the context
of general games where different games can be represented by a common abstract
structure. Torrey et al (2005) learn the Q-table in the target task by reusing advices
(i.e., actions with higher Q-values in the source task) which are mapped to the target
task through a hand-coded mapping. While the previous approaches assume that in
both source and target task the same solution representation is used (e.g., a tabu-
lar approach), Taylor and Stone (2007) consider the problem of mapping a solution
(i.e., a value function or a policy) to another solution when either the approximation
architecture (e.g., CMAC and neural networks) or the learning algorithm itself (e.g.,
value-based and policy search methods) changes between source and target tasks.
Using similar mappings as for the state-action mapping, they show that transfer is
still possible and it is still beneficial in improving the performance on the target
task. Finally, Taylor et al (2007b) study the transfer of the source policy where a
hand-coded mapping is used to transform the source policy into a valid policy for
the target task and a policy search algorithm is then used to refine it.
168 A. Lazaric
In this chapter we defined a general framework for the transfer learning problem in
the reinforcement learning paradigm, we proposed a classification of the different
approaches to the problem, and we reviewed the main algorithms available in the
literature. Although many algorithms have been already proposed, the problem of
transfer in RL is far from being solved. In the following we single out a few open
questions that are relevant to the advancement of the research on this topic. We
refer the reader to the survey by Taylor and Stone (2009) for other possible lines of
research in transfer in RL.
Theoretical analysis of transfer algorithms. Although experimental results sup-
port the idea that RL algorithms can benefit from transfer from related tasks, no
transfer algorithm for RL has strong theoretical guarantees. Recent research in trans-
fer and multi-task learning in the supervised learning paradigm achieved interesting
theoretical results identifying the conditions under which transfer approaches are ex-
pected to improve the performance over single-task learning. Crammer et al (2008)
study the performance of learning reusing samples coming from different classifica-
tion tasks and they prove that when the sample distributions of the source tasks do
not differ too much compared to the target distribution, then the transfer approach
performs better than just using the target samples. Baxter (2000) studies the problem
of learning the most suitable set of hypotheses for a given set of tasks. In particular,
he shows that, as the number of source tasks increases, the transfer algorithm man-
ages to identify a hypothesis set which is likely to contain good hypotheses for all
the tasks in M . Ben-David and Schuller-Borbely (2008) consider the problem of
5 Transfer in Reinforcement Learning 169
learning the best hypothesis set in the context of multi-task learning where the ob-
jective is not to generalize on new tasks but to achieve a better average performance
in the source tasks. At the same time, novel theoretical results are now available
for a number of popular RL algorithms such as fitted value iteration (Munos and
Szepesvári, 2008), LSTD (Farahmand et al, 2008; Lazaric et al, 2010), and Bellman
residual minimization (Antos et al, 2008; Maillard et al, 2010). An interesting line
of research is to take advantage of theoretical results of transfer algorithms in the
supervised learning setting and of RL algorithms in the single-task case to develop
new RL transfer algorithms which provably improve the performance over single-
task learning.
Transfer learning for exploration. The objective of learning speed improvement
(see Section 5.2.2) is often achieved by a better use of the samples at hand (e.g., by
changing the hypothesis set) rather than by the collection of more informative sam-
ples. This problem is strictly related to the exploration-exploitation dilemma where
the objective is to trade-off between the exploration of different strategies and the
exploitation of the best strategy so far. Recent works by Bartlett and Tewari (2009);
Jaksch et al (2010) studied optimal exploration strategies for single-task learning.
Although most of the option-based transfer methods implicitly bias the exploration
strategy, the problem of how the exploration on one task should be adapted on the
basis of the knowledge of previous related tasks is a problem which received little
attention so far.
Concept drift and continual learning. One of the main assumptions of transfer
learning is that a clear distinction between the tasks in M is possible. Nonetheless,
in many interesting applications there is no sharp division between source and target
tasks while it is rather the task itself that changes in time. This problem, also known
as concept drift, is also strictly related to the continual learning and lifelong learning
paradigm (Silver and Poirier, 2007) in which, as the learning agent autonomously
discovers new regions of a non-stationary environment, it also increases its capabil-
ity to solve tasks defined on that environment. Although tools coming from transfer
learning probably could be reused also in this setting, novel approaches are needed
to deal with the non-stationarity of the environment and to track the changes in the
task at hand.
References
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual
minimization based fitted policy iteration and a single sample path. Machine Learning
Journal 71, 89–129 (2008)
Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Machine Learning
Journal 73(3), 243–272 (2008)
Asadi, M., Huber, M.: Effective control knowledge transfer through learning skill and repre-
sentation hierarchies. In: Proceedings of the 20th International Joint Conference on Arti-
ficial Intelligence (IJCAI-2007), pp. 2054–2059 (2007)
170 A. Lazaric
Banerjee, B., Stone, P.: General game learning using knowledge transfer. In: Proceedings of
the 20th International Joint Conference on Artificial Intelligence (IJCAI-2007), pp. 672–
677 (2007)
Bartlett, P.L., Tewari, A.: Regal: a regularization based algorithm for reinforcement learning
in weakly communicating mdps. In: Proceedings of the Twenty-Fifth Conference on Un-
certainty in Artificial Intelligence (UAI-2009), pp. 35–42. AUAI Press, Arlington (2009)
Baxter, J.: A model of inductive bias learning. Journal of Artificial Intelligence Research 12,
149–198 (2000)
Ben-David, S., Schuller-Borbely, R.: A notion of task relatedness yiealding provable
multiple-task learning guarantees. Machine Learning Journal 73(3), 273–287 (2008)
Bernstein, D.S.: Reusing old policies to accelerate learning on new mdps. Tech. rep., Univer-
sity of Massachusetts, Amherst, MA, USA (1999)
Bonarini, A., Lazaric, A., Restelli, M.: Incremental Skill Acquisition for Self-motivated
Learning Animats. In: Nolfi, S., Baldassarre, G., Calabretta, R., Hallam, J.C.T., Marocco,
D., Meyer, J.-A., Miglino, O., Parisi, D. (eds.) SAB 2006. LNCS (LNAI), vol. 4095, pp.
357–368. Springer, Heidelberg (2006)
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press
(2006)
Crammer, K., Kearns, M., Wortman, J.: Learning from multiple sources. Journal of Machine
Learning Research 9, 1757–1774 (2008)
Drummond, C.: Accelerating reinforcement learning by composing solutions of automati-
cally identified subtasks. Journal of Artificial Intelligence Research 16, 59–104 (2002)
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Pro-
ceedings of the 22nd International Conference on Machine Learning (ICML-2005),
pp. 201–208 (2005)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal
of Machine Learning Research 6, 503–556 (2005)
Farahmand, A.M., Ghavamzadeh, M., Szepesvári, C., Mannor, S.: Regularized policy itera-
tion. In: Proceedings of the Twenty-Second Annual Conference on Advances in Neural
Information Processing Systems (NIPS-2008), pp. 441–448 (2008)
Fawcett, T., Callan, J., Matheus, C., Michalski, R., Pazzani, M., Rendell, L., Sutton, R. (eds.):
Constructive Induction Workshop at the Eleventh International Conference on Machine
Learning (1994)
Ferguson, K., Mahadevan, S.: Proto-transfer learning in markov decision processes using
spectral methods. In: Workshop on Structural Knowledge Transfer for Machine Learning
at the Twenty-Third International Conference on Machine Learning (2006)
Ferns, N., Panangaden, P., Precup, D.: Metrics for finite markov decision processes. In: Pro-
ceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI-2004),
pp. 162–169 (2004)
Ferrante, E., Lazaric, A., Restelli, M.: Transfer of task representation in reinforcement learn-
ing using policy-based proto-value functions. In: Proceedings of the Seventh International
Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS-2008), pp.
1329–1332 (2008)
Foster, D.J., Dayan, P.: Structure in the space of value functions. Machine Learning Jour-
nal 49(2-3), 325–346 (2002)
Gentner, D., Loewenstein, J., Thompson, L.: Learning and transfer: A general role for ana-
logical encoding. Journal of Educational Psychology 95(2), 393–408 (2003)
Gick, M.L., Holyoak, K.J.: Schema induction and analogical transfer. Cognitive Psychol-
ogy 15, 1–38 (1983)
5 Transfer in Reinforcement Learning 171
Hauskrecht, M.: Planning with macro-actions: Effect of initial value function estimate on con-
vergence rate of value iteration. Tech. rep., Department of Computer Science, University
of Pittsburgh (1998)
Hengst, B.: Discovering hierarchy in reinforcement learning. PhD thesis, University of New
South Wales (2003)
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. Jour-
nal of Machine Learning Research 11, 1563–1600 (2010)
Kalmar, Z., Szepesvari, C.: An evaluation criterion for macro-learning and some results. Tech.
Rep. TR-99-01, Mindmaker Ltd. (1999)
Konidaris, G., Barto, A.: Autonomous shaping: knowledge transfer in reinforcement learn-
ing. In: Proceedings of the Twenty-Third International Conference on Machine Learning
(ICML-2006), pp. 489–496 (2006)
Konidaris, G., Barto, A.G.: Building portable options: Skill transfer in reinforcement learn-
ing. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence
(IJCAI-2007), pp. 895–900 (2007)
Langley, P.: Transfer of knowledge in cognitive systems. In: Talk, Workshop on Structural
Knowledge Transfer for Machine Learning at the Twenty-Third International Conference
on Machine Learning (2006)
Lazaric, A.: Knowledge transfer in reinforcement learning. PhD thesis, Poltecnico di Milano
(2008)
Lazaric, A., Ghavamzadeh, M.: Bayesian multi-task reinforcement learning. In: Proceed-
ings of the Twenty-Seventh International Conference on Machine Learning, ICML-2010
(2010) (submitted)
Lazaric, A., Restelli, M., Bonarini, A.: Transfer of samples in batch reinforcement learning.
In: Proceedings of the Twenty-Fifth Annual International Conference on Machine Learn-
ing (ICML-2008), pp. 544–551 (2008)
Lazaric, A., Ghavamzadeh, M., Munos, R.: Finite-sample analysis of lstd. In: Proceedings of
the Twenty-Seventh International Conference on Machine Learning, ICML-2010 (2010)
Li, H., Liao, X., Carin, L.: Multi-task reinforcement learning in partially observable stochastic
environments. Journal of Machine Learning Research 10, 1131–1186 (2009)
Madden, M.G., Howley, T.: Transfer of experience between reinforcement learning environ-
ments with progressive difficulty. Artificial Intelligence Review 21(3-4), 375–398 (2004)
Mahadevan, S., Maggioni, M.: Proto-value functions: A laplacian framework for learning
representation and control in markov decision processes. Journal of Machine Learning
Research 38, 2169–2231 (2007)
Maillard, O.A., Lazaric, A., Ghavamzadeh, M., Munos, R.: Finite-sample analysis of bell-
man residual minimization. In: Proceedings of the Second Asian Conference on Machine
Learning, ACML-2010 (2010)
McGovern, A., Barto, A.G.: Automatic discovery of subgoals in reinforcement learning using
diverse density. In: Proceedings of the Eighteenth International Conference on Machine
Learning, ICML 2001 (2001)
Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.: Transfer in variable-reward hierarchical
reinforcement learning. Machine Learning Journal 73(3), 289–312 (2008)
Menache, I., Mannor, S., Shimkin, N.: Q-cut - dynamic discovery of sub-goals in reinforce-
ment learning. In: Proceedings of the Thirteen European Conference on Machine Learn-
ing, pp. 295–306 (2002)
Munos, R., Szepesvári, C.: Finite time bounds for fitted value iteration. Journal of Machine
Learning Research 9, 815–857 (2008)
172 A. Lazaric
Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowledge and Data
Engineering 22(22), 1345–1359 (2010)
Perkins, D.N., Salomon, G., Press, P.: Transfer of learning. In: International Encyclopedia of
Education. Pergamon Press (1992)
Perkins, T.J., Precup, D.: Using options for knowledge transfer in reinforcement learning.
Tech. rep., University of Massachusetts, Amherst, MA, USA (1999)
Phillips, C.: Knowledge transfer in markov decision processes. McGill School of Computer
Science (2006),
http://www.cs.mcgill.ca/˜martin/usrs/phillips.pdf
Ravindran, B., Barto, A.G.: Relativized options: Choosing the right transformation. In: Pro-
ceedings of the Twentieth International Conference on Machine Learning (ICML 2003),
pp. 608–615 (2003)
Sherstov, A.A., Stone, P.: Improving action selection in MDP’s via knowledge transfer. In:
Proceedings of the Twentieth National Conference on Artificial Intelligence, AAAI-2005
(2005)
Silver, D.: Selective transfer of neural network task knowledge. PhD thesis, University of
Western Ontario (2000)
Silver, D.L., Poirier, R.: Requirements for Machine Lifelong Learning. In: Mira, J., Álvarez,
J.R. (eds.) IWINAC 2007, Part I. LNCS, vol. 4527, pp. 313–319. Springer, Heidelberg
(2007)
Simsek, O., Wolfe, A.P., Barto, A.G.: Identifying useful subgoals in reinforcement learning
by local graph partitioning. In: Proceedings of the Twenty-Second International Confer-
ence of Machine Learning, ICML 2005 (2005)
Singh, S., Barto, A., Chentanez, N.: Intrinsically motivated reinforcement learning. In:
Proceedings of the Eighteenth Annual Conference on Neural Information Processing
Systems, NIPS-2004 (2004)
Soni, V., Singh, S.P.: Using homomorphisms to transfer options across continuous reinforce-
ment learning domains. In: Proceedings of the Twenty-first National Conference on Arti-
ficial Intelligence, AAAI-2006 (2006)
Stone, P., Sutton, R.S., Kuhlmann, G.: Reinforcement learning for RoboCup-soccer keep-
away. Adaptive Behavior 13(3), 165–188 (2005)
Sunmola, F.T., Wyatt, J.L.: Model transfer for markov decision tasks via parameter matching.
In: Proceedings of the 25th Workshop of the UK Planning and Scheduling Special Interest
Group, PlanSIG 2006 (2006)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge
(1998)
Sutton, R.S., Precup, D., Singh, S.: Between mdps and semi-mdps: a framework for temporal
abstraction in reinforcement learning. Artificial Intelligence 112, 181–211 (1999)
Talvitie, E., Singh, S.: An experts algorithm for transfer learning. In: Proceedings of the
20th International Joint Conference on Artificial Intelligence (IJCAI-2007), pp. 1065–
1070 (2007)
Tanaka, F., Yamamura, M.: Multitask reinforcement learning on the distribution of mdps. In:
IEEE International Symposium on Computational Intelligence in Robotics and Automa-
tion, vol. 3, pp. 1108–1113 (2003)
Taylor, M.E., Stone, P.: Behavior transfer for value-function-based reinforcement learning.
In: Proceedings of the Fourth International Joint Conference on Autonomous Agents and
Multiagent Systems (AAMAS-2005), pp. 53–59 (2005)
5 Transfer in Reinforcement Learning 173
Taylor, M.E., Stone, P.: Representation transfer for reinforcement learning. In: AAAI 2007
Fall Symposium on Computational Approaches to Representation Change during Learn-
ing and Development (2007)
Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: A survey. Jour-
nal of Machine Learning Research 10(1), 1633–1685 (2009)
Taylor, M.E., Stone, P., Liu, Y.: Value functions for RL-based behavior transfer: A compara-
tive study. In: Proceedings of the Twentieth National Conference on Artificial Intelligence,
AAAI-2005 (2005)
Taylor, M.E., Stone, P., Liu, Y.: Transfer learning via inter-task mappings for temporal differ-
ence learning. Journal of Machine Learning Research 8, 2125–2167 (2007a)
Taylor, M.E., Whiteson, S., Stone, P.: Transfer via inter-task mappings in policy search
reinforcement learning. In: Proceedings of the Sixth International Joint Conference on
Autonomous Agents and Multiagent Systems, AAMAS-2007 (2007b)
Taylor, M.E., Jong, N.K., Stone, P.: Transferring instances for model-based reinforcement
learning. In: Proceedings of the European Conference on Machine Learning (ECML-
2008), pp. 488–505 (2008a)
Taylor, M.E., Kuhlmann, G., Stone, P.: Autonomous transfer for reinforcement learning. In:
Proceedings of the Seventh International Joint Conference on Autonomous Agents and
Multiagent Systems (AAMAS-2008), pp. 283–290 (2008b)
Thorndike, E.L., Woodworth, R.S.: The influence of improvement in one mental function
upon the efficiency of other functions. Psychological Review 8 (1901)
Torrey, L., Walker, T., Shavlik, J., Maclin, R.: Using Advice to Transfer Knowledge Acquired
in one Reinforcement Learning Task to Another. In: Gama, J., Camacho, R., Brazdil,
P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 412–424.
Springer, Heidelberg (2005)
Torrey, L., Shavlik, J., Walker, T., Maclin, R.: Skill Acquisition Via Transfer Learning and
Advice Taking. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS
(LNAI), vol. 4212, pp. 425–436. Springer, Heidelberg (2006)
Utgoff, P.: Shift of bias for inductive concept learning. Machine Learning 2, 163–190 (1986)
Walsh, T.J., Li, L., Littman, M.L.: Transferring state abstractions between mdps. In: ICML
Workshop on Structural Knowledge Transfer for Machine Learning (2006)
Watkins, C., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: a hierarchi-
cal bayesian approach. In: Proceedings of the Twenty-Forth International Conference on
Machine learning (ICML-2007), pp. 1015–1022 (2007)
Chapter 6
Sample Complexity Bounds of Exploration
Lihong Li
6.1 Introduction
Lihong Li
Yahoo! Research, 4401 Great America Parkway, Santa Clara, CA, USA 95054
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 175–204.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
176 L. Li
6.2 Preliminaries
The chapter studies online reinforcement learning, which is more challenging than
the offline alternative, and focuses on agents that aim to maximize γ -discounted
cumulative reward for some given γ ∈ (0,1).
Throughout the chapter, we model the environment as a Markov decision process
(MDP) M = S,A,T,R,γ with state space S, action space A, transition function T ,
reward function R, and discount factor γ . Standard notation in reinforcement learn-
ing is used, whose definition is found in Chapter 1. In particular, given a policy π
that maps S to A, VMπ (s) and QπM (s,a) are the state and state–action value functions
of π in MDP M; VM∗ (s) and Q∗M (s,a) are the corresponding optimal value functions.
If M is clear from the context, it may be omitted and the value functions are de-
noted by V π , Qπ , V ∗ , and Q∗ , respectively. Finally, we use Õ(·) as a shorthand for
6 Sample Complexity Bounds of Exploration 177
1 2 3 N-1 N G
the common big-O notation with logarithmic factors suppressed. Online RL is made
precise by the following definition.
Definition 6.1. The online interaction between the agent and environment, modeled
as an MDP M = S,A,T,R,γ , proceeds in discrete timesteps t = 1,2,3, . . .,
1. The agent perceives the current state st ∈ S of the environment, then takes an
action at ∈ A.
2. In response, the environment sends an immediate reward rt ∈ [0,1] to the agent,1
and moves to a next state st+1 ∈ S. This transition is governed by the dynamics
of the MDP. In particular, the expectation of rt is R(st ,at ), and the next state
st+1 is drawn randomly from the distribution T (st ,at ,·).
3. The clock ticks: t ← t + 1.
In addition, we assume an upper bound Vmax on the optimal value function such that
Vmax ≥ V ∗ (s) for all s ∈ S. Since we have assumed R(s,a) ∈ [0,1], we always have
Vmax ≤ 1/(1 − γ ), although in many situations we have Vmax 1/(1 − γ ).
A number of heuristics have been proposed in the literature. In practice, the most
popular exploration strategy is probably ε -greedy: the agent chooses a random ac-
tion with probability ε , and the best action according to its current value-function
estimate otherwise. While the ε -greedy rule often guarantees sufficient exploration,
it may not be efficient since exploration occurs uniformly on all actions. A number
of alternatives such as soft-max, counter-based, recency-based, optimistic initializa-
tion, and exploration bonus are sometimes found more effective empirically (Sutton
and Barto, 1998; Thrun, 1992). Advanced techniques exist that guide exploration
using interval estimation (Kaelbling, 1993; Meuleau and Bourgine, 1999; Wiering
and Schmidhuber, 1998), information gain (Dearden et al, 1999), MDP characteris-
tics (Ratitch and Precup, 2002), or expert demonstration (Abbeel and Ng, 2005), to
name a few.
Unfortunately, while these heuristics work well in some applications, they may
be inefficient in others. Below is a prominent example (Whitehead, 1991), where
ε -greedy is provably inefficient.
Example 6.1. Consider an MDP with N + 1 states and 2 actions, as depicted in Fig-
ure 6.1. State 1 is the start state and state G is the absorbing goal state. Taking the
solid action transports the agent to the state to the right, while taking the dashed
action resets the agent to the start state 1 and it has to re-start from 1 trying to get
1 It is not a real restriction to assume rt ∈ [0,1] since any bounded reward function can be
shifted and rescaled to the range [0,1] without affecting the optimal policy (Ng et al, 1999).
178 L. Li
to the goal state. To simplify exposition, we use γ = 1, and R(s,a) = −1 for all
state–actions unless s = G, in which case R(s,a) = 0.
An agent that uses ε -greedy exploration rule will, at each timestep, be reset to
the start state with probability at least ε /2. Therefore, the probability that the agent
always chooses the solid action until it reaches the goal is (1 − ε /2)N , which implies
the value of state N is exponentially small in N, given any fixed value of ε . In con-
trast, the optimal policy requires only N steps, and the provably efficient algorithms
described later in this chapter requires only a small number of steps (polynomial in
N) to find the goal as well as the optimal policy.
This example demonstrates that a poor exploration rule may be exponentially inef-
ficient, as can be shown for many other heuristics. Furthermore, inefficient explo-
ration can have a substantial impact in practice, as illustrated by a number of empir-
ical studies where techniques covered in this chapter can lead to smarter exploration
schemes that outperform popular heuristics (see, e.g., Brunskill et al (2009), Li and
Littman (2010), and Nouri and Littman (2009, 2010)). Such problems raise the crit-
ical need for designing algorithms that are provably efficient, which is the focus of
the present chapter.
Before developing provably efficient algorithms, one must define what it means by
efficient exploration. A few choices are surveyed below.
In measuring the sample complexity of exploration (or sample complexity for short)
of a reinforcement-learning algorithm A, it is natural to relate the complexity to the
number of steps in which the algorithm does not choose near-optimal actions. To do
so, we treat A as a non-stationary policy that maps histories to actions. Two addi-
tional inputs are needed: ε > 0 and δ ∈ (0,1). The precision parameter, ε , controls
the quality of behavior we require of the algorithm (i.e., how close to optimality do
we desire to be). The confidence parameter, δ , measures how certain we want to be
of the algorithm’s performance. As both parameters decrease, greater exploration
and learning is necessary as more is expected of the algorithms.
Definition 6.2. (Kakade, 2003) Let ct = (s1 ,a1 ,r1 ,s2 ,a2 ,r2 , . . . ,st ) be a path gener-
ated by executing a reinforcement-learning algorithm A in an MDP M up to timestep
t. The algorithm is viewed as a non-stationary policy, denoted At at timestep t. Let
(rt ,rt+1 , . . .) be the random sequence of reward received by A after timestep t. The
state value at st , denoted V At (st ), is the expectation of the discounted cumulative
reward, rt + γ rt+1 + γ 2 rt+2 + · · ·. Given a fixed ε > 0, the sample complexity of A is
6 Sample Complexity Bounds of Exploration 179
the number of timesteps τ such that the policy at timestep τ , Aτ , is not ε -optimal:
V Aτ (sτ ) ≤ V ∗ (sτ ) − ε .
The above definition of sample complexity was first used in the analysis of
Rmax (Kakade, 2003), and will be the focus of the present chapter. It is worth
mentioning that the analyses of E3 family of algorithms (Kakade et al, 2003; Kearns
and Koller, 1999; Kearns and Singh, 2002) use slightly different definitions of effi-
cient learning. In particular, these algorithms are required to halt after a polynomial
amount of time and output a near-optimal policy for the last state, with high prob-
ability. Our analyses here are essentially equivalent, but simpler in the sense that
mixing-time arguments can be avoided.
The notion of PAC-MDP is quite general, avoiding the need for assumptions
such as reset (Fiechter, 1994, 1997), parallel sampling oracle (Even-Dar et al, 2002;
Kearns and Singh, 1999), or reachability of the transition matrices (see the next
subsection).2 On the other hand, as will be shown later, the criterion of achieving
small sample complexity of exploration is tractable in many challenging problems.
Definition 6.4. Given an MDP M and a start state s1 , the T -step return of an algo-
rithm A is the total reward the agent collects by starting in s1 and following A:
T
RA (M,s1 ,T ) = ∑ rt .
def
t=1
LA (T ) = T ρ ∗ − RA (M,s1 ,T ).
def
(6.1)
2 It should be noted that current PAC-MDP analysis relies heavily on discounting, which
essentially makes the problem a finite-horizon one. This issue will be made explicit in the
proof of Theorem 6.1.
6 Sample Complexity Bounds of Exploration 181
were made to derive strong regret bounds even when the reward function of the
MDP may change in an adaptively adversarial manner in every step, given complete
knowledge of the fixed transition probabilities (see, e.g., Neu et al (2011) and the
references therein).
In the heaven-or-hell example in Section 6.3.2, the sample complexity of any algo-
rithm in this problem is trivially 1, in contrast to the poor linear regret bound. The
drastic difference highlights a fundamental difference between sample complexity
and regret: the former is defined on the states visited by the algorithm, while the lat-
ter is defined with respect to the rewards along the trajectory of the optimal policy.
One may argue that this MDP is difficult to solve since mistakes in taking actions
cannot be recovered later, so the poor guarantee in terms of regret seems more con-
sistent with the hardness of the problem than, say, sample complexity. However,
mistakes are unavoidable for a trial-and-error agent as in reinforcement learning.
Thus it seems more natural to evaluate an algorithm conditioned on the history. In
this sense, the heaven-or-hell MDP is easy to solve since there is only one place
where the agent can make a mistake.
One may define the loss in cumulative rewards in a different way to avoid reach-
ability assumptions needed by regret analysis. One possibility is to use the assump-
tion of reset (or episodic tasks), with which the agent can go back to the same start
state periodically (Fiechter, 1994, 1997). However, resets are usually not available
in most online reinforcement-learning problems.
Another possibility, called the average loss (Strehl and Littman, 2008a), com-
pares the loss in cumulative reward of an agent on the sequence of states the agent
actually visits:
Definition 6.5. Fix T ∈ N and run an RL algorithm A for T steps. The instanta-
neous loss lt at timestep t is the difference between the optimal state value and the
cumulative discounted return:
T
lt =V ∗ (st ) − ∑ γ τ −t rτ .
def
τ =t
This section gives a generic theorem for proving polynomial sample complexity
for a class of online RL algorithms. This result serves as a basic tool for further
investigation of model-based and model-free approaches in this chapter.
Consider an RL algorithms A that maintains an estimated state–action value func-
tion Q(·,·), and let Qt denote the value of Q immediately before the t-th action of
the agent. We say that A is greedy if it always chooses an action that maximizes
its current value function estimate; namely, at = arg maxa∈A Qt (st ,a), where st is
def
the t-th state reached by the agent. For convenience, define Vt (s) = maxa∈A Qt (s,a).
For our discussions, two important definitions are needed, in which we denote by
K ⊆ S × A an arbitrary set of state–actions. While K may be arbitrarily defined, it
is often understood to be a subset that the agent already “knows” and need not be
explored. Finally, a state s is called “known” if (s,a) ∈ K for all a ∈ A.
Definition 6.6. We define Et to be the event, called the escape event (from K), that
some state–action (s,a) ∈
/ K is experienced by algorithm A at timestep t.
where I(s = s) is the indicator function. If we replace the true dynamics T and R by
respective estimates T̂ and R̂ in the right-hand side of Equations 6.2 and 6.3, then
the resulting MDP, denoted M̂K , is called an empirical known state–action MDP
with respect to K.
Intuitively, the known state–action MDP MK is an optimistic model of the true MDP
M as long as Q is optimistic (namely, Q(s,a) ≥ Q∗ (s,a) for all (s,a)). Furthermore,
the two MDPs’ dynamics agree on state–actions in K. Consider two cases:
1. If it takes too many steps for the agent to navigate from its current state s to
some unknown state by any policy, the value of exploration is small due to
γ -discounting. In other words, all unknown states are essentially irrelevant to
optimal action selection in state s, so the agent may just follow the optimal
policy of MK , which is guaranteed to be near-optimal for s in M.
2. If, on the other hand, an unknown state is close to the current state, the con-
struction of MK (assuming Q is optimistic) will encourage the agent to navigate
to the unknown states (the “escape” event) for exploration. The policy in the
current state may not be near-optimal.
6 Sample Complexity Bounds of Exploration 185
Hence, the number of times the second case happens is linked directly to the sample
complexity of the algorithm. The known state–action MDP is the key concept to
balance exploration and exploitation in an elegant way.
The intuition above is formalized in the following generic theorem (Li, 2009),
which slightly improves the original result (Strehl et al, 2006a). It provides a com-
mon basis for all our sample-complexity analyses later in this chapter.
Theorem 6.1. Let A(ε ,δ ) be an algorithm that takes ε and δ as inputs (in addition
to other algorithm-specific inputs), acts greedily according to its estimated state–
action value function, denoted Qt at timestep t. Suppose that on every timestep t,
there exists a set Kt of state–actions that depends only on the agent’s history up to
timestep t. We assume that Kt = Kt+1 unless, during timestep t, an update to some
state–action value occurs or the escape event Et happens. Let MKt be the known
state–action MDP (defined using Kt and Qt in Definition 6.7) and πt be the greedy
policy with respect to Qt . Suppose that for any inputs ε and δ , with probability at
least 1 − δ /2, the following conditions hold for all timesteps t:
1. (Optimism) Vt (st ) ≥ V ∗ (st ) − ε /4,
2. (Accuracy) Vt (st ) − VMπtK (st ) ≤ ε /4, and
t
3. (Bounded Surprises) The total number of updates of action–value estimates plus
the number of times the escape event from Kt , Et , can occur is bounded by a
function ζ (ε ,δ , |M|).
Then, with probability at least 1 − δ , the sample complexity of A(ε ,δ ) is
Vmax 1 1
O ζ (ε ,δ , |M|) + ln ln .
ε (1 − γ ) δ ε (1 − γ )
Proof. (sketch) The proof consists of two major parts. First, let W denote the
event that, after following the non-stationary policy At from state st in M for
H = 1−1 γ ln ε (1−
4
γ ) timesteps, one of the two following events occur: (a) the state–
action value function estimate is changed for some (s,a); or (b) an escape event from
Kt happens; namely, some (s,a) ∈ / Kt is experienced. The optimism and accuracy
conditions then imply an enlightening result in the analysis, known as the “Implicit
Explore or Exploit” lemma (Kakade, 2003; Kearns and Singh, 2002):
V At (st ) ≥ V ∗ (st ) − 3ε /4 − Pr(W )Vmax .
In other words, if Pr(W ) < 4Vεmax , At is ε -optimal in st (exploitation); otherwise, the
agent will experience event W with probability at least 4Vεmax (exploration).
The second part of the proof bounds the number of times when Pr(W ) ≥ 4Vεmax
happens. To do so, we split the entire trajectory into pieces of length H, each of
which starts in state siH+1 for i ∈ N: (s1 , . . . ,sH ),(sH+1 , . . . ,s2H ), . . .. We then view as
Bernoulli trials the H-step sub-trajectories of the agent starting in siH+1 and follow-
ing algorithm A. The trial succeeds if event W happens. Clearly, outcomes of these
Bernoulli trials are independent of each other, conditioned on the starting states of
the sub-trajectories. Since each of the Bernoulli trials succeeds with probability at
least 4Vεmax and the total number of successes is at most ζ (ε ,δ , |M|) (due to the
186 L. Li
6.5.1 Rmax
Algorithm 12 gives complete pseudocode for Rmax. Two critical ideas are used
in the design of Rmax. One is the distinction between known states—states where
the transition distribution and rewards can be accurately inferred from observed
transitions—and unknown states. The other is the principle for exploration known
as “optimism in the face of uncertainty” (Brafman and Tennenholtz, 2002).
The first key component in Rmax is the notion of known state–actions. Suppose
an action a ∈ A has been taken m times in a state s ∈ S. Let r[i] and s [i] denote the
i-th observed reward and next state, respectively, for i = 1,2, . . .,m. A maximum-
likelihood estimate of the reward and transition function for (s,a) is given by:
T̂ (s,a,s ) = {i | s [i] = s } /m, ∀s ∈ S (6.4)
R̂(s,a) = (r[1] + r[2] + · · ·+ r[m]) /m. (6.5)
6 Sample Complexity Bounds of Exploration 187
Intuitively, we expect these estimates to converge almost surely to their true values,
R(s,a) and T (s,a,s ), respectively as m becomes large. When m is large enough (in a
sense made precise soon), the state–action (s,a) is called “known” since we have an
accurate estimate of the reward and transition probabilities for it; otherwise, (s,a) is
“unknown.” Let Kt be the set of known state–actions at timestep t (before the t-th
action is taken). Clearly, K1 = 0/ and Kt ⊆ Kt+1 for all t.
The second key component in Rmax explicitly deals with exploration. Accord-
ing to the simulation lemma, once all state–actions become known, the agent is able
to approximate the unknown MDP with high accuracy and thus act near-optimally.
It is thus natural to encourage visitation to unknown state–actions. This goal is
achieved by assigning the optimistic value Vmax to unknown state–actions such that,
for all timestep t, Qt (s,a) = Vmax if (s,a) ∈
/ Kt . During execution of the algorithm,
if action at has been tried m times in state st , then Kt+1 = Kt ∪ {(st ,at )}. It is easy
to see that the empirical state–action MDP solved by Rmax at timestep t coincides
with the empirical known state–action MDP in Definition 6.7 (with Kt and Qt ).
Theorem 6.1 gives a convenient way to show Rmax is PAC-MDP. While a com-
plete proof is available in the literature (Kakade, 2003; Strehl et al, 2009), we give
a proof sketch that highlights some of the important steps in the analysis.
s ∈S
for any (s,a) as long as m ≥ m0 for some m0 = Õ ε|S|V
2
max
2 (1−γ )2 . With these highly
accurate estimates of reward and transition probabilities, the optimism and accuracy
conditions in Theorem 6.1 then follow immediately by the simulation lemma. Fur-
thermore, since any unknown state–action can be experienced at most m times (after
which it will become known), the bounded-surprises condition also holds with
|S|2 |A|Vmax2
ζ (ε ,δ , |M|) = |S| |A| m = Õ .
ε 2 (1 − γ )2
The sample complexity of Rmax then follows immediately.
The Rmax algorithm and its sample complexity analysis may be improved in vari-
ous ways. For instance, the binary concept of knownness of a state–action may be
replaced by the use of interval estimation that smoothly quantifies the prediction un-
certainty in the maximum-likelihood estimates, yielding the MBIE algorithm (Strehl
and Littman, 2008a). Furthermore, it is possible to replace the constant optimistic
value Vmax by a non-constant, optimistic value function to gain further improve-
ment in the sample complexity bound (Strehl et al, 2009; Szita and Lőrincz, 2008).
More importantly, we will show in the next subsection how to extend Rmax from
finite MDPs to general, potentially infinite, MDPs with help of a novel supervised-
learning model.
Finally, we note that a variant of Rmax, known as MoRmax,is recently pro-
posed (Szita and Szepesvári, 2010) with a sample complexity of Õ |S||A|V
2
max
ε 2 (1−γ )4
. Dif-
ferent from Rmax which freezes its transition/reward estimates for a state–action
once it becomes known, MoRmax sometimes re-estimates these quantities when
new m-tuples of samples are collected. While this algorithm and its analysis are
more complicated, its sample complexity significantly improves over that of Rmax
in terms of |S|, 1/ε , and 1/Vmax , at a cost of an additional factor of 1/(1 − γ ). Its
dependence on |S| matches the currently best lower bound (Li, 2009).
Rmax is not applicable to continuous-state MDPs that are common in some applica-
tions. Furthermore, its sample complexity has an explicit dependence on the number
of state–actions in the MDP, which implies that it is inefficient even in finite MDPs
when the state or action space is large. As in supervised learning, generalization is
needed to avoid explicit enumeration of every state–action, but generalization can
also make exploration more challenging. Below, we first introduce a novel model
for supervised learning, which allows one to extend Rmax to arbitrary MDPs.
6 Sample Complexity Bounds of Exploration 189
Given:
, , , , ,
Environment: Pick
secretly & adversarially
“I know”
Learner “ ”
“I don’t know”
“ ” Observe
Although KWIK is a harder learning model than existing models like Probably Ap-
proximately Correct (Valiant, 1984) and Mistake Bound (Littlestone, 1987), a num-
ber of useful hypothesis classes are shown to be efficiently KWIK-learnable. Below,
we review some representative examples from Li et al (2011).
be the current set of training examples. The algorithm first detects if xt is linearly
independent of the previous inputs stored in D. If xt is linearly independent, then the
algorithm predicts ⊥, observes the output yt = f (xt ), and then expands the training
set: D ← D ∪ {(xt ,yt )}. Otherwise, there exist k real numbers, a1 ,a2 , · · · ,ak , such
that xt = a1 v1 + a2 v2 + . . . + ak vk . In the latter case, we can accurately predict the
value of f (xt ): f (xt ) = a1 f (v1 ) + a2 f (v2 ) + . . . + ak f (vk ). Furthermore, since D can
contain at most d linear independent inputs, the algorithm’s KWIK bound is d.
6 Sample Complexity Bounds of Exploration 191
6.5.2.2 KWIK-Rmax
KWIK provides a formal learning model for studying prediction algorithms with un-
certainty awareness. It is natural to combine a KWIK algorithm for learning MDPs
with the basic Rmax algorithm. The main result in this subsection is the following:
if a class of MDPs can be KWIK-learned, then there exists an Rmax-style algorithm
that is PAC-MDP for this class of MDPs.
We first define KWIK-learnability of a class of MDPs, which is motivated by the
simulation lemma. For any set A, we denote by PA the set of probability distributions
defined over A.
Definition 6.10. Fix the state space S, action space A, and discount factor γ .
1. Define X = S × A, YT = PS , and ZT = S. Let HT ⊆ YX T be a set of transi-
tion functions of an MDP. HT is(efficiently) KWIK-learnable if in the accuracy
requirement of Definition 6.9, T̂ (· | s,a) − T (· | s,a) is interpreted as the 1
distance defined by:
∑s ∈S T̂ (s | s,a) − T (s | s,a) if S is countable
T̂ (s | s,a) − T (s | s,a) ds otherwise.
s ∈S
192 L. Li
1. The definitions of T̂ and R̂ are conceptual rather than operational. For finite
MDPs, one may represent T̂ by a matrix of size O(|S|2 |A|) and R̂ by a vector of
size O(|S| |A|). For structured MDPs, more compact representations are possi-
ble. For instance, MDPs with linear dynamical systems may be represented by
matrices of finite dimension (Strehl and Littman, 2008b; Brunskill et al, 2009),
and factored-state MDPs can be represented by a dynamic Bayes net (Kearns
and Koller, 1999).
2. It is unnecessary to update T̂ and R̂ and recompute Qt for every timestep t. The
known state–action MDP M̂ (and thus Q∗M̂ and Qt ) remains unchanged unless
some unknown state–action becomes known. Therefore, one may update M̂ and
Qt only when AT or AR obtain new samples in lines 13 or 16.
3. It is unnecessary to compute Qt for all (s,a). In fact, according
to Theorem 6.1,
it suffices to guarantee that Qt is εP -accurate in state st : Qt (st ,a) − Q∗M̂ (st ,a) <
εP for all a ∈ A. This kind of local planning often requires much less computa-
tion than global planning.
4. Given the approximate MDP M̂ and the current state st , the algorithm computes
a near-optimal action for st . This step can be done using dynamic programming
for finite MDPs. In general, however, doing so is computationally expensive.
Fortunately, recent advances in approximate local planning have made it possi-
ble for large-scale problems (Kearns et al, 2002; Kocsis and Szepesvári, 2006;
Walsh et al, 2010a).
The sample complexity of KWIK-Rmax is given by the following theorem. It shows
that the algorithm’s sample complexity of exploration scales linearly with the KWIK
bound of learning the MDP. For every KWIK-learnable class M of MDPs, KWIK-
Rmax is automatically instantiated to a PAC-MDP algorithm for M with appropri-
ate KWIK learners.
Theorem 6.3. Let M be a class of MDPs with state space S and action space A.
If M can be (efficiently) KWIK-learned by algorithms AT (for transition functions)
and AR (for reward functions) with respective KWIK bounds BT and BR , then KWIK-
Rmax is PAC-MDP in M. In particular, if the following parameters are used,
ε (1 − γ ) ε (1 − γ ) ε (1 − γ ) δ
εT = , εR = , εP = , δT = δR = ,
16Vmax 16 24 4
then the sample complexity of exploration of KWIK-Rmax is
Vmax 1 1
O BT (εT ,δT ) + BR (εR ,δR ) + ln ln .
ε (1 − γ ) δ ε (1 − γ )
We omit the proof which is a generalization of that for Rmax (Theorem 6.1).
Instead, we describe a few classes of KWIK-learnable MDPs and show how
KWIK-Rmax unifies and extends previous PAC-MDP algorithms. Since transition
194 L. Li
probabilities are usually more difficult to learn than rewards, we only describe how
to KWIK-learn transition functions. Reward functions may be KWIK-learned using
the same algorithmic building blocks. More examples are found in Li et al (2011).
Example 6.7. In many robotics and adaptive control applications, the systems being
manipulated have infinite state and action spaces, and their dynamics are governed
by linear equations (Abbeel and Ng, 2005; Strehl and Littman, 2008b). Here, S ⊆
RnS and A ⊆ RnA are the vector-valued state and action spaces, respectively. State
transitions follow a multivariate normal distribution: given the current state–action
(s,a), the next state s is randomly drawn from N (F φ (s,a),Σ ), where φ : RnS +nA →
Rn is a basis function satisfying
φ (·,·)
≤ 1, F ∈ RnS ×n is a matrix, Σ ∈ RnS ×nS is
a covariance matrix. We assume that φ and Σ are given, but F is unknown.
For such MDPs, each component of mean vector E[s ] is a linear function of
φ (s,a). Therefore, nS instances of noisy linear regression may be used to KWIK-
learn the nS rows of F.5 If individual components of E[s ] can be predicted accu-
rately by the sub-algorithms, one can easily concatenate them to predict the entire
vector E[s ]; otherwise, we can predict ⊥ and obtain data to allow the sub-algorithms
to learn. This meta algorithm, known as output combination, allows one to KWIK-
learn such linearly parameterized MDPs with a KWIK bound of Õ(n2S n/ε 4 ). There-
fore, the corresponding instantiation of KWIK-Rmax is PAC-MDP according to
Theorem 6.3.
A related class of MDPs is the normal offset model motivated by robot navigation
tasks (Brunskill et al, 2009), which may be viewed as a special case of the linear dy-
namics MDPs.
5 One technical subtlety exists: one has to first “clip” a normal distribution into a distribution
with bounded support since bounded noise is assumed in noisy linear regression.
6 Sample Complexity Bounds of Exploration 195
where Pi (s) ⊆ {s[1],s[2], . . . ,s[n]}, known as the i-th parent set, contains state vari-
ables that are relevant for defining the transition probabilities of s [i]. Define two
def def
quantities: D = maxi |Pi |, the maximum size of parent sets, and N = maxi |Si |, the
maximum number of values a state variable can take. Therefore, although there are
mN 2n transition probabilities in a factored-state MDP, the MDP is actually specified
by no more than mnN D+1 free parameters. When D n, such an MDP can not only
be represented efficiently, but also be KWIK-learned with a small KWIK bound, as
shown next.
When the parent sets Pi are known a priori, output combination can be used
to combine predictions for Ti , each of which can be KWIK-learned by an instance
of input partition (over all possible mN D state–actions) applied to dice learning
(for multinomial distribution over N elements). This three-level KWIK algorithm
provides an approach to learning the transition function of a factored-state MDP
with the following KWIK bound (Li et al, 2011): Õ(n3 mDN D+1 /ε 2 ). This insight
can be used to derive PAC-MDP algorithms for factored-state MDPs (Guestrin et al,
2002; Kearns and Koller, 1999; Strehl, 2007a).
In the more interesting case where the sets Pi are unknown, the RL agent has
to learn the structure of the factored-state representation. The problem of efficient
exploration becomes very challenging when combined with the need for structure
learning (Kearns and Koller, 1999). Fortunately, assuming knowledge of D, we can
use the noisy union algorithm (Li et al, 2011) to KWIK-learn the set of possible
structures, which is tractable as long as D is small. Combining noisy union with the
three-level KWIK algorithm for the known structure case above not only simplifies
an existing structure-learning algorithm (Strehl et al, 2007), but also facilitates de-
velopment of more efficient ones (Diuk et al, 2009).
The examples above show the power of the KWIK model when it is combined with
the generic PAC-MDP result in Theorem 6.3. Specifically, these examples show
how to KWIK-learn various important classes of MDPs, each of which leads imme-
diately to an instance of KWIK-Rmax that is PAC-MDP in those MDPs.
We conclude this section with two important open questions regarding the use of
KWIK model for devising PAC-MDP algorithms:
• KWIK learning a hypothesis class becomes very challenging when the realiz-
ability assumption is violated, that is, when h∗ ∈
/ H. In this case, some straight-
forward adaptation of the accuracy requirement in Definition 6.9 can make
it impossible to KWIK-learn a hypothesis class, even if the class is KWIK-
learnable in the realizable case (Li and Littman, 2010). Recently, an interesting
effort has been made by Szita and Szepesvári (2011).
196 L. Li
relation to Q-learning with the harmonic learning rates above and the batch nature
of the updates.
To encourage exploration, delayed Q-learning uses the same optimism-in-
the-face-of-uncertainty principle as many other algorithms like Rmax, MBIE and
UCRL2. Specifically, its initial Q-function is an over-estimate of the true function;
during execution, the successive value function estimates remain over-estimates
with high probability as long as m is sufficiently large, as will be shown soon.
While the role of the learning flags may not be immediately obvious, they are
critical to guarantee that the delayed update rule can happen only a finite number
of times during an entire execution of delayed Q-learning. This fact will be useful
when proving the following theorem for the algorithm’s sample complexity.
It is worth noting that the algorithm’s sample complexity matches the lower bound
of Li (2009) in terms of the number of states. While the sample complexity of explo-
ration is worse than the recently proposed MoRmax (Szita and Szepesvári, 2010),
its analysis is the first to make use of an interesting notion of known state–actions
that is different from the one used by previous model-based algorithms like Rmax.
Let Qt be the value function estimate at timestep t, then the set of known state–
actions are those with essentially nonnegative Bellman errors:
%
ε (1 − γ )
Kt = (s,a) | R(s,a) + γ ∑ T (s,a,s ) max Qt (s ,a ) − Qt (s,a) ≥ −
def
.
s ∈S a ∈A 4
A few definitions are useful in the proof. An attempted update of a state–action (s,a)
is a timestep t for which (s,a) is experienced, L(s,a) = TRUE, and C(s,a) = m; in
other words, attempted updates correspond to the timesteps in which Lines 18—25
in Algorithm 14 are executed. If, in addition, the condition in Line 19 is satisfied,
the attempted update is called successful since the state–action value is changed;
otherwise, it is called unsuccessful.
Proof. (sketch) For convenience, let S = |S| and A = |A|. We first prove that there are
only finitely many attempted updates in a whole run of the algorithm. Obviously, the
def
number of successful updates is at most κ =SAVmax/ξ since the state–action value
function is initialized to Vmax , the value function estimate is always non-negative,
and every successful update decreases Q(s,a) by at least ξ for some (s,a). Now
consider a fixed state–action (s,a). Once (s,a) is experienced for the m-th time,
an attempted update will occur. Suppose that an attempted update of (s,a) occurs
during timestep t. Afterward, for another attempted update to occur during some
later timestep t > t, it must be the case that a successful update of some state–action
(not necessarily (s,a)) has occurred on or after timestep t and before timestep t . We
have proved at most κ successful updates are possible and so there are at most 1 + κ
attempted updates of (s,a). Since there are SA state–actions, there can be at most
SA(1 + κ ) total attempted updates. The bound on the number of attempted updates
allows one to use a union bound to show that (Strehl et al, 2009, Lemma 22), with
high probability, any attempted update on an (s,a) ∈ Kt will be successful using the
specified value of m.
We are now ready to verify the three conditions in Theorem 6.1. The optimism
condition is easiest to verify using mathematical induction. Delayed Q-learning
uses optimistic initialization of the value function, so Qt (s,a) ≥ Q∗ (s,a) and thus
Vt (s) ≥ V ∗ (s) for t = 1. Now, suppose Qt (s,a) ≥ Q∗ (s,a) and Vt (s) ≥ V ∗ (s) for
all (s,a). If some Q(s,a) is updated, then Hoeffding’s inequality together with the
specified value for m ensures that the new value of Q(s,a) is still optimistic (modulo
a small gap of O(ε ) because of approximation error). Since there are only finitely
6 Sample Complexity Bounds of Exploration 199
many attempted updates, we may apply the union bound so that Vt (s) ≥ V ∗ (s) − ε /4
for all t, with high probability.
The accuracy condition can be verified using the definition of Kt . By definition,
the Bellman errors in known state–actions are at least −ε (1 − γ )/4. On the other
hand, unknown state–actions have zero Bellman error, by Definition 6.7. Therefore,
the well-known monotonicity property of Bellman operators implies that Qt is uni-
formly greater than the optimal Q-function in the known state–action MDP MKt ,
module a small gap of O(ε ). Thus the accuracy condition holds.
For the bounded-surprises condition, we first observe that the number of updates
to Q(·,·) is at most κ , as argued above. For the number of visits to unknown state–
actions, it can be argued that, with high probability, an attempted update to the
Q-value of an unknown state–action will be successful (using the specified value for
m), and an unsuccessful update followed by L(s,a) = FALSE indicates (s,a) ∈ Kt .
A formal argument requires more work but is along the same line of reasoning on
the learning flags when we bounded the number of attempted updates above (Strehl
et al, 2009, Lemma 25).
Delayed Q-learning may be combined with techniques like interval estimation
to gain further improvement (Strehl, 2007b). Although PAC-MDP algorithms like
delayed Q-learning exist for finite MDPs, extending the analysis to general MDPs
turns out very challenging, unlike the case for the model-based Rmax algorithm.
Part of the reason is the difficulty in analyzing model-free algorithms that often use
bootstrapping when updating value functions. While the learning target in model-
based algorithms is a fixed object (namely, the MDP model), the learning target in
a model-free algorithm is often not fixed. In a recent work (Li and Littman, 2010),
a finite-horizon reinforcement-learning problem in general MDPs is reduced to a
series of KWIK regression problems so that we can obtain a model-free PAC-MDP
RL algorithm as long as the individual KWIK regression problems are solvable.
However, it remains open how to devise a KWIK-based model-free RL algorithm in
discounted problems without first converting it into a finite-horizon one.
KWIK model, PAC-MDP algorithms can be systematically developed for many use-
ful MDP classes beyond finite MDPs. Examples include various rich classes of prob-
lems like factored-state MDPs (Guestrin et al, 2002; Kearns and Koller, 1999; Strehl
et al, 2007; Diuk et al, 2009), continuous-state MDPs (Strehl and Littman, 2008b;
Brunskill et al, 2009), and relational MDPs (Walsh, 2010). Furthermore, ideas from
the KWIK model can also be combined with apprenticeship learning, resulting in
RL systems that can explore more efficiently with the help of a teacher (Walsh et al,
2010b; Sayedi et al, 2011; Walsh et al, 2012).
Although the worst-case sample complexity bounds may be conservative, the
principled algorithms and analyses have proved useful for guiding development of
more practical exploration schemes. A number of novel algorithms are motivated
and work well in practice (Jong and Stone, 2007; Li et al, 2009; Nouri and Littman,
2009, 2010), all of which incorporate various notion of knownness in guiding explo-
ration. Several PAC-MDP algorithms are also competitive in non-trivial applications
like robotics (Brunskill et al, 2009) and computer games (Diuk et al, 2008).
Acknowledgements. The author would like to thank John Langford, Michael Littman, Alex
Strehl, Tom Walsh, and Eric Wiewiora for significant contributions to the development of
the KWIK model and sample complexity analysis of a number of PAC-MDP algorithms. The
reviewers and editors of the chapter as well as Michael Littman and Tom Walsh have provided
valuable comments that improves the content and presentation of the article in a number of
substantial ways.
References
Abbeel, P., Ng, A.Y.: Exploration and apprenticeship learning in reinforcement learning.
In: Proceedings of the Twenty-Second International Conference on Machine Learning
(ICML-2005), pp. 1–8 (2005)
Asmuth, J., Li, L., Littman, M.L., Nouri, A., Wingate, D.: A Bayesian sampling approach to
exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on
Uncertainty in Artificial Intelligence (UAI-2009), pp. 19–26 (2009)
Bartlett, P.L., Tewari, A.: REGAL: A regularization based algorithm for reinforcement learn-
ing in weakly communicating MDPs. In: Proceedings of the Twenty-Fifth Annual Con-
ference on Uncertainty in Artificial Intelligence (UAI-2009), pp. 35–42 (2009)
Barto, A.G., Bradtke, S.J., Singh, S.P.: Learning to act using real-time dynamic programming.
Artificial Intelligence 72(1-2), 81–138 (1995)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
Brafman, R.I., Tennenholtz, M.: R-max—a general polynomial time algorithm for near-
optimal reinforcement learning. Journal of Machine Learning Research 3, 213–231 (2002)
Brunskill, E., Leffler, B.R., Li, L., Littman, M.L., Roy, N.: Provably efficient learning with
typed parametric models. Journal of Machine Learning Research 10, 1955–1988 (2009)
Burnetas, A.N., Katehakis, M.N.: Optimal adaptive policies for Markov decision processes.
Mathematics of Operations Research 22(1), 222–255 (1997)
6 Sample Complexity Bounds of Exploration 201
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of
the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI-1999), pp. 150–
159 (1999)
Diuk, C., Cohen, A., Littman, M.L.: An object-oriented representation for efficient reinforce-
ment learning. In: Proceedings of the Twenty-Fifth International Conference on Machine
Learning (ICML-2008), pp. 240–247 (2008)
Diuk, C., Li, L., Leffler, B.R.: The adaptive k-meteorologists problem and its application to
structure discovery and feature selection in reinforcement learning. In: Proceedings of the
Twenty-Sixth International Conference on Machine Learning (ICML-2009), pp. 249–256
(2009)
Duff, M.O.: Optimal learning: Computational procedures for Bayes-adaptive Markov
decision processes. PhD thesis, University of Massachusetts, Amherst, MA (2002)
Even-Dar, E., Mansour, Y.: Learning rates for Q-learning. Journal of Machine Learning
Research 5, 1–25 (2003)
Even-Dar, E., Mannor, S., Mansour, Y.: Multi-Armed Bandit and Markov Decision Processes.
In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI), vol. 2375, pp. 255–270.
Springer, Heidelberg (2002)
Fiechter, C.N.: Efficient reinforcement learning. In: Proceedings of the Seventh Annual ACM
Conference on Computational Learning Theory (COLT-1994), pp. 88–97 (1994)
Fiechter, C.N.: Expected mistake bound model for on-line reinforcement learning. In: Pro-
ceedings of the Fourteenth International Conference on Machine Learning (ICML-1997),
pp. 116–124 (1997)
Guestrin, C., Patrascu, R., Schuurmans, D.: Algorithm-directed exploration for model-based
reinforcement learning in factored MDPs. In: Proceedings of the Nineteenth International
Conference on Machine Learning (ICML-2002), pp. 235–242 (2002)
Jaakkola, T., Jordan, M.I., Singh, S.P.: On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation 6(6), 1185–1201 (1994)
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. Jour-
nal of Machine Learning Research 11, 1563–1600 (2010)
Jong, N.K., Stone, P.: Model-based function approximation in reinforcement learning. In:
Proceedings of the Sixth International Joint Conference on Autonomous Agents and Mul-
tiagent Systems (AAMAS-2007), pp. 670–677 (2007)
Kaelbling, L.P.: Learning in Embedded Systems. MIT Press, Cambridge (1993)
Kakade, S.: On the sample complexity of reinforcement learning. PhD thesis, Gatsby Com-
putational Neuroscience Unit, University College London, UK (2003)
Kakade, S., Kearns, M.J., Langford, J.: Exploration in metric state spaces. In: Proceedings of
the Twentieth International Conference on Machine Learning (ICML-2003), pp. 306–312
(2003)
Kearns, M.J., Koller, D.: Efficient reinforcement learning in factored MDPs. In: Proceedings
of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-1999),
pp. 740–747 (1999)
Kearns, M.J., Singh, S.P.: Finite-sample convergence rates for Q-learning and indirect algo-
rithms. In: Advances in Neural Information Processing Systems (NIPS-1998), vol. 11,
pp. 996–1002 (1999)
Kearns, M.J., Singh, S.P.: Near-optimal reinforcement learning in polynomial time. Machine
Learning 49(2-3), 209–232 (2002)
Kearns, M.J., Mansour, Y., Ng, A.Y.: A sparse sampling algorithm for near-optimal planning
in large Markov decision processes. Machine Learning 49(2-3), 193–208 (2002)
202 L. Li
Kocsis, L., Szepesvári, C.: Bandit Based Monte-Carlo Planning. In: Fürnkranz, J., Scheffer,
T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer,
Heidelberg (2006)
Koenig, S., Simmons, R.G.: The effect of representation and knowledge on goal-directed
exploration with reinforcement-learning algorithms. Machine Learning 22(1-3), 227–250
(1996)
Kolter, J.Z., Ng, A.Y.: Near Bayesian exploration in polynomial time. In: Proceedings of the
Twenty-Sixth International Conference on Machine Learning (ICML-2009), pp. 513–520
(2009)
Li, L.: A unifying framework for computational reinforcement learning theory. PhD thesis,
Rutgers University, New Brunswick, NJ (2009)
Li, L., Littman, M.L.: Reducing reinforcement learning to KWIK online regression. Annals
of Mathematics and Artificial Intelligence 58(3-4), 217–237 (2010)
Li, L., Littman, M.L., Mansley, C.R.: Online exploration in least-squares policy iteration.
In: Proceedings of the Eighteenth International Conference on Agents and Multiagent
Systems (AAMAS-2009), pp. 733–739 (2009)
Li, L., Littman, M.L., Walsh, T.J., Strehl, A.L.: Knows what it knows: A framework for self-
aware learning. Machine Learning 82(3), 399–443 (2011)
Littlestone, N.: Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithms. Machine Learning 2(4), 285–318 (1987)
Meuleau, N., Bourgine, P.: Exploration of multi-state environments: Local measures and
back-propagation of uncertainty. Machine Learning 35(2), 117–154 (1999)
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data
and less time. Machine Learning 13(1), 103–130 (1993)
Neu, G., György, A., Szepesvári, C., Antos, A.: Online Markov decision processes under
bandit feedback. In: Advances in Neural Information Processing Systems 23 (NIPS-2010),
pp. 1804–1812 (2011)
Ng, A.Y., Harada, D., Russell, S.J.: Policy invariance under reward transformations: Theory
and application to reward shaping. In: Proceedings of the Sixteenth International Confer-
ence on Machine Learning (ICML-1999), pp. 278–287 (1999)
Nouri, A., Littman, M.L.: Multi-resolution exploration in continuous spaces. In: Advances in
Neural Information Processing Systems 21 (NIPS-2008), pp. 1209–1216 (2009)
Nouri, A., Littman, M.L.: Dimension reduction and its application to model-based explo-
ration in continuous spaces. Machine Learning 81(1), 85–98 (2010)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian re-
inforcement learning. In: Proceedings of the Twenty-Third International Conference on
Machine Learning (ICML-2006), pp. 697–704 (2006)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
Wiley-Interscience, New York (1994)
Ratitch, B., Precup, D.: Using MDP Characteristics to Guide Exploration in Reinforcement
Learning. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) ECML 2003.
LNCS (LNAI), vol. 2837, pp. 313–324. Springer, Heidelberg (2003)
Robbins, H.: Some aspects of the sequential design of experiments. Bulletin of the American
Mathematical Society 58(5), 527–535 (1952)
Sayedi, A., Zadimoghaddam, M., Blum, A.: Trading off mistakes and don’t-know predictions.
In: Advances in Neural Information Processing Systems 23 (NIPS-2010), pp. 2092–2100
(2011)
Singh, S.P., Jaakkola, T., Littman, M.L., Szepesvári, C.: Convergence results for single-step
on-policy reinforcement-learning algorithms. Machine Learning 38(3), 287–308 (2000)
6 Sample Complexity Bounds of Exploration 203
Walsh, T.J., Subramanian, K., Littman, M.L., Diuk, C.: Generalizing apprenticeship learning
across hypothesis classes. In: Proceedings of the Twenty-Seventh International Confer-
ence on Machine Learning (ICML-2010), pp. 1119–1126 (2010b)
Walsh, T.J., Hewlett, D., Morrison, C.T.: Blending autonomous and apprenticeship learning.
In: Advances in Neural Information Processing Systems 24, NIPS-2011 (2012)
Watkins, C.J., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
Whitehead, S.D.: Complexity and cooperation in Q-learning. In: Proceedings of the Eighth
International Workshop on Machine Learning (ICML-1991), pp. 363–367 (1991)
Wiering, M., Schmidhuber, J.: Efficient model-based exploration. In: Proceedings of the Fifth
International Conference on Simulation of Adaptive Behavior: From Animals to Animats
5 (SAB-1998), pp. 223–228 (1998)
Part III
Constructive-Representational Directions
Chapter 7
Reinforcement Learning in
Continuous State and Action Spaces
7.1 Introduction
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 207–251.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
208 H. van Hasselt
computational efficiency such that it can be used in real-time and sample efficiency
such that it can learn good action-selection policies with limited experience.
Because of the complexity of the full reinforcement-learning problem in contin-
uous spaces, many traditional reinforcement-learning methods have been designed
for Markov decision processes (MDPs) with small finite state and action spaces.
However, many problems inherently have large or continuous domains. In this chap-
ter, we discuss how to use reinforcement learning to learn good action-selection
policies in MDPs with continuous state spaces and discrete action spaces and in
MDPs where the state and action spaces are both continuous.
Throughout this chapter, we assume that a model of the environment is not
known. If a model is available, one can use dynamic programming (Bellman, 1957;
Howard, 1960; Puterman, 1994; Sutton and Barto, 1998; Bertsekas, 2005, 2007),
or one can sample from the model and use one of the reinforcement-learning algo-
rithms we discuss below. We focus mainly on the problem of control, which means
we want to find action-selection policies that yield high returns, as opposed to the
problem of prediction, which aims to estimate the value of a given policy.
For general introductions to reinforcement learning from varying perspectives,
we refer to the books by Bertsekas and Tsitsiklis (1996) and Sutton and Barto (1998)
and the more recent books by Bertsekas (2007), Powell (2007), Szepesvári (2010)
and Buşoniu et al (2010). Whenever we refer to a chapter, it is implied to be the
relevant chapter from the same volume as this chapter.
In the remainder of this introduction, we describe the structure of MDPs in con-
tinuous domains and discuss three general methodologies to find good policies
in such MDPs. We discuss function approximation techniques to deal with large
or continuous spaces in Section 7.2. We apply these techniques to reinforcement
learning in Section 7.3, where we discuss the current state of knowledge for re-
inforcement learning in continuous domains. This includes discussions on tempo-
ral differences, policy gradients, actor-critic algorithms and evolutionary strategies.
Section 7.4 shows the results of an experiment, comparing an actor-critic method
to an evolutionary strategy on a double-pole cart pole. Section 7.5 concludes the
chapter.
A Markov decision process (MDP) is a tuple (S,A,T,R,γ ). In this chapter, the state
space S is generally an infinitely large bounded set. More specifically, we assume
the state space is a subset of a possibly multi-dimensional Euclidean space, such
that S ⊆ RDS , where DS ∈ N is the dimension of the state space. The action space
is discrete or continuous and in the latter case we assume A ⊆ RDA , where DA ∈
N is the dimension of the action space.1 We consider two variants: MDPs with
continuous states and discrete actions and MDPs where both the states and actions
1 In general, the action space is more accurately represented with a function that maps a
state into a continuous set, such that A(s) ⊆ RDA . We ignore this subtlety for conciseness.
7 Reinforcement Learning in Continuous State and Action Spaces 209
Table 7.1 Symbols used in this chapter. All vectors are column vectors
are continuous. Often, when we write ‘continuous’ the results hold for ‘large finite’
spaces as well. The notation used in this chapter is summarized in Table 7.1.
The transition function T (s,a,s ) gives the probability of a transition to s when
action a is performed in s. When the state space is continuous, we can assume the
transition function specifies a probability density function (PDF), such that
&
T (s,a,s ) ds = P(st+1 ∈ S |st = s and at = a)
S
denotes the probability that action a in state s results in a transition to a state in the
region S ⊆ S. It is often more intuitive to describe the transitions through a function
that describes the system dynamics, such that
When the action space is also continuous, π (s) represents a PDF on the action space.
The goal of prediction is to find the value of the expected future discounted re-
ward for a given policy. The goal of control is to optimize this value by finding
an optimal policy. It is useful to define the following operators Bπ : V → V and
B∗ : V → V , where V is the space of all value functions:2
& &
(Bπ V )(s) = π (s,a) T (s,a,s ) R(s,a,s ) + γ V (s ) ds da , (7.1)
A S
&
(B∗V )(s) = max T (s,a,s ) R(s,a,s ) + γ V (s ) ds ,
a S
In continuous MDPs, the values of a given policy and the optimal value can then
be expressed with the Bellman equations V π = Bπ V π and V ∗ = B∗V ∗ . Here V π (s)
is the value of performing policy π starting from state s and V ∗ (s) = maxπ V π (s) is
the value of the best possible policy. If the action space is finite, the outer integral
in equation (7.1) should be replaced with a summation. In this chapter, we mainly
consider discounted MDPs, which means that γ ∈ (0,1).
For control with finite action spaces, action values are often used. The optimal
action value for continuous state spaces is given by the Bellman equation
&
Q∗ (s,a) = T (s,a,s ) R(s,a,s ) + γ max Q∗ (s ,a ) ds . (7.2)
S a
2 In the literature, these operators are more commonly denoted T π and T ∗ (e.g., Szepesvári,
2010), but since we use T to denote the transition function, we choose to use B.
7 Reinforcement Learning in Continuous State and Action Spaces 211
In the problem of control, the aim is an approximation of the optimal policy. The
optimal policy depends on the optimal value, which in turn depends on the model
of the MDP. In terms of equation (7.2), the optimal policy is the policy π ∗ that
maximizes Q∗ for each state: ∑a π ∗ (s,a)Q∗ (s,a) = maxa Q∗ (s,a). This means that
rather than trying to estimate π ∗ directly, we can try to estimate Q∗ , or we can even
estimate T and R to construct Q∗ and π ∗ when needed. These observations lead to
the following three general methodologies that differ in which part of the solution
is explicitly approximated. These methodologies are not mutually exclusive and we
will discuss algorithms that use combinations of these approaches.
Value Approximation. In this second methodology, the samples are used to ap-
proximate V ∗ or Q∗ directly. Many reinforcement-learning algorithms fall into this
category. We discuss value-approximation algorithms in Section 7.3.1.
issues on the choice of approximator. This discussion is split into a part on linear
function approximation and one on non-linear function approximation.
In equation (7.3) and in the rest of this chapter, θ t ∈ Θ denotes the adaptable param-
eter vector at time t and φ (s) ∈ Φ is the feature vector of state s. Since the function
in equation (7.3) is linear in the parameters, we refer to it as a linear function ap-
proximator. Note that it may be non-linear in the state variables, depending on the
feature extraction. In this section, the dimension DΘ of the parameter space is equal
to the dimension of the feature space DΦ . This does not necessarily hold for other
types of function approximation.
Linear function approximators are useful since they are better understood than
non-linear function approximators. Applied to reinforcement learning, this has led to
a number of convergence guarantees, under various additional assumptions (Sutton,
1984, 1988; Dayan, 1992; Peng, 1993; Dayan and Sejnowski, 1994; Bertsekas and
Tsitsiklis, 1996; Tsitsiklis and Van Roy, 1997). From a practical point of view, linear
approximators are useful because they are simple to implement and fast to compute.
Many problems have large state spaces in which each state can be represented
efficiently with a feature vector of limited size. For instance, the double pole cart
pole problem that we consider later in this chapter has continuous state variables,
and therefore an infinitely large state space. Yet, every state can be represented with
a vector with six elements. This means that we would need a table of infinite size,
but can suffice with a parameter vector with just six elements if we use (7.3) with
the state variables as features.
This reduction of tunable parameters of the value function comes at a cost. It is
obvious that not every possible value function can be represented as a linear combi-
nation of the features of the problem. Therefore, our solution is limited to the set of
value functions that can be represented with the chosen functional form. If one does
not know beforehand what useful features are for a given problem, it can be benefi-
cial to use non-linear function approximation, which we discuss in Section 7.2.2.
214 H. van Hasselt
Fig. 7.1 An elliptical state space is discretized by tile coding with two tilings. For a state
located at the X, the two active tiles are shown in light grey. The overlap of these active
features is shown in dark grey. On the left, each tiling contains 12 tiles. The feature vector
contains 24 elements and 35 different combinations of active features can be encountered
in the elliptical state space. On the right, the feature vector contains 13 elements and 34
combinations of active features can be encountered, although some combinations correspond
to very small parts of the ellipse.
A common method to find features for a linear function approximator divides the
continuous state space into separate segments and attaches one feature to each seg-
ment. A feature is active (i.e., equal to one) if the relevant state falls into the corre-
sponding segment. Otherwise, it is inactive (i.e., equal to zero).
An example of such a discretizing method that is often used in reinforcement
learning is tile coding (Watkins, 1989; Lin and Kim, 1991; Sutton, 1996; San-
tamaria et al, 1997; Sutton and Barto, 1998), which is based on the Cerebel-
lar Model Articulation Controller (CMAC) structure proposed by Albus (1971,
1975). In tile coding, the state space is divided into a number of disjoint sets.
These sets are commonly called tiles in this context. For instance, one could de-
fine N hypercubes such that each hypercube Hn is defined by a Cartesian product
Hn = [xn,1 ,yn,1 ] × . . . × [xn,DS ,yn,DS ], where xn,d is the lower bound of hypercube Hn
in state dimension d and yn,d is the corresponding upper bound. Then, a feature
φn (s) ∈ φ (s) corresponding to Hn is equal to one when s ∈ Hn and zero otherwise.
The idea behind tile coding is to use multiple non-overlapping tilings. If a single
tiling contains N tiles, one could use M such tilings to obtain a feature vector of
dimension DΦ = MN. In each state, precisely M of these features are then equal to
one, while the others are equal to zero. An example with M = 2 tilings and DΦ = 24
features is shown on the left in Figure 7.1. The tilings do not have to be homo-
geneous. The right picture in Figure 7.1 shows a non-homogeneous example with
M = 2 tilings and DΦ = 13 features.
When M features are active for each state, up to DMΦ different situations can
theoretically be represented with DΦ features. This contrasts with the naive ap-
proach where only one feature is active for each state, which would only be able to
7 Reinforcement Learning in Continuous State and Action Spaces 215
Fig. 7.2 A reward function and feature mapping. The reward is Markov for the features. If
st+1 = st + at with at ∈ {−2,2}, the feature-transition function is not Markov. This makes it
impossible to determine an optimal policy.
represent DΦ different
D situations
with the same number of features.4 In practice,
the upper bound of MΦ will rarely be obtained, since many combinations of ac-
tive features will not be possible. In both examples in Figure 7.1, the number of
different possible feature vectors is indeed larger than the length of the feature
vector and smaller than the theoretical upper bound: 24 < 35 < 24 2 = 276 and
13
13 < 34 < 2 = 78.
One potential problem with discretizing methods such as tile coding is that the
resulting function that maps states into features is not injective. In other words,
φ (s) = φ (s ) does not imply that s = s . This means that the resulting feature-space
MDP is partially observable and one should consider using an algorithm that is ex-
plicitly designed to work on partially observable MDPs (POMDPs). For more on
POMDPs, see Chapter 12. In practice, many good results have been obtained with
tile coding, but the discretization and the resulting loss of the Markov property im-
ply that most convergence proofs for ordinary reinforcement-learning algorithms do
not apply for the discretized state space. This holds for any function approximation
that uses a feature space that is not an injective function of the Markov state space.
Intuitively, this point can be explained with a simple example. Consider a state
space S = R that is discretized such that φ (s) = (1,0,0)T when s ≤ −2, φ (s) =
(0,1,0)T when −2 < s < 2 and φ (s) = (0,0,1)T when s ≥ 2. The action space is
A = {−2,2}, the transition function is st+1 = st + at and the initial state is s0 = 1.
The reward is defined by rt+1 = 1 if st ∈ (−2,2) and rt+1 = −1 otherwise. The
reward function and the feature mapping are shown in Figure 7.2. In this MDP, it is
optimal to jump back and forth between the states s = −1 and s = 1. However, if we
observe the feature vector (0,1,0)T , we can not know if we are in s = −1 or s = 1
and we cannot determine the optimal action.
Another practical issue with methods such as tile coding is related to the step-size
parameter that many algorithms use. For instance, in many algorithms the parame-
ters of a linear function approximator are updated with an update akin to
4
DΦ
Note that 1 < M < DΦ implies that DΦ < M .
216 H. van Hasselt
where αt (st ) ∈ [0,1] is a step size and δt is an error for the value of the current state.
This may be a temporal-difference error, the difference between the current value
and a Monte Carlo sample, or any other relevant error. A derivation and explanation
of this update and variants thereof are given below, in Sections 7.2.3.1 and 7.3.1.2.
If we look at the update to a value V (s) = θ T φ (s) that results from (7.4), we get
In other words, the effective step size for the values is equal to
For instance, in tile coding
φ (st )
is equal to the number of tilings M. Therefore,
the effective step size on the value function is larger than one for αt (st ) > 1/M. This
can cause divergence of the parameters. Conversely, if the euclidean norm
φ (s)
of the feature vector is often small, the change to the value function may be smaller
than intended.
This issue can occur for any feature space and linear function approximation,
since then the effective step sizes in (7.5) are used for the update to the value func-
tion. This indicates that it can be a good idea to scale the step size appropriately, by
using
α̃t (st ) = αt (st )/
φ (st )
,
where α̃t (st ) is the scaled step size.5 This scaled step size can prevent unintended
small as well as unintended large updates to the values.
In general, it is often a good idea to make sure that |φ (s)| = ∑Dk φk (s) ≤ 1 for
Φ
all s. For instance, in tile coding we could set the value of active features equal
to 1/M instead of to 1. Such feature representations have good convergence proper-
ties, because they are non-expansions, which means that maxs |φ (s)T θ − φ (s)T θ | ≤
maxk |θk − θk | for any feature vector φ (s) and any two parameter vectors θ and
θ . A non-expansive function makes it easier to prove that an algorithm iteratively
improves its solution in expectation through a so-called contraction mapping (Gor-
don, 1995; Littman and Szepesvári, 1996; Bertsekas and Tsitsiklis, 1996; Bertsekas,
2007; Szepesvári, 2010; Buşoniu et al, 2010). Algorithms that implement a contrac-
tion mapping eventually reach an optimal solution and can be guaranteed not to
diverge, for instance by updating their parameters to infinitely high values.
5 One can safely define α̃t (st ) = 0 if
φ (st )
= 0, since in that case update (7.4) would not
change the parameters anyway.
7 Reinforcement Learning in Continuous State and Action Spaces 217
Some of the issues with discretization can be avoided by using a function that is
piece-wise linear, rather than piece-wise constant. One way to do this, is by using
so-called fuzzy sets (Zadeh, 1965; Klir and Yuan, 1995; Babuska, 1998). A fuzzy
set is a generalization of normal sets to fuzzy membership. This means that elements
can partially belong to a set, instead of just the possibilities of truth or falsehood.
A common example of fuzzy sets is the division of temperature into ‘cold’ and
‘warm’. There is a gradual transition between cold and warm, so often it is more
natural to say that a certain temperature is partially cold and partially warm.
In reinforcement learning, the state or state-action space can be divided into fuzzy
sets. Then, a state may belong partially to the set defined by feature φi and partially
to the set defined by feature φ j . For instance, we may have φi (s) = 0.1 and φ j (s) =
0.3. An advantage of this view is that it is quite natural to assume that ∑k φk (s) ≤ 1,
since each part of an element can belong to only one set. For instance, something
cannot be fully warm and fully cold at the same time.
It is possible to define the sets such that each combination of feature activations
corresponds precisely to one single state, thereby avoiding the partial-observability
problem sketched earlier. A common choice is to use triangular functions that are
equal to one at the center of the corresponding feature and decay linearly to zero
for states further from the center. With some care, such features can be constructed
such that they span the whole state space and ∑k φk (s) ≤ 1 for all states.
A full treatment of fuzzy reinforcement learning falls outside the scope of this
chapter. References that make the explicit connection between fuzzy logic and re-
inforcement learning include Berenji and Khedkar (1992); Berenji (1994); Lin and
Lee (1994); Glorennec (1994); Bonarini (1996); Jouffe (1998); Zhou and Meng
(2003) and Buşoniu et al (2008, 2010). A drawback of fuzzy sets is that these sets
still need to be defined beforehand, which may be difficult.
6 Non-parametric approaches somewhat alleviate this point, but are harder to analyze in
general. A discussion on such methods falls outside the scope of this chapter.
218 H. van Hasselt
Here the size of θt ∈ Θ is not necessarily equal to the size of φ (s) ∈ Φ . For instance,
V may be a neural network where θ t is a vector with all its weights at time t. Often,
the functional form of V is fixed. However, it is also possible to change the structure
of the function during learning (e.g., Stanley and Miikkulainen, 2002; Taylor et al,
2006; Whiteson and Stone, 2006; Buşoniu et al, 2010).
In general, a non-linear function approximator may approximate an unknown
function with better accuracy than a linear function approximator that uses the same
input features. In some cases, it is even possible to avoid defining features altogether
by using the state variables as inputs. A drawback of non-linear function approxima-
tion in reinforcement learning is that less convergence guarantees can be given. In
some cases, convergence to a local optimum can be assured (e.g., Maei et al, 2009),
but in general the theory is less well developed than for linear approximation.
Some algorithms allow for the closed-form computation of parameters that best
approximate the desired function, for a given set of experience samples. For in-
stance, when TD-learning is coupled with linear function approximation, least-
squares temporal-difference learning (LSTD) (Bradtke and Barto, 1996; Boyan,
2002; Geramifard et al, 2006) can be used to compute parameters that minimize the
empirical temporal-difference error over the observed transitions. However, for non-
linear algorithms such as Q-learning or when non-linear function approximation is
used, these methods are not applicable and the parameters should be optimized in a
different manner.
Below, we explain how to use the two general techniques of gradient descent
and gradient-free optimization to adapt the parameters of the approximations. These
procedures can be used with both linear and non-linear approximation and they can
be used for all three types of functions: models, value functions and policies. In
Section 7.3, we discuss reinforcement-learning algorithms that use these methods.
We will not discuss Bayesian methods in any detail, but such methods can be used
to learn the probability distributions of stationary functions, such as the reward and
transition functions of a stationary MDP. An advantage of this is that the exploration
of an online algorithm can choose actions to increase the knowledge of parts of the
model that have high uncertainty. Bayesian methods are somewhat less suited to
7 Reinforcement Learning in Continuous State and Action Spaces 219
6: Update parameters:
θ t+1 = θ t − αt ∇θ E(x,θt )
learn the value of non-stationary functions, such as the value of a changing policy.
For more general information about Bayesian inference, see for instance Bishop
(2006). For Bayesian methods in the context of reinforcement learning, see Dearden
et al (1998, 1999); Strens (2000); Poupart et al (2006) and Chapter 11.
where Ei (xi ,θt ) is the error for the i th input xi and αt ∈ [0,1] is a step-size parameter.
If the error is defined over only a single input-output pair, the update is called a
stochastic gradient descent update. Batch updates can be used in offline algorithms,
while stochastic gradient descent updates are more suitable for online algorithms.
There is some indication that often stochastic gradient descent converges faster
than batch gradient descent (Wilson and Martinez, 2003). Another advantage of
stochastic gradient descent over batch learning is that it is straightforward to extend
online stochastic gradient descent to non-stationary targets, for instance if the policy
changes after an update. These features make online gradient methods quite suitable
for online reinforcement learning. In general, in combination with reinforcement
learning convergence to an optimal solution is not guaranteed, although in some
cases convergence to a local optimum can be proven (Maei et al, 2009).
In the context of neural networks, gradient descent is often implemented through
backpropagation (Bryson and Ho, 1969; Werbos, 1974; Rumelhart et al, 1986),
which uses the chain rule and the layer structure of the networks to efficiently calcu-
late the derivatives of the network’s output to its parameters. However, the principle
of gradient descent can be applied to any differentiable function.
In some cases, the normal gradient is not the best choice. More formally, a prob-
lem of ordinary gradient descent is that the distance metric in parameter space may
differ from the distance metric in function space, because of interactions between
the parameters. Let dθ ∈ RP denote a vector in parameter space. The euclidean
norm of this vector is
dθ
= dθ T dθ . However, if the parameter space is a curved
space—known as a Riemannian manifold—it is more appropriate to use dθ T G dθ
where G is a P× P positive semi-definite matrix. With this weighted distance metric,
the direction of steepest descent becomes
˜ θ E(x,θ ) = G−1 ∇θ E(x,θ ) ,
∇
which is known as the natural gradient (Amari, 1998). In general, the best choice
for matrix G depends on the functional form of E. Since E is not known in general,
G will usually need to be estimated.
Natural gradients have a number of advantages. For instance, the natural gradient
is invariant to transformations of the parameters. In other words, when using a natu-
ral gradient the change in our function does not depend on the precise parametriza-
tion of the function. This is somewhat similar to our observation in Section 7.2.1.2
that we can scale the step size to tune the size of the step in value space rather than
in parameter space. Only here we consider the direction of the update to the pa-
rameters, rather than its size. Additionally, the natural gradient avoids plateaus in
function space, often resulting in faster convergence. We discuss natural gradients
in more detail when we discuss policy-gradient algorithms in Section 7.3.2.1.
Gradient-free methods are useful when the function that is optimized is not differ-
entiable or when it is expected that many local optima exist. Many general global
7 Reinforcement Learning in Continuous State and Action Spaces 221
Care should be taken that the variance of the distribution does not become too small
too quickly, in order to prevent premature convergence to sub-optimal solutions. A
simple way to do this, is by using a step-size parameter (Rubinstein and Kroese,
2004) on the parameters in order to prevent from too large changes per iteration.
More sophisticated methods to prevent premature convergence include the use of
the natural gradient by NES, and the use of enforced correlations between the co-
variance matrices of consecutive populations by CMA-ES.
No general guarantees can be given concerning convergence to the optimal so-
lution for evolutionary strategies. Convergence to the optimal solution for non-
stationary problems, such as the control problem in reinforcement learning, seems
even harder to prove. Despite this lack of guarantees, these methods can perform
well in practice. The major bottleneck is usually that the computation of the fitness
can be both noisy and expensive. Additionally, these methods have been designed
mostly with stationary optimization problems in mind. Therefore, they are more
suited to optimize a policy using Monte Carlo samples than to approximate the
value of the unknown optimal policy. In Section 7.4, we compare the performance
of CMA-ES and an actor-critic temporal-difference approach.
The gradient-free methods mentioned above all fall into a category known as
metaheuristics (Glover and Kochenberger, 2003). These methods iteratively search
for good candidate solutions, or a distribution that generates these. Another ap-
proach is to construct an easier solvable (e.g., quadratic) model of the function that
is to be optimized and then maximize this model analytically (see, e.g., Powell,
2002, 2006; Huyer and Neumaier, 2008). New samples can be iteratively chosen
to improve the approximate model. We do not know any papers that have used
such methods in a reinforcement learning context, but the sample-efficiency of such
7 Reinforcement Learning in Continuous State and Action Spaces 223
In order to update a value with gradient descent, we must choose some measure
of error that we can minimize. This measure is often referred to as the objective
function. To be able to reason more formally about these objective functions, we
introduce the concepts of function space and projections. Recall that V is the space
of value functions, such that V ∈ V . Let F ⊆ V denote the function space of
representable functions for some function approximator. Intuitively, if F contains
7 Reinforcement Learning in Continuous State and Action Spaces 225
a large subset of V , the function is flexible and can accurately approximate many
value functions. However, it may be prone to overfitting of the perceived data and it
may be slow to update since usually a more flexible function requires more tunable
parameters. Conversely, if F is small compared to V , the function is not very flexi-
ble. For instance, the function space of a linear approximator is usually smaller than
that of a non-linear approximator. A parametrized function has a parameter vector
θ = {θ [1], . . ., θ [DΘ ]} ∈ RDΘ that can be adjusted during training. The function
space is then defined by
F = V (·,θ )|θ ∈ RDΘ .
V − Π V
w = min
V − v
w = min
V − V θ
w ,
v∈F θ
This means that the projection is determined by the functional form of the approxi-
mator and the weights of the norm.
Let B = Bπ or B = B∗ , depending on whether we are approximating the value of
a given policy, or the value of the optimal policy. It is often not possible to find a
parameter vector that fulfills the Bellman equations V θ = BV θ for the whole state
space exactly, because the value BV θ may not be representable with the chosen
function. Rather, the best we can hope for is a parameter vector that fulfills
V θ = Π BV θ . (7.7)
This is called the projected Bellman equation; Π projects the outcome of the Bell-
man operator back to the space that is representable by the function approximation.
In some cases, it is possible to give a closed form expression for the projection
(Tsitsiklis and Van Roy, 1997; Bertsekas, 2007; Szepesvári, 2010). For instance,
consider a finite state space with N states and a linear function Vt (s) = θ T φ (s),
where DΘ = DΦ N. Let ps = P(st = s) denote the expected steady-state prob-
abilities of sampling each state and store these values in a diagonal N × N matrix
P. We assume the states are always sampled according to these fixed probabilities.
Finally, the N × DΦ matrix Φ holds the feature vectors for all states in its rows,
such that Vt = Φθt and Vt (s) = Φs θt = θtT φ (s). Then, the projection operator can
be represented by the N × N matrix
226 H. van Hasselt
−1 T
Π = Φ Φ T PΦ Φ P . (7.8)
The inverse exists if the features are linearly independent, such that Φ has rank DΦ .
With this definition Π Vt = Π Φθt = Φθt = Vt , but Π BVt = BVt , unless BVt can be
expressed as a linear function of the feature vectors. A projection matrix as defined
in (7.8) is used in the analysis and in the derivation of several algorithms (Tsitsiklis
and Van Roy, 1997; Nedić and Bertsekas, 2003; Bertsekas et al, 2004; Sutton et al,
2008, 2009; Maei and Sutton, 2010). We discuss some of these in the next section.
Apart from the state and the parameters, the error depends on the MDP and the pol-
icy. We do not specify these dependencies explicitly to avoid cluttering the notation.
A direct implementation of gradient descent based on the error in (7.9) would
adapt the parameters to move Vt (s) closer to rt+1 + γ Vt (st+1 ) as desired, but would
also move γ Vt (st+1 ) closer to Vt (st ) − rt+1 . Such an algorithm is called a residual-
gradient algorithm (Baird, 1995). Alternatively, we can interpret rt+1 + γ Vt (st+1 ) as
7 Reinforcement Learning in Continuous State and Action Spaces 227
a stochastic approximation for V π that does not depend on θ . Then, the negative
gradient is (Sutton, 1984, 1988)
This is the conventional TD learning update and it usually converges faster than the
residual-gradient update (Gordon, 1995, 1999). For linear function approximation,
for any θ we have ∇θ Vt (st ) = φ (st ) and we obtain the same update as was shown
earlier for tile coding in (7.4). Similar updates for action-value algorithms are ob-
tained by replacing ∇θ Vt (st ) in (7.10) with ∇θ Qt (st ,at ) and using, for instance
et+1 = λ γ et + ∇θ Vt (st ) ,
θ t+1 = θ t + αt (st )δt et+1 ,
where e ∈ RDΦ is a trace vector. Replacing traces (Singh and Sutton, 1996) are less
straightforward, although the suggestion by Främling (2007) seem sensible:
since this corresponds nicely to the common practice for tile coding and this update
reduces to the conventional replacing traces update when the values are stored in a
table. However, a good theoretical justification for this update is still lacking.
Parameters updated with (7.10) may diverge when off-policy updates are used.
This holds for any temporal-difference method with λ < 1 when we use linear
(Baird, 1995) or non-linear function approximation (Tsitsiklis and Van Roy, 1996).
In other words, if we sample transitions from a distribution that does not comply
completely to the state-visit probabilities that would occur under the estimation pol-
icy, the parameters of the function may diverge. This is unfortunate, because in the
control setting ultimately we want to learn about the unknown optimal policy.
Recently, a class of algorithms has been proposed to deal with this issue (Sutton
et al, 2008, 2009; Maei et al, 2009; Maei and Sutton, 2010). The idea is to perform
a stochastic gradient-descent update on the quadratic projected temporal difference:
&
1 1
E(θ ) =
Vt − Π BVt
P = P(s = st )(Vt (s) − Π BVt (s))2 ds . (7.11)
2 2 s∈S
228 H. van Hasselt
In contrast with (7.9), this error does not depend on the time step or the state. The
norm in (7.11) is weighted according to the state probabilities that are stored in the
diagonal matrix P, as described in Section 7.3.1.1. If we minimize (7.11), we reach
the fixed point in (7.7). To do this, we rewrite the error to
1 −1
E(θt ) = (E {δt ∇θ Vt (s)})T E ∇θ Vt (s)∇Tθ Vt (s) E {δt ∇θ Vt (s)} , (7.12)
2
where it is assumed that the inverse exists (Maei et al, 2009). The expectancies are
taken over the state probabilities in P. The error is the product of multiple expected
values. These expected values can not be sampled from a single experience, because
then the samples would be correlated. This can be solved by updating an additional
parameter vector. We use the shorthands φ = φ (st ) and φ = φ (st+1 ) and we assume
linear function approximation. Then ∇θ Vt (st ) = φ and we get
−1
−∇θ E(θt ) = E (φ − γφ )φ T E φ φ T E {δt φ }
T
≈ E (φ − γφ )φ w ,
where βt (st ) ∈ [0,1] is a step-size parameter. Then there is only one expected value
left to approximate, which can be done with a single sample. This leads to the update
θ t+1 = θ t + αt (st ) φ − γφ φ T wt ,
The idea of policy-gradient algorithms is to update the policy with gradient ascent
on the cumulative expected value V π (Williams, 1992; Sutton et al, 2000; Baxter
and Bartlett, 2001; Peters and Schaal, 2008b; Rückstieß et al, 2010). If the gradient
is known, we can update the policy parameters with
&
ψ k+1 = ψ k + βk ∇ψ E{V π (st )} = ψ k + βk ∇ψ P(st = s)V π (s) ds .
s∈S
Here P(st = s) denotes the probability that the agent is in state s at time step t
and βk ∈ [0,1] is a step size. In this update we use a subscript k in addition to t
to distinguish between the time step of the actions and the update schedule of the
policy parameters, which may not overlap. If the state space is finite, we can replace
the integral with a sum.
As a practical alternative, we can use stochastic gradient descent:
Here the time step of the update corresponds to the time step of the action and we use
the subscript t. Such procedures can at best hope to find a local optimum, because
they use a gradient of a value function that is usually not convex with respect to
the policy parameters. However, some promising results have been obtained, for
instance in robotics (Benbrahim and Franklin, 1997; Peters et al, 2003).
The obvious problem with update (7.13) is that in general V π is not known and
therefore neither is its gradient. For a successful policy-gradient algorithm, we need
an estimate of ∇ψ V π . We will now discuss how to obtain such an estimate.
We will use the concept of a trajectory. A trajectory S is a sequence of states
and actions:
S = {s0 , a0 , s1 , a1 , . . .} .
The probability that a given trajectory occurs is equal to the probability that the
corresponding sequence of states and actions occurs with the given policy:
7 Reinforcement Learning in Continuous State and Action Spaces 231
P(S |s, ψ ) = P(s0 = s)P(a0 |s0 , ψ )P(s1 |s0 , a0 )P(a1 |s1 , ψ )P(s2 |s1 , a1 ) · · ·
∞
= P(s0 = s) ∏ π (st ,at ,ψ )Pstt+1
s
at . (7.14)
t=0
The expected value V π can then be expressed as an integral over all possible trajec-
tories for the given policy and the corresponding expected rewards:
&
%
∞
π
V (s) = P(S |s, ψ )E ∑ γ rt+1 S dS .
t
S t=0
where we used the general identity ∇x f (x) = f (x)∇x log f (x). This useful observa-
tion is related to Fisher’s score function (Fisher, 1925; Rao and Poti, 1946) and the
likelihood ratio (Fisher, 1922; Neyman and Pearson, 1928). It was applied to rein-
forcement learning by Williams (1992) for which reason it is sometimes called the
REINFORCE trick, after the policy-gradient algorithm that was proposed therein
(see, for instance, Peters and Schaal, 2008b).
The product in the definition of the probability of the trajectory as given in (7.14)
implies that the logarithm in (7.15) consists of a sum of terms, in which only the
policy terms depend on ψ . Therefore, the other terms disappear when we take the
gradient and we obtain:
∞ ∞
log P(s0 = s) + ∑ log π (st ,at ,ψ ) + ∑ log Pstt+1
s
∇ψ log P(S |s, ψ ) = ∇ψ at
t=0 t=0
∞
= ∑ ∇ψ log π (st ,at ,ψ ) . (7.16)
t=0
This is nice, since it implies we do not need the transition model. However, this only
holds if the policy is stochastic. If the policy is deterministic we need the gradi-
ent ∇ψ log Psas = ∇ log Ps ∇ π (s,a,ψ ), which is available only when the transition
a sa ψ
probabilities are known. In most cases this is not a big problem, since stochastic
policies are needed anyway to ensure sufficient exploration. Figure 7.3 shows two
examples of stochastic policies that can be used and the corresponding gradients.
232 H. van Hasselt
Boltzmann exploration can be used in discrete actions spaces. Assume that φ (s,a) is a feature
vector of size DΨ corresponding to state s and action a. Suppose the policy is a Boltzmann
distribution with parameters ψ, such that
T φ (s,a)
eψ
π(s,a,ψ) = T φ (s,b) ,
∑b∈A(s) eψ
Gaussian exploration can be used in continuous action spaces. Consider a Gaussian policy with
mean μ ∈ RDA and DA × DA covariance matrix Σ , such that
1 1
π(s,a,{μ,Σ }) = √ exp − (a − μ)T Σ −1 (a − μ) ,
2π det Σ 2
∇μ log π(s,a,{μ,Σ }) = (a − μ)T Σ −1 ,
1 −1
∇Σ log π(s,a,{μ,Σ }) = Σ (a − μ)(a − μ)T Σ −1 − Σ −1 .
2
where the actions a ∈ A are vectors of the same size as μ. If ψ ∈ Ψ ⊆ RDA is a parameter
vector that determines the state-dependent location of the mean μ(s,ψ), then ∇ψ log π(s,a,ψ) =
JψT (μ(s,ψ))∇μ log π(s,a,{μ,Σ }), where Jψ (μ(s,ψμ )) is the DA × DΨ Jacobian matrix, containing
the partial derivatives from each of the elements of μ(s,ψ) to each of the elements of ψ.
The covariance matrix can be the output of a parametrized function as well, but care should be taken
to preserve sufficient exploration. One way is to use natural-gradient updates, as normal gradients
may decrease the exploration too fast. Another option is to use a covariance matrix σ 2 I, where σ
is a tunable parameter that is fixed or decreased according to some predetermined schedule.
When we know the gradient in (7.16), we can sample the quantity in (7.15). For
this, we need to sample the expected cumulative discounted reward. For instance,
if the task is episodic we can take a Monte Carlo sample that gives the cumulative
(possibly discounted) reward for each episode. In episodic MDPs, the sum in (7.16)
is finite rather than infinite and we obtain
%
Tk −1
∇ψ V π (st ) = E Rk (st ) ∑ ∇ψ log π (s j ,a j ,ψ ) (7.17)
j=t
T −1
where Rk (st ) = ∑ j=t
k
γ t− j r j+1 is the total (discounted) return obtained after reach-
ing state st in episode k, where this episode ended on Tk . This gradient can be sam-
pled and used to update the policy through (7.13).
7 Reinforcement Learning in Continuous State and Action Spaces 233
A drawback of sampling (7.17) is that the variance of Rk (st ) can be quite high,
resulting in noisy estimates of the gradient. Williams (1992) notes that this can be
mitigated somewhat by using the following update:
Tk
ψ t+1 = ψ t + βt (st ) (Rk (st ) − b(st )) ∑ ∇ψ log π (s j ,a j ,ψ t ) , (7.18)
j=t
where b(st ) is a baseline that does not depend on the policy parameters, although it
may depend on the state. This baseline can be used to minimize the variance without
adding bias to the update, since for any s ∈ S
& &
∇ψ P(S |s, ψ )b(s) dS = b(s)∇ψ P(S |s, ψ ) dS
S S
= b(s)∇ψ 1 = 0 .
It has been shown that it can be a good idea to set this baseline equal to an esti-
mate of the state value, such that b(s) = Vt (s) (Sutton et al, 2000; Bhatnagar et al,
2009), although strictly speaking it is then not independent of the policy parameters.
Some work has been done to optimally set the baseline to minimize the variance and
thereby increase the convergence rate of the algorithm (Greensmith et al, 2004; Pe-
ters and Schaal, 2008b), but we will not go into this in detail here.
The policy-gradient updates as defined above all use a gradient that updates the
policy parameters in the direction of steepest ascent of the performance metric.
However, the gradient update operates in parameter space, rather than in policy
space. In other words, when we use normal gradient descent with a step size, we
restrict the size of the change in parameter space: dψtT dψt , where dψt = ψt+1 − ψt
is the change in parameters. It has been argued that it is much better to restrict the
step size in policy space. This is similar to our observation in Section 7.2.1.2 that an
update in parameter space for a linear function approximator can result in an update
in value space with a unintended large or small step size. A good distance metric for
policies is the Kullback-Leibler divergence (Kullback and Leibler, 1951; Kullback,
1959). This can be approximated with a second-order Taylor expansion dψtT Fψ dψt ,
where Fψ is the DΨ × DΨ Fisher information matrix, defined as
Fψ = E ∇ψ P(S |s, ψ )∇Tψ P(S |s, ψ ) ,
where the expectation ranges over the possible trajectories. This matrix can be sam-
pled with use of the identity (7.16). Then, we can obtain a natural policy gradient,
which follows a natural gradient (Amari, 1998). This idea was first introduced in
reinforcement learning by Kakade (2001). The desired update then becomes
ψ t+1
T
= ψ tT + βt (st )Fψ−1 ∇ψ V π (st ) , (7.19)
which needs to be sampled. A disadvantage of this update is the need for enough
samples to (approximately) compute the inverse matrix Fψ−1 . The number of re-
quired samples can be restrictive if the number of parameters is fairly large, espe-
cially if a sample consists of an episode that can take many time steps to complete.
234 H. van Hasselt
Most algorithms that use a natural gradient use O(DΨ 2 ) time per update and
may require a reasonable amount of samples. More details can be found elsewhere
(Kakade, 2001; Peters and Schaal, 2008a; Wierstra et al, 2008; Bhatnagar et al,
2009; Rückstieß et al, 2010).
The variance of the estimate of ∇ψ V π (st ) in (7.17) can be very high if Monte Carlo
roll-outs are used, which can severely slow convergence. Likewise, this is a prob-
lem for direct policy-search algorithms that use Monte Carlo roll-outs. A potential
7 Reinforcement Learning in Continuous State and Action Spaces 235
where ∇ψ log π (st ,at ,ψt ) can be replaced with Fψ−1 ∇ψ log π (st ,at ,ψt ) for a NAC
algorithm.
In some cases, an explicit approximation of the inverse Fisher information matrix
can be avoided by approximating Qπ (s,a)− b(s) with a linear function approximator
gtπ (s,a,w) = wtT ∇ψ log π (s,a,ψt ) (Sutton et al, 2000; Konda and Tsitsiklis, 2003;
Peters and Schaal, 2008a). After some algebraic manupulations we then get
However, this elegant update only applies to critics that use the specific linear func-
tional form of gtπ (s,a,w) to approximate the value Qπ (s,a) − b(s). Furthermore, the
accuracy of this update clearly depends on the accuracy of wt . Other NAC variants
are described by Bhatnagar et al (2009).
There is significant overlap between some of the policy-gradient ideas in this
section and many of the ideas in the related field of adaptive dynamic programming
(ADP) (Powell, 2007; Wang et al, 2009). Essentially, reinforcement learning and
ADP can be thought of as different names for the same research field. However, in
practice there is a divergence between the sort of problems that are considered and
the solutions that are proposed. Usually, in adaptive dynamic programming more of
an engineering’s perspective is used, which results in a slightly different notation
and a somewhat different set of goals. For instance, in ADP the goal is often to sta-
bilize a plant (Murray et al, 2002). This puts some restraints on the exploration that
can safely be used and implies that often the goal state is the starting state and the
goal is to stay near this state, rather that to find better states. Additionally, problems
in continuous time are discussed more often than in reinforcement learning (Beard
et al, 1998; Vrabie et al, 2009) for which the continuous version of the Bellman op-
timality equation is used, that it known as the Hamilton–Jacobi–Bellman equation
236 H. van Hasselt
(Bardi and Dolcetta, 1997). A further discussion of these specifics falls outside the
scope of this chapter.
One of the earliest actor-critic methods stems from the ADP literature. It approx-
imates Qπ , rather than V π . Suppose we use Gaussian exploration, centered at the
output of a deterministic function Ac : S × Ψ → A. Here, we refer to this function
as the actor, instead of to the whole policy. If we use a differentiable function Qt to
approximate Qπ , it becomes possible to update the parameters of this actor with use
of the chain rule:
ψ t+1 = ψ t + αt ∇ψ Qt (st ,Ac(s,ψ ),θ )
= ψ t + αt JψT (Ac(st ,ψ ))∇a Qt (st ,a) ,
and Thathachar, 1974, 1989), using the sign of the temporal-difference error as a
measure of ‘success’. Most other actor-critic methods use the size of the temporal-
difference error and also update in the opposite direction when its sign is negative.
However, this is usually not a good idea for Cacla, since this is equivalent to up-
dating towards some action that was not performed and for which it is not known
whether it is better than the current output of the actor. As an extreme case, consider
an actor that already outputs the optimal action in each state for some determinis-
tic MDP. For most exploring actions, the temporal-difference error is then negative.
If the actor would be updated away from such an action, its output would almost
certainly no longer be optimal.
This is an important difference between Cacla and policy-gradient methods: Ca-
cla only updates its actor when actual improvements have been observed. This
avoids slow learning when there are plateaus in the value space and the temporal-
difference errors are small. It was shown empirically that this can indeed result in
better policies than when the step size depends on the size of the temporal-difference
error (van Hasselt and Wiering, 2007). Intuitively, it makes sense that the distance to
a promising action at is more important than the size of the improvement in value.
A basic version of Cacla is shown in Algorithm 17. The policy in line 3 can
depend from the actor’s output, but this is not strictly necessary. For instance, unex-
plored promising parts of the action space could be favored by the action selection.
In Section 7.4, we will see that Cacla can even learn from a fully random policy. Ca-
cla can only update its actor when at = Ac(st ,ψ t ), but after training has concluded
the agent can deterministically use the action that is output by the actor.
The critic update in line 6 is an ordinary TD learning update. One can replace
this with a TD(λ ) update, an incremental least-squares update or with any of the
other updates from Section 7.3.1.2. The actor update in line 8 can be interpreted as
gradient descent on the error
at −Ac(st ,ψt )
between the action that was performed
and the output of the actor. This is the second important difference with most other
actor-critic algorithms: instead of updating the policy in parameter space (Konda,
2002) or policy space (Peters and Schaal, 2008a; Bhatnagar et al, 2009), we use an
error directly in action space.
238 H. van Hasselt
In reinforcement learning, many different metrics have been used to compare the
performance of algorithms and no fully standardized benchmarks exist. Therefore,
we compare the results of Cacla to the results for CMA-ES and NAC from an earlier
paper (Heidrich-Meisner and Igel, 2008), using the dynamics and the performance
metric used therein. We choose this particular paper because it reports results for
NAC and for CMA-ES, which is considered the current state-of-the-art in direct
policy search and black-box optimization (Jiang et al, 2008; Gomez et al, 2008;
Glasmachers et al, 2010; Hansen et al, 2010).
The dynamics of the double cart pole are as follows (Wieland, 1991):
F − μc sign(ẋ) + ∑2i=1 2mi χ̇i2 sin χi + 34 mi cos χi 2 μmiiχ̇lii + g sin χi
ẍ =
mc + ∑2i=1 mi 1 − 34 cos2 χi
3 μi χ̇i
χ̈ = − ẍ cos χi + g sin χi + 2
8li mi li
Here l1 = 1 m and l2 = 0.1 m are the lengths of the poles, mc = 1 kg is the weight of
the cart, m1 = 0.1 kg and m2 = 0.01 kg are the weights of the poles and g = 9.81 m/s2
is the gravity constant. Friction is modeled with coefficients μc = 5 · 10−4 N s/m and
μ1 = μ2 = 2 · 10−6 N m s. The admissible state space is defined by the position of the
cart x ∈ [−2.4 m,2.4 m] and the angles of both poles χi ∈ [−36◦,36◦ ] for i ∈ {1,2}.
On leaving the admissible state space, the episode ends. Every time step yields a
reward of rt = 1 and therefore it is optimal to make episodes as long as possible.
The agent can choose an action from the range [−50 N,50 N] every 0.02 s.
Because CMA-ES uses Monte Carlo roll-outs, the task was made explicitly
episodic by resetting the environment every 20 s (Heidrich-Meisner and Igel, 2008).
This is not required for Cacla, but was done anyway to make the comparison
fair. The feature vector is φ (s) = (x,ẋ,χ1 ,χ̇1 ,χ2 ,χ̇2 )T . All episodes start in φ (s) =
(0,0,1◦ ,0,0,0)T . The discount factor in the paper was γ = 1. This means that the state
values are unbounded. Therefore, we use a discount factor of γ = 0.99. In practice,
this makes little difference for the performance. Even though Cacla optimizes the
discounted cumulative reward, we use the reward per episode as performance met-
ric, which is explicitly optimized by CMA-ES and NAC.
CMA-ES and NAC were used to train a linear controller, so Cacla is also used
to find a linear controller. We use a bias feature that is always equal to one, so we
are looking for a parameter vector ψ ∈ R7 . A hard threshold is used, such that if the
output of the actor is larger than 50 N or smaller than −50 N, the agent outputs 50 N
or −50 N, respectively. The critic was implemented with a multi-layer perceptron
with 40 hidden nodes an a tanh activation function for the hidden layer. The initial
controller was initialized with uniformly random parameters between −0.3 and 0.3.
No attempt was made to optimize this initial range for the parameters or the number
of hidden nodes.
The results by CMA-ES are shown in Figure 7.4. Heidrich-Meisner and Igel
(2008) show that NAC performs far worse and therefore we do not show its
240 H. van Hasselt
Fig. 7.4 Median reward per episode by CMA-ES out of 500 repetitions of the experiment.
The x-axis shows the number of episodes. Figure is taken from Heidrich-Meisner and Igel
(2008).
Table 7.2 The mean reward per episode (mean), the standard error of this mean (se) and the
percentage of trials where the reward per episode was equal to 1000 (success) are shown for
Cacla with α = β = 10−3 . Results are shown for training episodes 401–500 (online) and for
the greedy policy after 500 episodes of training (offline). The action noise and exploration
are explained in the main text. Averaged over 1000 repetitions.
online offline
action noise exploration mean se success mean se success
σ = 5000 946.3 6.7 92.3 % 954.2 6.3 94.5 %
0 ε = 0.1 807.6 9.6 59.0 % 875.2 9.4 84.5 %
ε =1 29.2 0.0 0% 514.0 10.4 25.5 %
σ = 5000 944.6 6.9 92.5 % 952.4 6.5 94.5 %
[-20 N,20 N] ε = 0.1 841.0 8.7 60.7 % 909.5 8.1 87.4 %
ε =1 28.7 0.0 0% 454.7 9.5 11.3 %
σ = 5000 936.0 7.4 91.9 % 944.9 7.0 93.8 %
[-40 N,40 N] ε = 0.1 854.2 7.9 50.5 % 932.6 6.7 86.7 %
ε =1 27.6 0.0 0% 303.0 6.7 0%
On average, the controllers found by Cacla after only 500 episodes are signifi-
cantly better than those found by CMA-ES after 10,000 episodes. Even ε = 1 results
in quite reasonable greedy policies. Naturally, when ε = 1 the online performance
is poor, because the policy is fully random. But note that the greedy performance of
514.0 is much better than the performance of CMA-ES after 500 episodes.
To test robustness of Cacla, we reran the experiment with noise in the action
execution. A uniform random force in the range [−20 N,20 N] or [−40 N,40 N] is
added to the action before execution. The action noise is added after cropping the
actor output to the admissible range and the algorithm is not informed of the amount
of added noise. For example, assume the actor of Cacla outputs an action of Ac(st ) =
40. Then Gaussian exploration is added, for instance resulting in at = 70. This action
is not in the admissible range [−50,50], so it is cropped to 50. Then uniform noise,
drawn from [−20,20] or [−40,40], is added. Suppose the result is 60. Then, a force
of 60 N is applied to the cart. If the resulting temporal-difference is positive, the
output of the actor for this state is updated towards at = 70, so the algorithm is
unaware of both the cropping and the uniform noise that were applied to its output.
The results including action noise are also shown in Table 7.2. The performance
of Cacla is barely affected when Gaussian exploration is used. The slight drop in
performance falls within the statistical margins of error, although it does seem con-
sistent. Interestingly, the added noise even improves the online and offline perfor-
mance of Cacla when ε -greedy exploration with ε = 0.1 is used. Apparently, the
added noise results in desirable extra exploration.
This experiment indicates that the relatively simple Cacla algorithm is very ef-
fective at solving some continuous reinforcement-learning problems. Other previous
work show that natural-gradient and evolutionary algorithms typically need a few
242 H. van Hasselt
thousand episodes to learn a good policy on the double pole, but also on the sin-
gle pole task (Sehnke et al, 2010). We do not show the results here, but Cacla also
performs very well on the single-pole cart pole. Naturally, this does not imply that
Cacla is the best choice for all continuous MDPs. For instance, in a partially ob-
servable MDPs an evolutionary approach to directly search in parameter space may
find good controllers faster, although it is possible to use Cacla to train a recurrent
neural network, for instance with real-time recurrent learning (Williams and Zipser,
1989) or backpropagation through time (Werbos, 2002). Additionally, convergence
to an optimal policy or even local optima for variants of Cacla is not (yet) guar-
anteed, while for some actor-critic (Bhatnagar et al, 2009) and direct policy-search
algorithms convergence to a local optimum can be guaranteed.
The reason that Cacla performs much better than CMA-ES on this particular
problem is that CMA-ES uses whole episodes to estimate the fitness of a candidate
policy and stores a whole population of such policies. Cacla, on the other hand,
makes use of the structure of the problem by using temporal-difference errors. This
allows it to quickly update its actor, making learning possible even during the first
few episodes. NAC has the additional disadvantage that quite a few samples are
necessary to make its estimate of the Fisher information matrix accurate enough to
find the natural-gradient direction. Finally, the improvements to the actor in Cacla
are not slowed down by plateaus in the value space. As episodes become longer, the
value space will typically exhibit such plateaus, making the gradient estimates used
by NAC more unreliable and the updates smaller. Because Cacla operates directly
in action space, it does not have this problem and it can move towards better actions
with a fixed step size, whenever the temporal-difference is positive.
As a final note, the simple variant of Cacla will probably not perform very well
in problems with specific types of noise. For instance, Cacla may be tempted to
update towards actions that often yield fairly high returns but sometimes yield very
low returns, making them a poor choice on average. This problem can be mitigated
by storing an explicit approximation of the reward function, or by using averaged
temporal-difference errors instead of the stochastic errors. These issues have not
been investigated in depth.
7.5 Conclusion
There are numerous ways to find good policies in problems with continuous spaces.
Three general methodologies exist that differ in which part of the problem is explic-
itly approximated: the model, the value of a policy, or the policy itself. Function ap-
proximation can be used to approximate these functions, which can be updated with
gradient-based or gradient-free methods. Many different reinforcement-learning al-
gorithms result from combinations of these techniques. We mostly focused on value-
function and policy approximation, because models of continuous MDPs quickly
become intractable to solve, making explicit approximations of these less useful.
7 Reinforcement Learning in Continuous State and Action Spaces 243
Acknowledgements. I would like to thank Peter Bosman and the anonymous reviewers for
helpful comments.
References
Akimoto, Y., Nagata, Y., Ono, I., Kobayashi, S.: Bidirectional Relation Between CMA Evolu-
tion Strategies and Natural Evolution Strategies. In: Schaefer, R., Cotta, C., Kołodziej, J.,
Rudolph, G. (eds.) PPSN XI. LNCS, vol. 6238, pp. 154–163. Springer, Heidelberg (2010)
Albus, J.S.: A theory of cerebellar function. Mathematical Biosciences 10, 25–61 (1971)
244 H. van Hasselt
Albus, J.S.: A new approach to manipulator control: The cerebellar model articulation con-
troller (CMAC). In: Dynamic Systems, Measurement and Control, pp. 220–227 (1975)
Amari, S.I.: Natural gradient works efficiently in learning. Neural Computation 10(2), 251–
276 (1998)
Anderson, C.W.: Learning to control an inverted pendulum using neural networks. IEEE Con-
trol Systems Magazine 9(3), 31–37 (1989)
Antos, A., Munos, R., Szepesvári, C.: Fitted Q-iteration in continuous action-space MDPs.
In: Advances in Neural Information Processing Systems (NIPS-2007), vol. 20, pp. 9–16
(2008a)
Antos, A., Szepesvári, C., Munos, R.: Learning near-optimal policies with Bellman-residual
minimization based fitted policy iteration and a single sample path. Machine Learn-
ing 71(1), 89–129 (2008b)
Babuska, R.: Fuzzy modeling for control. Kluwer Academic Publishers (1998)
Bäck, T.: Evolutionary algorithms in theory and practice: evolution strategies, evolutionary
programming, genetic algorithms. Oxford University Press, USA (1996)
Bäck, T., Schwefel, H.P.: An overview of evolutionary algorithms for parameter optimization.
Evolutionary Computation 1(1), 1–23 (1993)
Baird, L.: Residual algorithms: Reinforcement learning with function approximation. In:
Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth Interna-
tional Conference, pp. 30–37. Morgan Kaufmann Publishers, San Francisco (1995)
Baird, L.C., Klopf, A.H.: Reinforcement learning with high-dimensional, continuous actions.
Tech. Rep. WL-TR-93-114, Wright Laboratory, Wright-Patterson Air Force Base, OH
(1993)
Bardi, M., Dolcetta, I.C.: Optimal control and viscosity solutions of Hamilton–Jacobi–
Bellman equations. Springer, Heidelberg (1997)
Barto, A.G., Sutton, R.S., Anderson, C.W.: Neuronlike adaptive elements that can solve
difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernet-
ics SMC-13, 834–846 (1983)
Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of Artificial In-
telligence Research 15, 319–350 (2001)
Beard, R., Saridis, G., Wen, J.: Approximate solutions to the time-invariant Hamilton–
Jacobi–Bellman equation. Journal of Optimization theory and Applications 96(3), 589–
626 (1998)
Bellman, R.: Dynamic Programming. Princeton University Press (1957)
Benbrahim, H., Franklin, J.A.: Biped dynamic walking using reinforcement learning.
Robotics and Autonomous Systems 22(3-4), 283–302 (1997)
Berenji, H.: Fuzzy Q-learning: a new approach for fuzzy dynamic programming. In: Pro-
ceedings of the Third IEEE Conference on Fuzzy Systems, IEEE World Congress on
Computational Intelligence, pp. 486–491. IEEE (1994)
Berenji, H., Khedkar, P.: Learning and tuning fuzzy logic controllers through reinforcements.
IEEE Transactions on Neural Networks 3(5), 724–740 (1992)
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. I. Athena Scientific (2005)
Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. II. Athena Scientific
(2007)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont
(1996)
Bertsekas, D.P., Borkar, V.S., Nedic, A.: Improved temporal difference methods with linear
function approximation. In: Handbook of Learning and Approximate Dynamic Program-
ming, pp. 235–260 (2004)
7 Reinforcement Learning in Continuous State and Action Spaces 245
Beyer, H., Schwefel, H.: Evolution strategies–a comprehensive introduction. Natural Com-
puting 1(1), 3–52 (2002)
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms.
Automatica 45(11), 2471–2482 (2009)
Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press, USA (1995)
Bishop, C.M.: Pattern recognition and machine learning. Springer, New York (2006)
Bonarini, A.: Delayed reinforcement, fuzzy Q-learning and fuzzy logic controllers. In: Her-
rera, F., Verdegay, J.L. (eds.) Genetic Algorithms and Soft Computing. Studies in Fuzzi-
ness, vol. 8, pp. 447–466. Physica-Verlag, Berlin (1996)
Boyan, J.A.: Technical update: Least-squares temporal difference learning. Machine Learn-
ing 49(2), 233–246 (2002)
Bradtke, S.J., Barto, A.G.: Linear least-squares algorithms for temporal difference learning.
Machine Learning 22, 33–57 (1996)
Bryson, A., Ho, Y.: Applied Optimal Control. Blaisdell Publishing Co. (1969)
Buşoniu, L., Ernst, D., De Schutter, B., Babuška, R.: Continuous-State Reinforcement Learn-
ing with Fuzzy Approximation. In: Tuyls, K., Nowe, A., Guessoum, Z., Kudenko, D.
(eds.) ALAMAS 2005, ALAMAS 2006, and ALAMAS 2007. LNCS (LNAI), vol. 4865,
pp. 27–43. Springer, Heidelberg (2008)
Buşoniu, L., Babuška, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic
Programming Using Function Approximators. CRC Press, Boca Raton (2010)
Coulom, R.: Reinforcement learning using neural networks, with applications to motor con-
trol. PhD thesis, Institut National Polytechnique de Grenoble (2002)
Crites, R.H., Barto, A.G.: Improving elevator performance using reinforcement learning. In:
Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neural Information
Processing Systems, vol. 8, pp. 1017–1023. MIT Press, Cambridge (1996)
Crites, R.H., Barto, A.G.: Elevator group control using multiple reinforcement learning
agents. Machine Learning 33(2/3), 235–262 (1998)
Davis, L.: Handbook of genetic algorithms. Arden Shakespeare (1991)
Dayan, P.: The convergence of TD(λ ) for general lambda. Machine Learning 8, 341–362
(1992)
Dayan, P., Sejnowski, T.: TD(λ ): Convergence with probability 1. Machine Learning 14,
295–301 (1994)
Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth
National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial
Intelligence, pp. 761–768. American Association for Artificial Intelligence (1998)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: Proceedings of
the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 150–159 (1999)
Eiben, A.E., Smith, J.E.: Introduction to evolutionary computing. Springer, Heidelberg
(2003)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal
of Machine Learning Research 6(1), 503–556 (2005)
Fisher, R.A.: On the mathematical foundations of theoretical statistics. Philosophical Trans-
actions of the Royal Society of London Series A, Containing Papers of a Mathematical or
Physical Character 222, 309–368 (1922)
Fisher, R.A.: Statistical methods for research workers. Oliver & Boyd, Edinburgh (1925)
Främling, K.: Replacing eligibility trace for action-value learning with function approxima-
tion. In: Proceedings of the 15th European Symposium on Artificial Neural Networks
(ESANN-2007), pp. 313–318. d-side publishing (2007)
246 H. van Hasselt
Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces.
In: Advanced Topics in Artificial Intelligence, pp. 417–428 (1999)
Geramifard, A., Bowling, M., Sutton, R.S.: Incremental least-squares temporal difference
learning. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1,
pp. 356–361. AAAI Press (2006)
Geramifard, A., Bowling, M., Zinkevich, M., Sutton, R.: ilstd: Eligibility traces and con-
vergence analysis. In: Advances in Neural Information Processing Systems, vol. 19, pp.
441–448 (2007)
Glasmachers, T., Schaul, T., Yi, S., Wierstra, D., Schmidhuber, J.: Exponential natural evolu-
tion strategies. In: Proceedings of the 12th Annual Conference on Genetic and Evolution-
ary Computation, pp. 393–400. ACM (2010)
Glorennec, P.: Fuzzy Q-learning and dynamical fuzzy Q-learning. In: Proceedings of the
Third IEEE Conference on Fuzzy Systems, IEEE World Congress on Computational In-
telligence, pp. 474–479. IEEE (1994)
Glover, F., Kochenberger, G.: Handbook of metaheuristics. Springer, Heidelberg (2003)
Gomez, F., Schmidhuber, J., Miikkulainen, R.: Accelerated neural evolution through coopera-
tively coevolved synapses. The Journal of Machine Learning Research 9, 937–965 (2008)
Gordon, G.J.: Stable function approximation in dynamic programming. In: Prieditis, A., Rus-
sell, S. (eds.) Proceedings of the Twelfth International Conference on Machine Learning
(ICML 1995), pp. 261–268. Morgan Kaufmann, San Francisco (1995)
Gordon, G.J.: Approximate solutions to Markov decision processes. PhD thesis, Carnegie
Mellon University (1999)
Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient esti-
mates in reinforcement learning. The Journal of Machine Learning Research 5, 1471–
1530 (2004)
Hansen, N., Ostermeier, A.: Completely derandomized self-adaptation in evolution strategies.
Evolutionary Computation 9(2), 159–195 (2001)
Hansen, N., Müller, S.D., Koumoutsakos, P.: Reducing the time complexity of the deran-
domized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary
Computation 11(1), 1–18 (2003)
Hansen, N., Auger, A., Ros, R., Finck, S., Pošı́k, P.: Comparing results of 31 algorithms from
the black-box optimization benchmarking BBOB-2009. In: Proceedings of the 12th An-
nual Conference Companion on Genetic and Evolutionary Computation, GECCO 2010,
pp. 1689–1696. ACM, New York (2010)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall PTR (1994)
Heidrich-Meisner, V., Igel, C.: Evolution Strategies for Direct Policy Search. In: Rudolph,
G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN 2008. LNCS, vol. 5199, pp.
428–437. Springer, Heidelberg (2008)
Holland, J.H.: Outline for a logical theory of adaptive systems. Journal of the ACM
(JACM) 9(3), 297–314 (1962)
Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press,
Ann Arbor (1975)
Howard, R.A.: Dynamic programming and Markov processes. MIT Press (1960)
Huyer, W., Neumaier, A.: SNOBFIT–stable noisy optimization by branch and fit. ACM
Transactions on Mathematical Software (TOMS) 35(2), 1–25 (2008)
Jiang, F., Berry, H., Schoenauer, M.: Supervised and Evolutionary Learning of Echo State
Networks. In: Rudolph, G., Jansen, T., Lucas, S., Poloni, C., Beume, N. (eds.) PPSN
2008. LNCS, vol. 5199, pp. 215–224. Springer, Heidelberg (2008)
7 Reinforcement Learning in Continuous State and Action Spaces 247
Jouffe, L.: Fuzzy inference system learning by reinforcement methods. IEEE Transactions on
Systems, Man, and Cybernetics, Part C: Applications and Reviews 28(3), 338–355 (1998)
Kakade, S.: A natural policy gradient. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.)
Advances in Neural Information Processing Systems 14 (NIPS-2001), pp. 1531–1538.
MIT Press (2001)
Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of IEEE Interna-
tional Conference on Neural Networks, Perth, Australia, vol. 4, pp. 1942–1948 (1995)
Kirkpatrick, S.: Optimization by simulated annealing: Quantitative studies. Journal of Statis-
tical Physics 34(5), 975–986 (1984)
Klir, G., Yuan, B.: Fuzzy sets and fuzzy logic: theory and applications. Prentice Hall PTR,
Upper Saddle River (1995)
Konda, V.: Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology (2002)
Konda, V.R., Borkar, V.: Actor-critic type learning algorithms for Markov decision processes.
SIAM Journal on Control and Optimization 38(1), 94–123 (1999)
Konda, V.R., Tsitsiklis, J.N.: Actor-critic algorithms. SIAM Journal on Control and Opti-
mization 42(4), 1143–1166 (2003)
Kullback, S.: Statistics and Information Theory. J. Wiley and Sons, New York (1959)
Kullback, S., Leibler, R.A.: On information and sufficiency. Annals of Mathematical Statis-
tics 22, 79–86 (1951)
Lagoudakis, M., Parr, R.: Least-squares policy iteration. The Journal of Machine Learning
Research 4, 1107–1149 (2003)
Lin, C., Lee, C.: Reinforcement structure/parameter learning for neural-network-based fuzzy
logic control systems. IEEE Transactions on Fuzzy Systems 2(1), 46–63 (1994)
Lin, C.S., Kim, H.: CMAC-based adaptive critic self-learning control. IEEE Transactions on
Neural Networks 2(5), 530–533 (1991)
Lin, L.: Self-improving reactive agents based on reinforcement learning, planning and teach-
ing. Machine Learning 8(3), 293–321 (1992)
Lin, L.J.: Reinforcement learning for robots using neural networks. PhD thesis, Carnegie
Mellon University, Pittsburgh (1993)
Littman, M.L., Szepesvári, C.: A generalized reinforcement-learning model: Convergence
and applications. In: Saitta, L. (ed.) Proceedings of the 13th International Conference on
Machine Learning (ICML 1996), pp. 310–318. Morgan Kaufmann, Bari (1996)
Maei, H.R., Sutton, R.S.: GQ (λ ): A general gradient algorithm for temporal-difference pre-
diction learning with eligibility traces. In: Proceedings of the Third Conference On Arti-
ficial General Intelligence (AGI-2010), pp. 91–96. Atlantis Press, Lugano (2010)
Maei, H.R., Szepesvári, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.: Convergent
temporal-difference learning with arbitrary smooth function approximation. In: Advances
in Neural Information Processing Systems 22 (NIPS-2009) (2009)
Maei, H.R., Szepesvári, C., Bhatnagar, S., Sutton, R.S.: Toward off-policy learning control
with function approximation. In: Proceedings of the 27th Annual International Conference
on Machine Learning (ICML-2010). ACM, New York (2010)
Maillard, O.A., Munos, R., Lazaric, A., Ghavamzadeh, M.: Finite sample analysis of Bellman
residual minimization. In: Asian Conference on Machine Learning, ACML-2010 (2010)
Mitchell, T.M.: Machine learning. McGraw Hill, New York (1996)
Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolu-
tion. Machine Learning 22, 11–32 (1996)
Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement
learning. Journal of Artificial Intelligence Research 11, 241–276 (1999)
248 H. van Hasselt
Murray, J.J., Cox, C.J., Lendaris, G.G., Saeks, R.: Adaptive dynamic programming. IEEE
Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 32(2),
140–153 (2002)
Narendra, K.S., Thathachar, M.A.L.: Learning automata - a survey. IEEE Transactions on
Systems, Man, and Cybernetics 4, 323–334 (1974)
Narendra, K.S., Thathachar, M.A.L.: Learning automata: an introduction. Prentice-Hall, Inc.,
Upper Saddle River (1989)
Nedić, A., Bertsekas, D.P.: Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dynamic Systems 13(1-2), 79–110 (2003)
Neyman, J., Pearson, E.S.: On the use and interpretation of certain test criteria for purposes
of statistical inference part i. Biometrika 20(1), 175–240 (1928)
Ng, A.Y., Parr, R., Koller, D.: Policy search via density estimation. In: Solla, S.A., Leen,
T.K., Müller, K.R. (eds.) Advances in Neural Information Processing Systems, vol. 13,
pp. 1022–1028. The MIT Press (1999)
Nguyen-Tuong, D., Peters, J.: Model learning for robot control: a survey. Cognitive Process-
ing, 1–22 (2011)
Ormoneit, D., Sen, Ś.: Kernel-based reinforcement learning. Machine Learning 49(2), 161–
178 (2002)
Pazis, J., Lagoudakis, M.G.: Binary action search for learning continuous-action control poli-
cies. In: Proceedings of the 26th Annual International Conference on Machine Learning,
pp. 793–800. ACM (2009)
Peng, J.: Efficient dynamic programming-based learning for control. PhD thesis, Northeastern
University (1993)
Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7-9), 1180–1190 (2008a)
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural
Networks 21(4), 682–697 (2008b)
Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In:
IEEE-RAS International Conference on Humanoid Robots (Humanoids 2003). IEEE
Press (2003)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian re-
inforcement learning. In: Proceedings of the 23rd International Conference on Machine
Learning, pp. 697–704. ACM (2006)
Powell, M.: UOBYQA: unconstrained optimization by quadratic approximation. Mathemat-
ical Programming 92(3), 555–582 (2002)
Powell, M.: The NEWUOA software for unconstrained optimization without derivatives. In:
Large-Scale Nonlinear Optimization, pp. 255–297 (2006)
Powell, W.B.: Approximate Dynamic Programming: Solving the Curses of Dimensionality.
Wiley-Blackwell (2007)
Precup, D., Sutton, R.S.: Off-policy temporal-difference learning with function approxi-
mation. In: Machine Learning: Proceedings of the Eighteenth International Conference
(ICML 2001), pp. 417–424. Morgan Kaufmann, Williams College (2001)
Precup, D., Sutton, R.S., Singh, S.P.: Eligibility traces for off-policy policy evaluation. In:
Proceedings of the Seventeenth International Conference on Machine Learning (ICML
2000), pp. 766–773. Morgan Kaufmann, Stanford University, Stanford, CA (2000)
Prokhorov, D.V., Wunsch, D.C.: Adaptive critic designs. IEEE Transactions on Neural
Networks 8(5), 997–1007 (2002)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., New York (1994)
7 Reinforcement Learning in Continuous State and Action Spaces 249
Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov
decision problems. Management Science 24(11), 1127–1137 (1978)
Rao, C.R., Poti, S.J.: On locally most powerful tests when alternatives are one sided. Sankhyā:
The Indian Journal of Statistics, 439–439 (1946)
Rechenberg, I.: Evolutionsstrategie - Optimierung technischer Systeme nach Prinzipien der
biologischen Evolution. Fromman-Holzboog (1971)
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural
Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M.,
Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidelberg
(2005)
Ripley, B.D.: Pattern recognition and neural networks. Cambridge University Press (2008)
Rubinstein, R.: The cross-entropy method for combinatorial and continuous optimization.
Methodology and Computing in Applied Probability 1(2), 127–190 (1999)
Rubinstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial
optimization, Monte-Carlo simulation, and machine learning. Springer-Verlag New York
Inc. (2004)
Rückstieß, T., Sehnke, F., Schaul, T., Wierstra, D., Sun, Y., Schmidhuber, J.: Exploring pa-
rameter space in reinforcement learning. Paladyn 1(1), 14–24 (2010)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error
propagation. In: Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press (1986)
Rummery, G.A., Niranjan, M.: On-line Q-learning using connectionist sytems. Tech. Rep.
CUED/F-INFENG-TR 166, Cambridge University, UK (1994)
Santamaria, J.C., Sutton, R.S., Ram, A.: Experiments with reinforcement learning in prob-
lems with continuous state and action spaces. Adaptive Behavior 6(2), 163–217 (1997)
Scherrer, B.: Should one compute the temporal difference fix point or minimize the Bell-
man residual? The unified oblique projection view. In: Fürnkranz, J., Joachims, T. (eds.)
Proceedings of the 27th International Conference on Machine Learning (ICML 2010),
pp. 959–966. Omnipress (2010)
Schwefel, H.P.: Numerische Optimierung von Computer-Modellen. Interdisciplinary
Systems Research, vol. 26. Birkhäuser, Basel (1977)
Sehnke, F., Osendorfer, C., Rückstieß, T., Graves, A., Peters, J., Schmidhuber, J.: Parameter-
exploring policy gradients. Neural Networks 23(4), 551–559 (2010)
Singh, S.P., Sutton, R.S.: Reinforcement learning with replacing eligibility traces. Machine
Learning 22, 123–158 (1996)
Spaan, M., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs. Jour-
nal of Artificial Intelligence Research 24(1), 195–220 (2005)
Stanley, K.O., Miikkulainen, R.: Efficient reinforcement learning through evolving neural
network topologies. In: Proceedings of the Genetic and Evolutionary Computation Con-
ference (GECCO-2002), pp. 569–577. Morgan Kaufmann, San Francisco (2002)
Strehl, A.L., Li, L., Wiewiora, E., Langford, J., Littman, M.L.: PAC model-free reinforce-
ment learning. In: Proceedings of the 23rd International Conference on Machine Learning,
pp. 881–888. ACM (2006)
Strens, M.: A Bayesian framework for reinforcement learning. In: Proceedings of the Sev-
enteenth International Conference on Machine Learning, p. 950. Morgan Kaufmann Pub-
lishers Inc. (2000)
Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient natural evolution strategies. In:
Proceedings of the 11th Annual conference on Genetic and Evolutionary Computation
(GECCO-2009), pp. 539–546. ACM (2009)
250 H. van Hasselt
Sutton, R.S.: Temporal credit assignment in reinforcement learning. PhD thesis, University
of Massachusetts, Dept. of Comp. and Inf. Sci. (1984)
Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learn-
ing 3, 9–44 (1988)
Sutton, R.S.: Generalization in reinforcement learning: Successful examples using sparse
coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.) Advances in Neu-
ral Information Processing Systems, vol. 8, pp. 1038–1045. MIT Press, Cambridge (1996)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. The MIT press,
Cambridge (1998)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforce-
ment learning with function approximation. In: Advances in Neural Information Process-
ing Systems 13 (NIPS-2000), vol. 12, pp. 1057–1063 (2000)
Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent O(n) algorithm for off-policy
temporal-difference learning with linear function approximation. In: Advances in Neu-
ral Information Processing Systems 21 (NIPS-2008), vol. 21, pp. 1609–1616 (2008)
Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora,
E.: Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In: Proceedings of the 26th Annual International Conference on Machine
Learning (ICML 2009), pp. 993–1000. ACM (2009)
Szepesvári, C.: Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intel-
ligence and Machine Learning 4(1), 1–103 (2010)
Szepesvári, C., Smart, W.D.: Interpolation-based Q-learning. In: Proceedings of the Twenty-
First International Conference on Machine Learning (ICML 2004), p. 100. ACM (2004)
Szita, I., Lörincz, A.: Learning tetris using the noisy cross-entropy method. Neural Compu-
tation 18(12), 2936–2941 (2006)
Taylor, M.E., Whiteson, S., Stone, P.: Comparing evolutionary and temporal difference meth-
ods in a reinforcement learning domain. In: Proceedings of the 8th Annual Conference on
Genetic and Evolutionary Computation, p. 1328. ACM (2006)
Tesauro, G.: Practical issues in temporal difference learning. In: Lippman, D.S., Moody, J.E.,
Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol. 4, pp.
259–266. Morgan Kaufmann, San Mateo (1992)
Tesauro, G.: TD-Gammon, a self-teaching backgammon program, achieves master-level play.
Neural Computation 6(2), 215–219 (1994)
Tesauro, G.J.: Temporal difference learning and TD-Gammon. Communications of the
ACM 38, 58–68 (1995)
Thrun, S., Schwartz, A.: Issues in using function approximation for reinforcement learning.
In: Mozer, M., Smolensky, P., Touretzky, D., Elman, J., Weigend, A. (eds.) Proceedings
of the 1993 Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale (1993)
Touzet, C.F.: Neural reinforcement learning for behaviour synthesis. Robotics and Au-
tonomous Systems 22(3/4), 251–281 (1997)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function ap-
proximation. Tech. Rep. LIDS-P-2322, MIT Laboratory for Information and Decision
Systems, Cambridge, MA (1996)
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function
approximation. IEEE Transactions on Automatic Control 42(5), 674–690 (1997)
van Hasselt, H.P.: Double Q-Learning. In: Advances in Neural Information Processing
Systems, vol. 23. The MIT Press (2010)
van Hasselt, H.P.: Insights in reinforcement learning. PhD thesis, Utrecht University (2011)
7 Reinforcement Learning in Continuous State and Action Spaces 251
van Hasselt, H.P., Wiering, M.A.: Reinforcement learning in continuous action spaces. In:
Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming
and Reinforcement Learning (ADPRL-2007), pp. 272–279 (2007)
van Hasselt, H.P., Wiering, M.A.: Using continuous action spaces to solve discrete problems.
In: Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009),
pp. 1149–1156 (2009)
van Seijen, H., van Hasselt, H.P., Whiteson, S., Wiering, M.A.: A theoretical and empirical
analysis of Expected Sarsa. In: Proceedings of the IEEE International Symposium on
Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184 (2009)
Vapnik, V.N.: The nature of statistical learning theory. Springer, Heidelberg (1995)
Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.: Adaptive optimal control for continuous-
time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: An introduction. IEEE
Computational Intelligence Magazine 4(2), 39–47 (2009)
Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge,
England (1989)
Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
Werbos, P.J.: Beyond regression: New tools for prediction and analysis in the behavioral
sciences. PhD thesis, Harvard University (1974)
Werbos, P.J.: Advanced forecasting methods for global crisis warning and models of intelli-
gence. In: General Systems, vol. XXII, pp. 25–38 (1977)
Werbos, P.J.: Backpropagation and neurocontrol: A review and prospectus. In: IEEE/INNS
International Joint Conference on Neural Networks, Washington, D.C, vol. 1, pp. 209–216
(1989a)
Werbos, P.J.: Neural networks for control and system identification. In: Proceedings of
IEEE/CDC, Tampa, Florida (1989b)
Werbos, P.J.: Consistency of HDP applied to a simple reinforcement learning problem. Neural
Networks 2, 179–189 (1990)
Werbos, P.J.: Backpropagation through time: What it does and how to do it. Proceedings of
the IEEE 78(10), 1550–1560 (2002)
Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning.
Journal of Machine Learning Research 7, 877–917 (2006)
Whitley, D., Dominic, S., Das, R., Anderson, C.W.: Genetic reinforcement learning for neu-
rocontrol problems. Machine Learning 13(2), 259–284 (1993)
Wieland, A.P.: Evolving neural network controllers for unstable systems. In: International
Joint Conference on Neural Networks, vol. 2, pp. 667–673. IEEE, New York (1991)
Wiering, M.A., van Hasselt, H.P.: The QV family compared to other reinforcement learning
algorithms. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic
Programming and Reinforcement Learning, pp. 101–108 (2009)
Wierstra, D., Schaul, T., Peters, J., Schmidhuber, J.: Natural evolution strategies. In: IEEE
Congress on Evolutionary Computation (CEC-2008), pp. 3381–3387. IEEE (2008)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning 8, 229–256 (1992)
Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural
networks. Neural Computation 1(2), 270–280 (1989)
Wilson, D.R., Martinez, T.R.: The general inefficiency of batch training for gradient descent
learning. Neural Networks 16(10), 1429–1451 (2003)
Zadeh, L.: Fuzzy sets. Information and Control 8(3), 338–353 (1965)
Zhou, C., Meng, Q.: Dynamic balance of a biped robot using fuzzy reinforcement learning
agents. Fuzzy Sets and Systems 134(1), 169–187 (2003)
Chapter 8
Solving Relational and First-Order Logical
Markov Decision Processes: A Survey
Abstract. In this chapter we survey representations and techniques for Markov de-
cision processes, reinforcement learning, and dynamic programming in worlds ex-
plicitly modeled in terms of objects and relations. Such relational worlds can be
found everywhere in planning domains, games, real-world indoor scenes and many
more. Relational representations allow for expressive and natural datastructures that
capture the objects and relations in an explicit way, enabling generalization over
objects and relations, but also over similar problems which differ in the number of
objects. The field was recently surveyed completely in (van Otterlo, 2009b), and
here we describe a large portion of the main approaches. We discuss model-free –
both value-based and policy-based – and model-based dynamic programming tech-
niques. Several other aspects will be covered, such as models and hierarchies, and
we end with several recent efforts and future directions.
As humans we speak and think about the world as being made up of objects and
relations among objects. There are books, tables and houses, tables inside houses,
books on top of tables, and so on. ”We are equipped with an inductive bias, a pre-
disposition to learn to divide the world up into objects, to study the interaction of
those objects, and to apply a variety of computational modules to the representa-
tion of these objects” (Baum, 2004, p. 173). For intelligent agents this should not
be different. In fact, ”. . . it is hard to imagine a truly intelligent agent that does
not conceive of the world in terms of objects and their properties and relations to
other objects” (Kaelbling et al, 2001). Furthermore, such representations are highly
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 253–292.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
254 M. van Otterlo
effective and compact:”The description of the world in terms of objects and simple
interactions is an enormously compressed description” (Baum, 2004, p. 168).
The last decade a new subfield in reinforcement learning (RL) has emerged that
tries to endow intelligent agents with both the knowledge representation of (proba-
bilistic) first-order logic – to deal with objects and relations – and efficient RL algo-
rithms – to deal with decision-theoretic learning in complex and uncertain worlds.
This field – relational RL (van Otterlo, 2009b) – takes inspiration, representational
methodologies and algorithms from diverse fields (see Fig. 8.2(left)) such as RL
(Sutton and Barto, 1998), logic-based artificial intelligence (Minker, 2000), knowl-
edge representation (KR) (Brachman and Levesque, 2004), planning (see Russell
and Norvig, 2003), probabilistic-logical machine learning (De Raedt, 2008) and
cognitive architectures (Langley, 2006). In many of these areas the use of objects
and relations is widespread and it is the assumption that if RL is to be applied in
these fields – e.g. for cognitive agents learning to behave, or RL in tasks where
communicating the acquired knowledge is required – one has to investigate how RL
algorithms interact with expressive formalisms such as first-order logic. Relational
RL offers a new representational paradigm for RL and can tackle many problems
more compactly and efficiently than state-of-the-art propositional approaches, and
in addition, can tackle new problems that could not be handled before. It also of-
fers new opportunities to inject additional background knowledge into algorithms,
surpassing tabula rasa learning.
In this chapter we introduce relational RL as new representational paradigm in
RL. We start by introducing the elements of generalization in RL and briefly sketch
historical developments. In Section 8.2 we introduce relational representations for
Markov decision processes (MDPs). In Sections 8.3 and 8.4 we survey model-based
and model-free solution algorithms. In addition we survey other approaches, such as
hierarchies and model learning, in Section 8.5, and we describe recent developments
in Section 8.6. We end with conclusions and future directions in Section 8.7.
assume a fixed representation, and learn either MDP-related functions (Q, V) over
this abstraction level (PIAG E T-1 ) or they learn parameters (e.g. neural network
weights) of the representation simultaneously
(PIAG E T-2 ). PIAG E T-3 methods are the
PIAGeT-1
most general; they adapt abstraction levels
π Vπ
while solving an MDP. PIAGeT-0 PIAGeT-2
Since relational RL is above all a repre- PIAGeT-3
abstraction level
sentational upgrade of RL we have to look
at concepts of generalization (or abstraction) flat MDP
in MDPs. We can distinguish five types typi-
cally used in MDPs: i) state space abstraction, Fig. 8.1 Policy Iteration using
ii) factored MDP representations, iii) value Abstraction and Generalization
functions, iv) policies, and v) hierarchical de- Techniques (PIAG E T)
compositions. Relational RL focuses on rep-
resenting an MDP in terms of objects and
relations, and then employing logical generalization to obtain any of the five types,
possibly making use of results obtained with propositional representations.
Generalization is closely tied to representation. In other words, one can only
learn what one can represent. Even though there are many ways of representing
knowledge, including rules, neural networks, entity-relationship models, first-order
logic and so on, there are few representational classes:
Atomic. Most descriptions of algorithms and proofs use so-called atomic state
representations, in which the state space S is a discrete set, consisting of discrete
states s1 to sN and A is a set of actions. No abstraction (generalization) is used,
and all computation and storage (e.g. of values) happens state-wise.
Propositional. This is the most common form of MDP representation (see Bert-
sekas and Tsitsiklis, 1996; Sutton and Barto, 1998; Boutilier et al, 1999, for thor-
ough descriptions). Each state consists of a vector of attribute-value pairs, where
each attribute (feature, random variable) represents a measurable quantity of the
domain. Features can be binary or real-valued. Usually the action set A is still
a discrete set. Propositional encodings enable the use of, e.g. decision trees or
support vector machines to represent value functions and policies, thereby gen-
eralizing over parts of the state space.
Deictic. Some works use deictic representations (e.g. Finney et al, 2002) in which
propositional encodings are used to refer to objects. For example, the value of
the feature ”the color of the object in my hand” could express something about
a specific object, without explicitly naming it. Deictic representations, as well
as object-oriented representations (Diuk et al, 2008) are still propositional, but
bridge the gap to explicit relational formalisms.
Relational. A relational representation is an expressive knowledge representa-
tion (KR) format in which objects and relations between objects can be expressed
in an explicit way, and will be discussed in the next section.
Each representational class comes with its own ways to generalize and abstract.
Atomic states do not offer much in that respect, but for propositional representations
256 M. van Otterlo
RL
At
om
ic
Psychology
Pr Mathematics
op mug1
os
iti
room1 red
Control Theory on here(R) AND
al
connected color
Fi Logic Artificial Intelligence pot on in(T,R) AND on(M,T)
rst
−o
Operations Research rd
er room2
on next pickup(M)
0.1
First−Order here in table 0.9
Probability Theory
MDPs Probabilis
tic
Classical
planning on
Planning
SRL
Action
formalism actions on not on(M,T) AND
s mug2
al M
L
ing pickup(mug1 ) color carrying(M) AND on(M,floor) AND
sition amm
Propo ic P
rogr Agents pickup(mug2 ) floor
e Log
ML
Fig. 8.2 (left) The material in this chapter embedded in various subfields of artificial in-
telligence (AI), (middle) relational representation of a scene, (right) generalization over a
probabilistic action
The research described in this chapter is a natural development in at least three areas.
The core element (see Fig. 8.2) is a first-order (or relational) MDP.
Reinforcement learning. Before the mid-nineties, many trial-and-error learn-
ing, optimal control, dynamic programming (DP) and value function approxima-
tion techniques were developed. The book by Sutton and Barto (1998) marked
the beginning of a new era in which more sophisticated algorithms and represen-
tations were developed, and more theory was formed concerning approximation
and convergence. Around the turn of the millennium, relational RL emerged as
a new subfield, going beyond the propositional representations used until then.
Džeroski et al (1998) introduced a first version of Q-learning in a relational do-
main, and Boutilier et al (2001) reported the first value iteration algorithm in
first-order logic. Together they initiated a new representational direction in RL.
Machine Learning. Machine learning (ML), of which RL is a subfield, has
used relational representations much earlier. Inductive logic programming (ILP)
(Bergadano and Gunetti, 1995) has a long tradition in learning logical concepts
from data. Still much current ML research uses purely propositional represen-
tations and focuses on probabilistic aspects (Alpaydin, 2004). However, the last
decade logical and probabilistic approaches are merging in the field of statistical
relational learning (SRL) (De Raedt, 2008). Relational RL itself can be seen as
an additional step, adding a utility framework to SRL.
Action languages. Planning and cognitive architectures are old AI subjects,
but before relational RL surprisingly few approaches could deal efficiently with
decision-theoretic concepts. First-order action languages ranging from STRIPS
(Fikes and Nilsson, 1971) to situation calculus (Reiter, 2001) were primarily
8 Relational and First-Order Logical MDPs 257
a on(A,B)
c yes no
b e
0.0 clear(A)
e a
c b yes no
d d
1.0 clear(E)
initial state goal state yes no
1) move a to d 3) move c to a
2) move b to the floor 4) move e to b 0.9 0.81
Fig. 8.3 (left) a Blocks World planning problem and an optimal plan, (right) a
logical Q-tree
used in deterministic, goal-based planning contexts, and the same holds for cog-
nitive architectures. Much relational RL is based on existing action formalisms,
thereby extending their applicability to solve decision-theoretic problems.
Objects and relations require a formalism to express them in explicit form. Fig. 8.2
(middle) depicts a typical indoor scene with several objects such as room1 , pot and
mug2 . Relations between objects are the solid lines, e.g. the relations connected
(room1 ,room2 ) and on(table,floor). Some relations merely express properties of
objects (i.e. the dashed lines) and can be represented as color(mug1 ,red), or as a
unary relation red(mug1 ). Actions are represented in a similar way, rendering them
parameterized, such as pickup(mug1 ) and goto(room1 ).
Note that a propositional representation of the scene would be cumbersome. Each
relation between objects should be represented by a binary feature. In order to rep-
resent a state by a feature vector, all objects and relations should first be fixed, and
ordered, to generate the features. For example, it should contain on mug1 table = 1
and on mug2 table = 1, but also useless features such as on table table = 0 and
on room1 mug2 = 0. This would blow up in size since it must contain many irrele-
vant relations (many of which do not hold in a state), and would be very inflexible
258 M. van Otterlo
if the number of objects or relations would vary. Relational representations can nat-
urally deal with these issues.
An important way to generalize over objects is by using variables that can stand
for different objects (denoted by uppercase letters). For example, in Fig. 8.2(right)
an action specification makes use of that in order to pickup an object denoted by
M. It specifies that the object should be on some object denoted by T, in the same
current location R, and M could be e.g. mug1 or mug2 . The bottom two rectangles
show two different outcomes of the action, governed by a probability distribution.
Out of many artificial domains, the Blocks World (Slaney and Thiébaux, 2001,
and Fig. 8.3(left)) is probably the most well-known problem in areas such as
KR and planning (see Russell and Norvig, 2003; Schmid, 2001; Brachman and
Levesque, 2004) and recently, relational RL. It is a computationally hard problem
for general purpose AI systems, it has a relatively simple form, and it supports
meaningful, systematic and affordable experiments. Blocks World is often used in
relational RL and we use it here to explain important concepts.
More generally, a relational representation consists of a domain of objects (or,
constants) C, and a set of relations {p/α } (or, predicates) of which each can have
several arguments (i.e. the arity α ). A ground atom is a relation applied to constants
from C, e.g. on(a,b). For generalization, usually a logical language is used which –
in addition to the domain C and set of predicates P – also contains variables, quanti-
fiers and connectives. In the sentence (i.e. formula) ≡ ∃X, Y on(X,Y)∧clear(X) in
such a language, represents all states in which there are (expressed by the quanti-
fier ∃) two objects (or, blocks) X and Y which are on each other, and also (expressed
by the connective ∧, or, the logical AND) that X is clear. In Fig. 8.3(left), in the
initial state X could only be a, but in the goal state, it could refer to c or e. Note that
is not ground since it contains variables.
Logical abstractions can be learned from data. This is typically done through
methods of ILP (Bergadano and Gunetti, 1995) or SRL (De Raedt, 2008). The de-
tails of these approaches are outside the scope of this chapter. It suffices here to see
them as search algorithms through a vast space of logical abstractions, similar in
spirit to propositional tree and rule learners (Alpaydin, 2004). The structure of the
space is given by the expressivity of the logical language used. In case of SRL, an
additional problem is learning the parameters (e.g. probabilities).
We first need to represent all the aspects of the problem in relational form, i.e. MDPs
in relational form, based on a domain of objects D and a set of predicates P.
Definition 8.2.1. Let P = {p1 /α1 , . . . ,pn /αn } be a set of predicates with their ar-
ities, C = {c1 , . . . ,ck } a set of constants, and let A = {a1 /α1 , . . . , am /αm } be a
set of actions with their arities. Let S be the set of all ground atoms that can be
constructed from P and C, and let A be the set of all ground atoms over A and C.
8 Relational and First-Order Logical MDPs 259
Relational MDPs form the core model underlying all work in relational RL. Usu-
ally they are not posed in ground form, but specified by logical languages which
can differ much in their expressivity (e.g. which quantifiers and connectives they
support) and reasoning complexity. Here we restrict our story to a simple form: an
abstract state is a conjunction ≡ 1 ∧ . . . ∧ m of logical atoms, which can con-
tain variables. A conjunction is implicitly existentially quantified, i.e. an abstract
state 1 ≡ on(X,Y) should be seen as ∃X∃Y 1 which reads as there are two blocks,
denoted X and Y AND block X is on block Y. For brevity we will omit the quantifiers
and connectors from now. An abstract state models a ground state if we can find
a substitution of the variables in such that all the substituted atoms appear in . It
is said that Z θ -subsumes (with θ the substitution). For example, 1 models (sub-
sumes) the state clear(a), on(a,b), clear(c), . . . since we can substitute X and Y
with a and b and then on(a,b) appears in . Thus, an abstract state generalizes over
a set of states of the underlying RMDP, i.e. it is an aggregate state.
Abstract states are also (partially) ordered by a θ -subsumption. If an abstract state
1 subsumes another abstract state 2 , then 1 is more general than 2 . A domain
theory supports reasoning and constraint handling to check whether states are legal
(wrt. to the underlying semantics of the domain) and to extend states (i.e. derive new
facts based on the domain theory). This could for the Blocks World allow to derive
260 M. van Otterlo
from on(a,b) that also under(b,a) is true. In addition, it would also exclude states
that are easy to generate from relational atoms, but are impossible in the domain,
e.g. on(a,a), using the rule on(X,X) → false.
The transition function in Definition 8.2.1 is usually defined in terms of abstract
actions. Their definition depends on a specific action logic and many different forms
exists. An action induces a mapping between abstract states, i.e. from a set of states
to another set of states. An abstract action defines a probabilistic action operator
by means of a probability distribution over a set of deterministic action outcomes.
A generic definition is the following, based on probabilistic STRIPS (Hanks and
McDermott, 1994).
which moves block X on Y with probability 0.9. With probability 0.1 the action fails,
i.e., we do not change the state. For an action rule pre → post, if pre is θ -subsumed
by a state , then in the resulting state is with pre θ (applied substitution) re-
moved, and post θ added. Applied to ≡ cl(a), cl(b), on(a,c) the action tells us
that move(a,b) will result in ≡ on(a,b), cl(a), cl(c) with probability 0.9 and
with probability 0.1 we stay in . An action defines a probabilistic mapping over the
set of state-action-state pairs.
Abstract (state) reward functions are easy to specify with abstract states. A
generic definition of R is the set { i , r | i = 1 . . . N} where each i is an abstract
state and r is a numerical value. Then, for each relational state of an RMDP, we can
define the reward of to be the reward given to that abstract state that generalizes
over , i.e. that subsumes . If rewards should be unique for states (not additive) one
has to ensure that R forms a partition over the complete state space of the RMDP.
Overall, by specifying i) a domain C and a set of relations P, ii) an abstract action
definition, and iii) an abstract reward function, one can define RMDPs in a compact
way. By only changing the domain C one obtains RMDP variants consisting of
different sets of objects, i.e. a family of related RMDPs (see van Otterlo, 2009b, for
more details).
For solution elements – policies and value functions – we can employ similar
representations. In fact, although value functions are different from reward func-
tions, we can use the same representation in terms of a set of abstract states with
values. However, other – more compact – representations are possible. To represent
a state-action value function, Džeroski et al (1998) employed a relational deci-
sion tree, i.e. a compact partition of the state-action space into regions of the same
value. Fig. 8.3(right) depicts such a tree, where the root node specifies the action
move(D,E). All state-action pairs that try to move block D on E, in a state where a
block A is not on B and where A is clear, get a value 1. Note that, in contrast to
propositional trees, node tests share variables.
8 Relational and First-Order Logical MDPs 261
Policies are mappings from states to actions, and they are often represented using
relational decision lists (Mooney and Califf, 1995). For example, consider the fol-
lowing abstract policy for a Blocks World, which optimally encodes how to reach
states where on(a,b) holds:
r1 : move(X,floor) ← onTop(X,b)
r2 : move(Y,floor) ← onTop(Y,a)
r3 : move(a,b) ← clear(a),clear(b)
r4 : noop ← on(a,b)
where noop denotes doing nothing. Rules are read from top to bottom: given a state,
the first rule where the abstract state applies, generates an optimal action.
Relational RL, either model-free or model-based, has to cope with representation
generation and control learning. A starting point is the RMDP with associated value
functions V and Q and policy π that correspond to the problem, and the overall goal
of learning is an (optimal) abstract policy Π̃ .
control learning + representation learning
RMDP M = S, A, T, R −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→ Π̃ ∗ : S → A
(8.2)
Because learning the policy in the ground RMDP is not an option, various routes
can be taken in order to find Π̃ ∗ . One can first construct abstract value functions and
deduce a policy from that. One might also inductively learn the policy from optimal
ground traces. Relational abstraction over RMDPs induces a number of structural
learning tasks. In the next sections we survey various approaches.
Model-based algorithms rely on the assumption that a full (abstract) model of the
RMDP is available, such that it can be used in DP algorithms to compute value
functions and policies. A characteristic solution pattern is the following series of
purely deductive steps:
D D D D D D
Ṽ 0 ≡ R −−−−−→ Ṽ 1 −−−−−→ Ṽ 2 −−−−−→ . . . −−−−−→ Ṽ k −−−−−→ Ṽ k+1 −−−−−→ . . .
⏐ ⏐ ⏐ ⏐ ⏐
⏐ ⏐ ⏐ ⏐ ⏐
(D (D (D (D (D
Π̃ 0 Π̃ 1 Π̃ 2 Π̃ k Π̃ k+1
(8.3)
The initial specification of the problem can be viewed as a logical theory. The initial
reward function R is used as the initial zero-step value function Ṽ 0 . Each subsequent
abstract value function Ṽ k+1 is obtained from Ṽ k by deduction (D). From each value
function Ṽ k a policy Π̃ k can be deduced. At the end of the section we will briefly
discuss additional sample-based approaches. Relational model-based solution al-
gorithms generally make explicit use of Bellman optimality equations. By turning
these into update rules they then heavily utilize the KR capabilities to exploit struc-
ture in both solutions and algorithms. This has resulted in several relational versions
of traditional algorithms such as value iteration (VI).
262 M. van Otterlo
Let us take a look at the following Bellman update rule , which is usually applied to
all states s ∈ S simultaneously in an iteration:
Now, classical VI computes values individually for each state, even though many
states will share the exact same transition pattern to other states. Structured (rela-
tional) representations can be used to avoid such redundancy, and compute values
for many states simultaneously. Look again at the abstract action definition in Equa-
tion 8.1. This compactly specifies many transition probabilities in the form of rules.
In a similar way, abstract value functions can represent compactly the values for a
set of states using just a single abstract state. Intensional dynamic programming
(IDP) (van Otterlo, 2009a) makes the use of KR formalisms explicit in MDPs, and
provides a unifying framework for structured DP. IDP is defined representation-
independent but can be instantiated with any atomic, propositional or relational rep-
resentation. The core of IDP consists of expressing the four mentioned computations
(overlap, regression, combination, and maximization) by representation-dependent
counterparts. Together they are called decision-theoretic regression (DTR). A sim-
ple instantiation of IDP is set-based DP, in which value functions are represented
8 Relational and First-Order Logical MDPs 263
as discrete sets, and DP employs set-based backups. Several other algorithms in the
literature can be seen as instantiations of IDP in the propositional context.
A first development in IDP is explanation-based RL ( EBRL) (Dietterich and
Flann, 1997). The inspiration for this work is the similarity between Bellman back-
ups and explanation-based generalization ( EBG) in explanation-based learning
( EBL) (see Minton et al, 1989). The representation that is used consists of proposi-
tional rules and region-based (i.e. for 2D grid worlds) representations. Boutilier et al
(2000) introduced structured value and policy iteration algorithms (SVI and SPI)
based on propositional trees for value functions and policies and dynamic Bayesian
networks for representing the transition function. Later these techniques were ex-
tended to work with even more compact algebraic decision diagrams (ADD). A
third instantiation of IDP are the (hyper)-rectangular partitions of continuous state
spaces (Feng et al, 2004). Here, so-called rectangular piecewise-constant ( RPWC)
functions are stored using kd-trees (splitting hyperplanes along k axes) for efficient
manipulation. IDP techniques are conceptually easy to extend towards POMDPs
(Boutilier and Poole, 1996). Let us now go to the relational setting.
Exact model-based algorithms for RMDPs implement the four components of DTR
in a specific logical formalism. The starting point is an abstract action as specified in
Equation 8.1. Let us assume our reward function R, which is the initial value func-
tion V 0 , is {on(a,b), on(c,d), 10 , true, 0 }. Now, in the first step of DTR, we can
use the action specification to find abstract states to backup the values in R to. Since
the value function is not specified in terms of individual states, we can use regres-
sion to find out what are the conditions for the action in order to end up in an ab-
stract state. Let us consider the first outcome of the action (with probability 0.9), and
take on(a,b), on(c,d) in the value function V 0 . Matching now amounts to checking
whether there is an overlap between states expressed by on(X,Y),cl(X),cl(Z) and
on(a,b), on(c,d). That is, we want to consider those states that are modeled by both
the action effect and the part the value function we are looking at. It turns out there
are four ways of computing this overlap (see Fig. 8.4) due to the use of variables
(we omit constraints here).
Let us pick the first possibility in the picture, then what happened was that
on(X,Y) was matched against on(a,b), which says that we consider the case where
the action performed has caused a being on b. Matching has generated the substitu-
tions X/a and Y/b which results in the matched state being ≡ on(a,b),cl(a),cl(Z)
”plus” on(c,d) which is the part the action did not cause. Now the regression step
is reasoning backwards through the action, and finding out what the possible ab-
stract state should have been in order for the action move(a,b) to have caused
. This is a straightforward computation and results in ≡ on(a,Z),cl(b) ”plus”
on(c,d). Now, in terms of the DTR algorithm, we can compute a partial Q-value
Q( , move(a,b)) = γ · T ( , move(a,b), ) ·V k ( ) = γ · 0.9 · 10.
264 M. van Otterlo
a
1 3
a c c
on(a,b)
Z b d clear(a) b d
a c
b d
c a a c X
on(c,d)
d Z b clear(c) b d Y Z
2 4
Fig. 8.4 Examples of regression in a Blocks World. All four configurations around the
center state are possible ’pre-states’ when regressing the center state through a standard move
action. The two left configurations are situations in which either block a is put onto b (1), or
block c is put onto d (2). Block Z in both these configurations can be either a block or the
floor. In the top right configuration (3) block a was on block c and is now put onto b. The
fourth configuration (4) is interesting because it assumes that none of the blocks a, b, c, d is
actually moved, but that there are additional blocks which were moved (i.e. blocks X and Y).
Doing these steps for all actions, all action outcomes and all abstract states in
the value function V k results in a set of abstract state-action pairs , , Q repre-
senting partial Q-values. The combination of partial Q-values into a Q-function
is again done by computing an overlap, in this case between states appearing in
the partial Q-function. Let us assume we have computed another partial Q-value
Q( 2 , move(a,b)) = 3 for 2 ≡ on(a,Z),cl(b),on(c,d),on(e,f), now for the sec-
ond outcome of the action (with probability 0.1). Now, in the overlap state 3 ≡
∧ 2 ≡ on(a,Z),cl(b),on(c,d),on(e,f) we know that doing action move(a,b) in
state 3 has an expected value of 10 + 3. The combination of partial Q-values has
to be performed for all possible combinations of action outcomes, for all actions.
The last part of DTR is the maximization. In our example case, this is fairly easy,
since the natural subsumption order on abstract states does most of the work. One
can just sort the Q-function { , , Q } and take for any state-action pair the first
rule that subsumes it. For efficiency reasons, one can remove rules in the Q-function
that are subsumed by a higher-valued rule. For other formalisms, the maximiza-
tion step sometimes involves an extra reasoning effort to maximize over different
variable substitutions. The whole procedure can be summarized in the following
algorithm, called first-order decision-theoretic regression (FODTR):
Require: an abstract value function V n
1: for each action type () do
()
10 ∃b Bin(Paris ,b,s)
Bin(B,C) Bin(b, Paris) 5.56 ¬ Rain(s) ∧ ∃b,t(On(b,t,s) ∧ Tin(t, Paris,s)
∧¬∃b Bin(Paris ,b,s)
4.29 Rain(s) ∧ ∃b,t(On(b,t,s) ∧ Tin(t, Paris,s))
1 B = b∗ 19 On(b,t) ∧¬∃b Bin(Paris ,b,s)
2.53 ¬ Rain(s) ∧ ∃b,t On(b,t,s) ∧ ¬∃b Bin(Paris ,b,s)
On(B,t ∗) ∧¬∃b,t(On(b,t,ts) ∧ Tin(t, Paris,s))
Tin(t, Paris) 1.52 ¬ Rain(s) ∧ ∃b,t, c(Bin(c,s) ∧ Tin(c,s))
∧¬∃b,t On(b,t,s) ∧ ¬b Bin(Paris ,b,s)
Tin(t ∗,C) Rain 1.26 Rain(s) ∧ ∃b,t On(b,t,s) ∧ ¬∃b,t(On(b,t,s)
∧ Tin(t, Paris,s)) ∧ ¬b Bin(Paris ,b,s)
1 0 6.3 8.1 0.0 0.0 ¬b Bin(Paris ,b,s) ∧ ¬∃b,t On(b,t,s)
∧(Rain(s) ∨ ¬b,t, c(Bin(c,s) ∧ Tin(c,s)))
Fig. 8.6 Examples of structures in exact, first-order IDP. Both are concrete examples
using the logistics domain described in our experimental section. The left two figures are first-
order decision diagrams taken from (Wang et al, 2007). On the far left, the transition function
(or, truth value diagram) is depicted for Bin(B,C) under action choice unload(b∗ ,t∗ ). The
right diagram depicts the value function 1 , which turns out to be equivalent to 1unload . The
formulas on the right represent the final value partition for the logistics domain computed in
(Boutilier et al, 2001).
...
11
00
11 00
11
000
111
00
11
000
111
9.00 a
...
bolic dynamic programming ( SDP) approach F1 b b 11 11
00 00
00 11
11
00
11 00
11
00
(Boutilier et al, 2001) based on the situa- F
a
F1 2
tion calculus language (see Reiter, 2001). 8.10 Ab Fa1 F2 11 00
00
11
00
11
11
00 11
00
00
11
000 11
111
00
11
A B 00
00
11
The second approach, FOVI (Hölldobler and 000
111
11
00
000 11
111
00
11
a C
00
00
11
7.29 ...
Skvortsova, 2004), is based on the fluent cal- 6.56 F1 b F2
111 11
000
111
000
00
111
000
00
11
111
000
000 11
111 00
111
000
culus (Thielscher, 1998). The third approach 5.91 5.31 ...
is ReBel (Kersting et al, 2004). The most re- 4.784.31
...
ReBel, decision diagrams in FODD, and both SDP and FOVI use their underlying
fluent and situation calculus action specifications. Figure 8.6 shows some exam-
ples of representations used in SDP and FODD. Note how exact, but also how
complex, a simple value function in SDP becomes. FODD’s decision diagrams are
highly compact, but strictly less expressive than SDP state descriptions. Regression
is built-in as a main reasoning method in SDP and FOVI but for ReBel and FODD
specialized procedures were invented.
All four methods perform IDP in first-order domains, without first grounding
the domain, thereby computing solutions directly on an abstract level. On a slightly
more general level, FODTR can be seen as a means to perform (lifted) first-order
reasoning over decision-theoretic values, i.e. as a kind of decision-theoretic logic.
One can say that all four methods deduce optimal utilities of states (possibly in in-
finite state spaces) through FODTR, using the action definitions and domain theory
as a set of axioms.
Several extensions to the four systems have been described, for example, policy
extraction and the use of tabling (van Otterlo, 2009b), search-based exploration
and efficient subsumption tests (Karabaev et al, 2006), policy iteration (Wang and
Khardon, 2007), factored decomposition of first-order MDPs and additive reward
models (Sanner and Boutilier, 2007), and universally quantified goals (Sanner and
Boutilier, 2006).
Since exact, optimal value functions are complex to compute and store (and may
be even infinite, in case of an unlimited domain size), several works have devel-
oped approximate model-based algorithms for RMDPs. The first type of approach
starts from FODTR approaches and then approximates, and a second type uses other
means, for example sampling and planning.
The first-order approximate linear programming technique ( FOALP) (Sanner
and Boutilier, 2005) extends the SDP approach, transforming it into an approximate
value iteration (AVI) algorithm (Schuurmans and Patrascu, 2001). Instead of exactly
representing the complete value function, which can be large and fine-grained (and
because of that hard to compute), the authors use a fixed set of basis functions, com-
parable to abstract states. That is, a value function can be represented as a weighted
sum of k first-order basis functions each containing a small number of formulae
that provide a first-order abstraction (i.e. partition) of the state space. The backup
of a linear combination of such basis functions is simply the linear combination of
the FODTR of each basis function. Unlike exact solutions where value functions
can grow exponentially in size and where much effort goes into logical simplifica-
tion of formulas, this feature-based approach must only look for good weights for
the case statements. Related to FOALP, the first-order approximate policy iteration
algorithm ( FOAPI) (Sanner and Boutilier, 2006) is a first-order generalization of
8 Relational and First-Order Logical MDPs 267
approximate policy iteration for factored MDPs (e.g. see Guestrin et al, 2003b). It
uses the same basis function decomposition of value functions as the FOALP ap-
proach, and in addition, an explicit policy representation. It iterates between two
phases. In the first, the value function for the current policy is computed, i.e. the
weights of the basis functions are computed using LP. The second phase computes
the policy from the value function. Convergence is reached if the policy remains sta-
ble between successive approximations. Loss bounds for the converged policy gen-
eralize directly from the ones for factored MDPs. The PRM approach by Guestrin
too provides bounds on policy quality. Yet, these are PAC-bounds obtained under
the assumption that the probability of domains falls off exponentially with their
size. The FOALP bounds on policy quality apply equally to all domains. FOALP
was used for factored FOMDPs by Sanner and Boutilier (2007) and applied in the
SysAdmin domain. Both FOALP and FOAPI have been entered the probabilistic
part of the international planning competition ( IPPC). In a similar AVI framework
as FOALP, Wu and Givan (2007) describe a technique for generating first-order fea-
tures. A simple ILP algorithm employs a beam-search in the feature space, guided
by how well each feature correlates with the ideal Bellman residual.
A different approach is the approximation to the SDP approach described by
Gretton and Thiébaux (2004a,b). The method uses the same basic setup as SDP, but
the FODTR procedure is only partially computed. By employing multi-step classi-
cal regression from the goal states, a number of structures is computed that represent
abstract states. The combination and maximization steps are not performed, but in-
stead the structures generated by regression are used as a hypothesis language in
the higher-order inductive tree-learner ALKEMY (Lloyd, 2003) to induce a tree
representing the value function.
A second type of approximate algorithm does not function at the level of ab-
straction (as in FODTR and extensions) but uses sampling and generalization in the
process of generating solutions. That is, one can first generate a solution for one
or more (small) ground instances (plans, value functions, policies), and then use
inductive generalization methods to obtain generalized solutions. This type of solu-
tion was pioneered by the work of Lecoeuche (2001) who used a solved instance of
an RMDP to obtain a generalized policy in a dialogue system. Two other methods
that use complete, ground RMDP solutions are based on value function general-
ization (Mausam and Weld, 2003) and policy generalization (Cocora et al, 2006).
Both approaches first use a general MDP solver and both use a relational deci-
sion tree algorithm to generalize solutions using relatively simple logical languages.
de la Rosa et al (2008) also use a relational decision tree in the ROLLER algo-
rithm to learn generalized policies from examples generated by a heuristic planner.
The Relational Envelope-Based Planning ( REBP) (Gardiol and Kaelbling, 2003)
uses a representation of a limited part of the state space (the envelope) which is
gradually expanded through sampling just outside the envelope. The aim is to com-
pute a policy, by first generating a good initial plan and use envelope-growing to
improve the robustness of the plans incrementally. A more recent extension of the
268 M. van Otterlo
method allows for the representation of the envelope using varying numbers of pred-
icates, such that the representational complexity can be gradually increased during
learning (Gardiol and Kaelbling, 2008).
In the following table we summarize the approaches in this section.
Table 8.1 Main model-based approaches that were discussed. Legend: BW=Blocks
World, Cnj=conjunction of logical atoms, Q=Q-learning,PS=prioritized sweeping, E=exact,
A=approximate, PP-IPC=planning problems from the planning contest (IPC), FOF=first-
order (relational) features, PRM=probabilistic relational model, FOL=first-order logic.
Here, an initial abstract Q-function is used to get (S) biased learning samples from
the RMDP. The samples are then used to learn (or, induce (I)) a new Q-function
structure. Policies can be computed (or, deduced (D)) from the current Q-function.
A restricted variation on this scheme, discussed in the next section, is to fix the
logical abstraction level (e.g. Q̃) and only sample the RMDP to get good estimates
of the values (e.g. parameters of Q̃).
8 Relational and First-Order Logical MDPs 269
It transforms the underlying RMDP into a much smaller abstract MDP, which can
then be solved by (modifications of) RL algorithms. For example, abstract state
3 models all situations where all blocks are stacked, in which case the only ac-
tion possible is to move the top block to the floor (31 ). In this case three abstract
states generalize over 13 RMDP states. The abstraction levels are exact aggregations
(i.e. partitions) of state-action spaces, and are closely related to averagers (Gordon,
1995). van Otterlo (2004) uses this in a Q-learning setting, and also in a model-based
fashion (prioritized sweeping, Moore and Atkeson, 1993) where also a transition
model between abstract states is learned. In essence, what is learned is the best
abstract policy among all policies present in the representation, assuming that this
policy space, can be obtained from a domain expert, or by other means.
Two closely related approaches are LOMDPs by Kersting and De Raedt (2004)
and rQ-learning by Morales (2003). LOMDP abstraction levels are very similar to
CARCASS, and they use Q-learning and a logical TD(λ )-algorithm to learn state
values for abstract states. The rQ-framework is based on a separate definition of ab-
stract states (r-states) and abstract actions (r-actions). The product space of r-states
and r-actions induces a new (abstract) state-action space over which Q-learning can
be performed. All approaches can be used to learn optimal policies in domains
where prior knowledge exists. They can also be employed as part of other learn-
ing methods, e.g. to learn sub-policies in a given hierarchical policy. A related effort
based on Markov logic networks was reported by Wang et al (2008b).
Initial investigations into automatically generating the abstractions have been
reported (Song and Chen, 2007, 2008), and Morales (2004) employed behavioral
cloning to learn r-actions from sub-optimal traces, generated by a human expert.
AMBIL (Walker et al, 2007) too learns abstract state-action pairs from traces; given
an abstraction level, it estimates an approximate model (as in CARCASS) and uses
it to generate new abstract state-action pairs. Each value learning iteration a new
representation is generated.
270 M. van Otterlo
Table 8.2 Main model-free approaches with static generalization. Legend: BW=Blocks
World, Cnj=conjunction of logical atoms, Q=Q-learning, PS=prioritized sweeping, TD=TD-
learning, RF=relational features.
As said, the Q-RRL method (Džeroski et al, 1998) was the first approach towards
model-free RL in RMDPs. It is a straightforward combination of Q-learning and
ILP for generalization of Q-functions, and was tested on small deterministic Blocks
Worlds. Q-RRL collects experience in the form of state-action pairs with corre-
sponding Q-values. During an episode, actions are taken according to the current
policy, based on the current Q-tree. After each episode a decision tree is induced
8 Relational and First-Order Logical MDPs 271
from the example set (see Fig. 8.3(right)). Q-trees can employ background knowl-
edge (such as about the amount of and heights of towers, the number of blocks).
One problem with Q-functions however, is that they implicitly encode a distance
to the goal, and they are dependent on the domain size in families of RMDPs. A Q-
function represents more information than needed for selecting an optimal action.
P-learning can be used to learn policies from the current Q-function and a training
set (Džeroski et al, 2001). For each state s occurring in the training set, all possi-
ble actions in that state are evaluated and a P-value is computed as P(s,a) = 1 if
a = arg maxa Q(s,a ) and 0 otherwise. The P-tree represents the best policy relative
to that Q-tree. In general, it will be less complex and generalize (better) over do-
mains with different numbers of blocks. Independently, Lecoeuche (2001) showed
similar results. Cole et al (2003) use a similar setup as Q-RRL, but upgrade the
representation language to higher-order logic ( HOL) (Lloyd, 2003).
An incremental extension to Q-RRL is the TG-algorithm (Driessens et al, 2001).
It can be seen as a relational extension of the G-algorithm (Chapman and Kaelbling,
1991), and it incrementally builds Q-trees. Each leaf in a tree is now augmented
with statistics about Q-values and the number of positive matches of examples. A
node is split when it has seen enough examples and a test on the node’s statistics
becomes significant with high confidence. This mechanism removes the need for
storing, retrieving and updating individual examples. Generating new tests is much
more complex than in the propositional case, because the amount of possible splits
is essentially unlimited and the number of possibilities grows further down the tree
with the number of variables introduced in earlier nodes.
In a subsequent upgrade TGR of TG, Ramon et al (2007) tackle the problem of
the irreversibility of the splits by adding a tree restructuring operation. This includes
leaf or subtree pruning, and internal node revision. To carry out these operations
statistics are now stored in all nodes in the tree. Special care is to be taken with
variables in the tree when building and restructuring the tree. Another method that
has used restructuring operations from the start is the relational UTree ( rUTree)
algorithm by Dabney and McGovern (2007). rUTree is a relational extension of
the UTree algorithm by McCallum (1995). Because rUTree is instance-based, tests
can be regenerated when needed for a split such that statistics do not have to be
kept for all nodes, as in TGR. Another interesting aspect of rUTree is that it uses
stochastic sampling (similar to the approach by Walker et al (2004), to cope with
the large number of possible tests when splitting a node. Combining these last two
aspects shows an interesting distinction with TG (and TGR). Whereas TG must
keep all statistics and consider all tests, rUTree considers only a limited, sampled
set of possible tests. In return, rUTree must often recompute statistics.
Finally, first-order XCS ( FOXCS) by Mellor (2008) is a learning classifier sys-
tem (e.g. see Lanzi, 2002) with relational rules. FOXCS’ policy representation is
similar to that of CARCASS, but each rule is augmented with an accuracy and
each time an action is required, all rules that cover the current state-action pair
considered, are taken into account. The accuracy can be used for both action
272 M. van Otterlo
Instead of building logical abstractions several methods use other means for gener-
alization over states modeled as relational interpretations.
The relational instance based regression method ( RIB) by Driessens and Ramon
(2003) uses instance-based learning (Aha et al, 1991) on ground relational states.
The Q-function is represented by a set of well-chosen experienced examples. To
look-up the value of a newly encountered state-action pair, a distance is computed
between this pair and the stored pairs, and the Q-value of the new pair is computed
as an average of the Q-values of pairs that it resembles. Special care is needed to
maintain the right set of examples, by throwing away, updating and adding exam-
ples. Instance-based regression for Q-learning has been employed for propositional
representations before but the challenge in relational domains is defining a suitable
distance between two interpretations. For RIB, a domain-specific distance has to
be defined beforehand. For example, in Blocks World problems, the distance be-
tween two states is computed by first renaming variables, by comparing the stacks
of blocks in the state and finally by the edit distance (e.g. how many actions are
needed to get from one state to another). Other background knowledge or declar-
ative bias is not used, as the representation consists solely of ground states and
actions. Garcı́a-Durán et al (2008) used a more general instance-based approach in
a policy-based algorithm. Katz et al (2008) use a relational instance-based algo-
rithm in a robotic manipulation task. Their similarity measure is defined in terms of
isomorphic subgraphs induced by the relational representation.
Later, the methods TG and RIB were combined by Driessens and Džeroski
(2005), making use of the strong points of both methods. TG builds an explicit,
structural model of the value function and – in practice – can only build up a coarse
approximation. RIB is not dependent on a language bias and the instance-based
nature is better suited for regression, but it does suffer from large numbers of exam-
ples that have to be processed. The combined algorithm – TRENDI builds up a tree
like TG but uses an instance-based representation in the leaves of the tree. Because
of this, new splitting criteria are needed and both the (language) bias for TG and
RIB are needed for TRENDI. However, on deterministic Blocks World examples
the new algorithm performs better on some aspects (such as computation time) than
its parent techniques. Note that the rUTree algorithm is also a combination of an
instance-based representation combined with a logical abstraction level in the form
of a tree. No comparison has been reported yet. Rodrigues et al (2008) recently in-
vestigated the online behavior of RIB and their results indicate that the number of
instances in the system is decreased when per-sample updates are used.
Compared to RIB, Gärtner et al (2003) take a more principled approach in
the KBR algorithm to distances between relational states and use graph kernels
and Gaussian processes for value function approximation in relational RL. Each
8 Relational and First-Order Logical MDPs 273
state-action pair is represented as a graph and a product kernel is defined for this
class of graphs. The kernel is wrapped into a Gaussian radial basis function, which
can be tuned to regulate the amount of generalization.
Probabilistic Approaches
Two additional, structurally adaptive, algorithms learn and use probabilistic infor-
mation about the environment to optimize behavior, yet for different purposes.
SVRRL (Sanner, 2005) targets undiscounted, finite-horizon domains in which
there is a single terminal reward. This enables viewing the value function as a prob-
ability of success, such that it can be represented as a relational naive Bayes net-
work. The structure and the parameters of this network are learned simultaneously.
The parameters can be computed using standard techniques based on the maximum
likelihood. Two structure learning approaches are described for SVRRL and in both
relational features (ground relational atoms) can be combined into joint features if
they are more informative than the independent features’ estimates. An extension
DM-SVRRL uses datamining to focus structure learning on only those parts of the
state space that are frequently visited (Sanner, 2006) finds frequently co-occurring
features and turns them into joint features, which can later be used to build even
larger features.
SVRRL is not based on TD-learning, but on probabilistic reasoning. MARLIE
(Croonenborghs et al, 2007b) too uses probabilistic techniques, but employs it for
transition model learning. It is one of the very few relational model-based RL algo-
rithms (next to CARCASS). It learns how ground relational atoms change from one
state to another. For each a probability tree is learned incrementally using a modi-
fied TG algorithm. Such trees represent for each ground instance of a predicate the
probability that it will be true in the next state, given the current state and action.
Using the model amounts to look ahead some steps in the future using an existing
technique called sparse sampling. The original TG algorithm is used to store the
Q-value function.
Table 8.3 summarizes some of the main characteristics. Detailed (experimental)
comparison between the methods is still a task to be accomplished. A crucial aspect
of the methods is the combination of representation and behavior learning. Fixed ab-
straction levels can provide convergence guarantees but are not flexible. Incremental
algorithms such as TG and restructuring approaches such as uTree and TGR provide
increasingly flexible function approximators, at the cost of increased computational
complexity and extensive bookkeeping.
An important aspect in all approaches is the representation used for examples,
and how examples are generalized into abstractions. Those based on logical abstrac-
tions have a number of significant advantages: i) they are more easy to generalize
over problems of different domain size through the use of variables and ii) abstrac-
tions are usually more comprehensible and transferrable. On the other hand, logical
abstractions are less suitable for finegrained regression. Methods such as TG have
severe difficulties with some very simple Blocks World problems. Highly relational
problems such as the Blocks World require complex patterns in their value functions
274 M. van Otterlo
Table 8.3 Main model-free approaches with adaptive generalization that were discussed. Leg-
end: BW=Blocks World, IB=instance based, NB=naive Bayes, HOL=higher-order logic.
and learning these in a typical RL learning process is difficult. Other methods that
base their estimates (in part) on instance-based representations, kernels or first-order
features are more suitable because they can provide more smooth approximations
of the value function.
Policy-based approaches have a simple structure: one starts with a policy structure
Π̃ 0 , generates samples (S) by interaction with the underlying RMDP, generates (D)
a new abstract policy Π̃ 1 , and so on. The general structure is the following:
S I S I
Π̃ 0 −−−−→ {s, a, q } −−−−→ Π̃ 1 −−−−→ {s, a, q } −−−−→ Π̃ 2 −−−−→ . . .
The representation used is usually the decision-rule-like structure we have presented
before. An important difference with value learning is that no explicit representa-
tions of Q̃ are required. At each iteration of these approximate policy iteration algo-
rithms, the current policy is used to gather useful learning experience – which can
be samples of state-action pairs, or the amount of reward gathered by that policy –
which is then used to generate a new policy structure.
The first type of policy-based approaches are evolutionary approaches, which have
been used in propositional RL before (Moriarty et al, 1999). Distinct features of
these approaches are that they usually maintain a population (i.e. a set) of policy
structures, and that they assign a single-valued fitness to each policy (or policy rule)
based on how it performs on the problem. The fitness is used to combine or modify
policies, thereby searching directly in policy space.
8 Relational and First-Order Logical MDPs 275
The Grey system (Muller and van Otterlo, 2005) employs simple relational deci-
sion list policies to evolve Blocks World policies. GAPI (van Otterlo and De Vuyst,
2009) is a similar approach based on genetic algorithms, and evolves probabilis-
tic relational policies. Gearhart (2003) employs a related technique (genetic pro-
gramming (GP)) in the real-time strategy FreeCraft domain used by Guestrin et al
(2003a). Results show that it compares well to Guestrin et al’s approach, though it
has difficulties with rarely occurring states. Castilho et al (2004) focus on STRIPS
planning problems, but unlike Grey for example, each chromosome encodes a full
plan, meaning that the approach searches in plan space. Both Kochenderfer (2003)
and Levine and Humphreys (2003) do search in policy space, and both use a GP
algorithm. Levine and Humphreys learn decision list policies from optimal plans
generated by a planner, which are then used in a policy restricted planner. Kochen-
derfer allows for hierarchical structure in the policies, by simultaneously evolving
sub-policies that can call each other. Finally, Baum (1999) describes the Hayek
machines that use evolutionary methods to learn policies for Blocks Worlds. An
additional approach (FOXCS) was discussed in a previous section.
A second type of approach uses ILP algorithms to learn the structure of the policy
from sampled state-action pairs. This essentially transforms the RL process into a
sequence of supervised (classification) learning tasks in which an abstract policy is
repeatedly induced from a (biased) set of state-action pairs sampled using the pre-
vious policy structure. The challenge is to get good samples either by getting them
from optimal traces (e.g. generated by a human expert, or a planning algorithm), or
by smart trajectory sampling from the current policy. Both types of approaches are
PIAG E T-3 , combining structure and parameter learning in a single algorithm.
The model-based approach by Yoon et al (2002) induces policies from optimal
traces generated by a planner. The algorithm can be viewed upon as an extension of
the work by Martin and Geffner (2000) and Khardon (1999) to stochastic domains.
Khardon (1999) studied the induction of deterministic policies for undiscounted,
goal-based planning domains, and proved general PAC-bounds on the number of
samples needed to obtain policies of a certain quality.
The LRW-API approach by Fern et al (2006) unifies, and extends, the afore-
mentioned approaches into one practical algorithm. LRW-API is based on a con-
cept language (taxonomic syntax), similar to Martin and Geffner (2000)’s approach,
and targeted at complex, probabilistic planning domains, as is Yoon et al (2002)’s
approach. LWR-API shares its main idea of iteratively inducing policy structures
(i.e. approximate policy iteration, API) and using the current policy to bias the
generation of samples to induce an improved policy. Two main improvements of
LRW-API relative to earlier approaches lie in the sampling process of examples,
and in the bootstrapping process. Concerning the first, LRW-API uses policy rollout
(Boyan and Moore, 1995) to sample the current policy. That is, it estimates all action
276 M. van Otterlo
values for the current policy for a state s by drawing w trajectories of length h,
where each trajectory is the result of starting in state s, doing a, and following the
policy for h − 1 more steps. Note that this requires a simulator that can be sampled
from any state, at any moment in time. The sampling width w and horizon h are
parameters that trade-off variance and computation time. A second main improve-
ment of LRW-API is the bootstrapping process, which amounts here to learning
from random worlds ( LRW). The idea is to learn complex problems by first starting
on simple problems and then iteratively solving more and more complex problem
instances. Each problem instance is generated by a random walk of length n through
the underlying RMDP, and by increasing n problems become more complex.
The previous two parts of this chapter have surveyed relational upgrades of tra-
ditional model-free and model-based algorithms. In this section we take a look at
relational techniques used for other aspects of the RMDP framework. We distin-
guish three groups: i) those that learn (probabilistic) models of the RMDP, ii) those
that impose structure on policies, and iii) those that generally bias the learner.
8 Relational and First-Order Logical MDPs 277
Table 8.4 Main model-free policy-based approaches that were discussed. Legend:
BW=Blocks World, PG=policy gradient, Q=Q-learning, PS=prioritized sweeping, TD=TD-
learning, DL=decision list, GD=gradient descent.
Learning world models is one of the most useful things an agent can do. Transition
models embody knowledge about the environment that can be exploited in various
ways. Such models can be used for more efficient RL algorithms, for model-based
DP algorithms, and furthermore, they can often be transferred to other, similar en-
vironments. There are several approaches that learn general operator models from
interaction. All model-based approaches in Section 8.3 on the contrary, take for
granted that a complete, logical model is available.
Usually, model learning amounts to learning general operator descriptions, such
as the STRIPS rules earlier in this text. However, simpler models can already be
very useful and we have seen examples such as the partial models used in MARLIE
and the abstract models learned for CARCASS. Another example is the more spe-
cialized action model learning employed by Morales (2004) who learns r-actions us-
ing behavioral cloning. A recent approach by Halbritter and Geibel (2007) is based
on graph kernels, which can be used to store transition models without the use of
logical abstractions such as used in most action formalisms. A related approach in
robotics by Mourão et al (2008) is based on kernel perceptrons but is restricted to
learning (deterministic) effects of STRIPS-like actions.
Learning aspects of ( STRIPS) operators from (planning) data is an old prob-
lem (e.g. see Vere, 1977) and several older works give partial solutions (e.g. only
learning effects of actions). However, learning full models, with the added difficulty
of probabilistic outcomes, is complex since it involves learning both logical struc-
tures and parameters from data. Remember the probabilistic STRIPS action we have
defined earlier. This move action has two different, probabilistic outcomes. Learn-
ing such a description from data involves a number of aspects. First, the logical
descriptions of the pre- and post-conditions have to be learned from data. Sec-
ond, the learning algorithm has to infer how many outcomes an action has. Third,
probabilities must be estimated for each outcome of the action. For the relational
case, early approaches by Gil (1994) and Wang (1995) learn deterministic operator
278 M. van Otterlo
descriptions from data, by interaction with simulated worlds. More recent work also
targets incomplete state information (Wu et al, 2005; Zhuo et al, 2007), or considers
sample complexity aspects (Walsh and Littman, 2008).
For general RMDPs though, we need probabilistic models. In the propositional
setting, the earliest learning approach was described by Oates and Cohen (1996).
For the relational models in the context of RMDPs the first approach was reported
by Pasula et al (2004), using a three-step greedy search approach. First, a search is
performed through the set of rule sets using standard ILP operators. Second, it finds
the best set of outcomes, given a context and an action. Third, it learns a probability
distribution over sets of outcomes. The learning process is supervised as it requires
a dataset of state-action-state pairs taken from the domain. As a consequence, the
rules are only valid on this set, and care has to be taken that it is representative for the
domain. Experiments on Blocks Worlds and logistics domains show the robustness
of the approach. The approach was later extended by Zettlemoyer et al (2005) who
added noise outcomes, i.e. outcomes which are difficult to model exactly, but which
do happen (like knocking over a blocks tower and scattering all blocks on the floor).
Another approach was introduced by Safaei and Ghassem-Sani (2007), which works
incrementally and combines planning and learning.
Hierarchical approaches can be naturally incorporated into relational RL, yet not
many techniques have been reported so far, although some cognitive architectures
(and sapient agents, cf. van Otterlo et al, 2007)) share these aspects. An advantage
of relational HRL is that parameterizations of sub-policies and goals naturally arise,
through logical variables. For example, a Blocks World task such as on(X,Y), where
X and Y can be instantiated using any two blocks, can be decomposed into two
tasks. First, all blocks must be removed from X and Y and then X and Y should be
moved on top of each other. Note that the first task also consists of two subtasks,
supporting even further decomposition. Now, by first learning policies for each of
these subtasks, the individual learning problems for each of these subtasks are much
simpler and the resulting skills might be reused. Furthermore, learning such subtasks
can be done by any of the model-free algorithms in Section 8.4. Depending on the
representation that is used, policies can be structured into hierarchies, facilitating
learning in more complex problems.
Simple forms of options are straightforward to model relationally (see Croo-
nenborghs et al, 2007a, for initial ideas). Driessens and Blockeel (2001) presented
an approach based on the Q-RRL algorithm to learn two goals simultaneously,
which can be considered a simple form of hierarchical decomposition. Aycenina
(2002) uses the original Q-RRL system to build a complete system. A number of
subgoals is given and separate policies are learned to achieve them. When learning
a more complex task, instantiated sub-policies can be used as new actions. A related
system by Roncagliolo and Tadepalli (2004) uses batch learning on a set of ex-
amples to learn values for a given relational hierarchy. Andersen (2005) presents
the most thorough investigation using MAXQ hierarchies adapted to relational
8 Relational and First-Order Logical MDPs 279
representations. The work uses the Q-RRL framework to induce local Q-trees and
P-trees, based on a manually constructed hierarchy of subgoals. A shortcoming of
hierarchical relational systems is still that they assume the hierarchy is given before-
hand. Two related systems combine planning and learning, which can be considered
as generating hierarchical abstractions by planning, and using RL to learn concrete
policies for behaviors. The approach by Ryan (2002) uses a planner to build a high-
level task hierarchy, after which RL is uses to learn the subpolicies. The method
by Grounds and Kudenko (2005) is based on similar ideas. Also in cognitive archi-
tectures, not much work has been reported yet, with notable exceptions of model-
learning in Icarus (Shapiro and Langley, 2002) and RL in SOAR (Nason and Laird,
2004).
Hierarchical learning can be seen as first decomposing a policy, and then do
learning. Multi-agent (Wooldridge, 2002) decompositions are also possible. Letia
and Precup (2001) report on multiple agents, modeled as independent reinforce-
ment learners who do not communicate, but act in the same environment. Programs
specify initial plans and knowledge about the environment, and complex actions
induce semi- MDPs, and learning is performed by model-free RL methods based
on options. A similar approach by Hernandez et al (2004) is based on the belief-
desires-intentions model. Finzi and Lukasiewicz (2004a) introduce GTGolog, a
game-theoretic language, which integrates explicit agent programming with game-
theoretic multi-agent planning in Markov Games. Along this direction Finzi and
Lukasiewicz (2004b) introduce relational Markov Games which can be used to ab-
stract over multi-agent RMDPs and to compute Nash policy pairs.
Bias
Solution algorithms for RMDPs can be helped in many ways. A basic distinction is
between helping the solution of the current problem (bias, guidance), and helping
the solution of related problems (transfer).
Concerning the first, one idea is to supply a policy to the learner that can guide
it towards useful areas in the state space (Driessens and Džeroski, 2002) by gener-
ating semi-optimal traces of behavior that can be used as experience by the learner.
Related to guidance is the usage of behavioral cloning based on human generated
traces, for example by (Morales, 2004) and Cocora et al (2006), the use of optimal
plans by Yoon et al (2002), or the random walk approach by Fern et al (2006), which
uses a guided exploration of domain instantiations. In the latter, first only easy in-
stantiations are generated and the difficulty of the problems is increased in accor-
dance with the current policy’s quality. Domain sampling was also used by Guestrin
et al (2003a). Another way to help the learner is by structuring the domain itself, for
example by imposing a strong topological structure (Lane and Wilson, 2005). Since
many of these guidance techniques are essentially representation-independent, there
are many more existing (propositional) algorithms to be used for RMDPs.
A second way to help the learner, is by transferring knowledge from other tasks.
Transfer learning – in a nutshell – is leveraging learned knowledge on a source
task to improve learning on a related, but different target task. It is particularly
280 M. van Otterlo
appropriate for learning agents that are meant to persist over time, changing flexibly
among tasks and environments. Rather than having to learn each task from scratch,
the goal is to take advantage of its past experience to speed up learning (see Stone,
2007). Transfer is much representation-dependent, and the use of (declarative) FOL
formalisms in relational RL offers good opportunities.
In the more restricted setting of (relational) RL there are several possibilities. For
example, the source and target problems can differ in the goal that must be reached,
but possibly the transition model and reward model can be transferred. Sometimes
a specific state abstraction can be transferred between problems (e.g. Walsh et al,
2006). Or, possibly some knowledge about actions can be transferred, although they
can have slightly different effects in the source and target tasks. Sometimes a com-
plete policy can be transferred, for example a stacking policy for a Blocks World
can be transferred between worlds of varying size. Another possibility is to transfer
solutions to subproblems, e.g. a sub-policy. Very often in relational domains, trans-
fer is possible by the intrinsic nature of relational representations alone. That is, in
Blocks Worlds, in many cases a policy that is learned for a task with n blocks will
work for m > n blocks.
So far, a few approaches have approached slightly more general notions of trans-
fer in relational RL. Several approaches are based on transfer of learned skills in
hierarchical decompositions, e.g. (Croonenborghs et al, 2007a) and (Torrey et al,
2007), and an extension of the latter using Markov logic networks (Torrey et al,
2008). Explicit investigations into transfer learning were based on methods we
have discussed such as in rTD (Stracuzzi and Asgharbeygi, 2006) and the work by
Garcı́a-Durán et al (2008) on transferring instance-based policies in deterministic
planning domains.
In this chapter so far, we have discussed an important selection of the methods de-
scribed in (van Otterlo, 2009b). In this last section we briefly discuss some recent
approaches and trends in solution techniques for RMDPs that appeared very re-
cently. These topics also mark at least two areas with lots of open problems and
potential; general logical-probabilistic engines, and partially observable problems.
Relational POMDPs
almost unexplored so far, and in (van Otterlo, 2009b) we mentioned some ap-
proaches that have went up this road.
Wingate et al (2007) present the first steps towards relational KR in predictive
representations of states (Littman et al, 2001). Although the representation is still
essentially propositional, they do capture some of Blocks World structure in a much
richer framework than MDPs. Zhao and Doshi (2007) introduce a semi-Markov ex-
tension to the situation calculus approach in SDP in the Haley system for web-based
services. Although no algorithm for solving the induced first-order SMDP is given,
the approach clearly shows that the formalization can capture useful temporal struc-
tures. Gretton (2007a,b) compute temporally extended policies for domains with
non-Markovian rewards, using a policy gradient approach and we have discussed it
in the previous chapter.
The first contribution to the solution of first-order POMDPs is given by Wang
(2007). Although modeling POMDPs using FOL formalisms has been done before
(e.g. see Geffner and Bonet, 1998; Wang and Schmolze, 2005), Wang is the first to
upgrade an existing POMDP solution algorithm to the first-order case. It takes the
FODD formalism we have discussed earlier and extends it to model observations,
conditioned on states. Based on the clear connections between regression-based
backups in IDP algorithms and value backups over belief states, Wang upgrades
the incremental pruning algorithm (see Kaelbling et al, 1998) to the first-order case.
Recently some additional efforts into relational POMDPs have been described,
for example by Lison (2010) in dialogue systems, and by both Wang and Khardon
(2010) and Sanner and Kersting (2010) who describe basic algorithms for such
POMDPs along the lines of IDP.
In addition to inference engines and POMDPs, other methods have appeared re-
cently. In the previous we have already mentioned the evolutionary technique GAPI
(van Otterlo and De Vuyst, 2009), and furthermore Neruda and Slusny (2009) pro-
vide a performance comparison between relational RL and evolutionary techniques.
New techniques for learning models include instance based techniques for de-
terministic action models in the SOAR cognitive architecture (Xu and Laird, 2010),
and incremental learning of action models in the context of noise by Rodrigues et al
(2010). Both represent new steps towards learning transition models that can be
used for planning and FODTR. A related technique by Vargas-Govea and Morales
(2009) learns grammars for sequences of actions, based on sequences of low-level
sensor readings in a robotic context.
In fact, domains such as robotics are interesting application areas for relational
RL. Both Vargas and Morales (2008) and Hernández and Morales (2010) apply re-
lational techniques for navigation purposes in a robotic domain. The first approach
learns teleo-reactive programs from traces using behavioral cloning. The second
combines relational RL and continuous actions. Another interesting area for rela-
tional RL is computer vision, and initial work in this direction is reported by Hming
and Peters (2009) who apply it for object recognition.
8 Relational and First-Order Logical MDPs 283
Just like in (van Otterlo, 2009b) we also pay attention to recent PhD theses
that were written on topics in relational RL. Five more theses have appeared, on
model-assisted approaches by Croonenborghs (2009), on transfer learning by Tor-
rey (2009), on object-oriented representations in RL by Diuk (2010), on efficient
model learning by Walsh (2010) and on FODTR by Joshi (2010).
In this chapter we have surveyed the field of relational RL, summarizing the main
techniques discussed in (van Otterlo, 2009b). We have discussed large subfields
such as model-based and model-free algorithms, and in addition hierarchies, mod-
els, POMDPs and much more.
There are many future directions for relational RL. For one part, it can be noticed
that not many techniques make use of efficient data structures. Some model-based
techniques use binary (or algebraic) data structures, but many other methods may
benefit since relational RL is a computationally complex task. The use of more
efficient inference engines has started with the development of languages such as
DT-Problog, but it is expected that the next few years more efforts along these lines
will be developed in the field of SRL. In the same direction, POMDP techniques
and probabilistic model learning can be explored too from the viewpoint of SRL.
In terms of application areas, robotics, computer vision, and web and social net-
works may yield interesting problems. The connection to manipulation and navi-
gation in robotics is easily made and not many relational RL techniques have been
used so far. Especially the grounding and anchoring of sensor data to relational rep-
resentations is a hard problem and largely unsolved.
References
Aha, D., Kibler, D., Albert, M.: Instance-based learning algorithms. Machine Learning 6(1),
37–66 (1991)
Alpaydin, E.: Introduction to Machine Learning. The MIT Press, Cambridge (2004)
Andersen, C.C.S.: Hierarchical relational reinforcement learning. Master’s thesis, Aalborg
University, Denmark (2005)
Asgharbeygi, N., Stracuzzi, D.J., Langley, P.: Relational temporal difference learning. In:
Proceedings of the International Conference on Machine Learning (ICML), pp. 49–56
(2006)
Aycenina, M.: Hierarchical relational reinforcement learning. In: Stanford Doctoral Sympo-
sium (2002) (unpublished)
Baum, E.B.: Toward a model of intelligence as an economy of agents. Machine Learn-
ing 35(2), 155–185 (1999)
Baum, E.B.: What is Thought? The MIT Press, Cambridge (2004)
Bergadano, F., Gunetti, D.: Inductive Logic Programming: From Machine Learning to Soft-
ware Engineering. The MIT Press, Cambridge (1995)
284 M. van Otterlo
Diuk, C.: An object-oriented representation for efficient reinforcement learning. PhD thesis,
Rutgers University, Computer Science Department (2010)
Diuk, C., Cohen, A., Littman, M.L.: An object-oriented representation for efficient rein-
forcement learning. In: Proceedings of the International Conference on Machine Learning
(ICML) (2008)
Driessens, K., Blockeel, H.: Learning Digger using hierarchical reinforcement learning for
concurrent goals. In: Proceedings of the European Workshop on Reinforcement Learning,
EWRL (2001)
Driessens, K., Džeroski, S.: Integrating experimentation and guidance in relational reinforce-
ment learning. In: Proceedings of the Nineteenth International Conference on Machine
Learning, pp. 115–122 (2002)
Driessens, K., Džeroski, S.: Combining model-based and instance-based learning for first
order regression. In: Proceedings of the International Conference on Machine Learning
(ICML), pp. 193–200 (2005)
Driessens, K., Ramon, J.: Relational instance based regression for relational reinforcement
learning. In: Proceedings of the International Conference on Machine Learning (ICML),
pp. 123–130 (2003)
Driessens, K., Ramon, J., Blockeel, H.: Speeding Up Relational Reinforcement Learning
Through the Use of an Incremental First Order Decision Tree Learner. In: Flach, P.A., De
Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 97–108. Springer, Heidelberg
(2001)
Džeroski, S., De Raedt, L., Blockeel, H.: Relational reinforcement learning. In: Shavlik, J.
(ed.) Proceedings of the International Conference on Machine Learning (ICML), pp. 136–
143 (1998)
Džeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learn-
ing 43, 7–52 (2001)
Feng, Z., Dearden, R.W., Meuleau, N., Washington, R.: Dynamic programming for structured
continuous Markov decision problems. In: Proceedings of the Conference on Uncertainty
in Artificial Intelligence (UAI), pp. 154–161 (2004)
Fern, A., Yoon, S.W., Givan, R.: Approximate policy iteration with a policy language bias:
Solving relational markov decision processes. Journal of Artificial Intelligence Research
(JAIR) 25, 75–118 (2006); special issue on the International Planning Competition 2004
Fern, A., Yoon, S.W., Givan, R.: Reinforcement learning in relational domains: A policy-
language approach. The MIT Press, Cambridge (2007)
Fikes, R.E., Nilsson, N.J.: STRIPS: A new approach to the application of theorem proving to
problem solving. Artificial Intelligence 2(2) (1971)
Finney, S., Gardiol, N.H., Kaelbling, L.P., Oates, T.: The thing that we tried Didn’t work very
well: Deictic representations in reinforcement learning. In: Proceedings of the Conference
on Uncertainty in Artificial Intelligence (UAI), pp. 154–161 (2002)
Finzi, A., Lukasiewicz, T.: Game-theoretic agent programming in Golog. In: Proceedings of
the European Conference on Artificial Intelligence (ECAI) (2004a)
Finzi, A., Lukasiewicz, T.: Relational Markov Games. In: Alferes, J.J., Leite, J. (eds.) JELIA
2004. LNCS (LNAI), vol. 3229, pp. 320–333. Springer, Heidelberg (2004)
Garcı́a-Durán, R., Fernández, F., Borrajo, D.: Learning and transferring relational instance-
based policies. In: Proceedings of the AAAI-2008 Workshop on Transfer Learning for
Complex Tasks (2008)
Gardiol, N.H., Kaelbling, L.P.: Envelope-based planning in relational MDPs. In: Proceedings
of the Neural Information Processing Conference (NIPS) (2003)
286 M. van Otterlo
Gardiol, N.H., Kaelbling, L.P.: Adaptive envelope MDPs for relational equivalence-based
planning. Tech. Rep. MIT-CSAIL-TR-2008-050, MIT CS & AI Lab, Cambridge, MA
(2008)
Gärtner, T., Driessens, K., Ramon, J.: Graph kernels and Gaussian processes for relational re-
inforcement learning. In: Proceedings of the International Conference on Inductive Logic
Programming (ILP) (2003)
Gearhart, C.: Genetic programming as policy search in Markov decision processes. In: Ge-
netic Algorithms and Genetic Programming at Stanford, pp. 61–67 (2003)
Geffner, H., Bonet, B.: High-level planning and control with incomplete information using
pomdps. In: Proceedings Fall AAAI Symposium on Cognitive Robotics (1998)
Gil, Y.: Learning by experimentation: Incremental refinement of incomplete planning do-
mains. In: Proceedings of the International Conference on Machine Learning (ICML)
(1994)
Gordon, G.J.: Stable function approximation in dynamic programming. In: Proceedings of
the International Conference on Machine Learning (ICML), pp. 261–268 (1995)
Gretton, C.: Gradient-based relational reinforcement-learning of temporally extended poli-
cies. In: Proceedings of the International Conference on Artificial Intelligence Planning
Systems (ICAPS) (2007a)
Gretton, C.: Gradient-based relational reinforcement learning of temporally extended poli-
cies. In: Workshop on Artificial Intelligence Planning and Learning at the International
Conference on Automated Planning Systems (2007b)
Gretton, C., Thiébaux, S.: Exploiting first-order regression in inductive policy selection. In:
Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pp. 217–
225 (2004a)
Gretton, C., Thiébaux, S.: Exploiting first-order regression in inductive policy selection (ex-
tended abstract). In: Proceedings of the Workshop on Relational Reinforcement Learning
at ICML 2004 (2004b)
Groote, J.F., Tveretina, O.: Binary decision diagrams for first-order predicate logic. The Jour-
nal of Logic and Algebraic Programming 57, 1–22 (2003)
Grounds, M., Kudenko, D.: Combining Reinforcement Learning with Symbolic Planning. In:
Tuyls, K., Nowe, A., Guessoum, Z., Kudenko, D. (eds.) ALAMAS 2005, ALAMAS 2006,
and ALAMAS 2007. LNCS (LNAI), vol. 4865, pp. 75–86. Springer, Heidelberg (2008)
Guestrin, C.: Planning under uncertainty in complex structured environments. PhD thesis,
Computer Science Department, Stanford University (2003)
Guestrin, C., Koller, D., Gearhart, C., Kanodia, N.: Generalizing plans to new environments
in relational MDPs. In: Proceedings of the International Joint Conference on Artificial
Intelligence (IJCAI), pp. 1003–1010 (2003a)
Guestrin, C., Koller, D., Parr, R., Venkataraman, S.: Efficient solution algorithms for factored
MDPs. Journal of Artificial Intelligence Research (JAIR) 19, 399–468 (2003b)
Halbritter, F., Geibel, P.: Learning Models of Relational MDPs Using Graph Kernels. In:
Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827, pp. 409–
419. Springer, Heidelberg (2007)
Hanks, S., McDermott, D.V.: Modeling a dynamic and uncertain world I: Symbolic and prob-
abilistic reasoning about change. Artificial Intelligence 66(1), 1–55 (1994)
Guerra-Hernández, A., Fallah-Seghrouchni, A.E., Soldano, H.: Learning in BDI Multi-Agent
Systems. In: Dix, J., Leite, J. (eds.) CLIMA 2004. LNCS (LNAI), vol. 3259, pp. 218–233.
Springer, Heidelberg (2004)
8 Relational and First-Order Logical MDPs 287
Hernández, J., Morales, E.F.: Relational reinforcement learning with continuous actions by
combining behavioral cloning and locally weighted regression. Journal of Intelligent Sys-
tems and Applications 2, 69–79 (2010)
Häming, K., Peters, G.: Relational Reinforcement Learning Applied to Appearance-Based
Object Recognition. In: Palmer-Brown, D., Draganova, C., Pimenidis, E., Mouratidis, H.
(eds.) EANN 2009. Communications in Computer and Information Science, vol. 43, pp.
301–312. Springer, Heidelberg (2009)
Hölldobler, S., Skvortsova, O.: A logic-based approach to dynamic programming. In: Pro-
ceedings of the AAAI Workshop on Learning and Planning in Markov Processes - Ad-
vances and Challenges (2004)
Itoh, H., Nakamura, K.: Towards learning to learn and plan by relational reinforcement learn-
ing. In: Proceedings of the ICML Workshop on Relational Reinforcement Learning (2004)
Joshi, S.: First-order decision diagrams for decision-theoretic planning. PhD thesis, Tufts
University, Computer Science Department (2010)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable
stochastic domains. Artificial Intelligence 101, 99–134 (1998)
Kaelbling, L.P., Oates, T., Gardiol, N.H., Finney, S.: Learning in worlds with objects. In: The
AAAI Spring Symposium (2001)
Karabaev, E., Skvortsova, O.: A heuristic search algorithm for solving first-order MDPs. In:
Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) (2005)
Karabaev, E., Rammé, G., Skvortsova, O.: Efficient symbolic reasoning for first-order MDPs.
In: ECAI Workshop on Planning, Learning and Monitoring with Uncertainty and Dynamic
Worlds (2006)
Katz, D., Pyuro, Y., Brock, O.: Learning to manipulate articulated objects in unstructured
environments using a grounded relational representation. In: Proceedings of Robotics:
Science and Systems IV (2008)
Kersting, K., De Raedt, L.: Logical Markov decision programs and the convergence of TD(λ ).
In: Proceedings of the International Conference on Inductive Logic Programming (ILP)
(2004)
Kersting, K., Driessens, K.: Non-parametric gradients: A unified treatment of propositional
and relational domains. In: Proceedings of the International Conference on Machine
Learning (ICML) (2008)
Kersting, K., van Otterlo, M., De Raedt, L.: Bellman goes relational. In: Proceedings of the
International Conference on Machine Learning (ICML) (2004)
Khardon, R.: Learning to take actions. Machine Learning 35(1), 57–90 (1999)
Kochenderfer, M.J.: Evolving Hierarchical and Recursive Teleo-Reactive Programs Through
Genetic Programming. In: Ryan, C., Soule, T., Keijzer, M., Tsang, E.P.K., Poli, R., Costa,
E. (eds.) EuroGP 2003. LNCS, vol. 2610, pp. 83–92. Springer, Heidelberg (2003)
Lane, T., Wilson, A.: Toward a topological theory of relational reinforcement learning for
navigation tasks. In: Proceedings of the International Florida Artificial Intelligence Re-
search Society Conference (FLAIRS) (2005)
Lang, T., Toussaint, M.: Approximate inference for planning in stochastic relational worlds.
In: Proceedings of the International Conference on Machine Learning (ICML) (2009)
Lang, T., Toussaint, M.: Probabilistic backward and forward reasoning in stochastic relational
worlds. In: Proceedings of the International Conference on Machine Learning (ICML)
(2010)
Langley, P.: Cognitive architectures and general intelligent systems. AI Magazine 27, 33–44
(2006)
288 M. van Otterlo
Lanzi, P.L.: Learning classifier systems from a reinforcement learning perspective. Soft Com-
puting 6, 162–170 (2002)
Lecoeuche, R.: Learning optimal dialogue management rules by using reinforcement learning
and inductive logic programming. In: Proceedings of the North American Chapter of the
Association for Computational Linguistics, NAACL (2001)
Letia, I., Precup, D.: Developing collaborative Golog agents by reinforcement learning. In:
Proceedings of the 13th IEEE International Conference on Tools with Artificial Intelli-
gence (ICTAI 2001). IEEE Computer Society (2001)
Levine, J., Humphreys, D.: Learning Action Strategies for Planning Domains Using Ge-
netic Programming. In: Raidl, G.R., Cagnoni, S., Cardalda, J.J.R., Corne, D.W., Got-
tlieb, J., Guillot, A., Hart, E., Johnson, C.G., Marchiori, E., Meyer, J.-A., Middendorf, M.
(eds.) EvoIASP 2003, EvoWorkshops 2003, EvoSTIM 2003, EvoROB/EvoRobot 2003,
EvoCOP 2003, EvoBIO 2003, and EvoMUSART 2003. LNCS, vol. 2611, pp. 684–695.
Springer, Heidelberg (2003)
Lison, P.: Towards relational POMDPs for adaptive dialogue management. In: ACL 2010:
Proceedings of the ACL 2010 Student Research Workshop, pp. 7–12. Association for
Computational Linguistics, Morristown (2010)
Littman, M.L., Sutton, R.S., Singh, S.: Predictive representations of state. In: Proceedings of
the Neural Information Processing Conference (NIPS) (2001)
Lloyd, J.W.: Logic for Learning: Learning Comprehensible Theories From Structured Data.
Springer, Heidelberg (2003)
Martin, M., Geffner, H.: Learning generalized policies in planning using concept languages.
In: Proceedings of the International Conference on Principles of Knowledge Representa-
tion and Reasoning (KR) (2000)
Mausam, Weld, D.S.: Solving relational MDPs with first-order machine learning. In: Work-
shop on Planning under Uncertainty and Incomplete Information at ICAPS 2003 (2003)
McCallum, R.A.: Instance-based utile distinctions for reinforcement learning with hidden
state. In: Proceedings of the International Conference on Machine Learning (ICML), pp.
387–395 (1995)
Mellor, D.: A Learning Classifier System Approach to Relational Reinforcement Learning.
In: Bacardit, J., Bernadó-Mansilla, E., Butz, M.V., Kovacs, T., Llorà, X., Takadama, K.
(eds.) IWLCS 2006 and IWLCS 2007. LNCS (LNAI), vol. 4998, pp. 169–188. Springer,
Heidelberg (2008)
Minker, J.: Logic-Based Artificial Intelligence. Kluwer Academic Publishers Group, Dor-
drecht (2000)
Minton, S., Carbonell, J., Knoblock, C.A., Kuokka, D.R., Etzioni, O., Gil, Y.: Explanation-
based learning: A problem solving perspective. Artificial Intelligence 40(1-3), 63–118
(1989)
Mooney, R.J., Califf, M.E.: Induction of first-order decision lists: Results on learning the past
tense of english verbs. Journal of Artificial Intelligence Research (JAIR) 3, 1–24 (1995)
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data
and less time. Machine Learning 13(1), 103–130 (1993)
Morales, E.F.: Scaling up reinforcement learning with a relational representation. In: Proceed-
ings of the Workshop on Adaptability in Multi-Agent Systems at AORC 2003, Sydney
(2003)
Morales, E.F.: Learning to fly by combining reinforcement learning with behavioral cloning.
In: Proceedings of the International Conference on Machine Learning (ICML), pp. 598–
605 (2004)
8 Relational and First-Order Logical MDPs 289
Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement
learning. Journal of Artificial Intelligence Research (JAIR) 11, 241–276 (1999)
Mourão, K., Petrick, R.P.A., Steedman, M.: Using kernel perceptrons to learn action ef-
fects for planning. In: Proceedings of the International Conference on Cognitive Systems
(CogSys), pp. 45–50 (2008)
Muller, T.J., van Otterlo, M.: Evolutionary reinforcement learning in relational domains. In:
Proceedings of the 7th European Workshop on Reinforcement Learning (2005)
Nason, S., Laird, J.E.: Soar-RL: Integrating reinforcement learning with soar. In: Proceedings
of the Workshop on Relational Reinforcement Learning at ICML 2004 (2004)
Nath, A., Domingos, P.: A language for relational decision theory. In: International Workshop
on Statistical Relational Learning, SRL (2009)
Neruda, R., Slusny, S.: Performance comparison of two reinforcement learning algorithms for
small mobile robots. International Journal of Control and Automation 2(1), 59–68 (2009)
Oates, T., Cohen, P.R.: Learning planning operators with conditional and probabilistic ef-
fects. In: Planning with Incomplete Information for Robot Problems: Papers from the
1996 AAAI Spring Symposium, pp. 86–94 (1996)
Pasula, H.M., Zettlemoyer, L.S., Kaelbling, L.P.: Learning probabilistic planning rules. In:
Proceedings of the International Conference on Artificial Intelligence Planning Systems
(ICAPS) (2004)
Poole, D.: The independent choice logic for modeling multiple agents under uncertainty.
Artificial Intelligence 94, 7–56 (1997)
Ramon, J., Driessens, K., Croonenborghs, T.: Transfer Learning in Reinforcement Learn-
ing Problems Through Partial Policy Recycling. In: Kok, J.N., Koronacki, J., Lopez de
Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI),
vol. 4701, pp. 699–707. Springer, Heidelberg (2007)
Reiter, R.: Knowledge in Action: Logical Foundations for Specifying and Implementing Dy-
namical Systems. The MIT Press, Cambridge (2001)
Rodrigues, C., Gerard, P., Rouveirol, C.: On and off-policy relational reinforcement learning.
In: Late-Breaking Papers of the International Conference on Inductive Logic Program-
ming (2008)
Rodrigues, C., Gérard, P., Rouveirol, C.: IncremEntal Learning of Relational Action Models
in Noisy Environments. In: Frasconi, P., Lisi, F.A. (eds.) ILP 2010. LNCS, vol. 6489, pp.
206–213. Springer, Heidelberg (2011)
Roncagliolo, S., Tadepalli, P.: Function approximation in hierarchical relational reinforce-
ment learning. In: Proceedings of the Workshop on Relational Reinforcement Learning at
ICML (2004)
Russell, S.J., Norvig, P.: Artificial Intelligence: a Modern Approach, 2nd edn. Prentice Hall,
New Jersey (2003)
Ryan, M.R.K.: Using abstract models of behaviors to automatically generate reinforcement
learning hierarchies. In: Proceedings of the International Conference on Machine Learn-
ing (ICML), pp. 522–529 (2002)
Saad, E.: A Logical Framework to Reinforcement Learning Using Hybrid Probabilistic Logic
Programs. In: Greco, S., Lukasiewicz, T. (eds.) SUM 2008. LNCS (LNAI), vol. 5291, pp.
341–355. Springer, Heidelberg (2008)
Safaei, J., Ghassem-Sani, G.: Incremental learning of planning operators in stochastic do-
mains. In: Proceedings of the International Conference on Current Trends in Theory and
Practice of Computer Science (SOFSEM), pp. 644–655 (2007)
290 M. van Otterlo
Sanner, S.: Simultaneous learning of structure and value in relational reinforcement learn-
ing. In: Driessens, K., Fern, A., van Otterlo, M. (eds.) Proceedings of the ICML-2005
Workshop on Rich Representations for Reinforcement Learning (2005)
Sanner, S.: Online feature discovery in relational reinforcement learning. In: Proceedings of
the ICML-2006 Workshop on Open Problems in Statistical Relational Learning (2006)
Sanner, S., Boutilier, C.: Approximate linear programming for first-order MDPs. In: Proceed-
ings of the Conference on Uncertainty in Artificial Intelligence (UAI) (2005)
Sanner, S., Boutilier, C.: Practical linear value-approximation techniques for first-order
MDPs. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI)
(2006)
Sanner, S., Boutilier, C.: Approximate solution techniques for factored first-order MDPs. In:
Proceedings of the International Conference on Artificial Intelligence Planning Systems
(ICAPS) (2007)
Sanner, S., Kersting, K.: Symbolic dynamic programming for first-order pomdps. In: Pro-
ceedings of the National Conference on Artificial Intelligence (AAAI) (2010)
Schmid, U.: Inductive synthesis of functional programs: Learning domain-specific control
rules and abstraction schemes. In: Habilitationsschrift, Fakultät IV, Elektrotechnik und
Informatik, Technische Universität Berlin, Germany (2001)
Schuurmans, D., Patrascu, R.: Direct value approximation for factored MDPs. In: Proceed-
ings of the Neural Information Processing Conference (NIPS) (2001)
Shapiro, D., Langley, P.: Separating skills from preference. In: Proceedings of the Interna-
tional Conference on Machine Learning (ICML), pp. 570–577 (2002)
Simpkins, C., Bhat, S., Isbell, C.L., Mateas, M.: Adaptive Programming: Integrating Rein-
forcement Learning into a Programming Language. In: Proceedings of the Twenty-Third
ACM SIGPLAN International Conference on Object-Oriented Programming, Systems,
Languages, and Applications, OOPSLA (2008)
Slaney, J., Thiébaux, S.: Blocks world revisited. Artificial Intelligence 125, 119–153 (2001)
Song, Z.W., Chen, X.P.: States evolution in Θ (λ )-learning based on logical mdps with nega-
tion. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 1624–1629
(2007)
Song, Z.W., Chen, X.P.: Agent learning in relational domains based on logical mdps with
negation. Journal of Computers 3(9), 29–38 (2008)
Stone, P.: Learning and multiagent reasoning for autonomous agents. In: Proceedings of the
International Joint Conference on Artificial Intelligence (IJCAI), Computers and Thought
Award Paper (2007)
Stracuzzi, D.J., Asgharbeygi, N.: Transfer of knowledge structures with relational temporal
difference learning. In: Proceedings of the ICML 2006 Workshop on Structural Knowl-
edge Transfer for Machine Learning (2006)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: an Introduction. The MIT Press,
Cambridge (1998)
Sutton, R.S., McAllester, D.A., Singh, S., Mansour, Y.: Policy gradient methods for reinforce-
ment learning with function approximation. In: Proceedings of the Neural Information
Processing Conference (NIPS), pp. 1057–1063 (2000)
Thielscher, M.: Introduction to the Fluent Calculus. Electronic Transactions on Artificial
Intelligence 2(3-4), 179–192 (1998)
Thon, I., Guttman, B., van Otterlo, M., Landwehr, N., De Raedt, L.: From non-deterministic
to probabilistic planning with the help of statistical relational learning. In: Workshop on
Planning and Learning at ICAPS (2009)
8 Relational and First-Order Logical MDPs 291
Torrey, L.: Relational transfer in reinforcement learning. PhD thesis, University of Wisconsin-
Madison, Computer Science Department (2009)
Torrey, L., Shavlik, J., Walker, T., Maclin, R.: Relational macros for transfer in reinforcement
learning. In: Proceedings of the International Conference on Inductive Logic Program-
ming (ILP) (2007)
Torrey, L., Shavlik, J., Natarajan, S., Kuppili, P., Walker, T.: Transfer in reinforcement learn-
ing via markov logic networks. In: Proceedings of the AAAI-2008 Workshop on Transfer
Learning for Complex Tasks (2008)
Toussaint, M.: Probabilistic inference as a model of planned behavior. Künstliche Intelligenz
(German Artificial Intelligence Journal) 3 (2009)
Toussaint, M., Plath, N., Lang, T., Jetchev, N.: Integrated motor control, planning, grasping
and high-level reasoning in a blocks world using probabilistic inference. In: IEEE Inter-
national Conference on Robotics and Automation, ICRA (2010)
Van den Broeck, G., Thon, I., van Otterlo, M., De Raedt, L.: DTProbLog: A decision-
theoretic probabilistic prolog. In: Proceedings of the National Conference on Artificial
Intelligence (AAAI) (2010)
van Otterlo, M.: Efficient reinforcement learning using relational aggregation. In: Proceed-
ings of the Sixth European Workshop on Reinforcement Learning, Nancy, France (EWRL-
6) (2003)
van Otterlo, M.: Reinforcement learning for relational MDPs. In: Nowé, A., Lenaerts, T.,
Steenhaut, K. (eds.) Machine Learning Conference of Belgium and the Netherlands
(BeNeLearn 2004), pp. 138–145 (2004)
van Otterlo, M.: Intensional dynamic programming: A rosetta stone for structured dynamic
programming. Journal of Algorithms 64, 169–191 (2009a)
van Otterlo, M.: The Logic of Adaptive Behavior: Knowledge Representation and Algorithms
for Adaptive Sequential Decision Making under Uncertainty in First-Order and Relational
Domains. IOS Press, Amsterdam (2009b)
van Otterlo, M., De Vuyst, T.: Evolving and transferring probabilistic policies for relational
reinforcement learning. In: Proceedings of the Belgium-Netherlands Artificial Intelligence
Conference (BNAIC), pp. 201–208 (2009)
van Otterlo, M., Wiering, M.A., Dastani, M., Meyer, J.J.: A characterization of sapient agents.
In: Mayorga, R.V., Perlovsky, L.I. (eds.) Toward Computational Sapience: Principles and
Systems, ch. 9. Springer, Heidelberg (2007)
Vargas, B., Morales, E.: Solving navigation tasks with learned teleo-reactive programs,
pp. 4185–4185 (2008), doi:10.1109/IROS.2008.4651240
Vargas-Govea, B., Morales, E.: Learning Relational Grammars from Sequences of Actions.
In: Bayro-Corrochano, E., Eklundh, J.-O. (eds.) CIARP 2009. LNCS, vol. 5856, pp. 892–
900. Springer, Heidelberg (2009)
Vere, S.A.: Induction of relational productions in the presence of background information.
In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI),
pp. 349–355 (1977)
Walker, T., Shavlik, J., Maclin, R.: Relational reinforcement learning via sampling the space
of first-order conjunctive features. In: Proceedings of the Workshop on Relational Rein-
forcement Learning at ICML 2004 (2004)
Walker, T., Torrey, L., Shavlik, J., Maclin, R.: Building relational world models for rein-
forcement learning. In: Proceedings of the International Conference on Inductive Logic
Programming (ILP) (2007)
Walsh, T.J.: Efficient learning of relational models for sequential decision making. PhD
thesis, Rutgers University, Computer Science Department (2010)
292 M. van Otterlo
Walsh, T.J., Littman, M.L.: Efficient learning of action schemas and web-service descriptions.
In: Proceedings of the National Conference on Artificial Intelligence (AAAI) (2008)
Walsh, T.J., Li, L., Littman, M.L.: Transferring state abstractions between mdps. In: ICML-
2006 Workshop on Structural Knowledge Transfer for Machine Learning (2006)
Wang, C.: First-order markov decision processes. PhD thesis, Department of Computer Sci-
ence, Tufts University, U.S.A (2007)
Wang, C., Khardon, R.: Policy iteration for relational mdps. In: Proceedings of the Confer-
ence on Uncertainty in Artificial Intelligence (UAI) (2007)
Wang, C., Khardon, R.: Relational partially observable mdps. In: Proceedings of the National
Conference on Artificial Intelligence (AAAI) (2010)
Wang, C., Schmolze, J.: Planning with pomdps using a compact, logic-based representation.
In: Proceedings of the IEEE International Conference on Tools with Artificial Intelligence,
ICTAI (2005)
Wang, C., Joshi, S., Khardon, R.: First order decision diagrams for relational MDPs. In:
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2007)
Wang, C., Joshi, S., Khardon, R.: First order decision diagrams for relational MDPs. Journal
of Artificial Intelligence Research (JAIR) 31, 431–472 (2008a)
Wang, W., Gao, Y., Chen, X., Ge, S.: Reinforcement Learning with Markov Logic Networks.
In: Gelbukh, A., Morales, E.F. (eds.) MICAI 2008. LNCS (LNAI), vol. 5317, pp. 230–
242. Springer, Heidelberg (2008b)
Wang, X.: Learning by observation and practice: An incremental approach for planning op-
erator acquisition. In: Proceedings of the International Conference on Machine Learning
(ICML), pp. 549–557 (1995)
Wingate, D., Soni, V., Wolfe, B., Singh, S.: Relational knowledge with predictive state repre-
sentations. In: Proceedings of the International Joint Conference on Artificial Intelligence
(IJCAI) (2007)
Wooldridge, M.: An introduction to MultiAgent Systems. John Wiley & Sons Ltd., West
Sussex (2002)
Wu, J.H., Givan, R.: Discovering relational domain features for probabilistic planning. In:
Proceedings of the International Conference on Artificial Intelligence Planning Systems
(ICAPS) (2007)
Wu, K., Yang, Q., Jiang, Y.: ARMS: Action-relation modelling system for learning action
models. In: Proceedings of the National Conference on Artificial Intelligence (AAAI)
(2005)
Xu, J.Z., Laird, J.E.: Instance-based online learning of deterministic relational action models.
In: Proceedings of the International Conference on Machine Learning (ICML) (2010)
Yoon, S.W., Fern, A., Givan, R.: Inductive policy selection for first-order MDPs. In: Proceed-
ings of the Conference on Uncertainty in Artificial Intelligence (UAI) (2002)
Zettlemoyer, L.S., Pasula, H.M., Kaelbling, L.P.: Learning planning rules in noisy stochas-
tic worlds. In: Proceedings of the National Conference on Artificial Intelligence (AAAI)
(2005)
Zhao, H., Doshi, P.: Haley: A hierarchical framework for logical composition of web services.
In: Proceedings of the International Conference on Web Services (ICWS), pp. 312–319
(2007)
Zhuo, H., Li, L., Bian, R., Wan, H.: Requirement Specification Based on Action Model Learn-
ing. In: Huang, D.-S., Heutte, L., Loog, M. (eds.) ICIC 2007. LNCS, vol. 4681, pp. 565–
574. Springer, Heidelberg (2007)
Chapter 9
Hierarchical Approaches
Bernhard Hengst
9.1 Introduction
Artificial intelligence (AI) is about how to construct agents that act rationally. An
agent acts rationally when it maximises a performance measure given a sequence
of perceptions (Russell and Norvig, 1995). Planning and control theory can also be
viewed from an agent perspective and included in this problem class. Reinforcement
learning may at first seem like a seductive approach to solve the artificial general
intelligence (AGI) problem (Hutter, 2007). While in principle this may be true, re-
inforcement learning is beleaguered by the “curse of dimensionality”. The curse
of dimensionality is a term coined by Bellman (1961) to refer to the exponential
Bernhard Hengst
School of Computer Science and Engineering, University of New South Wales,
Sydney, Australia
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 293–323.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
294 B. Hengst
increase in the state-space with each additional variable or dimension that describes
the problem. Bellman noted that sheer enumeration will not solve problems of any
significance. It is unlikely that complex problems can be described by only a few
variables, and so, it may seem we are at an impasse.
Fortunately the real world is highly structured with many constraints and with
most parts independent of most other parts. Without structure it would be impos-
sible to solve complex problems of any size (Russell and Norvig, 1995). Structure
can significantly reduce the naı̈ve state space generated by sheer enumeration. For
example, if the transition and reward functions for two variables are independent,
then a reinforcement learner would only need to explore a state-space the size of
the addition of the state-space sizes of the variables instead of one the size of their
product.
Reinforcement learning is concerned with problems represented by actions as
well as states, generalising problem solving to systems that are dynamic in time.
Hence we often refer to reinforcement learning problems as tasks, and sub-problems
as sub-tasks.
This Chapter is concerned with leveraging hierarchical structure to try to reduce
and solve more complex reinforcement learning problems that would otherwise be
difficult if not impossible to solve.
Hierarchy
Four-Room Task
Throughout this Chapter we will use a simple four-room task as a running example
to help illustrate concepts. Figure 9.1 (left) shows the agent view of reinforcement
learning. The agent interacts with the environment in a sense-act loop, receiving a
reward signal at each time-step as a part of the input.
Fig. 9.1 Left: The agent view of Reinforcement Learning. Right: A four-room task with the
agent in one of the rooms shown as a solid black oval.
In the four-room example, the agent is represented as a black oval at a grid loca-
tion in the South-East room of a four-room house (Figure 9.1, right). The rooms are
connected by open doorways. The North-West room has a doorway leading out of
the house. At each time-step the agent takes an action and receives a sensor obser-
vation and a reward from the environment.
Each cell represents a possible agent position. The position is uniquely described
by each room and the position in each room. In this example the rooms have similar
dimensions and similar positions in each room are assumed to be described by the
same identifier. The environment is fully observable by the agent which is able to
sense both the room it occupies and its position in the room. The agent can move one
cell-step in any of the four compass directions at each time-step. It also receives a
reward of −1 at each time-step. The objective is to leave the house via the least-cost
route. We assume that the actions are stochastic. When an action is taken there is
an 80% chance that the agent will move in the intended direction and a 20% chance
that it will stay in place. If the agent moves into a wall it will remain where it is.
296 B. Hengst
Our approach to HRL starts with a well specified reinforcement learning problem
modelled as an Markov Decision Process (MDP) as described in Chapter 1. The
reader can easily verify that the four-room task is such a reinforcement learning
problem. We provide here an initial intuitive description of how HRL methods might
be applied to the four-room task.
If we can find a policy to leave a room, say by the North doorway, then we could
reuse this policy in any of the rooms because they are identical. The problem to
leave a room by the North doorway is just another, albeit smaller, reinforcement
learning problem that has inherited the position-in-room states, actions, transition
and reward function from the original problem. We proceed to solve two smaller
reinforcement learning problems, one to find a room-leaving policy to the North
and another to leave a room through the West doorway.
We also formulate and solve a higher-level reinforcement learning problem that
uses only the four room-states. In each room-state we allow a choice of execut-
ing one of the previously learnt room-leaving policies. For the higher-level problem
these policies are viewed as temporally extended actions because once they are in-
voked they will usually persist for multiple time-steps until the agent exits a room.
At this stage we simply specify a reward of −1 per room-leaving action. As we
shall see later, reinforcement learning can be generalised to work with temporally
extended actions using the formalism of semi-Markov Decision Processes (SMDP).
Once learnt, the execution of the higher-level house-leaving policy will deter-
mine the room-leaving action to invoke given the current room. Control is passed to
the room-leaving sub-task that leads the agent out of the room through the chosen
doorway. Upon leaving the room, the sub-task is terminated and control is passed
back to the higher level that chooses the next room-leaving action until the agent
finally leaves the house.
The above example hides many issues that HRL needs to address, including: safe
state abstraction; appropriately accounting for accumulated sub-task reward; opti-
mality of the solution; and specifying or even learning of the hierarchical structure
itself. In the next sections we will discuss these issues and review several approaches
to HRL.
9.2 Background
This section will introduce several concepts that are important to understanding hi-
erarchical reinforcement learning (HRL). There is general agreement that tempo-
rally extended actions and the related semi-Markov Decision Process formalism are
key ingredients. We will also discuss issues related to problem size reduction and
solution optimality.
9 Hierarchical Approaches 297
Approaches to HRL employ actions that persist for multiple time-steps. These tem-
porally extended or abstract actions hide the multi-step state-transition and reward
details from the time they are invoked until termination. The room-leaving actions
discussed above for the four-room task are examples. Leaving a room may involve
taking multiple single-step actions to first navigate to a doorway and then stepping
through it.
Abstract actions are employed in many fields including AI, robotics and control
engineering. They are similar to macros in computer science that make available a
sequence of instructions as a single program statement. In planning, macros help
decompose and solve problems (see for example solutions to the Fifteen Puzzle and
Rubik’s Cube (Korf, 1985)).
Abstract actions in an MDP setting extend macros in that the more primitive
steps that comprise an abstract action may be modelled with stochastic transition
functions and use stochastic polices (see Chapter 1). Abstract actions may execute a
policy for a smaller Markov Decision Problem. Stochasticity can manifest itself in
several ways. When executing an abstract action a stochastic transition function will
make the sequence of states visited non-deterministic. The sequence of rewards may
also vary, depending on the sequence of states visited, even if the reward function
itself is deterministic. Finally, the time taken to complete an abstract action may
vary. Abstract actions with deterministic effects are just classical macro operators.
Properties of abstract actions can be seen in the four-room task. In this problem
the 20% chance of staying in situ, does not make it possible to determine beforehand
how many time-steps it will take to leave a room when a room-leaving abstract
action is invoked.
Abstract actions can be generalised for continuous-time (Puterman, 1994), how-
ever, this Chapter will focus on discrete-time problems. Special case abstract ac-
tions that terminate in one time-step are just ordinary actions and we refer to them
as primitive actions.
We will now extend MDPs from Chapter 1 to MDPs that include abstract actions.
MDPs that include abstract actions are called semi Markov Decision Problems, or
SMDPs (Puterman, 1994). As abstract actions can take a random number of time-
steps to complete, we need to introduce another variable to account for the time to
termination of the abstract action.
We denote the random variable N ≥ 1 to be the number of time steps that an
abstract action a takes to complete, starting in state s and terminating in state s .
298 B. Hengst
The model of the SMDP, defined by the state transition probability function and the
expected reward function now includes the random variable N.1
The transition function T : S × A × S × N → [0,1] gives the probability of the
abstract action a terminating in state s after N steps, having been initiated in state s.
The value functions and Bellman “backup” equations from Chapter 1 for MDPs
can also be generalised for SMDPs. The value of state s for policy π , denoted V π (s)
is the expected return starting in state s at time t, and taking abstract actions accord-
ing to π .2
∞
V π (s) = Eπ ∑ γ k rt+k |st = s
k=0
If the abstract action executed in state s is π (s), persists for N steps and terminates
we can write the value function as two series – the sum of the rewards accumulated
for the first N steps and the remainder of the series of rewards.
V π (s) = Eπ (rt + γ rt+1 + . . . + γ N−1 rt+N−1 ) + (γ N rt+N + . . .)|st = s
Taking the expectation with respect to both s and N with probabilities given by
Equation 9.1, substituting the N-step reward for abstract action π (s) from Equation
9.2, and recognising that the second series is just the value function starting in s
discounted by N steps, we can write
V π (s) = ∑
T (s,π (s),s ,N))[R(s,π (s),s ,N) + γ NV π (s )]
s ,N
1 This formulation of an SMDP is based on Sutton et al (1999) and Dietterich (2000), but
has been changed to be consistent with notation in the Chapter 1.
2 Note that we are overloading function π . π (s,a) is the probability of choosing abstract
action a in state s. π (s) is the abstract action that is chosen in state s under a deterministic
policy π .
9 Hierarchical Approaches 299
The optimum value function for an SMDP (denoted by ∗) is also similar to that for
MDPs with the sum taken with respect to s and N.
Qπ (s,a) = ∑
T (s,a,s ,N))[R(s,a,s ,N) + γ N Qπ (s ,π (s ))] (9.3)
s ,N
The optimum SMDP value function is
For problems that are guaranteed to terminate, the discount factor γ can be set to 1.
In this case the number of steps N can be marginalised out in the above equations
and the sum taken with respect to s alone. The equations are then similar to the ones
for MDPs with the expected primitive reward replaced with the expected sum of
rewards to termination of the abstract action.
All the methods developed for solving Markov decision processes in the Chapter
1 for reinforcement learning using primitive actions work equally well for problems
using abstract actions. As primitive actions are just a special case of abstract actions
we include them in the set of abstract actions.
The reader may well wonder whether the introduction of abstract actions buys us
anything. After all we have just added extra actions to the problem and increased
the complexity. The rest of this Chapter will show how abstract actions allow us
to leverage structure in problems to reduce storage requirements and increase the
speed of learning.
In a similar four-room example to that of Figure 9.1, Sutton et al (1999) show
how the presence of abstract actions allows the agent to learn significantly faster
proceeding on a room by room basis, rather than position by position3. When the
goal is not in a convenient location, able to be reached by the given abstract actions,
it is possible to include primitive actions as special-case abstract actions and still
accelerate learning for some problems. For example, with room-leaving abstract
actions alone, it may not be possible to reach a goal in the middle of a room.
Unless we introduce other abstract actions, primitive actions are still required
when the room containing the goal state is entered. Although the inclusion of prim-
itive actions guarantees convergence to the globally optimal policy, this may create
extra work for the learner. Reinforcement learning may be accelerated because the
3 In this example, abstract actions take the form of options and will be defined in Subsec-
tion 9.3.1
300 B. Hengst
value function can be backed-up over greater distances in the state-space and the
inclusion of primitive actions guarantees convergence to the globally optimal pol-
icy, but the introduction of additional actions increased the storage and exploration
necessary.
9.2.3 Structure
Abstract actions and SMDPs naturally lead to hierarchical structure. With appropri-
ate abstract actions alone we may be able to learn a policy with less effort than it
would take to solve the problem using primitive actions. This is because abstract
actions can skip over large parts of the state-space terminating in a small subset of
states. We saw in the four-room task how room-leaving abstract actions are able to
reduce the problem state-space to room states alone.
Abstract actions themselves may be policies from smaller SMDPs (or MDPs).
This establishes a hierarchy where a higher-level parent task employs child subtasks
as its abstract actions.
Task Hierarchies
"
"
Fig. 9.2 A task-hierarchy decomposing the four-room task in Figure 9.1. The two lower-level
sub-tasks are generic room-leaving abstract actions, one each for leaving a room to the North
and West.
9 Hierarchical Approaches 301
Figure 9.2 shows a task-hierarchy for the four-room task discussed in Section 9.1.
The two lower-level sub-tasks are MDPs for a generic room, where separate policies
are learnt to exit a room to the North and West. The arrows indicate transitions to
terminal states. States, actions, transitions and rewards are inherited from the origi-
nal MDP. The higher level problem (SMDP) consists of just four states representing
the rooms. Any of the sub-tasks (room-leaving actions) can be invoked in any of the
rooms.
Partial Programs
The benefit of decomposing a large MDP is that it will hopefully lead to state ab-
straction opportunities to help reduce the complexity of the problem. An abstracted
state space is smaller than the state space of an original MDP. There are broadly two
kinds of conditions under which state-abstractions can be introduced (Dietterich,
2000). They are situations in which:
• we can eliminate irrelevant variables, and
• where abstract actions “funnel” the agent to a small subset of states.
When reinforcement learning algorithms are given redundant information they will
learn the same value-function or policy for all the redundant states. For example,
navigating through a red coloured room may be the same as for a blue coloured
room, but a value function and policy treats each (position-in-room, colour) as a
different state. If colour has no effect on navigation it would simplify the problem
by eliminating the colour variable from consideration.
More generally, if we can find a partition of the state-space of a subtask m such
that all the transitions from states in one block of the partition have the same proba-
bility and expected reward to transitioning to each of the other blocks, we can reduce
302 B. Hengst
the subtask to one where the states become the blocks. The solution of this reduced
subtask is the solution of the original subtask. This is the notion of stochastic bisim-
ulation homogeneity for MDP model minimisation (Dean and Givan, 1997). The
computational complexity of solving an MDP, being polynomial in |S|, will be re-
duced depending on the coarseness of the partition. The state abstraction can be
substantial. Consider the subtask that involves walking. This skill is mostly inde-
pendent of geography, dress, and objects in the world. If we could not abstract the
walking subtask, we would be condemned to re-learn to walk every-time any one of
these variables changed value.
Formally, if P = {B1 , . . . , Bn } is a partition of the state-space of an SMDP and
for each: Bi , B j ∈ P, a ∈ A, p, q ∈ Bi , number of time-steps to termination is N, then
if and only if
∑ T (p,a,r,N) = ∑ T (q,a,r,N)
r∈B j r∈B j
∑ R(p,a,r,N) = ∑ R(q,a,r,N)
r∈B j r∈B j
Funnelling
resulting states. The effect is exploited for example by Forestier and Varaiya (1978)
in plant control and by Dean and Lin (1995) to reduce the size of the MDP.
Funnelling can be observed in the four-room task. Room-leaving abstract actions
move the agent from any position in a room to the state outside the respective door-
way. Funnelling allows the four-room task to be state-abstracted at the root node to
just 4 states because, irrespective of the starting position in each room, the abstract
actions have the property of moving the agent to another room state.
The task-hierarchy for the four-room task in Figure 9.2 has two successful higher-
level policies that will find a path out of the house from the starting position in the
South-East room. They are to leave rooms successively either North-West-North or
West-North-North. The latter is the shorter path, but the simple hierarchical rein-
forcement learner in Section 9.1 cannot make this distinction.
What is needed is a way to decompose the value function for the whole prob-
lem over the task-hierarchy. Given this decomposition we can take into account the
rewards within a subtask when making decision at higher levels.
To see this, consider the first decision that the agent in the four-room task (Figure
9.1) has to make. Deciding whether it is better to invoke a North or West room-
leaving abstract action does not just depend on the agent’s room state – the South-
West room in this case. It also depends on the current position in the room. In this
case we should aim for the doorway that is closer, as once we have exited the room,
the distance out of the house is the same for either doorway. We need both the room-
state and the position-in-room state to decide on the best action.
The question now arises as to how to decompose the original value function
given the task-hierarchy so that the optimal action can be determined in each state.
This will be addressed by the MAXQ approach (Dietterich, 2000) in Section 9.3.3
with a two part decomposition of the value function for each subtask in the task-
hierarchy. The two parts are value to termination of the abstract action and the value
to termination of the subtask.
Andre and Russell (2002) have introduced a three part decomposition of the value
function by including the component of the value to complete the problem after
subtask terminates. The advantage of this approach is that context is taking into
account, making solutions hierarchically optimal (see Section 9.2.6), but usually at
the expense of less state abstraction. The great benefit of decomposition is state-
abstraction opportunities.
9.2.6 Optimality
We are familiar with the notion of an optimal policy for an MDP from Chapter 1.
Unfortunately, HRL cannot guarantee in general that a decomposed problem will
necessarily yield the optimal solution. It depends on the problem and the quality of
304 B. Hengst
the decomposition in terms of the abstract actions available and the structure of the
task hierarchy or the partial program.
Hierarchically Optimal
Policies that are hierarchically optimal are ones that maximise the overall value
function consistent with the constraints imposed by the task-hierarchy. To illustrate
this concept assume that the agent in the four-room task moves with a 70% probabil-
ity in the intended direction, but slips with a 10% probability each of the other three
directions. Executing the hierarchical optimal policy for the task-hierarchy shown
in Figure 9.2 may not be optimal. The top level policy will choose the West room-
leaving action to leave the room by the nearest doorway. If the agent should find
itself near the North doorway due to stochastic drift, it will nevertheless stubbornly
persist to leave by the West doorway as dictated by the policy of the West room-
leaving abstract action. This is suboptimal. A “flat” reinforcement learner executing
the optimal policy would change its mind and attempt to leave the South-West room
by the closest doorway.
The task-hierarchy could once again be made to yield the optimal solution if we
included an abstract action that was tasked to leave the room by either the West or
North doorway.
Recursively Optimal
As we have seen above the stochastic nature of MDPs means that the condition
under which an abstract action is appropriate may have changed after the action’s
invocation and that another action may become a better choice because of the in-
herent stochasticity of transitions (Hauskrecht et al, 1998). A subtask policy pro-
ceeding to termination may be sub-optimal. By constantly interrupting the sub-task
a better sub-task may be chosen. Dietterich calls this “polling” procedure hierarchi-
cal greedy execution. While this is guaranteed to be no worse than a hierarchical
optimal solution or a recursively optimal solution and may be considerably better, it
still does not provide any global optimality guarantees.
9 Hierarchical Approaches 305
HRL continues to be an active research area. We will now briefly survey some of the
work in HRL. Please also see Barto and Mahadevan (2003) for a survey of advances
in HRL, Si et al (2004) for hierarchical decision making and approaches to con-
currency, multi-agency and partial observability, and Ryan (2004) for an alternative
treatment and motivation underlying the movement towards HRL.
Historical Perspective
Ashby (1956) talks about amplifying the regulation of large systems in a series
of stages, describing these hierarchical control systems “ultra-stable”. HRL can be
viewed as a gating mechanisms that, at a higher levels, learn to switch in appropriate
and more reactive behaviours at lower levels. Ashby (1952) proposed such a gating
mechanism for an agent to handle recurrent situations.
The subsumption architecture (Brooks, 1990) works along similar lines. It de-
composes complicated behaviour into many simple tasks which are organised into
layers, with higher level layers becoming increasingly abstract. Each layer’s goal
subsumes that of its child layers. For example, obstacle avoidance is subsumed by
a foraging for food parent task and switched in when an obstacle is sensed in the
robot’s path.
Watkins (1989) discusses the possibility of hierarchical control consisting of cou-
pled Markov decision problems at each level. In his example, the top level, like the
navigator of an 18th century ship, provides a kind of gating mechanism, instructing
the helmsman on which direction to sail. Singh (1992) developed a gating mecha-
nism called Hierarchical-DYNA (H-DYNA). DYNA is a reinforcement learner that
uses both real and simulated experience after building a model of the reward and
state transition function. H-DYNA first learns elementary tasks such as to navigate
to specific goal locations. Each task is treated as an abstract action at a higher level
of control.
The “feudal” reinforcement learning algorithm (Dayan and Hinton, 1992) em-
phasises another desirable property of HRL - state abstraction. The authors call it
“information hiding”. It is the idea that decision models should be constructed at
coarser granularities further up the control hierarchy.
In the hierarchical distance to goal (HDG) algorithm, Kaelbling (1993) intro-
duces the important idea of composing the value function from distance components
along the path to a goal. HDG is modelled on navigation by landmarks. The idea is
306 B. Hengst
to learn and store local distances to neighbouring landmarks and distances between
any two locations within each landmark region. Another function is used to store
shortest-distance information between landmarks as it becomes available from local
distance functions. The HDG algorithm aims for the next nearest landmark on the
way to the goal and uses the local distance function to guide its primitive actions. A
higher level controller switches lower level policies to target the next neighbouring
landmark whenever the agent enters the last targeted landmark region. The agent
therefore rarely travels through landmarks but uses them as points to aim for on its
way to the goal. This algorithm provided the inspiration for both the MAXQ value
function decomposition (to be discussed in more detail in Section 9.3.3) and value
function improvement with hierarchical greedy execution (Paragraph 9.2.6).
Moore et al (1999) have extended the HDG approach with the “airport-hierarchy”
algorithm. The aim is to find an optimal policy to move from a start state to a goal
state where both states can be selected from the set of all states. This multi-goal
problem requires an MDP with |S|2 states. By learning a set of goal-state reach-
ing abstract actions with progressively smaller “catchment” areas, it is possible to
approximate an optimal policy to any goal-state using, in the best case, only order
N log N states.
Abstract actions may be included in hybrid hierarchies with other learning or
planning methods mixed at different levels. For example, Ryan and Reid (2000) use
a hybrid approach (RL-TOP) to constructing task-hierarchies. In RL-TOP, planning
is used at the abstract level to invoke reactive planning operators, extended in time,
based on teleo-reactive programs (Nilsson, 1994). These operators use reinforce-
ment learning to achieve their post-conditions as sub-goals.
Three paradigms predominate HRL. They are: Options, a formalisation of ab-
stract actions; HAMQ, a partial program approach, and MAXQ value-function de-
composition including state abstraction.
9.3.1 Options
One formalisation of an abstract action is the idea of an option (Sutton et al, 1999)4.
Definition 9.3.1. An option (in relation to an MDP S, A, T, R ) is a triple I, π , β
in which I ⊆ S is an initiation set, π : S × A → [0,1] is a policy, and β : S+ → [0,1]
is a termination condition.
An option can be taken in state s ∈ S of the MDP if s ∈ I. This allows option ini-
tiation to be restricted to a subset of S. Once invoked the option takes actions as
determined by the stochastic policy π . Options terminate stochastically according
to function β that specifies the probability of termination in each state. Many tasks
are episodic, meaning that they will eventually terminate. When we include a single
4 Abstract actions can be formalised in slightly different ways. One such alternative is used
with the HEXQ algorithm to be described later.
9 Hierarchical Approaches 307
abstract terminal state in the definition of the state space of an MDP we write S+ to
denote the total state space including the terminal abstract state. If the option is in
state s it will terminate with probability β (s), otherwise it will continue execution
by taking the next action a with probability π (s,a).
When option policies and termination depend on only the current state s, options
are called Markov options. We may wish to base a policy on information other then
the current state s. For example, if we want the option to time-out after a certain
number of time-steps we could add a counter to the state space. In general the option
policy and termination can depended on the entire history sequence of states, actions
and rewards since the option was initiated. Options of this sort are called semi-
Markov options. We will later discuss examples of options that are formulated by
augmenting an MDP by a stochastic finite state machine, such as a program.
It is easy to see that a primitive action is an option. A primitive action a is equiv-
alent to the option I = s, π (s,a) = 1.0, β (s) = 1 for all s ∈ S. It is possible to unify
the set of options and primitive actions and take set A to be their union.
Just as with policies over primitive actions, we can define policies over options.
The option policy π : S × A → [1,0] is a function that selects an option with proba-
bility π (s, a), s ∈ S, a ∈ A. Since options select actions, and actions are just special
kinds of options, it is possible for options to select other options. In this way we can
form hierarchical structures to an arbitrary depth.
9.3.2 HAMQ-Learning
If a small hand-coded set of abstract actions are known to help solve an MDP we
may be able to learn a policy with much less effort than it would take to solve the
problem using primitive actions. This is because abstract actions can skip over large
parts to the state space terminating in a small subset of states. The original MDP
may be made smaller because now we only need to learn a policy with the set of
abstract actions over a reduced state space.
In the hierarchy of abstract machines (HAM) approach to HRL the designer
specifies abstract actions by providing stochastic finite state automata called ab-
stract machines that work jointly with the MDP (Parr and Russell, 1997). This ap-
proach explicitly specifies abstract actions allowing users to provide background
knowledge, more generally in the form of partial programs, with various levels of
expressivity.
An abstract machine is a triple μ , I, δ , where μ is a finite set of machine states,
I is a stochastic function from states of the MDP to machine states that determines
the initial machine state, and δ is a stochastic next-state function, mapping machine
states and MDP states to next machine states. The machine states are of different
types. Action-states specify the action to be taken given the state of the MDP to
be solved. Call-states execute another machine as a subroutine. Choice-states non-
deterministically select the next machine state. Halt-states halt the machine.
308 B. Hengst
The parallel action of the abstract machine and the MDP yields a discrete-time
higher-level SMDP. Choice-states are states in which abstract actions can be initi-
ated. The abstract machine’s action-states and choice states generate a sequence of
actions that amount to an abstract action policy. If another choice-state is reached
before the current executing machine has terminated, this is equivalent to an abstract
action selecting another abstract action, thereby creating another level in a hierarchy.
The abstract action is terminated by halt-states. A judicious specification of a HAM
may reduce the set of states of the original MDP associated with choice-points.
#!$
$
#!$ #
#!$
#!$
#! $ #!$
#!$
Fig. 9.3 An abstract machine (HAM) that provides routines for leaving rooms to the West and
North of the house in Figure 9.1. The choice is between two abstract actions. One to leave
the house through a West doorway, the other through a North doorway. If the West doorway
is chosen, the abstract action keeps taking a West primitive action until it has either moved
though the door and terminates or reaches a West wall. If at a West wall it takes only North
and South primitive actions at random until the West wall is no longer observable, i.e. it has
reached a doorway, whereupon it moves West through the doorway and terminates.
We can illustrate the operation of a HAM using the four-room example. The
abstract machine in Figure 9.3 provides choices for leaving a room to the West
or the North. In each room it will take actions that move the agent to a wall, and
perform a random walk along the wall until it finds the doorway. Primitive rewards
are summed between choice states. In this example we assume the agent’s initial
position is as shown in Figure 9.1. Only five states of the original MDP are states of
the SMDP. These states are the initial state of the agent and the states on the other
side of doorways where the abstract machine enters choice states. Reinforcement
learning methods update the value function for these five states in the usual way
with rewards accumulated since the last choice state.
Solving the SMDP yields an optimal policy for the agent to leave the house sub-
ject to the program constraints of the abstract machine. The best policy consists
of the three abstract actions – sequentially leaving a room to the West, North and
North again. In this case it is not a globally optimal policy because a random walk
9 Hierarchical Approaches 309
Programmable HRL
In the most general case a HAM can be a program that executes any computable
mapping of the agent’s complete sensory-action history (Parr, 1998).
Andre and Russell extended the HAM approach by introducing more express-
ible agent design languages for HRL – Programmable HAM (PHAM) (Andre and
Russell, 2000) and ALisp, a Lisp-based high-level partial programming language
(Andre and Russell, 2002).
Golog, is a logic programming language for agents that allow agents to rea-
son about actions, goals, perceptions, other agents, etc., using situation calculus
(Levesque et al, 1997). It has been extended as a partial program by embedding
MDPs. Examples include decision theoretic Golog (DTGolog) (Boutilier et al,
2000) and Readylog (Ferrein and Lakemeyer, 2008) using the options framework.
In each case the partial program allows users to provide background knowledge
about the problem structure using special choice-point routines that implement non-
deterministic actions for the agent to learn the best action to take from experience.
Programs of this kind leverage the expressiveness of the programming language to
succinctly specify (and solve) an SMDP.
9.3.3 MAXQ
MAXQ is an approach to HRL where the value function is decomposed over the
task hierarchy (Dietterich, 2000). It can lead to a compact representation of the
value function and makes sub-tasks context-free or portable.
MAXQ abstract actions are crafted by classifying subtask terminal states as either
goal states or non-goal states. Using disincentives for non-goal states, policies are
learnt for each subtask to encourage termination in goal states. This termination
predicate method may introduce an additional source of sub-optimality in the MDP
as “pseudo” rewards can distort the subtask policy.
A key feature of MAXQ is that it represents the value of a state as a decomposed
sum of sub-task completion values plus the expected reward for the immediate prim-
itive action. A completion value is the expected (discounted) cumulative reward to
complete the sub-task after taking the next abstract action.
We will derive the hierarchical decomposition following Dietterich (2000) and
extend the above SMDP notation by including explicit reference to a particular sub-
task m. Equation 9.3 for subtask m becomes:
310 B. Hengst
Abstract action a for subtask m invokes a child subtask ma . The expected value of
completing subtask ma is expressed as V π (ma ,s). The hierarchical policy, π , is a set
of policies, one for each subtask.
The completion function, Cπ (m,s,a), is the expected discounted cumulative re-
ward after completing abstract action a, in state s in subtask m, discounted back to
the point where a begins execution.
Cπ (m,s,a) = ∑
T (m,s,a,s ,N))γ N Qπ (m,s ,π (s ))
s ,N
The Q function for subtask m (Equation 9.4) can be expressed recursively as the
value for completing the subtask that a invokes, ma , plus the completion value to
the end of subtask m.
To follow an optimal greedy policy given the hierarchy, the decomposition Equation
9.6 for the subtask implementing abstract action a is modified to choose the best
action a, i.e. V ∗ (ma ,s) = maxa Q∗ (ma , s, a ). The introduction of the max operator
means that we have to perform a complete search through all the paths in the task-
hierarchy to determine the best action. Algorithm 18 performs such a depth-first
search and returns both the value and best action for subtask m in state s.
As the depth of the task-hierarchy increases, this exhaustive search can become
prohibitive. Limiting the depth of the search is one way to control its complexity
(Hengst, 2004). To plan an international trip, for example, the flight and airport-
transfer methods need to be considered, but optimising which side of the bed to get
out of on the way to the bathroom on the day of departure can effectively be ignored
for higher level planning.
9 Hierarchical Approaches 311
Fig. 9.4 The completion function components of the decomposed value function for the agent
following an optimal policy for the four-room problem in Figure 9.1. The agent is shown as
a solid black oval at the starting state.
For the four-room task-hierarchy in Figure 9.2, the decomposed value of the
agent’s state has three terms determined by the two levels in the task-hierarchy plus
a primitive action. With the agent located in the state shown in Figure 9.4 by a solid
back oval, the optimal value function for this state is the cost of the shortest path
out of the house. It is composed by adding the expected reward for taking the next
primitive action to the North, completing the lower-level sub-task of leaving the
room to the West, and completing the higher-level task of leaving the house.
MAXQ algorithm. For an extended version of the Algorithm, one that accelerates
learning and distinguishes goal from non-goal terminal states using pseduo-rewards,
please see (Dietterich, 2000). Algorithm 20 initiates the MAXQ process that pro-
ceeds to learn and execute the task-hierarchy.
We will now put all the above ideas together and show how we can learn and ex-
ecute the four-room task in Figure 9.1 when the agent can start in any state. The
designer of the task-hierarchy in Figure 9.5 has recognised several state abstraction
opportunities. Recall that the state is described by the tuple (room, position).
The agent can leave each room by one of four potential doorways to the North,
East, South, or West, and we need to learn a separate navigation strategy for each.
However, because the rooms are identical, the room variable is irrelevant for intra-
room navigation.
9 Hierarchical Approaches 313
Fig. 9.5 A task task-hierarchy for the four-room task. The subtasks are room-leaving actions
and can be used in any of the rooms. The parent-level root subtask has just four states repre-
senting the four rooms. “X” indicates non-goal terminal states.
We also notice that room-leaving abstract actions always terminate in one state. A
room leaving abstract action is seen to “funnel” the agent through the doorway. It is
for this reason that the position-in-room states can be abstracted away and the room
state retained at the root level and only room leaving abstract actions are deployed at
the root subtask. This means that instead of requiring 100 states for the root subtask
we only require four, one for each room. Also, we only need to learn 16 (4 states ×
4 actions) completion functions, instead of 400.
To solve the problem using the task-hierarchy with Algorithm 19, a main pro-
gram initialises expected primitive reward functions V (·,·) and completion functions
C(·,·,·) arbitrarily, and calls function MAXQ at the root subtask in the task-hierarchy
for starting state s0 , i.e. MAXQ(root,s0 ). MAXQ uses a Q-Learning like update
rule to learn expected rewards for primitive actions and completion values for all
subtasks.
With the values converged, α set to zero, and exploration turned off, MAXQ (Al-
gorithm 18) will execute a recursively optimal policy by searching for the shortest
path to exit the four rooms. An example of such a path and its decomposed value
function is shown in Figure 9.4 for one starting position.
Readers may be familiar with the mutilated checker-board problem showing that
problem representation plays a large part in its solution (Gamow and Stern, 1958). In
his seminal paper on six different representations for the missionaries and cannibals
problem, Amarel (1968) demonstrated the possibility of making machine learning
easier by discovering regularities and using them to formulating new representa-
tions. The choice of variables to represent states and actions in a reinforcement
learning problem plays a large part in providing opportunities to decompose the
problem.
Some researchers have tried to learn the hierarchical structure from the agent-
environment interaction. Most approaches look for sub-goals or sub-tasks that try
to partition the problem into near independent reusable sub-problems. Methods to
automatically decompose problems include ones that look for sub-goal bottleneck
or landmark states, and ones that find common behaviour trajectories or region
policies.
This is consistent with the principles advocated by Stone (1998) that include prob-
lem decomposition into multi-layers of abstraction, learning tasks from the lowest
level to the highest in a hierarchy, where the output of learning from one layer feeds
into the next layer. Utgoff and Stracuzzi (2002) point to the compression inherent
in the progression of learning from simple to more complex tasks. They suggest a
building block approach, designed to eliminate replication of knowledge structures.
Agents are seen to advance their knowledge by moving their “frontier of receptiv-
ity” as they acquire new concepts by building on earlier ones from the bottom up.
Their conclusion:
“Learning of complex structures can be guided successfully by assuming that local
learning methods are limited to simple tasks, and that the resulting building blocks are
available for subsequent learning”.
Some of the approaches use to learn abstract actions and structure include search-
ing for common behaviour trajectories or common state region polices (Thrun and
Schwartz, 1995; McGovern, 2002). Others look for bottleneck or landmark states
9 Hierarchical Approaches 315
(Digney, 1998; McGovern, 2002; Menache et al, 2002). Şimşek and Barto (2004)
use a relative novelty measure to identify sub-goal states. Interestingly Moore et al
(1999) suggest that, for some navigation tasks, performance is insensitive to the po-
sition of landmarks and an (automatic) randomly-generated set of landmarks does
not show widely varying results from ones more purposefully positioned.
9.4.1 HEXQ
HEXQ searches for subspaces by exploring the transition and reward functions for
the projected state space onto each state variable. Subspace states are included in
the same block of a partition when transitions do not change the other variables
and the transition and reward functions are independent of other variables. This is
a stricter form of stochastic bisimulation homogeneity (Section 9.2.4), one where
the context, in the guise of other than the projected variable, does not change. The
associated state abstraction eliminates context variables from the subspace as they
are irrelevant.
Whenever these condition are violated for a state transition, an exit, represented
as a state-action pair, (s,a), is created. If exits cannot be reached from initial sub-
space states, the subspace is split and extra exits created. Creating exits is the mech-
anism by which subgoals are automatically generated.
This process partitions the projected state space for each variable. Each block
of the partition forms a set of subtasks, one for each exit. The subtask is a SMDP
with the goal of terminating on exit execution. HEXQ subtasks are like options,
except that the termination condition is defined by a state-action pair. The reward
on termination is not a part of the subtask, but is counted at higher levels in the
task-hierarchy. This represents a slight reformulation of the MAXQ value function
decomposition, but one that unifies the definition of the Q function over the task-
hierarchy and simplifies the recursive equations.
316 B. Hengst
Early versions of HEXQ learn a monolithic hierarchy after ordering state vari-
ables by their frequency of change (Hengst, 2002). In more recent versions, state
variables are tackled in parallel (Hengst, 2008). The projected state-space for each
variable is partitioned into blocks as before. Parent level states are new variables
formed by taking the cross-product of block identifiers from the child variables.
This method of parallel decomposition of the factored state-space generates sequen-
tial (or multi-tasking) actions, that invoke multiple child subtasks, one at a time.
Sequential actions create partial-order task-hierarchies that have the potential for
greater state abstraction. (Hengst, 2008).
For the four-room task, HEXQ is provided with the factored state variable pairs
(room, position). It starts by forming one module for each variable. The position
module discovers one block with four exits. Exits are transitions that leave any room
by one of the doorways. They are discovered automatically because it is only for
these transitions that the room variable may change. The position module will for-
mulate four room-leaving subtasks, one for each of the exits. The subtask policies
can be learnt in parallel using standard off-policy Q-learning for SMDPs.
The room module learns a partition of the room states that consists of singleton
state blocks, because executing a primitive action in any room may change the po-
sition variable value. For each block there are four exits. The block identifier is just
the room state and is combined with the block identifier from the position module
to form a new state variable for the parent module.
In this case the room module does not add any value in the form of state abstrac-
tion and could be eliminated with the room variable passed directly to the parent
module as part of the input. Either way the parent module now represents abstract
room states and invokes room-leaving abstract actions to achieve its goal. The ma-
chine generated task-hierarchy is similar to that shown in Figure 9.5.
Interestingly, if the four-room state is defined using coordinates (x,y), where x
and y range in value from 0 to 9, instead of the more descriptive (room,position),
HEXQ will nevertheless create a higher level variable representing four rooms, but
one that was not supplied by the designer. Each of the x and y variables are par-
titioned by HEXQ into two blocks representing the space inside each room. The
cross-product of the block identifiers creates a higher-level variable representing the
four rooms. Sequential abstract actions are generated to move East-West and North-
South to leave rooms. In this decomposition the subspaces have a different meaning.
They model individual rooms instead of a generic room (Hengst, 2008).
Summary of HEXQ
the reinforcement setting including abstract actions, state abstraction and hierarchi-
cal approaches. For effective knowledge transfer the two environments need to be
“close enough”. Castro and Precup (2010) show that transferring abstract actions
is more successful than just primitive actions using a variant of the bisimulation
metric.
State abstraction for MAXQ like task-hierarchies is hindered by discounted re-
ward optimality models because the rewards after taking an abstract action are no
longer independent of the time to complete the abstract action. Hengst (2007) has
developed a method for decomposing discounted value functions over the task-
hierarchy by concurrently decomposing multi-time models (Precup and Sutton,
1997). The twin recursive decomposition functions restore state-abstraction oppor-
tunities and allow problems with continuing subtasks in the task-hierarchy to be
included in the class of problems that can be tackled by HRL.
Much of the literature in reinforcement learning involves one-dimensional ac-
tions. However, in many domains we wish to control several action variables simul-
taneously. These situations arise, for example, when coordinating teams of robots,
or when individual robots have multiple degrees of articulation. The challenge is to
decompose factored actions over a task-hierarchy, particularly if there is a chance
that the abstract actions will interact. Rohanimanesh and Mahadevan (2001) use
the options framework to show how actions can be parallelized with a SMPD for-
malism. Fitch et al (2005) demonstrate, using a two-taxi task, concurrent action
task-hierarchies and state abstraction to scale problems involving concurrent ac-
tions. Concurrent actions require a subtask termination scheme (Rohanimanesh and
Mahadevan, 2005) and the attribution of subtask rewards. Marthi et al (2005) extend
partial programs (Section 9.2.3) to the concurrent action case using multi-threaded
concurrent-ALisp.
When the state is not fully observable, or the observations are noisy the MDP
becomes a Partially Observable Markov Decision Problem or POMDP (Chapter
12). Wiering and Schmidhuber (1997) decompose a POMDP into sequences of
simpler subtasks, Hernandez and Mahadevan (2000) solve partially observable se-
quential decision tasks by propagating reward across long decision sequences us-
ing a memory-based SMDP. Pineau and Thrun (2002) present an algorithm for
planning in structured POMDPs using an action based decomposition to parti-
tion a complex problem into a hierarchy of smaller subproblems. Theocharous
and Kaelbling (2004) derive a hierarchical partially observable Markov decision
problem (HPOMDP) from hierarchical hidden Markov models extending previous
work to include multiple entry and exit states to represent the spatial borders of the
sub-space.
New structure learning techniques continue to be developed. The original HEXQ
decomposition uses a simple heuristic to determine an ordering over the state
variables for the decomposition. Jonsson and Barto (2006) propose a Bayesian
network model causal graph based approach – Variable Influence Structure Anal-
ysis (VISA) – that relates the way variables influence each other to construct the
task-hierarchy. Unlike HEXQ this algorithm combines variables that influence each
other and ignores lower-level activity. Bakker and Schmidhuber (2004)’s HASSLE
9 Hierarchical Approaches 319
9.6 Summary
References
Agre, P.E., Chapman, D.: Pengi: an implementation of a theory of activity. In: Proceedings of
the Sixth National Conference on Artificial Intelligence, AAAI 1987, vol. 1, pp. 268–272.
AAAI Press (1987)
320 B. Hengst
Amarel, S.: On representations of problems of reasoning about actions. In: Michie, D. (ed.)
Machine Intelligence, vol. 3, pp. 131–171. Edinburgh at the University Press, Edinburgh
(1968)
Andre, D., Russell, S.J.: Programmable reinforcement learning agents. In: Leen, T.K., Diet-
terich, T.G., Tresp, V. (eds.) NIPS, pp. 1019–1025. MIT Press (2000)
Andre, D., Russell, S.J.: State abstraction for programmable reinforcement learning agents.
In: Dechter, R., Kearns, M., Sutton, R.S. (eds.) Proceedings of the Eighteenth National
Conference on Artificial Intelligence, pp. 119–125. AAAI Press (2002)
Ashby, R.: Design for a Brain: The Origin of Adaptive Behaviour. Chapman & Hall, London
(1952)
Ashby, R.: Introduction to Cybernetics. Chapman & Hall, London (1956)
Bakker, B., Schmidhuber, J.: Hierarchical reinforcement learning based on subgoal discov-
ery and subpolicy specialization. In: Proceedings of the 8-th Conference on Intelligent
Autonomous Systems, IAS-8, pp. 438–445 (2004)
Barto, A.G., Mahadevan, S.: Recent advances in hiearchical reinforcement learning. Special
Issue on Reinforcement Learning, Discrete Event Systems Journal 13, 41–77 (2003)
Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Prince-
ton (1961)
Boutilier, C., Dearden, R., Goldszmidt, M.: Exploiting structure in policy construction. In:
Proceedings of the 14th International Joint Conference on Artificial Intelligence, vol. 2,
pp. 1104–1111. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Boutilier, C., Reiter, R., Soutchanski, M., Thrun, S.: Decision-theoretic, high-level agent pro-
gramming in the situation calculus. In: Proceedings of the Seventeenth National Con-
ference on Artificial Intelligence and Twelfth Conference on Innovative Applications of
Artificial Intelligence, pp. 355–362. AAAI Press (2000)
Brooks, R.A.: Elephants don’t play chess. Robotics and Autonomous Systems 6, 3–15 (1990)
Castro, P.S., Precup, D.: Using bisimulation for policy transfer in mdps. In: Proceedings of
the 9th International Conference on Autonomous Agents and Multiagent Systems, AA-
MAS 2010, vol. 1, pp. 1399–1400. International Foundation for Autonomous Agents and
Multiagent Systems, Richland (2010)
Clark, A., Thornton, C.: Trading spaces: Computation, representation, and the limits of unin-
formed learning. Behavioral and Brain Sciences 20(1), 57–66 (1997)
Dayan, P., Hinton, G.E.: Feudal reinforcement learning. In: Advances in Neural Information
Processing Systems (NIPS), vol. 5 (1992)
Dean, T., Givan, R.: Model minimization in Markov decision processes. In: AAAI/IAAI, pp.
106–111 (1997)
Dean, T., Lin, S.H.: Decomposition techniques for planning in stochastic domains. Tech. Rep.
CS-95-10, Department of Computer Science Brown University (1995)
Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decom-
position. Journal of Artificial Intelligence Research 13, 227–303 (2000)
Digney, B.L.: Learning hierarchical control structures for multiple tasks and changing envi-
ronments. From Animals to Animats 5: Proceedings of the Fifth International Conference
on Simulation of Adaptive Behaviour SAB (1998)
Ferrein, A., Lakemeyer, G.: Logic-based robot control in highly dynamic domains. Robot
Auton. Syst. 56(11), 980–991 (2008)
Fitch, R., Hengst, B., šuc, D., Calbert, G., Scholz, J.: Structural Abstraction Experiments
in Reinforcement Learning. In: Zhang, S., Jarvis, R.A. (eds.) AI 2005. LNCS (LNAI),
vol. 3809, pp. 164–175. Springer, Heidelberg (2005)
9 Hierarchical Approaches 321
Forestier, J., Varaiya, P.: Multilayer control of large Markov chains. IEEE Tansactions Auto-
matic Control 23, 298–304 (1978)
Gamow, G., Stern, M.: Puzzle-math. Viking Press (1958)
Ghavamzadeh, M., Mahadevan, S.: Continuous-time hierarchial reinforcement learning. In:
Proc. 18th International Conf. on Machine Learning, pp. 186–193. Morgan Kaufmann,
San Francisco (2001)
Ghavamzadeh, M., Mahadevan, S.: Hierarchical policy gradient algorithms. In: Marine Envi-
ronments, pp. 226–233. AAAI Press (2003)
Hauskrecht, M., Meuleau, N., Kaelbling, L.P., Dean, T., Boutilier, C.: Hierarchical solution
of Markov decision processes using macro-actions. In: Fourteenth Annual Conference on
Uncertainty in Artificial Intelligence, pp. 220–229 (1998)
Hengst, B.: Discovering hierarchy in reinforcement learning with HEXQ. In: Sammut, C.,
Hoffmann, A. (eds.) Proceedings of the Nineteenth International Conference on Machine
Learning, pp. 243–250. Morgan Kaufmann (2002)
Hengst, B.: Model Approximation for HEXQ Hierarchical Reinforcement Learning. In:
Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS
(LNAI), vol. 3201, pp. 144–155. Springer, Heidelberg (2004)
Hengst, B.: Safe State Abstraction and Reusable Continuing Subtasks in Hierarchical Re-
inforcement Learning. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI),
vol. 4830, pp. 58–67. Springer, Heidelberg (2007)
Hengst, B.: Partial Order Hierarchical Reinforcement Learning. In: Wobcke, W., Zhang, M.
(eds.) AI 2008. LNCS (LNAI), vol. 5360, pp. 138–149. Springer, Heidelberg (2008)
Hernandez, N., Mahadevan, S.: Hierarchical memory-based reinforcement learning. In:
Fifteenth International Conference on Neural Information Processing Systems, Denver
(2000)
Hutter, M.: Universal algorithmic intelligence: A mathematical top→down approach. In: Ar-
tificial General Intelligence, pp. 227–290. Springer, Berlin (2007)
Jong, N.K., Stone, P.: Compositional models for reinforcement learning. In: The European
Conference on Machine Learning and Principles and Practice of Knowledge Discovery in
Databases (2009)
Jonsson, A., Barto, A.G.: Causal graph based decomposition of factored mdps. Journal of
Machine Learning 7, 2259–2301 (2006)
Kaelbling, L.P.: Hierarchical learning in stochastic domains: Preliminary results. In: Pro-
ceedings of the Tenth International Conference Machine Learning, pp. 167–173. Morgan
Kaufmann, San Mateo (1993)
Konidaris, G., Barto, A.G.: Building portable options: skill transfer in reinforcement learning.
In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp.
895–900. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Konidaris, G., Barto, A.G.: Skill discovery in continuous reinforcement learning domains us-
ing skill chaining. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta,
A. (eds.) Advances in Neural Information Processing Systems, vol. 22, pp. 1015–1023
(2009)
Konidaris, G., Kuindersma, S., Barto, A.G., Grupen, R.: Constructing skill trees for reinforce-
ment learning agents from demonstration trajectories. In: Advances in Neural Information
Processing Systems NIPS, vol. 23 (2010)
Korf, R.E.: Learning to Solve Problems by Searching for Macro-Operators. Pitman Publish-
ing Inc., Boston (1985)
Levesque, H., Reiter, R., Lespérance, Y., Lin, F., Scherl, R.: Golog: A logic programming
language for dynamic domains. Journal of Logic Programming 31, 59–84 (1997)
322 B. Hengst
Mahadevan, S.: Representation discovery in sequential descision making. In: 24th Confer-
ence on Artificial Intelligence (AAAI), Atlanta, July 11-15 (2010)
Marthi, B., Russell, S., Latham, D., Guestrin, C.: Concurrent hierarchical reinforcement
learning. In: Proc. IJCAI 2005 Edinburgh, Scotland (2005)
Marthi, B., Kaelbling, L., Lozano-Perez, T.: Learning hierarchical structure in policies. In:
NIPS 2007 Workshop on Hierarchical Organization of Behavior (2007)
McGovern, A.: Autonomous Discovery of Abstractions Through Interaction with an Envi-
ronment. In: Koenig, S., Holte, R.C. (eds.) SARA 2002. LNCS (LNAI), vol. 2371, pp.
338–339. Springer, Heidelberg (2002)
Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.: Transfer in variable-reward hierarchical rein-
forcement learning. Mach. Learn. 73, 289–312 (2008a), doi:10.1007/s10994-008-5061-y
Mehta, N., Ray, S., Tadepalli, P., Dietterich, T.: Automatic discovery and transfer of maxq
hierarchies. In: Proceedings of the 25th International Conference on Machine Learning,
ICML 2008, pp. 648–655. ACM, New York (2008b)
Menache, I., Mannor, S., Shimkin, N.: Q-Cut - Dynamic Discovery of Sub-goals in Rein-
forcement Learning. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS
(LNAI), vol. 2430, pp. 295–305. Springer, Heidelberg (2002)
Moerman, W.: Hierarchical reinforcement learning: Assignment of behaviours to subpoli-
cies by self-organization. PhD thesis, Cognitive Artificial Intelligence, Utrecht University
(2009)
Moore, A., Baird, L., Kaelbling, L.P.: Multi-value-functions: Efficient automatic action hier-
archies for multiple goal mdps. In: Proceedings of the International Joint Conference on
Artificial Intelligence, pp. 1316–1323. Morgan Kaufmann, San Francisco (1999)
Mugan, J., Kuipers, B.: Autonomously learning an action hierarchy using a learned qualitative
state representation. In: Proceedings of the 21st International Jont Conference on Artifical
Intelligence, pp. 1175–1180. Morgan Kaufmann Publishers Inc., San Francisco (2009)
Neumann, G., Maass, W., Peters, J.: Learning complex motions by sequencing simpler mo-
tion templates. In: Proceedings of the 26th Annual International Conference on Machine
Learning, ICML 2009, pp. 753–760. ACM, New York (2009)
Nilsson, N.J.: Teleo-reactive programs for agent control. Journal of Artificial Intelligence
Research 1, 139–158 (1994)
Osentoski, S., Mahadevan, S.: Basis function construction for hierarchical reinforcement
learning. In: Proceedings of the 9th International Conference on Autonomous Agents and
Multiagent Systems, AAMAS 2010, vol. 1, pp. 747–754. International Foundation for
Autonomous Agents and Multiagent Systems, Richland (2010)
Parr, R., Russell, S.J.: Reinforcement learning with hierarchies of machines. In: NIPS (1997)
Parr, R.E.: Hierarchical control and learning for Markov decision processes. PhD thesis,
University of California at Berkeley (1998)
Pineau, J., Thrun, S.: An integrated approach to hierarchy and abstraction for pomdps. CMU
Technical Report: CMU-RI-TR-02-21 (2002)
Polya, G.: How to Solve It: A New Aspect of Mathematical Model. Princeton University
Press (1945)
Precup, D., Sutton, R.S.: Multi-time models for temporally abstract planning. In: Advances
in Neural Information Processing Systems, vol. 10, pp. 1050–1056. MIT Press (1997)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Whiley & Sons, Inc., New York (1994)
Ravindran, B., Barto, A.G.: SMDP homomorphisms: An algebraic approach to abstraction
in semi Markov decision processes. In: Proceedings of the Eighteenth International Joint
Conference on Artificial Intelligence, IJCAI 2003 (2003)
9 Hierarchical Approaches 323
Shimon Whiteson
10.1 Introduction
Shimon Whiteson
Informatics Institute, University of Amsterdam
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 325–355.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
326 S. Whiteson
best performing ones are selected as the basis for a new population. This new pop-
ulation is formed via reproduction, in which the selected policies are mated (i.e.,
components of two different solutions are combined) and mutated (i.e., the parame-
ter values of one solution are stochastically altered). This process repeats over many
iterations, until a sufficiently fit solution has been found or the available computa-
tional resources have been exhausted.
There is an enormous number of variations on this approach, such as multi-
objective methods (Deb, 2001; Coello et al, 2007) diversifying algorithms (Holland,
1975; Goldberg and Richardson, 1987; Mahfoud, 1995; Potter and De Jong, 1995;
Darwen and Yao, 1996) and distribution-based methods (Larranaga and Lozano,
2002; Hansen et al, 2003; Rubinstein and Kroese, 2004). However, the basic ap-
proach is extremely general and can in principle be applied to all optimization prob-
lems for which f can be specified.
Included among these optimization problems are reinforcement-learning tasks
(Moriarty et al, 1999). In this case, C corresponds to the set of possible policies, e.g.,
mappings from S to A, and f (c) is the average cumulative reward obtained while
using such a policy in a series of Monte Carlo trials in the task. In other words, in
an evolutionary approach to reinforcement learning, the algorithm directly searches
the space of policies for one that maximizes the expected cumulative reward.
Like many other policy-search methods, this approach reasons only about the
value of entire policies, without constructing value estimates for particular state-
action pairs, as temporal-difference methods do. The holistic nature of this approach
is sometimes criticized. For example, Sutton and Barto write:
Evolutionary methods do not use the fact that the policy they are searching for is a
function from states to actions; they do not notice which states an individual passes
through during its lifetime, or which actions it selects. In some cases this information
can be misleading (e.g., when states are misperceived) but more often it should enable
more efficient search (Sutton and Barto, 1998, p. 9).
These facts can put evolutionary methods at a theoretical disadvantage. For exam-
ple, in some circumstances, dynamic programming methods are guaranteed to find
an optimal policy in time polynomial in the number of states and actions (Littman
et al, 1995). By contrast, evolutionary methods, in the worst case, must iterate over
an exponential number of candidate policies before finding the best one. Empirical
results have also shown that evolutionary methods sometimes require more episodes
than temporal-difference methods to find a good policy, especially in highly stochas-
tic tasks in which many Monte Carlo simulations are necessary to achieve a reliable
estimate of the fitness of each candidate policy (Runarsson and Lucas, 2005; Lucas
and Runarsson, 2006; Lucas and Togelius, 2007; Whiteson et al, 2010b).
However, despite these limitations, evolutionary computation remains a popular
tool for solving reinforcement-learning problems and boasts a wide range of empir-
ical successes, sometimes substantially outperforming temporal-difference methods
(Whitley et al, 1993; Moriarty and Miikkulainen, 1996; Stanley and Miikkulainen,
2002; Gomez et al, 2008; Whiteson et al, 2010b). There are three main reasons why.
First, evolutionary methods can cope well with partial observability. While evo-
lutionary methods do not exploit the relationship between subsequent states that an
10 Evolutionary Computation for Reinforcement Learning 327
agent visits, this can be advantageous when the agent is unsure about its state. Since
temporal-difference methods rely explicitly on the Markov property, their value es-
timates can diverge when it fails to hold, with potentially catastrophic consequences
for the performance of the greedy policy. In contrast, evolutionary methods do not
rely on the Markov property and will always select the best policies they can find
for the given task. Severe partial observability may place a ceiling on the perfor-
mance of such policies, but optimization within the given policy space proceeds
normally (Moriarty et al, 1999). In addition, representations that use memory to re-
duce partial observability, such as recurrent neural networks, can be be optimized in
a natural way with evolutionary methods (Gomez and Miikkulainen, 1999; Stanley
and Miikkulainen, 2002; Gomez and Schmidhuber, 2005a,b).
Second, evolutionary methods can make it easier to find suitable representations
for the agent’s solution. Since policies need only specify an action for each state,
instead of the value of each state-action pair, they can be simpler to represent. In
addition, it is possible to simultaneously evolve a suitable policy representation (see
Sections 10.3 and 10.4.2). Furthermore, since it is not necessary to perform learning
updates on a given candidate solution, it is possible to use more elaborate represen-
tations, such as those employed by generative and developmental systems (GDS)
(see Section 10.6).
Third, evolutionary methods provide a simple way to solve problems with large
or continuous action spaces. Many temporal-difference methods are ill-suited to
such tasks because they require iterating over the action space in each state in or-
der to identify the maximizing action. In contrast, evolutionary methods need only
evolve policies that directly map states to actions. Of course, actor-critic methods
(Doya, 2000; Peters and Schaal, 2008) and other techniques (Gaskett et al, 1999;
Millán et al, 2002; van Hasselt and Wiering, 2007) can also be used to make
temporal-difference methods suitable for continuous action spaces. Nonetheless,
evolutionary methods provide a simple, effective way to address such difficulties.
Of course, none of these arguments are unique to evolutionary methods, but ap-
ply in principle to other policy-search methods too. However, evolutionary methods
have proven a particularly popular way to search policy space and, consequently,
there is a rich collection of algorithms and results for the reinforcement-learning
setting. Furthermore, as modern methods, such as distribution-based approaches,
depart further from the original genetic algorithms, their resemblance to the pro-
cess of natural selection has decreased. Thus, the distinction between evolutionary
methods and other policy search approaches has become fuzzier and less important.
This chapter provides an introduction to and overview of evolutionary methods
for reinforcement learning. The vastness of the field makes it infeasible to address
all the important developments and results. In the interest of clarity and brevity, this
chapter focuses heavily on neuroevolution (Yao, 1999), in which evolutionary meth-
ods are used to evolve neural networks (Haykin, 1994), e.g., to represent policies.
While evolutionary reinforcement learning is by no means limited to neural-network
representations, neuroevolutionary approaches are by far the most common. Fur-
thermore, since neural networks are a popular and well-studied representation in
general, they are a suitable object of focus for this chapter.
328 S. Whiteson
10.2 Neuroevolution
1 This does not apply to the hybrid methods discussed in Section 10.4.
330 S. Whiteson
10.3 TWEANNs
In its simplest form, Algorithm 21 evolves only neural networks with fixed repre-
sentations. In such a setup, all the networks in a particular evolutionary run have
10 Evolutionary Computation for Reinforcement Learning 331
the same topology, i.e., both the number of hidden nodes and the set of edges con-
necting the nodes are fixed. The networks differ only with respect to the weights of
these edges, which are optimized by evolution. The use of fixed representations is
by no means unique to neuroevolution. In fact, though methods exist for automat-
ically discovering good representations for value-functions (Mahadevan and Mag-
gioni, 2007; Parr et al, 2007) temporal-difference methods typically also use fixed
representations for function approximation.
Nonetheless, reliance on fixed representations is a significant limitation. The pri-
mary reason is that it requires the user of the algorithm to correctly specify a good
representation in advance. Clearly, choosing too simple a representation will doom
evolution to poor performance, since describing high quality solutions becomes im-
possible. However, choosing too complex a representation can be just as harmful.
While such a representation can still describe good solutions, finding them may be-
come infeasible. Since each weight in the network corresponds to a dimension of
the search space, a representation with too many edges can lead to an intractable
search problem.
In most tasks, the user is not able to correctly guess the right representation.
Even in cases where the user possesses great domain expertise, deducing the right
representation from this expertise is often not possible. Typically, finding a good
representation becomes a process of trial and error. However, repeatedly running
evolution until a suitable representation is found greatly increases computational
costs. Furthermore, in on-line tasks (see Section 10.7) it also increases the real-
world costs of trying out policies in the target environment.
For these reasons, many researchers have investigated ways to automate the dis-
covery of good representations (Dasgupta and McGregor, 1992; Radcliffe, 1993;
Gruau, 1994; Stanley and Miikkulainen, 2002). Evolutionary methods are well
suited to this challenge because they take a direct policy-search approach to rein-
forcement learning. In particular, since neuroevolution already directly searches the
space of network weights, it can also simultaneously search the space of network
topologies. Methods that do so are sometimes called topology- and weight-evolving
artificial neural networks (TWEANNs).
Perhaps the earliest and simplest TWEANN is the structured genetic algorithm
(sGA) (Dasgupta and McGregor, 1992), which uses a two-part representation to
describe each network. The first part represents the connectivity of the network in
the form of a binary matrix. Rows and columns correspond to nodes in the network
and the value of each cell indicates whether an edge exists connecting the given pair
of nodes. The second part represents the weights of each edge in the network. In
principle, by evolving these binary matrices along with connection weights, sGA
can automatically discover suitable network topologies. However, sGA suffers from
several limitations. In the following section, we discuss these limitations in order to
highlight the main challenges faced by all TWEANNs.
332 S. Whiteson
10.3.1 Challenges
There are three main challenges to developing a successful TWEANN. The first
is the competing conventions problem. In most tasks, there are multiple different
policies that have similar fitness. For example, many tasks contain symmetries that
give rise to several equivalent solutions. This can lead to difficulties for evolution
because of its reliance on crossover operators to breed new networks. When two
networks that represent different policies are combined, the result is likely to be
destructive, producing a policy that cannot successfully carry out the strategy used
by either parent.
While competing conventions can arise in any evolutionary method that uses
crossover, the problem is particularly severe for TWEANNs. Two parents may not
only implement different policies but also have different representations. There-
fore, to be effective, TWEANNs need a mechanism for combining networks with
different topologies in a way that minimizes the chance of catastrophic crossover.
Clearly, sGA does not meet this challenge, since the binary matrices it evolves are
crossed over without regard to incompatibility in representations. In fact, the dif-
ficulties posed by the competing conventions problem were a major obstacle for
early TWEANNs, to the point that some researchers simply avoided the problem by
developing methods that do not perform crossover at all (Radcliffe, 1993).
The second challenge is the need to protect topological innovations long enough
to optimize the associated weights. Typically, when new topological structures are
introduced (e.g., the addition of a new hidden node or edge), it has a negative ef-
fect on fitness even if that structure will eventually be necessary for a good policy.
The reason is that the weights associated with the new structure have not yet been
optimized.
For example, consider an edge in a network evolved via sGA that is not activated,
i.e., its cell in the binary matrix is set to zero. The corresponding weight for that
edge will not experience any selective pressure, since it is not manifested in the
network. If evolution suddenly activates that edge, the effect on fitness is likely to be
detrimental, since its weight is not optimized. Therefore, if topological innovations
are not explicitly protected, they will typically be eliminated from the population,
causing the search for better topologies to stagnate.
Fortunately, protecting innovation is a well-studied problem in evolutionary com-
putation. Speciation and niching methods (Holland, 1975; Goldberg and Richard-
son, 1987; Mahfoud, 1995; Potter and De Jong, 1995; Darwen and Yao, 1996) en-
sure diversity in the population, typically by segregating disparate individuals and/or
penalizing individuals that are too similar to others. However, using such methods
requires a distance metric to quantify the differences between individuals. Devis-
ing such a metric is difficult for TWEANNs, since it is not clear how to compare
networks with different topologies.
The third challenge is how to evolve minimal solutions. As mentioned above,
a central motivation for TWEANNs is the desire to avoid optimizing overly com-
plex topologies. However, if evolution is initialized with a population of randomly
chosen topologies, as in many TWEANNs, some of these topologies may already
10 Evolutionary Computation for Reinforcement Learning 333
be too complex. Thus, at least part of the evolutionary search will be conducted in
an unnecessarily high dimensional space. It is possible to explicitly reward smaller
solutions by adding size penalties in the fitness function (Zhang and Muhlenbein,
1993). However, there is no principled way to determine the size of the penalties
without prior knowledge about the topological complexity required for the task.
10.3.2 NEAT
Outputs Outputs
Inputs Inputs
Fig. 10.2 Structural mutation operators in NEAT. At left. a new node is added by splitting an
existing edge in two. At right, a new link (edge) is added between two existing nodes.
10.4 Hybrids
Many researchers have investigated hybrid methods that combine evolution with
supervised or unsupervised learning methods. In such systems, the individuals being
evolved do not remain fixed during their fitness evaluations. Instead, they change
during their ‘lifetimes’ by learning from the environments with which they interact.
Much of the research on hybrid methods focuses on analyzing the dynamics that
result when evolution and learning interact. For example, several studies (Whitley
et al, 1994; Yamasaki and Sekiguchi, 2000; Pereira and Costa, 2001; Whiteson and
Stone, 2006a) have used hybrids to compare Lamarckian and Darwinian systems.
10 Evolutionary Computation for Reinforcement Learning 335
In Lamarckian systems, the phenotypic effects of learning are copied back into the
genome before reproduction, allowing new offspring to inherit them. In Darwinian
systems, which more closely model biology, learning does not affect the genome.
As other hybrid studies (Hinton and Nowlan, 1987; French and Messinger, 1994;
Arita and Suzuki, 2000) have shown, Darwinian systems can indirectly transfer the
results of learning into the genome by way of the Baldwin effect (Baldwin, 1896), in
which learning creates selective pressures favoring individuals who innately possess
attributes that were previously learned.
Hybrid methods have also been employed to improve performance on supervised
learning tasks (Gruau and Whitley, 1993; Boers et al, 1995; Giraud-Carrier, 2000;
Schmidhuber et al, 2005, 2007). However, such methods are not directly applicable
to reinforcement-learning problems because the labeled data they require is absent.
Nonetheless, many hybrid methods for reinforcement learning have been
developed. To get around the problem of missing labels, researchers have employed
unsupervised learning (Stanley et al, 2003), trained individuals to resemble their par-
ents (McQuesten and Miikkulainen, 1997), trained them to predict state transitions
(Nolfi et al, 1994), and trained them to teach themselves (Nolfi and Parisi, 1997).
However, perhaps the most natural hybrids for the reinforcement learning setting are
combination of evolution with temporal-difference methods (Ackley and Littman,
1991; Wilson, 1995; Downing, 2001; Whiteson and Stone, 2006a). In this section,
we survey two such hybrids: evolutionary function approximation and XCS, a type
of learning classifier system.
weights of the networks NEAT evolves are updated during their fitness evaluations
using Q-learning and backpropagation, they will effectively evolve value functions
instead of action selectors. Hence, the outputs are no longer arbitrary values; they
represent the long-term discounted values of the associated state-action pairs and
are used, not just to select the most desirable action, but to update the estimates of
other state-action pairs.
Algorithm 22 shows the inner loop of NEAT+Q, replacing lines 9–13 in Algo-
rithm 21. Each time the agent takes an action, the network is backpropagated to-
wards Q-learning targets (line 7) and ε -greedy selection occurs (lines 4–5). Figure
10.3 illustrates the complete algorithm: networks are selected from the population
for evaluation and the Q-values they produce are used to select actions. The resulting
feedback from the environment is used both to perform TD updates and to measure
the network’s fitness, i.e., the total reward it accrues while learning.
10.4.2 XCS
A different type of hybrid method can be constructed using learning classifier sys-
tems (LCSs) (Holland, 1975; Holland and Reitman, 1977; Bull and Kovacs, 2005;
Butz, 2006; Drugowitsch, 2008). An LCS is an evolutionary system that uses rules,
10 Evolutionary Computation for Reinforcement Learning 337
∑c∈M(s,a) c. f · c.p
Q(s,a) = ,
∑c∈M(s,a) c. f
where M(s,a) is the set of all classifiers matching s and a; c. f and c.p are the fitness
and prediction, respectively, of classifier c.
338 S. Whiteson
Each time the agent is in state s, takes action a, receives reward r, and transitions
to state s , the following update rule is applied to each c ∈ M(s,a):
c. f
c.p ← c.p + β [r + γ maxa Q(s ,a ) − c.p] ,
∑c ∈M(s,a) c . f
Classifier accuracy is then defined in terms of this error. Specifically, when c.ε > ε0 ,
a minimum error threshold, the accuracy of c is defined as:
states) and thus lower accuracy. It might seem that, as a result, XCS will evolve only
highly specific rules. However, more general rules also match more often. Since only
matching classifiers can reproduce, XCS balances the pressure for specific rules with
pressure for general rules. Thus, it strives to learn a complete, maximally general,
and accurate set of classifiers for approximating the optimal Q-function.
Though there are no convergence proofs for XCS on MDPs, it has proven empir-
ically effective on many tasks. For example, on maze tasks, it has proven adept at
automatically discovering what state features to ignore (Butz et al, 2005) and solv-
ing problems with more than a million states (Butz and Lanzi, 2009). It has also
proven adept at complex sensorimotor control (Butz and Herbort, 2008; Butz et al,
2009) and autonomous robotics (Dorigo and Colombetti, 1998).
10.5 Coevolution
one member of each population is selected, often randomly, to form a team that is
then evaluated in the task. The total reward obtained contributes to an estimate of
the fitness of each participating agent, which is typically evaluated multiple times.
While this approach often outperforms monolithic evolution and has found suc-
cess in predator-prey (Yong and Miikkulainen, 2007) and robot-control (Cai and
Peng, 2002) applications, it also runs into difficulties when there are large numbers
of agents. The main problem is that the contribution of a single agent to the total
reward accrued becomes insignificant. Thus, the fitness an agent receives depends
more on which teammates it is evaluated with than on its own policy. However, it
is possible to construct special fitness functions for individual agents that are much
less sensitive to such effects (Agogino and Tumer, 2008). The main idea is to use
difference functions (Wolpert and Tumer, 2002) that compare the total reward the
team obtains when the agent is present to when it is absent or replaced by a fixed
baseline agent. While this approach requires access to a model of the environment
and increases the computational cost of fitness evaluation (so that the reward in both
scenarios can be measured), it can dramatically improve the performance of coop-
erative coevolution.
Coevolution can also be used to simultaneously evolve multiple components of
a single agent, instead of multiple agents. For example, in the task of robot soccer
keepaway, domain knowledge has been used to decompose the task into different
components, each representing an important skill such as running towards the ball
or getting open for a pass (Whiteson et al, 2005). Neural networks for each of these
components are then coevolved and together comprise a complete policy. In the
keepaway task, coevolution greatly outperforms a monolithic approach.
Cooperative coevolution can also be used in a single-agent setting to facilitate
neuroevolution. Rather than coevolving multiple networks, with one for each mem-
ber of a team or each component of a policy, neurons are coevolved, which to-
gether form a single network describing the agent’s policy (Potter and De Jong,
1995, 2000). Typically, networks have fixed topologies with a single hidden layer
and each neuron corresponds to a hidden node, including all the weights of its in-
coming and outgoing edges. Just as dividing a multi-agent task up by agent often
leads to simpler subproblems, so too can breaking up a neuroevolutionary task by
neuron. As Moriarty and Mikkulainen say,“neuron-level evolution takes advantage
of the a priori knowledge that individual neurons constitute basic components of
neural networks” (Moriarty and Miikkulainen, 1997).
One example is symbiotic adaptive neuroevolution (SANE) (Moriarty and Mi-
ikkulainen, 1996, 1997) in which evolution occurs simultaneously at two levels.
At the lower level, a single population of neurons is evolved. The fitness of each
neuron is the average performance of the networks in which it participates during
fitness evaluations. At the higher level, a population of blueprints is evolved, with
each blueprint consisting of a vector of pointers to neurons in the lower level. The
blueprints that combine neurons into the most effective networks tend to survive se-
lective pressure. On various reinforcement-learning tasks such as robot control and
10 Evolutionary Computation for Reinforcement Learning 341
xm x3 x2 x1
P1
P2
P3
P4
P5
P6
Neural Network
Fig. 10.4 The CoSyNE algorithm, using six subpopulations, each containing m weights. All
the weights at a given index i form a genotype xi . Each weight is taken from a different
subpopulation and describes one edge in the neural network. Figure taken with permission
from (Gomez et al, 2008).
342 S. Whiteson
Coevolution has also proven a powerful tool for competitive settings. The most
common applications are in games, in which coevolution is used to simultaneously
evolve strong players and the opponents against which they are evaluated. The hope
is to create an arms race (Dawkins and Krebs, 1979) in which the evolving agents
exert continual selective pressure on each other, driving evolution towards increas-
ingly effective policies.
Perhaps the simplest example of competitive coevolution is the work of Pollack
and Blair in the game of backgammon (Pollack and Blair, 1998). Their approach
relies on a simple optimization technique (essentially an evolutionary method with
a population size of two) wherein a neural network plays against a mutated version
of itself and the winner survives. The approach works so well that Pollack and Blair
hypothesize that Tesauro’s great success with TD-Gammon (Tesauro, 1994) is due
more to the nature of backgammon than the power of temporal-difference methods.2
Using larger populations, competitive coevolution has also found success in the
game of checkers. The Blondie24 program uses the minimax algorithm (Von Neu-
mann, 1928) to play checkers, relying on neuroevolution to discover an effective
evaluator of board positions (Chellapilla and Fogel, 2001). During fitness evalua-
tions, members of the current population play games against each other. Despite the
minimal use of human expertise, Blondie24 evolved to play at a level competitive
with human experts.
Competitive coevolution can also have useful synergies with TWEANNs. In
fixed-topology neuroevolution, arms races may be cut short when additional
improvement requires an expanded representation. Since TWEANNs can automati-
cally expand their representations, coevolution can give rise to continual complexi-
fication (Stanley and Miikkulainen, 2004a).
The methods mentioned above evolve only a single population. However, as in
cooperative coevolution, better performance is sometimes possible by segregating
individuals into separate populations. In the host/parasite model (Hillis, 1990), one
population evolves hosts and another parasites. Hosts are evaluated based on their
robustness against parasites, e.g., how many parasites they beat in games of check-
ers. In contrast, parasites are evaluated based on their uniqueness, e.g., how many
hosts they can beat that other parasites cannot. Such fitness functions can be imple-
mented using competitive fitness sharing (Rosin and Belew, 1997).
In Pareto coevolution, the problem is treated as a multi-objective one, with each
opponent as an objective (Ficici and Pollack, 2000, 2001). The goal is thus to find
a Pareto-optimal solution, i.e., one that cannot be improved with respect to one
objective without worsening its performance with respect to another. Using this
approach, many methods have been developed that maintain Pareto archives of
2 Tesauro, however, disputes this claim, pointing out that the performance difference be-
tween Pollack and Blair’s approach and his own is quite significant, analogous to that
between an average human player and a world-class one (Tesauro, 1998).
10 Evolutionary Computation for Reinforcement Learning 343
opponents against which to evaluate evolving solutions (De Jong, 2004; Monroy
et al, 2006; De Jong, 2007; Popovici et al, 2010).
More recently, the HyperNEAT method (Stanley et al, 2009) has been developed
to extend NEAT to use indirect encodings. This approach is based on composi-
tional patten producing networks (CPPNs). CPPNs are neural networks for describ-
ing complex patterns. For example, a two-dimensional image can be described by
a CPPN whose inputs correspond to an x-y position in the image and whose output
corresponds to the color that should appear in that position. The image can then
be generated by querying the CPPN at each x-y position and setting that position’s
color based on the output. Such CPPNs can be evolved by NEAT, yielding a devel-
opmental system with the CPPN as the genotype and the image as the phenotype.
In HyperNEAT, the CPPN is used to describe a neural network instead of an im-
age. Thus, both the genotype and phenotype are neural networks. As illustrated in
Figure 10.5, the nodes of the phenotypic network are laid out on a substrate, i.e., a
grid, such that each has a position. The CPPN takes as input two positions instead of
one and its output specifies the weight of the edge connecting the two corresponding
nodes. As before, these CPPNs can be evolved by NEAT based on the fitness of the
resulting phenotypic network, e.g., its performance as a policy in a reinforcement
learning task. The CPPNs can be interpreted as describing a spatial pattern in a four-
dimensional hypercube, yielding the name HyperNEAT. Because the developmental
approach makes it easy to specify networks that exploit symmetries and regularities
in complex tasks, HyperNEAT has proven an effective tool for reinforcement learn-
ing, with successful applications in domains such as checkers (Gauci and Stanley,
2008, 2010), keepaway soccer (Verbancsics and Stanley, 2010), and multi-agent
systems (D’Ambrosio et al, 2010).
Fig. 10.5 The HyperNEAT algorithm. Figure taken with permission from (Gauci and Stanley,
2010).
10 Evolutionary Computation for Reinforcement Learning 345
One possible approach is to use evolution, not as a complete solution method, but
as a component in a model-based method. In model-based algorithms, the agent’s
346 S. Whiteson
interactions with its environment are used to learn a model, to which planning meth-
ods are then applied. As the agent gathers more samples from the environment, the
quality of the model improves, which, in turn, improves the quality of the policy pro-
duced via planning. Because planning is done off-line, the number of interactions
needed to find a good policy is minimized, leading to strong on-line performance.
In such an approach, planning is typically conducted using dynamic program-
ming methods like value iteration. However, many other methods can be used in-
stead; if the model is continuous and/or high dimensional, evolutionary or other
policy-search methods may be preferable. Unfortunately, most model-based meth-
ods are designed only to learn tabular models for small, discrete state spaces. Still,
in some cases, especially when considerable domain expertise is available, more
complex models can be learned.
For example, linear regression has been used learn models of helicopter dynam-
ics, which can then be used for policy-search reinforcement learning (Ng et al,
2004). The resulting policies have successfully controlled real model helicopters.
A similar approach was used to maximize on-line performance in the helicopter-
hovering events in recent Reinforcement Learning Competitions (Whiteson et al,
2010a): models learned via linear regression were used as fitness functions for poli-
cies evolved off-line via neuroevolution (Koppejan and Whiteson, 2009).
Alternatively, evolutionary methods can be used for the model-learning compo-
nent of a model-based solution. In particular, anticipatory learning classifier sys-
tems (Butz, 2002; Gerard et al, 2002, 2005; Sigaud et al, 2009), a type of LCS,
can be used to evolve models of the environment that are used for planning in a
framework similar to Dyna-Q (Sutton, 1990).
10.8 Conclusion
methods, the use of evolutionary computation does not require forgoing the power
of temporal-difference methods. Furthermore, coevolutionary approaches extend
the reach of evolution to multi-agent reinforcement learning, both cooperative and
competitive. While most work in evolutionary computation has focused on off-line
settings, promising research exists in developing evolutionary methods for on-line
reinforcement learning, which remains a critical and exciting challenge for future
work.
References
Ackley, D., Littman, M.: Interactions between learning and evolution. Artificial Life II, SFI
Studies in the Sciences of Complexity 10, 487–509 (1991)
Agogino, A.K., Tumer, K.: Efficient evaluation functions for evolving coordination. Evolu-
tionary Computation 16(2), 257–288 (2008)
Arita, T., Suzuki, R.: Interactions between learning and evolution: The outstanding strategy
generated by the Baldwin Effect. Artificial Life 7, 196–205 (2000)
Baldwin, J.M.: A new factor in evolution. The American Naturalist 30, 441–451 (1896)
Boers, E., Borst, M., Sprinkhuizen-Kuyper, I.: Evolving Artificial Neural Networks using the
“Baldwin Effect”. In: Proceedings of the International Conference Artificial Neural Nets
and Genetic Algorithms in Ales, France (1995)
Bonarini, A.: An introduction to learning fuzzy classifier systems. Learning Classifier
Systems, 83–104 (2000)
Bull, L., Kovacs, T.: Foundations of learning classifier systems: An introduction. Foundations
of Learning Classifier Systems, 1–17 (2005)
Bull, L., O’Hara, T.: Accuracy-based neuro and neuro-fuzzy classifier systems. In: Proceed-
ings of the Genetic and Evolutionary Computation Conference, pp. 905–911 (2002)
Butz, M.: Anticipatory learning classifier systems. Kluwer Academic Publishers (2002)
Butz, M.: Rule-based evolutionary online learning systems: A principled approach to LCS
analysis and design. Springer, Heidelberg (2006)
Butz, M., Herbort, O.: Context-dependent predictions and cognitive arm control with XCSF.
In: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computa-
tion, pp. 1357–1364. ACM (2008)
Butz, M., Lanzi, P.: Sequential problems that test generalization in learning classifier systems.
Evolutionary Intelligence 2(3), 141–147 (2009)
Butz, M., Goldberg, D., Lanzi, P.: Gradient descent methods in learning classifier systems:
Improving XCS performance in multistep problems. IEEE Transactions on Evolutionary
Computation 9(5) (2005)
Butz, M., Lanzi, P., Wilson, S.: Function approximation with XCS: Hyperellipsoidal con-
ditions, recursive least squares, and compaction. IEEE Transactions on Evolutionary
Computation 12(3), 355–376 (2008)
10 Evolutionary Computation for Reinforcement Learning 349
Butz, M., Pedersen, G., Stalph, P.: Learning sensorimotor control structures with XCSF:
Redundancy exploitation and dynamic control. In: Proceedings of the 11th Annual Con-
ference on Genetic and Evolutionary Computation, pp. 1171–1178 (2009)
Cai, Z., Peng, Z.: Cooperative coevolutionary adaptive genetic algorithm in path planning of
cooperative multi-mobile robot systems. Journal of Intelligent and Robotic Systems 33(1),
61–71 (2002)
Cardamone, L., Loiacono, D., Lanzi, P.: On-line neuroevolution applied to the open racing
car simulator. In: Proceedings of the Congress on Evolutionary Computation (CEC), pp.
2622–2629 (2009)
Cardamone, L., Loiacono, D., Lanzi, P.L.: Learning to drive in the open racing car simulator
using online neuroevolution. IEEE Transactions on Computational Intelligence and AI in
Games 2(3), 176–190 (2010)
Chellapilla, K., Fogel, D.: Evolving an expert checkers playing program without using human
expertise. IEEE Transactions on Evolutionary Computation 5(4), 422–428 (2001)
Coello, C., Lamont, G., Van Veldhuizen, D.: Evolutionary algorithms for solving multi-
objective problems. Springer, Heidelberg (2007)
D’Ambrosio, D., Lehman, J., Risi, S., Stanley, K.O.: Evolving policy geometry for scal-
able multiagent learning. In: Proceedings of the Ninth International Conference on Au-
tonomous Agents and Multiagent Systems (AAMAS 2010), pp. 731–738 (2010)
Darwen, P., Yao, X.: Automatic modularization by speciation. In: Proceedings of the 1996
IEEE International Conference on Evolutionary Computation (ICEC 1996), pp. 88–93
(1996)
Dasgupta, D., McGregor, D.: Designing application-specific neural networks using the struc-
tured genetic algorithm. In: Proceedings of the International Conference on Combinations
of Genetic Algorithms and Neural Networks, pp. 87–96 (1992)
Dawkins, R., Krebs, J.: Arms races between and within species. Proceedings of the Royal
Society of London Series B, Biological Sciences 205(1161), 489–511 (1979)
de Jong, E.D.: The Incremental Pareto-coevolution Archive. In: Deb, K., et al. (eds.) GECCO
2004. LNCS, vol. 3102, pp. 525–536. Springer, Heidelberg (2004)
de Jong, E.: A monotonic archive for Pareto-coevolution. Evolutionary Computation 15(1),
61–93 (2007)
de Jong, K., Spears, W.: An analysis of the interacting roles of population size and crossover
in genetic algorithms. In: Parallel Problem Solving from Nature, pp. 38–47 (1991)
de Jong, K., Spears, W., Gordon, D.: Using genetic algorithms for concept learning. Machine
learning 13(2), 161–188 (1993)
Deb, K.: Multi-objective optimization using evolutionary algorithms. Wiley (2001)
Dorigo, M., Colombetti, M.: Robot shaping: An experiment in behavior engineering. The
MIT Press (1998)
Downing, K.L.: Reinforced genetic programming. Genetic Programming and Evolvable
Machines 2(3), 259–288 (2001)
Doya, K.: Reinforcement learning in continuous time and space. Neural Computation 12(1),
219–245 (2000)
Drugowitsch, J.: Design and analysis of learning classifier systems: A probabilistic approach.
Springer, Heidelberg (2008)
Ficici, S., Pollack, J.: A game-theoretic approach to the simple coevolutionary algorithm.
In: Parallel Problem Solving from Nature PPSN VI, pp. 467–476. Springer, Heidelberg
(2000)
Ficici, S., Pollack, J.: Pareto optimality in coevolutionary learning. Advances in Artificial
Life, 316–325 (2001)
350 S. Whiteson
Floreano, D., Mondada, F.: Evolution of homing navigation in a real mobile robot. IEEE
Transactions on Systems, Man, and Cybernetics, Part B 26(3), 396–407 (2002)
Floreano, D., Urzelai, J.: Evolution of plastic control networks. Autonomous Robots 11(3),
311–317 (2001)
French, R., Messinger, A.: Genes, phenes and the Baldwin effect: Learning and evolution in
a simulated population. Artificial Life 4, 277–282 (1994)
Gaskett, C., Wettergreen, D., Zelinsky, A.: Q-learning in continuous state and action spaces.
Advanced Topics in Artificial Intelligence, 417–428 (1999)
Gauci, J., Stanley, K.O.: A case study on the critical role of geometric regularity in ma-
chine learning. In: Proceedings of the Twenty-Third AAAI Conference on Artificial Intel-
ligence, AAAI 2008 (2008)
Gauci, J., Stanley, K.O.: Autonomous evolution of topographic regularities in artificial neural
networks. Neural Computation 22(7), 1860–1898 (2010)
Gerard, P., Stolzmann, W., Sigaud, O.: YACS: a new learning classifier system using antici-
pation. Soft Computing-A Fusion of Foundations, Methodologies and Applications 6(3),
216–228 (2002)
Gerard, P., Meyer, J., Sigaud, O.: Combining latent learning with dynamic programming
in the modular anticipatory classifier system. European Journal of Operational Re-
search 160(3), 614–637 (2005)
Giraud-Carrier, C.: Unifying learning with evolution through Baldwinian evolution and
Lamarckism: A case study. In: Proceedings of the Symposium on Computational Intel-
ligence and Learning (CoIL 2000), pp. 36–41 (2000)
Goldberg, D.: Genetic Algorithms in Search. In: Optimization and Machine Learning,
Addison-Wesley (1989)
Goldberg, D., Deb, K.: A comparative analysis of selection schemes used in genetic algo-
rithms. Foundations of genetic algorithms 1, 69–93 (1991)
Goldberg, D., Richardson, J.: Genetic algorithms with sharing for multimodal function opti-
mization. In: Proceedings of the Second International Conference on Genetic Algorithms
and their Application, p. 49 (1987)
Gomez, F., Miikkulainen, R.: Solving non-Markovian control tasks with neuroevolution. In:
Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1356–
1361 (1999)
Gomez, F., Miikkulainen, R.: Active guidance for a finless rocket using neuroevolution. In:
GECCO 2003: Proceedings of the Genetic and Evolutionary Computation Conference
(2003)
Gomez, F., Schmidhuber, J.: Co-evolving recurrent neurons learn deep memory POMDPs. In:
GECCO 2005: Proceedings of the Genetic and Evolutionary Computation Conference, pp.
491–498 (2005a)
Gomez, F.J., Schmidhuber, J.: Evolving Modular Fast-Weight Networks for Control. In:
Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697,
pp. 383–389. Springer, Heidelberg (2005b)
Gomez, F.J., Schmidhuber, J., Miikkulainen, R.: Efficient Non-Linear Control Through Neu-
roevolution. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS
(LNAI), vol. 4212, pp. 654–662. Springer, Heidelberg (2006)
Gomez, F., Schmidhuber, J., Miikkulainen, R.: Accelerated neural evolution through cooper-
atively coevolved synapses. Journal of Machine Learning Research 9, 937–965 (2008)
Gruau, F.: Automatic definition of modular neural networks. Adaptive Behavior 3(2), 151
(1994)
10 Evolutionary Computation for Reinforcement Learning 351
Gruau, F., Whitley, D.: Adding learning to the cellular development of neural networks: Evo-
lution and the Baldwin effect. Evolutionary Computation 1, 213–233 (1993)
Hansen, N., Müller, S., Koumoutsakos, P.: Reducing the time complexity of the derandomized
evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computa-
tion 11(1), 1–18 (2003)
van Hasselt, H., Wiering, M.: Reinforcement learning in continuous action spaces. In: IEEE
International Symposium on Approximate Dynamic Programming and Reinforcement
Learning, ADPRL, pp. 272–279 (2007)
Haykin, S.: Neural networks: a comprehensive foundation. Prentice-Hall (1994)
Heidrich-Meisner, V., Igel, C.: Variable metric reinforcement learning methods applied to
the noisy mountain car problem. Recent Advances in Reinforcement Learning, 136–150
(2008)
Heidrich-Meisner, V., Igel, C.: Hoeffding and Bernstein races for selecting policies in evolu-
tionary direct policy search. In: Proceedings of the 26th Annual International Conference
on Machine Learning, pp. 401–408 (2009a)
Heidrich-Meisner, V., Igel, C.: Neuroevolution strategies for episodic reinforcement learning.
Journal of Algorithms 64(4), 152–168 (2009b)
Heidrich-Meisner, V., Igel, C.: Uncertainty handling CMA-ES for reinforcement learning. In:
Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation,
pp. 1211–1218 (2009c)
Hillis, W.: Co-evolving parasites improve simulated evolution as an optimization procedure.
Physica D: Nonlinear Phenomena 42(1-3), 228–234 (1990)
Hinton, G.E., Nowlan, S.J.: How learning can guide evolution. Complex Systems 1, 495–502
(1987)
Holland, J., Reitman, J.: Cognitive systems based on adaptive algorithms. ACM SIGART
Bulletin 63, 49–49 (1977)
Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with
Applications to Biology. In: Control and Artificial Intelligence. University of Michigan
Press (1975)
Hornby, G., Pollack, J.: Creating high-level components with a generative representation for
body-brain evolution. Artificial Life 8(3), 223–246 (2002)
Igel, C.: Neuroevolution for reinforcement learning using evolution strategies. In: Congress
on Evolutionary Computation, vol. 4, pp. 2588–2595 (2003)
Jansen, T., Wiegand, R.P.: The cooperative coevolutionary (1+1) EA. Evolutionary Compu-
tation 12(4), 405–434 (2004)
Kaelbling, L.P.: Learning in Embedded Systems. MIT Press (1993)
Kernbach, S., Meister, E., Scholz, O., Humza, R., Liedke, J., Ricotti, L., Jemai, J., Havlik, J.,
Liu, W.: Evolutionary robotics: The next-generation-platform for on-line and on-board ar-
tificial evolution. In: CEC 2009: IEEE Congress on Evolutionary Computation, pp. 1079–
1086 (2009)
Kohl, N., Miikkulainen, R.: Evolving neural networks for fractured domains. In: Proceedings
of the Genetic and Evolutionary Computation Conference, pp. 1405–1412 (2008)
Kohl, N., Miikkulainen, R.: Evolving neural networks for strategic decision-making prob-
lems. Neural Networks 22, 326–337 (2009); (special issue on Goal-Directed Neural
Systems)
Koppejan, R., Whiteson, S.: Neuroevolutionary reinforcement learning for generalized heli-
copter control. In: GECCO 2009: Proceedings of the Genetic and Evolutionary Computa-
tion Conference, pp. 145–152 (2009)
352 S. Whiteson
Kovacs, T.: Strength or accuracy: credit assignment in learning classifier systems. Springer,
Heidelberg (2003)
Larranaga, P., Lozano, J.: Estimation of distribution algorithms: A new tool for evolutionary
computation. Springer, Netherlands (2002)
Lindenmayer, A.: Mathematical models for cellular interactions in development II. Simple
and branching filaments with two-sided inputs. Journal of Theoretical Biology 18(3), 300–
315 (1968)
Littman, M.L., Dean, T.L., Kaelbling, L.P.: On the complexity of solving Markov decision
processes. In: Proceedings of the Eleventh International Conference on Uncertainty in
Artificial Intelligence, pp. 394–402 (1995)
Lucas, S.M., Runarsson, T.P.: Temporal difference learning versus co-evolution for acquir-
ing othello position evaluation. In: IEEE Symposium on Computational Intelligence and
Games (2006)
Lucas, S.M., Togelius, J.: Point-to-point car racing: an initial study of evolution versus tem-
poral difference learning. In: Symposium, I.E.E.E. (ed.) on Computational Intelligence
and Games, pp. 260–267 (2007)
Mahadevan, S., Maggioni, M.: Proto-value functions: A Laplacian framework for learning
representation and control in Markov decision processes. Journal of Machine Learning
Research 8, 2169–2231 (2007)
Mahfoud, S.: A comparison of parallel and sequential niching methods. In: Conference on
Genetic Algorithms, vol. 136, p. 143 (1995)
McQuesten, P., Miikkulainen, R.: Culling and teaching in neuro-evolution. In: Proceedings
of the Seventh International Conference on Genetic Algorithms, pp. 760–767 (1997)
Meyer, J., Husbands, P., Harvey, I.: Evolutionary robotics: A survey of applications and prob-
lems. In: Evolutionary Robotics, pp. 1–21. Springer, Heidelberg (1998)
Millán, J., Posenato, D., Dedieu, E.: Continuous-action Q-learning. Machine Learning 49(2),
247–265 (2002)
Monroy, G., Stanley, K., Miikkulainen, R.: Coevolution of neural networks using a layered
Pareto archive. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary
Computation, p. 336 (2006)
Moriarty, D., Miikkulainen, R.: Forming neural networks through efficient and adaptive
coevolution. Evolutionary Computation 5(4), 373–399 (1997)
Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolu-
tion. Machine Learning 22(11), 11–33 (1996)
Moriarty, D.E., Schultz, A.C., Grefenstette, J.J.: Evolutionary algorithms for reinforcement
learning. Journal of Artificial Intelligence Research 11, 199–229 (1999)
Ng, A.Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., Liang, E.:
Inverted autonomous helicopter flight via reinforcement learning. In: Proceedings of the
International Symposium on Experimental Robotics (2004)
Nolfi, S., Parisi, D.: Learning to adapt to changing environments in evolving neural networks.
Adaptive Behavior 5(1), 75–98 (1997)
Nolfi, S., Elman, J.L., Parisi, D.: Learning and evolution in neural networks. Adaptive
Behavior 2, 5–28 (1994)
Nordin, P., Banzhaf, W.: An on-line method to evolve behavior and to control a miniature
robot in real time with genetic programming. Adaptive Behavior 5(2), 107 (1997)
Panait, L., Luke, S.: Cooperative multi-agent learning: The state of the art. Autonomous
Agents and Multi-Agent Systems 11(3), 387–434 (2005)
10 Evolutionary Computation for Reinforcement Learning 353
Panait, L., Luke, S., Harrison, J.F.: Archive-based cooperative coevolutionary algorithms. In:
GECCO 2006: Proceedings of the 8th Annual Conference on Genetic and Evolutionary
Computation, pp. 345–352 (2006)
Parr, R., Painter-Wakefield, C., Li, L., Littman, M.: Analyzing feature generation for value-
function approximation. In: Proceedings of the 24th International Conference on Machine
Learning, p. 744 (2007)
Pereira, F.B., Costa, E.: Understanding the role of learning in the evolution of busy beaver:
A comparison between the Baldwin Effect and a Lamarckian strategy. In: Proceedings of
the Genetic and Evolutionary Computation Conference, GECCO 2001 (2001)
Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7-9), 1180–1190 (2008)
Pollack, J., Blair, A.: Co-evolution in the successful learning of backgammon strategy. Ma-
chine Learning 32(3), 225–240 (1998)
Popovici, E., Bucci, A., Wiegand, P., De Jong, E.: Coevolutionary principles. In: Rozenberg,
G., Baeck, T., Kok, J. (eds.) Handbook of Natural Computing. Springer, Berlin (2010)
Potter, M.A., De Jong, K.A.: Evolving neural networks with collaborative species. In: Sum-
mer Computer Simulation Conference, pp. 340–345 (1995)
Potter, M.A., De Jong, K.A.: Cooperative coevolution: An architecture for evolving coad-
apted subcomponents. Evolutionary Computation 8, 1–29 (2000)
Pratihar, D.: Evolutionary robotics: A review. Sadhana 28(6), 999–1009 (2003)
Priesterjahn, S., Weimer, A., Eberling, M.: Real-time imitation-based adaptation of gaming
behaviour in modern computer games. In: Proceedings of the Genetic and Evolutionary
Computation Conference, pp. 1431–1432 (2008)
Radcliffe, N.: Genetic set recombination and its application to neural network topology opti-
misation. Neural Computing & Applications 1(1), 67–90 (1993)
Rosin, C.D., Belew, R.K.: New methods for competitive coevolution. Evolutionary Compu-
tation 5(1), 1–29 (1997)
Rubinstein, R., Kroese, D.: The cross-entropy method: a unified approach to combinatorial
optimization. In: Monte-Carlo Simulation, and Machine Learning. Springer, Heidelberg
(2004)
Runarsson, T.P., Lucas, S.M.: Co-evolution versus self-play temporal difference learning for
acquiring position evaluation in small-board go. IEEE Transactions on Evolutionary Com-
putation 9, 628–640 (2005)
Schmidhuber, J., Wierstra, D., Gomez, F.J.: Evolino: Hybrid neuroevolution / optimal lin-
ear search for sequence learning. In: Proceedings of the Nineteenth International Joint
Conference on Artificial Intelligence, pp. 853–858 (2005)
Schmidhuber, J., Wierstra, D., Gagliolo, M., Gomez, F.: Training recurrent networks by
evolino. Neural Computation 19(3), 757–779 (2007)
Schroder, P., Green, B., Grum, N., Fleming, P.: On-line evolution of robust control systems:
an industrial active magnetic bearing application. Control Engineering Practice 9(1), 37–
49 (2001)
Sigaud, O., Butz, M., Kozlova, O., Meyer, C.: Anticipatory Learning Classifier Systems and
Factored Reinforcement Learning. Anticipatory Behavior in Adaptive Learning Systems,
321–333 (2009)
Stanley, K., Miikkulainen, R.: A taxonomy for artificial embryogeny. Artificial Life 9(2),
93–130 (2003)
Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies.
Evolutionary Computation 10(2), 99–127 (2002)
Stanley, K.O., Miikkulainen, R.: Competitive coevolution through evolutionary complexifi-
cation. Journal of Artificial Intelligence Research 21, 63–100 (2004a)
354 S. Whiteson
Stanley, K.O., Miikkulainen, R.: Evolving a Roving Eye for Go. In: Deb, K., et al. (eds.)
GECCO 2004. LNCS, vol. 3103, pp. 1226–1238. Springer, Heidelberg (2004b)
Stanley, K.O., Bryant, B.D., Miikkulainen, R.: Evolving adaptive neural networks with and
without adaptive synapses. In: Proceeedings of the 2003 Congress on Evolutionary Com-
putation (CEC 2003), vol. 4, pp. 2557–2564 (2003)
Stanley, K.O., D’Ambrosio, D.B., Gauci, J.: A hypercube-based indirect encoding for evolv-
ing large-scale neural networks. Artificial Life 15(2), 185–212 (2009)
Steels, L.: Emergent functionality in robotic agents through on-line evolution. In: Artificial
Life IV: Proceedings of the Fourth International Workshop on the Synthesis and Simula-
tion of Living Systems, pp. 8–16 (1994)
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approxi-
mating dynamic programming. In: Proceedings of the Seventh International Conference
on Machine Learning, pp. 216–224 (1990)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Sywerda, G.: Uniform crossover in genetic algorithms. In: Proceedings of the Third Interna-
tional Conference on Genetic Algorithms, pp. 2–9 (1989)
Tan, C., Ang, J., Tan, K., Tay, A.: Online adaptive controller for simulated car racing. In:
Congress on Evolutionary Computation (CEC), pp. 2239–2245 (2008)
Taylor, M.E., Whiteson, S., Stone, P.: Comparing evolutionary and temporal difference meth-
ods in a reinforcement learning domain. In: GECCO 2006: Proceedings of the Genetic
and Evolutionary Computation Conference, pp. 1321–1328 (2006)
Tesauro, G.: TD-gammon, a self-teaching backgammon program achieves master-level play.
Neural Computation 6, 215–219 (1994)
Tesauro, G.: Comments on co-evolution in the successful learning of backgammon strategy.
Machine Learning 32(3), 241–243 (1998)
Verbancsics, P., Stanley, K.: Evolving Static Representations for Task Transfer. Journal of
Machine Learning Research 11, 1737–1769 (2010)
Von Neumann, J.: Zur Theorie der Gesellschaftsspiele Math. Annalen 100, 295–320 (1928)
Whiteson, S., Stone, P.: Evolutionary function approximation for reinforcement learning.
Journal of Machine Learning Research 7, 877–917 (2006a)
Whiteson, S., Stone, P.: On-line evolutionary computation for reinforcement learning in
stochastic domains. In: GECCO 2006: Proceedings of the Genetic and Evolutionary Com-
putation Conference, pp. 1577–1584 (2006b)
Whiteson, S., Kohl, N., Miikkulainen, R., Stone, P.: Evolving keepaway soccer players
through task decomposition. Machine Learning 59(1), 5–30 (2005)
Whiteson, S., Tanner, B., White, A.: The reinforcement learning competitions. AI Maga-
zine 31(2), 81–94 (2010a)
Whiteson, S., Taylor, M.E., Stone, P.: Critical factors in the empirical performance of tempo-
ral difference and evolutionary methods for reinforcement learning. Autonomous Agents
and Multi-Agent Systems 21(1), 1–27 (2010b)
Whitley, D., Dominic, S., Das, R., Anderson, C.W.: Genetic reinforcement learning for neu-
rocontrol problems. Machine Learning 13, 259–284 (1993)
Whitley, D., Gordon, S., Mathias, K.: Lamarckian evolution, the Baldwin effect and function
optimization. In: Parallel Problem Solving from Nature - PPSN III, pp. 6–15 (1994)
Wiegand, R., Liles, W., De Jong, K.: An empirical analysis of collaboration methods in coop-
erative coevolutionary algorithms. In: Proceedings of the Genetic and Evolutionary Com-
putation Conference (GECCO), pp. 1235–1242 (2001)
Wieland, A.: Evolving neural network controllers for unstable systems. In: International Joint
Conference on Neural Networks, vol 2, pp. 667–673 (1991)
10 Evolutionary Computation for Reinforcement Learning 355
Wilson, S.: Classifier fitness based on accuracy. Evolutionary Computation 3(2), 149–175
(1995)
Wilson, S.: Function approximation with a classifier system. In: GECCO 2001: Proceedings
of the Genetic and Evolutionary Computation Conference, pp. 974–982 (2001)
Wolpert, D., Tumer, K.: Optimal payoff functions for members of collectives. Modeling Com-
plexity in Economic and Social Systems, 355 (2002)
Yamasaki, K., Sekiguchi, M.: Clear explanation of different adaptive behaviors between Dar-
winian population and Lamarckian population in changing environment. In: Proceedings
of the Fifth International Symposium on Artificial Life and Robotics, vol. 1, pp. 120–123
(2000)
Yao, X.: Evolving artificial neural networks. Proceedings of the IEEE 87(9), 1423–1447
(1999)
Yong, C.H., Miikkulainen, R.: Coevolution of role-based cooperation in multi-agent sys-
tems. Tech. Rep. AI07-338, Department of Computer Sciences, The University of Texas
at Austin (2007)
Zhang, B., Muhlenbein, H.: Evolving optimal neural networks using genetic algorithms with
Occam’s razor. Complex Systems 7(3), 199–220 (1993)
Zufferey, J.-C., Floreano, D., van Leeuwen, M., Merenda, T.: Evolving vision-based flying
robots. In: Bülthoff, H.H., Lee, S.-W., Poggio, T.A., Wallraven, C. (eds.) BMCV 2002.
LNCS, vol. 2525, pp. 592–600. Springer, Heidelberg (2002)
Part IV
Probabilistic Models of Self and Others
Chapter 11
Bayesian Reinforcement Learning
Abstract. This chapter surveys recent lines of work that use Bayesian techniques
for reinforcement learning. In Bayesian learning, uncertainty is expressed by a prior
distribution over unknown parameters and learning is achieved by computing a
posterior distribution based on the data observed. Hence, Bayesian reinforcement
learning distinguishes itself from other forms of reinforcement learning by explic-
itly maintaining a distribution over various quantities such as the parameters of the
model, the value function, the policy or its gradient. This yields several benefits: a)
domain knowledge can be naturally encoded in the prior distribution to speed up
learning; b) the exploration/exploitation tradeoff can be naturally optimized; and c)
notions of risk can be naturally taken into account to obtain robust policies.
11.1 Introduction
Nikos Vlassis
(1) Luxembourg Centre for Systems Biomedicine, University of Luxembourg, and
(2) OneTree Luxembourg
e-mail: [email protected],[email protected]
Mohammad Ghavamzadeh
INRIA
e-mail: [email protected]
Shie Mannor
Technion
e-mail: [email protected]
Pascal Poupart
University of Waterloo
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 359–386.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
360 N. Vlassis et al.
Model-free RL methods are those that do not explicitly learn a model of the sys-
tem and only use sample trajectories obtained by direct interaction with the system.
Model-free techniques are often simpler to implement since they do not require any
data structure to represent a model nor any algorithm to update this model. How-
ever, it is often more complicated to reason about model-free approaches since it is
not always obvious how sample trajectories should be used to update an estimate of
the optimal policy or value function. In this section, we describe several Bayesian
techniques that treat the value function or policy gradient as random objects drawn
from a distribution. More specifically, Section 11.2.1 describes approaches to learn
distributions over Q-functions, Section 11.2.2 considers distributions over policy
gradients and Section 11.2.3 shows how distributions over value functions can be
used to infer distributions over policy gradients in actor-critic algorithms.
Value-function based RL methods search in the space of value functions to find the
optimal value (action-value) function, and then use it to extract an optimal policy. In
this section, we study two Bayesian value-function based RL algorithms: Bayesian
Q-learning (Dearden et al, 1998) and Gaussian process temporal difference learn-
ing (Engel et al, 2003, 2005a; Engel, 2005). The first algorithm caters to domains
with discrete state and action spaces while the second algorithm handles continuous
state and action spaces.
Since the posterior does not have a closed form due to the integral, it is approximated
by finding the closest Normal-Gamma distribution by minimizing KL-divergence.
At run-time, it is very tempting to select the action with the highest expected Q-
value (i.e., a∗ = arg maxa E[Q(s,a)]), however this strategy does not ensure explo-
ration. To address this, Dearden et al (1998) proposed to add an exploration bonus to
the expected Q-values that estimates the myopic value of perfect information (VPI).
If exploration leads to a policy change, then the gain in value should be taken into
account. Since the agent does not know in advance the effect of each action, VPI is
computed as an expected gain
& ∞
V PI(s,a) = dx Gains,a (x) P(Q(s,a) = x) (11.1)
−∞
where the gain corresponds to the improvement induced by learning the exact Q-
value (denoted by qs,a ) of the action executed.
⎧
⎨ qs,a − E[Q(s,a1 )] if a = a1 and qs,a > E[Q(s,a1 )]
Gains,a (qs,a ) = E[Q(s,a2 )] − qs,a if a = a1 and qs,a < E[Q(s,a2 )] (11.2)
⎩
0 otherwise
There are two cases: a is revealed to have a higher Q-value than the action a1 with
the highest expected Q-value or the action a1 with the highest expected Q-value is
revealed to have a lower Q-value than the action a2 with the second highest expected
Q-value.
Bayesian Q-learning (BQL) maintains a separate distribution over D(s,a) for each
(s,a)-pair, thus, it cannot be used for problems with continuous state or action
11 Bayesian Reinforcement Learning 363
spaces. Engel et al (2003, 2005a) proposed a natural extension that uses Gaussian
processes. As in BQL, D(s,a) is assumed to be Normal with mean μ (s,a) and pre-
cision τ (s,a). However, instead of maintaining a Normal-Gamma over μ and τ si-
multaneously, a Gaussian over μ is modeled. Since μ (s,a) = Q(s,a) and the main
quantity that we want to learn is the Q-function, it would be fine to maintain a belief
only about the mean. To accommodate infinite state and action spaces, a Gaussian
process is used to model infinitely many Gaussians over Q(s,a) for each (s,a)-pair.
A Gaussian process (e.g., Rasmussen and Williams 2006) is the extension of the
multivariate Gaussian distribution to infinitely many dimensions or equivalently,
corresponds to infinitely many correlated univariate Gaussians. Gaussian processes
GP(μ ,k) are parameterized by a mean function μ (x) and a kernel function k(x,x )
which are the limit of the mean vector and covariance matrix of multivariate Gaus-
sians when the number of dimensions become infinite. Gaussian processes are often
used for functional regression based on sampled realizations of some unknown un-
derlying function.
Along those lines, Engel et al (2003, 2005a) proposed a Gaussian Process Tem-
poral Difference (GPTD) approach to learn the Q-function of a policy based on
samples of discounted sums of returns. Recall that the distribution of the sum of
discounted rewards for a fixed policy π is defined recursively as follows:
When z refers to states then E[D] = V and when it refers to state-action pairs then
E[D] = Q. Unless otherwise specified, we will assume that z = (s,a). We can de-
compose D as the sum of its mean Q and a zero-mean noise term Δ Q, which
will allow us to place a distribution directly over Q later on. Replacing D(z) by
Q(z) + Δ Q(z) in Eq. 11.3 and grouping the Δ Q terms into a single zero-mean noise
term N(z,z ) = Δ Q(z) − γΔ Q(z ), we obtain
The GPTD learning model (Engel et al, 2003, 2005a) is based on the statistical gen-
erative model in Eq. 11.4 that relates the observed reward signal r to the unobserved
action-value function Q. Now suppose that we observe the sequence z0 , z1 , . . . , zt ,
then Eq. 11.4 leads to a system of t equations that can be expressed in matrix form
as
r t−1 = H t Qt + Nt , (11.5)
where
r t = r(z0 ), . . . ,r(zt ) , Qt = Q(z0 ), . . . ,Q(zt ) ,
Nt = N(z0 ,z1 ), . . . ,N(zt−1 ,zt ) , (11.6)
364 N. Vlassis et al.
⎡ ⎤
1 −γ 0 . . . 0
⎢ 0 1 −γ . . . 0 ⎥
⎢ ⎥
Ht = ⎢ . .. ⎥. (11.7)
⎣ .. . ⎦
0 0 ... 1 −γ
If we assume that the residuals Δ Q(z0 ), . . . ,Δ Q(zt ) are zero-mean Gaussians
with variance σ 2 , and moreover, each residual is generated independently of all
the others, i.e., E[Δ Q(zi )Δ Q(z j )] = 0, for i = j, it is easy to show that the noise
vector Nt is Gaussian with mean 0 and the covariance matrix
⎡ ⎤
1 + γ 2 −γ 0 . . . 0
⎢ −γ 1 + γ 2 −γ . . . 0 ⎥
⎢ ⎥
Σ t = σ 2 H t H t = σ 2 ⎢ . .. .. ⎥ . (11.8)
⎣ .. . . ⎦
0 0 . . . −γ 1 + γ 2
In episodic tasks, if zt−1 is the last state-action pair in the episode (i.e., st is a zero-
reward absorbing terminal state), Ht becomes a square t × t invertible matrix of
the form shown in Eq. 11.7 with its last column removed. The effect on the noise
covariance matrix Σt is that the bottom-right element becomes 1 instead of 1 + γ 2.
Placing a GP prior GP(0,k) on Q, we may use Bayes’ rule to obtain the moments
Q̂ and k̂ of the posterior Gaussian process on Q:
where Dt denotes the observed data up to and including time step t. We used here
the following definitions:
As more samples are observed, the posterior covariance decreases, reflecting a grow-
ing confidence in the Q-function estimate Q̂t .
The GPTD model described above is kernel-based and non-parametric. It is also
possible to employ a parametric representation under very similar assumptions. In
the parametric setting, the GP Q is assumed to consist of a linear combination of a
finite number of basis functions: Q(·,·) = φ (·,·)W , where φ is the feature vector
and W is the weight vector. In the parametric GPTD, the randomness in Q is due
to W being a random vector. In this model, we place a Gaussian prior over W and
apply Bayes’ rule to calculate the posterior distribution of W conditioned on the
observed data. The posterior mean and covariance of Q may be easily computed by
multiplying the posterior moments of W with the feature vector φ . See Engel (2005)
for more details on parametric GPTD.
11 Bayesian Reinforcement Learning 365
In the parametric case, the computation of the posterior may be performed on-
line in O(n2 ) time per sample and O(n2 ) memory, where n is the number of basis
functions used to approximate Q. In the non-parametric case, we have a new basis
function for each new sample we observe, making the cost of adding the t’th sample
O(t 2 ) in both time and memory. This would seem to make the non-parametric form
of GPTD computationally infeasible except in small and simple problems. However,
the computational cost of non-parametric GPTD can be reduced by using an online
sparsification method (e.g., Engel et al 2002), to a level that it can be efficiently
implemented online.
The choice of the prior distribution may significantly affect the performance of
GPTD. However, in the standard GPTD, the prior is set at the beginning and remains
unchanged during the execution of the algorithm. Reisinger et al (2008) developed
an online model selection method for GPTD using sequential MC techniques, called
replacing-kernel RL, and empirically showed that it yields better performance than
the standard GPTD for many different kernel families.
Finally, the GPTD model can be used to derive a SARSA-type algorithm, called
GPSARSA (Engel et al, 2005a; Engel, 2005), in which state-action values are esti-
mated using GPTD and policies are improved by a ε -greedily strategy while slowly
decreasing ε toward 0. The GPTD framework, especially the GPSARSA algorithm,
has been successfully applied to large scale RL problems such as the control of an
octopus arm (Engel et al, 2005b) and wireless network association control (Aharony
et al, 2005).
natural-gradient rule amounts to linearly transforming the gradient using the inverse
Fisher information matrix of the policy. In empirical evaluations, natural PG has
been shown to significantly outperform conventional PG (Kakade, 2002; Bagnell
and Schneider, 2003; Peters et al, 2003; Peters and Schaal, 2008).
However, both conventional and natural policy gradient methods rely on Monte-
Carlo (MC) techniques in estimating the gradient of the performance measure.
Although MC estimates are unbiased, they tend to suffer from high variance, or al-
ternatively, require excessive sample sizes (see O’Hagan, 1987 for a discussion). In
the case of policy gradient estimation this is exacerbated by the fact that consistent
policy improvement requires multiple gradient estimation steps. O’Hagan (1991)
proposes a Bayesian alternative to MC estimation of an integral, called Bayesian
quadrature (BQ). The idea is to model integrals of the form dx f (x)g(x) as ran-
dom quantities. This is done by treating the first term in the integrand, f , as a ran-
dom function over which we express a prior in the form of a Gaussian process (GP).
Observing (possibly noisy) samples of f at a set of points {x1 ,x2 , . . . ,xM } allows
us to employ Bayes’ rule to compute a posterior distribution of f conditioned on
these samples. This, in turn, induces a posterior distribution over the value of the
integral. Rasmussen and Ghahramani (2003) experimentally demonstrated how this
approach, when applied to the evaluation of an expectation, can outperform MC es-
timation by orders of magnitude, in terms of the mean-squared error. Interestingly,
BQ is often effective even when f is known. The posterior of f can be viewed as an
approximation of f (that converges to f in the limit), but this approximation can be
used to perform the integration in closed form. In contrast, MC integration uses the
exact f , but only at the points sampled. So BQ makes better use of the information
provided by the samples by using the posterior to “interpolate” between the samples
and by performing the integration in closed form.
In this section, we study a Bayesian framework for policy gradient estimation
based on modeling the policy gradient as a GP (Ghavamzadeh and Engel, 2006).
This reduces the number of samples needed to obtain accurate gradient estimates.
Moreover, estimates of the natural gradient as well as a measure of the uncertainty
in the gradient estimates, namely, the gradient covariance, are provided at little extra
cost.
Let us begin with some definitions and notations. A stationary policy π (·|s) is a
probability distribution over actions, conditioned on the current state. Given a fixed
policy π , the MDP induces a Markov chain over state-action pairs, whose transition
probability from (st ,at ) to (st+1 ,at+1 ) is π (at+1 |st+1 )P(st+1 |st ,at ). We generically
denote by ξ = (s0 ,a0 ,s1 ,a1 , . . . , sT −1 ,aT −1 ,sT ), T ∈ {0,1, . . . , ∞} a path generated
by this Markov chain. The probability (density) of such a path is given by
T −1
P(ξ |π ) = P0 (s0 ) ∏ π (at |st )P(st+1 |st ,at ). (11.11)
t=0
T −1 t
We denote by R(ξ ) = ∑t=0 γ r(st ,at ) the discounted cumulative return of the path
ξ , where γ ∈ [0,1] is a discount factor. R(ξ ) is a random variable both because the
path ξ itself is a random variable, and because, even for a given path, each of the
11 Bayesian Reinforcement Learning 367
rewards sampled in it may be stochastic. The expected value of R(ξ ) for a given
path ξ is denoted by R̄(ξ ). Finally, we define the expected return of policy π as
&
η (π ) = E[R(ξ )] = d ξ R̄(ξ )P(ξ |π ). (11.12)
2 To simplify notation, we omit ∇ and u ’s dependence on the policy parameters θ , and use
∇ and u (ξ ) in place of ∇θ and u (ξ ; θ ) in the sequel.
368 N. Vlassis et al.
&
∇P(ξ ; θ )
E ∇ηB (θ )|DM = E d ξ R(ξ ) P(ξ ; θ )DM . (11.16)
P(ξ ; θ )
In the Bayesian policy gradient (BPG) method of Ghavamzadeh and Engel (2006),
the problem of estimating the gradient of the expected return (Eq. 11.16) is cast as an
integral evaluation problem, and then the BQ method (O’Hagan, 1991), described
above, is used. In BQ, we need to partition the integrand into two parts, f (ξ ; θ )
and g(ξ ; θ ). We will model f as a GP and assume that g is a function known to us.
We will then proceed by calculating the posterior moments of the gradient ∇ηB (θ )
conditioned on the observed data DM = {ξ1 , . . . ,ξM }. Because in general, R(ξ ) can-
not be known exactly, even for a given ξ (due to the stochasticity of the rewards),
R(ξ ) should always belong to the GP part of the model, i.e., f (ξ ; θ ). Ghavamzadeh
and Engel (2006) proposed two different ways of partitioning the integrand in
Eq. 11.16, resulting in two distinct Bayesian models. Table 1 in Ghavamzadeh and
Engel (2006) summarizes the two models. Models 1 and 2 use Fisher-type kernels
for the prior covariance of f . The choice of Fisher-type kernels was motivated by
the notion that a good representation should depend on the data generating process
(see Jaakkola and Haussler 1999; Shawe-Taylor and Cristianini 2004 for a thor-
ough discussion). The particular choices of linear and quadratic Fisher kernels were
guided by the requirement that the posterior moments of the gradient be analytically
tractable.
Models 1 and 2 can be used to define algorithms for evaluating the gradient of the
expected return w.r.t. the policy parameters. The algorithm (for either model) takes
a set of policy parameters θ and a sample size M as input, and returns an estimate of
the posterior moments of the gradient of the expected return. This Bayesian PG eval-
uation algorithm, in turn, can be used to derive a Bayesian policy gradient (BPG)
algorithm that starts with an initial vector of policy parameters θ 0 and updates the
parameters in the direction of the posterior mean of the gradient of the expected re-
turn, computed by the Bayesian PG evaluation procedure. This is repeated N times,
or alternatively, until the gradient estimate is sufficiently close to zero.
As mentioned earlier, the kernel functions used in Models 1 and 2 are both based
on the Fisher information matrix G (θ ). Consequently, every time we update the
policy parameters we need to recompute G . In most practical situations, G is not
known and needs to be estimated. Ghavamzadeh and Engel (2006) described two
possible approaches to this problem: MC estimation of G and maximum likelihood
(ML) estimation of the MDP’s dynamics and use it to calculate G . They empirically
showed that even when G is estimated using MC or ML, BPG performs better than
MC-based PG algorithms.
BPG may be made significantly more efficient, both in time and memory, by
sparsifying the solution. Such sparsification may be performed incrementally, and
helps to numerically stabilize the algorithm when the kernel matrix is singular, or
nearly so. Similar to the GPTD case, one possibility is to use the on-line sparsifi-
cation method proposed by Engel et al (2002) to selectively add a new observed
path to a set of dictionary paths, which are used as a basis for approximating the
11 Bayesian Reinforcement Learning 369
full solution. Finally, it is easy to show that the BPG models and algorithms can be
extended to POMDPs along the same lines as in Baxter and Bartlett (2001).
∞
where r̄(z) is the mean reward for the state-action pair z, and μ π (z) = ∑t=0 γ t Ptπ (z)
is a discounted weighting of state-action pairs encountered while following policy
π . Integrating a out of μ π (z) = μ π (s,a) results in the corresponding
discounted
weighting of states encountered by following policy π ; ρ π (s) = A daμ π (s,a). Un-
like ρ π and μ π , (1 − γ )ρ π and (1 − γ )μ π are distributions. They are analogous
to the stationary distributions over states and state-action pairs of policy π in the
undiscounted setting, since as γ → 1, they tend to these stationary distributions, if
they exist. The policy gradient theorem (Marbach, 1998, Proposition 1; Sutton et al,
2000, Theorem 1; Konda and Tsitsiklis, 2000, Theorem 1) states that the gradient
of the expected return for parameterized policies is given by
& &
∇η (θ ) = dsda ρ (s; θ )∇π (a|s; θ )Q(s,a; θ ) = dz μ (z; θ )∇ log π (a|s; θ )Q(z; θ ).
(11.17)
Observe that if b : S → R is an arbitrary function of s (also called a baseline), then
& & &
dsda ρ (s; θ )∇π (a|s; θ )b(s) = ds ρ (s; θ )b(s)∇ da π (a|s; θ )
Z
&S
A
= ds ρ (s; θ )b(s)∇ 1 = 0,
S
and thus, for any baseline b(s), Eq. 11.17 may be written as
&
∇η (θ ) = dz μ (z; θ )∇ log π (a|s; θ )[Q(z; θ ) + b(s)]. (11.18)
Z
Now consider the case in which the action-value function for a fixed policy π , Qπ ,
is approximated by a learned function approximator. If the approximation is suffi-
ciently good, we may hope to use it in place of Qπ in Eqs. 11.17 and 11.18, and still
point roughly in the direction of the true gradient. Sutton et al (2000) and Konda
and Tsitsiklis (2000) showed that if the approximation Q̂π (·; w) with parameter w
is compatible, i.e., ∇w Q̂π (s,a; w) = ∇ log π (a|s; θ ), and if it minimizes the mean
squared error &
2
E π (w) = dz μ π (z) Qπ (z) − Q̂π (z; w) (11.19)
Z
for parameter value w∗ , then we may replace Qπ with Q̂π (·; w∗ ) in Eqs. 11.17
and 11.18. An approximation for the action-value function, in terms of a linear
combination of basis functions, may be written as Q̂π (z; w) = w ψ (z). This ap-
proximation is compatible if the ψ ’s are compatible with the policy, i.e., ψ (z; θ ) =
∇ log π (a|s; θ ). It can be shown that the mean squared-error problems of Eq. 11.19
and &
2
E π (w) = dz μ π (z) Qπ (z) − wψ (z) − b(s) (11.20)
Z
have the same solutions (e.g., Bhatnagar et al 2007, 2009), and if the parameter w is
set to be equal to w∗ in Eq. 11.20, then the resulting mean squared error E π (w∗ )
is further minimized by setting b(s) = V π (s) (Bhatnagar et al, 2007, 2009). In
other words, the variance in the action-value function estimator is minimized if the
11 Bayesian Reinforcement Learning 371
baseline is chosen to be the value function itself. This means that it is more mean-
ingful to consider w∗ ψ (z) as the least-squared optimal parametric representation
for the advantage function Aπ (s,a) = Qπ (s,a) − V π (s) rather than the action-value
function Qπ (s,a).
We are now in a position to describe the main idea behind the BAC approach.
Making use of the linearity of Eq. 11.17 in Q and denoting g (z; θ ) = μ π (z)∇ log π
(a|s; θ ), we obtain the following expressions for the posterior moments of the policy
gradient (O’Hagan, 1991):
& &
E[∇η (θ )|Dt ] = dz g (z; θ )Q̂t (z; θ ) = dz g (z; θ )kkt (z) α t ,
Z Z
&
Cov [∇η (θ )|Dt ] = dz dz g (z; θ )Ŝt (z,z )gg(z ; θ )
Z2
&
= dz dz g (z; θ ) k(z,z ) − kt (z)C t k t (z ) g (z ; θ ) ,
Z2
(11.21)
where Q̂t and Ŝt are the posterior moments of Q computed by the GPTD critic from
Eq. 11.9.
These equations provide us with the general form of the posterior policy gradient
moments. We are now left with a computational issue, namely, how to compute the
following integrals appearing in these expressions?
& &
Ut = dz g (z; θ )kkt (z) and V = dzdz g (z; θ )k(z,z )gg(z ; θ ) . (11.22)
Z Z2
Using the definitions in Eq. 11.22, we may write the gradient posterior moments
compactly as
Ghavamzadeh and Engel (2007) showed that in order to render these integrals
analytically tractable, the prior covariance kernel should be defined as k(z,z ) =
ks (s,s ) + kF (z,z ), the sum of an arbitrary state-kernel ks and the Fisher kernel be-
tween state-action pairs kF (z,z ) = u (z) G (θ )−1 u (z ). They proved that using this
prior covariance kernel, U t and V from Eq. 11.22 satisfy U t = [uu (z0 ), . . . ,uu(zt )]
and V = G (θ ). When the posterior moments of the gradient of the expected re-
turn are available, a Bayesian actor-critic (BAC) algorithm can be easily derived by
updating the policy parameters in the direction of the mean.
Similar to the BPG case in Section 11.2.2, the Fisher information matrix of each
policy may be estimated using MC or ML methods, and the algorithm may be made
significantly more efficient, both in time and memory, and more numerically sta-
ble by sparsifying the solution using for example the online sparsification method
of Engel et al (2002).
372 N. Vlassis et al.
value 0 otherwise). This Kronecker delta reflects the assumption that unknown pa-
rameters are stationary, i.e., θ does not change with time. The observation function
ZP (s ,θ ,a,o) = P(o|s ,θ ,a) indicates the probability of making an observation o
when joint state s ,θ is reached after executing action a. Since the observations are
the MDP states, then P(o|s ,θ ,a) = δs (o).
We can formulate a belief-state MDP over this POMDP by defining beliefs over
the unknown parameters θas,s . The key point is that this belief-state MDP is fully
observable even though the original RL problem involves hidden quantities. This
formulation effectively turns the reinforcement learning problem into a planning
problem in the space of beliefs over the unknown MDP model parameters.
For discrete MDPs a natural representation of beliefs is via Dirichlet distribu-
tions, as Dirichlets are conjugate densities of multinomials (DeGroot, 1970). A
Dirichlet distribution Dir(p; n) ∝ Πi pini −1 over a multinomial p is parameterized
by positive numbers ni , such that ni − 1 can be interpreted as the number of times
that the pi -probability event has been observed. Since each feasible transition s,a,s
11 Bayesian Reinforcement Learning 373
pertains only to one of the unknowns, we can model beliefs as products of Dirichlets,
one for each unknown model parameter θas,s .
Belief monitoring in this POMDP corresponds to Bayesian updating of the be-
liefs based on observed state transitions. For a prior belief b(θ ) = Dir(θ ; n) over
some transition parameter θ , when a specific (s,a,s ) transition is observed in
the environment, the posterior belief is analytically computed by the Bayes’ rule,
b (θ ) ∝ θas,s b(θ ). If we represent belief states by a tuple s,{ns,s
a } consisting of
the current state s and the hyperparameters ns,s a for each Dirichlet, belief updating
simply amounts to setting the current state to s and incrementing by one the hyper-
parameter ns,s
a that matches the observed transition s,a,s .
The POMDP formulation of Bayesian reinforcement learning provides a natural
framework to reason about the exploration/exploitation tradeoff. Since beliefs en-
code all the information gained by the learner (i.e., sufficient statistics of the history
of past actions and observations) and an optimal POMDP policy is a mapping from
beliefs to actions that maximizes the expected total rewards, it follows that an op-
timal POMDP policy naturally optimizes the exploration/exploitation tradeoff. In
other words, since the goal in balancing exploitation (immediate gain) and explo-
ration (information gain) is to maximize the overall sum of rewards, then the best
tradeoff is achieved by the best POMDP policy. Note however that this assumes that
the prior belief is accurate and that computation is exact, which is rarely the case
in practice. Nevertheless, the POMDP formulation provides a useful formalism to
design algorithms that naturally tradeoff the exploration/exploitation tradeoff.
The POMDP formulation reduces the RL problem to a planning problem with
special structure. In the next section we derive the parameterization of the optimal
value function, which can be computed exactly by dynamic programming (Poupart
et al, 2006). However, since the complexity grows exponentially with the planning
horizon, we also discuss some approximations.
Here s is the current nominal MDP state, b is the current belief over the model
parameters θ , and bs,s
a is the updated belief after transition s,a,s . The transition
model is defined as
& &
P(s |s,b,a) = d θ b(θ ) P(s |s,θ ,a) = d θ b(θ ) θas,s , (11.25)
θ θ
374 N. Vlassis et al.
and is just the average transition probability P(s |s,a) with respect to belief b. Since
an optimal POMDP policy achieves by definition the highest attainable expected
future reward, it follows that such a policy would automatically optimize the explo-
ration/exploitation tradeoff in the original RL problem.
It is known (see, e.g., chapter 12 in this book) that the optimal finite-horizon
value function of a POMDP with discrete states and actions is piecewise linear and
convex, and it corresponds to the upper envelope of a set Γ of linear segments called
α -vectors: V ∗ (b) = maxα ∈Γ α (b). In the literature, α is both defined as a lin-
ear function of b (i.e., α (b)) and as a vector of s (i.e., α (s)) such that α (b) =
∑s b(s)α (s). Hence, for discrete POMDPs, value functions can be parameterized
by a set of α -vectors each represented as a vector of values for each state. Conve-
niently, this parameterization is closed under Bellman backups.
In the case of Bayesian RL, despite the hybrid nature of the state space, the piece-
wise linearity and convexity of the value function may still hold as demonstrated by
Duff (2002) and Porta et al (2005). In particular, the optimal finite-horizon value
function of a discrete-action POMDP corresponds to the upper envelope of a set Γ
of linear segments called α -functions (due to the continuous nature of the POMDP
state θ ), which can be grouped in subsets per nominal state s:
Suppose that the optimal value function Vsk (b) for k steps-to-go is composed of a set
Γ k of α -functions such that Vsk (b) = maxα ∈Γ k αs (b). Using Bellman’s equation, we
can compute by dynamic programming the best set Γ k+1 representing the optimal
value function V k+1 with k + 1 stages-to-go. First we rewrite Bellman’s equation
(Eq. 11.24) by substituting V k for the maximum over the α -functions in Γ k as in
Eq. 11.26:
11 Bayesian Reinforcement Learning 375
Vsk+1 (b) = max R(b,a) + γ ∑ P(s |s,b,a) max αs (bs,s
a ).
a
s α ∈Γ k
Then we decompose Bellman’s equation in three steps. The first step finds the max-
imal α -function for each a and s . The second step finds the best action a. The third
step performs the actual Bellman backup using the maximal action and α -functions:
αb,a
s,s
= arg max α (bs,s
a ) (11.28)
α ∈Γ k
asb = arg max R(s,a) + γ ∑ P(s |s,b,a)αb,a
s,s
(bs,s
a ) (11.29)
a s
Vsk+1 (b) = R(s,asb ) + γ ∑ P(s |s,b,asb )αb,a
s,s s,s
s (bas ) (11.30)
b b
s
We can further rewrite the third step by using α -functions in terms of θ (instead
of b) and expanding the belief state bs,s
as : b
&
Vsk+1 (b) = R(s,asb ) + γ ∑ P(s |s,b,asb ) d θ bs,s
as (θ )αb,as (θ )
s,s
(11.31)
s θ b b
&
b(θ )P(s |s,θ ,asb ) s,s
= R(s,asb ) + γ ∑ P(s |s,b,asb ) dθ αb,as (θ )(11.32)
s θ P(s |s,b,asb ) b
&
= R(s,asb ) + γ ∑ d θ b(θ )P(s |s,θ ,asb )αb,a
s,s
s (θ ) (11.33)
s θ b
&
= d θ b(θ )[R(s,asb ) + γ ∑ P(s |s,θ ,asb )αb,a
s,s
s (θ )] (11.34)
θ s
b
For every b we define such an α -function, and together all αb,s form the set Γ k+1 .
Since each αb,s was defined by using the optimal action and α -functions in Γ k , it
follows that each αb,s is necessarily optimal at b and we can introduce a max over
all α -functions with no loss:
&
Vsk+1 (b) = d θ b(θ )αb,s (θ ) = αs (b) = max αs (b). (11.36)
θ α ∈Γ k+1
Based on the above we can show the following (we refer to the original paper for
the proof):
Theorem 11.1 (Poupart et al (2006)). The α -functions in Bayesian RL are linear
combinations of products of (unnormalized) Dirichlets.
376 N. Vlassis et al.
where the wiθ ’s are the importance weights of the sampled models depending on the
proposal distribution used. Dearden et al (1999) describe several efficient procedures
to sample the models from some proposal distributions that may be easier to work
with than P(θ ).
An alternative myopic Bayesian action selection strategy is Thompson sampling,
which involves sampling just one MDP from the current belief, solve this MDP to
optimality (e.g., by Dynamic Programming), and execute the optimal action at the
current state (Thompson, 1933; Strens, 2000), a strategy that reportedly tends to
over-explore (Wang et al, 2005).
One may achieve a less myopic action selection strategy by trying to compute a
near-optimal policy in the belief-state MDP of the POMDP (see previous section).
Since this is just an MDP (albeit continuous and with a special structure), one may
use any approximate solver for MDPs. Wang et al (2005); Ross and Pineau (2008)
have pursued this idea by applying the sparse sampling algorithm of Kearns et al
(1999) on the belief-state MDP. This approach carries out an explicit lookahead to
the effective horizon starting from the current belief, backing up rewards through the
tree by dynamic programming or linear programming (Castro and Precup, 2007), re-
sulting in a near-Bayes-optimal exploratory action. The search through the tree does
not produce a policy that will generalize over the belief space however, and a new
tree will have to be generated at each time step which can be expensive in practice.
Presumably the sparse sampling approach can be combined with an approach that
generalizes over the belief space via an α -function parameterization as in BEETLE,
although no algorithm of that type has been reported so far.
Multi-task learning (MTL) is an important learning paradigm and has recently been
an area of active research in machine learning (e.g., Caruana 1997; Baxter 2000).
378 N. Vlassis et al.
A common setup is that there are multiple related tasks for which we are inter-
ested in improving the performance over individual learning by sharing information
across the tasks. This transfer of information is particularly important when we are
provided with only a limited number of data to learn each task. Exploiting data from
related problems provides more training samples for the learner and can improve the
performance of the resulting solution. More formally, the main objective in MTL is
to maximize the improvement over individual learning averaged over the tasks. This
should be distinguished from transfer learning in which the goal is to learn a suitable
bias for a class of tasks in order to maximize the expected future performance.
Most RL algorithms often need a large number of samples to solve a problem
and cannot directly take advantage of the information coming from other similar
tasks. However, recent work has shown that transfer and multi-task learning tech-
niques can be employed in RL to reduce the number of samples needed to achieve
nearly-optimal solutions. All approaches to multi-task RL (MTRL) assume that the
tasks share similarity in some components of the problem such as dynamics, reward
structure, or value function. While some methods explicitly assume that the shared
components are drawn from a common generative model (Wilson et al, 2007; Mehta
et al, 2008; Lazaric and Ghavamzadeh, 2010), this assumption is more implicit in
others (Taylor et al, 2007; Lazaric et al, 2008). In Mehta et al (2008), tasks share
the same dynamics and reward features, and only differ in the weights of the reward
function. The proposed method initializes the value function for a new task using
the previously learned value functions as a prior. Wilson et al (2007) and Lazaric
and Ghavamzadeh (2010) both assume that the distribution over some components
of the tasks is drawn from a hierarchical Bayesian model (HBM). We describe these
two methods in more details below.
Lazaric and Ghavamzadeh (2010) study the MTRL scenario in which the learner
is provided with a number of MDPs with common state and action spaces. For any
given policy, only a small number of samples can be generated in each MDP, which
may not be enough to accurately evaluate the policy. In such a MTRL problem,
it is necessary to identify classes of tasks with similar structure and to learn them
jointly. It is important to note that here a task is a pair of MDP and policy such
that all the MDPs have the same state and action spaces. They consider a particular
class of MTRL problems in which the tasks share structure in their value functions.
To allow the value functions to share a common structure, it is assumed that they
are all sampled from a common prior. They adopt the GPTD value function model
(see Section 11.2.1) for each task, model the distribution over the value functions
using a HBM, and develop solutions to the following problems: (i) joint learning of
the value functions (multi-task learning), and (ii) efficient transfer of the informa-
tion acquired in (i) to facilitate learning the value function of a newly observed task
(transfer learning). They first present a HBM for the case in which all the value func-
tions belong to the same class, and derive an EM algorithm to find MAP estimates of
the value functions and the model’s hyper-parameters. However, if the functions do
not belong to the same class, simply learning them together can be detrimental (neg-
ative transfer). It is therefore important to have models that will generally benefit
from related tasks and will not hurt performance when the tasks are unrelated. This
11 Bayesian Reinforcement Learning 379
When transfer learning and multi-task learning are not possible, the learner may still
want to use domain knowledge to reduce the complexity of the learning task. In non-
Bayesian reinforcement learning, domain knowledge is often implicitly encoded in
the choice of features used to encode the state space, parametric form of the value
function, or the class of policies considered. In Bayesian reinforcement learning, the
prior distribution provides an explicit and expressive mechanism to encode domain
knowledge. Instead of starting with a non-informative prior (e.g., uniform, Jeffrey’s
prior), one can reduce the need for data by specifying a prior that biases the learning
towards parameters that a domain expert feels are more likely.
For instance, in model-based Bayesian reinforcement learning, Dirichlet distri-
butions over the transition and reward distributions can naturally encode an expert’s
bias. Recall that the hyperparameters ni − 1 of a Dirichlet can be interpreted as the
number of times that the pi -probability event has been observed. Hence, if the ex-
pert has access to prior data where each event occured ni − 1 times or has reasons
to believe that each event would occur ni − 1 times in a fictitious experiment, then a
corresponding Dirichlet can be used as an informative prior. Alternatively, if one has
some belief or prior data to estimate the mean and variance of some unknown multi-
nomial, then the hyperparameters of the Dirichlet can be set by moment matching.
A drawback of the Dirichlet distribution is that it only allows unimodal priors to
be expressed. However, mixtures of Dirichlets can be used to express multimodal
distributions. In fact, since Dirichlets are monomials (i.e., Dir(θ ) = ∏i θini ), then
n
mixtures of Dirichlets are polynomials with positive coefficients (i.e., ∑ j c j ∏i θi i j ).
So with a lage enough number of mixture components it is possible to approximate
380 N. Vlassis et al.
One of the main attractive features of the Bayesian approach to RL is the possibility
of obtaining finite sample estimation for the statistics of a given policy in terms
of posterior expected value and variance. This idea was first pursued by Mannor
et al (2007), who considered the bias and variance of the value function estimate
of a single policy. Assuming an exogenous sampling process (i.e., we only get to
observe the transitions and rewards, but not to control them), there exists a nominal
model (obtained by, say, maximum a-posteriori probability estimate) and a posterior
probability distribution over all possible models. Given a policy π and a posterior
distribution over model θ =< T,r >, we can consider the expected posterior value
function as: 0 1
∞
ET̃ ,r̃ Es [ ∑ γ t r̃(st )|T̃ ] , (11.38)
t=1
where the outer expectation is according to the posterior over the parameters of the
MDP model and the inner expectation is with respect to transitions given that the
model parameters are fixed. Collecting the infinite sum, we get
where T̃π and r̃π are the transition matrix and reward vector of policy π when model
< T̃ ,r̃ > is the true model. This problem maximizes the expected return over both
the trajectories and the model random variables. Because of the nonlinear effect of
T̃ on the expected return, Mannor et al (2007) argue that evaluating the objective of
this problem for a given policy is already difficult.
Assuming a Dirichlet prior for the transitions and a Gaussian prior for the re-
wards, one can obtain bias and variance estimates for the value function of a given
policy. These estimates are based on first order or second order approximations of
Equation (11.39). From a computational perspective, these estimates can be easily
computed and the value function can be de-biased. When trying to optimize over the
policy space, Mannor et al (2007) show experimentally that the common approach
11 Bayesian Reinforcement Learning 381
consisting of using the most likely (or expected) parameters leads to a strong bias in
the performance estimate of the resulting policy.
The Bayesian view for a finite sample naturally leads to the question of policy
optimization, where an additional maximum over all policies is taken in (11.38).
The standard approach in Markov decision processes is to consider the so-called
robust approach: assume the parameters of the problem belong to some uncertainty
set and find the policy with the best worst-case performance. This can be done ef-
ficiently using dynamic programming style algorithms; see Nilim and El Ghaoui
(2005); Iyengar (2005). The problem with the robust approach is that it leads to
over-conservative solutions. Moreover, the currently available algorithms require
the uncertainty in different states to be uncorrelated, meaning that the uncertainty
set is effectively taken as the Cartesian product of state-wise uncertainty sets.
One of the benefits of the Bayesian perspective is that it enables using certain risk
aware approaches since we have a probability distribution on the available models.
For example, it is possible to consider bias-variance tradeoffs in this context, where
one would maximize reward subject to variance constraints or give a penalty for
excessive variance. Mean-variance optimization in the Bayesian setup seems like
a difficult problem, and there are currently no known complexity results about it.
Curtailing this problem, Delage and Mannor (2010) present an approximation to a
risk-sensitive percentile optimization criterion:
maximizey∈R,π ∈ϒ y
∞
s.t. Pθ (Es (∑t=0 γ t rt (st )|s0 ∝ q,π ) ≥ y) ≥ 1 − ε . (11.40)
optimal Bayesian policy after a polynomial (in quantities describing the system)
number of time steps. The algorithm and analysis are reminiscent to PAC-MDP
(e.g., Brafman and Tennenholtz (2002); Strehl et al (2006)) but it explores in a
greedier style than PAC-MDP algorithms. In the second paper, Asmuth et al (2009)
present an approach that drives exploration by sampling multiple models from the
posterior and selecting actions optimistically. The decision when to re-sample the set
and how to combine the models is based on optimistic heuristics. The resulting algo-
rithm achieves near optimal reward with high probability with a sample complexity
that is low relative to the speed at which the posterior distribution converges during
learning. Finally, Fard and Pineau (2010) derive a PAC-Bayesian style bound that
allows balancing between the distribution-free PAC and the data-efficient Bayesian
paradigms.
While Bayesian Reinforcement Learning was perhaps the first kind of reinforce-
ment learning considered in the 1960s by the Operations Research community, a
recent surge of interest by the Machine Learning community has lead to many ad-
vances described in this chapter. Much of this interest comes from the benefits of
maintaining explicit distributions over the quantities of interest. In particular, the
exploration/exploitation tradeoff can be naturally optimized once a distribution is
used to quantify the uncertainty about various parts of the model, value function or
gradient. Notions of risk can also be taken into account while optimizing a policy.
In this chapter we provided an overview of the state of the art regarding the use of
Bayesian techniques in reinforcement learning for a single agent in fully observable
domains. We note that Bayesian techniques have also been used in partially ob-
servable domains (Ross et al, 2007, 2008; Poupart and Vlassis, 2008; Doshi-Velez,
2009; Veness et al, 2010) and multi-agent systems (Chalkiadakis and Boutilier,
2003, 2004; Gmytrasiewicz and Doshi, 2005).
References
Aharony, N., Zehavi, T., Engel, Y.: Learning wireless network association control with Gaus-
sian process temporal difference methods. In: Proceedings of OPNETWORK (2005)
Asmuth, J., Li, L., Littman, M.L., Nouri, A., Wingate, D.: A Bayesian sampling approach to
exploration in reinforcement learning. In: Proceedings of the Twenty-Fifth Conference on
Uncertainty in Artificial Intelligence, UAI 2009, pp. 19–26. AUAI Press (2009)
Bagnell, J., Schneider, J.: Covariant policy search. In: Proceedings of the Eighteenth Interna-
tional Joint Conference on Artificial Intelligence (2003)
Barto, A., Sutton, R., Anderson, C.: Neuron-like elements that can solve difficult learn-
ing control problems. IEEE Transaction on Systems, Man and Cybernetics 13, 835–846
(1983)
11 Bayesian Reinforcement Learning 383
Baxter, J.: A model of inductive bias learning. Journal of Artificial Intelligence Research 12,
149–198 (2000)
Baxter, J., Bartlett, P.: Infinite-horizon policy-gradient estimation. Journal of Artificial Intel-
ligence Research 15, 319–350 (2001)
Bellman, R.: A problem in sequential design of experiments. Sankhya 16, 221–229 (1956)
Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press (1961)
Bellman, R., Kalaba, R.: On adaptive control processes. Transactions on Automatic Control,
IRE 4(2), 1–9 (1959)
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Incremental natural actor-critic algo-
rithms. In: Proceedings of Advances in Neural Information Processing Systems, vol. 20,
pp. 105–112. MIT Press (2007)
Bhatnagar, S., Sutton, R., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Auto-
matica 45(11), 2471–2482 (2009)
Brafman, R., Tennenholtz, M.: R-max - a general polynomial time algorithm for near-optimal
reinforcement learning. JMLR 3, 213–231 (2002)
Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
Castro, P., Precup, D.: Using linear programming for Bayesian exploration in Markov de-
cision processes. In: Proc. 20th International Joint Conference on Artificial Intelligence
(2007)
Chalkiadakis, G., Boutilier, C.: Coordination in multi-agent reinforcement learning: A
Bayesian approach. In: International Joint Conference on Autonomous Agents and Mul-
tiagent Systems (AAMAS), pp. 709–716 (2003)
Chalkiadakis, G., Boutilier, C.: Bayesian reinforcement learning for coalition formation un-
der uncertainty. In: International Joint Conference on Autonomous Agents and Multiagent
Systems (AAMAS), pp. 1090–1097 (2004)
Cozzolino, J., Gonzales-Zubieta, R., Miller, R.L.: Markovian decision processes with uncer-
tain transition probabilities. Tech. Rep. Technical Report No. 11, Research in the Control
of Complex Systems. Operations Research Center, Massachusetts Institute of Technology
(1965)
Cozzolino, J.M.: Optimal sequential decision making under uncertainty. Master’s thesis, Mas-
sachusetts Institute of Technology (1964)
Dearden, R., Friedman, N., Russell, S.: Bayesian Q-learning. In: Proceedings of the Fifteenth
National Conference on Artificial Intelligence, pp. 761–768 (1998)
Dearden, R., Friedman, N., Andre, D.: Model based Bayesian exploration. In: UAI, pp. 150–
159 (1999)
DeGroot, M.H.: Optimal Statistical Decisions. McGraw-Hill, New York (1970)
Delage, E., Mannor, S.: Percentile optimization for Markov decision processes with parame-
ter uncertainty. Operations Research 58(1), 203–213 (2010)
Dimitrakakis, C.: Complexity of stochastic branch and bound methods for belief tree search
in bayesian reinforcement learning. In: ICAART (1), pp. 259–264 (2010)
Doshi-Velez, F.: The infinite partially observable Markov decision process. In: Neural Infor-
mation Processing Systems (2009)
Doshi-Velez, F., Wingate, D., Roy, N., Tenenbaum, J.: Nonparametric Bayesian policy priors
for reinforcement learning. In: NIPS (2010)
Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision
processes. PhD thesis, University of Massassachusetts Amherst (2002)
Duff, M.: Design for an optimal probe. In: ICML, pp. 131–138 (2003)
Engel, Y.: Algorithms and representations for reinforcement learning. PhD thesis, The
Hebrew University of Jerusalem, Israel (2005)
384 N. Vlassis et al.
Engel, Y., Mannor, S., Meir, R.: Sparse Online Greedy Support Vector Regression. In: Elo-
maa, T., Mannila, H., Toivonen, H. (eds.) ECML 2002. LNCS (LNAI), vol. 2430, pp.
84–96. Springer, Heidelberg (2002)
Engel, Y., Mannor, S., Meir, R.: Bayes meets Bellman: The Gaussian process approach to
temporal difference learning. In: Proceedings of the Twentieth International Conference
on Machine Learning, pp. 154–161 (2003)
Engel, Y., Mannor, S., Meir, R.: Reinforcement learning with Gaussian processes. In: Pro-
ceedings of the Twenty Second International Conference on Machine Learning, pp. 201–
208 (2005a)
Engel, Y., Szabo, P., Volkinshtein, D.: Learning to control an octopus arm with Gaussian
process temporal difference methods. In: Proceedings of Advances in Neural Information
Processing Systems, vol. 18, pp. 347–354. MIT Press (2005b)
Fard, M.M., Pineau, J.: PAC-Bayesian model selection for reinforcement learning. In: Laf-
ferty, J., Williams, C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in
Neural Information Processing Systems, vol. 23, pp. 1624–1632 (2010)
Ghavamzadeh, M., Engel, Y.: Bayesian policy gradient algorithms. In: Proceedings of
Advances in Neural Information Processing Systems, vol. 19, MIT Press (2006)
Ghavamzadeh, M., Engel, Y.: Bayesian Actor-Critic algorithms. In: Proceedings of the
Twenty-Fourth International Conference on Machine Learning (2007)
Gmytrasiewicz, P., Doshi, P.: A framework for sequential planning in multi-agent settings.
Journal of Artificial Intelligence Research (JAIR) 24, 49–79 (2005)
Greensmith, E., Bartlett, P., Baxter, J.: Variance reduction techniques for gradient estimates
in reinforcement learning. Journal of Machine Learning Research 5, 1471–1530 (2004)
Iyengar, G.N.: Robust dynamic programming. Mathematics of Operations Research 30(2),
257–280 (2005)
Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Pro-
ceedings of Advances in Neural Information Processing Systems, vol. 11, MIT Press
(1999)
Kaelbling, L.P.: Learning in Embedded Systems. MIT Press (1993)
Kakade, S.: A natural policy gradient. In: Proceedings of Advances in Neural Information
Processing Systems, vol. 14 (2002)
Kearns, M., Mansour, Y., Ng, A.: A sparse sampling algorithm for near-optimal planning in
large Markov decision processes. In: Proc. IJCAI (1999)
Kolter, J.Z., Ng, A.Y.: Near-bayesian exploration in polynomial time. In: Proceedings of the
26th Annual International Conference on Machine Learning, ICML 2009, pp. 513–520.
ACM, New York (2009)
Konda, V., Tsitsiklis, J.: Actor-Critic algorithms. In: Proceedings of Advances in Neural In-
formation Processing Systems, vol. 12, pp. 1008–1014 (2000)
Lazaric, A., Ghavamzadeh, M.: Bayesian multi-task reinforcement learning. In: Proceed-
ings of the Twenty-Seventh International Conference on Machine Learning, pp. 599–606
(2010)
Lazaric, A., Restelli, M., Bonarini, A.: Transfer of samples in batch reinforcement learning.
In: Proceedings of ICML, vol. 25, pp. 544–551 (2008)
Mannor, S., Simester, D., Sun, P., Tsitsiklis, J.N.: Bias and variance approximation in value
function estimates. Management Science 53(2), 308–322 (2007)
Marbach, P.: Simulated-based methods for Markov decision processes. PhD thesis, Mas-
sachusetts Institute of Technology (1998)
Martin, J.J.: Bayesian decision problems and Markov chains. John Wiley, New York (1967)
11 Bayesian Reinforcement Learning 385
Mehta, N., Natarajan, S., Tadepalli, P., Fern, A.: Transfer in variable-reward hierarchical
reinforcement learning. Machine Learning 73(3), 289–312 (2008)
Meuleau, N., Bourgine, P.: Exploration of multi-state environments: local measures and back-
propagation of uncertainty. Machine Learning 35, 117–154 (1999)
Nilim, A., El Ghaoui, L.: Robust control of Markov decision processes with uncertain transi-
tion matrices. Operations Research 53(5), 780–798 (2005)
O’Hagan, A.: Monte Carlo is fundamentally unsound. The Statistician 36, 247–249 (1987)
O’Hagan, A.: Bayes-Hermite quadrature. Journal of Statistical Planning and Inference 29,
245–260 (1991)
Pavlov, M., Poupart, P.: Towards global reinforcement learning. In: NIPS Workshop on Model
Uncertainty and Risk in Reinforcement Learning (2008)
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural
Networks 21(4), 682–697 (2008)
Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement learning for humanoid robotics. In: Pro-
ceedings of the Third IEEE-RAS International Conference on Humanoid Robots (2003)
Peters, J., Vijayakumar, S., Schaal, S.: Natural Actor-Critic. In: Gama, J., Camacho, R.,
Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp.
280–291. Springer, Heidelberg (2005)
Porta, J.M., Spaan, M.T., Vlassis, N.: Robot planning in partially observable continuous do-
mains. In: Proc. Robotics: Science and Systems (2005)
Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable
domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM
(2008)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian rein-
forcement learning. In: Proc. Int. Conf. on Machine Learning, Pittsburgh, USA (2006)
Rasmussen, C., Ghahramani, Z.: Bayesian Monte Carlo. In: Proceedings of Advances in Neu-
ral Information Processing Systems, vol. 15, pp. 489–496. MIT Press (2003)
Rasmussen, C., Williams, C.: Gaussian Processes for Machine Learning. MIT Press (2006)
Reisinger, J., Stone, P., Miikkulainen, R.: Online kernel selection for Bayesian reinforcement
learning. In: Proceedings of the Twenty-Fifth Conference on Machine Learning, pp. 816–
823 (2008)
Ross, S., Pineau, J.: Model-based Bayesian reinforcement learning in large structured do-
mains. In: Uncertainty in Artificial Intelligence, UAI (2008)
Ross, S., Chaib-Draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Infor-
mation Processing Systems, NIPS (2007)
Ross, S., Chaib-Draa, B., Pineau, J.: Bayesian reinforcement learning in continuous POMDPs
with application to robot navigation. In: IEEE International Conference on Robotics and
Automation (ICRA), pp. 2845–2851 (2008)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University
Press (2004)
Silver, E.A.: Markov decision processes with uncertain transition probabilities or rewards.
Tech. Rep. Technical Report No. 1, Research in the Control of Complex Systems. Opera-
tions Research Center, Massachusetts Institute of Technology (1963)
Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs.
Journal of Artificial Intelligence Research 24, 195–220 (2005)
Strehl, A.L., Li, L., Littman, M.L.: Incremental model-based learners with formal learning-
time guarantees. In: UAI (2006)
Strens, M.: A Bayesian framework for reinforcement learning. In: ICML (2000)
386 N. Vlassis et al.
Sutton, R.: Temporal credit assignment in reinforcement learning. PhD thesis, University of
Massachusetts Amherst (1984)
Sutton, R.: Learning to predict by the methods of temporal differences. Machine Learning 3,
9–44 (1988)
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement
learning with function approximation. In: Proceedings of Advances in Neural Information
Processing Systems, vol. 12, pp. 1057–1063 (2000)
Taylor, M., Stone, P., Liu, Y.: Transfer learning via inter-task mappings for temporal differ-
ence learning. JMLR 8, 2125–2167 (2007)
Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of
the evidence of two samples. Biometrika 25, 285–294 (1933)
Veness, J., Ng, K.S., Hutter, M., Silver, D.: Reinforcement learning via AIXI approximation.
In: AAAI (2010)
Wang, T., Lizotte, D., Bowling, M., Schuurmans, D.: Bayesian sparse sampling for on-line
reward optimization. In: ICML (2005)
Watkins, C.: Learning from delayed rewards. PhD thesis, Kings College, Cambridge, England
(1989)
Wiering, M.: Explorations in efficient reinforcement learning. PhD thesis, University of
Amsterdam (1999)
Williams, R.: Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning 8, 229–256 (1992)
Wilson, A., Fern, A., Ray, S., Tadepalli, P.: Multi-task reinforcement learning: A hierarchical
Bayesian approach. In: Proceedings of ICML, vol. 24, pp. 1015–1022 (2007)
Chapter 12
Partially Observable Markov Decision Processes
12.1 Introduction
The Markov decision process model has proven very successful for learning how
to act in stochastic environments. In this chapter, we explore methods for reinforce-
ment learning by relaxing one of the limiting factors of the MDP model, namely
the assumption that the agent knows with full certainty the state of the environment.
Put otherwise, the agent’s sensors allow it to perfectly monitor the state at all times,
where the state captures all aspects of the environment relevant for optimal deci-
sion making. Clearly, this is a strong assumption that can restrict the applicability
of the MDP framework. For instance, when certain state features are hidden from
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 387–414.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
388 M.T.J. Spaan
the agent the state signal will no longer be Markovian, violating a key assumption
of most reinforcement-learning techniques (Sutton and Barto, 1998).1
One example of particular interest arises when applying reinforcement learning
to embodied agents. In many robotic applications the robot’s on-board sensors do
not allow it to unambiguously identify its own location or pose (Thrun et al, 2005).
Furthermore, a robot’s sensors are often limited to observing its direct surround-
ings, and might not be adequate to monitor those features of the environment’s state
beyond its vicinity, so-called hidden state. Another source of uncertainty regarding
the true state of the system are imperfections in the robot’s sensors. For instance, let
us suppose a robot uses a camera to identify the person it is interacting with. The
face-recognition algorithm processing the camera images is likely to make mistakes
sometimes, and report the wrong identity. Such an imperfect sensor also prevents the
robot from knowing the true state of the system: even if the vision algorithm reports
person A, it is still possible that person B is interacting with the robot. Although
in some domains the issues resulting from imperfect sensing might be ignored, in
general they can lead to severe performance deterioration (Singh et al, 1994).
Instead, in this chapter we consider an extension of the (fully observable) MDP
setting that also deals with uncertainty resulting from the agent’s imperfect sen-
sors. A partially observable Markov decision process (POMDP) allows for optimal
decision making in environments which are only partially observable to the agent
(Kaelbling et al, 1998), in contrast with the full observability mandated by the MDP
model. In general the partial observability stems from two sources: (i) multiple states
give the same sensor reading, in case the agent can only sense a limited part of the
environment, and (ii) its sensor readings are noisy: observing the same state can
result in different sensor readings. The partial observability can lead to “perceptual
aliasing”: different parts of the environment appear similar to the agent’s sensor sys-
tem, but require different actions. The POMDP captures the partial observability in
a probabilistic observation model, which relates possible observations to states.
Classic POMDP examples are the machine maintenance (Smallwood and Sondik,
1973) or structural inspection (Ellis et al, 1995) problems. In these types of prob-
lems, the agent has to choose when to inspect a certain machine part or bridge sec-
tion, to decide whether maintenance is necessary. However, to allow for inspection
the machine has to be stopped, or the bridge to be closed, which has a clear eco-
nomic cost. A POMDP model can properly balance the trade-off between expected
deterioration over time and scheduling inspection or maintenance activities. Fur-
thermore, a POMDP can model the scenario that only choosing to inspect provides
information regarding the state of the machine or bridge, and that some flaws are
not always revealed reliably. More recently, the POMDP model has gained in rele-
vance for robotic applications such as robot navigation (Simmons and Koenig, 1995;
Spaan and Vlassis, 2004; Roy et al, 2005; Foka and Trahanias, 2007), active sens-
ing (Hoey and Little, 2007; Spaan et al, 2010), object grasping (Hsiao et al, 2007)
or human-robot interaction (Doshi and Roy, 2008). Finally, POMDPs have been
1 Note to editor: this point is most likely mentioned before in the book, for reasons of co-
herence this citation can be replaced with a reference to the correct section.
12 Partially Observable Markov Decision Processes 389
environment
agent action a
π
In this section we formally introduce the POMDP model and related decision-
making concepts.
A POMDP shares many elements with the fully observable MDP model as de-
scribed in Section 1.3, which we will repeat for completeness. Time is discretized
in steps, and at the start of each time step the agent has to execute an action. We
will consider only discrete, finite, models, which are by far the most commonly
used in the POMDP literature given the difficulties involved with solving continu-
ous models. For simplicity, the environment is represented by a finite set of states
S = {s1 , . . . ,sN }. The set of possible actions A = {a1 , . . . ,aK } represent the possible
ways the agent can influence the system state. Each time step the agent takes an ac-
tion a in state s, the environment transitions to state s according to the probabilistic
transition function T (s,a,s ) and the agent receives an immediate reward R(s,a,s ).
390 M.T.J. Spaan
What distinguishes a POMDP from a fully observable MDP is that the agent
now perceives an observation o ∈ Ω , instead of observing s directly. The discrete
set of observations Ω = {o1 , . . . ,oM } represent all possible sensor readings the agent
can receive. Which observation the agent receives depends on the next state s and
may also be conditional on its action a, and is drawn according to the observa-
tion function O : S × A × Ω → [0,1]. The probability of observing o in state s after
executing a is O(s ,a,o). In order for O to be a valid probability distribution over
possible observations it is required that ∀s ∈ S, a ∈ A, o ∈ Ω O(s ,a,o) ≥ 0 and that
∑o∈Ω O(s ,a,o) = 1. Alternatively, the observation function can also be defined as
O : S × Ω → [0,1] reflecting domains in which the observation is independent of the
last action.2
As in an MDP, the goal of the agent is to act in such a way as to maximize some
form of expected long-term reward, for instance
h !
E ∑ γ t Rt , (12.1)
t=0
where E[·] denotes the expectation operator, h is the planning horizon, and γ is a
discount rate, 0 ≤ γ < 1.
Analogous to Definition 1.3.1, we can define a POMDP as follows.
Definition 12.2.1. A partially observable Markov decision process is a tuple S, A,
Ω , T, O, R in which S is a finite set of states, A is a finite set of actions, Ω is a finite
set of observations, T is a transition function defined as T : S × A × S → [0,1], O is
an observation function defined as O : S × A × Ω → [0,1] and R is a reward function
defined as R : S × A × S → R.
Fig. 12.1 illustrates these concepts by depicting a schematic representation of a
POMDP agent interacting with the environment.
To illustrate how the observation function models different types of partial ob-
servability, consider the following examples, which assume a POMDP with 2 states,
2 observations, and 1 action (omitted for simplicity). The case that sensors make
mistakes or are noisy can be modeled as follows. For instance,
O(s1 ,o1 ) = 0.8, O(s1 ,o2 ) = 0.2, O(s2 ,o1 ) = 0.2, O(s2 ,o2 ) = 0.8,
models an agent equipped with a sensor that is correct in 80% of the cases. When
the agent observes o1 or o2 , it does not know for sure that the environment is in
state s1 resp. s2 . The possibility that the state is completely hidden to the agent can
be modeled by assigning the same observation to both states (and observation o2 is
effectively redundant):
O(s1 ,o1 ) = 1.0, O(s1 ,o2 ) = 0.0, O(s2 ,o1 ) = 1.0, O(s2 ,o2 ) = 0.0.
2 Technically speaking, by including the last action taken as a state feature, observation func-
tions of the form O(s ,o) can express the same models compared to O(s ,a,o) functions.
12 Partially Observable Markov Decision Processes 391
When the agent receives observation o1 it is not able to tell whether the environment
is in state s1 or s2 , which models the hidden state adequately.
are still associated with the environment state, as well as the state transitions, a
single observation is not a Markovian state signal. In particular, a direct mapping of
observations to actions is not sufficient for optimal behavior. In order for an agent
to choose its actions successfully in partially observable environments memory is
needed.
To illustrate this point, consider the two-state infinite-horizon POMDP depicted
in Fig. 12.2 (Singh et al, 1994). The agent has two actions, one of which will de-
terministically transport it to the other state, while executing the other action has no
effect on the state. If the agent jumps to the other state it receives a reward of r > 0,
and −r otherwise. The optimal policy in the underlying MDP has a value of 1−r γ ,
as the agent can gather a reward of r at each time step. In the POMDP however, the
agent receives the same observation in both states. As a result, there are only two
memoryless deterministic stationary policies possible: always execute a1 or always
γr
execute a2 . The maximum expected reward of these policies is r − 1− γ , when the
agent successfully jumps to the other state at the first time step. If we allow stochas-
tic policies, the best stationary policy would yield an expected discounted reward
of 0, when it chooses either action 50% of the time. However, if the agent could
remember what actions it had executed, it could execute a policy that alternates be-
γr
tween executing a1 and a2 . Such a memory-based policy would gather 1− γ − r in
the worst case, which is close to the optimal value in the MDP (Singh et al, 1994).
This example illustrates the need for memory when considering optimal decision
making in a POMDP. A straightforward implementation of memory would be to
simply store the sequence of actions executed and observations received. However,
such a form of memory can grow indefinitely over time, turning it impractical for
long planning horizons. Fortunately, a better option exists, as we can transform the
POMDP to a belief-state MDP in which the agent summarizes all information about
its past using a belief vector b(s) (Stratonovich, 1960; Dynkin, 1965; Åström, 1965).
This transformation requires that the transition and observation functions are known
to the agent, and hence can be applied only in model-based RL methods.
The belief b is a probability distribution over S, which forms a Markovian signal
for the planning task. Given an appropriate state space, the belief is a sufficient
statistic of the history, which means the agent could not do any better even if it had
remembered the full history of actions and observations. All beliefs are contained in
a (|S| − 1)-dimensional simplex Δ (S), hence we can represent a belief using |S| − 1
numbers. Each POMDP problem assumes an initial belief b0 , which for instance
can be set to a uniform distribution over all states (representing complete ignorance
regarding the initial state of the environment). Every time the agent takes an action a
and observes o, its belief is updated by Bayes’ rule:
12 Partially Observable Markov Decision Processes 393
Fig. 12.3 Belief-update example (adapted from Fox et al (1999)). (a) A robot moves in a one-
dimensional corridor with three identical doors. (b)-(e) The evolution of the belief over time,
for details see main text.
p(o|s ,a)
bao (s ) = ∑ p(s |s,a)b(s),
p(o|b,a) s∈S
(12.2)
where p(s |s,a) and p(o|s ,a) are defined by model parameters T resp. O, and
p(o|b,a) = ∑
p(o|s ,a) ∑ p(s |s,a)b(s) (12.3)
s ∈S s∈S
is a normalizing constant.
Fig. 12.3 shows an example of a sequence of belief updates for a robot navigating
in a corridor with three identical doors. The corridor is discretized in 26 states and
is circular, i.e., the right end of the corridor is connected to the left end. The robot
can observe either door or corridor, but its sensors are noisy. When the robot is
positioned in front of a door, it observes door with probability 0.9 (and corridor with
probability 0.1). When the robot is not located in front of a door the probability of
observing corridor is 0.9. The robot has two actions, forward and backward (right
394 M.T.J. Spaan
resp. left in the figure), which transport the robot 3 (20%), 4 (60%), or 5 (20%)
states in the corresponding direction. The initial belief b0 is uniform, as displayed in
Fig. 12.3b. Fig. 12.3c through (e) show how the belief of the robot is updated as it
executes the forward action each time. The true location of the robot is indicated by
the dark-gray component of its belief. In Fig. 12.3c we see that the robot is located
in front of the first door, and although it is fairly certain it is located in front of a
door, it cannot tell which one. However, after taking another move forward it again
observes door, and now can pinpoint its location more accurately, because of the
particular configuration of the three doors (Fig. 12.3d). However, in Fig. 12.3e the
belief blurs again, which is due to the noisy transition model and the fact that the
corridor observation is not very informative in this case.
As in the fully observable MDP setting, the goal of the agent is to choose actions
which fulfill its task as well as possible, i.e., to learn an optimal policy. In POMDPs,
an optimal policy π ∗ (b) maps beliefs to actions. Note that, contrary to MDPs, the
policy π (b) is a function over a continuous set of probability distributions over S. A
policy π can be characterized by a value function V π : Δ (S) → R which is defined
as the expected future discounted reward V π (b) the agent can gather by following π
starting from belief b:
h !
V π (b) = Eπ ∑ γ t
R(bt , π (bt ))b 0 = b , (12.4)
t=0
V ∗ = HPOMDPV ∗ , (12.5)
where HPOMDP is the Bellman backup operator for POMDPs, defined as:
!
V ∗ (b) = max ∑ R(s,a)b(s) + γ ∑ p(o|b,a)V ∗ (bao ) , (12.6)
a∈A s∈S o∈O
with bao given by (12.2), and p(o|b,a) as defined in (12.3). When (12.6) holds for
every b ∈ Δ (S) we are ensured the solution is optimal.
Computing value functions over a continuous belief space might seem intractable
at first, but fortunately the value function has a particular structure that we can ex-
ploit (Sondik, 1971). It can be parameterized by a finite number of vectors and has
a convex shape. The convexity implies that the value of a belief close to one of the
12 Partially Observable Markov Decision Processes 395
V A
o1 o2 o|O|
A A ... A
o1 o2 o|O| ... ...
A A ... A
(1,0) (0,1) ... ... ...
Fig. 12.4 (a) An example of a value function in a two-state POMDP. The y-axis shows the
value of each belief, and the x-axis depicts the belief space Δ (S), ranging from (1,0) to (0,1).
(b) An example policy tree, where at a node the agent takes an action, and it transitions to a
next node based on the received observation o ∈ {o1 ,o2 , . . . ,o|O| }.
corners of the belief simplex Δ (S) will be high. In general, the less uncertainty the
agent has over its true state, the better it can predict the future, and as such take bet-
ter decisions. A belief located exactly at a particular corner of Δ (S), i.e., b(s) = 1 for
a particular s, defines with full certainty the state of the agent. In this way, the con-
vex shape of V can be intuitively explained. An example of a convex value function
for a two-state POMDP is shown in Fig. 12.4a. As the belief space is a simplex, we
can represent any belief in a two-state POMDP on a line, as b(s2 ) = 1 − b(s1 ). The
corners of the belief simplex are denoted by (1,0) and (0,1), which have a higher
(or equal) value than a belief in the center of the belief space, e.g., (0.5,0.5).
An alternative way to represent policies in POMDPs is by considering policy
trees (Kaelbling et al, 1998). Fig. 12.4b shows a partial policy tree, in which the
agent starts at the root node of tree. Each node specifies an action which the agent
executes at the particular node. Next it receives an observation o, which determines
to what next node the agent transitions. The depth of the tree depends on the plan-
ning horizon h, i.e., if we want the agent to consider taking h steps, the correspond-
ing policy tree has depth h.
First, we discuss some heuristic control strategies that have been proposed which
∗
rely on a solution πMDP (s) or Q∗MDP (s,a) of the underlying MDP (Cassandra et al,
1996). The idea is that solving the MDP is of much lower complexity than solv-
ing the POMDP (P-complete vs. PSPACE-complete) (Papadimitriou and Tsitsiklis,
1987), but by tracking the belief state still some notion of imperfect state perception
can be maintained. Cassandra (1998) provides an extensive experimental compari-
son of MDP-based heuristics.
Perhaps the most straightforward heuristic is to consider for a belief at a given
time step its most likely state (MLS), and use the action the MDP policy prescribes
for the state
∗
πMLS (b) = πMDP (arg max b(s)). (12.7)
s
The MLS heuristic completely ignores the uncertainty in the current belief, which
clearly can be suboptimal.
A more sophisticated approximation technique is QMDP (Littman et al, 1995),
which also treats the POMDP as if it were fully observable. QMDP solves the MDP
and defines a control policy
QMDP can be very effective in some domains, but the policies it computes will not
take informative actions, as the QMDP solution assumes that any uncertainty regarding
the state will disappear after taking one action. As such, QMDP policies will fail in
domains where repeated information gathering is necessary.
For instance, consider the toy domain in Figure 12.5, which illustrates how MDP-
based heuristics can fail (Parr and Russell, 1995). The agent starts in the state
marked I, and upon taking any action the system transitions with equal probabil-
ity to one of two states. In both states it would receive observation A, meaning the
agent cannot distinguish between them. The optimal POMDP policy is to take the
action a twice in succession, after which the agent is back in the same state. How-
ever, because it observed either C or D, it knows in which of the two states marked A
it currently is. This knowledge is important for choosing the optimal action (b or c)
to transition to the state with positive reward, labelled +1. The fact that the a ac-
tions do not change the system state, but only the agent’s belief state (two time steps
later) is very hard for the MDP-based methods to plan for. It forms an example of
reasoning about explicit information gathering effects of actions, for which methods
based on MDP solutions do not suffice.
One can also expand the MDP setting to model some form of sensing uncer-
tainty without considering full-blown POMDP beliefs. For instance, in robotics the
navigation under localization uncertainty problem can be modeled by the mean and
entropy of the belief distribution (Cassandra et al, 1996; Roy and Thrun, 2000).
12 Partially Observable Markov Decision Processes 397
C a
b a A b
c
0.5
I +1 −1
0.5
Fig. 12.5 A simple domain b
vectors in a smart way (Monahan, 1982; Zhang and Liu, 1996; Littman, 1996; Cas-
sandra et al, 1997; Feng and Zilberstein, 2004; Lin et al, 2004; Varakantham et al,
2005). However, pruning again requires linear programming.
The value of an optimal policy π ∗ is defined by the optimal value function V ∗
which we compute by iterating a number of stages, at each stage considering a step
further into the future. At each stage we apply the exact dynamic-programming
operator HPOMDP (12.6). If the agent has only one time step left to act, we only
have to consider the immediate reward for the particular belief b, and can ignore
any future value V ∗ (bao ) and (12.6) reduces to:
!
V0∗ (b) = max ∑ R(s,a)b(s) . (12.9)
a s
We can view the immediate reward function R(s,a) as a set of |A| vectors α0a =
(α0a (1), . . . , α0a (|S|)), one for each action a: α0a (s) = R(s,a). Now we can rewrite
(12.9) as follows, where we view b as a |S|-dimensional vector:
Additionally, an action a(αnk ) ∈ A is associated with each vector, which is the opti-
mal one to take in the current step, for those beliefs for which αnk is the maximizing
vector. Each vector defines a region in the belief space for which this vector is the
maximizing element of Vn . These regions form a partition of the belief space, in-
duced by the piecewise linearity of the value function, as illustrated by Fig. 12.6.
The gradient of the value function at b is given by the vector
The main idea behind many value-iteration algorithms for POMDPs is that for
a given value function Vn and a particular belief point b we can easily compute the
vector αn+1
b of HPOMDPVn such that
12 Partially Observable Markov Decision Processes 399
αn+1
b
= arg max b · αn+1
k
, (12.15)
{αn+1
k }
k
|H V |
where {αn+1k } POMDP n is the (unknown) set of vectors for H
k=1 POMDPVn . We will de-
note this operation αn+1 = backup(b). For this, we define gao vectors
b
which represent the vectors resulting from back-projecting αnk for a particular a
and o. Starting from (12.6) we can derive
!
Vn+1(b) = max b · α0a + γ b · ∑ arg max b · gkao (12.17)
a o {gkao }k
From (12.20) we can derive the vector backup(b), as this is the vector whose inner
product with b yields Vn+1 (b):
with gba defined in (12.19). Note that in general not only the computed α vector is
retained, but also which action a was the maximizer in (12.21), as that is the optimal
action associated with backup(b).
400 M.T.J. Spaan
The Bellman backup operator (12.21) computes a next-horizon vector for a single
belief, and now we will employ this backup operator to compute a complete value
function for the next horizon, i.e., one that is optimal for all beliefs in the belief
space. Although computing the vector backup(b) for a given b is straightforward,
locating the (minimal) set of points b required to compute all vectors ∪b backup(b)
of HPOMDPVn is very costly. As each b has a region in the belief space in which its
αnb is maximal, a family of algorithms tries to identify these regions (Sondik, 1971;
Cheng, 1988; Kaelbling et al, 1998). The corresponding b of each region is called
a “witness” point, as it testifies to the existence of its region. Other exact POMDP
value-iteration algorithms do not focus on searching in the belief space. Instead, they
consider enumerating all possible vectors of HPOMDPVn , followed by pruning useless
vectors (Monahan, 1982; Zhang and Liu, 1996; Littman, 1996; Cassandra et al,
1997; Feng and Zilberstein, 2004; Lin et al, 2004; Varakantham et al, 2005). We will
focus on the enumeration algorithms as they have seen more recent developments
and are more commonly used.
4
3 Cross sum of sets is defined as: k Rk = R1 ⊕ R2 ⊕ · · · ⊕ Rk , with P ⊕ Q = { p + q | p ∈
P, q ∈ Q }.
12 Partially Observable Markov Decision Processes 401
Monahan (1982)’s algorithm first generates all |A||Vn ||O| vectors of HPOMDPVn be-
fore pruning all dominated vectors. Incremental Pruning methods (Zhang and Liu,
1996; Cassandra et al, 1997; Feng and Zilberstein, 2004; Lin et al, 2004; Varakan-
tham et al, 2005) save computation time by exploiting the fact that
In this way the number of constraints in the linear program used for pruning grows
slowly (Cassandra et al, 1997), leading to better performance. The basic Incremental
Pruning algorithm exploits (12.24) when computing Vn+1 as follows:
2
Vn+1 = prune Ga , with (12.25)
a
3 o
Ga = prune Ga (12.26)
o
|O|
= prune(G1a ⊕ G2a ⊕ G3a ⊕ · · · ⊕ Ga ) (12.27)
|O|
= prune(· · · prune(prune(G1a ⊕ G2a ) ⊕ G3a ) · · · ⊕ Ga ). (12.28)
Given the high computational complexity of optimal POMDP solutions, many meth-
ods for approximate solutions have been developed. One powerful idea has been to
compute solutions only for those parts of the belief simplex that are reachable, i.e.,
that can be actually encountered by interacting with the environment. This has mo-
tivated the use of approximate solution techniques which focus on the use of a sam-
pled set of belief points on which planning is performed (Hauskrecht, 2000; Poon,
2001; Roy and Gordon, 2003; Pineau et al, 2003; Smith and Simmons, 2004; Spaan
and Vlassis, 2005a; Shani et al, 2007; Kurniawati et al, 2008), a possibility already
mentioned by Lovejoy (1991). The idea is that instead of planning over the com-
plete belief space of the agent (which is intractable for large state spaces), planning
is carried out only on a limited set of prototype beliefs B that have been sampled by
letting the agent interact with the environment.
402 M.T.J. Spaan
In each backup stage the set B̃ is constructed by sampling beliefs from B until the
resulting Vn+1 upper bounds Vn over B, i.e., until condition (12.31) has been met.
The H̃P ERSEUS operator results in value functions with a relatively small number of
vectors, allowing for the use of much larger B, which has a positive effect on the
approximation accuracy (Pineau et al, 2003).
Crucial to the control quality of the computed approximate solution is the
makeup of B. A number of schemes to build B have been proposed. For instance,
one could use a regular grid on the belief simplex, computed, e.g., by Freudenthal
triangulation (Lovejoy, 1991). Other options include taking all extreme points of
the belief simplex or use a random grid (Hauskrecht, 2000; Poon, 2001). An alter-
native scheme is to include belief points that can be encountered by simulating the
12 Partially Observable Markov Decision Processes 403
POMDP: we can generate trajectories through the belief space by sampling random
actions and observations at each time step (Lovejoy, 1991; Hauskrecht, 2000; Poon,
2001; Pineau et al, 2003; Spaan and Vlassis, 2005a). This sampling scheme focuses
the contents of B to be beliefs that can actually be encountered while experiencing
the POMDP model.
More intricate schemes for belief sampling have also been proposed. For in-
stance, one can use the MDP solution to guide the belief sampling process (Shani
et al, 2007), but in problem domains which require series of information-gathering
actions such a heuristic will suffer from similar issues as when using QMDP (Sec-
tion 12.3.1). Furthermore, the belief set B does not need to be static, and can be up-
dated while running a point-based solver. HSVI heuristically selects belief points in
the search tree starting from the initial belief, based on upper and lower bounds on
the optimal value function (Smith and Simmons, 2004, 2005). SARSOP takes this
idea a step further by successively approximating the optimal reachable belief space,
i.e., the belief space that can be reached by following an optimal policy (Kurniawati
et al, 2008).
In general, point-based methods compute solutions in the form of piecewise lin-
ear and convex value functions, and given a particular belief, the agent can simply
look up which action to take using (12.14).
Besides the point-based methods, other types of approximation structure have been
explored as well.
One way to sidestep the intractability of exact POMDP value iteration is to grid the
belief simplex, using either a fixed grid (Drake, 1962; Lovejoy, 1991; Bonet, 2002)
or a variable grid (Brafman, 1997; Zhou and Hansen, 2001). Value backups are per-
formed for every grid point, but only the value of each grid point is preserved and the
gradient is ignored. The value of non-grid points is defined by an interpolation rule.
The grid based methods differ mainly on how the grid points are selected and what
shape the interpolation function takes. In general, regular grids do not scale well
in problems with high dimensionality and non-regular grids suffer from expensive
interpolation routines.
(BPI) (Poupart and Boutilier, 2004) search through the space of (bounded-size)
stochastic finite-state controllers by performing policy-iteration steps. Other options
for searching the policy space include gradient ascent (Meuleau et al, 1999a; Kearns
et al, 2000; Ng and Jordan, 2000; Baxter and Bartlett, 2001; Aberdeen and Baxter,
2002) and heuristic methods like stochastic local search (Braziunas and Boutilier,
2004). In particular, the P EGASUS method (Ng and Jordan, 2000) estimates the
value of a policy by simulating a (bounded) number of trajectories from the POMDP
using a fixed random seed, and then takes steps in the policy space in order to max-
imize this value. Policy search methods have demonstrated success in several cases,
but searching in the policy space can often be difficult and prone to local optima
(Baxter et al, 2001).
Another approach for solving POMDPs is based on heuristic search (Satia and Lave,
1973; Hansen, 1998a; Smith and Simmons, 2004). Defining an initial belief b0 as the
root node, these methods build a tree that branches over (a,o) pairs, each of which
recursively induces a new belief node. Branch-and-bound techniques are used to
maintain upper and lower bounds to the expected return at fringe nodes in the search
tree. Hansen (1998a) proposes a policy-iteration method that represents a policy as
a finite-state controller, and which uses the belief tree to focus the search on areas
of the belief space where the controller can most likely be improved. However, its
applicability to large problems is limited by its use of full dynamic-programming
updates. As mentioned before, HSVI (Smith and Simmons, 2004, 2005) is an ap-
proximate value-iteration technique that performs a heuristic search through the be-
lief space for beliefs at which to update the bounds, similar to work by Satia and
Lave (1973).
When no models of the environment are available to the agent a priori, the model-
based methods presented in the previous section cannot be directly applied. Even
relatively simple techniques such as QMDP (Section 12.3.1) require knowledge of the
complete POMDP model: the solution to the underlying MDP is computed using
the transition and reward model, while the belief update (12.2) additionally requires
the observation model.
In general, there exist two ways of tackling such a decision-making problem,
known as direct and indirect reinforcement learning methods. Direct methods apply
true model-free techniques, which do not try to reconstruct the unknown POMDP
models, but for instance map observation histories directly to actions. On the other
extreme, one can attempt to reconstruct the POMDP model by interacting with it,
which then in principle can be solved using techniques presented in Section 12.3.
12 Partially Observable Markov Decision Processes 405
This indirect approach has long been out of favor for POMDPs, as (i) reconstruct-
ing (an approximation of) the POMDP models is very hard, and (ii) even with a
recovered POMDP, model-based methods would take prohibitively long to compute
a good policy. However, advances in model-based methods such as the point-based
family of algorithms (Section 12.3.4) have made these types of approaches more
attractive.
First, we consider methods for learning memoryless policies, that is, policies that
map each observation that an agent receives directly to an action, without consulting
any internal state. Memoryless policies can either be deterministic mappings, π :
Ω → A, or probabilistic mappings, π : Ω → Δ (A). As illustrated by the example
in Section 12.2.3, probabilistic policies allow for higher payoffs, at the cost of an
increased search space that no longer can be enumerated (Singh et al, 1994). In fact,
the problem of finding an optimal deterministic memoryless policy has been shown
to be NP-hard (Littman, 1994), while the complexity of determining the optimal
probabilistic memoryless policy is still an open problem.
Loch and Singh (1998) have demonstrated empirically that using eligibility
traces, in their case in S ARSA(λ ), can improve the ability of memoryless methods
to handle partial observability. S ARSA(λ ) was shown to learn the optimal determin-
istic memoryless policy in several domains (for which it was possible to enumerate
all such policies, of which there are |A||Ω | ). Bagnell et al (2004) also consider the
memoryless deterministic case, but using non-stationary policies instead of station-
ary ones. They show that successful non-stationary policies can be found in cer-
tain maze domains for which no good stationary policies exist. Regarding learning
stochastic memoryless policies, an algorithm has been proposed by Jaakkola et al
(1995), and tested empirically by Williams and Singh (1999), showing that it can
successfully learn stochastic memoryless policies. An interesting twist is provided
by Hierarchical Q-Learning (Wiering and Schmidhuber, 1997), which aims to learn
a subgoal sequence in a POMDP, where each subgoal can be successfully achieved
using a memoryless policy.
o1
a1
o2
o1
o1
a2
a2 o2
o1 o1
a1
a1
o3 o2
o1
o1
a3
a3
o2
o3
a1
Fig. 12.7 (a) Long-term dependency T maze (Bakker, 2002). (b) Example of a suffix tree used
by the USM algorithm (McCallum, 1995), where fringe nodes are indicated by dashed lines.
for easy generalization, e.g., it is not clear how experience obtained after history
a1 , o1 , a1 , o1 can be used to update the value for history a2 , o1 , a1 , o1 . To counter
these problems, researchers have proposed many different internal-state representa-
tions, of which we give a brief overview.
First of all, the memoryless methods presented before can be seen as maintaining
a history window of only a single observation. Instead, these algorithms can also be
applied with a history window containing the last k observations (Littman, 1994;
Loch and Singh, 1998), where k is typically an a-priori defined parameter. In some
domains such a relatively cheap increase of the policy space (by means of a low k)
can buy a significant improvement in learning time and task performance. Finite
history windows have also been used as a representation for neural networks (Lin
and Mitchell, 1992).
Finite history windows cannot however capture arbitrary long-term dependen-
cies, such as for instance present in the T Maze in Figure 12.7a, an example pro-
vided by Bakker (2002). In this problem the agent starts at S, and needs to navigate
to G. However, the location of G is unknown initially, and might be on the left or on
the right at the end of the corridor. However, in the start state the agent can observe
a road sign X, which depends on the particular goal location. The length of the cor-
ridor can be varied (in Figure 12.7a it is 10), meaning that the agent needs to learn
to remember the road sign many time steps. Obviously, such a dependency cannot
be represented well by finite history windows.
12 Partially Observable Markov Decision Processes 407
Alleviating the problem of fixed history windows, McCallum (1993, 1995, 1996)
proposed several algorithms for variable history windows, among other contribu-
tions. These techniques allow for the history window to have a different depth in
different parts of the state space. For instance, Utile Suffix Memory (USM) learns
a short-term memory representation by growing a suffix tree (McCallum, 1995), an
example of which is shown in Figure 12.7b. USM groups together RL experiences
based on how much history it considers significant for each instance. In this sense,
in different parts of the state space different history lengths can be maintained, in
contrast to the finite history window approaches. A suffix tree representation is de-
picted by solid lines in Figure 12.7b, where the leaves cluster instances that have
a matching history up to the corresponding depth. The dashed nodes are the so-
called fringe nodes: additional branches in the tree that the algorithm can consider
to add to the tree. When a statistical test indicates that instances in a branch of fringe
nodes come from different distributions of the expected future discounted reward,
the tree is grown to include this fringe branch. Put otherwise, if adding the branch
will help predicting the future rewards, it is worthwhile to extend the memory in
the corresponding part of the state space. More recent work building on these ideas
focuses on better learning behavior in the presence of noisy observations (Shani and
Brafman, 2005; Wierstra and Wiering, 2004). Along these lines, recurrent neural
networks, for instance based on the Long Short-Term Memory architecture, have
also been successfully used as internal state representation (Hochreiter and Schmid-
huber, 1997; Bakker, 2002).
Other representations have been proposed as well. Meuleau et al (1999b) extend
the VAPS algorithm (Baird and Moore, 1999) to learn policies represented as Fi-
nite State Automata (FSA). The FSA represent finite policy graphs, in which nodes
are labelled with actions, and the arcs with observations. As in VAPS, stochastic
gradient ascent is used to converge to a locally optimal controller. The problem of
finding the optimal policy graph of a given size has also been studied (Meuleau et al,
1999a). However, note that the optimal POMDP policy can require an infinite policy
graph to be properly represented.
Finally, predictive state representations (PSRs) have been proposed as an alter-
native to POMDPs for modeling stochastic and partially observable environments
(Littman et al, 2002; Singh et al, 2004), see Chapter 13. A PSR dispenses with the
hidden POMDP states, and only considers sequences of action and observations
which are observed quantities. In a PSR, the state of the system is expressed in
possible future event sequences, or “core tests”, of alternating actions and observa-
tions. The state of a PSR is defined as a vector of probabilities that each core test
can actually be realized, given the current history. The advantages of PSRs are most
apparent in model-free learning settings, as the model only considers observable
events instead of hidden states.
408 M.T.J. Spaan
To conclude, we discuss some types of approaches that have been gaining popularity
recently.
Most of the model-based methods discussed in this chapter are offline techniques
that determine a priori what action to take in each situation the agent might en-
counter. Online approaches, on the other hand, only compute what action to take at
the current moment (Ross et al, 2008b). Focusing exclusively on the current deci-
sion can provide significant computational savings in certain domains, as the agent
does not have to plan for areas of the state space which it never encounters. How-
ever, the need to choose actions every time step implies severe constraints on the
online search time. Offline point-based methods can be used to compute a rough
value function, serving as the online search heuristic. In a similar manner, Monte
Carlo approaches are also appealing for large POMDPs, as they only require a gen-
erative model (black box simulator) to be available and they have the potential to
mitigate the curse of dimensionality (Thrun, 2000; Kearns et al, 2000; Silver and
Veness, 2010).
As discussed in detail in Chapter 11, Bayesian RL techniques are promising for
POMDPs, as they provide an integrated way of exploring and exploiting models.
Put otherwise, they do not require interleaving the model-learning phases (e.g.,
using Baum-Welch (Koenig and Simmons, 1996) or other methods (Shani et al,
2005)) with model-exploitation phases, which could be a naive approach to apply
model-based methods to unknown POMDPs. Poupart and Vlassis (2008) extended
the BEETLE algorithm (Poupart et al, 2006), a Bayesian RL method for MDPs, to
partially observable settings. As other Bayesian RL methods, the models are repre-
sented by Dirichlet distributions, and learning involves updating the Dirichlet hyper-
parameters. The work is more general than the earlier work by Jaulmes et al (2005),
which required the existence of an oracle that the agent could query to reveal the
true state. Ross et al (2008a) proposed the Bayes-Adaptive POMDP model, an al-
ternative model for Bayesian reinforcement learning which extends Bayes-Adaptive
MDPs (Duff, 2002). All these methods assume that the size of the state, observation
and action spaces are known.
Policy gradient methods search in a space of parameterized policies, optimizing
the policy by performing gradient ascent in the parameter space (Peters and Bagnell,
2010). As these methods do not require to estimate a belief state (Aberdeen and
Baxter, 2002), they have been readily applied in POMDPs, with impressive results
(Peters and Schaal, 2008).
Finally, a recent trend has been to cast the model-based RL problem as one of
probabilistic inference, for instance using Expectation Maximization for computing
optimal policies in MDPs. Vlassis and Toussaint (2009) showed how such methods
can also be extended to the model-free POMDP case. In general, inference methods
can provide fresh insights in well-known RL algorithms.
12 Partially Observable Markov Decision Processes 409
References
Aberdeen, D., Baxter, J.: Scaling internal-state policy-gradient methods for POMDPs. In:
International Conference on Machine Learning (2002)
Åström, K.J.: Optimal control of Markov processes with incomplete state information. Jour-
nal of Mathematical Analysis and Applications 10(1), 174–205 (1965)
Bagnell, J.A., Kakade, S., Ng, A.Y., Schneider, J.: Policy search by dynamic programming.
In: Advances in Neural Information Processing Systems, vol. 16. MIT Press (2004)
Baird, L., Moore, A.: Gradient descent for general reinforcement learning. In: Advances in
Neural Information Processing Systems, vol. 11. MIT Press (1999)
Bakker, B.: Reinforcement learning with long short-term memory. In: Advances in Neural
Information Processing Systems, vol. 14. MIT Press (2002)
Baxter, J., Bartlett, P.L.: Infinite-horizon policy-gradient estimation. Journal of Artificial
Intelligence Research 15, 319–350 (2001)
Baxter, J., Bartlett, P.L., Weaver, L.: Experiments with infinite-horizon, policy-gradient esti-
mation. Journal of Artificial Intelligence Research 15, 351–381 (2001)
Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized
control of Markov decision processes. Mathematics of Operations Research 27(4), 819–
840 (2002)
Bonet, B.: An epsilon-optimal grid-based algorithm for partially observable Markov decision
processes. In: International Conference on Machine Learning (2002)
Boutilier, C., Poole, D.: Computing optimal policies for partially observable decision pro-
cesses using compact representations. In: Proc. of the National Conference on Artificial
Intelligence (1996)
Brafman, R.I.: A heuristic variable grid solution method for POMDPs. In: Proc. of the
National Conference on Artificial Intelligence (1997)
Braziunas, D., Boutilier, C.: Stochastic local search for POMDP controllers. In: Proc. of the
National Conference on Artificial Intelligence (2004)
Brunskill, E., Kaelbling, L., Lozano-Perez, T., Roy, N.: Continuous-state POMDPs with hy-
brid dynamics. In: Proc. of the Int. Symposium on Artificial Intelligence and Mathematics
(2008)
Cassandra, A.R.: Exact and approximate algorithms for partially observable Markov decision
processes. PhD thesis, Brown University (1998)
Cassandra, A.R., Kaelbling, L.P., Littman, M.L.: Acting optimally in partially observable
stochastic domains. In: Proc. of the National Conference on Artificial Intelligence (1994)
Cassandra, A.R., Kaelbling, L.P., Kurien, J.A.: Acting under uncertainty: Discrete Bayesian
models for mobile robot navigation. In: Proc. of International Conference on Intelligent
Robots and Systems (1996)
Cassandra, A.R., Littman, M.L., Zhang, N.L.: Incremental pruning: A simple, fast, exact
method for partially observable Markov decision processes. In: Proc. of Uncertainty in
Artificial Intelligence (1997)
Cheng, H.T.: Algorithms for partially observable Markov decision processes. PhD thesis,
University of British Columbia (1988)
410 M.T.J. Spaan
Doshi, F., Roy, N.: The permutable POMDP: fast solutions to POMDPs for preference elic-
itation. In: Proc. of Int. Conference on Autonomous Agents and Multi Agent Systems
(2008)
Drake, A.W.: Observation of a Markov process through a noisy channel. Sc.D. thesis,
Massachusetts Institute of Technology (1962)
Duff, M.: Optimal learning: Computational procedures for Bayes-adaptive Markov decision
processes. PhD thesis, University of Massachusetts, Amherst (2002)
Dynkin, E.B.: Controlled random sequences. Theory of Probability and its Applica-
tions 10(1), 1–14 (1965)
Ellis, J.H., Jiang, M., Corotis, R.: Inspection, maintenance, and repair with partial observabil-
ity. Journal of Infrastructure Systems 1(2), 92–99 (1995)
Feng, Z., Zilberstein, S.: Region-based incremental pruning for POMDPs. In: Proc. of Un-
certainty in Artificial Intelligence (2004)
Foka, A., Trahanias, P.: Real-time hierarchical POMDPs for autonomous robot navigation.
Robotics and Autonomous Systems 55(7), 561–571 (2007)
Fox, D., Burgard, W., Thrun, S.: Markov localization for mobile robots in dynamic environ-
ments. Journal of Artificial Intelligence Research 11, 391–427 (1999)
Haight, R.G., Polasky, S.: Optimal control of an invasive species with imperfect information
about the level of infestation. Resource and Energy Economics (2010) (in Press, Corrected
Proof)
Hansen, E.A.: Finite-memory control of partially observable systems. PhD thesis, University
of Massachusetts, Amherst (1998a)
Hansen, E.A.: Solving POMDPs by searching in policy space. In: Proc. of Uncertainty in
Artificial Intelligence (1998b)
Hansen, E.A., Feng, Z.: Dynamic programming for POMDPs using a factored state represen-
tation. In: Int. Conf. on Artificial Intelligence Planning and Scheduling (2000)
Hauskrecht, M.: Value function approximations for partially observable Markov decision pro-
cesses. Journal of Artificial Intelligence Research 13, 33–95 (2000)
Hauskrecht, M., Fraser, H.: Planning treatment of ischemic heart disease with partially
observable Markov decision processes. Artificial Intelligence in Medicine 18, 221–244
(2000)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–
1780 (1997)
Hoey, J., Little, J.J.: Value-directed human behavior analysis from video using partially ob-
servable Markov decision processes. IEEE Transactions on Pattern Analysis and Machine
Intelligence 29(7), 1–15 (2007)
Hoey, J., Poupart, P.: Solving POMDPs with continuous or large discrete observation spaces.
In: Proc. Int. Joint Conf. on Artificial Intelligence (2005)
Hsiao, K., Kaelbling, L., Lozano-Perez, T.: Grasping pomdps. In: Proc. of the IEEE Int. Conf.
on Robotics and Automation, pp. 4685–4692 (2007)
Jaakkola, T., Singh, S.P., Jordan, M.I.: Reinforcement learning algorithm for partially observ-
able Markov decision problems. In: Advances in Neural Information Processing Systems,
vol. 7 (1995)
Jaulmes, R., Pineau, J., Precup, D.: Active Learning in Partially Observable Markov Decision
Processes. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML
2005. LNCS (LNAI), vol. 3720, pp. 601–608. Springer, Heidelberg (2005)
Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially observable
stochastic domains. Artificial Intelligence 101, 99–134 (1998)
12 Partially Observable Markov Decision Processes 411
Kearns, M., Mansour, Y., Ng, A.Y.: Approximate planning in large POMDPs via reusable
trajectories. In: Advances in Neural Information Processing Systems, vol. 12. MIT Press
(2000)
Koenig, S., Simmons, R.: Unsupervised learning of probabilistic models for robot navigation.
In: Proc. of the IEEE Int. Conf. on Robotics and Automation (1996)
Kurniawati, H., Hsu, D., Lee, W.: SARSOP: Efficient point-based POMDP planning by ap-
proximating optimally reachable belief spaces. In: Robotics: Science and Systems (2008)
Lin, L., Mitchell, T.: Memory approaches to reinforcement learning in non-Markovian do-
mains. Tech. rep., Carnegie Mellon University, Pittsburgh, PA, USA (1992)
Lin, Z.Z., Bean, J.C., White, C.C.: A hybrid genetic/optimization algorithm for finite horizon,
partially observed Markov decision processes. INFORMS Journal on Computing 16(1),
27–38 (2004)
Littman, M.L.: Memoryless policies: theoretical limitations and practical results. In: Proc. of
the 3rd Int. Conf. on Simulation of Adaptive Behavior: from Animals to Animats 3, pp.
238–245. MIT Press, Cambridge (1994)
Littman, M.L.: Algorithms for sequential decision making. PhD thesis, Brown University
(1996)
Littman, M.L., Cassandra, A.R., Kaelbling, L.P.: Learning policies for partially observable
environments: Scaling up. In: International Conference on Machine Learning (1995)
Littman, M.L., Sutton, R.S., Singh, S.: Predictive representations of state. In: Advances in
Neural Information Processing Systems, vol. 14. MIT Press (2002)
Loch, J., Singh, S.: Using eligibility traces to find the best memoryless policy in partially
observable Markov decision processes. In: International Conference on Machine Learning
(1998)
Lovejoy, W.S.: Computationally feasible bounds for partially observed Markov decision
processes. Operations Research 39(1), 162–175 (1991)
Madani, O., Hanks, S., Condon, A.: On the undecidability of probabilistic planning and
related stochastic optimization problems. Artificial Intelligence 147(1-2), 5–34 (2003)
McCallum, R.A.: Overcoming incomplete perception with utile distinction memory. In:
International Conference on Machine Learning (1993)
McCallum, R.A.: Instance-based utile distinctions for reinforcement learning with hidden
state. In: International Conference on Machine Learning (1995)
McCallum, R.A.: Reinforcement learning with selective perception and hidden state. PhD
thesis, University of Rochester (1996)
Meuleau, N., Kim, K.E., Kaelbling, L.P., Cassandra, A.R.: Solving POMDPs by searching
the space of finite policies. In: Proc. of Uncertainty in Artificial Intelligence (1999a)
Meuleau, N., Peshkin, L., Kim, K.E., Kaelbling, L.P.: Learning finite-state controllers for par-
tially observable environments. In: Proc. of Uncertainty in Artificial Intelligence (1999b)
Monahan, G.E.: A survey of partially observable Markov decision processes: theory, models
and algorithms. Management Science 28(1) (1982)
Ng, A.Y., Jordan, M.: PEGASUS: A policy search method for large MDPs and POMDPs. In:
Proc. of Uncertainty in Artificial Intelligence (2000)
Oliehoek, F.A., Spaan, M.T.J., Vlassis, N.: Optimal and approximate Q-value functions for
decentralized POMDPs. Journal of Artificial Intelligence Research 32, 289–353 (2008)
Papadimitriou, C.H., Tsitsiklis, J.N.: The complexity of Markov decision processes. Mathe-
matics of Operations Research 12(3), 441–450 (1987)
Parr, R., Russell, S.: Approximating optimal policies for partially observable stochastic do-
mains. In: Proc. Int. Joint Conf. on Artificial Intelligence (1995)
412 M.T.J. Spaan
Peters, J., Bagnell, J.A.D.: Policy gradient methods. In: Springer Encyclopedia of Machine
Learning. Springer, Heidelberg (2010)
Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71, 1180–1190 (2008)
Pineau, J., Thrun, S.: An integrated approach to hierarchy and abstraction for POMDPs. Tech.
Rep. CMU-RI-TR-02-21, Robotics Institute, Carnegie Mellon University (2002)
Pineau, J., Gordon, G., Thrun, S.: Point-based value iteration: An anytime algorithm for
POMDPs. In: Proc. Int. Joint Conf. on Artificial Intelligence (2003)
Platzman, L.K.: A feasible computational approach to infinite-horizon partially-observed
Markov decision problems. Tech. Rep. J-81-2, School of Industrial and Systems Engineer-
ing, Georgia Institute of Technology, reprinted in working notes AAAI, Fall Symposium
on Planning with POMDPs (1981)
Poon, K.M.: A fast heuristic algorithm for decision-theoretic planning. Master’s thesis, The
Hong-Kong University of Science and Technology (2001)
Porta, J.M., Spaan, M.T.J., Vlassis, N.: Robot planning in partially observable continuous
domains. In: Robotics: Science and Systems (2005)
Porta, J.M., Vlassis, N., Spaan, M.T.J., Poupart, P.: Point-based value iteration for continuous
POMDPs. Journal of Machine Learning Research 7, 2329–2367 (2006)
Poupart, P.: Exploiting structure to efficiently solve large scale partially observable Markov
decision processes. PhD thesis, University of Toronto (2005)
Poupart, P., Boutilier, C.: Bounded finite state controllers. In: Advances in Neural Information
Processing Systems, vol. 16. MIT Press (2004)
Poupart, P., Vlassis, N.: Model-based Bayesian reinforcement learning in partially observable
domains. In: International Symposium on Artificial Intelligence and Mathematics, ISAIM
(2008)
Poupart, P., Vlassis, N., Hoey, J., Regan, K.: An analytic solution to discrete Bayesian rein-
forcement learning. In: International Conference on Machine Learning (2006)
Ross, S., Chaib-draa, B., Pineau, J.: Bayes-adaptive POMDPs. In: Advances in Neural Infor-
mation Processing Systems, vol. 20, pp. 1225–1232. MIT Press (2008a)
Ross, S., Pineau, J., Paquet, S., Chaib-draa, B.: Online planning algorithms for POMDPs.
Journal of Artificial Intelligence Research 32, 664–704 (2008b)
Roy, N., Gordon, G.: Exponential family PCA for belief compression in POMDPs. In: Ad-
vances in Neural Information Processing Systems, vol. 15. MIT Press (2003)
Roy, N., Thrun, S.: Coastal navigation with mobile robots. In: Advances in Neural Informa-
tion Processing Systems, vol. 12. MIT Press (2000)
Roy, N., Gordon, G., Thrun, S.: Finding approximate POMDP solutions through belief com-
pression. Journal of Artificial Intelligence Research 23, 1–40 (2005)
Sanner, S., Kersting, K.: Symbolic dynamic programming for first-order POMDPs. In: Proc.
of the National Conference on Artificial Intelligence (2010)
Satia, J.K., Lave, R.E.: Markovian decision processes with probabilistic observation of states.
Management Science 20(1), 1–13 (1973)
Seuken, S., Zilberstein, S.: Formal models and algorithms for decentralized decision making
under uncertainty. Autonomous Agents and Multi-Agent Systems (2008)
Shani, G., Brafman, R.I.: Resolving perceptual aliasing in the presence of noisy sensors.
In: Saul, L.K., Weiss, Y., Bottou, L. (eds.) Advances in Neural Information Processing
Systems, vol. 17, pp. 1249–1256. MIT Press, Cambridge (2005)
Shani, G., Brafman, R.I., Shimony, S.E.: Model-Based Online Learning of POMDPs. In:
Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS
(LNAI), vol. 3720, pp. 353–364. Springer, Heidelberg (2005)
12 Partially Observable Markov Decision Processes 413
Shani, G., Brafman, R.I., Shimony, S.E.: Forward search value iteration for POMDPs. In:
Proc. Int. Joint Conf. on Artificial Intelligence (2007)
Shani, G., Poupart, P., Brafman, R.I., Shimony, S.E.: Efficient ADD operations for point-
based algorithms. In: Int. Conf. on Automated Planning and Scheduling (2008)
Silver, D., Veness, J.: Monte-carlo planning in large POMDPs. In: Lafferty, J., Williams,
C.K.I., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information
Processing Systems, vol. 23, pp. 2164–2172 (2010)
Simmons, R., Koenig, S.: Probabilistic robot navigation in partially observable environments.
In: Proc. Int. Joint Conf. on Artificial Intelligence (1995)
Singh, S., Jaakkola, T., Jordan, M.: Learning without state-estimation in partially observable
Markovian decision processes. In: International Conference on Machine Learning (1994)
Singh, S., James, M.R., Rudary, M.R.: Predictive state representations: A new theory for
modeling dynamical systems. In: Proc. of Uncertainty in Artificial Intelligence (2004)
Smallwood, R.D., Sondik, E.J.: The optimal control of partially observable Markov decision
processes over a finite horizon. Operations Research 21, 1071–1088 (1973)
Smith, T., Simmons, R.: Heuristic search value iteration for POMDPs. In: Proc. of Uncer-
tainty in Artificial Intelligence (2004)
Smith, T., Simmons, R.: Point-based POMDP algorithms: Improved analysis and implemen-
tation. In: Proc. of Uncertainty in Artificial Intelligence (2005)
Sondik, E.J.: The optimal control of partially observable Markov processes. PhD thesis,
Stanford University (1971)
Spaan, M.T.J., Vlassis, N.: A point-based POMDP algorithm for robot planning. In: Proc. of
the IEEE Int. Conf. on Robotics and Automation (2004)
Spaan, M.T.J., Vlassis, N.: Perseus: Randomized point-based value iteration for POMDPs.
Journal of Artificial Intelligence Research 24, 195–220 (2005a)
Spaan, M.T.J., Vlassis, N.: Planning with continuous actions in partially observable environ-
ments. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (2005b)
Spaan, M.T.J., Veiga, T.S., Lima, P.U.: Active cooperative perception in network robot sys-
tems using POMDPs. In: Proc. of International Conference on Intelligent Robots and
Systems (2010)
Sridharan, M., Wyatt, J., Dearden, R.: Planning to see: A hierarchical approach to planning
visual actions on a robot using POMDPs. Artificial Intelligence 174, 704–725 (2010)
Stankiewicz, B., Cassandra, A., McCabe, M., Weathers, W.: Development and evaluation of a
Bayesian low-vision navigation aid. IEEE Transactions on Systems, Man and Cybernetics,
Part A: Systems and Humans 37(6), 970–983 (2007)
Stratonovich, R.L.: Conditional Markov processes. Theory of Probability and Its Applica-
tions 5(2), 156–178 (1960)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998)
Theocharous, G., Mahadevan, S.: Approximate planning with hierarchical partially observ-
able Markov decision processes for robot navigation. In: Proc. of the IEEE Int. Conf. on
Robotics and Automation (2002)
Thrun, S.: Monte Carlo POMDPs. In: Advances in Neural Information Processing Systems,
vol. 12. MIT Press (2000)
Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press (2005)
Varakantham, P., Maheswaran, R., Tambe, M.: Exploiting belief bounds: Practical POMDPs
for personal assistant agents. In: Proc. of Int. Conference on Autonomous Agents and
Multi Agent Systems (2005)
Vlassis, N., Toussaint, M.: Model-free reinforcement learning as mixture learning. In: Inter-
national Conference on Machine Learning, pp. 1081–1088. ACM (2009)
414 M.T.J. Spaan
Wang, C., Khardon, R.: Relational partially observable MDPs. In: Proc. of the National Con-
ference on Artificial Intelligence (2010)
White, C.C.: Partially observed Markov decision processes: a survey. Annals of Operations
Research 32 (1991)
Wiering, M., Schmidhuber, J.: HQ-learning. Adaptive Behavior 6(2), 219–246 (1997)
Wierstra, D., Wiering, M.: Utile distinction hidden Markov models. In: International Confer-
ence on Machine Learning (2004)
Williams, J.D., Young, S.: Partially observable Markov decision processes for spoken dialog
systems. Computer Speech and Language 21(2), 393–422 (2007)
Williams, J.K., Singh, S.: Experimental results on learning stochastic memoryless policies
for partially observable Markov decision processes. In: Advances in Neural Information
Processing Systems, vol. 11 (1999)
Zhang, N.L., Liu, W.: Planning in stochastic domains: problem characteristics and approxi-
mations. Tech. Rep. HKUST-CS96-31, Department of Computer Science, The Hong Kong
University of Science and Technology (1996)
Zhou, R., Hansen, E.A.: An improved grid-based approximation algorithm for POMDPs. In:
Proc. Int. Joint Conf. on Artificial Intelligence (2001)
Chapter 13
Predictively Defined Representations of State
David Wingate
David Wingate
Massachusetts Institute of Technology, Cambridge, MA, 02139
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 415–439.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
416 D. Wingate
13.1 Introduction
This chapter considers the class of dynamical systems models with predictively de-
fined representations of state. To motivate and introduce the idea, we will first re-
examine the fundamental notion of state, why it is important to an agent, and why
different representations of state might have different theoretical or computational
properties—particularly as they apply to learning both a representation and a system
dynamics model online.
In the model-based RL setting, for example, an agent must learn a model from
data acquired from interactions with an environment. These models are used with
planning algorithms to help an agent choose optimal actions by reasoning about
the future, so the fundamental capability models must have is to predict the conse-
quences of actions in terms of future observations and rewards. To make the best
predictions possible, models must summarize important past experience. Such a
summary of the past is known as state, which we will define formally momen-
tarily. For now, we informally define state as a summary of an agent’s knowledge
about the state of affairs in the environment. In Markov Decision Processes (MDPs),
for example, the most recent observation constitutes state, meaning that building a
model simply involves learning transitions between states. In partially observable
domains, however, the most recent observation does not constitute state, so an agent
must learn what is important about the past, how to remember it, and how to use it
to predict the future.
This chapter examines in-depth a particular class of state representations known
as “predictively defined representations of state”. In a model with a predictively
defined representation of state, an agent summarizes the past as a set of predictions
about the short-term future that allow the agent to make further predictions about the
infinite future. This representation often seems counter-intuitive at first, but the idea
is simple: every state is a summary of a past, and every past implies a distribution
over the future—so why not represent state directly as sufficient statistics of this
distribution over possible futures?
Here, we will examine the theoretical foundations of this type of representation,
and will compare their computational and representational characteristics with more
traditional notions of state. We will focus on how they can be learned directly from
data acquired online, by considering two key parts of model building: learning a rep-
resentation of state, and learning an algorithm which allows an agent to accurately
maintain state in response to new information.
This chapter is intimately concerned with the idea of state and how it can be rep-
resented. We now define state more precisely, with emphasis on the fact that there
are multiple acceptable representations of state. Throughout this chapter, we will
always discuss state from the perspective of the agent, as opposed to that of an
13 Predictively Defined Representations of State 417
omniscient observer—a perspective motivated from our focus on agents who must
learn a model.
What exactly is “state”? Informally, state is the current situation that the agent
finds itself in: for a robot, state might be the position and angle of all its actuators,
as well as its map coordinates, battery status, the goal it is trying to achieve, etc. In a
Markov system, all of these variables are given to the agent. However, in a partially
observable system the agent may not have immediate access to all of the information
that it would like to know about its situation: a robot with a broken sensor may not
know the exact position of its arm; a stock trader may not know exactly what the
long-term strategies are of all the companies he is investing in; and a baseball batter
may not know exactly how fast the baseball is approaching.
In partially observable domains, there is a more formal definition of state: state
is a summary of all of the information that an agent could possibly have about its
current situation. This information is contained in the history of actions and ob-
servations it has experienced, which means that a full history constitutes perfect
state and is therefore the standard against which other state representations are mea-
sured. Of course, an agent will usually want to compress this history—without los-
ing information!—because otherwise it must store an increasingly large amount of
information as it interacts with the environment longer and longer. This motivates
our formal definition of state:
There are many acceptable summaries of history. To illustrate this, consider the ex-
ample of a robot localization problem. A small robot wanders through a building,
and must track its position, but it is given only a camera sensor. There are many pos-
sible representations for the robot’s position. Its pose could be captured in terms of
a distribution over x,y coordinates, for example. However, it could also be described
in terms of, say, a distribution over polar coordinates. Cartesian and polar coordi-
nates are different representations of state which are equally expressive, but both
are internal to the agent. From the perspective of the environment, neither is more
accurate, or more correct, or more useful—and we have said nothing about how ei-
ther Cartesian or polar coordinates could be accurately maintained given nothing but
camera images. It is easy to see that there are an infinite number of such state repre-
sentations: any one-to-one transformation of state is still state, and adding redundant
information to state is still state.
The realization that there are multiple satisfactory representations of state begs
the question: among all possible concepts of state, why should one be preferred
over another? There are many criteria that could be used to compare competing
representations. For example, a representation might be preferred if:
• It is easier for a human (designing an agent) to understand and use.
• It somehow “matches” another agent’s notion of state.
• It has favorable computational or statistical properties.
Not every statistic of history will be sufficient for predicting the future, which means
that some representations may only constitute approximate state. Among approxi-
mately sufficient representations, one might be preferred if:
• It is more expressive than another representation.
• It is less expressive than another, but is still sufficient to do what we want to do
with it (for example, control the system optimally).
Thus, even among state representations which are equally expressive, there might
be reasons to prefer one over another.
Because we are interested in learning agents, we are interested in learnable repre-
sentation of state—those for which effective learning algorithms are available. The
idea that one representation of state may be more learnable than another motivates
our first distinction between different representations of state: grounded1 represen-
tations of state are those in which every component of the state is defined using
only statistics about observable quantities (which could be either observables in the
future, or the past), and latent representations of state refer to everything else. In
our robot localization example, both Cartesian and polar coordinates are latent rep-
resentations of state, because neither is explicitly observed by the robot; only a state
representation defined in terms of features of camera images could be defined as
grounded.
1 Some disciplines may have other definitions of the word “grounded” which are specific
and technical; we avoid them.
13 Predictively Defined Representations of State 419
Within the class of grounded representations of state, we will make further dis-
tinctions. Some grounded representations may be defined in terms of past observa-
tions, as in the case of k-th order Markov models (where the past k observations
constitute state), and others could be defined in terms of the current observation.
There is also a third class of grounded representations, which is the subject of
this chapter: predictively defined representations of state. In a predictively defined
representation of state, state is represented as statistics about features of future ob-
servations. These statistics are flexible: they may be the parameters of a distribution
over the short-term future, represent the expectations of random variables in the
future, represent the densities of specific futures, represent statements about future
strings of observations given possible future actions, etc. It is this class of state rep-
resentations which we will investigate throughout the chapter, along with algorithms
for learning and maintaining that state.
and generalization. For example, Rafols et al (2005) have provided some pre-
liminary evidence that predictive representations provide a better basis for gen-
eralization than latent ones. To see the intuition for this, consider the problem of
assigning values to states for a domain in which an agent must navigate a maze.
Using a predictively defined representation, two states are “near” each other
when their distributions over the future are similar; if that is true, it is likely
that they should be assigned similar values. But if the agent uses, say, Cartesian
coordinates as a state representation, two states which are nearby in Euclidean
space may not necessarily have similar values. The classic example is two states
on either side of a wall: although the two states appear to be close, an agent may
have to travel long distances through the maze to reach one from the other, and
they should be assigned different values, which may be difficult to do with a
smooth function approximator. Littman et al (2002) have also suggested that
in compositional domains, predictions could also be useful in learning to make
other predictions, stating that in many cases “the solutions to earlier [predic-
tion] problems have been shown to provide features that generalize particularly
well to subsequent [prediction] problems.” This was also partly demonstrated
in Izadi and Precup (2005), who showed that PSRs have a natural ability to
compress symmetries in state spaces.
The rest of this chapter surveys important concepts and results for models with
predictively defined representations of state. Section 13.2 introduces PSRs, Sec-
tion 13.3 discusses how to learn a PSR model from data, and Section 13.4 discusses
planning in PSRs. Section 13.5 tours extensions of PSRs, and Section 13.6 surveys
other models with predictively defined state. Finally, Section 13.7 presents some
concluding thoughts.
13.2 PSRs
POMDPs have latent states at their heart—a POMDP begins by describing a latent
state space, transitions between those states, and the observations that they gener-
ate; it defines its sufficient statistic for history as a distribution over these latent
states, etc. This model is convenient in several situations, such as when a human de-
signer knows that there really are latent states in the system and knows something
about their relationships. However, numerous authors have pointed out that while a
POMDP is easy to write down, it is notoriously hard to learn from data (Nikovski,
2002; Shatkay and Kaelbling, 2002), which is our central concern.
In this section, we turn to alternative model of controlled dynamical systems
with discrete observations, called a “Predictive State Representation” (or PSR)2 .
The PSR was introduced by Littman et al (2002), and is one of the first models
with a predictively defined representation of state. A PSR is a model capable of
capturing dynamical systems that have discrete actions and discrete observations,
2 Unfortunately, this name would be more appropriate as a name for an entire class of
models.
13 Predictively Defined Representations of State 421
Fig. 13.1 An example of state as predictions about the future. A latent state representation
might be defined as x,y coordinates, but a predictively defined state is defined in terms of
future possibilities.
like a POMDP. The important contrast is that while POMDPs are built around latent
states, PSRs never make any reference to a latent state; instead, a PSR represents
state as a set of statistics about the future. This will have positive consequences for
learning, as we will see, but importantly, we will lose nothing in terms of modeling
capacity: we will see that PSRs can model any system that a finite-state POMDP
can, and that many POMDP planning algorithms are directly applicable to PSRs.
We will now introduce the terminology needed to explain PSRs.
can imagine that a sufficiently detailed set of sufficiently long predictions could
disambiguate any two positions in the maze—while only referring to observable
quantities. This is precisely the intuition behind the state representation in PSRs.
Tests form a central part of the state representation used by PSRs. They are also
central to the mathematical objects that PSRs rely on for theoretical results, as well
as most learning algorithms for PSRs. This is because from any given history, there
is some distribution over possible future sequences of actions and observations that
can be captured through the use of tests: each possible future corresponds to a dif-
ferent test, and there is some (possibly zero) probability that each test will occur
from every history.
We now define the prediction for a test t = a1 o1 a2 o2 · · · an on . We assume that
the agent is starting in history h = a1 o1 a2 o2 · · · am om , meaning that it has taken the
actions specified in h and seen the observations specified in h. The prediction of test
t is defined to be the probability that the next n observations are exactly those in t,
given that the next n actions taken are exactly those in t:
The actions in the test are executed in an open-loop way, without depending on any
future information, so Pr(an |a1 , o1 , · · · ,an−1 , on−1 ) = Pr(an |a1 , · · · , an ). See (Bowl-
ing et al, 2006) for an important discussion of why this is important, particularly
when empirically estimating test probabilities.
For ease of notation, we use the following shorthand: for a set of tests T =
{t1 ,t2 , · · ·tn }, the quantity p(T |h) = [p(t1 |h), p(t2 |h), · · · p(tn |h)] is a column vector
of predictions of the tests in the set T .
The systems dynamics vector (Singh et al, 2004) is a conceptual construct intro-
duced to define PSRs. This vector describes the evolution of a dynamical system
over time: every possible test t has an entry in this vector, representing p(t|0), / or
the prediction of t from the null history. These tests are conventionally arranged in
length-lexicographic order, from shortest to longest. In general, this is an infinitely
long vector, but will still be useful from a theoretical perspective. Here, we will use
the notation atm otn to denote the m-th action and the n-th observation at time t:
Longer futures
Longer histories
D a11 o11 a11 o21 · · · am n
1 o1 a11 o11 a12 o12 · · ·
∅
1 1
a1 o1
..
.
am1 o1
n
p(ti |hj )
a11 o11 a12 o12
.. ..
. .
p(hit j |0)
/
Di j = p(t j |hi ) = .
p(hi |0)
/
Note that the system dynamics vector has all of the information needed to com-
pute the full system dynamics matrix. Tests and histories are arranged length-
lexicographically, with ever increasing test and history lengths. The matrix has an
infinite number of rows and columns, and like the system dynamics vector, it is a
complete description of a dynamical system.
424 D. Wingate
The system dynamics matrix inherently defines a notion of sufficient statistic, and
suggests several possible learning algorithms and state update mechanisms. For ex-
ample, even though the system dynamics matrix has an infinite number of columns
and rows, if it has finite rank, there must be at least one finite set of linearly inde-
pendent columns corresponding to a set of linearly independent tests. We call the
tests associated with these linearly independent columns core tests. Similarly, there
must be at least one set of linearly independent rows corresponding to linearly in-
dependent histories. We call the histories associated with these linearly independent
rows core histories.
In fact, the rank of the system dynamics matrix has been shown to be finite for
interesting cases, such as POMDPs, as explained in Section 13.2.10. Estimating and
leveraging the rank of the systems dynamics matrix is a key component to many
PSR learning algorithms, as we shall see.
13.2.6 State
We are now prepared to discuss the key idea of PSRs: PSRs represent state as a set of
predictions about core tests, which represent the probabilities of possible future ob-
servations given possible future actions. Core tests are at the heart of PSRs, because
by definition, every other column—i.e., any prediction about possible futures—can
be computed as a weighted combination of these columns. Importantly, the weights
that are independent of time and history, which means they may be estimated once
and used for the lifetime of the agent.
To see how the predictions of a set of core tests can constitute state, consider a
particular history ht . Suppose that an agent knows which tests are core tests. We
will call this set Q, and suppose furthermore that the agent has access to a vector
containing their predictions from history ht :
p(c|ht ) = m
c p(Q|ht ).
Because columns correspond to possible futures, this agent can predict anything
about the future that it needs to, assuming it has the appropriate weight vector mc .
The weight vector mc does not depend on history, which will be critical to maintain-
ing state, as we will see in a moment. Because an agent can predict anything it needs
to as a linear combination of the entries in p(Q|ht ), we say that the predictions of
these core tests are a linearly sufficient statistic for the system.
13 Predictively Defined Representations of State 425
We have thus satisfied the key definition of state outlined in the introduction: that
p(future|state,past) = p(future|state). In this case, once we have the predictions of
the core tests, we can compute any future of interest, and we may discard history.
p(aoqi |h)
p(qi |hao) = .
p(ao|h)
Here, the notation aoqi means “the test consisting of taking action a, observing o,
and then taking the rest of the actions specified by qi and seeing the corresponding
observations in qi .” Note that the quantities on the right-hand side are defined strictly
in terms of predictions from history h—and because p(Q|h) can be used to predict
any future from h, we have solved the problem: to maintain state, we only need to
compute the predictions of the one step tests (ao) and the one-step extensions (aoqi)
to the core tests as a function of p(Q|h).
This formula is true for all PSRs, whether they use linearly sufficient statistics or
nonlinearly sufficient statistics. In the case of linearly sufficient statistics, the state
update takes on a particularly convenient form, as discussed next.
Linear PSRs are built around the idea of linearly sufficient statistics. In a linear PSR,
for every test c, there is a weight vector mc ∈ R|Q| independent of history h such that
the prediction p(c|h) = m c p(Q|h) for all h. This means that updating the prediction
of a single core test qi ∈ Q can be done efficiently in closed-form. From history h,
after taking action a and seeing observation o:
p(aoqi |h) m
aoq p(Q|h)
p(qi |hao) = = i . (13.1)
p(ao|h) mao p(Q|h)
426 D. Wingate
This equation shows how a single test can be recursively updated in an elegant,
closed-form way. Previously, we said that given the predictions of a set of core
tests for a certain history h, any other column in the same row could be computed
as a weighted sum of p(Q|h). Here, we see that in order to update state, only two
predictions are needed: the prediction of the one-step test p(ao|h), and the prediction
of the one-step extension p(aoqi |h). Thus, the agent only needs to know the weight
vectors mao , which are the weights for the one-step tests, and the maoqi , which are
the weights for the one-step extensions. We can combine the updates for all the
core tests into a single update by defining the matrix Mao , who’s i’th row is equal
to maoqi . Updating all core tests simultaneously is now equivalent to a normalized
matrix-vector multiply:
Mao p(Q|h)
p(Q|hao) = (13.2)
mao p(Q|h)
p(t|h) = m
an on Man−1 on−1 · · · Ma1 o1 p(Q|h). (13.3)
This can be derived by considering the system dynamics matrix (Singh et al, 2004),
or by considering the parameters of an equivalent POMDP (Littman et al, 2002).
Many theoretical properties of linear PSRs have been established by examining the
mappings between linear PSRs and POMDPs. Here, we will examine one relation-
ship which provides some intuition for their general relationship: we show that both
models have linear update and prediction equations, and that the parameters of a
linear PSR are a similarity transformation of the equivalent POMDP’s parameters.
In this section, we only consider finite POMDPs with n states. We use Oa,o to
represent an observation matrix, which is a diagonal n × n matrix. Entry Oa,o i,i rep-
resents the probability of observation o in state i when action a is taken. T a is the
n × n transition matrix with columns representing T (s |s,a).
Updating state in a POMDP is like updating state in a PSR. Given a belief state
bh and a new action a and observation o, the belief state bhao is
Oa,o T a bh
bhao =
1T Oa,o T a bh
13 Predictively Defined Representations of State 427
where 1 is the n × 1 vector of all 1s. Note the strong similarity to the PSR state
update in Eq. 13.2: the matrix product Oa,o T a can be computed once offline to create
a single matrix, meaning the computational complexity of updating state in both
representations is identical.
To compute p(t|h) where t = a1 o1 a2 o2 · · · an on , the POMDP must compute
n ,on n 1 ,o1 1
p(t|h) = 1T Oa T a · · · Oa T a bh = wtT bh
where we can again precompute the vector-matrix products. Note the similarity to
the PSR prediction equation in Eq. 13.3: both can make any prediction as a weighted
combination of entries in their respective state vectors, with weights that do not
depend on history.
Linear PSRs and POMDPs also have closely related parameters. To see this, we
define the outcome matrix U for an n-dimensional POMDP and a given set of n core
tests Q. An entry Ui j represents the probability that a particular test j will succeed
if executed from state i. Note that U will have full rank if the set tests Q constitute a
set of core tests. Given U, it has been shown (Littman et al, 2002) that:
Mao = U −1 T a Oa,oU
mao = U −1 T a Oa,o 1
In other words, the parameters of a linear PSR are a similarity transformation of the
corresponding parameters of the equivalent POMDP. This fact turns out to be useful
in translating POMDP planning algorithms to PSRs, as we shall see in Section 13.4.
There are a variety of theoretical results on the properties of Linear PSRs. Here, we
survey a few of the most important.
Expressivity. It was shown early that every POMDP can be equivalently ex-
pressed by a PSR using a constructive proof translating from POMDPs directly
to PSRs (Littman et al, 2002). This result was generalized by James (2005), who
states that “Finite PSRs are able to model all finite POMDPs, HMMs, MDPs, MCs,
history-window frameworks, diversity representations, interpretable OOMs and in-
terpretable IO-OOMs.” In fact, PSRs are strictly more expressive than POMDPs.
James (2005) showed that “there exist finite PSRs which cannot be modeled by
any finite POMDP, Hidden Markov Model, MDP, Markov chain, history-window,
diversity representation, interpretable OOM, or interpretable IO-OOM.”
Compactness. PSRs are just as compact as POMDPs. James (2005) states that
“Every system that can be modeled by a finite POMDP (including all HMMs,
MDPs, and MCs) can also be modeled by a finite PSR using number of core tests
less than or equal to the number of nominal-states in the POMDP” (Thm. 4.2). In
428 D. Wingate
addition, there are strong bounds on the length of the core tests needed to capture
state. Wolfe (2009) showed that “For any system-dynamics matrix of rank n, there
exists some set T of core tests such that no t ∈ T has length greater than n.” (Thm.
2.4). Similar statements apply for core histories (Thm. 2.5).
Converting models between other frameworks and PSRs. While we have fo-
cused on defining PSRs via the systems dynamics vector and matrix, one can also
derive PSR models from other models. An algorithm for converting POMDPs into
equivalent PSRs was given by Littman et al (2002), and James presented an ad-
ditional algorithm to convert an interpretable Observable Operator Model (OOM)
to a PSR (James, 2005). Interestingly, there is no known method of recovering a
POMDP from a PSR (perhaps because there are multiple POMDPs which map to
the same PSR). If there were, then any learning algorithm for a PSR would become
a learning algorithm for a POMDP.
Numerous algorithms have been proposed to learn PSRs, but they all address two
key problems: the discovery problem and the learning problem (James and Singh,
2004). The discovery problem is defined as the problem of finding a set of core tests,
and is essentially the problem of discovering a state representation. The learning
problem is defined as the problem of finding the parameters of the model needed
to update state, and is essentially the problem of learning the dynamical aspect of
the system. In the case of linear PSRs, this is the mqi weight vectors for all of the
one-step tests and the one-step extensions.
The idea of linear sufficiency suggests procedures for discovering sufficient statis-
tics: a set of core tests corresponds to a set of linearly independent columns of
the system dynamics matrix, and so techniques from linear algebra can be brought
to bear on empirical estimates of portions of the system dynamics matrix. Exist-
ing discovery algorithms search for linearly independent columns, which is a chal-
lenging task because the columns of the matrix are estimated from data, and noisy
columns are often linearly independent (Jaeger, 2004). Thus, the numerical rank of
the matrix must be estimated using a statistical test based on the singular values
of the matrix. The entire procedure typically relies on repeated singular value de-
compositions of the matrix, which is costly. For example, James and Singh (2004)
learn a “history-test matrix,” which is the predecessor to the systems dynamics ma-
trix. Their algorithm repeatedly estimates larger and larger portions of the matrix,
13 Predictively Defined Representations of State 429
Once the core tests have been found, the update parameters must be learned. Singh
et al (2003) presented the first algorithm for the learning problem, which assumes
that the core tests are given and uses a gradient algorithm to compute weights. The
more common approach is with regression and sample statistics (James and Singh,
2004): once a set of core tests is given, the update parameters can be solved by
regressing the appropriate entries of the estimated system dynamics matrix.
Some authors combine both problems into a single algorithm. For example,
Wiewiora (2005) presents a method for learning regular form PSRs with an iter-
ative extend-and-regress method, while McCracken and Bowling (2006) propose an
online discovery and learning algorithm based on gradient descent.
Not every set of matrices Mao define a valid linear PSR. If an estimation algo-
rithm returns invalid parameters, PSRs can suffer from underflow (predicting neg-
ative probabilities) or overflow (predicting probabilities greater than unity). Precise
constraints on PSR parameters were identified by Wolfe (2010), who used the con-
straints to improve estimates of the parameters, and demonstrated that predictions
generated from the resulting model did not suffer from underflow or overflow.
Many learning and discovery algorithms involve estimating the system dynamics
matrix, typically using sample statistics. In systems with a reset action, the agent
may actively reset to the empty history in order to repeatedly sample entries (James
and Singh, 2004). In systems without a reset, most researchers use the suffix-history
algorithm (Wolfe et al, 2005) to generate samples: given a trajectory of the system,
we slice the trajectory into all possible histories and futures. Active exploration is
also possible, as proposed by Bowling et al (2006).
There has been comparatively little work on planning in PSRs, partly because it is
generally believed among PSR researchers that any POMDP planning algorithm can
be directly applied to PSRs (although there is no formal proof of this). This is partly
430 D. Wingate
The basic PSR model has been extended in numerous ways, mostly in efforts to
scale them to cope with larger domains. This is typically done by leveraging some
sort of structure in the domain, although it can also be achieved by adopting a more
powerful (i.e., nonlinear) state update method.
Memory PSRs. James et al (2005b) showed that memory and predictions can be
combined to yield smaller models than can be obtained solely with predictions.
Their Memory-PSR (or mPSR) “remembers” an observation (typically the most
recent one), and builds a separate PSR for the distribution of the future conditioned
on each unique memory. This can result in compact, factored models if the memories
form landmarks that allow the domain to be decomposed. James and Singh (2005a)
then showed that effective planning is possible with the resulting model.
Hierarchical PSRs. Wolfe (2009) considers temporal abstraction with hPSRs: in-
stead of interleaving actions and observations, he interleaves options (Precup et al,
1998) and observations. These options capture temporal structure in the domain
by compressing regularities in action-observation sequences, which allows scaling
to larger domains. For example, in a maze domain, long hallways or large rooms
may have a simple compressed representation. Wolfe analyzes the resulting “option-
level” dynamical system as a compression of the system dynamics matrix which
sums out rows and columns. Counter to intuition, the resulting matrix can have a
larger rank than the original system (although he provides conditions where the rank
of the option system is no more than the rank of the original system [Thm 4.5]).
Factored PSRs. While hPSRs leverage temporal structure in a domain, Factored
PSRs (Wolfe, 2009) leverage structure among the observation dimensions of a
dynamical system with vector-valued observations by exploiting conditional inde-
pendence between subsets of observation dimensions. This can be important for
domains with many mostly independent observations (the motivating example is a
traffic prediction problem where observations consist of information about hundreds
of cars on a freeway). An important feature of Factored PSRs is that the representa-
tion can remain factored even when making long-term predictions about the future
state of the system. This is in contrast to, say, a DBN, where predictions about latent
states or observations many time steps in the future will often be dependent upon all
of the current latent state variables. The intuition behind this is explained by Wolfe:
“Consider the i-th latent state variable [of a DBN] at the current time step. It will affect
some subset of latent state variables at the next time step; that subset will affect an-
other subset of variables at the following time step; and so on. Generally speaking, the
subset of variables that can be traced back to the i-th latent state variable at the current
time step will continue to grow as one looks further forward in time. Consequently,
in general, the belief state of a DBN—which tracks the joint distribution of the latent
state variables—is not factored, and has size exponential in the number of latent state
variables. In contrast, the factored PSR does not reference latent state, but only actions
and observations.”
432 D. Wingate
Multi-Mode PSRs. The Multi-mode PSR model (Wolfe et al, 2008, 2010) is de-
signed to capture uncontrolled systems that switch between several modes of op-
eration. The MMPSR makes specialized predictions conditioned upon the current
mode, which can simplify the overall model. The MMPSR was inspired by the
problem of predicting cars’ movements on a highway, where cars typically oper-
ate in discrete modes of operation like making a lane change, or passing another
car. These modes are like options, but modes are not latent, unobservable quantities
(as opposed to, say, upper levels in a hierarchical HMM). Wolfe defines a recogniz-
ability condition to identify modes, which can be a function of both past and future
observations, but emphasizes that this does not limit the class of systems that can be
modeled; it merely impacts the set of modes available to help model the system.
Relational PSRs. Wingate et al (2007) showed how PSRs can be used to model
relational domains by grounding all predicates in predictions of the future. One
contribution of this work was the introduction of new kinds of tests: set tests (which
make predictions about sequences of actions and sets of observations) and indexical
tests (which make predictions about sequences of actions and observations where an
action/observation at time t is required to have the same—but unspecified—value as
an action/observation at a later time t + k, thus allowing a sort of “variable” in the
test). These tests permit a great deal of flexibility, but still retain all of the linearity
properties of linear PSRs, making them expressive and tractable.
Continuous PSRs. There has been less work on the systems with continuous ac-
tions and observations. Wingate and Singh (2007b) present a generalization of PSRs
where tests predict the density of a sequence of continuous actions and observations.
This creates a problem: PSRs with discrete observations derive the notion of predic-
tive state as a sufficient statistic via the rank of the system dynamics matrix, but with
continuous observations and actions, such a matrix and its rank no longer exist. The
authors define an analogous construct for the continuous case, called the system
dynamics distributions, and use information theoretic notions to define a sufficient
statistic and thus state. They use kernel density estimation to learn a model, and
information gradients to directly optimize the state representation.
Other extensions. Tanner et al (2007) presented a method to learn high-level ab-
stract features from low-level state representations. On the state update side, Rudary
and Singh (2004) showed that linear models can be compacted when nonlinearly
sufficient statistics are allowed.
There are other models of dynamical systems which capture state through the use
of predictions about the future.
13 Predictively Defined Representations of State 433
Observable Operator Models (OOMs) were introduced and studied by Jaeger (2000).
Like PSRs, there are several variants on the same basic theme, making it more of
a framework than a single model. Within the family of OOMs are models which
are designed to deal with different versions of dynamical systems: the basic OOM
models uncontrolled dynamical systems, while the IO-OOM models controlled dy-
namical systems. OOMs have several similarities to PSRs. For example, there are
analogous constructs to core tests (“characteristic events”), core histories (“indica-
tive events”) and the system dynamics matrix. State in an OOM is represented as
a vector of predictions about the future, but the predictions do not correspond to a
single test. Instead, each entry in the state vector is the prediction of some set of
tests of the same length k. There are constraints on these sets: they must be disjoint,
but their union must cover all tests of length k.
A significant restriction on IO-OOMs is that the action sequence used in tests
must be the same for all tests, which is needed to satisfy some assumptions about the
state vector. The assumption is severe enough that James (2005) gives an example
of why it results in systems which the IO-OOM cannot model, but which PSRs can.
There are also variants of OOMs which do not use predictions as part of their state
representation (called “uninterpretable OOMs”) but there are no learning algorithms
for these models (James, 2005). We refer the reader to the technical report by Jaeger
(2004) for a detailed comparison of PSRs and OOMs.
with equations analogous to the Riccati equations. Like PSRs, the PLG has learn-
ing algorithms that are based on sample statistics and regressions. An important
learnability result was obtained by Rudary et al (2005), who showed a statistically
consistent parameter estimation algorithm, which is an important contrast to typical
LDS learning based on methods such as EM (Ghahramani and Hinton, 1996).
Nonlinear dynamics have been considered in two different ways. By playing the
kernel trick, linear dynamics can be represented in a nonlinear feature space. This
results in the Kernel PLG (Wingate and Singh, 2006a). A more robust and learn-
able method seems to be assuming that dynamics are piecewise linear, resulting in
the “Mixtures of PLGs” model (Wingate and Singh, 2006b). In many ways, these
extensions parallel the development of the Kalman filter: for example, state in the
KPLG can be updated with an efficient approximate inference algorithm based on
sigma-point approximations (yielding an algorithm related to the unscented Kalman
filter (Wan and van der Merwe, 2000)).
The diversity automaton of Rivest and Schapire (1987) is a model based on predic-
tions about the future, although with some severe restrictions. Like the PSR model,
diversity models represent state as a vector of predictions about the future. How-
ever, these predictions are not as flexible as the usual tests used by PSRs, but rather
are limited to be like the e-tests used by Rudary and Singh (2004). Each test ti is
the probability that a certain observation will occur in ni steps, given a string of ni
actions but not given any observations between time t + 1 and t + ni . Each of these
tests corresponds to an equivalence class over the distribution of future observations.
Rivest and Schapire (1987) showed tight bounds on the number of tests needed
by a diversity model relative to the number of states a minimal POMDP would
need. Diversity models can either compress or inflate a system: in the best case, a
logarithmic number of tests are needed, but in the worst case, an exponential num-
ber are needed. This contrasts with PSRs, where only n tests to model any domain
modeled by an n-state POMDP. Diversity models are also limited to systems with
deterministic transitions and deterministic observations. This is due to the model’s
state update mechanism and the need to restrict the model to a finite number of tests
by restricting it to a finite number of equivalence classes of future distributions.
The exponential family PSR, or EFPSR, is a general model unifying many other
models with predictively defined state (Wingate and Singh, 2007a). It was moti-
vated by the observation that these models track the sufficient statistics of an expo-
nential family distribution over the short-term future. For example, the PLG uses
the parameters of a Gaussian, while the PSR uses the parameters of a multino-
mial. An exponential family distribution is any distribution which can be written
as p( f uture|history) ∝ exp wtT φ (future) ; Suitable choices of φ (future) and an
update mechanism for wt recover existing models.
This leads to the idea of placing a general exponential family distribution over
the short-term future observations, parameterized with features φ of the future.
This generalization has been used to predict and analyze new models, including the
Linear-Linear EFPSR (Wingate and Singh, 2008) which is designed to model do-
mains with large numbers of features, and has also resulted in an information-form
of the PLG (Wingate, 2008). This results in strong connections between graphical
models, maximum entropy modeling, and PSRs.
From a purist’s point of view, it is questionable whether the EFPSR is really a
PSR: while the representation is defined in terms of the short-term distribution over
the future, the parameters of the model are less directly grounded in data: they are
not verifiable in the same way that an expected value of a future event is. The model
therefore sits somewhere in between latent state space models and PSRs.
436 D. Wingate
Transformed PSRs (Rosencrantz et al, 2004) also learn a model by estimating the
system dynamics matrix. However, instead of searching for linearly independent
columns (i.e., core tests), they perform a singular value decomposition on the sys-
tem dynamics matrix to recover a low-dimensional representation of state, which
can be interpreted as a linear combination of core tests. The resulting state is a ro-
tated PSR, but elements of the state vector cannot be interpreted as predictions of
tests. This work was subsequently generalized Boots et al (2010) with an alternative
learning algorithm and the ability to deal with continuous observations. The overall
result is one of the first learning algorithms that is statistically consistent and which
empirically works well enough to support planning in complex domains.
13.7 Conclusion
Models with predictively defined representations of state are still relatively young,
but some theoretical and empirical results have emerged. What broad conclusions
can we draw from these results? What might their future hold?
Results on compactness, expressivity and learnability all suggest that the idea
is viable. We have seen, for example, that there are strong connections between
PSRs and POMDPs: PSRs are at least as compact as POMDPs and have compara-
ble computational complexity, and are strictly more expressive. Like other models,
PSRs can leverage structure in domains, as demonstrated by extensions to factored,
hierarchical, switching, kernelized, and continuous domains. Of course, the real
promise of the idea is learnability, and we have seen a variety of functional learning
algorithms—along with some tantalizing first glimpses of optimal learnability in the
statistical consistency results of PLGs and TPSRs.
Predictive representations are also surprisingly flexible. Different authors have
captured state by using probabilities of specific tests (the PSR model), densities of
specific tests (the Continuous PSR model), expectations of features of the short-term
future (the PLG family of models), the natural parameters of a distribution over a
window of short-term future observations (the EFPSR), and wide variety of tests
(including set tests, indexical tests, and option tests). It is likely that there are many
other possibilities which are as yet unexplored.
Just as important are an absence of serious negative results. As researchers have
explored this space, there has always been the question: is there anything a model
with a predictively defined representation of state can not do? Would this represen-
tation be a limitation? So far, the answer is no: there is no known representational,
computational, theoretical or empirical limitation that has resulted from adopting a
predictively defined representation of state, and indeed, there are still good reasons
to believe they have special properties which uniquely suit them to a variety of tasks.
13 Predictively Defined Representations of State 437
Has their full depth been plumbed? It seems unlikely. But perhaps one of their
most significant contributions has already been made: to broaden our thinking by
challenging us to reconsider what it means to learn and represent state.
References
Aberdeen, D., Buffet, O., Thomas, O.: Policy-gradients for psrs and pomdps. In: International
Workshop on Artificial Intelligence and Statistics, AISTAT (2007)
Astrom, K.J.: Optimal control of Markov decision processes with the incomplete state esti-
mation. Journal of Computer and System Sciences 10, 174–205 (1965)
Boots, B., Siddiqi, S., Gordon, G.: Closing the learning-planning loop with predictive state
representations. In: Proceedings of Robotics: Science and Systems VI, RSS (2010)
Boularias, A., Chaib-draa, B.: Predictive representations for policy gradient in pomdps. In:
International Conference on Machine Learning, ICML (2009)
Bowling, M., McCracken, P., James, M., Neufeld, J., Wilkinson, D.: Learning predictive state
representations using non-blind policies. In: International Conference on Machine Learn-
ing (ICML), pp. 129–136 (2006)
Ghahramani, Z., Hinton, G.E.: Parameter estimation for linear dynamical systems. Tech. Rep.
CRG-TR-96-2, Dept. of Computer Science, U. of Toronto (1996)
Izadi, M., Precup, D.: Model minimization by linear psr. In: International Joint Conference
on Artificial Intelligence (IJCAI), pp. 1749–1750 (2005)
Izadi, M.T., Precup, D.: Point-Based Planning for Predictive State Representations. In:
Bergler, S. (ed.) Canadian AI. LNCS (LNAI), vol. 5032, pp. 126–137. Springer,
Heidelberg (2008)
Jaeger, H.: Observable operator processes and conditioned continuation representations.
Neural Computation 12(6), 1371–1398 (2000)
Jaeger, H.: Discrete-time, discrete-valued observable operator models: A tutorial. Tech. rep.,
International University Bremen (2004)
James, M., Singh, S., Littman, M.: Planning with predictive state representations. In: Interna-
tional Conference on Machine Learning and Applications (ICMLA), pp. 304–311 (2004)
James, M., Wessling, T., Vlassis, N.: Improving approximate value iteration using memories
and predictive state representations. In: Proceedings of AAAI (2006)
James, M.R.: Using predictions for planning and modeling in stochastic environments. PhD
thesis, University of Michigan (2005)
James, M.R., Singh, S.: Learning and discovery of predictive state representations in dynam-
ical systems with reset. In: International Conference on Machine Learning (ICML), pp.
417–424 (2004)
James, M.R., Singh, S.: Planning in models that combine memory with predictive represen-
tations of state. In: National Conference on Artificial Intelligence (AAAI), pp. 987–992
(2005a)
James, M.R., Wolfe, B., Singh, S.: Combining memory and landmarks with predictive state
representations. In: International Joint Conference on Artificial Intelligence (IJCAI), pp.
734–739 (2005b)
Kalman, R.E.: A new approach to linear filtering and prediction problem. Transactions of the
ASME—Journal of Basic Engineering 82(Series D), 35–45 (1960)
Lagoudakis, M.G., Parr, R.: Least-squares policy iteration. Journal of Machine Learning
Research (JMLR) 4, 1107–1149 (2003)
438 D. Wingate
Littman, M.L., Sutton, R.S., Singh, S.: Predictive representations of state. In: Neural Infor-
mation Processing Systems (NIPS), pp. 1555–1561 (2002)
McCracken, P., Bowling, M.: Online discovery and learning of predictive state representa-
tions. In: Neural Information Processings Systems (NIPS), pp. 875–882 (2006)
Nikovski, D.: State-aggregation algorithms for learning probabilistic models for robot con-
trol. PhD thesis, Carnegie Mellon University (2002)
Peters, J., Vijayakumar, S., Schaal, S.: Natural actor-critic. In: European Conference on
Machine Learning (ECML), pp. 280–291 (2005)
Precup, D., Sutton, R.S., Singh, S.: Theoretical results on reinforcement learning with tem-
porally abstract options. In: European Conference on Machine Learning (ECML), pp.
382–393 (1998)
Rafols, E.J., Ring, M.B., Sutton, R.S., Tanner, B.: Using predictive representations to improve
generalization in reinforcement learning. In: International Joint Conference on Artificial
Intelligence (IJCAI), pp. 835–840 (2005)
Rivest, R.L., Schapire, R.E.: Diversity-based inference of finite automata. In: IEEE Sympo-
sium on the Foundations of Computer Science, pp. 78–87 (1987)
Rosencrantz, M., Gordon, G., Thrun, S.: Learning low dimensional predictive representa-
tions. In: International Conference on Machine Learning (ICML), pp. 695–702 (2004)
Rudary, M., Singh, S.: Predictive linear-Gaussian models of stochastic dynamical systems
with vector-value actions and observations. In: Proceedings of the Tenth International
Symposium on Artificial Intelligence and Mathematics, ISAIM (2008)
Rudary, M.R., Singh, S.: A nonlinear predictive state representation. Neural Information Pro-
cessing Systems (NIPS), 855–862 (2004)
Rudary, M.R., Singh, S.: Predictive linear-Gaussian models of controlled stochastic dynam-
ical systems. In: International Conference on Machine Learning (ICML), pp. 777–784
(2006)
Rudary, M.R., Singh, S., Wingate, D.: Predictive linear-Gaussian models of stochastic dy-
namical systems. In: Uncertainty in Artificial Intelligence, pp. 501–508 (2005)
Shatkay, H., Kaelbling, L.P.: Learning geometrically-constrained hidden Markov models
for robot navigation: Bridging the geometrical-topological gap. Journal of AI Research
(JAIR), 167–207 (2002)
Singh, S., Littman, M., Jong, N., Pardoe, D., Stone, P.: Learning predictive state representa-
tions. In: International Conference on Machine Learning (ICML), pp. 712–719 (2003)
Singh, S., James, M.R., Rudary, M.R.: Predictive state representations: A new theory for
modeling dynamical systems. In: Uncertainty in Artificial Intelligence (UAI), pp. 512–
519 (2004)
Sutton, R.S., Tanner, B.: Temporal-difference networks. In: Neural Information Processing
Systems (NIPS), pp. 1377–1384 (2005)
Tanner, B., Sutton, R.: Td(lambda) networks: Temporal difference networks with eligibility
traces. In: International Conference on Machine Learning (ICML), pp. 888–895 (2005a)
Tanner, B., Sutton, R.: b) Temporal difference networks with history. In: International Joint
Conference on Artificial Intelligence (IJCAI), pp. 865–870 (2005)
Tanner, B., Bulitko, V., Koop, A., Paduraru, C.: Grounding abstractions in predictive state
representations. In: International Joint Conference on Artificial Intelligence (IJCAI), pp.
1077–1082 (2007)
Wan, E.A., van der Merwe, R.: The unscented Kalman filter for nonlinear estimation. In:
Proceedings of Symposium 2000 on Adaptive Systems for Signal Processing, Communi-
cation and Control (2000)
13 Predictively Defined Representations of State 439
Wiewiora, E.: Learning predictive representations from a history. In: International Conference
on Machine Learning (ICML), pp. 964–971 (2005)
Wingate, D.: Exponential family predictive representations of state. PhD thesis, University of
Michigan (2008)
Wingate, D., Singh, S.: Kernel predictive linear Gaussian models for nonlinear stochastic
dynamical systems. In: International Conference on Machine Learning (ICML), pp. 1017–
1024 (2006a)
Wingate, D., Singh, S.: Mixtures of predictive linear Gaussian models for nonlinear stochastic
dynamical systems. In: National Conference on Artificial Intelligence (AAAI) (2006b)
Wingate, D., Singh, S.: Exponential family predictive representations of state. In: Neural
Information Processing Systems, NIPS (2007a) (to appear)
Wingate, D., Singh, S.: On discovery and learning of models with predictive representations
of state for agents with continuous actions and observations. In: International Conference
on Autonomous Agents and Multiagent Systems (AAMAS), pp. 1128–1135 (2007b)
Wingate, D., Singh, S.: Efficiently learning linear-linear exponential family predictive repre-
sentations of state. In: International Conference on Machine Learning, ICML (2008)
Wingate, D., Soni, V., Wolfe, B., Singh, S.: Relational knowledge with predictive represen-
tations of state. In: International Joint Conference on Artificial Intelligence (IJCAI), pp.
2035–2040 (2007)
Wolfe, B.: Modeling dynamical systems with structured predictive state representations. PhD
thesis, University of Michigan (2009)
Wolfe, B.: Valid parameters for predictive state representations. In: Eleventh International
Symposium on Artificial Intelligence and Mathematics (ISAIM) (2010)
Wolfe, B., James, M.R., Singh, S.: Learning predictive state representations in dynamical
systems without reset. In: International Conference on Machine Learning, pp. 980–987
(2005)
Wolfe, B., James, M., Singh, S.: Approximate predictive state representations. In: Procedings
of the 2008 International Conference on Autonomous Agents and Multiagent Systems
(AAMAS) (2008)
Wolfe, B., James, M., Singh, S.: Modeling multiple-mode systems with predictive state rep-
resentations. In: Proceedings of the 13th International IEEE Conference on Intelligent
Transportation Systems (2010)
Chapter 14
Game Theory and Multi-agent Reinforcement
Learning
14.1 Introduction
The reinforcement learning techniques studied throughout this book enable a single
agent to learn optimal behavior through trial-and-error interactions with its environ-
ment. Various RL techniques have been developed which allow an agent to optimize
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 441–470.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
442 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
its behavior in a wide range of circumstances. However, when multiple learners si-
multaneously apply reinforcement learning in a shared environment, the traditional
approaches often fail.
In the multi-agent setting, the assumptions that are needed to guarantee conver-
gence are often violated. Even in the most basic case where agents share a stationary
environment and need to learn a strategy for a single state, many new complexities
arise. When agent objectives are aligned and all agents try to maximize the same re-
ward signal, coordination is still required to reach the global optimum. When agents
have opposing goals, a clear optimal solution may no longer exist. In this case, an
equilibrium between agent strategies is usually searched for. In such an equilibrium,
no agent can improve its payoff when the other agents keep their actions fixed.
When, in addition to multiple agents, we assume a dynamic environment which
requires multiple sequential decisions, the problem becomes even more complex.
Now agents do not only have to coordinate, they also have to take into account the
current state of their environment. This problem is further complicated by the fact
that agents typically have only limited information about the system. In general,
they may not be able to observe actions or rewards of other agents, even though
these actions have a direct impact on their own rewards and their environment. In
the most challenging case, an agent may not even be aware of the presence of other
agents, making the environment seem non-stationary. In other cases, the agents have
access to all this information, but learning in a fully joint state-action space is in
general impractical, both due to the computational complexity and in terms of the
coordination required between the agents. In order to develop a successful multi-
agent approach, all these issues need to be addressed. Figure 14.1 depicts a standard
model of Multi-Agent Reinforcement Learning.
Despite the added learning complexity, a real need for multi-agent systems ex-
ists. Often systems are inherently decentralized, and a central, single agent learning
approach is not feasible. This situation may arise because data or control is physi-
cally distributed, because multiple, possibly conflicting, objectives should be met, or
simply because a single centralized controller requires to many resources. Examples
of such systems are multi-robot set-ups, decentralized network routing, distributed
load-balancing, electronic auctions, traffic control and many others.
The need for adaptive multi-agent systems, combined with the complexities of
dealing with interacting learners has led to the development of a multi-agent rein-
forcement learning field, which is built on two basic pillars: the reinforcement learn-
ing research performed within AI, and the interdisciplinary work on game theory.
While early game theory focused on purely competitive games, it has since devel-
oped into a general framework for analyzing strategic interactions. It has attracted
interest from fields as diverse as psychology, economics and biology. With the ad-
vent of multi-agent systems, it has also gained importance within the AI community
and computer science in general. In this chapter we discuss how game theory pro-
vides both a means to describe the problem setting for multi-agent learning and the
tools to analyze the outcome of learning.
14 Game Theory and Multi-agent RL 443
joint state st
st
reward rt
r1 Agent 1 a1 E
N
st V
I
r2 Agent 2 a2 joint action at R
O
N
...
M
st E
N
ri Agent i ai T
applicability and kind of information they use while learning in a multi-agent sys-
tem. We distinguish between techniques for stateless games, which focus on dealing
with multi-agent interactions while assuming that the environment is stationary, and
Markov game techniques, which deal with both multi-agent interactions and a dy-
namic environment. Furthermore, we also show the information used by the agents
for learning. Independent learners learn based only on their own reward observation,
while joint action learners also use observations of actions and possibly rewards of
the other agents.
Table 14.1 Overview of current MARL approaches. Algorithms are classified by their ap-
plicability (common interest or general Markov games) and their information requirement
(scalar feedback or joint-action information).
Game setting
Stateless Games Team Markov Games General Markov Games
Stateless Q-learning Policy Search MG-ILA
Information Requirement
In the following section we will describe the repeated games framework. This
setting introduces many of the complexities that arise from interactions between
learning agents. However, the repeated game setting only considers static, stateless
environments, where the learning challenges stem only from the interactions with
other agents. In Section 14.3 we introduce Markov Games. This framework gen-
eralizes the Markov Decision Process (MDP) setting usually employed for single
agent RL. It considers both interactions between agents and a dynamic environment.
We explain both value iteration and policy iteration approaches for solving these
Markov games. Section 14.4 describes the current state of the art in multi-agent re-
search, which takes the middle ground between independent learning techniques and
Markov game techniques operating in the full joint-state joint-action space. Finally
in Section 14.5, we shortly describe other interesting background material.
14 Game Theory and Multi-agent RL 445
The central idea of game theory is to model strategic interactions as a game between
a set of players. A game is a mathematical object, which describes the consequences
of interactions between player strategies in terms of individual payoffs. Different
representations for a game are possible. For example,traditional AI research often
focusses on the extensive form games, which were used as a representation of situa-
tions where players take turns to perform an action. This representation is used, for
instance, with the classical minimax algorithm (Russell and Norvig, 2003). In this
chapter, however, we will focus on the so called normal form games, in which game
players simultaneously select an individual action to perform. This setting is often
used as a testbed for multi-agent learning approaches. Below we the review basic
game theoretic terminology and define some common solution concepts in games.
that the expected payoffs are linear in the player strategies, i.e. the expected reward
for player k for a strategy profile σ is given by:
n
Rk (σ ) = ∑ ∏ σ j (a j )Rk (a)
a∈A j=1
Table 14.2 Examples of 2-player, 2-action games. From left to right: (a) Matching pennies,
a purely competitive (zero-sum) game. (b) The prisoner’s dilemma, a general sum game. (c)
The coordination game, a common interest (identical payoff) game. (d) Battle of the sexes,
a coordination game where agents have different preferences) Pure Nash equilibria are indi-
cated in bold.
a1 a2 a1 a2 a1 a2 a1 a2
a1 (1,-1) (-1,1) a1 (5,5) (0,10) a1 (5,5) (0,0) a1 (2,1) (0,0)
a2 (-1,1) (1,-1) a2 (10,0) (1,1) a2 (0,0) (10,10) a2 (0,0) (1,2)
(a) (b) (c) (d)
Examples of these game types can be seen in Table 14.2. The first game in this
table, named matching pennies, is an example of a strictly competitive game. This
game describes a situation where the two players must each, individually, select one
side of a coin to show (i.e. Heads or Tails). When both players show the same side,
player one wins and is paid 1 unit by player 2. When the coins do not match, player
2 wins and receives 1 unit from player 1. Since both players are betting against each
other, one player’s win automatically translates in the other player’s loss, therefore
this is a zero-sum game.
The second game in Table 14.2, called the prisoner’s dilemma, is a general sum
game. In this game, 2 criminals have been apprehended by the police for commit-
ting a crime. They both have 2 possible actions: cooperate with each other and deny
the crime (action a1), or defect and betray the other, implicating him in the crime
(action a2). If both cooperate and deny the crime, the police have insufficient evi-
dence and they get a minimal sentence, which translates to a payoff of 5 for both. If
one player cooperates, but the other one defects, the cooperator takes all the blame
14 Game Theory and Multi-agent RL 447
(payoff 0), while the defector escapes punishment (payoff 10). Should both play-
ers defect, however, they both receive a large sentence (payoff 1). The issue in this
game is that the cooperate action is strictly dominated by the defect action: no mat-
ter what action the other player chooses, to defect always gives the highest payoff.
This automatically leads to the (defect, defect) outcome, despite the fact that both
players could simultaneously do better by both playing cooperate.
The third game in Table 14.2 is a common interest game. In this case both players
receive the same payoff for each joint action. The challenge in this game is for the
players to coordinate on the optimal joint action. Selecting the wrong joint action
gives a suboptimal payoff and failing to coordinate results in a 0 payoff.
The fourth game, Battle of the sexes, is another example of a coordination game.
Here however, the players get individual rewards and prefer different outcomes.
Agent 1 prefers (a1,a1), whereas agent 2 prefers (a2,a2). In addition to the coor-
dination problem, the players now also have to agree on which of the preferred
outcomes.
Of course games are not restricted to only two actions but can have any number
of actions. In Table 14.3 we show some 3-action common interest games. In the
first, the climbing game from (Claus and Boutilier, 1998), the Nash equilibria are
surrounded by heavy penalties. In the second game, the penalties are left as a param-
eter k < 0. The smaller k, the more difficult it becomes to agree through learning
on the preferred solution ((a1,a1) and (a3,a3)) (The dynamics of these games us-
ing a value-iteration approach are analyzed in (Claus and Boutilier, 1998), see also
Section 14.2.2).
Table 14.3 Examples of 2-player, 3-action games. From left to right: (a) Climbing game (b)
The penalty game, where k ≤ 0. Both games are of the common interest type. Pure Nash
equilibria are indicated in bold.
a1 a2 a3 a1 a2 a3
a1 (11,11) (-30,-30) (0,0) a1 (10,10) (0,0) (k,k)
a2 (-30,-30) (7,7) (6,6) a2 (0,0) (2,2) (0,0)
a3 (0,0) (0,0) (5,5) a3 (k,k) (0,0) (10,10)
(a) (b)
Since players in a game have individual reward functions which are dependent on
the actions of other players, defining the desired outcome of a game is often not
clearcut. One cannot simply expect participants to maximize their payoffs, as it may
not be possible for all players to achieve this goal at the same time. See for example
the Battle of the sexes game in Table 14.2(d).
448 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
Definition 14.2. Let σ = (σ1 , . . . ,σn ) be a strategy profile and let σ −k denote the
same strategy profile but without the strategy σk of player k. A strategy σk∗ ∈ μ (Ak )
is then called a best response for player k, if following holds:
where σ−k ∪ σk denotes the strategy profile where all agents play the same strategy
as they play in σ except agent k who plays σk , i.e. (σ1 , . . . ,σk−1 ,σk ,σk+1 , . . . ,σn ).
A central solution concept in games, is the Nash equilibrium (NE). In a Nash equi-
librium, the players all play mutual best replies, meaning that each player uses a best
response to the current strategies of the other players. Nash (Nash, 1950) proved that
every normal form game has at least 1 Nash equilibrium, possibly in mixed strate-
gies. Based on the concept of best response we can define a Nash equilibrium as:
Thus, when playing a Nash equilibrium, no player in the game can improve his
payoff by unilaterally deviating from the equilibrium strategy profile. As such no
player has an incentive to change his strategy, and multiple players have to change
their strategy simultaneously in order to escape the Nash equilibrium.
In common interest games such as the coordination in Table 14.2(c), the Nash
equilibrium corresponds to a local optimum for all players, but it does not necessar-
ily correspond to the global optimum. This can clearly be seen in the coordination
game, where we have 2 Nash equilibria: the play (a1,a1) which gives both players
a reward of 5 and the global optimum (a2,a2) which results in a payoff of 10.
The prisoner’s dilemma game in Table 14.2 shows that a Nash equilibrium does
not necessarily correspond to the most desirable outcome for all agents. In the
unique Nash equilibrium both players prefer the ’defect’ action, despite the fact
that both would receive when both are cooperating. The cooperative outcome is not
a Nash equilibrium, however, as in this case both players can improve their payoff
by switching to the ’defect’ action.
The first game, matching pennies, does not have a pure strategy Nash equilib-
rium, as no pure strategy is a best response to another pure best response. Instead
the Nash equilibrium for this game is for both players to choose both sides with
equal probability. That is, the Nash strategy profile is ((1/2,1/2),(1/2,1/2)).
14 Game Theory and Multi-agent RL 449
The games described above are often used as test cases for multi-agent reinforce-
ment learning techniques. Unlike in the game theoretical setting, agents are not as-
sumed to have full access to the payoff matrix. In the reinforcement learning setting,
agents are taken to be players in a normal form game, which is played repeatedly, in
order to improve their strategy over time.
It should be noted that these repeated games do not yet capture the full multi-
agent reinforcement learning problem. In a repeated game all changes in the ex-
pected reward are due to changes in strategy by the players. There is no changing
environment state or state transition function external to the agents. Therefore, re-
peated games are sometimes also referred to as stateless games. Despite this lim-
itation, we will see further in this section that these games can already provide a
challenging problem for independent learning agents, and are well suited to test
coordination approaches. In the next section, we will address the Markov game
framework which does include a dynamic environment.
A number of different considerations have to be made when dealing with rein-
forcement learning in games. As is common in RL research, but contrary to tradi-
tional economic game theory literature, we assume that the game being played is
initially unknown to the agents, i.e. agents do not have access to the reward function
and do not know the expected reward that will result from playing a certain (joint)
action. However, RL techniques can still differ with respect to the observations the
agents make. Moreover, we also assume that the game payoffs can be stochastic,
meaning that a joint action does not always result in the same deterministic reward
for each agent. Therefore, actions have to be sampled repeatedly.
Since expected rewards depend on the strategy of all agents, many multi-agent
RL approaches assume that the learner can observe the actions and/or rewards of
all participants in the game. This allows the agent to model its opponents and to
explicitly learn estimates over joint actions. It could be argued however, that this
assumption is unrealistic, as in multi-agent systems which are physically distributed
this information may not be readily available. In this case the RL techniques must
be able to deal with the non-stationary rewards caused by the influence of the other
agents. As such, when developing a multi-agent reinforcement learning application
it is important to consider the information available in a particular setting in order
to match this setting with an appropriate technique.
Since it is in general impossible for all players in a game to maximize their pay-
off simultaneously, most RL methods attempt to achieve Nash equilibrium play.
However, a number of criticisms can be made of the Nash equilibrium as a solu-
tion concept for learning methods. The first issue is that Nash equilibria need not
be unique, which leads to an equilibrium selection problem. In general, multiple
Nash equilibria can exist for a single game. These equilibria can also differ in the
450 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
payoff they give to the players. This means that a method learning Nash equilibria,
cannot guarantee a unique outcome or even a unique payoff for the players. This
can be seen in the coordination game of Table 14.2(c), where 2 Nash equilibria ex-
ist, one giving payoff 5 to the agents, and the other giving payoff 10. The game in
Table 14.3(b) also has multiple NE, with (a1,a1) and (a3,a3) being the 2 optimal
ones. This results in a coordination problem for learning agents, as both these NE
have the same quality.
Furthermore, since the players can have different expected payoffs even in an
equilibrium play, the different players may also prefer different equilibrium out-
comes, which means that care should be taken to make sure the players coordinate
on a single equilibrium. This situation can be observed in the Battle of the sexes
game in Table 14.2(d), where 2 pure Nash equilibria exist, but each player prefers a
different equilibrium outcome.
Another criticism is that a Nash Equilibrium does not guarantee optimality.
While playing a Nash equilibrium assures that no single player can improve his
payoff by unilaterally changing its strategy, it does not guarantee that the players
globally maximize their payoffs, or even that no play exists in which the players si-
multaneously do better. It is possible for a game to have non-Nash outcomes, which
nonetheless result in a higher payoff to all agents than they would receive for play-
ing a Nash equilibrium. This can be seen for example in the prisoner’s dilemma in
Table 14.2(c).
While often used as the main goal of learning, Nash equilibria are not the only
possible solution concept in game theory. In part due to the criticisms mentioned
above, a number of alternative solution concepts for games have been developed.
These alternatives include a range of other equilibrium concepts, such as the Cor-
related Equilibrium(CE))(Aumann, 1974), which generalizes the Nash equilibrium
concept, or the Evolutionary Stable Strategy (ESS)(Smith, 1982), which refines the
Nash equilibrium. Each of these equilibrium outcomes has its own applications and
(dis)advantages. Which solution concept to use depends on the problem at hand, and
the objective of the learning algorithm. A complete discussion of possible equilib-
rium concepts is beyond the scope of this chapter. We focus on the Nash equilibrium
and briefly mention regret minimization as these are the approaches most frequently
observed in the multi-agent learning literature. A more complete discussion of so-
lution concepts can be found in many textbooks, e.g. (Leyton-Brown and Shoham,
2008).
Before continuing, we mention one more evaluation criterion, which is regularly
used in repeated games: the notion of regret. Regret is the difference between the
payoff an agent realized and the maximum payoff the agent could have obtained
using some fixed strategy. Often the fixed strategies that one compares the agent
performance to, are simply the pure strategies of the agent. In this case, the total
regret of the agent is the accumulated difference between the obtained reward and
the reward the agent would have received for playing some fixed action. For an agent
k, given the history of play at time T , this is defined as:
14 Game Theory and Multi-agent RL 451
T
RT = maxa∈Ak ∑ Rk (a−k (t) ∪ {a}) − Rk(a(t)), (14.1)
t=1
where a(t) denotes the joint action played at time t and a−k (t) ∪ {a} denotes the
same joint action but with player k playing action a. Most regret based learning
approaches attempt to minimize the average regret RT /T of the learner. Exact cal-
culation of this regret requires knowledge of the reward function and observation of
the actions of other agents in order to determine the Rk (a−k (t) ∪ {a}) term. If this
information is not available, regret has to be estimated from previous observations.
Under some assumptions regret based learning can been shown to converge to some
form of equilibrium play (Foster and Young, 2003; Hart and Mas-Colell, 2001).
A natural question to ask is what happens when agents use a standard, single-agent
RL technique to interact in a game environment. Early research into multi-agent
RL focussed largely on the application of Q-learning to repeated games. In this so
called independent or uninformed setting, each player k keeps an individual vector
of estimated Q-values Qk (a), a ∈ Ak . The players learn Q-values over their own
action set and do not use any information on other players in the game. Since there
is no concept of environment state in repeated games, a single vector of estimates is
sufficient, rather than a full table of state-action pairs, and the standard Q-learning
update is typically simplified to its stateless version:
In (Claus and Boutilier, 1998) the dynamics of stateless Q-learning in repeated nor-
mal form common interest games are empirically studied.The key questions here
are: is simple Q-learning still guaranteed to converge in a multi-agent setting, and
if so, does it converge to (the optimal) equilibrium. It also relates independent Q-
learners to joint action learners (see below) and investigates how the rates of con-
vergence and limit points are influenced by the game structures and action selection
strategies. In a related branch of research (Tuyls and Nowé, 2005; Wunder et al,
2010) the dynamics of independent Q-learning are studied using techniques from
evolutionary game theory (Smith, 1982).
While independent Q-learners were shown to reach equilibrium play under some
circumstances, they also demonstrated a failure to coordinate in some games, and
even failed to converge altogether in others.
They compared joint action learners to independent learners. In the former the
agents learn Q-values for all joint actions, with other words each agent j leans a Q-
value for all a in A. The action selection is done by each agent individually based on
the believe the agents has about the other agents strategy. Equation 14.3 expresses
that the Q-value of the joint action is weighted according to the probability the other
agents will select some particular value. The Expected Values (EV) can then be used
in combination with any action selection technique. Claus and Boutilier showed
452 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
experimentally using the games of table 2, that joint action learners and independent
learners using a Boltzmann exploration strategy with decreasing temperature behave
very similar. These learners have been studied from an evolutionary game theory
point of view in (Tuyls and Nowé, 2005) and is has been shown that these learners
will converge to evolutionary stable NE which are not necessarily pareto optimal.
However the learners have difficulties to reach the optimal NE, and more sophis-
ticated exploration strategies are needed to increase the probability of converging
to the optimal NE. The reason that simple exploration strategies are not sufficient
is mainly due to the fact that the actions involved in the optimal NE often lead to
much lower payoff when combined with other actions, the potential quality of the
action is therefore underestimated. For example in game 2a the action a1 of the row
player, will only lead to the highest reward 11 when combined with action a1 of the
column player. During the learning phase, agents are still exploring and action a1
will also be combined with actions a2 and a3. As a results the agents will often settle
for the more “safe” NE (a2,a2). A similar behavior is observed in game 2b, since
miscoordination on the 2 NE is punished, the bigger the penalty (k¡0) the more dif-
ficult it become for the agents to reach either of the optimal NE. This also explains
why independent learners are generally not able to converge to a NE when they
are allowed to use any, including a random exploration strategy. Whereas in sin-
gle agent Q-learning, the particular exploration strategy does not affect the eventual
convergence (Tsitsiklis, 1994) this no longer holds in a MAS setting.
The limitations of single-agent Q-learning have lead to a number of extensions
of the Q-learning algorithm for use in repeated games. Most of these approaches fo-
cus on coordination mechanisms allowing Q-learners to reach the optimal outcome
in common interest games. The frequency maximum Q-learning (FMQ) algorithm
(Kapetanakis and Kudenko, 2002), for example, keeps a frequency value f req(R∗ ,a)
indicating how often the maximum reward so far (R∗ ) has been observed for a cer-
tain action a. This value is then used as a sort of heuristic which is added to the
Q-values. Instead of using Q-values directly, the FMQ algorithm relies on following
heuristic evaluation of the actions:
where w is a weight that controls the importance of the heuristic value f req(R∗ ,a)R∗ .
The algorithm was empirically shown to be able to drive learners to the optimal joint
action in common interest games with deterministic payoffs.
In (Kapetanakis et al, 2003) the idea of commitment sequences has been intro-
duced to allow independent learning in games with stochastic payoffs. A commit-
ment sequence is a list of time slots for which an agent is committed to selecting al-
ways the same action. These sequences of time slots is generated according to some
protocol the agents are aware off. Using this guarantee that at time slots belonging
14 Game Theory and Multi-agent RL 453
to the same sequence the agents are committed to always select the same individual
action, the agents are able to distinguish between the two sources of uncertainty: the
noise on the reward signal and the influence on the reward by the actions taken by
the other agents. This allows the agents to deal with games with stochastic payoffs.
A recent overview of multi-agent Q-learning approaches can be found in (Wun-
der et al, 2010).
As an alternative to the well known Q-learning algorithm, we now list some ap-
proaches based on gradient following updates. We will focus on players that employ
learning automata (LA) reinforcement schemes. Learning automata are relatively
simple policy iterators, that keep a vector action probabilities p over the action set
A. As is common in RL, these probabilities are updated based on a feedback received
from the environment. While initial studies focussed mainly on a single automaton
in n-armed bandit settings, RL algorithms using multiple automata were developed
to learn policies in MDPs (Wheeler Jr and Narendra, 1986). The most commonly
used LA update scheme is called Linear Reward-Penalty and updates the action
probabilities as follows:
r(t) being the feedback received at time t and K the number of actions in available
to the auomaton. λ1 and λ2 are constants, called the reward and penalty parameter
respectively. Depending on the values of these parameters 3 distinct variations of the
algorithm can be considered. When λ1 = λ2 , the algorithm is referred to as Linear
Reward-Penalty (LR−P ) while it is called Linear Reward-ε Penalty (LR−ε P) when
λ1 >> λ2 . If λ2 = 0 the algorithm is called Linear Reward-Inaction (LR−I ). In this
case, λ1 is also sometimes called the learning rate:
This algorithm has also been shown to be a special case of the REINFORCE
(Williams, 1992) update rules. Despite the fact that all these update rules are derived
454 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
from the same general scheme, they exhibit very different learning behaviors. Inter-
estingly, these learning schemes perform well in game contexts, even though they
do not require any information (actions, rewards, strategies) on the other players in
the game. Each agent independently applies a LA update rule to change the prob-
abilities over its actions. Below we list some interesting properties of LA in game
settings. In two-person zero-sum games, the LR−I scheme converges to the Nash
equilibrium when this exists in pure strategies, while the LR−ε P scheme is able to
approximate mixed equilibria. In n-player common interest games reward-inaction
also converges to a pure Nash equilibrium. In (Sastry et al, 1994), the dynamics
of reward-inaction in general sum games are studied. The authors proceed by ap-
proximating the update in the automata game by a system of ordinary differential
equations. Following properties are found to hold for the LR−I dynamics:
• All Nash equilibria are stationary points.
• All strict Nash equilibria are asymptotically stable.
• All stationary points that are not Nash equilibria are unstable.
on the actions of the players. In a normal form game there is no concept of a system
with state transitions, a central issue of the Markov decision process concept. There-
fore, we now consider a richer framework which generalizes both repeated games
and MDPs. Introducing multiple agents to the MDP model significantly compli-
cates the problem that the learning agents face. Both rewards and transitions in the
environment now depend on the actions of all agents present in the system. Agents
are therefore required to learn in a joint action space. Moreover, since agents can
have different goals, an optimal solution which maximizes rewards for all agents
simultaneously may fail to exist.
To accommodate the increased complexity of this problem we use the represen-
tation of Stochastic of Markov games (Shapley, 1953). While they were originally
introduced in game theory as an extension of normal form games, Markov games
also generalize the Markov Decision process and were more recently proposed as
the standard framework for multi-agent reinforcement learning (Littman, 1994). As
the name implies, Markov games still assume that state transitions are Markovian,
however, both transition probabilities and expected rewards now depend on the joint
action of all agents. Markov games can be seen as an extension of MDPs to the
multi-agent case, and of repeated games to multiple state case. If we assume only 1
agent, or the case where other agents play a fixed policy, the Markov game reduces
to an MDP. When the Markov game has only 1 state, it reduces to a repeated normal
form game.
An extension of the single agent Markov decision process (MDP) to the multi-agent
case can be defined by Markov Games. In a Markov Game, joint actions are the
result of multiple agents choosing an action independently.
Note that Ak (si ) is now the action set available to agent k in state si , with k : 1 . . . n,
n being the number of agents in the system and i : 1, . . . , N, N being yhe number
of states in the system. Transition probabilities T (si ,ai ,s j ) and rewards Rk (si ,ai ,s j )
now depend on a current state si , next state s j and a joint action from state si , i.e.
ai = (ai1 , . . . ain ) with aik ∈ Ak (si ). The reward function Rk (si ,ai ,s j ,) is now individual
1 As was the case for MDPs, one can also consider the equivalent case where reward does
not depend on the next state.
456 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
to each agent k. Different agents can receive different rewards for the same state
transition. Transitions in the game are again assumed to obey the Markov property.
As was the case in MDPs, agents try to optimize some measure of their future
expected rewards. Typically they try to maximize either their future discounted re-
ward or their average reward over time. The main difference with respect to single
agent RL, is that now these criteria also depend on the policies of other agents. This
results in the following definition for the expected discounted reward for agent k
under a joint policy π = (π1 , . . . ,πn ), which assigns a policy πi to each agent i:
%
∞
Vkπ (s) = E π ∑ γ t rk (t + 1) | s(0) = s (14.9)
t=0
while the average reward for agent k under this joint policy is defined as:
%
1 π T
π
Jk (s) = lim E
T →∞ T
∑ rk (t + 1) | s(0) = s (14.10)
t=0
Since it is in general impossible to maximize this criterion for all agents simul-
taneously, as agents can have conflicting goals, agents playing a Markov game face
the same coordination problems as in repeated games. Therefore, typically one relies
again on equilibria as the solution concept for these problems. The best response and
Nash equilibrium concepts can be extended to Markov games, by defining a policy
πk as a best response, when no other policy for agent k exists which gives a higher
expected future reward, provided that the other agents keep their policies fixed.
It should be noted that learning in a Markov game introduces several new issues
over learning in MDPs with regard to the policy being learned. In an MDP, it is
possible to prove that, given some basic assumptions, an optimal deterministic pol-
icy always exists. This means it is sufficient to consider only those policies which
deterministically map each state to an action. In Markov games, however, where we
must consider equilibria between agent policies, this no longer holds. Similarly to
the situation in repeated games, it is possible that a discounted Markov game, only
has Nash equilibria in which stochastic policies are involved. As such, it is not suffi-
cient to let agents map a fixed action to each state: they must be able to learn a mixed
strategy. The situation becomes even harder when considering other reward criteria,
such as the average reward, since then it is possible that no equilibria in stationary
strategies exist (Gillette, 1957). This means that in order to achieve an equilibrium
outcome, the agents must be able to express policies which condition the action se-
lection in a state on the entire history of the learning process. Fortunately, one can
introduce some additional assumptions on the structure of the problem to ensure the
existence of stationary equilibria (Sobel, 1971).
While in normal form games the challenges for reinforcement learners originate
mainly from the interactions between the agents, in Markov games they face the
14 Game Theory and Multi-agent RL 457
additional challenge of an environment with state transitions. This means that the
agents typically need to combine coordination methods or equilibrium solvers used
in repeated games with MDP approaches from single-agent RL.
t=0
Qk (s,a) = 0 ∀s,a,k
repeat
for all agents k do
select action ak (t)
execute joint action a = (a1 , . . . ,an )
observe new state s’, rewards rk
for all agents k do
Qk (s,a) = Qk (s,a) + α [Rk (s,a) + γ Vk (s ) - Qk (s,a)]
until Termination Condition
This estimated frequency of play for the other agents, allows the joint action learner
to calculate the expected Q-values for a state:
where Q(s,ak ,a−k ) denotes the Q-value in state s for the joint action in which agent
k plays ak and the other agents play according to a−k . These expected Q-values can
then be used for the agent’s action selection, as well as in the Q-learning update, just
as in the standard single-agent Q-learning algorithm.
Another method used in multi-agent Q-learning is to assume that the other agents
will play according to some strategy. For example, in the minimax Q-learning algo-
rithm (Littman, 1994), which was developed for 2-agent zero-sum problems, the
learning agent assumes that its opponent will play the action which minimizes the
learner’s payoff. This means that the max operator of single agent Q-learning is
replaced by the minimax value:
The Q-learning agent maximizes over its strategies for state s, while assuming that
the opponent will pick the action which minimizes the learner’s future rewards.
Note that the agent does not just maximizes over the deterministic strategies, as it
is possible that the maximum will require a mixed strategy. This system was later
generalized to friend-or-foe Q-learning (Littman, 2001a), where the learning agent
deals with multiple agents by marking them either as friends, who assist to maximize
its payoff or foes, who try to minimize the payoff.
Alternative approaches assume that the agents will play an equilibrium strategy.
For example, Nash-Q (Hu and Wellman, 2003) observes the rewards for all agents
and keeps estimates of Q-values not only for the learning agent, but also for all
other agents. This allows the learner to represent the joint action selection in each
state as a game, where the entries in the payoff matrix are defined by the Q-values
of the agents for the joint action. This representation is also called the stage game.
14 Game Theory and Multi-agent RL 459
A Nash-Q agent then assumes that all agents will play according to a Nash equilib-
rium of this stage game in each state:
Vk (s) = Nashk (s,Q1 ,...,Qn ),
where Nashk (s,Q1 ,...,Qn ) denotes the expected payoff for agent k when the agents
play a Nash equilibrium in the stage game for state s with Q-values Q1 ,...,Qn . Under
some rather strict assumptions on the structure of the stage games, Nash-Q can be
shown to converge in self-play to a Nash equilibrium between agent policies.
The approach used in Nash-Q can also be combined with other equilibrium con-
cepts, for example correlated equilibria (Greenwald et al, 2003) or the Stackelberg
equilibrium (Kononen, 2003). The main difficulty with these approaches is that the
value is not uniquely defined when multiple equilibria exist, and coordination is
needed to agree on the same equilibrium. In these cases, additional mechanisms are
typically required to select some equilibrium.
While the intensive research into value iteration based multi-agent RL has
yielded some theoretical guarantees (Littman, 2001b), convergence results in the
general Markov game case remain elusive. Moreover, recent research indicates that
a reliance on Q-values alone may not be sufficient to learn an equilibrium policy in
arbitrary general sum games (Zinkevich et al, 2006) and new approaches are needed.
initialise r prev (s,k), t prev (s), a prev (s,k),t, rtot (k),ρk (s,a), ηk (s,a) to zero, ∀s, k, a.
s ← s(0)
loop
for all Agents k do
if s was visited before then
• Calculate received reward and time passed since last visit to state s:
Δ t = t − t prev (s)
• Calculate feedback:
ρk (s,a prev (s,k))
βk (t) =
ηk (s,a prev (s,k))
• Update automaton LA(s,k) using LR−I update with a(t) = a prev (s,k) and βk (t)
as above.
• Let LA(s,k) select an action ak .
• Store data for current state visit:
t prev (s) ← t
• Execute joint action a = (a1 , . . . ,an ), observe immediate rewards rk and new state s
• s ← s
• rtot (k) ← rtot (k) + rk
• t ← t +1
develop a gradient based policy search method for partially observable, identical
payoff stochastic games. The method is shown to converge to local optima which
are, however, not necessarily Nash equilibria between agent policies.
No Yes
Fig. 14.2 Decoupling the learning process by learning when to take the other agent into
account on one level, and acting on the second level
Kok & Vlassis proposed an approach based on a sparse representation for the joint
action space of the agents while observing the entire joint state space. More specif-
ically they are interested in learning joint-action values for those states where the
agents explicitly need to coordinate. In many problems, this need only occurs in very
specific situations (Guestrin et al, 2002b). Sparse Tabular Multiagent Q-learning
maintains a list of states in which coordination is necessary. In these states, agents
select a joint action, whereas in all the uncoordinated states they all select an action
individually (Kok and Vlassis, 2004b). By replacing this list of states by coordi-
nation graphs (CG) it is possible to represent dependencies that are limited to a
few agents (Guestrin et al, 2002a; Kok and Vlassis, 2004a, 2006). This technique
is known as Sparse Cooperative Q-learning (SCQ). Figure 14.3 shows a graphical
representation of a simple CG for a situation where the effect of the actions of agent
4 depend on the actions of agent 2 and the actions of agent 2 and 3 both depend on
the actions of agent 1, so the nodes represent the agents, while an edge defines a
dependency between two agents. If agents transitioned into a coordinated state, they
applied a variable elimination algorithm to compute the optimal joint action for the
current state. In all other states, the agents select their actions independently.
In later work, the authors introduced Utile Coordination (Kok et al, 2005). This
is a more advanced algorithm that uses the same idea as SCQ, but instead of hav-
ing to define the CGs beforehand, they are being learned online. This is done by
maintaining statistical information about the obtained rewards conditioned on the
states and actions of the other agents. As such, it is possible to learn the context spe-
cific dependencies that exist between the agents and represent them in a CG. This
technique is however limited to fully cooperative MAS.
14 Game Theory and Multi-agent RL 463
A1
A2 A3
A4
Fig. 14.3 Simple coordination graph. In the situation depicted the effect of the actions of
agent 4 depends on the actions of agent 2 and the actions of agent 2 and 3 both depend on the
actions of agent 1.
The primary goal of these approaches is to reduce the joint-action space. How-
ever, the computation or learning in the algorithms described above, always employ
a complete multi-agent view of the entire joint-state space to select their actions,
even in states where only using local state information would be sufficient. As such,
the state space in which they are learning is still exponential in the number of agents,
and its use is limited to situations in which it is possible to observe the entire joint
state.
Learning of Coordination
Spaan and Melo approached the problem of coordination from a different angle than
Kok & Vlassis (Spaan and Melo, 2008). They introduced a new model for multi-
agent decision making under uncertainty called interaction-driven Markov games
(IDMG). This model contains a set of interaction states which lists all the states in
which coordination is beneficial.
464 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
Expand
7 8 9
4 5 6
1 2 3
Fig. 14.4 Graphical representation of state expansion with sparse interactions. Independent
single states are expanded to joint-states where necessary. Agents begin with 9 independent
states. After a while states 4 and 6 of an agent are expanded to include the states of another
agent.
In later work, Melo and Veloso introduced an algorithm where agents learn in
which states they need to condition their actions on the local state information of
other agents (Melo and Veloso, 2009). As such, their approach can be seen as a
way of solving an IDMG where the states in which coordination is necessary is
not specified beforehand. To achieve this, they augment the action space of each
agent with a pseudo-coordination action (COORDINATE). This action will perform
an active perception step. This could for instance be a broadcast to the agents to
divulge their location or using a camera or sensors to detect the location of the other
agents. This active perception step will decide whether coordination is necessary or
if it is safe to ignore the other agents. Since the penalty of miscoordination is bigger
than the cost of using the active perception, the agents learn to take this action in the
interaction states of the underlying IDMG. This approach solves the coordination
problem by deferring it to the active perception mechanism.
The active perception step of LoC can consist of the use of a camera, sensory data,
or communication to reveal the local state information of another agent. As such the
outcome of the algorithm depends on the outcome of this function. Given an adequate
active perception function, LoC is capable of learning a sparse set of states in which
coordination should occur. Note that depending on the active perception function,
this algorithm can be used for both cooperative as conflicting interest systems.
The authors use a variation on the standard Q-learning update rule:
QCk (s,ak ) ← (1 − α (t))QCk (s,a) + α (t) rk + γ max Qk (sk ,ak ) (14.11)
a
Where QCk represents the Q-table containing states in which agent k will coordinate
and Qk contains the state-action values for its independent states. The joint state
information is represented as s, whereas sk and ak are the local state information
and action of agent k. So the update of QCk uses the estimates of Qk . This represents
the one-step behaviour of the COORDINATE action and allows for a sparse repre-
sentation of QCk , since there is no direct dependency between the states in this joint
Q-table.
14 Game Theory and Multi-agent RL 465
Coordinating Q-Learning
Coordinating Q-Learning, or CQ-learning, learns in which states an agent should
take the other agents into consideration (De Hauwere et al, 2010) and in which
states is cant act using primarily only its own state information. More precisely, the
algorithm will identify states in which an agent should take other agents into account
when choosing its preferred action.
The algorithm can be decomposed into three sections: detecting conflict situa-
tions, selecting actions and updating the Q-values which will now be explained in
more detail:
1. Detecting conflict situations
Agents must identify in which states they experience the influence of at least
one other agent. CQ-Learning needs a baseline for this, so agents are assumed
to have learned a model about the expected payoffs for selecting an action in a
particular state applying an individual policy. For example, in a gridworld this
would mean that the agents have learned a policy to reach some goal, while
being the only agent present in the environment. If agents are influencing each
other, this will be reflected in the payoff the agents receive when they are acting
together. CQ-learning uses a statistical test to detect if there are changes in the
observed rewards for the selected state-action pair compared to the case where
they were acting alone in the environment. Two situations can occur:
a. The statistics allow to detect a change in the received immediate rewards.
In this situation, the algorithm will mark this state, and search for the cause
of this change by collecting new samples from the joint state space in order
to identify the joint state-action pairs in which collisions occur. These state-
action pairs are then marked as being dangerous, and the state space of the
agent is augmented by adding this joint state information. State-action pairs
that did not cause interactions are marked as being safe, i.e. the agent’s
actions in this state are independent from the states of other agents. So
the algorithm will first attempt to detect changes in the rewards an agent
receives, solely based on its own state, before trying to identify due to which
other agents these changes occur.
b. The statistics indicate that the rewards the agent receives are from the same
distribution as if the agent was acting alone. Therefore, no special action is
taken in this situation and the agent continues to act as if it was alone.
2. Selecting actions
If an agent selects an action, it will check if its current local state is a state in
which a discrepancy has been detected previously (case 1.a, described above). If
so, it will observe the global state information to determine if the state informa-
tion of the other agents is the same as when the conflict was detected. If this is
the case, it will condition its actions on this global state information, otherwise
it can act independently, using only its own local state information. If its local
state information has never caused a discrepancy (case 1.b, described above), it
can act without taking the other agents into consideration.
466 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
j
Initialise Qk through single agent learning and Qk ;
while true do
if state sk of Agent k is unmarked then
Select ak for Agent k from Qk
else
if the joint state information js is safe then
Select ak for Agent k from Qk
else
j
Select ak for Agent k from Qk based on the joint state information js
Sample sk ,ak ,rk
if t-test detects difference in observed rewards vs expected rewards for sk ,ak then
mark sk
for ∀ other state information present in the joint state js do
if t-test detects difference between independent state sk and joint state js then
j
add js to Qk
mark js as dangerous
else
mark js as safe
if sk is unmarked for Agent k or js is safe then
No need to update Qk (sk ).
else
Update Qk ( js,ak ) ← (1 − αt )Qk ( js,ak ) + αt [r( js,ak ) + γ maxa Q(s k ,a)]
j j
This approach was later extended to detect sparse interactions that are only re-
flected in the reward signal, several timesteps in the future (De Hauwere et al, 2011).
Examples of such situations are for instance if the order in which goods arrive in a
warehouse are important.
References
Bowling, M., Veloso, M.: Scalable Learning in Stochastic Games. In: AAAI Workshop on
Game Theoretic and Decision Theoretic Agents (2002)
Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforce-
ment learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews 38(2), 156–172 (2008)
Chalkiadakis, G., Boutilier, C.: Sequential Decision Making in Repeated Coalition Forma-
tion under Uncertainty. In: Parkes, P.M., Parsons (eds.) Proceedings of 7th Int. Conf. on
Autonomous Agents and Multiagent Systems (AAMAS 2008), pp. 347–354 (2008)
Claus, C., Boutilier, C.: The Dynamics of Reinforcement Learning in Cooperative Multiagent
Systems. In: Proceedings of the National Conference on Artificial Intelligence, pp. 746–
752. John Wiley & Sons Ltd. (1998)
De Hauwere, Y.M., Vrancx, P., Nowé, A.: Learning Multi-Agent State Space Representations.
In: Proceedings of the 9th International Conference on Autonomous Agents and Multi-
Agent Systems, Toronto, Canada, pp. 715–722 (2010)
De Hauwere, Y.M., Vrancx, P., Nowé, A.: Detecting and Solving Future Multi-Agent Inter-
actions. In: Proceedings of the AAMAS Workshop on Adaptive and Learning Agents,
Taipei, Taiwan, pp. 45–52 (2011)
Dorigo, M., Stützle, T.: Ant Colony Optimization. Bradford Company, MA (2004)
Fakir, M.: Resource Optimization Methods for Telecommunication Networks. PhD thesis,
Department of Electronics and Informatics, Vrije Universiteit Brussel, Belgium (2004)
Foster, D., Young, H.: Regret Testing: A Simple Payoff-based Procedure for Learning Nash
Equilibrium. University of Pennsylvania and Johns Hopkins University, Mimeo (2003)
Gillette, D.: Stochastic Games with Zero Stop Probabilities. Ann. Math. Stud. 39, 178–187
(1957)
Gintis, H.: Game Theory Evolving. Princeton University Press (2000)
Greenwald, A., Hall, K., Serrano, R.: Correlated Q-learning. In: Proceedings of the Twentieth
International Conference on Machine Learning, pp. 242–249 (2003)
Guestrin, C., Lagoudakis, M., Parr, R.: Coordinated Reinforcement Learning. In: Proceedings
of the 19th International Conference on Machine Learning, pp. 227–234 (2002a)
Guestrin, C., Venkataraman, S., Koller, D.: Context-Specific Multiagent Coordination and
Planning with Factored MDPs. In: 18th National Conference on Artificial Intelligence,
pp. 253–259. American Association for Artificial Intelligence, Menlo Park (2002b)
Hart, S., Mas-Colell, A.: A Reinforcement Procedure Leading to Correlated Equilibrium.
Economic Essays: A Festschrift for Werner Hildenbrand, 181–200 (2001)
Hu, J., Wellman, M.: Nash Q-learning for General-Sum Stochastic Games. The Journal of
Machine Learning Research 4, 1039–1069 (2003)
Kapetanakis, S., Kudenko, D.: Reinforcement Learning of Coordination in Cooperative
Multi-Agent Systems. In: Proceedings of the National Conference on Artificial Intelli-
gence, pp. 326–331. AAAI Press, MIT Press, Menlo Park, Cambridge (2002)
Kapetanakis, S., Kudenko, D., Strens, M.: Learning to Coordinate Using Commitment Se-
quences in Cooperative Multiagent-Systems. In: Proceedings of the Third Symposium on
Adaptive Agents and Multi-agent Systems (AAMAS-2003), p. 2004 (2003)
Kok, J., Vlassis, N.: Sparse Cooperative Q-learning. In: Proceedings of the 21st International
Conference on Machine Learning. ACM, New York (2004a)
Kok, J., Vlassis, N.: Sparse Tabular Multiagent Q-learning. In: Proceedings of the 13th
Benelux Conference on Machine Learning, Benelearn (2004b)
Kok, J., Vlassis, N.: Collaborative Multiagent Reinforcement Learning by Payoff Propaga-
tion. Journal of Machine Learning Research 7, 1789–1828 (2006)
14 Game Theory and Multi-agent RL 469
Kok, J., ’t Hoen, P., Bakker, B., Vlassis, N.: Utile Coordination: Learning Interdependencies
among Cooperative Agents. In: Proceedings of the IEEE Symposium on Computational
Intelligence and Games (CIG 2005), pp. 29–36 (2005)
Kononen, V.: Asymmetric Multiagent Reinforcement Learning. In: IEEE/WIC International
Conference on Intelligent Agent Technology (IAT 2003), pp. 336–342 (2003)
Könönen, V.: Policy Gradient Method for Team Markov Games. In: Yang, Z.R., Yin, H.,
Everson, R.M. (eds.) IDEAL 2004. LNCS, vol. 3177, pp. 733–739. Springer, Heidelberg
(2004)
Leyton-Brown, K., Shoham, Y.: Essentials of Game Theory: A Concise Multidisciplinary
Introduction. Synthesis Lectures on Artificial Intelligence and Machine Learning 2(1),
1–88 (2008)
Littman, M.: Markov Games as a Framework for Multi-Agent Reinforcement Learning. In:
Proceedings of the Eleventh International Conference on Machine Learning, pp. 157–163.
Morgan Kaufmann (1994)
Littman, M.: Friend-or-Foe Q-learning in General-Sum Games. In: Proceedings of the Eigh-
teenth International Conference on Machine Learning, pp. 322–328. Morgan Kaufmann
(2001a)
Littman, M.: Value-function Reinforcement Learning in Markov Games. Cognitive Systems
Research 2(1), 55–66 (2001b), http://www.sciencedirect.com/science/
article/B6W6C-430G1TK-4/2/822caf1574be32ae91adf15de90becc4,
doi:10.1016/S1389-0417(01)00015-8
Littman, M., Boyan, J.: A Distributed Reinforcement Learning Scheme for Network Routing.
In: Proceedings of the 1993 International Workshop on Applications of Neural Networks
to Telecommunications, pp. 45–51. Erlbaum (1993)
Mariano, C., Morales, E.: DQL: A New Updating Strategy for Reinforcement Learning Based
on Q-Learning. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167,
pp. 324–335. Springer, Heidelberg (2001)
Melo, F., Veloso, M.: Learning of Coordination: Exploiting Sparse Interactions in Multiagent
Systems. In: Proceedings of the 8th International Conference on Autonomous Agents and
Multi-Agent Systems, pp. 773–780 (2009)
Nash, J.: Equilibrium Points in n-Person Games. Proceedings of the National Academy of
Sciences of the United States of America, 48–49 (1950)
Peshkin, L., Kim, K., Meuleau, N., Kaelbling, L.: Learning to Cooperate via Policy
Search. In: Proceedings of the 16th Conference on Uncertainty in Artificial Intelli-
gence, UAI 2000, pp. 489–496. Morgan Kaufmann Publishers Inc., San Francisco (2000),
http://portal.acm.org/citation.cfm?id=647234.719893
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Prentice-Hall,
Englewood Cliffs (2003)
Sastry, P., Phansalkar, V., Thathachar, M.: Decentralized Learning of Nash Equilibria in
Multi-Person Stochastic Games with Incomplete Information. IEEE Transactions on Sys-
tems, Man and Cybernetics 24(5), 769–777 (1994)
Shapley, L.: Stochastic Games. Proceedings of the National Academy of Sciences 39(10),
1095–1100 (1953)
Shoham, Y., Leyton-Brown, K.: Multiagent Systems: Algorithmic, Game-Theoretic, and
Logical Foundations. Cambridge University Press (2009)
Singh, S., Kearns, M., Mansour, Y.: Nash Convergence of Gradient Dynamics in General-
Sum Games. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial
Intelligence, pp. 541–548 (2000)
Smith, J.: Evolution and the Theory of Games. Cambridge Univ. Press (1982)
470 A. Nowé, P. Vrancx, and Y.-M. De Hauwere
Sobel, M.: Noncooperative Stochastic Games. The Annals of Mathematical Statistics 42(6),
1930–1935 (1971)
Spaan, M., Melo, F.: Interaction-Driven Markov Games for Decentralized Multiagent Plan-
ning under Uncertainty. In: Proceedings of the 7th International Conference on Au-
tonomous Agents and Multi-Agent Systems (AAMAS), pp. 525–532. International Foun-
dation for Autonomous Agents and Multiagent Systems (2008)
Steenhaut, K., Nowe, A., Fakir, M., Dirkx, E.: Towards a Hardware Implementation of Re-
inforcement Learning for Call Admission Control in Networks for Integrated Services.
In: Proceedings of the International Workshop on Applications of Neural Networks to
Telecommunications, vol. 3, p. 63. Lawrence Erlbaum (1997)
Stevens, J.P.: Intermediate Statistics: A Modern Approach. Lawrence Erlbaum (1990)
Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy Gradient Methods for Reinforce-
ment Learning with Function Approximation. In: Advances in Neural Information Pro-
cessing Systems, vol. 12(22) (2000)
Tsitsiklis, J.: Asynchronous stochastic approximation and Q-learning. Machine Learn-
ing 16(3), 185–202 (1994)
Tuyls, K., Nowé, A.: Evolutionary Game Theory and Multi-Agent Reinforcement Learning.
The Knowledge Engineering Review 20(01), 63–90 (2005)
Verbeeck, K.: Coordinated Exploration in Multi-Agent Reinforcement Learning. PhD thesis,
Computational Modeling Lab, Vrije Universiteit Brussel, Belgium (2004)
Verbeeck, K., Nowe, A., Tuyls, K.: Coordinated Exploration in Multi-Agent Reinforcement
Learning: An Application to Loadbalancing. In: Proceedings of the 4th International Con-
ference on Autonomous Agents and Multi-Agent Systems (2005)
Vrancx, P., Tuyls, K., Westra, R.: Switching Dynamics of Multi-Agent Learning. In: Pro-
ceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent
Systems (AAMAS 2008), vol. 1, pp. 307–313. International Foundation for Autonomous
Agents and Multiagent Systems, Richland (2008a),
http://portal.acm.org/citation.cfm?id=1402383.1402430
Vrancx, P., Verbeeck, K., Nowe, A.: Decentralized Learning in Markov Games. IEEE Trans-
actions on Systems, Man, and Cybernetics, Part B 38(4), 976–981 (2008b)
Vrancx, P., De Hauwere, Y.M., Nowé, A.: Transfer learning for Multi-Agent Coordination.
In: Proceedings of the 3th International Conference on Agents and Artificial Intelligence,
Rome, Italy, pp. 263–272 (2011)
Weiss, G.: Multiagent Systems, A Modern Approach to Distributed Artificial Intelligence.
The MIT Press (1999)
Wheeler Jr., R., Narendra, K.: Decentralized Learning in Finite Markov Chains. IEEE Trans-
actions on Automatic Control 31(6), 519–526 (1986)
Williams, R.: Simple Statistical Gradient-Following Algorithms for Connectionist Reinforce-
ment Learning. Machine Learning 8(3), 229–256 (1992)
Wooldridge, M.: An Introduction to Multi Agent Systems. John Wiley and Sons Ltd. (2002)
Wunder, M., Littman, M., Babes, M.: Classes of Multiagent Q-learning Dynamics with
epsilon-greedy Exploration. In: Proceedings of the 27th International Conference on Ma-
chine Learning, Haifa, Israel, pp. 1167–1174 (2010)
Zinkevich, M.: Online Convex Programming and Generalized Infinitesimal Gradient Ascent.
In: Machine Learning International Conference, vol. 20(2), p. 928 (2003)
Zinkevich, M., Greenwald, A., Littman, M.: Cyclic equilibria in Markov games. In: Advances
in Neural Information Processing Systems, vol. 18, p. 1641 (2006)
Chapter 15
Decentralized POMDPs
Frans A. Oliehoek
15.1 Introduction
Previous chapters generalized decision making to multiple agents (Chapter 14) and
to acting under state uncertainty as in POMDPs (Chapter 12). This chapter gen-
eralizes further by considering situations with both state uncertainty and multiple
agents. In particular, it focuses on teams of collaborative agents: the agents share a
single objective. Such settings can be formalized by the framework of decentralized
POMDPs (Dec-POMDPs) (Bernstein et al, 2002) or the roughly equivalent multi-
agent team decision problem (Pynadath and Tambe, 2002). The basic idea of this
model is illustrated in Figure 15.1, which depicts the two-agent case. At each stage,
the agents independently take an action. The environment undergoes a state transi-
tion and generates a reward depending on the state and the actions of both agents.
Finally, each agent receives an individual observation of the new state.
Frans A. Oliehoek
CSAIL, Massachusetts Institute of Technology
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 471–503.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
472 F.A. Oliehoek
o1 R(s,a)
Pr(s |s,a)
a1
o2
a2
This framework allows modeling important real-world tasks for which the mod-
els in the previous chapters do not suffice. An example of such a task is load bal-
ancing among queues (Cogill et al, 2004). Each agent represents a processing unit
with a queue that has to decide whether to accept new jobs or pass them to an-
other queue, based only on the local observations of its own queue size and that of
immediate neighbors. Another important application area for Dec-POMDPs is com-
munication networks. For instance, consider a packet routing task in which agents
are routers that have to decide at each time step to which neighbor to send each
packet in order to minimize the average transfer time of packets (Peshkin, 2001).
An application domain that receives much attention in the Dec-POMDP community
is that of sensor networks (Nair et al, 2005; Varakantham et al, 2007; Kumar and
Zilberstein, 2009). Other areas of interests are teams of robotic agents (Becker et al,
2004b; Emery-Montemerlo et al, 2005; Seuken and Zilberstein, 2007a) and crisis
management (Nair et al, 2003a,b; Paquet et al, 2005).
Most research on multi-agent systems under partial observability is relatively
recent and has focused almost exclusively on planning—settings where the model of
the environment is given—rather than the full reinforcement learning (RL) setting.
This chapter also focuses exclusively on planning. Some pointers to RL approaches
are given at the end of the chapter.
A common assumption is that planning takes place in an off-line phase, after
which the plans are executed in an on-line phase. This on-line phase is completely
decentralized as shown in Figure 15.1: each agent receives its individual part of
the joint policy found in the planning phase1 and its individual history of actions
and observations. The off-line planning phase, however, is centralized. We assume
a single computer that computes the joint plan and subsequently distributes it to the
agents (who then merely execute the plan on-line).2
1 In some cases it is assumed that the agents are given the joint policy. This enables the
computation of a joint belief from broadcast local observations (see Section 15.5.4).
2 Alternatively, each agent runs the same planning algorithm in parallel.
15 Decentralized POMDPs 473
a0 a1 ah−2 ah−1
an0 an1 anh−1
actions .. .. ..
. on1 . onh−1 .
...
a10 a11 a1h−1
o11 o1h−1
observations o1 ... oh−1
t 0 1 h−1
Fig. 15.2 A more detailed illustration of the dynamics of a Dec-POMDP. At every stage
the environment is in a particular state. This state emits a joint observation according to the
observation model (dashed arrows) from which each agent observes its individual component
(indicated by solid arrows). Then each agent selects an action, together forming the joint
action, which leads to a state transition according to the transition model (dotted arrows).
Note that, when the wrong door is opened by one or both agents, they are attacked
by the tiger and receive a penalty. However, neither of the agents observe this attack
nor the penalty (remember, the only possible observations are oHL and oHR ) and the
episode continues. Intuitively, an optimal joint policy for Dec-Tiger should specify
that the agents listen until they are certain enough to open one of the doors. At
the same time, the policy should be ‘as coordinated’ as possible, i.e., maximize the
probability of acting jointly.
In an MDP, the agent uses a policy that maps states to actions. In selecting its ac-
tion, it can ignore the history because of the Markov property. In a POMDP, the
agent can no longer observe the state, but it can compute a belief b that summarizes
the history; it is also a Markovian signal. In a Dec-POMDP, however, during execu-
tion each agent will only have access to its individual actions and observations and
there is no method known to summarize this individual history. It is not possible to
maintain and update an individual belief in the same way as in a POMDP, because
the transition and observation function are specified in terms of joint actions and
observations.3
This means that in a Dec-POMDP the agents do not have access to a Markovian
signal during execution. The consequence of this is that planning for Dec-POMDPs
involves searching the space of tuples of individual Dec-POMDP policies that map
full-length individual histories to actions. We will see later that this also means that
solving Dec-POMDPs is even harder than solving POMDPs.
15.3.1 Histories
3 Different forms of beliefs for Dec-POMDP-like settings have been considered Nair et al
(2003c); Hansen et al (2004); Oliehoek et al (2009); Zettlemoyer et al (2009). These are not
specified over only states, but also specify probabilities over histories/policies/types/beliefs
of the other agents. The key point is that from an individual agent’s perspective just know-
ing a probability distribution over states is insufficient; it also needs to predict what actions
the other agents will take.
476 F.A. Oliehoek
all stages for agent i is Θ̄ i and θ̄ i denotes an AOH from this set.4 Finally the set of
all possible joint AOHs θ̄ θ is denoted Θ̄Θ . At t = 0, the (joint) AOH is empty θ̄θ 0 = ().
Definition 15.3 (Observation history). The observation history (OH) for agent i,
ōi , is defined as the sequence of observations an agent has received. At a specific
time step t, this is:
ōti = oi1 , . . . ,oti .
The joint observation history, is the OH for all agents: ōot = ōt1 , . . . ,ōtn . The set of
observation histories for agent i at time t is denoted Ōti . Similar to the notation for
action-observation histories, we also use ōi ∈ Ōi and ōo ∈ Ō
O.
Definition 15.4 (Action history). The action history (AH) for agent i, āi , is the
sequence of actions an agent has performed:
āti = ai0 ,ai1 , . . . ,at−1
i
.
Notation for joint action histories and sets are analogous to those for observation
histories. Finally, note that a (joint) AOH consists of a (joint) action- and a (joint)
observation history: θ̄θ t = ōot ,āat .
15.3.2 Policies
A policy π i for an agent i maps from histories to actions. In the general case, these
histories are AOHs, since they contain all information an agent has. The number of
AOHs
i grows
exponentially with the horizon of the problem: At time step t, there are
A × Oi t possible AOHs for agent i. A policy π i assigns an action to each of
these histories. As a result, the number of possible policies π i is doubly exponential
in the horizon.
It is possible to reduce the number of policies under consideration by realizing
that many policies of the form considered above specify the same behavior. This is
illustrated by the left side of Figure 15.3: under a deterministic policy only a sub-
set of possible action-observation histories can be reached. Policies that only differ
with respect to an AOH that can never be reached, manifest the same behavior. The
consequence is that in order to specify a deterministic policy, the observation his-
tory suffices: when an agent selects actions deterministically, it will be able to infer
what action it took from only the observation history. This means that a determinis-
tic policy can conveniently be represented as a tree, as illustrated by the right side
of Figure 15.3.
4 In a particular Dec-POMDP, it may be the case that not all of these histories can actually
be realized, because of the probabilities specified by the transition and observation model.
15 Decentralized POMDPs 477
act.-obs. history
aLi
aOL aOR
oHL oHR aLi oHL oHR
aLi aLi
aOL aOR
aLi aOR aOL aLi
oHL oHR oHL oHR
Fig. 15.3 A deterministic policy can be represented as a tree. Left: a tree of action-observation
histories θ̄ i for one of the agents from the Dec-Tiger problem. A deterministic policy π i is
highlighted. Clearly shown is that π i only reaches a subset of histories θ̄ i . (θ̄ i that are not
reached are not further expanded.) Right: The same policy can be shown in a simplified
policy tree. When both agents execute this policy in the h = 3 Dec-Tiger problem, the joint
policy is optimal.
For a deterministic policy, π i (θ̄ i ) denotes the action5 that 6it specifies for the obser-
vation history contained in θ̄ i . For instance, let θ̄ i = ōi ,āi , then π i (θ̄ i ) π i (ōi ).
5 6
We use π = π 1 ,...,π n to denote a joint policy. We say that a deterministic joint
policy is an induced mapping from joint observation histories to joint actions
π : Ō
O → A . That is, the mapping is induced by individual policies π i that make up
the joint policy. Note, however, that only a subset of possible mappings f : Ō O→A
correspond to valid joint policies: when f does not specify the same individual ac-
tion for each ōi of each agent i, it will not be possible to execute f in a decentralized
manner. That is, such a policy is centralized: it would describe that an agent should
base its choice of action on the joint history. However, during execution it will only
be able to observe its individual history, not the joint history.
Agents can also execute stochastic policies, but we restrict our attention to deter-
ministic policies without sacrificing optimality, since a finite-horizon Dec-POMDP
has at least one optimal pure joint policy (Oliehoek et al, 2008b).
Policies specify actions for all stages of the Dec-POMDP. A common way to repre-
sent the temporal structure in a policy is to split it into decision rules δ i that specify
the policy for each stage. An individual policy is then represented as a sequence of
decision rules π i = (δ0i , . . . ,δh−1
i ). In case of a deterministic policy, the form of the
decision rule for stage t is a mapping from length-t observation histories to actions
δti : Ōti → Ai . In the more general case its domain is the set of AOHs δti : Θ̄ti → Ai .
A joint decision rule δ t = δt1 , . . . ,δtn specifies a decision rule for each agent.
478 F.A. Oliehoek
We will also consider policies that are partially specified with respect to time.
Formally, ϕ t = (δ 0 , . . . ,δ t−1 ) denotes the past joint policy at stage t, which is a
partial joint policy specified for stages 0,...,t − 1. By appending a joint decision rule
for stage t, we can ‘grow’ such a past joint policy.
i ψt ofiagent
A future policy i
i specifies all the future behavior relative to stage t. That
is, ψt = δt+1 , . . . ,δh−1 . We also consider future joint policies ψ t = (δ t+1 , . . . ,
i
t=0 δ0i
ǎ
o ϕi2
ǒ
qτi =2
t=1 a a δ1i
o ǒ o ǒ
t=2 ǎ ǎ a ǎ δ2i
qτi =1
Fig. 15.4 Structure of a policy for an agent with actions {a,ǎ} and observations {o,ǒ}. A
policy π i can be divided into decision rules δ i or sub-tree policies qi .
Joint policies differ in how much reward they can expect to accumulate, which will
serve as a criterion of their quality. Formally, we consider the expected cumulative
reward of a joint policy, also referred to as its value.
This expectation can be computed using a recursive formulation. For the last
stage t = h − 1, the value is given simply by the immediate reward
V π (st ,ōot ) = R (st ,π (ōot )) + ∑ ∑ P(st+1 ,oo|st ,π (ōot ))V π (st+1 ,ōot+1 ). (15.4)
st+1 ∈S o ∈O
O
Here, the probability is simply the product of the transition and observation prob-
abilities P(s ,oo|s,aa) = P(oo|aa,s )P(s |s,aa). In essence, fixing the joint policy trans-
forms the Dec-POMDP to a Markov chain with states (st ,ōot ). Evaluating this equa-
tion via dynamic programming will result in the value for all (s0 ,ōo0 )-pairs. The value
V (π ) is then given by weighting these pairs according to the initial state distribution
I. Note that given a fixed joint policy π , a history ōot actually induces a joint sub-tree
480 F.A. Oliehoek
P(st ,θ̄θ t |I,ϕ t ) = ∑ ∑ P(st ,oot |st−1 ,aat−1 )P(aat−1 |θ̄θ t−1 ,ϕ t )
st−1 ∈S at−1 ∈A
A
This section gives an overview of methods proposed for finding exact and approx-
imate solutions for finite-horizon Dec-POMDPs. For the infinite-horizon problem,
which is significantly different, some pointers are provided in Section 15.5.
Because there exists an optimal deterministic joint policy for a finite-horizon Dec-
POMDP, it is possible to enumerate all joint policies, evaluate them as described
in Section 15.3.4 and choose the best one. However, the number of such joint
policies is
†h
n(|O | −1)
O |A† | |O† |−1 ,
Theorem 15.1 (Dec-POMDP complexity). The problem of finding the optimal so-
lution for a finite-horizon Dec-POMDP with n ≥ 2 is NEXP-complete.
NEXP is the class of problems that takes non-deterministic exponential time. Non-
deterministic means that, similar to NP, it requires generating a guess about the
15 Decentralized POMDPs 481
Joint Equilibrium based Search for Policies (JESP) (Nair et al, 2003c) is a method
that is guaranteed to find a locally optimal joint policy, more specifically, a Nash
equilibrium: a tuple of policies such that for each agent i its policy π i is a best re-
sponse for the policies employed by the other agents π −i . It relies on a process called
alternating maximization. This is a procedure that computes a policy π i for an agent
i that maximizes the joint reward, while keeping the policies of the other agents
fixed. Next, another agent is chosen to maximize the joint reward by finding its best
response. This process is repeated until the joint policy converges to a Nash equilib-
rium, which is a local optimum. This process is also referred to as hill-climbing or
coordinate ascent. Note that the local optimum reached can be arbitrarily bad. For
instance, if agent 1 opens the left (aOL ) door right away in the Dec-Tiger problem,
the best response for agent 2 is to also select aOL . To reduce the impact of such bad
local optima, JESP can use random restarts.
JESP uses a dynamic programming approach to compute the best-response pol-
icy for a selected agent i. In essence, fixing π −i allows for a reformulation of the
problem as an augmented POMDP. In this augmented POMDP a state š = s,ō−i
consists of a nominal state s and the observation histories of the other agents ō−i .
Given the fixed deterministic policies of other agents π −i , such an augmented state
š is Markovian and all transition and observation probabilities can be derived from
π −i and the transition and observation model of the original Dec-POMDP.
This section describes an approach more in line with methods for single agent MDPs
and POMDPs: we identify an optimal value function Q∗ that can be used to derive
an optimal policy. Even though computation of Q∗ itself is intractable, the insight
it provides is valuable. In particular, it has a clear relation with the two dominant
approaches to solving Dec-POMDPs: the forward and the backward approach which
will be explained in the following subsections.
482 F.A. Oliehoek
θ̄ 0
1 2
a a ǎ1 ǎ2 R(θ̄ 0 ,ǎ1 a2 )
...
1 2
... ... ǎ a ...
o1 o2 ǒ1 ǒ2
o1 ǒ2 ǒ1 o2
θ̄ 2 † θ̄ 2 † θ̄ 2 θ̄ 2
... ...
Fig. 15.5
Tree of joint AOHs θ̄ θ fora fictitious
2-agent
Dec-POMDP with actions
a1 ,ǎ1 , a2 ,ǎ2 and observations o1 ,ǒ1 , o2 ,ǒ2 . Given I, the AOHs induce a
‘joint belief’ b (s) over states. Solid lines represent joint actions and dashed lines joint ob-
servations. Due to the size of the tree it is only partially shown. Highlighted joint actions
represent a joint policy. Given a joint sub-tree policy at a node (the action choices made in
the sub-tree below it), the value is given by (15.8). However, action choices are not inde-
pendent in different parts of the trees: e.g., the two nodes marked † have the same θ̄ 1 and
therefore should specify the same sub-tree policy for agent 1.
Let us start by considering Figure 15.5, which illustrates a tree of joint AOHs. For
a particular joint AOH (a node in Figure 15.5), we can try to determine which joint
sub-tree policy q τ =k is optimal. Recall that V (st ,qqτ =k ) the value of q τ =k starting
from st is specified by (15.5). Also, let b (s) P(s|I,θ̄θ t ) be the joint belief corre-
sponding to θ̄θ t which can be computed using Bayes’ rule in the same way as the
POMDP belief update (see Chapter 12). Given an initial belief I and joint AOH θ̄θ t ,
we can compute a value for each joint sub-tree policy q τ =k that can be used from
that node onward via
Therefore one would hope that a dynamic programming approach would be pos-
sible, where, for each θ̄θ t one could choose the maximizing q τ =k . Unfortunately,
running such a procedure on the entire tree is not possible because of the decentral-
ized nature of a Dec-POMDP: it is not possible to choose maximizing joint sub-tree
policies q τ =k independently, since this could lead to a centralized joint policy.
The consequence is that, even though (15.8) can be used to compute the value for
a (θ̄θ t ,qq τ =k )-pair, it does not directly help to optimize the joint policy, because we
cannot reason about parts of the joint AOH tree independently. Instead, one should
decide what sub-tree policies to select by considering all θ̄θ t of an entire stage t at
the same time, assuming a past joint policy ϕ t . That is, when we assume we have
computed V (I,θ̄θ t ,qq τ =k ) for all θ̄θ5t and
6 for all q τ =k , then we can compute a special
form of joint decision rule Γt = Γt i i∈D for stage t. Here, the individual decision
rules map individual histories to individual sub-tree policies Γt i : Θ̄ti → Qiτ =k . The
optimal Γt satisfies:
Γt ∗ = arg max
Γt
∑ P(θ̄θ t |I,ϕ t )V (I,θ̄θ t ,Γt (θ̄θ t )), (15.10)
θ̄θ t ∈Θ̄
Θt
5 6
where Γt (θ̄θ t ) = Γt i (θ̄ti ) i∈D denotes the joint sub-tree policy q τ =k resulting from ap-
plication of the individual decision rules and the probability is a marginal of (15.6).
This equation clearly illustrates that the optimal joint policy at a stage t of a Dec-
POMDP depends on ϕ t , the joint policy followed up to stage t. Moreover, there are
additional complications that make (15.10) impractical to use:
1. It sums over joint AOHs, the number of which is exponential in both the number
of agents and t.
2. It assumes computation of V (I,θ̄θ t ,qqτ =k ) for all θ̄θ t , for all q τ =k .
†
3. The number of Γt to be evaluated is O(|Q†τ =k ||Θ̄t |n ), where ‘†’ denotes the largest
set. |Q†τ =k | is doubly exponential in k and |Θ̄t† | is exponential in t. Therefore the
number of Γt is doubly exponential in h = t + k.
Note that by restricting our attention to deterministic ϕ t it is possible to reformulate
(15.10) as a summation over OHs, rather than AOHs (this involves adapting V to
take ϕ t as an argument). However, for such a reformulation, the same complications
hold.
This section shifts the focus back to regular decision rules δ i —as introduced in
Section 15.3.3—that map from OHs (or AOHs) to actions. We will specify a value
function that quantifies the expected value of taking actions as specified by δ t and
484 F.A. Oliehoek
ϕ12 ϕ22
t=0 ǎ 1
a 2
1 1 2
ϕ2
o ǒ o ǒ2
t=1 ǎ1 ǎ1 a2 a2
Fig. 15.6 Computation of Q∗ . The dashed ellipse indicates the optimal decision rule δ ∗2
for stage t = 2, given that ϕ 2 = ϕ 1 ◦ δ 1 is followed for the first two stages. The entries
Q∗ (I,θ̄θ 1 ,ϕ 1 ,δ 1 ) are computed by propagating relevant Q∗ -values of the5 next stage. For6in-
stance, the Q∗ -value under ϕ 2 for the highlighted joint history θ̄ θ 1 = (ǎ1 ,ǒ1 ),(a2 ,o2 ) is
computed by propagating the values of the four successor joint histories, as per (15.12).
continuing optimally afterward. That is, we replace the value of sub-trees in (15.10)
by the optimal value of decision rules. The optimal value function for a finite-
horizon Dec-POMDP is defined as follows.
Theorem 15.2 (Optimal Q∗ ). The optimal Q-value function Q∗ (I,ϕ t ,θ̄θ t ,δ t ) is a
function of the initial state distribution and joint past policy, AOH and decision
rule. For the last stage, it is given by
Q∗ (I,ϕ h−1 ,θ̄θ h−1 ,δ h−1 ) = R(θ̄θ h−1 ,δ h−1 (θ̄θ h−1 )), (15.11)
Q∗ (I,ϕ t ,θ̄θ t ,δ t ) = R(θ̄θ t ,δ t (θ̄θ t )) + ∑ P(oo|θ̄θ t ,δ t (θ̄θ t ))Q∗ (I,ϕ t+1 ,θ̄θ t+1 ,δ t+1
∗
),
o
(15.12)
with ϕ t+1 = ϕ t ◦ δ t , θ̄θ t+1 = (θ̄θ t ,δ t (θ̄θ t ),oo) and
∗
δ t+1 = arg max ∑ P(θ̄θ t+1 |I,ϕ t+1 )Q∗ (I,ϕ t+1 ,θ̄θ t+1 ,δ t+1 ). (15.13)
δ t+1 θ̄θ t+1 ∈Θ̄
Θ t+1
Proof. Because of (15.11), application of (15.13) for the last stage will maximize
the expected reward and thus is optimal. Equation (15.12) propagates these optimal
values to the preceding stage. Optimality for all stages follows by induction.
∗
Note that ϕ t is necessary in order to compute δ t+1 , the optimal joint decision rule
at the next stage, because (15.13) requires ϕ t+1 and thus ϕ t .
The above equations constitute a dynamic program. When assuming that only
deterministic joint past policies ϕ can be used, the dynamic program can be eval-
uated from the end (t = h − 1) to the beginning (t = 0). Figure 15.6 illustrates the
15 Decentralized POMDPs 485
where Q∗ is defined as
By expanding this definition of Q∗ using (15.12), one can verify that it indeed has
the regular interpretation of the expected immediate reward induced by first taking
‘action’ δ t plus the cumulative reward of continuing optimally afterward (Oliehoek,
2010).
5 Note that performing the maximization in (15.13) has already been done and can be
cached.
486 F.A. Oliehoek
ϕ0
ϕ0
δ0
ϕ1 δ0 δ 0
δ 0
δ1 ϕ1 ...
ϕ1 ϕ1
..
.
δ1 δ 1
δ h−1
ϕ2 ... ϕ2
π
(a) Forward-sweep policy computation (b) (Generalized) MAA∗ performs back-
(FSPC) can be used with Q∗ or a heuris- tracking and hence only is useful with (admis-
tic Q.
sible) heuristics Q.
Fig. 15.7 Forward approach to Dec-POMDPs
In the CBG agents use policies that map from their individual AOHs to actions.
That is, a policy of an agent i for a CBG corresponds to a decision rule δti for the
Dec-POMDP. The solution of the CBG is the joint decision rule δ t that maximizes
the expected payoff with respect to Q:
∗
δˆt = arg max ∑ θ̄θ t ,δ t (θ̄θ t )).
P(θ̄θ t |I,ϕ t )Q( (15.15)
δt θ̄θ t ∈Θ̄
Θt
While the CBG for a stage is fully specified given I,ϕ t and Q, it is not obvious how
to choose Q. Here we discuss this issue.
∗
Note that, for the last stage t = h − 1, δˆt has a close relation6 with the optimal
decision rule selected by (15.13): if for the last stage the heuristic specifies the im-
mediate reward Q( θ̄θ t ,aa) = R(θ̄θ t ,aa), both will select the same actions. That is, in this
∗ ∗
ˆ
case δ t = δ t .
While for other stages it is not possible to specify such a strong correspondence,
note that FSPC via CBGs is not sub-optimal per se: It is possible to compute a value
∗
function of the form Qπ (θ̄θ t ,aa) for any π . Doing this for a π ∗ yields Qπ and when
using the latter as the payoff functions for the CBGs, FSPC is exact (Oliehoek et al,
2008b).7
However, the practical value of this insight is limited, since it requires knowing
an optimal policy to start with. In practice, research has considered using an ap-
proximate value function. For instance, it is possible to compute the value function
QM (s,aa ) of the ‘underlying MDP’: the MDP with the same transition and reward
function as the Dec-POMDP (Emery-Montemerlo et al, 2004; Szer et al, 2005).
This can be used to compute Q( θ̄θ t ,aa) = ∑s b (s)QM (s,aa ), which can be used as the
payoff function for the CBGs. This is called QMDP . Similarly, it is possible to use
the value function of the ‘underlying POMDP’ (QPOMDP ) (Roth et al, 2005b; Szer
et al, 2005), or the value function of the problem with 1-step delayed communication
(QBG ) (Oliehoek and Vlassis, 2007).
A problem in FSPC is that (15.15) still maximizes over δ t that map from histo-
ries to actions; the number of such δ t is doubly exponential in t. There are two main
approaches to gain leverage. First, the maximization in (15.15) can be performed
6 Because Q∗ is a function of ϕ t and δ t , (15.13) has a slightly different form than (15.15).
The former technically does not correspond to a CBG, while the latter does.
∗
7 There is a subtle but important difference between Qπ (θ̄θ t ,aa) and Q∗ (I,ϕ t ,θ̄θ t ,δ t ): the
latter specifies the optimal value given any past joint policy ϕ t while the former only
specifies optimal value given that π ∗ is actually being followed.
488 F.A. Oliehoek
15.4.4.3 Multi-Agent A*
Since FSPC using Q can be seen as a single trace in a search tree, a natural idea is
to allow for back-tracking and perform a full heuristic search as in multi-agent A∗
(MAA∗ ) (Szer et al, 2005), illustrated in Figure 15.7b.
MAA∗ performs an A∗ search over past joint policies ϕ t . It computes a heuris-
tic value V (ϕ t ) by taking V 0...t−1 (ϕ t ), the actual expected reward over the first t
stages, and adding V t...h−1 , a heuristic value for the remaining h − t stages. When
the heuristic is admissible—a guaranteed overestimation—so is V (ϕ t ). MAA∗ per-
forms standard A∗ search (Russell and Norvig, 2003): it maintains an open list P of
partial joint policies ϕ t and their heuristic values V (ϕ t ). On every iteration MAA∗
selects the highest ranked ϕ t and expands it, generating and heuristically evaluating
all ϕ t+1 = ϕ t ◦ δ t and placing them in P. When using an admissible heuristic, the
heuristic values V (ϕ t+1 ) of the newly expanded policies are an upper bound to the
true values and any lower bound v that has been found can be used to prune P. The
search ends when the list becomes empty, at which point an optimal fully specified
joint policy has been found.
There is a direct relation between MAA∗ and the optimal value functions de-
scribed in the previous section: V ∗ given by (15.14) is the optimal heuristic V t...h−1
∗
(note that V only specifies reward from stage t onward).
MAA∗ suffers from the same problem as FSPC via CBGs: the number of δ t
grows doubly exponential with t, which means that the number of children of a
node grows doubly exponential in its depth. In order to mitigate the problem, it is
possible to apply lossless clustering (Oliehoek et al, 2009), or to try and avoid the
expansion of all child nodes by incrementally expanding nodes only when needed
(Spaan et al, 2011).
Even though Figure 15.7 shows a clear relation between FSPC and MAA ∗ , it is
not directly obvious how they relate: the former solves CBGs, while the latter per-
forms heuristic search. Generalized MAA∗ (GMAA∗ ) (Oliehoek et al, 2008b) uni-
fies these two approaches by making explicit the ‘Expand’ operator.
Algorithm 26 shows GMAA∗ . When the Select operator selects the highest
ranked ϕ t and when the Expand operator works as described for MAA∗ , GMAA∗
simply is MAA∗ . Alternatively, the Expand operator can construct a CBG B(I,ϕ t )
for which all joint CBG-policies δ t are evaluated. These can then be used to
15 Decentralized POMDPs 489
construct a new set of partial policies Φ Expand = {ϕ t ◦ δ t } and their heuristic
values. This corresponds to MAA∗ reformulated to work on CBGs. It can be shown
that when using a particular form of Q (including the mentioned heuristics QMDP ,
QPOMDP and QBG ), the approaches are identical (Oliehoek et al, 2008b). GMAA∗
can also use an Expand operator that does not construct all new partial policies, but
only the best-ranked one, Φ Expand = {ϕ t ◦ δ t∗ }. As a result the open list P will
never contain more than one partial policy and behavior reduces to FSPC. A gener-
alization called k-GMAA∗ constructs the k best-ranked partial policies, allowing to
trade off computation time and solution quality. Clustering of histories can also be
applied in GMAA∗ , but only lossless clustering will preserve optimality.
The forward approach to Dec-POMDPs incrementally builds policies from the first
stage t = 0 to the last t = h − 1. Prior to doing this, a Q-value function (optimal
Q∗ or approximate Q) needs to be computed. This computation itself, the dynamic
program represented in Theorem 15.2, starts with the last stage and works its way
back. The resulting optimal values correspond to the expected values of a joint de-
cision rule and continuing optimally afterwards. That is, in the light of (15.10) this
can be interpreted as the computation of the value for a subset of optimal (useful)
joint sub-tree policies.
This section treats dynamic programming (DP) for Dec-POMDPs (Hansen,
Bernstein, and Zilberstein, 2004). This method also works backwards, but rather
than computing a Q-value function, it directly computes a set of useful sub-tree
policies.
490 F.A. Oliehoek
The core idea of DP is to incrementally construct sets of longer sub-tree policies for
the agents: starting with a set of one-step-to-go (τ = 1) sub-tree policies (actions)
that can be executed at the last stage, construct a set of 2-step policies to be executed
at h − 2, etc. That is, DP constructs Qiτ =1 ,Qiτ =2 , . . . ,Qiτ =h for all agents i. When the
last backup step is completed, the optimal policy can be found by evaluating all
induced joint policies π ∈ Q1τ =h × · · · × Qnτ =h for the initial belief I as described in
Section 15.3.4.
a a
o ǒ o ǒ
a ǎ a ǎ
o ǒ o ǒ o ǒ o ǒ
a ǎ ǎ ǎ a ǎ ǎ ǎ
Fig. 15.8 Difference between policy construction in MAA∗ (left) and dynamic programming
(right) for an agent with actions a,ǎ and observations o,ǒ. Dashed components are newly
generated, dotted components result from the previous iteration.
DP formalizes this idea using backup operations that construct Qiτ =k+1 from
Qiτ =k . For instance, the right side of Figure 15.8 shows how qiτ =3 , a 3-steps-to-go
sub-tree policy, is constructed from two qiτ =2 ∈ Qiτ =2 . In general, a one step ex-
tended policy qiτ =k+1 is created by selecting a sub-tree policy for each observation
and an action for the root. An exhaustive backup generates all possible qiτ =k+1 that
have policies from the previously generated set qiτ =k ∈ Qiτ =k as their sub-trees. We
will denote the sets of sub-tree policies resulting from exhaustive backup for each
agent i by Qe,iτ =k+1 .
Unfortunately, sets of sub-tree policies maintained grow doubly exponentially
with k.8 To counter this source of intractability, it is possible to prune dominated
sub-tree policies from Qe,i m,i
τ =k , resulting in smaller maintained sets Qτ =k (Hansen et al,
i
2004). The value of a qτ =k depends on the probability distribution over states when
it is started (at stage t = h − k) as well as the probability with which the other agents
j = i select their sub-tree policies. Therefore, a qiτ =k is dominated if it is not max-
imizing at any point in the multi-agent belief space: the simplex over S × Qm,−i τ =k . It
is possible to test for dominance by linear programming. Removal of a dominated
sub-tree policy qiτ =k of an agent i may cause a qτj =k of another agent j to become
dominated. Therefore DP iterates over agents until no further pruning is possible,
8 Since the qiτ =k are essentially full policies for the horizon-k problem, their number is
doubly exponentially in k.
15 Decentralized POMDPs 491
numbers. As such, DP runs out of memory well before it runs out of time. In or-
der to address this problem Boularias and Chaib-draa (2008) represent these values
more compactly by making use of a sequence form (Koller et al, 1994) representa-
tion. A disadvantage is that this approach can lead to keeping dominated policies,
however. As such there is a trade-off between space required to store the values for
all sub-tree policies and the number of sub-tree policies.
15.4.5.2 Point-Based DP
DP only removes qiτ =k that are not maximizing at any point in the multi-agent be-
lief space. Point-based DP (PBDP) (Szer and Charpillet, 2006) proposes to improve
−i
pruning of the set Qe,i
τ =k by considering only a subset B ⊂ P(S × Qτ =k ) of reach-
i
i
able multi-agent belief points. Only those qτ =k that maximize the value at some
bi ∈ B i are kept. The definition of reachable is slightly involved.
Definition 15.9. A multi-agent belief point bti is reachable if there exists a proba-
bility distribution P(st ,θ̄t−i |I,ϕ t ) (for any deterministic ϕ t ) and an induced mapping
Γt −i = Γt j j=i with Γt j : Θ̄t j → Qτj =k that result in bti .
That is, a belief point bi is reachable if there is a past joint policy that will result in
the appropriate distribution over states and AOHs of other agents such that, when
combined with a mapping of those AOHs to sub-tree policies, bi is the resulting
distribution over states and sub-tree policies.
492 F.A. Oliehoek
PBDP can be understood in the light of (15.10). Suppose that the range of the Γt i
are restricted to the sets generated by exhaustive backup: Γt i : Θ̄ti → Qe,i τ =k . Solving
(15.10) for a past joint policy ϕ t will result in Γt ∗ which will specify, for all agents,
all the useful sub-tree policies qiτ =k ∈ Qe,iτ =k given ϕ t . Solving (15.10) for all ϕ t will
result in the set of all potentially useful qiτ =k ∈ Qe,i τ =k .
Given a ϕ t and a Γt −i , (15.10) can be rewritten as a maximization from the per-
spective of agent i to compute its best response:9
5 6
BRi (θ̄ti ,ϕ t ,Γt −i ) = arg max ∑ ∑ P(st ,θ̄t−i |θ̄ti ,I,ϕ t )V (st , Γt −i (θ̄t−i ),qiτ =k ).
qiτ =k θ̄t−i st
(15.16)
That is, given ϕ t and Γt −i , each θ̄ti generates a multi-agent belief point, for which
−i
(15.16) performs the maximization. The set Qm,i τ =k := {BR (θ̄t ,ϕ t ,Γt )} of best re-
i i
sponses for all ϕ t , Γt −i and θ̄ti , contains all non-dominated sub-tree policies, thus
yielding an exact pruning step.
PBDP uses the initial belief to overcome the need to test for dominance over the
entire multi-agent belief space. It can also result in more pruning, since it avoids
maintaining sub-tree policies that are maximizing in a part of this space that cannot
be reached. Still, the operation described above is intractable because the number
of (θ̄ti ,ϕ t ,Γt −i ) is doubly exponential in t and because the maintained sets Qm,i τ =k can
still grow doubly exponentially.
15.4.5.3 Memory-Bounded DP
The second approximation is that MBDP maintains sets Qm,i τ =k of a fixed size,
M, which has two main consequences. First, the size of the candidate sets Qe,i τ =k
formed by exhaustive backup is O(|A† |M |O | ), which clearly does not depend on
†
the horizon. Second, (15.17) does not have to be evaluated for all distinct b ; rather
MBDP samples M joint belief points b on which (15.17) is evaluated.10 To perform
this sampling, MBDP uses heuristic policies.
In order to perform the maximization in (15.17), MBDP loops over the |Q Qeτ =k | =
O(|A† |n M | | ) joint sub-tree policies for each of the sampled belief points. To re-
n O †
duce the burden of this complexity, many papers have proposed new methods for
performing this point-based backup operation (Seuken and Zilberstein, 2007a; Car-
lin and Zilberstein, 2008; Dibangoye et al, 2009; Amato et al, 2009; Wu et al,
2010a). This backup corresponds to solving a CBG for each joint action (Kumar
and Zilberstein, 2010b; Oliehoek et al, 2010).
Finally, sample-based extensions have been proposed (Wu et al, 2010c,b). These
use sampling to evaluate the quantities V (s,qq τ =k ) and use particle representations
for the sampled joint beliefs.
There are a few other approaches for finite-horizon Dec-POMDPs, which we will
only briefly describe here. Aras et al (2007) proposed a mixed integer linear pro-
gramming formulation for the optimal solution of finite-horizon Dec-POMDPs,
based on representing the set of possible policies for each agent in sequence form
(Koller and Pfeffer, 1997). In this representation, a policy for an agent i is rep-
resented as a subset of the set of sequences (roughly corresponding to action-
observation histories) for the agent. As such the problem can be interpreted as a
combinatorial optimization problem and solved with a mixed integer linear
program.
The fact that solving a Dec-POMDP can be approached as a combinatorial opti-
mization problem was also recognized by approaches based on cross-entropy opti-
mization (Oliehoek et al, 2008a) and genetic algorithms (Eker and Akın, 2008).
collection of reward functions: one for each agent. A POSG assumes self-interested
agents that maximize their individual expected cumulative reward. The consequence
of this is that there is no longer a simple concept of optimal joint policy. Rather the
joint policy should be a Nash Equilibrium (NE), and preferably a Pareto optimal NE.
However, there is no clear way to identify the best one. Moreover, such an NE is only
guaranteed to exist in randomized policies (for a finite POSG), which means that it
is no longer possible to perform brute-force policy evaluation. Also, search methods
based on alternating maximization are no longer guaranteed to converge for POSGs.
The (not point-based) dynamic programming method, discussed in Section 15.4.5.1,
applies to POSGs since it finds the set of non-dominated policies for each agent.
Because of the negative complexity results for Dec-POMDPs, much research has
focused on special cases to which pointers are given below. For a more compre-
hensive overview the reader is referred to the texts by Pynadath and Tambe (2002);
Goldman and Zilberstein (2004); Seuken and Zilberstein (2008).
Some of the special cases are formed by different degrees of observability. These
range from fully- or individually observable as in a multi-agent MDP (Boutilier,
1996) to non-observable. In the non-observable case agents use open-loop policies
and solving it is easier from a complexity point of view (NP-complete, Pynadath
and Tambe 2002). Between these two extremes there are partially observable prob-
lems. One more special case has been identified, namely the jointly observable case,
where not the individual, but the joint observation identifies the true state. A jointly
observable Dec-POMDP is referred to as a Dec-MDP, which is a non-trivial sub-
class of the Dec-POMDP for which the NEXP-completeness result holds (Bernstein
et al, 2002).
Other research has tried to exploit structure in states, transitions and reward. For
instance, many approaches are based on special cases of factored Dec-POMDPs.
A factored Dec-POMDP (Oliehoek et al, 2008c) is a Dec-POMDP in which the
state space is factored, i.e., a state s = x1 , . . . ,xk is specified as an assignment to
a number of state variables or factors. For factored Dec-POMDPs the transition
and reward models can often be specified much more compactly by making use of
Bayesian networks and additive reward decomposition (the total reward is the sum
of a number of ‘smaller’ reward functions, specified over a subset of agents). Many
special cases have tried to exploit independence between agents by partitioning the
set of state factors into individual states si for each agent.
One such example is the transition- and observation-independent (TOI) Dec-
MDP (Becker et al, 2004b; Wu and Durfee, 2006) that assumes each agent i has
its own MDP with local states si and transitions, but that these MDPs are coupled
through certain events in the reward function: some combinations of joint actions
and joint states will cause extra reward (or penalty). This work introduced the idea
that in order to compute a best response against a policy π j , an agent i may not
need to reason about all the details of π j , but can use a more abstract representation
of the influence of π j on itself. This core idea was also used in event-driven (ED)
Dec-MDPs (Becker et al, 2004a) that model settings in which the rewards are inde-
pendent, but there are certain events that cause transition dependencies. Mostafa and
Lesser (2009) introduced the EDI-CR, a type of Dec-POMDP that generalizes TOI
15 Decentralized POMDPs 495
and ED-Dec-MDPs. Recently the idea of abstraction has been further explored by
Witwicki and Durfee (2010), resulting in a more general formulation of influence-
based policy abstraction for a more general sub-class of the factored Dec-POMDP
called temporally decoupled Dec-POMDP (TD-POMDP) that also generalizes TOI-
and ED-Dec-MDPs (Witwicki, 2011). While much more general than TOI Dec-
MDPs (e.g., the local states of agents can overlap) the TD-POMDP is still restrictive
as it does not allow multiple agents to have direct influence on the same state factor.
Finally, there has been a body of work on networked distributed POMDPs (ND-
POMDPs) (Nair et al, 2005; Kim et al, 2006; Varakantham et al, 2007; Marecki et al,
2008; Kumar and Zilberstein, 2009; Varakantham et al, 2009). ND-POMDPs can be
understood as factored Dec-POMDPs with TOI and additively factored reward func-
tions. For this model, it was shown that the value function V (π ) can be additively
factored as well. As a consequence, it is possible to apply many ideas from dis-
tributed constraint optimization in order to optimize the value more efficiently. As
such ND-POMDPs have been shown to scale to moderate numbers (up to 20) of
agents. These results were extended to general factored Dec-POMDPs by Oliehoek
et al (2008c). In that case, the amount of independence depends on the stage of the
process; earlier stages are typically fully coupled limiting exact solutions to small
horizons and few (three) agents. Approximate solutions, however, were shown to
scale to hundreds of agents (Oliehoek, 2010).
The main focus of this chapter has been on finding solution methods for finite-
horizon Dec-POMDPs. There also has been quite a bit of work on infinite-horizon
Dec-POMDPs, some of which is summarized here.
The infinite-horizon case is substantially different from the finite-horizon case.
For instance, the infinite-horizon problem is undecidable (Bernstein et al, 2002),
which is a direct result of the undecidability of (single-agent) POMDPs over an
infinite horizon (Madani et al, 1999). This can be understood by thinking about the
representations of policies; in the infinite-horizon case the policy trees themselves
should be infinite and clearly there is no way to represent that in a finite amount of
memory.
As a result, research on infinite-horizon Dec-POMDPs has focused on approxi-
mate methods that use finite policy representations. A common choice is to use finite
state controllers (FSCs). A side-effect of limiting the amount of memory for the pol-
icy is that in many cases it can be beneficial to allow stochastic policies (Singh et al,
1994). Most research in this line of work has proposed methods that incrementally
improve the quality of the controller. For instance, Bernstein et al (2009) propose
a policy iteration algorithm that computes an ε -optimal solution by iteratively per-
forming backup operations on the FSCs. These backups, however, grow the size
of the controller exponentially. While value-preserving transformations may reduce
the size of the controller, the controllers can still grow unboundedly.
496 F.A. Oliehoek
One idea to overcome this problem is bounded policy iteration (BPI) for Dec-
POMDPs (Bernstein et al, 2005). BPI keeps the number of nodes of the FSCs fixed
by applying bounded backups. BPI converges to a local optimum given a particular
controller size. Amato et al (2010) also consider finding an optimal joint policy
given a particular controller size, but instead propose a non-linear programming
(NLP) formulation. While this formulation characterizes the true optimum, solving
the NLP exactly is intractable. However, approximate NLP solvers have shown good
results in practice.
Finally, a recent development has been to address infinite-horizon Dec-POMDPs
via the planning-as-inference paradigm (Kumar and Zilberstein, 2010a). Pajarinen
and Peltonen (2011) extended this approach to factored Dec-POMDPs.
A next related issue is the more general setting of multi-agent reinforcement learn-
ing (MARL). That is, this chapter has focused on the task of planning given a model.
In a MARL setting however, the agents do not have access to such a model. Rather,
the model will have to be learned on-line (model-based MARL) or the agents will
have to use model-free methods. While there is a great deal of work on MARL in
general (Buşoniu et al, 2008), MARL in Dec-POMDP-like settings has received
little attention.
Probably one of the main reasons for this gap in literature is that it is hard to
properly define the setup of the RL problem in these partially observable environ-
ments with multiple agents. For instance, it is not clear when or how the agents
will the observe rewards.11 Moreover, even when the agents can observe the state,
convergence of MARL is not well-understood: from the perspective of one agent,
the environment has become non-stationary since the other agent is also learning,
which means that convergence guarantees for single-agent RL no longer hold. Claus
and Boutilier (1998) argue that, in a cooperative setting, independent Q-learners are
guaranteed to converge to a local optimum, but not the optimal solution. Neverthe-
less, this method has on occasion been reported to be successful in practice (e.g.,
Crites and Barto, 1998) and theoretical understanding of convergence of individual
learners is progressing (e.g., Tuyls et al, 2006; Kaisers and Tuyls, 2010; Wunder
et al, 2010). There are coupled learning methods (e.g., Q-learning using the joint
action space) that will converge to an optimal solution (Vlassis, 2007). However,
all forms of coupled learning are precluded in the true Dec-POMDP setting: such
algorithms require either full observation of the state and actions of other agents, or
communication of all the state information.
Concluding this section we will provide pointers to a few notable approaches
to RL in Dec-POMDP-like settings. Peshkin et al (2000) introduced decentralized
11 Even in a POMDP, the agent is not assumed to have access to the immediate rewards, since
they can convey hidden information about the states.
15 Decentralized POMDPs 497
gradient ascent policy search (DGAPS), a method for MARL in partially observ-
able settings based on gradient descent. DGAPS represents individual policies us-
ing FSCs and assumes that agents observe the global rewards. Based in this, it is
possible for each agent to independently update its policy in the direction of the
gradient with respect to the return, resulting in a locally optimal joint policy. This
approach was extended to learn policies for self-configurable modular robots (Var-
shavskaya et al, 2008). Chang et al (2004) also consider decentralized RL assuming
that the global rewards are available to the agents. In their approach, these global re-
wards are interpreted as individual rewards, corrupted by noise due to the influence
of other agents. Each agent explicitly tries to estimate the individual reward using
Kalman filtering and performs independent Q-learning using the filtered individual
rewards. The method by Wu et al (2010b) is closely related to RL since it does not
need a model as input. It does, however, needs access to a simulator which can be
initialized to specific states. Moreover, the algorithm itself is centralized, as such it
is not directly suitable for on-line RL.
Finally, there are MARL methods for partially observed decentralized settings
that require only limited amounts of communication. For instance, Boyan and
Littman (1993) considered decentralized RL for a packet routing problem. Their
approach, Q-routing, performs a type of Q-learning where there is only limited lo-
cal communication: neighboring nodes communicate the expected future waiting
time for a packet. Q-routing was extended to mobile wireless networks by Chang
and Ho (2004). A similar problem, distributed task allocation, is considered by Ab-
dallah and Lesser (2007). In this problem there also is a network, but now agents do
not send communication packets, but rather tasks to neighbors. Again, communica-
tion is only local. Finally, in some RL methods for multi-agent MDPs (i.e., coupled
methods) it is possible to have agents observe a subset of state factors if they have
the ability to communicate locally (Guestrin et al, 2002; Kok and Vlassis, 2006).
15.5.4 Communication
That is, in this perspective the semantics of the communication actions become part
of the optimization problem (Xuan et al, 2001; Goldman and Zilberstein, 2003;
Spaan et al, 2006; Goldman et al, 2007).
498 F.A. Oliehoek
One can also consider the case where messages have fixed semantics. In such a
case the agents need a mechanism to process these semantics. For instance, when
the agents share their local observations, each agent maintains a joint belief and per-
forms an update of this joint belief, rather than maintaining the list of observations.
It was shown by Pynadath and Tambe (2002) that under cost-free communication,
a joint communication policy that shares the local observations at each stage is op-
timal. Much research has investigated sharing local observations in models similar
to the Dec-POMDP-Com (Ooi and Wornell, 1996; Pynadath and Tambe, 2002; Nair
et al, 2004; Becker et al, 2005; Roth et al, 2005b,a; Spaan et al, 2006; Oliehoek et al,
2007; Roth et al, 2007; Goldman and Zilberstein, 2008; Wu et al, 2011).
A final note is that, although models with explicit communication seem more
general than the models without, it is possible to transform the former to the latter.
That is, a Dec-POMDP-Com can be transformed to a Dec-POMDP (Goldman and
Zilberstein, 2004; Seuken and Zilberstein, 2008).
Acknowledgements. I would like to thank Leslie Kaelbling and Shimon Whiteson for the
valuable feedback they provided and the reviewers for their insightful comments. Research
supported by AFOSR MURI project #FA9550-09-1-0538.
References
Abdallah, S., Lesser, V.: Multiagent reinforcement learning and self-organization in a network
of agents. In: Proc. of the International Joint Conference on Autonomous Agents and
Multi Agent Systems, pp. 172–179 (2007)
Amato, C., Carlin, A., Zilberstein, S.: Bounded dynamic programming for decentralized
POMDPs. In: Proc. of the AAMAS Workshop on Multi-Agent Sequential Decision Mak-
ing in Uncertain Domains, MSDM (2007)
Amato, C., Dibangoye, J.S., Zilberstein, S.: Incremental policy generation for finite-horizon
DEC-POMDPs. In: Proc. of the International Conference on Automated Planning and
Scheduling, pp. 2–9 (2009)
Amato, C., Bernstein, D.S., Zilberstein, S.: Optimizing fixed-size stochastic controllers
for POMDPs and decentralized POMDPs. Autonomous Agents and Multi-Agent Sys-
tems 21(3), 293–320 (2010)
Aras, R., Dutech, A., Charpillet, F.: Mixed integer linear programming for exact finite-horizon
planning in decentralized POMDPs. In: Proc. of the International Conference on Auto-
mated Planning and Scheduling (2007)
Becker, R., Zilberstein, S., Lesser, V.: Decentralized Markov decision processes with event-
driven interactions. In: Proc. of the International Joint Conference on Autonomous Agents
and Multi Agent Systems, pp. 302–309 (2004a)
Becker, R., Zilberstein, S., Lesser, V., Goldman, C.V.: Solving transition independent decen-
tralized Markov decision processes. Journal of Artificial Intelligence Research 22, 423–
455 (2004b)
Becker, R., Lesser, V., Zilberstein, S.: Analyzing myopic approaches for multi-agent commu-
nication. In: Proc. of the International Conference on Intelligent Agent Technology, pp.
550–557 (2005)
15 Decentralized POMDPs 499
Bernstein, D.S., Givan, R., Immerman, N., Zilberstein, S.: The complexity of decentralized
control of Markov decision processes. Mathematics of Operations Research 27(4), 819–
840 (2002)
Bernstein, D.S., Hansen, E.A., Zilberstein, S.: Bounded policy iteration for decentralized
POMDPs. In: Proc. of the International Joint Conference on Artificial Intelligence, pp.
1287–1292 (2005)
Bernstein, D.S., Amato, C., Hansen, E.A., Zilberstein, S.: Policy iteration for decentralized
control of Markov decision processes. Journal of Artificial Intelligence Research 34, 89–
132 (2009)
Boularias, A., Chaib-draa, B.: Exact dynamic programming for decentralized POMDPs with
lossless policy compression. In: Proc. of the International Conference on Automated
Planning and Scheduling (2008)
Boutilier, C.: Planning, learning and coordination in multiagent decision processes. In: Proc.
of the 6th Conference on Theoretical Aspects of Rationality and Knowledge, pp. 195–210
(1996)
Boyan, J.A., Littman, M.L.: Packet routing in dynamically changing networks: A reinforce-
ment learning approach. In: Advances in Neural Information Processing Systems, vol. 6,
pp. 671–678 (1993)
Buşoniu, L., Babuška, R., De Schutter, B.: A comprehensive survey of multi-agent reinforce-
ment learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews 38(2), 156–172 (2008)
Carlin, A., Zilberstein, S.: Value-based observation compression for DEC-POMDPs. In: Proc.
of the International Joint Conference on Autonomous Agents and Multi Agent Systems,
pp. 501–508 (2008)
Chang, Y.H., Ho, T.: Mobilized ad-hoc networks: A reinforcement learning approach. In:
Proceedings of the First International Conference on Autonomic Computing, pp. 240–
247 (2004)
Chang, Y.H., Ho, T., Kaelbling, L.P.: All learning is local: Multi-agent learning in global
reward games. In: Advances in Neural Information Processing Systems, vol. 16 (2004)
Claus, C., Boutilier, C.: The dynamics of reinforcement learning in cooperative multiagent
systems. In: Proc. of the National Conference on Artificial Intelligence, pp. 746–752
(1998)
Cogill, R., Rotkowitz, M., Roy, B.V., Lall, S.: An approximate dynamic programming ap-
proach to decentralized control of stochastic systems. In: Proc. of the 2004 Allerton Con-
ference on Communication, Control, and Computing (2004)
Crites, R.H., Barto, A.G.: Elevator group control using multiple reinforcement learning
agents. Machine Learning 33(2-3), 235–262 (1998)
Dibangoye, J.S., Mouaddib, A.I., Chai-draa, B.: Point-based incremental pruning heuristic
for solving finite-horizon DEC-POMDPs. In: Proc. of the International Joint Conference
on Autonomous Agents and Multi Agent Systems, pp. 569–576 (2009)
Eker, B., Akın, H.L.: Using evolution strategies to solve DEC-POMDP problems. Soft Com-
puting - A Fusion of Foundations, Methodologies and Applications (2008)
Emery-Montemerlo, R., Gordon, G., Schneider, J., Thrun, S.: Approximate solutions for par-
tially observable stochastic games with common payoffs. In: Proc. of the International
Joint Conference on Autonomous Agents and Multi Agent Systems, pp. 136–143 (2004)
Emery-Montemerlo, R., Gordon, G., Schneider, J., Thrun, S.: Game theoretic control for
robot teams. In: Proc. of the IEEE International Conference on Robotics and Automation,
pp. 1175–1181 (2005)
500 F.A. Oliehoek
Nair, R., Tambe, M., Marsella, S.C.: Team Formation for Reformation in Multiagent Domains
Like RoboCupRescue. In: Kaminka, G.A., Lima, P.U., Rojas, R. (eds.) RoboCup 2002:
Robot Soccer World Cup VI, LNCS (LNAI), vol. 2752, pp. 150–161. Springer, Heidelberg
(2003)
Nair, R., Tambe, M., Yokoo, M., Pynadath, D.V., Marsella, S.: Taming decentralized
POMDPs: Towards efficient policy computation for multiagent settings. In: Proc. of the
International Joint Conference on Artificial Intelligence, pp. 705–711 (2003c)
Nair, R., Roth, M., Yohoo, M.: Communication for improving policy computation in dis-
tributed POMDPs. In: Proc. of the International Joint Conference on Autonomous Agents
and Multi Agent Systems, pp. 1098–1105 (2004)
Nair, R., Varakantham, P., Tambe, M., Yokoo, M.: Networked distributed POMDPs: A syn-
thesis of distributed constraint optimization and POMDPs. In: Proc. of the National Con-
ference on Artificial Intelligence, pp. 133–139 (2005)
Oliehoek, F.A.: Value-based planning for teams of agents in stochastic partially observable
environments. PhD thesis, Informatics Institute, University of Amsterdam (2010)
Oliehoek, F.A., Vlassis, N.: Q-value functions for decentralized POMDPs. In: Proc. of The In-
ternational Joint Conference on Autonomous Agents and Multi Agent Systems, pp. 833–
840 (2007)
Oliehoek, F.A., Spaan, M.T.J., Vlassis, N.: Dec-POMDPs with delayed communication. In:
AAMAS Workshop on Multi-agent Sequential Decision Making in Uncertain Domains
(2007)
Oliehoek, F.A., Kooi, J.F., Vlassis, N.: The cross-entropy method for policy search in decen-
tralized POMDPs. Informatica 32, 341–357 (2008a)
Oliehoek, F.A., Spaan, M.T.J., Vlassis, N.: Optimal and approximate Q-value functions for
decentralized POMDPs. Journal of Artificial Intelligence Research 32, 289–353 (2008b)
Oliehoek, F.A., Spaan, M.T.J., Whiteson, S., Vlassis, N.: Exploiting locality of interaction in
factored Dec-POMDPs. In: Proc. of The International Joint Conference on Autonomous
Agents and Multi Agent Systems, pp. 517–524 (2008)
Oliehoek, F.A., Whiteson, S., Spaan, M.T.J.: Lossless clustering of histories in decentralized
POMDPs. In: Proc. of The International Joint Conference on Autonomous Agents and
Multi Agent Systems, pp. 577–584 (2009)
Oliehoek, F.A., Spaan, M.T.J., Dibangoye, J., Amato, C.: Heuristic search for identical payoff
Bayesian games. In: Proc. of The International Joint Conference on Autonomous Agents
and Multi Agent Systems, pp. 1115–1122 (2010)
Ooi, J.M., Wornell, G.W.: Decentralized control of a multiple access broadcast channel: Per-
formance bounds. In: Proc. of the 35th Conference on Decision and Control, pp. 293–298
(1996)
Osborne, M.J., Rubinstein, A.: A Course in Game Theory. The MIT Press (1994)
Pajarinen, J., Peltonen, J.: Efficient planning for factored infinite-horizon DEC-POMDPs. In:
Proc. of the International Joint Conference on Artificial Intelligence (to appear, 2011)
Paquet, S., Tobin, L., Chaib-draa, B.: An online POMDP algorithm for complex multiagent
environments. In: Proc. of the International Joint Conference on Autonomous Agents and
Multi Agent Systems (2005)
Peshkin, L.: Reinforcement learning by policy search. PhD thesis, Brown University (2001)
Peshkin, L., Kim, K.E., Meuleau, N., Kaelbling, L.P.: Learning to cooperate via policy search.
In: Proc. of Uncertainty in Artificial Intelligence, pp. 307–314 (2000)
Pynadath, D.V., Tambe, M.: The communicative multiagent team decision problem: Analyz-
ing teamwork theories and models. Journal of Artificial Intelligence Research 16, 389–423
(2002)
502 F.A. Oliehoek
Rabinovich, Z., Goldman, C.V., Rosenschein, J.S.: The complexity of multiagent systems: the
price of silence. In: Proc. of the International Joint Conference on Autonomous Agents
and Multi Agent Systems, pp. 1102–1103 (2003)
Roth, M., Simmons, R., Veloso, M.: Decentralized communication strategies for coordinated
multi-agent policies. In: Parker, L.E., Schneider, F.E., Shultz, A.C. (eds.) Multi-Robot
Systems. From Swarms to Intelligent Automata, vol. III, pp. 93–106. Springer, Heidelberg
(2005a)
Roth, M., Simmons, R., Veloso, M.: Reasoning about joint beliefs for execution-time commu-
nication decisions. In: Proc. of the International Joint Conference on Autonomous Agents
and Multi Agent Systems, pp. 786–793 (2005b)
Roth, M., Simmons, R., Veloso, M.: Exploiting factored representations for decentralized
execution in multi-agent teams. In: Proc. of the International Joint Conference on Au-
tonomous Agents and Multi Agent Systems, pp. 467–463 (2007)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 2nd edn. Pearson Educa-
tion (2003)
Seuken, S., Zilberstein, S.: Improved memory-bounded dynamic programming for decentral-
ized POMDPs. In: Proc. of Uncertainty in Artificial Intelligence (2007a)
Seuken, S., Zilberstein, S.: Memory-bounded dynamic programming for DEC-POMDPs.
In: Proc. of the International Joint Conference on Artificial Intelligence, pp. 2009–2015
(2007b)
Seuken, S., Zilberstein, S.: Formal models and algorithms for decentralized decision making
under uncertainty. Autonomous Agents and Multi-Agent Systems 17(2), 190–250 (2008)
Singh, S.P., Jaakkola, T., Jordan, M.I.: Learning without state-estimation in partially observ-
able Markovian decision processes. In: Proc. of the International Conference on Machine
Learning, pp. 284–292. Morgan Kaufmann (1994)
Spaan, M.T.J., Gordon, G.J., Vlassis, N.: Decentralized planning under uncertainty for teams
of communicating agents. In: Proc. of the International Joint Conference on Autonomous
Agents and Multi Agent Systems, pp. 249–256 (2006)
Spaan, M.T.J., Oliehoek, F.A., Amato, C.: Scaling up optimal heuristic search in Dec-
POMDPs via incremental expansion. In: Proc. of the International Joint Conference on
Artificial Intelligence (to appear, 2011)
Szer, D., Charpillet, F.: Point-based dynamic programming for DEC-POMDPs. In: Proc. of
the National Conference on Artificial Intelligence (2006)
Szer, D., Charpillet, F., Zilberstein, S.: MAA*: A heuristic search algorithm for solving
decentralized POMDPs. In: Proc. of Uncertainty in Artificial Intelligence, pp. 576–583
(2005)
Tuyls, K., Hoen, P.J., Vanschoenwinkel, B.: An evolutionary dynamical analysis of multi-
agent learning in iterated games. Autonomous Agents and Multi-Agent Systems 12(1),
115–153 (2006)
Varakantham, P., Marecki, J., Yabu, Y., Tambe, M., Yokoo, M.: Letting loose a SPIDER on
a network of POMDPs: Generating quality guaranteed policies. In: Proc. of the Interna-
tional Joint Conference on Autonomous Agents and Multi Agent Systems (2007)
Varakantham, P., Young Kwak, J., Taylor, M.E., Marecki, J., Scerri, P., Tambe, M.: Exploiting
coordination locales in distributed POMDPs via social model shaping. In: Proc. of the
International Conference on Automated Planning and Scheduling (2009)
Varshavskaya, P., Kaelbling, L.P., Rus, D.: Automated design of adaptive controllers for mod-
ular robots using reinforcement learning. International Journal of Robotics Research 27(3-
4), 505–526 (2008)
15 Decentralized POMDPs 503
Vlassis, N.: A Concise Introduction to Multiagent Systems and Distributed Artificial Intelli-
gence. In: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan &
Claypool Publishers (2007)
Witwicki, S.J.: Abstracting influences for efficient multiagent coordination under uncertainty.
PhD thesis, University of Michigan, Ann Arbor, Michigan, USA (2011)
Witwicki, S.J., Durfee, E.H.: Influence-based policy abstraction for weakly-coupled Dec-
POMDPs. In: Proc. of the International Conference on Automated Planning and Schedul-
ing, pp. 185–192 (2010)
Wu, F., Zilberstein, S., Chen, X.: Point-based policy generation for decentralized POMDPs.
In: Proc. of the International Joint Conference on Autonomous Agents and Multi Agent
Systems, pp. 1307–1314 (2010a)
Wu, F., Zilberstein, S., Chen, X.: Rollout sampling policy iteration for decentralized
POMDPs. In: Proc. of Uncertainty in Artificial Intelligence (2010b)
Wu, F., Zilberstein, S., Chen, X.: Trial-based dynamic programming for multi-agent planning.
In: Proc. of the National Conference on Artificial Intelligence, pp. 908–914 (2010c)
Wu, F., Zilberstein, S., Chen, X.: Online planning for multi-agent systems with bounded
communication. Artificial Intelligence 175(2), 487–511 (2011)
Wu, J., Durfee, E.H.: Mixed-integer linear programming for transition-independent decen-
tralized MDPs. In: Proc. of the International Joint Conference on Autonomous Agents
and Multi Agent Systems, pp. 1058–1060 (2006)
Wunder, M., Littman, M.L., Babes, M.: Classes of multiagent Q-learning dynamics with
epsilon-greedy exploration. In: Proc. of the International Conference on Machine Learn-
ing, pp. 1167–1174 (2010)
Xuan, P., Lesser, V., Zilberstein, S.: Communication decisions in multi-agent cooperation:
Model and experiments. In: Proc. of the International Conference on Autonomous Agents
(2001)
Zettlemoyer, L.S., Milch, B., Kaelbling, L.P.: Multi-agent filtering with infinitely nested be-
liefs. In: Advances in Neural Information Processing Systems, vol. 21 (2009)
Part V
Domains and Background
Chapter 16
Psychological and Neuroscientific Connections
with Reinforcement Learning
Ashvin Shah
Abstract. The field of Reinforcement Learning (RL) was inspired in large part by
research in animal behavior and psychology. Early research showed that animals
can, through trial and error, learn to execute behavior that would eventually lead to
some (presumably satisfactory) outcome, and decades of subsequent research was
(and is still) aimed at discovering the mechanisms of this learning process. This
chapter describes behavioral and theoretical research in animal learning that is di-
rectly related to fundamental concepts used in RL. It then describes neuroscientific
research that suggests that animals and many RL algorithms use very similar learn-
ing mechanisms. Along the way, I highlight ways that research in computer science
contributes to and can be inspired by research in psychology and neuroscience.
16.1 Introduction
Ashvin Shah
Department of Psychology, University of Sheffield, Sheffield, UK
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 507–537.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
508 A. Shah
The animal’s behavior is familiar to readers of this book as it describes well the
behavior of a reinforcement learning (RL) agent engaged in a simple task. The basic
problems the animal faces—and solves—in the puzzle box are those that an RL
agent must solve: given no instruction and only a very coarse evaluation signal, how
does an agent learn what to do and when to do it in order to better its circumstances?
While RL is not intended to be a model of animal learning, animal behavior and
psychology form a major thread of research that led to its development (Chapter 1,
Sutton and Barto 1998). RL was also strongly influenced by the work of Harry Klopf
(Klopf, 1982), who put forth the idea that hedonistic (“pleasure seeking”) behavior
emerges from hedonistic learning processes, including processes that govern the
behavior of single neurons.
In this chapter I describe some of the early experimental work in animal be-
havior that started the field and developed the basic paradigms that are used even
today, and psychological theories that were developed to explain observed behavior.
I then describe neuroscience research aimed at discovering the brain mechanisms
responsible for such behavior. Rather than attempt to provide an exhaustive review
of animal learning and behavior and their underlying neural mechanisms in a single
chapter, I focus on studies that are directly-related to fundamental concepts used in
RL and that illustrate some of the experimental methodology. I hope that this fo-
cus will make clear the similarities—in some cases striking—between mechanisms
used by RL agents and mechanisms thought to dictate many types of animal behav-
ior. The fact that animals can solve problems we strive to develop artificial systems
to solve suggests that a greater understanding of psychology and neuroscience can
inspire research in RL and machine learning in general.
Prediction plays an important role in learning and control. Perhaps the most direct
way to study prediction in animals is with classical conditioning, pioneered by Ivan
Pavlov in Russia in the early 1900s. While investigating digestive functions of dogs,
Pavlov noticed that some dogs that he had worked with before would salivate before
any food was brought out. In what began as an attempt to account for this surpris-
ing behavior, Pavlov developed his theory of conditioned reflexes (Pavlov, 1927):
mental processes (e.g., perception of an auditory tone) can cause a physiological
reaction (salivation) that was previously thought to be caused only by physical pro-
cesses (e.g., smell or presence of food in the mouth). Most famously, the sound of
a ringing bell that reliably preceded the delivery of food eventually by itself caused
the dog to salivate. This behavior can be thought of as an indication that the dog has
learned to predict that food delivery will follow the ringing bell.
16 Psychological and Neuroscientific Connections 509
16.2.1 Behavior
While Pavlov’s drooling dog is the enduring image of classical conditioning, a more
studied system is the nictitating membrane (NM) of the rabbit eye (Gormezano et al,
1962), which is a thin “third eyelid” that closes to protect the eye. Typically, the
rabbit is restrained and an air puff or a mild electric shock applied to the eye (the
unconditioned stimulus, US) causes the NM to close (the unconditioned response,
UR). If a neutral stimulus that does not by itself cause the NM to close (conditioned
stimulus, CS), such as a tone or a light, is reliably presented to the rabbit before
the US, eventually the CS itself causes the NM to close (the conditioned response,
CR). Because the CR is acquired with repeated pairings of the CS and the US (the
acquisition phase of an experiment), the US is often referred to as a reinforcer. The
strength of the CR, often measured by how quickly the NM closes or the likelihood
that it closes before the US, is a measure of the predictive strength of the CS for
the animal. After the CR is acquired, if the CS is presented but the US is omitted
(extinction phase), the strength of the CR gradually decreases.
Manipulations to the experimental protocol can give us a better idea of how such
predictions are learned. Particularly instructive are manipulations in the timing of
the CS relative to the US and how the use of multiple CSs affects the predictive qual-
ities of each CS. (These manipulations are focused on in Sutton and Barto (1987)
and Sutton and Barto (1981), which describe temporal difference models of classical
conditioning.)
Two measures of timing between the CS and the subsequent US are (Figure 16.1,
top): 1) interstimulus interval (ISI), which is the time between the onset of the CS
CS US
single CS
time
TI
ISI
and the onset of the US, and 2) trace interval (TI), which is the time between the
offset of the CS and the onset of the US. A simple protocol uses a short and constant
ISI and a zero-length TI (delay conditioning). For example, the tone is presented
briefly (500 ms) and the air puff is presented at the end of the tone. When the TI
is greater than zero (trace conditioning), acquisition and retention of the CR are
hindered. If the ISI is zero, i.e., if the CS and US are presented at the same time,
the CS is useless for prediction and the animal will not acquire a CR. There is an
optimal ISI (about 250 ms for the NM response) after which the strength of the CR
decreases gradually. The rate of decrease is greater in trace conditioning than it is in
delay conditioning. In addition, the rate of acquisition decreases with an increase in
ISI, suggesting that it is harder to predict temporally distant events.
The use of several CSs (compound stimuli) reveals that learning also depends
on the animal’s ability to predict the upcoming US. Figure 16.1, middle, illustrates
example protocols in which one CS (e.g., a tone, referred to as CSA) is colored in
light gray and the other (a light, CSB) is colored in dark gray. In blocking (Kamin,
1969), the US is paired with CSA alone and the animal acquires a CR. Afterwards,
the simultaneous presentation of CSA and CSB is paired with the US for a block
of trials. Subsequent presentation of CSB alone elicits no CR. Because CSA was
already a predictor of the US, CSB holds no predictive power. In conditioned in-
hibition, two types of stimulus presentations are intermixed during training: CSA
alone paired with the US, and the simultaneous presentation of CSA and CSB with
the US omitted. Subsequent pairing of CSB alone with the US results in a lower rate
of CR acquisition relative to animals that did not experience the compound stimulus.
CSB was previously learned as a reliable predictor that the US will not occur.
In the above two examples, the two CSs had identical temporal properties. Other
protocols show the effects of presenting compound stimuli that have different tem-
poral properties (serial compound stimuli) (Figure 16.1, bottom). As mentioned
earlier, acquisition of a CR is impaired in trace conditioning (TI > 0). However,
if another CS is presented during the TI, the acquisition of the CR in response to
the first CS is facilitated (facilitation by intervening stimulus). A related protocol
results in higher-order conditioning, in which a CR is first acquired in response to
CSB. Then, if CSA is presented prior to CSB, a CR is acquired in response to CSA.
In a sense, CSB plays the role of reinforcer.
In the primacy effect, CSA and CSB overlap in time. The offset time of each is
the same and immediately precedes the US, but the onset time of CSA is earlier than
that of CSB (Figure 16.1, bottom). Because CSB has a shorter ISI than CSA, one
may expect that a CR would be elicited more strongly in response to CSB alone than
to CSA alone. However, the presence of CSA actually results in a decrease in the
strength of the CR in response to CSB alone. More surprising is a prediction first
discussed in Sutton and Barto (1981). They presented a model of classical condi-
tioning that was first trained with CSB paired with the US (delay conditioning with
a short ISI), and then CSA (with an earlier onset time) was presented as well. Even
though the strength of the association between CSB and the response—which repre-
sents the predictive qualities the CS—had reached its asymptotic level, it decreased
when CSA was presented. Such a finding is seemingly incongruous with the effects
16 Psychological and Neuroscientific Connections 511
of the ISI and the phenomenon of blocking. Sutton and Barto (1987) replicated this
result, and Kehoe et al (1987) confirmed this prediction experimentally.
16.2.2 Theory
The blocking effect suggests that learning occurs when the unexpected happens
(Kamin, 1969) as opposed to when two things are correlated. This idea led to
the development of the famous Rescorla-Wagner model of classical conditioning
(Rescorla and Wagner, 1972), in which the presence of a US during a trial is pre-
dicted by the sum of associative strengths between each CS present during the trial
and the US. Changes in associative strengths depend on the accuracy of the predic-
tion. For every CS i present during a trial:
Δ w(i) = α r − ∑ w(i)x(i) ,
i
where w(i) is the associative strength between CS i and the US, x(i) = 1 if CS i is
present and 0 otherwise, r is the maximum amount of conditioning the US can pro-
duce (analogous to the “magnitude” of the US), and α is a step-size parameter. (Note
that this notation differs from the original version.) If the US is perfectly-predicted
(i.e., if r − ∑i w(i)x(i) = 0), the associative strength of another CS subsequently
added (with an initial associative strength of zero) will not increase. This influential
model captures several features of classical conditioning (e.g., blocking). Also, as
first noted in Sutton and Barto (1981), it is similar to the independently-developed
Widrow-Hoff learning rule (Widrow and Hoff, 1960), showing the importance of
prediction-derived learning.
The Rescorla-Wagner model is a trial level account of classical conditioning in
that learning occurs from trial to trial as opposed to at each time point within a trial.
It cannot account for the effects that temporal properties of stimuli have on learning.
In addition, Sutton and Barto (1987) point out that animal learning processes may
not incorporate mechanisms that are dependent on the concept of the trial, which is
essentially a convenient way to segregate events.
Temporal difference (TD) models (Sutton, 1988; Sutton and Barto, 1998), which
have foundations in models of classical conditioning (Sutton and Barto, 1981; Barto
and Sutton, 1982; Sutton and Barto, 1987), were developed in part to do prediction
learning on a real-time level. In the following TD model of classical conditioning
(Sutton and Barto, 1987) (using notation that is similar to the equation above), let rt
be the presence (and magnitude) of the US at time step t, xt (i) be 1 if CS i is present
at time t and 0 otherwise, and wt (i) be the associative strength between CS i and the
US at time t. At each time point,
0 1+ 0 1+
wt+1 (i) = wt (i) + α rt + γ ∑ wt (i)xt (i) − ∑ wt (i)xt−1 (i) xt (i),
i i
512 A. Shah
where γ is the familiar temporal discount factor, [y]+ returns zero if y < 0, and xt (i)
is an eligibility trace, e.g., β xt−1 (i) + (1 − β )xt−1(i), where 0 ≤ β < 1.
At each time point, weights are adjusted to minimize the difference between
rt + γ [∑i wt (i)xt (i)]+ and [∑i wt (i)xt−1 (i)]+ (i.e., the temporal difference error).
These are temporally successive predictions of the same quantity: upcoming USs
(more precisely, ∑∞ k=0 γ rt+k ). The former prediction incorporates more recent in-
k
formation (rt and xt (i)) and serves as a target for the latter, which uses information
from an earlier time point (xt−1 (i)). The eligibility trace (which restricts modifica-
tion to associations of CSs that were recently present), discount factor, and explicit
dependence on time allow the model to capture the effect on learning of temporal
relationships among stimuli within a trial. Also, because the prediction at time t
trains the prediction at time t − 1, the model accounts for higher order conditioning.
Classical conditioning experiments present the animal with a salient stimulus (the
US) contingent on another stimulus (the CS) regardless of the animal’s behavior.
Operant conditioning experiments present a salient stimulus (usually a “reward”
such as food) contingent on specific actions executed by the animal (Thorndike,
1911; Skinner, 1938). (The animal is thought of as an instrument that operates on the
environment.) Thorndike’s basic experimental protocol described at the beginning
of this chapter forms the foundation of most experiments described in this section.
16.3.1 Behavior
The simplest protocols use single-action tasks such as a rat pressing the one lever or
a pigeon pecking at the one key available to it. Behavior is usually described as the
“strength” of the action, measured by how quickly it is initiated, how quickly it is
executed, or likelihood of execution. Basic protocols and behavior are analogous to
those of classical conditioning. During acquisition (action is followed by reward),
action strength increases, and the reward is referred to as the reinforcer. During
extinction (reward is omitted after the acquisition phase), action strength decreases.
The rates of action strength increase and decrease also serve as measures of learning.
Learning depends on several factors that can be manipulated by the experimenter.
In most experiments, the animal is put into a deprived state before acquisition (e.g.,
it’s hungry if the reward is food). Action strength increases at a faster rate with de-
privation; hence, deprivation is said to increase the animal’s “drive” or “motivation.”
Other factors commonly studied include the magnitude of reward (e.g., volume of
food, where an increase results in an increase in learning) and delay between action
execution and reward delivery (where an increase results in a decrease).
Factors that affect learning in single-action tasks also affect selection, or decision-
making, in free choice tasks in which more than one action is available (e.g., there
are two levers). The different actions lead to outcomes with different characteristics,
and the strength of an action relative to others is usually measured by relative likeli-
hood (i.e., choice distribution). Unsurprisingly, animals more frequently choose the
action that leads to a reward of greater magnitude and/or shorter delay.
We can quantify the effects of one factor in terms of another by examining choice
distribution. For example, suppose action A leads to a reward of a constant magni-
tude and delay and action B leads to an immediate reward. By determining how
much we must decrease the magnitude of the immediate reward so that the animal
shows no preference between the two actions, we can describe a temporal discount
function. Although most accounts in RL use an exponential discount function, be-
havioral studies support a hyperbolic form (Green and Myerson, 2004).
Probability of reward delivery is another factor that affects choice distribution.
In some tasks where different actions lead to rewards that are delivered with dif-
ferent probabilities, humans and some animals display probability matching, where
514 A. Shah
the choice distribution is similar to the relative probabilities that each action will
be reinforced (Siegel and Goldstein, 1959; Shanks et al, 2002). Such a strategy is
clearly suboptimal if the overall goal is to maximize total reward received. Some
studies suggest that probability matching may be due in part to factors such as the
small number of actions that are considered in most experiments or ambiguous task
instructions (i.e., participants may be attempting to achieve some goal other than
reward maximization) (Shanks et al, 2002; Gardner, 1958; Goodnow, 1955).
Naive animals usually do not execute the specific action(s) the experimenter
wishes to examine; the animal must be “taught” with methods drawn from the ba-
sic results outlined above and from classical conditioning. For example, in order to
draw a pigeon’s attention to a key to be pecked, the experimenter may first simply
illuminate the key and then deliver the food, independent of the pigeon’s behavior.
The pigeon naturally pecks at the food and, with repeated pairings, pecks at the key
itself (autoshaping, Brown and Jenkins 1968). Although such Pavlovian actions can
be exploited to guide the animal towards some behaviors, they can hinder the animal
in learning other behaviors (Dayan et al, 2006).
If the movements that compose an action can vary to some degree, a procedure
called shaping can be used to teach an animal to execute a specific movement by
gradually changing the range of movements that elicit a reward (Eckerman et al,
1980). For example, to teach a pigeon to peck at a particular location in space,
the experimenter defines a large imaginary sphere around that location. When the
pigeon happens to peck within that sphere, food is delivered. When the pigeon con-
sistently pecks within that sphere, the experimenter decreases the radius, and the
pigeon receives food only when it happens to peck within the smaller sphere. This
process continues until an acceptable level of precision is reached.
Shaping and autoshaping modify the movements that compose a single action.
Some behaviors are better described as a chain of several actions executed sequen-
tially. To train animals, experimenters exploit higher-order conditioning (or condi-
tioned reinforcement): as with classical conditioning, previously neutral stimuli can
take on the reinforcing properties of a reinforcer with which they were paired. In
backward chaining, the animal is first trained to execute the last action in a chain.
Then it is trained to execute the second to last action, after which it is in the state
from which it can execute the previously acquired action (Richardson and Warzak,
1981). This process, in which states from which the animal can execute a previously
learned action act as a reinforcers, continues until the entire sequence is learned.
16.3.2 Theory
To account for the behavior he had observed in his experiments, Thorndike devised
his famous Law of Effect:
Of several responses made to the same situation, those which are accompanied or
closely followed by satisfaction to the animal will, other things being equal, be more
firmly connected with the situation, so that, when it recurs, they will be more likely to
16 Psychological and Neuroscientific Connections 515
recur; those which are accompanied or closely followed by discomfort to the animal
will, other things being equal, have their connections with that situation weakened, so
that, when it recurs, they will be less likely to occur. The greater the satisfaction or
discomfort, the greater the strengthening or weakening of the bond. (Chapter 5, page
244 of Thorndike 1911.)
Learning occurs only with experience: actually executing the action and evaluating
the outcome. According to Thorndike, action strength is due to the strength of an
association between the response (or action) and the situation (state) from which
it was executed. The basic concepts described in the Law of Effect are also used
in many RL algorithms. In particular, action strength is analogous to action value,
Q(s,a), which changes according to the consequences of the action and determines
behavior in many RL schemes.
This type of action generation, where the action is elicited by the current state, is
sometimes thought of as being due to a stimulus-response (SR) association. Though
the term is a bit controversial, I use it here because it is used in many neuroscience
accounts of behavior (e.g., Yin et al 2008) and it emphasizes the idea that behavior
is due to associations between available actions and the current state. In contrast,
actions may be chosen based on explicit predictions of their outcomes (discussed in
the next subsection).
Thorndike conducted his experiments in part to address questions that arose from
Charles Darwin’s Theory of Evolution: do humans and animals have similar men-
tal faculties? His Law of Effect provides a mechanistic account that, coupled with
variability (exploration), can explain even complicated behaviors. The SR associ-
ation, despite its simplicity, plays a role in several early psychological theories of
behavior (e.g., Thorndike 1911; Hull 1943; Watson 1914). New SR associations can
be formed from behavior generated by previously-learned ones, and a complicated
“response” can be composed of previously-learned simple responses. (Such con-
cepts are used in hierarchical methods as well, Grupen and Huber 2005; Barto and
Mahadevan 2003 and see also Chapter 9.) This view is in agreement with Evolu-
tion: simple processes of which animals are capable explain some human behavior
as well. (In Chapter 6 of his book, Thorndike suggests that thought and reason are
human qualities that arise from our superior ability to learn associations.)
As with animals, it may be very difficult for a naive artificial agent to learn a
specific behavior. Training procedures developed by experimentalists, such as back-
ward chaining and shaping, can be used to aid artificial agents as well (Konidaris and
Barto, 2009; Selfridge et al, 1985; Ng et al, 1999). A type of shaping also occurs
when reinforcement methods are used to address the structural credit assignment
problem (Barto, 1985): when an “action” is composed of multiple elements that can
each be modified, exactly what did an agent just do that led to the reward?
Psychological theories that primarily use concepts similar to the SR association
represent a view in which behavior is accounted for only by variables that can be
directly-observed (e.g., the situation and the response). Taken to the extreme, it is
controversial and cannot explain all behavior. The next subsection discusses exper-
iments that show that some behavior is better explained by accounts that do allow
for variables that cannot be directly-observed.
516 A. Shah
16.4 Dopamine
Although different types of rewards could have similar effects on behavior, it was
unknown if similar or disparate neural mechanisms mediate those effects. In the
early 1950s, it was discovered that if an action led to electrical stimulation ap-
plied (via an electrode) to the brain, action strength would increase (Olds and Mil-
ner, 1954). This technique, known as intracranial self-stimulation (ICSS), showed
greater effects when the electrodes were strongly stimulating the projections of neu-
rons that release dopamine (DA) from their axon terminals. DA neurons are mostly
located in the substantia nigra pars compacta (SNpc, also called the A9 group) (a
part of the basal ganglia, BG) and the neighboring ventral tegmental area (VTA,
group A10) (Björklund and Dunnett, 2007). As described in the next section, many
of the areas DA neurons project to are involved with decision-making and move-
ment. ICSS shows that behavior may be modified according to a global signal com-
municated by DA neurons. Original interpretations suggested, quite reasonably, that
DA directly signals the occurrence of a reward. Subsequent research, some of which
is described in this section, shows that DA plays a more sophisticated role (Wise,
2004; Schultz, 2007; Bromberg-Martin et al, 2010; Montague et al, 2004).
of firing within a burst). Early studies showed that a DA burst occurred in response
to sensory events that were task-related, intense, or surprising (Miller et al, 1981;
Schultz, 1986; Horvitz, 2000).
In one of these most important studies linking RL to brain processes, Ljungberg
et al (1992) recorded from DA neurons in monkeys while the monkeys learned to
reach for a lever when a light was presented, after which they received some juice.
Initially, a DA burst occurred in response only to juice delivery. As the task was
learned, a DA burst occurred at both the light and juice delivery, and later only to
the light (and not juice delivery). Finally, after about 30,000 trials (over many days),
the DA burst even in response to the light declined by a large amount (perhaps due
to a lack of attention or motivation, Ljungberg et al 1992). Similarly, Schultz et al
(1993) showed that as monkeys learned a more complicated operant conditioning
task, the DA burst moved from the time of juice delivery to the time of the stimuli
that indicated that the monkey should execute the action. If the monkey executed
the wrong action, DA neuron activity at the time of the expected juice delivery
decreased from baseline. Also, juice delivered before its expected time resulted in
a DA burst; the omission of expected juice (even if the monkey behaved correctly)
resulted in a decrease in activity from baseline; and the DA burst at the time of juice
delivery gradually decreased as the monkey learned the task (Schultz et al, 1997;
Hollerman and Schultz, 1998).
The progression of DA neuron activity over the course of learning did not cor-
relate with variables that were directly-manipulated in the experiment, but it caught
the attention of those familiar with RL (Barto, 1995). As noted by Houk et al (1995),
there “is a remarkable similarity between the discharge properties of DA neurons
and the effective reinforcement signal generated by a TD algorithm...” (page 256).
Montague et al 1996 hypothesized that “the fluctuating delivery of dopamine from
the VTA to cortical and subcortical target structures in part delivers information
about prediction errors between the expected amount of reward and the actual re-
ward” (page 1944). Schultz et al (1997) discuss in more detail the relationship be-
tween their experiments, TD, and psychological learning theories (Schultz, 2010,
2006, 1998). The importance of these similarities cannot be overstated: a funda-
mental learning signal developed years earlier within the RL framework appears
to be represented—almost exactly—in the activity of DA neurons recorded during
learning tasks.
Several studies suggest that the DA burst is, like the TD error, influenced by
the prediction properties of a stimulus, e.g., as in blocking (Waelti et al, 2001) and
conditioned inhibition (Tobler et al, 2003). When trained monkeys were faced with
stimuli that predict the probability of reward delivery, the magnitude of the DA
burst at the time of stimuli presentation increased with likelihood; that at the time
of delivered reward decreased; and DA neuron activity at the time of an omitted
reward decreased to a larger extent (Fiorillo et al, 2003) (but see Niv et al 2005).
When reward magnitude was drawn from a probability distribution, the DA burst at
the time of reward delivery reflected the difference between delivered and expected
520 A. Shah
magnitude (Tobler et al, 2005). Other influences on the DA burst include motivation
(Satoh et al, 2003), delay of reward delivery, (Kobayashi and Schultz, 2008), and
history (if reward delivery was a function of past events) (Nakahara et al, 2004;
Bayer and Glimcher, 2005).
In a task that examined the role of DA neuron activity in decision-making (Mor-
ris et al, 2006), monkeys were presented with visual stimuli indicating probability of
reward delivery. If only one stimulus was presented, DA neuron activity at the time
of the stimulus presentation was proportional to the expected value. If two stimuli
were presented, and the monkey could choose between them, DA neuron activity
was proportional to the expected value of the eventual choice rather than represent-
ing some combination of the expected values of the available choices. The authors
suggest that these results indicate that DA neuron activity reflects the perceived
value of a chosen (by some other process) action, in agreement with a SARSA
learning scheme (Niv et al, 2006a) (though the results of other experiments support
a Q-learning scheme, Roesch et al 2007).
The studies reviewed above provide compelling evidence that the DA burst reflects a
reward prediction error very similar to the TD error of RL. Such an interpretation is
attractive because we can then draw upon the rich computational framework of RL
to analyze such activity. However, other studies and interpretations suggest that DA
acts as a general reinforcer of behavior, but perhaps not just to maximize reward.
The incentive salience theory (Berridge and Robinson, 1998; Berridge, 2007;
Berridge et al, 2009) separates “wanting” (a behavioral bias) from “liking” (hedonic
feelings). Experiments using pharmacological manipulations suggest that opioid—
not DA—systems mediate facial expressions associated with pleasure (Berridge and
Robinson, 1998). DA could increase action strength without increasing such mea-
sures (Tindell et al, 2005; Wyvell and Berridge, 2000), and DA-deficient animals
could learn to prefer pleasurable stimuli over neutral ones (Cannon and Palmiter,
2003). The separation offers an explanation for some experimental results and irra-
tional behaviors (Wyvell and Berridge, 2000).
Redgrave et al (2008) argue that the latency of the DA burst, < 100 ms after
stimulus presentation in many cases, is too short and uniform (across stimuli and
species) to be based on identification—and hence predicted value—of the stimulus.
Rather, the DA burst may be due to projections from entirely subcortical pathways
that respond quickly—faster than DA neurons—to coarse perceptions that indicate
that something has happened, but not what it was (Dommett et al, 2005) . More re-
cent experimental results provide evidence that additional phasic DA neuron activity
that occurs with a longer latency (< 200 ms) (Joshua et al, 2009; Bromberg-Martin
et al, 2010; Nomoto et al, 2010) may be due to early cortical processing and does
provide some reward-related information. Very early (< 100 ms) DA activity may
signal a sensory prediction error (e.g., Horvitz 2000) that biases the animal to repeat
16 Psychological and Neuroscientific Connections 521
The experiments described this section show that DA acts not as a “reward detector,”
but rather as a learning signal that reinforces behavior. Pharmacological treatments
that manipulate the effectiveness of DA further support this idea. For example, in
humans that were given DA agonists (which increase the effectiveness of DA on
target neurons) while performing a task, there was an increase in both learning and
a representation of the TD error in the striatum (a brain area targeted by DA neurons)
(Pessiglione et al, 2006). DA antagonists (which decrease the effectiveness of DA)
had the opposite effect.
There are a number of interesting issues that I have not discussed but deserve
mention. Exactly how the DA burst is shaped is a matter of some debate. Theories
based on projections to and from DA neurons suggest that they are actively sup-
pressed when a predicted reward is delivered (Hazy et al, 2010; Houk et al, 1995).
Also, because baseline DA neuron activity is low, aversive outcomes or omitted re-
wards cannot be represented in the same way as delivered rewards. Theories based
on stimulus representation (Ludvig et al, 2008; Daw et al, 2006a), other neurotrans-
mitter systems (Daw et al, 2002; Phelps and LeDoux, 2005; Wrase et al, 2007;
Doya, 2008), and/or learning rules (Frank, 2005; Frank et al, 2004) address this
issue. While this section focused on phasic DA neuron activity, research is also ex-
amining the effect that long-term (tonic) DA neuron activity has on learning and
behavior (Schultz, 2007; Daw and Touretzky, 2002). Finally, recent experimental
evidence suggests that the behavior of DA neurons across anatomical locations may
not be as uniform as suggested in this section (Haber, 2003; Wickens et al, 2007).
While several interpretations of how and to what end DA affects behavior have
been put forth, of most interest to readers of this chapter is the idea that the DA
burst represents a signal very similar to the TD error. To determine if that signal is
used in the brain in ways similar to how RL algorithms is it, we must examine target
structures of DA neurons.
Most DA neuron projections terminate in frontal cortex and the basal ganglia (BG),
areas of the brain that are involved in the control of movement, decision-making,
and other cognitive processes. Because DA projections to the BG are particularly
dense relative to projections in frontal cortex, this section focuses on the BG. What
follows is a brief (and necessarily incomplete) overview of the BG and its role in
learning and control.
522 A. Shah
The BG are a set of interconnected subcortical structures located near the thala-
mus. Scientists first connected their function with voluntary movement in the early
twentieth century when post-mortem analysis showed that a part of the BG was
damaged in Parkinson’s disease patients. Subsequent research has revealed a basic
understanding of their function in terms of movement (Mink, 1996), and research
over the past few decades show that they mediate learning and cognitive functions
as well (Packard and Knowlton, 2002; Graybiel, 2005).
A part of the BG called the striatum receives projections from DA neurons and
excitatory projections from most areas of cortex and thalamus. A striatal neuron re-
ceives a large number of weak inputs from many cortical neurons, suggesting that
striatal neurons implement a form of pattern recognition (Houk and Wise, 1995;
Wilson, 2004). Striatal neurons send inhibitory projections to the internal segment
of the globus pallidus (GPi), the neurons of which are tonically-active and send in-
hibitory projections to brain stem and thalamus. Thus, excitation of striatal neurons
results in a disinhibition of neurons targeted by GPi neurons. In the case in which
activation of neurons targeted by the GPi elicits movements, their disinhibition in-
creases the likelihood that those movements will be executed.
On an abstract level, we can think of pattern recognition at striatal neurons as
analogous to the detection of state as used in RL, and the resulting disinhibition of
the targets of GPi neurons as analogous to the selection of actions. Corticostriatal
synapses are subject to DA-dependent plasticity (Wickens, 2009; Calabresi et al,
2007), e.g., a learning rule roughly approximated by the product of the activity of
the striatal neuron and the activities of the cortical and DA neurons that project to
it. Thus, the DA burst (e.g., representing the TD error) can modify the activation of
striatal neurons that respond to a particular state according to the consequences of
the resulting action. In other words, the BG possess characteristics that enable them
to modify the selection of actions through mechanisms similar to those used in RL
(Barto, 1995; Doya, 1999; Daw and Doya, 2006; Doll and Frank, 2009; Graybiel,
2005; Joel et al, 2002; Wickens et al, 2007).
Additional pathways within the BG (which are not described here) impart a
functionality to the BG useful for behavioral control. For example, actions can
be actively facilitated or inhibited, possibly through different learning mechanisms
(Frank, 2005; Frank et al, 2004; Hikosaka, 2007). Also, intra-BG architecture ap-
pears to be well-suited to implement selection between competing actions in an
optimal way (Bogacz and Gurney, 2007; Gurney et al, 2001).
Different areas of cortex—which are involved in different types of functions—
project to different areas of the striatum. These pathways stay segregated to a large
degree through the BG, to the thalamus, and back up to the cortex (Alexander et al,
1986; Middleton and Strick, 2002). The parallel loop structure allows the BG to
affect behavior by shaping cortical activity as well as through descending projec-
tions to brain stem. The functional implications of the parallel loop structure are
discussed later in this section. The next subsection describes studies that aim to
16 Psychological and Neuroscientific Connections 523
further elucidate the functions of the BG, focusing mostly on areas involved with
the selection of movements and decisions.
In most of the experiments described in this subsection, the activities of single neu-
rons in the striatum were recorded while the animal was engaged in some condition-
ing task. As the animal learned the task, neural activity began to display task-related
activity, including activity modulated by reward (Schultz et al, 2003; Hikosaka,
2007; Barnes et al, 2005).
In a particularly relevant study, Samejima et al (2005) recorded from dorsal stria-
tum (where dorsal means toward the top of the head) of monkeys engaged in a two-
action free choice task. Each action led to a large reward (a large volume of juice)
with some probability that was held constant over a block of trials, and a smaller
reward (small volume) the rest of the time. For example, in one block of trials, the
probability that action A led to the large reward was 0.5 and that of action B was
0.9, while in another block the probabilities were, respectively, 0.5 and 0.1. Such
a design dissociates the absolute and relative action values (the expected reward
for executing the action): the value of action A is the same in both blocks, but it
is lower than that of action B in the first block and higher in the second. Choices
during a block were distributed between the two actions, with a preference for the
more valuable one.
The recorded activity of about one third of the neurons during a block covaried
with the value of one of the actions (a lesser proportion covaried with other aspects
such as the difference in action values or the eventual choice). In addition, modelling
techniques were used to estimate action values online based on experience, i.e., past
actions and rewards, and to predict choice behavior based on those estimated action
values. Many neurons were found whose activities covaried with estimated action
value, and the predicted choice distribution agreed with the observed distribution.
The temporal profile of neural activity within a trial yields further insights. Lau
and Glimcher (2008) showed that dorsal striatal neurons that encoded the values of
available actions were more active before action execution, and those that encoded
the value of the chosen action were more active after action execution. The results
of these and other studies (Balleine et al, 2007; Kim et al, 2009) suggest that action
values of some form are represented in dorsal striatum and that such representations
participate in action selection and evaluation (though some lesion studies suggest
that dorsal striatum may not be needed for learning such values, Atallah et al 2007).
Analyses described in Samejima et al (2005) illustrate an approach that has been
growing more prominent over the past decade: to correlate neural activity not only
with variables that can be directly-observed (e.g., reward delivery or choice), but
also with variables thought to participate in learning and control according to theo-
ries and computational models (e.g., expected value) (Corrado and Doya, 2007; Daw
and Doya, 2006; Niv, 2009). This approach is especially useful in analyzing data
524 A. Shah
derived from functional magnetic resonance imaging (fMRI) methods (Gläscher and
O’Doherty, 2010; Montague et al, 2006; Haruno and Kawato, 2006), where the
activity of many brain areas can be recorded simultaneously, including in humans
engaged in complex cognitive tasks. Note that the precise relationship between the
measured signal (volume of oxygenated blood) and neural activity is not known
and the signal has a low temporal and spatial resolution relative to single neuron
recordings. That being said, analyses of the abundance of data can give us a better
idea of the overall interactions between brain areas.
Using fMRI, O’Doherty et al (2004) showed that, in a task in which the human
participant must choose an action from a set, dorsolateral striatum and ventral stria-
tum (where ventral means toward the bottom) exhibited TD error-like signals, while
in a task in which the participant had no choice, only ventral striatum exhibited such
a signal. These results suggest that dorsal and ventral striatum implement functions
analogous to the Actor and Critic (see also Barto 1995; Joel et al 2002; Montague
et al 2006), respectively, in the Actor-Critic architecture (Barto et al, 1983).
Recordings from single neurons in the ventral striatum of animals support this
suggestion to some degree. Information which can be used to evaluate behavior,
such as context (or state), some types of actions, and outcome, appear to be repre-
sented in ventral striatum (Ito and Doya, 2009; Roesch et al, 2009; Kim et al, 2009).
Roughly speaking, while dorsal striatum is more concerned with actions in general,
ventral striatum may participate in assigning value to stimuli, but it may also par-
ticipate in controlling some types of actions and play a more complicated role in
behavior (Yin et al, 2008; Humphries and Prescott, 2010; Nicola, 2007).
cortical projections mainly from primary sensory, motor, and premotor cortices
(which will be referred to collectively as sensorimotor cortices, or SMC), providing
the DLS with basic sensory and movement information.
A model-based mechanism is thought to be implemented in a loop involving the
dorsomedial striatum (DMS, also called the caudate), which receives cortical pro-
jections mainly from prefrontal cortex (PFC). The PFC is on the front part of the
cortex and has reciprocal connections with many other cortical areas that mediate
abstract representations of sensations and movement. PFC neurons exhibit sustained
activity (working memory, Goldman-Rakic 1995) that allows them to temporarily
store information. RL mechanisms mediated by the DMS may determine which
past stimuli should be represented by sustained activity (O’Reilly and Frank, 2006).
Also, the PFC is thought to participate in the construction of a model of the envi-
ronment (Gläscher et al, 2010), allowing it to store predicted future events as well
(Mushiake et al, 2006; Matsumoto et al, 2003). Thus, the PFC can affect behavior
through a planning process and even override behavior suggested by other brain
areas (Miller and Cohen, 2001; Tanji and Hoshi, 2008).
The assignment of value to previously neutral stimuli may be implemented in a
loop involving the ventral striatum (VS, also called the nucleus accumbens), which
receives cortical projections mainly from the oribtofrontal cortex (OFC, the un-
derside part of the PFC that is just behind the forehead.) The VS and the OFC
have connections with limbic areas, such as the amygdala and hypothalamus. These
structures, and VS and OFC, have been implicated in the processing of emotion,
motivations, and reward (Wallis, 2007; Cardinal et al, 2002; Mirolli et al, 2010).
Most theories of brain function that focus on the loop structure share elements of
the interpretation of Yin et al (2008) (though of course there are some differences)
(Samejima and Doya, 2007; Haruno and Kawato, 2006; Daw et al, 2005; Balleine
et al, 2009; Ashby et al, 2010; Balleine and O’Dohrety, 2010; Pennartz et al, 2009;
Houk et al, 2007; Wickens et al, 2007; Redgrave et al, 2010; Mirolli et al, 2010;
Packard and Knowlton, 2002; Cohen and Frank, 2009). Common to all is the idea
that different loops implement different mechanisms that are useful for the types of
learning and control described in this chapter. Similar to how behavioral studies sug-
gest that different mechanisms dominate control at different points in learning (e.g.,
control is transferred from model-based to model-free mechanisms, as discussed in
Section 3.3), neuroscience studies suggest that brain structures associated with dif-
ferent loops dominate control at different points in learning (previous references and
Doyon et al 2009; Poldrack et al 2005).
For example, in humans learning to perform a sequence of movements, the PFC
(model-based mechanisms) dominates brain activity (as measured by fMRI) early in
learning while the striatum and SMC (model-free) dominates activity later (Doyon
et al, 2009). These results can be interpreted to suggest that decision-making mech-
anisms during model-based control lie predominantly within the PFC, while those
during model-free control lie predominantly at the cortico-striatal synapses to the
DLS. That is, control is transferred from cortical to BG selection mechanisms (Daw
et al, 2005; Niv et al, 2006b; Shah, 2008; Shah and Barto, 2009), in rough agree-
ment with experimental studies that suggest that the BG play a large role in encoding
526 A. Shah
motor skills and habitual behavior (Graybiel, 2008; Pennartz et al, 2009; Aldridge
and Berridge, 1998). However, other theories and experimental results suggest that
control is transferred in the opposite direction: the BG mediate trial and error learn-
ing early on (Pasupathy and Miller, 2005; Packard and Knowlton, 2002), but cor-
tical areas mediate habitual or skilled behavior (Ashby et al, 2007, 2010; Frank
and Claus, 2006; Matsuzaka et al, 2007). As discussed in Ashby et al (2010), part
the discrepancy may be due to the use of different experimental methods, includ-
ing tasks performed by the animal and measures used to define habitual or skilled
behavior.
Finally, some research is also focusing on how control via the different loops is
coordinated. Behavior generated by mechanisms in one loop can be used to train
mechanisms of another, but there is also some communication between the loops
(Haber, 2003; Haber et al, 2006; Pennartz et al, 2009; Yin et al, 2008; Balleine
and O’Dohrety, 2010; Mirolli et al, 2010; Joel and Weiner, 1994; Graybiel et al,
1994). Such communication may occur within the BG (Pennartz et al, 2009; Gray-
biel, 2008; Joel and Weiner, 1994; Graybiel et al, 1994) or via connections between
striatum and DA neurons (some of which are also part of the BG). The latter con-
nections are structured such that an area of the striatum projects to DA neurons that
send projections back to it and also to a neighboring area of striatum (Haber, 2003).
The pattern of connectivity forms a spiral where communication is predominantly
in one direction, suggestive of a hierarchical organization in which learned associ-
ations within the higher-level loop are used to train the lower-level loop (Haruno
and Kawato, 2006; Yin et al, 2008; Samejima and Doya, 2007). Again following an
interpretation similar to that of Yin et al (2008), the OFC-VS loop (SO association)
informs the PFC-DMS loop (AO), which informs the SMC-DLS loop (SR).
affects neural activity directly, striatal activity is shaped by the activities of interneu-
rons (neurons that project to other neurons within the same structure), which change
dramatically as an animal learns a task (Graybiel et al, 1994; Graybiel, 2008; Pen-
nartz et al, 2009), and the BG also affects behavior through recurrent connections
with subcortical structures (McHaffie et al, 2005). Reward-related activity in frontal
cortical areas (Schultz, 2006) are shaped not only through interactions with the BG,
but also by direct DA projections and interactions with other brain areas.
Computational mechanisms and considerations that were not discussed here but
are commonly used in RL and machine learning have analogs in the brain as well.
Psychological and neuroscientific research address topics such as behavior under
uncertainty, state abstraction, game theory, exploration versus exploitation, and hi-
erarchical behavior (Dayan and Daw, 2008; Doya, 2008; Gold and Shadlen, 2007;
Seger and Miller, 2010; Glimcher and Rustichini, 2004; Daw et al, 2006b; Yu and
Dayan, 2005; Wolpert, 2007; Botvinick et al, 2009; Grafton and Hamilton, 2007).
Determining how brain areas contribute to observed behavior is a very difficult en-
deavour. The approach discussed earlier—analyzing brain activity in terms of vari-
ables used in principled computational accounts (Corrado and Doya, 2007; Daw and
Doya, 2006; Niv, 2009)—is used for a variety of brain areas and accounts beyond
that discussed in this section. For example, the field of neuroeconomics—which
includes many ideas discussed in this chapter—investigates decision-making pro-
cesses and associated brain areas in humans engaged in economic games (Glimcher
and Rustichini, 2004; Glimcher, 2003).
Acknowledgements. I am grateful for comments from and discussions with Andrew Barto,
Tom Stafford, Kevin Gurney, and Peter Redgrave, comments from anonymous reviewers, and
financial support from the European Community’s 7th Framework Programme grant 231722
(IM-CLeVeR).
References
Aldridge, J.W., Berridge, K.C.: Coding of serial order by neostriatal neurons: a “natural
action” approach to movement sequence. The Journal of Neuroscience 18, 2777–2787
(1998)
Alexander, G.E., DeLong, M.R., Strick, P.L.: Parallel organization of functionally segregated
circuits linking basal ganglia and cortex. Annual Review of Neuroscience 9, 357–381
(1986)
Ashby, F.G., Ennis, J., Spiering, B.: A neurobiological theory of automaticity in perceptual
categorization. Psychological Review 114, 632–656 (2007)
Ashby, F.G., Turner, B.O., Horvitz, J.C.: Cortical and basal ganglia contributions to habit
learning and automaticity. Trends in Cognitive Sciences 14, 208–215 (2010)
Atallah, H.E., Lopez-Paniagua, D., Rudy, J.W., O’Reilly, R.C.: Separate neural substrates for
skill learning and performance in ventral and dorsal striatum. Nature Neuroscience 10,
126–131 (2007)
Balleine, B.W., O’Dohrety, J.P.: Human and rodent homologies in action control: Corticos-
triatal determinants of goal-directed and habitual action. Neuropsychopharmacology 35,
48–69 (2010)
Balleine, B.W., Delgado, M.R., Hikosaka, O.: The role of the dorsal striatum in reward and
decision-making. The Journal of Neuroscience 27, 8161–8165 (2007)
Balleine, B.W., Liljeholm, M., Ostlund, S.B.: The integrative function of the basal ganglia in
instrumental conditioning. Behavioural Brain Research 199, 43–52 (2009)
Bar-Gad, I., Morris, G., Bergman, H.: Information processing, dimensionality reduction, and
reinforcement learning in the basal ganglia. Progress in Neurobiology 71, 439–473 (2003)
Barnes, T.D., Kubota, Y., Hu, D., Jin, D.Z., Graybiel, A.M.: Activity of striatal neurons re-
flects dynamic encoding and recoding of procedural memories. Nature 437, 1158–1161
(2005)
16 Psychological and Neuroscientific Connections 529
Daw, N.D., Doya, K.: The computational neurobiology of learning and reward. Current Opin-
ion in Neurobiology 16, 199–204 (2006)
Daw, N.D., Touretzky, D.S.: Long-term reward prediction in TD models of the dopamine
system. Neural Computation 14, 2567–2583 (2002)
Daw, N.D., Kakade, S., Dayan, P.: Opponent interactions between serotonin and dopamine.
Neural Networks 15, 603–616 (2002)
Daw, N.D., Niv, Y., Dayan, P.: Uncertainty-based competition between prefrontal and dorso-
lateral striatal systems for behavioral control. Nature Neuroscience 8, 1704–1711 (2005)
Daw, N.D., Courville, A.C., Tourtezky, D.S.: Representation and timing in theories of the
dopamine system. Neural Computation 18, 1637–1677 (2006a)
Daw, N.D., O’Doherty, J.P., Dayan, P., Seymour, B., Dolan, R.J.: Cortical substrates for ex-
ploratory decisions in humans. Nature 441, 876–879 (2006b)
Dayan, P., Daw, N.D.: Connections between computational and neurobiological perspec-
tives on decision making. Cognitive, Affective, and Behavioral Neuroscience 8, 429–453
(2008)
Dayan, P., Niv, Y.: Reinforcement learning: the good, the bad, and the ugly. Current Opinion
in Neurobiology 18, 185–196 (2008)
Dayan, P., Niv, Y., Seymour, B., Daw, N.D.: The misbehavior of value and the discipline of
the will. Neural Networks 19, 1153–1160 (2006)
Dickinson, A.: Actions and habits: the development of behavioural autonomy. Philosophical
Transactions of the Royal Society of London B: Biological Sciences 308, 67–78 (1985)
Dickinson, A., Balleine, B.W.: Motivational control of goal-directed action. Animal Learning
and Behavior 22, 1–18 (1994)
Doll, B.B., Frank, M.J.: The basal ganglia in reward and decision making: computational
models and empirical studies. In: Dreher, J., Tremblay, L. (eds.) Handbook of Reward and
Decision Making, ch. 19, pp. 399–425. Academic Press, Oxford (2009)
Dommett, E., Coizet, V., Blaha, C.D., Martindale, J., Lefebvre, V., Mayhew, N.W.J.E., Over-
ton, P.G., Redgrave, P.: How visual stimuli activate dopaminergic neurons at short latency.
Science 307, 1476–1479 (2005)
Doya, K.: What are the computations of the cerebellum, the basal ganglia, and the cerebral
cortex? Neural Networks 12, 961–974 (1999)
Doya, K.: Reinforcement learning: Computational theory and biological mechanisms. HFSP
Journal 1, 30–40 (2007)
Doya, K.: Modulators of decision making. Nature Neuroscience 11, 410–416 (2008)
Doyon, J., Bellec, P., Amsel, R., Penhune, V., Monchi, O., Carrier, J., Lehéricy, S., Benali,
H.: Contributions of the basal ganglia and functionally related brain structures to motor
learning. Behavioural Brain Research 199, 61–75 (2009)
Eckerman, D.A., Hienz, R.D., Stern, S., Kowlowitz, V.: Shaping the location of a pigeon’s
peck: Effect of rate and size of shaping steps. Journal of the Experimental Analysis of
Behavior 33, 299–310 (1980)
Ferster, C.B., Skinner, B.F.: Schedules of Reinforcement. Appleton-Century-Crofts, New
York (1957)
Fiorillo, C.D., Tobler, P.N., Schultz, W.: Discrete coding of reward probability and uncertainty
by dopamine neurons. Science 299, 1898–1902 (2003)
Frank, M.J.: Dynamic dopamine modulation in the basal ganglia: a neurocomputational ac-
count of cognitive deficits in medicated and nonmedicated Parkinsonism. Journal of Cog-
nitive Neuroscience 17, 51–72 (2005)
16 Psychological and Neuroscientific Connections 531
Haber, S.N.: The primate basal ganglia: Parallel and integrative networks. Journal of Chemi-
cal Neuroanatomy 26, 317–330 (2003)
Haber, S.N., Kim, K.S., Mailly, P., Calzavara, R.: Reward-related cortical inputs define a
large striatal region in primates that interface with associative cortical inputs, providing a
substrate for incentive-based learning. The Journal of Neuroscience 26, 8368–8376 (2006)
Haruno, M., Kawato, M.: Heterarchical reinforcement-learning model for integration of mul-
tiple cortico-striatal loops: fMRI examination in stimulus-action-reward association learn-
ing. Neural Networks 19, 1242–1254 (2006)
Hazy, T.E., Frank, M.J., O’Reilly, R.C.: Neural mechanisms of acquired phasic dopamine
repsonses in learning. Neuroscience and Biobehavioral Reviews 34, 701–720 (2010)
Herrnstein, R.J.: Relative and absolute strength of response as a function of frequency of
reinforcement. Journal of the Experimental Analysis of Behavior 4, 267–272 (1961)
Hikosaka, O.: Basal ganglia mechanisms of reward-oriented eye movement. Annals of the
New York Academy of Science 1104, 229–249 (2007)
Hollerman, J.R., Schultz, W.: Dopamine neurons report an error in the temporal prediction of
reward during learning. Nature Neuroscience 1, 304–309 (1998)
Horvitz, J.C.: Mesolimbocortical and nigrostriatal dopamine responses to salient non-reward
events. Neuroscience 96, 651–656 (2000)
Houk, J.C., Wise, S.P.: Distributed modular architectures linking basal ganglia, cerebellum,
and cerebral cortex: Their role in planning and controlling action. Cerebral Cortex 5, 95–
110 (1995)
Houk, J.C., Adams, J.L., Barto, A.G.: A model of how the basal ganglia generate and use
neural signals that predict reinforcement. In: Houk, J.C., Davis, J.L., Beiser, D.G. (eds.)
Models of Information Processing in the Basal Ganglia, ch. 13, pp. 249–270. MIT Press,
Cambridge (1995)
Houk, J.C., Bastianen, C., Fansler, D., Fishbach, A., Fraser, D., Reber, P.J., Roy, S.A., Simo,
L.S.: Action selection and refinement in subcortical loops through basal ganglia and cere-
bellum. Philosophical Transactions of the Royal Society of London B: Biological Sci-
ences 362, 1573–1583 (2007)
Hull, C.L.: Principles of Behavior. Appleton-Century-Crofts, New York (1943)
Humphries, M.D., Prescott, T.J.: The ventral basal ganglia, a selection mechanism at the
crossroads of space, strategy, and reward. Progress in Neurobiology 90, 385–417 (2010)
Ito, M., Doya, K.: Validation of decision-making models and analysis of decision variables
in the rat basal ganglia. The Journal of Neuroscience 29, 9861–9874 (2009)
Joel, D., Weiner, I.: The organization of the basal ganglia-thalamocortical circuits: Open in-
terconnected rather than closed segregated. Neuroscience 63, 363–379 (1994)
Joel, D., Niv, Y., Ruppin, E.: Actor-critic models of the basal ganglia: New anatomical and
computational perspectives. Neural Networks 15, 535–547 (2002)
Joshua, M., Adler, A., Bergman, H.: The dynamics of dopamine in control of motor behavior.
Current Opinion in Neurobiology 19, 615–620 (2009)
Kamin, L.J.: Predictability, surprise, attention, and conditioning. In: Campbell, B.A., Church,
R.M. (eds.) Punishment and Aversive Behavior, pp. 279–296. Appleton-Century-Crofts,
New York (1969)
Kehoe, E.J., Schreurs, B.G., Graham, P.: Temporal primacy overrides prior training in serial
compound conditioning of the rabbit’s nictitating membrane response. Animal Learning
and Behavior 15, 455–464 (1987)
Kim, H., Sul, J.H., Huh, N., Lee, D., Jung, M.W.: Role of striatum in updating values of
chosen actions. The Journal of Neuroscience 29, 14,701–14,712 (2009)
16 Psychological and Neuroscientific Connections 533
Kishida, K.T., King-Casas, B., Montague, P.R.: Neuroeconomic approaches to mental disor-
ders. Neuron 67, 543–554 (2010)
Klopf, A.H.: The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence. Hemi-
sphere Publishing Corporation, Washington DC (1982)
Kobayashi, S., Schultz, W.: Influence of reward delays on responses of dopamine neurons.
The Journal of Neuroscience 28, 7837–7846 (2008)
Konidaris, G.D., Barto, A.G.: Skill discovery in continuous reinforcement learning domains
using skill chaining. In: Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Cu-
lotta, A. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 22, pp.
1015–1023. MIT Press, Cambridge (2009)
Lau, B., Glimcher, P.W.: Value representations in the primate striatum during matching be-
havior. Neuron 58, 451–463 (2008)
Ljungberg, T., Apicella, P., Schultz, W.: Responses of monkey dopamine neurons during
learning of behavioral reactions. Journal of Neurophysiology 67, 145–163 (1992)
Ludvig, E.A., Sutton, R.S., Kehoe, E.J.: Stimulus representation and the timing of reward-
prediction errors in models of the dopamine system. Neural Computation 20, 3034–3054
(2008)
Maia, T.V.: Reinforcement learning, conditioning, and the brain: Successes and challenges.
Cognitive, Affective, and Behavioral Neuroscience 9, 343–364 (2009)
Maia, T.V., Frank, M.J.: From reinforcement learning models to psychiatric and neurobiolog-
ical disorders. Nature Neuroscience 14, 154–162 (2011)
Matsumoto, K., Suzuki, W., Tanaka, K.: Neuronal correlates of goal-based motor selection in
the prefrontal cortex. Science 301, 229–232 (2003)
Matsuzaka, Y., Picard, N., Strick, P.: Skill representation in the primary motor cortex after
long-term practice. Journal of Neurophysiology 97, 1819–1832 (2007)
McHaffie, J.G., Stanford, T.R., Stein, B.E., Coizet, V., Redgrave, P.: Subcortical loops through
the basal ganglia. Trends in Neurosciences 28, 401–407 (2005)
Middleton, F.A., Strick, P.L.: Basal-ganglia“projections” to the prefrontal cortex of the pri-
mate. Cerebral Cortex 12, 926–935 (2002)
Miller, E.K., Cohen, J.D.: An integrative theory of prefrontal cortex function. Annual Review
of Neuroscience 24, 167–202 (2001)
Miller, J.D., Sanghera, M.K., German, D.C.: Mesencephalic dopaminergic unit activity in the
behaviorally conditioned rat. Life Sciences 29, 1255–1263 (1981)
Mink, J.W.: The basal ganglia: Focused selection and inhibition of competing motor pro-
grams. Progress in Neurobiology 50, 381–425 (1996)
Mirolli, M., Mannella, F., Baldassarre, G.: The roles of the amygdala in the affective regula-
tion of body, brain, and behaviour. Connection Science 22, 215–245 (2010)
Montague, P.R., Dayan, P., Sejnowski, T.J.: A framework for mesencephalic dopamine sys-
tems based on predictive Hebbian learning. Journal of Neuroscience 16, 1936–1947
(1996)
Montague, P.R., Hyman, S.E., Cohen, J.D.: Computational roles for dopamine in behavioural
control. Nature 431, 760–767 (2004)
Montague, P.R., King-Casas, B., Cohen, J.D.: Imaging valuation models in human choice.
Annual Review of Neuroscience 29, 417–448 (2006)
Moore, J.W., Choi, J.S.: Conditioned response timing and integration in the cerebellum.
Learning and Memory 4, 116–129 (1997)
Morris, G., Nevet, A., Arkadir, D., Vaadia, E., Bergman, H.: Midbrain dopamine neurons
encode decisions for future action. Nature Neuroscience 9, 1057–1063 (2006)
534 A. Shah
Mushiake, H., Saito, N., Sakamoto, K., Itoyama, Y., Tanji, J.: Activity in the lateral prefrontal
cortex reflects multiple steps of future events in action plans. Neuron 50, 631–641 (2006)
Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y., Hikosaka, O.: Dopamine neurons can
represent context-dependent prediction error. Neuron 41, 269–280 (2004)
Ng, A., Harada, D., Russell, S.: Policy invariance under reward transformations: theory and
applications to reward shaping. In: Proceedings of the Sixteenth International Conference
on Machine Learning, pp. 278–287 (1999)
Nicola, S.M.: The nucleus accumbens as part of a basal ganglia action selection circuit. Psy-
chopharmacology 191, 521–550 (2007)
Niv, Y.: Reinforcement learning in the brain. Journal of Mathematical Psychology 53, 139–
154 (2009)
Niv, Y., Duff, M.O., Dayan, P.: Dopamine, uncertainty, and TD learning. Behavioral and
Brain Functions 1, 6 (2005)
Niv, Y., Daw, N.D., Dayan, P.: Choice values. Nature Neuroscience 9, 987–988 (2006a)
Niv, Y., Joel, D., Dayan, P.: A normative perspective on motivation. Trends in Cognitive
Sciences 10, 375–381 (2006b)
Nomoto, K., Schultz, W., Watanabe, T., Sakagami, M.: Temporally extended dopamine re-
sponses to perceptually demanding reward-predictive stimuli. The Journal of Neuro-
science 30, 10,692–10,702 (2010)
O’Doherty, J.P., Dayan, P., Schultz, J., Deichmann, R., Friston, K., Dolan, R.J.: Dissociable
roles of ventral and dorsal striatum in instrumental conditioning. Science 304, 452–454
(2004)
Olds, J., Milner, P.: Positive reinforcement produced by electrical stimulation of septal area
and other regions of rat brain. Journal of Comparative and Physiological Psychology 47,
419–427 (1954)
O’Reilly, R.C., Frank, M.J.: Making working memory work: a computational model of learn-
ing in the prefrontal cortex and basal ganglia. Neural Computation 18, 283–328 (2006)
Packard, M.G., Knowlton, B.J.: Learning and memory functions of the basal ganglia. Annual
Review of Neuroscience 25, 563–593 (2002)
Pasupathy, A., Miller, E.K.: Different time courses of learning-related activity in the pre-
frontal cortex and striatum. Nature 433, 873–876 (2005)
Pavlov, I.P.: Conditioned Reflexes: An Investigation of the Physiological Activity of the Cere-
bral Cortex. Oxford University Press, Toronto (1927)
Pennartz, C.M., Berke, J.D., Graybiel, A.M., Ito, R., Lansink, C.S., van der Meer, M., Re-
dish, A.D., Smith, K.S., Voorn, P.: Corticostriatal interactions during learning, memory
processing, and decision making. The Journal of Neuroscience 29, 12,831–12,838 (2009)
Pessiglione, M., Seymour, B., Flandin, G., Dolan, R.J., Frith, C.D.: Dopamine-dependent
prediction errors underpin reward-seeking behaviour in humans. Nature 442, 1042–1045
(2006)
Phelps, E.A., LeDoux, J.E.: Contributions of the amygdala to emotion processing: From ani-
mal models to human behavior. Neuron 48, 175–187 (2005)
Poldrack, R.A., Sabb, F.W., Foerde, K., Tom, S.M., Asarnow, R.F., Bookheimer, S.Y.,
Knowlton, B.J.: The neural correlates of motor skill automaticity. The Journal of Neu-
roscience 25, 5356–5364 (2005)
Pompilio, L., Kacelnik, A.: State-dependent learning and suboptimal choice: when starlings
prefer long over short delays to food. Animal Behaviour 70, 571–578 (2005)
Redgrave, P., Gurney, K.: The short-latency dopamine signal: a role in discovering novel
actions? Nature Reviews Neuroscience 7, 967–975 (2006)
16 Psychological and Neuroscientific Connections 535
Redgrave, P., Gurney, K., Reynolds, J.: What is reinforced by phasic dopamine signals? Brain
Research Reviews 58, 322–339 (2008)
Redgrave, P., Rodriguez, M., Smith, Y., Rodriguez-Oroz, M.C., Lehericy, S., Bergman, H.,
Agid, Y., DeLong, M.R., Obeso, J.A.: Goal-directed and habitual control in the basal
ganglia: implications for Parkinson’s disease. Nature Reviews Neuroscience 11, 760–772
(2010)
Redish, A.D., Jensen, S., Johnson, A.: A unified framework for addiction: Vulnerabilities in
the decision process. Behavioral and Brain Sciences 31, 415–487 (2008)
Rescorla, R.A., Wagner, A.R.: A theory of pavlovian conditioning: Variations in the effective-
ness of reinforcement and nonreinforcement. In: Black, A.H., Prokasy, W.F. (eds.) Classi-
cal Conditioning II: Current Research and Theory, pp. 64–99. Appleton-Century-Crofts,
New York (1972)
Richardson, W.K., Warzak, W.J.: Stimulus stringing by pigeons. Journal of the Experimental
Analysis of Behavior 36, 267–276 (1981)
Roesch, M.R., Calu, D.J., Schoenbaum, G.: Dopamine neurons encode the better option
in rats deciding between differently delayed or sized rewards. Nature Neuroscience 10,
1615–1624 (2007)
Roesch, M.R., Singh, T., Brown, P.L., Mullins, S.E., Schoenbaum, G.: Ventral striatal neurons
encode the value of the chosen action in rats deciding between differently delayed or sized
rewards. The Journal of Neuroscience 29, 13,365–13,376 (2009)
Samejima, K., Doya, K.: Multiple representations of belief states and action values in cor-
ticobasal ganglia loops. Annals of the New York Academy of Sciences 1104, 213–228
(2007)
Samejima, K., Ueda, Y., Doya, K., Kimura, M.: Representation of action-specific reward
values in the striatum. Science 310, 1337–1340 (2005)
Satoh, T., Nakai, S., Sato, T., Kimura, M.: Correlated coding of motivation and outcome of
decision by dopamine neurons. The Journal of Neuroscience 23, 9913–9923 (2003)
Schultz, W.: Responses of midbrain dopamine neurons to behavioral trigger stimuli in the
monkey. Journal of Neurophysiology 56, 1439–1461 (1986)
Schultz, W.: Predictive reward signal of dopamine neurons. Journal of Neurophysiology 80,
1–27 (1998)
Schultz, W.: Behavioral theories and the neurophysiology of reward. Annual Review of Psy-
chology 57, 8–115 (2006)
Schultz, W.: Multiple dopamine functions at different time courses. Annual Review of Neu-
roscience 30, 259–288 (2007)
Schultz, W.: Dopamine signals for reward value and risk: basic and recent data. Behavioral
and Brain Functions 6, 24 (2010)
Schultz, W., Apicella, P., Ljungberg, T.: Responses of monkey dopamine neurons to reward
and conditioned stimuli during successive steps of learning a delayed response task. The
Journal of Neuroscience 13, 900–913 (1993)
Schultz, W., Dayan, P., Montague, P.R.: A neural substrate of prediction and reward. Sci-
ence 275, 1593–1599 (1997)
Schultz, W., Tremblay, L., Hollerman, J.R.: Changes in behavior-related neuronal activity in
the striatum during learning. Trends in Neuroscience 26, 321–328 (2003)
Seger, C.A., Miller, E.K.: Category learning in the brain. Annual Review of Neuroscience 33,
203–219 (2010)
Selfridge, O.J., Sutton, R.S., Barto, A.G.: Training and tracking in robotics. In: Joshi, A.
(ed.) Proceedings of the Ninth International Joint Conference on Artificial Intelligence,
pp. 670–672. Morgan Kaufmann, San Mateo (1985)
536 A. Shah
Shah, A.: Biologically-based functional mechanisms of motor skill acquisition. PhD thesis,
University of Massachusetts Amherst (2008)
Shah, A., Barto, A.G.: Effect on movement selection of an evolving sensory representation:
A multiple controller model of skill acquisition. Brain Research 1299, 55–73 (2009)
Shanks, D.R., Tunney, R.J., McCarthy, J.D.: A re-examination of probability matching and
rational choice. Journal of Behavioral Decision Making 15, 233–250 (2002)
Siegel, S., Goldstein, D.A.: Decision making behaviour in a two-choice uncertain outcome
situation. Journal of Experimental Psychology 57, 37–42 (1959)
Skinner, B.F.: The Behavior of Organisms. Appleton-Century-Crofts, New York (1938)
Staddon, J.E.R., Cerutti, D.T.: Operant behavior. Annual Review of Psychology 54, 115–144
(2003)
Sutton, R.S.: Learning to predict by methods of temporal differences. Machine Learning 3,
9–44 (1988)
Sutton, R.S., Barto, A.G.: Toward a modern theory of adaptive networks: Expectation and
prediction. Psychological Review 88, 135–170 (1981)
Sutton, R.S., Barto, A.G.: A temporal-difference model of classical conditioning. In: Pro-
ceedings of the Ninth Annual Conference of the Cognitive Science Society, pp. 355–378
(1987)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge
(1998)
Tanji, J., Hoshi, E.: Role of the lateral prefrontal cortex in executive behavioral control. Phys-
iological Reviews 88, 37–57 (2008)
Thorndike, E.L.: Animal Intelligence: Experimental Studies. Macmillan, New York (1911)
Tindell, A.J., Berridge, K.C., Zhang, J., Pecina, S., Aldridge, J.W.: Ventral pallidal neurons
code incentive motivation: Amplification by mesolimbic sensitization and amphetamine.
European Journal of Neuroscience 22, 2617–2634 (2005)
Tobler, P.N., Dickinson, A., Schultz, W.: Coding of predicted reward omission by dopamine
neurons in a conditioned inhibition paradigm. The Journal of Neuroscience 23, 10,402–
10,410 (2003)
Tobler, P.N., Fiorillo, C.D., Schultz, W.: Adaptive coding of reward value by dopamine neu-
rons. Science 307, 1642–1645 (2005)
Tolman, E.C.: Cognitive maps in rats and men. The Psychological Review 55, 189–208
(1948)
Tolman, E.C.: There is more than one kind of learning. Psychological Review 56, 44–55
(1949)
Waelti, P., Dickinson, A., Schultz, W.: Dopamine responses comply with basic assumptions
of formal learning theory. Nature 412, 43–48 (2001)
Wallis, J.D.: Orbitofrontal cortex and its contribution to decision-making. Annual Review of
Neuroscience 30, 31–56 (2007)
Watson, J.B.: Behavior: An Introduction to Comparative Psychology. Holt, New York (1914)
Wickens, J.R.: Synaptic plasticity in the basal ganglia. Behavioural Brain Research 199, 119–
128 (2009)
Wickens, J.R., Budd, C.S., Hyland, B.I., Arbuthnott, G.W.: Striatal contributions to reward
and decision making. Making sense of regional variations in a reiterated processing ma-
trix. Annals of the New York Academy of Sciences 1104, 192–212 (2007)
Widrow, B., Hoff, M.E.: Adaptive switching circuits. In: 1960 WESCON Convention Record
Part IV, pp. 96–104. Institute of Radio Engineers, New York (1960)
Wilson, C.J.: Basal ganglia. In: Shepherd, G.M. (ed.) The Synaptic Organization of the Brain,
ch. 9, 5th edn., pp. 361–414. Oxford University Press, Oxford (2004)
16 Psychological and Neuroscientific Connections 537
Wise, R.A.: Dopamine, learning and motivation. Nature Reviews Neuroscience 5, 483–494
(2004)
Wolpert, D.: Probabilistic models in human sensorimotor control. Human Movement Sci-
ence 27, 511–524 (2007)
Wörgötter, F., Porr, B.: Temporal sequence learning, prediction, and control: A review of
different models and their relation to biological mechanisms. Neural Computation 17,
245–319 (2005)
Wrase, J., Kahnt, T., Schlagenhauf, F., Beck, A., Cohen, M.X., Knutson, B., Heinz, A.: Dif-
ferent neural systems adjust motor behavior in response to reward and punishment. Neu-
roImage 36, 1253–1262 (2007)
Wyvell, C.L., Berridge, K.C.: Intra-accumbens amphetamine increases the conditioned in-
centive salience of sucrose reward: Enhancement of reward “wanting” without enhanced
“liking” or response reinforcement. Journal of Neuroscience 20, 8122–8130 (2000)
Yin, H.H., Ostlund, S.B., Balleine, B.W.: Reward-guided learning beyond dopamine in the
nucleus accumbens: the integrative functions of cortico-basal ganglia networks. European
Journal of Neuroscience 28, 1437–1448 (2008)
Yu, A., Dayan, P.: Uncertainty, neuromodulation and attention. Neuron 46, 681–692 (2005)
Chapter 17
Reinforcement Learning in Games
István Szita
Abstract. Reinforcement learning and games have a long and mutually beneficial
common history. From one side, games are rich and challenging domains for test-
ing reinforcement learning algorithms. From the other side, in several games the
best computer players use reinforcement learning. The chapter begins with a se-
lection of games and notable reinforcement learning implementations. Without any
modifications, the basic reinforcement learning algorithms are rarely sufficient for
high-level gameplay, so it is essential to discuss the additional ideas, ways of insert-
ing domain knowledge, implementation decisions that are necessary for scaling up.
These are reviewed in sufficient detail to understand their potentials and their limi-
tations. The second part of the chapter lists challenges for reinforcement learning in
games, together with a review of proposed solution methods. While this listing has
a game-centric viewpoint, and some of the items are specific to games (like oppo-
nent modelling), a large portion of this overview can provide insight for other kinds
of applications, too. In the third part we review how reinforcement learning can be
useful in game development and find its way into commercial computer games. Fi-
nally, we provide pointers for more in-depth reviews of specific games and solution
approaches.
17.1 Introduction
Reinforcement learning (RL) and games have a long and fruitful common his-
tory. Samuel’s Checkers player, one of the first learning programs ever, already had
the principles of temporal difference learning, decades before temporal difference
learning (TD) was described and analyzed. And it was another game, Backgam-
mon, where reinforcement learning reached its first big success, when Tesauro’s
István Szita
University of Alberta, Canada
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 539–577.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
540 I. Szita
TD-Gammon reached and exceeded the level of top human players – and did so en-
tirely by learning on its own. Since then, RL has been applied to many other games,
and while it could not repeat the success of TD-Gammon in every game, there are
many promising results and many lessons to be learned. We hope to present these
in this chapter, both in classical games and computer games games, real-time strat-
egy games, first person shooters, role-playing games. Most notably, reinforcement
learning approaches seem to have the upper hand in one of the flagship applications
of artificial intelligence research, Go.
From a different point of view, games are an excellent testbed for RL research.
Games are designed to entertain, amuse and challenge humans, so, by studying
games, we can (hopefully) learn about human intelligence, and the challenges that
human intelligence needs to solve. At the same time, games are challenging domains
for RL algorithms as well, probably for the same reason they are so for humans:
they are designed to involve interesting decisions. The types of challenges move on
a wide scale, and we aim at presenting a representative selection of these challenges,
together with RL approaches to tackle them.
One goal of this chapter is to collect notable RL applications to games. But there is
another, far more important goal: to get an idea how (and why) RL algorithms work
(or fail) in practice. Most of the algorithms mentioned in the chapter are described
in detail in other parts of the book. Their theoretical analysis (if exists) give us
guarantees that they work under ideal conditions, but these are impractical for most
games: conditions are too restrictive, statements are loose or both. For example, we
know that TD-learning1 converges to an optimal policy if the environment is a finite
Markov decision process (MDP), values of each state are stored individually, learn-
ing rates are decreasing in an proper manner, and exploration is sufficient. Most of
these conditions are violated in a typical game application. Yet, TD-learning works
phenomenally well for Backgammon and not at all for other games (for example,
Tetris). There is a rich literature of attempts to identify game attributes that make
TD-learning and other RL algorithms perform well. We think that an overview of
these attempts is pretty helpful for future developments.
In any application of RL, the choice of algorithm is just one among many factors
that determine success or failure. Oftentimes the choice of algorithm is not even the
most significant factor: the choice of representation, formalization, the encoding of
domain knowledge, additional heuristics and variations, proper setting of parameters
can all have great influence. For each of these issues, we can find exciting ideas that
have been developed for conquering specific games. Sadly (but not surprisingly),
there is no “magic bullet”: all approaches are more-or-less game- or genre-specific.
1 In theoretical works, TD-learning refers to a policy evaluation method, while in the games-
related literature, it is used in the sense “actor-critic learning with a critic using TD-
learning”.
17 Reinforcement Learning in Games 541
Still, we believe that studying these ideas can give us useful insights, and some of
of the findings can be generalized to other applications.
The vast variety of games makes the topic diverse and hard to organize. To keep
coherence, we begin with in-depth overview of reinforcement learning in a few well-
studied games: Backgammon, Chess, Go, Poker, Tetris and real-time strategy games.
These games can be considered as case studies, introducing many different issues
of RL. Section 17.3 summarizes some of the notable issues and challenges. Section
17.4 is probably the part that deserves the most attention, as it surveys the practical
uses of RL in games. Finally, Section 17.5 contains pointers for further reading and
our conclusions.
17.1.2 Scope
In a chapter about RL applications in games, one has to draw boundaries along two
dimensions: what kind of algorithms fit in? And what kind of games fit in? Neither
decision is an easy one.
It is clear that TD-Gammon has a place here (and Deep Blue probably does not).
But, for example, should the application of UCT to Go be here? We could treat it un-
der the umbrella of planning, or even game theory, though UCT certainly has strong
connections to RL. Other applications have a fuzzy status (another good example is
the use of evolutionary methods to RL tasks). Following the philosophy of the book,
we consider “RL applications” in a broad sense, including all approaches that take
inspiration from elements of reinforcement learning. However, we had to leave out
some important games, for example Poker, Bridge or the PSPACE-hard Soko-ban,
or the theory of general-sum games, which inspired lots of results in AI research,
but most of that is not considered RL in general.
As for games, we do not wish to exclude any actual game that people play, so
computer games, modern board games have definitely place here. Nevertheless, the
chapter will be inescapably biased towards “classical” games, simply because of a
larger body of research results available.
In this section we overview the RL-related results for several games. We focus on
the aspects that are relevant from the point of view of RL. So, for example, we
explain game rules only to the level of detail that is necessary for understanding the
particular RL issues.
542 I. Szita
17.2.1 Backgammon
In Backgammon, two players compete to remove their stones from the board as
quickly as possible. The board has 24 fields organized linearly, and each player
starts with 15 stones, arranged symmetrically. Figure 17.1 shows the board with the
starting position. Players move their stones in opposite directions. If a stone reaches
the end of the track, it can be removed, but only if all the player’s stones are already
in the last quarter. Players take turns, and they roll two dice each turn. The amount of
movement is determined by the dice, but the players decide which stones to move.
Single stones can be hit (and have to start the track from the beginning), but the
player cannot move to a field with two or more opponent stones.2
Fig. 17.1 Backgammon board with the initial game setup, two sets of dice and the doubling
cube
Basic strategies include blocking of key positions, building long blocks that are
hard or impossible to jump through, and racing to the end. On a higher level, players
need to estimate the probabilities of various events fairly accurately (among other
things). Another element of strategy is added by the “doubling cube”: any player
can offer to double the stakes of the game. The other player has two options: he
may accept the doubling, and get the cube and the right to offer the next doubling,
or he may give up the game, losing the current stakes only. Knowing when to offer
doubling and when to accept is the hardest part of Backgammon.
Traditional search algorithms are not effective for Backgammon because of its
chance element: for each turn, there are 21 possible outcomes for the dice rolls,
with 20 legal moves on average for each of them. This gives a branching factor over
400, making deep lookahead search impossible.
2 For a detailed description of rules, please refer to one of the many backgammon resources
on the internet, for example, http://www.play65.com/Backgammon.html
17 Reinforcement Learning in Games 543
17.2.1.1 TD-Gammon
The first version of TD-Gammon (Tesauro, 1992, 2002) used a neural network as a
position evaluator. The input of the network was an encoding of the game position,
and its output was the value of the position.3 After rolling the dice, the program tried
all legal moves and evaluated the resulting positions, then took the highest rated
move. The network did not consider the possibility of doubling, that was handled
by a separate heuristic formula. Rewards were +1 for winning and 0 for losing,4 so
output values approximated the probability of winning. For each player, each board
position was represented by four neurons, using a truncated unary encoding (input
k was 1 if the field had at least k stones, and 0 otherwise). Later versions of TD-
Gammon also used inputs that encoded expert features, e.g., “pip count”, a heuristic
progress measure (the total distance of pieces from the goal).
After each step, the output of the neuron was updated according to the TD rule,
and the weights were updated with the backpropagation rule. The network was
trained by self-play and no explicit exploration.
TD-Gammon did surprisingly well. Tesauro’s previous backgammon program,
Neurogammon, used neural networks with the same architecture, but was trained
on samples labeled by human backgammon experts. TD-Gammon became signif-
icantly stronger than Neurogammon, and it was on par with Neurogammon even
when handicapped, receiving only the raw table representation as input, without the
features encoding domain-knowledge. In later versions, Tesauro increased the num-
ber of hidden neurons, the number of training games (up to 1500000), added 3-ply
lookahead, and improved the representation. With these changes, TD-Gammon 3.0
became a world-class player.
Since then, TD-Gammon has retired, but TD(λ ) is still the basis of today’s
strongest Backgammon-playing programs, Snowie, Jellyfish, Bgblitz and GNUbg.
Little is known about the details because all of these programs except GNUbg are
closed-source commercial products (and the techniques of GNUbg are not well-
documented either). There seems to be a consensus that Snowie is the strongest of
these, but no reliable rankings exist. Comparing to humans is not easy because of
the high variance of the outcomes (an indicative example: in the 100-game tour-
nament between TD-Gammon and Malcolm Davis in 1998, TD-Gammon won 99
games, yet still lost the tournament because it lost so many points in its single
lost game). As an alternative to many-game tournaments, rollout analysis is used
widely to compare players. A computer Backgammon program tries to find the best
move for each step of the game by simulating thousands of games from the current
state, and calculates how much worse the player’s actual choice was. According to
3 Actually, there were four output neurons, one for each combination of the players hav-
ing/not having the situation called “gammon”, when one player wins before the other
starts to remove his own stones. For the sake of simplicity, we consider only the case
when neither player has a gammon.
4 Again, the situation considered by Tesauro was slightly more complicated: certain victories
in Backgammon are worth +2 points, or even +3 in rare cases; furthermore, the point
values can be doubled several times.
544 I. Szita
rollout analysis, Backgammon programs have exceeded the playing strength of the
best human players (Tesauro, 2002).
The success of TD-Gammon definitely had a great role in making TD(λ ) popu-
lar. TD-learning has found many applications, with many games among them (see
Ghory, 2004, for an overview). In many of these games, TD could not repeat the
same level of success, which makes the performance of TD-Gammon even more
surprising. TD-learning in Backgammon has to deal with the RL problem in its
full generality: there are stochastic transitions, stochastic and delayed rewards and a
huge state space. Because of self-play, the environment is not really an MDP (the op-
ponent changes continuously). Even more troubling is the use of nonlinear function
approximation Furthermore, TD-Gammon was not doing any exploration, it always
chose the greedy action, potentially resulting in poor performance. Many authors
have tried to identify the reasons why TD works so well in Backgammon, including
Tesauro (1995, 1998), Pollack and Blair (1997), Ghory (2004) and Wiering (2010).
We try to summarize their arguments here.
Representation. As emphasized in the opening chapter of the book, the proper
choice of representation is crucial to any RL application. Tesauro claimed that
the state representation of TD-Gammon is “knowledge-free”, as it could have been
formed without any knowledge of the game. On the other hand, the truncated unary
representation seems to fit Backgammon well, just like the natural board represen-
tation (one unit per each field and player). And of course, the addition of more in-
formative features increased playing strength considerably. The suitability of repre-
sentation is supported by the fact that Neurogammon, which used the same features,
but no RL, was also a reasonably good computer player.
Randomness. Every source agrees that randomness in the game makes the learning
task easier. Firstly, it smoothens the value function: the value of “similar” states is
close to each other, because random fluctuations can mask the differences. Func-
tion approximation works better for smooth target functions. Secondly, randomness
helps solve the exploration-exploitation dilemma, or in this case, to evade it com-
pletely (in TD-Gammon, both learning agents choose the greedy action all the time).
Because of the effect of dice, the players will visit large parts of the state space
without any extra effort spent on exploration. As a third factor, human players are
reportedly not very good in estimating probabilities, which makes them easier to
defeat.
Training regime. TD-Gammon was trained by playing against itself. In general, this
can be both good and bad: the agent plays against an opponent of the same level,
but, on the other hand, learning may get stuck and the agent may specialize to a
strategy against a weak player (itself), as it never gets to see a strong opponent. The
pros and contras of self-play will be analyzed in detail in section 17.3.3.1, but we
note here that the “automatic exploration” property of Backgammon helps prevent
17 Reinforcement Learning in Games 545
learning from getting stuck: even if the opponent is weak, it may arrive at a strong
position by luck. This creates a driving force for learning to get even.
Randomness and “automatic exploration” makes Backgammon quite unique
among games. Pollack and Blair (1997) actually argued that these special properties
of Backgammon contribute more to the success than TD-learning and neural net-
works. They train a linear architecture with simple gradient descent which reaches
reasonably good performance. Tesauro (1998) claims, on the other hand, that this
is insufficient, as neural networks have actually discovered important features that
linear architectures cannot capture.
Parameters. The parameter λ of TD(λ ) determines the amount of bootstrapping. It
is general consensus that intermediate values of λ work better than either extremes
0 (full bootstrapping) or 1 (no bootstrapping). Tesauro used λ = 0.7 for the initial
experiments, but later switched to λ = 0 because he noticed no significant difference
in performance, and with no eligibility traces, the TD update rule gets simpler and
faster to compute. On the other hand, Wiering (2010) reported that for the initial
learning progress is fastest with a high λ (≈ 0.8), while lower values (but still ≈ 0.6)
are optimal on the long run.
Proper architecture. Tesauro (2002) describes other parts of the architecture. We
emphasize one thing here: temporal difference learning is an important factor in the
success of TD-Gammon, but just as important was to know how to use it. The neural
network is trained to predict the probability of victory from a given state – and its
predictions are not very accurate, it can be biased by as much as 0.1, way too much
for direct use. Luckily, the bias does not change much among similar game states,
so the neural network predictor can still be used to compare and rank the values of
legal moves from a given position, as the relative values are reliable most of the time.
However, we need to know the exact victory probabilities to handle the doubling
cube, that is, to decide when to offer the doubling of stakes and when to accept it. For
this reason, TD-Gammon used an independent heuristic formula to make doubling
decisions. In some cases it is possible to calculate the probabilities of certain events
exactly (e.g., in endgame situations or getting through a block). Later versions of
TD-Gammon used pre-calculated probabilities as input features, and today’s leading
Backgammon programs probably use both precalculated probabilities and endgame
databases.
17.2.2 Chess
while the winner of the Computer Chess Olympiad, Rybka, is estimated to be well
above 3000.6
Chess is one of the top-tier
games where RL has not been
very successful so far, and not
for the lack of trying. Initial at-
tempts to reinforcement-learn
Chess include Gherrity (1993)
who applied Q-learning with
very limited success. We re-
view two programs here, Neu-
roChess of Thrun (1995) and
TD-Leaf of Baxter et al (2000).
Fig. 17.2 Chessboard with a Sicilian opening. (source:
Wikimedia Commons, author: Rmrfstar)
17.2.2.1 NeuroChess
NeuroChess by Thrun (1995) trained two neural networks, one representing the
value function V and another one representing a predictive model M. For any game
position s, M(s) was an approximation of the expected encoding of the game state
two half-steps later. Each chess board was encoded as a 175-element vector of hand-
coded features (probably containing standard features of chess evaluation functions,
e.g., piece values, piece-square values, penalties for double pawn, etc.) The eval-
uation function network was trained by an interesting extension of the temporal
difference method: not only the value function was adjusted toward the target value,
but also its slope was adjusted towards the slope of the target value function. This
is supposed to give better fit. Let s1 , s2 , . . . be the sequence of states where it is
white’s turn (black’s turns can be handled similarly), and consider a time step t. If
t is the final step, then the target value for V (st ) is the final outcome of the game
(0 or ±1). In concordance with the TD(0) update rule, for non-final t, the target
value is V target = γ V (xt+1 ) (note that all immediate rewards are 0). The target slope
is given by
of these tricks were most influential: firstly, each game starts with a random number
of steps from a game database, then finished by self-play. The initialization ensures
that learning effort is concentrated on interesting game situations. Secondly, the
features were designed to give a much smoother representation of the game board
than a raw representation (that is, the feature vectors of similar game positions were
typically close to each other). Further tricks included quiescence search,7 discount-
ing (with γ = 0.98), and increased learning rates for final states.
The resulting player was significantly better than a player that was trained by
pure TD learning (without slope information or a predictive model). However, Neu-
roChess still was not a strong player, winning only 13% of its matches against
GNUchess, even if the search depth of GNUchess was restricted to two half-steps.
According to Thrun, “NeuroChess has learned successfully to protect its material, to
trade material, and to protect its king. It has not learned, however, to open a game in
a coordinated way, and it also frequently fails to play short endgames even if it has
a material advantage. [...] Most importantly, it still plays incredibly poor openings,
which are often responsible for a draw or a loss.”
The use of evaluation functions in Chess is usually combined with minimax search
or some other multi-step lookahead algorithm. Such algorithms search the game
tree up to some depth d (not necessarily uniformly over all branches), evaluate the
deepest nodes with a heuristic evaluation function V and propagate the values up
in the tree to the root. Of course, the evaluation function can be trained with TD
or other RL methods. The naive implementation of temporal difference learning
disregards the interaction with tree search: at time t, it tries to adjust the value of
the current state, V (xt ), towards V (xt+1 ). However, the search algorithm does not
use V (xt ), but the values of states d steps down in the search tree. The idea of TD-
Leaf(λ ) is to use this extra information provided by the search tree.
To understand how TD-Leaf works, consider the game tree rooted at xt , con-
sisting of all moves up to depth d, with all the nodes evaluated with the heuristic
function V . The principal variation of the tree is the path from the root where each
player selects the minimax-optimal move with respect to V . Let xt denote the leaf
node of the principal variation. TD-Leaf applies temporal difference learning to
), using the TD(λ ) update rule.
these principal leaves: it shifts V (xt ) towards V (xt+1
This helps because the heuristic evaluation function is used exactly in these princi-
pal nodes (though in comparison to the other nodes). We note that TD-Leaf can be
interpreted as multi-step TD-learning, where the multi-step lookahead is done by an
improved policy.
KnightCap (Baxter et al, 2000) used TD-Leaf(λ ) to train an evaluation func-
tion for Chess. The evaluation function is a linear combination of 5872 hand-coded
7 Quiescence search extends the search tree around “interesting” or “non-quiet” moves to
consider their longer-term effects. For example, on the deepest level of the tree, capturing
an opponent piece is considered very advantageous, but even a one-ply lookahead can
reveal that it is followed by a counter-capture.
548 I. Szita
features. Features are divided into four groups, used at different stages of the game
(opening, middle-game, ending and mating), and include piece material strengths,
piece strengths on specific positions, and other standard features. The algorithm
used eligibility traces with λ = 0.7 and a surprisingly high learning rate α = 1.0,
meaning that old values were immediately forgotten and overwritten. KnightCap at-
tempted to separate the effect of the opponent: a positive TD-error could mean that
the opponent made an error,8 so KnightCap made an update only if it could pre-
dict the opponent’s move (thus it seemed like a reasonable move). Furthermore, for
making learning progress, the material weights had to be initialized to their default
values.
The key factor in KnightCap’s success was the training regime. According to
Baxter et al (2000), self-play converged prematurely and final performance was
poor. Therefore, they let KnightCap play against humans of various levels on an
internet chess server. The server usually matches players with roughly equal rank-
ings, which had an interesting effect: as KnightCap learned, it played against better
and better opponents, providing adequate experience for further improvement. As
a result, KnightCap climbed from an initial Elo ranking of 1650 to 2150, which is
slightly below the level of a human Master. With the addition of an opening book,
Elo ranking went up to 2400-2500, a reasonable level.
We note that TD-Leaf has been applied successfully to other games, most no-
tably, Checkers Schaeffer et al (2001), where the learned weight set was competitive
with the best hand-crafted player at the time.
Veness et al (2009) modified the idea of TD-Leaf: while TD-Leaf used the principal
node of step t to update the principal node of step t − 1, TreeStrap(minimax) used
it to update the search tree at time step t. Furthermore, not only the root node was
updated, but every inner node of the search tree got updated with its correspond-
ing principal node. TreeStrap(αβ ) used the same trick, but because of the pruning,
the value of lower nodes was not always determined, only bounded in an interval.
TreeStrap(αβ ) with expert trainers reached roughly the same level as TD-Leaf, but
without needing its tricks for filtering out bad opponent moves and proper initial-
ization. More importantly, TreeStrap learned much more efficiently from self-play.
While expert-trained versions of TreeStrap still converged to higher rankings than
versions with self-play, the difference shrunk to ≈ 150 Elo, weakening the myth that
self-play does not work in Chess. There is anecdotal evidence that TreeStrap is able
to reach ≈ 2800 Elo (Veness, personal communication).
The following observation gives the key to the success of TreeStrap: during the
lookahead search, the algorithm will traverse through all kinds of improbable states
(taking moves that no sane chess player would take). Function approximation has
to make sure that these insane states are also evaluated more-or-less correctly (so
that their value is lower than the value of good moves), otherwise, search is misled
by the value function. As an additional factor, TreeStrap also uses resources more
effectively: for each tree search, it updates every inner node.
The deterministic nature of Chess is another factor that makes it hard for RL. The
agents have to explore actively to get experience from a sufficiently diverse subset
of the state space. The question of proper exploration is largely open, and we know
of no Chess-specific solution attempt; though the question is studied abstractly in
evolutionary game theory (Fudenberg and Levine, 1998). Additionally, determinism
makes learning sensitive to the opponent’s strategy: if the opponent’s strategy is not
sufficiently diverse (as it happens in self-play), learning will easily get stuck in an
equilibrium where one player’s strategy has major weaknesses but the other player
fails to exploit them.
While reinforcement learning approaches are not yet competitive for the control
task, that is, playing chess, they are definitely useful for evaluation. Beal and Smith
(1997) and later Droste and Fürnkranz (2008) applied TD-Leaf to learn the mate-
rial values of pieces, plus the piece-square values. The experiments of Droste and
Fürnkranz (2008) showed that the learned values performed better than the expert
values used by several chess programs. The effect was much stronger in the case
of non-conventional Chess variants like Suicide chess, which received much less
attention in AI research, and have less polished heuristic functions.
17.2.3 Go
humans can easily look ahead 60 or more half-moves in special cases (Schraudolph
et al, 2001). Current tree search algorithms cannot go close to this depth yet. Sec-
ondly, it is hard to write good heuristic evaluation functions for Go. In Chess, mate-
rial strengths and other similar measures provide a rough evaluation that is still good
enough to be used in the leaf nodes. In Go, no such simple metrics exist (Müller,
2002), and even telling the winner from a final board position is nontrivial.
17.2.3.2 Dyna-2
modifies weights based on the current game situation, also trained by self-play
TD(λ ). Transient memory is erased after each move selection. The function approx-
imator for both components are linear combinations of several basic features plus
shape patterns up to 3 × 3 for each board position (resulting in an impressive num-
ber of parameters to learn). The combination of transient and permanent memories
performed better than either of them alone, and Dyna-2 reached good performance
on small boards.
otherwise cheap random move selection is used.11 Furthermore, to save storage, the
tree is extended only with the first (several) nodes from the random part of the roll-
out. If ni are low then UCT selects each move roughly equally often, while after lots
of trials the bonus term becomes negligible and the minimax move is selected.
While the UCT selection formula has theoretical underpinnings, other formu-
las have been also tried successfully. Chaslot et al. suggest that the O(1/ n(x,a))
bonus is too conservative, and O(1/n(x,a)) works better in practice (Chaslot et al,
2008). In Chaslot et al (2009) they give an even more complex heuristic formula that
can also incorporate domain knowledge. Coulom (2006) proposed another MCTS
selection strategy that is similar in effect (goes gradually from averaging to mini-
max), but selection is probabilistic.
Although UCT or other MCTS algorithms are the basis of all state-of-the art Go
programs, the playing strength of vanilla UCT can be improved greatly with heuris-
tics and methods to bias the search procedure with domain knowledge.12 Below we
overview some of the approaches of improvement.
Biasing tree search. Gelly and Silver (2008) incorporate prior knowledge by using
a heuristic function Qh (x,a). When a node is first added to the tree, it is initialized as
Q(x, a) = Qh (x, a) and n(x, a) = nh (x, a). The quantity nh indicates the confidence
in the heuristic in terms of equivalent experience. The technique can be alternatively
interpreted as adding a number of virtual wins/losses (Chatriot et al, 2008; Chaslot
et al, 2008). Progressive unpruning/widening (Chaslot et al, 2008; Coulom, 2006)
severely cuts the branching factor: initially, only the heuristically best move candi-
dates are added to each node. As the node is visited more times, more tree branches
are added, corresponding to move candidates that were considered weaker by the
heuristics.
RAVE (Gelly and Silver, 2008, rapid action value estimation), also called “all-
moves-as-first” is a rather Go-specific heuristics that takes all the moves of the
player along a game and update their visit counts and other statistics. The heuris-
tic treats actions as if their order did not matter. This gives a quick but distorted
estimate of values.
Biasing Monte-Carlo simulations. It is quite puzzling why Monte-Carlo evalua-
tion works at all: the simulated random games correspond to a game between two
extremely weak opponents, and it is not clear why the evaluation of states is use-
ful against stronger opponents. It seems reasonable that the stronger the simulation
policy is, the better the results. This is true to some extent, but there are two op-
posing factors. Firstly, simulation has to be diverse for sufficient exploration; and
secondly, move selection has to be very light on computation, as we need to run it
millions of times per game. Therefore, the rollout policy is critical to the success of
11 The switch from UCT move selection to random selection can also be done gradually.
Chaslot et al (2008) show that this can increase efficiency.
12 See http://senseis.xmp.net/?CGOSBasicUCTBots and http://cgos.
boardspace.net/9x9/allTime.html for results of pure UCT players on the
Computer Go Server. The strongest of these has an Elo rating of 1645, while the strongest
unconstrained programs surpass 2500.
554 I. Szita
MCTS methods. Gelly et al (2006) uses several 3 × 3 patterns that are considered
interesting, and chooses moves that match the patterns with higher probability. Most
other Go programs use similar biasing.
Silver and Tesauro (2009) suggested that the weakness of the rollout play is not
a problem as long as opponents are balanced, and their errors somehow cancel out.
While the opponents are equally good on average (they are identical), random fluc-
tuations may imbalance the rollout. Silver and Tesauro (2009) propose two methods
to learn a MC simulation policy with low imbalance, and they show that it signifi-
cantly improves performance.
The success of UCT and other MCTS methods in Go inspired applications to many
other games in a wide variety of genres, but its success was not repeated in each
case—most notably, Chess (Ramanujan et al, 2010). The question naturally arises:
why does UCT and other MCTS methods work so well in Go? From a theoretical
perspective, Coquelin and Munos (2007) showed that in the worst case, UCT can
perform much worse than random search: for finding a near-optimal policy, the re-
quired number of samples may grow as a nested exponential function of the depth.
They also propose an alternative BAST (bandit algorithm for smooth trees) which
has saner worst-case performance. Interestingly, the better worst-case performance
of BAST does not lead to a better performance in Go. The likely reason is that less
conservative update strategies are better empirically (Chaslot et al, 2008), and BAST
is even more conservative than UCT.
Policy bias vs. Value bias. According to Szepesvári (2010, personal communica-
tion), success of MCTS can (at least partly) attributed to the fact that it enables a
different bias than value-function-based methods like TD. Function approximation
for value estimation puts a “value bias” on the search. On the other hand, the heuris-
tics applied in tree search methods bias search in policy space, establishing a “policy
bias”. The two kinds of biases have different strengths and weaknesses (and can also
be combined, like Dyna-2 did), and for Go, it seems that policy bias is easier to do
well.
Narrow paths, shallow traps. UCT estimates the value of a state as an average
of outcomes. The average concentrates around the minimax value, but convergence
can take a long time. (Coquelin and Munos, 2007) Imagine a hypothetical situation
where Black has a “narrow path to victory”: he wins if he carries out a certain move
sequence exactly, but loses if he makes even a single mistake. The minimax value
of the state is a ‘win’, but the average value will be closer to ‘lose’ unless half of
the weights is concentrated on the winning branch. However, the algorithm has to
traverse the full search tree before it can discover this.
In the same manner, averaging can be deceived by the opposite situation where
the opponent has a winning strategy, but any other strategy is harmless. Unless we
leave MCTS a very long time to figure out the situation, it will get trapped and
think that the state is safe. This is especially harmful if the opponent has a short
17 Reinforcement Learning in Games 555
winning strategy from the given state: then even a shallow minimax search can find
that winning strategy and defeat MCTS. Ramanujan et al (2010) call these situations
“shallow traps”, and argue that (tragically) bad moves in Chess often lead to shallow
traps, making Chess a hard game for MCTS methods.
Narrow paths and shallow traps seem to be less prevalent in Go: bad moves lead
to defeat many steps later, and minimax-search methods cannot look ahead very
far, so MCTS can have the upper hand. Nevertheless, traps do exist in Go as well.
Chaslot et al (2009) identify a situation called “nakade”, which is a “situation in
which a surrounded group has a single large internal, enclosed space in which the
player won’t be able to establish two eyes if the opponent plays correctly. The group
is therefore dead, but the baseline Monte-Carlo simulator sometimes estimates that
it lives with a high probability”. Chaslot et al (2009) recognize and treat nakade
separately, using local minimax search.
Fig. 17.5 Screenshot of Tetris for the Nintendo entertainment system (NES). The left-hand
column lists the seven possible tetris pieces. The standard names for the tetrominoes are in
order: T, J, Z, O, S, L, I.
17.2.4 Tetris
research (Bertsekas and Tsitsiklis, 1996), pieces do not fall, the player just needs to
give the rotation and column where the current piece should fall. Furthermore, scor-
ing is flat, each cleared line is worth +1 reward.13 Many variants exist with different
board size, shape set, distribution of shapes.
As a fully-observable, single-player game with randomness, Tetris fits the MDP
framework well, where most of RL research is concentrated. Two other factors
helped its popularity: its simplicity of implementation and its complexity of op-
timal control. Optimal placement of tetrominoes is NP-hard even to approximate,
even if the tetromino sequence is known in advance. Therefore, Tetris became the
most popular game for testing and benchmarking RL algorithms.
Besides the properties that make Tetris a good benchmark for RL, it has the un-
fortunate property that the variance of a policy’s value is huge, comparable to its
mean. In addition, the better a policy is, the longer it takes to evaluate it (we have to
play through the game). This makes it hard to compare top-level algorithms or inves-
tigate the effects of small changes. For these reasons, Szita and Szepesvári (2010)
proposed to use the variant SZ-Tetris that only uses the ‘S’ and ‘Z’ tetrominoes.
They argue that this variant is the “hard core” of Tetris, that is, it preserves (even
amplifies) most of the complexity in Tetris, and maximum scores (and variances)
are much lower and seems therefore a much better experimental testbed.
Tetris is also special because it’s one of the few applications where function ap-
proximation of the value function does not work well. Many of the value-function-
approximating RL methods are known to diverge or have only weak performance
bounds. However, theoretical negative results (or lack of positive results) are often
dismissed as too conservative, as value function approximation generally works well
in practice. Tetris is a valuable reminder that the theoretical performance bounds
might not be so loose even for problems of practical interest.
Bertsekas and Tsitsiklis (1996) were among the first ones to try RL on Tetris.
States were represented as 21-element feature vectors consisting of the individ-
ual column heights (10 features), column height differences (9 features), maximum
column height and the number of holes. This representation with linear function
approximation served as a basis for most subsequent approaches,14 which pro-
vides an excellent opportunity to compare value-function-based and preference-
function-based methods. In both groups, a linear combination of feature functions,
V (x) = ∑ki=1 wi φi (x) is used for decision making, but in the first group of methods,
V (x) tries to approximate the optimal value function V ∗ (x), while in the second
group no attempt is made to ensure V (x) ≈ V ∗ (x), making the parameter optimiza-
tion less constrained. Preference functions can also be considered a special case of
direct policy search methods.
To learn the weights wi , Bertsekas and Tsitsiklis (1996) apply λ -policy iteration,
which can be considered as the planning version of TD(λ ), and generalizes value-
and policy-iteration. The best performance they report is around 3200 points, but
performance gets worse after training continues. Lagoudakis et al (2002) tries least-
squares policy iteration; Farias and van Roy (2006) applies approximate linear pro-
gramming with an iterative process for generating samples; Kakade (2001) applies
natural policy gradient tuning of the weights. The performance of these methods is
within the same order of magnitude, in the range 3000–7000.
Böhm et al (2004) tune the weight vector with an evolutionary algorithm, using
two-point crossover, Gaussian noise as the mutation operator, and a random weight
rescaling operator which did not affect the fitness of the weight vector, but altered
how it reacted to future genetic operations. Szita and Lörincz (2006a) applied the
cross-entropy method (CEM), which maintains a Gaussian distribution over the pa-
rameter space, and adapts the mean and variance of the distribution so as to maxi-
mize the performance of weight vectors sampled from the distribution, while main-
taining a sufficient level of exploration. Their results were later improved by Thiery
and Scherrer (2009) using additional feature functions, reaching 35 million points.
While CEM draws each vector component from an independent Gaussian (that is, its
joint distribution is an axis-aligned Gaussian), the CMA-ES method (covariance ma-
trix adaptation evolutionary strategy) allows general covariance matrices at the cost
of increasing the number of algorithm parameters. Boumaza (2009) apply CMA-ES
to Tetris, reaching results similar to Thiery and Scherrer (2009).
Given an approximate value function V , V (x) is approximately the total reward col-
lectible from x. Preference functions do not have this extra meaning, and are only
meaningful in comparison: V (x) > V (y) means that the algorithm prefers x to y
(probably because it can collect more reward from x). Infinitely many preference
functions induce the same policy as a given value function, as the policies induced
by preference functions are invariant to rescaling or any other monotonously in-
creasing transformations. The exact solution to the Bellman optimality equations,
V ∗ , is also an optimal preference function, so solving the first task also solves
the other. In the function approximation case, this is not true any more, a value
function that is quite good in minimizing the approximate Bellman error might be
558 I. Szita
very bad as a preference function (it may rank almost all actions in the wrong or-
der, while still having a small Bellman residual). Furthermore, the solution of the
approximate Bellman equations does not have a meaning – in fact, it can be ar-
bitrarily far from V ∗ . Usually this happens only in artificial counterexamples, but
weak results on Tetris with its traditional feature representation indicate that this
problem is also subject to this phenomenon. Szita and Szepesvári (2010) One way
to avoid problems would be to change the feature functions. However, it is an in-
teresting open question whether it’s possible to unify the advantages of preference
functions (direct performance optimization) and approximate value functions (the
approximate Bellman equations allow bootstrapping: learning from V (xt+1 )).
In certain aspects, Tetris-playing RL agents are well beyond human capabilities: the
agent of Thiery and Scherrer (2009) can clear 35 million lines on average. Calculat-
ing with a speed of 1 tetromino/second (reasonably good for a human player), this
would take 2.77 years of nonstop playing time. However, observation of the game-
play of AI shows that they make nonintuitive (and possibly dangerous) moves that
a human would avoid. Results on SZ-Tetris also support the superiority of humans
so far: the CEM-trained contoller reaches 133 points, while a simple hand-coded
controller, which would play weaker than a good human player, reaches 182 points
(Szita and Szepesvári, 2010).
Games in the real-time strategy (RTS) genre are war-simulations. In a typical RTS
(Fig. 17.6), players need to gather resources, use them for building a military base,
for technological development and for training military units, destroying bases of
the opponents and defend against the opponents’ attacks. Tasks of the player in-
clude ensuring a sufficient flow of resource income, balancing resource allocation
between economical expansion and military power, strategic placement of buildings
and defenses, exploration, attack planning, and tactical management of units during
battle. Due to the complexity of RTS games, RL approaches usually pick one or
several subtasks to learn while relying on default heuristics for the others. Listings
of subtasks can be found in Buro et al (2007); Laursen and Nielsen (2005); Pon-
sen et al (2010). One of the distinctive properties of RTS games, not present in any
of the previously listed games, is the need to handle parallel tasks. This is a huge
challenge, as the parallel tasks interact, and can span different time scales.
Most implementations use the Warcraft and Starcraft families, the open-source
RTS family Freecraft/Wargus/Stratagus/Battle of Survival/BosWars,15 or the ORTS
game engine which was written specifically to foster research (Buro and Furtak,
2003). The variety of tasks and the variety of different RTS games is accompanied
by a colorful selection of RL approaches.
used in Chess). Kerbusch (2005) has shown that TD-learning can do a better job in
tuning the relative strengths of units than hand-coding, and the updated evaluation
function improves the playing strength of dynamic scripting. Kok (2008) used an
implicit state representation: the preconditions of each rule determined whether the
rule is applicable; furthermore, the number of applicable actions was much lower,
typically around five. With these modifications, he reached good performances both
with vanilla DS and Monte-Carlo control (equivalent to TD(1)) with ε -greedy ex-
ploration and a modification that took into account the fixed ordering of rules.
Ponsen and Spronck (2004) also tried evolutionary learning on Wargus, with four
genes for each building-state, corresponding to the four action categories. In a follow
up paper, Ponsen et al (2005) get improved results by pre-evolving the action sets,
by selecting several four-tuples of actions per state that work well in conjunction.
EvolutionChamber is a recent, yet unpublished approach16 to learn opening
moves for Starcraft. It uses an accurate model of the game to reach a predefined goal
condition (like creating at least K units of a type) as quickly as possible. Evolution-
Chamber works under the assumption that the opponent does not interact, and needs
a goal condition as input, so in its current form it is applicable only to openings. As a
huge advantage, however, it produces opening strategies that are directly applicable
in Starcraft even by human players, and it has produced at least one opening (the
“seven-roach rush”, getting 7 heavily armored ranged units in under five minutes)
which was previously unknown and is regarded very effective by expert players.17
In case-based reasoning, the agent stores several relevant situations (cases), together
with policies that tell how to handle those. Relevant cases can be added to the case
base either manually Szczepański and Aamodt (2009) or automatically (either of-
fline or online). When the agent faces a new situation, it looks up the most similar
stored case(s) and makes its decision based on them, and (when learning is involved)
their statistics are updated based upon the reinforcement. The abstract states, set of
actions and evaluate function used by Aha et al (2005) are identical to the ones
used by Ponsen and Spronck (2004), distance of cases is determined from an eight-
dimensional feature vector. If a case is significantly different from all stored cases,
it is added to the pool, otherwise, updates the reward statistics of existing cases. The
16 See http://lbrandy.com/blog/2010/11/using-genetic-algorithms-
to-find-starcraft-2-build-orders/ for detailed description and
http://code.google.com/p/evolutionchamber/ for source code. We
note that even though the algorithm is not published through an academic journal or
proceedings, both the algorithm and the results are publicly available, and have been
verified by experts of the field.
17 See forum discussion at http://www.teamliquid.net/forum/
viewmessage.php?topic id=160231.
17 Reinforcement Learning in Games 561
resulting algorithm outperformed the AIs evolved by Ponsen and Spronck (2004),
and is able to adapt to several unknown opponents (Molineaux et al, 2005). Weber
and Mateas (2009) introduce an improved distance metric similar to the “edit dis-
tance”: they define a distance of two cases as the resource requirement to transfer
from one to the other.
The listed approaches deal with high-level decision making, delegating lower-level
tasks like tactical combat or arrangement of the base to default policies. Disregard-
ing the possible interactions certainly hurts their performance, e.g., fewer soldiers
may suffice if their attack is well-coordinated and they kill off opponents quickly.
But even discounting that, current approaches have apparent shortcomings, like the
limited ability to handle parallel decisions (see Marthi et al (2005) for a solution pro-
posal) or the limited exploration abilities. For example, Ponsen and Spronck (2004)
report that late-game states with advanced buildings were rarely experienced during
training, resulting in weaker strategies there.
State-of-the-art computer players of RTS are well below the level of the best
human players, demonstrated, for example, by the match between Overmind, winner
of the 2010 AIIDE Starcraft Competition and retired professional player =DoGo=.
As of today, top-level AIs for real-time strategies, for example, the contest entries
of the StarCraft competitions and ORTS competitions, do not use the reinforcement
techniques listed above (or others).
The previous section explored several games and solution techniques in depth. The
games were selected so that they each have several interesting RL issues, and to-
gether they demonstrate a wide scale of issues RL has to face. By no metric was this
list representative. In this section we give a list of important challenges in RL that
can be studied within games.
As it has already been pointed out in the introduction of the book, the question of
suitable representation plays a central role in all applications of RL. We usually have
rich domain knowledge about games, which can aid representation design. Domain
knowledge may come in many forms:
• Knowledge of the rules, the model of the game, knowing what information is
relevant;
• Gameplay information from human experts. This may come in the form of ad-
vice, rules-of-thumb, relevant features of the game, etc.;
562 I. Szita
18 See http://cswww.essex.ac.uk/staff/sml/pacman/
PacManContest.html. The contest is coordinated by Simon Lucas, and has been or-
ganized on various evolutionary computing and game AI conferences (WCCI, CEC and
GIG).
19 The Atari 2600 system is an amazing combination of simplicity and complexity: its mem-
ory, including the video memory, is 1024 bits, including the video output (that is, the video
had to be computed while the electron ray was drawing it). This amount of information is
on the verge of being manageable, and is a Markovian representation of the state of the
computer. On the other hand, the Atari 2600 was complex (and popular) enough so that it
got nearly 600 different games published.
17 Reinforcement Learning in Games 563
that part of the reason is definitely “presentation bias”: if an application did not work
because of parameter sensitivity / divergence / too slow convergence / bad features,
its probability of publication is low.
Tabular representation. For games with discrete states, the trivial feature mapping
renders a unique identifier to each state, corresponding to “tabular representation”.
The performance of dynamic programming-based methods depends on the size of
the state space, which usually makes the tabular representation intractable. How-
ever, state space sampling methods like UCT remain tractable, as they avoid the
storage of values for each state. They trade off the lack of stored values and the
lack of generalization by increased pre-movement decision time. For examples, see
applications to Go (section 17.2.3), Settlers of Catan (Szita et al, 2010), or general
game playing (Björnsson and Finnsson, 2009).
Expert features. Game situations are commonly characterized by several numerical
quantities that are considered relevant by an expert. Examples to such features are
material strength in Chess, number of liberties in Go, or the number of holes in
Tetris. It is also common to use binary features like the presence of a pattern in
Chess or Go.
Relational representation. Relational (Džeroski et al, 2001), object-oriented (Diuk
et al, 2008) and deictic (Ponsen et al, 2006) representations are similar, they assume
that the game situation can be characterized by several relevant relationships be-
tween objects (technically, a special case of expert features, and each relationship
involves a few objects only. Such representations may be used either directly for
value function approximation (Szita and Lörincz, 2006b), or by some structured
reinforcement learning algorithm (see Chapters 8 & 9)
State aggregation, clustering. While in some cases it is possible to enumerate and
aggregate ground states directly (Gilpin et al, 2007), it is more common to perform
clustering in some previously created feature space. Cluster centres can be used
either as cases in case-based reasoning or as a high-level discretization of the state
space.
Combination features. Conjunction features require the combined presence of sev-
eral (binary) features, while more generally, we can take the combination of features
via any logical formula. Combination features are a simple way of introducing non-
linearity in the inputs, so they are especially useful for use with linear function
approximation. Combination features are also a common subject of automatic rep-
resentation search techniques (see below.)
Hidden layers of neural networks define feature mappings of the input space, so
methods that use neural networks can be considered as implicit feature construction
methods. Tesauro (1995) reports that TD-Gammon learned interesting patterns, cor-
responding to typical situations in Backgammon. Moriarty and Miikkulainen (1995)
564 I. Szita
use artificial evolution to develop neural networks for Othello position evaluation,
and manage to discover advanced Othello strategies.
The Morph system of Gould and Levinson (1992) extracts and generalizes pat-
terns from chess positions, just like Levinson and Weber (2001). Both approaches
learn the weights of patterns by TD(λ ). Finkelstein and Markovitch (1998) modify
the approach to search for state-action patterns. The difference is analogous to the
difference of the value functions V (x) and Q(x,a): the latter does not need lookahead
for decision making.
A general approach for feature construction is to combine atomic features into
larger conjunctions or more complex logical formulas. The Zenith system of Fawcett
and Utgoff (1992) constructs features for Othello from a set of atomic formulas
(of first-order predicate calculus), using decomposition, abstraction, regression, and
specialization operators. Features are created iteratively, by adding features that help
discriminating between good and bad outcomes. The features are pruned and used
in a linear evaluation function. A similar system of Utgoff and Precup (1998) con-
structs new features in Checkers by the conjunction of simpler boolean predicates.
Buro (1998) uses a similar system to create conjunction features for Othello. For
a good summary on feature construction methods for games, see Utgoff (2001).
More recent results on learning conjunction features include Shapiro et al (2003) for
Diplomacy, Szita and Lörincz (2006b) for Ms. Pac-Man and Sturtevant and White
(2007) for Hearts.
Gilpin and Sandholm (2007) study abstractions of Poker (and more generally,
two-player imperfect-information games), and provide a state-aggregation algo-
rithm that preserves equilibrium solutions of the game, that is, preserves relevant
information with respect to the optimal solution. In general, some information has
to be sacrificed in order to keep the size of the task feasible. Gilpin et al (2007) pro-
pose another automatic abstraction method that groups information states to a fixed
number of clusters so that states within a cluster have roughly the same distribu-
tion of future rewards. Ponsen et al (2010) give an overview of (lossless and lossy)
abstraction operators specific to real-time strategy games.
In general, it seems hard to learn representations without any domain knowledge
(Bartók et al, 2010), but there are a few attempts to minimize the amount of do-
main knowledge. Examples can be found in the general gameplay literature, see
e.g., Sharma et al (2009); Günther (2008); Kuhlmann (2010). These approaches
look for general properties and invariants that are found in most games: variables of
spatial/temporal type, counting-type variables, etc.
17.3.2 Exploration
opponent selection in Section 17.3.3, and concentrate here on the agent’s own role
in enforcing variety.
Boltzmann and ε -greedy action selection are the most commonly occurring form
of exploration. We know from the theory of finite MDPs that much more efficient
exploration methods exist (Chapter 6), but it is not known yet how to extend these
methods to model-free RL with function approximation (which covers large part
of the game applications of RL). UCT and other Monte-Carlo tree search algo-
rithms handle exploration in a somewhat more principled way, maintaining confi-
dence intervals representing the uncertainty of outcomes, and apply the principle
of “optimism in the face of uncertainty”. When a generative model of the game is
available (which is the case for many games), even classical planning can simulate
exploration. This happens, for example, in TD-Leaf and TreeStrap where limited
lookahead search determines the most promising direction.
Evolutionary RL approaches explore the parameter space in a population-based
manner. Genetic operators like mutation and crossover ensure that the agents try
sufficiently diverse policies. RSPSA (Kocsis et al, 2006) also does exploration in
the parameter space, but uses local search instead of maintaining a population of
agents.
For games with two or more players, the environment of the learning agent de-
pends on the other players. The goal of the learning agent can be to find a Nash-
equilibrium, a best response play against a fixed opponent, or just performing well
against a set of strong opponents. However, learning can be very inefficient or com-
pletely blocked if the opponent is much stronger than the actual level of the agent:
all the strategies that a beginner agent tries will end up in a loss with very high
probability, so the reinforcement signal will be uniformly negative – not providing
any directions for improvement. In another note, opponents have to be sufficiently
diverse to prevent the agent from converging to local optima. For these reasons, the
choice of training opponents is a nontrivial question.
17.3.3.1 Self-play
Self-play is by far the most popular training method. According to experience, train-
ing is quickest if the learner’s opponent is roughly equally strong, and that definitely
holds for self-play. As a second reason for popularity, there is no need to imple-
ment or access a different agent with roughly equal playing strength. However, self-
play has several drawbacks, too. Theoretically, it is not even guaranteed to converge
(Bowling, 2004), though no sources mention that this would have caused a problem
in practice. The major problem with self-play is that the single opponent does not
provide sufficient exploration. Lack of exploration may lead to situations like the
566 I. Szita
one described in Section 6.1.1 of Davidson (2002) where a Poker agent is trapped
by a self-fulfilling prophecy: suppose that a state is (incorrectly) estimated to be bad.
Because of that, the agent folds, loses the game, so the estimated value of the state
will go further down.
17.3.3.2 Tutoring
Epstein (1994) proposed a “lesson and practice” setup, when periods of (resource-
expensive) tutoring games are followed by periods of self-play. Thrun (1995) trained
NeuroChess by self-play, but to enforce exploration, each game was initialized by
several steps from a random game of a game database. This way, the agent started
self-play from a more-or-less random position.
Learning from the observation of game databases is a cheap way of learning, but
usually not leading to good results: the agent only gets to experience good moves
(which were taken in the database games), so it never learns about the value of the
worse actions and about situations where the same actions are bad.
Agents may have to face missing information in both classical and computer games,
for example, hidden cards in Poker, partial visibility in first-person shooters or “fog
of war” in real-time strategies. Here we only consider information that is “truly
missing”, that is, which could not be obtained by using a better feature representa-
tion. Furthermore, missing information about the opponents’ behavior is discussed
separately in section 17.3.5.
One of the often-used approaches is to ignore the problem, and learn reactive
policies based on current observation. While TD-like methods can diverge under
partial observability, TD(λ ) with large λ (including Monte-Carlo estimation) is re-
ported to be relatively stable. The applicability of direct policy search methods, in-
cluding dynamic scripting (Spronck et al, 2006), cross-entropy policy search (Szita
17 Reinforcement Learning in Games 567
and Lörincz, 2006b) and evolutionary policy search (Koza, 1992) is unaffected by
partial observability.
Monte-Carlo sampling, however, becomes problematic in adversarial search: af-
ter sampling the unobserved quantities, the agent considers them observable, and
also assumes that the opponent can observe everything. This way, decision making
ignores the “value of information”, when a move is preferred because it gets extra
information (or keeps information hidden from the opponent). Ginsberg (2002) in-
troduces a variant of minimax search over information sets, sets of states that cannot
be distinguished by the information the agent has. They apply a simplified version
to the Bridge-playing program GIB. Amit and Markovitch (2006) use MC sampling,
but for each player, information unknown to them is masked out. The technique is
applied to improved bidding in Bridge.
Fig. 17.7 A game of Texas Hold ’em poker going on. Screenshot from Hoyle’s Texas Hold
Em for the Xbox 360.
20 See Rubin and Watson (2011) for a comprehensive review on computer poker.
568 I. Szita
So far, the chapter has focused on the RL perspective: algorithmic issues and types
of challenges characteristic to games. In this section, we take a look at the other side
of the RL–games relationship: how can RL be useful in games. The two viewpoints
more-or-less coincide when the goal is to find a strong AI player. However, RL in
games can be useful in more roles than that, which is most apparent in modern video
games. In this section we will overview how can reinforcement learning be of use
in game development and gameplay.
In plenty of games, making the computer smart enough is a challenge, but mak-
ing it weak enough might be one, too, as the example of Pong shows. Pong was
among the first video games with a computer opponent. The optimal policy is triv-
ial (the paddle should follow the y coordinate of the ball), but obviously not many
people would play Pong against the optimal policy. Imagine now an agent that plays
perfectly 45% of the time and does nothing in the rest – matches against this agent
would be close, but probably still not much fun. McGlinchey (2003) tries to extract
believable policies from recordings of human players. Modelling human behavior
is a popular approach, but is outside the scope of this chapter; see e.g., Hartley et al
(2005) for an overview.
Difficulty scaling techniques try to tune the difficulty of the game automatically.
They take a strong (possibly adaptive) strategy, and adjust its playing strength to the
level of the player, if necessary.
Spronck et al (2004) modify dynamic scripting. In the most successful of the
three modifications considered, they exclude rules that make the strategy too strong.
They maintain a weight threshold Wmax that is lowered for any victory and increased
in case of a defeat; if the weight of a rule exceeds Wmax , it cannot be selected. They
applied the technique to tactical combat in a simplified simulation of the RPG Bal-
dur’s Gate, reaching reliable and low-variance difficulty scaling against a number
of tactics. Andrade et al (2004) applied a similar technique to a Q-learning agent
in a simplified fighting game Knock’em: if the agent was leading by too much, sev-
eral of the best actions (ranked by Q-values) got disabled. Thanks to the use of
Q-functions, difficulty adaptation could be done online, within episodes. The differ-
ence between hitpoints of players determined the leader and whether the game was
deemed too difficult. Hunicke and Chapman (2004) propose a difficulty adjustment
system that modifies the game environment, e.g., gives more weapons to the player
and spawns fewer enemies when the game is considered too hard. A similar sys-
tem was implemented in the first-person shooter Max Payne,21 where the health of
enemies and the amount of auto-aim help are adjusted, based on the player’s perfor-
mance. Hagelbäck and Johansson (2009) aim at specific score difference in an RTS
game. According to questionnaires, their method significantly raised enjoyment and
perceived variability compared to static opponents.
21 http://www.gameontology.org/index.php/
Dynamic Difficulty Adjustment
570 I. Szita
provide more challenging opponents. It also protects against the easy exploitability
and fragility of hard-coded opponent strategies. Spronck et al (2006) demonstrate
the idea by an implementation of dynamic scripting for tactical combat in the RPG
Neverwinter Nights. The AI trained by dynamic scripting learned to overcome fixed
AIs after 20-30 episodes.
Very few video games have in-game adaptation, with notable exceptions Black &
White and Creatures and the academic game NERO (Stanley et al, 2005). The three
games have lots in common: learning is central part of the gameplay, and the player
can observe how his creatures are interacting with the environment. The player can
train his creatures by rewarding/punishing them and modifying their environment
(for example, to set up tasks or provide new stimuli).
There are two main reasons why so few examples of in-game learning exist:
learning speed and reliability. To have any noticeable effect, learning should be visi-
ble after a few repetitions (or few dozens at most). Spronck et al (2006)’s application
of dynamic scripting is on the verge of usability in this respect. As for reliability,
learning is affected by many factors including randomness. Thus, it is unpredictable
how the learning process will end, and results will probably have significant variance
and a chance of failure. This kind of unpredictability is not acceptable in commer-
cial games, with the possible exception of “teaching games” like Black & White,
where the player controls the learning process.
Opportunities are wider for offline adaptation, when all learning is done before the
release of the game, and the resulting policies are then tested and fixed by the devel-
opers. For example, opponents were trained by AI techniques in the racing games
Re-Volt, Colin McRae Rally and Forza Motorsport. We highlight two areas where
offline learning has promising results: micromanagement and game balancing.
A major problem with real-time strategy games and turn-based strategy games
like the Civilization series is the need for micromanagement (Wender and Watson,
2009): in parallel to making high level decisions (where to attack, what kind of
technology to research), the player needs to make lower-level decisions, like as-
signing production and allocate workforce in each city and manage individual sol-
diers and workers. Micromanagement tasks are tedious and repetitive. In turn-based
strategy games, a standard solution to avoid micromanagement is to hire AI coun-
sellors who can take part of the job; in RTS games, counsellors are not yet common.
Szczepański and Aamodt (2009) apply case-based learning to micromanagement in
the RTS Warcraft 3. According to their experiments, the agents (that, for example,
learned to heal wounded units in battle), were favorably received by beginner and
17 Reinforcement Learning in Games 571
Games are a thriving area for reinforcement learning research, with a large number
of applications. Temporal-difference learning, Monte-Carlo tree search and evolu-
tionary reinforcement learning are among the most popular techniques applied. In
many cases, RL approaches are competitive with other AI techniques and/or human
intelligence, while in others this did not happen yet, but improvement is rapid.
The chapter tried to give a representative sample of problems and challenges
specific to games, and the major approaches used in game applications. In the re-
view we focused on understanding the tips and tricks that make the difference be-
tween a working and failing RL algorithm. By no means do we claim to have a
complete overview of the RL techniques used in games. We did not even try to
give a historical overview; and we had to leave out most results bordering evo-
lutionary computation, game theory, supervised learning and general AI research.
Furthermore, as the focus of the chapter was the RL side of the RL–games rela-
tionship, we had to omit whole genres of games, like turn-based strategy games,
572 I. Szita
References
Aha, D.W., Molineaux, M., Ponsen, M.: Learning to win: Case-based plan selection in a
real-time strategy game. Case-Based Reasoning Research and Development, 5–20 (2005)
Amit, A., Markovitch, S.: Learning to bid in bridge. Machine Learning 63(3), 287–327 (2006)
22 http://cswww.essex.ac.uk/staff/sml/pacman/PacManContest.html
23 http://www.marioai.org/
24 http://rl-competition.org/
17 Reinforcement Learning in Games 573
Andrade, G., Santana, H., Furtado, A., Leitão, A., Ramalho, G.: Online adaptation of com-
puter games agents: A reinforcement learning approach. Scientia 15(2) (2004)
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem.
Machine Learning 47, 235–256 (2002)
Bartók, G., Szepesvári, C., Zilles, S.: Models of active learning in group-structured state
spaces. Information and Computation 208, 364–384 (2010)
Baxter, J., Tridgell, A., Weaver, L.: Learning to play chess using temporal-differences. Ma-
chine learning 40(3), 243–263 (2000)
Baxter, J., Tridgell, A., Weaver, L.: Reinforcement learning and chess. In: Machines that learn
to play games, pp. 91–116. Nova Science Publishers, Inc. (2001)
Beal, D., Smith, M.C.: Learning piece values using temporal differences. ICCA Journal 20(3),
147–151 (1997)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming. Athena Scientific (1996)
Billings, D., Davidson, A., Schauenberg, T., Burch, N., Bowling, M., Holte, R.C., Schaeffer,
J., Szafron, D.: Game-Tree Search with Adaptation in Stochastic Imperfect-Information
Games. In: van den Herik, H.J., Björnsson, Y., Netanyahu, N.S. (eds.) CG 2004. LNCS,
vol. 3846, pp. 21–34. Springer, Heidelberg (2006)
Björnsson, Y., Finnsson, H.: Cadiaplayer: A simulation-based general game player. IEEE
Transactions on Computational Intelligence and AI in Games 1(1), 4–15 (2009)
Böhm, N., Kókai, G., Mandl, S.: Evolving a heuristic function for the game of tetris. In: Proc.
Lernen, Wissensentdeckung und Adaptivität LWA, pp. 118–122 (2004)
Boumaza, A.: On the evolution of artificial Tetris players. In: IEEE Symposium on Compu-
tational Intelligence and Games (2009)
Bouzy, B., Helmstetter, B.: Monte Carlo Go developments. In: Advances in Computer Games,
pp. 159–174 (2003)
Bowling, M.: Convergence and no-regret in multiagent learning. In: Neural Information Pro-
cessing Systems, pp. 209–216 (2004)
Buro, M.: From simple features to sophisticated evaluation functions. In: International Con-
ference on Computers and Games, pp. 126–145 (1998)
Buro, M., Furtak, T.: RTS games as test-bed for real-time research. JCIS, 481–484 (2003)
Buro, M., Lanctot, M., Orsten, S.: The second annual real-time strategy game AI competition.
In: GAME-ON NA (2007)
Chaslot, G., Winands, M., Herik, H., Uiterwijk, J., Bouzy, B.: Progressive strategies for
monte-carlo tree search. New Mathematics and Natural Computation 4(3), 343 (2008)
Chaslot, G., Fiter, C., Hoock, J.B., Rimmel, A., Teytaud, O.: Adding Expert Knowledge and
Exploration in Monte-Carlo Tree Search. In: van den Herik, H.J., Spronck, P. (eds.) ACG
2009. LNCS, vol. 6048, pp. 1–13. Springer, Heidelberg (2010)
Chatriot, L., Gelly, S., Jean-Baptiste, H., Perez, J., Rimmel, A., Teytaud, O.: Including expert
knowledge in bandit-based Monte-Carlo planning, with application to computer-Go. In:
European Workshop on Reinforcement Learning (2008)
Coquelin, P.A., Munos, R.: Bandit algorithms for tree search. In: Uncertainty in Artificial
Intelligence (2007)
Coulom, R.: Efficient Selectivity and Backup Operators in Monte-carlo Tree Search. In: van
den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M(J.) (eds.) CG 2006. LNCS, vol. 4630,
pp. 72–83. Springer, Heidelberg (2007)
Coulom, R.: Computing Elo ratings of move patterns in the game of go. ICGA Journal 30(4),
198–208 (2007)
Dahl, F.A.: Honte, a Go-playing program using neural nets. In: Machines that learn to play
games, pp. 205–223. Nova Science Publishers (2001)
574 I. Szita
Davidson, A.: Opponent modeling in poker: Learning and acting in a hostile and uncertain
environment. Master’s thesis, University of Alberta (2002)
Diuk, C., Cohen, A., Littman, M.L.: An object-oriented representation for efficient reinforce-
ment learning. In: International Conference on Machine Learning, pp. 240–247 (2008)
Droste, S., Fürnkranz, J.: Learning of piece values for chess variants. Tech. Rep. TUD–KE–
2008-07, Knowledge Engineering Group, TU Darmstadt (2008)
Džeroski, S., Raedt, L.D., Driessens, K.: Relational reinforcement learning. Machine Learn-
ing 43(1-2), 7–52 (2001)
Epstein, S.L.: Toward an ideal trainer. Machine Learning 15, 251–277 (1994)
Farias, V.F., van Roy, B.: Tetris: A Study of Randomized Constraint Sampling. In: Probabilis-
tic and Randomized Methods for Design Under Uncertainty. Springer, UK (2006)
Fawcett, T., Utgoff, P.: Automatic feature generation for problem solving systems. In: Inter-
national Conference on Machine Learning, pp. 144–153 (1992)
Finkelstein, L., Markovitch, S.: Learning to play chess selectively by acquiring move patterns.
ICCA Journal 21, 100–119 (1998)
Fudenberg, D., Levine, D.K.: The theory of learning in games. MIT Press (1998)
Fürnkranz, J.: Machine learning in games: a survey. In: Machines that Learn to Play Games,
pp. 11–59. Nova Science Publishers (2001)
Fürnkranz, J.: Recent advances in machine learning and game playing. Tech. rep., TU Darm-
stadt (2007)
Galway, L., Charles, D., Black, M.: Machine learning in digital games: a survey. Artificial
Intelligence Review 29(2), 123–161 (2008)
Gelly, S., Silver, D.: Achieving master-level play in 9x9 computer go. In: AAAI, pp. 1537–
1540 (2008)
Gelly, S., Wang, Y., Munos, R., Teytaud, O.: Modification of UCT with patterns in Monte-
Carlo go. Tech. rep., INRIA (2006)
Gherrity, M.: A game-learning machine. PhD thesis, University of California, San Diego, CA
(1993)
Ghory, I.: Reinforcement learning in board games. Tech. rep., Department of Computer Sci-
ence, University of Bristol (2004)
Gilgenbach, M.: Fun game AI design for beginners. In: AI Game Programming Wisdom,
vol. 3. Charles River Media, Inc. (2006)
Gilpin, A., Sandholm, T.: Lossless abstraction of imperfect information games. Journal of the
ACM 54(5), 25 (2007)
Gilpin, A., Sandholm, T., Sørensen, T.B.: Potential-aware automated abstraction of sequential
games, and holistic equilibrium analysis of Texas Hold’em poker. In: AAAI, vol. 22, pp.
50–57 (2007)
Ginsberg, M.L.: Gib: Imperfect information in a computationally challenging game. Journal
of Artificial Intelligence Research 14, 313–368 (2002)
Gould, J., Levinson, R.: Experience-based adaptive search. Tech. Rep. UCSC-CRL-92-10,
University of California at Santa Cruz (1992)
Günther, M.: Automatic feature construction for general game playing. PhD thesis, Dresden
University of Technology (2008)
Hagelbäck, J., Johansson, S.J.: Measuring player experience on runtime dynamic difficulty
scaling in an RTS game. In: International Conference on Computational Intelligence and
Games (2009)
Hartley, T., Mehdi, Q., Gough, N.: Online learning from observation for interactive computer
games. In: International Conference on Computer Games: Artificial Intelligence and Mo-
bile Systems, pp. 27–30 (2005)
17 Reinforcement Learning in Games 575
van den Herik, H.J., Uiterwijk, J.W.H.M., van Rijswijck, J.: Games solved: Now and in the
future. Artificial Intelligence 134, 277–311 (2002)
Hsu, F.H.: Behind Deep Blue: Building the Computer that Defeated the World Chess Cham-
pion. Princeton University Press, Princeton (2002)
Hunicke, R., Chapman, V.: AI for dynamic difficult adjustment in games. In: Challenges in
Game AI Workshop (2004)
Kakade, S.: A natural policy gradient. In: Advances in Neural Information Processing Sys-
tems, vol. 14, pp. 1531–1538 (2001)
Kalles, D., Kanellopoulos, P.: On verifying game designs and playing strategies using rein-
forcement learning. In: ACM Symposium on Applied Computing, pp. 6–11 (2001)
Kerbusch, P.: Learning unit values in Wargus using temporal differences. BSc thesis (2005)
Kocsis, L., Szepesvári, C.: Bandit Based Monte-Carlo Planning. In: Fürnkranz, J., Scheffer,
T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 282–293. Springer,
Heidelberg (2006)
Kocsis, L., Szepesvári, C., Winands, M.H.M.: RSPSA: Enhanced Parameter Optimization in
Games. In: van den Herik, H.J., Hsu, S.-C., Hsu, T.-s., Donkers, H.H.L.M(J.) (eds.) CG
2005. LNCS, vol. 4250, pp. 39–56. Springer, Heidelberg (2006)
Kok, E.: Adaptive reinforcement learning agents in RTS games. Master’s thesis, University
of Utrecht, The Netherlands (2008)
Koza, J.: Genetic programming: on the programming of computers by means of natural se-
lection. MIT Press (1992)
Kuhlmann, G.J.: Automated domain analysis and transfer learning in general game playing.
PhD thesis, University of Texas at Austin (2010)
Lagoudakis, M.G., Parr, R., Littman, M.L.: Least-Squares Methods in Reinforcement Learn-
ing for Control. In: Vlahavas, I.P., Spyropoulos, C.D. (eds.) SETN 2002. LNCS (LNAI),
vol. 2308, pp. 249–260. Springer, Heidelberg (2002)
Laursen, R., Nielsen, D.: Investigating small scale combat situations in real time strategy
computer games. Master’s thesis, University of Aarhus (2005)
Levinson, R., Weber, R.: Chess Neighborhoods, Function Combination, and Reinforcement
Learning. In: Marsland, T., Frank, I. (eds.) CG 2001. LNCS, vol. 2063, pp. 133–150.
Springer, Heidelberg (2002)
Lorenz, U.: Beyond Optimal Play in Two-Person-Zerosum Games. In: Albers, S., Radzik, T.
(eds.) ESA 2004. LNCS, vol. 3221, pp. 749–759. Springer, Heidelberg (2004)
Mańdziuk, J.: Knowledge-Free and Learning-Based Methods in Intelligent Game Playing.
Springer, Heidelberg (2010)
Marthi, B., Russell, S., Latham, D.: Writing Stratagus-playing agents in concurrent alisp. In:
IJCAI Workshop on Reasoning, Representation, and Learning in Computer Games, pp.
67–71 (2005)
McGlinchey, S.J.: Learning of AI players from game observation data. In: GAME-ON, pp.
106–110 (2003)
Molineaux, M., Aha, D.W., Ponsen, M.: Defeating novel opponents in a real-time strategy
game. In: IJCAI Workshop on Reasoning, Representation, and Learning in Computer
Games, pp. 72–77 (2005)
Moriarty, D.E., Miikkulainen, R.: Discovering complex Othello strategies through evolution-
ary neural networks. Connection Science 7, 195–209 (1995)
Müller, M.: Position evaluation in computer go. ICGA Journal 25(4), 219–228 (2002)
Naddaf, Y.: Game-independent AI agents for playing Atari 2600 console games. Master’s
thesis, University of Alberta (2010)
576 I. Szita
Pollack, J.B., Blair, A.D.: Why did TD-Gammon work? In: Neural Information Processing
Systems, vol. 9, pp. 10–16 (1997)
Ponsen, M., Spronck, P.: Improving adaptive game AI with evolutionary learning. In: Com-
puter Games: Artificial Intelligence, Design and Education (2004)
Ponsen, M., Muñoz-Avila, H., Spronck, P., Aha, D.W.: Automatically acquiring adaptive real-
time strategy game opponents using evolutionary learning. In: Proceedings of the 17th
Innovative Applications of Artificial Intelligence Conference (2005)
Ponsen, M., Spronck, P., Tuyls, K.: Hierarchical reinforcement learning in computer games.
In: Adaptive Learning Agents and Multi-Agent Systems, pp. 49–60 (2006)
Ponsen, M., Taylor, M.E., Tuyls, K.: Abstraction and Generalization in Reinforcement Learn-
ing: A Summary and Framework. In: Taylor, M.E., Tuyls, K. (eds.) ALA 2009. LNCS,
vol. 5924, pp. 1–33. Springer, Heidelberg (2010)
Ramanujan, R., Sabharwal, A., Selman, B.: Adversarial search spaces and sampling-based
planning. In: International Conference on Automated Planning and Scheduling (2010)
Risk, N., Szafron, D.: Using counterfactual regret minimization to create competitive multi-
player poker agents. In: International Conference on Autonomous Agents and Multiagent
Systems, pp. 159–166 (2010)
Rubin, J., Watson, I.: Computer poker: A review. Artificial Intelligence 175(5-6), 958–987
(2011)
Schaeffer, J.: The games computers (and people) play. In: Zelkowitz, M. (ed.) Advances in
Computers, vol. 50, pp. 89–266. Academic Press (2000)
Schaeffer, J., Hlynka, M., Jussila, V.: Temporal difference learning applied to a high-
performance game-playing program. In: International Joint Conference on Artificial In-
telligence, pp. 529–534 (2001)
Schnizlein, D., Bowling, M., Szafron, D.: Probabilistic state translation in extensive games
with large action sets. In: International Joint Conference on Artificial Intelligence, pp.
278–284 (2009)
Schraudolph, N.N., Dayan, P., Sejnowski, T.J.: Learning to evaluate go positions via temporal
difference methods. In: Computational Intelligence in Games. Studies in Fuzziness and
Soft Computing, ch. 4, vol. 62, pp. 77–98. Springer, Heidelberg (2001)
Scott, B.: The illusion of intelligence. In: AI Game Programming Wisdom, pp. 16–20. Charles
River Media (2002)
Shapiro, A., Fuchs, G., Levinson, R.: Learning a Game Strategy Using Pattern-Weights and
Self-Play. In: Schaeffer, J., Müller, M., Björnsson, Y. (eds.) CG 2002. LNCS, vol. 2883,
pp. 42–60. Springer, Heidelberg (2003)
Sharifi, A.A., Zhao, R., Szafron, D.: Learning companion behaviors using reinforcement
learning in games. In: AIIDE (2010)
Sharma, S., Kobti, Z., Goodwin, S.: General game playing: An overview and open problems.
In: International Conference on Computing, Engineering and Information, pp. 257–260
(2009)
Silver, D., Tesauro, G.: Monte-carlo simulation balancing. In: International Conference on
Machine Learning (2009)
Silver, D., Sutton, R., Mueller, M.: Sample-based learning and search with permanent and
transient memories. In: ICML (2008)
Spronck, P., Sprinkhuizen-Kuyper, I., Postma, E.: Difficulty scaling of game AI. In: GAME-
ON 2004: 5th International Conference on Intelligent Games and Simulation (2004)
Spronck, P., Ponsen, M., Sprinkhuizen-Kuyper, I., Postma, E.: Adaptive game AI with dy-
namic scripting. Machine Learning 63(3), 217–248 (2006)
17 Reinforcement Learning in Games 577
Stanley, K.O., Bryant, B.D., Miikkulainen, R.: Real-time neuroevolution in the NERO video
game. IEEE Transactions on Evolutionary Computation 9(6), 653–668 (2005)
Sturtevant, N., White, A.: Feature construction for reinforcement learning in Hearts. In: Ad-
vances in Computers and Games, pp. 122–134 (2007)
Szczepański, T., Aamodt, A.: Case-based reasoning for improved micromanagement in real-
time strategy games. In: Workshop on Case-Based Reasoning for Computer Games, 8th
International Conference on Case-Based Reasoning, pp. 139–148 (2009)
Szita, I., Lörincz, A.: Learning Tetris using the noisy cross-entropy method. Neural Compu-
tation 18(12), 2936–2941 (2006a)
Szita, I., Lörincz, A.: Learning to play using low-complexity rule-based policies: Illustrations
through Ms. Pac-Man. Journal of Articial Intelligence Research 30, 659–684 (2006b)
Szita, I., Szepesvári, C.: Sz-tetris as a benchmark for studying key problems of rl. In: ICML
2010 Workshop on Machine Learning and Games (2010)
Szita, I., Chaslot, G., Spronck, P.: Monte-Carlo Tree Search in Settlers of Catan. In: van
den Herik, H.J., Spronck, P. (eds.) ACG 2009. LNCS, vol. 6048, pp. 21–32. Springer,
Heidelberg (2010)
Tesauro, G.: Practical issues in temporal difference learning. Machine Learning 8, 257–277
(1992)
Tesauro, G.: Temporal difference learning and TD-gammon. Communications of the
ACM 38(3), 58–68 (1995)
Tesauro, G.: Comments on co-evolution in the successful learning of backgammon strategy’.
Machine Learning 32(3), 241–243 (1998)
Tesauro, G.: Programming backgammon using self-teaching neural nets. Artificial Intelli-
gence 134(1-2), 181–199 (2002)
Thiery, C., Scherrer, B.: Building controllers for Tetris. ICGA Journal 32(1), 3–11 (2009)
Thrun, S.: Learning to play the game of chess. In: Neural Information Processing Systems,
vol. 7, pp. 1069–1076 (1995)
Utgoff, P.: Feature construction for game playing. In: Fürnkranz, J., Kubat, M. (eds.) Ma-
chines that Learn to Play Games, pp. 131–152. Nova Science Publishers (2001)
Utgoff, P., Precup, D.: Constructive function approximation. In: Liu, H., Motoda, H. (eds.)
Feature Extraction, Construction and Selection: A Data Mining Perspective, vol. 453, pp.
219–235. Kluwer Academic Publishers (1998)
Veness, J., Silver, D., Uther, W., Blair, A.: Bootstrapping from game tree search. In: Neural
Information Processing Systems, vol. 22, pp. 1937–1945 (2009)
Weber, B.G., Mateas, M.: Case-based reasoning for build order in real-time strategy games.
In: Artificial Intelligence and Interactive Digital Entertainment, pp. 1313–1318 (2009)
Wender, S., Watson, I.: Using reinforcement learning for city site selection in the turn-based
strategy game Civilization IV. In: Computational Intelligence and Games, pp. 372–377
(2009)
Wiering, M.A.: Self-play and using an expert to learn to play backgammon with temporal
difference learning. Journal of Intelligent Learning Systems and Applications 2, 57–68
(2010)
Zinkevich, M., Johanson, M., Bowling, M., Piccione, C.: Regret minimization in games
with incomplete information. In: Neural Information Processing Systems, pp. 1729–1736
(2008)
Chapter 18
Reinforcement Learning in Robotics: A Survey
18.1 Introduction
Robotics has a near infinite amount of interesting learning problems, a large percent-
age of which can be phrased as reinforcement learning problems. See Figure 18.1 for
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 579–610.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
580 J. Kober and J. Peters
an illustration of the wide variety of robots that have learned tasks using reinforce-
ment learning. However, robotics as a domain differs significantly from well-defined
typical reinforcement learning benchmark problems, which usually have discrete
states and actions. In contrast, many real-world problems in robotics are best repre-
sented with high-dimensional, continuous states and actions. Every single trial run
is costly and, as a result, such applications force us to focus on problems that do not
arise that frequently in classical reinforcement learning benchmark examples. In this
book chapter, we highlight the challenges faced in robot reinforcement learning and
bring many of the inherent problems of this domain to the reader’s attention.
Robotics is characterized by high dimensionality due to the many degrees of free-
dom of modern anthropomorphic robots. Experience on the real system is costly and
often hard to reproduce. However, it usually cannot be replaced by simulations, at
least for highly dynamic tasks, as even small modeling errors accumulate to sub-
stantially different dynamic behavior. Another challenge faced in robot reinforce-
ment learning is the generation of appropriate reward functions. Good rewards that
lead the systems quickly to success are needed to cope with the cost of real-world
experience but are a substantial manual contribution.
Obviously, not every reinforcement learning method is equally suitable for the
robotics domain. In fact, many of the methods that scale to more interesting tasks
are model-based (Atkeson et al, 1997; Abbeel et al, 2007) and often employ pol-
icy search rather than value function-based approaches (Gullapalli et al, 1994;
Miyamoto et al, 1996; Kohl and Stone, 2004; Tedrake et al, 2005; Peters and Schaal,
2008a,b; Kober and Peters, 2009). This stands in contrast to much of mainstream re-
inforcement (Kaelbling et al, 1996; Sutton and Barto, 1998). We attempt to give a
fairly complete overview on real robot reinforcement learning citing most original
papers while distinguishing mainly on a methodological level.
As none of the presented methods extends to robotics with ease, we discuss how
robot reinforcement learning can be made tractable. We present several approaches
to this problem such as choosing an appropriate representation for your value func-
tion or policy, incorporating prior knowledge, and transfer from simulations.
In this book chapter, we survey real robot reinforcement learning and highlight
how these approaches were able to handle the challenges posed by this setting. Less
attention is given to results that correspond only to slightly enhanced grid-worlds or
that were learned exclusively in simulation. The challenges in applying reinforce-
ment learning in robotics are discussed in Section 18.2.
Standard reinforcement learning methods suffer from the discussed challenges.
As already pointed out in the reinforcement learning review paper by Kaelbling et al
(1996) “we must give up tabula rasa learning techniques and begin to incorporate
bias that will give leverage to the learning process”. Hence, we concisely present re-
inforcement learning techniques in the context of robotics in Section 18.3. Different
approaches of making reinforcement learning tractable are treated in Sections 18.4
to 18.6. Finally in Section 18.7, we employ the example of ball-in-a-cup to highlight
which of the various approaches discussed in the book chapter have been particu-
larly helpful for us to make such a complex task tractable. In Section 18.8, we give
a conclusion and outlook on interesting problems.
18 Reinforcement Learning in Robotics: A Survey 581
Fig. 18.1 This figure illustrates robots to which reinforcement learning has been applied. The
robots cover the whole range of wheeled mobile robots, robotic arms, autonomous vehicles,
to humanoid robots. (a) The OBELIX robot is a wheeled mobile robot that learned to push
boxes (Mahadevan and Connell, 1992) with a value function-based approach (Picture reprint
with permission of Sridhar Mahadevan). (b) The Zebra Zero robot is a robot arm that learned a
peg-in-hole insertion task (Gullapalli et al, 1994) with a model-free policy gradient approach
(Picture reprint with permission of Rod Grupen). (c) The control of such autonomous blimps
(Picture reprint with permission of Axel Rottmann) was learned with both a value function
based approach (Rottmann et al, 2007) and model-based policy search (Ko et al, 2007). (d)
The Sacros humanoid DB learned a pole-balancing task (Schaal, 1997) using forward models
(Picture reprint with permission of Stefan Schaal).
Fig. 18.2 This Figure illustrates the state space of a robot reinforcement learning task
commercial robot systems also encapsulate some of the state and action components
in an embedded control system (e.g., trajectory fragments are frequently used as
actions for industrial robots); however, this form of a state dimensionality reduction
severely limits the dynamic capabilities of the robot according to our experience
(Schaal et al, 2002; Peters et al, 2010b).
The reinforcement learning community has a long history of dealing with dimen-
sionality using computational abstractions. It offers a larger set of applicable tools
ranging from adaptive discretizations (Buşoniu et al, 2010) over function approxi-
mation approaches (Sutton and Barto, 1998) to macro actions or options (Barto and
Mahadevan, 2003). Macro actions would allow decomposing a task in elementary
components and quite naturally translate to robotics. For example, a macro action
“move one meter to the left” could be achieved by a lower level controller that takes
care of accelerating, moving, and stopping while ensuring the precision. Using a
limited set of manually generated macro actions, standard reinforcement learning
approaches can be made tractable for navigational tasks for mobile robots. How-
ever, the automatic generation of such sets of macro actions is the key issue in order
to enable such approaches. We will discuss approaches that have been successful in
robot reinforcement learning in Section 18.4.
Robots inherently interact with the real world and, hence, robot reinforcement
learning suffers from most of the resulting real world problems. For example, robot
hardware is usually expensive, suffers from wear and tear, and requires careful main-
tenance. Repairing a robot system is a non-negligible effort associated with cost,
physical labor and long waiting periods. Hence, to apply reinforcement learning in
robotics, safe exploration becomes a key issue of the learning process; a problem
often neglected in the general reinforcement learning community.
However, several more aspects of the real world make robotics a challenging
domain. As the dynamics of a robot can change due to many external factors ranging
from temperature to wear, the learning process may never fully converge, i.e., it
needs a ‘tracking solution’ (Sutton et al, 2007). Frequently, the environment settings
during an earlier learning period cannot be reproduced and the external factors are
not clear, e.g., how did the light conditions affect the performance of the vision
system and, as a result, the task’s performance. This problem makes comparisons of
algorithms particularly hard.
Most real robot learning tasks require some form of human supervision, e.g.,
putting the pole back on the robot’s end-effector during pole balancing, see Fig-
ure 18.1d, after a failure. Even when an automatic reset exists (e.g., by having a
smart contraption that resets the pole), learning speed becomes essential as a task
on a real robot cannot be sped up. The whole episode needs to be complete as it is
often not possible to start from arbitrary states.
584 J. Kober and J. Peters
For such reasons, real-world samples are expensive in terms of time, labor and,
potentially, finances. Thus, sample efficient algorithms that are able to learn from
a small number of trials are essential. In Sections 18.6.2 and 18.6.3 we will dis-
cuss several approaches that allow reducing the amount of required real world
interactions.
As the robot is a physical system there are strict constraints on the interaction
between the learning algorithm and the robot setup. Usually the robot needs to
get commands at fixed frequency and for dynamic tasks the movement cannot be
paused. Thus, the agent has to select actions in real-time. It is often not possible to
pause to think, learn or plan between each action but rather the learning algorithm
has to deal with a fixed amount of time. Thus, not only are samples expensive to
obtain, but also often only a very limited number of samples can be used, if the
runtime of the algorithms depends on the number of samples. These constraints are
less severe in an episodic setting where the time intensive part of the learning can
be postponed to the period between episodes.
On physical systems there are always delays in sensing and actuation. The state
of the setup represented by the sensors slightly lags behind the real state due to
processing and communication delays. More critically there are also communica-
tion delays in the actuation as well as delays due to the fact that a physical cannot
instantly change its movement. For example, due to inertia the direction of a move-
ment cannot be inverted but rather a breaking and accelerating phase are required.
The robot always needs this kind of a transition phase. Due to these delays, actions
do not have instantaneous effects but are observable only several time steps later. In
contrast, in most general reinforcement learning algorithms, the actions are assumed
to take effect instantaneously.
One way to offset the cost of real world interaction would be accurate models that
are being used as simulators. In an ideal setting, such an approach would render it
possible to learn the behavior in simulation and subsequently transfer it to the real
robot. Unfortunately, creating a sufficiently accurate model of the robot and the en-
vironment is challenging. As small model errors may accumulate, we can frequently
see a fast divergence of the simulated robot from the real-world system. When a pol-
icy is trained using an imprecise forward model as simulator, the behavior will not
transfer without significant modifications as experienced by Atkeson (1994) when
learning the underactuated swing-up. Only in a limited number of experiments, the
authors have achieved such a direct transfer, see Section 18.6.3 for examples. If the
task is energy absorbing or excessive energy is not critical, it is safer to assume that
18 Reinforcement Learning in Robotics: A Survey 585
approaches that can be applied in simulation may work similarly in the real-world
(Kober and Peters, 2010). In an energy absorbing scenario, the task is inherently
stable and transferring policies poses a low risk of damaging the robot. However, in
energy absorbing scenarios, tasks can often be learned better in the real world than
in simulation due to complex interactions between mechanical objects. In an energy
emitting scenario, the task is inherently unstable and transferring policies poses a
high risk. As we will see later, models are best used to test the algorithms in simula-
tions but can also be used to check the proximity to theoretically optimal solutions,
or to perform ‘mental rehearsal’.
In reinforcement learning, the goal of the task is implicitly specified by the reward.
Defining a good reward function in robot reinforcement learning is hence often a
daunting task. Giving rewards only upon task achievement, e.g., did a table tennis
robot win the match, will result in a simple binary reward. However, the robot would
receive such a reward so rarely that it is unlikely to ever succeed in the lifetime
of a real-world system. Hence, instead of using only simpler binary rewards, we
frequently need to include additional knowledge into such scalar rewards to guide
the learning process to a reasonable solution. The trade-off between different factors
may be essential as hitting a table tennis ball very fast will result in a high score but
is likely to destroy the robot. Good reinforcement learning algorithms often exploit
the reward function in unexpected ways, especially if the reinforcement learning is
done locally and not globally. For example, if contact between racket and ball is
part of the reward in ball paddling (see Figure 18.2), many locally optimal solutions
would attempt to simply keep the ball on the racket.
Inverse reinforcement learning, also known as apprenticeship learning, is a
promising alternative to specifying the reward function manually. Instead, it as-
sumes that the reward function can be reconstructed from a set of expert demon-
strations. Recently, it has yielded a variety of successful applications in robotics,
see (Kolter et al, 2007; Abbeel et al, 2008; Coates et al, 2009) for more information.
Real-world domains such as robotics are affected more strongly by the basic ap-
proach choices. Hence, we introduce reinforcement learning in this chapter with a
particular point of view. As stated in Chapter 1, the goal of reinforcement learn-
ing is to find a policy π (s,a) that gathers maximal rewards R (s,a). However, in
real-world domains the average reward is often more suitable than a discounted
formulation due to its stability properties (Peters et al, 2004). In order to incor-
porate exploration, the policy is considered a conditional probability distribution
π (s,a) = f (a|s,θ ) with parameters θ . Reinforcement learning aims at finding the
586 J. Kober and J. Peters
Here, Equation (18.2) defines stationarity of the state distributions μ π (i.e., it ensures
that it converges) and Equation (18.3) ensures a proper state-action probability dis-
tribution. This optimization problem can be tackled in two substantially different
ways (Bellman, 1967, 1971), i.e., we can search the optimal solution directly in the
original, primal problem, and we can optimize in the dual formulation. Optimizing
in primal formulation is known as policy search in reinforcement learning while
searching in the dual is called a value function-based approach.
As this implies that there are as many equations the number of states multiplied by
the number of actions, it is clear that only one action a∗ can be optimal. Thus, we
have the Bellman Principle of Optimality (Kirk, 1970)
0 1
∗
V (s) = max∗
R (s,a ) − R̄ + ∑ V s T s,a ,s .
∗
(18.4)
a
s
18 Reinforcement Learning in Robotics: A Survey 587
When evaluating Equation 18.4, we realize that V (s) corresponds to the sum of the
reward difference from the average reward R̄ encountered after taking the optimal
action a∗ in state s. Note that this function is usually discovered by human insight
(Sutton and Barto, 1998). This principle of optimality has given birth to the field
of optimal control (Kirk, 1970) and the solution above corresponds to the dynamic
programming solution from the viewpoint of reinforcement learning.
Hence, we have a dual formulation of the original problem as condition for opti-
mality. Many traditional reinforcement learning approaches are based on this equa-
tion, these are called the value function methods. Instead of directly learning a policy,
they first approximate the Lagrangian multiplier V (s), also called the value function,
and use it to reconstruct the optimal policy. A wide variety of methods exist and
can be split mainly in three classes: (i) dynamic programming-based optimal con-
trol approaches such as policy iteration or value iteration, (ii) rollout-based Monte
Carlo methods and (iii) temporal difference methods such as TD(λ ), Q-learning and
SARSA. See Chapter 1 for a more detailed discussion. However, such value func-
tion based approaches often do not translate well into high dimensional robotics as
proper representations for the value function become intractable and even finding the
optimal action can already be a hard problem. A particularly drastic problem is the
error propagation in value functions where a small change in the policy may cause
a large change in the value function which again causes a large change in the policy.
Table 18.1 This table illustrates different value function based reinforcement learning meth-
ods employed for robotic tasks and associated publications
VALUE F UNCTION A PPROACHES
Approach Employed by. . .
Model-Based
Abbeel et al (2006, 2007); Atkeson and Schaal (1997); Atkeson (1998);
Bagnell and Schneider (2001); Bakker et al (2006); Coates et al (2009);
Donnart and Meyer (1996); Hester et al (2010); Kalmár et al (1998);
Ko et al (2007); Kolter et al (2008); Martı́nez-Marı́n and Duckett
(2005); Michels et al (2005); Morimoto and Doya (2001); Ng et al
(2004b,a); Pendrith (1999); Schaal and Atkeson (1994); Touzet (1997);
Willgoss and Iqbal (1999)
Model-Free
Asada et al (1996); Bakker et al (2003); Benbrahim et al (1992); Ben-
brahim and Franklin (1997); Birdwell and Livingston (2007); Bitzer
et al (2010); Conn and Peters II (2007); Duan et al (2007, 2008); Fagg
et al (1998); Gaskett et al (2000); Gräve et al (2010); Hafner and Ried-
miller (2007); Huang and Weng (2002); Ilg et al (1999); Katz et al
(2008); Kimura et al (2001); Kirchner (1997); Kroemer et al (2009,
2010); Latzke et al (2007); Lizotte et al (2007); Mahadevan and Con-
nell (1992); Mataric (1997); Nemec et al (2009, 2010); Oßwald et al
(2010); Paletta et al (2007); Platt et al (2006); Riedmiller et al (2009);
Rottmann et al (2007); Smart and Kaelbling (1998, 2002); Soni and
Singh (2006); Thrun (1995); Tokic et al (2009); Uchibe et al (1998);
Wang et al (2006)
588 J. Kober and J. Peters
While this may lead faster to good, possibly globally optimal solutions, such a learn-
ing process is considerably more dangerous when applied on real systems where it is
likely to cause significant damage. An overview of publications using value function
based methods is presented in Table 18.1. Here, model-based methods refers to all
methods that employ a predetermined or a learned model.
It is straightforward to realize that the primal formulation has a lot of features rel-
evant to robotics. It allows a natural integration of expert knowledge, e.g., through
initializations of the policy. It allows domain-appropriate pre-structuring of the pol-
icy in an approximate form without changing the original problem. Optimal policies
often have a lot less parameters than optimal value functions, e.g., in linear quadratic
control, the value function has quadratically many parameters while the policy only
requires linearly many parameters. Extensions to continuous state and action spaces
follow straightforwardly. Local search in policy space can directly lead to good
results as exhibited by early hill-climbing approaches (Kirk, 1970). Additional con-
straints can be incorporated naturally. As a result, policy search appears more natural
to robotics.
Nevertheless, policy search has been considered the harder problem for a long
time as the optimal solution cannot directly be determined from Equations (18.1-
18.3) while the Bellman Principle of Optimality (Kirk, 1970) directly arises from
the problems’ Karush-Kuhn-Tucker conditions (Kuhn and Tucker, 1950). Notwith-
standing, in robotics, policy search has recently become an important alternative
to value function based methods due to the reasons described above as well as the
convergence problems of approximate value function methods. Most policy search
Table 18.2 This table illustrates different policy search reinforcement learning methods em-
ployed for robotic tasks and associated publications
P OLICY S EARCH
Approach Employed by. . .
Gradient
Deisenroth and Rasmussen (2010); Endo et al (2008); Geng et al
(2006); Guenter et al (2007); Gullapalli et al (1994); Hailu and Sommer
(1998); Kohl and Stone (2004); Kolter and Ng (2009); Mitsunaga et al
(2005); Miyamoto et al (1996); Peters and Schaal (2008c,b); Tamei and
Shibata (2009); Tedrake (2004); Tedrake et al (2005)
Heuristic
Erden and Leblebicioaglu (2008); Dorigo and Colombetti (1993);
Mataric (1994); Svinin et al (2001); Yasuda and Ohkura (2008);
Youssef (2005)
Sample
Buchli et al (2011); Kober and Peters (2009); Kober et al (2010); Pastor
et al (2011); Peters and Schaal (2008a); Peters et al (2010a)
18 Reinforcement Learning in Robotics: A Survey 589
Much of the success of reinforcement learning methods has been due to the smart
use of approximate representations. In a domain that is so inherently beyond the
reach of complete tabular representation, the need of such approximations is par-
ticularly pronounced. The different ways of making reinforcement learning meth-
ods tractable in robotics are tightly coupled to the underlying framework. Policy
search methods require a choice of policy representation that limits the number
of representable policies to enhance learning speed, see Section 18.4.3. A value
function-based approach requires an accurate, robust but general function approxi-
mator that can capture the value function sufficiently precisely, see Section 18.4.2.
Reducing the dimensionality of states or actions by smart state-action discretization
is a representational simplification that may enhance both policy search and value
590 J. Kober and J. Peters
Table 18.3 This table illustrates different methods of making robot reinforcement learning
tractable by employing a suitable representation
S MART S TATE -ACTION D ISCRETIZATION
Approach Employed by. . .
Hand crafted
Benbrahim et al (1992); Kimura et al (2001); Nemec et al (2010);
Paletta et al (2007); Tokic et al (2009); Willgoss and Iqbal (1999)
Learned
Piater et al (2010); Yasuda and Ohkura (2008)
Meta-actions
Asada et al (1996); Dorigo and Colombetti (1993); Kalmár et al (1998);
Mataric (1994, 1997); Platt et al (2006); Soni and Singh (2006); Nemec
et al (2009)
Relational
Representation Cocora et al (2006); Katz et al (2008)
F UNCTION A PPROXIMATION
Approach Employed by. . .
Local Models
Bentivegna (2004); Schaal (1997); Smart and Kaelbling (1998)
Neural Networks
Benbrahim and Franklin (1997); Duan et al (2008); Gaskett et al
(2000); Hafner and Riedmiller (2003); Riedmiller et al (2009); Thrun
(1995)
GPR
Gräve et al (2010); Kroemer et al (2009, 2010); Lizotte et al (2007);
Rottmann et al (2007)
Neighbors
Hester et al (2010); Mahadevan and Connell (1992); Touzet (1997)
P RE - STRUCTURED P OLICIES
Approach Employed by. . .
Motor Primitives
Kohl and Stone (2004); Kober and Peters (2009); Peters and Schaal
(2008c,b); Theodorou et al (2010)
Neural Networks
Endo et al (2008); Geng et al (2006); Gullapalli et al (1994); Hailu and
Sommer (1998)
Via Points
Miyamoto et al (1996)
Linear Models
Tamei and Shibata (2009)
GMM & LLM
Deisenroth and Rasmussen (2010); Guenter et al (2007); Peters and
Schaal (2008a)
Controller
Kolter and Ng (2009); Tedrake (2004); Tedrake et al (2005); Vlassis
et al (2009)
Non-parametric
Kober et al (2010); Mitsunaga et al (2005); Peters et al (2010a)
592 J. Kober and J. Peters
Function approximation has always been the key component that allowed value
function methods to scale into interesting domains. In robot reinforcement learning,
the following function approximation schemes have been popular and successful.
Neural networks. Neural networks as function approximators for continuous states
and actions have been used by various groups, e.g., multi-layer perceptrons were
used to learn a wandering behavior and visual servoing (Gaskett et al, 2000); fuzzy
neural networks (Duan et al, 2008) as well as explanation-based neural networks
(Thrun, 1995) have allowed learning basic navigation while CMAC neural networks
have been used for biped locomotion (Benbrahim and Franklin, 1997). A particu-
larly impressive application has been the success of Brainstormers RoboCup soccer
team that used multi-layer perceptrons for winning the world cup several times in
different leagues (Hafner and Riedmiller, 2003; Riedmiller et al, 2009).
Generalize to Neighboring Cells. As neural networks are globally affected from
local errors, much work has focused on simply generalizing from neighboring cells.
One of the earliest papers in robot reinforcement learning (Mahadevan and Connell,
1992) introduced this idea by statistical clustering states to speed up a box pushing
task with a mobile robot, see Figure 18.1a. This approach was also used for a nav-
igation and obstacle avoidance task with a mobile robot (Touzet, 1997). Similarly,
decision trees have been used to generalize states and actions to unseen ones, e.g.,
to learn a penalty kick on a humanoid robot (Hester et al, 2010).
Local Models. Locally weighted regression is known to be a particularly efficient
function approximator. Using it for value function approximation has allowed learn-
ing a navigation task with obstacle avoidance (Smart and Kaelbling, 1998), a pole
balancing task Schaal (1997) as well as an air hockey task (Bentivegna, 2004).
Gaussian Process Regression. Using GPs as function approximator for the value
function has allowed learning of hovering with an autonomous blimp (Rottmann
et al, 2007), see Figure 18.1c. Similarly, another paper shows that grasping can be
learned using Gaussian Process Regression (Gräve et al, 2010). Grasping locations
can be learned by focusing on rewards, modeled by GPR, by trying candidates with
predicted high rewards (Kroemer et al, 2009). High reward uncertainty allows in-
telligent exploration in reward-based grasping (Kroemer et al, 2010). Gait of robot
dogs can be optimized by first learning the expected return function with a Gaussian
process regression and subsequently searching for the optimal solutions (Lizotte
et al, 2007).
To make the policy search approach tractable, the policy needs to be represented
with an appropriate function approximation.
18 Reinforcement Learning in Robotics: A Survey 593
Fig. 18.3 Boston Dynamics LittleDog jumping (Kolter and Ng, 2009) (Picture reprint with
permission of Zico Kolter)
Operational space control was learned by Peters and Schaal (2008a) using locally
linear models.
Controller. Here, parameters of a local linear controller are learned. Applications
include learning to walk in 20 minutes with a biped robot (Tedrake, 2004; Tedrake
et al, 2005), to drive a radio-controlled (RC) car as well as a jumping behavior for a
robot dog jump (Kolter and Ng, 2009), as illustrated in Figure 18.3, and to balance
a two wheeled robot (Vlassis et al, 2009).
Non-parametric. Also in this context non-parametric representations can be used.
The weights of different robot human interaction possibilities (Mitsunaga et al,
2005), the weights of different striking movements in a table tennis task (Peters
et al, 2010a), and the parameters of meta-actions for dart and table tennis tasks
(Kober et al, 2010) can be optimized.
Prior knowledge can significantly help to guide the learning process. Prior knowl-
edge can be included in the form of initial policies, initial models, or a predefined
structure of the task. These approaches significantly reduce the search space and,
thus, speed up the learning process. Providing a goal achieving initial policy allows
a reinforcement learning method to quickly explore promising regions in the value
functions or in policy space, see Section 18.5.1. Pre-structuring the task to break a
complicated task down into several more tractable ones can be very successful, see
Section 18.5.2. An overview of publications using prior knowledge to render the
learning problem tractable is presented in Table 18.4.
Animals and humans frequently learn using a combination of imitation and trial and
error. For example, when learning to play tennis, an instructor usually shows the
student how to do a proper swing, e.g., a forehand or backhand. The student will
subsequently imitate this behavior but still needs hours of practicing to successfully
return balls to the opponent’s court. Input from a teacher is not limited to initial in-
struction. The instructor can give additional demonstrations in a later learning stage
(Latzke et al, 2007), these can also be used as differential feedback (Argall et al,
2008). Global exploration is not necessary as the student can improve by locally
optimizing his striking movements previously obtained by imitation. A similar ap-
proach can speed up robot reinforcement learning based on human demonstrations
or initial hand coded policies. For a recent survey on imitation learning for robotics
see (Argall et al, 2009).
Demonstrations by a Teacher. Demonstrations can be obtained by remote control-
ling the robot, which was used to initialize a Q-table for a navigation task (Conn
and Peters II, 2007). If the robot is back-drivable, kinesthetic teach-in (i.e., by
18 Reinforcement Learning in Robotics: A Survey 595
Table 18.4 This table illustrates different methods of making robot reinforcement learning
tractable by incorporating prior knowledge
D EMONSTRATION
Approach Employed by. . .
Teacher
Bitzer et al (2010); Conn and Peters II (2007); Gräve et al (2010);
Kober et al (2008); Kober and Peters (2009); Latzke et al (2007); Peters
and Schaal (2008c,b)
Policy
Birdwell and Livingston (2007); Erden and Leblebicioaglu (2008);
Martı́nez-Marı́n and Duckett (2005); Smart and Kaelbling (1998);
Tedrake (2004); Tedrake et al (2005); Wang et al (2006)
TASK S TRUCTURE
Approach Employed by. . .
Hierarchical
Donnart and Meyer (1996); Kirchner (1997); Morimoto and Doya
(2001)
Progressive Tasks
Asada et al (1996)
D IRECTED E XPLORATION
Employed by. . .
taking it by the hand and moving it) can be employed. This method has resulted
in applications including T-ball batting (Peters and Schaal, 2008c,b), reaching tasks
(Guenter et al, 2007; Bitzer et al, 2010), ball-in-a-cup (Kober and Peters, 2009),
flipping a light switch (Buchli et al, 2011), as well as playing pool and manipulat-
ing a box (Pastor et al, 2011). Motion-capture setups can be used alternatively, but
the demonstrations are often not as informative due to the correspondence problem.
Demonstrations obtained by motion capture have been used to learn ball-in-a-cup
(Kober et al, 2008) and grasping (Gräve et al, 2010).
Hand Coded Policy. A pre-programmed policy can provide demonstrations in-
stead of a human teacher. A vision-based mobile robot docking task can be learned
faster with such a basic behavior than using Q-learning alone as demonstrated in
(Martı́nez-Marı́n and Duckett, 2005). As an alternative corrective actions when the
robot deviates significantly from the desired behavior can be employed as prior
knowledge. This approach has been applied to adapt walking patterns of a robot
dog to new surfaces (Birdwell and Livingston, 2007) by Q-learning. Having hand-
coded stable initial gaits can significantly help as shown on six-legged robot gait
(Erden and Leblebicioaglu, 2008) as well as on a biped (Tedrake, 2004; Tedrake
et al, 2005).
596 J. Kober and J. Peters
Often a task can be decomposed hierarchically into basic components (see Chap-
ter 9) or in a sequence of increasingly difficult tasks. In both cases the complexity
of the learning task is significantly reduced.
Hierarchical Reinforcement Learning. Easier tasks can be used as building blocks
for a more complex behavior. For example, hierarchical Q-learning has been used to
learn different behavioral levels for a six legged robot: moving single legs, locally
moving the complete body, and globally moving the robot towards a goal (Kirchner,
1997). A stand-up behavior considered as a hierarchical reinforcement learning task
has been learned using Q-learning in the upper-level and TD-learning in the lower
level (Morimoto and Doya, 2001). Navigation in a maze can be learned using an
actor-critic architecture by tuning the influence of different control modules and
learning these modules (Donnart and Meyer, 1996).
Progressive Tasks. Often complicated tasks are easier to learn if simpler tasks can
already be performed. A sequence of increasingly difficult missions has been em-
ployed to learn a goal shooting task in (Asada et al, 1996) using Q-learning.
A mobile robot learns to direct attention (Huang and Weng, 2002) by employing
a modified Q-learning approach using novelty. Using ‘corrected truncated returns’
and taking into account the estimator variance, a six legged robot employed with
stepping reflexes can learn to walk (Pendrith, 1999). Using upper confidence bounds
to direct exploration grasping can be learned efficiently (Kroemer et al, 2010). Of-
fline search can be used to enhance Q-learning during a grasping task (Wang et al,
2006).
Using a simulation instead of the real physical robot has major advantages such as
safety and speed. A simulation can be used to eliminate obviously bad behaviors
and often runs much faster than real time. Simulations are without doubt a helpful
testbed for debugging algorithms. A popular approach is to combine simulations and
real evaluations by only testing promising policies on the real system and using it
to collect new data to refine the simulation (Section 18.6.2). Unfortunately, directly
transferring policies learned in simulation to a real system can be challenging (Sec-
tion 18.6.3). An overview of publications using simulations to render the learning
problem tractable is presented in Table 18.5.
18 Reinforcement Learning in Robotics: A Survey 597
Table 18.5 This table illustrates different methods of making robot reinforcement learning
tractable using simulations
S IMULATIONS
Approach Employed by. . .
Mental Rehearsal
Abbeel et al (2006); Atkeson and Schaal (1997); Atkeson et al (1997);
Atkeson (1998); Bagnell and Schneider (2001); Bakker et al (2006);
Coates et al (2009); Deisenroth and Rasmussen (2010); Ko et al (2007);
Kolter et al (2008); Michels et al (2005); Nemec et al (2010); Ng et al
(2004b,a); Schaal and Atkeson (1994); Uchibe et al (1998)
Direct Policy
Transfer Bakker et al (2003); Duan et al (2007); Fagg et al (1998); Ilg et al
(1999); Oßwald et al (2010); Svinin et al (2001); Youssef (2005)
Fig. 18.4 Autonomous inverted helicopter flight (Ng et al, 2004b)(Picture reprint with per-
mission of Andrew Ng)
Model-free algorithms try to directly learn the value function or the policy. Model-
based approaches jointly learn a model of the system and the value function or the
policy. Model-based methods can make the learning process a lot more sample effi-
cient. However, depending on the type of model these may require a lot of memory.
Model-based approaches rely on an approach that finds good policies in the model.
These methods encounter the risk of exploiting model inaccuracies to decrease the
cost. If the learning methods require predicting the future or using derivatives, the
inaccuracies may accumulate quickly, and, thus, significantly amplify noise and er-
rors. These effects lead to value functions or policies that work well in the model
but poorly on the real system. This issue is highly related to the transfer problem
discussed in Section 18.2.4. A solution is to overestimate the noise, to introduce a
controlled amount of inconsistency (Atkeson, 1998), or to use a crude model to find
a policy that compensates the derivative of the behavior in the model and on the real
system (Abbeel et al, 2006). See Chapter 4 for a more detailed discussion.
In Section 18.2.4, we discussed that policies learned in simulation often cannot
be transferred to the real system. However, simulations are still a very useful tool.
598 J. Kober and J. Peters
Most simulations run significantly faster than real time and many problems asso-
ciated with expensive samples (Section 18.2.2) can be avoided. For these reasons
simulations are usually used to debug, test and optimize algorithms. Learning in
simulation often can be made significantly easier than on real robots. The noise can
be controlled and all variables can be accessed. If the approach does not work in
simulation it is often unlikely that it works on the real system. Many papers also
use simulations to benchmark approaches as repeating the experiment frequently to
observe the average behavior and to compare many algorithms is often not feasible
on the real system.
The idea of combining learning in simulation and in the real environment has been
popularized by the Dyna-architecture (Sutton, 1990) in reinforcement learning. Due
to the obvious advantages in the robotics domain, it has been proposed in this con-
text as well. Experience collected in the real world can be used to learn a forward
model (Åström and Wittenmark, 1989) from data. Such a a forward model allows
training in a simulated environment and the resulting policy is subsequently trans-
ferred to the real environment. This approach can also be iterated and may sig-
nificantly reduce the needed interactions with the real world. However, often the
learning process can exploit the model errors which may lead to biased solutions
and slow convergence.
Such mental rehearsal has found a lot of applications in robot reinforcement
learning. Parallel learning in simulation and directed exploration allows Q-learning
to learn a navigation task from scratch in 20 minutes (Bakker et al, 2006). Two
robots taking turns in learning a simplified soccer task were also able to profit from
mental rehearsal (Uchibe et al, 1998). Atkeson et al (1997) learned a billiard and
a devil sticking task employing forward models. Nemec et al (2010) used a value
function learned in simulation to initialize the real robot learning.
To reduce the simulation bias resulting from model errors, Ng et al (2004b,a) sug-
gested re-using the series of random numbers in learned simulators for robotics and
called this approach PEGASUS. Note, that this approach is well-known in the sim-
ulation community (Glynn, 1987) with fixed models. The resulting approach found
various applications in the learning artistic maneuvers for autonomous helicopters
(Bagnell and Schneider, 2001; Ng et al, 2004b,a), as illustrated in Figure 18.4, it
has been used to learn control parameters for a RC car (Michels et al, 2005) and
an autonomous blimp (Ko et al, 2007). Alternative means to use crude models by
grounding policies in real world data have been suggested in (Abbeel et al, 2006)
and were employed to learn steering a RC car. Instead of sampling from a forward
model-based simulator, such learned models can also be directly used for comput-
ing optimal control policies. This has resulted in a variety of robot reinforcement
learning applications ranging from pendulum swing-up tasks learned with DDP
(Atkeson and Schaal, 1997; Atkeson, 1998), devil-sticking (a form of gyroscopic
18 Reinforcement Learning in Robotics: A Survey 599
juggling) obtained with local LQR solutions (Schaal and Atkeson, 1994), trajectory
following with space-indexed controllers trained with DDP for an autonomous RC
car (Kolter et al, 2008), the cart-pole task (Deisenroth and Rasmussen, 2010), to
aerobatic helicopter flight trained with DDP (Coates et al, 2009). Solving an LQR
problem with multiple probabilisitc models and combining the resulting closed-loop
control with open-loop control has resulted in autonomous sideways sliding into a
parking spot (Kolter et al, 2010). A promising new related approach are LQR-trees
(Tedrake et al, 2010).
Only few papers claim that a policy learned in simulation can directly be transferred
to a real robot while maintaining its high level of performance. The few examples
include maze navigation tasks (Bakker et al, 2003; Oßwald et al, 2010; Youssef,
2005) and obstacle avoidance (Fagg et al, 1998) for a mobile robot. Similar transfer
was achieved in very basic robot soccer (Duan et al, 2007) and multi-legged robot
locomotion (Ilg et al, 1999; Svinin et al, 2001).
Up to this point in this book chapter, we have reviewed a large variety of problems
and associated possible solutions of robot reinforcement learning. In this section,
we will take an orthogonal approach and discuss one task in detail that we have
previously studied. This task is called ball-in-a-cup and due to its complexity, it
can serve as an example to highlight some of the various discussed challenges and
methods. In Section 18.7.1, we describe the experimental setting with a focus on
the task and the reward. Section 18.7.2 discusses a type of pre-structured policies
that has been particularly useful in robotics. Inclusion of prior knowledge is pre-
sented in Section 18.7.3. We explain the advantages of the employed policy search
algorithm in Section 18.7.4. We will discuss the use of simulations in this task in
Section 18.7.5. Finally, we explore an alternative reinforcement learning approach
in Section 18.7.6.
The children’s motor game ball-in-a-cup, also known as balero and bilboquet, is
challenging even for most adults. The toy consists of a small cup held in one hand
(or, in our case, is attached to the end-effector of the robot) and a small ball is
hanging on a string attached to the cup’s bottom (for our toy, the string is 40cm
long). Initially, the ball is at rest, hanging down vertically. The player needs to move
fast to induce motion in the ball through the string, toss it up and catch it with the
600 J. Kober and J. Peters
Fig. 18.5 This figure shows schematic drawings of the ball-in-a-cup motion (a), the final
learned robot motion (c), as well as a kinesthetic teach-in (b). The green arrows show the
directions of the current movements in that frame. The human cup motion was taught to the
robot by imitation learning with 31 parameters per joint for an approximately 3 seconds long
movement. The robot manages to reproduce the imitated motion quite accurately, but the ball
misses the cup by several centimeters. After approximately 75 iterations of our Policy learn-
ing by Weighting Exploration with the Returns (PoWER) algorithm the robot has improved
its motion so that the ball regularly goes into the cup. See Figure 18.6 for the performance
increase.
with the ball from below or from the side as such behaviors are easier to achieve
and yield comparatively high rewards. To avoid such local optima, it was essential
to find a good reward function such as the initially described one.
The task is quite complex as the reward is not only affected by the cup’s move-
ments but foremost by the ball’s movements. As the ball’s movements are very sen-
sitive to small perturbations, the initial conditions or small arm movement changes,
will drastically affect the outcome. Creating an accurate simulation is hard due to
the nonlinear, unobservable dynamics of the string and its non-negligible weight.
The policy is represented by dynamical system motor primitives (Ijspeert et al, 2003;
Schaal et al, 2007). The global movement is encoded as a point attractor linear dy-
namical system. The details of the movement are generated by a transformation
function that allows learning complex behaviors. This transformation function is
modeled using locally linear function approximation. This combination of the global
attractor behavior and local transformation allows a very parsimonious representa-
tion of the policy. This policy is linear in parameters a = θ μ (s) and, thus, it is
straightforward to include prior knowledge from a demonstration using supervised
learning by locally weighted regression.
Due to the complexity of the task, ball-in-a-cup is a hard motor task even for children
who usually only succeed after observing another person presenting a demonstra-
tion, and require additional trial-and-error-based learning. Mimicking how children
learn ball-in-a-cup, we first initialize the motor primitives by imitation and, subse-
quently, improve them by reinforcement learning.
We obtained a demonstration for imitation by recording the motions of a human
player performing kinesthetic teach-in as shown in Figure 18.5b. Kinesthetic teach-
in means ‘taking the robot by the hand’, performing the task by moving the robot
while it is in gravity-compensation mode and recording the joint angles, velocities
and accelerations. It requires a back-drivable robot system similar to a human arm.
From the imitation, the number of needed policy parameters can be determined by
cross-validation. As the robot fails to catch the ball with the cup, additional rein-
forcement learning is needed for self-improvement.
can be given in the form a = θ μ (s,t) + ε (μ (s,t)). Policy search approaches often
focus on state-independent, white Gaussian exploration, i.e., ε (μ (s,t)) ∼ N (0,Σ )
which has resulted into applications such as T-Ball batting (Peters and Schaal,
2008c) and constrained movement (Guenter et al, 2007). However, from our ex-
perience, such unstructured exploration at every step has several disadvantages, i.e.,
(i) it causes a large variance which grows with the number of time-steps, (ii) it per-
turbs actions too frequently, thus, ‘washing’ out their effects and (iii) can damage
the system executing the trajectory.
Alternatively, one could generate a form of structured, state-dependent explo-
ration (Rückstieß et al, 2008) ε (μ (s,t)) = εt μ (s,t) with [εt ]i j ∼ N (0,σi2j ), where
σi2j are meta-parameters of the exploration that can also be optimized. This argu-
ment results into the policy a ∼ π (at |st ,t) = N (a|μ (s,t),Σ̂ (s,t)).
In (Kober and Peters, 2009), we have derived a framework of reward weighted
imitation. Based on (Dayan and Hinton, 1997) we consider the return of an episode
as an improper probability distribution. We maximize a lower bound of the loga-
rithm of the expected return. Depending on the strategy of optimizing this lower
bound and the exploration strategy, the framework yields several well known policy
search algorithms: episodic REINFORCE (Williams, 1992), the policy gradient the-
orem (Sutton et al, 2000), episodic natural actor critic (Peters and Schaal, 2008b), a
generalization of the reward-weighted regression (Peters and Schaal, 2008a) as well
as our novel Policy learning by Weighting Exploration with the Returns (PoWER)
algorithm. PoWER is an expectation-maximization inspired algorithm that employs
state-dependent exploration. The update rule is given by
T
Eτ ∑t=1 εt Qπ (st ,at ,t)
θ =θ+ .
Eτ ∑t=1 T
Qπ (st ,at ,t)
To reduce the number of trials in this on-policy scenario, we reuse the trials through
importance sampling (Sutton and Barto, 1998). To avoid the fragility sometimes
resulting from importance sampling in reinforcement learning, samples with very
small importance weights are discarded. This algorithm performs basically a local
search around the policy learned from prior knowledge.
0.8
average return
0.6
0.4
0.2
0
0 20 40 60 80 100
number of episodes
Fig. 18.6 This figure shows the expected return of the learned policy in the ball-in-a-cup
evaluation averaged over 20 runs
18 Reinforcement Learning in Robotics: A Survey 603
Figure 18.6 shows the expected return over the number of episodes where con-
vergence to a maximum is clearly recognizable. The robot regularly succeeds at
bringing the ball into the cup after approximately 75 episodes.
Using a value function based approach would require an unrealistic amount of
samples to get a good estimate of the value function. Greedily searching for an
optimal motor command in such a high-dimensional action space is probably as
hard as finding a locally optimal policy.
We created a simulation of the robot using rigid body dynamics with parameters
estimated from data. The toy is simulated as a pendulum with an elastic string that
switches to a ballistic point mass when the ball is closer to the cup than the string
is long. The spring, damper and restitution constants were tuned to match recorded
data. Even though this simulation matches recorded data very well, policies that get
the ball in the cup in simulation usually miss the cup by several centimeters on the
real system and vice-versa. However, this simulation was very helpful to develop
and tune the algorithm as it runs faster in simulation than real-time and does not
require human supervision or intervention.
18.8 Conclusion
We have pointed out the inherent challenges such as the high-dimensional contin-
uous state and action space, the high cost associated with trials, the problems as-
sociated with transferring policies learned in simulation to real robots as well as
the need for appropriate reward functions. A discussion of how different robot re-
inforcement learning approaches are affected by the domain has been given. We
have surveyed different authors’ approaches to render robot reinforcement learning
tractable through improved representation, inclusion of prior knowledge and usage
of simulation. To highlight aspects that we found particularly important, we give a
case study on how a robot can learn a complex task such as ball-in-a-cup.
References
Abbeel, P., Quigley, M., Ng, A.Y.: Using inaccurate models in reinforcement learning. In:
International Conference on Machine Learning, ICML (2006)
Abbeel, P., Coates, A., Quigley, M., Ng, A.Y.: An application of reinforcement learning to
aerobatic helicopter flight. In: Advances in Neural Information Processing Systems, NIPS
(2007)
Abbeel, P., Dolgov, D., Ng, A.Y., Thrun, S.: Apprenticeship learning for motion planning
with application to parking lot navigation. In: IEEE/RSJ International Conference on In-
telligent Robots and Systems, IROS (2008)
Argall, B.D., Browning, B., Veloso, M.: Learning robot motion control with demonstration
and advice-operators. In: IEEE/RSJ International Conference on Intelligent Robots and
Systems, IROS (2008)
Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from
demonstration. Robotics and Autonomous Systems 57, 469–483 (2009)
Asada, M., Noda, S., Tawaratsumida, S., Hosoda, K.: Purposive behavior acquisition for a
real robot by vision-based reinforcement learning. Machine Learning 23(2-3), 279–303
(1996)
Atkeson, C., Moore, A., Stefan, S.: Locally weighted learning for control. AI Review 11,
75–113 (1997)
Atkeson, C.G.: Using local trajectory optimizers to speed up global optimization in dynamic
programming. In: Advances in Neural Information Processing Systems, NIPS (1994)
Atkeson, C.G.: Nonparametric model-based reinforcement learning. In: Advances in Neural
Information Processing Systems, NIPS (1998)
Atkeson, C.G., Schaal, S.: Robot learning from demonstration. In: International Conference
on Machine Learning, ICML (1997)
Bagnell, J.A., Schneider, J.C.: Autonomous helicopter control using reinforcement learning
policy search methods. In: IEEE International Conference on Robotics and Automation,
ICRA (2001)
Bakker, B., Zhumatiy, V., Gruener, G., Schmidhuber, J.: A robot that reinforcement-learns
to identify and memorize important previous observations. In: IEEE/RSJ International
Conference on Intelligent Robots and Systems, IROS (2003)
Bakker, B., Zhumatiy, V., Gruener, G., Schmidhuber, J.: Quasi-online reinforcement learning
for robots. In: IEEE International Conference on Robotics and Automation, ICRA (2006)
Barto, A.G., Mahadevan, S.: Recent advances in hierarchical reinforcement learning. Discrete
Event Dynamic Systems 13(4), 341–379 (2003)
18 Reinforcement Learning in Robotics: A Survey 605
Erden, M.S., Leblebicioaglu, K.: Free gait generation with reinforcement learning for a six-
legged robot. Robot. Auton. Syst. 56(3), 199–212 (2008)
Fagg, A.H., Lotspeich, D.L., Hoff, J., Bekey, G.A.: Rapid reinforcement learning for reactive
control policy design for autonomous robots. In: Artificial Life in Robotics (1998)
Gaskett, C., Fletcher, L., Zelinsky, A.: Reinforcement learning for a vision based mobile
robot. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS
(2000)
Geng, T., Porr, B., Wörgötter, F.: Fast biped walking with a reflexive controller and real-time
policy searching. In: Advances in Neural Information Processing Systems, NIPS (2006)
Glynn, P.: Likelihood ratio gradient estimation: an overview. In: Winter Simulation Confer-
ence, WSC (1987)
Goldberg, D.E.: Genetic algorithms. Addision Wesley (1989)
Gräve, K., Stückler, J., Behnke, S.: Learning motion skills from expert demonstrations and
own experience using gaussian process regression. In: Joint International Symposium on
Robotics (ISR) and German Conference on Robotics, ROBOTIK (2010)
Guenter, F., Hersch, M., Calinon, S., Billard, A.: Reinforcement learning for imitating con-
strained reaching movements. Advanced Robotics 21(13), 1521–1544 (2007)
Gullapalli, V., Franklin, J., Benbrahim, H.: Acquiring robot skills via reinforcement learning.
IEEE on Control Systems Magazine 14(1), 13–24 (1994)
Hafner, R., Riedmiller, M.: Reinforcement learning on a omnidirectional mobile robot. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS (2003)
Hafner, R., Riedmiller, M.: Neural reinforcement learning controllers for a real robot appli-
cation. In: IEEE International Conference on Robotics and Automation, ICRA (2007)
Hailu, G., Sommer, G.: Integrating symbolic knowledge in reinforcement learning. In: IEEE
International Conference on Systems, Man and Cybernetics (SMC) (1998)
Hester, T., Quinlan, M., Stone, P.: Generalized model learning for reinforcement learning on
a humanoid robot. In: IEEE International Conference on Robotics and Automation, ICRA
(2010)
Huang, X., Weng, J.: Novelty and reinforcement learning in the value system of developmen-
tal robots. In: Lund University Cognitive Studies (2002)
Ijspeert, A.J., Nakanishi, J., Schaal, S.: Learning attractor landscapes for learning motor prim-
itives. in: Advances in Neural Information Processing Systems, NIPS (2003)
Ilg, W., Albiez, J., Jedele, H., Berns, K., Dillmann, R.: Adaptive periodic movement con-
trol for the four legged walking machine BISAM. In: IEEE International Conference on
Robotics and Automation, ICRA (1999)
Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of
Artificial Intelligence Research 4, 237–285 (1996)
Kalmár, Z., Szepesvári, C., Lörincz, A.: Modular Reinforcement Learning: An Application to
a Real Robot Task. In: Birk, A., Demiris, J. (eds.) EWLR 1997. LNCS (LNAI), vol. 1545,
pp. 29–45. Springer, Heidelberg (1998)
Kappen, H.: Path integrals and symmetry breaking for optimal control theory. Journal of
Statistical Mechanics: Theory and Experiment 11 (2005)
Katz, D., Pyuro, Y., Brock, O.: Learning to manipulate articulated objects in unstructured en-
vironments using a grounded relational representation. In: Robotics: Science and Systems,
R:SS (2008)
Kimura, H., Yamashita, T., Kobayashi, S.: Reinforcement learning of walking behavior for a
four-legged robot. In: IEEE Conference on Decision and Control (CDC) (2001)
Kirchner, F.: Q-learning of complex behaviours on a six-legged walking machine. In: EU-
ROMICRO Workshop on Advanced Mobile Robots (1997)
18 Reinforcement Learning in Robotics: A Survey 607
Mitsunaga, N., Smith, C., Kanda, T., Ishiguro, H., Hagita, N.: Robot behavior adaptation for
human-robot interaction based on policy gradient reinforcement learning. In: IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS) (2005)
Miyamoto, H., Schaal, S., Gandolfo, F., Gomi, H., Koike, Y., Osu, R., Nakano, E., Wada,
Y., Kawato, M.: A kendama learning robot based on bi-directional theory. Neural Net-
works 9(8), 1281–1302 (1996)
Morimoto, J., Doya, K.: Acquisition of stand-up behavior by a real robot using hierarchical
reinforcement learning. Robotics and Autonomous Systems 36(1), 37–51 (2001)
Nakanishi, J., Cory, R., Mistry, M., Peters, J., Schaal, S.: Operational space control: a theo-
retical and emprical comparison. International Journal of Robotics Research 27, 737–757
(2008)
Nemec, B., Tamošiūnaitė, M., Wörgötter, F., Ude, A.: Task adaptation through exploration
and action sequencing. In: IEEE-RAS International Conference on Humanoid Robots,
Humanoids (2009)
Nemec, B., Zorko, M., Zlajpah, L.: Learning of a ball-in-a-cup playing robot. In: International
Workshop on Robotics in Alpe-Adria-Danube Region (RAAD) (2010)
Ng, A.Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., Berger, E., Liang, E.:
Autonomous inverted helicopter flight via reinforcement learning. In: International Sym-
posium on Experimental Robotics (ISER) (2004a)
Ng, A.Y., Kim, H.J., Jordan, M.I., Sastry, S.: Autonomous helicopter flight via reinforcement
learning. In: Advances in Neural Information Processing Systems (NIPS) (2004b)
Oßwald, S., Hornung, A., Bennewitz, M.: Learning reliable and efficient navigation with a
humanoid. In: IEEE International Conference on Robotics and Automation (ICRA) (2010)
Paletta, L., Fritz, G., Kintzler, F., Irran, J., Dorffner, G.: Perception and Developmental Learn-
ing of Affordances in Autonomous Robots. In: Hertzberg, J., Beetz, M., Englert, R. (eds.)
KI 2007. LNCS (LNAI), vol. 4667, pp. 235–250. Springer, Heidelberg (2007)
Pastor, P., Kalakrishnan, M., Chitta, S., Theodorou, E., Schaal, S.: Skill learning and task
outcome prediction for manipulation. In: IEEE International Conference on Robotics and
Automation (ICRA) (2011)
Pendrith, M.: Reinforcement learning in situated agents: Some theoretical problems and prac-
tical solutions. In: European Workshop on Learning Robots (EWRL) (1999)
Peters, J., Schaal, S.: Learning to control in operational space. International Journal of
Robotics Research 27(2), 197–212 (2008a)
Peters, J., Schaal, S.: Natural actor-critic. Neurocomputing 71(7-9), 1180–1190 (2008b)
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural
Networks 21(4), 682–697 (2008c)
Peters, J., Vijayakumar, S., Schaal, S.: Linear quadratic regulation as benchmark for policy
gradient methods. Tech. rep., University of Southern California (2004)
Peters, J., Mülling, K., Altun, Y.: Relative entropy policy search. In: National Conference on
Artificial Intelligence (AAAI) (2010a)
Peters, J., Mülling, K., Kober, J., Nguyen-Tuong, D., Kroemer, O.: Towards motor skill learn-
ing for robotics. In: International Symposium on Robotics Research, ISRR (2010b)
Piater, J., Jodogne, S., Detry, R., Kraft, D., Krüger, N., Kroemer, O., Peters, J.: Learning
visual representations for perception-action systems. International Journal of Robotics
Research Online First (2010)
Platt, R., Grupen, R.A., Fagg, A.H.: Improving grasp skills using schema structured learning.
In: International Conference on Development and Learning (2006)
Åström, K.J., Wittenmark, B.: Adaptive control. Addison-Wesley, Reading (1989)
18 Reinforcement Learning in Robotics: A Survey 609
Riedmiller, M., Gabel, T., Hafner, R., Lange, S.: Reinforcement learning for robot soccer.
Autonomous Robots 27(1), 55–73 (2009)
Rottmann, A., Plagemann, C., Hilgers, P., Burgard, W.: Autonomous blimp control using
model-free reinforcement learning in a continuous state and action space. In: IEEE/RSJ
International Conference on Intelligent Robots and Systems, IROS (2007)
Rückstieß, T., Felder, M., Schmidhuber, J.: State-Dependent Exploration for Policy Gradient
Methods. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II.
LNCS (LNAI), vol. 5212, pp. 234–249. Springer, Heidelberg (2008)
Sato, M.-A., Nakamura, Y., Ishii, S.: Reinforcement Learning for Biped Locomotion. In:
Dorronsoro, J.R. (ed.) ICANN 2002. LNCS, vol. 2415, pp. 777–782. Springer, Heidelberg
(2002)
Schaal, S.: Learning from demonstration. In: Advances in Neural Information Processing
Systems, NIPS (1997)
Schaal, S., Atkeson, C.G.: Robot juggling: An implementation of memory-based learning.
Control Systems Magazine 14(1), 57–71 (1994)
Schaal, S., Atkeson, C.G., Vijayakumar, S.: Scalable techniques from nonparameteric statis-
tics for real-time robot learning. Applied Intelligence 17(1), 49–60 (2002)
Schaal, S., Mohajerian, P., Ijspeert, A.J.: Dynamics systems vs. optimal control - a unifying
view. Progress in Brain Research 165(1), 425–445 (2007)
Smart, W.D., Kaelbling, L.P.: A framework for reinforcement learning on real robots. In:
National Conference on Artificial Intelligence/Innovative Applications of Artificial Intel-
ligence, AAAI/IAAI (1998)
Smart, W.D., Kaelbling, L.P.: Effective reinforcement learning for mobile robots. In: IEEE
International Conference on Robotics and Automation (ICRA) (2002)
Soni, V., Singh, S.: Reinforcement learning of hierarchical skills on the sony aibo robot. In:
International Conference on Development and Learning (ICDL) (2006)
Strens, M., Moore, A.: Direct policy search using paired statistical tests. In: International
Conference on Machine Learning (ICML) (2001)
Sutton, R., Barto, A.: Reinforcement Learning. MIT Press, Boston (1998)
Sutton, R.S.: Integrated architectures for learning, planning, and reacting based on approxi-
mating dynamic programming. In: International Machine Learning Conference (1990)
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforce-
ment learning with function approximation. In: Advances in Neural Information Process-
ing Systems (NIPS) (2000)
Sutton, R.S., Koop, A., Silver, D.: On the role of tracking in stationary environments. In:
International Conference on Machine Learning (ICML) (2007)
Svinin, M.M., Yamada, K., Ueda, K.: Emergent synthesis of motion patterns for locomotion
robots. Artificial Intelligence in Engineering 15(4), 353–363 (2001)
Tamei, T., Shibata, T.: Policy Gradient Learning of Cooperative Interaction with a Robot
Using User’s Biological Signals. In: Köppen, M., Kasabov, N., Coghill, G. (eds.) ICONIP
2008. LNCS, vol. 5507, pp. 1029–1037. Springer, Heidelberg (2009)
Tedrake, R.: Stochastic policy gradient reinforcement learning on a simple 3d biped. In: In-
ternational Conference on Intelligent Robots and Systems (IROS) (2004)
Tedrake, R., Zhang, T.W., Seung, H.S.: Learning to walk in 20 minutes. In: Yale Workshop
on Adaptive and Learning Systems (2005)
Tedrake, R., Manchester, I.R., Tobenkin, M.M., Roberts, J.W.: LQR-trees: Feedback motion
planning via sums of squares verification. International Journal of Robotics Research 29,
1038–1052 (2010)
610 J. Kober and J. Peters
Theodorou, E.A., Buchli, J., Schaal, S.: Reinforcement learning of motor skills in high di-
mensions: A path integral approach. In: IEEE International Conference on Robotics and
Automation (ICRA) (2010)
Thrun, S.: An approach to learning mobile robot navigation. Robotics and Autonomous
Systems 15, 301–319 (1995)
Tokic, M., Ertel, W., Fessler, J.: The crawler, a class room demonstrator for reinforcement
learning. In: International Florida Artificial Intelligence Research Society Conference
(FLAIRS) (2009)
Toussaint, M., Storkey, A., Harmeling, S.: Expectation-Maximization methods for solving
(PO)MDPs and optimal control problems. In: Inference and Learning in Dynamic Models.
Cambridge University Press (2010)
Touzet, C.: Neural reinforcement learning for behaviour synthesis. Robotics and Autonomous
Systems, Special Issue on Learning Robot: the New Wave 22(3-4), 251–281 (1997)
Uchibe, E., Asada, M., Hosoda, K.: Cooperative behavior acquisition in multi mobile robots
environment by reinforcement learning based on state vector estimation. In: IEEE Inter-
national Conference on Robotics and Automation (ICRA) (1998)
Vlassis, N., Toussaint, M., Kontes, G., Piperidis, S.: Learning model-free robot control by a
Monte Carlo EM algorithm. Autonomous Robots 27(2), 123–130 (2009)
Wang, B., Li, J., Liu, H.: A heuristic reinforcement learning for robot approaching objects.
In: IEEE Conference on Robotics, Automation and Mechatronics (2006)
Willgoss, R.A., Iqbal, J.: Reinforcement learning of behaviors in mobile robots using noisy
infrared sensing. In: Australian Conference on Robotics and Automation (1999)
Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforce-
ment learning. Machine Learning 8, 229–256 (1992)
Yasuda, T., Ohkura, K.: A Reinforcement Learning Technique with an Adaptive Action Gen-
erator for a Multi-Robot System. In: Asada, M., Hallam, J.C.T., Meyer, J.-A., Tani, J.
(eds.) SAB 2008. LNCS (LNAI), vol. 5040, pp. 250–259. Springer, Heidelberg (2008)
Youssef, S.M.: Neuro-based learning of mobile robots with evolutionary path planning. In:
ICGST International Conference on Automation, Robotics and Autonomous Systems
(ARAS) (2005)
Part VI
Closing
Chapter 19
Conclusions, Future Directions and Outlook
This book has provided the reader with a thorough description of the field of re-
inforcement learning (RL). In this last chapter we will first discuss what has been
accomplished with this book, followed by a description of those topics that were left
out of this book, mainly because they are outside of the main field of RL or they are
small (possibly novel and emerging) subfields within RL. After looking back what
has been done in RL and in this book, a step into the future development of the field
will be taken, and we will end with the opinions of some of the authors what they
think will become the most important areas of research in RL.
The book has the aim to describe a very large part of the field of RL. It focuses
on all big directions of RL that were not covered in the important books on RL by
Sutton et al (1998) and by Bertsekas and Tsitsiklis (1996). Starting from the most
well known methods in RL described in chapter 2, several chapters are devoted
to describing more efficient solution methods than those proposed more than 15
years ago. The large body of work and the number of citations in this book clearly
show the enormous growth of the field since the initial contributions, most impor-
tantly probably the TD algorithm (Sutton, 1988) and the discovery of the Q-learning
Marco Wiering
Department of Artificial Intelligence, University of Groningen, The Netherlands
e-mail: [email protected]
Martijn van Otterlo
Radboud University Nijmegen, The Netherlands
e-mail: [email protected]
M. Wiering and M. van Otterlo (Eds.): Reinforcement Learning, ALO 12, pp. 613–630.
springerlink.com © Springer-Verlag Berlin Heidelberg 2012
614 M. Wiering and M. van Otterlo
algorithm (Watkins, 1989; Watkins and Dayan, 1992). Many of these newer devel-
opments make better use of an agent’s experiences, such as batch RL (chapter 2),
least squares methods for policy iteration (chapter 3), and the use of models (chapter
4) and knowledge transfer (chapter 5). Chapter 6 analyzes theoretical advantages of
better exploration methods to obtain more important experiences.
In the third part of the book different usages of a variety of representations in RL
are given. Such representations can be vector based as in chapter 7, use first-order
logic (chapter 8), make efficient use of hierarchies (chapter 9), or are basically rep-
resentation bias free as when evolutionary algorithms are used (chapter 10). When
the right representation is used, learning can be made much more effective and the
learned policies can be sometimes much more intelligible as in the case of first-order
logic programs and value functions.
In the fourth part different probabilistic frameworks and algorithms are de-
scribed. In chapter 11 the novel framework of Bayesian RL is described. In chapter
12, partially observable Markov decision processes and efficient solution methods
are covered. Chapter 13 describes predictive state representations where agent histo-
ries are compactly described by a set of expectancies about the future. In chapter 14,
an extension to multiple agents is given together with game theoretical notions of
cooperation, coordination, and competition. This chapter is followed by a descrip-
tion of the decentralized partially observable Markov decision process framework
and planning algorithms for solving these hard problems, in chapter 15. The book
ends with background information and the relation to human learning in chapter
16, successful applications of RL for learning games in chapter 17, and using RL
methods for robot control in chapter 18.
RL is a topic within the field of machine learning. Very often unsupervised and su-
pervised machine learning methods are used for applying RL effectively in large
state-action spaces. This book has the goal to describe recent developments in RL,
and therefore machine learning in general is not well covered. Of course reinforce-
ment learning techniques can be combined with many kinds of regression methods
to learn the value functions from a limited amount of interactions with the environ-
ment. Therefore, new developments in learning regression models are very impor-
tant as well for the field of RL. We refer to some books, see, e.g. (Mitchell, 1996;
Alpaydin, 2010; Bishop, 2006), on machine learning to allow the reader to study
the whole field of machine learning, although most of the existing machine learning
books cover RL in a single chapter only.
Next to being related to machine learning, RL shares the same objectives as the
use of planning algorithms or control theory. Often these fields use a slightly dif-
ferent notation, but the shared objective is to allow an agent or controller to select
actions so that its design objectives are optimally fulfilled. We cannot cover these
19 Conclusions, Future Directions and Outlook 615
fields next to the field of RL, because they have a much longer background and a
larger community.
Therefore, we will discuss topics that are not included which are commonly ref-
ered to as RL techniques, although they do not all need to learn value functions such
as also discussed in the chapter on evolutionary RL (Moriarty and Miikkulainen,
1996). We will first discuss some areas in RL that were not covered in this book, al-
though they show interesting new developments in this rich field. After that we will
discuss some application areas that have shown effective usage of RL for solving
difficult problems.
Of course a book can never describe an entire field, especially not if the field exists
for more than two decades, and when the research community is growing every year.
Therefore, the following list can never be complete, but it shows some interesting
ideas that can prove useful for more efficient RL for solving particular problems.
Online planning and roll-out methods. For large state-action spaces it can be
very hard to accurately approximate the value function with a limited amount of
resources. Often the function approximator cannot perfectly fit the target value func-
tion, and this will cause the selection and execution of wrong actions. One partial
remedy for this problem is to use look-ahead planning in order to use more infor-
mation in the decision making process. This technique was for example applied
successfully to make TD-Gammon (Tesauro, 1995) even stronger as a backgammon
playing program. Another interesting technique that receives a lot of attention in
the community of applying methods from artificial and computational intelligence
for game playing programs, is the Monte Carlo Tree Search (MCTS) method. This
method is based on running many simulations in a quick way to sample the results
of many future paths of the agent. MCTS has been very effective for playing the
game of Go (Coulom, 2006). Such methods try to avoid the immediate use of a
learned value function that is not an accurate approximation to the optimal value
function. The idea is that these techniques can be more robust against uncertainties
in estimates by averaging the outcomes over many simulations that generate dif-
ferent future behaviors. Another technique that falls in the area of online planning
methods is model-based predictive control (MPC). This technique is well known in
the control theory field and uses a moving horizon window to compute the return of
different limited action sequences. The “plan” that obtains most reward will then be
used to execute the first action, after which the planning process is repeated. These
MPC techniques can be very useful for deterministic tasks, but may have problems
with dealing with many uncertainties or stochastic outcomes.
Curiosity and internal motivations. Agents can profit from learning a sequence of
subtasks such as sitting down before starting to use the computer. Learning adap-
tive subgoals sometimes relies on the use of highly frequent state visits to identify
bottleneck states, but such methods need obviously a large amount of experiences.
Internal motivations and the use of creativity to maximize some generated reward
616 M. Wiering and M. van Otterlo
function, either specified a-priori or not, can give RL and possibly model-based
RL in particular, a boost for understanding the world. One development (Schmid-
huber, 1991b,a, 2010) is the use of curiosity and creativity for generating trivial,
novel, and surprising behaviors and data. Such algorithms should have two learning
components: a general reward optimizer or reinforcement learner, and an adaptive
encoder of the agent’s growing data history (the record of the agent’s interaction
with its environment). The learning progress of the encoder is the intrinsic reward
for the reward optimizer. That is, the latter is motivated to invent interesting spatio-
temporal patterns that the encoder does not yet know but can easily learn to encode
better with little computational effort. To maximize expected reward (in the absence
of external reward), the reward optimizer will create more and more-complex be-
haviors that yield temporarily surprising (but eventually boring) patterns that make
the encoder quickly improve.
Model-based RL with adaptive state partitioning. Model-based algorithms first
estimate a model based on the experiences of an agent, and then use dynamic pro-
gramming methods to compute a policy. The advantage is that once the transition
model is estimated, it is possible to use new reward functions, e.g. one that gives a
reward of one for entering a goal state and zero elsewhere, and a new policy can be
computed. Such knowledge reuse is often not possible with model-free RL methods.
Although model-based techniques can be very sample efficient, they have problems
with dealing with continuous states and actions, since it is hard to store the transition
probabilities (or probability densities) in this case. One possible direction to cope
with this is to use adaptive state partitioning to cluster similar states in an abstract
state, and to estimate transition probabilities between abstract states given some
discrete action. This approach may lead to models which do not obey the Markov
property anymore, and therefore it can make sense to combine them with memory
of previous states and actions, as is done in several techniques for solving POMDPs
(Ring, 1994; McCallum, 1995).
Other RL methods and function approximation techniques. There is a large
number of RL methods proposed in literature. E.g. one paper (Wiering and van
Hasselt, 2009) proposed five new RL methods. Other recent methods are double
Q-learning (van Hasselt, 2010) and Speedy Q-learning (Azar et al, 2011). Often
these algorithms have some advantages in particular environments, but may per-
form worse in other environments. Since the list of RL algorithms is so large, this
book has not covered all of them. The same holds for RL algorithms in combi-
nation with a variety of function approximators. We did not extensively cover the
use of recurrent neural networks in combination with RL, see e.g. (Schmidhuber,
1990; Bakker and Schmidhuber, 2004), although they can be important tools for
POMDPs or multi-agent environments. In multi-agent RL it can be very benefi-
cial to memorize previous interactions in order to model other agents, and possibly
lead them to behaviors which are advantageous for the agent itself. Other function
approximators, such as support-vector machines (Vapnik, 1995) have been used in
RL (Dietterich and Wang, 2002), but have not received a large interest from the
19 Conclusions, Future Directions and Outlook 617
optimal prediction and universal inductive inference (Solomonoff, 1964, 1978). The
theoretically optimal yet uncomputable RL algorithm Aixi (Hutter, 2004) has found
several implementations now (Poland and Hutter, 2006; Veness et al, 2011), an ex-
ample of which is playing PacMan.
Most of these theoretical algorithms are based on finding the shortest program
that solves a problem. A problem of this is that often many programs need to be
tested, in order of complexity. In case a short program is able to compute the optimal
solution, then this approach can be very efficient. Some developments in this line of
research have taken place. First, the Speed Prior is derived from the fastest (not the
shortest) way of computing data, and it was shown that it allows for deriving a com-
putable strategy for optimal prediction of the future given the past (Schmidhuber,
2002). Unlike Levin search, the algorithm called optimal ordered problem solving
(OOPS) does not ignore the huge constants buried in the algorithmic complexity.
OOPS incrementally reduces the constants before the O() notation through expe-
rience (Schmidhuber, 2004). Within a few days OOPS learned a universal solver
for all n disk Towers of Hanoi problems, solving all instances up to n = 30, where
the shortest solution (not the search for it!) takes more than 109 moves. Gödel ma-
chines (Schmidhuber, 2009) are the first class of mathematically rigorous, general,
fully self-referential, self-improving, optimally efficient RL machines or problem
solvers. Inspired by Gödel’s celebrated self-referential formulas (1931), a Gödel
machine rewrites any part of its own code as soon as it has found a proof that the
rewrite is useful.
Although this book has described two application areas, namely robotics and game
playing, there has also been quite some research in using RL for letting agents learn
to act in several other interesting areas. Some of these areas will be described below.
Economical applications. In economical applications the aim of an agent is to earn
as much money as possible in a reasonable time. One example is the use of RL
agents to trade stocks in stock markets or to sell and buy foreign currencies given
fluctuating exchange rates (Nevmyvaka et al, 2006). In such problems it is very dif-
ficult or even impossible to predict the future. Such techniques often rely on letting
agents learn to optimize their gains given the past dynamics of the prices and af-
ter that use the best agent to trade in the near future. Although this approach may
work, it is often very difficult to deal with unpredictable events such as stock mar-
ket crashes. Another economical application is to have agents trade objects on the
internet or to organize trips for a booking company. Such trading agents may have
many benefits, such as their speed and low-cost usage, compared to human traders
and organizers.
Network applications. Many real-world problems can be modelled with networks
where some items are travelling over the edges. One of the first successful appli-
cations of RL in the area of optimizing agents in a network of nodes was routing
messages on the internet (Littman and Boyan, 1993; Di Caro and Dorigo, 1998). In
19 Conclusions, Future Directions and Outlook 619
such networks messages arrive at nodes and an agent in a node learns the optimal
way for directing the message to its goal-node. Especially in quite saturated net-
works, these adaptive methods are very promising. Similar to routing messages on
the internet, RL agents can be used in smart-grid applications in order to control the
energy flow and storage on an electricity grid. Another type of network can be found
in daily traffic, where vehicles have a particular destination, and at intersections (the
nodes), traffic lights are operational to circumvent collisions. In (Wiering, 2000), RL
was used to optimize the agents that control traffic lights in such a network, and was
shown to outperform a variety of fixed controllers. As a last application, we men-
tion the use of RL agents to optimize the use of cognitive radio and mobile phones
frequency channels, where the RL agents have the goal to minimize lost calls that
require a particular channel frequency.
Learning to communicate. RL agents can also be used to select actions related to
natural language. Saying or writing some words can be considered to be a particular
type of action. The difficulty is that there are a huge number of short natural lan-
guage messages possible due to the huge amount of existing words. To deal with this
issue, one interesting paper (Singh et al, 2002) describes the use of a fixed number
of sentences that can be told to people phoning a computer to inquire about partic-
ular upcoming events. Because the computer uses speech recognition to understand
the human user, it cannot always be certain about the information a caller wants to
have. Therefore, the probabilistic framework of MDPs and POMDPs can be fruitful
for such tasks.
Combinatorial optimization problems. The field of operations research and meta-
heuristics is very large, and has the aim to solve very complex problems that do not
necessarily require interaction with an agent. Often these problems are formulated
with matrices describing the problem state and a cost function stating which solu-
tion is optimal. One type of combinatorial optimization solving method, called ant
colony optimization (Dorigo et al, 1996) uses ants to make local decisions, similar
to RL agents. After each ant has performed a sequence of local decisions, a solution
is obtained. After that the best sequence of decisions is rewarded with additional
pheromone, which influences future decisions of all ants. In this way, ant colony
optimization is very similar to RL, and it has been applied successfully to prob-
lems such as the traveling salesman problem (Dorigo and Gambardella, 1997), the
quadratic assignment problem (Gambardella et al, 1999) and internet traffic routing
(Di Caro and Dorigo, 1998). Also other multi-agent RL methods have been used
for combinatorial optimization problems, such as for job-shop scheduling and load
balancing (Riedmiller and Riedmiller, 1999).
Experimental testbeds. When researchers develop a new RL method, they often
perform experiments with it to show how well the new algorithm is performing
compared to previous methods. A problem when one compares the obtained re-
sults of different algorithms described in different papers is that often the used
problem parameters are slightly different. E.g., the mountain car problem becomes
much simpler to solve and requires fewer actions to climb the hill, if the discrete
time-step is a bit larger. For this reason, some researchers have developed RL Glue
620 M. Wiering and M. van Otterlo
(Tanner and White, 2009) with which competitions have been held between RL al-
gorithms. RL Competitions have been held for challenging domains such as Tetris
and Mario Bros. Such experimental testbeds allow for fair comparisons between dif-
ferent methods, and such standardized platforms should help to answer the question:
“which RL method performs best for which environment?”.
RL techniques have a lot of appeal, since they can be applied to many different
problems where an agent is used to perform tasks autonomously. Although RL has
already been used with success for many toy and some real-world problems, a lot of
future work is still needed to apply RL for many problems of interest. This section
will first look at research questions that have not been fully answered yet, and then
examines some applications that seem unsolvable with RL methods.
We will first describe research questions in RL, which have so far remained unan-
swered, although the field of RL could become even more efficient for solving many
kinds of problems when knowing the answers.
What is the best function approximator? RL algorithms have often been com-
bined with tabular representations for storing the value functions. Although most
of these algorithms have been proven to converge, for scaling up to large or con-
tinuous state-action spaces, function approximators are necessary. For this reason
most RL researchers use generalized linear models where a fixed set of basis func-
tions is chosen, or multi-layer perceptrons. The advantage of linear models where
only the parameters between basis function activations and the state value estimates
have to be learned, is that they are fast and allow for techniques from linear alge-
bra to efficiently compute the approximations. A disadvantage is that this method
requires a set of basis functions which has to be chosen a-priori. Furthermore, often
the used basis functions are very local, such as fuzzy rules or radial basis functions,
and these tend not to scale up well to problems with very many input dimensions.
Neural networks can learn the hidden features and work well with logistic activa-
tion functions that create shorter hidden representations than more localized basis
functions. Furthermore, they are more flexible since they can learn the placement
of the basis functions, but training them with RL often takes a lot of experiences.
For representing the value functions, basically all regression algorithms from ma-
chine learning can be used. For example, in (Dietterich and Wang, 2002), support
vector machines were proposed in combination with RL and in (Ernst et al, 2005)
random forests were used. Recently several feature construction techniques – e.g.
based on reward based heuristics or Bellman residuals – are being developed for RL
and dynamic programming, but much work is still needed to find out when and how
19 Conclusions, Future Directions and Outlook 621
to apply them in the most effective way. In the book of Sutton and Barto (1998),
Kanerva encoding is proposed as a representation that could possibly scale up to
many input dimensions. The question that still remains to be solved is for what type
of problem which function approximator works best.
Convergence proofs for general algorithms. There has been quite some research
to study convergence of RL algorithms in combination with particular function ap-
proximators, e.g., with generalized linear models. Some research in this direction
(Schoknecht, 2002; Maei et al, 2010) has shown that some combinations can be
proven to converge given particular assumptions. Often these proofs do not extend
to the full control case, but only apply to the evaluation case where the value func-
tion of a fixed policy is estimated. We would like to see more proofs that guarantee
the convergence of many function approximators, and what the conditions of these
proofs would be. Furthermore, the rate of convergence also plays a very important
role. Such research could help practitioners to choose the best method for their spe-
cific problem.
Optimal exploration. A lot of research has focused on optimal exploration to find
the most important learning experiences. However, it is still the question which ex-
ploration method will work best for particular problems. The problem of choosing
the optimal exploration policy for a problem becomes also much harder when func-
tion approximators are used instead of lookup tables. Some research (Nouri and
Littman, 2010) has proposed an efficient exploration method for continuous spaces.
Such problems are very hard, since given a limited training time of the agent, the
agent is often not able to visit all parts of the environment. Therefore trade-offs need
to be made. Creating the most efficient exploration strategy for difficult problems is
still an underdeveloped topic in RL, and therefore most researchers are still using
the simple ε -greedy and Boltzmann exploration for their problems at hand.
Scaling up issues. RL has already been applied successfully to particular complex
problems involving many input dimensions. The most famous examples are TD-
Gammon (Tesauro, 1995) and Elevator Dispatching (Crites and Barto, 1996). How-
ever, fifteen years later, still little is known about what is required to let RL scale-up
to solve basically any control problem. It is quite probable that RL will perform
very well for some problems and much worse for other problems. For example, the
success of TD-Gammon can be partially explained by the smooth value functions
caused by the stochasticity of the Backgammon game. It is well known in machine
learning that approximating smooth functions is much simpler than approximating
very fluctuating, or volatile, functions. It is important to understand more about the
problems faced when applying RL to solve complex problems, since this will al-
low the field to be applied to solve many industrial applications, for which currently
often other methods, e.g. from control theory, are applied.
How does the brain do RL? The old brain contains areas that generate emotions
and feelings and is connected to perceptual input. These emotions play a very im-
portant role from the beginning of a children’s life and can be translated in RL terms
as providing rewards to the cortex. There are also feedback loops in humans from
622 M. Wiering and M. van Otterlo
the frontal cortex to the old brain, so that a person can think and thereby manipulate
the way (s)he feels. Although a rough analogy can be made between RL systems and
the brain, we still do not understand enough about the exact workings of the brain,
which would allow for programming more human-like learning abilities in agents.
A lot of research in neuroscience is focusing on studying detailed workings of the
brain and measurement devices become more and more accurate, but we expect that
there is still a long way to go until neuroscientists understand all human learning
processes in a detailed way.
We have the feeling that RL can be used for many different control tasks, which
require an agent that makes decisions to manipulate a given environment. Below,
we give some examples of applications that seem too hard for RL to solve.
General intelligence only with RL. General intelligence requires perception, nat-
ural language processing, executing complex behaviors, and much more. For these
different tasks, different subfields in artificial intelligence are focused at develop-
ing better and better algorithms over time. The field of artificial general intelligence
tries to develop agents that possess all such different skills. It is not advisable to use
only RL for creating such different intelligent skills, although RL could play a role
in each of them. The ability to predict the future is maybe one of the most impor-
tant abilities of intelligent entities. When we see some car driving over a road, we
can predict where it will be after some seconds with high accuracy. When we walk
down a staircase we predict when our feet will hit the stairs. RL can be fruitfully
applied to solve prediction problems, and therefore it can play an important role in
the creation of general intelligence, but it cannot do this without developments in
other subfields of artificial intelligence.
Using value-function based RL to implement programs. One control problem
that RL researchers often face is the task of implementing a program involving an
RL agent and an environment. It would be very interesting to study RL methods
that can program themselves, such as the Gödel machine (Schmidhuber, 2009). This
could allow RL agents to program even better RL agents, which will then never stop.
It is according to us very unlikely that current value-function based RL methods will
be suited for this. The reason is that searching for the right program can be seen as
finding a needle in a haystack problem: almost no programs do anything useful, and
only few solve the task. Without being able to learn from partial rewards, an RL
optimization method for creating programs is unlikely to work. We want to note
that there has been an efficient application of RL to optimize compilers (McGovern
and a. Andrew G. Barto, 1999), but that application of RL used a lot of domain
knowledge to make it work well. The use of RL in general programming languages
is a related, and interesting, direction for further research (Simpkins et al, 2008).
19 Conclusions, Future Directions and Outlook 623
This would enable to write a program with all available knowledge, and let RL
optimize all other aspects of that program.
We will now give some directions that could improve the efficiency of RL. Further-
more, we describe some application areas that do not fully use the potential of RL
algorithms.
Better world modelling techniques. Model-based RL can use learning experi-
ences much more effectively than model-free RL algorithms. Currently, many model
building techniques exist, such as batch RL (Riedmiller, 2005; Ernst et al, 2005),
LSPI (see chapter 3 in this book), Prioritized sweeping (Moore and Atkeson, 1993),
Dyna (Sutton, 1990), and best match learning (van Seijen et al, 2011). The question
is whether we can use vector quantization techniques to discretize continuous state
spaces, and then apply dynamic programming like algorithms in promising novel
ways. For example, for a robot navigation task, the famous SIFT feature (Lowe,
2004) can be used to create a set of visual keywords. Then multiple substates would
become active in the model, which makes the planning and representation problem
challenging. It would for example be possible to estimate the probability that visual
keywords are activated at the next time step from the previous active keywords using
particular regression algorithms, and combine active keywords to estimate reward
and value functions. Such a research direction creates many new interesting research
questions.
Highly parallel RL. With more and more distributed computing facilities that can
be used, it is interesting to research RL methods that can optimally profit from recent
developments in cloud computing and supercomputing. In case of linear models,
research can be performed to see if averaging the weights from the linear models
will create better models than the separate ones. Some research has also focused on
ensemble methods in RL (Wiering and van Hasselt, 2008) and these methods can be
advantageous when distributed computing facilities are used.
Combining RL with learning from demonstration. Often controllers or agents
already exist for many applications, and this allows RL agents to first learn from
demonstrated experiences, after which the RL agents can finetune themselves to
optimize the performance even further. In robotics, learning from demonstration has
already been applied successfully, see e.g. (Coates et al, 2009; Peters et al., 2003;
Peters and Schaal, 2008; Kober and Peters, 2011). However, in one application of
learning from demonstration to learn the game of backgammon, no performance
improvement was obtained (Wiering, 2010). A challenging research direction is to
explore novel problems where learning from demonstration can be very successfully
applied in combination with RL.
RL for learning to play very complex games. Some very complex games, such
as Go, are currently best played by a computer that uses Monte-Carlo tree search
624 M. Wiering and M. van Otterlo
(Coulom, 2006). RL methods can hardly be employed well for playing such games.
The reason is that it is very difficult to accurately represent and learn the value func-
tion. Therefore, a lot of search and sampling currently works better than knowledge
intensive solutions. The search/knowledge tradeoff (Berliner, 1977) states that more
knowledge will come at the expense of less search, since models containing more
parameters consume more time to evaluate. It would be very interesting to gain un-
derstanding of novel RL methods that can quite accurately learn to approximate the
value function for complex games, but which are still very fast to evaluate. Some
research in this direction was performed for the game of chess (Baxter et al, 1997;
Veness et al, 2009) where linear models trained with RL could be efficiently used
by the search algorithms.
Novel queueing problems. Many successful RL applications, such as network rout-
ing (Littman and Boyan, 1993), elevator dispatching (Crites and Barto, 1996), and
traffic light control (Wiering, 2000) are very related to queueing processes, where
some items have to wait for others. An advantage of such problems is that of-
ten direct feedback is available. It would be interesting to study RL methods for
novel queuing problems, such as train scheduling, where expected traveling times
of many different individual trips are minimized by multi-agent RL techniques.
Many research disciplines may profit from RL methods, such as telecommunica-
tions and traffic engineering, which can lead to a broader acceptance of RL as solu-
tion method.
Finally, to conclude the book, we asked the authors to comment on the following
questions:
• Which direction(s) will it go in the future with RL?
• Which topics in RL do you find most important to be understood better and
improved more?
The answers of some of the expert’s ”best educated guesses” are described below.
Ashvin Shah: Because of their focus on learning through interaction with the envi-
ronment with limited prior knowledge and guidance, many reinforcement learning
(RL) methods suffer from the curse of dimensionality. The inclusion of abstract rep-
resentations of the environment and hierarchical representations of behavior allows
RL agents to circumvent this curse to some degree. One important research topic
is the autonomous development of useful abstractions and hierarchies so that re-
liance on prior knowledge and guidance can continue to be limited. Inspiration may
be drawn from many sources, including studies on how animals develop and use
such representations, and resulting research can produce more capable, flexible, and
autonomous artificial agents.
Lucian Busoniu: Fueled by advances in approximation-based algorithms, the field
of RL has greatly matured and diversified over the last few years. One issue that
19 Conclusions, Future Directions and Outlook 625
representing and exploiting the influence of state variables, features and actions on
each other. This may also lead to a better integration with ideas from the multi-agent
planning community, where people have been exploiting limited influence of ’local
states’ (sub-sets of features) and actions on each other.
Jan Peters: RL is at a way-point where it essentially can go into two direction: First,
we could consolidate classical RL and subsequently fall into obscurity. Second, we
could return back to the basic questions and solve these with more solid answers
as done in supervised learning, while working on better applications. I hope it will
be the latter. Key steps for future RL will be to look more at the primal problem
of RL (as done e.g., in (Peters et al, 2010)) and look how changes in the primal
change the dual. Some RL methods are totally logical from this perspective – some
will turn out to be theoretically improper. This may include classical assumptions
on the cost functions of value function approximation such as TD or Bellman errors,
as already indicated by Schoknecht’s work (2002). The second component will be
proper assumptions on the world. Human intelligence is not independent from the
world we live in, neither will be robot RL. In robot reinforcement learning, I would
assume that the hybrid discrete-continuous robot RL is about to happen.
Shimon Whiteson: Future research in RL will probably place an increasing em-
phasis on getting humans in the loop. Initial efforts in this direction are already
underway, e.g., the TAMER framework, but much remains to be done to exploit
developments from the field of human-computer interaction, better understand what
kinds of prior knowledge humans can express, and devise methods for learning from
the implicit and explicit feedback they can provide. The field of RL could benefit
from the development of both richer representations for learning and more practical
strategies for exploration. To date, some sophisticated representations such as the
indirect encodings used in evolutionary computation can only be used in policy-
search methods, as algorithms for using them in value-function methods have not
been developed. In addition, even the most efficient strategies for exploration are
much too dangerous for many realistic tasks. An important goal is the development
of practical strategies for safely exploring in tasks with substantial risk of catas-
trophic failure. These two areas are intimately connected, as representations should
be designed to guide exploration (e.g., by estimating their own uncertainties) and
exploration strategies should consider how the samples they acquire will be incor-
porated into the representation.
Acknowledgements. We want to thank Jürgen Schmidhuber for some help with writing this
concluding chapter.
References
Bakker, B., Schmidhuber, J.: Hierarchical reinforcement learning based on subgoal discov-
ery and subpolicy specialization. In: Proceedings of the 8th Conference on Intelligent
Autonomous Systems, IAS-8, pp. 438–445 (2004)
Baxter, J., Tridgell, A., Weaver, L.: Knightcap: A chess program that learns by combining
TD(λ ) with minimax search. Tech. rep., Australian National University, Canberra (1997)
Berliner, H.: Experiences in evaluation with BKG - a program that plays backgammon. In:
Proceedings of IJCAI, pp. 428–433 (1977)
Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-dynamic Programming. Athena Scientific, Belmont
(1996)
Bishop, C.: Pattern Recognition and Machine learning. Springer, Heidelberg (2006)
Coates, A., Abbeel, P., Ng, A.: Apprenticeship learning for helicopter control. Commun.
ACM 52(7), 97–105 (2009)
Coulom, R.: Efficient Selectivity and Backup Operators in Monte-carlo Tree Search. In: van
den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M(J.) (eds.) CG 2006. LNCS, vol. 4630,
pp. 72–83. Springer, Heidelberg (2007)
Cramer, N.L.: A representation for the adaptive generation of simple sequential programs. In:
Grefenstette, J. (ed.) Proceedings of an International Conference on Genetic Algorithms
and Their Applications, pp. 183–187 (1985)
Crites, R., Barto, A.: Improving elevator performance using reinforcement learning. In:
Touretzky, D., Mozer, M., Hasselmo, M. (eds.) Advances in Neural Information Process-
ing Systems, Cambridge, MA, vol. 8, pp. 1017–1023 (1996)
Di Caro, G., Dorigo, M.: An adaptive multi-agent routing algorithm inspired by ants behavior.
In: Proceedings of PART 1998 - Fifth Annual Australasian Conference on Parallel and
Real-Time Systems (1998)
Dietterich, T., Wang, X.: Batch value function approximation via support vectors. In: Ad-
vances in Neural Information Processing Systems, vol. 14, pp. 1491–1498 (2002)
Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the
traveling salesman problem. Evolutionary Computation 1(1), 53–66 (1997)
Dorigo, M., Maniezzo, V., Colorni, A.: The ant system: Optimization by a colony of cooper-
ating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41
(1996)
Ernst, D., Geurts, P., Wehenkel, L.: Tree-based batch mode reinforcement learning. Journal
of Machine Learning Research 6, 503–556 (2005)
Gambardella, L.M., Taillard, E., Dorigo, M.: Ant colonies for the qadratic assignement prob-
lem. Journal of the Operational Research Society 50, 167–176 (1999)
van Hasselt, H.: Double Q-learning. In: Advances in Neural Information Processing Systems,
vol. 23, pp. 2613–2621 (2010)
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic
Probability. Springer, Berlin (2004)
Kober, J., Peters, J.: Policy Search for Motor Primitives in Robotics. Machine Learning 84(1-
2), 171–203 (2011)
Kolmogorov, A.: Three approaches to the quantitative definition of information. Problems of
Information Transmission 1, 1–11 (1965)
Koza, J.R.: Genetic evolution and co-evolution of computer programs. In: Langton, C., Tay-
lor, C., Farmer, J.D., Rasmussen, S. (eds.) Artificial Life II, pp. 313–324. Addison Wesley
Publishing Company (1992)
Koza, J.R.: Genetic Programming II – Automatic Discovery of Reusable Programs. MIT
Press (1994)
628 M. Wiering and M. van Otterlo
Li, M., Vitányi, P.M.B.: An introduction to Kolmogorov complexity and its applications. In:
van Leeuwen, J. (ed.) Handbook of Theoretical Computer Science, pp. 188–254. Elsevier
Science Publishers B.V (1990)
Littman, M., Boyan, J.: A distributed reinforcement learning scheme for network routing.
In: Alspector, J., Goodman, R., Brown, T. (eds.) Proceedings of the First International
Workshop on Applications of Neural Networks to Telecommunication, Hillsdale, New
Jersey, pp. 45–51 (1993)
Lowe, D.: Distinctive image features from scale-invariant keypoints. International Journal of
Computer Vision, 315–333 (2004)
Maei, H., Szepesvari, C., Bhatnagar, S., Sutton, R.: Toward off-policy learning control with
function approximation. In: Proceedings of the International Conference on Machine
Learning, pp. 719–726 (2010)
McCallum, R.A.: Instance-based utile distinctions for reinforcement learning with hidden
state. In: Prieditis, A., Russell, S. (eds.) Machine Learning: Proceedings of the Twelfth
International Conference, pp. 387–395. Morgan Kaufmann Publishers, San Francisco
(1995)
McGovern, A., Andrew, G., Barto, E.M.: Scheduling straight-line code using reinforcement
learning and rollouts. In: Proceedings of Neural Information Processing Systems. MIT
Press (1999)
Mitchell, T.M.: Machine learning. McGraw Hill, New York (1996)
Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data
and less time. Machine Learning 13, 103–130 (1993)
Moriarty, D.E., Miikkulainen, R.: Efficient reinforcement learning through symbiotic evolu-
tion. Machine Learning 22, 11–32 (1996)
Nevmyvaka, Y., Feng, Y., Kearns, M.: Reinforcement learning for optimized trade execution.
In: Proceedings of the 23rd International Conference on Machine Learning, pp. 673–680
(2006)
Nouri, A., Littman, M.: Dimension reduction and its application to model-based exploration
in continuous spaces. Machine Learning 81(1), 85–98 (2010)
van Otterlo, M.: Efficient reinforcement learning using relational aggregation. Proceedings
of the Sixth European Workshop on Reinforcement Learning, EWRL-6 (2003)
Peters, J., Mülling, K., Altun, Y.: Relative entropy policy search. In: Fox, M., Poole, D. (eds.)
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI
2010 (2010)
Peters, J., Vijayakumar, S., Schaal, S.: Reinforcement Learning for Humanoid Robotics. In:
IEEE-RAS International Conference on Humanoid Robots, Humanoids (2003)
Peters, J., Schaal, S.: Reinforcement Learning of Motor Skills with Policy Gradients. Neural
Networks 21(4), 682–697 (2008), doi:10.1016/j.neunet.2008.02.003
Poland, J., Hutter, M.: Universal learning of repeated matrix games. In: Proc. 15th Annual
Machine Learning Conf. of Belgium and The Netherlands (Benelearn 2006), Ghent, pp.
7–14 (2006)
Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural
Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M.,
Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328. Springer, Heidel-
berg (2005)
Riedmiller, S., Riedmiller, M.: A neural reinforcement learning approach to learn local dis-
patching policies in production scheduling. In: Proceedings of International Joint Confer-
ence on Artificial Intelligence (IJCAI 1999) (1999)
19 Conclusions, Future Directions and Outlook 629
Ring, M.: Continual learning in reinforcement environments. PhD thesis, University of Texas,
Austin, Texas (1994)
Sałustowicz, R.P., Schmidhuber, J.H.: Probabilistic incremental program evolution. Evolu-
tionary Computation 5(2), 123–141 (1997)
Schmidhuber, J.: The Speed Prior: A New Simplicity Measure Yielding Near-Optimal Com-
putable Predictions. In: Kivinen, J., Sloan, R.H. (eds.) COLT 2002. LNCS (LNAI),
vol. 2375, pp. 216–228. Springer, Heidelberg (2002)
Schmidhuber, J.: Optimal ordered problem solver. Machine Learning 54, 211–254 (2004)
Schmidhuber, J.: Ultimate cognition à la Gödel. Cognitive Computation 1(2), 177–193
(2009)
Schmidhuber, J.: Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE
Transactions on Autonomous Mental Development 2(3), 230–247 (2010)
Schmidhuber, J., Zhao, J., Schraudolph, N.: Reinforcement learning with self-modifying poli-
cies. In: Thrun, S., Pratt, L. (eds.) Learning to Learn, pp. 293–309. Kluwer (1997a)
Schmidhuber, J., Zhao, J., Schraudolph, N.N.: Reinforcement learning with self-modifying
policies. In: Thrun, S., Pratt, L. (eds.) Learning to Learn. Kluwer (1997b)
Schmidhuber, J.H.: Temporal-difference-driven learning in recurrent networks. In: Eckmiller,
R., Hartmann, G., Hauske, G. (eds.) Parallel Processing in Neural Systems and Comput-
ers, pp. 209–212. North-Holland (1990)
Schmidhuber, J.H.: Curious model-building control systems. In: Proceedings of the Inter-
national Joint Conference on Neural Networks, vol. 2, pp. 1458–1463. IEEE, Singapore
(1991a)
Schmidhuber, J.H.: A possibility for implementing curiosity and boredom in model-building
neural controllers. In: Meyer, J.A., Wilson, S.W. (eds.) Proceedings of the International
Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 222–227.
MIT Press/Bradford Books (1991b)
Schmidhuber, J.H.: A general method for incremental self-improvement and multi-agent
learning in unrestricted environments. In: Yao, X. (ed.) Evolutionary Computation: The-
ory and Applications. Scientific Publ. Co., Singapore (1996)
Schmidhuber, J.H., Zhao, J., Wiering, M.A.: Shifting inductive bias with success-story algo-
rithm, adaptive Levin search, and incremental self-improvement. Machine Learning 28,
105–130 (1997c)
Schoknecht, R.: Optimality of reinforcement learning algorithms with linear function approx-
imation. In: Becker, S., Thrun, S., Obermayer, K. (eds.) Advances in Neural Information
Processing Systems, NIPS 2002, pp. 1555–1562 (2002)
van Seijen, H., Whiteson, S., van Hasselt, H., Wiering, M.: Exploiting best-match equations
for efficient reinforcement learning. Journal of Machine Learning Research 12, 2045–
2094 (2011)
Simpkins, C., Bhat, S., Isbell Jr., C., Mateas, M.: Towards adaptive programming: integrating
reinforcement learning into a programming language. SIGPLAN Not. 43, 603–614 (2008)
Singh, S., Litman, D., Kearns, M., Walker, M.: Optimizing dialogue management with rein-
forcement learning: Experiments with the NJFun system. Journal of Artificial Intelligence
Research 16, 105–133 (2002)
Smart, W., Kaelbling, L.: Effective reinforcement learning for mobile robots. In: Proceedings
of the IEEE International Conference on Robotics and Automation, pp. 3404–3410 (2002)
Solomonoff, R.: A formal theory of inductive inference. Part I. Information and Control 7,
1–22 (1964)
Solomonoff, R.: Complexity-based induction systems. IEEE Transactions on Information
Theory IT-24(5), 422–432 (1978)
630 M. Wiering and M. van Otterlo
Sutton, R.S.: Learning to predict by the methods of temporal differences. Machine Learn-
ing 3, 9–44 (1988)
Sutton, R.S.: Integrated architectures for learning, planning and reacting based on dynamic
programming. In: Machine Learning: Proceedings of the Seventh International Workshop
(1990)
Sutton, R.S., Precup, D., Singh, S.P.: Between MDPs and semi-MDPs: Learning, planning,
learning and sequential decision making. Tech. Rep. COINS 89-95, University of Mas-
sachusetts, Amherst (1998)
Tanner, B., White, A.: RL-Glue: Language-independent software for reinforcement-learning
experiments. Journal of Machine Learning Research 10, 2133–2136 (2009)
Tesauro, G.: Temporal difference learning and TD-Gammon. Communications of the
ACM 38, 58–68 (1995)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Veness, J., Silver, D., Uther, W., Blair, A.: Bootstrapping from game tree search. In: Bengio,
Y., Schuurmans, D., Lafferty, J., Williams, C.K.I., Culotta, A. (eds.) Advances in Neural
Information Processing Systems, vol. 22, pp. 1937–1945 (2009)
Veness, J., Ng, K., Hutter, M., Uther, W., Silver, D.: A Monte-carlo AIXI approximation.
Journal of Artificial Intelligence Research (2011)
Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis, King’s College, Cambridge,
England (1989)
Watkins, C.J.C.H., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992)
Westra, J.: Organizing adaptation using agents in serious games. PhD thesis, Utrecht Univer-
sity (2011)
Wiering, M.: Self-play and using an expert to learn to play backgammon with temporal dif-
ference learning. Journal of Intelligent Learning Systems and Applications 2(2), 57–68
(2010)
Wiering, M., van Hasselt, H.: Ensemble algorithms in reinforcement learning. IEEE Transac-
tions, SMC Part B, Special Issue on Adaptive Dynamic Programming and Reinforcement
Learning in Feedback Control (2008)
Wiering, M., van Hasselt, H.: The QV family compared to other reinforcement learning al-
gorithms. In: Proceedings of the IEEE International Symposium on Adaptive Dynamic
Programming and Reinforcement Learning (ADPRL 2009), pp. 101–108 (2009)
Wiering, M.A.: Multi-agent reinforcement learning for traffic light control. In: Langley, P.
(ed.) Proceedings of the Seventeenth International Conference on Machine Learning,
pp. 1151–1158 (2000)
Wiering, M.A., Schmidhuber, J.H.: Solving POMDPs with Levin search and EIRA. In:
Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference,
pp. 534–542. Morgan Kaufmann Publishers, San Francisco (1996)
Index