Explorations in Reinforcement and
Model-based Learning
A thesis submitted by
Anthony J. Prescott
Department of Psychology
University of Sheffield
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
Submitted December 1993
Accepted 1994.
Explorations in Reinforcement and
Model-based Learning
Anthony J. Prescott
Summary
Reinforcement learning concerns the gradual acquisition of associations between
events in the context of specific rewarding outcomes, whereas model-based learning
involves the construction of representations of causal or world knowledge outside the
context of any specific task. This thesis investigates issues in reinforcement learning
concerned with exploration, the adaptive recoding of continuous input spaces, and
learning with partial state information. It also explores the borderline between
reinforcement and model-based learning in the context of the problem of navigation.
A connectionist learning architecture is developed for reinforcement and delayed
reinforcement learning that performs adaptive recoding in tasks defined over
continuous input spaces. This architecture employs networks of Gaussian basis
function units with adaptive receptive fields. Simulation results show that networks
with only a small number of units are capable of learning effective behaviour in realtime control tasks within reasonable time frames.
A tactical/strategic split in navigation skills is proposed and it is argued that tactical,
local navigation can be performed by reactive, task-specific systems. Acquisition of
an adaptive local navigation behaviour is demonstrated within a modular control
architecture for a simulated mobile robot. The delayed reinforcement learning system
for this task acquires successful, often plan-like strategies for control using only
partial state information. The algorithm also demonstrates adaptive exploration using
performance related control over local search.
Finally, it is suggested that strategic, way-finding navigation skills require modelbased, task-independent knowledge. A method for constructing spatial models based
on multiple, quantitative local allocentric frames is described and simulated. This
system exploits simple neural network learning, storage and search mechanisms, to
support robust way-finding behaviour without the need to construct a unique global
model of the environment.
Declaration
This thesis has been composed by myself and contains original work of my own
execution. Some of the work reported here has previously been published:
Prescott, A.J. and Mayhew, J.E.W. (1992). Obstacle avoidance through reinforcement
learning. in Moody, J.E., Hanson, S.J., and Lippmann, R.P. Advances in Neural
Information Processing Systems 4, Morgan Kaufmann, New York.
Prescott, A.J. and Mayhew, J.E.W. (1992). Adaptive local navigation. in Blake, A.
and Yuille, A. Active Vision, MIT Press, Cambridge MA.
Prescott, A.J. and Mayhew, J.E.W. (1993). Building long-range cognitive maps using
local landmarks. in From Animals to Animats: Proceedings of the 2nd International
Conference on Simulation of Adaptive Behaviour, MIT Press.
Tony Prescott, 14 December 1993.
We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time.
T.S. Eliot: The Four Quartets.
For my parents— John and Diana
vii
Acknowledgements
I would like to thank the many people who have given me their help, advice and
encouragement in researching and writing this dissertation. I am particularly indebted
to the following.
My supervisor John Mayhew for his depth of insight, forthrightness, and humour. His
ideas have been a constant source of inspiration to me through the years. John Frisby
for his patience, wise counsel, and generous support. John Porrill, Neil Thacker, and
latterly Steve Hippisley-Cox, for guiding my stumbling steps through the often alien
and bewildering realm of mathematics. Pat Langdon for some apposite advice on
programming Lisp and tuning Morris Minor engines. Paul Dean, Pete Redgrave, Pete
Coffey, Rod Nicolson and Mark Blades, for some inspiring conversations about
animal and human intelligence. The remaining members of the AI Vision Research
Unit and the Department of Psychology, past and present, for creating such a pleasant
and rewarding environment in which to work and learn.
I would not have been able to carry out this work without the friends who have given
me their support and companionship since coming to Sheffield. I am particularly
grateful to Phil Whyte, Leila Edwards and especially Sue Keeton for sharing their
lives and homes with me. I am also grateful to Justin Avery for proof-reading parts of
the text.
Finally, I wish to thank the Science and Engineering Research Council and the
University of Sheffield for the financial support I have received while carrying out
this work.
viii
Contents
One
Introduction and Overview
Two
Reinforcement Learning Systems
Three
Exploration
Four
Input Coding for Reinforcement Learning
Five
Experiments in Delayed Reinforcement Learning using Networks
Basis Function Units 98
Six
Adaptive Local Navigation
1
13
48
63
of
128
Seven Representations for Way-finding: Topological Models
162
Eight
Representations for Way-finding: Local Allocentric Frames
Nine
Conclusions and Future Work
190
221
Appendices
A
Algorithm and Simulation Details for Chapter Three
B
Algorithm and Simulation Details for Chapter Four
229
C
Algorithm and Simulation Details for Chapter Five
240
D
Algorithm and Simulation Details for Chapter Six
247
Bibliography
253
226
1
Chapter 1
Introduction and Overview
Summary
The ‘explorations’ in this thesis consider learning from a computational perspective.
In other words, it is assumed that both the biological ‘neural nets’ that underlie
adaptive processes in animals and humans, and the electronic circuitry that allows
learning in a robot or a computer simulation, can be considered as implementing
similar types of abstract, information processing operations. A computational
understanding of learning should then give insight into the adaptive capabilities of
both natural and artificial systems.
However, to determine such an understanding in abstract is clearly a daunting, if not
impossible, task. This will be helped by looking both to natural systems, for
inspiration concerning how effective adaptive mechanisms have evolved and are
organised, and to artificial systems, as vehicles in which to embed and then evaluate
theories of learning. This chapter introduces two key domains in the study of learning
from each of these perspectives, it then sets out the research objectives that will be
pursued in the remainder of the thesis.
CHAPTER ONE
INTRODUCTION
2
Learning in natural systems
Research in psychology suggests that underlying a large number of observable
phenomena of learning and memory, there are two broad clusters of learning
processes.
First, there are the associative learning processes involved in habit formation, the
acquisition of motor skills, and certain forms of classical and instrumental
conditioning. These processes involve incremental adaptation and do not seem to
need awareness. Learning is driven by a significant outcome in the form of a
positively or negatively reinforcing event. Further, it does not seem to require or
involve the acquisition of knowledge about the causal processes underlying the task
that is solved.
Second, there are the learning processes involved in acquiring knowledge about the
relationships between events (stimuli or responses). For instance, that one event
follows another (causal knowledge), or is close to another (spatial knowledge). These
forms of learning appear to be have more of an all-or-none character, and may require
awareness or involve attentional processes. They are also not directly involved in
generating behaviour, and need not be acquired with respect to a specific task or
desired outcome. The knowledge acquired can support both further learning or
decision-making through inference.
Patterns of learning impairment in human amnesiacs [153, 159, 160] and lesion
studies with animals (e.g. with monkeys [104, 160], and with rats [63, 110]) indicate
that the second style of learning relies on specific medial-temporal structures in the
brain, in particular, the hippocampus. In contrast the simpler associative forms of
learning underlying habit and skill acquisition are not affected by damage to this
brain region, but appear instead to be supported by neural systems that evolved much
earlier. This view is supported by observations that all vertebrates and most
invertebrates show the more ‘primitive’ learning abilities, whereas the more
CHAPTER ONE
INTRODUCTION
3
‘cognitive’ learning styles have evolved primarily in higher vertebrates [62]
coinciding with a massive increase in brain-size1.
A variety of terms have been suggested to capture the qualitative distinctions between
different learning processes, for instance, procedural and declarative [5, 186],
dispositional and representational [32, 110], implicit and explicit [142], and,
incremental and all-or-none [173]. This variation reflects the fact that there may be a
number of important dimensions of difference involved. Here I will adopt the terms
dispositional and representational suggested by Thomas [32] and Morris [110] to
refer to these two clusters of learning processes.
A fuller understanding of learning systems, in which their similarities, differences,
and interactions are better understood, can be gained by realising the mechanisms in
computational models and evaluating them in various task domains. This agenda
describes much of the recent connectionist research in cognitive science and Artificial
Intelligence (AI).
Learning and connectionism
The explosion of research in connectionist modelling in the last ten years has
reawakened interest in associative learning and has motivated researchers to attempt
to construct complex learning systems out of simpler associative components.
Connectionist systems, or artificial neural networks, consist of highly interconnected
networks of simple processing units in which knowledge is stored in the connection
strengths or weights between units. These systems demonstrate remarkable learning
capabilities yet adaptation of the network weights is governed by only a small number
of simple rules. Many of these rules have their origins in psychological learning
theories—the associationist ideas of Locke, James and others, Thorndyke’s ‘law of
effect’, Hull’s stimulus-response theory, and the correlation learning principles
proposed by Hebb. Although most contemporary connectionist models use more
sophisticated learning rules and assume network circuitry unlikely to occur in real
1When
body-size is taken into account the brains of higher vertebrates are roughly ten times as large as
those of lower vertebrates [68] .
CHAPTER ONE
INTRODUCTION
4
neural nets, the impression remains of a deep similarity with the adaptive capabilities
of biological systems.
Classical connectionist research in the 1960s by Rosenblatt [139] and by Widrow and
Hoff [182] concerned the acquisition of target associative mappings by adjusting a
single-layer of adaptive weights under feedback from a ‘teacher’ with knowledge of
the correct responses. However, researchers have since relaxed many of the
assumptions embodied in these early systems.
First, multi-layer systems have been developed that adaptively recode the input by
incorporating a trainable layer of ‘hidden’ units (e.g. [140]). This development
surmounted what had been a major limitation of early connectionist systems—the
inability of networks with only one adaptive layer to efficiently represent certain
classes of input-output mappings [102].
Second, reinforcement learning systems have been developed that learn appropriate
outputs without guidance from a ‘teacher’ by using environmental feedback in the
form of positively or negatively reinforcing outcomes (e.g. [162]). These systems
have been extended further to allow learning in delayed reward tasks in which
reinforcing outcomes occur only after a sequence of stimuli have been observed and
actions performed. Recent work has also considered reinforcement learning in multilayer systems with adaptive recodings (e.g. [167]).
Finally, model-based associative learning systems have been developed that, rather
than acquiring task knowledge directly, explicitly encode knowledge about causal
processes (forward models) or environment structure (world models) [41, 69, 106,
107, 164, 166]. This knowledge then forms the basis either for task-specific learning
or for decision-making by interpolation, inference, or planning.
It is clear that there are certain parallels between the connectionist learning systems
described so far and the classes of psychological learning processes described above.
In particular, there seems to be a reasonable match between some forms of
reinforcement learning and dispositional learning, and between model-based learning
and certain aspects of representational learning processes. To summarise, the former
pair are both concerned with the gradual acquisition of associations between events in
the context of specific rewarding outcomes. Although these events might individually
CHAPTER ONE
INTRODUCTION
5
be composed of elaborate compound patterns, and the acquired link may involve
recoding processes, the input-output relation is of a simple, reflexive nature. On the
other hand, model-based learning and representational learning, while being
associative in a broad sense (in that they concern the acquisition of knowledge of the
relationships between events), generally involve the construction of representations of
causal or world knowledge to be used by other learning or decision-making
processes. These learning processes may also have other characteristics such as the
involvement of domain-specific learning mechanisms and/or memory structures.
The ‘Animat’ approach to understanding adaptive behaviour
The shared interest in adaptive systems, between psychologists and ethologists, on the
one hand, and Artificial Intelligence researchers and roboticists on the other, has
recently seen the development of a new inter-disciplinary research field. Being
largely uncharted it goes by a variety of titles—‘comparative’ or ‘biomimetic’
cognitive science (Roitblat [138]), ‘computational neuroethology’ (Cliff [35]),
‘behaviour-based’ AI (Maes [91]), ‘animat’ (simulated animal) AI (Wilson [187]) or
‘Nouvelle’ AI (Brooks [24]). The common research aim is to understand how
autonomous agents—animals, simulated animals, robots, or simulated robots—can
survive and adapt in their environments, and be successful in fulfilling needs and
achieving goals. The following seeks to identify some of the key elements of this
approach by citing some of its leading proponents.
Wilson [187] identifies the general methodology of this research programme as
follows:
“The basic strategy of the animat approach is to work towards higher levels of
intelligence from below—using minimal ad hoc machinery. The essential process is
incremental and holistic [...] it is vital (1) to maintain the realism and wholeness of
the environment [...] (2) to maximise physicality in the sensory signals [...] and (3) to
employ adaptive mechanisms maximally, to minimalise the rate of introduction of
new machinery and maximise understanding of adaptation.” ([187] p. 16)
An important theme is that control, in the agent, is not centralised but is distributed
between multiple task-oriented modules—
“The goal is to build complete intelligent systems. To the extent that the system
consists of modules, the modules are organised around activities, such as path-
CHAPTER ONE
INTRODUCTION
6
finding, rather than around sensory or representational systems. Each activity is a
complete behaving sub-system, which individually connects perception to action.”
(Roitblat [138] p. 9)
The animat approach therefore seeks minimal reliance on internal world models and
reasoning or planning processes—
“We argue that the traditional idea of building a world model, or a representation of the
state of the world is the wrong idea. Instead the creature [animat] needs to process
only aspects of the world that are relevant to its task. Furthermore, we argue that it
may be better to construct theoretical tools which instead of using the state of the
world as their central formal notion, instead use the aspects that the creature is
sensing as the primary formal notion.” (Brooks [23] p. 436)
It advocates, instead, an emphasis on the role of the agent’s interaction with its
environment in driving the selection and performance of appropriate, generally
reflexive, behaviours—
“Rather than relying on reasoning to intervene between perception and action, we
believe actions derive from very simple sorts of machinery interacting with the
immediate situation. This machinery exploits regularities in its interaction with the
world to engage in complex, apparently planful activity without requiring explicit
models of the world.” (Chapman and Agre [29] p. 1)
“One interesting hypothesis is that the most efficient systems will be those that convert
every frequently encountered important situation to one of ‘virtual stimulus-response’
in which internal state (intention, memory) and sensory stimulus together form a
compound stimulus that immediately implies the correct next intention or external
action. This would be in contrast to a system that often tends to ‘figure out’ or
undertake a chain of step by step reasoning to decide the next action.” (Wilson [187]
p. 19)
Perception too is targeted at acquiring task-relevant information rather than delivering
a general description of the current state of the perceived world—
“The basic idea is that it is unnecessary to equip the animat with a sensory apparatus
capable at all times of detecting and distinguishing between objects in its
environment in order to ensure its adaptive competence. All that is required is that it
be able to register only the features of a few key objects and ignore the rest. Also
those objects should be indexed according to the intrinsic features and properties that
make them significant.” (Meyer and Guillot [98] p. 3).
CHAPTER ONE
INTRODUCTION
7
It is clear, from this brief overview, that the ‘Animat’ approach is in good accord with
reinforcement learning approaches to the adaptation of behavioural competences. In
view of the stated aim of building ‘complete intelligent systems’ in an incremental,
and bottom-up fashion this is wholly consistent with the earlier observation that
learning in simpler animals is principally of a dispositional nature. However, the
development of this research paradigm is already beginning to see the need for some
representational learning. One reason for this is the emphasis on mobile robotics as
the domain of choice for investigating animat AI. The next section contains a
preliminary look at this issue.
Navigation as a forcing domain
The fundamental skill required by a mobile agent is the ability to move around in the
immediate environment quickly and safely, this will be referred to here as local
navigation competence. Research in animat AI has had considerable success in using
pre-wired reactive competences to implement local navigation skills [6, 22, 38, 170].
The robustness, fluency, and responsiveness of these systems have played a
significant role in promoting the animat methodology as a means for constructing
effective, autonomous robots. In this thesis the possibility of acquiring adaptive local
navigation competences through reinforcement learning is investigated and advanced
as an appropriate mechanism for learning or fine-tuning such skills.
However, a second highly valuable of navigation expertise is the ability to find and
follow paths to desired goals outside the current visual scene. This skill will be
referred to here as way-finding. The literature on animal spatial learning differentiates
the way-finding skills of invertebrates and lower vertebrates, from those of higher
vertebrates (birds and mammals). In particular, it suggests that invertebrate navigation
is performed primarily by using path integration mechanisms and compass senses and
secondarily by orienting to specific remembered stimulus patterns (landmarks) [2628, 178, 179]. This suggests that invertebrates do not construct models of the spatial
layout of their environment and that consequently, their way-finding behaviour is
CHAPTER ONE
INTRODUCTION
8
relatively inflexible and restricted to homing or retracing familiar routes2. In contrast,
higher vertebrates appear to construct and use representations of the spatial relations
between locations in their environments (see, for example, [52, 119, 120, 122]). They
are then able to use these models to select and follow paths to desired goals. This
form of learning is often regarded as the classic example of a representational
learning process (e.g. [152]).
This evidence has clear implications for research in animat AI. First, it suggests that
the current ethos of minimal representation and reactive competence could support
way-finding behaviour similar to that of invertebrates3. Second, however, the
acquisition of more flexible way-finding skills would appear to require model-based
learning abilities, this raises the interesting issue of how control and learning
architectures in animat AI should be developed to meet this need.
Content of the thesis
The above seeks to explain the motivation for the research described in the remaining
chapters. However, although inspired by the desire to understand and explain learning
in natural systems the work to be described primarily concerns learning in artificial
systems. The motivation, like much of the work in connectionism, is to seek to
understand learning systems from a general perspective before attempting to apply
this understanding to the interpretation and modelling of animal or human behaviour.
I have suggested above that much of the learning that occurs in natural systems
clusters into two fundamental classes —dispositional and representational learning. I
have further suggested that these two classes are loosely analogous to reinforcement
learning and model-based learning approaches in connectionism. Finally, I have
proposed that a forcing domain for the development of model-based learning systems
is that of navigation. These ideas form the focus for the work in this thesis.
2Gould
[54] has proposed a contrary view that insects do construct models of spatial layout however,
the balance of evidence (cited above) appears to be against this position.
3In
particular it should be possible to exploit the good odometry information available to mobile
robots.
CHAPTER ONE
INTRODUCTION
9
The first objective, which is the focus of chapters two through five, is with
understanding reinforcement learning systems. A particular concern is with learning
in continuous state-spaces and with continuous outputs. Many natural learning
problems and most tasks in robot control are of this nature, however, much existing
work in reinforcement learning has concentrated primarily on finite, discrete state and
action spaces. These chapters concentrate on the issues relating to exploration and
adaptive recoding in reinforcement learning. In particular, chapters four and five
propose and evaluate a novel architecture for adaptive coding in which a network of
local expert units with trainable receptive fields are applied to continuous
reinforcement learning problems.
A second objective, which is the topic of chapter six, is the consideration of reinforcement learning as a tool for acquiring adaptive local navigation competences.
This chapter also introduces the theme of navigation which is continued through
chapters seven and eight where the possibility of model-based learning systems for
way-finding are considered. The focus of these later chapters is on two questions.
First, on whether spatial representations for way-finding should encode topological or
metric knowledge of spatial relations. And second, on whether a global representation
of space is desirable as opposed to multiple local models. Finally, chapter nine seeks
to draw some conclusions from the work described and considers future directions for
research.
A more detailed summary of the contents of each chapter is as follows:
Chapter Two—Reinforcement Learning Systems introduces the study of learning
systems in general and of reinforcement and delayed reinforcement learning systems
in particular. It focuses specifically on learning in continuous state-spaces and on the
Actor/Critic systems that have been proposed for such tasks in which one learning
element (the Actor) learns to control behaviour while the other (the Critic) learns to
predict future rewards. The relationship of delayed reward learning to dynamic
programming is reviewed and the possibility of systems that integrate reinforcement
learning with model-based learning is considered. The chapter concludes by arguing
that, despite the absence of strong theoretical results, reinforcement learning should
be possible in tasks with only partial state information where the strict equivalence
with stochastic dynamic programming does not apply.
CHAPTER ONE
INTRODUCTION
10
Chapter Three—Exploration considers methods for determining effective
exploration behaviour in reinforcement learning systems. This chapter primarily
concerns the indirect effect on exploration of the predictions determined by the critic
system. The analysis given shows that if the initial evaluation is optimistic relative to
available rewards then an effective search of the state-space will arise that may
prevent convergence on sub-optimal behaviours. The chapter concludes with a brief
review of more direct methods for adapting exploration behaviour.
Chapter Four—Input Coding for Reinforcement Learning considers the task of
recoding a continuous input space in a manner that will support successful
reinforcement learning. Three general approaches to this problem are considered:
fixed quantisation methods; unsupervised learning methods for adaptively generating
an input coding; and adaptive methods that modify the input coding according to the
reinforcement received. The advantages and drawbacks of various recoding
techniques are considered and a novel multilayer learning architecture is described in
which a recoding layer of Gaussian basis function (GBF) units with adaptive
receptive fields is trained by generalised gradient descent to maximise the expected
reinforcement. The performance of this algorithm is demonstrated on a simple
immediate reinforcement task.
Chapter Five—Experiments in Delayed Reinforcement Learning Using
Networks of Basis Function Units applies the algorithm developed in the previous
chapter to a delayed reinforcement control task (the pole-balancer) that has often been
used as a test-bed for reinforcement learning systems. The performance of the GBF
algorithm is compared and contrasted with other work, and considered in relation to
problem of input sampling that arises in real-time control tasks. The interface
between explicit task knowledge and adaptive reinforcement learning is considered,
and it is proposed that the GBF algorithm may be suitable for refining the control
behaviour of a coarsely pre-specified system.
Chapter Six—Adaptive Local Navigation introduces the topic of navigation and
argues for the division of navigation competences between tactical, local navigation
skills that deal with the immediate problems involved in moving efficiently while
avoiding collisions, and strategic, way-finding skills that allow the successful
planning and execution of paths to distant goals. It further argues that local navigation
can be efficiently supported by adaptive dispositional learning processes, while way-
CHAPTER ONE
INTRODUCTION
11
finding requires task-independent knowledge of the environment, in other words, it
requires representational, or model-based, learning of the spatial layout of the world.
A modular architecture in the spirit of Animat AI is proposed for the acquisition of
local navigation skills through reinforcement learning. To evaluate this approach a
prototype model of an acquired local navigation competence is described and
successfully tested in a simulation of a mobile robot.
Chapter Seven—Representations for Way-finding: Topological Models. Some
recent research in Artificial Intelligence has favoured spatial representations of a
primarily topological nature over more quantitative models on the grounds that they
are: cheaper and easier to construct, more robust in the face of poor sensor data,
simpler to represent, more economical to store, and also, perhaps, more biologically
plausible. This chapter suggests that it may be possible, given these criteria, to
construct sequential route-like knowledge of the environment, but that to integrate
this information into more powerful layout models or maps may not be straightforward. It is argued that the construction of such models realistically demands the
support of either strong vision capabilities or the ability to detect higher-order
geometric relations. And further, that in the latter case, it seems hard to justify not
using the acquired information to construct models with richer geometric structure
that can provide more effective support to way-finding.
Chapter Eight—Representations for Way-finding: Local Allocentric Frames.
This chapter describes a representation of metric environmental spatial relations with
respect to landmark-based local allocentric frames. The system works by recording in
a relational network of linear units the locations of salient landmarks relative to
barycentric coordinate frames defined by groups of three nearby cues. It is argued
that the robust and economical character of this system makes it a feasible mechanism
for way-finding in large-scale space. The chapter further argues for a heterarchical
view of spatial knowledge for way-finding. It proposes that knowledge should be
constructed in multiple representational ‘schemata’ where different schemata are
distinguished not so much by their geometric content but by their dependence on
different sensory modalities, environmental cues, or computational mechanisms. It
thus argues against storing unified models of space, favouring instead the use of runtime arbitration mechanisms to decide the relative contributions of different local
models in determining appropriate way-finding behaviour.
CHAPTER ONE
INTRODUCTION
12
Chapter Nine: Conclusions and Future Work summarises the findings of the thesis
and considers some areas where further research might be worthwhile.
13
Chapter Two
Reinforcement Learning Systems
Summary
The purpose of this chapter is to set out the background to the learning systems
described in later parts of the thesis. It therefore consists primarily of description of
reinforcement learning systems and particularly of the actor/critic and temporal
difference and learning methods developed by Sutton and Barto.
Reinforcement learning systems have been studied since the early days of artificial
intelligence. An extensive review of this research has been provided by Sutton [162].
Williams [184] has also discussed a broad class of reinforcement learning algorithms
viewing them from the perspective of gradient ascent learning and in relation to the
theory of stochastic learning automata. An account of the relationship between
delayed reinforcement learning and the theory of dynamic programming has been
provided by Watkins [177] and is clearly summarised in [14]. The theoretical
understanding of these algorithms has recently seen several further advances [10, 41,
65]. In view of the thoroughness of these existing accounts the scope of the review
given here is limited to what I hope is a sufficient account of the theory of
reinforcement learning to support the work described later.
The structure of this chapter is as follows. The study of learning systems and their
embodiment in neural networks is briefly introduced from the perspective of function
estimation. Reinforcement learning methods are then reviewed and considered as
gradient ascent learning algorithms following the analysis given by Williams [184].
A sub-class of these learning methods are the reinforcement comparison algorithms
developed by Sutton. Temporal difference methods for learning in delayed
reinforcement tasks are then described within the framework developed by Sutton and
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
14
Barto [11, 162, 163] and by Watkins [177]. This section also includes a brief review
of the relationship of TD methods to both supervised learning and dynamic
programming, and describes the actor/critic architecture for learning with delayed
rewards which is studied extensively in later chapters. Finally, a number of proposals
for combining reinforcement learning with model-based learning are reviewed, and
the chapter concludes by considering learning in circumstances where the system has
access to only partial state information.
CHAPTER TWO
2.1
REINFORCEMENT LEARNING SYSTEMS
15
Associative Learning Systems
Learning appropriate behaviour for a task can be characterised as forming an
associative memory that retrieves suitable actions in response to stimulus patterns. A
system that includes such a memory and a mechanism by which to improve the stored
associations during interactions with the environment is called an associative learning
system.
The stimulus patterns, that provide the inputs to a learning system, are measures of
salient aspects of the environment from which suitable outputs (often actions) can be
determined. However, a learning system may also attend to a second class of stimuli
called feedback signals. These signals arise as part of the environment’s response to
the recent actions of the system and provide measures of its performance. In general,
therefore, we are concerned with learning systems, as depicted in Figure 2.1, that
improve their responses to input stimuli under the influence of feedback from the
environment.
Feedback
Learning Mechanism
Signal
(Stimuli)
(Actions)
Inputs
Memory
Outputs
Figure 2.1: A learning system viewed as an associative memory. The learning
mechanism causes associations to be formed in memory in accordance with feedback
from environment (adapted from [162])
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
16
Associative memories are mappings
Mathematically, the behaviour of any system that transforms inputs into outputs—
stimuli into responses—can be characterised as a function f that maps an input
domain X to an output domain Y. Any associative memory is therefore a species of
mapping.
Generally we will be concerned with mappings over input and output domains that
are multi-dimensional, vector spaces. That is, the input stimulus will be described by
a vector4 x = (x1 , x2 ,!, x N )! whose elements each measure some salient aspect of the
current environmental state, and the output will also be a vector y = (y1 , y2 ,!, y M )!
whose elements characterise the system’s response. In order to learn, a system must
be able to modify the associations that encode the input-output mapping. These
adaptable elements of memory are the parameters of the learning system and can be
described by a vector w taken from a domain W. The mapping defined by the
memory component of a learning system can therefore be written as the function
y = f (w, x), f :W ! X " Y .
Varieties of learning problem
To improve implies a measure of performance. As suggested above such measures
are generally provided in the form of feedback signals. The nature of the available
feedback can be used to classify different learning problems as supervised,
reinforcement, or unsupervised learning tasks.
In supervised learning feedback plays an instructive role indicating, for any given
input, what the output of the system ought to have been. The environment trains the
learning system by supplying examples of a target mapping
y * = F(x), F: X ! Y .
4Vectors
are normally considered to be column vectors. Superscript T is used to indicate the transpose
of a row vector into a column vector or vice versa.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
17
For any input-output pair (x,y* ) a measure of the error in the estimated output y can
be determined. The task of the learning system is then to adapt the parameters w in a
way that will minimise the total error over all the input patterns in the training set.
Since the goal of learning is for the associative memory to approximate the target
mapping, learning can be viewed as a problem of function approximation.
In contrast, the feedback provided in a reinforcement learning task is of a far less
specific nature. A reinforcement signal is a scalar value judging whether performance
is good or bad but not indicating either the size or direction of the output error. Some
reinforcement problems provide feedback in the form of regular assessments on
sliding scales. In other tasks, however, reinforcement can be both more intermittent
and less informative. At the most ‘minimalist’ end of the spectrum signals can
indicate as little as whether the final outcome following a long sequence of actions
was a success or a failure.
In reinforcement learning the target mapping is any mapping that achieves maximum
positive reinforcement and minimum negative reinforcement. This mapping is
generally not known in advance, indeed, there may not be a unique optimal function.
Learning therefore requires active exploration of alternative input-output mappings.
In this process different outputs (for any given input stimulus) are tried out, the
consequent rewards are observed, and the estimated mapping f is adapted so as to
prefer those outputs that are the most successful.
Finally, in unsupervised learning there is no feedback, indeed, there is no teaching
signal at all other than the input stimulus. Unsupervised training rules are generally
devised, not with the primary goal of estimating a target function, but with the aim of
developing useful or interesting representations of the input domain (for input to other
processes). For example, a system might learn to code the input stimuli in a more
compact form that retains a maximal amount of the information in the original
signals.
Learning architectures
A functional description of an arbitrary learning system was given above as a
mapping from an input domain X to an output domain Y . In order to simplify the
account this chapter focuses on mappings for which the input x ! X is multi-valued
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
18
and the output y !Y is a scalar. All the learning methods described will, however,
generalise in a straightforward way to problems requiring a multi-valued output.
In order to specify an appropriate system, y = f ( w, x) , for a particular task three
principal issues need to be considered. Following Poggio and Girosi [126] these will
referred to as the representation, learning, and implementation problems.
•
The representation problem concerns the choice of a suitable form of f (that
is how y depends on w and x) such that a good mapping for the task can be
learned. Choosing any particular form of f can enormously constrain the
range of mappings that can be approximated (to any degree of accuracy)
regardless of how carefully the parameters w are selected.
•
The learning problem is concerned with selecting appropriate rules for finding
good parameter values with a given choice of f .
•
Finally, the implementation problem concerns the choice of an efficient device
in which to realise the abstract idea of the learning system (for instance,
appropriate hardware).
This chapter is primarily concerned with the learning problem for the class of systems
that are based on the linear mapping
y = f (w, x) = w! " (x) .
(2.1.1)
That is, y is chosen to be the product of a parameter vector w = (w1 , w2 ,!, wp ) and a
recoding vector ! (x) "# whose elements !1 (x) , ! 2 (x) , !, ! P (x) are basis functions
of the original stimulus pattern. In other words, we assume the existence of a recoding
function ! that maps each input pattern to a vector representation in a new domain
!.
For any desired output mapping over X, an appropriate recoding can be defined that
will allow a good approximation to be acquired. Of course, for any specific choice of
! only the limited class of linear mappings over ! can be estimated. The
representation problem is not solved therefore, rather it is transmuted into the
problem of selecting, or learning, a suitable coding for a given task. What equation
2.1.1 does allow, however, is for a clear separation to be made between the choice of
! and the choice of suitable learning rules for a linear system. As this chapter
concentrates on the latter it therefore assumes the existence of an adequate, fixed
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
19
coding of the input. The recoding problem, which is clearly of critical importance,
will be considered later as the main topic of Chapters Four and Five.
The process of learning involves a series of changes of increments to the parameters
of the system. Thus, for instance, the jth update step for the parameter vector w could
be written either as
w j +1 = w j + !w j or as wk ( j +1) = wk ( j) + !wk ( j )
giving, respectively, the new value of the vector, or of the kth individual parameter.
Since many rules of this type will be considered in this thesis, a more concise notation
will be used whereby a rule is expressed in terms of the increment alone, that is, by
defining either !w or !wk .
Error minimisation and gradient descent learning
As suggested above, supervised training occurs by providing the learning system with
a set of example input/output pairs (x, y * ) of the target function F . This allows the
task of the learning system to be defined as a problem of error minimisation. The
total error for a given value of w, can be written as
E(w) = "i y*i ! y i
(2.1.2)
where i indexes over the full training set and . is a distance measure. An optimal set
of parameters w* is one for which this total error is minimised, i.e. where
E(w* ) ! E(w) for every choice of the parameter vector w.
A gradient descent learning algorithm is an incremental method for improving the
function approximation provided by the parameter vector w. The error function E(w)
can be thought of as defining an error surface over the domain W. (For instance, if
the parameter space is two-dimensional then E(w) can be visualised as the height of
a surface, above the 2-D plane, at each possible position ( w1 ,w2 ).) Starting from a
given position E(w(0) ) on the error surface, gradient descent involves moving the
parameter vector a small distance in the direction of the steepest downward gradient
and calculating a new estimate E(w(1) ) of the total error. This process is then
repeated over multiple iterations. On the jth pass through the training set the error
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
20
gradient e ( j ) is given by the partial derivative of the total error with respect to the
weights, i.e. by
e( j ) = !
" E(w( j ) )
.
" w( j)
(2.1.3)
This gives the iterative procedure for updating the parameter estimate
!w( j ) = " e ( j )
(2.1.4)
where ! is the step size or learning rate
< ! <<1) . When a point is reached where
the error gradient is greater than or equal to zero in all directions the parameters will
have converged to a stable solution.
Local minima
The parameter estimate found by a gradient descent learning rule will generally be a
locally minimal and not a globally minimal solution. Without prior knowledge of the
nature of the error surface a global minimum can only be guaranteed by exhaustive
search of the parameter space. For most interesting problems this is ruled out on the
grounds of computational expense. Other methods for finding improved solutions
involve adding noise to the learning system, for instance by perturbing the learning
rule. It has been pointed out (Vogi et al [176]) that for most practical problems a
solution is required that is not necessarily a true minimum provided it reduces the
error to an extent that satisfies some prescribed criteria. For instance, a solution that
is not minimal but is robust to small changes in the input data might be preferable to
an exact minimum that is sensitive to change. Methods for overcoming the problem
of local minima are considered in this thesis as exploration strategies since effective
exploration of the parameter space promotes the likelihood of finding better solutions.
A gradient descent learning problem can always be restated as a gradient ascent, or
hill-climbing task simply be reversing the sign of the error (that is the problem is
defined as one of climbing the gradient of ! E(w) ). Desirable solutions will then be
local or global maxima rather than minima. Given that there are these two, effectively
interchangeable, ways of describing gradient learning, each algorithm considered here
will be described using the terminology that seems most appropriate for its context.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
21
Least mean squares learning
In choosing a learning rule a common choice for the distance metric is the squared
distance between the target and desired outputs. From equation 2.1.2 the total mean
squared error in a training set is given by
E(w) =
1
2
" (y
i
*
i
2
! yi ) .
(2.1.5)
The error gradient for the input pattern x i and yi = w! " (x i ) is then given by
!
" Ei
= (y *i ! yi ) # (x i )
"w
(2.1.6)
giving the exact least mean squares learning rule
!w = " %i (yi* # yi )$ (xi ) .
(2.1.7)
The Off-line/On-line distinction
The gradient descent rule described above operates by calculating the adjustment
needed for the entire batch of input patterns before any change in the parameters is
made. However, learning systems are often required to operate locally in time. This
means that any changes in the system relating to a given input must be made at the
time that pattern is presented and not recorded for updating later. To derive an update
rule that is local in time the assumption is made that the error gradient for the current
pattern is an unbiased estimate of the true gradient.
A notational convention is adopted here that variables pertaining to the input pattern
at a given point in time are labelled by the time index. For instance, the vector
encoding of the input at time t is given by ! (t) and the desired and actual outputs for
*
that input are given by y (t) and y(t ) . If the input patterns are presented at discrete
time intervals t = 0,1,2,… then the approximate LMS or ‘Widrow-Hoff’ rule is given
by
!w(t) = " ( y* (t) # y(t)) $ (t) .
(2.1.8)
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
22
If the learning rate ! is small enough then the Widrow-Hoff rule is known to
converge5 to a local minimum [183]
.
Neural Networks
Artificial neural networks provide an apposite way of understanding and
implementing learning systems. Widrow and Hoff [182] were pioneers of the network
approach inventing the ‘Adaline’ neurone as an implementation of the gradientdescent rule. Rosenblatt [139] developed a single-layer network classifier called the
‘Perceptron’. The perceptron architecture and the gradient descent rule were later
generalised to multi-layer networks by Rumelhart, McClelland and Williams [140]
who developed the back-propagation learning algorithm for the ‘multi-layer
perceptron’ (MLP). The approach of treating networks as function approximators has
been taken by many researchers including Moody and Darken [105] and Poggio and
Girosi [127]. Hardware neural networks in which computations are performed in
parallel by a large number of simple processors constitute an effective solution the
implementation problem in choosing a learning system for a specific task.
Viewed as a neural network the inputs and outputs of the learning system map onto
the activations of units in the input and output layers of the network. The recoding
vector ! (t) represents activity in a layer of hidden units and the parameter vector w
represents the weights on the connections from this hidden layer to the output units.
This architecture is shown in Figure 2.2 for a network with four input units, four
hidden units and one output unit.
5The
parameters will actually continue to vary around the minimum unless the learning rate is reduced
to zero according to an appropriate annealing schedule.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
23
output unit
parameter weights w
recoding units q
recoding weights
inputs x
Figure 2.2: Implementation of a learning system as a neural network.
2.2
Reinforcement Learning Systems
The basic architecture of an associative reinforcement learning system is illustrated in
Figure 2.3. For any input pattern an action preference is recorded in memory.
However, when the system is learning, the chosen action may differ from the stored
action according to search behaviour determined by the exploration component. The
stimulus-action associations stored in memory are modified by the learning
mechanism according to feedback in the form of scalar reinforcement signals. The
basic learning principle involves strengthening a stimulus-action association if
performing that action is followed by positive reinforcement and weakening the
association if the action is followed by negative reinforcement. This principle is
known to animal learning theorists as the law of effect [168] .
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
Learning Mechanism
24
Reinforcement
Context
(stimuli)
Memory
Exploration
Component
Action
Figure 2.3: A Reinforcement learning system.
The term reinforcement originates from the study of animal learning and of
instrumental conditioning in particular. However, there are a number of different
ways in which this and related terms have been used in the psychological and AI
literature. For clarity, therefore, a brief definition of the terminology (as it is used
here) is appropriate.
Primary reinforcement or reward is feedback, provided by the environment, which
the learning system recognises as a measure of performance. Reinforcement can be
positive (pleasing) or negative (punishing). A stimulus that provides primary
reinforcement is called a primary reinforcer. Stimuli that do not provide such
reinforcement are called context or, simply, input.
An immediate reinforcement task is one in which every context is followed by a
reward. In this chapter the assumption is made that the inputs for such tasks are
sampled from stationary distributions (i.e. ones that do not vary with time) and that
they are independent of the actions of the learning system.
A delayed reinforcement task is one in which rewards occur only after sequences of
context/action pairs. The sequence of contexts generally depends on past contexts and
on the past actions performed by the system. Delayed reinforcement signals serve as a
measure of the overall success of multiple past actions.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
25
In a delayed reinforcement task a context can become a secondary reinforcer if the
system has learnt to associate with it an internal heuristic or secondary reinforcement
signal. One way in which this could happen is if a specific context regularly occurs
just before a second stimulus that is a primary reinforcer. The system might therefore
learn that the first stimulus is a good predictor of the second. A backward chaining
of secondary reinforcement can occur over repeated learning experiences as the
system discovers that certain stimuli can be used to predict other stimuli that are
themselves secondary reinforcers6.
Credit Assignment
A central issue in learning is the problem of credit assignment (Minsky [103]). In
reinforcement learning the feedback signal evaluates the system’s action in an almost
qualitative fashion. Such a signal assigns an amount of credit (or blame) to the action
taken but does not indicate how the action can be improved or which alternate actions
might be better. For multi-valued outputs the feedback also fails to indicate how
credit can be distributed between the various elements of the action vector. These
issues concern problems of structural credit assignment.
In a delayed reinforcement task the learning problem is further compounded. The
delayed reward signal contains no information about the role played by each action in
the preceding sequence in facilitating or hindering that outcome. This question of
how to share credit and blame between a number of past decisions is termed the
problem of temporal credit assignment.
Maximising the expected reinforcement
In reinforcement learning it is not possible to construct a true error function to
minimise since there are no target outputs and the reinforcement signals do not
diminish as the system’s performance improves. However, there is a natural way of
viewing reinforcement learning problems that does lead to a gradient learning rule.
This is to treat the task as a problem in maximising the expected reinforcement signal.
6This
is a phenomenon of classical conditioning known as second-order conditioning.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
26
(The expected value of the reward is used since both the inputs and outputs of the
system and the reinforcement signal provided by the environment may all depend on
random influences.) Williams [184] has demonstrated that there is a class of
reinforcement learning systems that perform stochastic gradient ascent in this
performance measure. In this section Williams’ analysis is considered for the problem
of immediate reinforcement learning.
The expected reinforcement defines a surface over the parameter set w on which the
learning system can do hill-climbing. Unlike supervised learning, however, there are
no target values with which to compare current outputs, so the system cannot directly
determine the local slope of this surface. Instead the system must try to find the
gradient by varying its actions away from those specified by its current parameters.
This exploration behaviour samples the reward surface at different nearby positions
allowing estimates of the local slope to be obtained.
In an immediate reinforcement task the learning system observes the context x ,
selects and performs an action y , then receives a reward signal r . The expected value
of this reward IE (r) depends on the actions performed by the system and may vary
for different contexts. The action y depends both on the system’s action preferences,
as encoded by the parameter vector w, and on its exploration strategy. Exploration is
a stochastic process and y can therefore be regarded as a random variable that takes
some value ! with a probability given by a function
g( ! ,w, x) = Pr[ y = ! w, x ].
(2.2.1)
If the system is to follow a gradient ascent learning procedure then the system must
be able to detect variation in IE (r) for small changes in this action probability
function. This means that g must be a continuous function of the parameter vector for
which the gradient with respect to the parameter vector w can be determined. This
gradient is henceforth termed the eligibility (of the parameters) and is written as !w .
Williams shows that any system that performs exploration in this way can improve its
performance by gradient ascent in IE (r) if it uses an update rule of the form
!w = " [ r # b ] $w
(2.2.2)
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
27
where ! is a learning rate7 (0 < ! << 1), and b is the reinforcement baseline. The
following two sections discuss in detail the concept of eligibility for different action
selection mechanisms, and the concept of a reinforcement baseline.
Action selection and eligibility
The probability function g( ! ,w, x) can be viewed as the composition of a
deterministic function, corresponding to the trainable memory component of the
learning system, and a random number generator corresponding to the exploration
component. The deterministic component is a generally a continuous, non-linear,
function !(s) of the linear sum s = w! " (x) . The term semilinear is often used to
denote a composite non-linear function of this type. For instance, if the logistic
function
!(s) =
1
"s
1+ e
(2.2.3)
is used then the deterministic component is equivalent to the semilinear squashing
function used in the multilayer perceptron [140]. As the action selection mechanism
in reinforcement learning combines a random number generator with a semilinear
function Williams calls the overall construction a stochastic semilinear unit. Figure
2.4 illustrates this mechanism.
Deterministic
linear non-linear
Context
s
l
Stochastic
g
Action
Figure 2.4: A stochastic semilinear unit for action selection. The deterministic
(semilinear) component of the unit consists of a linear function s and a non-linear
function ! . The stochastic component is a random number generator with the
probability function g .
7The
learning rate is generally constant though it may vary with i and/or t.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
28
Williams [184] shows that if the eligibility !w is chosen as the deriviative, with
respect to the weights, of ln g then the term (r ! b)"w gives an unbiased estimate of
the gradient in the expected reinforcement.
Hitherto, gradient learning methods have been described solely for linear
approximations. However, provided ln g is differentiable with respect to the linear
approximation s the gradient with respect to the parameter vector can be found by the
chain rule, hence
!w =
" ln g " ln g "# " s " ln g "#
=
=
$ (x)
"w
"# " s " w
"# " s
(2.2.4)
The following sections discuss specific examples of action selection mechanisms for
tasks in which the required output domain is first, a real number Y = IR , and second,
a forced choice between two alternatives Y = (0,1) .
Selecting an action from a continuous distribution
For a real-valued output range the action y can be treated as a Gaussian random
variable with the probability function
g(y, ! , µ ) =
$ ( y # µ )2 '
1
exp
#
2
12
&%
2! )(
(2" ) !
(2.2.5)
where the mean action is given by µ = w! " (x) and the standard deviation ! is fixed.
This action selection mechanism can be described as a Gaussian (linear) unit with
eligibility proportional to
!w =
" ln g
(y $ µ )
# (x) =
# (x) .
"µ
%2
(2.2.6)
A multi-parameter probability function can be defined by treating the standard
deviation as a second variable parameter. This general form of Gaussian action
selection unit is discussed further in the next chapter and employed in the simulations
described in chapter six.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
29
Binary action selection
If the system has only two possible actions then y can be treated as a Bernoulli
random variable that takes the value one with probability p and the value zero with
probability 1 ! p . The probability function g(y, p) is then given by
"1 ! p if y = 0
g(y, p) = #
$ p if y = 1 .
(2.2.7)
An action selection mechanism called a Bernoulli logistic unit combines this
probability function with the logistic function (2.2.3). Substituting the derivatives8 of
(2.2.3, 5) into (2.2.4) gives the eligibility term
!w =
y" p
p(1 " p) # (x) = (y " p)# (x)
p(1" p)
(2.2.8)
which is the same gradient term favoured by Sutton [162] on empirical grounds.
Reinforcement learning systems based on this mechanism are described in chapters
four and five.
Williams also considers an action selection mechanism used by Barto and co-workers
[9, 11, 12, 162] which he calls a stochastic linear threshold unit
$1 if w! " (x) + # > 0
y=%
&0 otherwise
(2.2.9)
where ! is a random number drawn from a distribution ! . He shows that this is
equivalent to a Bernoulli semilinear unit with the non-linear function
(s) = 1" # ("s) and that therefore the proof of gradient ascent learning applies. A
variant of this mechanism is employed in the learning system described in the next
chapter.
#" 1 if y = 0
% 1" p
y" p
!
lng
8
=$
=
,
1
!p
%
if y = 1 p(1 " p)
p
&
!p ! #
&(
%
=
"s ' =
$
!s ! s + e
e" s
(
+ e" s
2
)
= p( " p) .
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
30
Reinforcement Comparison
For the proof of gradient ascent learning to be valid the reinforcement baseline b
must be independent of the action y . It may, however, be a function of the context
and/or of past reward signals. Sutton [162] investigated several learning rules of the
form (2.2.2) prior to the analysis given by Williams. In an empirical study over a
range of tasks he determined that a good choice for the reinforcement baseline is a
prediction of the expected reinforcement. Sutton reasoned that the difference between
a prediction of reward, determined before an action is taken, and the actual reward,
received following that action, should help the learning system to decide whether the
selected action was a good or a bad choice. If the predicted reward for context x is
given by V then a reinforcement comparison rule for updating w is given by
!w
" [ r # V ] $w .
(2.2.10)
Sutton suggests that !w can be viewed as the correlation of the variation in the
expected reward (the reinforcement comparison) and the variation in the mean action
(the eligibility).
One of the interesting facts revealed by Williams’ analysis was that the choice of
reinforcement baseline does not affect the gradient ascent property of the algorithm.
However, Sutton’s experiments clearly demonstrated that the baseline value does
have a significant influence on learning, and in particular, that convergence is faster
when the predicted reinforcement is used than when the baseline is zero. Dayan [41]
has suggested that the comparison term has a second order effect on learning. He
suggests selecting a baseline value that minimises the second order term of the Taylor
expansion of IE (r) on the grounds that this should give smoother progress up the
(first order) gradient9.
9Dayan
found that the optimal value for b from this perspective is slightly different from the predicted
reinforcement. The term he derived gave a small improvement in convergence rate on a number of
binary action tasks compared with Sutton’s comparison term. However, the calculation of the new
term is more complicated and it is not appropriate for problems with delayed reinforcement.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
31
Learning with a critic
For problems in which the maximum expected reinforcement varies between contexts
(termed reinforcement-level asymmetry) Sutton found that the most effective learning
was achieved by making the prediction term a function of the context input.
However, this prediction function also has to be learned. Sutton therefore proposed a
learning architecture composed of two separate learning sub-systems: a critic element
that learns to predict future reinforcement and an actor or performance element that
learns appropriate actions. The sole purpose of the critic is to filter the rewards from
the environment and generate better credit assignment information for the
performance element. This actor/critic10 architecture is illustrated in figure 2.5.
Learning System
Critic Element
Prediction
Memory
Context
Learning
Mechanism
Reinforcement
Internal
Feedback
Actor/Performance Element
Action
Learning Mechanism
Action
Memory
Exploration
Component
Figure 2.5: An actor/critic learning system.
N
The critic can be viewed as another function approximator V(v, x):IR ! IR with
adaptable parameters v = (v1 ,v2 ,!, v M ) for which the target function is the expected
10Sutton
also uses the term ‘Adaptive Heuristic Critic’ to describe learning systems of this sort.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
32
reward IE (r) . In other words, over the set of input patterns, the goal of the critic
function is to minimise the error
(v) = 12 "i (IE (ri ) ! Vi )2 .
(2.2.11)
The gradient descent error is therefore
!
" E(v)
= r!
"
(2.2.12)
which is the same as the reinforcement comparison term in (2.2.10). If V is given by
the linear sum v ! " (t) then the parameters can be updated by a Widrow-Hoff learning
rule
!v = " (r # V ) $ (x)
(2.2.13)
where ! (0 < ! << 1) is the learning rate.
The actor/critic architecture can be seen to be two concurrent and interactive learning
processes. The actor performs gradient ascent in the expected reinforcement while the
critic continuously adapts (by a gradient descent rule) to try and predict this
reinforcement accurately. The critic generates a reinforcement comparison signal
which it uses to train itself and to provide the improved feedback signal for the
performance element.
2.3
Temporal Difference learning
Sutton and Barto [11, 162] extended the concept of reinforcement comparison to
provide a solution to the temporal credit assignment problem in delayed
reinforcement learning. The resulting algorithms are called temporal difference (TD)
learning methods.
The following review of TD learning is split into three sections. The first describes
TD methods for learning predictions. The simplifying assumption is made that the
action preferences of the performance element are fixed. The aim is to show how,
under such circumstances, the critic can learn to accurately predict future returns.
The derivation of TD learning rules follows the analysis given by Watkins [177]. The
second section is intended to give some useful insight into how TD learning works. It
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
33
reviews the relationship of TD to the Widrow-Hoff learning rule and to dynamic
programming methods and also summarises what is known about the convergence
properties of TD algorithms. The final section describes the actor/critic method for
learning with delayed rewards in which the performance and critic elements adapt
concurrently.
Although there have been several applications to problems with real-valued inputs,
TD algorithms have generally been analysed in the context of tasks with discrete
input spaces containing a finite number of possible contexts. The following analysis
continues with the vector notation of the previous section so as to allow both discrete
and continuous input spaces to be represented. For a discrete problem space we can
assume that the set of contexts is encoded by a set of mutually orthogonal11 recoding
vectors, an encoding of this type is assumed throughout the following section.
Predicting delayed rewards
Before describing TD learning procedures it is necessary to outline some of the
terminology that applies to sequential decision tasks. It is also important to consider
exactly what function of future rewards the learning system should seek to maximise.
After dealing with these preliminaries the basic TD predictor is outlined and then
generalised to give a family of TD learning methods.
Sequential decision tasks
In most delayed reinforcement tasks there is a dynamic interaction between the
environment and the learning system. The future reinforcement and the future context
are dependent both on past contexts and on the actions of the system. In sequential
decision tasks of this nature the concept of state is useful for describing the condition
of the environment at any given time. A state description contains sufficient
information such that, when combined with knowledge of future actions, all aspects
11A
simple way to achieve this is to make the number of elements in ! equal to the total number of
contexts. The basis vector for the ith context is then given by a vector with a non-zero element only at
the ith position.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
34
of future environmental states can be determined12. In other words, given a state
description, an appropriate action can be selected without knowing anything of the
history of past states. This is the ‘path independence’ property of the state description
and is also known as the Markov property.
The set of all possible states for a task is called the state space. The transition
function for a state space S is the set of probabilities pij (i, j !S) giving the likelihood
that any one state will follow another. A state space together with a transition
function defines a Markov decision process. Figure 2.6 illustrates an example of a
Markov process of six discrete states with two possible transitions in each state.
Figure 2.6: A Markov decision process.
In general, for any given state, different transition probabilities will be associated
with the different actions available in that state. A function that maps states to
preferred actions (or action probabilities) is called a policy and denoted by the symbol
! . A second function that maps states to the expected future rewards for a given
!
policy is denoted V and called the evaluation function.
A common assumption in delayed reinforcement learning is that the task constitutes a
Markov decision process in which the set of possible contexts corresponds to the state
space for the task. The behaviour acquired by the actor system therefore constitutes
the policy function, and predictions of the critic system the evaluation function.
12If
the environment is stochastic, then the state description is such that the probabilities of all future
aspects of the environment can be determined.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
35
Time horizons and discounting
In tasks with delayed rewards actions must be selected to maximise the reinforcement
received at future points in time where the intervening delay can be of arbitrary
length. The extent to which any sequence of actions is optimal can therefore only be
determined with respect to some prior assumptions about the relative values of
immediate and long-term rewards. Such assumptions are made concrete by the
concept of a time horizon. For example, a system that has an infinite time horizon
values all rewards equally no matter how far they lie ahead. A system with a finite
time horizon, on the other hand, values rewards only up to some fixed delay limit. To
allow lower values to be attached to distant rewards without introducing a fixed cutoff it is usual to use an infinite discounted time horizon where the value of each future
reward is given by an exponentially decreasing function of the length of the delay
period13. In TD learning the slope of the discounted time horizon is specified by a
constant denoted by ! (0<!<1) and known as the discount factor. Assuming a
sequence t=1,2,3,... of equally spaced discrete time-steps the total discounted return
R(t) , at a given time t, is then given by the sum of discounted future rewards
R(t) = r(t + 1) + ! r(t + 2) + ! 2 r(t + 3) + ! 3 r(t + 4) + …
which can be written as
R(t) =
#
k =1
! k "1r(t + k) .
(2.3.1)
The goal of the critic in delayed reward tasks is to learn to anticipate the expected
value of this return. In other words, the prediction V(t) should be an estimate of
IE (R) = IE
[
#
k =1
]
! k "1r (t + k ) .
(2.3.2)
13In some tasks, particular those with deterministic transition and reward functions, an infinite time
horizon clearly is desirable. However, a discounted return measure is still needed for most
reinforcement learning algorithms as this gives the sum of future returns a finite value. A discussion of
this issue and a proposal for a reinforcement learning method that allows an infinite time horizon is
given in Schwartz [148] .
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
36
The TD(0) learning rule
One way of estimating the expected return would be to average values of the
truncated return after n-steps
#
n
k =1
! k "1r(t + k) .
(2.3.3)
In other words, the system could wait n steps, calculate this sum of discounted
rewards, and then use it as a target value for computing a gradient descent error term.
However, for any value of n, there will be an error in this estimate equal to the
prospective rewards (for as yet unexperienced time-steps)
$
#
k =n +1
! k "1r (t + k) .
The central idea of TD learning is to notice that this error can be reduced by making
use of the predicted return V(t + n) associated with the context input at time t + n .
Combining the truncated return (2.3.3) with this prediction gives an estimate called
the corrected n-step return Rn (t)
n
Rn (t) = #k =1! k "1r(t + k) + ! n V (t + n) .
(2.3.4)
Of course, at the start of training the predictions generated by the critic will be poor
estimates of the unseen rewards. However, Watkins [177] has shown that the
expected value of the corrected return will on average be a better estimate of R(t)
than the current prediction at all stages of learning. This is called the error reduction
property of Rn (t) . Because of this useful property, estimates of Rn (t) are suitable
targets for training the predictor.
The estimator used in the temporal difference method which Sutton calls TD(0) is the
one-step corrected return
R1 (t) = r(t + 1) + !V(t + 1)
(2.3.5)
which leads to a gradient descent error term called the TD error
eTD (t + 1) = [r(t + 1) + !V (t + 1)] " V (t) .
(2.3.6)
Substituting this error into the critic learning rule (2.2.17) gives the update equation
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
!v = " eTD(t + 1) # (t) .
37
(2.3.7)
The learning process occurs as follows. The system observes the current context, as
encoded by the vector ! (t) , and calculates the prediction V(t) . It then performs the
action associated with this context. The environment changes as result of the action
generating a new context ! (t + 1) and a reinforcement signal r(t + 1) (which may be
zero). The system then calculates the new prediction V(t + 1) , establishes the error in
the first prediction, and updates the parameter vector v appropriately.
To prevent changes in the weights made after each step from biasing the TD error it is
preferable to calculate the predictions V(t) and V(t + 1) using the same version of the
parameter vector. To make this point clear, the notation V(a|b) is introduced to
indicate the prediction computed with the parameter vector at time t = a for the
context at time t = b. The TD error term is then written as
eTD (t + 1) = r(t + 1) + !V(t|t + 1) " V (t|t) .
(2.3.8)
The TD(0) learning rule carries the expectation of reward one interval back in time
thus allowing for the backward chaining of secondary reinforcement. For example,
consider a task in which the learning system experiences, over repeated trials, the
same sequence of context inputs (with no rewards attached) followed by a fixed
reward signal. On the first trial the system will learn that the final pattern predicts the
primary reinforcement. On the second trial it will learn that the penultimate pattern
predicts the secondary reinforcement associated with the final pattern. In general, on
the kth trial, the context that is seen k steps before the reward will start to predict the
primary reinforcement.
The family of TD(") learning methods
The TD(0) learning rule will eventually carry the expectation of a reward signal back
along a chain of stimuli of arbitrary length. The question that arises, however, is
whether it is possible to propagate the expectation at a faster rate. Sutton suggested
that this can achieved by using the TD error to update the predictions associated with
a sequence of past contexts where the update size for each context is weighted
according to recency. A learning rule [163] that incorporates this heuristic is
t $1
!v(t) = " eTD(t + 1) &k =0 # t $ k % (t $ k)
(2.3.9)
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
38
where " (0 ! " ! 1) is a decay parameter that causes an exponential fall-off in the
update size as the time interval between context and reward lengthens. One of the
advantages of this rule is that the sum on the right hand side can be computed
recursively using an activity trace vector ! (t) given by
! (t) = ! (t) + "! (t # 1)
(2.3.10)
where ! (0) = 0 (the zero vector). This gives the TD(") update rule
!v(t) = " eTD(t + 1) # (t) .
(2.3.11)
Watkins shows an alternative method for deriving this learning rule. Instead of using
just the 1 (t) estimate, a weighted sum of different n-steps corrected returns can be
used to estimate the expected return. This is appropriate because such a sum also has
the error reduction property14. The TD(") return R ! (t) is defined to be such a sum in
n
which the weight for each Rn (t) is proportional to ! such that
!
2
R (t) = (1 " ! )[R1 (t) + !R2 (t) + ! R3 (t)…] .
Watkins shows that this can be rewritten as the recursive expression
!
!
R (t) = r(t + 1) + " (1 # ! )V (t + 1) + "!R (t + 1).
(2.3.12)
Using this R ! (t) estimator the gradient descent error for the context at time t is given
by R ! (t) " V (t) for which a good approximation15 is the discounted sum of future
TD errors
[
]
2
eTD (t + 1) + !"eTD (t + 2) + ( !" ) eTD (t + 3)+…
14Provided
the weight on each of the corrected returns is between 0 and 1 and the sum of weights is
unity (see Watkins [177]).
15If
learning occurs off-line (i.e. after all context and reinforcement inputs have been seen) then
R! (t ) " V(t ) can be given exactly as a discounted sum of TD errors. Otherwise, changes in the
parameter vector over successive time-steps will bias the approximation by an amount equal to
%
$
k =1
(! " )k V(t + k # 1, t + k) # V (t +k,t + k ) ,
i.e. the discounted sum of the differences in the prediction of each state visited for successive
parameter vectors. This sum will be small if the learning rate is not too large.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
39
From this a rule for updating all past contexts can shown to be
t
!v(t) = " eTD (t + 1)&k =1( #$ )
t%k
' (k)
which is identical to Sutton’s update rule but for the substitution of the decay rate !"
for ! .
Understanding TD learning
This section reviews some of the findings concerning the behaviour of TD methods
and their relationship to other ways of learning to predict. The aim is to establish a
clearer understanding of what these algorithms actually do.
TD(") and the Widrow-Hoff rule
Consider the definition of the TD(") return (2.3.12). If " is made equal to one it can
1
easily be seen that R (t) is the same as the actual return (2.3.1). Sutton [163] shows
that the TD(1) learning rule is in fact equivalent16 to a Widrow-Hoff rule of the form
!v(t) = " ( R(t) # V(t)) $ (t) .
An important question is therefore whether TD methods are anything other than just
an elegant, incremental method of implementing this supervised learning procedure.
To address this issue Sutton carried out an experiment using a simple delayed
reinforcement task called a bounded random walk. In this task there are only seven
states A-B-C-D-E-F-G two of which, A and G, are boundary states. In the nonboundary states there is a 50% chance of moving to the right or to left along the
chain. All states are encoded by orthogonal vectors. A sequence starts in a (random)
non-boundary state and terminates when either A or G is reached. A reward of 1 is
given at G and zero at A. The ideal prediction for any state is therefore just the
probability of terminating at G.
16This
is strictly true only if TD(1) learning occurs off-line, i.e. if the parameters are updated at the end
of each trial rather than after each step.
CHAPTER TWO
A
0
REINFORCEMENT LEARNING SYSTEMS
1/6
1/3
1/2
2/3
40
5/6
G
1
Figure 2.7: Sutton’s random walk task. The numbers indicate the rewards available
in the terminal states (A and G) and the ideal predictions in the non-terminal states (B
to F).
Sutton generated a hundred sets of ten randomly generated sequences. TD(") learning
procedures for seven values of " including 0 and 1 (the Widrow-Hoff rule) were then
trained repeatedly17 on each set until the parameters converged to a stable solution.
Sutton then measured the total mean squared error, between the predictions for each
state generated by the learned parameters, and the ideal predictions. Significant
variation in the size of this error was found. The total error in the predictions was
lowest for "=0, largest for "=1, and increased monotonically between the two values.
To understand this result it is important to note that each training run used only a
small set of data. The supervised training rule minimises the mean squared error over
the training set, but as Sutton points out, this is not necessarily the best way to
minimise the error for future experience. In fact, Sutton was able to show that the
predictions learned using TD(0) are optimal in a different sense in that they maximise
the likelihood of correctly estimating the expected reward. He interpreted this finding
in the following way:
“...our real goal is to match the expected value of the subsequent outcome, not the
actual outcome occurring in the training set. TD methods can perform better than
supervised learning methods [on delayed reward problems] because the actual
outcome of a sequence is often not the best estimate of its expected value.” ([163]
p.33)
Sutton performed a second experiment with the bounded random walk task looking at
the speed at which good predictions were learned. If each training set was presented
just once to each learning method then the best choice for ", in terms of reducing the
error most rapidly, was an intermediate value of around 0.3. Watkins also considers
17The
best value of the learning rate # was found for each value of " in order to make a fair
comparison between the different methods. The weights were updated off-line.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
41
this question and points out that the choice of " is a trade-off between using biased
estimates ("=0) and estimates with high variance ("=1). He suggests that if the
current predictions are nearly optimal then the variance of the estimated returns will
be lowest for "=0 and that therefore the predictor should be trained using TD(0).
However, if the predictions are currently poor approximations then the corrections
added to the immediate reinforcement signals will be very inaccurate and introduce
considerable bias. The best approach overall might therefore be to start with "=1,
giving unbiased estimates but with a high variance, then reduce " towards zero as the
predictions become more accurate.
TD and dynamic programming
A second way to understand TD learning is in relation to the Dynamic Programming
methods for determining optimal control actions. ‘Heuristic’ methods of dynamic
programming were first proposed by Werbos [180]. However, Watkins [177] has
investigated the connection most thoroughly showing that TD methods can be
understood as incremental approximations to dynamic programming procedures. This
approach to studying actor/critic learning systems has also been taken by Williams
[185].
Dynamic Programming (the term was first introduced by Bellman [15]) is a search
method for finding a suitable policy for a Markov decision process. A policy is
optimal if the action chosen in every state maximises the expected return as defined
above (2.3.2). To compute this optimal control requires accurate models of both the
transition function and the reward function (which gives the value of the expected
reward that will be received in any state). Given these prerequisites dynamic
programming proceeds through an iterative, exhaustive search to calculate the
maximum expected return, or optimal evaluation, for each state. Once this optimal
evaluation function is known an optimal policy is easily found by selecting in each
state the action that leads to the highest expected return in the next state.
A significant disadvantage of dynamic programming is that it requires accurate
models of the transition and reward functions. Watkins [177] has shown that TD
algorithms can be considered as incremental forms of dynamic programming that
require no advanced or explicit knowledge of state transitions or of the distribution of
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
42
available rewards18. Instead, the learning system uses its ongoing experience as a
substitute for accurate models of these functions.
His analysis of dynamic programming led Watkins to propose a learning method
called ‘Q learning’ that arises directly from viewing the TD procedure as incremental
dynamic programming. In Q learning a prediction is associated with each of the
different actions available in a given state. While exploring the state space the system
improves the prediction for each state/action pair using a gradient learning rule. This
learning method does away with the need for explicitly learning the policy, the
preferred action in any state is simply the one with the highest associated value,
therefore as the system improves its predictions it also adapts its policy. If each
action in each state is attempted a sufficient number of times then Q learning will
eventually converge to an optimal set of evaluations. A family of Q(") learning
algorithms that use activity traces similar to those given for TD(") can also be
defined.
Convergence properties of TD methods
There are now several results showing that TD(") and Q(") learning algorithms will
converge[10, 41, 65, 163, 177], many of them based on an underlying identity with
stochastic dynamic programming. The latest proofs demonstrate convergence to the
ideal values with probability of one in both batch and on-line training of both types of
algorithm. These proofs generally assume tasks that, like the bounded random walk,
are guaranteed to terminate, have one-to-one mappings from discrete states to
contexts, fixed transition probabilities, and encode different contexts using orthogonal
vectors.
Actor/Critic architectures for delayed rewards
The actor/critic learning methods developed by Barto and Sutton and described in
section 2.2 can be also be applied to learning in tasks with delayed rewards [11, 162].
The separation of action learning from prediction learning has several useful
18Watkins
uses the term ‘primitive’ learning to describe learning of this sort, likening it to what I have
called dispositional learning in animals.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
43
consequences although analysing the behaviour of the system is more difficult. One
important difference is that problems can be addressed in which actions are realvalued (Q learning is restricted to tasks with discrete action spaces). A second
advantage arises in learning problems with continuous input spaces. Here the optimal
policy and evaluation functions may have quite different non-linearities with respect
to the input. Separating the two functions into distinct learning systems can therefore
allow appropriate recodings to be developed for each. The benefit of this is shown
clearly in the simulations described in Chapter 5.
The training rules for actor/critic learning in delayed reward tasks are based on the
rules described above for immediate reinforcement problems (section 2.2). With
delayed rewards the goal of the system is to maximise the expected return IE (R) . The
critic element is therefore trained to predict E (R) using the TD(") procedure, while
the performance element is trained to maximise IE (R) using a variant of the
reinforcement comparison learning rule (equation 2.2.14). This gives the update !w
for the parameters of the actor learning system
!w(t) " [ r(t + 1) + #V(t + 1) $ V (t)] %w(t) .
(2.3.13)
Here !w is a sum of past eligibility vectors (section 2.2) weighted according to
recency. This allows the actions associated with several past contexts to be updated at
each time-step. This eligibility trace is given by the recursive rule
!w(0) = 0 , !w(t) = !w(t) + " !w(t # 1)
(2.3.14)
where $ is the rate of trace decay. The eligibility trace and the activity trace (2.3.10)
(used to update the critic parameters) encode images of past contexts and behaviours
that persist after the original stimuli have disappeared. They can therefore be viewed
as short term memory (STM) components of the learning system. The weight vectors
encoding the prediction and action associations then constitute the long term memory
(LTM) components of the system.
There are no convergence proofs for actor/critic methods for delayed reward tasks
because of the problem of analysing the interaction of two concurrent learning
systems. However successful learning has been demonstrated empirically on several
difficult tasks [3, 11, 167] which encourages the view that these training rules may
share the desirable gradient learning properties of their simpler counterparts.
CHAPTER TWO
2.4
REINFORCEMENT LEARNING SYSTEMS
44
Integrating reinforcement and model-based learning
Planning, world knowledge and search
The classical AI method of action selection is to form an explicit plan by searching an
accurate internal representation of the environment appropriate to the current task
(see, for example [32, 114]). However, any planning system is faced with a scaling
problem. As the world becomes more complex, and as the system attempts to plan
further ahead, the size of the search space expands at an alarming rate. In particular,
problems arise as the system attempts to consider more of the available information
about the world. With each additional variable another dimension is added to the
input space which can cause an exponential rise in the time and memory costs of
search. Bellman [16] aptly described this as the “curse” of dimensionality.
Dynamic Programming is as subject to these problems as any other search method.
Incremental approximations to dynamic programming such as the TD learning
methods attempt to circumvent forward planning by making appropriate use of past
experience. Actions are chosen that in similar situations on repeated past occasions
have proven to be successful. A given context triggers an action that is in effect the
beginning of a ‘compiled plan’, summarising the best result from the history of past
experiences down that branch of the search tree. Thus TD methods are a solution (of
sorts) to the problem of acting in real time—there is no on-line search, and when the
future is a little different than expected, then there is often a ‘compiled plan’ available
for that too. However, the problem of search does not go away. Instead, the size of
the search space translates into the length of learning time required, and, when
exploration is local (as in gradient learning methods), there is an increased likelihood
of acquiring behaviours that are only locally optimal.
Optimal Learning, forward and world models
Real experience can be expensive to obtain—exploration can be a time-consuming,
even dangerous, affair. Optimal learning, rather than learning of optimal behaviour19,
19Watkins
[177] gives a fuller discussion of this distinction.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
45
is concerned with gaining all possible knowledge from each experience that can be
used to maximise all future rewards and minimise future loss. Reinforcement learning
methods are not optimal in this sense. They extract information from the temporal
sequence of events that enables them to learn mappings of the following types
stimulus ! action (S! A) (actor)
stimulus ! reward (S! R) (critic)
stimulus ! action " reward (S!A " R) (Q learning)
Associative learning of this type is clearly dispositional, it encodes task-specific
information and retains no knowledge of the underlying causal process. However,
associative mappings obtained through model-based learning can clearly help in
determining optimal behaviour. These can take the form of forward (causal) models,
i.e.
stimulus ! action " stimulus (S!A " S)
or world models encoding information about neighbouring or successive stimuli, i.e.
stimulus ! stimulus (S ! S)
Where the knowledge such mappings contain is independent of any reward
contingencies they can be applied to any task defined over that environment.
However, there can be substantial overhead in the extra computation and memory
required to learn, store, and maintain models of the environment. Several methods of
supplementing TD learning with different types of forward model have been
proposed [41, 108, 164, 166] and are described further below.
Forward models
A mapping of the S!A " S type, is effectively a model of the transition function used
in dynamic programming. Sutton’s [164] DYNA system uses such a model as an
extension to Q learning, the agent uses its actual experience in the world both to learn
evaluations for state/action pairs and to estimate the transition function. This allows
on-line learning to be supplemented by learning during simulated experience. In
other words, between taking actions in the real world and observing and learning
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
46
from their consequences, the agent performs actions in its ‘imagination’. Using its
current estimates of the evaluation and transition functions, it can then observe the
‘imaginary’ outcomes of these actions and learn from them accordingly. Sutton calls
this process ‘relaxation planning’—a large number of shallow searches, performed
whenever the system has a ‘free moment’, will eventually approximate a full search
of arbitrary depth. By carrying out this off-line search the system can propagate
information about delayed rewards more rapidly. Its actual behaviour will therefore
improve faster than by on-line learning alone.
Moore [108] describes a related method for learning in tasks with continuous input
spaces. To perform dynamic programming the continuous input space is partitioned
or quantised into discrete regions and an optimal action and evaluation learned in
each cell. The novel aspect of Moore’s approach is to suggest heuristic methods for
determining a suitable quantisation of the space that attempt to side-step the
dimensionality problem. He proposes varying the resolution of the quantisation
during learning, specifically, having a fine-grained quantisation in those parts of the
state-space that are visited during an actual or simulated sequence of behaviour and a
coarse granularity in the remainder. As the trajectory through the space changes over
repeated trials the quantisation is then altered in accordance with the most recent
behaviour.
World models
Sutton and Pinette [166] and Dayan [41] both propose learning S ! S models. The
essence of both approaches is to train a network to estimate, from the current context
x(t) , the discounted sum of future contexts
#
"
k =1
! k x(t + n) .
One reason for learning this particular function is that a recursive error measure,
similar to the TD error, can be used to adapt the network parameters. Having acquired
such a mapping the resulting associations will reflect the topology of the task, which
may differ from the topology of the input space. When motivated to achieve a
specific goal, such a mapping may aid the learning system to distinguish situations in
which different actions are required, and to recognise circumstances where similar
behaviours can be applied.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
47
For instance, Dayan simulated a route-finding problem which involved moving
within a 2-D bounded arena between start and goal positions. A barrier across part of
the space obliged the agent to make detours when traversing between the separated
regions. The agent was constrained to make small, local movements, hence the
probability of co-occurrence of any two positions was generally a smooth function of
the distance between them, except for the non-linearity introduced by the barrier. An
S ! S mapping was trained by allowing the agent to wander unrewarded through the
space. Neighbouring positions thus acquired high expectations except where they lay
either side of the barrier where the expectation was low. The output of this mapping
system was used as additional input to an actor/critic learning system that was
reinforced for moving to a specific goal. The S ! S associations aided the system in
discriminating between positions on either side of the barrier. This allowed
appropriate detour behaviour to be acquired more rapidly than when the system was
trained without the model.
One significant problem with this specific method for learning world models is that it
is not independent of the behaviour of system. If the S ! S model in the routefinding task is updated while the agent is learning paths to a specific goal then the
model will become biased to anticipate states that lie towards the target location.
When the goal is subsequently moved the model will then by much less effective as
an aid to learning. An additional problem with this mapping is that it is one-to-many
making it difficult to represent efficiently and giving it poor scaling properties.
2.5
Continuous input spaces and partial state knowledge
Progress in the theoretical understanding of delayed reward learning algorithms has
largely depended on the assumptions of discrete states, and of a Markovian decision
process. If either of these assumptions is relaxed then the proofs of convergence,
noted above, no longer hold. Furthermore, it seems likely that strong theoretical
results for tasks in which these restrictions do not apply will be difficult to obtain.
One reason for this pessimism is that recent progress has depended on demonstrating
an underlying equivalence with stochastic dynamic programming for which the same
rigid assumptions are required.
An alternative attack on these issues is of course an empirical one—to investigate
problems in which the assumptions are relaxed and then observe the consequences.
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
48
This is a common approach in connectionist learning where the success of empirical
studies has often inspired further theoretical advances.
To be able to apply delayed reinforcement learning in tasks with continuous state
spaces would clearly be of great value. Many interesting tasks in robot control, for
example, are functions defined over real-valued inputs. It seems reasonable, given the
success of supervised learning in this domain, to expect that delayed reinforcement
learning will generalise to such tasks. This issue is one of the main focuses of the
investigations in this thesis.
Relaxing the assumption of Markovian state would give, perhaps, even greater gain.
The assertion that context input should have the Markov property constitutes an
extremely strict demand on a learning system. It seems likely that in many realistic
task environments the underlying Markovian state-space will be so vast that dynamic
programming in either its full or its incremental forms will not be viable.
It is clear, however, that for many tasks performed in complex, dynamic
environments, learning can occur perfectly well in the absence of full state
information. This is for the reason that the data required to predict the next state is
almost always a super-set of that needed to distinguish between the states for which
different actions are required. The latter discrimination is really all that an adaptive
agent needs to make. Consider, for instance, a hypothetical environment in which the
Markovian state information is encoded in a N bit vector. Let us assume that all N
bits are required in order to predict the next state. It is clear that a binary-decision task
could be defined for this environment in which the optimal output is based on the
value at only a single bit position in the vector. An agent who observed the value of
this bit and no other could then perform as well as an agent who observed the entire
state description. Furthermore, this single-minded operator, who observes only the
task-relevant elements of the state information, has a potentially huge advantage. This
is that the size of its search-space (2 contexts) is reduced enormously from that of the
full Markovian task (2N contexts). The crucial problem, clearly, is finding the right
variables to look at!20
20This
task has inspired research in reinforcement learning on perceptual aliasing—distinguishing
states that having identical codings but require different actions (this issue is considered further in
chapter four).
CHAPTER TWO
REINFORCEMENT LEARNING SYSTEMS
49
It is the above insight that has motivated the emphasis in Animat AI on reactive
systems that detect and exploit only the minimal number of key environmental
variables as opposed to attempting to construct a full world model. It is therefore to
be strongly hoped that delayed reinforcement learning will generalise to tasks with
only partial state information. Further, we might hope that it will degrade gracefully
where this information is not fully sufficient to determine the correct input-output
mapping. If these expectations are not met then these methods will not be free from
the explosive consequences of dimensionality that make dynamic programming an
interesting but largely inapplicable tool.
Conclusion
Reinforcement learning methods provide powerful mechanisms for learning in
circumstances of truly minimal feedback from the environment. The research
reviewed here shows that these learning systems can be viewed as climbing the
gradient in the expected reinforcement to a locally maximal position. The use of a
secondary system that predicts the expected future reward encourages successful
learning because it gives better feedback about the direction of this uphill gradient.
Many issues remain to be resolved. Success in reinforcement learning is largely
dependent on effective exploration behaviour. Chapters three and six of this thesis are
in part concerned with this issue. The learning systems described here have all been
given as linear functions in an unspecified representation of the input. However, for
continuous task spaces, finding an appropriate coding is clearly critical to the success
of the learning endeavour. The question of how suitable representations can be
chosen or learned will be taken up in chapters four and five.
51
Chapter 3
Exploration
Summary
Adaptive behaviour involves a trade-off between exploitation and exploration. At
each decision point there is choice between selecting actions for which the
expected rewards are relatively well known and trying out other actions whose
outcomes are less certain. More successful behaviours can only be found by
attempting unknown actions but the more likely short-term consequence of
exploration is lower reward than would otherwise be achieved.
This chapter discusses methods for determining effective exploration behaviour. It
primarily concerns the indirect effect on exploration of the evaluation function.
The analysis given here shows that if the initial evaluation is optimistic relative to
available rewards then an effective search of the state-space will arise that may
prevent convergence on sub-optimal behaviours. The chapter concludes with a
brief summary of direct methods for adapting exploration behaviour.
CHAPTER THREE
3.1
EXPLORATION
52
Exploration and Expectation
Chapter two introduced learning rules for immediate reinforcement tasks of the
form
!wi (t) = " [r(t + 1) # b(t)] $wi (t) .
When there are a finite number of actions, altering the value of the reinforcement
baseline b has an interesting effect on the pattern of exploration behaviour.
Consider the task shown in figure 3.1 which could be viewed as a four armed
maze.
rN
N
rW
?
rE
rS
Figure 3.1 : A simple four choice reinforcement learning problem.
Assume that for i ! {N,S, E, W } the preference for choosing each arm is given by
a parameter wi and the reward for selecting each arm by ri , also let the action on
any trial be to choose the arm for which
wi + ! i
(3.1)
is highest where ! i is a random number drawn from a gaussian distribution with a
fixed standard deviation. The eligibility !wi (t) is equal to one for the chosen arm
and zero for all others.
CHAPTER THREE
EXPLORATION
53
From inspecting the learning rule it is clear that an action will be punished
whenever the reinforcement achieved is less than the baseline b . Consider the
case where the reward in each arm is zero but the baseline is some small positive
value. It is easy to see that the preferred action at the start of each trial (i.e. the one
with least negative weighting) will be the one which has been attempted on fewest
previous occasions. The preference weights effectively encode a search strategy
based on action frequency. It is tempting to call this strategy spontaneous
alternation (after the exploration behaviour observed in rodents21) since a given
action is unlikely to be retried until all other alternatives have been attempted an
equal number of times. If non-zero rewards are available in any of the maze arms,
but the base-line is still higher than the maximum reward, then the behaviour of
the system will follow an alternation pattern in which the frequency pi with which
each arm is selected22 is
pi =
(ri ! b) !1
" j (r j ! b)!1 .
In other words, the alternation behaviour is biased to produce actions with higher
reward levels more frequently.
With a fixed baseline, the learning system will never converge on actions that
achieve only lower levels of reward. Consequently, if the maximum reward value
*
r* is known, setting b = r will ensure that the system will not cease exploring
unless an optimal behaviour is found. In Sutton’s reinforcement comparison
scheme, the baseline is replaced by the prediction of reinforcement V . From the
above discussion it is clear that there is a greater likelihood of achieving optimal
21Spontaneous
alternation (e.g. [43] ) is usually studied in Y or T mazes. Over two successive
trials, rodents and other animals are observed to select the alternate arm on the second trial in
approximately 80% of tests. Albeit that there maybe a superficial likeness, the artificial
spontaneous alternation described here is not intended as a psychological model—it seems
probable that in most animals this phenomenon is due to representational learning.
22If
T is the total number of trials and T i is the number of trials in which arm i was selected then
(r i ! b)T i
(r i ! b)
" 1 . Therefore,
T i # T hence the frequency
as T ! " for any pair i, j
j
(r j ! b)T j
(r j ! b)
with which arm i is chosen is
!1
T
(r i ! b)!1
pi = i = #%(r i ! b)
(r j ! b)!1 &( =
.
j
'
T $
(r j ! b)!1
"
"
"
j
CHAPTER THREE
EXPLORATION
54
behaviour (and avoiding convergence on poor actions) if the initial prediction
*
*
V 0 ! r . Alternation behaviour will occur in a similar manner until V ! r ,
thereafter the optimal action will be rewarded while all others continue to be
punished.
For associative learning, the expectation V(x) associated with a context x, is
described here as optimistic if V(x) ! r * (x) (where r* (x) is the maximum
possible reward for that context) and as pessimistic otherwise. In immediate
*
reinforcement tasks, setting the initial expectation V 0 (x) = r (x) will give a
*
greater likelihood of finding better actions than for V 0 (x) < r (x) , it should also
*
give faster learning than for V 0 (x) > r (x) . When r* (x) is not known, an
optimistic guess for V 0 (x) will give slower learning than a pessimistic guess but
also a better chance of finding the best actions.
A similar argument applies to delayed reinforcement tasks. In this case, an
expectation is considered optimistic if V(x) ! R* (x) , where R* (x) is the
maximum possible return. If the initial expectation V 0 is the same in all states
and is globally optimistic then a form of spontaneous alternation will arise. While
the predictions are over-valued the TD error will on average be negative and
actions will generally be punished. However, transitions to states that have been
visited less frequently will be punished less. The selection mechanism therefore
favours those actions that have been attempted least frequently and that lead to the
least visited states. When the expected return varies between states the alternation
pattern should also be biased towards actions and states with higher levels of
return.
Hence, for an optimistic system, initial exploration is a function of action
frequency, state frequency and estimated returns. This results in behaviour that
traverses the state-space in a near systematic manner until expectation is reduced
to match the true level of available reward.
Sutton [162] performed several experiments investigating the effect on learning of
reinforcements that occur after varying lengths of delay. He found that the
learning system tends to adapt to maximise rewards that occur sooner rather than
later. This arises because secondary reinforcement from more immediate rewards
biases action selection before signals from later rewards have been backed-up
sufficiently to have any influence. Clearly this problem is not overcome by
altering rates of trace decay or learning since these parameters effect the rate of
propagation of all rewards equally. Providing the learning system with an
CHAPTER THREE
EXPLORATION
55
optimistic initial expectation can, however, increase the likelihood of learning
optimal behaviour. While the expectation is optimistic, action learning is
postponed in favour of spontaneous alternation. Chains of secondary
reinforcement only begin to form once the expectation falls below the level of
available reward. Rewards of greater value will obtain a head start in this backingup process increasing the likelihood of learning appropriate, optimal actions. In
the following section this effect of the initial expectation is demonstrated for
learning in a simple maze-like task.
A maze learning task
Figure 3.2 shows a maze learning problem represented as a grid in which the cells
of the grid correspond to intersections and the edges between cells to paths that
connect at these intersections. In each cell in the rectangular grid shown there are
therefore upto four paths leading to neighbouring places.
A
H
H
H
H
G
Figure 3.2: A maze learning task with 6x6 intersections. The agent (A) and goal
(G) are in opposite corners and there are four ‘hazard’ areas (H).
Behaviour is modelled in discrete time intervals where at each time-step the agent
makes a transition from one cell to a neighbour. A version of the actor/critic
architecture is used in which each cell is given a unique, discrete encoding. The
evaluation for a cell is encoded by a distinct parameter, and, as in the four-arm
maze (figure 3.1), there is a separate weight for each action in each cell. The
CHAPTER THREE
EXPLORATION
56
action in any specific cell is chosen by equation 3.1. Further details of the
algorithm are given in Appendix A.
For the task considered here, one cell of the grid is assigned to be the starting
position of the agent and a second cell is assigned to be the goal position where
positive reward (+1) is available. Certain cells contain hazards where a negative
reinforcement (-1) is given for entering the cell, in all other non-goal cells the
reward is zero. Note that with this reward schedule it is the effect of the
discounted time horizon (that delayed rewards are valued less) that encourages the
agent to find direct paths. It is also possible that the learning system will fail to
learn any route to the goal. This arises if a locally optimal, ‘dithering’ policy is
found that involves swapping back and forth between adjacent non-hazard cells to
avoid approaching punishing areas.
s"1
where s is the minimum number
The maximum return R* (x) for any cell is !
*
of steps to the goal, hence, for all cells, 0 < R (x) ! 1 . An initial expectation of
zero is therefore pessimistic in all cells and an expectation of +1 optimistic.
The effect on learning of these different expectations was examined in the
following experiment. The agent was run on repeated trials with the maze
configuration shown above. Each trial ended either when the agent reached the
goal or after a thousand transitions had occurred. A run was terminated after 100
trials or after two successive trials in which the agent failed to reach the goal in
the allotted time. Suitable global parameters for the learning system (learning
rates, decay rates and discount factor) were determined by testing the system in a
hazard-free maze.
Out of ten learning runs starting from the pessimistic initial expectation, V 0 = 0 ,
the agent failed to learn a path to the goal on all occasions as a result of learning a
procrastination policy. Figure 3.3 shows a typical example of the associations
acquired.
CHAPTER THREE
EXPLORATION
57
Figure 3.3: Action preferences and cell evaluations after a series of trials learning
from initially pessimistic expectation. The arrows in the left diagram indicate the
preferred direction of movement in each cell. The heights of the columns in the
right diagram show the predicted return for each cell (white +ve, black -ve).
In this example the agent has gradually confined itself to the top left hand
corner—all preferences near the start cell direct the agent away from the hazards
and back toward the initial position. A ‘wall’ of negative expectation prevents the
agent from gaining any new experience near the goal.
In contrast to the poor performance of this pessimistic learner, given an optimistic
initial expectation23, V 0 = +1 , successful learning of a direct path was achieved on
all ten runs. Figure 3.4 illustrates the policy and evaluation after one particular
run. On this occasion the prediction in any cell never fell below zero expectation,
so convergence on a dithering policy could not occur.
23The
optimistic expectation is applied to all cells except the goal which is given an initial value of
zero. This does not imply any prior knowledge but merely indicates that once the goal is achieved
the anticipation of it ceases. Experiments with a continuous version of the problem in which the
agent moves from the goal cell back to its starting position (and updates the evaluation of the goal
according to its experience in the next trial) support the conclusions reported here.
CHAPTER THREE
EXPLORATION
58
Figure 3.4: Action preferences and cell evaluations after a series of trials of
learning from an initially optimistic expectation, the agent has learned a direct
path to the goal (along the top row then down the left column).
To confirm that the exploration behaviour of a optimistic system is better than
chance a simple experiment was performed using a ‘dry’ 6x6 maze (i.e. one
without rewards of any kind). In each cell the action with the highest weighting
was always selected (using random noise as a tie breaker). A good measure of
how systematically the maze is traversed is the variance in the mean number of
times each possible action is chosen out of a series of n actions. For n = 120 ,
random action selection (or V 0 = 0 ) gave an average variance that was more than
five times higher24 than optimistic exploration behaviour ( V 0 = +1 ). In other
words, the initial behaviour of an optimistic system traverses the maze in a
manner that is considerably more systematic than a random walk.
The effect of initial expectation is further demonstrated in figure 3.5. This graph
shows the average number of transitions in the first ten trials of learning starting
from different initial expectations25. Behaviour in the maze with hazards is
contrasted with behaviour in a hazard-free maze. In the latter case the number of
transitions gradually rises as the value of the initial expectation is increased (from
zero through to one). This is entirely due to the alternation behaviour induced by
24Since there are 120 actions in total, choosing n=120 makes the mean number of choices 1. Over
ten trials, random selection gave an average variance in this mean of 1.42, for optimistic search the
variance was only 0.26.
25The
averages were calculated over ten runs, with only those runs which were ultimately
successful in learning a path to the goal being considered.
CHAPTER THREE
EXPLORATION
59
the more optimistic learning systems. In the hazardous maze, however, the trend is
reversed. In this case, systems with lower initial expectations take longer to learn
the task. More time is spent in unfruitful dithering behaviour and less in effective
exploration.
120
average steps per trial
100
80
60
40
hazards
no hazards
20
0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
initial value
Figure 3.5: Speed of learning over first ten trials for initial value functions 0.0,
0.3, 0.6, 0.8 and 1.0. (For an initial value of zero in the maze with hazards all ten
runs failed to find a successful route to the goal).
Clearly, it is the relative values of the expectation and the maximum return that
determine the extent to which the learning system is optimistic or pessimistic.
Therefore an alternate way to modify exploration behaviour is to change the
rewards and not the expectation. For the maze task described above, equivalent
learning behaviour is generated if there is zero reward at the goal and negative
rewards for transitions between all non-goal cells26.
26
In the task described above there is a reward r g at the goal and zero reward in all other nonhazard cells. Consider an identical task but with zero reward at the goal and a negative ‘cost’
reward rc for every transition between non-goal cells (Barto et al. [13] investigated a routefinding task of this nature). Let the maximum return in each cell for the first task be RG (x ) and in
the second task RC ( x) . It can be easily shown that, for a given discount factor ! ,
CHAPTER THREE
EXPLORATION
60
As learning proceeds the affect of an optimistic bias in initial predictions
gradually diminishes. To obtain a similar, but permanent, influence on exploration
Williams [184] has therefore suggested adding a small negative bias to the error
signal. This will provoke continuous exploration since any action which does not
lead to a higher than expected level of reinforcement is penalised. The system will
never actually converge on a fixed policy but better actions will be preferred.
Relationship to animal learning
The indirect exploration induced by the discrepancy between predictions and
rewards seems to fit naturally with many of the characteristics of conditioning in
animal learning. In particular, a large number of theorists have proposed that it is
the ‘surprise’ generated by a stimulus that provokes exploration more than the
status of the outcome as a positive or negative reinforcer (see, for instance,
Lieberman [85]). The pessimistic-optimistic distinction seems also to have a
parallel in animal learning which is shown by experiments on learned
helplessness [123, 149]. This research demonstrates that animals fail to learn
avoidance of aversive stimuli if they are pre-trained in situations where the
punishing outcome is uncontrollable. In terms of the simulations described above,
it could be argued that the induced negative expectation brought about by the pretraining, reduces the discrepancy between the predicted outcome and the
(aversive) reward and so removes the incentive for exploration behaviour.
3.2
Direct Exploration Methods
In addition to the indirect exploration strategies described above a number of
methods have proposed for directly encouraging efficient exploration. This
section briefly reviews some of these techniques, and in this context, describes an
exploration method due to Williams [184] that is employed in the simulations
described in later chapters.
The simplest procedure for controlling exploration is to start out with a high level
of randomness in the action selection mechanism and reduce this toward zero as
learning proceeds. An annealing process of this sort can be applied to the task as a
whole, or for more efficient exploration, the probability function from which
C
G
R ( ) = 1 ! R ( x) iff r p = (! " 1)r g . Learning behaviour will therefore be the same under both
C
G
circumstances if the initial expectation V 0 = 1! V0 .
CHAPTER THREE
EXPLORATION
61
actions are selected can be varied for different local regions of the state space. The
following considers several such methods for tailoring local exploration, these fall
into two general categories that I will call uncertainty and performance measures.
Uncertainty measures attempt to estimate the accuracy in the learning system’s
current knowledge. Suitable heuristics are to attach more uncertainty to contexts
that have been observed less frequently [108], or less recently [164], or for which
recent estimates have shown large errors [108, 111, 145, 169]. Exploration can be
made a direct function of uncertainty by making the probability of selecting an
action depend on both the action preference and the uncertainty measure. The
effect of biasing exploration in this manner is local, that is, it cannot direct the
system to explore in a distant region of the state-space. An alternative approach is
to apply the uncertainty heuristic indirectly by adding some positive function of
uncertainty to the primary reinforcement (e.g. [164]). This mechanism works by
redefining optimal behaviours as those that both maximise reward and minimise
uncertainty. This method can produce a form of non-local exploration bias, since
uncertainty will be propagated by the backward-chaining of prediction estimates
eventually having some effect on the decisions made in distant states.
Performance measures estimate the success of the system’s behaviour as
compared with either a priori knowledge, or local estimates of available rewards.
Gullapalli [56] describes a method for immediate reinforcement tasks in which the
performance measure is a function of the difference between the predicted reward
and the maximum possible reward r* (x) (which is assumed to be known). The
amount of exploration, which varies from zero up to some maximum level is in
direct proportion to the size of this disparity.
Williams [184] has proposed a method for adapting exploration behaviour that is
suitable to learning real-valued actions when r* (x) is not known to the learning
system. He suggests allowing the degree of search to adapt according to the
variance in the expected reward around the current mean action. In other words,
if actions close to the current mean are achieving lower rewards than actions
further away, then the amount of noise in the decision rule should be increased to
allow more distant actions to be tried more frequently. If, on the other hand,
actions close to the mean are more successful than those further away, then noise
should be reduced, so that the more distant (less successful) actions are sampled
less often. A Gaussian action unit that performs adaptive exploration of this
nature is illustrated below and described in detail in Appendix D. Here both the
CHAPTER THREE
EXPLORATION
62
mean and the standard deviation of a Gaussian pdf are learned. The mean, of
course, is trained so as to move toward actions that are rewarded and away from
those that are punished, however, the standard deviation is also adapted. The
width of the probability function is increased when the mean is in a region of low
reward (or surrounded by regions of high reward) and reduced when the mean is
close to a local peak in the reinforcement landscape (see figure 3.6).
This learning procedure results in an automatic annealing process as the variance
of the Gaussian will shrink as the mean behaviour converges to the local
maximum. However, the width of the Gaussian can also expand if the mean is
locally sub-optimal allowing for an increase in exploratory behaviour at the start
of learning or if there are changes in the environment or in the availability of
reward.
It is interesting to contrast this annealing method with Gullapalli’s approach. In
the latter the aim of adaptive exploration is solely to enable optimal actions to be
found, consequently as performance improves the noise in the decision rule is
reduced to zero. In Williams’ method, however, the goal of learning is to adapt
the variance in the acquired actions to reflect the local slope of the expected
return. The final width of the Gaussian should therefore depend on whether the
local peak in this function is narrow or flat on top. The resulting variations in
behaviour can be viewed not so much as noise but as acquired versatility. An
application of Williams’ method to a difficult delayed reinforcement task is
described in chapter six where the value of this learned versatility can be clearly
seen.
CHAPTER THREE
EXPLORATION
!
e +ve
(y-!"2##
> m2
m
63
y
e -ve
e +ve
(y-!"2##
< m2
(y-!"2##
< m2
e +ve
e -ve
(y-!"2##
> m2
(y-!"2##
> m2
e -ve
(y-!"2##
> m2
change in m
Figure 3.6: Learning the standard deviation of the action probability function. The
figures indicate the direction of change in the standard deviation ( ! ) for actions
(y) sampled less ( (y ! µ )2 < " 2 ) or more ( (y ! µ )2 " 2 ) than one standard
deviation from the mean for different values of the reinforcement error signal (e).
The bottom left figure indicates that the distribution will widen when the mean is
in a local trough in the reinforcement landscape, the bottom right, that it will
narrow over a local peak.
Direct exploration and model-based learning
Mechanisms that adapt exploration behaviour according to some performance
measure can be seen as rather subtle forms of reinforcement (i.e. non modelbased) learning. Here the acquired associations, are clearly task-specific although
they specify the degree of variation in response as well as the response itself.
Uncertainty measures, on the other hand, are more obviously ‘knowledge about
knowledge’, implying some further degree of sophistication in the learning
system. However, whether this knowledge should be characterised as model-based
is arguable. To a considerable extent this may depend on how and where the
knowledge is acquired and used.
CHAPTER THREE
EXPLORATION
64
For instance, a learning system could estimate the frequency or recency of
different input patterns, or the size of their associated errors, in the context of
learning a specific task. This measure could then be added to the internal reward
signal (as described above) so indirectly biasing exploration. These heuristics will
be available even in situations where the agent has only partial state knowledge.
In contrast, where causal or world models are constructed, the same uncertainty
estimates with respect to a given goal could be acquired within the context of a
task-independent framework [108, 164]. Exploration might then be determined
more directly and in a non-local fashion. That is, rather than simply biasing
exploration toward under-explored contexts, the learning system could explicitly
identify and move toward regions of the state-space where knowledge is known to
be lacking. Clearly such strategies depend on full model-based knowledge of the
task although the motivating force for exploration measure is still task-specific.
Finally, uncertainty measures could be determined with respect to the causal or
world models themselves, in other words, there could be task-independent
knowledge of uncertainty, something perhaps more like true curiosity, which
could then drive exploration behaviour.
Conclusion
This chapter has considered both direct and indirect methods for controlling the
extent of search carried out by a reinforcement learning system. In particular, the
value of the initial expectation (relative to the maximum available reward) has
been shown to have an indirect effect on exploration behaviour and consequently
on the likelihood of finding globally optimal solutions to the task in hand.
63
Chapter 4
Input Coding for Reinforcement Learning
Summary
The reinforcement learning methods described so far can be applied to any task in
which the correct outputs (actions and predictions) can be learned as linear functions
of the recoded input patterns. However, the nature of this recoding is obviously
critical to the form and speed of learning. Three general approaches can be taken to
the problem of choosing a suitable basis for recoding a continuous input space: fixed
quantisation methods; unsupervised learning methods for adaptively generating an
input coding; and adaptive methods that modify the input coding according to the
reinforcement received. This chapter considers the advantages and drawbacks of
various recoding techniques and describes a multilayer learning architecture in which
a recoding layer of Gaussian basis function units with adaptive receptive fields is
trained by generalised gradient descent to maximise the expected reinforcement.
CHAPTER FOUR
4.1
INPUT CODING
64
Introduction
In chapter two multilayer neural networks were considered as function
approximators. The lower layers of a network were viewed as providing a recoding of
the input pattern, or recoding vector, which acts as the input to the upper network
layer where desired outputs (actions and predictions) are learned as linear functions.
This chapter addresses some of the issues concerned with selecting an appropriate
architecture for the recoding layer(s). The discussion divides into three parts: methods
that provide fixed or a priori codings; unsupervised or competitive learning methods
for adaptively generating codings based on characteristics of the input; and methods
that use the reinforcement feedback to adaptively improve the initial coding system.
As discussed in the chapter two, for any given encoding of an input space, a system
with a single layer of adjustable weights can only learn a limited set of output
functions [102] . Unfortunately, there is no simple way of ensuring that an arbitrary
output function can be learned short of expanding the set of input patterns into a highdimensional space in which they are all orthogonal to each other (for instance, by
assigning a different coding unit to each pattern). This option is clearly ruled out for
tasks defined over continuous input spaces as the set of possible input vectors is
infinite. However, even for tasks defined over a finite set such a solution is
undesirable because it allows no generalisation to occur between different inputs that
require similar outputs.
One very general assumption that is often made in choosing a representation is that
the input/output function will be locally smooth. If this is true, it follows that
generalisation from a learned input/output pairing will be worthwhile to nearby
positions in the input space but less so to distant positions. The recoding methods
discussed in this chapter exploit this assumption by mapping similar inputs to highlycorrelated codings and dissimilar inputs to near-orthogonal codings. This allows local
generalisation to occur whilst reducing crosstalk (interference between similar
patterns requiring different outputs). To provide such a coding the input space is
CHAPTER FOUR
INPUT CODING
65
mapped to a set of recoding units, or local experts1, each with a limited, local
receptive field. Each element of the recoding vector then corresponds to the activation
of one such unit.
In selecting a good local representation for a learning task there are clearly two
opposing requirements. The first concerns making the search problem tractable by
limiting the size of recoded space. It is desirable to make the number of local experts
small enough and their receptive fields large enough that sufficient experience can be
gained by each expert over a reasonably short learning period. The more units there
are the longer learning will take. The second requirement is that of making the
desired mapping learnable. The recoding must be such that the non-linearities in the
input-output mapping can be adequately described by the single layer of output
weights. However, it is the nature of the task and not the nature of the input that
determines where the significant changes in the input-output mapping occur.
Therefore, in the absence of specific task knowledge, any a priori division of the
space may result in some input codings that obscure important distinctions between
input patterns. This problem—sometimes called perceptual aliasing—will here be
described as the ambiguity problem as such codings are ambiguous with regard to
discriminating contexts for which different outputs are required. In general the
likelihood of ambiguous codings will be reduced by adding more coding units. Thus
there is a direct trade-off between creating an adequate, unambiguous code and
keeping the search tractable.
4.2
Fixed and unsupervised coding methods
Boundary methods
The simplest form of a priori coding is formed by a division of the continuous input
space into bounded regions within each of which every input point is mapped to the
same coded representation. A recoding of this form is sometimes called a hard
1The use of this term follows that of Nowlan [117] and Jacobs [66], though no strict definition is
intended here.
CHAPTER FOUR
INPUT CODING
66
quantisation. Figure 4.1 illustrates a hard quantisation of an input space along a
single dimension.
min
input
max
x
Figure 4.1: Quantisation of a continuous input variable.
A hard quantisation can be defined by a set of M cells denoted by = {c1 ,c2 ,…cM } ,
max
where for each cell ci the range {x min
} is specified, and where the ranges of all
i , xi
cells cover the space and are non-overlapping. The current input pattern x(t) then
maps to the single cell c * (t ) !C whose boundaries encompass this position in the
input space. The elements of the recoding vector are in a one-to-one mapping to the
*
quantisation cells. If the index of the winning cell is given by i this gives ! (t) as a
unit vector of size M where
! i (t ) " (i,i )
*
(The Kronecker delta
functions of this type).
# 1 iff i i*
$
%0 otherwise
(4.1.1)
(i, j) = 1 for i = j, 0 for i " j will also be used to indicate
With a quantisation of this sort non-linearities that occur at the boundaries between
cells can be acquired easily. However, this is achieved at the price of having no
generalisation or transfer of learning between adjoining cells. For high-dimensional
input spaces this straight-forward approach of dividing up the Euclidean space using a
fixed resolution grid rapidly falls prey to Bellman’s curse. Not only is a large amount
of space required to store the adjustable parameters (much of it possibly unused), but
learning occurs extremely slowly as each cell is visited very infrequently.
Coarse coding
One way to reduce the amount of storage required is to vary the resolution of the
quantisation cells. For instance, Barto et al. [11] (following Michie and Chambers
[99]) achieve this by using a priori task knowledge about which regions of the state
CHAPTER FOUR
INPUT CODING
67
space are most critical. An alternative approach is to use coarse-coding [61], or soft
quantisation, methods where each input is mapped to a distribution over a subset of
the recoding vector elements. This can reduce storage requirements and at the same
time provide for a degree of local generalisation.
A simple form of coarse coding is created by overlapping two or more hard codings
or tilings as shown in figure 4.2.
min
max
input
x
Figure 4.2 Coarse-coding using offset tilings along a single dimension. One cell in
each tiling is active.
If each of the T tilings is described by a set C j ( j = 1..T ) of quantisation cells, then
the current input x(t) is mapped to a single cell j (t ) in each tiling and hence to a set
U(t) = {c1 *(t), c2 * (t),…cT * (t) of T cells overall. If, again, there is a one-to-one
mapping of cells to the elements ! i (t ) then the coding vector is given by
# 1 iff c i "U(t)
! i (t ) = $ T
.
% 0 otherwise
Note that here each element is ‘normalised’ to the value
activation across all the elements of the vector is unity.
(4.1.2)
1
T
so that the sum of
A soft quantisation of this type can give a reasonably good approximation wherever
the input/output mapping varies in a smooth fashion. However, any sharp nonlinearities in the function surface will necessarily be blurred as a result of taking the
average of the values associated with multiple cells each covering a relatively large
area.
A coarse-coding of this type called the CMAC (“Cerebellar Model Articulation
Computer”) was proposed by Albus [2] as a model of information processing in the
mammalian cerebellum. The use of CMACs to recode the input to reinforcement
CHAPTER FOUR
INPUT CODING
68
learning systems has been described by Watkins [177] , they are also employed in the
learning system described in Chapter Six. The precise definition of a CMAC differs
slightly from the method given above, in that the cells of the CMAC are not
necessarily mapped in a one-to-one fashion to the elements of the recoding vector.
Rather a hashing function can be used to create a pseudo random many-to-one
mapping of cells to recoding vector elements. This can reduce the number of adaptive
parameters needed to encode an input space that is only sparsely sampled by the input
data.
Compared with a representation consisting of a single hyper-cuboid grid a CMAC
appears to quantise a high-dimensional input space to a similar resolution (albeit
coarsely) using substantially less storage. For example, figure 4.3 shows a CMAC in
a two-dimensional space consisting of four tilings with 64 adjustable parameters, this
gives a degree of discrimination similar to a single grid with 169 parameters.
Figure 4.3: A CMAC consisting of four 4x4 tilings compared with a single grid
(13x13) of similar resolution
At first sight it appears that the economy in storage further improves as the
dimensionality of the space increases—if all tilings are identical and each has a
uniform resolution P in each of N input dimensions, then the total number of
N
T . This gives a maximum resolution
adjustable parameters (without hashing) is
CHAPTER FOUR
INPUT CODING
69
N
similar2 to a single grid of size PT ) , in other words, there is a saving in memory
N !1
requirement of order T . However, this saving in memory actually incurs a
significant loss of coding accuracy as the nature of the interpolation performed by the
system varies along different directions in the input space. Specifically, the
interpolation is generated evenly along the axis on which the tilings are displaced, but
unevenly in other directions (in particular in directions orthogonal to this axis).
Figure 4.4 illustrates this for the 2-dimensional CMAC shown above. The figure
shows the sixteen high-resolution regions within a single cell of the foremost tiling.
For each of these regions, the figure shows the low resolutions cells in each of the
tilings that contribute to coding that area. There is a clear difference in the way the
interpolation occurs along the displacement axis (top-left to bottom-right) and
orthogonal to it. This difference will be more pronounced when coding higher
dimensional spaces.
2This is strictly true only if the input space is a closed surface (i.e. an N-dimensional torus), if the input
space is open (i.e. has boundaries) then the resolution is lower at the edges of the CMAC because the
tilings are displaced relative to each other. If the space is open in all dimensions (as in figure 4.3)
N
then the CMAC will have maximum resolution only over grid of size ( PT + 1 ! T ) .
CHAPTER FOUR
INPUT CODING
70
Figure 4.4: Uneven interpolation in a two dimensional CMAC. The distribution of
low resolution cells contributing to any single high-resolution area (the black squares)
is more balanced along the displacement axis (top-left to bottom-right) than
orthogonal to it (bottom-left to top-right).
Nearest Neighbour methods
Rather than defining the boundaries of the local regions it is a common practice to
partition the input space by defining the centres of the receptive fields of the set of
recoding units. This results in nearest-neighbour algorithms for encoding input
patterns [70, 105]. Given a set of i = 1…M units each centred at position c i in the
input space, a hard, winner-takes-all coding is obtained by finding the centre that is
closest to the current input x(t) according to some distance metric. For instance, if
the Euclidean metric is chosen, then the winning node is the one for which
N
di = d(x(t ),c i ) = " j =1(x j (t) ! c ij )2
(4.1.3)
CHAPTER FOUR
INPUT CODING
71
*
is minimised. If the index of the winning node is written i then the recoding vector is
given by the unit vector
! i (t ) = " (i,i* )
(4.1.4)
This mechanism creates an implicit division of the space called a Voronoi
tessellation3.
Radial basis functions
This nearest neighbour method can be extended to form a soft coding by computing
for each unit a value gi which is some radially symmetric, non-linear function of the
distance from the input point to the node centre known as a radial basis function
(RBF) (see [131] for a review). Although their are a number of possible choices for
this function there are good reasons [127] for preferring the multi-dimensional
Gaussian basis function (GBF). First, the 2-dimensional Gaussian has a natural
interpretation as the ‘receptive field’ observed in biological neurons. Furthermore, it
is the only radial basis function that is factorisable in that a multi-dimensional
Gaussian can be formed from the product of several lower dimensional Gaussians
(this allows complex features to be built-up by combining the outputs of two- or onedimensional detectors).
A radial Gaussian node has a spherical activation function of the form
2
$
x j (t) " cij '
N
gi (t) = g(x(t), c i ,w) =
exp & " # j =1
) (4.1.5)
(2 ! ) N 2 w N
2w2
%
(
1
[
]
where w denotes the width of the Gaussian distribution in all the input dimensions. It
is convenient to use a normalised encoding which can be obtained by scaling the
activation gi (t) of each unit according to the total activation of all the units. In other
words, the recoding vector element for the ith unit is calculated by
3 See Kohonen [70] chapter five.
CHAPTER FOUR
! i (t ) =
INPUT CODING
gi (t)
" j =1g j (t) .
72
(4.1.6)
M
Figure 4.5 illustrates a radial basis recoding of this type with equally spaced fixed
centres and the Gaussian activation function.
x
Figure 4.5 A radial basis quantisation of an input variable, The coding of input x is
distributed between the two closest nodes.
Without task-specific knowledge selecting a set of fixed centres is a straight trade-off
between the size of the search-space and the likelihood of generating ambiguous
codings. For a task with a high-dimensional state-space learning can be slow and
expensive in memory. It is therefore common practice, as we will see in the next
section, to employ algorithms that position the nodes dynamically in response to the
training data as this can create a coding that is both more compact and more directly
suited to the task.
Unsupervised learning
Unsupervised learning methods encapsulate several useful heuristics for reducing
either the bandwidth (dimensionality) of the input vector, or the number of units
required to adequately encode the space. Most such algorithms operate by attempting
to maximise the network’s ability to reconstruct the input according to some
particular criteria. This section discusses how adaptive basis functions can be used to
learn the probability distribution of the data, perform appropriate rescaling, and learn
the covariance of input patterns. To simplify the notation the dependence of the input
and the node parameters on the time t is assumed hereafter.
CHAPTER FOUR
INPUT CODING
73
Learning the distribution of the input data
Many authors have investigated what is commonly known as competitive learning
methods (see [59] for a review) whereby a set of node centres are adapted according
to a rule of the form
!c i " # (i, i * )( x $ ci )
(4.2.1)
*
where i is the winning (nearest neighbour) node. In other words, at each time-step
the winning node is moved in the direction of the input vector4. The resulting network
provides an effective winner-takes-all quantisation of an input space that may support
supervised [105] or reinforcement learning (see next chapter). A similar learning rule
for a soft competitive network of spherical Gaussian nodes5 is given by
ci " #i x $ ci )
(4.2.2)
In this rule the winner-only update is relaxed to allow each node to move toward the
input in proportion to its normalised activation (4.1.6). Nowlan [115] points out that
this learning procedure approximates a maximum likelihood fit of a set of Gaussian
nodes to the training data.
To avoid the problem of specifying the number of nodes and their initial positions in
the input space, new nodes can be generated as and when they are required. A
suitable heuristic for this (used, for instance, by Shannon and Mayhew [150]) is to
create a new node whenever the error in the reconstructed input is greater than a
threshold !. In other words, given an estimate xˆ of the input calculated by
M
xˆ = "i =1 i ci
(4.2.3)
a new node will be generated with its centre at x whenever
x ! xˆ > " .
(4.2.4)
4Learning is usually subject to an annealing process whereby the learning rate is gradually reduced to
zero over the duration of training.
5All nodes have equal variance and equal prior probability.
CHAPTER FOUR
INPUT CODING
74
One of the advantages of such a scheme compared to an a priori quantisation, is in
coding the input of a task with a high-dimensional state-space which is only sparsely
visited. The node generation scheme will reduce substantially the number of units
required to encode the input patterns since units will never be assigned to unvisited
regions of the state space.
Rescaling the input
Often a network is given a task where there is a qualitative difference between
different input dimensions. For instance, in control tasks, inputs often describe
variables such as Cartesian co-ordinates, velocities, acceleration, joint angles, angular
velocities etc. There is no sense in applying the same distance metric to such different
kinds of measure. In order for a metric such as the Euclidean norm to be of any use
here an appropriate scaling of the input dimensions must be performed. Fortunately, it
is possible to modify the algorithm for adaptive Gaussian nodes so that the each unit
learns a distance metric which carries out a local rescaling of the input vectors
(Nowlan [116]). This is achieved by adding an additional set of adaptive parameters
to each node, that describes the width of the Gaussian in each of the input
dimensions. The (non-radial) activation of function is therefore given by
2
%
(
N x j # cij
gi = g(x, c i ,wi ) =
exp
#
'
*
N
$
j
2wij2 )
(2 ! )N 2 " j wij
&
[
1
]
(4.2.5)
The receptive field of such a node can be visualised as a multi-dimensional ellipse
that is extended in the jth input dimension in proportion to the width wij . Suitable
rules for adapting the parameters of the ith node are
cij " # i
x j $ cij
w2ij
, and !wij " #i
(x $ c )
i
2
ij
wij3
$ wij2
.
(4.2.6)
Figure 4.6 illustrates the receptive fields of a collection of nodes with adaptive
variance trained to reflect the distribution of an artificial data set. The data shown is
randomly distributed around three vertical lines in a two-dimensional space. The
central line is shorter than the other two but generates the same number of data points.
The nodes were initially placed at random positions near to the centre of the space.
CHAPTER FOUR
INPUT CODING
75
Figure 4.6: Unsupervised learning with Gaussian nodes with adaptive variance.
Learning the covariance
Linsker [86, 87] proposed that unsupervised learning algorithms should seek to
maximise the rate of information retained in the recoded signal. A related approach
was taken by Sanger [141] who developed a Hebbian learning algorithm that learns
the first P principal components of the input and therefore gives an optimal linear
encoding (for P units) that minimises the mean squared error in the reconstructed
input patterns. Such an algorithm performs data compression allowing the input to be
described in a lower dimensional space and thus reducing the search problem. The
principal components of the data are equivalent to the eigenvectors of the covariance
(input correlation) matrix. Porrill [128] has also proposed the use of Gaussian basis
function units that learn the local covariance of the data by a competitive learning
rule.
To adapt the receptive field of such a unit a symmetric NxN covariance matrix S is
trained together with the parameter vector c describing the position of the unit’s
centre in the input space. The activation function of a Gaussian node with adaptive
covariance is therefore given by
CHAPTER FOUR
g( x, c, S) =
INPUT CODING
1
(2 ! )
N2
S
12
exp (" 12 (x " c) # S"1(x " c))
76
(4.2.7)
where . denotes the determinant. The term Gaussian basis function (GBF) unit will
be used hereafter to refer exclusively to units of this type.
In practice, it is easier to work with the inverse of the covariance matrix
= S!1
which is known as the information matrix. With this substitution equation 4.2.7
becomes
g(x, , H) = (2 ! )
"N 2
H
12
exp (" 2 (x " )# H(x " ))
1
(4.2.8)
Given a collection of GBF nodes with normalised activations (eq. 4.1.6), suitable
update rules for the parameters of the ith node are
!c i " # i Hi (x $ c) , and
!Hi " # i [Si $ (x $ ci )(x $ c i ) % ].
(4.2.9)
A further discussion of these rules is given in Appendix B.
The receptive field of a GBF node is a multi-dimensional ellipse where the axes align
themselves along the principal components of the input. The value of storing and
learning the extra parameters that describe the covariance is that this often allows the
input data to be encoded to a given level of accuracy using a smaller number of units
than if nodes with a more restricted parameter set are used. Figure 4.7 illustrates the
use of this learning rule for unsupervised learning with a data set generated as a
random distribution around an X shape in a 2-dimensional space. The left frame in
the figure shows the receptive fields of four GBF nodes, for comparison the right
frame shows the fields learned by four units with adaptive variance alone.
CHAPTER FOUR
INPUT CODING
77
Figure 4.7: Unsupervised learning with adaptive covariance (GBF) and adaptive
variance units. The units in the former case learn a more accurate encoding of the
input data.6
A set of GBF nodes trained in this manner can provide a set of local coordinate
systems that efficiently describe the topology of the input data. In particular, they can
be used to capture the shape of the manifold formed by the data-set in the input space.
Porrill [128] gives a fuller discussion of this interesting topic.
Discussion
Unfortunately, none of the techniques described here for unsupervised learning is able
to address the ambiguity problem directly. The information that is salient for a
particular task may not be the same information that is predominant in the input.
Hence though a data-compression algorithm may retain 99% of the information in the
input, it could be that the lost 1% is vital to solving the task. Secondly, unsupervised
learning rules generally attempt to split the state-space in such a way that the different
units get an equal share of the total input (or according to some similar heuristic).
However, it may be the case that a particular task requires a very-fine grained
6In this particular case the mean squared error in the reconstructed input was approximately half as
large for the GBF units (left figure) as for the units with adaptive variance only (right figure). (mse =
0.0075 and 0.014 respectively after six thousand input presentations).
CHAPTER FOUR
INPUT CODING
78
quantisation in certain regions of the space though not in others, and that this
granularity is not reflected in the frequency distribution of the input data.
Alternatively, the acquired quantisation may be sufficiently fine-grained but the
region boundaries may not be aligned with the significant changes in the input-output
mapping.
In many reinforcement learning tasks the input vectors are determined in part by the
past actions of the system. Therefore as the system learns a policy for the task the
distribution of inputs will change and the recoding will either become out-dated or
will need to adapt continually. In the latter case the reinforcement learning system
will be presented with a moving target problem where the same coding may represent
different input patterns over the training period.
Finally, it is often the case that the temporal order in which the input vectors are
provided to the system is a poor reflection of their overall distribution. Consider, for
instance, a robot which is moving through an environment generating depth patterns
as input to a learning system. The patterns generated in each local area of the space
(and hence over any short time period) will be relatively similar and will not
randomly sample the total set of depth patterns that can be encountered. In order to
represent such an input space adequately the learning rate of the recoding algorithm
will either have to be extremely slow, or some buffering of the input or batch training
will be needed.
In general, unsupervised techniques can provide useful pre-processing but are not
able to discover relevant task-related structure in the input. The next section describes
some steps toward learning more appropriate input representations using algorithms
in which the reinforcement learning signal is used to adapt the input coding.
CHAPTER FOUR
4.3
INPUT CODING
79
Adaptive coding using the reinforcement signal
Non-gradient descent methods for discrete codings
I am aware of two methods that have been proposed for adaptively improving a
discrete input quantisation. Both techniques are based on Watkin's Q learning, and
therefore also require a discrete action space. Both also assume that the quantisation
of the input space consists of hyper-cuboid regions.
Whitehead and Ballard [181] suggest the following method for finding unambiguous
state representations. They observe that if a state is ambiguous relative to possible
outcomes then the Q learning algorithm will associate with each action in that state a
value which is actually an average of the future returns. For an unambiguous state,
however, the values converged upon by Q learning will always be less than or equal
to the true returns (but only if all states between it and the goal are also
unambiguous—an important caveat!). Their algorithm therefore does a search
through possible input state representations in which any one which learns to
overestimate some of its returns is suppressed. In the long run unambiguous states
will be suppressed less often and therefore come to dominate. This method has some
similarities with selectionist models of learning (e.g. Edelman [45]) since it requires
that there are several alternative input representations all competing to provide the
recoding with the less successful ones gradually dying out.
Chapman and Kaebling [30] describe a more rigorous, statistical method based on a
similar principle. Their ‘G algorithm’ attempts to improve a discrete quantisation by
using the t-test to decide whether the reinforcement accruing either side of a
candidate splitting-point is derived from a single distribution. If the test suggests two
different underlying distributions then the space is divided at that position. The
technique can be applied recursively to any new quantisation cells that are generated.
The algorithm is likely to require very extensive training periods for the following
reasons. First, the evaluation function must be entirely re-learned every time the
quantisation is altered. Second, because the secondary reinforcement is noisy whilst
the values are being learned it is necessary to split training into two phases—value
function learning and t-test data acquisition. Finally, the requirement of the t-test that
CHAPTER FOUR
INPUT CODING
80
data is drawn from normal distributions requires that the same state be visited many
times (ideally) before the splitting test can be applied.
Both of these algorithms have a major limitation in that they require that the set of
potential splitting points, or alternative quantisations, is finite and preferably small.
This will clearly not be true for most tasks defined over continuous input spaces.
Gradient learning methods for continuous input spaces
In chapter two Williams’ [184] analysis of reinforcement learning algorithms as
gradient ascent learning methods was reviewed. As Williams has pointed out, once
the gradient of the error surface has been estimated it is possible to apply generalised
gradient learning rules to train multilayer neural networks on such problems. This
allows a suitable recoding of the input space to be learned dynamically by adaptation
of the connection weights to a layer of hidden units.
There are basically two alternative means for training a hidden layer of coding units
using the reinforcement signal. The first approach is the use of generalised gradient
descent whereby the error from the output layer is back-propagated to the weights on
the hidden units. This is the usual, supervised learning, method for adapting an
internal representation of the input. The second approach is a generalisation of the
correlation-based reinforcement learning rule. That is, the coding layer (or each
coding unit) attempts, independently of the rest of the net, to do stochastic gradient
ascent in the expected reward signal. Learning in this case is of a trial and error nature
where alternative codings are tested and judged by their effect on the global
reinforcement. The output layer of the network has no more direct influence in this
process than any other component of the changing environment in which the coding
system is seeking to maximise its reward. A network architecture that uses a
correlation rule to train hidden units has been proposed by Schmidhuber [143, 144]
and is discussed in the next chapter.
In general, it will more efficient to use correlation-based learning only when
absolutely necessary [184]. That is, stochastic gradient ascent need only be used at
the output layer of the network where no more direct measure of error is possible.
Elsewhere units that are trained deterministically by back-propagation of error should
always learn more efficiently than stochastic units trained by the weaker rule. The
CHAPTER FOUR
INPUT CODING
81
rest of this chapter therefore concerns methods based on the use of generalised
gradient descent training.
Reinforcement learning in multilayer perceptrons
The now classical multilayer learning architecture is the multilayer perceptron (MLP)
developed by Rumelhart et al. [140] and illustrated in figure 4.8.
layer 2: output units
layer 1: hidden units
layer 0: input units
Figure 4.8: Multilayer Perceptron architecture. (The curved lines indicate non-linear
activation functions.)
In a standard feed-forward MLP activation flows upward through the net, the output
of the nodes in each layer acting as the inputs to the nodes in the layer above.
Learning in the network is achieved by propagating errors in the reverse direction to
the activation (hence the term back-propagation) where the generalised gradientdescent rule is used to calculate the desired alteration in the connection weights
between units.
The activity in the hidden units provides a recoding of each input pattern appropriate
to the task being learned. In achieving this coding each hidden unit acts by creating a
partition of the space into two regions on either side of a hyper plane. The combined
effect of all the partitions identifies the sub-region of the space to which the input
CHAPTER FOUR
INPUT CODING
82
pattern is assigned. This form of recoding is thus considerably more distributed than
the localist, basis function representations considered previously.
There have been several successful attempts to learn complex reinforcement learning
tasks by combining TD methods with MLP-like networks, examples being
Anderson's pole balancer [4] and Tesauro's backgammon player [167]. However, the
degree of crosstalk incurred by the distributed representation means that learning in
an MLP network can be exceptionally slow. This is especially a burden in
reinforcement learning where the error feedback signal is already extremely noisy.
For this reason, a more localist form of representation may be more appropriate and
effective for learning in such problems. This motivates the exploration of generalised
gradient learning methods for training networks of basis function units on
reinforcement learning tasks.
Reinforcement learning by generalised gradient learning in networks of
Gaussian Basis Function units.
This section describes a network architecture for reinforcement learning using a
recoding layer of Gaussian basis function nodes with adaptive centres and receptive
fields. The network is trained on line using an approximate gradient learning rule.
Franklin [51] and Millington [101] both describe reinforcement learning architectures
consisting of Gaussian basis nodes with adaptive variance only. Clearly, such
algorithms will be most effective only when the relevant task-dimensions are aligned
with the dimensions of the input space. The architecture described here is based on
units with adaptive covariance, the additional flexibility should provide a more
general and powerful solution.
An intuition into how the receptive fields of the GBF units should be trained arises
from considering Thorndike’s ‘law of effect’. The classic statement of this learning
principle (see for instance [85]) is that a stimulus-action association should be
strengthened if performing that action (after presentation of the stimulus) is followed
by positive reinforcement and weakened if the action is followed by negative
reinforcement. Consider an artificial neuron that is constrained to always emit the
same action but is able to vary its ‘bid’ as to how much it wants to respond to a given
stimulus. The implication of the law of effect is clear. The node should learn to make
CHAPTER FOUR
INPUT CODING
83
high bids for stimuli where its action is rewarded, and low bids for stimuli where its
action is punished. Generalising this idea to a continuous domain suggests that the
neuron should seek to move the centre of its receptive field (i.e. its maximum bid)
towards regions of the space in which its action is highly rewarded and away from
regions of low reward. If the neuron is also able to adapt the shape of its receptive
field then it should expand wherever the feedback is positive and contract where it is
negative.
Now, if the constraint of a fixed action is relaxed then three adaptive processes will
occur concurrently: the neuron adapts its action so as to be more successful in the
region of the input space it currently occupies; meanwhile it moves its centre toward
regions of the space in which its current action is most effective; finally it changes its
receptive field in such a way as to cover the region of maximum reward as effectively
as possible. Figure 4.9 illustrates this process. The figures shows a single adaptive
GBF unit in a two-dimensional space. The shape of the ellipse shows the width of the
receptive field of the unit along its two principal axes. The unit's current action is a. If
the unit adapts its receptive field in the manner just described then it will migrate and
expand its receptive field towards regions in which action a receives positive reward
and away from regions where the reward is negative. It will also migrate away from
the position where the alternative action b is more successful. A group of units of
this type should therefore learn to partition the space between them so that each is
performing the optimal action for its ‘region of expertise’.
CHAPTER FOUR
INPUT CODING
84
b +ve
a +ve
a
a -ve
Figure 4.9: Adapting the receptive field of a Gaussian basis function unit according to
the reinforcement received. The unit will migrate and expand towards regions where
its current action is positively reinforced and will contract and move away from other
regions.
In an early paper Sutton and Barto [165] termed an artificial neuron that adapts its
output so as to maximise its reinforcement signal a ‘hedonistic’ neuron. This term
perhaps even more aptly describes units of the type just described in which both the
output (action) and input (stimulus sensitivity) adapt so as to maximise the total
‘goodness’ obtained from the incoming reward signal.
The learning algorithm
As in equation 4.2.8 the activation of each expert node is given by the Gaussian
g(x, c, H) = (2 ! )
"N 2
H
12
exp (" 12 (x " c)# H(x " c))
CHAPTER FOUR
INPUT CODING
85
where x is the current context, c is the parameter vector describing the position of
the centre of the node in the input space and H is the information matrix (the inverse
of the covariance matrix). Here I assume a scalar output for the network to make the
algorithm easier to describe, the extension to vector outputs is, however,
straightforward.
The net output y is given as some function of the net sum s. Now if the error e in the
network output is known then update rules for the output parameter vector w and the
parameters c i and H i of the ith expert can be determined by the chain rule. Let
!=e
"s
"y
!i =
,
"gi then
"s
(4.3.1)
$s
= # % (x) ,
$w
(4.3.2)
!w " #
!c i " # $
% gi
% c i , and
(4.3.3)
!Hi " # $
%g i
%H i .
(4.3.4)
To see that these learning rules behave in the manner described above consider the
following example. Assume a simple immediate reinforcement task in which the
network output is given by a Gaussian probability function with standard deviation of
one and mean w ! " (x) = s . After presentation of a context x the network receives a
reward signal r. From Williams’ gradient ascent learning procedure (Section 2.2) and
assuming a reinforcement baseline of zero we have
! = r ( " s)
(4.3.5)
First of all consider the case where the outputs of the units are not normalised with
respect to each other that is ! i (x) = gi (x) = gi for each expert i. We have
!i =
"
" i
(# w ) = w ,
i i
i
the update rules are therefore given by
(4.3.6)
CHAPTER FOUR
INPUT CODING
86
!w " r (y # s) $ (x) ,
(4.3.7)
c i " r (y # s)wi $ i Hi (x # c i ) and
(4.3.8)
%
!Hi " r(y # s)wi $i ( H#1
i # (x # c i )(x # ci ) )
(4.3.9)
Since ! i is always positive it will affect the size and not the direction of the change in
the network parameters. The dependence of the direction of change in parameters
according to the sign of the remaining components of the learning rules is illustrated
in the table below.
direction of change
!c
!H
!wi
r
components
y!s
! i = wi
a
+
+
+
+
!x
grow
b
+
-
-
-
!x
grow
c
-
+
+
-
!x
shrink
d
-
-
-
+
!x
shrink
e
+
+
-
+
!x
shrink
f
+
-
+
-
!x
shrink
g
-
+
-
-
!x
grow
h
-
-
+
+
!x
grow
The table shows that the learning procedure will, as expected, result in each local
expert moving toward the input and expanding whenever the output weight of the unit
has the same sign as the exploration component of the action, and the reward is
positive (rows a, b). If the reward is negative the unit will move away and its
receptive field will contract (c, d). If the sign of the weight is opposite to the direction
of exploration then all these effects are reversed (e, f, g, h).
CHAPTER FOUR
INPUT CODING
87
Explicit competition between units
There appears, then, to be good accordance between these training rules and the
intuitive idea for adaptive coding as a generalisation of the law of effect. However,
there remains an impression that this method of training the receptive fields is not
quite ideal. This arises because the direction of change in each of these rules depends
upon the correlation of the variation in the mean action with the absolute size of the
output weight, that is on
(y ! s)wi .
Intuitively, however, a more appropriate measure would seem to be the correlation of
the variation in the mean action with the variation of the output (of this unit)
compared with the mean output, that is
(y ! s)(wi ! s) .
This measure seems more appropriate as it introduces an explicit comparison between
the local experts allowing them to judge their own success against the group mean
rather than against an absolute and arbitrary standard. Fortunately learning rules that
incorporate this alternative measure arise directly if the normalised activation is used
in determining the output of the local experts. If (as in equation 4.2.6) we have
!i =
gi
"
M
j =1
gj
where M is the total number of local expert units, then
!i =
"s
=
" i
1
#
j
j
( wi $ s) .
(4.3.10)
From which we obtain the learning rules (from 4.3.3 and 4.3.4)
!c i " # (wi $ s) % i H i (x $ c i ) and
&
!Hi " # (wi $ s) % i ( H$1
i $ (x $ c i )(x $ ci ) )
(4.3.11)
(4.3.12)
CHAPTER FOUR
INPUT CODING
88
wherein the desired comparison measure (wi ! s) arises as a natural consequence of
employing generalised gradient ascent learning.
Further refinements and potential sources of difficulty
The use of GBF networks for a simple immediate reinforcement learning task is
described below, their application to difficult delayed reinforcement tasks is
investigated in the next chapter. Before describing any implementations, however,
some refinements to the learning system and potential problems (and possible
solutions) will be considered.
Learning scaling parameters
An important extension of the learning scheme outlined above involves adapting the
strength of response of each unit independently of the position and shape of its
receptive field. This sets up a competition for a share in the output of the network
between the different ‘regions of expertise’ occupied by the units. One of the benefits
of this competition is a finer degree of control in specifying the shape and slope of the
decision boundaries between competing nodes. For each unit an additional parameter
pˆ i is used which scales the activation of the ith node during the calculation of the
network output, i.e. the activation equation becomes
gi
"N 2
12
1
H i exp (" 2 (x " ci )#H i (x " c i ))
pˆ i ( 2! )
(4.3.13)
The learning rule for the scale parameter is then given by
! pˆi " # $
% gi
% pˆi .
(4.3.14)
The pˆ i s must be non-zero and sum to unity over the network. This requires a slightly
complicated learning rule since a change in any one scale parameter must be met by a
corresponding re-balancing of all the others. A suitable learning scheme (due to
Millington [101]) is described in Appendix B.
CHAPTER FOUR
INPUT CODING
89
Receptive field instability
The learning rules for the node receptive fields described above do not guarantee that
the width of the field along each principal axis will always be positive. A sufficient
condition for this is for the covariance matrix to be positive definite which can be
determined by checking for a positive, non-zero value of the determinant. A simple
fix for this problem is to check this value after each update and reinitialise any node
receptive field which fails the test. A better solution, however, is to adapt the square
root of the covariance matrix rather than attempt to learn the covariance (or
information) matrix directly. Algorithms using the square root technique are
described in [17].
Keeping the GBF units apart
A further problem that can arise in training sets of GBF nodes is that two nodes will
drift together, occupy the same space, and eventually become identical in every
respect. This is a locally optimal solution for the gradient learning algorithm and is
clearly an inefficient use of the basis units. To overcome this problem a spring
component that implements a repulsive ‘force’ between all of the node centres can be
added to the learning mechanism (see also [126]). This is not always a desirable
solution however. For instance, it could be the case that two nodes have their centres
almost exactly aligned but differ considerably in both the shape of their receptive
field and their output weights. This can be a very efficient manner of approximating
some surfaces (see next chapter) but cannot arise if the spring heuristic is used to
keep the units separated.
Staying in the data
The converse of the problem of units drifting together is that they may drift too far
apart. Specifically, some units can be pushed beyond the margins of the sampled
region of the input space through the action of the learning rule (for a unit to be
totally inactive is another locally optimal solution). A possible way to keep units ‘in
the data’ would be to implement some form of conscience mechanism whereby
inactive units have higher learning rates (see Appendix B for more on this topic) or to
use unsupervised learning to cause units that are under-used to migrate toward the
CHAPTER FOUR
INPUT CODING
90
input. Both these devices will only be of use, however, if the temporal sequence of
inputs approximates a random sampling of the input space, a requirement that rarely
holds for learning in real time control tasks.
Relationship to fuzzy inference systems
One of the most attractive features of basis function approximations is their
relationship to rule-based forms of knowledge, in particular, what are known as fuzzy
inference systems (FIS). A FIS is a device for function approximation based on fuzzy
if-then rules such as
“If the pressure is high, then the volume is small”
An FIS is defined by a set of fuzzy rules together with a set of membership functions
for the linguistic components of the rules, and a mechanism, called fuzzy reasoning,
for generating inferences. Networks of Gaussian basis function units have been
shown to be functionally equivalent to fuzzy inference systems [67]. In other words,
the local units in GBF networks can be directly equated with if-then type rules. For
instance, if we have a network of two GBF units a and b (in a 2D space with a single
scalar output), then an equivalent FIS would be described by
Rule A : If x1 is A1 and x 2 is A2 , then y = wA ,
Rule B: If x1 is B1 and x 2 is B2 , then y = wB .
Here the membership functions ( A1 , A2 , B1 , and B2 ) are the components of the
(normalised) Gaussian receptive fields of the units in each input dimension. The
functional equivalence between the two systems allows an easy transfer of explicit
knowledge (fuzzy rules) into tuneable implicit knowledge (network parameters) and
vice versa. In other words, a priori knowledge about a target function can be built in
to the initial conditions of the network. Provided these initial fuzzy rules give a
reasonable first-order approximation to the target then learning should be greatly
accelerated and the likelihood of local optima much reduced. This ability to start the
learning process from good initial positions should be of great value in reinforcement
learning where tabula rasa systems can take an inordinately long time to train.
CHAPTER FOUR
INPUT CODING
91
A simple immediate reinforcement learning problem
To demonstrate the effectiveness of the GBF reinforcement learning mechanism this
section describes its application to a simple artificial immediate reinforcement
learning task. Its use for more complex problems involving delayed reinforcement is
discussed in the next chapter.
Figure 4.10 shows a two-dimensional input space partitioned into two regions by the
boundaries of an X shape. The ‘X’ task is defined such that to achieve maximum
reward a system should output a one for inputs sampled from the within the X shape
and a zero for inputs sampled from the area outside it.
Figure 4.10: A simple immediate reinforcement task.
In the simulations described below the network architecture used a Bernoulli logistic
unit (see section 2.2.2) to generate the required stochastic binary output. A spring
mechanism was also employed to keep the GBF nodes separated. Full details of
algorithm are given in Appendix B where suitable learning rate parameters are also
described.
Networks of between five and ten GBF nodes were each trained on forty thousand
randomly selected inputs. Over the period of training the learning rates of the
networks were gradually reduced to zero to ensure that the system settled to a stable
configuration. Each network was then tested on ten thousand input points lying on a
CHAPTER FOUR
INPUT CODING
92
uniform 100x100 grid. During this test phase the probabilistic output of the net was
replaced with a deterministic one, i.e. the most likely output was always taken.
The learning mechanism was initially tested with units with fixed (equal) scale
parameters. Ten runs were performed with each size of network. The results for each
run, computed as the percentage of correct outputs during the test phase, are given in
Appendix B. In all, the best average performance was found with networks of eight
GBF units (hereafter 8-GBF nets). Figure 4.11 shows a typical run for such a net.
Initially the nodes were randomly positioned within 0.01 of the centre of the space.
By five hundred training steps the spring component of the learning rule has caused
the nodes to spread out slightly but they still have the appearance of a random cluster.
The next phase of training, illustrated here by the snapshot at two thousand timesteps, is characterised by movement of the node centres to strategic parts of the space
and adaptation of the output weights toward the optimal actions. Soon after,
illustrated here at five thousand steps, the receptive fields begin to rotate to follow the
shape of the desired output function. The last twenty thousand steps mainly involve
the consolidation and fine tuning of this configuration.
CHAPTER FOUR
INPUT CODING
93
5,000
500
0.18
0.47
0.52
0.81
0.49
0.15
0.47
0.51
0.87
0.52
0.87
0.55
0.18
0.73
0.48
0.15
0.04
10,000
40,000
0.01
0.95
0.93
0.98
0.98
0.01
0.04
0.96
0.05
0.01
0.99
0.97
0.89
0.04
0.01
Figure 4.11: Learning the X task with an 8-GBF network. The figures show the
position, receptive field and probability of outputting a one for each GBF unit after
500, 2000, 5,000 and 40,000 training steps.
Figure 4.12 shows that the output of the network during the test phase (that is, after
forty thousand steps) is a reasonable approximation to the X shape.
CHAPTER FOUR
INPUT CODING
94
Figure 4.12: Test output of an 8-GBF network on the X task. Black shows a
preferred output of 1, white a preferred output of zero.
Averaged over ten runs of 8-GBF the mean score on the test phase was 93.6%
optimal outputs (standard deviation 1.1%). This performance compared favourably
with that of eight-unit networks with adaptive variance only. The latter, being unable
to rotate the receptive fields of their basis units, achieved average scores of less than
90%.
Performance at a similar level to the 8-GBF networks was also achieved on most runs
with networks of seven units and on some runs using networks of six units (though in
the latter case the final configuration of the units were substantially different).
However, with fewer than eight units locally optimal solutions in which the X shape
is incompletely reproduced7 were more likely to arise.
Networks larger than eight units did not show any significant improvement in
performance over the 8-GBF nets, indeed, if anything the performance was less
consistently good. There are two observations that may be relevant to understanding
this result. First, on some runs with larger network sizes one or more units is
eventually pushed outside the space to a position in which it is almost entirely
inactive. This effectively reduces the number of units participating in the function
7For instance, one arm of the X might be missing or the space between the two arms incorrectly filled.
CHAPTER FOUR
INPUT CODING
95
approximation. Second, with the larger nets, the number of alternative near optimal
configurations of units is increased. These networks are therefore less likely to
converge to the (globally) best possible solution on every run.
The experiments with GBF networks of five to ten units were repeated this time with
the learning rule for adapting the scale parameters switched on. Though the overall
performance was similar to that reported above, quantitatively the scores achieved
with each size of network were slightly better. Again the best performance was
achieved by the 8-GBF nets with mean score 95.4% ("= 0.56) showing a significant8
improvement when compared to networks of the same size without adaptive scaling.
Once more there was no significant improvement for net sizes larger than eight units
indicating a clear ceiling effect9. Figure 4.13 shows the final configuration and test
output of a typical 8-GBF network with adaptive scaling. The additional degrees
freedom provided by the scaling parameters results in the node centres being more
widely spaced generating an output which better reproduces the straight edges and
square corners of the X.
8t= 4.47, p=0.0003
9A run of 15-GBF nets also failed to produce a higher performance than the 8 unit networks.
CHAPTER FOUR
INPUT CODING
96
0.251
0.003
0.001
0.271
0.209
0.002
0.001
0.260
Figure 4.13: GBF network with adaptive priors. The numbers superimposed on the
nodes show the acquired scale factors (the output probabilities were all near
deterministic i.e. >0.99 or <0.01). The test output of this net indicates a better
reproduction of the straight edges and square corners of the X.
CHAPTER FOUR
INPUT CODING
97
Conclusion
This chapter has considered a wide range of possible methods for generating a
suitable input coding of a continuous input space for immediate and delayed
reinforcement learning systems. Local coding methods have been emphasised as
their relative immunity to spatial crosstalk means they can learn with reasonable
speed even in tasks with impoverished feedback. The problem of generating
unambiguous codings has been considered and a gradient ascent learning mechanism
for adapting a layer of Gaussian basis function (GBF) units described. This algorithm
has been successfully applied to a simple immediate reinforcement task. In the next
chapter the extension of methods for adaptive coding to delayed reinforcement
problems will be investigated.
98
Chapter 5
Experiments in Delayed Reinforcement
Learning Using Networks of Basis
Function Units
Summary
The previous chapter introduced methods for training a layer of basis function units to
provide an adaptive coding of a continuous input space. This chapter applies these
systems to a real-time control task involving delayed reinforcement learning.
Specifically, the pole balancing task, which was used by Barto, Sutton and Anderson
[11] to demonstrate the original actor/critic learning system, is used to evaluate two
new architectures. The first uses a single memory-based, radial basis function coding
layer to provide input to actor and critic learning units. The second architecture
consists of separate actor and critic networks each composed of Gaussian basis
function (GBF) units trained by generalised gradient learning methods. The
performance of both systems on the simulated pole balancing problem is described
and compared with other recent reinforcement learning architectures for this task [4,
137, 143]. The analysis of these systems focuses, in particular, on the problems of
input sampling that arise in this and other real-time control tasks. The interface
between explicit task knowledge and adaptive reinforcement learning is also
considered, and it is suggested that the GBF algorithm may be particular suitable for
refining the control behaviour of a coarsely pre-specified system.
CHAPTER FIVE
5.1
DELAYED REINFORCEMENT LEARNING
99
Introduction
The pole balancing or inverted pendulum problem has become something of an acid
test for delayed reinforcement learning algorithms. The clearly defined nature of the
task, the fine-grained control required for a successful solution, and the availability of
results from other studies make it an appropriate test-bed for the learning methods
described in the previous chapter.
Pole balancing has a long history in the study of adaptive control and many different
variants of the problem have been explored (see Anderson [3] for a review). As a
problem in delayed reinforcement learning it was first studied by Barto, Sutton, and
Anderson [11] and the same version of the task (originally from Michie and
Chambers [99]) has since been used to evaluate several reinforcement learning
architectures involving adaptive coding [4, 137, 143]. Figure 5.1 illustrates the basic
control problem:
!
x
Figure 5.1: the cart/pole system modelled in the pole balancing task.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
100
A pole is attached by a frictionless hinge to a cart which is free to move back and
forth along a length of track. The hinge restricts the movement of the pole to the
vertical plane. The continuous trajectory of the system is modelled as a sequence of
discrete time-steps. At each step the control system applies a fixed force (±10 N) to
the base of the cart pushing it either to the left or the right. The task for the system is
to balance the pole for as long as possible using this ‘bang-bang’ mechanism for
moving the cart.
The cart begins each trial at the centre of a finite length (2.4m) of track. Each trial
ends either when the pole falls beyond an angle of 12° from vertical (a pole failure),
or once the movement of the cart carries it beyond either end of the track (a track
failure). The information that allows the control system to learn and select
appropriate actions is supplied by a context input, describing the current state of the
cart-pole system, and a reinforcement signal. The context is a vector of four
continuous variables describing the momentary angle and angular velocity of the pole
! and !˙ and the position and horizontal velocity of the cart along the track
( x and x˙) . The reinforcement signal is a single scalar value provided after each action
which is non-zero only when that action results in failure (and the end of the current
trial) at which point a punishment signal of -1 is provided. This non-zero feedback
occurs only after long sequences of actions by the control system making the task one
of minimal, delayed reinforcement.
(
)
The equations describing the dynamics of the cart/pole simulation are taken from [11]
and are given in Appendix C.
Characteristics of the pole balancing task
Before investigating alternative learning systems for this task it is worthwhile
discussing some of the characteristics of the pole balancing task that determine what
sorts of solutions are possible.
Sampling the input space
Many studies of learning bypass the problem of input sampling by arranging cycles of
training examples or by selecting input points from a flat, random distribution. This is
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
101
not, however, a situation that is ever faced by a real-time control system be it animal
or machine. One of the most problematic aspects of real-time control is that inputs
tend to arrive in a highly non-random order. The learning system is therefore faced
with the difficult task of trying to find solutions that are valid for the whole space
when only a biased sample of that space is available within a given time-span. The
following description gives an informal analysis of the types of sampling bias that
arise in the pole-balancing task. This analysis should help in evaluating the
performance of different learning systems on this specific problem and also in
judging how robust these systems will be to sampling bias in general.
Biased sampling in the pole balancing task arises over several time-scales and in both
the context inputs and the reward signals.
Short term bias
The moment to moment state changes of the cart-pole system are relatively small,
hence, in the short term (tens of time-steps), the controller samples a long, narrow
trajectory of points in the input space with each pattern being very similar to the
previous one.
Medium term bias
In the medium term (hundreds of time-steps) the system will observe substantial
variation in pole angle, angular velocity and cart velocity but relatively small, gradual
change in the horizontal position of the cart. On this time-scale, therefore, the system
will sample only a narrow range of cart positions.
Long term bias
If the system is uncontrolled or randomly controlled the pole will fall past the critical
angle long before the cart reaches the extreme ends of the track. Pole failures are
therefore likely to happen within a much shorter time scale (tens to hundreds of timesteps) than track failures (hundreds to thousands of time-steps). A consequence of this
is that during the early period of learning most trials will end due to pole failures
(henceforth stage one). Thereafter, once the system has discovered ways of
preventing the pole from falling, punishment will almost exclusively be as a result of
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
102
track failures (stage two). There are thus dramatic long-term (tens or hundreds of
trials) changes in the way that the alternative sources of reinforcement are sampled
by the control system.
During the first stage of learning the system will sample a wide range of pole angles
and angular velocities but a relatively narrow range of cart positions. During the
second stage this pattern is reversed with the distribution of pole angles covering a
narrow band near the vertical, angular velocities being relatively small, but a wider
range of cart positions and velocities being sampled (albeit over a slower time-scale).
In stage two the system may also get locked into certain patterns of behaviour, for
instance, always travelling to the left and failing at the far left end of the track. Whole
regions of the state-space may then be unrepresented in the input stream for many
trials. These effects all represent long term changes in the way the control system
samples the input space.
Multiple sources of reinforcement
The symmetry of the pole balancing task is evident to any human observer, however,
this knowledge is not available to the adaptive machine controller. From the
machine’s perspective, therefore, dropping the pole either to the left or to the right
constitutes two distinct sources of negative reward, reaching either end of the track
adds a further two. The existence of multiple sources of reward creates a hillclimbing landscape for the learning system in which there will certainly be some
locally optimal strategies whereby some sources of punishment are avoided but not
others. In view of the sampling bias problems described above learning a strategy that
simultaneously avoids both end of the track may be particularly difficult.
Non-linearities and critical boundaries in the input-output mapping
Symmetry considerations and lay physics also make it intuitively evident that there
will be a critical turning point for any optimal control strategy near to the centre
position of the space (vertical pole, centre track, zero velocities). It also seem likely
that accurate positioning of this control boundary will be required for successful
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
103
control. Again, knowledge of this symmetry, or of critical regions and boundaries in
the input space, is not generally available to the learning system.
5.2
Fixed and unsupervised coding methods for the pole
balancing task
Previous approaches
A priori quantisation
Barto et al.’s [11] solution for the pole-balancing task used an Actor/Critic system
with a hand-tuned hard quantisation scheme. The coding mechanism, adopted from
an earlier study by Michie and Chambers [99], involved splitting the state-space into
‘boxes’ formed by dividing the range of each variable into sections. The pole angle
# was assigned six sections and each of the other variables ( !˙ , x, and x˙ ) three, giving
6 ! 3 ! 3 ! 3 = 162 hyper-cube boxes in all. This system learned to balance the pole
for in excess of fifty thousand time-steps (equivalent to approximately fifteen minutes
of real time) within fifty to one hundred trials. Although the performance of this
system is impressive, it is fair to say that its success depended critically on the
carefully tailored coding scheme. The quantisation encoded both task-relevant
knowledge of the distribution of the input data, and partial knowledge of the
symmetry and critical decision regions of the task. The problems arising from
sampling bias were also largely avoided by having only fixed, local controllers (the
‘boxes’) each responsible for a small region of the state-space.
Unsupervised learning using Kohonen networks
Ritter et al. [137] used a variant of the pole balancing task to evaluate the use of
Kohonen’s self-organising networks for quantising the continuous space underlying
motor control problems. Kohonen’s [70] learning rule extends the unsupervised,
nearest neighbour learning methods described in section 4.2 to networks in which the
units are arranged in a pre-defined topology, typically a multi-dimensional grid. Each
time an input pattern is presented the centre of the closest, winning unit is moved
toward the input. In addition, however, units that are near to the winner in the
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
104
network topology, or index space, are also updated. The amount by which each of
these neighbouring units is moved depends on a neighbourhood function (typically a
Gaussian) of the distance in the index space from the unit to the winner. By reducing
with time both the width of the neighbourhood function and the overall learning rate
the network relaxes into a self-organised map in which the topology of the network
connections reflects the topology of the input space, that is, neighbouring units
correspond to neighbouring input patterns. As in other unsupervised learning methods
the distribution of units in space also comes to reflect the underlying distribution of
the input data.
Ritter et al. performed several experiments with a two-dimensional pole balancer in
which the problem was limited to balancing the pole regardless of the position or
horizontal velocity of the cart10. The pole angle and angular velocity ( ! and !˙ ) were
provided to a network of 25x25 nodes arranged in a grid. Each node had a single
output weight which learned the required action for the region of the state-space it
encoded. Whilst the node positions were adapted by Kohonen’s training rule the
output weights were trained either by supervised learning (the teaching signal being
given as a function of ! and !˙ ) or by immediate reinforcement learning (a reward of
! " 2 was given at every time-step). In both cases the network learned a suitable
recoding of the input space and acquired successful control behaviour.
To enable comparison of the Kohonen input layer with other adaptive coding methods
a version11 of the algorithm was tested here on the delayed reinforcement pole
balancing task described above. These experiments were largely unsuccessful as a
result of inadequate codings generated by the Kohonen learning rule. This poor
10Various other aspects of the task were varied from those described above, however, these details are
not important to the current discussion.
11In Kohonen’s original learning rule the positions of all the coding units in the network are given
small initial random values at the start of training. The network then gradually untangles itself as
learning proceeds. Luttrell [90] has pointed out that a large proportion of training is taken up
unravelling this initial knotted configuration. He proposed instead a multistage version of the algorithm
in which the full-size network is built-up incrementally by successive periods of training and splitting
smaller networks. This multistage learning algorithm which shows advantages both in training time
and in the degree of distortion in the trained network was used in the experiments alluded to here.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
105
performance by the unsupervised learning system appeared to arise as a direct result
of the biased sampling of the input space—because the target input distribution is not
adequately represented in any time slice of experience both distortions (some parts of
the space over-represented, others under-represented) and instabilities (no
convergence to a stable configuration over time) arose in the Kohonen mappings. As
the temporal distribution of inputs varies in the short, medium and long-term,
techniques for coping with sampling bias such as batch training, buffering the input,
or using very low learning rates did not significantly improve performance.
In the experiments performed by Ritter et al. the positions of the Kohonen nodes were
adapted only over the first thousand training steps. Over this time period the
neighbourhood function and learning rates were gradually reduced so that the
network settled to a fixed configuration. During this very early stage of learning the
pole is relatively uncontrolled and the system is likely to sample most of the
attainable input states. However, if the network is trained over a longer period, as is
essential for the full four-dimensional task, the effect of learning a successful control
strategy is to shrink the sampled region of the state-space disrupting the input coding
as an undesirable but inevitable consequence. This suggests that Ritter et al.’s
network was effective only because learning was limited to a narrow time-slice of
experience which, fortuitously, sampled the state-space in a relatively unbiased
manner.
It seems likely that other unsupervised learning algorithms whose heuristic power
depends on the ability to adapt over time to the distribution of input data will be
subject to the same catastrophic difficulties on this task.
Memory-based coding using radial basis function units
In view of the problems in adapting an input coding by unsupervised learning an
alternative memory-based approach was attempted using networks of radial basis
function (RBF) units. The idea with this method is that, starting with an empty
network, new nodes are generated at fixed positions in the input space whenever the
current input is inadequately represented (similar methods have also been proposed in
[150]). A suitable node generation scheme was described in equations 4.2.3, 4.2.4
and involves creating a new node at the current input position whenever the error in
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
106
reconstructing the input is greater than a fixed threshold. The approach is termed
memory-based since, rather than adapting the node positions, learning is simply a
matter of storing unfamiliar input patterns. The resulting coding reflects the spatial
distribution of the input points—nodes are placed wherever inputs arise—but ignores
their temporal distribution. Although this technique avoids problems with sampling
bias there is clearly be a price to paid in terms of the efficiency of the resulting
coding. The density of nodes in different regions of the input space will be the same
regardless of whether a given region generates a large or small number of inputs.
An actor/critic architecture using such a memory-based RBF coding was implemented
here for the pole balancing task. This architecture is illustrated in figure 5.2 and
described in detail in Appendix C. The following gives a brief qualitative description
of the learning system.
outputs
critic
actor
basis
units
inputs
Figure 5.2: RBF architecture for the pole balancing task. The thick line between the
basis units indicates that the activation values are normalised; the dotted line indicates
that additional basis nodes are added during training.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
107
At each time-step each of the basis units calculates its activation as a fixed-width
Gaussian function of the distance from its centre to the current input point. The
activations of all the basis nodes are then normalised (indicated by the thick line
between the units) so that the total sums to unity. The input is then reconstructed by
calculating the vector sum of the unit centres scaled by their normalised activation
values. The Euclidean distance between this reconstructed input and the original is
then computed. If this value is greater than a threshold the input coding is judged
inadequate, a new basis unit is added at the input position, and the recoding vector
(i.e. the vector of normalised activation values) is recalculated. This recoding vector
then serves as input, via weighted connections, to a linear critic unit and a binary
actor unit. The weights from each basis unit to the actor and critic elements are
initially zero and are trained over time following a standard delayed reinforcement
training procedure.
Results
Tests were performed with different values for the node-generation threshold creating
networks of varying size. It was found that to give a coding with an adequate
resolution for the task required networks of at least one hundred basis units. In the
experiments described below the network size was limited to a maximum of 162 units
giving the same number of output parameters as in Barto et al.’s a priori coding
system. Further tests were performed to determine suitable global parameters for the
actor/critic learning systems. The experiments described below used values that
seemed to give acceptable results (also given in the Appendix), however, the time
required to train the system prohibited a systematic search of possible parameter
values.
The learning system was tested over ten runs. Each run was terminated after one
thousand trials, five hundred thousand time-steps, or once the network succeeded in
balancing the pole for fifty thousand time-steps on a single trial. The latter condition
indicates balancing for more than fifteen minutes of real time which, as in [11], is
taken to indicate that optimal behaviour was achieved. Performance reached the
criterion level on all ten runs. On average this level of success was achieved on the
290th trial or after approximately 183,000 training-steps. This rate of acquisition is
only slightly slower than that observed with Barto et al.’s a priori coding (acquisition
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
108
in the initial stages of learning is actually slight faster). Figure 5.3 shows the
distribution of the basis nodes in the input space on a successful run. The graphs show
the positions of the centres in the two-dimensional space defined by each pair of input
variables.
+
X
X
!
--
!
+
X
!
X
X
!
!
!
X
Figure 5.3:
Positions of radial-basis function nodes in the 4d space !, !˙ ,x, x˙
illustrated by their projection into the 2d space defined by each pair of variables12.
12The range of each variable (from left to right or bottom to top) is as follows:
! : -12°, +12°, !˙ : -150°/s, +150°, x : -2.4m, +2.4m, x˙ : -3m/s, +3m/s.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
109
The figure illustrates both what is good and what is bad about the coding mechanism.
The memory-based node generation algorithm places units only where there is input
data. Since many of the variables are strongly correlated (for instance !˙ with both
! and ! x˙ ) large regions of the input space are never sampled, hence there is a
significant economy in the number of units required compared, say, with a uniform
resolution tiling. However, the inefficiency of the coding is also clearly demonstrated
by the figure. For instance, many basis units are placed in extreme regions of the
state-space (for instance, states with very high horizontal velocities) that are sampled
very rarely and, in any case, may represent unrecoverable positions. Furthermore, the
resolution of the coding is uniform throughout the space, being the same in the
sparsely visited extreme regions as in the critical area in the centre of the space. To
achieve an adequate resolution at the centre the threshold for generating new nodes
must be kept relatively low creating an excess of units in non-critical areas.
In that the memory-based coding method generates, automatically, an effective input
coding for this task it is more general than systems that depend on their designers for
a tailored, a priori quantisation. At the same time, however, being a brute-force
method—scattering nodes around the input space in an almost haphazard fashion—it
is both inelegant and inefficient and fails to take advantage of (potentially) available
information about the adequacy of the coding. This observation motivates the
exploration in the remainder of this chapter of learning systems that attempt to use the
reinforcement signal to produce a coding that is specifically tuned to the task.
CHAPTER FIVE
5.3
DELAYED REINFORCEMENT LEARNING
110
Adaptive coding using the reinforcement signal
Previous approaches
Multilayer Perceptron (MLP) networks
Anderson [4] proposed a reinforcement learning architecture that uses two MLP
networks one to act as the actor learning system the other as the critic. By separating
the two learning systems entirely, each system is free to learn an internal
representation of the input specific to its needs. Anderson successfully applied this
architecture to the pole balancing task using the network structure for both the actor
and critic systems illustrated in figure 5.4.
output unit
hidden units
inputs
Figure 5.4: Anderson’s MLP architecture for the pole balancing task.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
111
(
)
The input to each network consisted of the four state variables !, !˙ , x, and x˙
normalised to lie approximately between nought and one, plus a fifth input with a
constant, positive value. The output element in each net had both direct weighted
connections to these inputs and indirect connections via five hidden units with
sigmoidal activation functions. The critic system computed a simple linear output,
whilst the actor system generated a stochastic binary output. The hidden layer of
each network was trained using a generalised gradient learning procedure based on
the well-known MLP back-propagation algorithm,
This system took between five and ten thousand trials to learn to balance the pole for
in excess of fifty thousand time-steps. In other words, learning was at least an order
of magnitude slower than with either the a priori or memory-based learning
architectures described above. Several other disadvantages with this learning system
are also worth noting. First, Anderson reported that an extended search for values of
the global learning parameters was needed to obtain networks that would converge to
an effective solution. Such sensitivity to the learning rates was not noted with the
systems described in the previous section. Second, the cart/pole simulation had to be
started from a random initial state on each trial. This was required in order to increase
the sampling of different track positions and of track failures. Without countering the
inherently uneven sampling bias of the task in this manner the system failed to solve
the task of keeping the cart on the track.
The disadvantages of using MLP networks in reinforcement learning have already
been noted in the previous chapter. Here it suffices to recall that the distributed nature
of the recoding performed by the hidden units makes the learning process very
vulnerable to interference between different input patterns requiring different outputs
(spatial crosstalk). This is one possible cause of the extremely long learning times
required by Anderson’s simulation.
Recurrent networks
Schmidhuber [143, 144] describes an interesting recurrent network architecture for
reinforcement learning in which a correlation learning rule is used to train the hidden
units. His network consists of a number of fully connected primary units each of
which is a Bernoulli logistic unit producing a stochastic binary output. At each
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
112
simulation step every primary unit receives a combined input vector consisting of the
current context together with the outputs of all the primary units computed on the
previous time-step. One primary unit is nominated to act as the actor, that is, its
output is taken to be the control signal for the system overall. A secondary unit,
which receives as its input the same combined input vector but has no recurrent links
to other units, acts as a linear critic element. This architecture is shown in figure 5.5.
evaluation
secondary
unit (critic)
actor unit
primary
units
inputs
Figure 5.5: Recurrent network architecture for reinforcement learning. Thin lines
show feed-forward connections, the heavy lines indicate the two-way connections
between the primary units.
The critic element computes an error according to the normal temporal difference
comparison rule which is used to update the weights on the critic’s input lines by
gradient descent training. For the primary network the same temporal difference error
is used to adjust the weights for the recurrent units according to the following rule. If
the combined state vector is given by x(t) and the TD error by eTD (t + 1) then the
directed w ij from unit i to unit j is adjusted by
wij " eTD (t + 1) x i (t + 1) x j (t) .
(5.2.1)
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
113
The effect of this rule is clearly to make the last transition more likely if error signal
is positive and less likely when the error is negative. The update rule is therefore a
correlation learning rule consistent with the law of effect.
For the pole-balancing task the primary network consisted of five units. The input to
the system was, as in Anderson’s simulation, the four normalised state variables
together with a fifth constant input. This system learned the pole-balancing task faster
than the MLP architecture achieving runs of upto 30,000 time-steps within two to six
thousand trials, and runs in excess of 300,000 time-steps if the following
interventions were made: first, learning was switched off once the system achieved
the mile-stone of 1,000 successive balancing steps, and second, the stochastic
activation of the primary units was made deterministic (i.e. by always selecting the
activation that was most probable).
Given the observation, made in the previous chapter, that the correlation learning is a
weaker method than back-propagation of error it is something of a puzzle that
Schmidhuber’s recurrent architecture learned the pole balancing task faster than
Anderson’s MLP system. This question will be addressed at the end of the chapter as
the experiments reported in the next section appear to cast some light on this issue.
Gaussian Basis Function (GBF) networks
The procedure for training networks of GBF units by gradient ascent in the
reinforcement signal was evaluated using the pole balancing task. The principal
difference between the architecture used here and that employed for the immediate
reinforcement learning task (described in the previous chapter) being the use of a
second GBF network to act as the critic learning system.
The learning system was evaluated with 2,3, and 4 GBF units in each network. At the
start of each run the GBF units were initialised to random positions near the centre of
the input space with the principal axes parallel to the input dimensions and with small
random perturbations in the initial widths. A number of test runs were performed to
determine suitable learning parameters for the two networks. However, because of the
number of global parameters and the time required to train each system no systematic
search of the parameter space was possible. The test runs indicated that systems with
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
114
as few as two GBF units in each network could perform reasonably well on the task
provided the node receptive fields in the critic network were allowed to overlap. The
spring mechanism which forces the unit centres to spread out was therefore switched
off throughout the experiments reported below.
The same performance criteria were applied as in the experiment with the memory based RBF coding system. In other words, the system was tested over ten runs where
each run was terminated after one thousand trials, five hundred thousand time-steps,
or once the network succeeded in balancing the pole for fifty thousand time-steps on a
single trial. Figure 5.6 shows the results for a learning system with two GBF units in
each network (2/2-GBF networks hereafter). The table shows, for each run, the
number of steps in the most successful trial and the total number of trials and
training-steps upto and including the best trial.
run
best
trial
total
trials
total
steps
1
2
3
30,160
2,146
6,965
837
315
348
372k
796k
195k
7
8
9
10
50,000 29,214 27,232
532
50,000
7,730
24,284
145
854
796
415
873
823
779
78k
218k
233k
53k
365k
473k
204k
4
5
6
Figure 5.6: Performance of the 2/2-GBF learning system on the pole balancing task.
Although only two of the ten runs reached the criterion level of performance, six out
of the ten achieved maximum balancing times in excess of twenty thousand steps
(equivalent to more than six minutes of real time). The system is thus at least partially
successful in solving the task. Systems with more coding units did not perform any
better than the 2/2-GBF networks. The main reason for this is that units tended to
migrate together if allowed to do so. When the spring mechanism was engaged to
prevent this happening this created spurious barriers in the error landscape which
caused the system to become stuck in poor configurations.
It is possible to gain some understanding of what the system has learned by
examining the positions of the GBF nodes in the input-space and the shapes of their
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
115
receptive fields. Figure 5.7 shows the final configuration of the actor network in the
eighth training run projected onto the plane formed by each pair of input variables.
1.0
X
X
a)
b)
X
c)
X
X
X
d)
e)
f)
Figure 5.7: The balancing strategy acquired by the actor learning system. Each graph
shows the positions and receptive field shapes of the two GBF nodes projected onto
the two-dimensional plane formed by each pair of normalised input variables.
The principal strategy in evidence is that of pushing the cart to the right when the pole
is angled or falling right and to the left when the pole is angled or falling left (this
strategy is evident from the positions of the two opposing units in 5.7a). However, a
second, subtler strategy has also evolved to cope with the problem of keeping the cart
on the track. Specifically, the system shows a preference for pushing the cart left
when it is near the left end of the track and right when it is near the right end of the
track (see 5.7b for instance). This behaviour biases the preferred balancing angle of
the pole to be right of vertical when the pole is near the left track end and left of
vertical near the right track end resulting in compensatory movement of the cart back
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
116
toward the centre of the track13. Finally, there is a slight preference for pushing the
cart left when it is moving right and vice versa (5.7c).
Figure 5.8 shows this balancing strategy as a set of three-dimensional graphs showing
the preferred action plotted against the pole angle and angular velocity for different
values of the cart position and cart velocity. In all the graphs the decision boundary is
approximately parallel to the axis = " ˙ providing the main component of the
balancing behaviour that keeps the pole near upright. However, the shift in this
decision boundary (i.e. the balancing angle) towards the left as the cart nears the left
of the track and towards the right as the cart approaches the right boundary is also
clearly evident (compare 5.8a and 5.8i for instance).
The critic system learns to represent the evaluation function for this task by adapting
the receptive field shapes of its two GBF nodes rather than the node positions. Figure
5.9 illustrates projections of the final configuration of the critic net. In order to make
the shapes of the receptive fields more discernible only the central region of each
plane is shown.
The figure illustrates that the centres of the two units are effectively coincident on or
near the centre of the space. One unit, however, has a smaller receptive field and has
acquired a positive prediction (+ 0.21). This unit is drawn as a white ellipse in the
figure partially eclipsing the second unit which has a larger receptive field, a negative
weighting (-0.24), and is drawn in black. Because the first node has as smaller
receptive field its activation is stronger in the central region of the space giving nearzero evaluations in this region. The second node having a larger receptive field
dominates the peripheral regions of the space giving generally lower evaluations
(down to -0.3) in these regions.
13Anderson [3] reports that a similar strategy, though more pronounced, was acquired in his MLP
learning system for the pole balancer.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
117
1.0
+
p
0.0
+-
a) x ! "2. 4m, x˙ ! <1.5ms
-1
b) x ! "2. 4m, x˙ ! 0.0ms
-1
c) x ! "2.4m, x˙ ! "1.5ms
-1
d) x ! 0.0m, x˙ ! <1.5ms -1
e) x ! 0. 0m, x˙ ! 0. 0ms -1
f ) x ! 0.0m, x˙ ! "1.5ms -1
g) x ! <2. 4m, x˙ ! <1. 5ms -1
h) x ! <2.4m, x˙ ! 0.0ms -1
i ) x ! <2. 4m, x˙ ! "1. 5ms -1
Figure 5.12: The preferred action of the actor learning system (given by the
probability of pushing the cart to the right) plotted against the angle and angular
velocity of the pole for different values of the cart position and horizontal velocity.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
118
0.3
X
X
a)
X
c)
X
X
d)
b)
e)
X
f)
Figure 5.9: GBF receptive field shapes for the critic learning system. The unit with
positive weight is drawn in white partially eclipsing the larger unit with a negative
weighting. Each graph is limited to the region of dimension 0.3x0.3 around the centre
of the normalised input space.
The subtler variations in the shapes of the receptive fields also appear to have a
significant effect on the acquired function. The most striking of these is evident from
graph 5.9a. Here the principal axis of the larger unit has rotated and widened along
the line ! = !˙ . The perpendicular axis of this unit, lying along the line = " ˙ , is
narrower and is almost the same width as the smaller unit which has a relatively
uniform diameter in all directions. This configuration results in a ridge of high
evaluations along ! = " !˙ with negative evaluations toward the corners of the space
where the angle and angular velocity of the pole have the same sign. The ability to
adapt the full covariance is clearly essential to representing this aspect of the function
(given only two units).
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
119
There are further fine differences in the receptive shapes due to cart velocity ( x˙ ). The
precise effect of these differences is difficult to deduce from figure 5.9 but can be
more easily observed by plotting the acquired evaluation function as a set of threedimensional graphs as in figure 5.10. These graphs show a consistent variation in the
ridge of the high predictions along the ! = " !˙ axis for different values of cart
velocity. Specifically, if the cart is moving toward the right and the pole is tilting
towards the left then the prediction is higher than if the pole is tilting right (5.10c).
Similarly if the cart is moving towards the left then the system is ‘happier’ when the
pole is tilting to the right (5.10g).
A surprising feature of figure 5.10 is that there is only a small degree of variation in
the evaluation function with cart position. This observation suggests that, despite the
long balancing times on some runs, the learning system may not be particularly
sensitive to the punishment meted out for reaching either end of the track. To
determine whether or not this was the case two final control experiments were
performed. In both experiments the learning system received the normal negative
reward for pole failures but the punishment signal for track failures was suppressed14.
In the first control experiment (C1 hereafter) the inputs to the learning system were
limited to just the pole angle and angular velocity, in the second (C2) all four context
variables were provided. Figure 5.11 shows the number of steps in the longest trial for
ten runs of each experiment.
run
1
2
3
4
5
6
7
8
9
10
C1
373
450
575
518
451
1,093
508
656
397
403
C2
444
2,731
15,092
3,210
29,022
310
1,781
371
687
10,526
Figure 5.11 Performance on the pole balancing task without punishment for track
failures for systems with two and four dimensional input vectors. The difference in
14More precisely, the final training step on trials ending in track failure was not performed. Hence,
there was no opportunity in this experiment for the control system to learn ‘what happens’ at the end of
the track.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
120
performance between the two systems is statistically significant (Mann-Whitney U =
20, p<0.025).
0.0
+
V
-0.3
+-
a) x ! "2. 4m, x˙ ! <1.5ms -1
b) x ! "2. 4m, x˙ ! 0.0ms -1
c) x ! "2.4m, x˙ ! "1.5ms -1
d) x ! 0.0m, x˙ ! <1.5ms -1
e) x ! 0. 0m, x˙ ! 0. 0ms -1
f ) x ! 0.0m, x˙ ! "1.5ms -1
-1
g) x ! <2. 4m, x˙ ! <1. 5ms
-1
h) x ! <2.4m, x˙ ! 0.0ms
-1
i ) x ! <2. 4m, x˙ ! "1. 5ms
Figure 5.14: The prediction computed by the critic learning system plotted
(between 0.0 and -0.3) against the angle and angular velocity of the pole for
different values of the cart position and horizontal velocity.
The first system with inputs relating only to the position and velocity of the pole,
rarely achieved balancing times in excess of a thousand time-steps. On each of these
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
121
runs the system successfully avoided pole failures after approximately one hundred
trials15, however, the actual balancing angle was not constrained by this learning
process. As a result the pole was usually balanced slightly off vertical resulting in
compensatory horizontal movements of the cart and consistent track failures.
The surprising result, however, is shown in the second control experiment, C2. Here
substantially longer trial—tens of thousands of time-steps—were achieved on some
runs. On these runs it appeared that the system was learning to reduce horizontal
movement by balancing the pole as close to vertical as possible. This occurs in spite
of the absence of any negative reward signals for track failures. The extra constraint
on balancing angle appears to arise because of the strong negative correlation, noted
earlier, between pole angular velocity and cart horizontal velocity. The system learns
that movement of the cart is associated with dropping the pole, and, as a result, it
learns to keep the cart near stationary thus indirectly postponing failure at the track
boundary. Although on average the system in the original experiment (that did
receive the track punishment signal) performed marginally better than C2, this
difference is barely statistically significant16. It is not possible, then, to conclude with
any certainty that the learning system was improving its performance on the basis of
the delayed reward signals provided by the track failures.
The long balancing times achieved both by the original and C2 systems were not
simply due to some lucky initial configuration of the GBF units. This is demonstrated
for the latter in figure 5.12 which shows a graph of the number of steps in each trial
of run three in the C2 experiment. The graph shows the characteristic two stage
learning process outlined earlier. Over the first 150–250 trials the system learns to
successfully keep the pole near vertical, thereafter all trials end due to track failures.
The learning system in C2 received no primary reinforcement throughout the second
learning stage (from approx. trial 250 onwards), in spite of this there is a gradual
increase in balancing times culminating at trial 944 in a period of successful
15This assertion is supported by an experiment in which an identical system was tested without the
constraint of finite track length. Over ten runs this system succeeded in balancing the pole for in excess
of 10,000 time-steps (the cut-off point where indefinite balancing was assumed) within 60 to 140 trials.
16Mann-Whitney U = 28.0, p<0.1.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
122
balancing lasting over fifteen thousand time-steps. The graph also shows a drop in the
performance of the system immediately following this long period of successful
balancing. This phenomenon, which was characteristic of both the original and C2
experiments, has two implications. First, that the performance of both systems was
very sensitive to small changes in the control strategy, in other words, the acquired
behaviour is not robust. Second, that the severe sampling bias that occurs during the
longer balancing periods—the system samples a very narrow region of the state-space
for an extended period—has a significant disruptive effect on the acquired control
behaviour. The system is thus penalised by its own success.
600
500
steps
400
300
200
100
0
0
200
400
600
800
1000
Trials
Figure 5.12: Characteristic learning curve for the second control experiment.
Balancing continues to improve after the cessation of primary reinforcement (c. trial
250).
Comparison with other systems
The above results perhaps cast some light on the behaviour of the pole balancing
systems developed by Anderson [3,4] and Schmidhuber [143,144]. Both these
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
123
systems achieved what might be called ‘perfect balancing behaviour’, but both
required specific changes either to the task or to the learning system to make this
possible.
First, consider Schmidhuber’s recurrent correlation-learning system. In order to
achieve maximum balancing times the learning mechanism was disengaged once the
system reached a thousand successive balancing steps. This intervention avoids the
undesirable effects of sampling bias observed above in very long trials. The action
selection mechanism was also altered from stochastic to deterministic. Without this
intervention a very high variance in trial length was observed. Thus the acquired
behaviour was not a robust solution to the task but one which was very sensitive to
noise in the action selection mechanism.
Anderson’s MLP (gradient descent) learning system, on the other hand, achieved
success by altering an important characteristic of the task—the initial starting position
of the cart/pole system. This change has a significant effect in countering medium and
long-term sampling bias of cart positions and track failures. It was noted above that
this system took far longer to solve the pole balancing problem that Schmidhuber’s
system. One possible explanation for this difference might be that qualitatively
different solutions to the task were acquired. Specifically, it may be the case that
Schmidhuber’s system adapted principally to the correlation between cart velocity
and pole failure rather than to track failures directly. Anderson’s system, on the other
hand, forced to cope with difficult starting positions, acquired a more robust strategy
that was more sensitive to track failures (which were themselves less masked by
sampling noise). If this hypothesis is correct, this might account for the faster
acquisition of control behaviour observed in the system with the weaker training rule.
The coding units in both of the above systems generate an internal representation by
fitting hyper-planes through the input-space. The GBF system, by contrast, forms a
more localist representation. With hindsight, the sample biasing problems inherent in
the pole balancing task might be expecting to create more problems for a localist
coding that for a more distributed one. The former being, by definition, more
sensitive to local change is also more likely to be disturbed by it.
CHAPTER FIVE
5.4
DELAYED REINFORCEMENT LEARNING
124
Discussion
This chapter has described two novel delayed reinforcement learning methods for the
pole balancing task. In the first memory-based method a soft quantisation is generated
by scattering large numbers of basis function nodes around the populated regions of
the state-space. This method is effective and learns rapidly but is clearly inefficient in
terms of its memory and processing demands. The second, GBF generalised gradient
learning method is quite radically different. Here a very small, possibly minimal,
number of units are placed in the state-space and allowed to adapt both their outputs
and the exact positions and shapes of their regions of expertise in a manner that
maximises the global reward. Remarkably, perhaps, this system, is able to extract
appropriate training signals from the very sparse primary reinforcement signals. This
method is not, however, robust to bias in the way the control task samples the input
space and available rewards. Furthermore, being a gradient learning system trying to
live off a very noisy error signal, it may find solutions that are not globally optimal.
Adaptive recoding and dynamic programming
It is clearly difficult to reconcile adaptive recoding methods with the view of delayed
reinforcement learning as incremental dynamic programming. In the circumstances of
the learning systems discussed above the memory-based system is clearly closest to
having Markovian state descriptions. Here the coding for each underlying task state is
relatively fixed from the time it is first encountered and is also very localist.
However, in the case of any method that uses the reward signal to adapt the internal
representation the Markov assumption can hardly apply, this is for the following
reasons. First, although there may be sufficient information implicit in the context
input to determine the causal processes underlying events, this knowledge is not, at
least at the start of learning, coded in anything like a satisfactory form. Second,
during training the coding of any given world state will change, presenting the
learning system with a moving target problem and the impression of a non-stationary
underlying process. Finally, the recodings acquired by these systems are generally
specific to the function being learned—radically different representations can be
acquired, for instance, for determining predictions and actions. It seems likely that
such representations will not be suited to the acquisition of the very different
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
125
mappings (i.e. the transition and reward functions) that could support causal
knowledge of the task.
These observations suggest that it is inappropriate to consider generalised gradient
learning systems for delayed reinforcement learning from the perspective of
incremental dynamic programming. Thus, strong convergence properties cannot be
expected and the performance of the learning system may improve or worsen with
time. It seems clear, however, from the empirical studies described here and
elsewhere that the ability of these algorithms to climb the reward gradient can allow
successful acquisition of skill. These learning systems may therefore be useful in
tasks where no richer feedback signal than the delayed reward measure is possible
(note: the pole-balancer is not of this sort17!), and where it is not known a priori what
the relevant aspects of the context information will be.
Suggestions for coping with input sampling problems and local minima
The experiments with the pole balancing problem demonstrate that critical
characteristics of a task can be so masked by bias in the temporal sampling of
contexts and rewards as to become almost invisible. This indicates a need to develop
systems that are able to detect bias and make appropriate allowances for it. Such
systems, which would make use of uncertainty or world models (as discussed in
chapters two and three), could then control some of the attentional and exploratory
aspects of learning. For instance, a system which modelled the frequency and/or
recency of visits to different regions of the state-space could detect over- or undersampling and take appropriate actions such as biasing exploration, suppressing
learning, triggering new trials in unexplored regions, etc.
A possible mechanism for achieving improved learning is the use of an adversarial or
parasitic system. This is a secondary learning system, whose outputs control critical
task parameters of the primary learner. The adversary is rewarded whenever the
primary system performs badly and punished whenever it does well. The idea is that
17Clearly the deviation of the pole from vertical or of the cart from the track centre could serve to
provide far richer feedback about the moment-to-moment performance of the system.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
126
this second system will expose the weaknesses of the primary system forcing it to
abandon control policies that are locally optimal in favour of more robust solutions.
Hillis [60] has demonstrated that co-evolving parasitic systems can be used to coerce
a genetic algorithm to find improved solutions to an optimisation problem. In turntaking tasks, such as multi-player games, the primary system can, of course, be its
own adversary if it is set to play against itself. This technique was employed by
Tesauro [167] in training his reinforcement learning backgammon player and is
possibly one of the factors that enabled that system to achieve a near-expert level of
play.
Implicit and explicit knowledge
Perhaps one of the most striking differences between the memory-based and GBF
learning methods is in the accessibility of the knowledge they contain. The memorybased method involves a very large number of nodes that individually do not reflect
the task structure in any clear way. This form of knowledge is clearly on the implicit
side—it is difficult to extract knowledge from the system or down-load task
knowledge into it. In contrast the second method appears to straddle the
implicit/explicit divide. The knowledge encoded in the individual units reflects the
structure of the task in a meaningful way and, through the similarity with fuzzy
reasoning, rule-like knowledge could be ‘programmed in’ to be improved through
experience. This suggests that one role for such systems in reinforcement learning
tasks may be as tuning mechanisms for coarse control strategies that are initially set
up by other processes or learning methods. A second role might lie in extracting (by
some supervised learning process) the knowledge from less accessible systems like
the memory-based learner. That is, the less memory-efficient, more opaque, but also
more robust system would learn the task through reinforcement learning. This
network would then act as a skilled but dumb teacher for a more compact system of
trainable ‘experts’ that would identify and refine the essential task knowledge.
Conclusion
The idea of adapting the internal representation of a task state-space by hill-climbing
in the delayed reinforcement signal stretches the possibility of machine learning
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
127
toward one of its limits. This chapter has demonstrated that learning of this nature is
possible even in awkward, real-time control tasks such as the pole balancer.
However, such learning is undoubtedly very time-consuming and carries no guarantee
of an optimal solution. I have argued here that the principal value of such methods
may be more as a mechanism for refining ‘first guess’ control strategies derived from
other sources than as a means of learning effective behaviour from scratch. The
systems of trainable Gaussian basis function units investigated here may be
particularly useful in this respect as they can provide a smooth interface between
explicit task knowledge and adaptable motor skill.
CHAPTER FIVE
DELAYED REINFORCEMENT LEARNING
128
128
Chapter 6
Adaptive Local Navigation
Summary
So far in this thesis learning has been considered largely for its own sake and
beyond the context of any single problem domain. The remaining chapters have a
different emphasis, focusing instead on a crucial problem area in Artificial
Intelligence and Cognitive Science, that of spatial learning for the task of
navigation in a mobile agent.
Several authors [37, 102, 172, 188] have suggested a division in navigation
competence between tactical skills that deal with the immediate problems
involved in moving efficiently while avoiding collisions, and strategic skills that
allow the successful planning and execution of paths to distant goals. These two
levels of navigation skill are here termed local navigation and way-finding1
respectively. This chapter proposes a functional distinction between the two levels
of skill. Namely, that local navigation can be efficiently supported by adaptive,
task-oriented, stimulus-response mechanisms, but that way-finding requires
mechanisms that encode task-independent knowledge of the environment. In other
words, whereas local navigation can operate through acquired simple associations
allowing very rapid, but largely stereotyped reactions to environmental events,
1This term is borrowed from the psychological literature. It is used here in place of the usual AI
term path-planning as the latter often connotes a complete solution to the problem of finding a
trajectory through an environment to a desired goal. In contrast the term way-finding is intended
to describe the task of choosing appropriate strategic level control behaviour leaving the tasks of
local path-planning and obstacle avoidance to the tactical navigation systems.
CHAPTER SIX
LOCAL NAVIGATION
129
way-finding requires mechanisms that construct and use internal models that
encode the spatial layout of the world.
The principal focus in this chapter is the problem of learning local navigation
skills (way-finding is the main topic of chapters seven and eight). I argue that
delayed reinforcement learning is a suitable mechanism for acquiring such
behaviours as it allows a direct link to be forged between the evaluation of skills
and the tactical goals they aim to fulfil. An architecture for an adaptive local
navigation module is then proposed that instantiates this idea.
To evaluate this approach a prototype model of an acquired local navigation
competence is described and tested in a simulation of a mobile robot2. This
models a small agile, vehicle with minimal sensory apparatus that must learn to
move at speed around a simulated two-dimensional environment while avoiding
collisions with stationary or slow-moving obstacles.
The control system developed for this task acquires appropriate real-valued
actions for input patterns drawn from a continuous space. This system combines
the Actor/Critic architecture with the method for adjusting local exploration
behaviour suggested by Williams [184] (to my knowledge this is the first
application of Williams’ method to a delayed reinforcement task). Input vectors
are recoded using a priori coarse-coding techniques (CMAC networks [2]). The
learning system adapts its responses to sensory input patterns, encoding only
partial state information, so as to minimise the negative reinforcement arising
from collisions and from an internal ‘drive’ that encourages the vehicle to
maintain a consistent speed. Successful acquisition of local navigation skill is
demonstrated across several environments with different obstacle configurations.
2This work has previously been described in [133,
134].
CHAPTER SIX
6.1
LOCAL NAVIGATION
130
Introduction
The fundamental skill required by a mobile agent is the ability to move around in
the immediate environment flexibly, quickly and safely. This will be referred to
here as tactical, local navigation competence. A second, higher level of valuable
strategic expertise is the ability, here called way-finding, to plan and follow routes
to desired goals. The next three chapters are concerned with computational
models of the mechanisms that might underlie these different levels of navigation
behaviour.
I suggest below that the distinction between local navigation and way-finding is
not simply a matter of efficient hierarchical planning—dividing control into long
and short-term decision processes. Rather, that it reflects a fundamental functional
difference between the two interacting mechanisms that are required to allow an
agent to successfully navigate its world.
Local Navigation
Local navigation covers a wide range of behaviours, for example, Gibson [53]
describes nine different problems in the control of locomotion: starting, stopping,
backing-up, steering, aiming, approaching without collision, steering among
obstacles, and pursuit and flight. In natural systems there is an evolutionary
pressure for many of these skills to be very rapidly enacted. Faced with an
imminent collision with an unforeseen obstacle, for example, or with the sight of
a predator or prey there is very little time to plan an effective course of action. An
animal that can put into effect an appropriate pre-compiled or stereotypical action
plan will be able to respond most rapidly and therefore have the best survival
chance3. In robot systems there is a similar premium in having fast reactions that
can provide smooth, highly responsive behaviour without the need for time-
3The predator-prey evolutionary ‘arms race’ has led to some incredibly fast local navigation
reflexes. In reviewing research in this field Gronenberg et al. [55] cite several remarkable
examples, for instance the jumps of fleas take circa 1 ms (millisecond); the escape responses of
cockroaches c. 40 ms, and of certain fish c. 35 ms; take-off behaviour in locusts and flies also
occurs within fractions of a second. The preying responses of the predators that catch these
animals are even faster.
CHAPTER SIX
LOCAL NAVIGATION
131
consuming planning processes. These observations suggest that, wherever
possible, local navigation skills might be best implemented by rapid reflexive,
stimulus-response behaviours.
An increasing number of researchers (see Meyer [98] and Maes [91] for reviews)
have taken the view that interesting, plan-like behaviour can emerge from the
interplay of a set of pre-wired reflexes with the evolving structure of the world. A
similar approach has been taken by Brooks [20, 22, 24] in building control
systems for mobile robots. He suggests that fast reactions and robust behaviour
might be best achieved through task-dedicated sub-systems that use direct sensor
data, little internal state, and relatively simple decision mechanisms to choose
between alternative actions. His subsumption architecture consists of a hierarchy
of sub-systems that implement behaviours at several levels of competence. The
lowest levels are concerned with the basic, self-preserving, motor-control
activities such as stopping and collision avoidance while higher levels implement
more abstract behaviours with longer-term goals. The control architecture is
developed by an incremental bottom-up process in which the lowest level of
competence is built first and fully debugged. Layers of more complex behaviour
are then gradually added where each layer ‘subsumes’ part of the competence of
those below by suppressing their outputs and substituting its own.
This emphasis on developing structures that achieve self-sustaining goals is useful
not least because it reflects the pressures that have guided the evolution of natural
intelligence. Moreover it has led to the development of several modular, taskoriented systems for robust local navigation in mobile robots (for instance [6, 21,
38, 158]). However, the structures often proposed for implementing individual
behaviours operate largely through the use of heuristic rules. They are therefore
shaped primarily by the ingenuity of their designers and only indirectly influenced
by the robot’s ability to operate effectively in its environment. The robot builder
creates a candidate control system, then through an iterative process of observing
the robot’s behaviour and modifying its architecture this initial system is refined
to give more effective behaviour.
Clearly, however, the temporal structure in an agent's interaction with its
environment can act as more than just a trigger for fixed reactions. Given a
suitable learning mechanism, it may be possible to acquire sequences of new (or
CHAPTER SIX
LOCAL NAVIGATION
132
modified) responses more suited to the task by attending to pertinent
environmental feedback signals. The learning methods described in previous
chapters are clearly candidate mechanisms for these purposes, and, below the
possibility of acquiring effective local navigation behaviour through delayed
reinforcement learning will be investigated.
Way-finding
Way-finding refers to the ability to move to desired locations in the world. More
specifically, it is taken here to be the ability to travel from a starting place A to a
goal place B where B is (usually) outside the scene visible from A but can be
reached by physical travel. In other words, it is about finding and following
routes in large-scale4 environments of which the agent has first-hand experience.
In contrast to local navigation which, I have suggested, can operate through
simple reflexive skills, it seems evident that a simple stimulus-response mapping
cannot support general way-finding behaviour. This is so for the reason, also
discussed in previous chapters, that reflexive skills are, inevitably, goal- or taskspecific. In other words, although a chain of conditioned reflexes could be
acquired that would allow travel to a particular goal (as, for instance, in the
simulated maze experiments in chapter three) it does not contain enough
information to be of use in finding a path to any other destination. Information
about the agent’s interaction with the environment, is in this form, simply too
inflexible and specific to have any general value for navigating large-scale space.
If acquiring way-finding skills were simply a case of learning suitable response
chains, then learning to travel between arbitrary starting and destination points
would be a hugely slow and laborious process.
This is not to say that simple associative learning cannot play a role in traversing
large-scale environments. It may well be the case that a certain route is
sufficiently well-travelled that the often-repeated sequence of moves can be stored
intact rather than recreated on every trip. However, whereas in local navigation a
4Kuipers
[77] defines a large-scale environment as “one whose structure is revealed by
integrating local observations over time, rather than being perceived from one vantage point.”
CHAPTER SIX
LOCAL NAVIGATION
133
small number of stereotyped behaviours could be sufficiently general as to be
applicable in wide-ranging circumstances, in navigating a large-scale environment
the opportunity to repeat any pre-compiled action sequence will arise far less
frequently. This leads to the view that spatial knowledge of large-scale
environments should be organised in manner that is more flexible and taskindependent, that is, it should involve some form of acquired model. The question
of the form such models should take—what information is stored, and how it is
acquired and used—is one of the longest running controversies in both
psychology and artificial intelligence. Chapters seven and eight consider this
debate.
6.2
A reinforcement learning module for local navigation
The remainder of this chapter seeks to demonstrate that effective local navigation
behaviour can be developed by acquiring conditioned responses to perceptual
stimuli on the basis of feedback signals about the success of actions in achieving
tactical goals. Such feedback will necessarily be qualitative and often intermittent
in nature hence training methods will generally involve delayed reinforcement
learning. The Actor/Critic systems considered in previous chapters are clearly
appropriate for reasons that can be summarised as follows:
• Learning is driven by the systems ability to achieve its goals.
• Behaviour is reactive—at any moment the system responds only to the immediate
sensory input.
• The system acts to anticipate future rewards but does no explicit planning or
forward search.
• Short term memory requirements are minimal—the system remembers only the
most recent input patterns and chosen actions.
• Continuous actions can be learned.
• Prediction and action functions can be based on different, appropriate recodings of
a continuous input space.
Assuming a modular decomposition of the overall skill into task-specific
components, a suitable architecture for an adaptive local navigation module is as
illustrated in Figure 6.1.
CHAPTER SIX
LOCAL NAVIGATION
rewards
134
motivate
primary reinforcement
INPUTS
context
critic
OUTPUTS
prediction error
context
actor
Figure 6.1: Architecture of an adaptive local navigation module.
The inputs to the adaptive module consist of signals from sensors or from other
modules in the control system. The outputs, which will generally be interpreted as
control signals, may also feed into other modules, or they may control actuators
directly. The distinction between context and reward is intrinsic to the design of
the control system—the output of a sensor, for instance a collision detector, could
act as context to one module and as reward to another. The nature of the primary
reinforcement is decided within the adaptive module itself. This function is
performed by the motivate component which computes an overall reinforcement
signal as some pre-determined function of the input reward signals. The design of
a module for any given task therefore involves specifying the following:
•
The context and reward inputs.
•
The form of the required outputs and their initial values.
•
The motivation function and the time horizon.
•
Mechanisms for recoding the input space.
•
Output selection mechanisms.
•
Learning algorithms.
This architecture is investigated below in the context of a modular control system
for a simulated mobile robot operating in a simple two-dimensional world. The
goal of the adaptive module, in this case, is to acquire reflexive behaviours that
CHAPTER SIX
LOCAL NAVIGATION
135
allow the vehicle to move at speed whilst avoiding collisions with obstacles.
Actions are acquired by the actor learning system as Gaussian distributions (using
the method suggested by Williams [184]) which specify a range of possible
behaviours from which control signals are selected. Success in the task is thus
equivalent to learning effective wandering behaviour for the simple environments
considered.
The following section gives a description of the simulation and the overall
control architecture. To make the account more readable much of the detail of the
system is given in Appendix D, the aim of the description here being to give a
largely qualitative overview of how the system operates.
6.3
A simulation of adaptive local navigation
The architecture of the control system used in the simulation is illustrated in
Figure 6.2. The overall system is made up of five linked modules. Two of these,
touch and map, are associated with sensor systems; two are control modules—
recover, a fixed collision recovery competence, and wander, the adaptive local
navigation module; the final module, motor control, generates low-level
instructions for controlling the vehicle’s wheel speeds.
touch
RECOVER
TOUCH
WANDER
stop
back
turn
touch
motor
motivate
reset
map
critic
reset
map
actor
S
(f,a)
lws
MOTOR
CONTROL
rws
MAP
Figure 6.2: Modular control architecture for adaptive local navigation in a
simulated mobile robot.
CHAPTER SIX
LOCAL NAVIGATION
136
For the purposes of simulation the continuous control behaviour and trajectory of
a robot vehicle is approximated by a sequence of discrete time intervals. At each
step the model robot acquires new perceptual data; learns; generates control
signals; and executes motor commands. In the following the perceptual, motor,
and collision recovery components of the simulation are described first. This
provides the context for a more detailed description of the specific architecture for
the adaptive control module.
Perception
One of the goals of this research is to explore the extent to which spatially sparse
but frequent data can support complex behaviour. For this reason the map
module, which provides the context input for the learning system, simulates a
sensor system that detected the distance to the nearest obstacle at a small number
of fixed angles from the robot’s heading. Specifically, in the experiments reported
below, a laser range-finder was simulated giving the logarithmically scaled
distance to the nearest object at angles of -60°, 0°, and +60°. This unprocessed
depth ‘map’ (a real-valued 3-vector) was provided to the control system as its sole
source of information other than collisions about the local geography of the
world. Figure 6.3 shows the simulated robot, modelled as a 0.3 ! 0.3 m square
box, casting its three ‘rays’ in a sample 5 ! 5m world.
Figure 6.3: The simulated robot in a two-dimensional world.
CHAPTER SIX
LOCAL NAVIGATION
137
Two additional sources of perceptual information are also modelled. The module
touch models a set of contact-sensitive sensors mounted on the corners of the
vehicle. These sensors are triggered by contact with walls or obstacles and hence
act as collision detectors. Finally, wheel speed information (motor in figure 6.2) is
obtained internally from the motor control system and is used in determining
reinforcement signals.
Motor Control
The vehicle model assumes two main drive wheels with independent motors and a
third, unpowered wheel (that acts to give stability). A first-order approximation to
the kinematics of this vehicle is to consider any movement as a rotation around a
point on the line through the main axle (see figure 6.4). The adaptive control
module for the robot generates, at each time-step t, a real-valued two-vector
y(t) = ( f (t a(t)) where the signals f (t) and a(t) indicate the desired forward
velocity and steering angle of the vehicle. The motor control module converts
these signals into desired left and right wheel-speeds which it then attempts to
match. This module also deals with requests from the recover module to perform
an emergency stop, back-up, or turn. Restrictions on acceleration and braking are
included in the model by specifying the maximum increase or decrease in wheel
speed per time-step. This enforces a degree of realism in the robot’s ability to
initiate fast avoidance behaviour.
CHAPTER SIX
LOCAL NAVIGATION
138
lws
f(t)
rws
a(t)
g
Figure 6.4: The simulated vehicle. (Given the instantaneous forward and angular
velocities
t) and a(t) , the left and right wheel speeds are given by
1
lws = f (t) + 2 ! a(t), and rws = f (t) ! 12 " a(t) where ! is the distance between
the drive wheels.)
Collision recovery
Recover is a ‘pre-wired’ collision recovery module. When activated it stops the
robot, suppresses the output of the adaptive controller, and sends its own control
signals to the motor system. These signals perform a sequence of actions causing
the vehicle to back-up (by undoing its most recent movement) and then rotate by a
random amount ( ± 90–180°) before proceeding. If a further collision occurs
during this recovery process the robot stops and waits to be relocated to a new
starting position. Recover also sends a reset signal to each of the learning
components of the adaptive controller. This signal inhibits learning whilst
recover is in control of the vehicle once appropriate updates have been made for
the actions that resulted in the collision, it also clears all short-term memory
activity in the adaptive module so that when learning recommences actions and
predictions associated with pre-collision states are not updated further.
CHAPTER SIX
LOCAL NAVIGATION
139
The adaptive local navigation module
Context and reward inputs, outputs and initial values
The motivate component of the wander module generates reinforcement signals
based on touch (collision) and motor reward signals. The actor and critic
components both have the current depth map as their context input. The critic
learns to predict sums of future motivation signals, the actor learns the control
vector y(t) = ( f (t), a(t)) that determines the movement of the vehicle.
Initially, the prediction and action functions are uniform throughout the input
space. The default output of the actor corresponds a forward velocity equal to
half the vehicle’s maximum speed and a steering angle of zero. In other words,
the vehicle is set to travel in approximately a straight line. The default prediction
is equal to the maximal attainable reward (i.e. it is optimistic). Clearly, as learning
proceeds, both the prediction and actions will adapt and become specialised for
the different regions of the input space.
Motivation
In order to learn useful local navigation behaviour the robot must move around
and explore its environment. If the system is rewarded solely for avoiding
collisions an optimal strategy would be to remain still or to stay within a small,
safe, circular orbit. To generate more interesting behaviour some additional
‘drive’ is required. This task of providing suitable incentives to the learning
system is performed by the motivate component.
In the experiments described below motivate combines two different sources of
reward. To promote obstacle avoidance the signals from the external collision
sensors are combined to compute a ‘collision’ reward. This signal is zero unless
the robot makes contact with an obstacle which produces an immediate negative
reward signal. To encourage exploration internal feedback from the motor system
is used to derive a ‘movement’ reward which is a Gaussian function of the
vehicle’s current absolute translational velocity. This reward is maximal (zero)
when the robot is travelling at an optimal speed and becomes increasingly
negative away from this optimum.
CHAPTER SIX
LOCAL NAVIGATION
140
There are two points to note about the movement reward signal. First, by basing
the signal on translational motion only (i.e. excluding rotational movement), the
system is discouraged from following small circular orbits. Second, by using the
absolute velocity the system is rewarded equally for travelling backwards at speed
as it is for travelling forward. This can encourage the robot to back out of difficult
situations. The system learns to travel forward most of the time, not because
forward motion is in itself preferable, but because the robot discovers that
reversing is a dangerous activity—since all of its range sensors are directed
forward it is unable to predict, and hence avoid, collisions that occur while
backing.
The output of motivate is the total reward given by the weighted sum of the
collision and movement rewards. The reward horizon of the module is infinite
and discounted, in other words, the critic learns to predict a discounted sum of
future motivation signals. Full details are given in the Appendix.
Input coding
As discussed in chapters four and five, to learn non-linear functions of a
continuous input space the input vectors must be recoded in a suitable manner.
Both a priori and adaptive recoding methods have been considered hitherto,
however, it is clear that the latter generally makes the learning problem
significantly more difficult. For this reason a fixed, a priori coding method was
chosen for these initial experiments in adaptive local navigation. To have the
advantages of a localist coding and yet retain some generalisation ability the
CMAC coarse-coding architecture [2] (Section 4.1) is employed. In the
experiments described below, identical CMAC codings are used by both actor and
critic components.
Output computation
The CMAC encoding effectively maps the current input pattern into activity in a
small number of the stored parameters of each adaptive component. For a given
input the value of the prediction computed by the critic is obtained simply by
averaging the values of all the active parameters.
CHAPTER SIX
LOCAL NAVIGATION
141
The procedure for obtaining the output of the actor component is slightly more
complex. The design of the actor component is based on the Gaussian action unit
[184] described in chapter three. As its output the actor generates the control
vector y(t) by selecting random values from Gaussian probability distributions
specified as functions of the current depth map. To generate each element of the
two-valued action requires a mean and a standard deviation. Hence, in all, the
actor has four sets of adaptive parameters from which it computes four values
(again by averaging over the active, CMAC-indexed parameters in each set).
Figure 6.5 illustrates the mechanism for recoding the input and computing the
critic and actor outputs. Details of the exact coding system and activation rules are
given in the Appendix.
Learning
The critic and actor components both make use of short term memory (STM)
buffers for encoding recent states and actions. The critic’s STM records the
activity trace of past activations required for the TD(!) update rule while the
STM of the actor records the eligibility trace of each of its adaptive parameters
(see section 2.2). The reset line allows the contents of these buffers to be erased
by a signal from the recover module. As described in chapter two, the use of
STM traces allows a more rapid propagation of credit and thus gives accelerated
learning. Again, full details of the update rules for all parameters are given in the
Appendix.
CHAPTER SIX
LOCAL NAVIGATION
142
The learning cycle
The following summarises the steps that are carried at each time interval
1) The sensor modules map and touch obtain new perceptual input which is
communicated to the control modules recover and wander. The latter also
receives the motor feedback signal from the motor control module.
2) The motivate component of wander generates a primary reinforcement signal.
3) The critic component calculates a new prediction and the TD error.
4) Actor and critic components update their parameters according to the contents of
their STM buffers.
5) If a collision has occurred the recover module takes temporary control,
suppresses learning and erases all STM buffers in the adaptive module,
otherwise
the actor generates new control signals based on the current depth map.
6) The STM memory buffers are updated.
7) The motor control module attempts to execute the motor commands.
CHAPTER SIX
Laser range-finder
distance to nearest obstacle
-4.0
prediction of future reward
0.0
Critic CMAC
input vector
-60" 0" +60"
angle of ray
reverse
forward velocity
advance
left
steering angle
right
f
am
Note: This figure depicts 3x(3x3) cmac tilings of a
two-dimensional space, the simulations use 5x(5x5x5)
tilings of the three-dimensional space (3 rays) of
depths patterns.
a!
LOCAL NAVIGATION
fm
Actor CMACs
143
Figure 6.5: Recoding and output computation of actor/critic components of the wander module. The CMAC coding divides the input space into a set
of overlapping but offset tilings. For a given input the value of each stored function is found by averaging the parameters attached to all the tiles that
overlap that point in space The critic generates a prediction of future reward, the actor the desired forward velocity and steering angle of the
vehicle (f, a). Each element of the action is specified by a gaussian pdf and is encoded by two adjustable parameters denoting its mean and standard
deviation. The action is chosen by selecting randomly from the two distributions specified.
CHAPTER SIX
6.4
LOCAL NAVIGATION
144
Results
To test the effectiveness of the learning algorithm the performance of the system
was compared before and after fifty thousand training steps in the environment
shown in figure 6.3. Averaged over ten runs5 the proportion of steps ending in
collision fell from over 5.4% before training to less than 0.01% afterwards. At the
same time the average translational velocity more than doubled to just below the
optimal speed. These changes represent a very significant improvement in
successful, obstacle avoiding, travel. This is further illustrated in Figure 6.6 which
shows sample vehicle trajectories before and after training. The paths shown
represent two thousand time-steps of robot behaviour. The dots show the vehicle’s
position on consecutive time-steps, crosses indicate collisions, and circles show
new starting positions. After training, collision-free trajectories averaging in
excess of forty metres were achieved compared with an average of less than one
metre before training.
The requirement of maintaining an optimum speed encourages the vehicle to
follow trajectories that avoid slowing down, stopping or reversing. However, if
the vehicle is placed too close to an obstacle to turn away safely, it can perform an
n-point-turn manoeuvre requiring it to stop, back-off, turn, and then move
forward. It is thus capable of generating quite complex sequences of actions.
Furthermore, although the robot has no perceptual inputs near its rear corners its
behaviour is adapted so that turning movements implicitly take account of its
body shape.
Most of the obstacle avoidance learning occurs during the first twenty thousand
steps of the simulation, thereafter the vehicle optimises its speed with little change
in the number of collisions6. The learning process is therefore quite rapid, if a
mobile robot with a sample rate of 5 Hz could be trained in this manner it could
begin to move around competently after roughly one hour of training.
5On each run the three measures of performance were computed by averaging over five thousand
time-steps before and after training with the learning mechanism disengaged.
6See Figure 6.7 below.
CHAPTER SIX
LOCAL NAVIGATION
145
Figure 6.6: Trajectories before and after training for 50,000 simulation steps. Dots show
vehicle positions, crosses show collisions, and circles new starting positions.
CHAPTER SIX
LOCAL NAVIGATION
146
Learning in different environments
Some differences were found in the system’s ability to negotiate different
environments with the effectiveness of the avoidance learning system varying for
different configurations of obstacles. In the following, the original environment
(Figure 6.1, 6.6) is referred to as E1 and the two new environments illustrated in
Figure 6.7 as E2 and E3 respectively.
Figure 6.7: Test environments E2 and E3.
The performance in each environment was measured, before and after training, by
the percentage of steps resulting in collisions and by the average translational
velocity. These results, averaged over ten runs in each environment, are shown in
Figure 6.8.
CHAPTER SIX
LOCAL NAVIGATION
collisions %
Environment
147
translational velocity (cm/step)
before
after
before
after
5.43
0.09
3.77
7.40
6.29
0.24
3.77
7.42
8.56
0.55
3.64
6.94
E1
E2
E3
Figure 6.8: Performance of the local navigation system in different training
settings. The differences, between environments, in the ‘collisions’ measure are
significant for all comparisons (p<0.05).
These results indicate substantial variation in performance for different
configurations of obstacles. Although there is a major improvement over the
training period in local navigation skill in all settings, the number of collisions in
E2 and E3 is certainly not negligible—in the worst case (E3) a collision occurs
approximately once in every two hundred time-steps after training.
The experiments also suggest that some obstacle configurations are, a priori,
more difficult to negotiate than others—this is indicated by the identical ranking
between the before and after values of the ‘collisions’ measure. Unfortunately it is
difficult to precisely specify the differences between environments. The test
situations were not devised according to any strict criteria, indeed, the question of
what criteria could be applied is itself a research issue7. Possible explanations for
the failure to learn optimal, collision-free behaviour in relation to characteristics
of different obstacle configurations will be considered further in the discussion
section below.
7The problem of categorising environments in relation to adaptive behaviours is considered in
[187] [171] [88]. However, the criteria suggested largely assume a discrete state-space.
CHAPTER SIX
LOCAL NAVIGATION
148
Transfer of acquired behaviour between environments
The specificity of the acquired behaviour was evaluated by testing each trained
system on the two unseen environments as well as on the original training
environment. On average, obstacle avoidance skill transferred with reasonable
success to unseen settings. For instance, Figure 6.9 shows performance in E2
after training in environment E1.
Figure 6.9: Sprite behaviour in a novel environment. The trajectories show
behaviour (without learning) after transfer from the training environment (E1).
Performance was, however, slightly better when training and testing both took
place in the same setting. This was shown by a higher percentage of collisions in
any given environment if the system had no previous training there. Any drop in
performance after moving to a novel environment can, however, be made up if the
system continues to train after the transfer. This process is illustrated in Figure
6.10 which shows the average reinforcement signal during training, initially in
environment E1, and subsequently in E2.
CHAPTER SIX
LOCAL NAVIGATION
149
0.00
average reward
-0.15
0
simulation steps
50000
Figure 6.10: The change in the reward received over a training run in which the
simulated vehicle was transferred to the new situation (E2) after fifty thousand
simulation steps and allowed to adapt its actions to suit the new circumstances.
As a final test of the flexibility of the acquired behaviour, ten training runs were
carried out during which the vehicle was repeatedly transferred between the three
environments. After training these systems did not perform significantly worse in
any single test environment than systems that were trained exclusively in that
setting. These findings encourage the view that the acquired behaviour is
capturing some general local navigation skills rather than situation-specific
actions suited only to the spatial layout of the training setting.
Exploration
An automatic annealing process takes place in the exploration of alternative
actions as the local probability functions adapt to fit the reward landscape (see
section 3.2). Figure 6.11 illustrates this process showing a record of the change,
over training, in the variance of selected actions from their mean values. For both
elements of the output vector the expected fall-off in exploration is seen as the
learning proceeds. The increase in variance following transfer to the new
CHAPTER SIX
LOCAL NAVIGATION
150
environment may indicate the ability of the algorithm to respond to changes in the
local distribution of reward by increasing exploration8.
2.1
7.2
forward velocity
steering angle
standard deviation (degrees)
standard deviation (cm)
2.0
1.9
1.8
1.7
1.6
6.8
6.6
6.4
6.2
1.5
1.4
7.0
0
simulation steps
50000
6.0
0
simulation steps
50000
Figure 6.11: Change in average exploration over training.
What is learned?
The different kinds of tactical behaviour acquired by the system can be illustrated
using three-dimensional plots of the preferred actions of the control system. After
training for fifty thousand time-steps in an environment containing two slow
moving rectangular obstacles the mean desired forward velocity and steering
angle were as shown in Figure 6.12. Each plot in the figure shows the preferred
action as a function of the three rays from the simulated laser range-finder: the x
and y axes show the lengths of the left (-60°) and right rays (+60°); the vertical
slices correspond to different critical lengths (9, 35 and 74cm) of the central ray
(0°); and the height of the surface indicates the mean action for that position in the
input space.
8Alternative interpretations of these results are possible however. For instance, the observed
increase in exploration following transfer may be due to increased sampling of input patterns to
which the system has had little prior exposure.
CHAPTER SIX
LOCAL NAVIGATION
Forward Velocity
Steering Angle
+15cm
+35°
+5
0°
-5
200
124
left
20
124
63
63
20
-30°
200
124
124
63
151
right
left
63
20
20
0
0
a) centre 9cm
b) centre 9cm
+15cm
+35°
+5
0°
-5
-30°
d) 35cm
c) 35cm
e) 74cm
right
+15cm
+35°
+5
0°
-5
-30°
f) 74cm
Figure 6.12: Surfaces showing desired forward velocity and steering angle
for different values of the central ray (9, 35 and 74 cm).
CHAPTER SIX
LOCAL NAVIGATION
152
The graphs9 show clearly several features that might be expected of effective
wandering behaviour. Most notably, there is a transition occurring over the three
vertical slices during which the policy changes from one of braking then reversing
(graph a) to one of turning sharply (d) whilst maintaining speed or accelerating
(e). This transition clearly corresponds to the threshold below which a collision
cannot be avoided by swerving but requires backing-off instead. There is a
considerable degree of left-right symmetry (reflection along the line left-ray =
right-ray) in most of the graphs. This concurs with the observation that obstacle
avoidance is by and large a symmetric problem. However some asymmetric
behaviour is acquired in order to break the deadlock that arises when the vehicle
is faced with obstacles that are equidistant on both sides.
Varying the reward schedule
The behaviour acquired by system is very sensitive to the precise nature of the
primary reinforcement signal provided by the motivate component. Varying the
ratio of the ‘movement’ and ‘collision’ rewards, for instance, can create
noticeable differences in the character of the vehicle’s actions. An increase in the
relative importance of the ‘movement’ reward encourages the vehicle to optimise
its trajectory for speed which means going very close to obstacles in order to
follow the fastest possible curves. Alternatively, if ‘collision’ rewards are
emphasised, then the acquired behaviour veers on the side of caution—the vehicle
will seek to maintain a healthy distance between itself and obstacle surfaces at all
times. A minor addition to the motivate function causes the system to switch from
wandering to wall-following. This is achieved by adding a component to the
reinforcement signal that encourages the vehicle to maintain a fixed, short
distance on either its left or right ray. An example of the acquired wall-following
9In order to eliminate irrelevant contours arising from unvisited parts of the space (still at default
values) a smoothing algorithm was applied to the CMACs before sampling. This had the effect of
propagating values of neighbouring cells to positions that had been updated less than 200 times
(less than 1% of the maximum number of updates). The algorithm thus has no effect on the major
areas of the function space that generate behaviour.
CHAPTER SIX
LOCAL NAVIGATION
153
behaviour is shown in Figure 6.13 for a system trained in the environment E2.
Unfortunately the transfer of behaviour across environments was noticeably
poorer for wall-following than for the original wandering competence. Adding
extra constraints to the reinforcement signal may mean that optimal behaviour is
more rigidly specified and thus less flexible or transferable.
Figure 6.13: Wall-following behaviour can be acquired by altering the function
computed by the motivate component.
Different sensor systems
A number of experiments were carried out with modified sensor systems. In one
experiment 5% Gaussian noise was added to the range-finder measures. This did
not produce any significant reduction in performance. In a second experiment the
range-finder model was modified to simulate an optic flow sensor [18] which
detects the speed of visual motion at a set angle from the current heading. Direct
optic flow signals confound visual motion due to environment structure with
motion due to rotation of the vehicle. However, because of the characteristics of
CHAPTER SIX
LOCAL NAVIGATION
154
the vehicle kinematics, the motion detected by a sensor that is directed straight
ahead is due solely to vehicle rotation. By subtracting this forward signal from the
motion detected by other sensors the rotational component of flow can be
removed. Four such signals from sensors at fixed angles of -60°, -15°, +15°, and
+60° were used as input to the adaptive control system. In spite of the fact that the
flow signals vary for different vehicle velocities the controller was, nevertheless,
able to acquire effective local navigation skill. The main difference from the
results with the range-finder simulation was that, with the optic flow signals,
acquired backing-up behaviour was less effective. A likely reason for this is that
the robot must be moving in order to generate non-zero flow signals. To go from
forward to backward motion involves having zero motion at some point. This
suggests that with the optic flow context some complementary sensor mechanisms
may be needed to guide starting and stopping behaviour. Attempts are currently
underway to test the adaptive controller on a real robot platform with an array of
visual motion sensors.
6.5
Discussion
The following sections consider some of the issues arising from the simulation,
analyse sources of difficulty, and suggest some possible future improvements.
This is followed by a review of related work on adaptive local navigation.
Partial State Information
The learning system acquires successful wandering behaviour despite having only
partial state knowledge. The information in the three depth measures provided by
the ray-caster is clearly insufficient to predict the next input pattern though it
seems largely adequate to predict rewards and to determine suitable behaviours.
The underlying task may be a deterministic Markov process, however, as a result
of the lack of sufficient context, the learning system experiences a stochastic
process and must try to optimise its behaviour with respect to that. Furthermore,
because the system’s actions are changing over time, the transition and reward
CHAPTER SIX
LOCAL NAVIGATION
155
probabilities in this process are non-stationary10. In as far as the learning system
is successful in acquiring suitable behaviours this demonstrates that the gradient
climbing properties of the learning algorithm, that exploit the correlation between
actions and rewards to improve behaviour, do not require, either explicitly or
implicitly, the ability to model the causal processes in the agent’s interaction with
the world. This is not to say that more information would not allow improved
learning—I will argue below that improving the context data available to this
system should give better performance. Rather, it seems evident that the learning
process does not expect or depend on full state knowledge.
In many tasks in complex, dynamic environments it will be impossible to
determine full state descriptions. This is one of the fundamental limitations of
dynamic programming as a means of determining effective control behaviour.
Relating delayed reinforcement learning methods to dynamic programming
allows the former to exploit the strong convergence properties of the latter.
However, to make this equation too emphatically ignores the property of
reinforcement learning that it is, at base, a correlation learning rule that neither
needs nor notices whether it has sufficient data to formulate a causative
explanation of events. To paraphrase the quotation from Brooks [23] given in
chapter one, reinforcement learning uses the aspects of the world that the agent is
sensing as its primary formal notion, whereas dynamic programming (and its
incremental variants) use the state of the world as their central formal notion. The
range of applicability of reinforcement learning should therefore be enormously
more diverse than that of incremental dynamic programming though it may be a
weaker learning method.
Given this argument that learning does not require full state information it is
nevertheless a truism that the ability of the system to learn optimal actions
depends critically on what characteristics of the current state it does observe. The
following therefore considers how the sparse perceptual information used by the
adaptive wander module could be enhanced to give better performance.
10In particular, the rewards for the system are dependent on its actions (forward velocity) about
which there is no direct information in the context input.
CHAPTER SIX
LOCAL NAVIGATION
156
Perception
Many of the collisions which occur after training involve the vehicle hitting the
convex corners of obstacles. A comparison between the different test
environments bears out this view—the increase in difficulty across environments
appears to be matched by an increase in the number of convex corners in each
setting (9, 16, and 27 for E1, E2, and E3 respectively)11. One possible explanation
for this finding is the extremely sparse sensing. The three ‘rays’ of the simulated
laser range-finder often fail to catch the nearest point of an obstacle and can thus
give a false indication of distance to the nearest surface. Given the coarseness of
this perceptual apparatus it is perhaps surprising that the system performs as well
as it does.
A second possible cause of failure to avoid collision is the lack of any direct
context input concerning the vehicle’s instantaneous forward and angular
velocities. This is equivalent to an implicit but questionable assumption that, for
any given input pattern, a single action vector (desired velocities) will be
appropriate no matter how the vehicle is currently moving. It is easy to see
situations where this might not be appropriate. For instance, if an obstacle lies
directly in front of the vehicle then different actions may be required if the
vehicle’s current angular velocity is carrying it towards the left than if it is
moving towards the right. The learning system can minimise the impact of this
deficiency by developing characteristic patterns of approach behaviour for any
local setting. However, when similar local geometries are experienced in two
different settings different approach behaviour may be inevitable. The lack of
suitable proprioceptive context signals therefore encourages the acquisition of
more inflexible behaviour than might otherwise be learned.
Unfortunately, given the current learning architecture, enriching the observed
state information is not just a matter of loading more range-finder measures
and/or motor feedback signals into the context input. Using the current localist
coarse-coding technique (CMACs) this would result in an exponential growth in
11Clearly this is only one of several differences that may be significant. For instance, variation in
the number of obstacles or their density may also be having an affect.
CHAPTER SIX
LOCAL NAVIGATION
157
both the number of adaptive parameters and the amount of time and experience
required for learning. Of course, any increase in the dimensionality of the input
vector will almost certainly increase the redundancy of the context signals. Hence
adaptive recoding techniques such as those discussed in chapters four and five
could be appropriate for compressing the input space or adapting the granularity
of the input coding. A further possibility for enriching the context while
maintaining, or increasing, the rate of skill acquisition is to introduce further
information gradually. For instance, learning could begin using a very coarsegrained coding to which detail is gradually added as training proceeds. An
example of this approach would be a hierarchy of CMAC tilings of different
resolutions. Training would begin with the lowest resolution tiling allowing rapid
generalisation of acquired knowledge. As local navigation skill improves, tilings
of higher resolutions would gradually come into use generating a more accurate
fit to an optimal control strategy. Finally, rather than simply adding extra context,
the sensor modules themselves could be made adaptive. For instance, a module
that controls the direction of sampling of the laser range-finder could be trained to
tune the sampling direction according to some reward-related feedback signal. An
adaptive attentional mechanism of this sort might allow the sensor system to pinpoint nearby obstacles with increased accuracy without a blanket increase in the
amount of sampling by the range-finder.
Multi-modal actions
The learning system selects only one of the multiple solutions possible for
wandering behaviour. Although the acquired behaviour is versatile, in that a range
of acceptable actions is determined for each region of the input space, this
distribution is unimodal whereas, in general, a full catalogue of the possible
options for any specific context input might have a multi-modal distribution.
Millington [101] has described a reinforcement learning technique capable of
learning multi-modal outputs which could be applicable here12.
12Millington divides the input space into a number of discrete cells then assigns a number of
Gaussian pdf ‘modes’ to each cell which adapt to explore different regions of output space. A
CHAPTER SIX
LOCAL NAVIGATION
158
The existence of multiple solutions creates particular difficulties for the learning
system in situations where the input vector is left-right symmetric, for example,
where the robot is moving toward a head-on collision with a wall. In this situation
veering strongly either to the left or the right is generally an acceptable solution.
However, since both options are equally likely to occur through random
exploration, it is possible that by repeatedly trying first one and then the other the
learning system will consistently cancel out the positive effect of both and be left
with a policy of moving directly forward—a clear compromise, but the worst
possible strategy. This situation would not arise, however, if there were two
systems one specialised in left turns the other in right turns, indeed, since one
could be the mirror-image of the other only one system might actually be
required. A separate module would then have the task of deciding which option to
employ in any given circumstances.
Goal conflict
The adaptive module is rewarded both for successful avoidance behaviour and for
moving reasonably quickly. Inevitably, therefore, the strategy that develops is a
compromise between these initially antagonistic goals. In the early stages of
learning the critic rapidly detects a strong positive correlation between moving at
speed (desirable) and colliding with obstacles (undesirable). A likely response to
this double-bind is to become excessively cautious or acquire prevarication (i.e.
dithering) behaviours. Indeed, the most difficult part of setting up the learning
algorithm was found to be tuning the global parameters of the adaptive module so
as to overcome the initial ‘hump’ of discouraging experience and the consequent
disinclination to move around.
Relating reinforcement to speed of movement was necessary because, in the
absence of any more specific objective, some incentive was required that would
force movement and exploration. In general, however, it would be desirable to
replace the ‘movement’ reward by some measure of success more closely related
to a genuine navigation goal. For instance, a way-finding module might specify a
further alternative would be to adopt a discrete output-space quantisation and use a Q-learning
algorithm to determine the values of alternative actions.
CHAPTER SIX
LOCAL NAVIGATION
159
local target position or a target heading, reward would then be related to success
in satisfying that criterion. However, the problem of conflicting goals will always
exist for a vehicle with variable velocity as any type of movement increases the
risk of collision with stationary obstacles13. Furthermore, unlike the ‘movement’
reward more appropriate incentives may provide only intermittent feedback
making the optimal reinforcement gradient even more difficult to detect and
follow. In building robots that are increasingly self-reliant and self-motivated it
seems likely that this problem of balancing conflicting goals will become more
not less important. This suggests that future research will require a better
understanding of the operation of motivational mechanisms in both animals and
robots14.
Related work on adaptive local navigation
There has been a substantial amount of recent work on acquisition of local
navigation skill. The following is not an attempt at an exhaustive review, rather,
the objective is to identify some of the progress that has been made and the
different approaches that have been taken.
Zapata et al. [190] implemented both pre-wired and adaptive local navigation
competences obtaining qualitatively similar results with each when tested on
mobile robots in outdoor and indoor environments. The adaptive competences
were acquired using multi-layer neural networks controllers trained by backpropagation in a teaching signal which was some pre-defined function of the input
vector (for instance this function might relate to deformations of a desired
13Some experiments were conducted in which movement was encouraged indirectly by populating
the environment with moving obstacles. This should motivate the robot to move around as
remaining in any one location will not guarantee obstacle avoidance. However, in general, these
experiments were not very successful suggesting that some more straight-forward and obvious
link between movement and reward is needed to overcome the disincentives that arise from
collisions.
14Toates
[170] gives a review of motivation research in psychology and ethology, Halparin [58]
has described a model for investigating issues in robot motivation.
CHAPTER SIX
LOCAL NAVIGATION
160
obstacle-free zone around the robot vehicle). Three adaptive modules were
developed including a module for dynamic collision avoidance.
Nehmzow [113] describes acquisition of forward motion, collision avoidance,
wall-following and corridor-following skills in a small, indoor mobile robot.
Learning occurred in a single-layer network with an eight-bit input vector
describing the state of two touch sensors (whiskers) and a forward motion sensor.
The output of the network selected one of four possible actions—turn left, turn
right, move forward, or move back, where the chosen action was timed to last for
a fixed period. Exploration of behaviours occurred simply by rotating through the
alternative outputs. Stimulus-response associations were reinforced which
satisfied a set of instinct rules. These rules described desirable sensory conditions
such as “Keep the forward motion sensor on”, “Keep the whiskers straight”, etc.
Relatively complex behaviours such as wall-following were developed in this
manner through the interaction of a number of such instinct rules. Learning for
these tasks was very rapid, both because of the small size of the input and output
spaces and because the network was specified in a manner that made the input
patterns linearly separable with respect to satisfactory outputs. Typically, a
behaviour such as wall-following could be learned in just a few minutes. More
efficient or accurate manoeuvring would clearly require an increase in the size of
the search-space and would therefore result in slower learning speed. However,
this research does make the point that very simple acquired reflexes can support
robust behaviour.
The research in adaptive local navigation by Kröse and van Dam [73, 74] bears
the closest resemblance to the work described in this chapter. They describe a
simulated robot whose task is to avoid punishment signals incurred through
collisions with obstacles. The motor speed of the vehicle is fixed and positive,
hence, stationary, backing-up or procrastination behaviours are not possible. This
removes a major source of locally-optimal behaviours but at the same
significantly restricts the manoeuvrability of the vehicle. The simulation was
trained using an Actor/Critic learning architecture for which the input was a
vector of logarithmically-scaled measures from eight range-finder sensors set in a
semi-circular arc at the front of vehicle. The stochastic output of the actor
component moved the current heading of the robot either to the left or the right.
CHAPTER SIX
LOCAL NAVIGATION
161
A major difference from the work reported here was in the recoding methods used
to form a quantisation of the state-space. Both a priori and adaptive discrete
quantisation techniques were tested. The adaptive techniques included
unsupervised learning using a Kohonen self-organising network and nearest
neighbour coding using an on-line node generation and pruning algorithm. The
latter worked in the following manner. Starting with just one node, whenever a
collision occurred new units were added at the points in the input space
corresponding to the last M input patterns prior to the collision. In order to
prevent an explosion in the number of nodes, adjacent units with similar output
weights were periodically merged. The prediction value of units were also
monitored and units with persistently low predictions removed from the network.
The system was tested in a two-dimensional environment of polygonal obstacles.
The training schedule was somewhat different than in the experiments described
above, however, roughly similar training times were required to obtain good
performance. The performance with the nearest neighbour coding, with a final
size of 80 units, was similar to that obtained with a Kohonen network with 128
units and better than with an a priori quantisation with 256 cells. The nearest
neighbour coding method appears to have been reasonably successful, however,
the mechanism for generating new nodes is heuristically driven and exploits the
task characteristic that avoidance can be initiated close to the collision site. This
heuristic might be less effective if task parameters were changed, for instance, if
the turning circle of the robot was increased requiring earlier initiation of
avoidance activity.
Millan and Torras [100] describe a reinforcement learning approach to the local
path-finding problem. They consider an abstract model in which a ‘point robot’
learns to move to nearby goal positions in an environment containing only
circular obstacles. Success is defined as entering a circle of small radius around
the goal position. The inputs to the learning system, which assume some relatively
complex pre-processing of sensory signals, consist of an attractive force exerted
by the goal and repulsive forces exerted by nearby obstacles. The output of the
learning system gives the size and direction of a movement vector in the
coordinate frame defined by a direct line to the goal. The reinforcement provided
on every simulation step is a function of the attractive and repulsive force inputs
and the current clearance between the robot and the nearest obstacle. Reward is
CHAPTER SIX
LOCAL NAVIGATION
162
maximum at the goal and minimum when a collision occurs. The learning system
is based on the actor/critic architecture and employs a variant of Gullapalli’s [56]
algorithm for determining suitable exploration behaviour (section 3.2). In order to
allow the acquisition of non-linear input-output mappings the actor component is
an MLP network whose hidden units are trained by generalised gradient ascent in
the reinforcement signal. Certain domain-specific heuristics were used to
facilitate the discovery of suitable paths, and some intervention (re-siting the
robot after collisions) was needed to allow successful learning.
Although reinforcement learning is a very slow method for planning, the local
path-finding skills acquired by this system did generalise moderately well to other
configurations of circular obstacles. The acquired behaviour therefore does meet
the criteria proposed here that local navigation competences should not be
environment-specific. Millan and Torras share a similar view to the one advocated
in this chapter that navigation requires a combination of planning mechanisms
that employ world models and local navigation mechanisms encoding stimulusresponse behaviours. Whether the appropriate place to interface these two
mechanisms is at the level of local path-planning is an open question. In chapter
eight I will suggest that the way-finding mechanism should have a more
continuous role than that suggested by Millan and Torras (specifying intermediate
goal positions). For instance, the way-finding system might continually update its
planned route as the robot moves and direct the local navigation systems by
specifying the current desired heading. This increases the burden of work for the
way-finding system (means for coping with this burden are discussed in chapter
eight) but reduces the need for reactive modules that are adapted to maximise
very long-term rewards. Acquisition of local navigation skill should therefore be a
faster and more straightforward process.
Conclusion
This chapter has argued that the problem of navigation should be divided between
a way-finding component that builds world models and performs planning, and a
number of local navigation components which together implement appropriate
local manoeuvres for traversing the desired route. I have also demonstrated that
sophisticated local navigation behaviour can arise from sequences of learned
CHAPTER SIX
LOCAL NAVIGATION
163
reactions to raw perceptual data. A simulation of a mobile robot has been
described that acquires successful wandering behaviour in environments of
polygonal obstacles. The trajectories generated by the simulation often have the
appearance of planned activity since each individual action is only appropriate as
part of an extended pattern of movement. However, planning only occurs as an
implicit part of the learning process that allows experience of rewarding outcomes
to be propagated backwards to influence future actions taken in similar contexts.
This learning process exploits the underlying regularities in the robot's interaction
with its world to find an effective mapping from sensor data to motor actions in
the absence of full state information.
CHAPTER SIX
LOCAL NAVIGATION
164
162
Chapter 7
Representations for Way-finding:
Topological Models
Summary
This chapter and the next consider the construction of representations that can
support way-finding. Some recent research in Artificial Intelligence has favoured
spatial representations of a primarily topological nature over more quantitative
models on the grounds that they are: cheaper and easier to construct, more robust
in the face of poor sensor data, simpler to represent, more economical to store,
and also, perhaps, more biologically plausible. This chapter suggests that it may
be possible, given these criteria, to construct sequential route-like knowledge of
the environment, but that to integrate this information into more powerful layout
models or maps may not be straight-forward. It is argued that the construction of
such models realistically demands the support of either strong vision capabilities
or the ability to detect higher-order geometric relations. And further, that in the
latter case, it seems hard to justify not using the acquired information to construct
models with richer geometric structure that can provide more effective support to
way-finding.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
163
7.1
Introduction
The problem of navigating in large-scale space has been the subject of
considerable study both by Artificial Intelligence researchers interested in
building autonomous mobile robots, and by ethologists and psychologists
interested in understanding the navigation behaviour of animals and humans. For
a long time these two strands of research have been largely independent, however,
recently there has been considerable cross-over of ideas. Robot builders have
begun to seek inspiration from natural navigation systems as examples of robust
systems that have been selected and fine-tuned by evolution. Meanwhile
psychologists and animal behaviourists have started to recognise that robotics
research represents a rich resource of detailed computational models that can be
used to evaluate and inspire theoretical accounts.
A recent trend in research in robotics has involved a substantive change in the
character of the systems under investigation. Specifically, the emphasis of
‘classical’ AI methods on detailed path-planning using metric models of the
environment (e.g. [34, 40, 64, 89, 136, 174]) has been rejected by some
researchers in favour of the use of more ‘qualitative’ methods and models (e.g.
[38, 72, 77, 78, 81, 82, 84, 92, 94, 112]). Researchers with this perspective often
regard metric modelling and planning as supplementary to a core level of
navigation skill based, primarily, on representations of topological spatial
relations. This new approach has, as part of its motivation, the perceived
inadequacies of classical systems which are regarded as over-reliant on accurate
sensing and detailed world models. It is suggested that such systems are both too
‘brittle’ in the face of degraded or missing sensory information, and too costly in
terms of the computational and memory resources they require. In contrast, the
emphasis on topological models can be understood as part of a general trend
toward ‘behaviour-based’ robot control (i.e. animat AI) that seeks minimal
reliance on internal models, and—where representations are considered
essential—the need for on-line acquisition of appropriate knowledge.
Representations are therefore preferred that are simple to construct, cheap to
store, and support a ‘graceful degradation’ of performance when confronted with
unreliable sensory data.
A second motivation for investigating topological spatial models is research on
human way-finding. Much of this literature follows a theory originating with
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
164
Piaget [124, 125] that human spatial knowledge has a hierarchical structure and is
acquired through a stage-like process. Specifically, Piaget, and later Siegel and
White [154], have argued that a fundamental stage in the acquisition of spatial
knowledge is the construction of qualitative models of the environment from
more elementary sensorimotor associations. This representation is then gradually
supplemented by distance and direction information to form a more detailed
quantitative map. An important element of this theory is the view that a primarily
topological representation can support robust way-finding behaviour in everyday
environments. Computational models inspired by the human way-finding
literature have been described by Leiser [82] and by Kuipers [76-81]. The latter in
particular has developed a number of robot simulations of considerable
sophistication and detail based on the hypothesis of a hierarchical representation
of spatial knowledge. The following extract serves to illustrate this theoretical
position:
“There is a natural four-level semantic hierarchy of descriptions of large-scale
space that supports robust map-learning and navigation:
1. Sensorimotor: The traveller’s input-output relations with the environment.
2. Procedural: Learned and stored procedures defined in terms of sensori-motor
primitives for accomplishing particular instances of place-finding and routefollowing tasks.
3. Topological: A description of the environment in terms of fixed entities,
such as places, paths, landmarks, and regions, linked by topological relations
such as connectivity, containment and order.
4. Metric: A description of the environment in terms of fixed entities such as
places, paths, landmarks, and regions, linked by metric relations such as
relative distance, relative angle and absolute angle and distance with respect
to a frame of reference.
In general, although not without exception, assimilation of the cognitive map
proceeds from the lowest level of the spatial semantic hierarchy to the highest, as
resources permit. The lower levels of the cognitive map can be created
accurately without depending greatly on computational resources or
observational accuracy. A complete and accurate lower level map improves the
interpretation of observations and the creation of higher levels of the map.”
([81], p. 26, my italics)
However, while agreeing with many aspects of this position it is possible to
question one of its central themes—that successive levels build on those below.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
165
This issue can be highlighted by contrasting the above view of human wayfinding with much of the literature from the wider field of animal navigation. This
evidence suggests a discontinuity between procedural knowledge and the use of
map-like metric spatial representations [52, 119, 120].
For instance, in contrast to the incremental hierarchy of spatial knowledge
outlined above, O’Keefe [119, 120] has argued that there are two fundamentally
independent navigation systems used by mammals including man. The first of
these, which he calls the taxon system is supported by route-like chains of
stimulus ! action ! stimulus associations. Each element in such a chain is an
association that involves approaching or avoiding a specific cue, or performing a
body-centred action (generally a rotation) in response to a cue. Taxon strategies
therefore have a similar nature to the procedural knowledge in the second level of
Kuipers hierarchy. O’Keefe’s second system, called the locale system, is,
however, a ‘true’ mapping system in that it constructs a representational
stimulus ! stimulus model describing the metric spatial relations between
locations in the environment. Evidence for the existence of this system and its
independence from taxon strategies consists of both observational and laboratory
studies of animal behaviour, and neurophysiological studies suggesting that
different brain structures underlie the two systems.
Although the highest level of Kuipers hierarchy can be identified with O’Keefe’s
locale system the former suggests a continuity—with assimilation of information
onto ‘weaker’ representations to generate the metric model, whereas the latter
stresses the discontinuity and apparent autonomy of the two alternative
mechanisms. A further difference is that O’Keefe’s theory bypasses the level of
the topological map, if such a map exists it is as an abstraction from the full
metric representation. Gallistel [52], who provides a recent and extensive review
of research on animal navigation, also concludes that animals make considerable
use of metric data for navigation. Like O’Keefe he also proposes a modular and
autonomous mapping system that stores a metric representation of spatial layout1.
The controversy highlighted above really rests on the importance of metric
knowledge in constructing map representations of space. On the one hand is the
1O’Keefe and Gallistel agree on the existence of a separate metric mapping system but largely
disagree on the relative importance of dead reckoning and environmental fixes in constructing the
map. This debate will be considered further in chapter eight.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
166
view that a useful level of topological map knowledge can be acquired without
relying on higher-order geometric information, on the other, the assertion that the
ability to detect large-scale metric spatial relations is the key to constructing an
effective model. It might be helpful here, to delineate alternative views on this
issue more clearly. Two possible positions are as follows:
1. That a map describing the topological spatial relations between salient places can
be efficiently constructed from sensorimotor and procedural knowledge without
any need to detect metric spatial relations.
2. That the construction of any sort of map of environmental layout is best begun
with the detection of metric spatial relations. These can then be used to construct
a metric model from which topological relations can be abstracted as required.
Both of these positions are as much a matter of practice as of principle. They are
concerned with building useful and robust representations without making
excessive demands on the agents resources of time, perceptual skills,
computational abilities, and memory. I believe that, given these caveats, the
second view would be reasonably acceptable to O’Keefe, Gallistel, and many
researchers adopting more quantitative approaches to robot map-building. The
first view is possibly too strong to ascribe to most builders of behaviour-based
robots. The aim of much of this research is—not unreasonably—to build a
working system that exploits whatever information can be acquired cheaply and
re-acquired with fair reliability2. This approach might therefore be better
characterised as follows.
3. That a map describing primarily topological spatial relations can be efficiently
constructed from sensorimotor and procedural knowledge with some additional
but only approximate knowledge of metric spatial relations.
This approach, however, is open to the criticism that if metric relations are
detected then they may as well be stored and exploited. Why not construct a
metric model? Even if such a model is only approximate it should still have some
advantages over a topological map, for instance, in estimating direct routes. If the
detected relations are too inaccurate to be worth storing then they will, as likely,
2Many systems, for instance, make use of odometry (wheel-speed) information and simple
compass sensors to compute rough estimates of spatial position that are used in constructing a
primarily topological model.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
167
be of little use in constructing the topological model. If this argument is correct,
then the case against building a metric map can only be based on the assertion that
the construction or use of such a model is significantly more expensive or
complex. This issue will be taken up in the next chapter where it will be argued
that the use of more quantitative models is compatible with the general research
paradigm of Animat AI.
A further problem with this third approach is that it undermines Kuipers theory of
the spatial semantic hierarchy. If the flow of information is not principally
bottom-up then the argument for placing the metric map at a higher level than the
topological model is weakened. In accordance with the more classical approach
the construction of metric spatial relations could be viewed as having more a
direct link to the sensorimotor level than the topological map. In other words the
theory of the semantic hierarchy would lose much of its bite.
On the basis of these arguments this chapter therefore focuses on the first
approach which I will call the ‘strong’ topological map hypothesis—the
possibility of constructing topological models without reliance on higher
geometric knowledge. The first section reviews some of the general issues in
constructing and using models, topological or otherwise, of large-scale space for
way-finding. This discussion is not intended to elucidate any new or startling
truths but rather to define some relatively basic ideas and seek to identify some of
their logical consequences. The second section then considers, in relatively
abstract terms, the construction of route representations and their assimilation into
topological maps. Some empirical attempts at constructing such systems, for
mobile robots or robot simulations, are reviewed in following section, and the
chapter concludes by considering some of the implication for building spatial
representations that can support way-finding in artificial and natural systems.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
168
7.2
The problem of way-finding
A pre-requisite for way-finding is knowing where you are in the world. Knowing
where you are involves being localised—that is knowing what place you are
currently in, and being oriented—knowing where other places are with respect to
you. Constructing a representation for way-finding in a novel environment is
about accumulating information that will allow you to answer these ‘where?’
questions and then solve the travel problem of finding and following a path from
the current location to the desired goal. The ‘where?’ questions are intimately
tied-up with questions of what constitutes a place (or a location), therefore it is
appropriate to begin by attempting to define this fundamental spatial entity.
The concept of place
A common-place understanding of the idea of a ‘place’ is of a particular point or
part of space3. This definition indicates that places gain their identities not
through any sensory quality of the objects that occupy them but by virtue of their
relationships to the other parts of space. These spatial relations can be
considered the indispensable4, or primary qualities of a place since no two places
can have the same spatial relations without merging their identities. Places can,
however, be associated with non-spatial qualities. These are properties that can be
directly determined from sensation such as the shape, colour, and texture of the
surfaces of objects, or by characteristic odours, temperature, and sounds.
However, the non-spatial qualities of a place are dispensable or secondary
attributes since they do not guarantee uniqueness. Two distinct locations that are
3This is the first meaning of the word listed in many dictionaries (see, for instance, “Collins
Dictionary of the English Language”, Collins: London. 1979).
4The distinction between primary and secondary qualities was made by Locke and perhaps
[75] introduced the terms indispensable and dispensable
attribute to more precisely characterise this distinction. Shephard [151] contains a further
originated with Aristotle. Kubovy
discussion of spatial representations in this light.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
169
indistinguishable in terms of their secondary attributes nevertheless remain
different places5.
Although local sensory characteristics do not define location they can, however,
play an important role in constructing a model of space. Salient places often do
have distinctive characteristics and by detecting these features the primary spatial
identity can be decided with some degree of certainty.
It is important to note that this distinction between the primary and secondary
characteristics of places, is not a distinction between different types of sensation.
The sensed geometric properties of the local scene, for instance, the depth, slant,
and shape of surfaces can be treated as distinctive sensory patterns with no special
regard given to their spatial content. Considered in this way they constitute
secondary characteristics. Perceptions of depth, shape, etc. can clearly also be
used as a source of information from which spatial relations are constructed.
Regarded in this way this is they provide the data from which the primary
characteristics of places are determined. Almost any sensory modality can either
supply cues for spatial location, or be treated as a source of local, sensory
patterns. For instance, the high temperature in proximity to a heat source can be
considered as a distinctive secondary characteristic of that place. However, the
temperature gradient away from the source can also be used to indicate distance
from the source and can therefore be treated as a spatial cue.
The non-local nature of spatial invariance presents a difficult perceptual problem.
The navigator must construct information about the spatial relations in the
environment out of its local sensory experience. This problem gives rise to two
fundamental questions that underpin a long controversy in the psychological6
literature. First, what is nature of the spatial relations that are encoded in cognitive
5This view of the nature of places follows that of O’Keefe
[120] who also attributes it to Newton
and Kant: “The notions of place and space are logical and conceptual primitives which cannot be
reduced to, or defined in terms of, other entities. Specifically, places and space are not, in our
view, defined in terms of objects or the relations between objects. [...]. It is sometimes convenient
to locate a place by reference to objects occupying it [...] but this is only a convenience.” ([120]
p. 86.)
6O’Keefe
[120] gives an excellent review of the history of this debate in the psychological and
philosophical literature.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
170
spatial representations? and, second, what is the role of different types of
sensation in constructing these relations? In considering these questions, we must
first give some substance to the debate by defining more carefully the different
classes of spatial relations in terms of their geometric properties.
Geometric properties of spatial representations
There are various ways of classifying spatial relations. One classification, for
instance, is based on the methods used to derive theorems in geometry. This gives
rise to the distinction between synthetic geometry which is developed from purely
geometric axioms concerning the properties of points, lines, and figures in space,
and analytic geometry which builds on number theory and algebra through the use
of numerical coordinates. An alternative classification is to categorise geometric
theorems according to content. For instance, many of the theorems of geometry
are concerned with magnitudes—distances, areas, angles and so forth—which are
determined with respect to a measure of the distance between two points called a
metric. One way of describing this body of theory is that it concerns those
properties that are invariant under the magnitude preserving, displacement
transformations of rotation and translation. These theorems of metric geometry
allow tests of the congruence (identity of size and shape) and the analysis of rigid
motions of figures. A further set of theorems, however, concern properties that
survive transformations that involve changes in magnitude. For instance, under
affine transformations, angles, distances, and areas can become distorted but other
properties are retained such as that parallel lines remain parallel. Projective
transformations introduce further distortions but preserve still more basic
properties such as the collinearity of points (that is, lines remain lines), and the
notion of one geometric figure lying between two others. Finally, the most radical
topological transformations disrupt all but the most fundamental spatial relations
of connectivity (that is, that adjacent points remain adjacent). In fact, metric
models of space stand at the top of a hierarchy7 of geometric models—metric,
affine, projective, and topological—in which each level necessarily preserves the
spatial relations of every lower level. Hence, metric models incorporate all more
7This hierarchical view of geometry was introduced by the mathematician Felix Klein in a famous
address given in 1872 known as the “Erlanger program”, its relevance to research in cognitive
maps has previously been considered by Gallistel [52].
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
171
fundamental relations but topological models do not encode any more detailed
structure. Figure 7.1 illustrates a simple geometric figure undergoing successive
transformations each disrupting spatial relations at a deeper level.
(a)
(b)
(c)
(d)
Figure 7.1: A simple geometric figure (a) subjected to successive transformations
that preserve affine (b), projective (c), and topological (d) spatial relations.
Theoretically at least, a navigator could construct a representation that describes
the spatial relations of its environment upto any level in the above hierarchy. For
instance, if the model was metric then it would encode (at least implicitly)
estimates of the distances and relative angles between locations. At the other end
of the scale a topological model would encode information solely about which
locations are adjacent to each other. The debate over the nature of spatial
representations for way-finding is generally characterised as being between these
two positions. However, it is important to note that models corresponding to
intermediate geometries are also possible. For instance, a representation that
encoded no magnitude information might yet specify sets of locations as lying on
straight lines (i.e. it could encode certain projective relations).
There is a further important geometric property that does not fit neatly into the
hierarchical model of geometry. This is the property of the sense of a geometric
figure which is preserved under all transformations except those that involve
reflection (Figure 7.2 below). The construction of a useful representation of space
often requires that the sense of spatial relations in the environment is detected and
stored, that is, it demands the ability to distinguish up from down and left from
right (or clockwise from anti-clockwise).
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
172
Figure 7.2: The sense relation is disturbed by any transformation involving
reflection.
In the case of metric representations a specific measure of distance must be
established. It seems reasonable to assume that any representing system, natural
or artificial, that seeks to capture the metric spatial relations of its environment
will not employ a metric that is seriously at odds with reality. This rules out many
of the different metrics studied in theoretical geometry that are known to conflict
with the three-dimensional geometry of the real physical world8. Both classical
Euclidean geometry and the non-Euclidean, hyperbolic geometry qualify on these
grounds as possible models. However, given that they are empirically
indistinguishable, in environments on the scale considered here, the Euclidean
geometry is, perhaps, to be preferred on the grounds that it is simpler to deal
with9. Henceforth, therefore, by metric representation we will assume models
based on Euclid’s metric.
A further class of spatial relations are ‘qualitative’ ones such as “A is near to B”
or “B is between A and C”. These relations are sometimes grouped together with
the topological relation of adjacency whereas in fact they are propositional
statements of higher-order geometric relations. This suggests that when
8Gallistel [52] gives an extensive argument in support of this supposition in modelling animal
cognitive maps. The issue is perhaps less clear in considering robot systems that navigate in very
artificial
environments.
( distance(x 1 , x 2 ) = " x 2
Here
other
metrics
such
as
the
‘city-block’
metric
x 1 ) may have useful applications.
9 Courant and Robbins [39] (chapter four) contains a good introduction both to this issue and to
the general topic of the classification of geometrical properties.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
173
‘unpacked’ this type of knowledge will show some approximate or ‘fuzzy’
knowledge of quantitative spatial relations. The qualitative relation of
containment represents a further important form of spatial knowledge that is
particularly relevant in constructing models of space that describe containment
hierarchies. This type of knowledge seems to have important psychological
reality for humans [161, 175]. Containment relations are relevant in way-finding
as a means of performing planning at different spatial scales, however, qualitative
planning at a more any abstract level must at some point be turned into specific
instructions that guide local navigation. In this context planning must be
concerned with inter-object spatial relations—representations of these relations
are therefore the main concern of this chapter.
Representations for way-finding
In keeping with the terminology of earlier chapters, the forms of representation
that might be employed are distinguished here in terms of the mapping they
describe. In chapter one the distinction was made between dispositional
associative mappings, such as stimulus ! action (S ! A) , and representational
models such as stimulus ! action " stimulus (S!A " S) or stimulus ! stimulus
(S ! S) . In the context of acquiring way-finding skills the term stimulus refers to
the characteristics of salient locations in the environment while actions indicate
motor control behaviours (or chains of such behaviours). I have argued in the
previous chapter that dispositional learning will not support general way-finding
skills. Rather, a store of spatial knowledge is required that is more flexible and
task-independent, in other words, some form of model is needed.
In this chapter and the next, methods for constructing topological and metric maps
will be described and contrasted. Before considering this issue, however, some
brief comments are appropriate on possible structures for representing knowledge
of the physical world. Two principal structures will be considered here: graphs
that describe (primarily) topological relations, and coordinate systems that can be
used to describe higher order spatial relations.
Graph representations of space
A graph G is defined by a set of vertices V and a set of edges E , where eij ! E is
the edge connecting vertices vi and v j . Graphs can either be directed or
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
174
undirected, in the former edges can be traversed only in one specified direction, in
the latter case all edges can be traversed in both directions.
A graph is a very powerful model for representing space. For instance, if the
vertices of the graph are identified with salient locations and the edges with the
spatial relations between locations, then the way-finding problem partially
reduces to the soluble problem of graph search—finding a sequence of connecting
edges between start and goal vertices in the graph. Furthermore, the idea of a
containment hierarchy can be easily expressed using a hierarchy of graphs in
which each corresponds to a different physical or psychological scale. For
instance, the graph which models the layout of a room can become a vertex in the
graph of a building which is itself a vertex in the graph of a street and so on. Such
a hierarchy allows planning to occur at different levels of abstraction giving
efficient search for paths between very distant positions.
Coordinate systems
The discovery of analytical geometry, by Descartes, lead to the unification of
number theory and geometry. This allowed the continuous, spatially-extended
structure of the physical world to be fully described in abstract, numerical terms.
Such a description is achieved by the use of a coordinate system. A coordinate
system is established by a set of reference lines or points that describe a
coordinate frame. The position of any point in space is then identified by a set of
real numbers (coordinates) that are distances or angles from the reference frame.
The use of co-ordinates greatly enhances the power of a spatial representation
allowing, at least potentially, all the techniques of trigonometry and vector
algebra to be brought to bear on geometric problems.
In constructing spatial representations, graphs and coordinate systems are not
incompatible. For one thing, a metric model can always be contracted onto a
graph by a suitable partitioning of the space, the vertices of the graph could then
be labelled with specific positions relative to the global coordinate frame. A
further possibility is for neighbouring regions of physical space to be modelled
with respect to separate coordinate frames, these frames might then be linked in a
graph model. An atlas of pictorial maps is an example of such a representation.
An extension of this idea is to tag the edges of the graph with the descriptions of
the spatial relations between the coordinate frames at the vertices. This can allow
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
175
the reconstruction of spatial relations across frames (across the pages of the atlas
as it were). Chapter eight considers some specific models of this type.
Having determined some of the constraints on the construction of spatial models
we are now in a position consider the virtues and vices of alternative forms of
representation that can support the task of way-finding. The next section considers
the acquisition of route-based knowledge and the possibility of assimilating
information of this kind into topological models of environmental layout.
7.3
From routes to topological maps
Constructing route knowledge
An agent exploring its environment will experience a temporally ordered
sequence of local views. The continuity of this experience contains information
about the spatial layout of the world—successive experiences relate to adjacent
locations. The simplest form of model—a straight-forward record of successive
stimuli ( So ! S 1! S2 …)—can therefore be derived directly and easily from
experience. Furthermore, if information is stored concerning the required
movements between each location then a record ( S0 ! A0 " S 1! A1 " S2 ! A2 … )
of a retraceable sequential model can be created.
To store and re-use such a model requires that the agent segments its experience
into a set of a places (points or regions of finite extent) where control decisions
are made. These places are then linked by records of the sensory information or
motor control programs that will allow the agent to retrace its trajectory between
each pair of adjacent locations. Adopting the language of graph theory, decision
places can be considered as vertices and the intervening links as edges in an
unknown graph of the environment. A sequential model therefore describes a
walk—an alternating sequence of vertex and edge elements—through such a
graph. In general a walk may cross itself or go back on itself arbitrarily often. A
walk in which all the vertices are known to be distinct is termed a path and
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
176
corresponds to what is usually described in the navigation literature as route10
knowledge.
To retrace a specific walk or route the environment must be segmented in the
same manner on each occasion. This implies that the agent must possess a set of
segmentation functions that use sensory data and local navigation skills to
perform the following low-level processes:
1. Classify any visited point in the environment as belonging either to a vertex or to
an edge11.
2. Detect the exits from a visited vertex to adjoining edges and discriminate12
between different exits (the number of detected exits determines the degree of the
vertex).
3. Orient toward a desired exit.
4. Follow the edge connecting two vertices.
The problems involved in segmentation are clearly non-trivial. Several attempts
have been made at solving the segmentation problem for qualitative navigation
systems in mobile robots, these systems will be considered further below.
Although route knowledge can support travel along known paths it does not allow
novel or short-cut routes between locations to be determined. As such it is little
more than a chain of dispositional associations. Navigation by route-following
also runs the risk that if the path is lost, through any sort of error, then the
10The concept of route knowledge arises frequently in the literature though definitions vary. The
definition given here is similar to that given by O’Keefe [120] who gives an extended discussion
of navigation by following routes.
11This segmentation function could allow a point to be recognised as belonging neither to a vertex
nor an edge, but simply to intervening, undifferentiated space. In such circumstances the navigator
would then depend on search strategies to move to a position that could be identified with an
element of the graph.
12This discrimination can be absolute, that is, based on sensory characteristics of the exits, or, in
the absence of such characteristics (in a true maze for instance), it can be relative to the last edge
traversed. An example of the latter, for a planar graph, would be to distinguish between exits by
ordering them clockwise starting with the one through which the node was entered.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
177
navigator has no fall-back method other than search by which to get back on
track. This argues for a richer representation in which the information gained
during each trajectory through the environment is integrated into some overall
layout plan. In the metric-free case this means the construction of the true
underlying graph of the environment. Henceforth the term topological map will
be used to describe a representation of this type.
In constructing a topological map the temporal constraint that successive
experiences denote adjacent locations is clearly insufficient. Instead, the agent
must be able to judge when experiences separated by arbitrary time intervals are
generated from a single spatial location. In other words, a fundamental
requirement of map-building is place identification—the ability to uniquely
identify and then re-identify places. Given appropriate segmentation functions
that convert the temporal flow of experience into a walk through the unknown
graph the place identification task can be defined as the problem of determining
as each vertex is entered whether it should be added as a new vertex in the
partially-built graph or should be identified with an existing one.
Constructing a topological map by place identification
It was suggested earlier that the identity of places can be inferred from
comparisons of either primary characteristics—spatial relations—or secondary
characteristics—distinctive local sensory attributes. With both types of
comparison, similarity is taken as evidence in favour of identity and dissimilarity
as evidence against it. Given that any comparison will be subject to errors arising
directly or indirectly from sensing inaccuracies there is always a danger of
making incorrect matches. Furthermore, in comparing secondary qualities there is
the additional problem that false positive matches can arise because places may
not have unique sensory attributes.
There is clearly an important distinction that needs to be made between the
knowledge that is required to construct a model and the knowledge that is
encoded within the stored representation. For instance, a navigator might employ
knowledge of metric relations in constructing a model that encodes only
topological ones. However, one of the strongest arguments for the construction
and use of a topological map rests on the idea that such a representation can be
constructed without recourse to knowledge of higher spatial relations. This
possibility must therefore be given some consideration.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
178
To build a topological map without exploiting further geometric knowledge the
decision as to whether two places are identical must rely primarily on secondary
qualities. This can be demonstrated by considering the problem of mapping an
environment in which all vertices have identical appearance and degree. The
correct graph of such an environment will be indistinguishable, on the basis of
connectivity alone, from an infinite number of other graphs. Figure 7.3 illustrates
this fairly obvious point for graphs of degree two.
Figure 7.3: If all the places in the environment are identical in appearance then the
correct graph cannot be determined on the basis of sensed topological relations—
each of the graphs shown here is indistinguishable on this basis from each of the
others.
Assuming that the environment does contain places that have secondary
characteristics that are locally but not necessarily globally distinctive (a not
unreasonable assumption perhaps) then the task of place identification involves:
1. A comparison of the secondary characteristics.
2. A spatial test in which the adjacency relations of the two places are compared.
That is neighbouring locations are tested for identity.
The second test is clearly recursive, it demands that identity is established for
neighbouring places which can mean testing the neighbours of the neighbours,
and so on. This process will only be halted by tests of identity that succeed on the
basis of the first test alone. In other words, by testing locations whose secondary
characteristics are known, a priori, to be globally distinctive. The process of
determining and carrying out a suitable series of adjacency tests that will
terminate in this way is called a rehearsal procedure [44, 78].
Consideration of the rehearsal procedure gives rise to several questions. First,
how many such globally distinctive places (GDPs) are required to make mapping
a given environment possible? Second, given that rehearsal is a costly process that
involves substantial extra travel to known locations, how efficient are the possible
rehearsal procedures?
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
179
Neither of these questions has been properly addressed in the literature. The
question of how many GDPs are needed to map an environment would seem to
depend both on the underlying graph of the environment and on the prevalence of
locally distinctive sensory characteristics. Kuipers and Byun [78] state that a
single unique place or ‘home’ is sufficient to eliminate false positive matches in
constructing a graph. However, this may not always be the case. For instance,
figure 7.4a illustrates an environment isomorphic to the complete graph K4 in
which one vertex is globally distinctive (shaded black) but the remaining three are
indiscriminable (shaded grey). It appears that no sequence of actions in this graph
will allow it to be discriminated from the graph with three self-connected
vertices13 (figure 7.4b).
13Assume that the graph is embedded in the plane and that the edges at any vertex can only be
distinguished by ordering with respect to the last edge traversed. At any vertex, then, the mapbuilder has the choice of taking either the left or right exit or returning along the edge by which it
entered. To discriminate between the two graphs therefore requires that there exists a sequence of
such actions that generates a different sequence of sensory experiences in the two graphs. In
computer simulation using sequences of upto 24 actions no such discriminating sequence has been
found. Unfortunately, I have been unable to determine a simple proof that no such sequence can
exist for this graph although intuitively this seems a reasonable proposition.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
180
(a)
(b)
Figure 7.4: A fully-connected graph of four vertices (a) which has only one
globally distinctive vertex. This graph is indistinguishable by any rehearsal
procedure from a graph with three self-connected vertices (right).
The graph in figure 7.4a can be reconstructed, however, if at least one other
vertex is locally or globally distinctive. The question of whether there are graphs
that cannot be mapped given two or more GDPs is unresolved. To some extent,
however, the existence of such pathologically unmappable environments is
largely academic. In environments with such sparse sensory features and very few
GDPs it is evident that every call to the rehearsal procedure will have to go to
considerable depth. The cost of this process will therefore scale rapidly with the
size of the graph making map-building a very inefficient process.
Dudek et al. [44] have considered an interesting and related problem of
constructing the graph of an environment lacking distinctive secondary
characteristics by using portable markers.
Their work, which assumes
appropriate segmentation functions, demonstrates that a map-builder with one or
more globally distinct markers (that can be both put down and picked up) can map
any graph in a sequence of moves that is polynomial in the number of markers
and graph vertices. Furthermore simulation studies by these authors show that
although the cost of this rehearsal procedure are lowered considerably if the mapbuilder has a several retrievable markers (rather than just one) these overheads
are, nonetheless, still large.
The arguments of the preceding paragraphs suggest that, to avoid lengthy and
possibly non-terminating rehearsal behaviour, topological map-building systems
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
181
must either make use of place markers or assume that many locations will be
discriminable on the basis of sensory characteristics alone. Failing that they must
exploit knowledge of higher spatial relations.
7.4
Practical approaches to topological map building
The previous section has described in abstract the necessary mechanisms for
constructing a topological map of an environment without relying on the detection
metric spatial relations. It has been argued that the agent must have perceptual and
motor mechanisms that can generate a segmentation of the environment, reidentify many locations in a reliable and robust fashion from their secondary
characteristics alone, and disambiguate any locations that are not globally
distinctive by a rehearsal procedure. How realistic are these tasks? To a large
extent this is an empirical issue that is best answered by proposing candidate
solutions then evaluating their success on a robot platform or, failing that, in a
robot simulation. This section therefore considers some of the work in qualitative
navigation relating to this issue and attempts to draw some preliminary
conclusions.
In considering approaches to the construction of topological maps it is possible to
distinguish two alternative methods of segmenting the environment to form the
elementary components of a graph model. The first is to associate the vertices of
the graph with distinctive zero-dimensional points in space and the edges with a
network of distinctive one-dimensional paths connecting these positions.
Alternatively the vertices can be associated with extended regions of space and
the edges with the boundaries between regions. These boundaries will generally
be determined by topographic features, for instance by landmarks and physical
barriers, thus covering the environment with a tessellation of contiguous
irregularly-shaped areas. The two forms of graph are closely related—for any
abstracted tessellation there is a dual which is a network and vice versa. However,
there are some practical differences in the way each form of segmentation can be
performed and, as importantly, exploited. The latter differences arise because in a
network the agent is almost always in transit between places on a defined path,
whereas in a tessellation transitions are instantaneous, paths are not specifically
designated, and the agent is always either in one place or another.
A number of recent robot navigation systems are based on the network model [71,
72, 78-81, 92-94, 157]. Here two of these systems are described in some detail in
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
182
order to highlight some of the strengths and weaknesses of this approach.
Tessellation models have been less widely studied, however, Levitt and Lawton
[84] have described and simulated a system of this type which is also considered
below.
The ‘NX’ robot simulation
Kuipers and Byun [78-81] describe a robot simulation ‘NX’ designed to
implement the theory of a ‘spatial semantic hierarchy’ described earlier. As this
theory suggests, NX constructs spatial knowledge at multiple levels, however,
here we will focus on the procedures implemented to construct a topological
representation in the form of a network map.
NX operates in a two-dimensional environment of corridors and rooms using
simulated sonar and compass senses. The robot identifies places and paths during
exploration in terms of a set of pre-defined ‘distinctiveness’ measures. In the
simulations described the principal measure employed is a function “EqualDistance-to-Near-Objects” that computes a single scalar value from the distance
data returned by a ring of sonar sensors. This function generates zero-dimensional
peak values at the intersections of corridor-like environments, and ridges of
locally maximal value along the mid-lines of passages. The former are therefore
defined to be the vertices of the constructed graph and the latter the edges. Since
many points in the environment do not lie on the graph robot relies on a hillclimbing search strategy to find distinctive places and relocate paths.
The number of directions of free travel surrounding the robot at a distinctive place
is estimated using sonar readings. These exits are then discriminated from each
other by using a local co-ordinate frame defined by an absolute compass. Edges
are followed using local navigation behaviours. For instance, a corridor-following
behaviour is used to track the mid-line between two walls and a wall-following
behaviour to move along the boundary wall of a room.
To perform place recognition vertices are distinguished by the pattern of local
geometry detected by sonar. Similar vertices are disambiguated using metric
information accumulated by dead reckoning and through the use of a shallow
rehearsal procedure that tests the identity of nearby places upto a depth of one
transition in the graph.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
183
The Michigan robot
Researchers at Michigan University [71, 72, 83] have begun investigating
topological map-building for a mobile robot platform with vision and sonar
sensors which navigates in a corridor environment. The robot builds a network
model of space based on the notion of gateways which are defined as transitions
points between distinct areas of free-space. Typically, gateways are positions such
as corridor intersections and doorways that are identified by characteristic sonar
patterns. The robot constructs both route knowledge and a separate representation
of the topological map. Each vertex in this map is identified with one or more
gateways. The areas of free-space between gateways constitute the edges of the
graph. However, the lack of a precise definition for edges means that the robot
relies on environment structure to stay on a path14.
The Michigan robot uses vision as its primary sense for place identification. At
each gateway a local view representation is constructed by storing arrays of visual
patterns obtained by rotating to a small number of orientations while remaining in
a fixed position. By matching the current visual scene with the pattern recorded in
one of these segments the robot is able to re-orient itself at a gateway. The overlap
in sensory patterns between adjacent segments can also be used to determine the
required direction of turn to face in any desired orientation. Exits are defined by
areas of free-space detected by sonar as adjacent to the gateway. Each exit is
linked to the sensory pattern stored when the robot is facing in that direction.
Thus to choose a particular exit the robot orients toward the appropriate segment
of the stored local view.
To perform place identification the Michigan robot attempts to match the view at
any gateway with records of previously observed local views. If necessary it
rotates itself to observe the scene at different view orientations in order to check
for matches. A running estimate of the likelihood that two gateways correspond to
a unique place is maintained and when this ‘hypothesis strength’ exceeds
threshold the robot assumes that the place is unique and assigns it to a vertex in
the topological map. As in the NX simulation the robot also uses a shallow
rehearsal procedure to disambiguate similar locations. Local distance information
14In the experiments described in [72] edge-following behaviour has yet to be developed and the
robot is led by hand through the environment.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
184
obtained by dead reckoning and stereo matching is also exploited to some degree
in determining the adjacency relations of places.
The ‘Qualnav’ robot simulation
The Qualnav simulation developed by Levitt and Lawton [84] constructs a multilevel representation of large-scale space that like the NX system can be viewed as
an instance of Kuipers [81] ‘semantic hierarchy’ theory. Qualnav comprises a
metric level model which in situations of degraded or absent metric data, and
under some strong assumptions of visual scene analysis and object recognition
skills, can generate a tessellation model of topological spatial relations. This
review will focus on the elements of the model relating to the construction of the
topological map without reliance on metric knowledge. The proposals for metric
map construction and use will be considered in the next chapter. The topological
mapping system is able to construct a model, and plan and follow routes, on the
basis of detected adjacency and sense relations, and qualitative judgements of
visual angle.
Qualnav is an attempt to address the problem of navigation in open, non-urban
environments. It therefore does not attempt to contract the world onto a finite
number of places and paths, rather, it seeks to define a segmentation of space into
finite regions separated by virtual boundaries. To do this Qualnav uses vision to
identify landmarks which are defined as prominent topographic features with
distinctive secondary characteristics of shape, colour, etc.
The basic strategy of Qualnav is to divide up the world using ‘landmark pair
boundaries’ (LPBs), these are virtual lines connecting each pair of identified
landmarks in a given scene. The observed left-right ordering of a landmark pair
(as the robot faces toward them) defines the current location as being on either
one side or the other of this boundary. The LPBs from multiple landmark pairs
generate a segmentation of space into the regions that make up the tessellated
graph. The current ‘place’ is therefore defined as the bounded region defined by
the intersections of the most proximal LPBs. This segmentation is illustrated in
Figure 7.5 for an environment of five distinctive landmarks. Each region in this
tessellation has a unique description in terms of the left-right ordering of each of
the visible landmark pairs. Note that without range-sensing the boundaries
between regions can only be detected as they are crossed, and it is impossible to
say prior to crossing which boundaries are the most proximal.
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
185
L2
L1
L3
L5
L4
Figure 7.5: Segmentation in the Qualnav model (adapted from Levitt and Lawton
[84] p. 327).
A route description consists of a list of desired LPB crossings. An LPB can be
crossed between the two landmarks, or on the left or right side of both landmarks.
In the former case orientation is determined by moving in such a way as to create
a qualitative increase in the visual angle between the two landmarks, in the latter
case by moving so as to decrease the visual angle up to the point where one
landmark occludes the other. A further possibility is for the robot to head directly
towards a landmark. A production system rule-base can be used to optimise a
route by combining certain pairs of desired LPB crossings into more efficient
ones.
The QUALNAV model assumes that landmarks will be correctly identified from
different viewing positions and re-identified on subsequent visits, however, it
does not assume that any single landmark is necessarily re-acquired. Failure to reidentify any single landmark will reduce the set of LPBs that can be constructed,
however, the remaining LPBs may still provide an adequate, though less refined,
localisation of the robot.
Discussion
In both of the network systems the ability to create and maintain a mapping
between the world and the graph clearly depends on the environment being highly
structured. Segmentation assumes a world in which physical barriers constrain
travel to a finite number of narrow paths and the distinction between intersections
and connecting paths is reasonably clear-cut. In more open environments with
large areas of traversable space the partitioning of the world would be more
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
186
arbitrary, harder to define and then maintain. NX gets around this problem to a
limited extent by treating more open areas (for instance any large room) as a
circuit of paths each with a single boundary wall. Less drastic methods for
mapping an open space to multiple network graph elements are clearly possible.
However, the difficulties in identifying and re-identifying particular places and
paths in an open area may render the idea of a unique contraction onto a network
impractical. Furthermore, the constraint of limiting travel to a finite number of
‘virtual’ paths between fixed choice points will become increasingly inappropriate
in more open settings.
The Qualnav model demonstrates that by ‘flipping’ into the alternative
tessellation mode of graph description a useful segmentation can be achieved in
open terrain. This model also successfully relaxes the arbitrary constraint of
following a finite number of paths. It would interesting to see if the tessellation
approach could be applied successfully in more structured environment such as
buildings.
Place recognition and orientation
The NX simulation uses local sonar patterns to characterise different places.
However, this sort of local geometry information is not likely to be globally
distinctive, hence the need to use dead reckoning to make the place identification
task possible. The lack of sensed secondary characteristics also makes it necessary
to use compass information to distinguish the exits from a vertex. The model
consequently relies on metric information to create the topological map and, to
this extent, abstracts the topological map from knowledge of higher geometric
relations. The need to exploit metric knowledge demonstrates the difficulty of
topological mapping with only sparse sensory data.
The Michigan robot is more ambitious in its attempts to build a network map
using primarily topological and sensory data. By using vision this system allows
for the possibility of exploiting visual pattern matching and object recognition for
the task of place identification. The system also constitutes a commendable
attempt at performing the orientation task without exploiting explicit direction
sensors. The emphasis in this research on relatively complex sensory processing
amply demonstrates the need for rich descriptions of invariant secondary
characteristics if the use of higher geometric knowledge is to be avoided. These
descriptions must be robust to superficial and minor changes in sensation (due, for
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
187
instance, to changing lighting conditions, moving objects, etc.) but at the same
time be sufficiently unique to justify an assumption of global distinctiveness for
many of the places that are encountered. The Qualnav simulation similarly
demonstrates the need for a sophisticated vision front-end to provide reliable
landmark acquisition and identification.
Conclusion
Topological maps
The nature of the spatial relations encoded by a world model determines the type
of navigation behaviour that can be supported. Procedural (or route) knowledge
can only support movement along segments of known paths. Knowledge of the
topological layout of the environment gains the navigator the ability to identify
suitable sub-goals, and generate and follow novel routes between locations.
However, because this knowledge is limited to knowing the connectivity of places
navigation is constrained to using known path segments between adjacent subgoals. A navigator with a topological map who enters unfamiliar territory can
explore new paths and construct new map knowledge but cannot engage in goaldirected movement to target sites. The ability to determine short-cut or straightline routes across un-explored terrain requires knowledge of higher-order spatial
relations. Where such behaviours are observed in animals this is usually taken as
strong evidence for the use of a metric model. That such skills would be very
useful to an animal or robot is undeniable giving a strong incentive for
constructing and using knowledge of this type.
Given the value of metric knowledge is there a justification for constructing only
a weaker form of spatial representation? So far this chapter has considered one
possible argument for this view—the idea that a topological map could be
constructed without the need to detect higher geometric relations. Mathematically
topological geometry is simpler and more basic than metric geometry—it requires
fewer axioms and specificies weaker constraints. I have argued here that this
mathematical simplicity belies the real difficulties of constructing topological
knowledge in the absence of metric knowledge. I have suggested that such a
model is in general realisable only if the agent has sensory abilities that can be
relied on to give accurate identification and re-identification of most locations,
and that in practice this may require vision skills capable of object recognition or,
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
188
at least, of very robust pattern matching. If this view is correct then this ‘strong’
topological map hypothesis fits uneasily with the bottom-up bias of Animat AI,
and, indeed, with the current perceptual capabilities of most robots.
The problem of constructing a topological map are eased considerably by
introducing additional constraints to the map-building process. One possible
constraint is to limit the behavioural repertoire of the robot. One way this might
be achieved is to force the robot to maintain a travel-path that follows object
boundaries [37, 92, 112]. This constraint of ‘wall-following’ assists substantially
both in segmentation and place identification by limiting the maximum degree of
vertices in the graph, reducing the number of true choice points (i.e. of vertices of
degree>2), and by avoiding the need to represent places that lack distinctive local
geometric structure. This approach has lead to some successes in building wayfinding systems for indoor autonomous robots, however, it also has an obvious
cost—open spaces will be poorly represented in the map and movement will be
more rigidly limited to a small set of paths.
In general Animat systems have also not been entirely metric-free. The use of
some approximate metric knowledge underlies what might be called the ‘weak’
topological map approach—the possibility that metric information might be
computed and exploited in map-building, but might not actually be explicitly
recorded or used for way-finding. Given the advantages that metric knowledge of
any sort can endow the main justification for this proposal must be that the cost or
complexity of building (or using) such representation would outweigh its
usefulness. That there may be methods that are robust to noise yet relatively
simple to compute and use will be argued in the next chapter.
The semantic spatial hierarchy
A further approach which has been considered in this chapter is expressed in
Kuipers theory of the ‘semantic spatial hierarchy’. This theory advocates
construction of both topological and metric models. However, although it allows
for some exceptions in the direction of information flow it still presents a
relatively strong claim of a progressive and incremental increase in structure and
geometrical richness. Indeed, Kuipers [78] explicitly argues against an alternative
hierarchy in which the metric level precedes the topological one.
I suggest that this hierarchical view contains the valid insight that robust
navigation requires separate systems that encode information concerning
CHAPTER SEVEN
WAY-FINDING: TOPOLOGICAL MODELS
189
independent spatial constraints. However, the hierarchical model confounds the
need for multiple systems with the idea that these systems should only be
distinguished in terms of the classes of geometric relations they encode. The
hierarchical view also weakens the value of having such distinct systems by
insisting on some form of dependence of higher levels on those below. I will
argue in the next chapter that distinctions other than those based on geometric
content may be more or as important in distinguishing different representations
for way-finding. I will also suggest that what is really required is independent
pathways for the determination of spatial constraints, in other words, that a
hierarchical approach is not justified or desirable.
190
Chapter 8
Representations for Way-finding:
Local Allocentric Frames
Summary
This chapter describes a representation of quantitative environmental spatial
relations with respect to landmark-based local allocentric frames. The system
works by recording in a relational network of linear units the locations of salient
landmarks relative to barycentric coordinate frames defined by groups of three
nearby cues. The system can orient towards distant targets, find and follow a path
to a goal, and generate a dynamic description of the environment aligned with
agent’s momentary position and orientation. It is argued that the robust and
economical character of this system makes it a feasible mechanism for wayfinding in large-scale space.
The chapter further argues for a heterarchical view of spatial knowledge for wayfinding. It proposes that knowledge should be constructed in multiple
representational ‘schemata’ where different schemata are distinguished not so
much by their geometric content but by their dependence on different sensory
modalities, environmental cues, or computational mechanisms. It thus argues
against storing unified models of space, favouring instead the use of run-time
arbitration mechanisms to decide the relative contributions of different
constraints obtained from active schemata in choosing appropriate way-finding
behaviour.
CHAPTER EIGHT
8.1
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
191
Introduction
The focus of most of the research in robot navigation over the past two decades
has been on the task of path-planning using quantitative models of space. Much of
this work (e.g. [19, 89]) is concerned with planning optimal, collision-free
trajectories using an accurate metric map of spatial layout and a model of the
shape and dynamics of the robot. Based on the assumption that such a map is
needed, research on the acquisition of quantitative spatial knowledge has centred
of the task of constructing an appropriate, detailed metric model in which object
positions and shapes are described with respect to a global coordinate frame. Most
of these models have either assumed accurate sensing, sought to correct error
estimates during map-building (e.g. [40]), or to incorporate estimates of spatial
error explicitly in the map (e.g. [46, 95, 109, 155, 156]). To facilitate search the
map is also segmented to form a graph either by imposing a grid [46, 109], or by
determining an appropriate tessellation or network model [19, 33, 34, 40, 64, 135,
136, 174]. A distinctive characteristic of this research, which is shared with many
of the more qualitative approaches, is the emphasis on combining all the available
information into a unified global model. This chapter suggests that it is this
characteristic, perhaps more than any other, that contributes to the inflexibility of
these systems and to their over-sensitivity to measurement error. It is proposed
instead that different sources of spatial constraint are used to construct multiple,
relatively independent15, representational schemata—often encoding only the
local spatial layout, from which large-scale relations are determined as needed.
The first section of this chapter considers the general problem of constructing a
quantitative model of space for way-finding in natural or artificial systems. The
following four sections then describe a preliminary simulation of a robot wayfinding system that uses a network architecture to relate multiple spatial models
based on local coordinate frames. This system which is also described in [132]
builds on research by Zipser [191-193] and is related to a method of metric map
construction proposed by Levitt and Lawton [84]. The final section then attempts
15In a sense to be defined below.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
192
to generalise from some properties of this system to an overall approach to the
problem of constructing and using spatial representations for way-finding.
8.2
Quantitative representations of space
The task of building a representation of an environment that encodes distance and
direction is very different from the procedures outlined in the previous chapter for
constructing a metric-free model. To detect metric relations places must be
located with respect to common coordinate frames. In determining appropriate
frameworks a significant problem is immediately apparent. The coordinate frames
most directly available are egocentric, that is, they are defined by the navigator’s
instantaneous position and orientation in space. However, to establish invariant
relations over a large-scale environment egocentric relations will evidently not
suffice. This follows since first, by definition, such an environment cannot be
entirely viewed from any single position, and second, movement of any sort will
alter the viewer’s egocentric frame. The necessary solution to this problem is that
observations from different view-points must be integrated into representations
with respect to environment-centred or allocentric frames. That is egocentric
relations must be transformed to give allocentric spatial relations.
To construct a model of metric spatial relations during a trajectory through an
environment therefore requires at least the following:
1. The establishment of a suitable allocentric frame.
2. The determination of the current relation of the self to this frame (irrespective of
ego-motion).
3. The determination of the relation of salient places, first to the self, and thereafter
(by suitable displacement transformations), to the allocentric frame.
If position estimates are accurately determined with respect to an allocentric
frame then the task of constructing layout knowledge by place identification
disappears—places are identical that have the same coordinates with respect to
the global frame. Generally, however, we wish to assume error, possibly of a
large or cumulative nature, in the estimates of spatial position. In this case, the
identification problem still exists though now there are substantial additional
constraints provided by estimates of spatial position. There is therefore less need,
than in the metric-free topological case, for the accurate perception of invariant
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
193
secondary characteristics, or for the costly fall-back strategy of rehearsal. Place
matching not only serves to determine a cycle in the sequence of observed
locations but can also play an important role in the correction of spatial errors—
by obtaining ‘fixes’ with respect to landmarks of known position the navigator
can correct the errors in the estimated positions of itself and of recently visited
places.
The problems that arise in constructing a metric model concern the difficulty of
obtaining good distance information (or failing that, dealing with noisy
information), and the resource demands of the above processes, in particular, the
need for continuous transformations between egocentric and allocentric frames.
This chapter considers way-finding representations and mechanisms that can
address the latter issue. Specifically, a distributed representation is proposed in
which the environment is modelled using a network of local coordinate frames.
The computations required to construct the representation from egocentric metric
data require only linear mathematics and have memory requirements roughly
proportional to the number of goal-sites and landmarks stored. The task of
determining direction or route information from the model is performed in
parallel giving search times roughly linear in the distance to the goal. During
route following the system exploits run-time error-correction by incorporating
egocentric fixes on sighted landmarks. This means that the route following system
is highly robust to noise in the spatial representation.
A frame-based model can represent the true continuous nature of physical space
in which every zero-magnitude position is a potential goal or sub-goal. However,
the idea that every point will be explicitly represented is clearly naive. A viable,
compact, non-graphical representation of space will not describe the content of
space at every point16, rather, it will explicitly represent only those positions that
16Gallistel [52] makes a similar point by contrasting different computer representations of a
graphical image. One possibility is to record the colour of every pixel in the image. This results in
a representation that is expensive in memory use and opaque with respect to further image
processing—the salient objects in the image are not distinguished in any way from the surround.
In contrast an object-based picture description language like “Postscript” explicitly encodes the
structure in an image by representing lines, curves, etc. This representation is generally both more
memory-efficient and more accessible for further processing.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
194
are salient as targets or landmarks. This notion has not always been appreciated in
psychological studies of human cognitive maps. In particular, many early studies
did not distinguish the hypothesis of an underlying metric model from the idea of
a picture-like representation. Consequently, experiments that found errors in
human spatial knowledge—distortion, gaps, holes, fault-lines, and asymmetries—
were interpreted as strong evidence against the use of metric representations in
human way-finding. However, although there is little support for the idea of
cognitive spatial representations that are isomorphic to graphical maps17 (see, for
instance, [77]) this does not constitute a falsification of the view that humans
construct or use metric models. Indeed, there is strong evidence that metric
representations of a non-graphical nature form an important element of human
way-finding competence [96, 146, 147].
As was indicated in the last chapter there is also substantial evidence that other
animals construct and use metric spatial representations. This is indicated by,
among other things, the ability observed in many species to find novel or straightline routes to distant positions outside the current visible scene (see [52, 119, 120]
for reviews). There is some controversy, however, over how animals use
perceptual data to construct the metric model. Several researchers [52, 97] have
argued that the principal sources of position information are dead-reckoning
skills—integrating changes in position from sensory signals—and 'compass'
senses—determining orientation by using non-local cues such as the sun, or by
sensing physical gradients such as geomagnetism. These skills are used to
maintain an estimate of current position and heading relative to the origin (e.g. the
nest) of a global allocentric frame. Places are coded in terms of the distance and
direction from this origin or from single landmarks whose position in the global
frame are already stored. Following McNaughton et al. [97] such a mapping
scheme will be referred to here as a vector coding system.
An alternative view, proposed for instance by O’Keefe [119, 120], is that a place
is encoded in terms of the array of local landmarks visible from that position. In
other words, that the cognitive spatial representation stores the locations of
17An exception is where humans are explicitly trained by showing them graphical maps, in this
case an image-like memory of the map may be retained and used in way-finding. Spatial
knowledge acquired by exploring an environment is, however, of a very different nature [147].
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
195
potential goals in an allocentric coordinate frame determined by a group of (two
or more) salient cues. Such a system is here called a relational coding.
This chapter is not concerned with the empirical question of which coding
method is used in animal cognitive maps. Indeed, there seems no particular reason
to believe that any one method will be relied upon to the exclusion others. Rather,
since robust navigation skills are critical to survival, multiple, relatively
independent coding systems may be employed. In other words, for any specific
navigation task, both vector and relational solutions may be computed and
behaviour chosen to be as consistent as possible with all available constraints (this
proposal will be considered further below).
The goal of the chapter is instead to consider the idea of relational models, that do
not rely on dead-reckoning or compass senses, in more detail. It will be proposed
that such methods can provide robust support for way-finding without being
expensive in terms of computational power or memory and without requiring
complex coding mechanisms. It will argued these properties encourage the view
that similar mechanisms might be employed in animal navigation, or could
support way-finding for an autonomous mobile robot.
8.3
Relational models of space
The task of navigating a large-scale environment using relational methods divides
into three problems: identification and re-identification of salient landmarks;
encoding, and later remembering, goal locations in terms of sets of visible local
cues; and finally, calculating routes between positions that share no common
view. The first task, landmark identification, has been considered elsewhere both
from the point of view of animal and robot navigation systems [84, 119, 191,
193]. In this chapter landmarks are taken to be objects (or parts of objects) with
locally distinctive secondary characteristics that can be identified with zerodimension locations in egocentric space. The agent is assumed to be able to detect
suitable landmarks and determine, at least approximately, their positions relative
to itself. While realising that this constitutes a tremendous simplification of the
map-building problem, the justification is to allow the logically distinct problems
of encoding and remembering goals in local landmark frames to be investigated,
and to consider the use of such representations for large-scale way-finding tasks.
Since metric relations will be recoverable from the stored model the landmarks
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
196
need not be globally distinctive, hence there is less need for the rich descriptions
of landmark characteristics that are required for metric-free topological mapping.
The issues arising from inaccurate perceptual data and discriminating ambiguous
landmarks are ongoing research topics and will be considered further toward the
end of the chapter.
Proposed relational codings
A proposal for a relational coding system, inspired by empirical studies of ‘place’
cells in the hippocampus of the rat, has been provided by O'Keefe. In the most
recent version of this model O’Keefe [118, 119] suggests that the rat brain
computes the origin and orientation of a polar coordinate frame from the vectors
giving the egocentric locations of the set of salient visible cues. Specifically, he
proposes that these location vectors are averaged to compute the origin (or centroid) of the polar frame, and that the gradients of vectors between each pair of
cues are averaged to compute its orientation (or slope). Goal positions can then be
recorded in the coordinate system of this allocentric frame in a form that will be
invariant regardless of the position and orientation of the animal. This idea is
illustrated in figure 8.1.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
197
A
E
centroid
slope
B
D
C
Figure 8.1: O’Keefe’s proposal for an allocentric polar coordinate frame defined
by ‘centroid’ and ‘slope’ measures determined from the positions of local
landmarks in the agent’s egocentric coordinate frame. The arrows indicate two
possible viewing positions. (adapted from O’Keefe [119] p. 281)
However, there are problems with this hypothesis. Firstly, the computation of the
slope is such that the resulting angle will differ if the cues are taken in different
orders18. Since any ordering is essentially arbitrary, a specific sequence will have
to be remembered in order to generate the same allocentric frame from all
positions within sight of the landmark set. Secondly, as landmarks move out of
sight, are occluded by each other, or new ones come into view, the values of the
slope and centroid will change.
Rather than changing the global frame each time a landmark appears or
disappears it seems more judicious to maintain multiple local frames based on
subsets of the available cues. These would supply several mutually-consistent
encodings making the mapping system robust to changes in individual landmarks.
The use of multiple local frames has been proposed by Levitt and Lawton [84].
They observe that the minimum number of landmarks required to generate a
coordinate frame is two (in two-dimensions, three in 3D). They also provide a
useful analysis of how the constraints generated by multiple local frames can be
combined, even in the presence of very poor distance information, to provide
18This arises because the gradient of a line is a scalar function with a singularity at infinity when
the line is parallel to the y axis. Hence in order to average gradients this must be done in terms of
angles or vectors, in which case the order in which the points are compared is important.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
198
robust location estimates. To calculate, from a novel position, a goal location that
has been encoded in a two landmark frame requires non-linear computations
(trigonometric functions and square roots). It also requires that an arbitrary
ordering of the two landmarks is remembered in order to specify a unique
coordinate system.
Zipser [193], who had earlier considered a landmark pair method [192], points out
that if one more landmark is used to compute the local frame then all the
calculations required become linear. In fact, all that is required to encode a goal
location using three landmarks (in 2D, four in 3D) is that one constant is
associated with each cue. Zipser called these constants “ -coefficients”, they are,
however, identical to the barycentric19 coordinates that have been known to
mathematicians since Moebius (see for instance [50]). The system for large-scale
navigation described below uses this three landmark method and therefore it is
described in more detail in the following section. In the remainder of the chapter
the navigation problem will be considered as two-dimensional, however, the
extension of these methods to three dimensions is straightforward.
Barycentric coordinates
Figure 8.2 shows the relative locations of a group of three landmarks (hereafter
termed an L-trie) labelled A, B, and C, seen from two different viewing positions
V and V ! . A goal site G is assumed to be visible only from the first viewpoint.
19This term, originally used in physics, is derived from “barycentre” meaning “centre of gravity” .
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
199
B
x'B
V'
A
x'A
x'C
C
xC
xA
xB
G
xG
V
Figure 8.2: relative positions of three landmarks and a goal from two viewpoints
V and V'. (adapted from Zipser [193] p.461)
"
The column vectors x i = (x i , yi ,1) and x i! = ( x !i , !i ,1) give the location in
homogeneous coordinates of object i in the egocentric frames centred at V and
V ! respectively. The two frames can therefore be described by the matrices
!
X = [x A
xB
x C ] , X!
x!A
x!B x !C ]
(8.1)
If the three landmarks are distinct and not collinear then there exists an unique
"
vector ! = (!A ,! B , BC ) such that
! = x G and X!" = xG! .
(8.2)
In others words, by remembering the invariant ! the egocentric goal position
from any new viewing position V ! can be determined by the linear sums
xG! = " A x !A + " B x B! + " C xC! ,
yG! = " A y A! + " B y B! + " C y!C,
(8.3)
(1 = " A + " B + " C ).
Note that since each constant is tied to a specific cue the ordering of the
landmarks is irrelevant.
!1
The ! vector can be determined directly by computing the inverse matrix X
since
! = X"1X ! = X "1 x G .
(8.4)
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
200
Though this inverse calculation uses only linear maths, the value of the encoding as a possible biological model has been questioned on the grounds of its
apparent mathematical complexity [189]. However, Zipser points out that an even
simpler computational mechanism is possible by allowing the ! values to
converge to a solution under feedback. This can be viewed as adapting the
connection strengths of a linear learning unit by gradient descent. A network
architecture that instantiates this mechanism is as follows. The network consists
of two types of simple processing unit. The first are object-position units (objectunits hereafter) whose activation represents the locations in egocentric space of
specific goal-sites and salient landmarks. These units receive their primary input
from high-level perceptual processing systems that identify goals and landmarks
and determine their positions relative to the self. The second type of processor is
termed a !-coding unit. This receives input from three object-units and adapts its
connection strengths (the values) to match its output vector to the activation of a
fourth.
An example of this architecture is illustrated in Figure 8.3 which shows a coding unit G/ABC that receives the positions of the landmarks A, B, and C as its
inputs. The unit adapts its weights ( ! A , ! B , BC ) till the output (x, y, z) matches
the goal vector ( G , yG , 1) . The unit is assumed to be triggered whenever all three
input nodes are active. Gradient-descent learning is used to adapt the connection
strengths. For the weight ! i from the ith object unit this gives the update rule at
each iteration
!
i
= # [(x $ )
i
+ (y $ y)yi + (1$ )]
(8.5)
where the parameter ! is the learning rate. In general, the network will rapidly
converge to an good estimate of the values.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
201
G
xG yG
1
G/ABC
y
z
xB yB
1
`A
`C
x
C
1
1
xC
yA
yC
xA
`B
A
B
Figure 8.3: The
calculation.
-coding unit. A gradient-learning model of the
-coefficient
In order to provide a further understanding of the -coding a geometrical
interpretation can be given. The coefficient associated with each landmark is the
ratio of the perpendiculars from the goal and that landmark to the line between the
other two cues. For example, consider landmark A in figure 8.4. The coefficient
! A defines an axis that is perpendicular to the line BC and scaled according to the
ratio of the two perpendiculars hG hA (this can also be thought of as the ratio of
the areas of the triangles GBC and ABC). Taken together the three coefficients
define a barycentric coordinate frame. This coding system in fact records affine
rather than metric spatial relations, hence, another term for the coefficients is
affine coordinates. However, assuming that the agent detects metric egocentric
spatial relations according to some calibrated Euclidean measure, metric
relations—direction and distance—will be recoverable from the stored model.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
202
G(`A, `B, `C) =
(1.3, -1.8, 0.6)
B
hA
A
C
hG
G
`C
`B
`A
Figure 8.4: Coding a goal position (G) relative to the three landmarks A, B, and
C using barycentric coordinates.
8.4
Modelling large-scale space
I now describe how this coding method can be extended to determine the spatial
relations between points over a wide environment that share no common
landmarks. The essence of the method is to build a two-layer relational network
of object and -coding units that stores the positions of landmarks in the local
frames defined by neighbouring L-trie groups. The resulting structure will
therefore record the relationships between multiple local frames. Thereafter, the
locations of distant landmarks (and goal sites) can be found by propagating local
view information through this network. Zipser [192] and Levitt and Lawton [84]
have both discussed methods of this type for large-scale navigation using
landmark-pair coordinate frames. The advantage of using the three landmark
method20, however, is that following a sequence of transformations through the
network is significantly simpler. Since all calculations are linear and independent
of landmark order the process can be carried out by spreading activation through
20Zipser was no doubt aware of the application of his earlier multiple frame model to his later idea
of the -coding, however, he has not, as far as I am aware, published anything that relates the two
ideas.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
203
the relational network. In contrast, a landmark-pair method would require
networks of local processing units of considerably greater complexity in order to
perform the necessary non-linear transformations between frames.
Constructing a large-scale representation
The relational network that encodes the large-scale spatial model is constructed
whilst exploring the environment. The specific method investigated here is as
follows. Each time the agent moves the egocentric positions of all visible
landmarks are computed. If there are any new landmarks in this set then new
object-units are recruited to the lower layer of the network to represent them.
Then, for each L-trie combination in the set a test is made to see if a -coding unit
(with this local frame) already exists for each of the remaining visible cues. If not,
new -units are recruited to the network’s upper layer and linked appropriately
with the object-units. The -coefficients are then calculated either directly (using
matrix inversion) or gradually (via the gradient descent learning rule) as the agent
moves within sight of the relevant cues.
Figure 8.5 shows an example of this process for a simple environment of five
landmarks. From the current view position four landmarks A, B, C, and D are
visible (assuming 360° perceptual capability) for which the agent generates
coding units A/BCD, B/ACD, C/ABD, D/ABC. Following adequate
exploration the network illustrated in Figure 8.6 will have been generated.
agent
D
B
E
A
C
Figure 8.5: An environment with five landmarks, the agent is oriented towards the
right with field of view as indicated by the circle .
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
B/ACD
204
C/BDE
D/ABC
A/BCD
E/BCD
B/CDE
C/ABD
D/BCE
C
A
B
E
D
Figure 8.6: A relational network for the five landmark environment. The network
consists of an input/output layer of object-units and a hidden layer of -coding
units that encode landmark transformations in specific local frames.
Given this network the agent can determine the location of any target landmark
when it is within sight of any group of three others. For instance if cues A, B, and
C are visible and E is required, then the active object units will trigger D/ABC
(activating unit D) and hence E/BCD to give the egocentric location of the target.
The method clearly generalises to allow the position of any goal site that is
encoded within an L-trie frame to be found.
The topology of the relational net
The connectivity of the relational network defines a graph of the topological
arrangement of local landmark frames. For instance, the network shown above
instantiates the L-trie graph shown in Figure 8.7.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
ABD
BDE
ACD
ABC
205
BCD
BCE
CDE
Figure 8.7: the L-trie adjacency graph for the five landmark environment.
The links between nodes in this graph correspond to the coding units. Although
the graph shown here has entirely bilateral connections, there is nothing
intrinsically symmetrical about the coding method. For instance, it would be quite
possible to encode the relationship D/ABC and not the reverse A/BCD. This
could plausibly happen if the agent, whilst moving through its environment,
encodes the positions of landmarks in front with respect to those it is already
passing but not vice versa. This property of the mapping mechanism accords with
observations of non-reflexive spatial knowledge in humans (see [77]).
Figures 8.8 and 8.9 show respectively an environment of twenty-four landmarks
and the adjacency graph generated after exploration by random walk. In learning
this environment the system was restricted to encoding the relative locations of
only the four closest landmarks at any time. This reduces the connectivity of the
graph and the memory requirements of the network substantially. However, even
without this restriction, the memory requirements of the system are O(N) (i.e.
proportional to the number of landmarks) rather than O(N2) since the relations
between landmarks are only stored if they share a common local group.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
A
B
R
E
O
206
S
C
D
L
F
G
H
K
I
J
Q
T
P
M
N
W
U
V
X
Figure 8.8: Landmark locations for a sample environment. The circle indicates
the range of the simulated perceptual system.
The box indicates a target
landmark (see below).
start
goal
Figure 8.9: An L-trie graph. Each vertice represents an L-trie node and is placed
in the position corresponding to the centre of the triangle formed by its three
landmarks (in the previous figure). The boxes enclose the L-trie nodes local to
the agent’s position and the target landmark ‘X’.
CHAPTER EIGHT
8.5
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
207
Way-finding using the relational model
Target finding: estimating the distance and direction to desired goals
From Figure 8.9 it is evident that there are multiple possible paths through the
network that will connect any two landmark groups. Hence the system represents
a highly redundant coding of landmark locations. As described in the previous
section the (external) activation of the object units for any group of three
landmarks triggers the firing of all associated -coding units which in turn
activates further object units. This ‘domino’ effect will eventually propagate
egocentric position estimates of all landmarks throughout the graph. Indeed, due
to the redundancy of the paths through the graph many estimates will be
computed for each landmark, each arriving at the object node after differing
periods of delay. For any specific goal or landmark the delay (between the initial
activation and the arrival of the first estimate) will be proportional to the length of
the shortest sequence of frames linking the two positions.
Assuming noise in the perceptual mechanisms that determine the relative
positions of visible cues (and hence noise in the stored -coefficients) the position
estimates arriving via different routes through the graph will vary. The question
then arises as to how the relative accuracy of these different estimates can be
judged. The simplest heuristic to adopt is to treat the first estimate that arrives at
each node as the best approximation to that landmark’s true position. This is
motivated by the observation that each transition in a sequence of frames can only
add noise to a position estimate, hence better estimates will (on average) be
provided by sequences with a minimal number of frames (whose outputs will
arrive first). I call this accordingly the minimal sequence heuristic. However, there
is a second important factor that effects the accuracy of propagated location
estimates which is the spatial arrangement of the cues in each L-trie group. The
worst case is if all the landmarks in a given group lie along a line, in this situation
the -coefficients for an encoded point will be undefined. In general, landmarks
groups that are near collinear will also give large errors when computing cue
positions in the presence of noise. One possibility, as yet unimplemented, is for
the system to calculate estimates of the size of these errors and propagate these
together with the computed landmark positions. Each computed location would
then arrive with a label indicating how accurate an estimate it is judged to be. In
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
208
the following examples landmark positions were calculated simply by rejecting
information from L-trie frames that were near-collinear (i.e. within some margin
of error) and otherwise using the minimal-sequence heuristic, that is, adopting the
first location estimate to arrive at each node. The issue of combining multiple
estimates to obtain greater accuracy is considered further below.
Finding the direction and distance to a specific goal by propagating landmark
positions is here called target finding. This competence is sufficient to support
behaviours such as orienting toward distant locations, and moving in the direction
of a goal with the hope of finding a direct route. However, this mechanism will
not always be appropriate as a method of navigation for two reasons. First, the
direct line to a goal location is clearly not always a viable path. Secondly, the
target finding system is susceptible to cumulative noise. I have simulated the
effect of 0, 5, 10 and 20% gaussian relative noise in the measurement of all
egocentric position vectors that occurs during map learning and navigation.
Figure 8.10 shows an example of the effect of this noise on estimates for the
position of landmark X (in the environment shown in Figure 8.8) relative to the
agent’s starting location.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
A
B
R
E
O
209
S
C
D
Q
L
F
H
I
0%
T
P
G
M
N
K
W
J
U
V
X
5%
10%
20%
Figure 8.10: Target finding in the presence of 0, 5, 10 and 20% relative21 noise
in perceived landmark positions.
The figure demonstrates that with the less accurate perceptual data only a rough
approximation to a desired goal vector can be obtained. Of course it is would be
possible for the agent to move in the direction indicated by target finding and then
hope to use landmarks that are recognised en route to gradually correct the initial
error and so home in on the goal. However, this will often be a risky strategy as a
direct line will not necessarily cross known territory. The following section
describes a method which exploits a heuristic of this sort but in a form that is
more likely to generate a successful path to the goal.
Path following: a robust method for moving to a goal location.
An alternative strategy is to move, not in the direction of the goal itself, but
towards landmarks that are known to lie along a possible path to the goal. This
method, here called path following, involves determining a sequence of adjacent
local frames that link start and goal positions prior to setting out, then
navigating by moving from frame to frame through this topological sequence.
Because perceptual information about known landmarks will (very likely) become
21The egocentric vector (x , y) is perceived as (x + N(0, ! ) x 2 + y 2 , y + N (0, ! ) x 2 + y 2 )
where N(0, ! ) is a mean zero Gaussian random variable with standard deviation ". In effect this
means an approximately linear increase in noise with distance. These noise characteristics were
not chosen to model any specific sensor system.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
210
available as each frame is crossed, the agent will be able to replace estimates of
cue positions with ‘hard’ data, thus avoiding the build-up of noise encountered by
the target finding system. There is however, some computational overhead to be
incurred through the need to calculate a suitable sequence of adjoining frames.
Since, again, there are multiple sequences to choose from some heuristics are
required to determine which should be preferred. The minimal sequence heuristic
is again appropriate though on the slightly different grounds that shorter
sequences should (on average) give more direct routes. Other heuristics are
possible, for instance, estimates of the actual distances covered by alternate routes
could be calculated allowing a more informed judgement as to which is the
shortest path.
To find the minimal sequence the process of propagating information through the
relational network is simply reversed. In other words, a spreading activation
search22 is performed from the goal back toward the start position. This is easiest
to imagine in the context of the L-trie graph (Figure 8.9) however it could be
implemented directly in the relational network by backward connections between
units.
In simulation I have modelled this parallel search process through a series of
discrete time-steps. This occurs as follows. The L-trie node closest to the goal is
activated and clamped on (i.e. its activity is fixed throughout the search) while all
other nodes are initialised with zero activity. The signal at the goal is then allowed
to diffuse through the adjacency graph decaying by a constant factor # for each
link that is traversed23. Once the activation reaches the L-trie node local to the
agent the minimal sequence can be found. Beginning with the start node this
sequence is traced through the network simply by connecting each node to its
most active neighbour.
22Spreading activation as a graph searching technique has a long history in cognitive modelling
[5] and in the literature on graph search [42] and path planning (e.g. [40]). It is also has clear
similarities with Dynamic Programming.
23This is achieved by, at each time-step, updating the activity of each L-trie node to be equal to
the maximum of its own activation and that of any of its immediate neighbours (multiplied by the
decay factor) at the previous time-step.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
211
This spread of activation is illustrated in Figure 8.11. The three frames show the
activity after 0,4, and 9 time-steps, after which time the activity has filtered
through to the start node. The path found is indicated by the boxes enclosing the
winning nodes.
Figure 8.11: Spread of activation through the L-trie graph (Figure Eight) after 0,
4, and 9 time steps (# = 0.95). The points in the figures represent the vertices in
the L-trie graph. The size of each point shows the level of activation of that L-trie
node. The boxes indicate that a minimal sequence ABC BCE CEL EGL GLM
LMN MNP MNW NVW VWX was found.
Having found a minimal sequence the path following method proceeds as follows.
The agent moves toward the average position of the landmarks vectors for the
first L-trie in the path. Once that position is reached (it will be near the centre of
the three cues), the position of the next L-trie is generated (using direct perceptual
data as far as possible) and so on till the goal is reached. Figure 8.12 illustrates
this mechanism for the noise-free case, and Figure 8.13 for noise levels of 5, 10
and 20%. The second figure demonstrates that path following is extremely robust
to noise as the error in the final goal estimate is independent of total path-length.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
A
B
R
E
O
212
S
C
D
L
F
G
H
K
T
P
M
U
N
V
W
J
I
Q
X
Figure 8.12: Moving to the goal by path following. The dotted lines indicate
additional landmark locations that were utilised en route.
A
B
R
E
O
S
C
D
L
F
G
H
K
J
I
5%
Q
T
P
M
N
W
U
V
X
10%
20%
Figure 8.13: Performance of the path following system in the presence of 5, 10
and 20% relative noise in perceived landmark positions. (Different L-trie
sequences were followed in each case as networks with different connectivity
were acquired during exploration).
Building a predictive map
The ‘domino’ effect that propagates local landmark positions through the
relational net will eventually generate egocentric location estimates for all
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
213
landmarks (with connections to the L-trie graph). The resulting activity can be
thought of as a dynamic ‘map’ of the environment which could be updated
continuously as the agent moves such that it is always arranged with the agent at
the centre and oriented towards its current heading. Figures 8.14 and 8.15 show
this egocentric map computed (with 20% noise) for an environment of twenty-six
landmarks.
As a result of cumulative error the exact layout of landmarks is more accurately
judged close at hand than further away, however, the topological relations
between landmarks are reproduced throughout. One use to which such a map
might be put is to disambiguate perceptually similar landmarks by calculating
predictions of upcoming scenes. In other words, if the agent sees a landmark
which appears identical to one it already knows, then it can judge whether the two
cues actually arise from the same object from the extent to which the place where
the landmark appears agrees with the location predicted for it by the mapping
system. If there is a large discrepancy between actual and predicted locations then
the new landmark could be judged to be a distinct object. On the other hand, if
there is a good match the agent could conclude that it is observing the original
cue.
Note that the purpose and nature of this dynamic map demonstrates one of the
major differences between the relational frame approach to the acquisition of
spatial knowledge and methods that emphasise the construction of a permanent
‘map’ of environmental layout in which position errors are minimised or
explicitly modelled. In this relational approach there is no long-term static
representation of large-scale spatial relations. Instead, to the extent that a largescale map of any sort exists, it is described by the continuously changing
activations of the object-units in the relational net.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
214
North
B
A
C
E
I
L
F
D
K
H
G
M
J
N
Q
R
O
W X
P
S
Z
Y
TU
V
Figure 8.14: An environment of twenty-six landmarks with the agent positioned
and oriented as shown.
North
T
V U
R
S
Q
Y
Z
D
P
N
O
J
F
W
X
C
G
M
E
B
A
H
K
I
L
Figure 8.15: The ‘dynamic’ map in the agent’s egocentric coordinate frame
generated from the viewing position shown in the previous figure.
CHAPTER EIGHT
8.6
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
215
Spatial knowledge and way-finding in perspective
This final section is an attempt to find a common theme, in the arguments and
ideas expressed in this chapter and the last, and to consider where this might lead
to in the future.
Chapter seven ended by considering the theory, proposed by Kuipers [78, 81], of
a hierarchical organisation of spatial knowledge. This theory proposes four levels
of organisation—sensorimotor, procedural, topological, and metric—that form a
hierarchy in which each level introduces a greater degree of complexity and
resource demands. On the whole, lower level representations are constructed first
and play some role in the assimilation of knowledge to higher levels.
I will argue below in favour of one element of this view—the proposal that spatial
knowledge is organised in distinct components encoding separate constraints.
However, I also wish to suggest that the hierarchical view is misleading, both in
its emphasis on geometric content (as the principal distinguishing factor between
the top two levels), and in its suggestion of a largely incremental process in the
construction of spatial knowledge. The following is an attempt to set out an
alternative perspective which, following Michael Arbib [7, 8], I call a ‘multiple
schemata’ view.
Arbib has proposed the use of the term schemata to describe active
representational systems or “perceptual structures and programs for distributed
motor control” ([7] p. 47). In the context of constructing and using models of
space he suggests—
“The representation of [...] space in the brain is not one absolute space, but rather a
patchwork of approximate spaces (partial representations) that link sensation to
action. I mean ‘partial’ and ‘approximate’ in two senses: a representation will be
partial because it represents only one small sample of space (...), and it will be
approximate in that it will be based on an incomplete sample of sensory data and
may be of limited accuracy and reliability. I will suggest that our behaviour is
mediated by the interaction of these partial representations: both through their
integration to map ever larger portions of the territory relevant to behaviour, and
through their mutual calibration to yield a shared representation more reliable
than that obtained by any one alone.” ([8] p. 380)
In the specific context of cognitive maps, he also argues that
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
216
“There is no reason, in general, to expect the system to use Euclidean space or
Cartesian coordinates for such a map. Rather, the system needs a whole array of
local representations that are easily interfaced and moved between.” ([7] p. 47).
The view advocated below is, I hope, in close accord with these ideas.
A ‘multiple schemata’ view
Spatial information can be picked-up through multiple sensory modalities in a
number of different guises and forms. This information may describe geometric
spatial relations upto any level in the topological–metric hierarchy it may also be
anywhere on a scale from very precise to utterly vague. Each piece of information
can be viewed as supplying a potential constraint for constructing spatial
representations for way-finding.
I propose that the critical distinction with regard to different forms of constraint
information is not to do with the geometric content of the knowledge but with the
process by which that knowledge is derived. For instance, metric information
derived from odometry (dead reckoning) is independent from metric information
determined by perceived distance and direction to identifiable salient landmarks.
They thus represent constraints that are independent because they derive from
different sensory modalities. Multiple constraints can also be obtained from
within a single modality by observing different environmental cues. For instance,
the observed position of the sun gives a constraint that has an independent source
from spatial data acquired from local landmarks. Indeed, different individual
landmarks or landmark groups can supply separate constraints as has been
demonstrated in this chapter. Finally, relatively independent constraints can arise
within a modality by reference to the same external cues but by employing
different computational mechanisms. It is in this sense that the distinction between
different geometries may be most relevant. For instance, as has been discussed in
chapter seven, the visual characteristics of distinctive landmarks might be used to
construct knowledge of topological relations that is independent of the
mechanisms that extract distance or direction from the visual scene. To the extent
that different constraints are independent (in this sense), two constraints will
clearly be much more powerful than one, three more than two, etc. It therefore
seems reasonable to suggest that an agent should seek to detect and represent a
number of independent or near independent constraints that describe the spatial
relations between important places.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
217
The emphasis of a multiple schemata approach is not on constructing unified
representations such as topological or metric maps but rather on establishing
multiple orienting systems. I suggest that each schema should exploit a different
combination of cues, channels, and mechanisms to instantiate a set of
environmental spatial relations. Thus, there will overall, be a number of relatively
distinct path-ways through which knowledge is acquired. This suggests a
heterarchy of models (as opposed to a hierarchy), with some, but not all, schemata
sharing common sources and resources. At any time an agent should exploit the
knowledge in multiple schemata to support its current navigation task. Although
some tasks may require the temporary creation of a unified model (drawing a
graphical map of the environment might constitute such a task) in general the
underlying representations will be distinct allowing the reliability of each source
of information to be assessed at run-time.
Way-finding should exploit acquired schemata via arbitration procedures which
decide on the basis of the content and accuracy of each active model the extent to
which it should contribute to the decision process. This arbitration could be
carried out through some fixed subsumption mechanism whereby, for instance,
knowledge determined from large-scale metric relations could override taxon
(route-following) strategies. Alternatively a more sophisticated system would
seek to combine the constraints afforded by multiple schemata by weighting them
according to their perceived accuracy or reliability. In this way, reliable
identification of a highly distinctive landmark might override estimates of spatial
position or orientation determined by some metric reckoning process. Kalman
filtering techniques [25] can also be applied for combining multiple constraints,
as, for instance, in the ‘feature-based’ robot navigator developed by Hallam,
Forster, and Howe [57].
I hope that it is reasonably evident that these ideas are consistent with the
relational coding system described above. From a multiple schemata perspective
the set of !-coding units associated with each distinct L-trie would constitute a
schema encoding the location of other landmarks and goals relative to a specific
allocentric coordinate frame. If a specific location is encoded by two separate ‘Ltrie schema’ based on non-overlapping landmark sets then these would constitute
relatively independent constraints. To the extent that landmark sets do overlap
they will obviously be less independent, but will nevertheless encode partially
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
218
distinct constraint information. However, the idea of multiple schemata also
generalises to encompass different coding systems. For instance, representations
based principally on direction sense and dead reckoning could be constructed
which would provide a modality-independent source of information from the
relational coding.
An obvious argument against a multiple schemata view is that acquiring and
storing spatial knowledge is not without cost. It makes demands on attention,
processing and memory (there are really separate costs associated with detecting
constraints, storing them, retrieving them, and combining them!). One defence
against this argument is the relative independence between different schemata
which will allow parallel processing to be exploited to a considerable extent. A
second possibility is that the amount of resources devoted to a given location (i.e.
the number of constraints stored) may vary according to the subjective importance
of being able to relocate that place or reach it quickly. We could expect, for
instance, that an animal’s home or nest (or a robot’s power source) would have
the highest priority and that therefore ‘homing’ might be the most robustly
supported way-finding behaviour.
It is my contention that a multiple schemata view may help in understanding the
evolution of way-finding competence in animals, and may also provide support
for the essentially pragmatic approach of Animat AI. In the case of the former,
O’Keefe’s [119] separate taxon and locale systems (which follows a very long
line of research into response vs. place knowledge in animal navigation, see Olton
[121, 122]) can be viewed as a distinction along these lines. However, there also
seems a reasonable case for breaking up the ‘locale’ system into multiple
schemata, for instance, models derived from dead reckoning and direction senses
[52, 97] and those derived principally from relational codings. With respect to
robotics this view suggests the abandonment of theoretical pre-conceptions about
the priority, or lack of it, of different forms of geometric knowledge. It also
implies that the ‘brittleness’ of classical approaches arises not so much from the
emphasis on metric modelling but from the search for an accurate unified metric
model. The alternative, advocated here, is clearly to seek multiple constraint
models that, although individually fallible, can combine to give robust support to
way-finding.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
219
For a multiple schemata view to have any theoretical teeth it will need to be
developed in such a way as to create predictions of animal or human behaviour.
More specific definitions of the different schemata are needed—what constraints
they compute, using what information, and the extent of their inter-dependence.
Secondly we need to hypothesise the nature of the arbitration procedure and
develop predictions of the likely effects of opposing or complimentary constraints
on behaviour. Clearly Michael Arbib’s work on schema theory [8] is relevant to
these issues, in addition there is much existing work in psychology and AI that
shares the same concerns. For instance, in robot simulation, the ‘Qualnav’ model
[84] discussed earlier24 describes some interesting arbitration mechanisms for
integrating range information determined from pairs of local landmarks to provide
accurate localisation. This work could generate some interesting behavioural
predictions. Some research in human navigation by Scholl [147] can also be
viewed along these general theoretical lines. Work in animal navigation that
specifically fits this research theme has been performed by Poucet et al. [31, 129,
130], Collett et al. [36] and Etienne et al. [47-49]. To end this chapter I would
therefore like to draw upon a couple of examples from this work.
Etienne et al. [47] report that hamsters have effective dead reckoning skills which
are sufficient to relocate their nest in darkness. However, in lighted conditions
hamsters were found to orient primarily using visual information about local
landmarks. In conflict situations, where a landmark (a single light spot) was
rotated relative to the learned position, the hamsters homed using either the
landmark information or their dead-reckoning sense. When the visual information
and dead reckoning produced highly divergent paths dead reckoning was used,
however, with smaller discrepancies visual information took priority over path
integration. Etienne et al. also report that the dead-reckoning sense was more
precise when used to return to the nest than when used to locate a secondary
feeding site. This suggests that a dead reckoning way-finding schema maybe more
available for homing than for general path-finding.
Experiments by Collett et al. [36] with gerbils suggests that these animals may
encode goal positions (buried sunflower seeds) in terms of individual visible
24Although this work has been interpreted from the perspective of the ‘spatial semantic hierarchy’
[81] I believe fits at least equally well with a multiple constraints view.
CHAPTER EIGHT
WAY-FINDING: LOCAL ALLOCENTRIC FRAMES
220
landmarks by using some form of direction sense. For instance, in one experiment
gerbils were trained to locate a food cache at the centre of an array of two
landmarks. When the distance between landmarks was doubled the gerbils
searched at two sites each at the correct distance and orientation to one of the
landmarks rather that at the centre of the two locations (as some theories of a
landmark ‘map’ might predict). In a further experiment the gerbils were trained to
go to a goal-site at the centre of a triangle of three landmarks. During testing the
distance of one landmark to the centre was doubled, Collett et al. report that the
animals spent most of their search time around the place specified by the two
landmarks, ignoring the one that broke the pattern. They interpreted this result in
the following way:
“The gerbil is thus equipped with a useful procedure for deciding between
discrepant solutions. When most of the landmarks agree in specifying the same
goal, with just a few pointing to other sites, the chances are that the majority
view is correct and that the additional possibilities result from mistakes in
computation or from disturbances to the environment.” ([36] p. 845).
Collett et al. are therefore suggesting that this multiple encoding of landmark-goal
relations by hamsters occurs to provide the system with robustness. In other
words, they advocate something like a multiple schemata system and give a clear
example of the ability of such a hypothesis to generate interesting and testable
predictions.
221
Chapter Nine
Conclusions and Future Work
What has been achieved?
This thesis has investigated reinforcement learning systems, with particular
concern for the issues of adaptive recoding and exploration, and has explored the
distinction between reinforcement and model-based learning in the context of
acquiring navigation skills. Delayed reinforcement learning architectures have
been described and simulated for adaptive recoding in control tasks, and for the
acquisition of tactical, local navigation skills in a mobile robot. Model-based
learning methods have been investigated with respect to the construction and use
of quantitative spatial world models for way-finding.
Adaptive recoding for reinforcement learning
A connectionist learning architecture has been developed that performs adaptive
recoding in reinforcement and delayed reinforcement learning tasks defined over
continuous input spaces. The architecture employs networks of Gaussian basis
function (GBF) units with adaptive receptive fields. Learning is achieved by
gradient climbing in a measure of the expected reinforcement.
This architecture has several desirable characteristics for reinforcement learning.
First, the function approximation is performed locally. This means that the
network should learn considerably faster than systems based on global
approximators such as the multilayer perceptron. Second, the parameters of the
trained network should be easier to initialise and interpret than more distributed
learning systems. This creates the potential for interfacing explicit knowledge
with adaptive reinforcement learning for control. Finally, by adapting the
covariance matrix each GBF unit can align its principal axis according to the
variance of the optimal output in its region of expertise. For a given task, fewer
CHAPTER NINE
CONCLUSION
222
units should therefore be needed than in similar architectures in which only the
centres and the widths of the receptive fields are learned.
Results with a simulation of a real-time control problem show that networks with
only a small number of units are capable of learning effective behaviour in
reinforcement learning tasks within reasonable time frames. However, as in all
multilayer delayed reward learning systems, the learning system is susceptible to
local minimum, is vulnerable to interference due to input sampling bias, and is not
guaranteed to converge. I have therefore argued that this system may be most
appropriate for fine tuning of coarsely specified control behaviours than for
acquisition of new skills from scratch.
Reinforcement learning in navigation
A tactical/strategic split in navigation skill has been proposed for both natural and
artificial systems. I have argued that tactical local navigation could be performed
by reactive, task-specific systems encoding competences that will be effective
across different environmental settings. These systems might be innately
specified, learned, or, coarsely specified and adaptable. I have argued that
learning of appropriate control behaviour may be possible with short or mediumterm rewards and using only partial state information. The learning system can
exploit the redundancy in the mapping from environmental states to appropriate
actions to reduce the dimensionality of the search space from the full Markovian
problem down to something more manageable.
I have demonstrated acquisition of an adaptive local navigation behaviour within
a simulated modular control architecture for a mobile robot. The Actor/Critic
learning system for this task acquires successful, often plan-like strategies for
control using very sparse sensory data and delayed reward feedback. The
algorithm also demonstrates adaptive exploration using Williams’ proposal for
performance related control over local search behaviour.
Model-based learning in navigation
I have suggested that strategic, way-finding skills require model-based, taskindependent, but environment-specific knowledge. I have argued against the view
that topological spatial models are simpler to construct, store, or use than
quantitative, metric models, and have simulated a local allocentric frame method
as an example of way-finding using quantitative spatial knowledge. This system
CHAPTER NINE
CONCLUSION
223
exploits simple neural network learning, storage and search mechanisms, yet
generates very robust way-finding behaviour. I have also argued against a strong
distinction between different forms of geometric knowledge in cognitive spatial
representations and against the construction of complete global models. I have
proposed instead that robust systems should be based on the acquisition of
multiple representations or ‘schemata’ with mechanisms for chaining and
combining constraints arising from different models.
Future Work
The research described above leaves many gaps to be filled in and many avenues
to be explored. Various specific proposals for extending this research have
already been outlined, here I will briefly describe how the various underlying
threads might be brought together to make some more coherent wholes.
It is clear that the GBF adaptive recoding techniques for delayed reinforcement
learning could be applied in local navigation systems allowing these systems to be
coarsely pre-specified and giving the possibility of accelerated learning. Further,
since the adaptive coding methods are more memory-efficient and allow broader
and more flexible generalisation than coarse-coding this could allow for tactical
learning in input spaces of much higher dimension. The possibility of using the
GBF methods to distill control knowledge from more opaque memory systems
such as CMAC coding is also worth investigating. The problem of input sampling
has been discussed as being a major complication in learning real-time control.
Mechanisms need to be developed that will keep track of and cope with oversampling or under-sampling. This is part of the overall need to develop better
attentional and exploration mechanisms for reactive learning systems.
A further possibility for the development of local navigation systems is to use
reinforcement learning to adapt the perceptual components of a system rather than
the outputs. For instance, one idea would be to exploit the notion of the
‘Deformable Virtual Zone’ (DVZ) proposed by Zapata et al. [190] and mentioned
briefly in chapter six. The DVZ describes an area of space around the robot within
which intrusions from obstacles trigger avoidance manoeuvres. The shape of this
zone is a function of the dynamics and current trajectory of the robot. Figure 9.1
illustrates some possible elliptical zone shapes for different instantaneous robot
trajectories. The proposal would therefore be to use reward signals to adapt the
parameters of the DVZ separately from any adaptive mechanisms that control the
CHAPTER NINE
CONCLUSION
224
actual avoidance behaviour. This could be achieved by a gradient learning rule
and would essentially constitute an adaptive attentional mechanism.
DVZ
Figure 9.1: The deformable virtual zone (DVZ) is as function of the vehicle and
dynamics and current trajectory that specifies the ‘danger’ area within which
intrusions will trigger avoidance behaviour. (adapted from [190] p. 110)
I have argued for a distinction between tactical and strategic navigation skills but
have not yet considered how this might be brought about. Clearly advances are
needed in many areas. First, the development is required of more appropriate local
navigation modules that achieve tactical objectives rather than simply wandering
without purpose. Second, the development and integration of multiple representational systems for way-finding is needed. This work has barely been begun,
mechanisms for determining, chaining and in particular combining different
sources of constraint need to be investigated. Finally, mechanisms for integrating
the two levels of control must be specified. The research described here suggests a
tactical/strategic split in which the way-finding system continuously specifies a
target heading while the local navigation systems work to keep the agent on
course while avoiding obstacles etc. Both systems would run in parallel. Some
advantages of this decomposition would be that it avoids the specification of
arbitrary sub-goals; it limits the need for local navigation systems that adapt with
CHAPTER NINE
CONCLUSION
225
respect to very long-term rewards; and it allows both components of the system to
be fully responsive to contigencies that arise from moment to moment. A critical
aspect of all this work will be to make contact with real environments through
robotics, or at least, through more realistic simulations; the lack of such realism is
perhaps the major drawback of the work performed so far.
CHAPTER NINE
CONCLUSION
226
226
Appendix A
Algorithm and Simulation Details for Chapter Three
This appendix describes the actor/critic architecture used in the maze learning
task described in chapter three.
The learning system
The N cells of the maze (figure 3.1) describe a state space xi ! X . The action
selected in the ith cell is given by ai ! {1 3, 4} . The orientation of the agent is
ignored, hence selecting action ai = j will always result in a transition to the jth
neighbouring cell. Behaviour is modelled in discrete time intervals where at each
time-step the agent makes a transition from one cell to a neighbour. The cell
occupied at time t is denoted x(t) and is represented by the recoding vector ! (t) .
The current action is denoted a(t) .
In this model, each cell is given a unique, discrete encoding. This is achieved by
using a recoding vector ! (t) that is a unit vector of size N. In other words, if the
current cell is xi then
"1 iff i = j
! i (t ) = #
(B.1)
$ 0 otherwise .
The parameters of the learning system are encoded by a vector v of size N for the
critic element and a vector w of size N ! 4 for the performance element. Since
cells are encoded by unit vectors the prediction associated with the ith cell is just
the parameter vi and the degree of preference for the jth action in this cell is given
by the parameter w( 4i + j) . To select an action, Gaussian noise with mean zero and
standard deviation ! is added to each of the appropriate weights1. The action is
then chosen that maximises
w( 4i + j) + N(0, ! ) for j = 1…4 .
(B.2)
Only the weight for the selected action becomes eligible for learning. This
architecture corresponds to a winner-takes-all competition between a set of
stochastic linear threshold units. Figure A.1 summarises the learning algorithm.
1For cells on the periphery of the maze the weights for invalid actions are ignored.
APPENDIX A.
227
eTD (t + 1) = r(t + 1) + ! V(t + 1) " V (t)
#1 iff
!w(4 i+ j ) (t) = $
%0
i
(t) = 1 and a(t) = j
otherwise
! (0) = 0 ,
! (t) = ! (t) + "! (t # 1)
! w(0) = 0 ,
w(t) = w(t) + " !w(t # 1)
!v = " eTD (t + 1) # (t)
!w(t) = " eTD (t + 1) #w(t )
Figure A.1 Update rules for the actor/critic maze learning system.
There are six global parameters in the learning algorithm: the discount factor ! ,
the action and prediction learning rates ! and " , the rates of trace decay " and
#, and the standard deviation ! of the noise in the action selection rule. A
discount factor ! = 0.95 was chosen to provide a slow decline in the importance
attached to delayed rewards. Suitable values for the remaining parameters were
chosen by experiments with maze tasks in which there were no hazard cells.
Acquisition of a consistent path is most accelerated in such tasks by using high
values for the decay and learning rates. However, this can be at the cost of
exploration making convergence on an indirect path is more likely. To find a
suitable compromise between speed of learning and achieving a direct route the
learning rates ! = " = 0.2 , and standard deviation of the Gaussian noise
! = 0.05, were tested with different values of trace decay " = # = 0.0, 0.3, 0.6,
and 0.9 (the initial expectation in all tests was zero). Figure A.1 shows the
average number of transitions over the first one hundred trials in learning the 6x6
maze. Of the values tested a decay rate of 0.6 provided the fastest convergence to
a direct path. The experiments described in chapter three therefore used the
parameters given above together with the decay rates " =# =0.6.
APPENDIX A.
Average number of transitions per trial
228
80
0.0
0.3
0.6
0.9
60
40
20
0
10
30
50
70
90
Trials
Figure A.1: Learning mazes without hazards with different values of trace decay
229
Appendix B
Algorithm and Simulation Details for Chapter Four
This appendix gives further details of the algorithms for training networks of
Gaussian Basis Function (GBF) units discussed in chapter four. The first section
summarises the activation rule and partial derivative computations for single-layer
networks of GBF units and contains a brief discussion of unsupervised learning.
The second section describes in detail the architecture for immediate
reinforcement learning in the binary output ‘X’ task including the additional
learning rules required for adapting the GBF scaling parameters. The final section
gives the specifications of the ‘X’ task, details of the network parameters, and
describes the results of the simulations.
B.1 Networks of Gaussian Basis Function Units
Given an network of M GBF nodes the activation of the ith unit for context x is
given by the Gaussian
g = (2 !)
"N 2
H
12
exp ( " 12 (x " c ) # H (x " c ))
(B.1.1)
where c i is the parameter vector describing the position of the centre of the unit
in the input space and H i the information matrix. The activation of each unit is
often scaled by the total activation of all units. This gives a recoding vector of
normalised activation values in which the ith element is computed by
i
=
gi
"g
j
.
(B.1.2)
j
Most learning rules require the gradients of the (unscaled) activation function
given by the partial derivatives with respect to the centre and information matrix
of each unit. These are, respectively, 2
partial derivatives of the activation of the ith unit with respect to the position
and shape (information matrix) of its receptive field are obtained as follows:
Assuming dependence on i. For the node centre c
2The
! g !g ! d
let d = (x ! c ) then !c = ! d ! c ,
! "
( d Hd) = Hd + H " d = 2Hd since H is symmetric
!d
APPENDIX B.
230
! gi
= gi Hi (x " c i ) and
! ci
(B.1.3)
! gi 1
= g (S " (x " c i )(x " c i )# )
(B.1.4)
! Hi 2 i i
!1
where Si = Hi .
For unsupervised learning each unit is moved in the direction of the partial
derivative gradients in proportion to its normalised activation. This gives the
update rules (given in section 4.2) for training the centres c and information
matrices H
!c i " # i Hi (x $ c) , and
(B.1.5)
!Hi " i [Si $ (x $ ci )(x $ c i ) % ]
(B.1.6)
As it is usually thought desirable that all units get an equal share of the data,
learning rules of this kind are often supplemented by a conscience mechanism [1,
128]. For instance, a running average of the activation of each unit can be learned
which is used to modulate the units learning rate—a node that is less active than
average has its gain temporarily increased, while one that is more active than
average has its learning rate temporarily reduced.
!g
= g j ("H(x " c) " I N ) = g H(x " c) .
!c
For the information matrix H
!g 1 "1 2 !
=2H
( H ). H "1 2 g " 12 g ! ((x " c)# H(x " c))
!H
!H
!H
$ !1
"
= 12 g H H
(
ln H ) ! (x ! c)(x ! c)# &
%
'
"H
= 12 g( H !1 ! (x ! c)(x ! c)" )
APPENDIX B.
231
B.2 A GBF architecture for a simple immediate reinforcement task
This section describes the learning architecture used for the ‘X’ task described in
section 4.3. In this task networks of between five and ten GBF units were trained
to output either 1 or 0 according to whether the context input originated from
within or without a ‘X’ shaped pattern imposed on a two-dimensional input space.
The network architecture is illustrated below and described in detail in the
following text.
binary action generator
policy weights w
output q
GBF
centres c
and matrix H
inputs x
Figure B.1: A network architecture for immediate reinforcement learning. The
thick line joining the GBF units is intended to indicate that the activation are
normalised with respect to the total activation.
As in section B.1 (above) each GBF local experts i has parameters c i and H i
describing its centre and information matrix. The output of each unit is computed
as in equation B.1.1 and normalised as in B.1.2. The normalised activation vector
for all experts at time t is given by the vector ! (t) . The network has one output
computed as follows. The net sum s(t) is first computed by
s(t) = w! " (t)
(B.2.1)
where w is the output weight vector. This is then passed through a logistic
function to give the probability
p(t) =
1
!s (t )
1+e
(B.2.2)
APPENDIX B.
232
that the current action y(t ) will take the value 1 (or value 0 with probability
! p(t) ).
In Williams’ terminology (see chapter two) the output component of the network
therefore acts as a Bernoulli logistic unit.
Learning rules
The method for estimating the error in a Bernoulli logistic unit was described in
section 2.2. Assuming dependance on time, for an immediate reinforcement task
in which r is the payoff and b is the reinforcement baseline this gives the error
estimate
e = (r ! b)(y ! ) .
The update rule is for the output weights w is therefore
(B.2.3)
!w = " e #
(B.2.4)
where ! is the learning rate, and for the parameters of the local experts (from
section 4.3)
!c i = " c e (wi # s) $ i Hi (x # c i ) + " s di and
(B.2.5)
%
!Hi = " H e (wi # s) $i (H #1
i # (x # c i )(x # c i ) ) .
(B.2.6)
where ! c , ! s and ! H are learning rates. The rule for training the node centres
contains a extra component di . This is the spring force between the GBF units
that acts to prevent any two nodes occupying exactly the same position in the
input space. A suitable measure of the proximity of each pair of nodes i and j is
the Gaussian, with width ! s , given by
! ij
(
1
exp " # 2s c i " c j
2
).
(B.2.7)
The resultant force on unit i is then given by
M
di =
s
$"
ij
(B.2.8)
j , j #i
where ! s scales the contribution of the spring component to the total update for
that unit. The gaussian is chosen so that the spring is maximal for nodes that are
coincident and decays rapidly as the distance between centres increases.
B.3 Learning scaling parameters
To adapt the strength of response of each unit independently of its position and
shape requires the following modified activation function and learning rule (see
section 4.3)
APPENDIX B.
i
233
12
1
= pˆ i (2 ! )" N 2 H i exp (" 2 (x " ci )# H i (x " c i ))
(B.3.1)
Since
! i
"1
= gi pˆi
! pˆi
Then from equations 4.3.7, 4.3.12 and 4.3.16 we have
pˆi = p r(y # s)(wi # s) $ i pˆ i#1
(B.3.2)
where pˆ i is the scaling parameter for the ith unit and ! p is a learning rate. Since
the pˆ i s are probabilities then the following conditions must be met
!
pˆi = 1 and 0 ! pˆi ! 1 for all i.
To satisfy the first of these conditions the change ! pˆi is distributed evenly
between the remaining nodes. That is, all nodes j where j ! i are adjusted by
i
! pˆi
M "1
where M is the total number of GBF units. However, because all values must stay
within the bounds 0 ! pˆi ! 1 an iterative procedure is required to balance the
parameters correctly. The full algorithm (adapted from [101]) is given below in
pseudocode form. It is worth noting, however, that in practice such a scheme will
rarely be required provided that the learning rates are appropriately set and there
is not a large surplus of GBF units for the task in hand.
! pˆ j (i ) = "
Iterative procedure for maintaining scaling parameters within bounds
;!
is the change remaining to be distributed between other units.
; U is the set of units excluded from this distribution process.
; M is the total number of GBF units in the network.
For each
pˆ i
do {
! pˆi = " p r(y # s)(wi # s) $ i pˆ i#1
Let U = {i }
ˆ i + !pˆ i > 1 then {
if p
! = " (1 " pˆi )
pˆ i = 1 }
ˆ i + !pˆ i < 0 then
else if p
! = "(" pˆi )
pˆ i = 0 }
else {
;determine the desired update
;initialise the exclusion set
;update to the upper bound
; ! = negative of change
;update to the lower bound
;normal update
pˆ i pˆ i + !pˆ i
! = "! pˆi }
While
! " 0 do {
;iterative redistribution
APPENDIX B.
234
;repeats till
"
=
M # size(U)
;compute change per unit
!=0
For each
!=0
;reset distribution sum
pˆ j , j !U
if
do {
pˆ + ! > 1 then {
else if
else
! ! pˆ j
U = U ! {j}
pˆ j = 1 }
pˆ j + ! < 0 then {
;update to the upper bound
) #1
! = ! + ( pˆ j + " )
U = U ! {j}
pˆ = 0 }
ˆp = pˆ + !
;add excess to
!
;add unit to exclusion set
;update to lower bound
;normal update
}}
B.4 Simulation details for the X task
The learning system was required to emit the action y(t ) = 1 to obtain maximum
payoff for inputs from within the boundaries of the cross and action y(t ) = 0 to
maximise payoff outside the cross. Inputs were sampled randomly from the unit
square where the cross was defined by the set of points U = (x1 , x2 ) where
( x1 + x 2 ! 0.85 and x1 + x 2 ! 1.15 ) or
( x1 ! x 2 " !0.15 and x1 ! x 2 " 0.15 )
The payoff r(t + 1) was equal to 1 with a probability 0.9 for a correct action and
0.1 for an incorrect action and was zero otherwise. The reinforcement baseline
was zero throughout.
Parameter settings
The following global parameter settings were determined to give reasonable
performance (no systematic exploration of the parameter space was performed):
!
learning rate for weight vector
0.05
!c
0.004
learning rate for node centres
learning rate for information matrix
10.0
!H
!p
learning rate for scale parameters
0.0 or 0.0005
gain for inter-node spring
!s
1.0
!s
0.06
width of inter-node spring
All learning rates were annealed to zero at a linear rate over the training period of
forty thousand input patterns. In other words, for the jth input each learning rate
APPENDIX B.
235
j)
times its starting value. Each GBF unit was initially
had a value equal to (40,000!
4 0,000
placed at a random position { N(0.5,0.01) , N(0.5,0.01) } near the centre of the
space. The receptive fields of the nodes were initialise with principal axes parallel
to the dimensions of the input space and with widths in each dimension of 0.12.
The scale parameters were all equal at the start of training. On runs in which
these were trained the learning rate ! p was set at 0.0005 and was subject to
annealing over the training period.
Results
Figure B.2 shows the results for ten runs using networks of between five and ten
GBF units with the zero learning rate for the adaptive priors. The data shows quite
clearly two (or more) levels of performance which correspond to qualitative
differences in the network's ability to approximate of the X shape. Specifically,
only those nets scoring above approximately 87% produced a test output which
showed all four arms of the X, below this figure either one arm was missing or the
space between the two arms was incorrectly filled.
236
Individual training runs with different numbers of GBF units
APPENDIX B.
5
6
7
8
9
10
72
77
82
87
92
97
percentage success in reproducing test output
Figure B.2: Performance on the X task for nets of 5,6,7,8,9, and 10 GBF units.
Figure B.3 compares the performance shown above for networks of eight GBF
units (8-GBF) with ten runs using networks of eight units with adaptive variance
only. The difference between the average performance of the two systems is
significant3.
3t= 7.658, p<0.001 (2-tailed). Nine out of ten runs in the 'width only' test approximated the full X
shape (µ=89.3%, != 1.4%). The tenth run (in which the approximation was incomplete) was
excluded from the t-test comparison. The mean performance over all ten runs of the 8-GBF
network was 93.6%.
237
Individual training runs
APPENDIX B.
72
77
82
87
92
97
percentage success in reproducing test output
Figure B.3: Comparison of performance on the X task for nets with adaptive
covariance (clear triangles) and with adaptive variance only (black circles).
The experiments with GBF networks of five to ten units were repeated with the
non-zero learning rate for the prior probabilities. Figure B.4 shows the results for
each size of network. The overall the performance is similar to that reported
above with networks of six units or less frequently failing to reproduce the full X
shape. Quantitatively, however, the scores achieved with each size of network are
slightly better.
238
Individual training runs with different numbers of GBF units
APPENDIX B.
5
6
7
8
9
10
72
77
82
87
92
97
percentage success in reproducing test output
Figure B.4: Performance on X task for networks of 5,6,7,8,9, and 10 GBF units
with adaptive prior probabilities.
239
Appendix C
Algorithm and Simulation Details for Chapter Five
This appendix gives details of the delayed reinforcement learning algorithms used
for the experiments described in Chapter Five with the simulated pole balancing
task. The first section gives the equations (from Barto et al. [11]) governing the
dynamics of the simulated cart-pole system. The following sections describe the
algorithms for training Actor/Critic learning systems on this task: for a system a
with a radial basis function (RBF) ‘memory-based’ coding layer, and for systems
of Gaussian basis function (GBF) units trained by generalised gradient ascent.
C.1 Dynamics of the cart-pole system
The dynamics of the cart-pole system are determined by the following equations
(angles are expressed in radians):
gsin ! t + cos ! t
!˙˙t =
l
˙˙t =
t
+
p
# " F " mp l!˙t2 sin ! t &
('
%$
mc + m p
# 4 mp cos 2 !t &
"
%3
mc + m p ('
$
[!˙
2
t
sin ! t " !˙˙t cos ! t
c
+
p
]
,
.
Here
" 10 newtons, if the output of the control system = 1
Ft = #
"
"
" = 0 or -1
$!10 newtons "
is the force applied to the cart by the control system, mc = 1.0 kg and m p = 0.1 kg
are the respective masses of the cart and pole, l= 0.5 m is the distance from the
centre of mass of the pole to the pivot, and g=9.8 m/s2 is the acceleration due to
gravity.
As in [11] the system was simulated according to Euler’s method using a timestep $=0.02 giving the following the discrete time equations for the four state
variables
xt +1 = xt + ! x˙t , x˙t +1 = x˙t + ! ˙x˙t , ! t +1 = ! t + " !˙t , !˙t +1 = !˙t + " !˙˙t .
240
APPENDIX C.
C.2 RBF memory-based learning architecture
The activation of the ith RBF node is given by
2
%
(
N x j (t) # c ij
gi (t) = g(x(t), c i , ! ) =
exp
#
' $j
*
( 2 " )1 2 ! N
2 !2
&
)
where ! denotes the width of the Gaussian distribution of all nodes in all the
input dimensions. The normalised basis encoding is then obtained by
[
1
! i (t ) =
]
gi (t)
" j =1g j (t) .
M
and an estimate xˆ of the input reconstructed by
x (t) = "i =1! i (t)c i (t) .
A new node is generated with its centre at x whenever
x(t) ! ˆ (t) > "
where . denotes the squared Euclidean distance measure. The initial network
contains zero units.
The recoding vector ! (t) acts as the input to an actor/critic learning architecture
with parameters w and v, generating the action y(t ) and the prediction V(t) at
each time step according to
V(t) = v ! " (t)
&1 iff w! " (t) + N(0, #) $ 0
y(t ) = '
( %1 otherwise
where N(0, ! ) is a Gaussian random variable with mean zero and standard
deviation ! . The parameter vectors are then updated by
!w(0) = 0
! (0) = 0
(t) = (t) + " ! (t # 1)
!w(t) = y(t)" (t) + # !w(t $ 1)
!w(t) = eTD (t + #w t )
!v = " eTD (t + 1) # (t)
where the temporal difference error is given by
eTD (t + 1) = r(t + 1) + V(t + 1) " V (t) .
When a new node is added to the network the recoding and parameter vectors all
increase dimension with the new parameters taking the value zero.
Specific architecture for the pole balancing task
The four state variables of the cart/pole system ! ! x and x˙ were normalised
to provide the 4-vector input x for the learning system. The following scaling
equations (taken from Anderson [4]) were used
(
x1 =
! + 12
x + 2.4
x˙ + 1.5
!˙ + 115
x
=
x
=
x
=
3
4
,
24 , 2
3.0
300 ,
)
APPENDIX C.
241
where ! and !˙ are given in degrees and degrees/s and x and x˙ are given in
metres and m/s.
The reward given to the learning system was
r(t) = $
%
!1 if " < -12° or " > +12° or x < 2.4 or x > 2.4
.
0 otherwise
APPENDIX C.
242
The global parameters used for the learning system were:
RBF coding
(The network was limited to a maximum size of 162 coding units.)
threshold
$
0.18
width
!
0.18
Actor
learning rate
trace decay
standard deviation of noise
%
#
&
0.2
0.9
0.01
Critic
discount factor
trace decay
learning rate
'
"
(
0.95
0.9
0.1
243
APPENDIX C.
C.3 Gaussian Basis Function (GBF) learning architecture
The network architecture is much the same as that described in Appendix B for
the immediate reinforcement task. The principal difference is that the temporal
difference error which is computed by a second critic GBF network is used in
place of the immediate reinforcement, thus the learning system is an Actor/Critic
architecture. The algorithm is given here in full to avoid any possible ambiguity.
GBF input coding
The activation of each GBF unit i with centre c i and information matrix H i for
context x is given by the Gaussian
gi = (2! )
Hi exp (" 12 (x " c i )# H i (x " ci ))
and normalised according to
"N 2
!i =
gi
"g
j
12
.
(C.3.1)
(C.3.2)
j
Actor network
The actor network has MA GBF units where each unit i has parameters c Ai and
H Ai describing its centre and information matrix. The normalised activation
vector for all units at time t is given by the vector ! A(t) . The network has one
output y(t ) ! {1,0} which takes the value 1 with probability
p(t) =
1
!s (t ) where
1+e
s(t) = w! " A(t)
and w is the output weight vector.
Given the temporal difference error eTD (t + 1) let
(C.3.3)
(C.3.4)
(t + 1) = TD (t + 1)(y(t) ! p(t)) .
The update rule for the output weight vector w is then given by
(C.3.5)
!w = " eA (t + 1) # A (t)
and for the parameters of the GBF units by
(C.3.6)
A
!c Ai = " c e A (wi # s) $ Ai H(x # c Ai ) + " s d Ai and
(C.3.7)
%
!H Ai = H eA (wi # s) $ Ai ( H #1
Ai # (x # c Ai )(x # c Ai ) )
(C.3.8)
where dependence on time is assumed. The spring component d Ai between the
GBF units (if required) is computed as described in Appendix B.2.
244
APPENDIX C.
Critic network
The critic net has Mv GBF units where each node i has parameters c Vi and H Vi
describing its centre and information matrix. The normalised activation vector for
all units at time t is given by the vector ! V (t) . The network has one output unit
which generates the prediction V(t, t) according to
V(t, t) = v(t)! " V (t) .
(C.3.9)
where v is the output weight vector. From which the temporal difference error
eTD (t + 1) = r(t + 1) + V(t|t + 1) " V (t|t)
is computed in the double-time dependent manner.
The update rule for the output weight vector v is then given by
(C.3.10)
!v = eT (t + 1) # V (t )
and for the parameters of the GBF units by
(C.3.11)
!c Vi =
c
eTD (vi # V) $Vi H Vi (x # c Vi ) +
s
dV i and
(C.3.12)
!HV = "H TD ( # V) $ V (H V#1 # (x # cV )(x # c V )% )
(C.3.13)
where dependence on time is assumed. The spring component dV i between the
GBF units (if required) is computed as described in Appendix B.
Specific architecture for the pole balancing task.
The context vector and reward were computed as in section C.2 above. The values
for all the global parameters of the learning system were as follows.
Actor
learning rate for output weights
learning rate for node centres
learning rate for information matrix
gain for inter-node spring
Critic
discount factor
learning rate for output weights
learning rate for node centres
learning rate for information matrix
gain for inter-node spring
!
!c
0.2
!H
!S
0.002
10.0
0.0
!
!
!c
!H
!S
0.95
0.05
0.001
10.0
0.0
APPENDIX C.
245
All learning rates were annealed to zero at a linear rate over the maximum
training period of one thousand trials. In other words, on the nth trial each gain
n)
times its starting value.
parameter had a value equal to (1000!
1000
The GBF units were initially placed randomly near the centre of the normalised
input space with the position in each dimension sampled from the Gaussian
distribution !(0.5, 0.01) . The receptive fields of the nodes were initialise with
principal axes parallel to the dimensions of the input space and with widths in
each dimension !(0.12, 0.006) .
246
Appendix D
Algorithm and Simulation Details for Chapter Six
This appendix gives details of the learning algorithms and global parameter
values for the adaptive local navigation architecture described in chapter six. The
input recoding mechanism used by the adaptive wander module is described first
followed by details of the algorithms employed by the motivate, critic, and actor
components.
Input Coding
For both the critic and actor components the real-valued input from the laser
range-finder was recoded to give index entries to identical CMAC coarse-coded
look-up tables.
In chapter four a CMAC was defined as a set C = {C0 ,C1 ,…CT } of T tilings each
itself a set Ci = ci 1 ,ci2 ,…cin of non-overlapping quantisation cells. In this
instance the three-dimensional space of depth vectors was quantised using tilings
D
D
D
of cuboid cells of size 5 ! 5 ! 5 where D is the maximum range of the rangeD
finder. Five tilings, each offset from its neighbours by 2 5 in each dimension, were
overlaid to form each CMAC. Figure D.1 illustrates this coding mechanism.
247
APPENDIX D
CMAC
Uniform quantisation
x5
-60°
0°
offset
+60°
Figure D.1: Input encoding by CMAC. Each tiling covers the space of possible
depth patterns. Five overlapping and offset tilings make up the CMAC encoder.
The input vector x(t) (i.e. the current depth map) selects the set of coding cells
U t) = {c1 (t),c 2 t ,…c5 t } where each ci (t) is the cell in ith tiling that
encompasses this input position in the 3D space. CMAC cells are mapped in a
one-to-one manner4 to elements of the recoding vector ! (t) , hence,
1
T
i
(t ) = $
%0
iff ci "U(t)
otherwise .
(D.1)
Motivate
Motivate takes input from external (collision) and internal (motor) sensors and
combines them into an overall scalar reward signal for the system.
Collision reward
At each time-step t each of the collision sensors generates a binary output oi (t)
which is non-zero only when the sensor is in contact with an object. The total
collision reward rc (t) is calculated as the negative sum of these outputs i.e.
4
r (t) = ! " #i =1oi (t) .
(D.2)
where ! is a scaling factor. As long as the robot avoids collision this reward
will therefore be zero. However, when a collision does occur the punishment is
proportional to the number of sensors that are triggered. This should assist the
learning system by enabling it to discriminate between crashes of differing
severity. For instance, a head- or side-on collision should trigger two sensors
4For larger input spaces (formed by adding further depth measures) a hashing function has be used
to create a many-to-one mapping and hence effectively reduce the number of stored parameters.
248
APPENDIX D
giving a punishment of -2), but if the robots crashes on only one corner (a
position that is easier to recover from) the punishment will be only -1).
Movement reward
The motor system measures the current actual speeds at which the wheels are
revolving. By averaging the two wheel speeds the current translational velocity
can be calculated. Motivate takes the absolute value s(t) of this signal and uses it
to compute the movement reward rM . The goal of the system is to maintain a
*
max
constant target velocity s that is just below the vehicles maximum speed s .
The reward rM is therefore computed as a gaussian function which is zero at
*
s(t) = s and negative at all other speeds this is given by
(
2
)
rM (t) ! exp " [(s(t) " s* ) s max ] # 2 " 1 .
(D.3)
The constant ! in this equation determines whether the peak in the reward
function is narrow or flat, in other words, it effects how tightly the constraint of
constant velocity is enforced. The scaling parameter * controls the overall
strength the signal.
Total reward
The total reward for the current time-step is given by a weighted sum of the
collision and movement reward signals, i.e.
r(t) = rC (t) + rM (t ) .
(D.4)
Critic
The critic component is based on the standard adaptive critic architecture, that is it
learns to output a prediction V(t) given by a linear function of a parameter vector
v and the recoding vector ! (t) , i.e.
V(t) = v ! " (t) .
The parameter vector v is updated by the usual TD(") rule
!v(t) = " eTD(t + 1) # (t)
(D.5)
(D.6)
where
eTD (t + 1) = r(t +
is the TD error and
+ V(t + 1) " V (t)
(D.7)
! (0) = 0 , (t) = (t) + v (t # 1)
(D.8)
is the STM trace of past recoding vectors with rate of trace decay ! v . Note that
the time counter t is set to zero after a collision by the recover module.
249
APPENDIX D
Actor
The actor component is based on the gaussian action unit architecture proposed
by Williams [184] and described qualitatively in chapter three. The following
describes the procedure for a one-dimensional gaussian which generalises in a
straight-forward way to the two-dimensional output used in the simulation.
The output y of a 1-D gaussian action unit is selected randomly from a probability
function g with mean µ and standard deviation ! given by
1
$ (y # ) 2 '
exp& #
(D.9)
2"!
2! 2 )( .
%
The mean and standard deviation are encoded by a parameter matrix
(wµ , w! ) and computed as functions of the recoding vector according to
g(y, , ! )
µ = (wµ ) ! " (x) and ! = exp (w ! )" # (x)
where the exponential is used in computing the standard deviation to ensure that it
always has a positive value and approaches zero very slowly (see [184]).
The eligibilities (see section 2.2) of the parameter vectors wµ and w! are,
respectively,
wµ =
" ln g "µ
y#µ
=
% (x) and
"µ " wµ
$2
(D.10)
# ln g # (ln " ) (y $ µ )2 $ " 2
=
% (x) .
# (ln " ) #w "
"2
This gives the eligibility trace for each vector,
(D.11)
!w i (0) = 0, !w i (t) = !w i (t) + " i !wi (t # 1) ,
and the update rule by
(D.13)
w" =
!wi ( ) = " i eTD (t + 1) #w i (t)
(D.15)
where i is one of the indices µ or ! , ! i is a learning rate and ! i the rate of
eligibility trace decay.
For the wander module the actor component computes the two-dimensional
output y(t) = ( f , a) corresponding to the desired forward velocity and steering
angle of the vehicle. Each element of the output is encoded by a separate weight
matrix and has different global parameters.
Parameters
The following values for the parameters of the simulation and the control modules
were used in the experiments reported.
motor control
250
APPENDIX D
maximum speed
maximum acceleration
maximum braking
axle width
perception
ray angles
maximum depth
s
max
d
max
scaling and normalisation of rays
motivate
target velocity
collision reward scale
movement reward scale
movement reward width
critic
discount factor
learning rate
trace decay
initial evaluation function
0.1 m/step
0.02 m/step
0.05 m/step
0.3 m
-60°, 0°, +60°
2m
0
(d d max )
s*
!
!
!
0.08 m/step
1.0
1.0
1.4
!
!
!v
0.9
0.15
0.7
0.0
To make the selection of learning parameters more straight-forward, the
parameters of the actor modules are scaled to lie roughly in the range -1 to +1.
The actual output of each module is computed by simply multiplying the output
of the CMAC by the scaling factor given below. Following a suggestion by
Watkins [177] the learning rate for the mean action was halved for negative
values of the TD error.
forward velocity (f) (range -0.1 to +0.1m/step)
!µ f
mean:
learning rate
ln ! :
initial parameter value
learning rate
initial parameter value
trace decay
output scaling factor
!" f
0.10 eTD ! 0
0.05 eTD < 0
0.5
0.02
! µf , ! "f
ln 0.2
0.7
steering angle (a) (range -0.67 to +0.67 radians/step)
mean:
learning rate
!µ a
0.1
0.10 eTD ! 0
0.05 eTD < 0
251
APPENDIX D
ln ! :
initial parameter value
learning rate
initial parameter value
trace decay
output scaling factor
!" a
! µa , ! " a
0.0
0.01
ln 0.2
0.7
0.67
APPENDIX D
252
BIBLIOGRAPHY
253
Bibliography
1.
Ahalt, S.C., K.K. Ashok, P. Chen, and D.E. Melton (1990). Competitive learning
algorithms for vector quantization. Neural Networks. 3: p. 277-290.
2.
Albus, J.S. (1971). A theory of cerebellar function. Mathematical Biosciences. 10:
p. 25-61.
3.
Anderson, C.W. (1986). Learning and Problem Solving with Multilayer
Connectionist Systems. Phd thesis. University of Massachusetts.
4.
Anderson, C.W. (1988). Strategy learning with multilayer connectionist
representations. GTE Laboratories, MA. Report no. TR87-509.3.
5.
Anderson, J.R. (1983). The Architecture of Cognition. Cambridge, MA: Harvard
University Press.
6.
Anderson, T.L. (1990). Autonomous robots and emergent behaviours: a set of
primitive behaviours for mobile robot control. In International Workshop on
Intelligent Robots and Systems. Tsukuba, Japan.
7.
Arbib, M.A. (1989). The Metaphorical Brain. New York: Wiley and Sons.
8.
Arbib, M.A. (1990). Interaction of multiple representations of space in the brain.
In Brain and Space, J. Paillard, Editor. Oxford: Oxford University Press. p. 380403.
9.
Barto, A.G. (1985). Learning by statistical cooperation of self-interested neuronlike computing elements. Human Neurobiology. 4: p. 229-256.
10.
Barto, A.G., S.J. Bradtke, and S.P. Singh (1993). Learning to act using real-time
dynamic programming. Department of Computer Science, Amherst,
Massachusetts. Report no. CMPSCI 93-02.
11.
Barto, A.G., R.S. Sutton, and C.W. Anderson (1983). Neuronlike adaptive
elements that can solve difficult learning control problems. IEEE Transactions in
systems, man, and cybernetics. SMC-13: p. 834-846.
253
BIBLIOGRAPHY
254
12.
Barto, A.G., R.S. Sutton, and P.S. Brouwer (1981). Associative search network: a
reinforcement learning associative memory. Biological Cybernetics. 43: p. 175185.
13.
Barto, A.G., R.S. Sutton, and C.J.H.C. Watkins (1989). Learning and sequential
decision making. In Learning and Computational Neuroscience, J.W. Moore and
M. Gabriel, Editor. Cambridge, MA: MIT Press.
14.
Barto, A.G., R.S. Sutton, and C.J.H.C. Watkins (1990). Sequential decision
problems and neural networks. In Advances in Neural Information Processing
Systems 2. San Mateo, CA: Morgan Kaufmann.
15.
Bellman, R.E. (1957). Dynamic Programming. Princeton, NJ: Princeton
University Press.
16.
Bellman, R.E. and S.E. Dreyfus (1962). Applied Dynamic Programming. Rand
Corporation.
17.
Bierman, G.J. (1977). Factorization Methods of Discrete Sequential Estimation.
New York: Academic Press.
18.
Blake, A., G. Hamid, and L. Tarassenko (1992). A design for a visual motion
transducer. University of Oxford, Department of Engineering Science. Report
no. OUEL 1960/92.
19.
Brooks, R.A. (1985). A subdivision algorithm in configuration space for findpath
with rotation. IEEE Transactions on Systems, Man, and Cybernetics. SMC-15(2):
p. 224-233.
20.
Brooks, R.A. (1986). A robust layered control system for a mobile robot. IEEE
Journal on Robotics and Automation. RA-2, 14-23.
21.
Brooks, R.A. (1989). Robot beings. In International Workshop on Intelligent
Robots and Systems. Tsukuba, Japan.
22.
Brooks, R.A. (1989). A robot that walks: emergent behaviour from a carefully
evolved network. Neural Computation. 1(2): p. 253-262.
23.
Brooks, R.A. (1990). Challenges for complete creature architectures. In From
Animals to Animats: Proceedings of the First International Conference on the
Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press.
24.
Brooks, R.A. (1990). Elephants don’t play chess. In Designing Autonomous
Agents: Theory and Practice from Biology to Engineering and Back, P. Maes,
Editor. Cambridge, MA: MIT Press.
254
BIBLIOGRAPHY
255
25.
Brown, R.G. and P.Y.C. Hwang (1992). Introduction to Random Signals and
Applied Kalman Filtering. 2nd ed. New York: John Wiley & Sons.
26.
Cartwright, B.A. and T.S. Collett (1979). How honeybees know their distance
from a nearby visual landmark. Journal of Experimental Biology. 82: p. 367-72.
27.
Cartwright, B.A. and T.S. Collett (1983). Landmark learning in bees: experiments
and models. Journal of Comparative Physiology A. 151(4): p. 521-544.
28.
Cartwright, B.A. and T.S. Collett (1987). Landmark maps for honeybees.
Biological Cybernetics. 57: p. 85-93.
29.
Chapman, D. and P.E. Agre (1987). Pengi: An implementation of a theory of
activity. In Proceedings of the Sixth National Conference on AI (AAAI-87). San
Mateo, CA: Morgan Kaufmann.
30.
Chapman, D. and L.P. Kaelbling (1990). Learning from delayed reinforcement in
a complex domain. Teleor Research, Palo Alto. Report no. TR-90-11.
31.
Chapuis, N., C. Thinus-Blanc, and B. Poucet (1983). Dissociation of mechanisms
involved in dogs’ oriented displacements. Quarterly Journal of Experimental
Psychology. 35B: p. 213-219.
32.
Charniak, E. and D. McDermott (1985). Introduction to Artificial Intelligence.
Reading, MA: Addison-Wesley.
33.
Chatila, R. (1982). Path planning and environment learning in a mobile robot
system. In European Conference on AI. France.
34.
Chatila, R. (1986). Mobile robot navigation: space modelling and decisional
processes. In 3rd International Symposium on Robotics Research. Cambridge,
MA: MIT Press.
35.
Cliff, D.T. (1990). Computational neuroethology: a provisional manifesto.
University of Sussex. Report no. CSRP 162.
36.
Collett, T.S., B.A. Cartwright, and B.A. Smith (1986). Landmark learning and
visuo-spatial memory in gerbils. Journal of Comparative Physiology A. 158: p.
835-51.
37.
Connell, J. (1988). Navigation by path remembering. SPIE Mobile Robots III.
1007.
38.
Connell, J.H. (1990). Minimalist Mobile Robots. Perspectives in Artificial
Intelligence, Boston: Academic Press.
255
BIBLIOGRAPHY
256
39.
Courant, R. and H. Robbins (1941). What is Mathematics? Oxford: Oxford
University Press.
40.
Crowley, J.L. (1985). Navigation for an intelligent mobile robot. IEEE Journal of
Robotics and Automation. 1(1): p. 31-41.
41.
Dayan, P. (1991). Reinforcing Connectionism: Learning the statistical way. PhD
thesis. Edinburgh.
42.
Dijistra, E.W. (1959). A note on two problems in connexion graphs. Numerishe
Mathematik. 1: p. 269-272.
43.
Douglas, R.J. (1966). Cues for spontaneous alternation. Journal of Comparative
and Physiological Psychology. 62: p. 171-183.
44.
Dudek, G., M. Jenkin, E. Milios, and D. Wilkes (1988). Robotic exploration as
graph construction. Department of Computer Science, University of Toronto.
Report no. RBCV-TR-88-23.
45.
Edelman, G.M. (1989). Neural Darwinism. Oxford: Oxford University Press.
46.
Elfes, A. (1987). Sonar-based real-world mapping and navigation. IEEE Journal
of Robotics and Automation. 3(3): p. 249-265.
47.
Etienne, A.S. (1992). Navigation of a small mammal by dead reckoning and local
cues. Current Directions in Psychological Science. 1(2): p. 48-52.
48.
Etienne, A.S., R. Maurer, F. Saucy, and E. Teroni (1986). Short-distance homing
in the golden hamster after a passive outward journey. Animal Behaviour. 34: p.
696-715.
49.
Etienne, A.S., E. Teroni, C. Hurni, and V. Portenier (1990). The effect of a single
light cue on homing behaviour in the golden hamster. Animal Behaviour. 39: p.
17-41.
50.
Farin, G. (1988). Curves and Surfaces for Computer Aided Geometric Design: A
Practical Guide. Boston: Academic Press.
51.
Franklin, J.A. (1989). Input space representation for refinement learning control.
In IEEE International Symposium on Intelligent Control. Albany, NY. Computer
Society Press.
52.
Gallistel, C.R. (1990). The Organisation of Learning. Cambridge, MA: MIT
Press.
256
BIBLIOGRAPHY
257
53.
Gibson, J.J. (1950). The perception of the visual world. Cambridge, MA:
Riverside Press.
54.
Gould, J.L. (1986). The locale map of honeybees: do insects have cognitive
maps? Science. 232: p. 861-63.
55.
Gronenberg, W., J. Taütz, and Hölldobler (1993). Fast trap jaws and giant neurons
in the ant Odontomachus. Science. 262: p. 561-563.
56.
Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning
real-valued functions. Ñeural Networks. 3: p. 671-692.
57.
Hallam, J., P. Forster, and J. Howe (1989). Map-Free Localisation in a Partially
Moving 3D World: The Edinburgh Feature-Based Navigator. In Intelligent
Autonomous Systems: Proceedings of an International Conference. Amsterdam.
Stichting International Congress of Intelligent Autonomous Systems.
58.
Halperin, J.R.P. (1990). Machine Motivation. In From Animals to Animats:
Proceedings of the First International Conference Simulation of Adaptive
Behaviour. Paris. Cambridge, MA: MIT Press.
59.
Hertz, J., A. Krogh, and R.G. Palmer (1991). Introduction to the Theory of Neural
Computation. Redwood City, CA: Addison-Wesley.
60.
Hillis, W.D. (1992). Co-evolving parasites improve simulated evolution as an
optimisation procedure. In Artificial Life II. Santa Fe Institute Studies in the
Sciences of Complexity, vol. 6. Reading, Mass.: Addison-Wesley.
61.
Hinton, G.E., J.L. McClelland, and D.E. Rumelhart (1986). Distributed
Representations. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1, D.E. Rumelhart and J.L. McClelland, Editor.
Cambridge, MA: Bradford Books.
62.
Hintzman, D.L. (1992). Twenty-five years of learning and memory: was the
cognitive revolution a mistake. In Fourteenth International Symposium on
Attention and Performance. Michigan. Cambridge, MA: Bradford Books.
63.
Hirsh, R. (1974). The hippocampus and contextual retrival of information from
memory: a theory. Behavioural Biology. 12: p. 421-444.
64.
Iyengar, S., C. Jorgensen, S. Rao, and C. Weisbin (1985). Learned navigation
paths for a robot in unexplored terrain. In 2nd Conference on Artificial
Intelligence Applications, Vol. 1. p. 148-154.
257
BIBLIOGRAPHY
258
65.
Jaakkola, T., M.I. Jordan, and S.P. Singh (1993). On the convergence of
stochastic iterative dynamic programming algorithms. MIT Computational
Cognitive Science. Report no. 9307.
66.
Jacobs, R.A., M.L. Jordan, S.J. Nowlan, and G.E. Hinton (1991). Adaptive
mixtures of local experts. Neural Computation. 3(1).
67.
Jang, J.-S.R. and C.-T. Sun Functional equivalence between radial basis function
networks and fuzzy inference systems. Department of Electrical Engineering and
Computer Science, University of California.
68.
Jerison, H. (1973). Evolution of the Brain and Intelligence. New York: Academic
Press.
69.
Jordan, M.I. (1992). Forward models: supervised learning with a distal teacher.
Cognitive Science.
70.
Kohonen, T. (1984). Self-organization and associative memory. Heidelberg:
Springer-Verlag.
71.
Kortenkamp, D. and E. Chown (1992). A directional spreading activation network
for mobile robot navigation. In From Animals to Animats: Proceedings of the
Second International Conference on Simulation of Adaptive Behaviour. Honolulu,
USA. Cambridge, MA: MIT Press.
72.
Kortenkamp, D., T. Weymouth, E. Chown, and S. Kaplan (1991). A scene-based
multi-level representation for mobile robot spatial mapping and navigation. IEEE
transactions on robotics and automation.
73.
Krose, B.J.A. and J.W.M. van Dam (1992). Adaptive state space quantisation for
reinforcement learning of collision-free navigation. In IEEE International
Conference on Intelligent Robots and Systems.
74.
Krose, B.J.A. and J.W.M. van Dam (1992). Learning to avoid collisions: a
reinforcement learning paradigm for mobile robot navigation. In
IFAC/IFIP/IMACS International Symposium on Artificial intelligence in RealTime Control. Delft.
75.
Kubovy, M. (1971). Concurrent pitch segregation and the theory of indispensable
attributes. In Perceptual Organization, M. Kubovy and J. Pomerantz, Editor.
Erlbaum: Hillsdale, N.J.
76.
Kuipers, B. (1978). Modelling spatial knowledge. Cognitive Science. 2: p. 129153.
258
BIBLIOGRAPHY
259
77.
Kuipers, B. (1982). The “map in the head” metaphor. Environment and
behaviour. 14: p. 202-220.
78.
Kuipers, B. and Y. Byun (1991). A robot exploration and mapping strategy based
on a semantic hierarchy of spatial representations. Robotics and Autonomous
Systems. 8.
79.
Kuipers, B. and Y.T. Byun (1987). A qualitative approach to robot exploration
and map-learning. In Spatial reasoning and multi-sensor fusion workshop.
Chicago, Illinois.
80.
Kuipers, B. and Y.T. Byun (1987). A robust, qualitative method for robot
exploration and map-learning. In Proceedings of the Sixth National Conference
on AI (AAAI–87). St. Pauls, Minneapolis. Morgan Kaufmann.
81.
Kuipers, B. and T. Levitt (1988). Navigation and mapping in large-scale space. AI
Magazine. (Summer 1988): p. 25-43.
82.
Leiser, D. and A. Zilbershatz (1989). The traveller: a computational model of
spatial network learning. Environment and behaviour. 21(4): p. 435-463.
83.
Levenick, J.R. (1991). NAPS: A connectionist implementation of cognitive maps.
Connection Science. 3(2).
84.
Levitt, T.S. and D.T. Lawton (1990). Qualitative navigation for mobile robots.
Artificial Intelligence. 44: p. 305-360.
85.
Lieberman, D.A. (1993). Learning: Behaviour and Cognition. 2nd ed. Pacific
Grove, CA: Brooks/Cole Publishing Co.
86.
Linsker, R. (1986). From basic network principles to neural architecture:
emergence of orientation-selective cells. Proceedings of the National Academy of
Sciences. 83: p. 8390-8394.
87.
Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer.
(March): p. 105-117.
88.
Littman, M.L. (1992). An optimization-based categorization of reinforcement
learning environments. In From Animals to Animats: Proceedings of the Second
International Conference on Simulation of Adaptive Behaviour. Honolulu, USA.
Cambridge, MA: MIT Press.
89.
Lozano-Perez (1983). Spatial-planning: a configuration space approach. IEEE
transactions on computers. C-32(2): p. 108-121.
259
BIBLIOGRAPHY
260
90.
Luttrell, S.P. (1989). Hierarchical vector quantisation. Proceedings of the IEE.
136: p. 405-413.
91.
Maes, P. (1992). Behaviour-Based Artificial Intelligence. In From Animals to
Animats: Proceedings of the Second International Conference on Simulation of
Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press.
92.
Mataric, M.J. (1990). Navigating with a rat brain: A neurologically-inspired
model for robot spatial representation. In From Animals to Animats: Proceedings
of the First International Conference Simulation of Adaptive Behaviour. Paris.
Cambridge, MA: MIT Press.
93.
Mataric, M.J. (1990). Parallel, decentralized spatial mapping for robot navigation
and path planning. In 1st Workshop on Parallel Problem Solving from Nature.
Dortmund, FRG. Springer-Verlag.
94.
Mataric, M.J. (1992). Integration of representation into goal-driven behaviourbased robots. IEEE Transactions on robotics and automation. 8(3): p. 304-312.
95.
Matthies, L. and S. Shafer (1987). Error modelling in stereo navigation. IEEE
Journal of Robotics and Automation. 3(3): p. 239-248.
96.
McNamara, T.P. (1992). Spatial representation. Geoforum. 23(2): p. 139-150.
97.
McNaughton, B.L., L.L. Chen, and E.J. Markus (1991). Landmark learning and
the sense of direction - a neurophysiological and computational hypothesis.
Journal of Cognitive Neuroscience. 3(2): p. 192-202.
98.
Meyer, J. and A. Guillot (1990). Simulation of adaptive behaviour in animats:
review and prospect. In From Animals to Animats: Proceedings of the First
International Conference Simulation of Adaptive Behaviour. Paris. Cambridge,
MA: MIT Press.
99.
Michie, D. and R.A. Chambers (1968). Boxes: an experiment in adaptive control.
In Machine Intelligence 2, E. Dale and D. Michie, Editor. Oliver and Boyd:
Edinburgh.
100.
Millan, J.d.R. and C. Torras (1992). A reinforcement connectionist approach to
robot path finding in non-maze-like environments. Machine Learning. 8(3/4): p.
229-256.
101.
Millington, P.J. (1991). Associative Reinforcement Learning for Optimal Control.
MSc thesis. Massachusetts Institute of Technology, Cambridge, MA.
102.
Minsky, M. and S. Papert (1988). Perceptrons. 3rd ed. Cambridge, MA: MIT
Press.
260
BIBLIOGRAPHY
261
103.
Minsky, M.L. (1961). Steps toward artificial intelligence. Proceedings IRE. 49: p.
8-30.
104.
Mishkin, M., B. Malamut, and J. Bachevalier (1992). Memories and habits: two
neural systems. In Fourteenth International Symposium on Attention and
Performance. Michigan. Bradford Books.
105.
Moody, J. and C. Darken (1989). Fast learning in networks of locally-tuned
processing units. Neural Computation. 1(2): p. 281-294.
106.
Moore, A.W. (1991). Fast, robust adaptive control by learning only with forward
models. In Advances in Neural Information Processing systems 4. Denver. San
Mateo, CA: Morgan Kaufmann.
107.
Moore, A.W. (1991). Knowledge of knowledge and intelligent experimentation
for learning control. In International Joint Conference on Neural Networks.
Seattle.
108.
Moore, A.W. (1991). Variable resolution dynamic programming: efficiently
learning action maps in multi-variate real-valued state-spaces. In Machine
Learning: Proceedings of the 8th International Workshop. San Mateo, CA:
Morgan Kaufmann.
109.
Moravec, H. (1988). Sensor fusion in uncertainty grids. AI Magazine. (Summer
1988): p. 61-74.
110.
Morris, R.G.M. (1990). Toward a representational hypothesis of the role of
hippocampal synaptic plasticity in spatial and other forms of learning. Cold
Spring Harbor Symposia on Quantitative Biology. 55: p. 161-173.
111.
Myers, C.E. (1992). Delay learning in artificial neural networks. London:
Chapman & Hall.
112.
Nehmzow, U. and T. Smithers (1990). Mapbuilding using self-organising
networks in “really useful robots”. In From Animals to Animats: Proceedings of
the First International Conference Simulation of Adaptive Behaviour. Paris.
Cambridge, MA: MIT Press.
113.
Nehmzow, U., T. Smithers, and B. McGonigle (1992). Increasing behavioural
repetoire in a mobile robot. In From Animals to Animats: Proceedings of the
Second International Conference on Simulation of Adaptive Behaviour. Honolulu,
USA. Cambridge, MA: MIT Press.
114.
Nilsson, N.J. (1982). Principles of Artificial Intelligence. Berlin: Springer Verlag.
261
BIBLIOGRAPHY
262
115.
Nowlan, S.J. (1990). Competing experts: an experimental investigation of
associative mixture models. Department of Computer Science, University of
Toronto. Report no. CRG-TR-90-5.
116.
Nowlan, S.J. (1990). Maximum likelihood competition in RBF networks.
University of Toronto. Report no. CRG-TR-90-2.
117.
Nowlan, S.J. and G.E. Hinton (1991). Evaluation of adaptive mixtures of
competing experts. In Advances in Neural Information Processing Systems 3.
Denver, Colorado. San Mateo, CA: Morgan Kaufmann.
118.
O’Keefe, J. (1990). Computational theory of the hippocampal cognitive map.
Progress in Brain Research. 83: p. 301-312.
119.
O’Keefe, J. (1990). The hippocampal cognitive map and navigational strategies.
In Brain and Space, J. Paillard, Editor. Oxford: Oxford University Press.
120.
O’Keefe, J.A. and L. Nadel (1978). The Hippocampus as a Cognitive Map.
Oxford: Oxford University Press.
121.
Olton, D.S. (1979). Mazes, maps, and memory. American Psychologist. 34(7): p.
583-596.
122.
Olton, D.S. (1982). Spatially organized behaviours of animals: behavioural and
neurological studies. In Spatial Abilities: Development and Physiological
Foundations, M. Potegal, Editor. Academic Press: New York.
123.
Overmeier, J.B. and M.E.P. Seligman (1967). Effects of inescapable shock upon
subsequent escape and avoidance learning. Journal of Comparative and
Physiological Psychology. 63: p. 23-33.
124.
Piaget, J. and B. Inhelder (1967). The Child’s Conception of Space. New York:
Norton.
125.
Piaget, J., B. Inhelder, and A. Szeminska (1960). The Child’s Conception of
Geometry. New York: Basic Books.
126.
Poggio, T. and F. Girosi (1989). A theory of networks for approximation and
learning. MIT AI Lab. Report no. 1140.
127.
Poggio, T. and F. Girosi (1990). Networks for approximation and learning.
Proceedings of the IEEE. 78(9): p. 1481-1496.
128.
Porrill, J. (1993). Approximation by linear combinations of basis functions. AI
Vision Research Unit, Sheffield University.
262
BIBLIOGRAPHY
263
129.
Poucet, B. (1985). Spatial behaviour of cats in cue-controlled environments.
Quarterly Journal of Experimental Psychology. 37B: p. 155-179.
130.
Poucet, B., C. Thinus-Blanc, and N. Chapuis (1983). Route planning in cats, in
relation to the visibility of the goal. Animal Behaviour. 31: p. 594-599.
131.
Powell, M.J.D. (1987). Radial basis functions for multivariable interpolation: a
review. In Algorithms for Approximation, J.C. Mason and M.G. Cox, Editor.
Oxford: Clarendon Press.
132.
Prescott, A.J. (1993). Building long-range cognitive maps using local landmarks.
In From Animals to Animats: Proceedings of the Second International Conference
on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT
Press.
133.
Prescott, A.J. and J.E.W. Mayhew (1992). Obstacle avoidance through
reinforcement learning. In Advances in Neural Information Processing Systems 4.
Denver. San Mateo, CA: Morgan Kaufman.
134.
Prescott, A.J. and J.E.W.M. Mayhew (1992). Adaptive local navigation. In Active
Vision, A. Blake and A. Yuille, Editor. MIT Press:
135.
Rao, N., S. Iyengar, and G. deSaussure (1988). The visit problem: visibility
graph-based solution. In IEEE International Conference on Robotics and
Automation.
136.
Rao, N., N. Stoltzfus, and S. Iyengar (1988). A retraction method for terrain
model acquisition. In IEEE International Conference on Robotics and
Automation.
137.
Ritter, H., T. Martinetz, and K. Schulten (1992). Neural Computation and SelfOrganizing Maps. Reading, MA: Addison-Wesley.
138.
Roitblat, H.L. (1992). Comparative Approaches to Cognitive Science. University
of Honolulu. Report no.
139.
Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory
of Brain Mechanisms. Washington, DC: Spartan Books.
140.
Rumelhart, D.E., G.E. Hinton, and R.W. WIlliams (1986). Learning internal
representations by error propagation. In Parallel Distributed Processing:
Explorations in the Micro-structure of Cognition., D.E. Rumelhart and J.L.
McClelland, Editor. Bradford: Cambridge, MA.
141.
Sanger, T.D. An optimality principle for unsupervised learning.
Laboratory, Cambridge, MA. Report no.
MIT AI
263
BIBLIOGRAPHY
264
142.
Schacter, D.L. (1987). Implicit memory: history and current status. Journal of
Experimental Psychology: learning , Memory, and Cognition. 13: p. 501-518.
143.
Schmidhuber, J.H. (1990). Networks adjusting networks. In Distributed Adaptive
Neural Information Processing. Oldenburg.
144.
Schmidhuber, J.H. (1990). Recurrent networks adjusted by adaptive critics. In
IEEE International Joint Conference on Neural Networks. Washington DC.
145.
Schmidhuber, J.H. (1991). Adaptive confidence and adaptive curiosity.
Technische Universitat Munchen, Germany. Report no. FKI-149-91.
146.
Scholl, M.J. (1987). Cognitive maps as orienting schemata. Journal of
Experimental Psychology: Learning, Memory and Cognition. 13: p. 615-628.
147.
Scholl, M.J. (1992). Landmarks, places, environments: multiple mind-brain
systems for spatial orientation. Geoforum. 23(2): p. 151-164.
148.
Schwartz, A. (1993). Thinking locally to act globally: a novel approach to
reinforcement learning. In Fifteenth Annual Conference of the Cognitive Science
Society. University of Colorado-Boulder. Lawrence Erlbaum Associates.
149.
Seligman, M.E.P. and S.F. Maier (1967). Failure to escape traumatic shock.
Journal of Experimental Psychology. 74: p. 1-9.
150.
Shannon, S. and J.E.W. Mayhew (1990). Simple associative memory. Research
Initiative in Pattern Recognition, UK. Report no. RIPREP/1000/78/90.
151.
Shepard (1990). Internal representation of universal regularities. In Neural
Connections, Mental Computation, L. Nadel, L.A. Cooper, P. Culicover, andR.M.
Harnish, Editor. Bradford Books: Cambridge, MA.
152.
Sherry, D.F. and D.L. Schacter (1987). The evolution of multiple memory
systems. Psychological Review. 94(4): p. 439-454.
153.
Shimamura, A.P. (1989). Disorders of memory: the cognitive science perspective.
In Handbook of Neuropsychology, F. Boller and J. Grafman, Editor. Elselvier
Press: Amsterdam.
154.
Siegel, A.W. and S.H. White (1975). The development of spatial representations
of large-scale environments. In Advances in Child Development and Behaviour,
H.W. Reese, Editor. Academic Press:
155.
Smith, R. and P. Cheeseman (1986). On the representation and estimation of
spatial uncertainty. Int. Journal of Robotics Research. 5(4): p. 56-68.
264
BIBLIOGRAPHY
156.
157.
265
Smith, R., M. Self, and P. Cheeseman (1987). A stochastic map for uncertain
spatial relationships. In Workshop on Spatial Reasoning and Multisensor Fusion.
Snaith, M. and O. Holland (1991). A biologically plausible mapping system for
animat navigation. Technology Applics. Group, Alnwick, UK.
158.
Soldo, M. (1990). Reactive and preplanned control in a mobile robot. In IEEE
Conference on Robotics and Automation. Cincinnati, Ohio.
159.
Squire, L.R. (1987). Memory and Brain. Oxford: OUP.
160.
Squire, L.R. and S. Zola-Morgan (1988). Memory: brain systems and behaviour.
Trends in Neuroscience. 11: p. 170-175.
161.
Stevens, A. and P. Coupe (1978). Distortions in judged spatial relations. Cognitive
Psychology. 10: p. 422-437.
162.
Sutton, R.S. (1984). Temporal Credit Assignment in Reinforcement Learning
Control. Phd thesis. University of Massachusetts.
163.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences.
Machine Learning. 3: p. 9-44.
164.
Sutton, R.S. (1990). Integrated architectures for learning, planning and reacting
based on approximate dynamic programming. In 7th Int. Conf. on Machine
Learning. Austin, Texas. Morgan Kaufmann.
165.
Sutton, R.S. and A.G. Barto (1981). Toward a modern theory of adaptive
networks: expectation and prediction. Psych. Review. 88(2): p. 135-70.
166.
Sutton, R.S. and B. Pinette (1985). The learning of world models by connectionist
networks. In Proceedings of the Seventh Annual Conference of the Cognitive
Science Society.
167.
Tesauro, G.J. (1991). Practical issues in temporal difference learning. Machine
Learning. 8(3/4): p. 257-278.
168.
Thorndyke, E.L. (1911). Animal Intelligence. Darien, Conn.: Hafner.
169.
Thrun, S. and K. Moller (1992). Active Exploration in Dynamic Environments. In
Advances in Neural Information Processing Systems IV. Denver. San Mateo, CA:
Morgan Kaufmann.
170.
Toates, F. and P. Jensen (1990). Ethological and psychological models of
motivation—towards a synthesis. In From Animals to Animats: Proceedings of
265
BIBLIOGRAPHY
266
the First International Conference Simulation of Adaptive Behaviour. Paris.
Cambridge, MA: MIT Press.
171.
Todd, P.M. and S.W. Wilson (1992). Environment structure and adaptive
behaviour from the ground up. In From Animals to Animats: Proceedings of the
Second International Conference on Simulation of Adaptive Behaviour. Honolulu,
USA. Cambridge, MA: MIT Press.
172.
Torras, C. (1990). Motion planning and control: symbolic and neural levels of
computation. In Proceedings of the 3rd COGNITIVA Conference.
173.
Trabasso, T. (1963). Stimulus emphasis and all-or-none learning of concept
identification. Journal of Experimental Psychology. 65: p. 395-406.
174.
Turchan, M. and A. Wong (1985). Low-level learning for a mobile robot:
Environment model acquisition. In 2nd Conference on Artificial Intelligence
Applications.
175.
Tversky, B. (1981). Distortions in memory for maps. Cognitive Psychology. 13: p.
407-433.
176.
Vogi, T.P., J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon (1988).
Accelerating the convergence of the back-propagation method. Biological
Cybernetics. 59: p. 257-263.
177.
Watkins, C.J.H.C. (1989). Learning from delayed rewards. PhD thesis. King’s
College, Cambridge.
178.
Wehner, R. (1983). Celestial and terrestrial navigation: human strategies— insect
strategies. In Neuroethology and Behavioural Physiology, F. Huber and H. Markl,
Editor. Springer-Verlag: Berlin.
179.
Wehner, R. and R. Menzel (1990). Do insects have cognitive maps? Annual
Review of Neuroscience. 13: p. 403-14.
180.
Werbos, P.J. (1977). Advanced forecasting methods for global crisis warning and
models of intelligence. General Systems Yearbook. 22: p. 25-38.
181.
Whitehead, S.D. and D.H. Ballard (1990). Active perception and reinforcement
learning. Neural Computation. 2: p. 409-419.
182.
Widrow, B. (1962). Generalization and information storage in networks of
Adaline ‘neurons’. In Self-organizing systems, M.C. Jovitz, J.T. Jacobi, and G.
Goldstein, Eds. Spartan Books: Washington, DC. p. 435-461.
266
BIBLIOGRAPHY
267
183.
Widrow, B. and S.D. Stearns (1985). Adaptive signal processing. Englewood
Cliffs, NJ: Prentice-Hall.
184.
Williams, R.J. (1988). Towards a theory of reinforcement learning connectionist
systems. College of Computer Science, Northeastern University, Boston, MA.
Report no. NU-CCS-88-3.
185.
Williams, R.J. and L.J. Baird III (1990). A mathematical analysis of actor-critic
architectures for learning optimal controls through incremental dynamic
programming. In Sixth Yale workshop on Adaptive and Learning Systems. New
Haven, CT.
186.
Willingham, D.B., M.J. Nissen, and P. Bullemer (1989). On the development of
procedural knowledge. Journal of Experimental Psychology: Memory, Learning
and Cognition. 15(1047-1060).
187.
Wilson, S.W. (1990). The animat path to AI. In From Animals to Animats:
Proceedings of the First International Conference on the Simulation of Adaptive
Behaviour. Paris. Cambridge, MA: MIT Press.
188.
Wong, V.S. and D.W. Payton (1987). Goal-oriented obstacle avoidance through
behaviour selection. SPIE Mobile Robots II. 852: p. 2-10.
189.
Worden, R. (1992). Navigation by fragment fitting: a theory of hippocampal
function. Hippocampus. 2(2): p. 165-188.
190.
Zapata, R., P. Lepinay, C. Novales, and P. Deplanques (1992). Reactive
behaviours of fast mobile robots in unstructured environments: sensor-based
control and neural networks. In From Animals to Animats: Proceedings of the
Second International Conference on Simulation of Adaptive Behaviour. Honolulu,
USA. Cambridge, MA: MIT Press.
191.
Zipser (1983). The representation of location. Institute for Cognitive Science,
Univesity of California at San Diego. Report no. ICS 8301.
192.
Zipser, D. (1983). The representation of maps. Institute of Cognitive Science,
Univesity of California at San Diego. Report no. ICS 8304.
193.
Zipser, D. (1986). Biologically plausible models of place recognition and place
location. In Parallel Distributed Processing: Explorations in the Micro-structure
of Cognition, Volume 2. J.L. McClelland and D.E. Rumelhart, Editor. Bradford:
Cambridge, MA. p. 432-70.
267