Explorations in Reinforcement and Model-based Learning

Tony J Prescott

Explorations in Reinforcement and Model-based Learning

Tony J Prescott

1993

visibility

…

description

282 pages

link

1 file

Reinforcement learning concerns the gradual acquisition of associations between events in the context of specific rewarding outcomes, whereas model-based learning involves the construction of representations of causal or world knowledge outside the context of any specific task. This thesis investigates issues in reinforcement learning concerned with exploration, the adaptive recoding of continuous input spaces, and learning with partial state information. It also explores the borderline between reinforcement and model-based learning in the context of the problem of navigation. A connectionist learning architecture is developed for reinforcement and delayed reinforcement learning that performs adaptive recoding in tasks defined over continuous input spaces. This architecture employs networks of Gaussian basis function units with adaptive receptive fields. Simulation results show that networks with only a small number of units are capable of learning effective behaviour in realtime control tasks within reasonable time frames. A tactical/strategic split in navigation skills is proposed and it is argued that tactical, local navigation can be performed by reactive, task-specific systems. Acquisition of an adaptive local navigation behaviour is demonstrated within a modular control architecture for a simulated mobile robot. The delayed reinforcement learning system for this task acquires successful, often plan-like strategies for control using only partial state information. The algorithm also demonstrates adaptive exploration using performance related control over local search. Finally, it is suggested that strategic, way-finding navigation skills require model-based, task-independent knowledge. A method for constructing spatial models based on multiple, quantitative local allocentric frames is described and simulated. This system exploits simple neural network learning, storage and search mechanisms, to support robust way-finding behaviour without the need to construct a unique global model of the environment.

Explorations in Reinforcement and Model-based Learning A thesis submitted by Anthony J. Prescott Department of Psychology University of Sheffield in partial fulfillment of the requirements for the degree of Doctor of Philosophy Submitted December 1993 Accepted 1994. Explorations in Reinforcement and Model-based Learning Anthony J. Prescott Summary Reinforcement learning concerns the gradual acquisition of associations between events in the context of specific rewarding outcomes, whereas model-based learning involves the construction of representations of causal or world knowledge outside the context of any specific task. This thesis investigates issues in reinforcement learning concerned with exploration, the adaptive recoding of continuous input spaces, and learning with partial state information. It also explores the borderline between reinforcement and model-based learning in the context of the problem of navigation. A connectionist learning architecture is developed for reinforcement and delayed reinforcement learning that performs adaptive recoding in tasks defined over continuous input spaces. This architecture employs networks of Gaussian basis function units with adaptive receptive fields. Simulation results show that networks with only a small number of units are capable of learning effective behaviour in realtime control tasks within reasonable time frames. A tactical/strategic split in navigation skills is proposed and it is argued that tactical, local navigation can be performed by reactive, task-specific systems. Acquisition of an adaptive local navigation behaviour is demonstrated within a modular control architecture for a simulated mobile robot. The delayed reinforcement learning system for this task acquires successful, often plan-like strategies for control using only partial state information. The algorithm also demonstrates adaptive exploration using performance related control over local search. Finally, it is suggested that strategic, way-finding navigation skills require modelbased, task-independent knowledge. A method for constructing spatial models based on multiple, quantitative local allocentric frames is described and simulated. This system exploits simple neural network learning, storage and search mechanisms, to support robust way-finding behaviour without the need to construct a unique global model of the environment. Declaration This thesis has been composed by myself and contains original work of my own execution. Some of the work reported here has previously been published: Prescott, A.J. and Mayhew, J.E.W. (1992). Obstacle avoidance through reinforcement learning. in Moody, J.E., Hanson, S.J., and Lippmann, R.P. Advances in Neural Information Processing Systems 4, Morgan Kaufmann, New York. Prescott, A.J. and Mayhew, J.E.W. (1992). Adaptive local navigation. in Blake, A. and Yuille, A. Active Vision, MIT Press, Cambridge MA. Prescott, A.J. and Mayhew, J.E.W. (1993). Building long-range cognitive maps using local landmarks. in From Animals to Animats: Proceedings of the 2nd International Conference on Simulation of Adaptive Behaviour, MIT Press. Tony Prescott, 14 December 1993. We shall not cease from exploration And the end of all our exploring Will be to arrive where we started And know the place for the first time. T.S. Eliot: The Four Quartets. For my parents— John and Diana vii Acknowledgements I would like to thank the many people who have given me their help, advice and encouragement in researching and writing this dissertation. I am particularly indebted to the following. My supervisor John Mayhew for his depth of insight, forthrightness, and humour. His ideas have been a constant source of inspiration to me through the years. John Frisby for his patience, wise counsel, and generous support. John Porrill, Neil Thacker, and latterly Steve Hippisley-Cox, for guiding my stumbling steps through the often alien and bewildering realm of mathematics. Pat Langdon for some apposite advice on programming Lisp and tuning Morris Minor engines. Paul Dean, Pete Redgrave, Pete Coffey, Rod Nicolson and Mark Blades, for some inspiring conversations about animal and human intelligence. The remaining members of the AI Vision Research Unit and the Department of Psychology, past and present, for creating such a pleasant and rewarding environment in which to work and learn. I would not have been able to carry out this work without the friends who have given me their support and companionship since coming to Sheffield. I am particularly grateful to Phil Whyte, Leila Edwards and especially Sue Keeton for sharing their lives and homes with me. I am also grateful to Justin Avery for proof-reading parts of the text. Finally, I wish to thank the Science and Engineering Research Council and the University of Sheffield for the financial support I have received while carrying out this work. viii Contents One Introduction and Overview Two Reinforcement Learning Systems Three Exploration Four Input Coding for Reinforcement Learning Five Experiments in Delayed Reinforcement Learning using Networks Basis Function Units 98 Six Adaptive Local Navigation 1 13 48 63 of 128 Seven Representations for Way-finding: Topological Models 162 Eight Representations for Way-finding: Local Allocentric Frames Nine Conclusions and Future Work 190 221 Appendices A Algorithm and Simulation Details for Chapter Three B Algorithm and Simulation Details for Chapter Four 229 C Algorithm and Simulation Details for Chapter Five 240 D Algorithm and Simulation Details for Chapter Six 247 Bibliography 253 226 1 Chapter 1 Introduction and Overview Summary The ‘explorations’ in this thesis consider learning from a computational perspective. In other words, it is assumed that both the biological ‘neural nets’ that underlie adaptive processes in animals and humans, and the electronic circuitry that allows learning in a robot or a computer simulation, can be considered as implementing similar types of abstract, information processing operations. A computational understanding of learning should then give insight into the adaptive capabilities of both natural and artificial systems. However, to determine such an understanding in abstract is clearly a daunting, if not impossible, task. This will be helped by looking both to natural systems, for inspiration concerning how effective adaptive mechanisms have evolved and are organised, and to artificial systems, as vehicles in which to embed and then evaluate theories of learning. This chapter introduces two key domains in the study of learning from each of these perspectives, it then sets out the research objectives that will be pursued in the remainder of the thesis. CHAPTER ONE INTRODUCTION 2 Learning in natural systems Research in psychology suggests that underlying a large number of observable phenomena of learning and memory, there are two broad clusters of learning processes. First, there are the associative learning processes involved in habit formation, the acquisition of motor skills, and certain forms of classical and instrumental conditioning. These processes involve incremental adaptation and do not seem to need awareness. Learning is driven by a significant outcome in the form of a positively or negatively reinforcing event. Further, it does not seem to require or involve the acquisition of knowledge about the causal processes underlying the task that is solved. Second, there are the learning processes involved in acquiring knowledge about the relationships between events (stimuli or responses). For instance, that one event follows another (causal knowledge), or is close to another (spatial knowledge). These forms of learning appear to be have more of an all-or-none character, and may require awareness or involve attentional processes. They are also not directly involved in generating behaviour, and need not be acquired with respect to a specific task or desired outcome. The knowledge acquired can support both further learning or decision-making through inference. Patterns of learning impairment in human amnesiacs [153, 159, 160] and lesion studies with animals (e.g. with monkeys [104, 160], and with rats [63, 110]) indicate that the second style of learning relies on specific medial-temporal structures in the brain, in particular, the hippocampus. In contrast the simpler associative forms of learning underlying habit and skill acquisition are not affected by damage to this brain region, but appear instead to be supported by neural systems that evolved much earlier. This view is supported by observations that all vertebrates and most invertebrates show the more ‘primitive’ learning abilities, whereas the more CHAPTER ONE INTRODUCTION 3 ‘cognitive’ learning styles have evolved primarily in higher vertebrates [62] coinciding with a massive increase in brain-size1. A variety of terms have been suggested to capture the qualitative distinctions between different learning processes, for instance, procedural and declarative [5, 186], dispositional and representational [32, 110], implicit and explicit [142], and, incremental and all-or-none [173]. This variation reflects the fact that there may be a number of important dimensions of difference involved. Here I will adopt the terms dispositional and representational suggested by Thomas [32] and Morris [110] to refer to these two clusters of learning processes. A fuller understanding of learning systems, in which their similarities, differences, and interactions are better understood, can be gained by realising the mechanisms in computational models and evaluating them in various task domains. This agenda describes much of the recent connectionist research in cognitive science and Artificial Intelligence (AI). Learning and connectionism The explosion of research in connectionist modelling in the last ten years has reawakened interest in associative learning and has motivated researchers to attempt to construct complex learning systems out of simpler associative components. Connectionist systems, or artificial neural networks, consist of highly interconnected networks of simple processing units in which knowledge is stored in the connection strengths or weights between units. These systems demonstrate remarkable learning capabilities yet adaptation of the network weights is governed by only a small number of simple rules. Many of these rules have their origins in psychological learning theories—the associationist ideas of Locke, James and others, Thorndyke’s ‘law of effect’, Hull’s stimulus-response theory, and the correlation learning principles proposed by Hebb. Although most contemporary connectionist models use more sophisticated learning rules and assume network circuitry unlikely to occur in real 1When body-size is taken into account the brains of higher vertebrates are roughly ten times as large as those of lower vertebrates [68] . CHAPTER ONE INTRODUCTION 4 neural nets, the impression remains of a deep similarity with the adaptive capabilities of biological systems. Classical connectionist research in the 1960s by Rosenblatt [139] and by Widrow and Hoff [182] concerned the acquisition of target associative mappings by adjusting a single-layer of adaptive weights under feedback from a ‘teacher’ with knowledge of the correct responses. However, researchers have since relaxed many of the assumptions embodied in these early systems. First, multi-layer systems have been developed that adaptively recode the input by incorporating a trainable layer of ‘hidden’ units (e.g. [140]). This development surmounted what had been a major limitation of early connectionist systems—the inability of networks with only one adaptive layer to efficiently represent certain classes of input-output mappings [102]. Second, reinforcement learning systems have been developed that learn appropriate outputs without guidance from a ‘teacher’ by using environmental feedback in the form of positively or negatively reinforcing outcomes (e.g. [162]). These systems have been extended further to allow learning in delayed reward tasks in which reinforcing outcomes occur only after a sequence of stimuli have been observed and actions performed. Recent work has also considered reinforcement learning in multilayer systems with adaptive recodings (e.g. [167]). Finally, model-based associative learning systems have been developed that, rather than acquiring task knowledge directly, explicitly encode knowledge about causal processes (forward models) or environment structure (world models) [41, 69, 106, 107, 164, 166]. This knowledge then forms the basis either for task-specific learning or for decision-making by interpolation, inference, or planning. It is clear that there are certain parallels between the connectionist learning systems described so far and the classes of psychological learning processes described above. In particular, there seems to be a reasonable match between some forms of reinforcement learning and dispositional learning, and between model-based learning and certain aspects of representational learning processes. To summarise, the former pair are both concerned with the gradual acquisition of associations between events in the context of specific rewarding outcomes. Although these events might individually CHAPTER ONE INTRODUCTION 5 be composed of elaborate compound patterns, and the acquired link may involve recoding processes, the input-output relation is of a simple, reflexive nature. On the other hand, model-based learning and representational learning, while being associative in a broad sense (in that they concern the acquisition of knowledge of the relationships between events), generally involve the construction of representations of causal or world knowledge to be used by other learning or decision-making processes. These learning processes may also have other characteristics such as the involvement of domain-specific learning mechanisms and/or memory structures. The ‘Animat’ approach to understanding adaptive behaviour The shared interest in adaptive systems, between psychologists and ethologists, on the one hand, and Artificial Intelligence researchers and roboticists on the other, has recently seen the development of a new inter-disciplinary research field. Being largely uncharted it goes by a variety of titles—‘comparative’ or ‘biomimetic’ cognitive science (Roitblat [138]), ‘computational neuroethology’ (Cliff [35]), ‘behaviour-based’ AI (Maes [91]), ‘animat’ (simulated animal) AI (Wilson [187]) or ‘Nouvelle’ AI (Brooks [24]). The common research aim is to understand how autonomous agents—animals, simulated animals, robots, or simulated robots—can survive and adapt in their environments, and be successful in fulfilling needs and achieving goals. The following seeks to identify some of the key elements of this approach by citing some of its leading proponents. Wilson [187] identifies the general methodology of this research programme as follows: “The basic strategy of the animat approach is to work towards higher levels of intelligence from below—using minimal ad hoc machinery. The essential process is incremental and holistic [...] it is vital (1) to maintain the realism and wholeness of the environment [...] (2) to maximise physicality in the sensory signals [...] and (3) to employ adaptive mechanisms maximally, to minimalise the rate of introduction of new machinery and maximise understanding of adaptation.” ([187] p. 16) An important theme is that control, in the agent, is not centralised but is distributed between multiple task-oriented modules— “The goal is to build complete intelligent systems. To the extent that the system consists of modules, the modules are organised around activities, such as path- CHAPTER ONE INTRODUCTION 6 finding, rather than around sensory or representational systems. Each activity is a complete behaving sub-system, which individually connects perception to action.” (Roitblat [138] p. 9) The animat approach therefore seeks minimal reliance on internal world models and reasoning or planning processes— “We argue that the traditional idea of building a world model, or a representation of the state of the world is the wrong idea. Instead the creature [animat] needs to process only aspects of the world that are relevant to its task. Furthermore, we argue that it may be better to construct theoretical tools which instead of using the state of the world as their central formal notion, instead use the aspects that the creature is sensing as the primary formal notion.” (Brooks [23] p. 436) It advocates, instead, an emphasis on the role of the agent’s interaction with its environment in driving the selection and performance of appropriate, generally reflexive, behaviours— “Rather than relying on reasoning to intervene between perception and action, we believe actions derive from very simple sorts of machinery interacting with the immediate situation. This machinery exploits regularities in its interaction with the world to engage in complex, apparently planful activity without requiring explicit models of the world.” (Chapman and Agre [29] p. 1) “One interesting hypothesis is that the most efficient systems will be those that convert every frequently encountered important situation to one of ‘virtual stimulus-response’ in which internal state (intention, memory) and sensory stimulus together form a compound stimulus that immediately implies the correct next intention or external action. This would be in contrast to a system that often tends to ‘figure out’ or undertake a chain of step by step reasoning to decide the next action.” (Wilson [187] p. 19) Perception too is targeted at acquiring task-relevant information rather than delivering a general description of the current state of the perceived world— “The basic idea is that it is unnecessary to equip the animat with a sensory apparatus capable at all times of detecting and distinguishing between objects in its environment in order to ensure its adaptive competence. All that is required is that it be able to register only the features of a few key objects and ignore the rest. Also those objects should be indexed according to the intrinsic features and properties that make them significant.” (Meyer and Guillot [98] p. 3). CHAPTER ONE INTRODUCTION 7 It is clear, from this brief overview, that the ‘Animat’ approach is in good accord with reinforcement learning approaches to the adaptation of behavioural competences. In view of the stated aim of building ‘complete intelligent systems’ in an incremental, and bottom-up fashion this is wholly consistent with the earlier observation that learning in simpler animals is principally of a dispositional nature. However, the development of this research paradigm is already beginning to see the need for some representational learning. One reason for this is the emphasis on mobile robotics as the domain of choice for investigating animat AI. The next section contains a preliminary look at this issue. Navigation as a forcing domain The fundamental skill required by a mobile agent is the ability to move around in the immediate environment quickly and safely, this will be referred to here as local navigation competence. Research in animat AI has had considerable success in using pre-wired reactive competences to implement local navigation skills [6, 22, 38, 170]. The robustness, fluency, and responsiveness of these systems have played a significant role in promoting the animat methodology as a means for constructing effective, autonomous robots. In this thesis the possibility of acquiring adaptive local navigation competences through reinforcement learning is investigated and advanced as an appropriate mechanism for learning or fine-tuning such skills. However, a second highly valuable of navigation expertise is the ability to find and follow paths to desired goals outside the current visual scene. This skill will be referred to here as way-finding. The literature on animal spatial learning differentiates the way-finding skills of invertebrates and lower vertebrates, from those of higher vertebrates (birds and mammals). In particular, it suggests that invertebrate navigation is performed primarily by using path integration mechanisms and compass senses and secondarily by orienting to specific remembered stimulus patterns (landmarks) [2628, 178, 179]. This suggests that invertebrates do not construct models of the spatial layout of their environment and that consequently, their way-finding behaviour is CHAPTER ONE INTRODUCTION 8 relatively inflexible and restricted to homing or retracing familiar routes2. In contrast, higher vertebrates appear to construct and use representations of the spatial relations between locations in their environments (see, for example, [52, 119, 120, 122]). They are then able to use these models to select and follow paths to desired goals. This form of learning is often regarded as the classic example of a representational learning process (e.g. [152]). This evidence has clear implications for research in animat AI. First, it suggests that the current ethos of minimal representation and reactive competence could support way-finding behaviour similar to that of invertebrates3. Second, however, the acquisition of more flexible way-finding skills would appear to require model-based learning abilities, this raises the interesting issue of how control and learning architectures in animat AI should be developed to meet this need. Content of the thesis The above seeks to explain the motivation for the research described in the remaining chapters. However, although inspired by the desire to understand and explain learning in natural systems the work to be described primarily concerns learning in artificial systems. The motivation, like much of the work in connectionism, is to seek to understand learning systems from a general perspective before attempting to apply this understanding to the interpretation and modelling of animal or human behaviour. I have suggested above that much of the learning that occurs in natural systems clusters into two fundamental classes —dispositional and representational learning. I have further suggested that these two classes are loosely analogous to reinforcement learning and model-based learning approaches in connectionism. Finally, I have proposed that a forcing domain for the development of model-based learning systems is that of navigation. These ideas form the focus for the work in this thesis. 2Gould [54] has proposed a contrary view that insects do construct models of spatial layout however, the balance of evidence (cited above) appears to be against this position. 3In particular it should be possible to exploit the good odometry information available to mobile robots. CHAPTER ONE INTRODUCTION 9 The first objective, which is the focus of chapters two through five, is with understanding reinforcement learning systems. A particular concern is with learning in continuous state-spaces and with continuous outputs. Many natural learning problems and most tasks in robot control are of this nature, however, much existing work in reinforcement learning has concentrated primarily on finite, discrete state and action spaces. These chapters concentrate on the issues relating to exploration and adaptive recoding in reinforcement learning. In particular, chapters four and five propose and evaluate a novel architecture for adaptive coding in which a network of local expert units with trainable receptive fields are applied to continuous reinforcement learning problems. A second objective, which is the topic of chapter six, is the consideration of reinforcement learning as a tool for acquiring adaptive local navigation competences. This chapter also introduces the theme of navigation which is continued through chapters seven and eight where the possibility of model-based learning systems for way-finding are considered. The focus of these later chapters is on two questions. First, on whether spatial representations for way-finding should encode topological or metric knowledge of spatial relations. And second, on whether a global representation of space is desirable as opposed to multiple local models. Finally, chapter nine seeks to draw some conclusions from the work described and considers future directions for research. A more detailed summary of the contents of each chapter is as follows: Chapter Two—Reinforcement Learning Systems introduces the study of learning systems in general and of reinforcement and delayed reinforcement learning systems in particular. It focuses specifically on learning in continuous state-spaces and on the Actor/Critic systems that have been proposed for such tasks in which one learning element (the Actor) learns to control behaviour while the other (the Critic) learns to predict future rewards. The relationship of delayed reward learning to dynamic programming is reviewed and the possibility of systems that integrate reinforcement learning with model-based learning is considered. The chapter concludes by arguing that, despite the absence of strong theoretical results, reinforcement learning should be possible in tasks with only partial state information where the strict equivalence with stochastic dynamic programming does not apply. CHAPTER ONE INTRODUCTION 10 Chapter Three—Exploration considers methods for determining effective exploration behaviour in reinforcement learning systems. This chapter primarily concerns the indirect effect on exploration of the predictions determined by the critic system. The analysis given shows that if the initial evaluation is optimistic relative to available rewards then an effective search of the state-space will arise that may prevent convergence on sub-optimal behaviours. The chapter concludes with a brief review of more direct methods for adapting exploration behaviour. Chapter Four—Input Coding for Reinforcement Learning considers the task of recoding a continuous input space in a manner that will support successful reinforcement learning. Three general approaches to this problem are considered: fixed quantisation methods; unsupervised learning methods for adaptively generating an input coding; and adaptive methods that modify the input coding according to the reinforcement received. The advantages and drawbacks of various recoding techniques are considered and a novel multilayer learning architecture is described in which a recoding layer of Gaussian basis function (GBF) units with adaptive receptive fields is trained by generalised gradient descent to maximise the expected reinforcement. The performance of this algorithm is demonstrated on a simple immediate reinforcement task. Chapter Five—Experiments in Delayed Reinforcement Learning Using Networks of Basis Function Units applies the algorithm developed in the previous chapter to a delayed reinforcement control task (the pole-balancer) that has often been used as a test-bed for reinforcement learning systems. The performance of the GBF algorithm is compared and contrasted with other work, and considered in relation to problem of input sampling that arises in real-time control tasks. The interface between explicit task knowledge and adaptive reinforcement learning is considered, and it is proposed that the GBF algorithm may be suitable for refining the control behaviour of a coarsely pre-specified system. Chapter Six—Adaptive Local Navigation introduces the topic of navigation and argues for the division of navigation competences between tactical, local navigation skills that deal with the immediate problems involved in moving efficiently while avoiding collisions, and strategic, way-finding skills that allow the successful planning and execution of paths to distant goals. It further argues that local navigation can be efficiently supported by adaptive dispositional learning processes, while way- CHAPTER ONE INTRODUCTION 11 finding requires task-independent knowledge of the environment, in other words, it requires representational, or model-based, learning of the spatial layout of the world. A modular architecture in the spirit of Animat AI is proposed for the acquisition of local navigation skills through reinforcement learning. To evaluate this approach a prototype model of an acquired local navigation competence is described and successfully tested in a simulation of a mobile robot. Chapter Seven—Representations for Way-finding: Topological Models. Some recent research in Artificial Intelligence has favoured spatial representations of a primarily topological nature over more quantitative models on the grounds that they are: cheaper and easier to construct, more robust in the face of poor sensor data, simpler to represent, more economical to store, and also, perhaps, more biologically plausible. This chapter suggests that it may be possible, given these criteria, to construct sequential route-like knowledge of the environment, but that to integrate this information into more powerful layout models or maps may not be straightforward. It is argued that the construction of such models realistically demands the support of either strong vision capabilities or the ability to detect higher-order geometric relations. And further, that in the latter case, it seems hard to justify not using the acquired information to construct models with richer geometric structure that can provide more effective support to way-finding. Chapter Eight—Representations for Way-finding: Local Allocentric Frames. This chapter describes a representation of metric environmental spatial relations with respect to landmark-based local allocentric frames. The system works by recording in a relational network of linear units the locations of salient landmarks relative to barycentric coordinate frames defined by groups of three nearby cues. It is argued that the robust and economical character of this system makes it a feasible mechanism for way-finding in large-scale space. The chapter further argues for a heterarchical view of spatial knowledge for way-finding. It proposes that knowledge should be constructed in multiple representational ‘schemata’ where different schemata are distinguished not so much by their geometric content but by their dependence on different sensory modalities, environmental cues, or computational mechanisms. It thus argues against storing unified models of space, favouring instead the use of runtime arbitration mechanisms to decide the relative contributions of different local models in determining appropriate way-finding behaviour. CHAPTER ONE INTRODUCTION 12 Chapter Nine: Conclusions and Future Work summarises the findings of the thesis and considers some areas where further research might be worthwhile. 13 Chapter Two Reinforcement Learning Systems Summary The purpose of this chapter is to set out the background to the learning systems described in later parts of the thesis. It therefore consists primarily of description of reinforcement learning systems and particularly of the actor/critic and temporal difference and learning methods developed by Sutton and Barto. Reinforcement learning systems have been studied since the early days of artificial intelligence. An extensive review of this research has been provided by Sutton [162]. Williams [184] has also discussed a broad class of reinforcement learning algorithms viewing them from the perspective of gradient ascent learning and in relation to the theory of stochastic learning automata. An account of the relationship between delayed reinforcement learning and the theory of dynamic programming has been provided by Watkins [177] and is clearly summarised in [14]. The theoretical understanding of these algorithms has recently seen several further advances [10, 41, 65]. In view of the thoroughness of these existing accounts the scope of the review given here is limited to what I hope is a sufficient account of the theory of reinforcement learning to support the work described later. The structure of this chapter is as follows. The study of learning systems and their embodiment in neural networks is briefly introduced from the perspective of function estimation. Reinforcement learning methods are then reviewed and considered as gradient ascent learning algorithms following the analysis given by Williams [184]. A sub-class of these learning methods are the reinforcement comparison algorithms developed by Sutton. Temporal difference methods for learning in delayed reinforcement tasks are then described within the framework developed by Sutton and CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 14 Barto [11, 162, 163] and by Watkins [177]. This section also includes a brief review of the relationship of TD methods to both supervised learning and dynamic programming, and describes the actor/critic architecture for learning with delayed rewards which is studied extensively in later chapters. Finally, a number of proposals for combining reinforcement learning with model-based learning are reviewed, and the chapter concludes by considering learning in circumstances where the system has access to only partial state information. CHAPTER TWO 2.1 REINFORCEMENT LEARNING SYSTEMS 15 Associative Learning Systems Learning appropriate behaviour for a task can be characterised as forming an associative memory that retrieves suitable actions in response to stimulus patterns. A system that includes such a memory and a mechanism by which to improve the stored associations during interactions with the environment is called an associative learning system. The stimulus patterns, that provide the inputs to a learning system, are measures of salient aspects of the environment from which suitable outputs (often actions) can be determined. However, a learning system may also attend to a second class of stimuli called feedback signals. These signals arise as part of the environment’s response to the recent actions of the system and provide measures of its performance. In general, therefore, we are concerned with learning systems, as depicted in Figure 2.1, that improve their responses to input stimuli under the influence of feedback from the environment. Feedback Learning Mechanism Signal (Stimuli) (Actions) Inputs Memory Outputs Figure 2.1: A learning system viewed as an associative memory. The learning mechanism causes associations to be formed in memory in accordance with feedback from environment (adapted from [162]) CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 16 Associative memories are mappings Mathematically, the behaviour of any system that transforms inputs into outputs— stimuli into responses—can be characterised as a function f that maps an input domain X to an output domain Y. Any associative memory is therefore a species of mapping. Generally we will be concerned with mappings over input and output domains that are multi-dimensional, vector spaces. That is, the input stimulus will be described by a vector4 x = (x1 , x2 ,!, x N )! whose elements each measure some salient aspect of the current environmental state, and the output will also be a vector y = (y1 , y2 ,!, y M )! whose elements characterise the system’s response. In order to learn, a system must be able to modify the associations that encode the input-output mapping. These adaptable elements of memory are the parameters of the learning system and can be described by a vector w taken from a domain W. The mapping defined by the memory component of a learning system can therefore be written as the function y = f (w, x), f :W ! X " Y . Varieties of learning problem To improve implies a measure of performance. As suggested above such measures are generally provided in the form of feedback signals. The nature of the available feedback can be used to classify different learning problems as supervised, reinforcement, or unsupervised learning tasks. In supervised learning feedback plays an instructive role indicating, for any given input, what the output of the system ought to have been. The environment trains the learning system by supplying examples of a target mapping y * = F(x), F: X ! Y . 4Vectors are normally considered to be column vectors. Superscript T is used to indicate the transpose of a row vector into a column vector or vice versa. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 17 For any input-output pair (x,y* ) a measure of the error in the estimated output y can be determined. The task of the learning system is then to adapt the parameters w in a way that will minimise the total error over all the input patterns in the training set. Since the goal of learning is for the associative memory to approximate the target mapping, learning can be viewed as a problem of function approximation. In contrast, the feedback provided in a reinforcement learning task is of a far less specific nature. A reinforcement signal is a scalar value judging whether performance is good or bad but not indicating either the size or direction of the output error. Some reinforcement problems provide feedback in the form of regular assessments on sliding scales. In other tasks, however, reinforcement can be both more intermittent and less informative. At the most ‘minimalist’ end of the spectrum signals can indicate as little as whether the final outcome following a long sequence of actions was a success or a failure. In reinforcement learning the target mapping is any mapping that achieves maximum positive reinforcement and minimum negative reinforcement. This mapping is generally not known in advance, indeed, there may not be a unique optimal function. Learning therefore requires active exploration of alternative input-output mappings. In this process different outputs (for any given input stimulus) are tried out, the consequent rewards are observed, and the estimated mapping f is adapted so as to prefer those outputs that are the most successful. Finally, in unsupervised learning there is no feedback, indeed, there is no teaching signal at all other than the input stimulus. Unsupervised training rules are generally devised, not with the primary goal of estimating a target function, but with the aim of developing useful or interesting representations of the input domain (for input to other processes). For example, a system might learn to code the input stimuli in a more compact form that retains a maximal amount of the information in the original signals. Learning architectures A functional description of an arbitrary learning system was given above as a mapping from an input domain X to an output domain Y . In order to simplify the account this chapter focuses on mappings for which the input x ! X is multi-valued CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 18 and the output y !Y is a scalar. All the learning methods described will, however, generalise in a straightforward way to problems requiring a multi-valued output. In order to specify an appropriate system, y = f ( w, x) , for a particular task three principal issues need to be considered. Following Poggio and Girosi [126] these will referred to as the representation, learning, and implementation problems. • The representation problem concerns the choice of a suitable form of f (that is how y depends on w and x) such that a good mapping for the task can be learned. Choosing any particular form of f can enormously constrain the range of mappings that can be approximated (to any degree of accuracy) regardless of how carefully the parameters w are selected. • The learning problem is concerned with selecting appropriate rules for finding good parameter values with a given choice of f . • Finally, the implementation problem concerns the choice of an efficient device in which to realise the abstract idea of the learning system (for instance, appropriate hardware). This chapter is primarily concerned with the learning problem for the class of systems that are based on the linear mapping y = f (w, x) = w! " (x) . (2.1.1) That is, y is chosen to be the product of a parameter vector w = (w1 , w2 ,!, wp ) and a recoding vector ! (x) "# whose elements !1 (x) , ! 2 (x) , !, ! P (x) are basis functions of the original stimulus pattern. In other words, we assume the existence of a recoding function ! that maps each input pattern to a vector representation in a new domain !. For any desired output mapping over X, an appropriate recoding can be defined that will allow a good approximation to be acquired. Of course, for any specific choice of ! only the limited class of linear mappings over ! can be estimated. The representation problem is not solved therefore, rather it is transmuted into the problem of selecting, or learning, a suitable coding for a given task. What equation 2.1.1 does allow, however, is for a clear separation to be made between the choice of ! and the choice of suitable learning rules for a linear system. As this chapter concentrates on the latter it therefore assumes the existence of an adequate, fixed CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 19 coding of the input. The recoding problem, which is clearly of critical importance, will be considered later as the main topic of Chapters Four and Five. The process of learning involves a series of changes of increments to the parameters of the system. Thus, for instance, the jth update step for the parameter vector w could be written either as w j +1 = w j + !w j or as wk ( j +1) = wk ( j) + !wk ( j ) giving, respectively, the new value of the vector, or of the kth individual parameter. Since many rules of this type will be considered in this thesis, a more concise notation will be used whereby a rule is expressed in terms of the increment alone, that is, by defining either !w or !wk . Error minimisation and gradient descent learning As suggested above, supervised training occurs by providing the learning system with a set of example input/output pairs (x, y * ) of the target function F . This allows the task of the learning system to be defined as a problem of error minimisation. The total error for a given value of w, can be written as E(w) = "i y*i ! y i (2.1.2) where i indexes over the full training set and . is a distance measure. An optimal set of parameters w* is one for which this total error is minimised, i.e. where E(w* ) ! E(w) for every choice of the parameter vector w. A gradient descent learning algorithm is an incremental method for improving the function approximation provided by the parameter vector w. The error function E(w) can be thought of as defining an error surface over the domain W. (For instance, if the parameter space is two-dimensional then E(w) can be visualised as the height of a surface, above the 2-D plane, at each possible position ( w1 ,w2 ).) Starting from a given position E(w(0) ) on the error surface, gradient descent involves moving the parameter vector a small distance in the direction of the steepest downward gradient and calculating a new estimate E(w(1) ) of the total error. This process is then repeated over multiple iterations. On the jth pass through the training set the error CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 20 gradient e ( j ) is given by the partial derivative of the total error with respect to the weights, i.e. by e( j ) = ! " E(w( j ) ) . " w( j) (2.1.3) This gives the iterative procedure for updating the parameter estimate !w( j ) = " e ( j ) (2.1.4) where ! is the step size or learning rate < ! <<1) . When a point is reached where the error gradient is greater than or equal to zero in all directions the parameters will have converged to a stable solution. Local minima The parameter estimate found by a gradient descent learning rule will generally be a locally minimal and not a globally minimal solution. Without prior knowledge of the nature of the error surface a global minimum can only be guaranteed by exhaustive search of the parameter space. For most interesting problems this is ruled out on the grounds of computational expense. Other methods for finding improved solutions involve adding noise to the learning system, for instance by perturbing the learning rule. It has been pointed out (Vogi et al [176]) that for most practical problems a solution is required that is not necessarily a true minimum provided it reduces the error to an extent that satisfies some prescribed criteria. For instance, a solution that is not minimal but is robust to small changes in the input data might be preferable to an exact minimum that is sensitive to change. Methods for overcoming the problem of local minima are considered in this thesis as exploration strategies since effective exploration of the parameter space promotes the likelihood of finding better solutions. A gradient descent learning problem can always be restated as a gradient ascent, or hill-climbing task simply be reversing the sign of the error (that is the problem is defined as one of climbing the gradient of ! E(w) ). Desirable solutions will then be local or global maxima rather than minima. Given that there are these two, effectively interchangeable, ways of describing gradient learning, each algorithm considered here will be described using the terminology that seems most appropriate for its context. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 21 Least mean squares learning In choosing a learning rule a common choice for the distance metric is the squared distance between the target and desired outputs. From equation 2.1.2 the total mean squared error in a training set is given by E(w) = 1 2 " (y i * i 2 ! yi ) . (2.1.5) The error gradient for the input pattern x i and yi = w! " (x i ) is then given by ! " Ei = (y *i ! yi ) # (x i ) "w (2.1.6) giving the exact least mean squares learning rule !w = " %i (yi* # yi )$ (xi ) . (2.1.7) The Off-line/On-line distinction The gradient descent rule described above operates by calculating the adjustment needed for the entire batch of input patterns before any change in the parameters is made. However, learning systems are often required to operate locally in time. This means that any changes in the system relating to a given input must be made at the time that pattern is presented and not recorded for updating later. To derive an update rule that is local in time the assumption is made that the error gradient for the current pattern is an unbiased estimate of the true gradient. A notational convention is adopted here that variables pertaining to the input pattern at a given point in time are labelled by the time index. For instance, the vector encoding of the input at time t is given by ! (t) and the desired and actual outputs for * that input are given by y (t) and y(t ) . If the input patterns are presented at discrete time intervals t = 0,1,2,… then the approximate LMS or ‘Widrow-Hoff’ rule is given by !w(t) = " ( y* (t) # y(t)) $ (t) . (2.1.8) CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 22 If the learning rate ! is small enough then the Widrow-Hoff rule is known to converge5 to a local minimum [183] . Neural Networks Artificial neural networks provide an apposite way of understanding and implementing learning systems. Widrow and Hoff [182] were pioneers of the network approach inventing the ‘Adaline’ neurone as an implementation of the gradientdescent rule. Rosenblatt [139] developed a single-layer network classifier called the ‘Perceptron’. The perceptron architecture and the gradient descent rule were later generalised to multi-layer networks by Rumelhart, McClelland and Williams [140] who developed the back-propagation learning algorithm for the ‘multi-layer perceptron’ (MLP). The approach of treating networks as function approximators has been taken by many researchers including Moody and Darken [105] and Poggio and Girosi [127]. Hardware neural networks in which computations are performed in parallel by a large number of simple processors constitute an effective solution the implementation problem in choosing a learning system for a specific task. Viewed as a neural network the inputs and outputs of the learning system map onto the activations of units in the input and output layers of the network. The recoding vector ! (t) represents activity in a layer of hidden units and the parameter vector w represents the weights on the connections from this hidden layer to the output units. This architecture is shown in Figure 2.2 for a network with four input units, four hidden units and one output unit. 5The parameters will actually continue to vary around the minimum unless the learning rate is reduced to zero according to an appropriate annealing schedule. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 23 output unit parameter weights w recoding units q recoding weights inputs x Figure 2.2: Implementation of a learning system as a neural network. 2.2 Reinforcement Learning Systems The basic architecture of an associative reinforcement learning system is illustrated in Figure 2.3. For any input pattern an action preference is recorded in memory. However, when the system is learning, the chosen action may differ from the stored action according to search behaviour determined by the exploration component. The stimulus-action associations stored in memory are modified by the learning mechanism according to feedback in the form of scalar reinforcement signals. The basic learning principle involves strengthening a stimulus-action association if performing that action is followed by positive reinforcement and weakening the association if the action is followed by negative reinforcement. This principle is known to animal learning theorists as the law of effect [168] . CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS Learning Mechanism 24 Reinforcement Context (stimuli) Memory Exploration Component Action Figure 2.3: A Reinforcement learning system. The term reinforcement originates from the study of animal learning and of instrumental conditioning in particular. However, there are a number of different ways in which this and related terms have been used in the psychological and AI literature. For clarity, therefore, a brief definition of the terminology (as it is used here) is appropriate. Primary reinforcement or reward is feedback, provided by the environment, which the learning system recognises as a measure of performance. Reinforcement can be positive (pleasing) or negative (punishing). A stimulus that provides primary reinforcement is called a primary reinforcer. Stimuli that do not provide such reinforcement are called context or, simply, input. An immediate reinforcement task is one in which every context is followed by a reward. In this chapter the assumption is made that the inputs for such tasks are sampled from stationary distributions (i.e. ones that do not vary with time) and that they are independent of the actions of the learning system. A delayed reinforcement task is one in which rewards occur only after sequences of context/action pairs. The sequence of contexts generally depends on past contexts and on the past actions performed by the system. Delayed reinforcement signals serve as a measure of the overall success of multiple past actions. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 25 In a delayed reinforcement task a context can become a secondary reinforcer if the system has learnt to associate with it an internal heuristic or secondary reinforcement signal. One way in which this could happen is if a specific context regularly occurs just before a second stimulus that is a primary reinforcer. The system might therefore learn that the first stimulus is a good predictor of the second. A backward chaining of secondary reinforcement can occur over repeated learning experiences as the system discovers that certain stimuli can be used to predict other stimuli that are themselves secondary reinforcers6. Credit Assignment A central issue in learning is the problem of credit assignment (Minsky [103]). In reinforcement learning the feedback signal evaluates the system’s action in an almost qualitative fashion. Such a signal assigns an amount of credit (or blame) to the action taken but does not indicate how the action can be improved or which alternate actions might be better. For multi-valued outputs the feedback also fails to indicate how credit can be distributed between the various elements of the action vector. These issues concern problems of structural credit assignment. In a delayed reinforcement task the learning problem is further compounded. The delayed reward signal contains no information about the role played by each action in the preceding sequence in facilitating or hindering that outcome. This question of how to share credit and blame between a number of past decisions is termed the problem of temporal credit assignment. Maximising the expected reinforcement In reinforcement learning it is not possible to construct a true error function to minimise since there are no target outputs and the reinforcement signals do not diminish as the system’s performance improves. However, there is a natural way of viewing reinforcement learning problems that does lead to a gradient learning rule. This is to treat the task as a problem in maximising the expected reinforcement signal. 6This is a phenomenon of classical conditioning known as second-order conditioning. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 26 (The expected value of the reward is used since both the inputs and outputs of the system and the reinforcement signal provided by the environment may all depend on random influences.) Williams [184] has demonstrated that there is a class of reinforcement learning systems that perform stochastic gradient ascent in this performance measure. In this section Williams’ analysis is considered for the problem of immediate reinforcement learning. The expected reinforcement defines a surface over the parameter set w on which the learning system can do hill-climbing. Unlike supervised learning, however, there are no target values with which to compare current outputs, so the system cannot directly determine the local slope of this surface. Instead the system must try to find the gradient by varying its actions away from those specified by its current parameters. This exploration behaviour samples the reward surface at different nearby positions allowing estimates of the local slope to be obtained. In an immediate reinforcement task the learning system observes the context x , selects and performs an action y , then receives a reward signal r . The expected value of this reward IE (r) depends on the actions performed by the system and may vary for different contexts. The action y depends both on the system’s action preferences, as encoded by the parameter vector w, and on its exploration strategy. Exploration is a stochastic process and y can therefore be regarded as a random variable that takes some value ! with a probability given by a function g( ! ,w, x) = Pr[ y = ! w, x ]. (2.2.1) If the system is to follow a gradient ascent learning procedure then the system must be able to detect variation in IE (r) for small changes in this action probability function. This means that g must be a continuous function of the parameter vector for which the gradient with respect to the parameter vector w can be determined. This gradient is henceforth termed the eligibility (of the parameters) and is written as !w . Williams shows that any system that performs exploration in this way can improve its performance by gradient ascent in IE (r) if it uses an update rule of the form !w = " [ r # b ] $w (2.2.2) CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 27 where ! is a learning rate7 (0 < ! << 1), and b is the reinforcement baseline. The following two sections discuss in detail the concept of eligibility for different action selection mechanisms, and the concept of a reinforcement baseline. Action selection and eligibility The probability function g( ! ,w, x) can be viewed as the composition of a deterministic function, corresponding to the trainable memory component of the learning system, and a random number generator corresponding to the exploration component. The deterministic component is a generally a continuous, non-linear, function !(s) of the linear sum s = w! " (x) . The term semilinear is often used to denote a composite non-linear function of this type. For instance, if the logistic function !(s) = 1 "s 1+ e (2.2.3) is used then the deterministic component is equivalent to the semilinear squashing function used in the multilayer perceptron [140]. As the action selection mechanism in reinforcement learning combines a random number generator with a semilinear function Williams calls the overall construction a stochastic semilinear unit. Figure 2.4 illustrates this mechanism. Deterministic linear non-linear Context s l Stochastic g Action Figure 2.4: A stochastic semilinear unit for action selection. The deterministic (semilinear) component of the unit consists of a linear function s and a non-linear function ! . The stochastic component is a random number generator with the probability function g . 7The learning rate is generally constant though it may vary with i and/or t. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 28 Williams [184] shows that if the eligibility !w is chosen as the deriviative, with respect to the weights, of ln g then the term (r ! b)"w gives an unbiased estimate of the gradient in the expected reinforcement. Hitherto, gradient learning methods have been described solely for linear approximations. However, provided ln g is differentiable with respect to the linear approximation s the gradient with respect to the parameter vector can be found by the chain rule, hence !w = " ln g " ln g "# " s " ln g "# = = $ (x) "w "# " s " w "# " s (2.2.4) The following sections discuss specific examples of action selection mechanisms for tasks in which the required output domain is first, a real number Y = IR , and second, a forced choice between two alternatives Y = (0,1) . Selecting an action from a continuous distribution For a real-valued output range the action y can be treated as a Gaussian random variable with the probability function g(y, ! , µ ) = $ ( y # µ )2 ' 1 exp # 2 12 &% 2! )( (2" ) ! (2.2.5) where the mean action is given by µ = w! " (x) and the standard deviation ! is fixed. This action selection mechanism can be described as a Gaussian (linear) unit with eligibility proportional to !w = " ln g (y $ µ ) # (x) = # (x) . "µ %2 (2.2.6) A multi-parameter probability function can be defined by treating the standard deviation as a second variable parameter. This general form of Gaussian action selection unit is discussed further in the next chapter and employed in the simulations described in chapter six. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 29 Binary action selection If the system has only two possible actions then y can be treated as a Bernoulli random variable that takes the value one with probability p and the value zero with probability 1 ! p . The probability function g(y, p) is then given by "1 ! p if y = 0 g(y, p) = # $ p if y = 1 . (2.2.7) An action selection mechanism called a Bernoulli logistic unit combines this probability function with the logistic function (2.2.3). Substituting the derivatives8 of (2.2.3, 5) into (2.2.4) gives the eligibility term !w = y" p p(1 " p) # (x) = (y " p)# (x) p(1" p) (2.2.8) which is the same gradient term favoured by Sutton [162] on empirical grounds. Reinforcement learning systems based on this mechanism are described in chapters four and five. Williams also considers an action selection mechanism used by Barto and co-workers [9, 11, 12, 162] which he calls a stochastic linear threshold unit $1 if w! " (x) + # > 0 y=% &0 otherwise (2.2.9) where ! is a random number drawn from a distribution ! . He shows that this is equivalent to a Bernoulli semilinear unit with the non-linear function (s) = 1" # ("s) and that therefore the proof of gradient ascent learning applies. A variant of this mechanism is employed in the learning system described in the next chapter. #" 1 if y = 0 % 1" p y" p ! lng 8 =$ = , 1 !p % if y = 1 p(1 " p) p & !p ! # &( % = "s ' = $ !s ! s + e e" s ( + e" s 2 ) = p( " p) . CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 30 Reinforcement Comparison For the proof of gradient ascent learning to be valid the reinforcement baseline b must be independent of the action y . It may, however, be a function of the context and/or of past reward signals. Sutton [162] investigated several learning rules of the form (2.2.2) prior to the analysis given by Williams. In an empirical study over a range of tasks he determined that a good choice for the reinforcement baseline is a prediction of the expected reinforcement. Sutton reasoned that the difference between a prediction of reward, determined before an action is taken, and the actual reward, received following that action, should help the learning system to decide whether the selected action was a good or a bad choice. If the predicted reward for context x is given by V then a reinforcement comparison rule for updating w is given by !w " [ r # V ] $w . (2.2.10) Sutton suggests that !w can be viewed as the correlation of the variation in the expected reward (the reinforcement comparison) and the variation in the mean action (the eligibility). One of the interesting facts revealed by Williams’ analysis was that the choice of reinforcement baseline does not affect the gradient ascent property of the algorithm. However, Sutton’s experiments clearly demonstrated that the baseline value does have a significant influence on learning, and in particular, that convergence is faster when the predicted reinforcement is used than when the baseline is zero. Dayan [41] has suggested that the comparison term has a second order effect on learning. He suggests selecting a baseline value that minimises the second order term of the Taylor expansion of IE (r) on the grounds that this should give smoother progress up the (first order) gradient9. 9Dayan found that the optimal value for b from this perspective is slightly different from the predicted reinforcement. The term he derived gave a small improvement in convergence rate on a number of binary action tasks compared with Sutton’s comparison term. However, the calculation of the new term is more complicated and it is not appropriate for problems with delayed reinforcement. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 31 Learning with a critic For problems in which the maximum expected reinforcement varies between contexts (termed reinforcement-level asymmetry) Sutton found that the most effective learning was achieved by making the prediction term a function of the context input. However, this prediction function also has to be learned. Sutton therefore proposed a learning architecture composed of two separate learning sub-systems: a critic element that learns to predict future reinforcement and an actor or performance element that learns appropriate actions. The sole purpose of the critic is to filter the rewards from the environment and generate better credit assignment information for the performance element. This actor/critic10 architecture is illustrated in figure 2.5. Learning System Critic Element Prediction Memory Context Learning Mechanism Reinforcement Internal Feedback Actor/Performance Element Action Learning Mechanism Action Memory Exploration Component Figure 2.5: An actor/critic learning system. N The critic can be viewed as another function approximator V(v, x):IR ! IR with adaptable parameters v = (v1 ,v2 ,!, v M ) for which the target function is the expected 10Sutton also uses the term ‘Adaptive Heuristic Critic’ to describe learning systems of this sort. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 32 reward IE (r) . In other words, over the set of input patterns, the goal of the critic function is to minimise the error (v) = 12 "i (IE (ri ) ! Vi )2 . (2.2.11) The gradient descent error is therefore ! " E(v) = r! " (2.2.12) which is the same as the reinforcement comparison term in (2.2.10). If V is given by the linear sum v ! " (t) then the parameters can be updated by a Widrow-Hoff learning rule !v = " (r # V ) $ (x) (2.2.13) where ! (0 < ! << 1) is the learning rate. The actor/critic architecture can be seen to be two concurrent and interactive learning processes. The actor performs gradient ascent in the expected reinforcement while the critic continuously adapts (by a gradient descent rule) to try and predict this reinforcement accurately. The critic generates a reinforcement comparison signal which it uses to train itself and to provide the improved feedback signal for the performance element. 2.3 Temporal Difference learning Sutton and Barto [11, 162] extended the concept of reinforcement comparison to provide a solution to the temporal credit assignment problem in delayed reinforcement learning. The resulting algorithms are called temporal difference (TD) learning methods. The following review of TD learning is split into three sections. The first describes TD methods for learning predictions. The simplifying assumption is made that the action preferences of the performance element are fixed. The aim is to show how, under such circumstances, the critic can learn to accurately predict future returns. The derivation of TD learning rules follows the analysis given by Watkins [177]. The second section is intended to give some useful insight into how TD learning works. It CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 33 reviews the relationship of TD to the Widrow-Hoff learning rule and to dynamic programming methods and also summarises what is known about the convergence properties of TD algorithms. The final section describes the actor/critic method for learning with delayed rewards in which the performance and critic elements adapt concurrently. Although there have been several applications to problems with real-valued inputs, TD algorithms have generally been analysed in the context of tasks with discrete input spaces containing a finite number of possible contexts. The following analysis continues with the vector notation of the previous section so as to allow both discrete and continuous input spaces to be represented. For a discrete problem space we can assume that the set of contexts is encoded by a set of mutually orthogonal11 recoding vectors, an encoding of this type is assumed throughout the following section. Predicting delayed rewards Before describing TD learning procedures it is necessary to outline some of the terminology that applies to sequential decision tasks. It is also important to consider exactly what function of future rewards the learning system should seek to maximise. After dealing with these preliminaries the basic TD predictor is outlined and then generalised to give a family of TD learning methods. Sequential decision tasks In most delayed reinforcement tasks there is a dynamic interaction between the environment and the learning system. The future reinforcement and the future context are dependent both on past contexts and on the actions of the system. In sequential decision tasks of this nature the concept of state is useful for describing the condition of the environment at any given time. A state description contains sufficient information such that, when combined with knowledge of future actions, all aspects 11A simple way to achieve this is to make the number of elements in ! equal to the total number of contexts. The basis vector for the ith context is then given by a vector with a non-zero element only at the ith position. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 34 of future environmental states can be determined12. In other words, given a state description, an appropriate action can be selected without knowing anything of the history of past states. This is the ‘path independence’ property of the state description and is also known as the Markov property. The set of all possible states for a task is called the state space. The transition function for a state space S is the set of probabilities pij (i, j !S) giving the likelihood that any one state will follow another. A state space together with a transition function defines a Markov decision process. Figure 2.6 illustrates an example of a Markov process of six discrete states with two possible transitions in each state. Figure 2.6: A Markov decision process. In general, for any given state, different transition probabilities will be associated with the different actions available in that state. A function that maps states to preferred actions (or action probabilities) is called a policy and denoted by the symbol ! . A second function that maps states to the expected future rewards for a given ! policy is denoted V and called the evaluation function. A common assumption in delayed reinforcement learning is that the task constitutes a Markov decision process in which the set of possible contexts corresponds to the state space for the task. The behaviour acquired by the actor system therefore constitutes the policy function, and predictions of the critic system the evaluation function. 12If the environment is stochastic, then the state description is such that the probabilities of all future aspects of the environment can be determined. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 35 Time horizons and discounting In tasks with delayed rewards actions must be selected to maximise the reinforcement received at future points in time where the intervening delay can be of arbitrary length. The extent to which any sequence of actions is optimal can therefore only be determined with respect to some prior assumptions about the relative values of immediate and long-term rewards. Such assumptions are made concrete by the concept of a time horizon. For example, a system that has an infinite time horizon values all rewards equally no matter how far they lie ahead. A system with a finite time horizon, on the other hand, values rewards only up to some fixed delay limit. To allow lower values to be attached to distant rewards without introducing a fixed cutoff it is usual to use an infinite discounted time horizon where the value of each future reward is given by an exponentially decreasing function of the length of the delay period13. In TD learning the slope of the discounted time horizon is specified by a constant denoted by ! (0<!<1) and known as the discount factor. Assuming a sequence t=1,2,3,... of equally spaced discrete time-steps the total discounted return R(t) , at a given time t, is then given by the sum of discounted future rewards R(t) = r(t + 1) + ! r(t + 2) + ! 2 r(t + 3) + ! 3 r(t + 4) + … which can be written as R(t) = # k =1 ! k "1r(t + k) . (2.3.1) The goal of the critic in delayed reward tasks is to learn to anticipate the expected value of this return. In other words, the prediction V(t) should be an estimate of IE (R) = IE [ # k =1 ] ! k "1r (t + k ) . (2.3.2) 13In some tasks, particular those with deterministic transition and reward functions, an infinite time horizon clearly is desirable. However, a discounted return measure is still needed for most reinforcement learning algorithms as this gives the sum of future returns a finite value. A discussion of this issue and a proposal for a reinforcement learning method that allows an infinite time horizon is given in Schwartz [148] . CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 36 The TD(0) learning rule One way of estimating the expected return would be to average values of the truncated return after n-steps # n k =1 ! k "1r(t + k) . (2.3.3) In other words, the system could wait n steps, calculate this sum of discounted rewards, and then use it as a target value for computing a gradient descent error term. However, for any value of n, there will be an error in this estimate equal to the prospective rewards (for as yet unexperienced time-steps) $ # k =n +1 ! k "1r (t + k) . The central idea of TD learning is to notice that this error can be reduced by making use of the predicted return V(t + n) associated with the context input at time t + n . Combining the truncated return (2.3.3) with this prediction gives an estimate called the corrected n-step return Rn (t) n Rn (t) = #k =1! k "1r(t + k) + ! n V (t + n) . (2.3.4) Of course, at the start of training the predictions generated by the critic will be poor estimates of the unseen rewards. However, Watkins [177] has shown that the expected value of the corrected return will on average be a better estimate of R(t) than the current prediction at all stages of learning. This is called the error reduction property of Rn (t) . Because of this useful property, estimates of Rn (t) are suitable targets for training the predictor. The estimator used in the temporal difference method which Sutton calls TD(0) is the one-step corrected return R1 (t) = r(t + 1) + !V(t + 1) (2.3.5) which leads to a gradient descent error term called the TD error eTD (t + 1) = [r(t + 1) + !V (t + 1)] " V (t) . (2.3.6) Substituting this error into the critic learning rule (2.2.17) gives the update equation CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS !v = " eTD(t + 1) # (t) . 37 (2.3.7) The learning process occurs as follows. The system observes the current context, as encoded by the vector ! (t) , and calculates the prediction V(t) . It then performs the action associated with this context. The environment changes as result of the action generating a new context ! (t + 1) and a reinforcement signal r(t + 1) (which may be zero). The system then calculates the new prediction V(t + 1) , establishes the error in the first prediction, and updates the parameter vector v appropriately. To prevent changes in the weights made after each step from biasing the TD error it is preferable to calculate the predictions V(t) and V(t + 1) using the same version of the parameter vector. To make this point clear, the notation V(a|b) is introduced to indicate the prediction computed with the parameter vector at time t = a for the context at time t = b. The TD error term is then written as eTD (t + 1) = r(t + 1) + !V(t|t + 1) " V (t|t) . (2.3.8) The TD(0) learning rule carries the expectation of reward one interval back in time thus allowing for the backward chaining of secondary reinforcement. For example, consider a task in which the learning system experiences, over repeated trials, the same sequence of context inputs (with no rewards attached) followed by a fixed reward signal. On the first trial the system will learn that the final pattern predicts the primary reinforcement. On the second trial it will learn that the penultimate pattern predicts the secondary reinforcement associated with the final pattern. In general, on the kth trial, the context that is seen k steps before the reward will start to predict the primary reinforcement. The family of TD(") learning methods The TD(0) learning rule will eventually carry the expectation of a reward signal back along a chain of stimuli of arbitrary length. The question that arises, however, is whether it is possible to propagate the expectation at a faster rate. Sutton suggested that this can achieved by using the TD error to update the predictions associated with a sequence of past contexts where the update size for each context is weighted according to recency. A learning rule [163] that incorporates this heuristic is t $1 !v(t) = " eTD(t + 1) &k =0 # t $ k % (t $ k) (2.3.9) CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 38 where " (0 ! " ! 1) is a decay parameter that causes an exponential fall-off in the update size as the time interval between context and reward lengthens. One of the advantages of this rule is that the sum on the right hand side can be computed recursively using an activity trace vector ! (t) given by ! (t) = ! (t) + "! (t # 1) (2.3.10) where ! (0) = 0 (the zero vector). This gives the TD(") update rule !v(t) = " eTD(t + 1) # (t) . (2.3.11) Watkins shows an alternative method for deriving this learning rule. Instead of using just the 1 (t) estimate, a weighted sum of different n-steps corrected returns can be used to estimate the expected return. This is appropriate because such a sum also has the error reduction property14. The TD(") return R ! (t) is defined to be such a sum in n which the weight for each Rn (t) is proportional to ! such that ! 2 R (t) = (1 " ! )[R1 (t) + !R2 (t) + ! R3 (t)…] . Watkins shows that this can be rewritten as the recursive expression ! ! R (t) = r(t + 1) + " (1 # ! )V (t + 1) + "!R (t + 1). (2.3.12) Using this R ! (t) estimator the gradient descent error for the context at time t is given by R ! (t) " V (t) for which a good approximation15 is the discounted sum of future TD errors [ ] 2 eTD (t + 1) + !"eTD (t + 2) + ( !" ) eTD (t + 3)+… 14Provided the weight on each of the corrected returns is between 0 and 1 and the sum of weights is unity (see Watkins [177]). 15If learning occurs off-line (i.e. after all context and reinforcement inputs have been seen) then R! (t ) " V(t ) can be given exactly as a discounted sum of TD errors. Otherwise, changes in the parameter vector over successive time-steps will bias the approximation by an amount equal to % $ k =1 (! " )k V(t + k # 1, t + k) # V (t +k,t + k ) , i.e. the discounted sum of the differences in the prediction of each state visited for successive parameter vectors. This sum will be small if the learning rate is not too large. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 39 From this a rule for updating all past contexts can shown to be t !v(t) = " eTD (t + 1)&k =1( #$ ) t%k ' (k) which is identical to Sutton’s update rule but for the substitution of the decay rate !" for ! . Understanding TD learning This section reviews some of the findings concerning the behaviour of TD methods and their relationship to other ways of learning to predict. The aim is to establish a clearer understanding of what these algorithms actually do. TD(") and the Widrow-Hoff rule Consider the definition of the TD(") return (2.3.12). If " is made equal to one it can 1 easily be seen that R (t) is the same as the actual return (2.3.1). Sutton [163] shows that the TD(1) learning rule is in fact equivalent16 to a Widrow-Hoff rule of the form !v(t) = " ( R(t) # V(t)) $ (t) . An important question is therefore whether TD methods are anything other than just an elegant, incremental method of implementing this supervised learning procedure. To address this issue Sutton carried out an experiment using a simple delayed reinforcement task called a bounded random walk. In this task there are only seven states A-B-C-D-E-F-G two of which, A and G, are boundary states. In the nonboundary states there is a 50% chance of moving to the right or to left along the chain. All states are encoded by orthogonal vectors. A sequence starts in a (random) non-boundary state and terminates when either A or G is reached. A reward of 1 is given at G and zero at A. The ideal prediction for any state is therefore just the probability of terminating at G. 16This is strictly true only if TD(1) learning occurs off-line, i.e. if the parameters are updated at the end of each trial rather than after each step. CHAPTER TWO A 0 REINFORCEMENT LEARNING SYSTEMS 1/6 1/3 1/2 2/3 40 5/6 G 1 Figure 2.7: Sutton’s random walk task. The numbers indicate the rewards available in the terminal states (A and G) and the ideal predictions in the non-terminal states (B to F). Sutton generated a hundred sets of ten randomly generated sequences. TD(") learning procedures for seven values of " including 0 and 1 (the Widrow-Hoff rule) were then trained repeatedly17 on each set until the parameters converged to a stable solution. Sutton then measured the total mean squared error, between the predictions for each state generated by the learned parameters, and the ideal predictions. Significant variation in the size of this error was found. The total error in the predictions was lowest for "=0, largest for "=1, and increased monotonically between the two values. To understand this result it is important to note that each training run used only a small set of data. The supervised training rule minimises the mean squared error over the training set, but as Sutton points out, this is not necessarily the best way to minimise the error for future experience. In fact, Sutton was able to show that the predictions learned using TD(0) are optimal in a different sense in that they maximise the likelihood of correctly estimating the expected reward. He interpreted this finding in the following way: “...our real goal is to match the expected value of the subsequent outcome, not the actual outcome occurring in the training set. TD methods can perform better than supervised learning methods [on delayed reward problems] because the actual outcome of a sequence is often not the best estimate of its expected value.” ([163] p.33) Sutton performed a second experiment with the bounded random walk task looking at the speed at which good predictions were learned. If each training set was presented just once to each learning method then the best choice for ", in terms of reducing the error most rapidly, was an intermediate value of around 0.3. Watkins also considers 17The best value of the learning rate # was found for each value of " in order to make a fair comparison between the different methods. The weights were updated off-line. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 41 this question and points out that the choice of " is a trade-off between using biased estimates ("=0) and estimates with high variance ("=1). He suggests that if the current predictions are nearly optimal then the variance of the estimated returns will be lowest for "=0 and that therefore the predictor should be trained using TD(0). However, if the predictions are currently poor approximations then the corrections added to the immediate reinforcement signals will be very inaccurate and introduce considerable bias. The best approach overall might therefore be to start with "=1, giving unbiased estimates but with a high variance, then reduce " towards zero as the predictions become more accurate. TD and dynamic programming A second way to understand TD learning is in relation to the Dynamic Programming methods for determining optimal control actions. ‘Heuristic’ methods of dynamic programming were first proposed by Werbos [180]. However, Watkins [177] has investigated the connection most thoroughly showing that TD methods can be understood as incremental approximations to dynamic programming procedures. This approach to studying actor/critic learning systems has also been taken by Williams [185]. Dynamic Programming (the term was first introduced by Bellman [15]) is a search method for finding a suitable policy for a Markov decision process. A policy is optimal if the action chosen in every state maximises the expected return as defined above (2.3.2). To compute this optimal control requires accurate models of both the transition function and the reward function (which gives the value of the expected reward that will be received in any state). Given these prerequisites dynamic programming proceeds through an iterative, exhaustive search to calculate the maximum expected return, or optimal evaluation, for each state. Once this optimal evaluation function is known an optimal policy is easily found by selecting in each state the action that leads to the highest expected return in the next state. A significant disadvantage of dynamic programming is that it requires accurate models of the transition and reward functions. Watkins [177] has shown that TD algorithms can be considered as incremental forms of dynamic programming that require no advanced or explicit knowledge of state transitions or of the distribution of CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 42 available rewards18. Instead, the learning system uses its ongoing experience as a substitute for accurate models of these functions. His analysis of dynamic programming led Watkins to propose a learning method called ‘Q learning’ that arises directly from viewing the TD procedure as incremental dynamic programming. In Q learning a prediction is associated with each of the different actions available in a given state. While exploring the state space the system improves the prediction for each state/action pair using a gradient learning rule. This learning method does away with the need for explicitly learning the policy, the preferred action in any state is simply the one with the highest associated value, therefore as the system improves its predictions it also adapts its policy. If each action in each state is attempted a sufficient number of times then Q learning will eventually converge to an optimal set of evaluations. A family of Q(") learning algorithms that use activity traces similar to those given for TD(") can also be defined. Convergence properties of TD methods There are now several results showing that TD(") and Q(") learning algorithms will converge[10, 41, 65, 163, 177], many of them based on an underlying identity with stochastic dynamic programming. The latest proofs demonstrate convergence to the ideal values with probability of one in both batch and on-line training of both types of algorithm. These proofs generally assume tasks that, like the bounded random walk, are guaranteed to terminate, have one-to-one mappings from discrete states to contexts, fixed transition probabilities, and encode different contexts using orthogonal vectors. Actor/Critic architectures for delayed rewards The actor/critic learning methods developed by Barto and Sutton and described in section 2.2 can be also be applied to learning in tasks with delayed rewards [11, 162]. The separation of action learning from prediction learning has several useful 18Watkins uses the term ‘primitive’ learning to describe learning of this sort, likening it to what I have called dispositional learning in animals. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 43 consequences although analysing the behaviour of the system is more difficult. One important difference is that problems can be addressed in which actions are realvalued (Q learning is restricted to tasks with discrete action spaces). A second advantage arises in learning problems with continuous input spaces. Here the optimal policy and evaluation functions may have quite different non-linearities with respect to the input. Separating the two functions into distinct learning systems can therefore allow appropriate recodings to be developed for each. The benefit of this is shown clearly in the simulations described in Chapter 5. The training rules for actor/critic learning in delayed reward tasks are based on the rules described above for immediate reinforcement problems (section 2.2). With delayed rewards the goal of the system is to maximise the expected return IE (R) . The critic element is therefore trained to predict E (R) using the TD(") procedure, while the performance element is trained to maximise IE (R) using a variant of the reinforcement comparison learning rule (equation 2.2.14). This gives the update !w for the parameters of the actor learning system !w(t) " [ r(t + 1) + #V(t + 1) $ V (t)] %w(t) . (2.3.13) Here !w is a sum of past eligibility vectors (section 2.2) weighted according to recency. This allows the actions associated with several past contexts to be updated at each time-step. This eligibility trace is given by the recursive rule !w(0) = 0 , !w(t) = !w(t) + " !w(t # 1) (2.3.14) where $ is the rate of trace decay. The eligibility trace and the activity trace (2.3.10) (used to update the critic parameters) encode images of past contexts and behaviours that persist after the original stimuli have disappeared. They can therefore be viewed as short term memory (STM) components of the learning system. The weight vectors encoding the prediction and action associations then constitute the long term memory (LTM) components of the system. There are no convergence proofs for actor/critic methods for delayed reward tasks because of the problem of analysing the interaction of two concurrent learning systems. However successful learning has been demonstrated empirically on several difficult tasks [3, 11, 167] which encourages the view that these training rules may share the desirable gradient learning properties of their simpler counterparts. CHAPTER TWO 2.4 REINFORCEMENT LEARNING SYSTEMS 44 Integrating reinforcement and model-based learning Planning, world knowledge and search The classical AI method of action selection is to form an explicit plan by searching an accurate internal representation of the environment appropriate to the current task (see, for example [32, 114]). However, any planning system is faced with a scaling problem. As the world becomes more complex, and as the system attempts to plan further ahead, the size of the search space expands at an alarming rate. In particular, problems arise as the system attempts to consider more of the available information about the world. With each additional variable another dimension is added to the input space which can cause an exponential rise in the time and memory costs of search. Bellman [16] aptly described this as the “curse” of dimensionality. Dynamic Programming is as subject to these problems as any other search method. Incremental approximations to dynamic programming such as the TD learning methods attempt to circumvent forward planning by making appropriate use of past experience. Actions are chosen that in similar situations on repeated past occasions have proven to be successful. A given context triggers an action that is in effect the beginning of a ‘compiled plan’, summarising the best result from the history of past experiences down that branch of the search tree. Thus TD methods are a solution (of sorts) to the problem of acting in real time—there is no on-line search, and when the future is a little different than expected, then there is often a ‘compiled plan’ available for that too. However, the problem of search does not go away. Instead, the size of the search space translates into the length of learning time required, and, when exploration is local (as in gradient learning methods), there is an increased likelihood of acquiring behaviours that are only locally optimal. Optimal Learning, forward and world models Real experience can be expensive to obtain—exploration can be a time-consuming, even dangerous, affair. Optimal learning, rather than learning of optimal behaviour19, 19Watkins [177] gives a fuller discussion of this distinction. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 45 is concerned with gaining all possible knowledge from each experience that can be used to maximise all future rewards and minimise future loss. Reinforcement learning methods are not optimal in this sense. They extract information from the temporal sequence of events that enables them to learn mappings of the following types stimulus ! action (S! A) (actor) stimulus ! reward (S! R) (critic) stimulus ! action " reward (S!A " R) (Q learning) Associative learning of this type is clearly dispositional, it encodes task-specific information and retains no knowledge of the underlying causal process. However, associative mappings obtained through model-based learning can clearly help in determining optimal behaviour. These can take the form of forward (causal) models, i.e. stimulus ! action " stimulus (S!A " S) or world models encoding information about neighbouring or successive stimuli, i.e. stimulus ! stimulus (S ! S) Where the knowledge such mappings contain is independent of any reward contingencies they can be applied to any task defined over that environment. However, there can be substantial overhead in the extra computation and memory required to learn, store, and maintain models of the environment. Several methods of supplementing TD learning with different types of forward model have been proposed [41, 108, 164, 166] and are described further below. Forward models A mapping of the S!A " S type, is effectively a model of the transition function used in dynamic programming. Sutton’s [164] DYNA system uses such a model as an extension to Q learning, the agent uses its actual experience in the world both to learn evaluations for state/action pairs and to estimate the transition function. This allows on-line learning to be supplemented by learning during simulated experience. In other words, between taking actions in the real world and observing and learning CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 46 from their consequences, the agent performs actions in its ‘imagination’. Using its current estimates of the evaluation and transition functions, it can then observe the ‘imaginary’ outcomes of these actions and learn from them accordingly. Sutton calls this process ‘relaxation planning’—a large number of shallow searches, performed whenever the system has a ‘free moment’, will eventually approximate a full search of arbitrary depth. By carrying out this off-line search the system can propagate information about delayed rewards more rapidly. Its actual behaviour will therefore improve faster than by on-line learning alone. Moore [108] describes a related method for learning in tasks with continuous input spaces. To perform dynamic programming the continuous input space is partitioned or quantised into discrete regions and an optimal action and evaluation learned in each cell. The novel aspect of Moore’s approach is to suggest heuristic methods for determining a suitable quantisation of the space that attempt to side-step the dimensionality problem. He proposes varying the resolution of the quantisation during learning, specifically, having a fine-grained quantisation in those parts of the state-space that are visited during an actual or simulated sequence of behaviour and a coarse granularity in the remainder. As the trajectory through the space changes over repeated trials the quantisation is then altered in accordance with the most recent behaviour. World models Sutton and Pinette [166] and Dayan [41] both propose learning S ! S models. The essence of both approaches is to train a network to estimate, from the current context x(t) , the discounted sum of future contexts # " k =1 ! k x(t + n) . One reason for learning this particular function is that a recursive error measure, similar to the TD error, can be used to adapt the network parameters. Having acquired such a mapping the resulting associations will reflect the topology of the task, which may differ from the topology of the input space. When motivated to achieve a specific goal, such a mapping may aid the learning system to distinguish situations in which different actions are required, and to recognise circumstances where similar behaviours can be applied. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 47 For instance, Dayan simulated a route-finding problem which involved moving within a 2-D bounded arena between start and goal positions. A barrier across part of the space obliged the agent to make detours when traversing between the separated regions. The agent was constrained to make small, local movements, hence the probability of co-occurrence of any two positions was generally a smooth function of the distance between them, except for the non-linearity introduced by the barrier. An S ! S mapping was trained by allowing the agent to wander unrewarded through the space. Neighbouring positions thus acquired high expectations except where they lay either side of the barrier where the expectation was low. The output of this mapping system was used as additional input to an actor/critic learning system that was reinforced for moving to a specific goal. The S ! S associations aided the system in discriminating between positions on either side of the barrier. This allowed appropriate detour behaviour to be acquired more rapidly than when the system was trained without the model. One significant problem with this specific method for learning world models is that it is not independent of the behaviour of system. If the S ! S model in the routefinding task is updated while the agent is learning paths to a specific goal then the model will become biased to anticipate states that lie towards the target location. When the goal is subsequently moved the model will then by much less effective as an aid to learning. An additional problem with this mapping is that it is one-to-many making it difficult to represent efficiently and giving it poor scaling properties. 2.5 Continuous input spaces and partial state knowledge Progress in the theoretical understanding of delayed reward learning algorithms has largely depended on the assumptions of discrete states, and of a Markovian decision process. If either of these assumptions is relaxed then the proofs of convergence, noted above, no longer hold. Furthermore, it seems likely that strong theoretical results for tasks in which these restrictions do not apply will be difficult to obtain. One reason for this pessimism is that recent progress has depended on demonstrating an underlying equivalence with stochastic dynamic programming for which the same rigid assumptions are required. An alternative attack on these issues is of course an empirical one—to investigate problems in which the assumptions are relaxed and then observe the consequences. CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 48 This is a common approach in connectionist learning where the success of empirical studies has often inspired further theoretical advances. To be able to apply delayed reinforcement learning in tasks with continuous state spaces would clearly be of great value. Many interesting tasks in robot control, for example, are functions defined over real-valued inputs. It seems reasonable, given the success of supervised learning in this domain, to expect that delayed reinforcement learning will generalise to such tasks. This issue is one of the main focuses of the investigations in this thesis. Relaxing the assumption of Markovian state would give, perhaps, even greater gain. The assertion that context input should have the Markov property constitutes an extremely strict demand on a learning system. It seems likely that in many realistic task environments the underlying Markovian state-space will be so vast that dynamic programming in either its full or its incremental forms will not be viable. It is clear, however, that for many tasks performed in complex, dynamic environments, learning can occur perfectly well in the absence of full state information. This is for the reason that the data required to predict the next state is almost always a super-set of that needed to distinguish between the states for which different actions are required. The latter discrimination is really all that an adaptive agent needs to make. Consider, for instance, a hypothetical environment in which the Markovian state information is encoded in a N bit vector. Let us assume that all N bits are required in order to predict the next state. It is clear that a binary-decision task could be defined for this environment in which the optimal output is based on the value at only a single bit position in the vector. An agent who observed the value of this bit and no other could then perform as well as an agent who observed the entire state description. Furthermore, this single-minded operator, who observes only the task-relevant elements of the state information, has a potentially huge advantage. This is that the size of its search-space (2 contexts) is reduced enormously from that of the full Markovian task (2N contexts). The crucial problem, clearly, is finding the right variables to look at!20 20This task has inspired research in reinforcement learning on perceptual aliasing—distinguishing states that having identical codings but require different actions (this issue is considered further in chapter four). CHAPTER TWO REINFORCEMENT LEARNING SYSTEMS 49 It is the above insight that has motivated the emphasis in Animat AI on reactive systems that detect and exploit only the minimal number of key environmental variables as opposed to attempting to construct a full world model. It is therefore to be strongly hoped that delayed reinforcement learning will generalise to tasks with only partial state information. Further, we might hope that it will degrade gracefully where this information is not fully sufficient to determine the correct input-output mapping. If these expectations are not met then these methods will not be free from the explosive consequences of dimensionality that make dynamic programming an interesting but largely inapplicable tool. Conclusion Reinforcement learning methods provide powerful mechanisms for learning in circumstances of truly minimal feedback from the environment. The research reviewed here shows that these learning systems can be viewed as climbing the gradient in the expected reinforcement to a locally maximal position. The use of a secondary system that predicts the expected future reward encourages successful learning because it gives better feedback about the direction of this uphill gradient. Many issues remain to be resolved. Success in reinforcement learning is largely dependent on effective exploration behaviour. Chapters three and six of this thesis are in part concerned with this issue. The learning systems described here have all been given as linear functions in an unspecified representation of the input. However, for continuous task spaces, finding an appropriate coding is clearly critical to the success of the learning endeavour. The question of how suitable representations can be chosen or learned will be taken up in chapters four and five. 51 Chapter 3 Exploration Summary Adaptive behaviour involves a trade-off between exploitation and exploration. At each decision point there is choice between selecting actions for which the expected rewards are relatively well known and trying out other actions whose outcomes are less certain. More successful behaviours can only be found by attempting unknown actions but the more likely short-term consequence of exploration is lower reward than would otherwise be achieved. This chapter discusses methods for determining effective exploration behaviour. It primarily concerns the indirect effect on exploration of the evaluation function. The analysis given here shows that if the initial evaluation is optimistic relative to available rewards then an effective search of the state-space will arise that may prevent convergence on sub-optimal behaviours. The chapter concludes with a brief summary of direct methods for adapting exploration behaviour. CHAPTER THREE 3.1 EXPLORATION 52 Exploration and Expectation Chapter two introduced learning rules for immediate reinforcement tasks of the form !wi (t) = " [r(t + 1) # b(t)] $wi (t) . When there are a finite number of actions, altering the value of the reinforcement baseline b has an interesting effect on the pattern of exploration behaviour. Consider the task shown in figure 3.1 which could be viewed as a four armed maze. rN N rW ? rE rS Figure 3.1 : A simple four choice reinforcement learning problem. Assume that for i ! {N,S, E, W } the preference for choosing each arm is given by a parameter wi and the reward for selecting each arm by ri , also let the action on any trial be to choose the arm for which wi + ! i (3.1) is highest where ! i is a random number drawn from a gaussian distribution with a fixed standard deviation. The eligibility !wi (t) is equal to one for the chosen arm and zero for all others. CHAPTER THREE EXPLORATION 53 From inspecting the learning rule it is clear that an action will be punished whenever the reinforcement achieved is less than the baseline b . Consider the case where the reward in each arm is zero but the baseline is some small positive value. It is easy to see that the preferred action at the start of each trial (i.e. the one with least negative weighting) will be the one which has been attempted on fewest previous occasions. The preference weights effectively encode a search strategy based on action frequency. It is tempting to call this strategy spontaneous alternation (after the exploration behaviour observed in rodents21) since a given action is unlikely to be retried until all other alternatives have been attempted an equal number of times. If non-zero rewards are available in any of the maze arms, but the base-line is still higher than the maximum reward, then the behaviour of the system will follow an alternation pattern in which the frequency pi with which each arm is selected22 is pi = (ri ! b) !1 " j (r j ! b)!1 . In other words, the alternation behaviour is biased to produce actions with higher reward levels more frequently. With a fixed baseline, the learning system will never converge on actions that achieve only lower levels of reward. Consequently, if the maximum reward value * r* is known, setting b = r will ensure that the system will not cease exploring unless an optimal behaviour is found. In Sutton’s reinforcement comparison scheme, the baseline is replaced by the prediction of reinforcement V . From the above discussion it is clear that there is a greater likelihood of achieving optimal 21Spontaneous alternation (e.g. [43] ) is usually studied in Y or T mazes. Over two successive trials, rodents and other animals are observed to select the alternate arm on the second trial in approximately 80% of tests. Albeit that there maybe a superficial likeness, the artificial spontaneous alternation described here is not intended as a psychological model—it seems probable that in most animals this phenomenon is due to representational learning. 22If T is the total number of trials and T i is the number of trials in which arm i was selected then (r i ! b)T i (r i ! b) " 1 . Therefore, T i # T hence the frequency as T ! " for any pair i, j j (r j ! b)T j (r j ! b) with which arm i is chosen is !1 T (r i ! b)!1 pi = i = #%(r i ! b) (r j ! b)!1 &( = . j ' T $ (r j ! b)!1 " " " j CHAPTER THREE EXPLORATION 54 behaviour (and avoiding convergence on poor actions) if the initial prediction * * V 0 ! r . Alternation behaviour will occur in a similar manner until V ! r , thereafter the optimal action will be rewarded while all others continue to be punished. For associative learning, the expectation V(x) associated with a context x, is described here as optimistic if V(x) ! r * (x) (where r* (x) is the maximum possible reward for that context) and as pessimistic otherwise. In immediate * reinforcement tasks, setting the initial expectation V 0 (x) = r (x) will give a * greater likelihood of finding better actions than for V 0 (x) < r (x) , it should also * give faster learning than for V 0 (x) > r (x) . When r* (x) is not known, an optimistic guess for V 0 (x) will give slower learning than a pessimistic guess but also a better chance of finding the best actions. A similar argument applies to delayed reinforcement tasks. In this case, an expectation is considered optimistic if V(x) ! R* (x) , where R* (x) is the maximum possible return. If the initial expectation V 0 is the same in all states and is globally optimistic then a form of spontaneous alternation will arise. While the predictions are over-valued the TD error will on average be negative and actions will generally be punished. However, transitions to states that have been visited less frequently will be punished less. The selection mechanism therefore favours those actions that have been attempted least frequently and that lead to the least visited states. When the expected return varies between states the alternation pattern should also be biased towards actions and states with higher levels of return. Hence, for an optimistic system, initial exploration is a function of action frequency, state frequency and estimated returns. This results in behaviour that traverses the state-space in a near systematic manner until expectation is reduced to match the true level of available reward. Sutton [162] performed several experiments investigating the effect on learning of reinforcements that occur after varying lengths of delay. He found that the learning system tends to adapt to maximise rewards that occur sooner rather than later. This arises because secondary reinforcement from more immediate rewards biases action selection before signals from later rewards have been backed-up sufficiently to have any influence. Clearly this problem is not overcome by altering rates of trace decay or learning since these parameters effect the rate of propagation of all rewards equally. Providing the learning system with an CHAPTER THREE EXPLORATION 55 optimistic initial expectation can, however, increase the likelihood of learning optimal behaviour. While the expectation is optimistic, action learning is postponed in favour of spontaneous alternation. Chains of secondary reinforcement only begin to form once the expectation falls below the level of available reward. Rewards of greater value will obtain a head start in this backingup process increasing the likelihood of learning appropriate, optimal actions. In the following section this effect of the initial expectation is demonstrated for learning in a simple maze-like task. A maze learning task Figure 3.2 shows a maze learning problem represented as a grid in which the cells of the grid correspond to intersections and the edges between cells to paths that connect at these intersections. In each cell in the rectangular grid shown there are therefore upto four paths leading to neighbouring places. A H H H H G Figure 3.2: A maze learning task with 6x6 intersections. The agent (A) and goal (G) are in opposite corners and there are four ‘hazard’ areas (H). Behaviour is modelled in discrete time intervals where at each time-step the agent makes a transition from one cell to a neighbour. A version of the actor/critic architecture is used in which each cell is given a unique, discrete encoding. The evaluation for a cell is encoded by a distinct parameter, and, as in the four-arm maze (figure 3.1), there is a separate weight for each action in each cell. The CHAPTER THREE EXPLORATION 56 action in any specific cell is chosen by equation 3.1. Further details of the algorithm are given in Appendix A. For the task considered here, one cell of the grid is assigned to be the starting position of the agent and a second cell is assigned to be the goal position where positive reward (+1) is available. Certain cells contain hazards where a negative reinforcement (-1) is given for entering the cell, in all other non-goal cells the reward is zero. Note that with this reward schedule it is the effect of the discounted time horizon (that delayed rewards are valued less) that encourages the agent to find direct paths. It is also possible that the learning system will fail to learn any route to the goal. This arises if a locally optimal, ‘dithering’ policy is found that involves swapping back and forth between adjacent non-hazard cells to avoid approaching punishing areas. s"1 where s is the minimum number The maximum return R* (x) for any cell is ! * of steps to the goal, hence, for all cells, 0 < R (x) ! 1 . An initial expectation of zero is therefore pessimistic in all cells and an expectation of +1 optimistic. The effect on learning of these different expectations was examined in the following experiment. The agent was run on repeated trials with the maze configuration shown above. Each trial ended either when the agent reached the goal or after a thousand transitions had occurred. A run was terminated after 100 trials or after two successive trials in which the agent failed to reach the goal in the allotted time. Suitable global parameters for the learning system (learning rates, decay rates and discount factor) were determined by testing the system in a hazard-free maze. Out of ten learning runs starting from the pessimistic initial expectation, V 0 = 0 , the agent failed to learn a path to the goal on all occasions as a result of learning a procrastination policy. Figure 3.3 shows a typical example of the associations acquired. CHAPTER THREE EXPLORATION 57 Figure 3.3: Action preferences and cell evaluations after a series of trials learning from initially pessimistic expectation. The arrows in the left diagram indicate the preferred direction of movement in each cell. The heights of the columns in the right diagram show the predicted return for each cell (white +ve, black -ve). In this example the agent has gradually confined itself to the top left hand corner—all preferences near the start cell direct the agent away from the hazards and back toward the initial position. A ‘wall’ of negative expectation prevents the agent from gaining any new experience near the goal. In contrast to the poor performance of this pessimistic learner, given an optimistic initial expectation23, V 0 = +1 , successful learning of a direct path was achieved on all ten runs. Figure 3.4 illustrates the policy and evaluation after one particular run. On this occasion the prediction in any cell never fell below zero expectation, so convergence on a dithering policy could not occur. 23The optimistic expectation is applied to all cells except the goal which is given an initial value of zero. This does not imply any prior knowledge but merely indicates that once the goal is achieved the anticipation of it ceases. Experiments with a continuous version of the problem in which the agent moves from the goal cell back to its starting position (and updates the evaluation of the goal according to its experience in the next trial) support the conclusions reported here. CHAPTER THREE EXPLORATION 58 Figure 3.4: Action preferences and cell evaluations after a series of trials of learning from an initially optimistic expectation, the agent has learned a direct path to the goal (along the top row then down the left column). To confirm that the exploration behaviour of a optimistic system is better than chance a simple experiment was performed using a ‘dry’ 6x6 maze (i.e. one without rewards of any kind). In each cell the action with the highest weighting was always selected (using random noise as a tie breaker). A good measure of how systematically the maze is traversed is the variance in the mean number of times each possible action is chosen out of a series of n actions. For n = 120 , random action selection (or V 0 = 0 ) gave an average variance that was more than five times higher24 than optimistic exploration behaviour ( V 0 = +1 ). In other words, the initial behaviour of an optimistic system traverses the maze in a manner that is considerably more systematic than a random walk. The effect of initial expectation is further demonstrated in figure 3.5. This graph shows the average number of transitions in the first ten trials of learning starting from different initial expectations25. Behaviour in the maze with hazards is contrasted with behaviour in a hazard-free maze. In the latter case the number of transitions gradually rises as the value of the initial expectation is increased (from zero through to one). This is entirely due to the alternation behaviour induced by 24Since there are 120 actions in total, choosing n=120 makes the mean number of choices 1. Over ten trials, random selection gave an average variance in this mean of 1.42, for optimistic search the variance was only 0.26. 25The averages were calculated over ten runs, with only those runs which were ultimately successful in learning a path to the goal being considered. CHAPTER THREE EXPLORATION 59 the more optimistic learning systems. In the hazardous maze, however, the trend is reversed. In this case, systems with lower initial expectations take longer to learn the task. More time is spent in unfruitful dithering behaviour and less in effective exploration. 120 average steps per trial 100 80 60 40 hazards no hazards 20 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 initial value Figure 3.5: Speed of learning over first ten trials for initial value functions 0.0, 0.3, 0.6, 0.8 and 1.0. (For an initial value of zero in the maze with hazards all ten runs failed to find a successful route to the goal). Clearly, it is the relative values of the expectation and the maximum return that determine the extent to which the learning system is optimistic or pessimistic. Therefore an alternate way to modify exploration behaviour is to change the rewards and not the expectation. For the maze task described above, equivalent learning behaviour is generated if there is zero reward at the goal and negative rewards for transitions between all non-goal cells26. 26 In the task described above there is a reward r g at the goal and zero reward in all other nonhazard cells. Consider an identical task but with zero reward at the goal and a negative ‘cost’ reward rc for every transition between non-goal cells (Barto et al. [13] investigated a routefinding task of this nature). Let the maximum return in each cell for the first task be RG (x ) and in the second task RC ( x) . It can be easily shown that, for a given discount factor ! , CHAPTER THREE EXPLORATION 60 As learning proceeds the affect of an optimistic bias in initial predictions gradually diminishes. To obtain a similar, but permanent, influence on exploration Williams [184] has therefore suggested adding a small negative bias to the error signal. This will provoke continuous exploration since any action which does not lead to a higher than expected level of reinforcement is penalised. The system will never actually converge on a fixed policy but better actions will be preferred. Relationship to animal learning The indirect exploration induced by the discrepancy between predictions and rewards seems to fit naturally with many of the characteristics of conditioning in animal learning. In particular, a large number of theorists have proposed that it is the ‘surprise’ generated by a stimulus that provokes exploration more than the status of the outcome as a positive or negative reinforcer (see, for instance, Lieberman [85]). The pessimistic-optimistic distinction seems also to have a parallel in animal learning which is shown by experiments on learned helplessness [123, 149]. This research demonstrates that animals fail to learn avoidance of aversive stimuli if they are pre-trained in situations where the punishing outcome is uncontrollable. In terms of the simulations described above, it could be argued that the induced negative expectation brought about by the pretraining, reduces the discrepancy between the predicted outcome and the (aversive) reward and so removes the incentive for exploration behaviour. 3.2 Direct Exploration Methods In addition to the indirect exploration strategies described above a number of methods have proposed for directly encouraging efficient exploration. This section briefly reviews some of these techniques, and in this context, describes an exploration method due to Williams [184] that is employed in the simulations described in later chapters. The simplest procedure for controlling exploration is to start out with a high level of randomness in the action selection mechanism and reduce this toward zero as learning proceeds. An annealing process of this sort can be applied to the task as a whole, or for more efficient exploration, the probability function from which C G R ( ) = 1 ! R ( x) iff r p = (! " 1)r g . Learning behaviour will therefore be the same under both C G circumstances if the initial expectation V 0 = 1! V0 . CHAPTER THREE EXPLORATION 61 actions are selected can be varied for different local regions of the state space. The following considers several such methods for tailoring local exploration, these fall into two general categories that I will call uncertainty and performance measures. Uncertainty measures attempt to estimate the accuracy in the learning system’s current knowledge. Suitable heuristics are to attach more uncertainty to contexts that have been observed less frequently [108], or less recently [164], or for which recent estimates have shown large errors [108, 111, 145, 169]. Exploration can be made a direct function of uncertainty by making the probability of selecting an action depend on both the action preference and the uncertainty measure. The effect of biasing exploration in this manner is local, that is, it cannot direct the system to explore in a distant region of the state-space. An alternative approach is to apply the uncertainty heuristic indirectly by adding some positive function of uncertainty to the primary reinforcement (e.g. [164]). This mechanism works by redefining optimal behaviours as those that both maximise reward and minimise uncertainty. This method can produce a form of non-local exploration bias, since uncertainty will be propagated by the backward-chaining of prediction estimates eventually having some effect on the decisions made in distant states. Performance measures estimate the success of the system’s behaviour as compared with either a priori knowledge, or local estimates of available rewards. Gullapalli [56] describes a method for immediate reinforcement tasks in which the performance measure is a function of the difference between the predicted reward and the maximum possible reward r* (x) (which is assumed to be known). The amount of exploration, which varies from zero up to some maximum level is in direct proportion to the size of this disparity. Williams [184] has proposed a method for adapting exploration behaviour that is suitable to learning real-valued actions when r* (x) is not known to the learning system. He suggests allowing the degree of search to adapt according to the variance in the expected reward around the current mean action. In other words, if actions close to the current mean are achieving lower rewards than actions further away, then the amount of noise in the decision rule should be increased to allow more distant actions to be tried more frequently. If, on the other hand, actions close to the mean are more successful than those further away, then noise should be reduced, so that the more distant (less successful) actions are sampled less often. A Gaussian action unit that performs adaptive exploration of this nature is illustrated below and described in detail in Appendix D. Here both the CHAPTER THREE EXPLORATION 62 mean and the standard deviation of a Gaussian pdf are learned. The mean, of course, is trained so as to move toward actions that are rewarded and away from those that are punished, however, the standard deviation is also adapted. The width of the probability function is increased when the mean is in a region of low reward (or surrounded by regions of high reward) and reduced when the mean is close to a local peak in the reinforcement landscape (see figure 3.6). This learning procedure results in an automatic annealing process as the variance of the Gaussian will shrink as the mean behaviour converges to the local maximum. However, the width of the Gaussian can also expand if the mean is locally sub-optimal allowing for an increase in exploratory behaviour at the start of learning or if there are changes in the environment or in the availability of reward. It is interesting to contrast this annealing method with Gullapalli’s approach. In the latter the aim of adaptive exploration is solely to enable optimal actions to be found, consequently as performance improves the noise in the decision rule is reduced to zero. In Williams’ method, however, the goal of learning is to adapt the variance in the acquired actions to reflect the local slope of the expected return. The final width of the Gaussian should therefore depend on whether the local peak in this function is narrow or flat on top. The resulting variations in behaviour can be viewed not so much as noise but as acquired versatility. An application of Williams’ method to a difficult delayed reinforcement task is described in chapter six where the value of this learned versatility can be clearly seen. CHAPTER THREE EXPLORATION ! e +ve (y-!"2## > m2 m 63 y e -ve e +ve (y-!"2## < m2 (y-!"2## < m2 e +ve e -ve (y-!"2## > m2 (y-!"2## > m2 e -ve (y-!"2## > m2 change in m Figure 3.6: Learning the standard deviation of the action probability function. The figures indicate the direction of change in the standard deviation ( ! ) for actions (y) sampled less ( (y ! µ )2 < " 2 ) or more ( (y ! µ )2 " 2 ) than one standard deviation from the mean for different values of the reinforcement error signal (e). The bottom left figure indicates that the distribution will widen when the mean is in a local trough in the reinforcement landscape, the bottom right, that it will narrow over a local peak. Direct exploration and model-based learning Mechanisms that adapt exploration behaviour according to some performance measure can be seen as rather subtle forms of reinforcement (i.e. non modelbased) learning. Here the acquired associations, are clearly task-specific although they specify the degree of variation in response as well as the response itself. Uncertainty measures, on the other hand, are more obviously ‘knowledge about knowledge’, implying some further degree of sophistication in the learning system. However, whether this knowledge should be characterised as model-based is arguable. To a considerable extent this may depend on how and where the knowledge is acquired and used. CHAPTER THREE EXPLORATION 64 For instance, a learning system could estimate the frequency or recency of different input patterns, or the size of their associated errors, in the context of learning a specific task. This measure could then be added to the internal reward signal (as described above) so indirectly biasing exploration. These heuristics will be available even in situations where the agent has only partial state knowledge. In contrast, where causal or world models are constructed, the same uncertainty estimates with respect to a given goal could be acquired within the context of a task-independent framework [108, 164]. Exploration might then be determined more directly and in a non-local fashion. That is, rather than simply biasing exploration toward under-explored contexts, the learning system could explicitly identify and move toward regions of the state-space where knowledge is known to be lacking. Clearly such strategies depend on full model-based knowledge of the task although the motivating force for exploration measure is still task-specific. Finally, uncertainty measures could be determined with respect to the causal or world models themselves, in other words, there could be task-independent knowledge of uncertainty, something perhaps more like true curiosity, which could then drive exploration behaviour. Conclusion This chapter has considered both direct and indirect methods for controlling the extent of search carried out by a reinforcement learning system. In particular, the value of the initial expectation (relative to the maximum available reward) has been shown to have an indirect effect on exploration behaviour and consequently on the likelihood of finding globally optimal solutions to the task in hand. 63 Chapter 4 Input Coding for Reinforcement Learning Summary The reinforcement learning methods described so far can be applied to any task in which the correct outputs (actions and predictions) can be learned as linear functions of the recoded input patterns. However, the nature of this recoding is obviously critical to the form and speed of learning. Three general approaches can be taken to the problem of choosing a suitable basis for recoding a continuous input space: fixed quantisation methods; unsupervised learning methods for adaptively generating an input coding; and adaptive methods that modify the input coding according to the reinforcement received. This chapter considers the advantages and drawbacks of various recoding techniques and describes a multilayer learning architecture in which a recoding layer of Gaussian basis function units with adaptive receptive fields is trained by generalised gradient descent to maximise the expected reinforcement. CHAPTER FOUR 4.1 INPUT CODING 64 Introduction In chapter two multilayer neural networks were considered as function approximators. The lower layers of a network were viewed as providing a recoding of the input pattern, or recoding vector, which acts as the input to the upper network layer where desired outputs (actions and predictions) are learned as linear functions. This chapter addresses some of the issues concerned with selecting an appropriate architecture for the recoding layer(s). The discussion divides into three parts: methods that provide fixed or a priori codings; unsupervised or competitive learning methods for adaptively generating codings based on characteristics of the input; and methods that use the reinforcement feedback to adaptively improve the initial coding system. As discussed in the chapter two, for any given encoding of an input space, a system with a single layer of adjustable weights can only learn a limited set of output functions [102] . Unfortunately, there is no simple way of ensuring that an arbitrary output function can be learned short of expanding the set of input patterns into a highdimensional space in which they are all orthogonal to each other (for instance, by assigning a different coding unit to each pattern). This option is clearly ruled out for tasks defined over continuous input spaces as the set of possible input vectors is infinite. However, even for tasks defined over a finite set such a solution is undesirable because it allows no generalisation to occur between different inputs that require similar outputs. One very general assumption that is often made in choosing a representation is that the input/output function will be locally smooth. If this is true, it follows that generalisation from a learned input/output pairing will be worthwhile to nearby positions in the input space but less so to distant positions. The recoding methods discussed in this chapter exploit this assumption by mapping similar inputs to highlycorrelated codings and dissimilar inputs to near-orthogonal codings. This allows local generalisation to occur whilst reducing crosstalk (interference between similar patterns requiring different outputs). To provide such a coding the input space is CHAPTER FOUR INPUT CODING 65 mapped to a set of recoding units, or local experts1, each with a limited, local receptive field. Each element of the recoding vector then corresponds to the activation of one such unit. In selecting a good local representation for a learning task there are clearly two opposing requirements. The first concerns making the search problem tractable by limiting the size of recoded space. It is desirable to make the number of local experts small enough and their receptive fields large enough that sufficient experience can be gained by each expert over a reasonably short learning period. The more units there are the longer learning will take. The second requirement is that of making the desired mapping learnable. The recoding must be such that the non-linearities in the input-output mapping can be adequately described by the single layer of output weights. However, it is the nature of the task and not the nature of the input that determines where the significant changes in the input-output mapping occur. Therefore, in the absence of specific task knowledge, any a priori division of the space may result in some input codings that obscure important distinctions between input patterns. This problem—sometimes called perceptual aliasing—will here be described as the ambiguity problem as such codings are ambiguous with regard to discriminating contexts for which different outputs are required. In general the likelihood of ambiguous codings will be reduced by adding more coding units. Thus there is a direct trade-off between creating an adequate, unambiguous code and keeping the search tractable. 4.2 Fixed and unsupervised coding methods Boundary methods The simplest form of a priori coding is formed by a division of the continuous input space into bounded regions within each of which every input point is mapped to the same coded representation. A recoding of this form is sometimes called a hard 1The use of this term follows that of Nowlan [117] and Jacobs [66], though no strict definition is intended here. CHAPTER FOUR INPUT CODING 66 quantisation. Figure 4.1 illustrates a hard quantisation of an input space along a single dimension. min input max x Figure 4.1: Quantisation of a continuous input variable. A hard quantisation can be defined by a set of M cells denoted by = {c1 ,c2 ,…cM } , max where for each cell ci the range {x min } is specified, and where the ranges of all i , xi cells cover the space and are non-overlapping. The current input pattern x(t) then maps to the single cell c * (t ) !C whose boundaries encompass this position in the input space. The elements of the recoding vector are in a one-to-one mapping to the * quantisation cells. If the index of the winning cell is given by i this gives ! (t) as a unit vector of size M where ! i (t ) " (i,i ) * (The Kronecker delta functions of this type). # 1 iff i i* $ %0 otherwise (4.1.1) (i, j) = 1 for i = j, 0 for i " j will also be used to indicate With a quantisation of this sort non-linearities that occur at the boundaries between cells can be acquired easily. However, this is achieved at the price of having no generalisation or transfer of learning between adjoining cells. For high-dimensional input spaces this straight-forward approach of dividing up the Euclidean space using a fixed resolution grid rapidly falls prey to Bellman’s curse. Not only is a large amount of space required to store the adjustable parameters (much of it possibly unused), but learning occurs extremely slowly as each cell is visited very infrequently. Coarse coding One way to reduce the amount of storage required is to vary the resolution of the quantisation cells. For instance, Barto et al. [11] (following Michie and Chambers [99]) achieve this by using a priori task knowledge about which regions of the state CHAPTER FOUR INPUT CODING 67 space are most critical. An alternative approach is to use coarse-coding [61], or soft quantisation, methods where each input is mapped to a distribution over a subset of the recoding vector elements. This can reduce storage requirements and at the same time provide for a degree of local generalisation. A simple form of coarse coding is created by overlapping two or more hard codings or tilings as shown in figure 4.2. min max input x Figure 4.2 Coarse-coding using offset tilings along a single dimension. One cell in each tiling is active. If each of the T tilings is described by a set C j ( j = 1..T ) of quantisation cells, then the current input x(t) is mapped to a single cell j (t ) in each tiling and hence to a set U(t) = {c1 *(t), c2 * (t),…cT * (t) of T cells overall. If, again, there is a one-to-one mapping of cells to the elements ! i (t ) then the coding vector is given by # 1 iff c i "U(t) ! i (t ) = $ T . % 0 otherwise Note that here each element is ‘normalised’ to the value activation across all the elements of the vector is unity. (4.1.2) 1 T so that the sum of A soft quantisation of this type can give a reasonably good approximation wherever the input/output mapping varies in a smooth fashion. However, any sharp nonlinearities in the function surface will necessarily be blurred as a result of taking the average of the values associated with multiple cells each covering a relatively large area. A coarse-coding of this type called the CMAC (“Cerebellar Model Articulation Computer”) was proposed by Albus [2] as a model of information processing in the mammalian cerebellum. The use of CMACs to recode the input to reinforcement CHAPTER FOUR INPUT CODING 68 learning systems has been described by Watkins [177] , they are also employed in the learning system described in Chapter Six. The precise definition of a CMAC differs slightly from the method given above, in that the cells of the CMAC are not necessarily mapped in a one-to-one fashion to the elements of the recoding vector. Rather a hashing function can be used to create a pseudo random many-to-one mapping of cells to recoding vector elements. This can reduce the number of adaptive parameters needed to encode an input space that is only sparsely sampled by the input data. Compared with a representation consisting of a single hyper-cuboid grid a CMAC appears to quantise a high-dimensional input space to a similar resolution (albeit coarsely) using substantially less storage. For example, figure 4.3 shows a CMAC in a two-dimensional space consisting of four tilings with 64 adjustable parameters, this gives a degree of discrimination similar to a single grid with 169 parameters. Figure 4.3: A CMAC consisting of four 4x4 tilings compared with a single grid (13x13) of similar resolution At first sight it appears that the economy in storage further improves as the dimensionality of the space increases—if all tilings are identical and each has a uniform resolution P in each of N input dimensions, then the total number of N T . This gives a maximum resolution adjustable parameters (without hashing) is CHAPTER FOUR INPUT CODING 69 N similar2 to a single grid of size PT ) , in other words, there is a saving in memory N !1 requirement of order T . However, this saving in memory actually incurs a significant loss of coding accuracy as the nature of the interpolation performed by the system varies along different directions in the input space. Specifically, the interpolation is generated evenly along the axis on which the tilings are displaced, but unevenly in other directions (in particular in directions orthogonal to this axis). Figure 4.4 illustrates this for the 2-dimensional CMAC shown above. The figure shows the sixteen high-resolution regions within a single cell of the foremost tiling. For each of these regions, the figure shows the low resolutions cells in each of the tilings that contribute to coding that area. There is a clear difference in the way the interpolation occurs along the displacement axis (top-left to bottom-right) and orthogonal to it. This difference will be more pronounced when coding higher dimensional spaces. 2This is strictly true only if the input space is a closed surface (i.e. an N-dimensional torus), if the input space is open (i.e. has boundaries) then the resolution is lower at the edges of the CMAC because the tilings are displaced relative to each other. If the space is open in all dimensions (as in figure 4.3) N then the CMAC will have maximum resolution only over grid of size ( PT + 1 ! T ) . CHAPTER FOUR INPUT CODING 70 Figure 4.4: Uneven interpolation in a two dimensional CMAC. The distribution of low resolution cells contributing to any single high-resolution area (the black squares) is more balanced along the displacement axis (top-left to bottom-right) than orthogonal to it (bottom-left to top-right). Nearest Neighbour methods Rather than defining the boundaries of the local regions it is a common practice to partition the input space by defining the centres of the receptive fields of the set of recoding units. This results in nearest-neighbour algorithms for encoding input patterns [70, 105]. Given a set of i = 1…M units each centred at position c i in the input space, a hard, winner-takes-all coding is obtained by finding the centre that is closest to the current input x(t) according to some distance metric. For instance, if the Euclidean metric is chosen, then the winning node is the one for which N di = d(x(t ),c i ) = " j =1(x j (t) ! c ij )2 (4.1.3) CHAPTER FOUR INPUT CODING 71 * is minimised. If the index of the winning node is written i then the recoding vector is given by the unit vector ! i (t ) = " (i,i* ) (4.1.4) This mechanism creates an implicit division of the space called a Voronoi tessellation3. Radial basis functions This nearest neighbour method can be extended to form a soft coding by computing for each unit a value gi which is some radially symmetric, non-linear function of the distance from the input point to the node centre known as a radial basis function (RBF) (see [131] for a review). Although their are a number of possible choices for this function there are good reasons [127] for preferring the multi-dimensional Gaussian basis function (GBF). First, the 2-dimensional Gaussian has a natural interpretation as the ‘receptive field’ observed in biological neurons. Furthermore, it is the only radial basis function that is factorisable in that a multi-dimensional Gaussian can be formed from the product of several lower dimensional Gaussians (this allows complex features to be built-up by combining the outputs of two- or onedimensional detectors). A radial Gaussian node has a spherical activation function of the form 2 $ x j (t) " cij ' N gi (t) = g(x(t), c i ,w) = exp & " # j =1 ) (4.1.5) (2 ! ) N 2 w N 2w2 % ( 1 [ ] where w denotes the width of the Gaussian distribution in all the input dimensions. It is convenient to use a normalised encoding which can be obtained by scaling the activation gi (t) of each unit according to the total activation of all the units. In other words, the recoding vector element for the ith unit is calculated by 3 See Kohonen [70] chapter five. CHAPTER FOUR ! i (t ) = INPUT CODING gi (t) " j =1g j (t) . 72 (4.1.6) M Figure 4.5 illustrates a radial basis recoding of this type with equally spaced fixed centres and the Gaussian activation function. x Figure 4.5 A radial basis quantisation of an input variable, The coding of input x is distributed between the two closest nodes. Without task-specific knowledge selecting a set of fixed centres is a straight trade-off between the size of the search-space and the likelihood of generating ambiguous codings. For a task with a high-dimensional state-space learning can be slow and expensive in memory. It is therefore common practice, as we will see in the next section, to employ algorithms that position the nodes dynamically in response to the training data as this can create a coding that is both more compact and more directly suited to the task. Unsupervised learning Unsupervised learning methods encapsulate several useful heuristics for reducing either the bandwidth (dimensionality) of the input vector, or the number of units required to adequately encode the space. Most such algorithms operate by attempting to maximise the network’s ability to reconstruct the input according to some particular criteria. This section discusses how adaptive basis functions can be used to learn the probability distribution of the data, perform appropriate rescaling, and learn the covariance of input patterns. To simplify the notation the dependence of the input and the node parameters on the time t is assumed hereafter. CHAPTER FOUR INPUT CODING 73 Learning the distribution of the input data Many authors have investigated what is commonly known as competitive learning methods (see [59] for a review) whereby a set of node centres are adapted according to a rule of the form !c i " # (i, i * )( x $ ci ) (4.2.1) * where i is the winning (nearest neighbour) node. In other words, at each time-step the winning node is moved in the direction of the input vector4. The resulting network provides an effective winner-takes-all quantisation of an input space that may support supervised [105] or reinforcement learning (see next chapter). A similar learning rule for a soft competitive network of spherical Gaussian nodes5 is given by ci " #i x $ ci ) (4.2.2) In this rule the winner-only update is relaxed to allow each node to move toward the input in proportion to its normalised activation (4.1.6). Nowlan [115] points out that this learning procedure approximates a maximum likelihood fit of a set of Gaussian nodes to the training data. To avoid the problem of specifying the number of nodes and their initial positions in the input space, new nodes can be generated as and when they are required. A suitable heuristic for this (used, for instance, by Shannon and Mayhew [150]) is to create a new node whenever the error in the reconstructed input is greater than a threshold !. In other words, given an estimate xˆ of the input calculated by M xˆ = "i =1 i ci (4.2.3) a new node will be generated with its centre at x whenever x ! xˆ > " . (4.2.4) 4Learning is usually subject to an annealing process whereby the learning rate is gradually reduced to zero over the duration of training. 5All nodes have equal variance and equal prior probability. CHAPTER FOUR INPUT CODING 74 One of the advantages of such a scheme compared to an a priori quantisation, is in coding the input of a task with a high-dimensional state-space which is only sparsely visited. The node generation scheme will reduce substantially the number of units required to encode the input patterns since units will never be assigned to unvisited regions of the state space. Rescaling the input Often a network is given a task where there is a qualitative difference between different input dimensions. For instance, in control tasks, inputs often describe variables such as Cartesian co-ordinates, velocities, acceleration, joint angles, angular velocities etc. There is no sense in applying the same distance metric to such different kinds of measure. In order for a metric such as the Euclidean norm to be of any use here an appropriate scaling of the input dimensions must be performed. Fortunately, it is possible to modify the algorithm for adaptive Gaussian nodes so that the each unit learns a distance metric which carries out a local rescaling of the input vectors (Nowlan [116]). This is achieved by adding an additional set of adaptive parameters to each node, that describes the width of the Gaussian in each of the input dimensions. The (non-radial) activation of function is therefore given by 2 % ( N x j # cij gi = g(x, c i ,wi ) = exp # ' * N $ j 2wij2 ) (2 ! )N 2 " j wij & [ 1 ] (4.2.5) The receptive field of such a node can be visualised as a multi-dimensional ellipse that is extended in the jth input dimension in proportion to the width wij . Suitable rules for adapting the parameters of the ith node are cij " # i x j $ cij w2ij , and !wij " #i (x $ c ) i 2 ij wij3 $ wij2 . (4.2.6) Figure 4.6 illustrates the receptive fields of a collection of nodes with adaptive variance trained to reflect the distribution of an artificial data set. The data shown is randomly distributed around three vertical lines in a two-dimensional space. The central line is shorter than the other two but generates the same number of data points. The nodes were initially placed at random positions near to the centre of the space. CHAPTER FOUR INPUT CODING 75 Figure 4.6: Unsupervised learning with Gaussian nodes with adaptive variance. Learning the covariance Linsker [86, 87] proposed that unsupervised learning algorithms should seek to maximise the rate of information retained in the recoded signal. A related approach was taken by Sanger [141] who developed a Hebbian learning algorithm that learns the first P principal components of the input and therefore gives an optimal linear encoding (for P units) that minimises the mean squared error in the reconstructed input patterns. Such an algorithm performs data compression allowing the input to be described in a lower dimensional space and thus reducing the search problem. The principal components of the data are equivalent to the eigenvectors of the covariance (input correlation) matrix. Porrill [128] has also proposed the use of Gaussian basis function units that learn the local covariance of the data by a competitive learning rule. To adapt the receptive field of such a unit a symmetric NxN covariance matrix S is trained together with the parameter vector c describing the position of the unit’s centre in the input space. The activation function of a Gaussian node with adaptive covariance is therefore given by CHAPTER FOUR g( x, c, S) = INPUT CODING 1 (2 ! ) N2 S 12 exp (" 12 (x " c) # S"1(x " c)) 76 (4.2.7) where . denotes the determinant. The term Gaussian basis function (GBF) unit will be used hereafter to refer exclusively to units of this type. In practice, it is easier to work with the inverse of the covariance matrix = S!1 which is known as the information matrix. With this substitution equation 4.2.7 becomes g(x, , H) = (2 ! ) "N 2 H 12 exp (" 2 (x " )# H(x " )) 1 (4.2.8) Given a collection of GBF nodes with normalised activations (eq. 4.1.6), suitable update rules for the parameters of the ith node are !c i " # i Hi (x $ c) , and !Hi " # i [Si $ (x $ ci )(x $ c i ) % ]. (4.2.9) A further discussion of these rules is given in Appendix B. The receptive field of a GBF node is a multi-dimensional ellipse where the axes align themselves along the principal components of the input. The value of storing and learning the extra parameters that describe the covariance is that this often allows the input data to be encoded to a given level of accuracy using a smaller number of units than if nodes with a more restricted parameter set are used. Figure 4.7 illustrates the use of this learning rule for unsupervised learning with a data set generated as a random distribution around an X shape in a 2-dimensional space. The left frame in the figure shows the receptive fields of four GBF nodes, for comparison the right frame shows the fields learned by four units with adaptive variance alone. CHAPTER FOUR INPUT CODING 77 Figure 4.7: Unsupervised learning with adaptive covariance (GBF) and adaptive variance units. The units in the former case learn a more accurate encoding of the input data.6 A set of GBF nodes trained in this manner can provide a set of local coordinate systems that efficiently describe the topology of the input data. In particular, they can be used to capture the shape of the manifold formed by the data-set in the input space. Porrill [128] gives a fuller discussion of this interesting topic. Discussion Unfortunately, none of the techniques described here for unsupervised learning is able to address the ambiguity problem directly. The information that is salient for a particular task may not be the same information that is predominant in the input. Hence though a data-compression algorithm may retain 99% of the information in the input, it could be that the lost 1% is vital to solving the task. Secondly, unsupervised learning rules generally attempt to split the state-space in such a way that the different units get an equal share of the total input (or according to some similar heuristic). However, it may be the case that a particular task requires a very-fine grained 6In this particular case the mean squared error in the reconstructed input was approximately half as large for the GBF units (left figure) as for the units with adaptive variance only (right figure). (mse = 0.0075 and 0.014 respectively after six thousand input presentations). CHAPTER FOUR INPUT CODING 78 quantisation in certain regions of the space though not in others, and that this granularity is not reflected in the frequency distribution of the input data. Alternatively, the acquired quantisation may be sufficiently fine-grained but the region boundaries may not be aligned with the significant changes in the input-output mapping. In many reinforcement learning tasks the input vectors are determined in part by the past actions of the system. Therefore as the system learns a policy for the task the distribution of inputs will change and the recoding will either become out-dated or will need to adapt continually. In the latter case the reinforcement learning system will be presented with a moving target problem where the same coding may represent different input patterns over the training period. Finally, it is often the case that the temporal order in which the input vectors are provided to the system is a poor reflection of their overall distribution. Consider, for instance, a robot which is moving through an environment generating depth patterns as input to a learning system. The patterns generated in each local area of the space (and hence over any short time period) will be relatively similar and will not randomly sample the total set of depth patterns that can be encountered. In order to represent such an input space adequately the learning rate of the recoding algorithm will either have to be extremely slow, or some buffering of the input or batch training will be needed. In general, unsupervised techniques can provide useful pre-processing but are not able to discover relevant task-related structure in the input. The next section describes some steps toward learning more appropriate input representations using algorithms in which the reinforcement learning signal is used to adapt the input coding. CHAPTER FOUR 4.3 INPUT CODING 79 Adaptive coding using the reinforcement signal Non-gradient descent methods for discrete codings I am aware of two methods that have been proposed for adaptively improving a discrete input quantisation. Both techniques are based on Watkin's Q learning, and therefore also require a discrete action space. Both also assume that the quantisation of the input space consists of hyper-cuboid regions. Whitehead and Ballard [181] suggest the following method for finding unambiguous state representations. They observe that if a state is ambiguous relative to possible outcomes then the Q learning algorithm will associate with each action in that state a value which is actually an average of the future returns. For an unambiguous state, however, the values converged upon by Q learning will always be less than or equal to the true returns (but only if all states between it and the goal are also unambiguous—an important caveat!). Their algorithm therefore does a search through possible input state representations in which any one which learns to overestimate some of its returns is suppressed. In the long run unambiguous states will be suppressed less often and therefore come to dominate. This method has some similarities with selectionist models of learning (e.g. Edelman [45]) since it requires that there are several alternative input representations all competing to provide the recoding with the less successful ones gradually dying out. Chapman and Kaebling [30] describe a more rigorous, statistical method based on a similar principle. Their ‘G algorithm’ attempts to improve a discrete quantisation by using the t-test to decide whether the reinforcement accruing either side of a candidate splitting-point is derived from a single distribution. If the test suggests two different underlying distributions then the space is divided at that position. The technique can be applied recursively to any new quantisation cells that are generated. The algorithm is likely to require very extensive training periods for the following reasons. First, the evaluation function must be entirely re-learned every time the quantisation is altered. Second, because the secondary reinforcement is noisy whilst the values are being learned it is necessary to split training into two phases—value function learning and t-test data acquisition. Finally, the requirement of the t-test that CHAPTER FOUR INPUT CODING 80 data is drawn from normal distributions requires that the same state be visited many times (ideally) before the splitting test can be applied. Both of these algorithms have a major limitation in that they require that the set of potential splitting points, or alternative quantisations, is finite and preferably small. This will clearly not be true for most tasks defined over continuous input spaces. Gradient learning methods for continuous input spaces In chapter two Williams’ [184] analysis of reinforcement learning algorithms as gradient ascent learning methods was reviewed. As Williams has pointed out, once the gradient of the error surface has been estimated it is possible to apply generalised gradient learning rules to train multilayer neural networks on such problems. This allows a suitable recoding of the input space to be learned dynamically by adaptation of the connection weights to a layer of hidden units. There are basically two alternative means for training a hidden layer of coding units using the reinforcement signal. The first approach is the use of generalised gradient descent whereby the error from the output layer is back-propagated to the weights on the hidden units. This is the usual, supervised learning, method for adapting an internal representation of the input. The second approach is a generalisation of the correlation-based reinforcement learning rule. That is, the coding layer (or each coding unit) attempts, independently of the rest of the net, to do stochastic gradient ascent in the expected reward signal. Learning in this case is of a trial and error nature where alternative codings are tested and judged by their effect on the global reinforcement. The output layer of the network has no more direct influence in this process than any other component of the changing environment in which the coding system is seeking to maximise its reward. A network architecture that uses a correlation rule to train hidden units has been proposed by Schmidhuber [143, 144] and is discussed in the next chapter. In general, it will more efficient to use correlation-based learning only when absolutely necessary [184]. That is, stochastic gradient ascent need only be used at the output layer of the network where no more direct measure of error is possible. Elsewhere units that are trained deterministically by back-propagation of error should always learn more efficiently than stochastic units trained by the weaker rule. The CHAPTER FOUR INPUT CODING 81 rest of this chapter therefore concerns methods based on the use of generalised gradient descent training. Reinforcement learning in multilayer perceptrons The now classical multilayer learning architecture is the multilayer perceptron (MLP) developed by Rumelhart et al. [140] and illustrated in figure 4.8. layer 2: output units layer 1: hidden units layer 0: input units Figure 4.8: Multilayer Perceptron architecture. (The curved lines indicate non-linear activation functions.) In a standard feed-forward MLP activation flows upward through the net, the output of the nodes in each layer acting as the inputs to the nodes in the layer above. Learning in the network is achieved by propagating errors in the reverse direction to the activation (hence the term back-propagation) where the generalised gradientdescent rule is used to calculate the desired alteration in the connection weights between units. The activity in the hidden units provides a recoding of each input pattern appropriate to the task being learned. In achieving this coding each hidden unit acts by creating a partition of the space into two regions on either side of a hyper plane. The combined effect of all the partitions identifies the sub-region of the space to which the input CHAPTER FOUR INPUT CODING 82 pattern is assigned. This form of recoding is thus considerably more distributed than the localist, basis function representations considered previously. There have been several successful attempts to learn complex reinforcement learning tasks by combining TD methods with MLP-like networks, examples being Anderson's pole balancer [4] and Tesauro's backgammon player [167]. However, the degree of crosstalk incurred by the distributed representation means that learning in an MLP network can be exceptionally slow. This is especially a burden in reinforcement learning where the error feedback signal is already extremely noisy. For this reason, a more localist form of representation may be more appropriate and effective for learning in such problems. This motivates the exploration of generalised gradient learning methods for training networks of basis function units on reinforcement learning tasks. Reinforcement learning by generalised gradient learning in networks of Gaussian Basis Function units. This section describes a network architecture for reinforcement learning using a recoding layer of Gaussian basis function nodes with adaptive centres and receptive fields. The network is trained on line using an approximate gradient learning rule. Franklin [51] and Millington [101] both describe reinforcement learning architectures consisting of Gaussian basis nodes with adaptive variance only. Clearly, such algorithms will be most effective only when the relevant task-dimensions are aligned with the dimensions of the input space. The architecture described here is based on units with adaptive covariance, the additional flexibility should provide a more general and powerful solution. An intuition into how the receptive fields of the GBF units should be trained arises from considering Thorndike’s ‘law of effect’. The classic statement of this learning principle (see for instance [85]) is that a stimulus-action association should be strengthened if performing that action (after presentation of the stimulus) is followed by positive reinforcement and weakened if the action is followed by negative reinforcement. Consider an artificial neuron that is constrained to always emit the same action but is able to vary its ‘bid’ as to how much it wants to respond to a given stimulus. The implication of the law of effect is clear. The node should learn to make CHAPTER FOUR INPUT CODING 83 high bids for stimuli where its action is rewarded, and low bids for stimuli where its action is punished. Generalising this idea to a continuous domain suggests that the neuron should seek to move the centre of its receptive field (i.e. its maximum bid) towards regions of the space in which its action is highly rewarded and away from regions of low reward. If the neuron is also able to adapt the shape of its receptive field then it should expand wherever the feedback is positive and contract where it is negative. Now, if the constraint of a fixed action is relaxed then three adaptive processes will occur concurrently: the neuron adapts its action so as to be more successful in the region of the input space it currently occupies; meanwhile it moves its centre toward regions of the space in which its current action is most effective; finally it changes its receptive field in such a way as to cover the region of maximum reward as effectively as possible. Figure 4.9 illustrates this process. The figures shows a single adaptive GBF unit in a two-dimensional space. The shape of the ellipse shows the width of the receptive field of the unit along its two principal axes. The unit's current action is a. If the unit adapts its receptive field in the manner just described then it will migrate and expand its receptive field towards regions in which action a receives positive reward and away from regions where the reward is negative. It will also migrate away from the position where the alternative action b is more successful. A group of units of this type should therefore learn to partition the space between them so that each is performing the optimal action for its ‘region of expertise’. CHAPTER FOUR INPUT CODING 84 b +ve a +ve a a -ve Figure 4.9: Adapting the receptive field of a Gaussian basis function unit according to the reinforcement received. The unit will migrate and expand towards regions where its current action is positively reinforced and will contract and move away from other regions. In an early paper Sutton and Barto [165] termed an artificial neuron that adapts its output so as to maximise its reinforcement signal a ‘hedonistic’ neuron. This term perhaps even more aptly describes units of the type just described in which both the output (action) and input (stimulus sensitivity) adapt so as to maximise the total ‘goodness’ obtained from the incoming reward signal. The learning algorithm As in equation 4.2.8 the activation of each expert node is given by the Gaussian g(x, c, H) = (2 ! ) "N 2 H 12 exp (" 12 (x " c)# H(x " c)) CHAPTER FOUR INPUT CODING 85 where x is the current context, c is the parameter vector describing the position of the centre of the node in the input space and H is the information matrix (the inverse of the covariance matrix). Here I assume a scalar output for the network to make the algorithm easier to describe, the extension to vector outputs is, however, straightforward. The net output y is given as some function of the net sum s. Now if the error e in the network output is known then update rules for the output parameter vector w and the parameters c i and H i of the ith expert can be determined by the chain rule. Let !=e "s "y !i = , "gi then "s (4.3.1) $s = # % (x) , $w (4.3.2) !w " # !c i " # $ % gi % c i , and (4.3.3) !Hi " # $ %g i %H i . (4.3.4) To see that these learning rules behave in the manner described above consider the following example. Assume a simple immediate reinforcement task in which the network output is given by a Gaussian probability function with standard deviation of one and mean w ! " (x) = s . After presentation of a context x the network receives a reward signal r. From Williams’ gradient ascent learning procedure (Section 2.2) and assuming a reinforcement baseline of zero we have ! = r ( " s) (4.3.5) First of all consider the case where the outputs of the units are not normalised with respect to each other that is ! i (x) = gi (x) = gi for each expert i. We have !i = " " i (# w ) = w , i i i the update rules are therefore given by (4.3.6) CHAPTER FOUR INPUT CODING 86 !w " r (y # s) $ (x) , (4.3.7) c i " r (y # s)wi $ i Hi (x # c i ) and (4.3.8) % !Hi " r(y # s)wi $i ( H#1 i # (x # c i )(x # ci ) ) (4.3.9) Since ! i is always positive it will affect the size and not the direction of the change in the network parameters. The dependence of the direction of change in parameters according to the sign of the remaining components of the learning rules is illustrated in the table below. direction of change !c !H !wi r components y!s ! i = wi a + + + + !x grow b + - - - !x grow c - + + - !x shrink d - - - + !x shrink e + + - + !x shrink f + - + - !x shrink g - + - - !x grow h - - + + !x grow The table shows that the learning procedure will, as expected, result in each local expert moving toward the input and expanding whenever the output weight of the unit has the same sign as the exploration component of the action, and the reward is positive (rows a, b). If the reward is negative the unit will move away and its receptive field will contract (c, d). If the sign of the weight is opposite to the direction of exploration then all these effects are reversed (e, f, g, h). CHAPTER FOUR INPUT CODING 87 Explicit competition between units There appears, then, to be good accordance between these training rules and the intuitive idea for adaptive coding as a generalisation of the law of effect. However, there remains an impression that this method of training the receptive fields is not quite ideal. This arises because the direction of change in each of these rules depends upon the correlation of the variation in the mean action with the absolute size of the output weight, that is on (y ! s)wi . Intuitively, however, a more appropriate measure would seem to be the correlation of the variation in the mean action with the variation of the output (of this unit) compared with the mean output, that is (y ! s)(wi ! s) . This measure seems more appropriate as it introduces an explicit comparison between the local experts allowing them to judge their own success against the group mean rather than against an absolute and arbitrary standard. Fortunately learning rules that incorporate this alternative measure arise directly if the normalised activation is used in determining the output of the local experts. If (as in equation 4.2.6) we have !i = gi " M j =1 gj where M is the total number of local expert units, then !i = "s = " i 1 # j j ( wi $ s) . (4.3.10) From which we obtain the learning rules (from 4.3.3 and 4.3.4) !c i " # (wi $ s) % i H i (x $ c i ) and & !Hi " # (wi $ s) % i ( H$1 i $ (x $ c i )(x $ ci ) ) (4.3.11) (4.3.12) CHAPTER FOUR INPUT CODING 88 wherein the desired comparison measure (wi ! s) arises as a natural consequence of employing generalised gradient ascent learning. Further refinements and potential sources of difficulty The use of GBF networks for a simple immediate reinforcement learning task is described below, their application to difficult delayed reinforcement tasks is investigated in the next chapter. Before describing any implementations, however, some refinements to the learning system and potential problems (and possible solutions) will be considered. Learning scaling parameters An important extension of the learning scheme outlined above involves adapting the strength of response of each unit independently of the position and shape of its receptive field. This sets up a competition for a share in the output of the network between the different ‘regions of expertise’ occupied by the units. One of the benefits of this competition is a finer degree of control in specifying the shape and slope of the decision boundaries between competing nodes. For each unit an additional parameter pˆ i is used which scales the activation of the ith node during the calculation of the network output, i.e. the activation equation becomes gi "N 2 12 1 H i exp (" 2 (x " ci )#H i (x " c i )) pˆ i ( 2! ) (4.3.13) The learning rule for the scale parameter is then given by ! pˆi " # $ % gi % pˆi . (4.3.14) The pˆ i s must be non-zero and sum to unity over the network. This requires a slightly complicated learning rule since a change in any one scale parameter must be met by a corresponding re-balancing of all the others. A suitable learning scheme (due to Millington [101]) is described in Appendix B. CHAPTER FOUR INPUT CODING 89 Receptive field instability The learning rules for the node receptive fields described above do not guarantee that the width of the field along each principal axis will always be positive. A sufficient condition for this is for the covariance matrix to be positive definite which can be determined by checking for a positive, non-zero value of the determinant. A simple fix for this problem is to check this value after each update and reinitialise any node receptive field which fails the test. A better solution, however, is to adapt the square root of the covariance matrix rather than attempt to learn the covariance (or information) matrix directly. Algorithms using the square root technique are described in [17]. Keeping the GBF units apart A further problem that can arise in training sets of GBF nodes is that two nodes will drift together, occupy the same space, and eventually become identical in every respect. This is a locally optimal solution for the gradient learning algorithm and is clearly an inefficient use of the basis units. To overcome this problem a spring component that implements a repulsive ‘force’ between all of the node centres can be added to the learning mechanism (see also [126]). This is not always a desirable solution however. For instance, it could be the case that two nodes have their centres almost exactly aligned but differ considerably in both the shape of their receptive field and their output weights. This can be a very efficient manner of approximating some surfaces (see next chapter) but cannot arise if the spring heuristic is used to keep the units separated. Staying in the data The converse of the problem of units drifting together is that they may drift too far apart. Specifically, some units can be pushed beyond the margins of the sampled region of the input space through the action of the learning rule (for a unit to be totally inactive is another locally optimal solution). A possible way to keep units ‘in the data’ would be to implement some form of conscience mechanism whereby inactive units have higher learning rates (see Appendix B for more on this topic) or to use unsupervised learning to cause units that are under-used to migrate toward the CHAPTER FOUR INPUT CODING 90 input. Both these devices will only be of use, however, if the temporal sequence of inputs approximates a random sampling of the input space, a requirement that rarely holds for learning in real time control tasks. Relationship to fuzzy inference systems One of the most attractive features of basis function approximations is their relationship to rule-based forms of knowledge, in particular, what are known as fuzzy inference systems (FIS). A FIS is a device for function approximation based on fuzzy if-then rules such as “If the pressure is high, then the volume is small” An FIS is defined by a set of fuzzy rules together with a set of membership functions for the linguistic components of the rules, and a mechanism, called fuzzy reasoning, for generating inferences. Networks of Gaussian basis function units have been shown to be functionally equivalent to fuzzy inference systems [67]. In other words, the local units in GBF networks can be directly equated with if-then type rules. For instance, if we have a network of two GBF units a and b (in a 2D space with a single scalar output), then an equivalent FIS would be described by Rule A : If x1 is A1 and x 2 is A2 , then y = wA , Rule B: If x1 is B1 and x 2 is B2 , then y = wB . Here the membership functions ( A1 , A2 , B1 , and B2 ) are the components of the (normalised) Gaussian receptive fields of the units in each input dimension. The functional equivalence between the two systems allows an easy transfer of explicit knowledge (fuzzy rules) into tuneable implicit knowledge (network parameters) and vice versa. In other words, a priori knowledge about a target function can be built in to the initial conditions of the network. Provided these initial fuzzy rules give a reasonable first-order approximation to the target then learning should be greatly accelerated and the likelihood of local optima much reduced. This ability to start the learning process from good initial positions should be of great value in reinforcement learning where tabula rasa systems can take an inordinately long time to train. CHAPTER FOUR INPUT CODING 91 A simple immediate reinforcement learning problem To demonstrate the effectiveness of the GBF reinforcement learning mechanism this section describes its application to a simple artificial immediate reinforcement learning task. Its use for more complex problems involving delayed reinforcement is discussed in the next chapter. Figure 4.10 shows a two-dimensional input space partitioned into two regions by the boundaries of an X shape. The ‘X’ task is defined such that to achieve maximum reward a system should output a one for inputs sampled from the within the X shape and a zero for inputs sampled from the area outside it. Figure 4.10: A simple immediate reinforcement task. In the simulations described below the network architecture used a Bernoulli logistic unit (see section 2.2.2) to generate the required stochastic binary output. A spring mechanism was also employed to keep the GBF nodes separated. Full details of algorithm are given in Appendix B where suitable learning rate parameters are also described. Networks of between five and ten GBF nodes were each trained on forty thousand randomly selected inputs. Over the period of training the learning rates of the networks were gradually reduced to zero to ensure that the system settled to a stable configuration. Each network was then tested on ten thousand input points lying on a CHAPTER FOUR INPUT CODING 92 uniform 100x100 grid. During this test phase the probabilistic output of the net was replaced with a deterministic one, i.e. the most likely output was always taken. The learning mechanism was initially tested with units with fixed (equal) scale parameters. Ten runs were performed with each size of network. The results for each run, computed as the percentage of correct outputs during the test phase, are given in Appendix B. In all, the best average performance was found with networks of eight GBF units (hereafter 8-GBF nets). Figure 4.11 shows a typical run for such a net. Initially the nodes were randomly positioned within 0.01 of the centre of the space. By five hundred training steps the spring component of the learning rule has caused the nodes to spread out slightly but they still have the appearance of a random cluster. The next phase of training, illustrated here by the snapshot at two thousand timesteps, is characterised by movement of the node centres to strategic parts of the space and adaptation of the output weights toward the optimal actions. Soon after, illustrated here at five thousand steps, the receptive fields begin to rotate to follow the shape of the desired output function. The last twenty thousand steps mainly involve the consolidation and fine tuning of this configuration. CHAPTER FOUR INPUT CODING 93 5,000 500 0.18 0.47 0.52 0.81 0.49 0.15 0.47 0.51 0.87 0.52 0.87 0.55 0.18 0.73 0.48 0.15 0.04 10,000 40,000 0.01 0.95 0.93 0.98 0.98 0.01 0.04 0.96 0.05 0.01 0.99 0.97 0.89 0.04 0.01 Figure 4.11: Learning the X task with an 8-GBF network. The figures show the position, receptive field and probability of outputting a one for each GBF unit after 500, 2000, 5,000 and 40,000 training steps. Figure 4.12 shows that the output of the network during the test phase (that is, after forty thousand steps) is a reasonable approximation to the X shape. CHAPTER FOUR INPUT CODING 94 Figure 4.12: Test output of an 8-GBF network on the X task. Black shows a preferred output of 1, white a preferred output of zero. Averaged over ten runs of 8-GBF the mean score on the test phase was 93.6% optimal outputs (standard deviation 1.1%). This performance compared favourably with that of eight-unit networks with adaptive variance only. The latter, being unable to rotate the receptive fields of their basis units, achieved average scores of less than 90%. Performance at a similar level to the 8-GBF networks was also achieved on most runs with networks of seven units and on some runs using networks of six units (though in the latter case the final configuration of the units were substantially different). However, with fewer than eight units locally optimal solutions in which the X shape is incompletely reproduced7 were more likely to arise. Networks larger than eight units did not show any significant improvement in performance over the 8-GBF nets, indeed, if anything the performance was less consistently good. There are two observations that may be relevant to understanding this result. First, on some runs with larger network sizes one or more units is eventually pushed outside the space to a position in which it is almost entirely inactive. This effectively reduces the number of units participating in the function 7For instance, one arm of the X might be missing or the space between the two arms incorrectly filled. CHAPTER FOUR INPUT CODING 95 approximation. Second, with the larger nets, the number of alternative near optimal configurations of units is increased. These networks are therefore less likely to converge to the (globally) best possible solution on every run. The experiments with GBF networks of five to ten units were repeated this time with the learning rule for adapting the scale parameters switched on. Though the overall performance was similar to that reported above, quantitatively the scores achieved with each size of network were slightly better. Again the best performance was achieved by the 8-GBF nets with mean score 95.4% ("= 0.56) showing a significant8 improvement when compared to networks of the same size without adaptive scaling. Once more there was no significant improvement for net sizes larger than eight units indicating a clear ceiling effect9. Figure 4.13 shows the final configuration and test output of a typical 8-GBF network with adaptive scaling. The additional degrees freedom provided by the scaling parameters results in the node centres being more widely spaced generating an output which better reproduces the straight edges and square corners of the X. 8t= 4.47, p=0.0003 9A run of 15-GBF nets also failed to produce a higher performance than the 8 unit networks. CHAPTER FOUR INPUT CODING 96 0.251 0.003 0.001 0.271 0.209 0.002 0.001 0.260 Figure 4.13: GBF network with adaptive priors. The numbers superimposed on the nodes show the acquired scale factors (the output probabilities were all near deterministic i.e. >0.99 or <0.01). The test output of this net indicates a better reproduction of the straight edges and square corners of the X. CHAPTER FOUR INPUT CODING 97 Conclusion This chapter has considered a wide range of possible methods for generating a suitable input coding of a continuous input space for immediate and delayed reinforcement learning systems. Local coding methods have been emphasised as their relative immunity to spatial crosstalk means they can learn with reasonable speed even in tasks with impoverished feedback. The problem of generating unambiguous codings has been considered and a gradient ascent learning mechanism for adapting a layer of Gaussian basis function (GBF) units described. This algorithm has been successfully applied to a simple immediate reinforcement task. In the next chapter the extension of methods for adaptive coding to delayed reinforcement problems will be investigated. 98 Chapter 5 Experiments in Delayed Reinforcement Learning Using Networks of Basis Function Units Summary The previous chapter introduced methods for training a layer of basis function units to provide an adaptive coding of a continuous input space. This chapter applies these systems to a real-time control task involving delayed reinforcement learning. Specifically, the pole balancing task, which was used by Barto, Sutton and Anderson [11] to demonstrate the original actor/critic learning system, is used to evaluate two new architectures. The first uses a single memory-based, radial basis function coding layer to provide input to actor and critic learning units. The second architecture consists of separate actor and critic networks each composed of Gaussian basis function (GBF) units trained by generalised gradient learning methods. The performance of both systems on the simulated pole balancing problem is described and compared with other recent reinforcement learning architectures for this task [4, 137, 143]. The analysis of these systems focuses, in particular, on the problems of input sampling that arise in this and other real-time control tasks. The interface between explicit task knowledge and adaptive reinforcement learning is also considered, and it is suggested that the GBF algorithm may be particular suitable for refining the control behaviour of a coarsely pre-specified system. CHAPTER FIVE 5.1 DELAYED REINFORCEMENT LEARNING 99 Introduction The pole balancing or inverted pendulum problem has become something of an acid test for delayed reinforcement learning algorithms. The clearly defined nature of the task, the fine-grained control required for a successful solution, and the availability of results from other studies make it an appropriate test-bed for the learning methods described in the previous chapter. Pole balancing has a long history in the study of adaptive control and many different variants of the problem have been explored (see Anderson [3] for a review). As a problem in delayed reinforcement learning it was first studied by Barto, Sutton, and Anderson [11] and the same version of the task (originally from Michie and Chambers [99]) has since been used to evaluate several reinforcement learning architectures involving adaptive coding [4, 137, 143]. Figure 5.1 illustrates the basic control problem: ! x Figure 5.1: the cart/pole system modelled in the pole balancing task. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 100 A pole is attached by a frictionless hinge to a cart which is free to move back and forth along a length of track. The hinge restricts the movement of the pole to the vertical plane. The continuous trajectory of the system is modelled as a sequence of discrete time-steps. At each step the control system applies a fixed force (±10 N) to the base of the cart pushing it either to the left or the right. The task for the system is to balance the pole for as long as possible using this ‘bang-bang’ mechanism for moving the cart. The cart begins each trial at the centre of a finite length (2.4m) of track. Each trial ends either when the pole falls beyond an angle of 12° from vertical (a pole failure), or once the movement of the cart carries it beyond either end of the track (a track failure). The information that allows the control system to learn and select appropriate actions is supplied by a context input, describing the current state of the cart-pole system, and a reinforcement signal. The context is a vector of four continuous variables describing the momentary angle and angular velocity of the pole ! and !˙ and the position and horizontal velocity of the cart along the track ( x and x˙) . The reinforcement signal is a single scalar value provided after each action which is non-zero only when that action results in failure (and the end of the current trial) at which point a punishment signal of -1 is provided. This non-zero feedback occurs only after long sequences of actions by the control system making the task one of minimal, delayed reinforcement. ( ) The equations describing the dynamics of the cart/pole simulation are taken from [11] and are given in Appendix C. Characteristics of the pole balancing task Before investigating alternative learning systems for this task it is worthwhile discussing some of the characteristics of the pole balancing task that determine what sorts of solutions are possible. Sampling the input space Many studies of learning bypass the problem of input sampling by arranging cycles of training examples or by selecting input points from a flat, random distribution. This is CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 101 not, however, a situation that is ever faced by a real-time control system be it animal or machine. One of the most problematic aspects of real-time control is that inputs tend to arrive in a highly non-random order. The learning system is therefore faced with the difficult task of trying to find solutions that are valid for the whole space when only a biased sample of that space is available within a given time-span. The following description gives an informal analysis of the types of sampling bias that arise in the pole-balancing task. This analysis should help in evaluating the performance of different learning systems on this specific problem and also in judging how robust these systems will be to sampling bias in general. Biased sampling in the pole balancing task arises over several time-scales and in both the context inputs and the reward signals. Short term bias The moment to moment state changes of the cart-pole system are relatively small, hence, in the short term (tens of time-steps), the controller samples a long, narrow trajectory of points in the input space with each pattern being very similar to the previous one. Medium term bias In the medium term (hundreds of time-steps) the system will observe substantial variation in pole angle, angular velocity and cart velocity but relatively small, gradual change in the horizontal position of the cart. On this time-scale, therefore, the system will sample only a narrow range of cart positions. Long term bias If the system is uncontrolled or randomly controlled the pole will fall past the critical angle long before the cart reaches the extreme ends of the track. Pole failures are therefore likely to happen within a much shorter time scale (tens to hundreds of timesteps) than track failures (hundreds to thousands of time-steps). A consequence of this is that during the early period of learning most trials will end due to pole failures (henceforth stage one). Thereafter, once the system has discovered ways of preventing the pole from falling, punishment will almost exclusively be as a result of CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 102 track failures (stage two). There are thus dramatic long-term (tens or hundreds of trials) changes in the way that the alternative sources of reinforcement are sampled by the control system. During the first stage of learning the system will sample a wide range of pole angles and angular velocities but a relatively narrow range of cart positions. During the second stage this pattern is reversed with the distribution of pole angles covering a narrow band near the vertical, angular velocities being relatively small, but a wider range of cart positions and velocities being sampled (albeit over a slower time-scale). In stage two the system may also get locked into certain patterns of behaviour, for instance, always travelling to the left and failing at the far left end of the track. Whole regions of the state-space may then be unrepresented in the input stream for many trials. These effects all represent long term changes in the way the control system samples the input space. Multiple sources of reinforcement The symmetry of the pole balancing task is evident to any human observer, however, this knowledge is not available to the adaptive machine controller. From the machine’s perspective, therefore, dropping the pole either to the left or to the right constitutes two distinct sources of negative reward, reaching either end of the track adds a further two. The existence of multiple sources of reward creates a hillclimbing landscape for the learning system in which there will certainly be some locally optimal strategies whereby some sources of punishment are avoided but not others. In view of the sampling bias problems described above learning a strategy that simultaneously avoids both end of the track may be particularly difficult. Non-linearities and critical boundaries in the input-output mapping Symmetry considerations and lay physics also make it intuitively evident that there will be a critical turning point for any optimal control strategy near to the centre position of the space (vertical pole, centre track, zero velocities). It also seem likely that accurate positioning of this control boundary will be required for successful CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 103 control. Again, knowledge of this symmetry, or of critical regions and boundaries in the input space, is not generally available to the learning system. 5.2 Fixed and unsupervised coding methods for the pole balancing task Previous approaches A priori quantisation Barto et al.’s [11] solution for the pole-balancing task used an Actor/Critic system with a hand-tuned hard quantisation scheme. The coding mechanism, adopted from an earlier study by Michie and Chambers [99], involved splitting the state-space into ‘boxes’ formed by dividing the range of each variable into sections. The pole angle # was assigned six sections and each of the other variables ( !˙ , x, and x˙ ) three, giving 6 ! 3 ! 3 ! 3 = 162 hyper-cube boxes in all. This system learned to balance the pole for in excess of fifty thousand time-steps (equivalent to approximately fifteen minutes of real time) within fifty to one hundred trials. Although the performance of this system is impressive, it is fair to say that its success depended critically on the carefully tailored coding scheme. The quantisation encoded both task-relevant knowledge of the distribution of the input data, and partial knowledge of the symmetry and critical decision regions of the task. The problems arising from sampling bias were also largely avoided by having only fixed, local controllers (the ‘boxes’) each responsible for a small region of the state-space. Unsupervised learning using Kohonen networks Ritter et al. [137] used a variant of the pole balancing task to evaluate the use of Kohonen’s self-organising networks for quantising the continuous space underlying motor control problems. Kohonen’s [70] learning rule extends the unsupervised, nearest neighbour learning methods described in section 4.2 to networks in which the units are arranged in a pre-defined topology, typically a multi-dimensional grid. Each time an input pattern is presented the centre of the closest, winning unit is moved toward the input. In addition, however, units that are near to the winner in the CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 104 network topology, or index space, are also updated. The amount by which each of these neighbouring units is moved depends on a neighbourhood function (typically a Gaussian) of the distance in the index space from the unit to the winner. By reducing with time both the width of the neighbourhood function and the overall learning rate the network relaxes into a self-organised map in which the topology of the network connections reflects the topology of the input space, that is, neighbouring units correspond to neighbouring input patterns. As in other unsupervised learning methods the distribution of units in space also comes to reflect the underlying distribution of the input data. Ritter et al. performed several experiments with a two-dimensional pole balancer in which the problem was limited to balancing the pole regardless of the position or horizontal velocity of the cart10. The pole angle and angular velocity ( ! and !˙ ) were provided to a network of 25x25 nodes arranged in a grid. Each node had a single output weight which learned the required action for the region of the state-space it encoded. Whilst the node positions were adapted by Kohonen’s training rule the output weights were trained either by supervised learning (the teaching signal being given as a function of ! and !˙ ) or by immediate reinforcement learning (a reward of ! " 2 was given at every time-step). In both cases the network learned a suitable recoding of the input space and acquired successful control behaviour. To enable comparison of the Kohonen input layer with other adaptive coding methods a version11 of the algorithm was tested here on the delayed reinforcement pole balancing task described above. These experiments were largely unsuccessful as a result of inadequate codings generated by the Kohonen learning rule. This poor 10Various other aspects of the task were varied from those described above, however, these details are not important to the current discussion. 11In Kohonen’s original learning rule the positions of all the coding units in the network are given small initial random values at the start of training. The network then gradually untangles itself as learning proceeds. Luttrell [90] has pointed out that a large proportion of training is taken up unravelling this initial knotted configuration. He proposed instead a multistage version of the algorithm in which the full-size network is built-up incrementally by successive periods of training and splitting smaller networks. This multistage learning algorithm which shows advantages both in training time and in the degree of distortion in the trained network was used in the experiments alluded to here. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 105 performance by the unsupervised learning system appeared to arise as a direct result of the biased sampling of the input space—because the target input distribution is not adequately represented in any time slice of experience both distortions (some parts of the space over-represented, others under-represented) and instabilities (no convergence to a stable configuration over time) arose in the Kohonen mappings. As the temporal distribution of inputs varies in the short, medium and long-term, techniques for coping with sampling bias such as batch training, buffering the input, or using very low learning rates did not significantly improve performance. In the experiments performed by Ritter et al. the positions of the Kohonen nodes were adapted only over the first thousand training steps. Over this time period the neighbourhood function and learning rates were gradually reduced so that the network settled to a fixed configuration. During this very early stage of learning the pole is relatively uncontrolled and the system is likely to sample most of the attainable input states. However, if the network is trained over a longer period, as is essential for the full four-dimensional task, the effect of learning a successful control strategy is to shrink the sampled region of the state-space disrupting the input coding as an undesirable but inevitable consequence. This suggests that Ritter et al.’s network was effective only because learning was limited to a narrow time-slice of experience which, fortuitously, sampled the state-space in a relatively unbiased manner. It seems likely that other unsupervised learning algorithms whose heuristic power depends on the ability to adapt over time to the distribution of input data will be subject to the same catastrophic difficulties on this task. Memory-based coding using radial basis function units In view of the problems in adapting an input coding by unsupervised learning an alternative memory-based approach was attempted using networks of radial basis function (RBF) units. The idea with this method is that, starting with an empty network, new nodes are generated at fixed positions in the input space whenever the current input is inadequately represented (similar methods have also been proposed in [150]). A suitable node generation scheme was described in equations 4.2.3, 4.2.4 and involves creating a new node at the current input position whenever the error in CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 106 reconstructing the input is greater than a fixed threshold. The approach is termed memory-based since, rather than adapting the node positions, learning is simply a matter of storing unfamiliar input patterns. The resulting coding reflects the spatial distribution of the input points—nodes are placed wherever inputs arise—but ignores their temporal distribution. Although this technique avoids problems with sampling bias there is clearly be a price to paid in terms of the efficiency of the resulting coding. The density of nodes in different regions of the input space will be the same regardless of whether a given region generates a large or small number of inputs. An actor/critic architecture using such a memory-based RBF coding was implemented here for the pole balancing task. This architecture is illustrated in figure 5.2 and described in detail in Appendix C. The following gives a brief qualitative description of the learning system. outputs critic actor basis units inputs Figure 5.2: RBF architecture for the pole balancing task. The thick line between the basis units indicates that the activation values are normalised; the dotted line indicates that additional basis nodes are added during training. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 107 At each time-step each of the basis units calculates its activation as a fixed-width Gaussian function of the distance from its centre to the current input point. The activations of all the basis nodes are then normalised (indicated by the thick line between the units) so that the total sums to unity. The input is then reconstructed by calculating the vector sum of the unit centres scaled by their normalised activation values. The Euclidean distance between this reconstructed input and the original is then computed. If this value is greater than a threshold the input coding is judged inadequate, a new basis unit is added at the input position, and the recoding vector (i.e. the vector of normalised activation values) is recalculated. This recoding vector then serves as input, via weighted connections, to a linear critic unit and a binary actor unit. The weights from each basis unit to the actor and critic elements are initially zero and are trained over time following a standard delayed reinforcement training procedure. Results Tests were performed with different values for the node-generation threshold creating networks of varying size. It was found that to give a coding with an adequate resolution for the task required networks of at least one hundred basis units. In the experiments described below the network size was limited to a maximum of 162 units giving the same number of output parameters as in Barto et al.’s a priori coding system. Further tests were performed to determine suitable global parameters for the actor/critic learning systems. The experiments described below used values that seemed to give acceptable results (also given in the Appendix), however, the time required to train the system prohibited a systematic search of possible parameter values. The learning system was tested over ten runs. Each run was terminated after one thousand trials, five hundred thousand time-steps, or once the network succeeded in balancing the pole for fifty thousand time-steps on a single trial. The latter condition indicates balancing for more than fifteen minutes of real time which, as in [11], is taken to indicate that optimal behaviour was achieved. Performance reached the criterion level on all ten runs. On average this level of success was achieved on the 290th trial or after approximately 183,000 training-steps. This rate of acquisition is only slightly slower than that observed with Barto et al.’s a priori coding (acquisition CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 108 in the initial stages of learning is actually slight faster). Figure 5.3 shows the distribution of the basis nodes in the input space on a successful run. The graphs show the positions of the centres in the two-dimensional space defined by each pair of input variables. + X X ! -- ! + X ! X X ! ! ! X Figure 5.3: Positions of radial-basis function nodes in the 4d space !, !˙ ,x, x˙ illustrated by their projection into the 2d space defined by each pair of variables12. 12The range of each variable (from left to right or bottom to top) is as follows: ! : -12°, +12°, !˙ : -150°/s, +150°, x : -2.4m, +2.4m, x˙ : -3m/s, +3m/s. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 109 The figure illustrates both what is good and what is bad about the coding mechanism. The memory-based node generation algorithm places units only where there is input data. Since many of the variables are strongly correlated (for instance !˙ with both ! and ! x˙ ) large regions of the input space are never sampled, hence there is a significant economy in the number of units required compared, say, with a uniform resolution tiling. However, the inefficiency of the coding is also clearly demonstrated by the figure. For instance, many basis units are placed in extreme regions of the state-space (for instance, states with very high horizontal velocities) that are sampled very rarely and, in any case, may represent unrecoverable positions. Furthermore, the resolution of the coding is uniform throughout the space, being the same in the sparsely visited extreme regions as in the critical area in the centre of the space. To achieve an adequate resolution at the centre the threshold for generating new nodes must be kept relatively low creating an excess of units in non-critical areas. In that the memory-based coding method generates, automatically, an effective input coding for this task it is more general than systems that depend on their designers for a tailored, a priori quantisation. At the same time, however, being a brute-force method—scattering nodes around the input space in an almost haphazard fashion—it is both inelegant and inefficient and fails to take advantage of (potentially) available information about the adequacy of the coding. This observation motivates the exploration in the remainder of this chapter of learning systems that attempt to use the reinforcement signal to produce a coding that is specifically tuned to the task. CHAPTER FIVE 5.3 DELAYED REINFORCEMENT LEARNING 110 Adaptive coding using the reinforcement signal Previous approaches Multilayer Perceptron (MLP) networks Anderson [4] proposed a reinforcement learning architecture that uses two MLP networks one to act as the actor learning system the other as the critic. By separating the two learning systems entirely, each system is free to learn an internal representation of the input specific to its needs. Anderson successfully applied this architecture to the pole balancing task using the network structure for both the actor and critic systems illustrated in figure 5.4. output unit hidden units inputs Figure 5.4: Anderson’s MLP architecture for the pole balancing task. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 111 ( ) The input to each network consisted of the four state variables !, !˙ , x, and x˙ normalised to lie approximately between nought and one, plus a fifth input with a constant, positive value. The output element in each net had both direct weighted connections to these inputs and indirect connections via five hidden units with sigmoidal activation functions. The critic system computed a simple linear output, whilst the actor system generated a stochastic binary output. The hidden layer of each network was trained using a generalised gradient learning procedure based on the well-known MLP back-propagation algorithm, This system took between five and ten thousand trials to learn to balance the pole for in excess of fifty thousand time-steps. In other words, learning was at least an order of magnitude slower than with either the a priori or memory-based learning architectures described above. Several other disadvantages with this learning system are also worth noting. First, Anderson reported that an extended search for values of the global learning parameters was needed to obtain networks that would converge to an effective solution. Such sensitivity to the learning rates was not noted with the systems described in the previous section. Second, the cart/pole simulation had to be started from a random initial state on each trial. This was required in order to increase the sampling of different track positions and of track failures. Without countering the inherently uneven sampling bias of the task in this manner the system failed to solve the task of keeping the cart on the track. The disadvantages of using MLP networks in reinforcement learning have already been noted in the previous chapter. Here it suffices to recall that the distributed nature of the recoding performed by the hidden units makes the learning process very vulnerable to interference between different input patterns requiring different outputs (spatial crosstalk). This is one possible cause of the extremely long learning times required by Anderson’s simulation. Recurrent networks Schmidhuber [143, 144] describes an interesting recurrent network architecture for reinforcement learning in which a correlation learning rule is used to train the hidden units. His network consists of a number of fully connected primary units each of which is a Bernoulli logistic unit producing a stochastic binary output. At each CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 112 simulation step every primary unit receives a combined input vector consisting of the current context together with the outputs of all the primary units computed on the previous time-step. One primary unit is nominated to act as the actor, that is, its output is taken to be the control signal for the system overall. A secondary unit, which receives as its input the same combined input vector but has no recurrent links to other units, acts as a linear critic element. This architecture is shown in figure 5.5. evaluation secondary unit (critic) actor unit primary units inputs Figure 5.5: Recurrent network architecture for reinforcement learning. Thin lines show feed-forward connections, the heavy lines indicate the two-way connections between the primary units. The critic element computes an error according to the normal temporal difference comparison rule which is used to update the weights on the critic’s input lines by gradient descent training. For the primary network the same temporal difference error is used to adjust the weights for the recurrent units according to the following rule. If the combined state vector is given by x(t) and the TD error by eTD (t + 1) then the directed w ij from unit i to unit j is adjusted by wij " eTD (t + 1) x i (t + 1) x j (t) . (5.2.1) CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 113 The effect of this rule is clearly to make the last transition more likely if error signal is positive and less likely when the error is negative. The update rule is therefore a correlation learning rule consistent with the law of effect. For the pole-balancing task the primary network consisted of five units. The input to the system was, as in Anderson’s simulation, the four normalised state variables together with a fifth constant input. This system learned the pole-balancing task faster than the MLP architecture achieving runs of upto 30,000 time-steps within two to six thousand trials, and runs in excess of 300,000 time-steps if the following interventions were made: first, learning was switched off once the system achieved the mile-stone of 1,000 successive balancing steps, and second, the stochastic activation of the primary units was made deterministic (i.e. by always selecting the activation that was most probable). Given the observation, made in the previous chapter, that the correlation learning is a weaker method than back-propagation of error it is something of a puzzle that Schmidhuber’s recurrent architecture learned the pole balancing task faster than Anderson’s MLP system. This question will be addressed at the end of the chapter as the experiments reported in the next section appear to cast some light on this issue. Gaussian Basis Function (GBF) networks The procedure for training networks of GBF units by gradient ascent in the reinforcement signal was evaluated using the pole balancing task. The principal difference between the architecture used here and that employed for the immediate reinforcement learning task (described in the previous chapter) being the use of a second GBF network to act as the critic learning system. The learning system was evaluated with 2,3, and 4 GBF units in each network. At the start of each run the GBF units were initialised to random positions near the centre of the input space with the principal axes parallel to the input dimensions and with small random perturbations in the initial widths. A number of test runs were performed to determine suitable learning parameters for the two networks. However, because of the number of global parameters and the time required to train each system no systematic search of the parameter space was possible. The test runs indicated that systems with CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 114 as few as two GBF units in each network could perform reasonably well on the task provided the node receptive fields in the critic network were allowed to overlap. The spring mechanism which forces the unit centres to spread out was therefore switched off throughout the experiments reported below. The same performance criteria were applied as in the experiment with the memory based RBF coding system. In other words, the system was tested over ten runs where each run was terminated after one thousand trials, five hundred thousand time-steps, or once the network succeeded in balancing the pole for fifty thousand time-steps on a single trial. Figure 5.6 shows the results for a learning system with two GBF units in each network (2/2-GBF networks hereafter). The table shows, for each run, the number of steps in the most successful trial and the total number of trials and training-steps upto and including the best trial. run best trial total trials total steps 1 2 3 30,160 2,146 6,965 837 315 348 372k 796k 195k 7 8 9 10 50,000 29,214 27,232 532 50,000 7,730 24,284 145 854 796 415 873 823 779 78k 218k 233k 53k 365k 473k 204k 4 5 6 Figure 5.6: Performance of the 2/2-GBF learning system on the pole balancing task. Although only two of the ten runs reached the criterion level of performance, six out of the ten achieved maximum balancing times in excess of twenty thousand steps (equivalent to more than six minutes of real time). The system is thus at least partially successful in solving the task. Systems with more coding units did not perform any better than the 2/2-GBF networks. The main reason for this is that units tended to migrate together if allowed to do so. When the spring mechanism was engaged to prevent this happening this created spurious barriers in the error landscape which caused the system to become stuck in poor configurations. It is possible to gain some understanding of what the system has learned by examining the positions of the GBF nodes in the input-space and the shapes of their CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 115 receptive fields. Figure 5.7 shows the final configuration of the actor network in the eighth training run projected onto the plane formed by each pair of input variables. 1.0 X X a) b) X c) X X X d) e) f) Figure 5.7: The balancing strategy acquired by the actor learning system. Each graph shows the positions and receptive field shapes of the two GBF nodes projected onto the two-dimensional plane formed by each pair of normalised input variables. The principal strategy in evidence is that of pushing the cart to the right when the pole is angled or falling right and to the left when the pole is angled or falling left (this strategy is evident from the positions of the two opposing units in 5.7a). However, a second, subtler strategy has also evolved to cope with the problem of keeping the cart on the track. Specifically, the system shows a preference for pushing the cart left when it is near the left end of the track and right when it is near the right end of the track (see 5.7b for instance). This behaviour biases the preferred balancing angle of the pole to be right of vertical when the pole is near the left track end and left of vertical near the right track end resulting in compensatory movement of the cart back CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 116 toward the centre of the track13. Finally, there is a slight preference for pushing the cart left when it is moving right and vice versa (5.7c). Figure 5.8 shows this balancing strategy as a set of three-dimensional graphs showing the preferred action plotted against the pole angle and angular velocity for different values of the cart position and cart velocity. In all the graphs the decision boundary is approximately parallel to the axis = " ˙ providing the main component of the balancing behaviour that keeps the pole near upright. However, the shift in this decision boundary (i.e. the balancing angle) towards the left as the cart nears the left of the track and towards the right as the cart approaches the right boundary is also clearly evident (compare 5.8a and 5.8i for instance). The critic system learns to represent the evaluation function for this task by adapting the receptive field shapes of its two GBF nodes rather than the node positions. Figure 5.9 illustrates projections of the final configuration of the critic net. In order to make the shapes of the receptive fields more discernible only the central region of each plane is shown. The figure illustrates that the centres of the two units are effectively coincident on or near the centre of the space. One unit, however, has a smaller receptive field and has acquired a positive prediction (+ 0.21). This unit is drawn as a white ellipse in the figure partially eclipsing the second unit which has a larger receptive field, a negative weighting (-0.24), and is drawn in black. Because the first node has as smaller receptive field its activation is stronger in the central region of the space giving nearzero evaluations in this region. The second node having a larger receptive field dominates the peripheral regions of the space giving generally lower evaluations (down to -0.3) in these regions. 13Anderson [3] reports that a similar strategy, though more pronounced, was acquired in his MLP learning system for the pole balancer. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 117 1.0 + p 0.0 +- a) x ! "2. 4m, x˙ ! <1.5ms -1 b) x ! "2. 4m, x˙ ! 0.0ms -1 c) x ! "2.4m, x˙ ! "1.5ms -1 d) x ! 0.0m, x˙ ! <1.5ms -1 e) x ! 0. 0m, x˙ ! 0. 0ms -1 f ) x ! 0.0m, x˙ ! "1.5ms -1 g) x ! <2. 4m, x˙ ! <1. 5ms -1 h) x ! <2.4m, x˙ ! 0.0ms -1 i ) x ! <2. 4m, x˙ ! "1. 5ms -1 Figure 5.12: The preferred action of the actor learning system (given by the probability of pushing the cart to the right) plotted against the angle and angular velocity of the pole for different values of the cart position and horizontal velocity. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 118 0.3 X X a) X c) X X d) b) e) X f) Figure 5.9: GBF receptive field shapes for the critic learning system. The unit with positive weight is drawn in white partially eclipsing the larger unit with a negative weighting. Each graph is limited to the region of dimension 0.3x0.3 around the centre of the normalised input space. The subtler variations in the shapes of the receptive fields also appear to have a significant effect on the acquired function. The most striking of these is evident from graph 5.9a. Here the principal axis of the larger unit has rotated and widened along the line ! = !˙ . The perpendicular axis of this unit, lying along the line = " ˙ , is narrower and is almost the same width as the smaller unit which has a relatively uniform diameter in all directions. This configuration results in a ridge of high evaluations along ! = " !˙ with negative evaluations toward the corners of the space where the angle and angular velocity of the pole have the same sign. The ability to adapt the full covariance is clearly essential to representing this aspect of the function (given only two units). CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 119 There are further fine differences in the receptive shapes due to cart velocity ( x˙ ). The precise effect of these differences is difficult to deduce from figure 5.9 but can be more easily observed by plotting the acquired evaluation function as a set of threedimensional graphs as in figure 5.10. These graphs show a consistent variation in the ridge of the high predictions along the ! = " !˙ axis for different values of cart velocity. Specifically, if the cart is moving toward the right and the pole is tilting towards the left then the prediction is higher than if the pole is tilting right (5.10c). Similarly if the cart is moving towards the left then the system is ‘happier’ when the pole is tilting to the right (5.10g). A surprising feature of figure 5.10 is that there is only a small degree of variation in the evaluation function with cart position. This observation suggests that, despite the long balancing times on some runs, the learning system may not be particularly sensitive to the punishment meted out for reaching either end of the track. To determine whether or not this was the case two final control experiments were performed. In both experiments the learning system received the normal negative reward for pole failures but the punishment signal for track failures was suppressed14. In the first control experiment (C1 hereafter) the inputs to the learning system were limited to just the pole angle and angular velocity, in the second (C2) all four context variables were provided. Figure 5.11 shows the number of steps in the longest trial for ten runs of each experiment. run 1 2 3 4 5 6 7 8 9 10 C1 373 450 575 518 451 1,093 508 656 397 403 C2 444 2,731 15,092 3,210 29,022 310 1,781 371 687 10,526 Figure 5.11 Performance on the pole balancing task without punishment for track failures for systems with two and four dimensional input vectors. The difference in 14More precisely, the final training step on trials ending in track failure was not performed. Hence, there was no opportunity in this experiment for the control system to learn ‘what happens’ at the end of the track. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 120 performance between the two systems is statistically significant (Mann-Whitney U = 20, p<0.025). 0.0 + V -0.3 +- a) x ! "2. 4m, x˙ ! <1.5ms -1 b) x ! "2. 4m, x˙ ! 0.0ms -1 c) x ! "2.4m, x˙ ! "1.5ms -1 d) x ! 0.0m, x˙ ! <1.5ms -1 e) x ! 0. 0m, x˙ ! 0. 0ms -1 f ) x ! 0.0m, x˙ ! "1.5ms -1 -1 g) x ! <2. 4m, x˙ ! <1. 5ms -1 h) x ! <2.4m, x˙ ! 0.0ms -1 i ) x ! <2. 4m, x˙ ! "1. 5ms Figure 5.14: The prediction computed by the critic learning system plotted (between 0.0 and -0.3) against the angle and angular velocity of the pole for different values of the cart position and horizontal velocity. The first system with inputs relating only to the position and velocity of the pole, rarely achieved balancing times in excess of a thousand time-steps. On each of these CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 121 runs the system successfully avoided pole failures after approximately one hundred trials15, however, the actual balancing angle was not constrained by this learning process. As a result the pole was usually balanced slightly off vertical resulting in compensatory horizontal movements of the cart and consistent track failures. The surprising result, however, is shown in the second control experiment, C2. Here substantially longer trial—tens of thousands of time-steps—were achieved on some runs. On these runs it appeared that the system was learning to reduce horizontal movement by balancing the pole as close to vertical as possible. This occurs in spite of the absence of any negative reward signals for track failures. The extra constraint on balancing angle appears to arise because of the strong negative correlation, noted earlier, between pole angular velocity and cart horizontal velocity. The system learns that movement of the cart is associated with dropping the pole, and, as a result, it learns to keep the cart near stationary thus indirectly postponing failure at the track boundary. Although on average the system in the original experiment (that did receive the track punishment signal) performed marginally better than C2, this difference is barely statistically significant16. It is not possible, then, to conclude with any certainty that the learning system was improving its performance on the basis of the delayed reward signals provided by the track failures. The long balancing times achieved both by the original and C2 systems were not simply due to some lucky initial configuration of the GBF units. This is demonstrated for the latter in figure 5.12 which shows a graph of the number of steps in each trial of run three in the C2 experiment. The graph shows the characteristic two stage learning process outlined earlier. Over the first 150–250 trials the system learns to successfully keep the pole near vertical, thereafter all trials end due to track failures. The learning system in C2 received no primary reinforcement throughout the second learning stage (from approx. trial 250 onwards), in spite of this there is a gradual increase in balancing times culminating at trial 944 in a period of successful 15This assertion is supported by an experiment in which an identical system was tested without the constraint of finite track length. Over ten runs this system succeeded in balancing the pole for in excess of 10,000 time-steps (the cut-off point where indefinite balancing was assumed) within 60 to 140 trials. 16Mann-Whitney U = 28.0, p<0.1. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 122 balancing lasting over fifteen thousand time-steps. The graph also shows a drop in the performance of the system immediately following this long period of successful balancing. This phenomenon, which was characteristic of both the original and C2 experiments, has two implications. First, that the performance of both systems was very sensitive to small changes in the control strategy, in other words, the acquired behaviour is not robust. Second, that the severe sampling bias that occurs during the longer balancing periods—the system samples a very narrow region of the state-space for an extended period—has a significant disruptive effect on the acquired control behaviour. The system is thus penalised by its own success. 600 500 steps 400 300 200 100 0 0 200 400 600 800 1000 Trials Figure 5.12: Characteristic learning curve for the second control experiment. Balancing continues to improve after the cessation of primary reinforcement (c. trial 250). Comparison with other systems The above results perhaps cast some light on the behaviour of the pole balancing systems developed by Anderson [3,4] and Schmidhuber [143,144]. Both these CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 123 systems achieved what might be called ‘perfect balancing behaviour’, but both required specific changes either to the task or to the learning system to make this possible. First, consider Schmidhuber’s recurrent correlation-learning system. In order to achieve maximum balancing times the learning mechanism was disengaged once the system reached a thousand successive balancing steps. This intervention avoids the undesirable effects of sampling bias observed above in very long trials. The action selection mechanism was also altered from stochastic to deterministic. Without this intervention a very high variance in trial length was observed. Thus the acquired behaviour was not a robust solution to the task but one which was very sensitive to noise in the action selection mechanism. Anderson’s MLP (gradient descent) learning system, on the other hand, achieved success by altering an important characteristic of the task—the initial starting position of the cart/pole system. This change has a significant effect in countering medium and long-term sampling bias of cart positions and track failures. It was noted above that this system took far longer to solve the pole balancing problem that Schmidhuber’s system. One possible explanation for this difference might be that qualitatively different solutions to the task were acquired. Specifically, it may be the case that Schmidhuber’s system adapted principally to the correlation between cart velocity and pole failure rather than to track failures directly. Anderson’s system, on the other hand, forced to cope with difficult starting positions, acquired a more robust strategy that was more sensitive to track failures (which were themselves less masked by sampling noise). If this hypothesis is correct, this might account for the faster acquisition of control behaviour observed in the system with the weaker training rule. The coding units in both of the above systems generate an internal representation by fitting hyper-planes through the input-space. The GBF system, by contrast, forms a more localist representation. With hindsight, the sample biasing problems inherent in the pole balancing task might be expecting to create more problems for a localist coding that for a more distributed one. The former being, by definition, more sensitive to local change is also more likely to be disturbed by it. CHAPTER FIVE 5.4 DELAYED REINFORCEMENT LEARNING 124 Discussion This chapter has described two novel delayed reinforcement learning methods for the pole balancing task. In the first memory-based method a soft quantisation is generated by scattering large numbers of basis function nodes around the populated regions of the state-space. This method is effective and learns rapidly but is clearly inefficient in terms of its memory and processing demands. The second, GBF generalised gradient learning method is quite radically different. Here a very small, possibly minimal, number of units are placed in the state-space and allowed to adapt both their outputs and the exact positions and shapes of their regions of expertise in a manner that maximises the global reward. Remarkably, perhaps, this system, is able to extract appropriate training signals from the very sparse primary reinforcement signals. This method is not, however, robust to bias in the way the control task samples the input space and available rewards. Furthermore, being a gradient learning system trying to live off a very noisy error signal, it may find solutions that are not globally optimal. Adaptive recoding and dynamic programming It is clearly difficult to reconcile adaptive recoding methods with the view of delayed reinforcement learning as incremental dynamic programming. In the circumstances of the learning systems discussed above the memory-based system is clearly closest to having Markovian state descriptions. Here the coding for each underlying task state is relatively fixed from the time it is first encountered and is also very localist. However, in the case of any method that uses the reward signal to adapt the internal representation the Markov assumption can hardly apply, this is for the following reasons. First, although there may be sufficient information implicit in the context input to determine the causal processes underlying events, this knowledge is not, at least at the start of learning, coded in anything like a satisfactory form. Second, during training the coding of any given world state will change, presenting the learning system with a moving target problem and the impression of a non-stationary underlying process. Finally, the recodings acquired by these systems are generally specific to the function being learned—radically different representations can be acquired, for instance, for determining predictions and actions. It seems likely that such representations will not be suited to the acquisition of the very different CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 125 mappings (i.e. the transition and reward functions) that could support causal knowledge of the task. These observations suggest that it is inappropriate to consider generalised gradient learning systems for delayed reinforcement learning from the perspective of incremental dynamic programming. Thus, strong convergence properties cannot be expected and the performance of the learning system may improve or worsen with time. It seems clear, however, from the empirical studies described here and elsewhere that the ability of these algorithms to climb the reward gradient can allow successful acquisition of skill. These learning systems may therefore be useful in tasks where no richer feedback signal than the delayed reward measure is possible (note: the pole-balancer is not of this sort17!), and where it is not known a priori what the relevant aspects of the context information will be. Suggestions for coping with input sampling problems and local minima The experiments with the pole balancing problem demonstrate that critical characteristics of a task can be so masked by bias in the temporal sampling of contexts and rewards as to become almost invisible. This indicates a need to develop systems that are able to detect bias and make appropriate allowances for it. Such systems, which would make use of uncertainty or world models (as discussed in chapters two and three), could then control some of the attentional and exploratory aspects of learning. For instance, a system which modelled the frequency and/or recency of visits to different regions of the state-space could detect over- or undersampling and take appropriate actions such as biasing exploration, suppressing learning, triggering new trials in unexplored regions, etc. A possible mechanism for achieving improved learning is the use of an adversarial or parasitic system. This is a secondary learning system, whose outputs control critical task parameters of the primary learner. The adversary is rewarded whenever the primary system performs badly and punished whenever it does well. The idea is that 17Clearly the deviation of the pole from vertical or of the cart from the track centre could serve to provide far richer feedback about the moment-to-moment performance of the system. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 126 this second system will expose the weaknesses of the primary system forcing it to abandon control policies that are locally optimal in favour of more robust solutions. Hillis [60] has demonstrated that co-evolving parasitic systems can be used to coerce a genetic algorithm to find improved solutions to an optimisation problem. In turntaking tasks, such as multi-player games, the primary system can, of course, be its own adversary if it is set to play against itself. This technique was employed by Tesauro [167] in training his reinforcement learning backgammon player and is possibly one of the factors that enabled that system to achieve a near-expert level of play. Implicit and explicit knowledge Perhaps one of the most striking differences between the memory-based and GBF learning methods is in the accessibility of the knowledge they contain. The memorybased method involves a very large number of nodes that individually do not reflect the task structure in any clear way. This form of knowledge is clearly on the implicit side—it is difficult to extract knowledge from the system or down-load task knowledge into it. In contrast the second method appears to straddle the implicit/explicit divide. The knowledge encoded in the individual units reflects the structure of the task in a meaningful way and, through the similarity with fuzzy reasoning, rule-like knowledge could be ‘programmed in’ to be improved through experience. This suggests that one role for such systems in reinforcement learning tasks may be as tuning mechanisms for coarse control strategies that are initially set up by other processes or learning methods. A second role might lie in extracting (by some supervised learning process) the knowledge from less accessible systems like the memory-based learner. That is, the less memory-efficient, more opaque, but also more robust system would learn the task through reinforcement learning. This network would then act as a skilled but dumb teacher for a more compact system of trainable ‘experts’ that would identify and refine the essential task knowledge. Conclusion The idea of adapting the internal representation of a task state-space by hill-climbing in the delayed reinforcement signal stretches the possibility of machine learning CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 127 toward one of its limits. This chapter has demonstrated that learning of this nature is possible even in awkward, real-time control tasks such as the pole balancer. However, such learning is undoubtedly very time-consuming and carries no guarantee of an optimal solution. I have argued here that the principal value of such methods may be more as a mechanism for refining ‘first guess’ control strategies derived from other sources than as a means of learning effective behaviour from scratch. The systems of trainable Gaussian basis function units investigated here may be particularly useful in this respect as they can provide a smooth interface between explicit task knowledge and adaptable motor skill. CHAPTER FIVE DELAYED REINFORCEMENT LEARNING 128 128 Chapter 6 Adaptive Local Navigation Summary So far in this thesis learning has been considered largely for its own sake and beyond the context of any single problem domain. The remaining chapters have a different emphasis, focusing instead on a crucial problem area in Artificial Intelligence and Cognitive Science, that of spatial learning for the task of navigation in a mobile agent. Several authors [37, 102, 172, 188] have suggested a division in navigation competence between tactical skills that deal with the immediate problems involved in moving efficiently while avoiding collisions, and strategic skills that allow the successful planning and execution of paths to distant goals. These two levels of navigation skill are here termed local navigation and way-finding1 respectively. This chapter proposes a functional distinction between the two levels of skill. Namely, that local navigation can be efficiently supported by adaptive, task-oriented, stimulus-response mechanisms, but that way-finding requires mechanisms that encode task-independent knowledge of the environment. In other words, whereas local navigation can operate through acquired simple associations allowing very rapid, but largely stereotyped reactions to environmental events, 1This term is borrowed from the psychological literature. It is used here in place of the usual AI term path-planning as the latter often connotes a complete solution to the problem of finding a trajectory through an environment to a desired goal. In contrast the term way-finding is intended to describe the task of choosing appropriate strategic level control behaviour leaving the tasks of local path-planning and obstacle avoidance to the tactical navigation systems. CHAPTER SIX LOCAL NAVIGATION 129 way-finding requires mechanisms that construct and use internal models that encode the spatial layout of the world. The principal focus in this chapter is the problem of learning local navigation skills (way-finding is the main topic of chapters seven and eight). I argue that delayed reinforcement learning is a suitable mechanism for acquiring such behaviours as it allows a direct link to be forged between the evaluation of skills and the tactical goals they aim to fulfil. An architecture for an adaptive local navigation module is then proposed that instantiates this idea. To evaluate this approach a prototype model of an acquired local navigation competence is described and tested in a simulation of a mobile robot2. This models a small agile, vehicle with minimal sensory apparatus that must learn to move at speed around a simulated two-dimensional environment while avoiding collisions with stationary or slow-moving obstacles. The control system developed for this task acquires appropriate real-valued actions for input patterns drawn from a continuous space. This system combines the Actor/Critic architecture with the method for adjusting local exploration behaviour suggested by Williams [184] (to my knowledge this is the first application of Williams’ method to a delayed reinforcement task). Input vectors are recoded using a priori coarse-coding techniques (CMAC networks [2]). The learning system adapts its responses to sensory input patterns, encoding only partial state information, so as to minimise the negative reinforcement arising from collisions and from an internal ‘drive’ that encourages the vehicle to maintain a consistent speed. Successful acquisition of local navigation skill is demonstrated across several environments with different obstacle configurations. 2This work has previously been described in [133, 134]. CHAPTER SIX 6.1 LOCAL NAVIGATION 130 Introduction The fundamental skill required by a mobile agent is the ability to move around in the immediate environment flexibly, quickly and safely. This will be referred to here as tactical, local navigation competence. A second, higher level of valuable strategic expertise is the ability, here called way-finding, to plan and follow routes to desired goals. The next three chapters are concerned with computational models of the mechanisms that might underlie these different levels of navigation behaviour. I suggest below that the distinction between local navigation and way-finding is not simply a matter of efficient hierarchical planning—dividing control into long and short-term decision processes. Rather, that it reflects a fundamental functional difference between the two interacting mechanisms that are required to allow an agent to successfully navigate its world. Local Navigation Local navigation covers a wide range of behaviours, for example, Gibson [53] describes nine different problems in the control of locomotion: starting, stopping, backing-up, steering, aiming, approaching without collision, steering among obstacles, and pursuit and flight. In natural systems there is an evolutionary pressure for many of these skills to be very rapidly enacted. Faced with an imminent collision with an unforeseen obstacle, for example, or with the sight of a predator or prey there is very little time to plan an effective course of action. An animal that can put into effect an appropriate pre-compiled or stereotypical action plan will be able to respond most rapidly and therefore have the best survival chance3. In robot systems there is a similar premium in having fast reactions that can provide smooth, highly responsive behaviour without the need for time- 3The predator-prey evolutionary ‘arms race’ has led to some incredibly fast local navigation reflexes. In reviewing research in this field Gronenberg et al. [55] cite several remarkable examples, for instance the jumps of fleas take circa 1 ms (millisecond); the escape responses of cockroaches c. 40 ms, and of certain fish c. 35 ms; take-off behaviour in locusts and flies also occurs within fractions of a second. The preying responses of the predators that catch these animals are even faster. CHAPTER SIX LOCAL NAVIGATION 131 consuming planning processes. These observations suggest that, wherever possible, local navigation skills might be best implemented by rapid reflexive, stimulus-response behaviours. An increasing number of researchers (see Meyer [98] and Maes [91] for reviews) have taken the view that interesting, plan-like behaviour can emerge from the interplay of a set of pre-wired reflexes with the evolving structure of the world. A similar approach has been taken by Brooks [20, 22, 24] in building control systems for mobile robots. He suggests that fast reactions and robust behaviour might be best achieved through task-dedicated sub-systems that use direct sensor data, little internal state, and relatively simple decision mechanisms to choose between alternative actions. His subsumption architecture consists of a hierarchy of sub-systems that implement behaviours at several levels of competence. The lowest levels are concerned with the basic, self-preserving, motor-control activities such as stopping and collision avoidance while higher levels implement more abstract behaviours with longer-term goals. The control architecture is developed by an incremental bottom-up process in which the lowest level of competence is built first and fully debugged. Layers of more complex behaviour are then gradually added where each layer ‘subsumes’ part of the competence of those below by suppressing their outputs and substituting its own. This emphasis on developing structures that achieve self-sustaining goals is useful not least because it reflects the pressures that have guided the evolution of natural intelligence. Moreover it has led to the development of several modular, taskoriented systems for robust local navigation in mobile robots (for instance [6, 21, 38, 158]). However, the structures often proposed for implementing individual behaviours operate largely through the use of heuristic rules. They are therefore shaped primarily by the ingenuity of their designers and only indirectly influenced by the robot’s ability to operate effectively in its environment. The robot builder creates a candidate control system, then through an iterative process of observing the robot’s behaviour and modifying its architecture this initial system is refined to give more effective behaviour. Clearly, however, the temporal structure in an agent's interaction with its environment can act as more than just a trigger for fixed reactions. Given a suitable learning mechanism, it may be possible to acquire sequences of new (or CHAPTER SIX LOCAL NAVIGATION 132 modified) responses more suited to the task by attending to pertinent environmental feedback signals. The learning methods described in previous chapters are clearly candidate mechanisms for these purposes, and, below the possibility of acquiring effective local navigation behaviour through delayed reinforcement learning will be investigated. Way-finding Way-finding refers to the ability to move to desired locations in the world. More specifically, it is taken here to be the ability to travel from a starting place A to a goal place B where B is (usually) outside the scene visible from A but can be reached by physical travel. In other words, it is about finding and following routes in large-scale4 environments of which the agent has first-hand experience. In contrast to local navigation which, I have suggested, can operate through simple reflexive skills, it seems evident that a simple stimulus-response mapping cannot support general way-finding behaviour. This is so for the reason, also discussed in previous chapters, that reflexive skills are, inevitably, goal- or taskspecific. In other words, although a chain of conditioned reflexes could be acquired that would allow travel to a particular goal (as, for instance, in the simulated maze experiments in chapter three) it does not contain enough information to be of use in finding a path to any other destination. Information about the agent’s interaction with the environment, is in this form, simply too inflexible and specific to have any general value for navigating large-scale space. If acquiring way-finding skills were simply a case of learning suitable response chains, then learning to travel between arbitrary starting and destination points would be a hugely slow and laborious process. This is not to say that simple associative learning cannot play a role in traversing large-scale environments. It may well be the case that a certain route is sufficiently well-travelled that the often-repeated sequence of moves can be stored intact rather than recreated on every trip. However, whereas in local navigation a 4Kuipers [77] defines a large-scale environment as “one whose structure is revealed by integrating local observations over time, rather than being perceived from one vantage point.” CHAPTER SIX LOCAL NAVIGATION 133 small number of stereotyped behaviours could be sufficiently general as to be applicable in wide-ranging circumstances, in navigating a large-scale environment the opportunity to repeat any pre-compiled action sequence will arise far less frequently. This leads to the view that spatial knowledge of large-scale environments should be organised in manner that is more flexible and taskindependent, that is, it should involve some form of acquired model. The question of the form such models should take—what information is stored, and how it is acquired and used—is one of the longest running controversies in both psychology and artificial intelligence. Chapters seven and eight consider this debate. 6.2 A reinforcement learning module for local navigation The remainder of this chapter seeks to demonstrate that effective local navigation behaviour can be developed by acquiring conditioned responses to perceptual stimuli on the basis of feedback signals about the success of actions in achieving tactical goals. Such feedback will necessarily be qualitative and often intermittent in nature hence training methods will generally involve delayed reinforcement learning. The Actor/Critic systems considered in previous chapters are clearly appropriate for reasons that can be summarised as follows: • Learning is driven by the systems ability to achieve its goals. • Behaviour is reactive—at any moment the system responds only to the immediate sensory input. • The system acts to anticipate future rewards but does no explicit planning or forward search. • Short term memory requirements are minimal—the system remembers only the most recent input patterns and chosen actions. • Continuous actions can be learned. • Prediction and action functions can be based on different, appropriate recodings of a continuous input space. Assuming a modular decomposition of the overall skill into task-specific components, a suitable architecture for an adaptive local navigation module is as illustrated in Figure 6.1. CHAPTER SIX LOCAL NAVIGATION rewards 134 motivate primary reinforcement INPUTS context critic OUTPUTS prediction error context actor Figure 6.1: Architecture of an adaptive local navigation module. The inputs to the adaptive module consist of signals from sensors or from other modules in the control system. The outputs, which will generally be interpreted as control signals, may also feed into other modules, or they may control actuators directly. The distinction between context and reward is intrinsic to the design of the control system—the output of a sensor, for instance a collision detector, could act as context to one module and as reward to another. The nature of the primary reinforcement is decided within the adaptive module itself. This function is performed by the motivate component which computes an overall reinforcement signal as some pre-determined function of the input reward signals. The design of a module for any given task therefore involves specifying the following: • The context and reward inputs. • The form of the required outputs and their initial values. • The motivation function and the time horizon. • Mechanisms for recoding the input space. • Output selection mechanisms. • Learning algorithms. This architecture is investigated below in the context of a modular control system for a simulated mobile robot operating in a simple two-dimensional world. The goal of the adaptive module, in this case, is to acquire reflexive behaviours that CHAPTER SIX LOCAL NAVIGATION 135 allow the vehicle to move at speed whilst avoiding collisions with obstacles. Actions are acquired by the actor learning system as Gaussian distributions (using the method suggested by Williams [184]) which specify a range of possible behaviours from which control signals are selected. Success in the task is thus equivalent to learning effective wandering behaviour for the simple environments considered. The following section gives a description of the simulation and the overall control architecture. To make the account more readable much of the detail of the system is given in Appendix D, the aim of the description here being to give a largely qualitative overview of how the system operates. 6.3 A simulation of adaptive local navigation The architecture of the control system used in the simulation is illustrated in Figure 6.2. The overall system is made up of five linked modules. Two of these, touch and map, are associated with sensor systems; two are control modules— recover, a fixed collision recovery competence, and wander, the adaptive local navigation module; the final module, motor control, generates low-level instructions for controlling the vehicle’s wheel speeds. touch RECOVER TOUCH WANDER stop back turn touch motor motivate reset map critic reset map actor S (f,a) lws MOTOR CONTROL rws MAP Figure 6.2: Modular control architecture for adaptive local navigation in a simulated mobile robot. CHAPTER SIX LOCAL NAVIGATION 136 For the purposes of simulation the continuous control behaviour and trajectory of a robot vehicle is approximated by a sequence of discrete time intervals. At each step the model robot acquires new perceptual data; learns; generates control signals; and executes motor commands. In the following the perceptual, motor, and collision recovery components of the simulation are described first. This provides the context for a more detailed description of the specific architecture for the adaptive control module. Perception One of the goals of this research is to explore the extent to which spatially sparse but frequent data can support complex behaviour. For this reason the map module, which provides the context input for the learning system, simulates a sensor system that detected the distance to the nearest obstacle at a small number of fixed angles from the robot’s heading. Specifically, in the experiments reported below, a laser range-finder was simulated giving the logarithmically scaled distance to the nearest object at angles of -60°, 0°, and +60°. This unprocessed depth ‘map’ (a real-valued 3-vector) was provided to the control system as its sole source of information other than collisions about the local geography of the world. Figure 6.3 shows the simulated robot, modelled as a 0.3 ! 0.3 m square box, casting its three ‘rays’ in a sample 5 ! 5m world. Figure 6.3: The simulated robot in a two-dimensional world. CHAPTER SIX LOCAL NAVIGATION 137 Two additional sources of perceptual information are also modelled. The module touch models a set of contact-sensitive sensors mounted on the corners of the vehicle. These sensors are triggered by contact with walls or obstacles and hence act as collision detectors. Finally, wheel speed information (motor in figure 6.2) is obtained internally from the motor control system and is used in determining reinforcement signals. Motor Control The vehicle model assumes two main drive wheels with independent motors and a third, unpowered wheel (that acts to give stability). A first-order approximation to the kinematics of this vehicle is to consider any movement as a rotation around a point on the line through the main axle (see figure 6.4). The adaptive control module for the robot generates, at each time-step t, a real-valued two-vector y(t) = ( f (t a(t)) where the signals f (t) and a(t) indicate the desired forward velocity and steering angle of the vehicle. The motor control module converts these signals into desired left and right wheel-speeds which it then attempts to match. This module also deals with requests from the recover module to perform an emergency stop, back-up, or turn. Restrictions on acceleration and braking are included in the model by specifying the maximum increase or decrease in wheel speed per time-step. This enforces a degree of realism in the robot’s ability to initiate fast avoidance behaviour. CHAPTER SIX LOCAL NAVIGATION 138 lws f(t) rws a(t) g Figure 6.4: The simulated vehicle. (Given the instantaneous forward and angular velocities t) and a(t) , the left and right wheel speeds are given by 1 lws = f (t) + 2 ! a(t), and rws = f (t) ! 12 " a(t) where ! is the distance between the drive wheels.) Collision recovery Recover is a ‘pre-wired’ collision recovery module. When activated it stops the robot, suppresses the output of the adaptive controller, and sends its own control signals to the motor system. These signals perform a sequence of actions causing the vehicle to back-up (by undoing its most recent movement) and then rotate by a random amount ( ± 90–180°) before proceeding. If a further collision occurs during this recovery process the robot stops and waits to be relocated to a new starting position. Recover also sends a reset signal to each of the learning components of the adaptive controller. This signal inhibits learning whilst recover is in control of the vehicle once appropriate updates have been made for the actions that resulted in the collision, it also clears all short-term memory activity in the adaptive module so that when learning recommences actions and predictions associated with pre-collision states are not updated further. CHAPTER SIX LOCAL NAVIGATION 139 The adaptive local navigation module Context and reward inputs, outputs and initial values The motivate component of the wander module generates reinforcement signals based on touch (collision) and motor reward signals. The actor and critic components both have the current depth map as their context input. The critic learns to predict sums of future motivation signals, the actor learns the control vector y(t) = ( f (t), a(t)) that determines the movement of the vehicle. Initially, the prediction and action functions are uniform throughout the input space. The default output of the actor corresponds a forward velocity equal to half the vehicle’s maximum speed and a steering angle of zero. In other words, the vehicle is set to travel in approximately a straight line. The default prediction is equal to the maximal attainable reward (i.e. it is optimistic). Clearly, as learning proceeds, both the prediction and actions will adapt and become specialised for the different regions of the input space. Motivation In order to learn useful local navigation behaviour the robot must move around and explore its environment. If the system is rewarded solely for avoiding collisions an optimal strategy would be to remain still or to stay within a small, safe, circular orbit. To generate more interesting behaviour some additional ‘drive’ is required. This task of providing suitable incentives to the learning system is performed by the motivate component. In the experiments described below motivate combines two different sources of reward. To promote obstacle avoidance the signals from the external collision sensors are combined to compute a ‘collision’ reward. This signal is zero unless the robot makes contact with an obstacle which produces an immediate negative reward signal. To encourage exploration internal feedback from the motor system is used to derive a ‘movement’ reward which is a Gaussian function of the vehicle’s current absolute translational velocity. This reward is maximal (zero) when the robot is travelling at an optimal speed and becomes increasingly negative away from this optimum. CHAPTER SIX LOCAL NAVIGATION 140 There are two points to note about the movement reward signal. First, by basing the signal on translational motion only (i.e. excluding rotational movement), the system is discouraged from following small circular orbits. Second, by using the absolute velocity the system is rewarded equally for travelling backwards at speed as it is for travelling forward. This can encourage the robot to back out of difficult situations. The system learns to travel forward most of the time, not because forward motion is in itself preferable, but because the robot discovers that reversing is a dangerous activity—since all of its range sensors are directed forward it is unable to predict, and hence avoid, collisions that occur while backing. The output of motivate is the total reward given by the weighted sum of the collision and movement rewards. The reward horizon of the module is infinite and discounted, in other words, the critic learns to predict a discounted sum of future motivation signals. Full details are given in the Appendix. Input coding As discussed in chapters four and five, to learn non-linear functions of a continuous input space the input vectors must be recoded in a suitable manner. Both a priori and adaptive recoding methods have been considered hitherto, however, it is clear that the latter generally makes the learning problem significantly more difficult. For this reason a fixed, a priori coding method was chosen for these initial experiments in adaptive local navigation. To have the advantages of a localist coding and yet retain some generalisation ability the CMAC coarse-coding architecture [2] (Section 4.1) is employed. In the experiments described below, identical CMAC codings are used by both actor and critic components. Output computation The CMAC encoding effectively maps the current input pattern into activity in a small number of the stored parameters of each adaptive component. For a given input the value of the prediction computed by the critic is obtained simply by averaging the values of all the active parameters. CHAPTER SIX LOCAL NAVIGATION 141 The procedure for obtaining the output of the actor component is slightly more complex. The design of the actor component is based on the Gaussian action unit [184] described in chapter three. As its output the actor generates the control vector y(t) by selecting random values from Gaussian probability distributions specified as functions of the current depth map. To generate each element of the two-valued action requires a mean and a standard deviation. Hence, in all, the actor has four sets of adaptive parameters from which it computes four values (again by averaging over the active, CMAC-indexed parameters in each set). Figure 6.5 illustrates the mechanism for recoding the input and computing the critic and actor outputs. Details of the exact coding system and activation rules are given in the Appendix. Learning The critic and actor components both make use of short term memory (STM) buffers for encoding recent states and actions. The critic’s STM records the activity trace of past activations required for the TD(!) update rule while the STM of the actor records the eligibility trace of each of its adaptive parameters (see section 2.2). The reset line allows the contents of these buffers to be erased by a signal from the recover module. As described in chapter two, the use of STM traces allows a more rapid propagation of credit and thus gives accelerated learning. Again, full details of the update rules for all parameters are given in the Appendix. CHAPTER SIX LOCAL NAVIGATION 142 The learning cycle The following summarises the steps that are carried at each time interval 1) The sensor modules map and touch obtain new perceptual input which is communicated to the control modules recover and wander. The latter also receives the motor feedback signal from the motor control module. 2) The motivate component of wander generates a primary reinforcement signal. 3) The critic component calculates a new prediction and the TD error. 4) Actor and critic components update their parameters according to the contents of their STM buffers. 5) If a collision has occurred the recover module takes temporary control, suppresses learning and erases all STM buffers in the adaptive module, otherwise the actor generates new control signals based on the current depth map. 6) The STM memory buffers are updated. 7) The motor control module attempts to execute the motor commands. CHAPTER SIX Laser range-finder distance to nearest obstacle -4.0 prediction of future reward 0.0 Critic CMAC input vector -60" 0" +60" angle of ray reverse forward velocity advance left steering angle right f am Note: This figure depicts 3x(3x3) cmac tilings of a two-dimensional space, the simulations use 5x(5x5x5) tilings of the three-dimensional space (3 rays) of depths patterns. a! LOCAL NAVIGATION fm Actor CMACs 143 Figure 6.5: Recoding and output computation of actor/critic components of the wander module. The CMAC coding divides the input space into a set of overlapping but offset tilings. For a given input the value of each stored function is found by averaging the parameters attached to all the tiles that overlap that point in space The critic generates a prediction of future reward, the actor the desired forward velocity and steering angle of the vehicle (f, a). Each element of the action is specified by a gaussian pdf and is encoded by two adjustable parameters denoting its mean and standard deviation. The action is chosen by selecting randomly from the two distributions specified. CHAPTER SIX 6.4 LOCAL NAVIGATION 144 Results To test the effectiveness of the learning algorithm the performance of the system was compared before and after fifty thousand training steps in the environment shown in figure 6.3. Averaged over ten runs5 the proportion of steps ending in collision fell from over 5.4% before training to less than 0.01% afterwards. At the same time the average translational velocity more than doubled to just below the optimal speed. These changes represent a very significant improvement in successful, obstacle avoiding, travel. This is further illustrated in Figure 6.6 which shows sample vehicle trajectories before and after training. The paths shown represent two thousand time-steps of robot behaviour. The dots show the vehicle’s position on consecutive time-steps, crosses indicate collisions, and circles show new starting positions. After training, collision-free trajectories averaging in excess of forty metres were achieved compared with an average of less than one metre before training. The requirement of maintaining an optimum speed encourages the vehicle to follow trajectories that avoid slowing down, stopping or reversing. However, if the vehicle is placed too close to an obstacle to turn away safely, it can perform an n-point-turn manoeuvre requiring it to stop, back-off, turn, and then move forward. It is thus capable of generating quite complex sequences of actions. Furthermore, although the robot has no perceptual inputs near its rear corners its behaviour is adapted so that turning movements implicitly take account of its body shape. Most of the obstacle avoidance learning occurs during the first twenty thousand steps of the simulation, thereafter the vehicle optimises its speed with little change in the number of collisions6. The learning process is therefore quite rapid, if a mobile robot with a sample rate of 5 Hz could be trained in this manner it could begin to move around competently after roughly one hour of training. 5On each run the three measures of performance were computed by averaging over five thousand time-steps before and after training with the learning mechanism disengaged. 6See Figure 6.7 below. CHAPTER SIX LOCAL NAVIGATION 145 Figure 6.6: Trajectories before and after training for 50,000 simulation steps. Dots show vehicle positions, crosses show collisions, and circles new starting positions. CHAPTER SIX LOCAL NAVIGATION 146 Learning in different environments Some differences were found in the system’s ability to negotiate different environments with the effectiveness of the avoidance learning system varying for different configurations of obstacles. In the following, the original environment (Figure 6.1, 6.6) is referred to as E1 and the two new environments illustrated in Figure 6.7 as E2 and E3 respectively. Figure 6.7: Test environments E2 and E3. The performance in each environment was measured, before and after training, by the percentage of steps resulting in collisions and by the average translational velocity. These results, averaged over ten runs in each environment, are shown in Figure 6.8. CHAPTER SIX LOCAL NAVIGATION collisions % Environment 147 translational velocity (cm/step) before after before after 5.43 0.09 3.77 7.40 6.29 0.24 3.77 7.42 8.56 0.55 3.64 6.94 E1 E2 E3 Figure 6.8: Performance of the local navigation system in different training settings. The differences, between environments, in the ‘collisions’ measure are significant for all comparisons (p<0.05). These results indicate substantial variation in performance for different configurations of obstacles. Although there is a major improvement over the training period in local navigation skill in all settings, the number of collisions in E2 and E3 is certainly not negligible—in the worst case (E3) a collision occurs approximately once in every two hundred time-steps after training. The experiments also suggest that some obstacle configurations are, a priori, more difficult to negotiate than others—this is indicated by the identical ranking between the before and after values of the ‘collisions’ measure. Unfortunately it is difficult to precisely specify the differences between environments. The test situations were not devised according to any strict criteria, indeed, the question of what criteria could be applied is itself a research issue7. Possible explanations for the failure to learn optimal, collision-free behaviour in relation to characteristics of different obstacle configurations will be considered further in the discussion section below. 7The problem of categorising environments in relation to adaptive behaviours is considered in [187] [171] [88]. However, the criteria suggested largely assume a discrete state-space. CHAPTER SIX LOCAL NAVIGATION 148 Transfer of acquired behaviour between environments The specificity of the acquired behaviour was evaluated by testing each trained system on the two unseen environments as well as on the original training environment. On average, obstacle avoidance skill transferred with reasonable success to unseen settings. For instance, Figure 6.9 shows performance in E2 after training in environment E1. Figure 6.9: Sprite behaviour in a novel environment. The trajectories show behaviour (without learning) after transfer from the training environment (E1). Performance was, however, slightly better when training and testing both took place in the same setting. This was shown by a higher percentage of collisions in any given environment if the system had no previous training there. Any drop in performance after moving to a novel environment can, however, be made up if the system continues to train after the transfer. This process is illustrated in Figure 6.10 which shows the average reinforcement signal during training, initially in environment E1, and subsequently in E2. CHAPTER SIX LOCAL NAVIGATION 149 0.00 average reward -0.15 0 simulation steps 50000 Figure 6.10: The change in the reward received over a training run in which the simulated vehicle was transferred to the new situation (E2) after fifty thousand simulation steps and allowed to adapt its actions to suit the new circumstances. As a final test of the flexibility of the acquired behaviour, ten training runs were carried out during which the vehicle was repeatedly transferred between the three environments. After training these systems did not perform significantly worse in any single test environment than systems that were trained exclusively in that setting. These findings encourage the view that the acquired behaviour is capturing some general local navigation skills rather than situation-specific actions suited only to the spatial layout of the training setting. Exploration An automatic annealing process takes place in the exploration of alternative actions as the local probability functions adapt to fit the reward landscape (see section 3.2). Figure 6.11 illustrates this process showing a record of the change, over training, in the variance of selected actions from their mean values. For both elements of the output vector the expected fall-off in exploration is seen as the learning proceeds. The increase in variance following transfer to the new CHAPTER SIX LOCAL NAVIGATION 150 environment may indicate the ability of the algorithm to respond to changes in the local distribution of reward by increasing exploration8. 2.1 7.2 forward velocity steering angle standard deviation (degrees) standard deviation (cm) 2.0 1.9 1.8 1.7 1.6 6.8 6.6 6.4 6.2 1.5 1.4 7.0 0 simulation steps 50000 6.0 0 simulation steps 50000 Figure 6.11: Change in average exploration over training. What is learned? The different kinds of tactical behaviour acquired by the system can be illustrated using three-dimensional plots of the preferred actions of the control system. After training for fifty thousand time-steps in an environment containing two slow moving rectangular obstacles the mean desired forward velocity and steering angle were as shown in Figure 6.12. Each plot in the figure shows the preferred action as a function of the three rays from the simulated laser range-finder: the x and y axes show the lengths of the left (-60°) and right rays (+60°); the vertical slices correspond to different critical lengths (9, 35 and 74cm) of the central ray (0°); and the height of the surface indicates the mean action for that position in the input space. 8Alternative interpretations of these results are possible however. For instance, the observed increase in exploration following transfer may be due to increased sampling of input patterns to which the system has had little prior exposure. CHAPTER SIX LOCAL NAVIGATION Forward Velocity Steering Angle +15cm +35° +5 0° -5 200 124 left 20 124 63 63 20 -30° 200 124 124 63 151 right left 63 20 20 0 0 a) centre 9cm b) centre 9cm +15cm +35° +5 0° -5 -30° d) 35cm c) 35cm e) 74cm right +15cm +35° +5 0° -5 -30° f) 74cm Figure 6.12: Surfaces showing desired forward velocity and steering angle for different values of the central ray (9, 35 and 74 cm). CHAPTER SIX LOCAL NAVIGATION 152 The graphs9 show clearly several features that might be expected of effective wandering behaviour. Most notably, there is a transition occurring over the three vertical slices during which the policy changes from one of braking then reversing (graph a) to one of turning sharply (d) whilst maintaining speed or accelerating (e). This transition clearly corresponds to the threshold below which a collision cannot be avoided by swerving but requires backing-off instead. There is a considerable degree of left-right symmetry (reflection along the line left-ray = right-ray) in most of the graphs. This concurs with the observation that obstacle avoidance is by and large a symmetric problem. However some asymmetric behaviour is acquired in order to break the deadlock that arises when the vehicle is faced with obstacles that are equidistant on both sides. Varying the reward schedule The behaviour acquired by system is very sensitive to the precise nature of the primary reinforcement signal provided by the motivate component. Varying the ratio of the ‘movement’ and ‘collision’ rewards, for instance, can create noticeable differences in the character of the vehicle’s actions. An increase in the relative importance of the ‘movement’ reward encourages the vehicle to optimise its trajectory for speed which means going very close to obstacles in order to follow the fastest possible curves. Alternatively, if ‘collision’ rewards are emphasised, then the acquired behaviour veers on the side of caution—the vehicle will seek to maintain a healthy distance between itself and obstacle surfaces at all times. A minor addition to the motivate function causes the system to switch from wandering to wall-following. This is achieved by adding a component to the reinforcement signal that encourages the vehicle to maintain a fixed, short distance on either its left or right ray. An example of the acquired wall-following 9In order to eliminate irrelevant contours arising from unvisited parts of the space (still at default values) a smoothing algorithm was applied to the CMACs before sampling. This had the effect of propagating values of neighbouring cells to positions that had been updated less than 200 times (less than 1% of the maximum number of updates). The algorithm thus has no effect on the major areas of the function space that generate behaviour. CHAPTER SIX LOCAL NAVIGATION 153 behaviour is shown in Figure 6.13 for a system trained in the environment E2. Unfortunately the transfer of behaviour across environments was noticeably poorer for wall-following than for the original wandering competence. Adding extra constraints to the reinforcement signal may mean that optimal behaviour is more rigidly specified and thus less flexible or transferable. Figure 6.13: Wall-following behaviour can be acquired by altering the function computed by the motivate component. Different sensor systems A number of experiments were carried out with modified sensor systems. In one experiment 5% Gaussian noise was added to the range-finder measures. This did not produce any significant reduction in performance. In a second experiment the range-finder model was modified to simulate an optic flow sensor [18] which detects the speed of visual motion at a set angle from the current heading. Direct optic flow signals confound visual motion due to environment structure with motion due to rotation of the vehicle. However, because of the characteristics of CHAPTER SIX LOCAL NAVIGATION 154 the vehicle kinematics, the motion detected by a sensor that is directed straight ahead is due solely to vehicle rotation. By subtracting this forward signal from the motion detected by other sensors the rotational component of flow can be removed. Four such signals from sensors at fixed angles of -60°, -15°, +15°, and +60° were used as input to the adaptive control system. In spite of the fact that the flow signals vary for different vehicle velocities the controller was, nevertheless, able to acquire effective local navigation skill. The main difference from the results with the range-finder simulation was that, with the optic flow signals, acquired backing-up behaviour was less effective. A likely reason for this is that the robot must be moving in order to generate non-zero flow signals. To go from forward to backward motion involves having zero motion at some point. This suggests that with the optic flow context some complementary sensor mechanisms may be needed to guide starting and stopping behaviour. Attempts are currently underway to test the adaptive controller on a real robot platform with an array of visual motion sensors. 6.5 Discussion The following sections consider some of the issues arising from the simulation, analyse sources of difficulty, and suggest some possible future improvements. This is followed by a review of related work on adaptive local navigation. Partial State Information The learning system acquires successful wandering behaviour despite having only partial state knowledge. The information in the three depth measures provided by the ray-caster is clearly insufficient to predict the next input pattern though it seems largely adequate to predict rewards and to determine suitable behaviours. The underlying task may be a deterministic Markov process, however, as a result of the lack of sufficient context, the learning system experiences a stochastic process and must try to optimise its behaviour with respect to that. Furthermore, because the system’s actions are changing over time, the transition and reward CHAPTER SIX LOCAL NAVIGATION 155 probabilities in this process are non-stationary10. In as far as the learning system is successful in acquiring suitable behaviours this demonstrates that the gradient climbing properties of the learning algorithm, that exploit the correlation between actions and rewards to improve behaviour, do not require, either explicitly or implicitly, the ability to model the causal processes in the agent’s interaction with the world. This is not to say that more information would not allow improved learning—I will argue below that improving the context data available to this system should give better performance. Rather, it seems evident that the learning process does not expect or depend on full state knowledge. In many tasks in complex, dynamic environments it will be impossible to determine full state descriptions. This is one of the fundamental limitations of dynamic programming as a means of determining effective control behaviour. Relating delayed reinforcement learning methods to dynamic programming allows the former to exploit the strong convergence properties of the latter. However, to make this equation too emphatically ignores the property of reinforcement learning that it is, at base, a correlation learning rule that neither needs nor notices whether it has sufficient data to formulate a causative explanation of events. To paraphrase the quotation from Brooks [23] given in chapter one, reinforcement learning uses the aspects of the world that the agent is sensing as its primary formal notion, whereas dynamic programming (and its incremental variants) use the state of the world as their central formal notion. The range of applicability of reinforcement learning should therefore be enormously more diverse than that of incremental dynamic programming though it may be a weaker learning method. Given this argument that learning does not require full state information it is nevertheless a truism that the ability of the system to learn optimal actions depends critically on what characteristics of the current state it does observe. The following therefore considers how the sparse perceptual information used by the adaptive wander module could be enhanced to give better performance. 10In particular, the rewards for the system are dependent on its actions (forward velocity) about which there is no direct information in the context input. CHAPTER SIX LOCAL NAVIGATION 156 Perception Many of the collisions which occur after training involve the vehicle hitting the convex corners of obstacles. A comparison between the different test environments bears out this view—the increase in difficulty across environments appears to be matched by an increase in the number of convex corners in each setting (9, 16, and 27 for E1, E2, and E3 respectively)11. One possible explanation for this finding is the extremely sparse sensing. The three ‘rays’ of the simulated laser range-finder often fail to catch the nearest point of an obstacle and can thus give a false indication of distance to the nearest surface. Given the coarseness of this perceptual apparatus it is perhaps surprising that the system performs as well as it does. A second possible cause of failure to avoid collision is the lack of any direct context input concerning the vehicle’s instantaneous forward and angular velocities. This is equivalent to an implicit but questionable assumption that, for any given input pattern, a single action vector (desired velocities) will be appropriate no matter how the vehicle is currently moving. It is easy to see situations where this might not be appropriate. For instance, if an obstacle lies directly in front of the vehicle then different actions may be required if the vehicle’s current angular velocity is carrying it towards the left than if it is moving towards the right. The learning system can minimise the impact of this deficiency by developing characteristic patterns of approach behaviour for any local setting. However, when similar local geometries are experienced in two different settings different approach behaviour may be inevitable. The lack of suitable proprioceptive context signals therefore encourages the acquisition of more inflexible behaviour than might otherwise be learned. Unfortunately, given the current learning architecture, enriching the observed state information is not just a matter of loading more range-finder measures and/or motor feedback signals into the context input. Using the current localist coarse-coding technique (CMACs) this would result in an exponential growth in 11Clearly this is only one of several differences that may be significant. For instance, variation in the number of obstacles or their density may also be having an affect. CHAPTER SIX LOCAL NAVIGATION 157 both the number of adaptive parameters and the amount of time and experience required for learning. Of course, any increase in the dimensionality of the input vector will almost certainly increase the redundancy of the context signals. Hence adaptive recoding techniques such as those discussed in chapters four and five could be appropriate for compressing the input space or adapting the granularity of the input coding. A further possibility for enriching the context while maintaining, or increasing, the rate of skill acquisition is to introduce further information gradually. For instance, learning could begin using a very coarsegrained coding to which detail is gradually added as training proceeds. An example of this approach would be a hierarchy of CMAC tilings of different resolutions. Training would begin with the lowest resolution tiling allowing rapid generalisation of acquired knowledge. As local navigation skill improves, tilings of higher resolutions would gradually come into use generating a more accurate fit to an optimal control strategy. Finally, rather than simply adding extra context, the sensor modules themselves could be made adaptive. For instance, a module that controls the direction of sampling of the laser range-finder could be trained to tune the sampling direction according to some reward-related feedback signal. An adaptive attentional mechanism of this sort might allow the sensor system to pinpoint nearby obstacles with increased accuracy without a blanket increase in the amount of sampling by the range-finder. Multi-modal actions The learning system selects only one of the multiple solutions possible for wandering behaviour. Although the acquired behaviour is versatile, in that a range of acceptable actions is determined for each region of the input space, this distribution is unimodal whereas, in general, a full catalogue of the possible options for any specific context input might have a multi-modal distribution. Millington [101] has described a reinforcement learning technique capable of learning multi-modal outputs which could be applicable here12. 12Millington divides the input space into a number of discrete cells then assigns a number of Gaussian pdf ‘modes’ to each cell which adapt to explore different regions of output space. A CHAPTER SIX LOCAL NAVIGATION 158 The existence of multiple solutions creates particular difficulties for the learning system in situations where the input vector is left-right symmetric, for example, where the robot is moving toward a head-on collision with a wall. In this situation veering strongly either to the left or the right is generally an acceptable solution. However, since both options are equally likely to occur through random exploration, it is possible that by repeatedly trying first one and then the other the learning system will consistently cancel out the positive effect of both and be left with a policy of moving directly forward—a clear compromise, but the worst possible strategy. This situation would not arise, however, if there were two systems one specialised in left turns the other in right turns, indeed, since one could be the mirror-image of the other only one system might actually be required. A separate module would then have the task of deciding which option to employ in any given circumstances. Goal conflict The adaptive module is rewarded both for successful avoidance behaviour and for moving reasonably quickly. Inevitably, therefore, the strategy that develops is a compromise between these initially antagonistic goals. In the early stages of learning the critic rapidly detects a strong positive correlation between moving at speed (desirable) and colliding with obstacles (undesirable). A likely response to this double-bind is to become excessively cautious or acquire prevarication (i.e. dithering) behaviours. Indeed, the most difficult part of setting up the learning algorithm was found to be tuning the global parameters of the adaptive module so as to overcome the initial ‘hump’ of discouraging experience and the consequent disinclination to move around. Relating reinforcement to speed of movement was necessary because, in the absence of any more specific objective, some incentive was required that would force movement and exploration. In general, however, it would be desirable to replace the ‘movement’ reward by some measure of success more closely related to a genuine navigation goal. For instance, a way-finding module might specify a further alternative would be to adopt a discrete output-space quantisation and use a Q-learning algorithm to determine the values of alternative actions. CHAPTER SIX LOCAL NAVIGATION 159 local target position or a target heading, reward would then be related to success in satisfying that criterion. However, the problem of conflicting goals will always exist for a vehicle with variable velocity as any type of movement increases the risk of collision with stationary obstacles13. Furthermore, unlike the ‘movement’ reward more appropriate incentives may provide only intermittent feedback making the optimal reinforcement gradient even more difficult to detect and follow. In building robots that are increasingly self-reliant and self-motivated it seems likely that this problem of balancing conflicting goals will become more not less important. This suggests that future research will require a better understanding of the operation of motivational mechanisms in both animals and robots14. Related work on adaptive local navigation There has been a substantial amount of recent work on acquisition of local navigation skill. The following is not an attempt at an exhaustive review, rather, the objective is to identify some of the progress that has been made and the different approaches that have been taken. Zapata et al. [190] implemented both pre-wired and adaptive local navigation competences obtaining qualitatively similar results with each when tested on mobile robots in outdoor and indoor environments. The adaptive competences were acquired using multi-layer neural networks controllers trained by backpropagation in a teaching signal which was some pre-defined function of the input vector (for instance this function might relate to deformations of a desired 13Some experiments were conducted in which movement was encouraged indirectly by populating the environment with moving obstacles. This should motivate the robot to move around as remaining in any one location will not guarantee obstacle avoidance. However, in general, these experiments were not very successful suggesting that some more straight-forward and obvious link between movement and reward is needed to overcome the disincentives that arise from collisions. 14Toates [170] gives a review of motivation research in psychology and ethology, Halparin [58] has described a model for investigating issues in robot motivation. CHAPTER SIX LOCAL NAVIGATION 160 obstacle-free zone around the robot vehicle). Three adaptive modules were developed including a module for dynamic collision avoidance. Nehmzow [113] describes acquisition of forward motion, collision avoidance, wall-following and corridor-following skills in a small, indoor mobile robot. Learning occurred in a single-layer network with an eight-bit input vector describing the state of two touch sensors (whiskers) and a forward motion sensor. The output of the network selected one of four possible actions—turn left, turn right, move forward, or move back, where the chosen action was timed to last for a fixed period. Exploration of behaviours occurred simply by rotating through the alternative outputs. Stimulus-response associations were reinforced which satisfied a set of instinct rules. These rules described desirable sensory conditions such as “Keep the forward motion sensor on”, “Keep the whiskers straight”, etc. Relatively complex behaviours such as wall-following were developed in this manner through the interaction of a number of such instinct rules. Learning for these tasks was very rapid, both because of the small size of the input and output spaces and because the network was specified in a manner that made the input patterns linearly separable with respect to satisfactory outputs. Typically, a behaviour such as wall-following could be learned in just a few minutes. More efficient or accurate manoeuvring would clearly require an increase in the size of the search-space and would therefore result in slower learning speed. However, this research does make the point that very simple acquired reflexes can support robust behaviour. The research in adaptive local navigation by Kröse and van Dam [73, 74] bears the closest resemblance to the work described in this chapter. They describe a simulated robot whose task is to avoid punishment signals incurred through collisions with obstacles. The motor speed of the vehicle is fixed and positive, hence, stationary, backing-up or procrastination behaviours are not possible. This removes a major source of locally-optimal behaviours but at the same significantly restricts the manoeuvrability of the vehicle. The simulation was trained using an Actor/Critic learning architecture for which the input was a vector of logarithmically-scaled measures from eight range-finder sensors set in a semi-circular arc at the front of vehicle. The stochastic output of the actor component moved the current heading of the robot either to the left or the right. CHAPTER SIX LOCAL NAVIGATION 161 A major difference from the work reported here was in the recoding methods used to form a quantisation of the state-space. Both a priori and adaptive discrete quantisation techniques were tested. The adaptive techniques included unsupervised learning using a Kohonen self-organising network and nearest neighbour coding using an on-line node generation and pruning algorithm. The latter worked in the following manner. Starting with just one node, whenever a collision occurred new units were added at the points in the input space corresponding to the last M input patterns prior to the collision. In order to prevent an explosion in the number of nodes, adjacent units with similar output weights were periodically merged. The prediction value of units were also monitored and units with persistently low predictions removed from the network. The system was tested in a two-dimensional environment of polygonal obstacles. The training schedule was somewhat different than in the experiments described above, however, roughly similar training times were required to obtain good performance. The performance with the nearest neighbour coding, with a final size of 80 units, was similar to that obtained with a Kohonen network with 128 units and better than with an a priori quantisation with 256 cells. The nearest neighbour coding method appears to have been reasonably successful, however, the mechanism for generating new nodes is heuristically driven and exploits the task characteristic that avoidance can be initiated close to the collision site. This heuristic might be less effective if task parameters were changed, for instance, if the turning circle of the robot was increased requiring earlier initiation of avoidance activity. Millan and Torras [100] describe a reinforcement learning approach to the local path-finding problem. They consider an abstract model in which a ‘point robot’ learns to move to nearby goal positions in an environment containing only circular obstacles. Success is defined as entering a circle of small radius around the goal position. The inputs to the learning system, which assume some relatively complex pre-processing of sensory signals, consist of an attractive force exerted by the goal and repulsive forces exerted by nearby obstacles. The output of the learning system gives the size and direction of a movement vector in the coordinate frame defined by a direct line to the goal. The reinforcement provided on every simulation step is a function of the attractive and repulsive force inputs and the current clearance between the robot and the nearest obstacle. Reward is CHAPTER SIX LOCAL NAVIGATION 162 maximum at the goal and minimum when a collision occurs. The learning system is based on the actor/critic architecture and employs a variant of Gullapalli’s [56] algorithm for determining suitable exploration behaviour (section 3.2). In order to allow the acquisition of non-linear input-output mappings the actor component is an MLP network whose hidden units are trained by generalised gradient ascent in the reinforcement signal. Certain domain-specific heuristics were used to facilitate the discovery of suitable paths, and some intervention (re-siting the robot after collisions) was needed to allow successful learning. Although reinforcement learning is a very slow method for planning, the local path-finding skills acquired by this system did generalise moderately well to other configurations of circular obstacles. The acquired behaviour therefore does meet the criteria proposed here that local navigation competences should not be environment-specific. Millan and Torras share a similar view to the one advocated in this chapter that navigation requires a combination of planning mechanisms that employ world models and local navigation mechanisms encoding stimulusresponse behaviours. Whether the appropriate place to interface these two mechanisms is at the level of local path-planning is an open question. In chapter eight I will suggest that the way-finding mechanism should have a more continuous role than that suggested by Millan and Torras (specifying intermediate goal positions). For instance, the way-finding system might continually update its planned route as the robot moves and direct the local navigation systems by specifying the current desired heading. This increases the burden of work for the way-finding system (means for coping with this burden are discussed in chapter eight) but reduces the need for reactive modules that are adapted to maximise very long-term rewards. Acquisition of local navigation skill should therefore be a faster and more straightforward process. Conclusion This chapter has argued that the problem of navigation should be divided between a way-finding component that builds world models and performs planning, and a number of local navigation components which together implement appropriate local manoeuvres for traversing the desired route. I have also demonstrated that sophisticated local navigation behaviour can arise from sequences of learned CHAPTER SIX LOCAL NAVIGATION 163 reactions to raw perceptual data. A simulation of a mobile robot has been described that acquires successful wandering behaviour in environments of polygonal obstacles. The trajectories generated by the simulation often have the appearance of planned activity since each individual action is only appropriate as part of an extended pattern of movement. However, planning only occurs as an implicit part of the learning process that allows experience of rewarding outcomes to be propagated backwards to influence future actions taken in similar contexts. This learning process exploits the underlying regularities in the robot's interaction with its world to find an effective mapping from sensor data to motor actions in the absence of full state information. CHAPTER SIX LOCAL NAVIGATION 164 162 Chapter 7 Representations for Way-finding: Topological Models Summary This chapter and the next consider the construction of representations that can support way-finding. Some recent research in Artificial Intelligence has favoured spatial representations of a primarily topological nature over more quantitative models on the grounds that they are: cheaper and easier to construct, more robust in the face of poor sensor data, simpler to represent, more economical to store, and also, perhaps, more biologically plausible. This chapter suggests that it may be possible, given these criteria, to construct sequential route-like knowledge of the environment, but that to integrate this information into more powerful layout models or maps may not be straight-forward. It is argued that the construction of such models realistically demands the support of either strong vision capabilities or the ability to detect higher-order geometric relations. And further, that in the latter case, it seems hard to justify not using the acquired information to construct models with richer geometric structure that can provide more effective support to way-finding. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 163 7.1 Introduction The problem of navigating in large-scale space has been the subject of considerable study both by Artificial Intelligence researchers interested in building autonomous mobile robots, and by ethologists and psychologists interested in understanding the navigation behaviour of animals and humans. For a long time these two strands of research have been largely independent, however, recently there has been considerable cross-over of ideas. Robot builders have begun to seek inspiration from natural navigation systems as examples of robust systems that have been selected and fine-tuned by evolution. Meanwhile psychologists and animal behaviourists have started to recognise that robotics research represents a rich resource of detailed computational models that can be used to evaluate and inspire theoretical accounts. A recent trend in research in robotics has involved a substantive change in the character of the systems under investigation. Specifically, the emphasis of ‘classical’ AI methods on detailed path-planning using metric models of the environment (e.g. [34, 40, 64, 89, 136, 174]) has been rejected by some researchers in favour of the use of more ‘qualitative’ methods and models (e.g. [38, 72, 77, 78, 81, 82, 84, 92, 94, 112]). Researchers with this perspective often regard metric modelling and planning as supplementary to a core level of navigation skill based, primarily, on representations of topological spatial relations. This new approach has, as part of its motivation, the perceived inadequacies of classical systems which are regarded as over-reliant on accurate sensing and detailed world models. It is suggested that such systems are both too ‘brittle’ in the face of degraded or missing sensory information, and too costly in terms of the computational and memory resources they require. In contrast, the emphasis on topological models can be understood as part of a general trend toward ‘behaviour-based’ robot control (i.e. animat AI) that seeks minimal reliance on internal models, and—where representations are considered essential—the need for on-line acquisition of appropriate knowledge. Representations are therefore preferred that are simple to construct, cheap to store, and support a ‘graceful degradation’ of performance when confronted with unreliable sensory data. A second motivation for investigating topological spatial models is research on human way-finding. Much of this literature follows a theory originating with CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 164 Piaget [124, 125] that human spatial knowledge has a hierarchical structure and is acquired through a stage-like process. Specifically, Piaget, and later Siegel and White [154], have argued that a fundamental stage in the acquisition of spatial knowledge is the construction of qualitative models of the environment from more elementary sensorimotor associations. This representation is then gradually supplemented by distance and direction information to form a more detailed quantitative map. An important element of this theory is the view that a primarily topological representation can support robust way-finding behaviour in everyday environments. Computational models inspired by the human way-finding literature have been described by Leiser [82] and by Kuipers [76-81]. The latter in particular has developed a number of robot simulations of considerable sophistication and detail based on the hypothesis of a hierarchical representation of spatial knowledge. The following extract serves to illustrate this theoretical position: “There is a natural four-level semantic hierarchy of descriptions of large-scale space that supports robust map-learning and navigation: 1. Sensorimotor: The traveller’s input-output relations with the environment. 2. Procedural: Learned and stored procedures defined in terms of sensori-motor primitives for accomplishing particular instances of place-finding and routefollowing tasks. 3. Topological: A description of the environment in terms of fixed entities, such as places, paths, landmarks, and regions, linked by topological relations such as connectivity, containment and order. 4. Metric: A description of the environment in terms of fixed entities such as places, paths, landmarks, and regions, linked by metric relations such as relative distance, relative angle and absolute angle and distance with respect to a frame of reference. In general, although not without exception, assimilation of the cognitive map proceeds from the lowest level of the spatial semantic hierarchy to the highest, as resources permit. The lower levels of the cognitive map can be created accurately without depending greatly on computational resources or observational accuracy. A complete and accurate lower level map improves the interpretation of observations and the creation of higher levels of the map.” ([81], p. 26, my italics) However, while agreeing with many aspects of this position it is possible to question one of its central themes—that successive levels build on those below. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 165 This issue can be highlighted by contrasting the above view of human wayfinding with much of the literature from the wider field of animal navigation. This evidence suggests a discontinuity between procedural knowledge and the use of map-like metric spatial representations [52, 119, 120]. For instance, in contrast to the incremental hierarchy of spatial knowledge outlined above, O’Keefe [119, 120] has argued that there are two fundamentally independent navigation systems used by mammals including man. The first of these, which he calls the taxon system is supported by route-like chains of stimulus ! action ! stimulus associations. Each element in such a chain is an association that involves approaching or avoiding a specific cue, or performing a body-centred action (generally a rotation) in response to a cue. Taxon strategies therefore have a similar nature to the procedural knowledge in the second level of Kuipers hierarchy. O’Keefe’s second system, called the locale system, is, however, a ‘true’ mapping system in that it constructs a representational stimulus ! stimulus model describing the metric spatial relations between locations in the environment. Evidence for the existence of this system and its independence from taxon strategies consists of both observational and laboratory studies of animal behaviour, and neurophysiological studies suggesting that different brain structures underlie the two systems. Although the highest level of Kuipers hierarchy can be identified with O’Keefe’s locale system the former suggests a continuity—with assimilation of information onto ‘weaker’ representations to generate the metric model, whereas the latter stresses the discontinuity and apparent autonomy of the two alternative mechanisms. A further difference is that O’Keefe’s theory bypasses the level of the topological map, if such a map exists it is as an abstraction from the full metric representation. Gallistel [52], who provides a recent and extensive review of research on animal navigation, also concludes that animals make considerable use of metric data for navigation. Like O’Keefe he also proposes a modular and autonomous mapping system that stores a metric representation of spatial layout1. The controversy highlighted above really rests on the importance of metric knowledge in constructing map representations of space. On the one hand is the 1O’Keefe and Gallistel agree on the existence of a separate metric mapping system but largely disagree on the relative importance of dead reckoning and environmental fixes in constructing the map. This debate will be considered further in chapter eight. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 166 view that a useful level of topological map knowledge can be acquired without relying on higher-order geometric information, on the other, the assertion that the ability to detect large-scale metric spatial relations is the key to constructing an effective model. It might be helpful here, to delineate alternative views on this issue more clearly. Two possible positions are as follows: 1. That a map describing the topological spatial relations between salient places can be efficiently constructed from sensorimotor and procedural knowledge without any need to detect metric spatial relations. 2. That the construction of any sort of map of environmental layout is best begun with the detection of metric spatial relations. These can then be used to construct a metric model from which topological relations can be abstracted as required. Both of these positions are as much a matter of practice as of principle. They are concerned with building useful and robust representations without making excessive demands on the agents resources of time, perceptual skills, computational abilities, and memory. I believe that, given these caveats, the second view would be reasonably acceptable to O’Keefe, Gallistel, and many researchers adopting more quantitative approaches to robot map-building. The first view is possibly too strong to ascribe to most builders of behaviour-based robots. The aim of much of this research is—not unreasonably—to build a working system that exploits whatever information can be acquired cheaply and re-acquired with fair reliability2. This approach might therefore be better characterised as follows. 3. That a map describing primarily topological spatial relations can be efficiently constructed from sensorimotor and procedural knowledge with some additional but only approximate knowledge of metric spatial relations. This approach, however, is open to the criticism that if metric relations are detected then they may as well be stored and exploited. Why not construct a metric model? Even if such a model is only approximate it should still have some advantages over a topological map, for instance, in estimating direct routes. If the detected relations are too inaccurate to be worth storing then they will, as likely, 2Many systems, for instance, make use of odometry (wheel-speed) information and simple compass sensors to compute rough estimates of spatial position that are used in constructing a primarily topological model. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 167 be of little use in constructing the topological model. If this argument is correct, then the case against building a metric map can only be based on the assertion that the construction or use of such a model is significantly more expensive or complex. This issue will be taken up in the next chapter where it will be argued that the use of more quantitative models is compatible with the general research paradigm of Animat AI. A further problem with this third approach is that it undermines Kuipers theory of the spatial semantic hierarchy. If the flow of information is not principally bottom-up then the argument for placing the metric map at a higher level than the topological model is weakened. In accordance with the more classical approach the construction of metric spatial relations could be viewed as having more a direct link to the sensorimotor level than the topological map. In other words the theory of the semantic hierarchy would lose much of its bite. On the basis of these arguments this chapter therefore focuses on the first approach which I will call the ‘strong’ topological map hypothesis—the possibility of constructing topological models without reliance on higher geometric knowledge. The first section reviews some of the general issues in constructing and using models, topological or otherwise, of large-scale space for way-finding. This discussion is not intended to elucidate any new or startling truths but rather to define some relatively basic ideas and seek to identify some of their logical consequences. The second section then considers, in relatively abstract terms, the construction of route representations and their assimilation into topological maps. Some empirical attempts at constructing such systems, for mobile robots or robot simulations, are reviewed in following section, and the chapter concludes by considering some of the implication for building spatial representations that can support way-finding in artificial and natural systems. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 168 7.2 The problem of way-finding A pre-requisite for way-finding is knowing where you are in the world. Knowing where you are involves being localised—that is knowing what place you are currently in, and being oriented—knowing where other places are with respect to you. Constructing a representation for way-finding in a novel environment is about accumulating information that will allow you to answer these ‘where?’ questions and then solve the travel problem of finding and following a path from the current location to the desired goal. The ‘where?’ questions are intimately tied-up with questions of what constitutes a place (or a location), therefore it is appropriate to begin by attempting to define this fundamental spatial entity. The concept of place A common-place understanding of the idea of a ‘place’ is of a particular point or part of space3. This definition indicates that places gain their identities not through any sensory quality of the objects that occupy them but by virtue of their relationships to the other parts of space. These spatial relations can be considered the indispensable4, or primary qualities of a place since no two places can have the same spatial relations without merging their identities. Places can, however, be associated with non-spatial qualities. These are properties that can be directly determined from sensation such as the shape, colour, and texture of the surfaces of objects, or by characteristic odours, temperature, and sounds. However, the non-spatial qualities of a place are dispensable or secondary attributes since they do not guarantee uniqueness. Two distinct locations that are 3This is the first meaning of the word listed in many dictionaries (see, for instance, “Collins Dictionary of the English Language”, Collins: London. 1979). 4The distinction between primary and secondary qualities was made by Locke and perhaps [75] introduced the terms indispensable and dispensable attribute to more precisely characterise this distinction. Shephard [151] contains a further originated with Aristotle. Kubovy discussion of spatial representations in this light. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 169 indistinguishable in terms of their secondary attributes nevertheless remain different places5. Although local sensory characteristics do not define location they can, however, play an important role in constructing a model of space. Salient places often do have distinctive characteristics and by detecting these features the primary spatial identity can be decided with some degree of certainty. It is important to note that this distinction between the primary and secondary characteristics of places, is not a distinction between different types of sensation. The sensed geometric properties of the local scene, for instance, the depth, slant, and shape of surfaces can be treated as distinctive sensory patterns with no special regard given to their spatial content. Considered in this way they constitute secondary characteristics. Perceptions of depth, shape, etc. can clearly also be used as a source of information from which spatial relations are constructed. Regarded in this way this is they provide the data from which the primary characteristics of places are determined. Almost any sensory modality can either supply cues for spatial location, or be treated as a source of local, sensory patterns. For instance, the high temperature in proximity to a heat source can be considered as a distinctive secondary characteristic of that place. However, the temperature gradient away from the source can also be used to indicate distance from the source and can therefore be treated as a spatial cue. The non-local nature of spatial invariance presents a difficult perceptual problem. The navigator must construct information about the spatial relations in the environment out of its local sensory experience. This problem gives rise to two fundamental questions that underpin a long controversy in the psychological6 literature. First, what is nature of the spatial relations that are encoded in cognitive 5This view of the nature of places follows that of O’Keefe [120] who also attributes it to Newton and Kant: “The notions of place and space are logical and conceptual primitives which cannot be reduced to, or defined in terms of, other entities. Specifically, places and space are not, in our view, defined in terms of objects or the relations between objects. [...]. It is sometimes convenient to locate a place by reference to objects occupying it [...] but this is only a convenience.” ([120] p. 86.) 6O’Keefe [120] gives an excellent review of the history of this debate in the psychological and philosophical literature. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 170 spatial representations? and, second, what is the role of different types of sensation in constructing these relations? In considering these questions, we must first give some substance to the debate by defining more carefully the different classes of spatial relations in terms of their geometric properties. Geometric properties of spatial representations There are various ways of classifying spatial relations. One classification, for instance, is based on the methods used to derive theorems in geometry. This gives rise to the distinction between synthetic geometry which is developed from purely geometric axioms concerning the properties of points, lines, and figures in space, and analytic geometry which builds on number theory and algebra through the use of numerical coordinates. An alternative classification is to categorise geometric theorems according to content. For instance, many of the theorems of geometry are concerned with magnitudes—distances, areas, angles and so forth—which are determined with respect to a measure of the distance between two points called a metric. One way of describing this body of theory is that it concerns those properties that are invariant under the magnitude preserving, displacement transformations of rotation and translation. These theorems of metric geometry allow tests of the congruence (identity of size and shape) and the analysis of rigid motions of figures. A further set of theorems, however, concern properties that survive transformations that involve changes in magnitude. For instance, under affine transformations, angles, distances, and areas can become distorted but other properties are retained such as that parallel lines remain parallel. Projective transformations introduce further distortions but preserve still more basic properties such as the collinearity of points (that is, lines remain lines), and the notion of one geometric figure lying between two others. Finally, the most radical topological transformations disrupt all but the most fundamental spatial relations of connectivity (that is, that adjacent points remain adjacent). In fact, metric models of space stand at the top of a hierarchy7 of geometric models—metric, affine, projective, and topological—in which each level necessarily preserves the spatial relations of every lower level. Hence, metric models incorporate all more 7This hierarchical view of geometry was introduced by the mathematician Felix Klein in a famous address given in 1872 known as the “Erlanger program”, its relevance to research in cognitive maps has previously been considered by Gallistel [52]. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 171 fundamental relations but topological models do not encode any more detailed structure. Figure 7.1 illustrates a simple geometric figure undergoing successive transformations each disrupting spatial relations at a deeper level. (a) (b) (c) (d) Figure 7.1: A simple geometric figure (a) subjected to successive transformations that preserve affine (b), projective (c), and topological (d) spatial relations. Theoretically at least, a navigator could construct a representation that describes the spatial relations of its environment upto any level in the above hierarchy. For instance, if the model was metric then it would encode (at least implicitly) estimates of the distances and relative angles between locations. At the other end of the scale a topological model would encode information solely about which locations are adjacent to each other. The debate over the nature of spatial representations for way-finding is generally characterised as being between these two positions. However, it is important to note that models corresponding to intermediate geometries are also possible. For instance, a representation that encoded no magnitude information might yet specify sets of locations as lying on straight lines (i.e. it could encode certain projective relations). There is a further important geometric property that does not fit neatly into the hierarchical model of geometry. This is the property of the sense of a geometric figure which is preserved under all transformations except those that involve reflection (Figure 7.2 below). The construction of a useful representation of space often requires that the sense of spatial relations in the environment is detected and stored, that is, it demands the ability to distinguish up from down and left from right (or clockwise from anti-clockwise). CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 172 Figure 7.2: The sense relation is disturbed by any transformation involving reflection. In the case of metric representations a specific measure of distance must be established. It seems reasonable to assume that any representing system, natural or artificial, that seeks to capture the metric spatial relations of its environment will not employ a metric that is seriously at odds with reality. This rules out many of the different metrics studied in theoretical geometry that are known to conflict with the three-dimensional geometry of the real physical world8. Both classical Euclidean geometry and the non-Euclidean, hyperbolic geometry qualify on these grounds as possible models. However, given that they are empirically indistinguishable, in environments on the scale considered here, the Euclidean geometry is, perhaps, to be preferred on the grounds that it is simpler to deal with9. Henceforth, therefore, by metric representation we will assume models based on Euclid’s metric. A further class of spatial relations are ‘qualitative’ ones such as “A is near to B” or “B is between A and C”. These relations are sometimes grouped together with the topological relation of adjacency whereas in fact they are propositional statements of higher-order geometric relations. This suggests that when 8Gallistel [52] gives an extensive argument in support of this supposition in modelling animal cognitive maps. The issue is perhaps less clear in considering robot systems that navigate in very artificial environments. ( distance(x 1 , x 2 ) = " x 2 Here other metrics such as the ‘city-block’ metric x 1 ) may have useful applications. 9 Courant and Robbins [39] (chapter four) contains a good introduction both to this issue and to the general topic of the classification of geometrical properties. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 173 ‘unpacked’ this type of knowledge will show some approximate or ‘fuzzy’ knowledge of quantitative spatial relations. The qualitative relation of containment represents a further important form of spatial knowledge that is particularly relevant in constructing models of space that describe containment hierarchies. This type of knowledge seems to have important psychological reality for humans [161, 175]. Containment relations are relevant in way-finding as a means of performing planning at different spatial scales, however, qualitative planning at a more any abstract level must at some point be turned into specific instructions that guide local navigation. In this context planning must be concerned with inter-object spatial relations—representations of these relations are therefore the main concern of this chapter. Representations for way-finding In keeping with the terminology of earlier chapters, the forms of representation that might be employed are distinguished here in terms of the mapping they describe. In chapter one the distinction was made between dispositional associative mappings, such as stimulus ! action (S ! A) , and representational models such as stimulus ! action " stimulus (S!A " S) or stimulus ! stimulus (S ! S) . In the context of acquiring way-finding skills the term stimulus refers to the characteristics of salient locations in the environment while actions indicate motor control behaviours (or chains of such behaviours). I have argued in the previous chapter that dispositional learning will not support general way-finding skills. Rather, a store of spatial knowledge is required that is more flexible and task-independent, in other words, some form of model is needed. In this chapter and the next, methods for constructing topological and metric maps will be described and contrasted. Before considering this issue, however, some brief comments are appropriate on possible structures for representing knowledge of the physical world. Two principal structures will be considered here: graphs that describe (primarily) topological relations, and coordinate systems that can be used to describe higher order spatial relations. Graph representations of space A graph G is defined by a set of vertices V and a set of edges E , where eij ! E is the edge connecting vertices vi and v j . Graphs can either be directed or CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 174 undirected, in the former edges can be traversed only in one specified direction, in the latter case all edges can be traversed in both directions. A graph is a very powerful model for representing space. For instance, if the vertices of the graph are identified with salient locations and the edges with the spatial relations between locations, then the way-finding problem partially reduces to the soluble problem of graph search—finding a sequence of connecting edges between start and goal vertices in the graph. Furthermore, the idea of a containment hierarchy can be easily expressed using a hierarchy of graphs in which each corresponds to a different physical or psychological scale. For instance, the graph which models the layout of a room can become a vertex in the graph of a building which is itself a vertex in the graph of a street and so on. Such a hierarchy allows planning to occur at different levels of abstraction giving efficient search for paths between very distant positions. Coordinate systems The discovery of analytical geometry, by Descartes, lead to the unification of number theory and geometry. This allowed the continuous, spatially-extended structure of the physical world to be fully described in abstract, numerical terms. Such a description is achieved by the use of a coordinate system. A coordinate system is established by a set of reference lines or points that describe a coordinate frame. The position of any point in space is then identified by a set of real numbers (coordinates) that are distances or angles from the reference frame. The use of co-ordinates greatly enhances the power of a spatial representation allowing, at least potentially, all the techniques of trigonometry and vector algebra to be brought to bear on geometric problems. In constructing spatial representations, graphs and coordinate systems are not incompatible. For one thing, a metric model can always be contracted onto a graph by a suitable partitioning of the space, the vertices of the graph could then be labelled with specific positions relative to the global coordinate frame. A further possibility is for neighbouring regions of physical space to be modelled with respect to separate coordinate frames, these frames might then be linked in a graph model. An atlas of pictorial maps is an example of such a representation. An extension of this idea is to tag the edges of the graph with the descriptions of the spatial relations between the coordinate frames at the vertices. This can allow CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 175 the reconstruction of spatial relations across frames (across the pages of the atlas as it were). Chapter eight considers some specific models of this type. Having determined some of the constraints on the construction of spatial models we are now in a position consider the virtues and vices of alternative forms of representation that can support the task of way-finding. The next section considers the acquisition of route-based knowledge and the possibility of assimilating information of this kind into topological models of environmental layout. 7.3 From routes to topological maps Constructing route knowledge An agent exploring its environment will experience a temporally ordered sequence of local views. The continuity of this experience contains information about the spatial layout of the world—successive experiences relate to adjacent locations. The simplest form of model—a straight-forward record of successive stimuli ( So ! S 1! S2 …)—can therefore be derived directly and easily from experience. Furthermore, if information is stored concerning the required movements between each location then a record ( S0 ! A0 " S 1! A1 " S2 ! A2 … ) of a retraceable sequential model can be created. To store and re-use such a model requires that the agent segments its experience into a set of a places (points or regions of finite extent) where control decisions are made. These places are then linked by records of the sensory information or motor control programs that will allow the agent to retrace its trajectory between each pair of adjacent locations. Adopting the language of graph theory, decision places can be considered as vertices and the intervening links as edges in an unknown graph of the environment. A sequential model therefore describes a walk—an alternating sequence of vertex and edge elements—through such a graph. In general a walk may cross itself or go back on itself arbitrarily often. A walk in which all the vertices are known to be distinct is termed a path and CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 176 corresponds to what is usually described in the navigation literature as route10 knowledge. To retrace a specific walk or route the environment must be segmented in the same manner on each occasion. This implies that the agent must possess a set of segmentation functions that use sensory data and local navigation skills to perform the following low-level processes: 1. Classify any visited point in the environment as belonging either to a vertex or to an edge11. 2. Detect the exits from a visited vertex to adjoining edges and discriminate12 between different exits (the number of detected exits determines the degree of the vertex). 3. Orient toward a desired exit. 4. Follow the edge connecting two vertices. The problems involved in segmentation are clearly non-trivial. Several attempts have been made at solving the segmentation problem for qualitative navigation systems in mobile robots, these systems will be considered further below. Although route knowledge can support travel along known paths it does not allow novel or short-cut routes between locations to be determined. As such it is little more than a chain of dispositional associations. Navigation by route-following also runs the risk that if the path is lost, through any sort of error, then the 10The concept of route knowledge arises frequently in the literature though definitions vary. The definition given here is similar to that given by O’Keefe [120] who gives an extended discussion of navigation by following routes. 11This segmentation function could allow a point to be recognised as belonging neither to a vertex nor an edge, but simply to intervening, undifferentiated space. In such circumstances the navigator would then depend on search strategies to move to a position that could be identified with an element of the graph. 12This discrimination can be absolute, that is, based on sensory characteristics of the exits, or, in the absence of such characteristics (in a true maze for instance), it can be relative to the last edge traversed. An example of the latter, for a planar graph, would be to distinguish between exits by ordering them clockwise starting with the one through which the node was entered. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 177 navigator has no fall-back method other than search by which to get back on track. This argues for a richer representation in which the information gained during each trajectory through the environment is integrated into some overall layout plan. In the metric-free case this means the construction of the true underlying graph of the environment. Henceforth the term topological map will be used to describe a representation of this type. In constructing a topological map the temporal constraint that successive experiences denote adjacent locations is clearly insufficient. Instead, the agent must be able to judge when experiences separated by arbitrary time intervals are generated from a single spatial location. In other words, a fundamental requirement of map-building is place identification—the ability to uniquely identify and then re-identify places. Given appropriate segmentation functions that convert the temporal flow of experience into a walk through the unknown graph the place identification task can be defined as the problem of determining as each vertex is entered whether it should be added as a new vertex in the partially-built graph or should be identified with an existing one. Constructing a topological map by place identification It was suggested earlier that the identity of places can be inferred from comparisons of either primary characteristics—spatial relations—or secondary characteristics—distinctive local sensory attributes. With both types of comparison, similarity is taken as evidence in favour of identity and dissimilarity as evidence against it. Given that any comparison will be subject to errors arising directly or indirectly from sensing inaccuracies there is always a danger of making incorrect matches. Furthermore, in comparing secondary qualities there is the additional problem that false positive matches can arise because places may not have unique sensory attributes. There is clearly an important distinction that needs to be made between the knowledge that is required to construct a model and the knowledge that is encoded within the stored representation. For instance, a navigator might employ knowledge of metric relations in constructing a model that encodes only topological ones. However, one of the strongest arguments for the construction and use of a topological map rests on the idea that such a representation can be constructed without recourse to knowledge of higher spatial relations. This possibility must therefore be given some consideration. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 178 To build a topological map without exploiting further geometric knowledge the decision as to whether two places are identical must rely primarily on secondary qualities. This can be demonstrated by considering the problem of mapping an environment in which all vertices have identical appearance and degree. The correct graph of such an environment will be indistinguishable, on the basis of connectivity alone, from an infinite number of other graphs. Figure 7.3 illustrates this fairly obvious point for graphs of degree two. Figure 7.3: If all the places in the environment are identical in appearance then the correct graph cannot be determined on the basis of sensed topological relations— each of the graphs shown here is indistinguishable on this basis from each of the others. Assuming that the environment does contain places that have secondary characteristics that are locally but not necessarily globally distinctive (a not unreasonable assumption perhaps) then the task of place identification involves: 1. A comparison of the secondary characteristics. 2. A spatial test in which the adjacency relations of the two places are compared. That is neighbouring locations are tested for identity. The second test is clearly recursive, it demands that identity is established for neighbouring places which can mean testing the neighbours of the neighbours, and so on. This process will only be halted by tests of identity that succeed on the basis of the first test alone. In other words, by testing locations whose secondary characteristics are known, a priori, to be globally distinctive. The process of determining and carrying out a suitable series of adjacency tests that will terminate in this way is called a rehearsal procedure [44, 78]. Consideration of the rehearsal procedure gives rise to several questions. First, how many such globally distinctive places (GDPs) are required to make mapping a given environment possible? Second, given that rehearsal is a costly process that involves substantial extra travel to known locations, how efficient are the possible rehearsal procedures? CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 179 Neither of these questions has been properly addressed in the literature. The question of how many GDPs are needed to map an environment would seem to depend both on the underlying graph of the environment and on the prevalence of locally distinctive sensory characteristics. Kuipers and Byun [78] state that a single unique place or ‘home’ is sufficient to eliminate false positive matches in constructing a graph. However, this may not always be the case. For instance, figure 7.4a illustrates an environment isomorphic to the complete graph K4 in which one vertex is globally distinctive (shaded black) but the remaining three are indiscriminable (shaded grey). It appears that no sequence of actions in this graph will allow it to be discriminated from the graph with three self-connected vertices13 (figure 7.4b). 13Assume that the graph is embedded in the plane and that the edges at any vertex can only be distinguished by ordering with respect to the last edge traversed. At any vertex, then, the mapbuilder has the choice of taking either the left or right exit or returning along the edge by which it entered. To discriminate between the two graphs therefore requires that there exists a sequence of such actions that generates a different sequence of sensory experiences in the two graphs. In computer simulation using sequences of upto 24 actions no such discriminating sequence has been found. Unfortunately, I have been unable to determine a simple proof that no such sequence can exist for this graph although intuitively this seems a reasonable proposition. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 180 (a) (b) Figure 7.4: A fully-connected graph of four vertices (a) which has only one globally distinctive vertex. This graph is indistinguishable by any rehearsal procedure from a graph with three self-connected vertices (right). The graph in figure 7.4a can be reconstructed, however, if at least one other vertex is locally or globally distinctive. The question of whether there are graphs that cannot be mapped given two or more GDPs is unresolved. To some extent, however, the existence of such pathologically unmappable environments is largely academic. In environments with such sparse sensory features and very few GDPs it is evident that every call to the rehearsal procedure will have to go to considerable depth. The cost of this process will therefore scale rapidly with the size of the graph making map-building a very inefficient process. Dudek et al. [44] have considered an interesting and related problem of constructing the graph of an environment lacking distinctive secondary characteristics by using portable markers. Their work, which assumes appropriate segmentation functions, demonstrates that a map-builder with one or more globally distinct markers (that can be both put down and picked up) can map any graph in a sequence of moves that is polynomial in the number of markers and graph vertices. Furthermore simulation studies by these authors show that although the cost of this rehearsal procedure are lowered considerably if the mapbuilder has a several retrievable markers (rather than just one) these overheads are, nonetheless, still large. The arguments of the preceding paragraphs suggest that, to avoid lengthy and possibly non-terminating rehearsal behaviour, topological map-building systems CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 181 must either make use of place markers or assume that many locations will be discriminable on the basis of sensory characteristics alone. Failing that they must exploit knowledge of higher spatial relations. 7.4 Practical approaches to topological map building The previous section has described in abstract the necessary mechanisms for constructing a topological map of an environment without relying on the detection metric spatial relations. It has been argued that the agent must have perceptual and motor mechanisms that can generate a segmentation of the environment, reidentify many locations in a reliable and robust fashion from their secondary characteristics alone, and disambiguate any locations that are not globally distinctive by a rehearsal procedure. How realistic are these tasks? To a large extent this is an empirical issue that is best answered by proposing candidate solutions then evaluating their success on a robot platform or, failing that, in a robot simulation. This section therefore considers some of the work in qualitative navigation relating to this issue and attempts to draw some preliminary conclusions. In considering approaches to the construction of topological maps it is possible to distinguish two alternative methods of segmenting the environment to form the elementary components of a graph model. The first is to associate the vertices of the graph with distinctive zero-dimensional points in space and the edges with a network of distinctive one-dimensional paths connecting these positions. Alternatively the vertices can be associated with extended regions of space and the edges with the boundaries between regions. These boundaries will generally be determined by topographic features, for instance by landmarks and physical barriers, thus covering the environment with a tessellation of contiguous irregularly-shaped areas. The two forms of graph are closely related—for any abstracted tessellation there is a dual which is a network and vice versa. However, there are some practical differences in the way each form of segmentation can be performed and, as importantly, exploited. The latter differences arise because in a network the agent is almost always in transit between places on a defined path, whereas in a tessellation transitions are instantaneous, paths are not specifically designated, and the agent is always either in one place or another. A number of recent robot navigation systems are based on the network model [71, 72, 78-81, 92-94, 157]. Here two of these systems are described in some detail in CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 182 order to highlight some of the strengths and weaknesses of this approach. Tessellation models have been less widely studied, however, Levitt and Lawton [84] have described and simulated a system of this type which is also considered below. The ‘NX’ robot simulation Kuipers and Byun [78-81] describe a robot simulation ‘NX’ designed to implement the theory of a ‘spatial semantic hierarchy’ described earlier. As this theory suggests, NX constructs spatial knowledge at multiple levels, however, here we will focus on the procedures implemented to construct a topological representation in the form of a network map. NX operates in a two-dimensional environment of corridors and rooms using simulated sonar and compass senses. The robot identifies places and paths during exploration in terms of a set of pre-defined ‘distinctiveness’ measures. In the simulations described the principal measure employed is a function “EqualDistance-to-Near-Objects” that computes a single scalar value from the distance data returned by a ring of sonar sensors. This function generates zero-dimensional peak values at the intersections of corridor-like environments, and ridges of locally maximal value along the mid-lines of passages. The former are therefore defined to be the vertices of the constructed graph and the latter the edges. Since many points in the environment do not lie on the graph robot relies on a hillclimbing search strategy to find distinctive places and relocate paths. The number of directions of free travel surrounding the robot at a distinctive place is estimated using sonar readings. These exits are then discriminated from each other by using a local co-ordinate frame defined by an absolute compass. Edges are followed using local navigation behaviours. For instance, a corridor-following behaviour is used to track the mid-line between two walls and a wall-following behaviour to move along the boundary wall of a room. To perform place recognition vertices are distinguished by the pattern of local geometry detected by sonar. Similar vertices are disambiguated using metric information accumulated by dead reckoning and through the use of a shallow rehearsal procedure that tests the identity of nearby places upto a depth of one transition in the graph. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 183 The Michigan robot Researchers at Michigan University [71, 72, 83] have begun investigating topological map-building for a mobile robot platform with vision and sonar sensors which navigates in a corridor environment. The robot builds a network model of space based on the notion of gateways which are defined as transitions points between distinct areas of free-space. Typically, gateways are positions such as corridor intersections and doorways that are identified by characteristic sonar patterns. The robot constructs both route knowledge and a separate representation of the topological map. Each vertex in this map is identified with one or more gateways. The areas of free-space between gateways constitute the edges of the graph. However, the lack of a precise definition for edges means that the robot relies on environment structure to stay on a path14. The Michigan robot uses vision as its primary sense for place identification. At each gateway a local view representation is constructed by storing arrays of visual patterns obtained by rotating to a small number of orientations while remaining in a fixed position. By matching the current visual scene with the pattern recorded in one of these segments the robot is able to re-orient itself at a gateway. The overlap in sensory patterns between adjacent segments can also be used to determine the required direction of turn to face in any desired orientation. Exits are defined by areas of free-space detected by sonar as adjacent to the gateway. Each exit is linked to the sensory pattern stored when the robot is facing in that direction. Thus to choose a particular exit the robot orients toward the appropriate segment of the stored local view. To perform place identification the Michigan robot attempts to match the view at any gateway with records of previously observed local views. If necessary it rotates itself to observe the scene at different view orientations in order to check for matches. A running estimate of the likelihood that two gateways correspond to a unique place is maintained and when this ‘hypothesis strength’ exceeds threshold the robot assumes that the place is unique and assigns it to a vertex in the topological map. As in the NX simulation the robot also uses a shallow rehearsal procedure to disambiguate similar locations. Local distance information 14In the experiments described in [72] edge-following behaviour has yet to be developed and the robot is led by hand through the environment. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 184 obtained by dead reckoning and stereo matching is also exploited to some degree in determining the adjacency relations of places. The ‘Qualnav’ robot simulation The Qualnav simulation developed by Levitt and Lawton [84] constructs a multilevel representation of large-scale space that like the NX system can be viewed as an instance of Kuipers [81] ‘semantic hierarchy’ theory. Qualnav comprises a metric level model which in situations of degraded or absent metric data, and under some strong assumptions of visual scene analysis and object recognition skills, can generate a tessellation model of topological spatial relations. This review will focus on the elements of the model relating to the construction of the topological map without reliance on metric knowledge. The proposals for metric map construction and use will be considered in the next chapter. The topological mapping system is able to construct a model, and plan and follow routes, on the basis of detected adjacency and sense relations, and qualitative judgements of visual angle. Qualnav is an attempt to address the problem of navigation in open, non-urban environments. It therefore does not attempt to contract the world onto a finite number of places and paths, rather, it seeks to define a segmentation of space into finite regions separated by virtual boundaries. To do this Qualnav uses vision to identify landmarks which are defined as prominent topographic features with distinctive secondary characteristics of shape, colour, etc. The basic strategy of Qualnav is to divide up the world using ‘landmark pair boundaries’ (LPBs), these are virtual lines connecting each pair of identified landmarks in a given scene. The observed left-right ordering of a landmark pair (as the robot faces toward them) defines the current location as being on either one side or the other of this boundary. The LPBs from multiple landmark pairs generate a segmentation of space into the regions that make up the tessellated graph. The current ‘place’ is therefore defined as the bounded region defined by the intersections of the most proximal LPBs. This segmentation is illustrated in Figure 7.5 for an environment of five distinctive landmarks. Each region in this tessellation has a unique description in terms of the left-right ordering of each of the visible landmark pairs. Note that without range-sensing the boundaries between regions can only be detected as they are crossed, and it is impossible to say prior to crossing which boundaries are the most proximal. CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 185 L2 L1 L3 L5 L4 Figure 7.5: Segmentation in the Qualnav model (adapted from Levitt and Lawton [84] p. 327). A route description consists of a list of desired LPB crossings. An LPB can be crossed between the two landmarks, or on the left or right side of both landmarks. In the former case orientation is determined by moving in such a way as to create a qualitative increase in the visual angle between the two landmarks, in the latter case by moving so as to decrease the visual angle up to the point where one landmark occludes the other. A further possibility is for the robot to head directly towards a landmark. A production system rule-base can be used to optimise a route by combining certain pairs of desired LPB crossings into more efficient ones. The QUALNAV model assumes that landmarks will be correctly identified from different viewing positions and re-identified on subsequent visits, however, it does not assume that any single landmark is necessarily re-acquired. Failure to reidentify any single landmark will reduce the set of LPBs that can be constructed, however, the remaining LPBs may still provide an adequate, though less refined, localisation of the robot. Discussion In both of the network systems the ability to create and maintain a mapping between the world and the graph clearly depends on the environment being highly structured. Segmentation assumes a world in which physical barriers constrain travel to a finite number of narrow paths and the distinction between intersections and connecting paths is reasonably clear-cut. In more open environments with large areas of traversable space the partitioning of the world would be more CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 186 arbitrary, harder to define and then maintain. NX gets around this problem to a limited extent by treating more open areas (for instance any large room) as a circuit of paths each with a single boundary wall. Less drastic methods for mapping an open space to multiple network graph elements are clearly possible. However, the difficulties in identifying and re-identifying particular places and paths in an open area may render the idea of a unique contraction onto a network impractical. Furthermore, the constraint of limiting travel to a finite number of ‘virtual’ paths between fixed choice points will become increasingly inappropriate in more open settings. The Qualnav model demonstrates that by ‘flipping’ into the alternative tessellation mode of graph description a useful segmentation can be achieved in open terrain. This model also successfully relaxes the arbitrary constraint of following a finite number of paths. It would interesting to see if the tessellation approach could be applied successfully in more structured environment such as buildings. Place recognition and orientation The NX simulation uses local sonar patterns to characterise different places. However, this sort of local geometry information is not likely to be globally distinctive, hence the need to use dead reckoning to make the place identification task possible. The lack of sensed secondary characteristics also makes it necessary to use compass information to distinguish the exits from a vertex. The model consequently relies on metric information to create the topological map and, to this extent, abstracts the topological map from knowledge of higher geometric relations. The need to exploit metric knowledge demonstrates the difficulty of topological mapping with only sparse sensory data. The Michigan robot is more ambitious in its attempts to build a network map using primarily topological and sensory data. By using vision this system allows for the possibility of exploiting visual pattern matching and object recognition for the task of place identification. The system also constitutes a commendable attempt at performing the orientation task without exploiting explicit direction sensors. The emphasis in this research on relatively complex sensory processing amply demonstrates the need for rich descriptions of invariant secondary characteristics if the use of higher geometric knowledge is to be avoided. These descriptions must be robust to superficial and minor changes in sensation (due, for CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 187 instance, to changing lighting conditions, moving objects, etc.) but at the same time be sufficiently unique to justify an assumption of global distinctiveness for many of the places that are encountered. The Qualnav simulation similarly demonstrates the need for a sophisticated vision front-end to provide reliable landmark acquisition and identification. Conclusion Topological maps The nature of the spatial relations encoded by a world model determines the type of navigation behaviour that can be supported. Procedural (or route) knowledge can only support movement along segments of known paths. Knowledge of the topological layout of the environment gains the navigator the ability to identify suitable sub-goals, and generate and follow novel routes between locations. However, because this knowledge is limited to knowing the connectivity of places navigation is constrained to using known path segments between adjacent subgoals. A navigator with a topological map who enters unfamiliar territory can explore new paths and construct new map knowledge but cannot engage in goaldirected movement to target sites. The ability to determine short-cut or straightline routes across un-explored terrain requires knowledge of higher-order spatial relations. Where such behaviours are observed in animals this is usually taken as strong evidence for the use of a metric model. That such skills would be very useful to an animal or robot is undeniable giving a strong incentive for constructing and using knowledge of this type. Given the value of metric knowledge is there a justification for constructing only a weaker form of spatial representation? So far this chapter has considered one possible argument for this view—the idea that a topological map could be constructed without the need to detect higher geometric relations. Mathematically topological geometry is simpler and more basic than metric geometry—it requires fewer axioms and specificies weaker constraints. I have argued here that this mathematical simplicity belies the real difficulties of constructing topological knowledge in the absence of metric knowledge. I have suggested that such a model is in general realisable only if the agent has sensory abilities that can be relied on to give accurate identification and re-identification of most locations, and that in practice this may require vision skills capable of object recognition or, CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 188 at least, of very robust pattern matching. If this view is correct then this ‘strong’ topological map hypothesis fits uneasily with the bottom-up bias of Animat AI, and, indeed, with the current perceptual capabilities of most robots. The problem of constructing a topological map are eased considerably by introducing additional constraints to the map-building process. One possible constraint is to limit the behavioural repertoire of the robot. One way this might be achieved is to force the robot to maintain a travel-path that follows object boundaries [37, 92, 112]. This constraint of ‘wall-following’ assists substantially both in segmentation and place identification by limiting the maximum degree of vertices in the graph, reducing the number of true choice points (i.e. of vertices of degree>2), and by avoiding the need to represent places that lack distinctive local geometric structure. This approach has lead to some successes in building wayfinding systems for indoor autonomous robots, however, it also has an obvious cost—open spaces will be poorly represented in the map and movement will be more rigidly limited to a small set of paths. In general Animat systems have also not been entirely metric-free. The use of some approximate metric knowledge underlies what might be called the ‘weak’ topological map approach—the possibility that metric information might be computed and exploited in map-building, but might not actually be explicitly recorded or used for way-finding. Given the advantages that metric knowledge of any sort can endow the main justification for this proposal must be that the cost or complexity of building (or using) such representation would outweigh its usefulness. That there may be methods that are robust to noise yet relatively simple to compute and use will be argued in the next chapter. The semantic spatial hierarchy A further approach which has been considered in this chapter is expressed in Kuipers theory of the ‘semantic spatial hierarchy’. This theory advocates construction of both topological and metric models. However, although it allows for some exceptions in the direction of information flow it still presents a relatively strong claim of a progressive and incremental increase in structure and geometrical richness. Indeed, Kuipers [78] explicitly argues against an alternative hierarchy in which the metric level precedes the topological one. I suggest that this hierarchical view contains the valid insight that robust navigation requires separate systems that encode information concerning CHAPTER SEVEN WAY-FINDING: TOPOLOGICAL MODELS 189 independent spatial constraints. However, the hierarchical model confounds the need for multiple systems with the idea that these systems should only be distinguished in terms of the classes of geometric relations they encode. The hierarchical view also weakens the value of having such distinct systems by insisting on some form of dependence of higher levels on those below. I will argue in the next chapter that distinctions other than those based on geometric content may be more or as important in distinguishing different representations for way-finding. I will also suggest that what is really required is independent pathways for the determination of spatial constraints, in other words, that a hierarchical approach is not justified or desirable. 190 Chapter 8 Representations for Way-finding: Local Allocentric Frames Summary This chapter describes a representation of quantitative environmental spatial relations with respect to landmark-based local allocentric frames. The system works by recording in a relational network of linear units the locations of salient landmarks relative to barycentric coordinate frames defined by groups of three nearby cues. The system can orient towards distant targets, find and follow a path to a goal, and generate a dynamic description of the environment aligned with agent’s momentary position and orientation. It is argued that the robust and economical character of this system makes it a feasible mechanism for wayfinding in large-scale space. The chapter further argues for a heterarchical view of spatial knowledge for wayfinding. It proposes that knowledge should be constructed in multiple representational ‘schemata’ where different schemata are distinguished not so much by their geometric content but by their dependence on different sensory modalities, environmental cues, or computational mechanisms. It thus argues against storing unified models of space, favouring instead the use of run-time arbitration mechanisms to decide the relative contributions of different constraints obtained from active schemata in choosing appropriate way-finding behaviour. CHAPTER EIGHT 8.1 WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 191 Introduction The focus of most of the research in robot navigation over the past two decades has been on the task of path-planning using quantitative models of space. Much of this work (e.g. [19, 89]) is concerned with planning optimal, collision-free trajectories using an accurate metric map of spatial layout and a model of the shape and dynamics of the robot. Based on the assumption that such a map is needed, research on the acquisition of quantitative spatial knowledge has centred of the task of constructing an appropriate, detailed metric model in which object positions and shapes are described with respect to a global coordinate frame. Most of these models have either assumed accurate sensing, sought to correct error estimates during map-building (e.g. [40]), or to incorporate estimates of spatial error explicitly in the map (e.g. [46, 95, 109, 155, 156]). To facilitate search the map is also segmented to form a graph either by imposing a grid [46, 109], or by determining an appropriate tessellation or network model [19, 33, 34, 40, 64, 135, 136, 174]. A distinctive characteristic of this research, which is shared with many of the more qualitative approaches, is the emphasis on combining all the available information into a unified global model. This chapter suggests that it is this characteristic, perhaps more than any other, that contributes to the inflexibility of these systems and to their over-sensitivity to measurement error. It is proposed instead that different sources of spatial constraint are used to construct multiple, relatively independent15, representational schemata—often encoding only the local spatial layout, from which large-scale relations are determined as needed. The first section of this chapter considers the general problem of constructing a quantitative model of space for way-finding in natural or artificial systems. The following four sections then describe a preliminary simulation of a robot wayfinding system that uses a network architecture to relate multiple spatial models based on local coordinate frames. This system which is also described in [132] builds on research by Zipser [191-193] and is related to a method of metric map construction proposed by Levitt and Lawton [84]. The final section then attempts 15In a sense to be defined below. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 192 to generalise from some properties of this system to an overall approach to the problem of constructing and using spatial representations for way-finding. 8.2 Quantitative representations of space The task of building a representation of an environment that encodes distance and direction is very different from the procedures outlined in the previous chapter for constructing a metric-free model. To detect metric relations places must be located with respect to common coordinate frames. In determining appropriate frameworks a significant problem is immediately apparent. The coordinate frames most directly available are egocentric, that is, they are defined by the navigator’s instantaneous position and orientation in space. However, to establish invariant relations over a large-scale environment egocentric relations will evidently not suffice. This follows since first, by definition, such an environment cannot be entirely viewed from any single position, and second, movement of any sort will alter the viewer’s egocentric frame. The necessary solution to this problem is that observations from different view-points must be integrated into representations with respect to environment-centred or allocentric frames. That is egocentric relations must be transformed to give allocentric spatial relations. To construct a model of metric spatial relations during a trajectory through an environment therefore requires at least the following: 1. The establishment of a suitable allocentric frame. 2. The determination of the current relation of the self to this frame (irrespective of ego-motion). 3. The determination of the relation of salient places, first to the self, and thereafter (by suitable displacement transformations), to the allocentric frame. If position estimates are accurately determined with respect to an allocentric frame then the task of constructing layout knowledge by place identification disappears—places are identical that have the same coordinates with respect to the global frame. Generally, however, we wish to assume error, possibly of a large or cumulative nature, in the estimates of spatial position. In this case, the identification problem still exists though now there are substantial additional constraints provided by estimates of spatial position. There is therefore less need, than in the metric-free topological case, for the accurate perception of invariant CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 193 secondary characteristics, or for the costly fall-back strategy of rehearsal. Place matching not only serves to determine a cycle in the sequence of observed locations but can also play an important role in the correction of spatial errors— by obtaining ‘fixes’ with respect to landmarks of known position the navigator can correct the errors in the estimated positions of itself and of recently visited places. The problems that arise in constructing a metric model concern the difficulty of obtaining good distance information (or failing that, dealing with noisy information), and the resource demands of the above processes, in particular, the need for continuous transformations between egocentric and allocentric frames. This chapter considers way-finding representations and mechanisms that can address the latter issue. Specifically, a distributed representation is proposed in which the environment is modelled using a network of local coordinate frames. The computations required to construct the representation from egocentric metric data require only linear mathematics and have memory requirements roughly proportional to the number of goal-sites and landmarks stored. The task of determining direction or route information from the model is performed in parallel giving search times roughly linear in the distance to the goal. During route following the system exploits run-time error-correction by incorporating egocentric fixes on sighted landmarks. This means that the route following system is highly robust to noise in the spatial representation. A frame-based model can represent the true continuous nature of physical space in which every zero-magnitude position is a potential goal or sub-goal. However, the idea that every point will be explicitly represented is clearly naive. A viable, compact, non-graphical representation of space will not describe the content of space at every point16, rather, it will explicitly represent only those positions that 16Gallistel [52] makes a similar point by contrasting different computer representations of a graphical image. One possibility is to record the colour of every pixel in the image. This results in a representation that is expensive in memory use and opaque with respect to further image processing—the salient objects in the image are not distinguished in any way from the surround. In contrast an object-based picture description language like “Postscript” explicitly encodes the structure in an image by representing lines, curves, etc. This representation is generally both more memory-efficient and more accessible for further processing. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 194 are salient as targets or landmarks. This notion has not always been appreciated in psychological studies of human cognitive maps. In particular, many early studies did not distinguish the hypothesis of an underlying metric model from the idea of a picture-like representation. Consequently, experiments that found errors in human spatial knowledge—distortion, gaps, holes, fault-lines, and asymmetries— were interpreted as strong evidence against the use of metric representations in human way-finding. However, although there is little support for the idea of cognitive spatial representations that are isomorphic to graphical maps17 (see, for instance, [77]) this does not constitute a falsification of the view that humans construct or use metric models. Indeed, there is strong evidence that metric representations of a non-graphical nature form an important element of human way-finding competence [96, 146, 147]. As was indicated in the last chapter there is also substantial evidence that other animals construct and use metric spatial representations. This is indicated by, among other things, the ability observed in many species to find novel or straightline routes to distant positions outside the current visible scene (see [52, 119, 120] for reviews). There is some controversy, however, over how animals use perceptual data to construct the metric model. Several researchers [52, 97] have argued that the principal sources of position information are dead-reckoning skills—integrating changes in position from sensory signals—and 'compass' senses—determining orientation by using non-local cues such as the sun, or by sensing physical gradients such as geomagnetism. These skills are used to maintain an estimate of current position and heading relative to the origin (e.g. the nest) of a global allocentric frame. Places are coded in terms of the distance and direction from this origin or from single landmarks whose position in the global frame are already stored. Following McNaughton et al. [97] such a mapping scheme will be referred to here as a vector coding system. An alternative view, proposed for instance by O’Keefe [119, 120], is that a place is encoded in terms of the array of local landmarks visible from that position. In other words, that the cognitive spatial representation stores the locations of 17An exception is where humans are explicitly trained by showing them graphical maps, in this case an image-like memory of the map may be retained and used in way-finding. Spatial knowledge acquired by exploring an environment is, however, of a very different nature [147]. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 195 potential goals in an allocentric coordinate frame determined by a group of (two or more) salient cues. Such a system is here called a relational coding. This chapter is not concerned with the empirical question of which coding method is used in animal cognitive maps. Indeed, there seems no particular reason to believe that any one method will be relied upon to the exclusion others. Rather, since robust navigation skills are critical to survival, multiple, relatively independent coding systems may be employed. In other words, for any specific navigation task, both vector and relational solutions may be computed and behaviour chosen to be as consistent as possible with all available constraints (this proposal will be considered further below). The goal of the chapter is instead to consider the idea of relational models, that do not rely on dead-reckoning or compass senses, in more detail. It will be proposed that such methods can provide robust support for way-finding without being expensive in terms of computational power or memory and without requiring complex coding mechanisms. It will argued these properties encourage the view that similar mechanisms might be employed in animal navigation, or could support way-finding for an autonomous mobile robot. 8.3 Relational models of space The task of navigating a large-scale environment using relational methods divides into three problems: identification and re-identification of salient landmarks; encoding, and later remembering, goal locations in terms of sets of visible local cues; and finally, calculating routes between positions that share no common view. The first task, landmark identification, has been considered elsewhere both from the point of view of animal and robot navigation systems [84, 119, 191, 193]. In this chapter landmarks are taken to be objects (or parts of objects) with locally distinctive secondary characteristics that can be identified with zerodimension locations in egocentric space. The agent is assumed to be able to detect suitable landmarks and determine, at least approximately, their positions relative to itself. While realising that this constitutes a tremendous simplification of the map-building problem, the justification is to allow the logically distinct problems of encoding and remembering goals in local landmark frames to be investigated, and to consider the use of such representations for large-scale way-finding tasks. Since metric relations will be recoverable from the stored model the landmarks CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 196 need not be globally distinctive, hence there is less need for the rich descriptions of landmark characteristics that are required for metric-free topological mapping. The issues arising from inaccurate perceptual data and discriminating ambiguous landmarks are ongoing research topics and will be considered further toward the end of the chapter. Proposed relational codings A proposal for a relational coding system, inspired by empirical studies of ‘place’ cells in the hippocampus of the rat, has been provided by O'Keefe. In the most recent version of this model O’Keefe [118, 119] suggests that the rat brain computes the origin and orientation of a polar coordinate frame from the vectors giving the egocentric locations of the set of salient visible cues. Specifically, he proposes that these location vectors are averaged to compute the origin (or centroid) of the polar frame, and that the gradients of vectors between each pair of cues are averaged to compute its orientation (or slope). Goal positions can then be recorded in the coordinate system of this allocentric frame in a form that will be invariant regardless of the position and orientation of the animal. This idea is illustrated in figure 8.1. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 197 A E centroid slope B D C Figure 8.1: O’Keefe’s proposal for an allocentric polar coordinate frame defined by ‘centroid’ and ‘slope’ measures determined from the positions of local landmarks in the agent’s egocentric coordinate frame. The arrows indicate two possible viewing positions. (adapted from O’Keefe [119] p. 281) However, there are problems with this hypothesis. Firstly, the computation of the slope is such that the resulting angle will differ if the cues are taken in different orders18. Since any ordering is essentially arbitrary, a specific sequence will have to be remembered in order to generate the same allocentric frame from all positions within sight of the landmark set. Secondly, as landmarks move out of sight, are occluded by each other, or new ones come into view, the values of the slope and centroid will change. Rather than changing the global frame each time a landmark appears or disappears it seems more judicious to maintain multiple local frames based on subsets of the available cues. These would supply several mutually-consistent encodings making the mapping system robust to changes in individual landmarks. The use of multiple local frames has been proposed by Levitt and Lawton [84]. They observe that the minimum number of landmarks required to generate a coordinate frame is two (in two-dimensions, three in 3D). They also provide a useful analysis of how the constraints generated by multiple local frames can be combined, even in the presence of very poor distance information, to provide 18This arises because the gradient of a line is a scalar function with a singularity at infinity when the line is parallel to the y axis. Hence in order to average gradients this must be done in terms of angles or vectors, in which case the order in which the points are compared is important. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 198 robust location estimates. To calculate, from a novel position, a goal location that has been encoded in a two landmark frame requires non-linear computations (trigonometric functions and square roots). It also requires that an arbitrary ordering of the two landmarks is remembered in order to specify a unique coordinate system. Zipser [193], who had earlier considered a landmark pair method [192], points out that if one more landmark is used to compute the local frame then all the calculations required become linear. In fact, all that is required to encode a goal location using three landmarks (in 2D, four in 3D) is that one constant is associated with each cue. Zipser called these constants “ -coefficients”, they are, however, identical to the barycentric19 coordinates that have been known to mathematicians since Moebius (see for instance [50]). The system for large-scale navigation described below uses this three landmark method and therefore it is described in more detail in the following section. In the remainder of the chapter the navigation problem will be considered as two-dimensional, however, the extension of these methods to three dimensions is straightforward. Barycentric coordinates Figure 8.2 shows the relative locations of a group of three landmarks (hereafter termed an L-trie) labelled A, B, and C, seen from two different viewing positions V and V ! . A goal site G is assumed to be visible only from the first viewpoint. 19This term, originally used in physics, is derived from “barycentre” meaning “centre of gravity” . CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 199 B x'B V' A x'A x'C C xC xA xB G xG V Figure 8.2: relative positions of three landmarks and a goal from two viewpoints V and V'. (adapted from Zipser [193] p.461) " The column vectors x i = (x i , yi ,1) and x i! = ( x !i , !i ,1) give the location in homogeneous coordinates of object i in the egocentric frames centred at V and V ! respectively. The two frames can therefore be described by the matrices ! X = [x A xB x C ] , X! x!A x!B x !C ] (8.1) If the three landmarks are distinct and not collinear then there exists an unique " vector ! = (!A ,! B , BC ) such that ! = x G and X!" = xG! . (8.2) In others words, by remembering the invariant ! the egocentric goal position from any new viewing position V ! can be determined by the linear sums xG! = " A x !A + " B x B! + " C xC! , yG! = " A y A! + " B y B! + " C y!C, (8.3) (1 = " A + " B + " C ). Note that since each constant is tied to a specific cue the ordering of the landmarks is irrelevant. !1 The ! vector can be determined directly by computing the inverse matrix X since ! = X"1X ! = X "1 x G . (8.4) CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 200 Though this inverse calculation uses only linear maths, the value of the encoding as a possible biological model has been questioned on the grounds of its apparent mathematical complexity [189]. However, Zipser points out that an even simpler computational mechanism is possible by allowing the ! values to converge to a solution under feedback. This can be viewed as adapting the connection strengths of a linear learning unit by gradient descent. A network architecture that instantiates this mechanism is as follows. The network consists of two types of simple processing unit. The first are object-position units (objectunits hereafter) whose activation represents the locations in egocentric space of specific goal-sites and salient landmarks. These units receive their primary input from high-level perceptual processing systems that identify goals and landmarks and determine their positions relative to the self. The second type of processor is termed a !-coding unit. This receives input from three object-units and adapts its connection strengths (the values) to match its output vector to the activation of a fourth. An example of this architecture is illustrated in Figure 8.3 which shows a coding unit G/ABC that receives the positions of the landmarks A, B, and C as its inputs. The unit adapts its weights ( ! A , ! B , BC ) till the output (x, y, z) matches the goal vector ( G , yG , 1) . The unit is assumed to be triggered whenever all three input nodes are active. Gradient-descent learning is used to adapt the connection strengths. For the weight ! i from the ith object unit this gives the update rule at each iteration ! i = # [(x $ ) i + (y $ y)yi + (1$ )] (8.5) where the parameter ! is the learning rate. In general, the network will rapidly converge to an good estimate of the values. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 201 G xG yG 1 G/ABC y z xB yB 1 `A `C x C 1 1 xC yA yC xA `B A B Figure 8.3: The calculation. -coding unit. A gradient-learning model of the -coefficient In order to provide a further understanding of the -coding a geometrical interpretation can be given. The coefficient associated with each landmark is the ratio of the perpendiculars from the goal and that landmark to the line between the other two cues. For example, consider landmark A in figure 8.4. The coefficient ! A defines an axis that is perpendicular to the line BC and scaled according to the ratio of the two perpendiculars hG hA (this can also be thought of as the ratio of the areas of the triangles GBC and ABC). Taken together the three coefficients define a barycentric coordinate frame. This coding system in fact records affine rather than metric spatial relations, hence, another term for the coefficients is affine coordinates. However, assuming that the agent detects metric egocentric spatial relations according to some calibrated Euclidean measure, metric relations—direction and distance—will be recoverable from the stored model. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 202 G(`A, `B, `C) = (1.3, -1.8, 0.6) B hA A C hG G `C `B `A Figure 8.4: Coding a goal position (G) relative to the three landmarks A, B, and C using barycentric coordinates. 8.4 Modelling large-scale space I now describe how this coding method can be extended to determine the spatial relations between points over a wide environment that share no common landmarks. The essence of the method is to build a two-layer relational network of object and -coding units that stores the positions of landmarks in the local frames defined by neighbouring L-trie groups. The resulting structure will therefore record the relationships between multiple local frames. Thereafter, the locations of distant landmarks (and goal sites) can be found by propagating local view information through this network. Zipser [192] and Levitt and Lawton [84] have both discussed methods of this type for large-scale navigation using landmark-pair coordinate frames. The advantage of using the three landmark method20, however, is that following a sequence of transformations through the network is significantly simpler. Since all calculations are linear and independent of landmark order the process can be carried out by spreading activation through 20Zipser was no doubt aware of the application of his earlier multiple frame model to his later idea of the -coding, however, he has not, as far as I am aware, published anything that relates the two ideas. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 203 the relational network. In contrast, a landmark-pair method would require networks of local processing units of considerably greater complexity in order to perform the necessary non-linear transformations between frames. Constructing a large-scale representation The relational network that encodes the large-scale spatial model is constructed whilst exploring the environment. The specific method investigated here is as follows. Each time the agent moves the egocentric positions of all visible landmarks are computed. If there are any new landmarks in this set then new object-units are recruited to the lower layer of the network to represent them. Then, for each L-trie combination in the set a test is made to see if a -coding unit (with this local frame) already exists for each of the remaining visible cues. If not, new -units are recruited to the network’s upper layer and linked appropriately with the object-units. The -coefficients are then calculated either directly (using matrix inversion) or gradually (via the gradient descent learning rule) as the agent moves within sight of the relevant cues. Figure 8.5 shows an example of this process for a simple environment of five landmarks. From the current view position four landmarks A, B, C, and D are visible (assuming 360° perceptual capability) for which the agent generates coding units A/BCD, B/ACD, C/ABD, D/ABC. Following adequate exploration the network illustrated in Figure 8.6 will have been generated. agent D B E A C Figure 8.5: An environment with five landmarks, the agent is oriented towards the right with field of view as indicated by the circle . CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES B/ACD 204 C/BDE D/ABC A/BCD E/BCD B/CDE C/ABD D/BCE C A B E D Figure 8.6: A relational network for the five landmark environment. The network consists of an input/output layer of object-units and a hidden layer of -coding units that encode landmark transformations in specific local frames. Given this network the agent can determine the location of any target landmark when it is within sight of any group of three others. For instance if cues A, B, and C are visible and E is required, then the active object units will trigger D/ABC (activating unit D) and hence E/BCD to give the egocentric location of the target. The method clearly generalises to allow the position of any goal site that is encoded within an L-trie frame to be found. The topology of the relational net The connectivity of the relational network defines a graph of the topological arrangement of local landmark frames. For instance, the network shown above instantiates the L-trie graph shown in Figure 8.7. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES ABD BDE ACD ABC 205 BCD BCE CDE Figure 8.7: the L-trie adjacency graph for the five landmark environment. The links between nodes in this graph correspond to the coding units. Although the graph shown here has entirely bilateral connections, there is nothing intrinsically symmetrical about the coding method. For instance, it would be quite possible to encode the relationship D/ABC and not the reverse A/BCD. This could plausibly happen if the agent, whilst moving through its environment, encodes the positions of landmarks in front with respect to those it is already passing but not vice versa. This property of the mapping mechanism accords with observations of non-reflexive spatial knowledge in humans (see [77]). Figures 8.8 and 8.9 show respectively an environment of twenty-four landmarks and the adjacency graph generated after exploration by random walk. In learning this environment the system was restricted to encoding the relative locations of only the four closest landmarks at any time. This reduces the connectivity of the graph and the memory requirements of the network substantially. However, even without this restriction, the memory requirements of the system are O(N) (i.e. proportional to the number of landmarks) rather than O(N2) since the relations between landmarks are only stored if they share a common local group. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES A B R E O 206 S C D L F G H K I J Q T P M N W U V X Figure 8.8: Landmark locations for a sample environment. The circle indicates the range of the simulated perceptual system. The box indicates a target landmark (see below). start goal Figure 8.9: An L-trie graph. Each vertice represents an L-trie node and is placed in the position corresponding to the centre of the triangle formed by its three landmarks (in the previous figure). The boxes enclose the L-trie nodes local to the agent’s position and the target landmark ‘X’. CHAPTER EIGHT 8.5 WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 207 Way-finding using the relational model Target finding: estimating the distance and direction to desired goals From Figure 8.9 it is evident that there are multiple possible paths through the network that will connect any two landmark groups. Hence the system represents a highly redundant coding of landmark locations. As described in the previous section the (external) activation of the object units for any group of three landmarks triggers the firing of all associated -coding units which in turn activates further object units. This ‘domino’ effect will eventually propagate egocentric position estimates of all landmarks throughout the graph. Indeed, due to the redundancy of the paths through the graph many estimates will be computed for each landmark, each arriving at the object node after differing periods of delay. For any specific goal or landmark the delay (between the initial activation and the arrival of the first estimate) will be proportional to the length of the shortest sequence of frames linking the two positions. Assuming noise in the perceptual mechanisms that determine the relative positions of visible cues (and hence noise in the stored -coefficients) the position estimates arriving via different routes through the graph will vary. The question then arises as to how the relative accuracy of these different estimates can be judged. The simplest heuristic to adopt is to treat the first estimate that arrives at each node as the best approximation to that landmark’s true position. This is motivated by the observation that each transition in a sequence of frames can only add noise to a position estimate, hence better estimates will (on average) be provided by sequences with a minimal number of frames (whose outputs will arrive first). I call this accordingly the minimal sequence heuristic. However, there is a second important factor that effects the accuracy of propagated location estimates which is the spatial arrangement of the cues in each L-trie group. The worst case is if all the landmarks in a given group lie along a line, in this situation the -coefficients for an encoded point will be undefined. In general, landmarks groups that are near collinear will also give large errors when computing cue positions in the presence of noise. One possibility, as yet unimplemented, is for the system to calculate estimates of the size of these errors and propagate these together with the computed landmark positions. Each computed location would then arrive with a label indicating how accurate an estimate it is judged to be. In CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 208 the following examples landmark positions were calculated simply by rejecting information from L-trie frames that were near-collinear (i.e. within some margin of error) and otherwise using the minimal-sequence heuristic, that is, adopting the first location estimate to arrive at each node. The issue of combining multiple estimates to obtain greater accuracy is considered further below. Finding the direction and distance to a specific goal by propagating landmark positions is here called target finding. This competence is sufficient to support behaviours such as orienting toward distant locations, and moving in the direction of a goal with the hope of finding a direct route. However, this mechanism will not always be appropriate as a method of navigation for two reasons. First, the direct line to a goal location is clearly not always a viable path. Secondly, the target finding system is susceptible to cumulative noise. I have simulated the effect of 0, 5, 10 and 20% gaussian relative noise in the measurement of all egocentric position vectors that occurs during map learning and navigation. Figure 8.10 shows an example of the effect of this noise on estimates for the position of landmark X (in the environment shown in Figure 8.8) relative to the agent’s starting location. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES A B R E O 209 S C D Q L F H I 0% T P G M N K W J U V X 5% 10% 20% Figure 8.10: Target finding in the presence of 0, 5, 10 and 20% relative21 noise in perceived landmark positions. The figure demonstrates that with the less accurate perceptual data only a rough approximation to a desired goal vector can be obtained. Of course it is would be possible for the agent to move in the direction indicated by target finding and then hope to use landmarks that are recognised en route to gradually correct the initial error and so home in on the goal. However, this will often be a risky strategy as a direct line will not necessarily cross known territory. The following section describes a method which exploits a heuristic of this sort but in a form that is more likely to generate a successful path to the goal. Path following: a robust method for moving to a goal location. An alternative strategy is to move, not in the direction of the goal itself, but towards landmarks that are known to lie along a possible path to the goal. This method, here called path following, involves determining a sequence of adjacent local frames that link start and goal positions prior to setting out, then navigating by moving from frame to frame through this topological sequence. Because perceptual information about known landmarks will (very likely) become 21The egocentric vector (x , y) is perceived as (x + N(0, ! ) x 2 + y 2 , y + N (0, ! ) x 2 + y 2 ) where N(0, ! ) is a mean zero Gaussian random variable with standard deviation ". In effect this means an approximately linear increase in noise with distance. These noise characteristics were not chosen to model any specific sensor system. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 210 available as each frame is crossed, the agent will be able to replace estimates of cue positions with ‘hard’ data, thus avoiding the build-up of noise encountered by the target finding system. There is however, some computational overhead to be incurred through the need to calculate a suitable sequence of adjoining frames. Since, again, there are multiple sequences to choose from some heuristics are required to determine which should be preferred. The minimal sequence heuristic is again appropriate though on the slightly different grounds that shorter sequences should (on average) give more direct routes. Other heuristics are possible, for instance, estimates of the actual distances covered by alternate routes could be calculated allowing a more informed judgement as to which is the shortest path. To find the minimal sequence the process of propagating information through the relational network is simply reversed. In other words, a spreading activation search22 is performed from the goal back toward the start position. This is easiest to imagine in the context of the L-trie graph (Figure 8.9) however it could be implemented directly in the relational network by backward connections between units. In simulation I have modelled this parallel search process through a series of discrete time-steps. This occurs as follows. The L-trie node closest to the goal is activated and clamped on (i.e. its activity is fixed throughout the search) while all other nodes are initialised with zero activity. The signal at the goal is then allowed to diffuse through the adjacency graph decaying by a constant factor # for each link that is traversed23. Once the activation reaches the L-trie node local to the agent the minimal sequence can be found. Beginning with the start node this sequence is traced through the network simply by connecting each node to its most active neighbour. 22Spreading activation as a graph searching technique has a long history in cognitive modelling [5] and in the literature on graph search [42] and path planning (e.g. [40]). It is also has clear similarities with Dynamic Programming. 23This is achieved by, at each time-step, updating the activity of each L-trie node to be equal to the maximum of its own activation and that of any of its immediate neighbours (multiplied by the decay factor) at the previous time-step. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 211 This spread of activation is illustrated in Figure 8.11. The three frames show the activity after 0,4, and 9 time-steps, after which time the activity has filtered through to the start node. The path found is indicated by the boxes enclosing the winning nodes. Figure 8.11: Spread of activation through the L-trie graph (Figure Eight) after 0, 4, and 9 time steps (# = 0.95). The points in the figures represent the vertices in the L-trie graph. The size of each point shows the level of activation of that L-trie node. The boxes indicate that a minimal sequence ABC BCE CEL EGL GLM LMN MNP MNW NVW VWX was found. Having found a minimal sequence the path following method proceeds as follows. The agent moves toward the average position of the landmarks vectors for the first L-trie in the path. Once that position is reached (it will be near the centre of the three cues), the position of the next L-trie is generated (using direct perceptual data as far as possible) and so on till the goal is reached. Figure 8.12 illustrates this mechanism for the noise-free case, and Figure 8.13 for noise levels of 5, 10 and 20%. The second figure demonstrates that path following is extremely robust to noise as the error in the final goal estimate is independent of total path-length. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES A B R E O 212 S C D L F G H K T P M U N V W J I Q X Figure 8.12: Moving to the goal by path following. The dotted lines indicate additional landmark locations that were utilised en route. A B R E O S C D L F G H K J I 5% Q T P M N W U V X 10% 20% Figure 8.13: Performance of the path following system in the presence of 5, 10 and 20% relative noise in perceived landmark positions. (Different L-trie sequences were followed in each case as networks with different connectivity were acquired during exploration). Building a predictive map The ‘domino’ effect that propagates local landmark positions through the relational net will eventually generate egocentric location estimates for all CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 213 landmarks (with connections to the L-trie graph). The resulting activity can be thought of as a dynamic ‘map’ of the environment which could be updated continuously as the agent moves such that it is always arranged with the agent at the centre and oriented towards its current heading. Figures 8.14 and 8.15 show this egocentric map computed (with 20% noise) for an environment of twenty-six landmarks. As a result of cumulative error the exact layout of landmarks is more accurately judged close at hand than further away, however, the topological relations between landmarks are reproduced throughout. One use to which such a map might be put is to disambiguate perceptually similar landmarks by calculating predictions of upcoming scenes. In other words, if the agent sees a landmark which appears identical to one it already knows, then it can judge whether the two cues actually arise from the same object from the extent to which the place where the landmark appears agrees with the location predicted for it by the mapping system. If there is a large discrepancy between actual and predicted locations then the new landmark could be judged to be a distinct object. On the other hand, if there is a good match the agent could conclude that it is observing the original cue. Note that the purpose and nature of this dynamic map demonstrates one of the major differences between the relational frame approach to the acquisition of spatial knowledge and methods that emphasise the construction of a permanent ‘map’ of environmental layout in which position errors are minimised or explicitly modelled. In this relational approach there is no long-term static representation of large-scale spatial relations. Instead, to the extent that a largescale map of any sort exists, it is described by the continuously changing activations of the object-units in the relational net. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 214 North B A C E I L F D K H G M J N Q R O W X P S Z Y TU V Figure 8.14: An environment of twenty-six landmarks with the agent positioned and oriented as shown. North T V U R S Q Y Z D P N O J F W X C G M E B A H K I L Figure 8.15: The ‘dynamic’ map in the agent’s egocentric coordinate frame generated from the viewing position shown in the previous figure. CHAPTER EIGHT 8.6 WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 215 Spatial knowledge and way-finding in perspective This final section is an attempt to find a common theme, in the arguments and ideas expressed in this chapter and the last, and to consider where this might lead to in the future. Chapter seven ended by considering the theory, proposed by Kuipers [78, 81], of a hierarchical organisation of spatial knowledge. This theory proposes four levels of organisation—sensorimotor, procedural, topological, and metric—that form a hierarchy in which each level introduces a greater degree of complexity and resource demands. On the whole, lower level representations are constructed first and play some role in the assimilation of knowledge to higher levels. I will argue below in favour of one element of this view—the proposal that spatial knowledge is organised in distinct components encoding separate constraints. However, I also wish to suggest that the hierarchical view is misleading, both in its emphasis on geometric content (as the principal distinguishing factor between the top two levels), and in its suggestion of a largely incremental process in the construction of spatial knowledge. The following is an attempt to set out an alternative perspective which, following Michael Arbib [7, 8], I call a ‘multiple schemata’ view. Arbib has proposed the use of the term schemata to describe active representational systems or “perceptual structures and programs for distributed motor control” ([7] p. 47). In the context of constructing and using models of space he suggests— “The representation of [...] space in the brain is not one absolute space, but rather a patchwork of approximate spaces (partial representations) that link sensation to action. I mean ‘partial’ and ‘approximate’ in two senses: a representation will be partial because it represents only one small sample of space (...), and it will be approximate in that it will be based on an incomplete sample of sensory data and may be of limited accuracy and reliability. I will suggest that our behaviour is mediated by the interaction of these partial representations: both through their integration to map ever larger portions of the territory relevant to behaviour, and through their mutual calibration to yield a shared representation more reliable than that obtained by any one alone.” ([8] p. 380) In the specific context of cognitive maps, he also argues that CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 216 “There is no reason, in general, to expect the system to use Euclidean space or Cartesian coordinates for such a map. Rather, the system needs a whole array of local representations that are easily interfaced and moved between.” ([7] p. 47). The view advocated below is, I hope, in close accord with these ideas. A ‘multiple schemata’ view Spatial information can be picked-up through multiple sensory modalities in a number of different guises and forms. This information may describe geometric spatial relations upto any level in the topological–metric hierarchy it may also be anywhere on a scale from very precise to utterly vague. Each piece of information can be viewed as supplying a potential constraint for constructing spatial representations for way-finding. I propose that the critical distinction with regard to different forms of constraint information is not to do with the geometric content of the knowledge but with the process by which that knowledge is derived. For instance, metric information derived from odometry (dead reckoning) is independent from metric information determined by perceived distance and direction to identifiable salient landmarks. They thus represent constraints that are independent because they derive from different sensory modalities. Multiple constraints can also be obtained from within a single modality by observing different environmental cues. For instance, the observed position of the sun gives a constraint that has an independent source from spatial data acquired from local landmarks. Indeed, different individual landmarks or landmark groups can supply separate constraints as has been demonstrated in this chapter. Finally, relatively independent constraints can arise within a modality by reference to the same external cues but by employing different computational mechanisms. It is in this sense that the distinction between different geometries may be most relevant. For instance, as has been discussed in chapter seven, the visual characteristics of distinctive landmarks might be used to construct knowledge of topological relations that is independent of the mechanisms that extract distance or direction from the visual scene. To the extent that different constraints are independent (in this sense), two constraints will clearly be much more powerful than one, three more than two, etc. It therefore seems reasonable to suggest that an agent should seek to detect and represent a number of independent or near independent constraints that describe the spatial relations between important places. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 217 The emphasis of a multiple schemata approach is not on constructing unified representations such as topological or metric maps but rather on establishing multiple orienting systems. I suggest that each schema should exploit a different combination of cues, channels, and mechanisms to instantiate a set of environmental spatial relations. Thus, there will overall, be a number of relatively distinct path-ways through which knowledge is acquired. This suggests a heterarchy of models (as opposed to a hierarchy), with some, but not all, schemata sharing common sources and resources. At any time an agent should exploit the knowledge in multiple schemata to support its current navigation task. Although some tasks may require the temporary creation of a unified model (drawing a graphical map of the environment might constitute such a task) in general the underlying representations will be distinct allowing the reliability of each source of information to be assessed at run-time. Way-finding should exploit acquired schemata via arbitration procedures which decide on the basis of the content and accuracy of each active model the extent to which it should contribute to the decision process. This arbitration could be carried out through some fixed subsumption mechanism whereby, for instance, knowledge determined from large-scale metric relations could override taxon (route-following) strategies. Alternatively a more sophisticated system would seek to combine the constraints afforded by multiple schemata by weighting them according to their perceived accuracy or reliability. In this way, reliable identification of a highly distinctive landmark might override estimates of spatial position or orientation determined by some metric reckoning process. Kalman filtering techniques [25] can also be applied for combining multiple constraints, as, for instance, in the ‘feature-based’ robot navigator developed by Hallam, Forster, and Howe [57]. I hope that it is reasonably evident that these ideas are consistent with the relational coding system described above. From a multiple schemata perspective the set of !-coding units associated with each distinct L-trie would constitute a schema encoding the location of other landmarks and goals relative to a specific allocentric coordinate frame. If a specific location is encoded by two separate ‘Ltrie schema’ based on non-overlapping landmark sets then these would constitute relatively independent constraints. To the extent that landmark sets do overlap they will obviously be less independent, but will nevertheless encode partially CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 218 distinct constraint information. However, the idea of multiple schemata also generalises to encompass different coding systems. For instance, representations based principally on direction sense and dead reckoning could be constructed which would provide a modality-independent source of information from the relational coding. An obvious argument against a multiple schemata view is that acquiring and storing spatial knowledge is not without cost. It makes demands on attention, processing and memory (there are really separate costs associated with detecting constraints, storing them, retrieving them, and combining them!). One defence against this argument is the relative independence between different schemata which will allow parallel processing to be exploited to a considerable extent. A second possibility is that the amount of resources devoted to a given location (i.e. the number of constraints stored) may vary according to the subjective importance of being able to relocate that place or reach it quickly. We could expect, for instance, that an animal’s home or nest (or a robot’s power source) would have the highest priority and that therefore ‘homing’ might be the most robustly supported way-finding behaviour. It is my contention that a multiple schemata view may help in understanding the evolution of way-finding competence in animals, and may also provide support for the essentially pragmatic approach of Animat AI. In the case of the former, O’Keefe’s [119] separate taxon and locale systems (which follows a very long line of research into response vs. place knowledge in animal navigation, see Olton [121, 122]) can be viewed as a distinction along these lines. However, there also seems a reasonable case for breaking up the ‘locale’ system into multiple schemata, for instance, models derived from dead reckoning and direction senses [52, 97] and those derived principally from relational codings. With respect to robotics this view suggests the abandonment of theoretical pre-conceptions about the priority, or lack of it, of different forms of geometric knowledge. It also implies that the ‘brittleness’ of classical approaches arises not so much from the emphasis on metric modelling but from the search for an accurate unified metric model. The alternative, advocated here, is clearly to seek multiple constraint models that, although individually fallible, can combine to give robust support to way-finding. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 219 For a multiple schemata view to have any theoretical teeth it will need to be developed in such a way as to create predictions of animal or human behaviour. More specific definitions of the different schemata are needed—what constraints they compute, using what information, and the extent of their inter-dependence. Secondly we need to hypothesise the nature of the arbitration procedure and develop predictions of the likely effects of opposing or complimentary constraints on behaviour. Clearly Michael Arbib’s work on schema theory [8] is relevant to these issues, in addition there is much existing work in psychology and AI that shares the same concerns. For instance, in robot simulation, the ‘Qualnav’ model [84] discussed earlier24 describes some interesting arbitration mechanisms for integrating range information determined from pairs of local landmarks to provide accurate localisation. This work could generate some interesting behavioural predictions. Some research in human navigation by Scholl [147] can also be viewed along these general theoretical lines. Work in animal navigation that specifically fits this research theme has been performed by Poucet et al. [31, 129, 130], Collett et al. [36] and Etienne et al. [47-49]. To end this chapter I would therefore like to draw upon a couple of examples from this work. Etienne et al. [47] report that hamsters have effective dead reckoning skills which are sufficient to relocate their nest in darkness. However, in lighted conditions hamsters were found to orient primarily using visual information about local landmarks. In conflict situations, where a landmark (a single light spot) was rotated relative to the learned position, the hamsters homed using either the landmark information or their dead-reckoning sense. When the visual information and dead reckoning produced highly divergent paths dead reckoning was used, however, with smaller discrepancies visual information took priority over path integration. Etienne et al. also report that the dead-reckoning sense was more precise when used to return to the nest than when used to locate a secondary feeding site. This suggests that a dead reckoning way-finding schema maybe more available for homing than for general path-finding. Experiments by Collett et al. [36] with gerbils suggests that these animals may encode goal positions (buried sunflower seeds) in terms of individual visible 24Although this work has been interpreted from the perspective of the ‘spatial semantic hierarchy’ [81] I believe fits at least equally well with a multiple constraints view. CHAPTER EIGHT WAY-FINDING: LOCAL ALLOCENTRIC FRAMES 220 landmarks by using some form of direction sense. For instance, in one experiment gerbils were trained to locate a food cache at the centre of an array of two landmarks. When the distance between landmarks was doubled the gerbils searched at two sites each at the correct distance and orientation to one of the landmarks rather that at the centre of the two locations (as some theories of a landmark ‘map’ might predict). In a further experiment the gerbils were trained to go to a goal-site at the centre of a triangle of three landmarks. During testing the distance of one landmark to the centre was doubled, Collett et al. report that the animals spent most of their search time around the place specified by the two landmarks, ignoring the one that broke the pattern. They interpreted this result in the following way: “The gerbil is thus equipped with a useful procedure for deciding between discrepant solutions. When most of the landmarks agree in specifying the same goal, with just a few pointing to other sites, the chances are that the majority view is correct and that the additional possibilities result from mistakes in computation or from disturbances to the environment.” ([36] p. 845). Collett et al. are therefore suggesting that this multiple encoding of landmark-goal relations by hamsters occurs to provide the system with robustness. In other words, they advocate something like a multiple schemata system and give a clear example of the ability of such a hypothesis to generate interesting and testable predictions. 221 Chapter Nine Conclusions and Future Work What has been achieved? This thesis has investigated reinforcement learning systems, with particular concern for the issues of adaptive recoding and exploration, and has explored the distinction between reinforcement and model-based learning in the context of acquiring navigation skills. Delayed reinforcement learning architectures have been described and simulated for adaptive recoding in control tasks, and for the acquisition of tactical, local navigation skills in a mobile robot. Model-based learning methods have been investigated with respect to the construction and use of quantitative spatial world models for way-finding. Adaptive recoding for reinforcement learning A connectionist learning architecture has been developed that performs adaptive recoding in reinforcement and delayed reinforcement learning tasks defined over continuous input spaces. The architecture employs networks of Gaussian basis function (GBF) units with adaptive receptive fields. Learning is achieved by gradient climbing in a measure of the expected reinforcement. This architecture has several desirable characteristics for reinforcement learning. First, the function approximation is performed locally. This means that the network should learn considerably faster than systems based on global approximators such as the multilayer perceptron. Second, the parameters of the trained network should be easier to initialise and interpret than more distributed learning systems. This creates the potential for interfacing explicit knowledge with adaptive reinforcement learning for control. Finally, by adapting the covariance matrix each GBF unit can align its principal axis according to the variance of the optimal output in its region of expertise. For a given task, fewer CHAPTER NINE CONCLUSION 222 units should therefore be needed than in similar architectures in which only the centres and the widths of the receptive fields are learned. Results with a simulation of a real-time control problem show that networks with only a small number of units are capable of learning effective behaviour in reinforcement learning tasks within reasonable time frames. However, as in all multilayer delayed reward learning systems, the learning system is susceptible to local minimum, is vulnerable to interference due to input sampling bias, and is not guaranteed to converge. I have therefore argued that this system may be most appropriate for fine tuning of coarsely specified control behaviours than for acquisition of new skills from scratch. Reinforcement learning in navigation A tactical/strategic split in navigation skill has been proposed for both natural and artificial systems. I have argued that tactical local navigation could be performed by reactive, task-specific systems encoding competences that will be effective across different environmental settings. These systems might be innately specified, learned, or, coarsely specified and adaptable. I have argued that learning of appropriate control behaviour may be possible with short or mediumterm rewards and using only partial state information. The learning system can exploit the redundancy in the mapping from environmental states to appropriate actions to reduce the dimensionality of the search space from the full Markovian problem down to something more manageable. I have demonstrated acquisition of an adaptive local navigation behaviour within a simulated modular control architecture for a mobile robot. The Actor/Critic learning system for this task acquires successful, often plan-like strategies for control using very sparse sensory data and delayed reward feedback. The algorithm also demonstrates adaptive exploration using Williams’ proposal for performance related control over local search behaviour. Model-based learning in navigation I have suggested that strategic, way-finding skills require model-based, taskindependent, but environment-specific knowledge. I have argued against the view that topological spatial models are simpler to construct, store, or use than quantitative, metric models, and have simulated a local allocentric frame method as an example of way-finding using quantitative spatial knowledge. This system CHAPTER NINE CONCLUSION 223 exploits simple neural network learning, storage and search mechanisms, yet generates very robust way-finding behaviour. I have also argued against a strong distinction between different forms of geometric knowledge in cognitive spatial representations and against the construction of complete global models. I have proposed instead that robust systems should be based on the acquisition of multiple representations or ‘schemata’ with mechanisms for chaining and combining constraints arising from different models. Future Work The research described above leaves many gaps to be filled in and many avenues to be explored. Various specific proposals for extending this research have already been outlined, here I will briefly describe how the various underlying threads might be brought together to make some more coherent wholes. It is clear that the GBF adaptive recoding techniques for delayed reinforcement learning could be applied in local navigation systems allowing these systems to be coarsely pre-specified and giving the possibility of accelerated learning. Further, since the adaptive coding methods are more memory-efficient and allow broader and more flexible generalisation than coarse-coding this could allow for tactical learning in input spaces of much higher dimension. The possibility of using the GBF methods to distill control knowledge from more opaque memory systems such as CMAC coding is also worth investigating. The problem of input sampling has been discussed as being a major complication in learning real-time control. Mechanisms need to be developed that will keep track of and cope with oversampling or under-sampling. This is part of the overall need to develop better attentional and exploration mechanisms for reactive learning systems. A further possibility for the development of local navigation systems is to use reinforcement learning to adapt the perceptual components of a system rather than the outputs. For instance, one idea would be to exploit the notion of the ‘Deformable Virtual Zone’ (DVZ) proposed by Zapata et al. [190] and mentioned briefly in chapter six. The DVZ describes an area of space around the robot within which intrusions from obstacles trigger avoidance manoeuvres. The shape of this zone is a function of the dynamics and current trajectory of the robot. Figure 9.1 illustrates some possible elliptical zone shapes for different instantaneous robot trajectories. The proposal would therefore be to use reward signals to adapt the parameters of the DVZ separately from any adaptive mechanisms that control the CHAPTER NINE CONCLUSION 224 actual avoidance behaviour. This could be achieved by a gradient learning rule and would essentially constitute an adaptive attentional mechanism. DVZ Figure 9.1: The deformable virtual zone (DVZ) is as function of the vehicle and dynamics and current trajectory that specifies the ‘danger’ area within which intrusions will trigger avoidance behaviour. (adapted from [190] p. 110) I have argued for a distinction between tactical and strategic navigation skills but have not yet considered how this might be brought about. Clearly advances are needed in many areas. First, the development is required of more appropriate local navigation modules that achieve tactical objectives rather than simply wandering without purpose. Second, the development and integration of multiple representational systems for way-finding is needed. This work has barely been begun, mechanisms for determining, chaining and in particular combining different sources of constraint need to be investigated. Finally, mechanisms for integrating the two levels of control must be specified. The research described here suggests a tactical/strategic split in which the way-finding system continuously specifies a target heading while the local navigation systems work to keep the agent on course while avoiding obstacles etc. Both systems would run in parallel. Some advantages of this decomposition would be that it avoids the specification of arbitrary sub-goals; it limits the need for local navigation systems that adapt with CHAPTER NINE CONCLUSION 225 respect to very long-term rewards; and it allows both components of the system to be fully responsive to contigencies that arise from moment to moment. A critical aspect of all this work will be to make contact with real environments through robotics, or at least, through more realistic simulations; the lack of such realism is perhaps the major drawback of the work performed so far. CHAPTER NINE CONCLUSION 226 226 Appendix A Algorithm and Simulation Details for Chapter Three This appendix describes the actor/critic architecture used in the maze learning task described in chapter three. The learning system The N cells of the maze (figure 3.1) describe a state space xi ! X . The action selected in the ith cell is given by ai ! {1 3, 4} . The orientation of the agent is ignored, hence selecting action ai = j will always result in a transition to the jth neighbouring cell. Behaviour is modelled in discrete time intervals where at each time-step the agent makes a transition from one cell to a neighbour. The cell occupied at time t is denoted x(t) and is represented by the recoding vector ! (t) . The current action is denoted a(t) . In this model, each cell is given a unique, discrete encoding. This is achieved by using a recoding vector ! (t) that is a unit vector of size N. In other words, if the current cell is xi then "1 iff i = j ! i (t ) = # (B.1) $ 0 otherwise . The parameters of the learning system are encoded by a vector v of size N for the critic element and a vector w of size N ! 4 for the performance element. Since cells are encoded by unit vectors the prediction associated with the ith cell is just the parameter vi and the degree of preference for the jth action in this cell is given by the parameter w( 4i + j) . To select an action, Gaussian noise with mean zero and standard deviation ! is added to each of the appropriate weights1. The action is then chosen that maximises w( 4i + j) + N(0, ! ) for j = 1…4 . (B.2) Only the weight for the selected action becomes eligible for learning. This architecture corresponds to a winner-takes-all competition between a set of stochastic linear threshold units. Figure A.1 summarises the learning algorithm. 1For cells on the periphery of the maze the weights for invalid actions are ignored. APPENDIX A. 227 eTD (t + 1) = r(t + 1) + ! V(t + 1) " V (t) #1 iff !w(4 i+ j ) (t) = $ %0 i (t) = 1 and a(t) = j otherwise ! (0) = 0 , ! (t) = ! (t) + "! (t # 1) ! w(0) = 0 , w(t) = w(t) + " !w(t # 1) !v = " eTD (t + 1) # (t) !w(t) = " eTD (t + 1) #w(t ) Figure A.1 Update rules for the actor/critic maze learning system. There are six global parameters in the learning algorithm: the discount factor ! , the action and prediction learning rates ! and " , the rates of trace decay " and #, and the standard deviation ! of the noise in the action selection rule. A discount factor ! = 0.95 was chosen to provide a slow decline in the importance attached to delayed rewards. Suitable values for the remaining parameters were chosen by experiments with maze tasks in which there were no hazard cells. Acquisition of a consistent path is most accelerated in such tasks by using high values for the decay and learning rates. However, this can be at the cost of exploration making convergence on an indirect path is more likely. To find a suitable compromise between speed of learning and achieving a direct route the learning rates ! = " = 0.2 , and standard deviation of the Gaussian noise ! = 0.05, were tested with different values of trace decay " = # = 0.0, 0.3, 0.6, and 0.9 (the initial expectation in all tests was zero). Figure A.1 shows the average number of transitions over the first one hundred trials in learning the 6x6 maze. Of the values tested a decay rate of 0.6 provided the fastest convergence to a direct path. The experiments described in chapter three therefore used the parameters given above together with the decay rates " =# =0.6. APPENDIX A. Average number of transitions per trial 228 80 0.0 0.3 0.6 0.9 60 40 20 0 10 30 50 70 90 Trials Figure A.1: Learning mazes without hazards with different values of trace decay 229 Appendix B Algorithm and Simulation Details for Chapter Four This appendix gives further details of the algorithms for training networks of Gaussian Basis Function (GBF) units discussed in chapter four. The first section summarises the activation rule and partial derivative computations for single-layer networks of GBF units and contains a brief discussion of unsupervised learning. The second section describes in detail the architecture for immediate reinforcement learning in the binary output ‘X’ task including the additional learning rules required for adapting the GBF scaling parameters. The final section gives the specifications of the ‘X’ task, details of the network parameters, and describes the results of the simulations. B.1 Networks of Gaussian Basis Function Units Given an network of M GBF nodes the activation of the ith unit for context x is given by the Gaussian g = (2 !) "N 2 H 12 exp ( " 12 (x " c ) # H (x " c )) (B.1.1) where c i is the parameter vector describing the position of the centre of the unit in the input space and H i the information matrix. The activation of each unit is often scaled by the total activation of all units. This gives a recoding vector of normalised activation values in which the ith element is computed by i = gi "g j . (B.1.2) j Most learning rules require the gradients of the (unscaled) activation function given by the partial derivatives with respect to the centre and information matrix of each unit. These are, respectively, 2 partial derivatives of the activation of the ith unit with respect to the position and shape (information matrix) of its receptive field are obtained as follows: Assuming dependence on i. For the node centre c 2The ! g !g ! d let d = (x ! c ) then !c = ! d ! c , ! " ( d Hd) = Hd + H " d = 2Hd since H is symmetric !d APPENDIX B. 230 ! gi = gi Hi (x " c i ) and ! ci (B.1.3) ! gi 1 = g (S " (x " c i )(x " c i )# ) (B.1.4) ! Hi 2 i i !1 where Si = Hi . For unsupervised learning each unit is moved in the direction of the partial derivative gradients in proportion to its normalised activation. This gives the update rules (given in section 4.2) for training the centres c and information matrices H !c i " # i Hi (x $ c) , and (B.1.5) !Hi " i [Si $ (x $ ci )(x $ c i ) % ] (B.1.6) As it is usually thought desirable that all units get an equal share of the data, learning rules of this kind are often supplemented by a conscience mechanism [1, 128]. For instance, a running average of the activation of each unit can be learned which is used to modulate the units learning rate—a node that is less active than average has its gain temporarily increased, while one that is more active than average has its learning rate temporarily reduced. !g = g j ("H(x " c) " I N ) = g H(x " c) . !c For the information matrix H !g 1 "1 2 ! =2H ( H ). H "1 2 g " 12 g ! ((x " c)# H(x " c)) !H !H !H $ !1 " = 12 g H H ( ln H ) ! (x ! c)(x ! c)# & % ' "H = 12 g( H !1 ! (x ! c)(x ! c)" ) APPENDIX B. 231 B.2 A GBF architecture for a simple immediate reinforcement task This section describes the learning architecture used for the ‘X’ task described in section 4.3. In this task networks of between five and ten GBF units were trained to output either 1 or 0 according to whether the context input originated from within or without a ‘X’ shaped pattern imposed on a two-dimensional input space. The network architecture is illustrated below and described in detail in the following text. binary action generator policy weights w output q GBF centres c and matrix H inputs x Figure B.1: A network architecture for immediate reinforcement learning. The thick line joining the GBF units is intended to indicate that the activation are normalised with respect to the total activation. As in section B.1 (above) each GBF local experts i has parameters c i and H i describing its centre and information matrix. The output of each unit is computed as in equation B.1.1 and normalised as in B.1.2. The normalised activation vector for all experts at time t is given by the vector ! (t) . The network has one output computed as follows. The net sum s(t) is first computed by s(t) = w! " (t) (B.2.1) where w is the output weight vector. This is then passed through a logistic function to give the probability p(t) = 1 !s (t ) 1+e (B.2.2) APPENDIX B. 232 that the current action y(t ) will take the value 1 (or value 0 with probability ! p(t) ). In Williams’ terminology (see chapter two) the output component of the network therefore acts as a Bernoulli logistic unit. Learning rules The method for estimating the error in a Bernoulli logistic unit was described in section 2.2. Assuming dependance on time, for an immediate reinforcement task in which r is the payoff and b is the reinforcement baseline this gives the error estimate e = (r ! b)(y ! ) . The update rule is for the output weights w is therefore (B.2.3) !w = " e # (B.2.4) where ! is the learning rate, and for the parameters of the local experts (from section 4.3) !c i = " c e (wi # s) $ i Hi (x # c i ) + " s di and (B.2.5) % !Hi = " H e (wi # s) $i (H #1 i # (x # c i )(x # c i ) ) . (B.2.6) where ! c , ! s and ! H are learning rates. The rule for training the node centres contains a extra component di . This is the spring force between the GBF units that acts to prevent any two nodes occupying exactly the same position in the input space. A suitable measure of the proximity of each pair of nodes i and j is the Gaussian, with width ! s , given by ! ij ( 1 exp " # 2s c i " c j 2 ). (B.2.7) The resultant force on unit i is then given by M di = s $" ij (B.2.8) j , j #i where ! s scales the contribution of the spring component to the total update for that unit. The gaussian is chosen so that the spring is maximal for nodes that are coincident and decays rapidly as the distance between centres increases. B.3 Learning scaling parameters To adapt the strength of response of each unit independently of its position and shape requires the following modified activation function and learning rule (see section 4.3) APPENDIX B. i 233 12 1 = pˆ i (2 ! )" N 2 H i exp (" 2 (x " ci )# H i (x " c i )) (B.3.1) Since ! i "1 = gi pˆi ! pˆi Then from equations 4.3.7, 4.3.12 and 4.3.16 we have pˆi = p r(y # s)(wi # s) $ i pˆ i#1 (B.3.2) where pˆ i is the scaling parameter for the ith unit and ! p is a learning rate. Since the pˆ i s are probabilities then the following conditions must be met ! pˆi = 1 and 0 ! pˆi ! 1 for all i. To satisfy the first of these conditions the change ! pˆi is distributed evenly between the remaining nodes. That is, all nodes j where j ! i are adjusted by i ! pˆi M "1 where M is the total number of GBF units. However, because all values must stay within the bounds 0 ! pˆi ! 1 an iterative procedure is required to balance the parameters correctly. The full algorithm (adapted from [101]) is given below in pseudocode form. It is worth noting, however, that in practice such a scheme will rarely be required provided that the learning rates are appropriately set and there is not a large surplus of GBF units for the task in hand. ! pˆ j (i ) = " Iterative procedure for maintaining scaling parameters within bounds ;! is the change remaining to be distributed between other units. ; U is the set of units excluded from this distribution process. ; M is the total number of GBF units in the network. For each pˆ i do { ! pˆi = " p r(y # s)(wi # s) $ i pˆ i#1 Let U = {i } ˆ i + !pˆ i > 1 then { if p ! = " (1 " pˆi ) pˆ i = 1 } ˆ i + !pˆ i < 0 then else if p ! = "(" pˆi ) pˆ i = 0 } else { ;determine the desired update ;initialise the exclusion set ;update to the upper bound ; ! = negative of change ;update to the lower bound ;normal update pˆ i pˆ i + !pˆ i ! = "! pˆi } While ! " 0 do { ;iterative redistribution APPENDIX B. 234 ;repeats till " = M # size(U) ;compute change per unit !=0 For each !=0 ;reset distribution sum pˆ j , j !U if do { pˆ + ! > 1 then { else if else ! ! pˆ j U = U ! {j} pˆ j = 1 } pˆ j + ! < 0 then { ;update to the upper bound ) #1 ! = ! + ( pˆ j + " ) U = U ! {j} pˆ = 0 } ˆp = pˆ + ! ;add excess to ! ;add unit to exclusion set ;update to lower bound ;normal update }} B.4 Simulation details for the X task The learning system was required to emit the action y(t ) = 1 to obtain maximum payoff for inputs from within the boundaries of the cross and action y(t ) = 0 to maximise payoff outside the cross. Inputs were sampled randomly from the unit square where the cross was defined by the set of points U = (x1 , x2 ) where ( x1 + x 2 ! 0.85 and x1 + x 2 ! 1.15 ) or ( x1 ! x 2 " !0.15 and x1 ! x 2 " 0.15 ) The payoff r(t + 1) was equal to 1 with a probability 0.9 for a correct action and 0.1 for an incorrect action and was zero otherwise. The reinforcement baseline was zero throughout. Parameter settings The following global parameter settings were determined to give reasonable performance (no systematic exploration of the parameter space was performed): ! learning rate for weight vector 0.05 !c 0.004 learning rate for node centres learning rate for information matrix 10.0 !H !p learning rate for scale parameters 0.0 or 0.0005 gain for inter-node spring !s 1.0 !s 0.06 width of inter-node spring All learning rates were annealed to zero at a linear rate over the training period of forty thousand input patterns. In other words, for the jth input each learning rate APPENDIX B. 235 j) times its starting value. Each GBF unit was initially had a value equal to (40,000! 4 0,000 placed at a random position { N(0.5,0.01) , N(0.5,0.01) } near the centre of the space. The receptive fields of the nodes were initialise with principal axes parallel to the dimensions of the input space and with widths in each dimension of 0.12. The scale parameters were all equal at the start of training. On runs in which these were trained the learning rate ! p was set at 0.0005 and was subject to annealing over the training period. Results Figure B.2 shows the results for ten runs using networks of between five and ten GBF units with the zero learning rate for the adaptive priors. The data shows quite clearly two (or more) levels of performance which correspond to qualitative differences in the network's ability to approximate of the X shape. Specifically, only those nets scoring above approximately 87% produced a test output which showed all four arms of the X, below this figure either one arm was missing or the space between the two arms was incorrectly filled. 236 Individual training runs with different numbers of GBF units APPENDIX B. 5 6 7 8 9 10 72 77 82 87 92 97 percentage success in reproducing test output Figure B.2: Performance on the X task for nets of 5,6,7,8,9, and 10 GBF units. Figure B.3 compares the performance shown above for networks of eight GBF units (8-GBF) with ten runs using networks of eight units with adaptive variance only. The difference between the average performance of the two systems is significant3. 3t= 7.658, p<0.001 (2-tailed). Nine out of ten runs in the 'width only' test approximated the full X shape (µ=89.3%, != 1.4%). The tenth run (in which the approximation was incomplete) was excluded from the t-test comparison. The mean performance over all ten runs of the 8-GBF network was 93.6%. 237 Individual training runs APPENDIX B. 72 77 82 87 92 97 percentage success in reproducing test output Figure B.3: Comparison of performance on the X task for nets with adaptive covariance (clear triangles) and with adaptive variance only (black circles). The experiments with GBF networks of five to ten units were repeated with the non-zero learning rate for the prior probabilities. Figure B.4 shows the results for each size of network. The overall the performance is similar to that reported above with networks of six units or less frequently failing to reproduce the full X shape. Quantitatively, however, the scores achieved with each size of network are slightly better. 238 Individual training runs with different numbers of GBF units APPENDIX B. 5 6 7 8 9 10 72 77 82 87 92 97 percentage success in reproducing test output Figure B.4: Performance on X task for networks of 5,6,7,8,9, and 10 GBF units with adaptive prior probabilities. 239 Appendix C Algorithm and Simulation Details for Chapter Five This appendix gives details of the delayed reinforcement learning algorithms used for the experiments described in Chapter Five with the simulated pole balancing task. The first section gives the equations (from Barto et al. [11]) governing the dynamics of the simulated cart-pole system. The following sections describe the algorithms for training Actor/Critic learning systems on this task: for a system a with a radial basis function (RBF) ‘memory-based’ coding layer, and for systems of Gaussian basis function (GBF) units trained by generalised gradient ascent. C.1 Dynamics of the cart-pole system The dynamics of the cart-pole system are determined by the following equations (angles are expressed in radians): gsin ! t + cos ! t !˙˙t = l ˙˙t = t + p # " F " mp l!˙t2 sin ! t & (' %$ mc + m p # 4 mp cos 2 !t & " %3 mc + m p (' $ [!˙ 2 t sin ! t " !˙˙t cos ! t c + p ] , . Here " 10 newtons, if the output of the control system = 1 Ft = # " " " = 0 or -1 $!10 newtons " is the force applied to the cart by the control system, mc = 1.0 kg and m p = 0.1 kg are the respective masses of the cart and pole, l= 0.5 m is the distance from the centre of mass of the pole to the pivot, and g=9.8 m/s2 is the acceleration due to gravity. As in [11] the system was simulated according to Euler’s method using a timestep $=0.02 giving the following the discrete time equations for the four state variables xt +1 = xt + ! x˙t , x˙t +1 = x˙t + ! ˙x˙t , ! t +1 = ! t + " !˙t , !˙t +1 = !˙t + " !˙˙t . 240 APPENDIX C. C.2 RBF memory-based learning architecture The activation of the ith RBF node is given by 2 % ( N x j (t) # c ij gi (t) = g(x(t), c i , ! ) = exp # ' $j * ( 2 " )1 2 ! N 2 !2 & ) where ! denotes the width of the Gaussian distribution of all nodes in all the input dimensions. The normalised basis encoding is then obtained by [ 1 ! i (t ) = ] gi (t) " j =1g j (t) . M and an estimate xˆ of the input reconstructed by x (t) = "i =1! i (t)c i (t) . A new node is generated with its centre at x whenever x(t) ! ˆ (t) > " where . denotes the squared Euclidean distance measure. The initial network contains zero units. The recoding vector ! (t) acts as the input to an actor/critic learning architecture with parameters w and v, generating the action y(t ) and the prediction V(t) at each time step according to V(t) = v ! " (t) &1 iff w! " (t) + N(0, #) $ 0 y(t ) = ' ( %1 otherwise where N(0, ! ) is a Gaussian random variable with mean zero and standard deviation ! . The parameter vectors are then updated by !w(0) = 0 ! (0) = 0 (t) = (t) + " ! (t # 1) !w(t) = y(t)" (t) + # !w(t $ 1) !w(t) = eTD (t + #w t ) !v = " eTD (t + 1) # (t) where the temporal difference error is given by eTD (t + 1) = r(t + 1) + V(t + 1) " V (t) . When a new node is added to the network the recoding and parameter vectors all increase dimension with the new parameters taking the value zero. Specific architecture for the pole balancing task The four state variables of the cart/pole system ! ! x and x˙ were normalised to provide the 4-vector input x for the learning system. The following scaling equations (taken from Anderson [4]) were used ( x1 = ! + 12 x + 2.4 x˙ + 1.5 !˙ + 115 x = x = x = 3 4 , 24 , 2 3.0 300 , ) APPENDIX C. 241 where ! and !˙ are given in degrees and degrees/s and x and x˙ are given in metres and m/s. The reward given to the learning system was r(t) = $ % !1 if " < -12° or " > +12° or x < 2.4 or x > 2.4 . 0 otherwise APPENDIX C. 242 The global parameters used for the learning system were: RBF coding (The network was limited to a maximum size of 162 coding units.) threshold $ 0.18 width ! 0.18 Actor learning rate trace decay standard deviation of noise % # & 0.2 0.9 0.01 Critic discount factor trace decay learning rate ' " ( 0.95 0.9 0.1 243 APPENDIX C. C.3 Gaussian Basis Function (GBF) learning architecture The network architecture is much the same as that described in Appendix B for the immediate reinforcement task. The principal difference is that the temporal difference error which is computed by a second critic GBF network is used in place of the immediate reinforcement, thus the learning system is an Actor/Critic architecture. The algorithm is given here in full to avoid any possible ambiguity. GBF input coding The activation of each GBF unit i with centre c i and information matrix H i for context x is given by the Gaussian gi = (2! ) Hi exp (" 12 (x " c i )# H i (x " ci )) and normalised according to "N 2 !i = gi "g j 12 . (C.3.1) (C.3.2) j Actor network The actor network has MA GBF units where each unit i has parameters c Ai and H Ai describing its centre and information matrix. The normalised activation vector for all units at time t is given by the vector ! A(t) . The network has one output y(t ) ! {1,0} which takes the value 1 with probability p(t) = 1 !s (t ) where 1+e s(t) = w! " A(t) and w is the output weight vector. Given the temporal difference error eTD (t + 1) let (C.3.3) (C.3.4) (t + 1) = TD (t + 1)(y(t) ! p(t)) . The update rule for the output weight vector w is then given by (C.3.5) !w = " eA (t + 1) # A (t) and for the parameters of the GBF units by (C.3.6) A !c Ai = " c e A (wi # s) $ Ai H(x # c Ai ) + " s d Ai and (C.3.7) % !H Ai = H eA (wi # s) $ Ai ( H #1 Ai # (x # c Ai )(x # c Ai ) ) (C.3.8) where dependence on time is assumed. The spring component d Ai between the GBF units (if required) is computed as described in Appendix B.2. 244 APPENDIX C. Critic network The critic net has Mv GBF units where each node i has parameters c Vi and H Vi describing its centre and information matrix. The normalised activation vector for all units at time t is given by the vector ! V (t) . The network has one output unit which generates the prediction V(t, t) according to V(t, t) = v(t)! " V (t) . (C.3.9) where v is the output weight vector. From which the temporal difference error eTD (t + 1) = r(t + 1) + V(t|t + 1) " V (t|t) is computed in the double-time dependent manner. The update rule for the output weight vector v is then given by (C.3.10) !v = eT (t + 1) # V (t ) and for the parameters of the GBF units by (C.3.11) !c Vi = c eTD (vi # V) $Vi H Vi (x # c Vi ) + s dV i and (C.3.12) !HV = "H TD ( # V) $ V (H V#1 # (x # cV )(x # c V )% ) (C.3.13) where dependence on time is assumed. The spring component dV i between the GBF units (if required) is computed as described in Appendix B. Specific architecture for the pole balancing task. The context vector and reward were computed as in section C.2 above. The values for all the global parameters of the learning system were as follows. Actor learning rate for output weights learning rate for node centres learning rate for information matrix gain for inter-node spring Critic discount factor learning rate for output weights learning rate for node centres learning rate for information matrix gain for inter-node spring ! !c 0.2 !H !S 0.002 10.0 0.0 ! ! !c !H !S 0.95 0.05 0.001 10.0 0.0 APPENDIX C. 245 All learning rates were annealed to zero at a linear rate over the maximum training period of one thousand trials. In other words, on the nth trial each gain n) times its starting value. parameter had a value equal to (1000! 1000 The GBF units were initially placed randomly near the centre of the normalised input space with the position in each dimension sampled from the Gaussian distribution !(0.5, 0.01) . The receptive fields of the nodes were initialise with principal axes parallel to the dimensions of the input space and with widths in each dimension !(0.12, 0.006) . 246 Appendix D Algorithm and Simulation Details for Chapter Six This appendix gives details of the learning algorithms and global parameter values for the adaptive local navigation architecture described in chapter six. The input recoding mechanism used by the adaptive wander module is described first followed by details of the algorithms employed by the motivate, critic, and actor components. Input Coding For both the critic and actor components the real-valued input from the laser range-finder was recoded to give index entries to identical CMAC coarse-coded look-up tables. In chapter four a CMAC was defined as a set C = {C0 ,C1 ,…CT } of T tilings each itself a set Ci = ci 1 ,ci2 ,…cin of non-overlapping quantisation cells. In this instance the three-dimensional space of depth vectors was quantised using tilings D D D of cuboid cells of size 5 ! 5 ! 5 where D is the maximum range of the rangeD finder. Five tilings, each offset from its neighbours by 2 5 in each dimension, were overlaid to form each CMAC. Figure D.1 illustrates this coding mechanism. 247 APPENDIX D CMAC Uniform quantisation x5 -60° 0° offset +60° Figure D.1: Input encoding by CMAC. Each tiling covers the space of possible depth patterns. Five overlapping and offset tilings make up the CMAC encoder. The input vector x(t) (i.e. the current depth map) selects the set of coding cells U t) = {c1 (t),c 2 t ,…c5 t } where each ci (t) is the cell in ith tiling that encompasses this input position in the 3D space. CMAC cells are mapped in a one-to-one manner4 to elements of the recoding vector ! (t) , hence, 1 T i (t ) = $ %0 iff ci "U(t) otherwise . (D.1) Motivate Motivate takes input from external (collision) and internal (motor) sensors and combines them into an overall scalar reward signal for the system. Collision reward At each time-step t each of the collision sensors generates a binary output oi (t) which is non-zero only when the sensor is in contact with an object. The total collision reward rc (t) is calculated as the negative sum of these outputs i.e. 4 r (t) = ! " #i =1oi (t) . (D.2) where ! is a scaling factor. As long as the robot avoids collision this reward will therefore be zero. However, when a collision does occur the punishment is proportional to the number of sensors that are triggered. This should assist the learning system by enabling it to discriminate between crashes of differing severity. For instance, a head- or side-on collision should trigger two sensors 4For larger input spaces (formed by adding further depth measures) a hashing function has be used to create a many-to-one mapping and hence effectively reduce the number of stored parameters. 248 APPENDIX D giving a punishment of -2), but if the robots crashes on only one corner (a position that is easier to recover from) the punishment will be only -1). Movement reward The motor system measures the current actual speeds at which the wheels are revolving. By averaging the two wheel speeds the current translational velocity can be calculated. Motivate takes the absolute value s(t) of this signal and uses it to compute the movement reward rM . The goal of the system is to maintain a * max constant target velocity s that is just below the vehicles maximum speed s . The reward rM is therefore computed as a gaussian function which is zero at * s(t) = s and negative at all other speeds this is given by ( 2 ) rM (t) ! exp " [(s(t) " s* ) s max ] # 2 " 1 . (D.3) The constant ! in this equation determines whether the peak in the reward function is narrow or flat, in other words, it effects how tightly the constraint of constant velocity is enforced. The scaling parameter * controls the overall strength the signal. Total reward The total reward for the current time-step is given by a weighted sum of the collision and movement reward signals, i.e. r(t) = rC (t) + rM (t ) . (D.4) Critic The critic component is based on the standard adaptive critic architecture, that is it learns to output a prediction V(t) given by a linear function of a parameter vector v and the recoding vector ! (t) , i.e. V(t) = v ! " (t) . The parameter vector v is updated by the usual TD(") rule !v(t) = " eTD(t + 1) # (t) (D.5) (D.6) where eTD (t + 1) = r(t + is the TD error and + V(t + 1) " V (t) (D.7) ! (0) = 0 , (t) = (t) + v (t # 1) (D.8) is the STM trace of past recoding vectors with rate of trace decay ! v . Note that the time counter t is set to zero after a collision by the recover module. 249 APPENDIX D Actor The actor component is based on the gaussian action unit architecture proposed by Williams [184] and described qualitatively in chapter three. The following describes the procedure for a one-dimensional gaussian which generalises in a straight-forward way to the two-dimensional output used in the simulation. The output y of a 1-D gaussian action unit is selected randomly from a probability function g with mean µ and standard deviation ! given by 1 $ (y # ) 2 ' exp& # (D.9) 2"! 2! 2 )( . % The mean and standard deviation are encoded by a parameter matrix (wµ , w! ) and computed as functions of the recoding vector according to g(y, , ! ) µ = (wµ ) ! " (x) and ! = exp (w ! )" # (x) where the exponential is used in computing the standard deviation to ensure that it always has a positive value and approaches zero very slowly (see [184]). The eligibilities (see section 2.2) of the parameter vectors wµ and w! are, respectively, wµ = " ln g "µ y#µ = % (x) and "µ " wµ $2 (D.10) # ln g # (ln " ) (y $ µ )2 $ " 2 = % (x) . # (ln " ) #w " "2 This gives the eligibility trace for each vector, (D.11) !w i (0) = 0, !w i (t) = !w i (t) + " i !wi (t # 1) , and the update rule by (D.13) w" = !wi ( ) = " i eTD (t + 1) #w i (t) (D.15) where i is one of the indices µ or ! , ! i is a learning rate and ! i the rate of eligibility trace decay. For the wander module the actor component computes the two-dimensional output y(t) = ( f , a) corresponding to the desired forward velocity and steering angle of the vehicle. Each element of the output is encoded by a separate weight matrix and has different global parameters. Parameters The following values for the parameters of the simulation and the control modules were used in the experiments reported. motor control 250 APPENDIX D maximum speed maximum acceleration maximum braking axle width perception ray angles maximum depth s max d max scaling and normalisation of rays motivate target velocity collision reward scale movement reward scale movement reward width critic discount factor learning rate trace decay initial evaluation function 0.1 m/step 0.02 m/step 0.05 m/step 0.3 m -60°, 0°, +60° 2m 0 (d d max ) s* ! ! ! 0.08 m/step 1.0 1.0 1.4 ! ! !v 0.9 0.15 0.7 0.0 To make the selection of learning parameters more straight-forward, the parameters of the actor modules are scaled to lie roughly in the range -1 to +1. The actual output of each module is computed by simply multiplying the output of the CMAC by the scaling factor given below. Following a suggestion by Watkins [177] the learning rate for the mean action was halved for negative values of the TD error. forward velocity (f) (range -0.1 to +0.1m/step) !µ f mean: learning rate ln ! : initial parameter value learning rate initial parameter value trace decay output scaling factor !" f 0.10 eTD ! 0 0.05 eTD < 0 0.5 0.02 ! µf , ! "f ln 0.2 0.7 steering angle (a) (range -0.67 to +0.67 radians/step) mean: learning rate !µ a 0.1 0.10 eTD ! 0 0.05 eTD < 0 251 APPENDIX D ln ! : initial parameter value learning rate initial parameter value trace decay output scaling factor !" a ! µa , ! " a 0.0 0.01 ln 0.2 0.7 0.67 APPENDIX D 252 BIBLIOGRAPHY 253 Bibliography 1. Ahalt, S.C., K.K. Ashok, P. Chen, and D.E. Melton (1990). Competitive learning algorithms for vector quantization. Neural Networks. 3: p. 277-290. 2. Albus, J.S. (1971). A theory of cerebellar function. Mathematical Biosciences. 10: p. 25-61. 3. Anderson, C.W. (1986). Learning and Problem Solving with Multilayer Connectionist Systems. Phd thesis. University of Massachusetts. 4. Anderson, C.W. (1988). Strategy learning with multilayer connectionist representations. GTE Laboratories, MA. Report no. TR87-509.3. 5. Anderson, J.R. (1983). The Architecture of Cognition. Cambridge, MA: Harvard University Press. 6. Anderson, T.L. (1990). Autonomous robots and emergent behaviours: a set of primitive behaviours for mobile robot control. In International Workshop on Intelligent Robots and Systems. Tsukuba, Japan. 7. Arbib, M.A. (1989). The Metaphorical Brain. New York: Wiley and Sons. 8. Arbib, M.A. (1990). Interaction of multiple representations of space in the brain. In Brain and Space, J. Paillard, Editor. Oxford: Oxford University Press. p. 380403. 9. Barto, A.G. (1985). Learning by statistical cooperation of self-interested neuronlike computing elements. Human Neurobiology. 4: p. 229-256. 10. Barto, A.G., S.J. Bradtke, and S.P. Singh (1993). Learning to act using real-time dynamic programming. Department of Computer Science, Amherst, Massachusetts. Report no. CMPSCI 93-02. 11. Barto, A.G., R.S. Sutton, and C.W. Anderson (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions in systems, man, and cybernetics. SMC-13: p. 834-846. 253 BIBLIOGRAPHY 254 12. Barto, A.G., R.S. Sutton, and P.S. Brouwer (1981). Associative search network: a reinforcement learning associative memory. Biological Cybernetics. 43: p. 175185. 13. Barto, A.G., R.S. Sutton, and C.J.H.C. Watkins (1989). Learning and sequential decision making. In Learning and Computational Neuroscience, J.W. Moore and M. Gabriel, Editor. Cambridge, MA: MIT Press. 14. Barto, A.G., R.S. Sutton, and C.J.H.C. Watkins (1990). Sequential decision problems and neural networks. In Advances in Neural Information Processing Systems 2. San Mateo, CA: Morgan Kaufmann. 15. Bellman, R.E. (1957). Dynamic Programming. Princeton, NJ: Princeton University Press. 16. Bellman, R.E. and S.E. Dreyfus (1962). Applied Dynamic Programming. Rand Corporation. 17. Bierman, G.J. (1977). Factorization Methods of Discrete Sequential Estimation. New York: Academic Press. 18. Blake, A., G. Hamid, and L. Tarassenko (1992). A design for a visual motion transducer. University of Oxford, Department of Engineering Science. Report no. OUEL 1960/92. 19. Brooks, R.A. (1985). A subdivision algorithm in configuration space for findpath with rotation. IEEE Transactions on Systems, Man, and Cybernetics. SMC-15(2): p. 224-233. 20. Brooks, R.A. (1986). A robust layered control system for a mobile robot. IEEE Journal on Robotics and Automation. RA-2, 14-23. 21. Brooks, R.A. (1989). Robot beings. In International Workshop on Intelligent Robots and Systems. Tsukuba, Japan. 22. Brooks, R.A. (1989). A robot that walks: emergent behaviour from a carefully evolved network. Neural Computation. 1(2): p. 253-262. 23. Brooks, R.A. (1990). Challenges for complete creature architectures. In From Animals to Animats: Proceedings of the First International Conference on the Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 24. Brooks, R.A. (1990). Elephants don’t play chess. In Designing Autonomous Agents: Theory and Practice from Biology to Engineering and Back, P. Maes, Editor. Cambridge, MA: MIT Press. 254 BIBLIOGRAPHY 255 25. Brown, R.G. and P.Y.C. Hwang (1992). Introduction to Random Signals and Applied Kalman Filtering. 2nd ed. New York: John Wiley & Sons. 26. Cartwright, B.A. and T.S. Collett (1979). How honeybees know their distance from a nearby visual landmark. Journal of Experimental Biology. 82: p. 367-72. 27. Cartwright, B.A. and T.S. Collett (1983). Landmark learning in bees: experiments and models. Journal of Comparative Physiology A. 151(4): p. 521-544. 28. Cartwright, B.A. and T.S. Collett (1987). Landmark maps for honeybees. Biological Cybernetics. 57: p. 85-93. 29. Chapman, D. and P.E. Agre (1987). Pengi: An implementation of a theory of activity. In Proceedings of the Sixth National Conference on AI (AAAI-87). San Mateo, CA: Morgan Kaufmann. 30. Chapman, D. and L.P. Kaelbling (1990). Learning from delayed reinforcement in a complex domain. Teleor Research, Palo Alto. Report no. TR-90-11. 31. Chapuis, N., C. Thinus-Blanc, and B. Poucet (1983). Dissociation of mechanisms involved in dogs’ oriented displacements. Quarterly Journal of Experimental Psychology. 35B: p. 213-219. 32. Charniak, E. and D. McDermott (1985). Introduction to Artificial Intelligence. Reading, MA: Addison-Wesley. 33. Chatila, R. (1982). Path planning and environment learning in a mobile robot system. In European Conference on AI. France. 34. Chatila, R. (1986). Mobile robot navigation: space modelling and decisional processes. In 3rd International Symposium on Robotics Research. Cambridge, MA: MIT Press. 35. Cliff, D.T. (1990). Computational neuroethology: a provisional manifesto. University of Sussex. Report no. CSRP 162. 36. Collett, T.S., B.A. Cartwright, and B.A. Smith (1986). Landmark learning and visuo-spatial memory in gerbils. Journal of Comparative Physiology A. 158: p. 835-51. 37. Connell, J. (1988). Navigation by path remembering. SPIE Mobile Robots III. 1007. 38. Connell, J.H. (1990). Minimalist Mobile Robots. Perspectives in Artificial Intelligence, Boston: Academic Press. 255 BIBLIOGRAPHY 256 39. Courant, R. and H. Robbins (1941). What is Mathematics? Oxford: Oxford University Press. 40. Crowley, J.L. (1985). Navigation for an intelligent mobile robot. IEEE Journal of Robotics and Automation. 1(1): p. 31-41. 41. Dayan, P. (1991). Reinforcing Connectionism: Learning the statistical way. PhD thesis. Edinburgh. 42. Dijistra, E.W. (1959). A note on two problems in connexion graphs. Numerishe Mathematik. 1: p. 269-272. 43. Douglas, R.J. (1966). Cues for spontaneous alternation. Journal of Comparative and Physiological Psychology. 62: p. 171-183. 44. Dudek, G., M. Jenkin, E. Milios, and D. Wilkes (1988). Robotic exploration as graph construction. Department of Computer Science, University of Toronto. Report no. RBCV-TR-88-23. 45. Edelman, G.M. (1989). Neural Darwinism. Oxford: Oxford University Press. 46. Elfes, A. (1987). Sonar-based real-world mapping and navigation. IEEE Journal of Robotics and Automation. 3(3): p. 249-265. 47. Etienne, A.S. (1992). Navigation of a small mammal by dead reckoning and local cues. Current Directions in Psychological Science. 1(2): p. 48-52. 48. Etienne, A.S., R. Maurer, F. Saucy, and E. Teroni (1986). Short-distance homing in the golden hamster after a passive outward journey. Animal Behaviour. 34: p. 696-715. 49. Etienne, A.S., E. Teroni, C. Hurni, and V. Portenier (1990). The effect of a single light cue on homing behaviour in the golden hamster. Animal Behaviour. 39: p. 17-41. 50. Farin, G. (1988). Curves and Surfaces for Computer Aided Geometric Design: A Practical Guide. Boston: Academic Press. 51. Franklin, J.A. (1989). Input space representation for refinement learning control. In IEEE International Symposium on Intelligent Control. Albany, NY. Computer Society Press. 52. Gallistel, C.R. (1990). The Organisation of Learning. Cambridge, MA: MIT Press. 256 BIBLIOGRAPHY 257 53. Gibson, J.J. (1950). The perception of the visual world. Cambridge, MA: Riverside Press. 54. Gould, J.L. (1986). The locale map of honeybees: do insects have cognitive maps? Science. 232: p. 861-63. 55. Gronenberg, W., J. Taütz, and Hölldobler (1993). Fast trap jaws and giant neurons in the ant Odontomachus. Science. 262: p. 561-563. 56. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Ñeural Networks. 3: p. 671-692. 57. Hallam, J., P. Forster, and J. Howe (1989). Map-Free Localisation in a Partially Moving 3D World: The Edinburgh Feature-Based Navigator. In Intelligent Autonomous Systems: Proceedings of an International Conference. Amsterdam. Stichting International Congress of Intelligent Autonomous Systems. 58. Halperin, J.R.P. (1990). Machine Motivation. In From Animals to Animats: Proceedings of the First International Conference Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 59. Hertz, J., A. Krogh, and R.G. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley. 60. Hillis, W.D. (1992). Co-evolving parasites improve simulated evolution as an optimisation procedure. In Artificial Life II. Santa Fe Institute Studies in the Sciences of Complexity, vol. 6. Reading, Mass.: Addison-Wesley. 61. Hinton, G.E., J.L. McClelland, and D.E. Rumelhart (1986). Distributed Representations. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1, D.E. Rumelhart and J.L. McClelland, Editor. Cambridge, MA: Bradford Books. 62. Hintzman, D.L. (1992). Twenty-five years of learning and memory: was the cognitive revolution a mistake. In Fourteenth International Symposium on Attention and Performance. Michigan. Cambridge, MA: Bradford Books. 63. Hirsh, R. (1974). The hippocampus and contextual retrival of information from memory: a theory. Behavioural Biology. 12: p. 421-444. 64. Iyengar, S., C. Jorgensen, S. Rao, and C. Weisbin (1985). Learned navigation paths for a robot in unexplored terrain. In 2nd Conference on Artificial Intelligence Applications, Vol. 1. p. 148-154. 257 BIBLIOGRAPHY 258 65. Jaakkola, T., M.I. Jordan, and S.P. Singh (1993). On the convergence of stochastic iterative dynamic programming algorithms. MIT Computational Cognitive Science. Report no. 9307. 66. Jacobs, R.A., M.L. Jordan, S.J. Nowlan, and G.E. Hinton (1991). Adaptive mixtures of local experts. Neural Computation. 3(1). 67. Jang, J.-S.R. and C.-T. Sun Functional equivalence between radial basis function networks and fuzzy inference systems. Department of Electrical Engineering and Computer Science, University of California. 68. Jerison, H. (1973). Evolution of the Brain and Intelligence. New York: Academic Press. 69. Jordan, M.I. (1992). Forward models: supervised learning with a distal teacher. Cognitive Science. 70. Kohonen, T. (1984). Self-organization and associative memory. Heidelberg: Springer-Verlag. 71. Kortenkamp, D. and E. Chown (1992). A directional spreading activation network for mobile robot navigation. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 72. Kortenkamp, D., T. Weymouth, E. Chown, and S. Kaplan (1991). A scene-based multi-level representation for mobile robot spatial mapping and navigation. IEEE transactions on robotics and automation. 73. Krose, B.J.A. and J.W.M. van Dam (1992). Adaptive state space quantisation for reinforcement learning of collision-free navigation. In IEEE International Conference on Intelligent Robots and Systems. 74. Krose, B.J.A. and J.W.M. van Dam (1992). Learning to avoid collisions: a reinforcement learning paradigm for mobile robot navigation. In IFAC/IFIP/IMACS International Symposium on Artificial intelligence in RealTime Control. Delft. 75. Kubovy, M. (1971). Concurrent pitch segregation and the theory of indispensable attributes. In Perceptual Organization, M. Kubovy and J. Pomerantz, Editor. Erlbaum: Hillsdale, N.J. 76. Kuipers, B. (1978). Modelling spatial knowledge. Cognitive Science. 2: p. 129153. 258 BIBLIOGRAPHY 259 77. Kuipers, B. (1982). The “map in the head” metaphor. Environment and behaviour. 14: p. 202-220. 78. Kuipers, B. and Y. Byun (1991). A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations. Robotics and Autonomous Systems. 8. 79. Kuipers, B. and Y.T. Byun (1987). A qualitative approach to robot exploration and map-learning. In Spatial reasoning and multi-sensor fusion workshop. Chicago, Illinois. 80. Kuipers, B. and Y.T. Byun (1987). A robust, qualitative method for robot exploration and map-learning. In Proceedings of the Sixth National Conference on AI (AAAI–87). St. Pauls, Minneapolis. Morgan Kaufmann. 81. Kuipers, B. and T. Levitt (1988). Navigation and mapping in large-scale space. AI Magazine. (Summer 1988): p. 25-43. 82. Leiser, D. and A. Zilbershatz (1989). The traveller: a computational model of spatial network learning. Environment and behaviour. 21(4): p. 435-463. 83. Levenick, J.R. (1991). NAPS: A connectionist implementation of cognitive maps. Connection Science. 3(2). 84. Levitt, T.S. and D.T. Lawton (1990). Qualitative navigation for mobile robots. Artificial Intelligence. 44: p. 305-360. 85. Lieberman, D.A. (1993). Learning: Behaviour and Cognition. 2nd ed. Pacific Grove, CA: Brooks/Cole Publishing Co. 86. Linsker, R. (1986). From basic network principles to neural architecture: emergence of orientation-selective cells. Proceedings of the National Academy of Sciences. 83: p. 8390-8394. 87. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer. (March): p. 105-117. 88. Littman, M.L. (1992). An optimization-based categorization of reinforcement learning environments. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 89. Lozano-Perez (1983). Spatial-planning: a configuration space approach. IEEE transactions on computers. C-32(2): p. 108-121. 259 BIBLIOGRAPHY 260 90. Luttrell, S.P. (1989). Hierarchical vector quantisation. Proceedings of the IEE. 136: p. 405-413. 91. Maes, P. (1992). Behaviour-Based Artificial Intelligence. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 92. Mataric, M.J. (1990). Navigating with a rat brain: A neurologically-inspired model for robot spatial representation. In From Animals to Animats: Proceedings of the First International Conference Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 93. Mataric, M.J. (1990). Parallel, decentralized spatial mapping for robot navigation and path planning. In 1st Workshop on Parallel Problem Solving from Nature. Dortmund, FRG. Springer-Verlag. 94. Mataric, M.J. (1992). Integration of representation into goal-driven behaviourbased robots. IEEE Transactions on robotics and automation. 8(3): p. 304-312. 95. Matthies, L. and S. Shafer (1987). Error modelling in stereo navigation. IEEE Journal of Robotics and Automation. 3(3): p. 239-248. 96. McNamara, T.P. (1992). Spatial representation. Geoforum. 23(2): p. 139-150. 97. McNaughton, B.L., L.L. Chen, and E.J. Markus (1991). Landmark learning and the sense of direction - a neurophysiological and computational hypothesis. Journal of Cognitive Neuroscience. 3(2): p. 192-202. 98. Meyer, J. and A. Guillot (1990). Simulation of adaptive behaviour in animats: review and prospect. In From Animals to Animats: Proceedings of the First International Conference Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 99. Michie, D. and R.A. Chambers (1968). Boxes: an experiment in adaptive control. In Machine Intelligence 2, E. Dale and D. Michie, Editor. Oliver and Boyd: Edinburgh. 100. Millan, J.d.R. and C. Torras (1992). A reinforcement connectionist approach to robot path finding in non-maze-like environments. Machine Learning. 8(3/4): p. 229-256. 101. Millington, P.J. (1991). Associative Reinforcement Learning for Optimal Control. MSc thesis. Massachusetts Institute of Technology, Cambridge, MA. 102. Minsky, M. and S. Papert (1988). Perceptrons. 3rd ed. Cambridge, MA: MIT Press. 260 BIBLIOGRAPHY 261 103. Minsky, M.L. (1961). Steps toward artificial intelligence. Proceedings IRE. 49: p. 8-30. 104. Mishkin, M., B. Malamut, and J. Bachevalier (1992). Memories and habits: two neural systems. In Fourteenth International Symposium on Attention and Performance. Michigan. Bradford Books. 105. Moody, J. and C. Darken (1989). Fast learning in networks of locally-tuned processing units. Neural Computation. 1(2): p. 281-294. 106. Moore, A.W. (1991). Fast, robust adaptive control by learning only with forward models. In Advances in Neural Information Processing systems 4. Denver. San Mateo, CA: Morgan Kaufmann. 107. Moore, A.W. (1991). Knowledge of knowledge and intelligent experimentation for learning control. In International Joint Conference on Neural Networks. Seattle. 108. Moore, A.W. (1991). Variable resolution dynamic programming: efficiently learning action maps in multi-variate real-valued state-spaces. In Machine Learning: Proceedings of the 8th International Workshop. San Mateo, CA: Morgan Kaufmann. 109. Moravec, H. (1988). Sensor fusion in uncertainty grids. AI Magazine. (Summer 1988): p. 61-74. 110. Morris, R.G.M. (1990). Toward a representational hypothesis of the role of hippocampal synaptic plasticity in spatial and other forms of learning. Cold Spring Harbor Symposia on Quantitative Biology. 55: p. 161-173. 111. Myers, C.E. (1992). Delay learning in artificial neural networks. London: Chapman & Hall. 112. Nehmzow, U. and T. Smithers (1990). Mapbuilding using self-organising networks in “really useful robots”. In From Animals to Animats: Proceedings of the First International Conference Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 113. Nehmzow, U., T. Smithers, and B. McGonigle (1992). Increasing behavioural repetoire in a mobile robot. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 114. Nilsson, N.J. (1982). Principles of Artificial Intelligence. Berlin: Springer Verlag. 261 BIBLIOGRAPHY 262 115. Nowlan, S.J. (1990). Competing experts: an experimental investigation of associative mixture models. Department of Computer Science, University of Toronto. Report no. CRG-TR-90-5. 116. Nowlan, S.J. (1990). Maximum likelihood competition in RBF networks. University of Toronto. Report no. CRG-TR-90-2. 117. Nowlan, S.J. and G.E. Hinton (1991). Evaluation of adaptive mixtures of competing experts. In Advances in Neural Information Processing Systems 3. Denver, Colorado. San Mateo, CA: Morgan Kaufmann. 118. O’Keefe, J. (1990). Computational theory of the hippocampal cognitive map. Progress in Brain Research. 83: p. 301-312. 119. O’Keefe, J. (1990). The hippocampal cognitive map and navigational strategies. In Brain and Space, J. Paillard, Editor. Oxford: Oxford University Press. 120. O’Keefe, J.A. and L. Nadel (1978). The Hippocampus as a Cognitive Map. Oxford: Oxford University Press. 121. Olton, D.S. (1979). Mazes, maps, and memory. American Psychologist. 34(7): p. 583-596. 122. Olton, D.S. (1982). Spatially organized behaviours of animals: behavioural and neurological studies. In Spatial Abilities: Development and Physiological Foundations, M. Potegal, Editor. Academic Press: New York. 123. Overmeier, J.B. and M.E.P. Seligman (1967). Effects of inescapable shock upon subsequent escape and avoidance learning. Journal of Comparative and Physiological Psychology. 63: p. 23-33. 124. Piaget, J. and B. Inhelder (1967). The Child’s Conception of Space. New York: Norton. 125. Piaget, J., B. Inhelder, and A. Szeminska (1960). The Child’s Conception of Geometry. New York: Basic Books. 126. Poggio, T. and F. Girosi (1989). A theory of networks for approximation and learning. MIT AI Lab. Report no. 1140. 127. Poggio, T. and F. Girosi (1990). Networks for approximation and learning. Proceedings of the IEEE. 78(9): p. 1481-1496. 128. Porrill, J. (1993). Approximation by linear combinations of basis functions. AI Vision Research Unit, Sheffield University. 262 BIBLIOGRAPHY 263 129. Poucet, B. (1985). Spatial behaviour of cats in cue-controlled environments. Quarterly Journal of Experimental Psychology. 37B: p. 155-179. 130. Poucet, B., C. Thinus-Blanc, and N. Chapuis (1983). Route planning in cats, in relation to the visibility of the goal. Animal Behaviour. 31: p. 594-599. 131. Powell, M.J.D. (1987). Radial basis functions for multivariable interpolation: a review. In Algorithms for Approximation, J.C. Mason and M.G. Cox, Editor. Oxford: Clarendon Press. 132. Prescott, A.J. (1993). Building long-range cognitive maps using local landmarks. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 133. Prescott, A.J. and J.E.W. Mayhew (1992). Obstacle avoidance through reinforcement learning. In Advances in Neural Information Processing Systems 4. Denver. San Mateo, CA: Morgan Kaufman. 134. Prescott, A.J. and J.E.W.M. Mayhew (1992). Adaptive local navigation. In Active Vision, A. Blake and A. Yuille, Editor. MIT Press: 135. Rao, N., S. Iyengar, and G. deSaussure (1988). The visit problem: visibility graph-based solution. In IEEE International Conference on Robotics and Automation. 136. Rao, N., N. Stoltzfus, and S. Iyengar (1988). A retraction method for terrain model acquisition. In IEEE International Conference on Robotics and Automation. 137. Ritter, H., T. Martinetz, and K. Schulten (1992). Neural Computation and SelfOrganizing Maps. Reading, MA: Addison-Wesley. 138. Roitblat, H.L. (1992). Comparative Approaches to Cognitive Science. University of Honolulu. Report no. 139. Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington, DC: Spartan Books. 140. Rumelhart, D.E., G.E. Hinton, and R.W. WIlliams (1986). Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Micro-structure of Cognition., D.E. Rumelhart and J.L. McClelland, Editor. Bradford: Cambridge, MA. 141. Sanger, T.D. An optimality principle for unsupervised learning. Laboratory, Cambridge, MA. Report no. MIT AI 263 BIBLIOGRAPHY 264 142. Schacter, D.L. (1987). Implicit memory: history and current status. Journal of Experimental Psychology: learning , Memory, and Cognition. 13: p. 501-518. 143. Schmidhuber, J.H. (1990). Networks adjusting networks. In Distributed Adaptive Neural Information Processing. Oldenburg. 144. Schmidhuber, J.H. (1990). Recurrent networks adjusted by adaptive critics. In IEEE International Joint Conference on Neural Networks. Washington DC. 145. Schmidhuber, J.H. (1991). Adaptive confidence and adaptive curiosity. Technische Universitat Munchen, Germany. Report no. FKI-149-91. 146. Scholl, M.J. (1987). Cognitive maps as orienting schemata. Journal of Experimental Psychology: Learning, Memory and Cognition. 13: p. 615-628. 147. Scholl, M.J. (1992). Landmarks, places, environments: multiple mind-brain systems for spatial orientation. Geoforum. 23(2): p. 151-164. 148. Schwartz, A. (1993). Thinking locally to act globally: a novel approach to reinforcement learning. In Fifteenth Annual Conference of the Cognitive Science Society. University of Colorado-Boulder. Lawrence Erlbaum Associates. 149. Seligman, M.E.P. and S.F. Maier (1967). Failure to escape traumatic shock. Journal of Experimental Psychology. 74: p. 1-9. 150. Shannon, S. and J.E.W. Mayhew (1990). Simple associative memory. Research Initiative in Pattern Recognition, UK. Report no. RIPREP/1000/78/90. 151. Shepard (1990). Internal representation of universal regularities. In Neural Connections, Mental Computation, L. Nadel, L.A. Cooper, P. Culicover, andR.M. Harnish, Editor. Bradford Books: Cambridge, MA. 152. Sherry, D.F. and D.L. Schacter (1987). The evolution of multiple memory systems. Psychological Review. 94(4): p. 439-454. 153. Shimamura, A.P. (1989). Disorders of memory: the cognitive science perspective. In Handbook of Neuropsychology, F. Boller and J. Grafman, Editor. Elselvier Press: Amsterdam. 154. Siegel, A.W. and S.H. White (1975). The development of spatial representations of large-scale environments. In Advances in Child Development and Behaviour, H.W. Reese, Editor. Academic Press: 155. Smith, R. and P. Cheeseman (1986). On the representation and estimation of spatial uncertainty. Int. Journal of Robotics Research. 5(4): p. 56-68. 264 BIBLIOGRAPHY 156. 157. 265 Smith, R., M. Self, and P. Cheeseman (1987). A stochastic map for uncertain spatial relationships. In Workshop on Spatial Reasoning and Multisensor Fusion. Snaith, M. and O. Holland (1991). A biologically plausible mapping system for animat navigation. Technology Applics. Group, Alnwick, UK. 158. Soldo, M. (1990). Reactive and preplanned control in a mobile robot. In IEEE Conference on Robotics and Automation. Cincinnati, Ohio. 159. Squire, L.R. (1987). Memory and Brain. Oxford: OUP. 160. Squire, L.R. and S. Zola-Morgan (1988). Memory: brain systems and behaviour. Trends in Neuroscience. 11: p. 170-175. 161. Stevens, A. and P. Coupe (1978). Distortions in judged spatial relations. Cognitive Psychology. 10: p. 422-437. 162. Sutton, R.S. (1984). Temporal Credit Assignment in Reinforcement Learning Control. Phd thesis. University of Massachusetts. 163. Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning. 3: p. 9-44. 164. Sutton, R.S. (1990). Integrated architectures for learning, planning and reacting based on approximate dynamic programming. In 7th Int. Conf. on Machine Learning. Austin, Texas. Morgan Kaufmann. 165. Sutton, R.S. and A.G. Barto (1981). Toward a modern theory of adaptive networks: expectation and prediction. Psych. Review. 88(2): p. 135-70. 166. Sutton, R.S. and B. Pinette (1985). The learning of world models by connectionist networks. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society. 167. Tesauro, G.J. (1991). Practical issues in temporal difference learning. Machine Learning. 8(3/4): p. 257-278. 168. Thorndyke, E.L. (1911). Animal Intelligence. Darien, Conn.: Hafner. 169. Thrun, S. and K. Moller (1992). Active Exploration in Dynamic Environments. In Advances in Neural Information Processing Systems IV. Denver. San Mateo, CA: Morgan Kaufmann. 170. Toates, F. and P. Jensen (1990). Ethological and psychological models of motivation—towards a synthesis. In From Animals to Animats: Proceedings of 265 BIBLIOGRAPHY 266 the First International Conference Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 171. Todd, P.M. and S.W. Wilson (1992). Environment structure and adaptive behaviour from the ground up. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 172. Torras, C. (1990). Motion planning and control: symbolic and neural levels of computation. In Proceedings of the 3rd COGNITIVA Conference. 173. Trabasso, T. (1963). Stimulus emphasis and all-or-none learning of concept identification. Journal of Experimental Psychology. 65: p. 395-406. 174. Turchan, M. and A. Wong (1985). Low-level learning for a mobile robot: Environment model acquisition. In 2nd Conference on Artificial Intelligence Applications. 175. Tversky, B. (1981). Distortions in memory for maps. Cognitive Psychology. 13: p. 407-433. 176. Vogi, T.P., J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon (1988). Accelerating the convergence of the back-propagation method. Biological Cybernetics. 59: p. 257-263. 177. Watkins, C.J.H.C. (1989). Learning from delayed rewards. PhD thesis. King’s College, Cambridge. 178. Wehner, R. (1983). Celestial and terrestrial navigation: human strategies— insect strategies. In Neuroethology and Behavioural Physiology, F. Huber and H. Markl, Editor. Springer-Verlag: Berlin. 179. Wehner, R. and R. Menzel (1990). Do insects have cognitive maps? Annual Review of Neuroscience. 13: p. 403-14. 180. Werbos, P.J. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook. 22: p. 25-38. 181. Whitehead, S.D. and D.H. Ballard (1990). Active perception and reinforcement learning. Neural Computation. 2: p. 409-419. 182. Widrow, B. (1962). Generalization and information storage in networks of Adaline ‘neurons’. In Self-organizing systems, M.C. Jovitz, J.T. Jacobi, and G. Goldstein, Eds. Spartan Books: Washington, DC. p. 435-461. 266 BIBLIOGRAPHY 267 183. Widrow, B. and S.D. Stearns (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice-Hall. 184. Williams, R.J. (1988). Towards a theory of reinforcement learning connectionist systems. College of Computer Science, Northeastern University, Boston, MA. Report no. NU-CCS-88-3. 185. Williams, R.J. and L.J. Baird III (1990). A mathematical analysis of actor-critic architectures for learning optimal controls through incremental dynamic programming. In Sixth Yale workshop on Adaptive and Learning Systems. New Haven, CT. 186. Willingham, D.B., M.J. Nissen, and P. Bullemer (1989). On the development of procedural knowledge. Journal of Experimental Psychology: Memory, Learning and Cognition. 15(1047-1060). 187. Wilson, S.W. (1990). The animat path to AI. In From Animals to Animats: Proceedings of the First International Conference on the Simulation of Adaptive Behaviour. Paris. Cambridge, MA: MIT Press. 188. Wong, V.S. and D.W. Payton (1987). Goal-oriented obstacle avoidance through behaviour selection. SPIE Mobile Robots II. 852: p. 2-10. 189. Worden, R. (1992). Navigation by fragment fitting: a theory of hippocampal function. Hippocampus. 2(2): p. 165-188. 190. Zapata, R., P. Lepinay, C. Novales, and P. Deplanques (1992). Reactive behaviours of fast mobile robots in unstructured environments: sensor-based control and neural networks. In From Animals to Animats: Proceedings of the Second International Conference on Simulation of Adaptive Behaviour. Honolulu, USA. Cambridge, MA: MIT Press. 191. Zipser (1983). The representation of location. Institute for Cognitive Science, Univesity of California at San Diego. Report no. ICS 8301. 192. Zipser, D. (1983). The representation of maps. Institute of Cognitive Science, Univesity of California at San Diego. Report no. ICS 8304. 193. Zipser, D. (1986). Biologically plausible models of place recognition and place location. In Parallel Distributed Processing: Explorations in the Micro-structure of Cognition, Volume 2. J.L. McClelland and D.E. Rumelhart, Editor. Bradford: Cambridge, MA. p. 432-70. 267

Log In

Explorations in Reinforcement and Model-based Learning

Related papers

Related papers

Related topics