Reinforcement Learning in High-Diameter,
Continuous Environments
Jefferson Provost
Report AI-TR-07-344
[email protected]
http://www.cs.utexas.edu/users/nn/
http://www.cs.utexas.edu/users/qr/
Artificial Intelligence Laboratory
The University of Texas at Austin
Austin, TX 78712
Copyright
by
Jefferson Provost
2007
The Dissertation Committee for Jefferson Provost
certifies that this is the approved version of the following dissertation:
Reinforcement Learning In High-Diameter,
Continuous Environments
Committee:
Benjamin J. Kuipers, Supervisor
Risto Miikkulainen, Supervisor
Raymond Mooney
Bruce Porter
Peter Stone
Brian Stankiewicz
Reinforcement Learning In High-Diameter,
Continuous Environments
by
Jefferson Provost, B.S.
Dissertation
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
Doctor of Philosophy
The University of Texas at Austin
August 2007
Acknowledgments
A couple of years ago someone forwarded me a link to an article interpreting Frodo’s quest in The
Lord of the Rings as an allegory for writing and filing a dissertation (Lee, 2005). At the time, I saw
this as nerdy humor for PhD-seeking Tolkien geeks like myself. But when I read the article again
last week, it made me embarrasingly emotional. I won’t belabor this page with the details, to spare
you the boredom, and to spare myself any more of the ridicule that Pat, Laura, Foster, Pic, and Todd
surely all have waiting for me after reading that. (What else are friends for, after all?) I’ll only say
this: I think I had much more fun than Frodo ever did (except maybe just before the end, where
we’re about even), and my fellowship of advisors, friends, and family was far larger and richer than
Frodo’s.
My first thanks go to my advisors and committee. Ben Kuipers and Risto Miikkulainen were
simultaneously the most unlikely and yet most complementary co-advisors I could imagine. Each
provided his own form of guidance, rigor, inspiration, and prodding without which my education
would be incomplete. Ray Mooney, Bruce Porter, Peter Stone, and Brian Stankiewicz all were there
over the years with inspiration, advice, ideas, and sometimes even beer.
Pat Beeson, Joseph Modayil, Jim Bednar and Harold Chaput were my comrades in the
trenches at the UT AI Lab—all just as quick with a laugh as a good idea. Along with Tal Tversky,
Ken Stanley, Aniket Murarka, Ram Ramamoorthy, Misha Bilenko, Prem Melville, Nick Jong, Sugato Basu, Shimon Whiteson, Bobby Bryant, Tino Gomez, and everyone else... It’s hard to imagine
another collection of great minds who would also be so much fun to be around.
My brother Foster was my secret weapon through the years, providing advice from the
perspective of a professor, but intended for a peer. And without him, who would Pat and I have to
drink with in a red state on election night? ...all that, and he let me freeload in his hotel rooms at
conferences for years. Thanks, man!
To Pic, Todd, Rob, Rachel, Sean, Meghan, Ezra, Christy, and all my other non-UT friends,
it’s great to know that there’s life and laughs outside academia. Thanks for sticking with me.
Last, first and always are Laura and Maggie. Laura trooped along with me for all these
years, only occasionally asking me what the heck I was doing and why the **** it was taking so
long. Well here it is. I love you. Maggie, some day you’ll read all this nonsense and say, “Are you
kiddin’ me?!” And you won’t have any idea why your mom and dad think that’s so hilarious. I love
v
you, bucket!
J EFF P ROVOST
The University of Texas at Austin
August 2007
vi
Reinforcement Learning In High-Diameter,
Continuous Environments
Publication No.
Jefferson Provost, Ph.D.
The University of Texas at Austin, 2007
Supervisor: Benjamin J. Kuipers
Many important real-world robotic tasks have high diameter, that is, their solution requires a
large number of primitive actions by the robot. For example, they may require navigating to distant
locations using primitive motor control commands. In addition, modern robots are endowed with
rich, high-dimensional sensory systems, providing measurements of a continuous environment. Reinforcement learning (RL) has shown promise as a method for automatic learning of robot behavior,
but current methods work best on low-diameter, low-dimensional tasks. Because of this problem,
the success of RL on real-world tasks still depends on human analysis of the robot, environment,
and task to provide a useful set of perceptual features and an appropriate decomposition of the task
into subtasks.
This thesis presents Self-Organizing Distinctive-state Abstraction (SODA) as a solution to
this problem. Using SODA a robot with little prior knowledge of its sensorimotor system, environment, and task can automatically reduce the effective diameter of its tasks. First it uses a
self-organizing feature map to learn higher level perceptual features while exploring using primitive, local actions. Then, using the learned features as input, it learns a set of high-level actions that
carry the robot between perceptually distinctive states in the environment.
Experiments in two robot navigation environments demonstrate that SODA learns useful
features and high-level actions, that using these new actions dramatically speeds up learning for
high-diameter navigation tasks, and that the method scales to large (building-sized) robot environments. These experiments demonstrate SODAs effectiveness as a generic learning agent for mobile
robot navigation, pointing the way toward developmental robots that learn to understand themselves
vii
and their environments through experience in the world, reducing the need for human engineering
for each new robotic application.
viii
Contents
Acknowledgments
v
Abstract
vii
Contents
ix
List of Tables
xii
List of Figures
xiii
Chapter 1 Introduction
1.1 Motivation . . . . . . . . . . . . . . . . . . . . .
1.2 Reinforcement Learning in Realistic Robots . . .
1.3 Learning Without Prior Knowledge . . . . . . . .
1.4 Bootstrapping to Higher Levels of Representation
1.5 Approach . . . . . . . . . . . . . . . . . . . . .
1.6 Overview of the Dissertation . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 2 Background and Related Work
2.1 The Building Blocks of SODA . . . . . . . . . . . . . . . .
2.1.1 The Spatial Semantic Hierarchy (SSH) . . . . . . .
2.1.2 Reinforcement Learning . . . . . . . . . . . . . . .
2.1.3 Self-Organizing Maps (SOMs) . . . . . . . . . . . .
2.2 Bootstrap Learning . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Drescher’s Schema System . . . . . . . . . . . . . .
2.2.2 Constructivist Learning Architecture . . . . . . . . .
2.2.3 Bootstrap Learning for Place Recognition . . . . . .
2.2.4 Learning an Object Ontology with OPAL . . . . . .
2.2.5 Other methods of temporal abstraction . . . . . . . .
2.3 Automatic Feature Construction for Reinforcement Learning
ix
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
4
5
5
7
.
.
.
.
.
.
.
.
.
.
.
9
9
9
12
18
23
24
25
26
26
26
27
2.4
2.5
2.3.1 U-Tree Algorithm . . .
2.3.2 Backpropagation . . . .
SODA and Tabula Rasa Learning
Conclusion . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
30
Chapter 3 Learning Method and Representation
3.1 Overview . . . . . . . . . . . . . . . . . .
3.2 Topn State Representation . . . . . . . . .
3.3 Trajectory Following . . . . . . . . . . . .
3.4 Hill-climbing . . . . . . . . . . . . . . . .
3.4.1 HC using Gradient Approximation .
3.4.2 Learning to Hill-Climb with RL . .
3.5 Conclusion . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
33
33
35
35
37
38
Chapter 4 Learning Perceptual Features
4.1 Experimental Setup . . . . . . . . .
4.2 T-Maze Environment . . . . . . . .
4.3 ACES Fourth Floor Environment . .
4.4 Feature Discussion . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
42
44
47
.
.
.
.
.
.
51
51
55
56
58
61
63
.
.
.
.
.
64
64
66
72
74
76
.
.
.
.
77
77
80
81
82
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 5 Learning High-Level Actions
5.1 Trajectory Following . . . . . . . . . . . . .
5.2 Hill Climbing . . . . . . . . . . . . . . . . .
5.2.1 Hand-coded Predictive Action Model
5.2.2 Hill-Climbing Experiment . . . . . .
5.2.3 HC Learning Discussion . . . . . . .
5.3 Action Learning Conclusion . . . . . . . . .
Chapter 6 Learning High-Diameter Navigation
6.1 Learning in the T-Maze . . . . . . . . . .
6.2 The Role of Hill-Climbing . . . . . . . .
6.3 Scaling Up The Environment: ACES 4 . .
6.4 Navigation Discussion . . . . . . . . . .
6.5 Navigation Conclusion . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 7 Discussion and Future Directions
7.1 SODA with Perceptual Aliasing . . . . . . . . . . . . . .
7.2 Dealing with Turns . . . . . . . . . . . . . . . . . . . . .
7.2.1 Alternative Distance Metrics for Feature Learning
7.2.2 Learning Places, Paths, and Gateways . . . . . . .
x
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.3
7.4
7.5
Scaling to More Complex Sensorimotor Systems . . . . . . . . . . . . . . . . . .
Prior Knowledge and Self-Calibration . . . . . . . . . . . . . . . . . . . . . . . .
Discussion Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 8
Summary and Conclusion
84
85
87
88
Bibliography
90
Vita
98
xi
List of Tables
3.1
3.3
Open-loop and Closed-loop Trajectory-following. Top: The open-loop trajectoryfollowing macro repeats the same action until the current SOM winner changes.
Bottom: The closed-loop Option policy chooses the action with the highest value
from the Option’s action set. The action set, defined in Equation (3.7) is constructed
to force the agent to make progress in the direction of ai while being able to make
small orthogonal course adjustments. . . . . . . . . . . . . . . . . . . . . . . . .
Pseudo-code for hill-climbing policy πiHC using gradient estimation. The value
of Gij is the estimated change in feature fi with respect to primitive action a0j .
Gij can be determined either by sampling the change induced by each action or
by using an action model to predict the change. Sampling is simple and requires
no knowledge of the robot’s sensorimotor system or environment dynamics, but is
expensive, requiring 2|A0 | − 2 sampling steps for each movement . . . . . . . . .
Summary of the components of SODA. . . . . . . . . . . . . . . . . . . . . . . .
4.1
4.2
T-Maze Abstract Motor Interface and Primitive Actions . . . . . . . . . . . . . . .
ACES Abstract Motor Interface and Primitive Actions . . . . . . . . . . . . . . . .
42
46
5.1
The average trajectory length for open loop vs. learned trajectory-following.
Learned TF options produce longer trajectories. . . . . . . . . . . . . . . . . . . .
55
3.2
xii
33
36
39
List of Figures
1.1
1.2
2.1
2.2
The Reinforcement Learning Problem Framework In reinforcement learning
problems an agent interacts with an environment by performing actions and receiving sensations and a reward signal. The agent must learn to act so as to maximize
future reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The SODA Architecture This expanded version of the diagram in Figure 1.1 shows
the internal structure of the SODA agent (For the purposes of this description, the
physical robot is considered part of the environment, and not the agent). The agent
receives from the environment a sensation in the form of a continuous sense vector.
This vector is passed to a self-organizing feature map that learns a set of highlevel perceptual features. These perceptual features and the scalar reward passed
from the environment are fed as input to a learned policy that is updated using
reinforcement learning. This policy makes an action selection from among a set of
high-level actions consisting of combinations of learned trajectory-following (TF)
and hill-climbing (HC) control laws. These high-level actions generate sequences
of primitive actions that take the form of continuous motor vectors sent from the
agent to the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Continuous-to-Discrete Abstraction of Action In the SSH Control Level (top),
continuous action takes the form of hill-climbing control laws, that move the robot
to perceptually distinctive states, and trajectory-following control laws, that carry
the robot from one distinctive state into the neighborhood of another. In the Causal
Level, the trajectory-following/hill-climbing combination is abstracted to a discrete
action, A, and the perceptual experience at each distinctive state into views V 1 and
V 2, forming a causal schema hV 1|A|V 2i. (Adapted from Kuipers (2000).) . . . .
Example 5x5 Self-Organizing Map A SOM learning a representation of a highdimensional, continuous state space. In training, each sensor image is compared
with the weight vector of each cell, and the weights are adapted so that over time
each cell responds to a different portion of the input space. [Figure adapted from
(Miikkulainen, 1990)] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
3
6
11
19
2.3
4.1
4.2
4.3
A Growing Neural Gas (GNG) Network An Example GNG network adapting to
model an input distribution with 1-dimensional, 2-dimensional, and 3-dimensional
parts. The GNG network has several properties that make it a potential improvement over typical SOMs for our feature learning problem: It requires no prior assumptions about the dimensionality of the task, it can continue to grow and adapt
indefinitely. The Homeostatic-GNG developed for SODA regulates its growth in order to maintain a fixed value of cumulative error between the input and the winning
units. Figure adapted from Fritzke (1995) . . . . . . . . . . . . . . . . . . . . . .
22
Simulated Robot Environment, T-Maze. (a) A “sensor-centric” plot of a single
scan from the robot’s laser rangefinder in the T-maze environment, with the individual range sensor on the X axis and the range value on the Y axis. (b) The egocentric
plot of the same scan in polar coordinates, with the robot at the origin, heading up
the Y axis. This format provides a useful visualization of laser rangefinder scans for
human consumption. (c) A screen-shot of the Stage robot simulator with the robot
in the pose that generated the scan used in (a) and (b). The shaded region represents
the area scanned by the laser rangefinder. The robot has a drive-and-turn base, and
a laser rangefinder. The environment is approximately 10 meters by 6 meters. The
formats in (a) and (b) will be used in later figures to display the learned perceptual
prototypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Example Learned Perceptual Prototypes, T-Maze. The agent’s self-organizing
feature map learns a set of perceptual prototypes that are used to define perceptually
distinctive states in the environment. This figure shows the set of features learned
from one feature-learning run in the T-Maze. In each row the first figure represents
a unit in the GNG, and the other figures represent the units it is connected to in
the GNG topology. Some rows are omitted to save space, but every learned feature
appears at least once. Each feature is a prototypical laser rangefinder image. On the
left, the ranges are plotted in the human-readable format of Figure 4.1(b). On the
right, the ranges are plotted in the sensor-centric format of Figure 4.1(a). The agent
learns a rich feature set covering the perceptual situations in the environment . . .
45
Three Learned Prototypes, Enlarged Enlarged views of features 8, 35, and 65
from Figure 4.2. These figures show prototypical sensory views of looking down
the long corridor in the T-Maze, looking toward the side wall at the end of the
corridor, and the view from the middle of the T intersection, respectively. . . . . .
46
xiv
4.4
4.5
4.6
5.1
5.2
5.3
Simulated Robot Environment, ACES. The ACES4 environment. A simulated
environment generated from an occupancy grid map of the fourth floor of the ACES
building at UT Austin. The map was collected using a physical robot similar to
the simulated robot used in these experiments. The environment is approximately
40m × 35m. The small circle represents the robot. The area swept by the laser
rangefinder is shaded. This environment is much larger and perceptually richer than
the T-maze. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
Example Learned Perceptual Prototypes, ACES. The agent’s self-organizing feature map learns a set of perceptual prototypes that are used to define perceptually
distinctive states in the environment. The richer variety of perceptual situations in
the ACES environment produces a larger set of features using the same parameter
settings for the Homeostatic-GNG. . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Views of T-Maze Intersection The use of Euclidean distance as the similarity metric for the GNG leads to the learning of many similar views that differ only by a
small rotation. A small rotation in the robot’s configuration space induces a large
(Euclidean) movement in the robot’s input space causing large distortion error in the
GNG, which must add more features in that region of the space to reduce the error.
The large number of features means that the agent must traverse many more distinctive states when turning than when traveling forward and backward. Nevertheless,
the experiments in Chapter 6 show that SODA still greatly reduces task diameter
and improves navigation learning. . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Trajectory Following Improvement with Learning These figures show the results
of 100 trajectory following runs from each of three locations (marked with large
black disks). The end point of each run is marked with a ’+’. The top figure shows
the results of open-loop TF, and the bottom figure shows the results of TF learned
using reinforcement learning. The TF learned using RL are much better clustered,
indicating much more reliable travel. . . . . . . . . . . . . . . . . . . . . . . . . .
53
The average trajectory length and endpoint spread for open loop vs. learned
trajectory-following for the last 100 episodes in each condition. Learned TF
options produce longer trajectories. Error bars indicate ± one standard error. All
differences are significant at p < 6 × 10−5 . . . . . . . . . . . . . . . . . . . . . .
54
Test Hill-Climbing Features. The hill-climbing experiments tested SODA’s ability to hill-climb on these five features from the GNG in Figure 4.2in the T-Maze
environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
xv
5.4
5.5
6.1
6.2
6.3
Hill-climbing Learning Curves. These curves compare learned hill-climbing using the Top3 , Top4 , and Top5 state representations against hill-climbing by approximating the gradient with user-defined linear action models. Each plot compares
hill-climbers on one of the five different SOM features in Figure 5.3. The Y axis indicates the final feature activation achieved in each episode. The thick straight line
indicates the mean performance of sampling-based HC, and the two thin straight
lines indicate its standard deviation. This figure shows that Learned HC does about
as well as an HC controller that manually approximates the feature gradient at each
point, and sometimes better. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hill-climbing performance with and without learned options. Using learned options makes hill-climbing achieve the same feature values faster. Top: The average
lengths of hill-climbing episodes in the neighborhoods of the three different features
shown in Figure 5.3. All differences are significant (p < 2 × 10−6 ). The bottom
chart shows the average maximum feature value achieved for each prototype per
episode. The plots compare the last 100 HC episodes for each feature with 100
hard-coded HC runs. Differences are significant between learned options and the
other two methods (p < 0.03) for all features except 65, which has no significant
differences. Across all features, the maximum values achieved are comparable, but
the numbers of actions needed to achieve them are much smaller. . . . . . . . . . .
59
60
T-Maze Navigation Targets. The red circles indicate the locations used as starting
and ending points for navigation in the T-Maze in this chapter. In the text, these
points are referred to as top left, top right, and bottom. Navigation between these
targets is a high-diameter task: they are all separated from each other by hundreds of
primitive actions, and from each target the other two are outside the sensory horizon
of the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
T-Maze Learning, all routes. These learning curves show the length per episode
for learning to navigate between each pair of targets shown in Figure 6.1. The curves
compare learning with A1 actions and learning with A0 actions. Each curve is the
aggregate of three runs using each of three trained GNGs. Error bars indicate +/one standard error. In all cases the A1 agents dramatically outperform the A0 agents. 67
Navigation using Learned Abstraction. An example episode after the agent has
learned the task using the A1 actions. The triangles indicate the state of the robot
at the start of each A1 action. The sequence of winning features corresponding to
these states is [8, 39, 40, 14, 0, 4, 65, 7, 30, 62], shown in Figure 6.4. The narrow
line indicates the sequence of A0 actions used by the A1 actions. In two cases the
A1 action essentially abstracts the concept, ‘drive down the hall to the next decision
point.’ Navigating to the goal requires only 10 A1 actions, instead of hundreds of
A0 actions. In other words, task diameter is vastly reduced. . . . . . . . . . . . . 68
xvi
6.4
6.5
6.6
6.7
Features for Distinctive States These are the perceptual features for the distinctive
states used in the navigation path shown in Figure 6.3, in the order they were traversed in the solution. (Read left-to-right, top-to-bottom.) The first feature [8] and
the last two [30, 62] represent the distinctive states used to launch long actions down
the hallways, while the intervening seven features [39, 40, 14, 0, 4, 65, 7] show the
robot progressively turning to the right to follow the lower corridor. The large number of turn actions is caused by the large number of features in the GNG used to
represent views separated by small turns, discussed in Section 4.4 and Figure 4.6.
Despite the many turn actions, however, SODA still reduces the task diameter by an
order of magnitude over primitive actions. . . . . . . . . . . . . . . . . . . . . . .
69
A1 vs TF-only Learning Performance, T-Maze These learning curves compare
the length per an episode in the T-Maze top-left-to-bottom navigation task using
primitive actions (A0 ), trajectory-following alone (TF), and trajectory following
with hill-climbing (A1 ). Each curve is the average of 50 runs, 5 each using 10
different learned SOMs. Error bars indicate +/- one standard error. The agents
using just TF actions, without the final HC step learn as fast and perform slightly
better than the agents using A1 actions. (TF vs A1 performance is significantly
different, p < 0.0001, in the last 50 episodes.) . . . . . . . . . . . . . . . . . . . .
70
Hill-climbing Reduces Transition Entropy. These plots compare the average transition entropy for all state/action pairs for each of the 10 different GNGs used in the
T-Maze experiment. The x-axis indicates the transition entropy (in bits) using hillclimbing, and the y-axis indicates the entropy without hill-climbing. The solid line
indicates equal entropy (y = x). Error bars indicate 95% confidence intervals. Hillclimbing reduces the entropy by about 0.4 bits, on average. This is approximately
equivalent to a reduction in branching factor from 3 to 2. These results indicate
that hill-climbing makes actions more deterministic, making them more useful for
building planning-based abstractions on top of SODA. . . . . . . . . . . . . . . .
71
Hill-climbing Improves Task Diameter. Bars show the average abstract length per
episode of the successful episodes taken from the last 100 episodes for each agent
in the T-Maze experiment. Abstract length is measured in the number of high-level
actions needed to complete the task. Using trajectory-following with hill-climbing,
the agents require an average of 23 actions to complete the task, while without hillclimbing they require an average of 67. (Error bars indicate +/- one standard error.
Differences are statistically significant with infinitesimal p.) This result indicates
agents using hill-climbing the hill-climbing component of A1 actions will be make
them more useful for building future, planning-based abstractions on top of the
SODA abstraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
xvii
6.8
6.9
7.1
ACES Navigation Task. The circles indicate the locations used as starting and
ending points for navigation in ACES in this chapter. The green circle on the right
indicates the starting point and the red circle on the left indicates the ending point.
The shaded area shows the robot’s field of view. The longer task and added complexity of the environment make this task much more difficult than the tasks in the
T-maze. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Learning Performance in the ACES Environment. These learning curves compare the length per episode for learning to navigate in ACES from the center-right
intersection to the lower-left corner of the environment. The curves compare primitive actions and A1 actions. Each curve is the average of 50 runs, five each using 10
different learned SOMs. Error bars indicate +/- one standard error. The minimum
length path is around 1200 actions. The agents using the high-level actions learn
to solve the task while those using only primitive actions have a very difficult time
learning to solve the task in the allotted time. . . . . . . . . . . . . . . . . . . . .
Grouping SODA Actions into Gateways, Places and Paths. Topological abstraction of SODA’s actions could be accomplished identifying the states that initiate or
terminate sequences of long similar actions as gateways, and then aggregating the
sequences of states connected by long actions between two gateways into paths, and
the collections of states connecting groups of gateways into places. In this T-Maze
example, from Chapter 6, the states in which the robot enters and leaves the intersection terminate and begin sequences of long actions, and thus are gateways. The
states and actions moving the robot down the corridor are grouped into paths, and
the collection of states traversed in turning are grouped into a place. This secondlevel (A2 ) abstraction would reduce the diameter of this task from ten actions to
three. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xviii
74
75
83
Chapter 1
Introduction
A generic learning agent boots up in a brand new mobile robot, with a high-resolution, highdimensional range sensor, and continuous motor commands that send voltages to the wheel motors
to drive and turn the robot. It also has a list of tasks, requiring navigating to distant locations in an
office building. Being generic, however, the agent knows only that it has a large vector of continuous input values, a few continuous outputs, and some reward signals whose values it must maximize
over time. This dissertation addresses the following question: Under these conditions, How can the
agent learn to complete its tasks?
1.1 Motivation
The ultimate goal of machine learning in robotics is an agent that learns autonomously. In other
words, it is a generic learning agent that can learn to perform its task without needing humans to
point out relevant perceptual features or break down its task into more tractable subtasks. Constructing such an agent is difficult for two major reasons: (1) Many important real-world robotic
tasks have a high diameter, that is, their solution requires a large number of primitive actions by
the robots. (2) Modern robots are endowed with rich, high-dimensional sensorimotor systems, providing measurements of a continuous environment, and sending commands to a continuous motor
system. These factors make the space of possible policies for action extremely large.
Methods of reinforcement learning — by which an agent learns to maximize a reward signal — have shown promise for learning robot behavior, but currently work best on low-diameter,
low-dimensional tasks. Thus, current applications of reinforcement learning to real-world robotics
require prior human analysis of the robot, environment, and task to provide a useful set of perceptual
features and an appropriate task decomposition.
This dissertation presents Self-Organizing Distinctive-state Abstraction (SODA) a method
by which a robot with little prior knowledge of its sensorimotor system and environment can automatically reduce the diameter of its tasks by bootstrapping up from its raw sensory motor system
1
to high-level perceptual features and large-scale, discrete actions. Using unsupervised learning, the
agent learns higher level perceptual features while exploring using primitive, local actions. Then,
using the learned features, it builds a set of high-level actions that carry it between perceptually
distinctive states in the environment. More specifically,
• SODA improves reinforcement learning in realistic robots, that is, robots with rich sensorimotor systems systems in high-diameter, continuous environments;
• it learns autonomously, with little or no prior knowledge of its sensorimotor system, environment, or task;
• it develops an accessible internal representation that is suitable for further bootstrapping to
yet higher representations.
The remainder of this chapter discusses each of these goals in more detail. Section 1.2
briefly reviews the broad field of reinforcement learning, and discusses the difficulties of applying
reinforcement learning in robots with rich sensorimotor systems in high-diameter, continuous environments. Section 1.3 discusses the motivations for learning without prior knowledge. Section 1.4
motivates the need for learning an internal representation suitable for further bootstrap learning, and
the constraints it places upon the methods to be used. Section 1.5 gives an overview of the SODA
method. Finally, section 1.6 lays out the organization of the rest of the material in the dissertation.
1.2 Reinforcement Learning in Realistic Robots
The learning problem that SODA addresses is a special case of reinforcement learning. Such problems all involve an agent interacting with an environment. The agent receives sensations from the
environment, and performs actions that may change the state of the environment. It also receives a
scalar reward signal that indicates how well it is performing its task. The learning problem for the
agent, generally, is to discover a policy for action that maximizes the cumulative reward. Rewards
may be sparsely distributed, and delayed, for example occurring only when the agent achieves some
goal. As a result of this sparsity, the agent must determine how to assign credit (or blame) for the
reward among all of the decisions leading up to the reward.
Reinforcement learning spans a vast set of learning problems that vary along many dimensions. Sensations and actions may be continuous or discrete. Rewards may be frequent or sparse,
immediate or delayed. Tasks may be episodic, with a definite termination, or may continue indefinitely. Agents may begin learning tabula rasa, or may be given a full model of the environment
a priori. A single agent may learn alone, or multiple agents may cooperate to maximize a single,
global reward signal, or compete against one another.
SODA addresses the problem of a single agent learning in a realistic robot. Realistic robots
have a number of features that make reinforcement learning with little prior knowledge difficult.
2
Agent
Sensation
Action
Reward
Environment
Figure 1.1: The Reinforcement Learning Problem Framework In reinforcement learning problems an agent interacts with an environment by performing actions and receiving sensations and a
reward signal. The agent must learn to act so as to maximize future reward.
There are two particular features of realistic robots that are of interest to this dissertation: (1) they
have rich sensors measuring a complex environment, and (2) they have short-range, noisy, local
actions that give their tasks a high-diameter. Efficient learning requires an abstraction of the continuous, high-resolution sensorimotor system to make the problem tractable.
First, modern robots have rich sensory systems such as laser range-finders or cameras, that
provide an array hundreds or thousands of measurements of the environment at once. In addition, real environments are rich and complex. In typical robotics applications, engineers use their
knowledge of the nature of the sensors and environment to endow the agent with an appropriate
set of feature detectors with which to interpret its sensory input. For example, an agent might be
programmed with perceptual routines designed based on the knowledge that its sensors are rangefinders and that it should attempt to extract features in the form of line segments because it operates
in an indoor office environment. An agent learning to operate a robot without such prior knowledge
is confronted simply with a large vector of numbers. The agent must learn on its own to extract
the appropriate perceptual features from that input vector by acting in the world and observing the
results.
Second, modern robots have motor systems that accept continuous-valued commands, such
as voltages for controlling motors. Although robots controlled by digital computers still act by
sending a discrete sequence of such commands, the size of the discretization is typically determined
by the controller’s duty cycle, which usually has a frequency of many cycles per second. As a result,
3
the robot’s primitive actions are typically fine-grained and local. Performing meaningful tasks, such
as navigating to the end of a corridor, may require hundreds or thousands of these actions. Such
long sequences of actions, combined with sparse reward, make the problem of assigning credit to
the decisions in that sequence difficult, requiring a great deal of exploration of the space of possible
sequences of actions. An engineer might shorten the task diameter by providing the agent with
prior knowledge encoded in a set of high-level action procedures that abstract a long sequence of
actions into a single step, such as turn-around or follow-wall-to-end. Without this prior knowledge,
an agent must develop its own high-level abstraction of action if it is to reduce its task diameter.
In addition to rich sensors and local actions, real, physical robots have a number of other
properties that make their use challenging, but that are not directly related to the main thesis of
this dissertation, which is the autonomous development of a useful abstraction to aid reinforcement
learning. These properties include the fact that they can operate for a limited time on a single battery
charge, that they often must be physically repositioned at a starting point in order to run controlled
experiments. For these reasons, the evaluation of SODA in Chapters 4, 5, and 6, is done in a realistic
simulation in which the robot is modeled on an actual physical robot with rich sensors, fine-grained
continuous actions, with realistic noise. This simulation allows many experiments to be run very
efficiently while preserving the properties that are important for evaluating SODA.
1.3 Learning Without Prior Knowledge
The second goal of this work is to make progress toward the generic autonomous learning agent
mentioned at the outset of this chapter. There are both engineering and scientific reasons for this
goal. The engineering reason is that building an agent that requires little prior knowledge to learn
a task saves human labor in adding the knowledge manually. Furthermore, such an agent should be
robust and adaptable to new situations, since it must learn to act with few prior assumptions about
itself or its world. The scientific reason is that by exploring the limits of how much prior knowledge
is required to learn to act in the world we gain a greater understanding of the role of such knowledge.
With such understanding, we can make progress toward a general theory of how much knowledge
is necessary to perform a task, and how and when one might add prior knowledge to an agent to
expedite the solution of a particular problem, without compromising the agent’s robustness and
adaptability to new situations.
In practice, it is probably impossible for an artificial agent to begin learning entirely tabula
rasa. At a minimum, it is likely that the algorithms used will have free parameters that must be set
properly for learning to succeed, and their values will encapsulate some prior knowledge. The representational framework will likely also embody some assumptions about the nature of the problem
and the world. In addition, SODA is not intended to address every single problem encountered in
applying reinforcement learning in robots. Some problems, such as an inability of robot to completely observe its state through available sensors, are orthogonal research questions, already under
4
investigation elsewhere (see Section 2.2.5). When such problems, known as perceptual aliasing or
partial observability are encountered in experimental demonstrations of SODA, it is reasonable to
add sufficient additional information to SODA’s sensory input(e.g. a compass or coarse location
sensor) to allow the experiments to run without confounding the issues of high-diameter with those
of partial observability.
However, by consciously attempting to identify and limit the prior knowledge given to the
agent, we can try to set an upper bound on the prior knowledge needed for an important class of
learning problems, and make steps toward a principled framework for understanding how and when
such knowledge should be added. One means of constructing such a theory is to divide an agent’s
cognitive system into a set of modules with well-defined inputs, and then examine the assumptions
that the modules make about their inputs to see how much prior knowledge must be engineered
into each module. In this way, it may be possible to create a hierarchy of such modules, in which
the lowest level modules assume only raw input and output from the robot, and the higher level
modules assumptions are satisfied by the output from the lower modules, bootstrapping up to highlevel behavior.
1.4 Bootstrapping to Higher Levels of Representation
SODA can be seen as one investigation in a larger research area known as Bootstrap Learning
(Kuipers et al., 2006), that seeks to understand how an agent can begin with the “pixel-level” ontology of continuous sensorimotor experience with the world and learn high-level concepts needed
for common-sense knowledge. This learning is achieved through the hierarchical application of
simple-but-general learning algorithms, such that the concepts learned by one level become the input to learning processes at the next level. One goal of SODA is to produce a method can serve
as a substrate for bootstrapping of yet higher-level concepts and behaviors. Thus it is desirable to
produce representations that can be used as input by a wide variety of learning methods.
This goal constrains the choices of methods available for constructing the learning architecture. Some methods, such as backpropagation networks and neuro-evolution, are attractive for their
performance on these kinds of problems, but their internal representations are implicit, distributed
and often inscrutable. These properties make it difficult to use them as part of a bootstrap learning
process. The methods used in SODA were chosen in part because they abstract the agents experience into components, such as discrete symbols and kernel functions, that are usable by a wide
variety of learning methods.
1.5 Approach
Given a robot with high-dimensional, continuous sensations, continuous actions, and one or more
reinforcement signals for high-diameter tasks, the agent’s learning process consists of the following
5
Agent
Learned
Policy
TF+HC
Control Laws
Self-Organizing
Feature Map
Primitive Actions
Reward
Motor Vector
Sense Vector
Robot + Environment
Figure 1.2: The SODA Architecture This expanded version of the diagram in Figure 1.1 shows
the internal structure of the SODA agent (For the purposes of this description, the physical robot is
considered part of the environment, and not the agent). The agent receives from the environment
a sensation in the form of a continuous sense vector. This vector is passed to a self-organizing
feature map that learns a set of high-level perceptual features. These perceptual features and the
scalar reward passed from the environment are fed as input to a learned policy that is updated
using reinforcement learning. This policy makes an action selection from among a set of highlevel actions consisting of combinations of learned trajectory-following (TF) and hill-climbing (HC)
control laws. These high-level actions generate sequences of primitive actions that take the form of
continuous motor vectors sent from the agent to the robot.
steps (a detailed formal description of the method is provided in Chapter 3).
1. Define Primitive Actions. The agent defines a set of discrete, short-range, local actions to
act as the primitive motor operations. They are a fixed discretization of a learnable abstract
motor interface consisting of a set of “principal motor components”.
6
2. Learn High-level Perceptual Features. The agent explores using the primitive actions, and
feeds the observed sensations to a self-organizing feature map that organizes its units into
a set of high-level perceptual features. These features are prototypical sensory impressions
used both as a discrete state abstraction suitable for tabular reinforcement learning, and a set
of continuous perceptual features for continuous control.
3. Learn High-level Actions. Using these new features, the agent defines perceptually distinctive states as points in the robot’s state space that are the local best match for some perceptual
feature, and creates actions that carry it from one distinctive state to another. The actions are
compositions of (1) trajectory-following control laws that carry the agent into the neighborhood of a new distinctive state, and (2) hill-climbing control laws that climb the gradient of a
perceptual feature to a distinctive state.
4. Learn Tasks. The agent attempts to learn its task using a well-known and simple reinforcement learning method with a tabular state and action representation (such as Sarsa(λ)), using
the perceptual categories in the self-organizing feature map as the state space, and the the new
high-level actions as actions.
Figure 1.2 shows the architecture of the learning agent after the high-level features and
actions are learned. This method of Bootstrap Learning in which each stage of the representation builds upon previously learned stage allows the robot to start with a high-diameter, highdimensional, continuous task and build up a progressively more abstract representation. The learned
features reduce the high-dimensional continuous state space to an atomic, discrete one. Likewise,
the high-level actions reduce the effective diameter of the task by ensuring that the agent moves
through the environment in relatively large steps. This reduction in task diameter allows the agent
to learn to perform its tasks much more quickly than it would if it had to learn a policy over primitive
actions.
1.6 Overview of the Dissertation
This dissertation demonstrates the effectiveness of SODA in learning navigation, using a series of
experiments in two different environments on a realistic simulation of a mobile robot. The dissertation is structured as follows:
• Chapter 2 describes SODA’s foundations in self-organizing feature maps, reinforcement learning, and the Spatial Semantic Hierarchy. The chapter also discusses related work such as
other autonomous learning architectures, hierarchical reinforcement learning (including the
Options framework), automatic temporal abstration for learning agents, and automatic feature
construction for reinforcement learning.
7
• Chapter 3 presents a formal description of SODA’s algorithm and assumptions, a formal definition of trajectory-following and hill-climbing actions, some different methods used for
implementing TF and HC controllers, including learning them with RL, and the novel staterepresentation used by SODA for learning the controllers.
• Chapter 4 describes the simulated robot and the two learning environment used in the experiments in this dissertation, one a small test environment, the other large, realistic simulation
of the floor of an office building. The chapter goes on to describe the results of the feature
learning phase of SODA in the two environments, showing that the agent learns features representing the variety of possible perceptual situations in those environments, such as corridors
and intersections at different relative orientations to the robot.
• Chapter 5 describes experiments and results comparing learned trajectory-following and hillclimbing actions to hard-coded alternatives showing that learning the actions as nested reinforcement learning problems makes them more efficient and reliable while minimizing the
amount of prior knowledge required.
• Chapter 6 presents experiments comparing SODA’s ability to learn long-range navigation
tasks to that of an agent using only low-level, primitive actions. SODA learns much faster. In
addition, this chapter analyzes the role of hill-climbing in SODA’s high-level actions. When
SODA uses a hill-climbing step at the end of each high-level action, its actions are more
reliable and the total task-diameter is lower than when just using trajectory-following alone.
• Chapter 7 discusses SODA’s performance, strengths and weaknesses in more detail, and as
the role of SODA in the general Bootstrap Learning framework. In addition, this chapter
discusses a variety of possible future extensions to or extrapolations from SODA, including
(1) replacing SODA’s model-free reinforcement learning with some form of learned predictive
model, (2) bootstrapping up to a higher, topological, representation, (3) improving SODA’s
feature learning by using better distance metrics for comparing sensory images. In addition,
this section discusses the general role of prior knowledge in the context of creating a “selfcalibrating” robot learning algorithm.
• Chapter 8 summarizes the SODA architecture and the experimental results showing the effectiveness of the SODA abstraction in reducing task diameter, and concludes with a view
of the relationship between SODA, bootstrap learning, and the general enterprise of creating
artificially intelligent agents.
8
Chapter 2
Background and Related Work
The first section of this chapter provides background information on the main pieces of prior work
on which SODA is based: (1) the Spatial Semantic Hierarchy, (2) artificial reinforcement learning,
including hierarchical reinforcement learning and (3) self-organizing maps. The next sections, 2.2
and 2.3 discuss related work:
• Section 2.2 describes Bootstrap Learning, a philosophy for building agents that learn world
representations from primitive sensorimotor experience by first learning primitive concepts
and using those learned concepts to bootstrap the learning of higher-level concepts.
• Section 2.3 discusses approaches for automatic feature construction or state abstraction in
reinforcement learning, and relates these to SODA’s use of self-organizing maps.
Finally, Section 2.4 looks at SODA vis-a-vis tabula rasa learning, and examines the assumptions
implicit in SODA’s architecture.
2.1 The Building Blocks of SODA
SODA is built upon three major components of prior work: (1) The Spatial Semantic Hierarchy is
used as a framework for building agents that automatically learn navigate, as discussed in Section 2.1.1; (2) artificial reinforcement learning, discussed in Section 2.1.2, is a SODA agent’s
method for learning how to act; (3) Self-Organizing Maps, described in Section 2.1.3, are used
by SODA to extract useful perceptual features from the agent’s sensory input stream.
2.1.1
The Spatial Semantic Hierarchy (SSH)
The Spatial Semantic Hierarchy (Kuipers, 2000), a model of knowledge of large-scale space used
for navigation, provides part of the representational framework for SODA’s learning problem. The
SSH abstracts an agent’s continuous experience in the world into a discrete topology through a
9
hierarchy of levels with well-defined interfaces: the control level, the causal level, the topological
level, and the metrical level. SODA is founded on the control and causal levels, which are described
below. An possible extension of SODA based on ideas from the topological level is described in
Section 7.2.2.
The control and causal levels of SSH provide a framework for grounding the reinforcement
learning interface in continuous sensorimotor experience. Specifically, the control level of the SSH
abstracts continuous sensory experience and motor control into a set of discrete states and action
defined by trajectory-following (TF) and hill-climbing (HC) control laws. It defines the distinctive
state as the principal subgoal for navigation. A distinctive state is a stationary fixed point in the environment at the local maximum of a continuous, scalar perceptual feature, reachable (from within
some local neighborhood) using a hill-climbing control law that moves the robot up the gradient of
the feature. The control level also includes a set of trajectory-following control laws that carry the
robot from one distinctive state into the neighborhood of another. The causal level of the SSH represents the sensorimotor interface in terms of discrete actions that carry the robot from one distinctive
state to another, and views that are a discrete abstraction of the robot’s continuous perceptual state
experienced at distinctive states.
The control level to causal level transition of the SSH provides a discrete state and action
model, well grounded in continuous sensorimotor interaction. This abstraction is the means by
which the learning agent will reduce its task diameter, by doing reinforcement learning in the space
of these large-scale abstract actions, rather than small-scale local actions.
SODA builds on previous work by Pierce & Kuipers (1997) on learning the SSH control
level in a robot with a continuous sensorimotor interface, but little prior knowledge of the nature
of its sensors, effectors, and environment. In that work, the learning agent successfully learned the
structure of the robot’s sensorimotor apparatus and defined an abstract interface to the robot suitable
for low-level control. By observing correlations in sensor values during action, the agent was able
to group similarly behaving sensors together, separating out, for example, a ring of range sensors
from other sensors. Then, using a variety of statistical and other data analysis techniques, the agent
discovered the spatial arrangement of the sensor groups. By computing sensory flow fields on the
major sensor groups and using principal component analysis (PCA), the agent defined an abstract
motor interface, represented by a basis set of orthogonal motor vectors u0 , . . . , un .
To define higher-level actions, the agent defined two kinds of control laws, homing behaviors
for hill-climbing and path-following behaviors for trajectory following. Each kind of behavior was
represented by a simple control template. Every control law is an instantiation of one of those two
templates, with the parameters set to minimize an error criterion. The error criteria are defined in
terms of a set of learned scalar perceptual features Y, such as local maxima or minima over sensory
images, defined through a set of feature generators.
Each homing behavior is defined by a local state variable y ∈ Y, and is applicable only
when y is known to be controllable by a single motor basis vector uj . The homing control template
10
Figure 2.1: Continuous-to-Discrete Abstraction of Action In the SSH Control Level (top), continuous action takes the form of hill-climbing control laws, that move the robot to perceptually
distinctive states, and trajectory-following control laws, that carry the robot from one distinctive
state into the neighborhood of another. In the Causal Level, the trajectory-following/hill-climbing
combination is abstracted to a discrete action, A, and the perceptual experience at each distinctive
state into views V 1 and V 2, forming a causal schema hV 1|A|V 2i. (Adapted from Kuipers (2000).)
defines a one-dimensional proportional-integral controller that scales uj as a function of y ∗ − y,
where y ∗ is the target value for y. Each controller only operates along one axis of the motor space,
requiring the sensory and motor spaces to decompose in closely corresponding ways.
Each path-following behavior is defined by a set of local state variables {y1 , . . . , yn } ⊆ Y,
and is applicable when all its local state variables can be held constant while applying some uj (or its
opposite) to move the robot. The path-following control template comprises a constant application
of uj plus a linear combination of the other motor basis vectors where the linear coefficients are
determined by proportional-integral or proportional-derivative controllers designed to keep all yi
within a target range.
These methods – sensor grouping and learning an abstract, low-level motor interface – allow
the robot to learn continuous control with little prior knowledge, but are not a suitable discrete
sensorimotor abstraction for reinforcement learning. In particular the feature discovery mechanism
constructs a tree of heterogenous features of different types and dimensionalities, and it is difficult
to see how this variety of different kinds of percepts could be used as input to an RL algorithm.
Furthermore, the learned behaviors are limited by their use of simple controller templates and their
applicability only when a local state variable can be controlled using a single component of the
11
abstract motor interface.
SODA assumes an that the agent has discovered the primary sensor grouping and learned
the abstract motor interface using the methods above. SODA then augments this system with a new
feature generator that provides features that can be used both as local state variables for continuous
control, and as a discrete state abstraction for well understood reinforcement learning algorithms.
It also defines a new set of trajectory-following and hill-climbing controllers based on these new
features that are not subject to the limitations of Pierce & Kuipers’ method.
2.1.2
Reinforcement Learning
Expanding the summary of reinforcement learning (RL) presented in the introduction, this section
introduces the basic concepts in reinforcement learning used by SODA. The material below lays
out the formal underpinnings of reinforcement learning in Markov Decision processes (MDPs),
describes the methods that SODA’s RL policies use to select exploratory actions for learning, and
describes the Sarsa learning algorithm, used by SODA. The section goes on to discuss hierarchical
extensions to reinforcement learning used by SODA.
Formal Framework of RL
The general formal framework for reinforcement learning is the Markov Decision Process (MDP):
Given a set of states S, a set of actions As applicable in each state s ∈ S, and a reinforcement signal
r, the task of a reinforcement learning system is to learn a policy, π : S × A → [0, 1]. Given a
state s ∈ S and an action a ∈ As , π(s, a) indicates the probability of taking action a in state s. The
optimal policy maximizes the expected discounted future reward, as expressed in the value function
of the state under that policy:
V π (s) = E rt + γrt+1 + γ 2 rt+2 + · · · | st = s, π ,
(2.1)
Qπ (s, a) = E rt + γrt+1 + γ 2 rt+2 + · · · | st = s, at = a, π .
(2.2)
indicating the expected discounted sum of reward for future action subsequent to time t assuming
the state at time t is s, and actions are selected according to policy π. γ ∈ [0, 1] is the discount rate
parameter. When a predictive model of the environment is available, in the form of the transition
function T (st , at , st+1 ) = P (st+1 |st , at ) it is possible to derive the policy for action from the value
function by predicting what state will result from each of the available actions, and setting the
probabilities π(st , at ) to maximize the expected value of V (st+1 ).
When no model of the environment is available to provide the T function, it is typical to
collapse V and T into an intermediate function, the state-action-value function, Qπ (s, a), giving
the value of taking action a in state s under policy π:
12
The policy can be derived from Qπ through a number of methods, for example by always selecting
the action with the highest Q value. Therefore a large family of reinforcement learning methods
concentrate on learning estimates of one of these two functions, V or Q.
Methods for learning Q rely on the fact that the optimal policy Q∗ obeys the Bellman Equation:
X
Q∗ (s′ , a′ ),
(2.3)
P (s′ |s, a) max
Q∗ (s, a) = R(s, a) + γ
′
s′
a ∈As
where R(s, a) is the expected reward r resulting from taking action a in state s.
An iterative algorithm for approximating Q∗ via dynamic programming (DP) uses an update
operation based on the Bellman equation:
X
Qk (s′ , a′ ).
(2.4)
P (s′ |s, a) max
Qk+1 = R(s, a) + γ
′
a ∈As
s′
Many methods exist that combine DP updating with Monte Carlo policy iteration to update a policy
π on-line such that Qπ provably converges to Q∗ under a handful of simplifying assumptions.
Sarsa, the method used by SODA, is one such method and is described in more detail later in this
section. The approximations necessary to apply these techniques in robots with rich sensorimotor
systems generally violate these simplifying assumptions, and convergence proofs in these situations
have thus far remained elusive. Nevertheless, these techniques have been quite successful in realworld domains such as robotic soccer (Stone & Veloso, 2000), albeit by manually incorporating a
significant amount of prior knowledge of the task.
In order to learn, these on-line algorithms need to choose at each step whether to perform
the action with the highest Q-value, or whether to choose some other, exploratory action, in order
to gain more information for learning. The most popular of these methods are covered in the next
section.
Exploration vs. Exploitation in RL
In reinforcement learning, agents attempt to learn their task while performing it, iteratively improving performance as they learn. An agent in such a situation must decide when choosing actions
whether to explore in order to gain more information about its state-action space, or exploit its current knowledge to attempt to immediately maximize future reward. The typical way of dealing with
this choice is for the agent’s policy to explore more early, and gradually reduce exploration as time
goes on. The two most popular main methods for implementing this tradeoff, both used by SODA,
are epsilon-greedy action selection, and optimistic initialization.
In epsilon-greedy action selection, at each step the agent chooses a random exploratory
action with probability ǫ, 0 < ǫ < 1, while with probability 1 − ǫ, the agent chooses the greedy
action, that it estimates will lead to the highest discounted future reward, i.e. arg maxa∈As Q(s, a).
Learning begins with a high value for ǫ that is annealed toward zero or some small value as learning
continues.
13
Optimistic initialization refers to initializing the estimates of Q with higher-than-expected
values. As the agent learns, the estimates will be adjusted downward toward the actual values. This
way greedy action selection biases the agent to seek out regions of the state space or state-action
space that have not been visited as frequently. It is possible to combine optimistic initialization with
ǫ-greedy action selection in a single agent.
Episodic vs. Continual RL Tasks
Some reinforcement learning problems terminate after some period of time, and are characterized
as episodic. Others continue indefinitely and are called continual (or non-episodic). Trajectoryfollowing and hill-climbing in the SSH are examples of episodic tasks, since each continues for
a finite period of time, while large scale-robot navigation using TF and HC actions may be either
continual or episodic, depending on the task.
Episodic tasks have a terminal state in which action stops and the agent receives its “final
reward” (for that episode). Non-episodic tasks, on the other hand, have no terminal state, and the
agent continues to act indefinitely. The learning problem in episodic tasks is usually to maximize the
total reward per episode, so it is not necessary to discount the reward and γ = 1.0. In continual tasks,
reward will accumulate indefinitely, so it is necessary to discount future reward with 0 < γ < 1.
Generally speaking, episodic tasks can be solved with either on-line or off-line methods, while
continual tasks must be solved on-line. Depending on the nature of the task, however, it may be
possible for an agent in a continual task to stop acting and “sleep” periodically in order to learn
off-line.
The experimental navigation tasks used to test SODA in Chapter 6 are all episodic. However,
SODA could just as easily be applied to continual navigation, for example in a delivery robot that
must continually make rounds of a large building.
The Sarsa(λ) Algorithm
The feature and action construction methods in SODA are intended to be independent of the specific
reinforcement learning algorithm used. The experiments in this thesis uses Sarsa(λ), described by
Sutton & Barto (1998). Sarsa was chosen because it is simple and well understood. Also, Sarsa has
good performance when used with function approximation – a necessity when using RL in robots.
Some other methods, like Q-learning (Watkins & Dayan, 1992), are known to diverge from the
optimal policy when used with function approximation (Baird, 1995).
‘Sarsa’ is an acronym for State, Action, Reward, State, Action; each Sarsa update uses the
states and actions from time t and the reward, state, and action from time t + 1, usually denoted
by the tuple: hst , at , rt+1 , st+1 , at+1 i. In the simplest form, Sarsa modifies the state-value estimate
Q(s, a) as follows:
Q(st , at ) ← Q(st , at ) + α [rt+1 + γQ(st+1 , at+1 ) − Q(st , at )] .
14
(2.5)
This rule updates the current Q value by a fraction of the reward after action at plus the temporal
difference of the discounted reward predicted from the next state-action pair, Q(st+1 , at+1 ), and the
current estimate of Q(st , at ). The parameter 0 < α < 1 is a learning rate that controls how much of
this value is used. The parameter 0 < γ ≤ 1 is the discount factor. For episodic tasks that ultimately
reach a terminating state, γ = 1 is used, allowing Q(s, a) to approach an estimate of the remaining
reward for the episode.
For faster learning, Sarsa(λ) performs multiple-step backups by keeping a scalar eligibility
trace, e(s, a), for each state action pair, that tracks how recently action a was taken in state s — i.e.,
how “eligible” it is to recieve a backup. Using this information the algorithm updates the estimates
of the value function for many recent state-action pairs on each step, rather than just updating the
estimate for the most recent pair. When a step is taken, each eligibility trace is decayed according
to a parameter 0 ≤ λ < 1:
∀s, a : e(s, a) ← λe(s, a),
(2.6)
Then the eligibility trace for the most recent state and action is updated:
e(st , at ) ← 1.
(2.7)
Finally the Q table is updated according to the eligibility trace:
∀s, a : Q(s, a) ← Q(s, a) + e(s, a) α [rt+1 + γQ(st+1 , at+1 ) − Q(st , at )] .
(2.8)
This method can speed up learning by a large margin, backing up the reward estimates to recently
experienced state-action pairs, rather than just the most recent. (Note that when λ = 0 the eligibility
trace is zero for all but the most recent state-action pair, and the update rule in Equation 2.8 reduces
to the one step update in Equation 2.5.) Sarsa(λ)’s simplicity and known convergence properties
make it a good choice for use in SODA.
Hierarchical Reinforcement Learning and Options
Reinforcement learning methods that maintain a single, flat, policy struggle on high-diameter problems. To address this failing, much recent work has gone into extending the paradigm to learn a
task as a decomposition of subtasks, or by using other forms of temporal abstraction. Originally
this work assumed a task that had been decomposed by a human, and concentrated on policy learning. More recently, however, work has gone into agents that automatically discover a useful task
decomposition or temporal abstraction.
The most widely-used formalism for representing hierarchical task decomposition in reinforcement learning is the semi-Markov decision process (SMDP). SMDPs extend MDPs by allowing actions with variable temporal extent, that may themselves be implemented as SMDPs or MDPs,
executed as “subroutines.” Such processes are “semi-Markov” because the choice of primitive actions (at the lowest level of the decomposition) depends not only on the environmental state, but
also on the internal state of the agent, as manifest in the choice of higher-level actions.
15
There are three major learning systems and frameworks based on SMDPs: Options (Precup,
2000; Sutton, Precup, & Singh, 1999), MAXQ value function decomposition (Dietterich, 2000), and
Hierarchies of Abstract Machines (HAMs; Parr & Russel, 1997; Andre & Russell, 2001) . These
methods derive much of their power from the fact that short policies are easier to learn than long
policies, because the search space is smaller. Using this knowledge they structure a long task as a
short sequence of extended subtasks, that may themselves be compositions of lower-level subtasks.
In other words, all these methods attempt to reduce the diameter of high-diameter tasks.
SODA’s abstraction is built on Options, because its formalism provides a convenient framework for specifying trajectory-following and hill-climbing actions. In addition, hand-specified Options have been successfully used in real-world robotics tasks. For example, Stone & Sutton (2001)
used the options framework in the complex task of robot soccer. Finally Options has been the most
successful framework for automatic discovery of high-level actions. MAXQ and HAMs have primarily been used as a means by which a human can specify a useful set of subtasks for learning.
The rest of this section presents the Options framework in more detail, and describes some existing
methods for automatic Option discovery.
The options formalism extends the standard reinforcement learning paradigm with a set of
temporally-extended actions O, called options. Each option oi in O is a tuple hIi , πi , βi i, where the
input set, Ii ⊆ S, is the set of states where the option may be executed, the policy, πi : S × A →
[0, 1], determines the probability of selecting a particular action in a particular state while the option
is executing, and the termination condition, βi : S → [0, 1], indicates the probability that the option
will terminate in any particular state. For uniformity, primitive actions are formalized as options so
each SMDP’s policy chooses among options only. Each primitive action a ∈ A can be seen as an
option whose input set I is the set of states where the action is applicable, whose policy π always
chooses a, and whose termination condition β always returns 1. In cases where the set of actions
is very large (or continuous) it is often desirable to allow each option to act only over a limited set
of actions Ai , such that πi : S × Ai → [0, 1]. In addition, most work in options focuses on cases
where the option policies are themselves learned using reinforcement learning — although this is
not strictly necessary within the formalism. In this case, it is often useful to augment the option
with a pseudo-reward function Ri , different from the MDP’s reward function, to specify the subtask
that the option is to accomplish. SODA’s trajectory-following and hill-climbing actions are defined
and learned as options. Sections 3.3 and 3.4.2 define Ii , πi , βi , Ri , and Ai for SODA’s TF and HC
options.
To learn a policy for selecting options to execute, the Sarsa(λ) algorithm reviewed in Section 2.1.2 must be slightly modified to accommodate options (Precup, 2000). Assume an option
ot is executed at time t, and takes τ steps to complete. Define ρt+τ as the cumulative, discounted
reward over the duration of the option:
ρt+τ =
τ −1
X
i=0
16
γ i rt+i+1 .
(2.9)
Using this value, the one-step update rule (Equation 2.5) is modified as follows:
Q(st , ot ) ← Q(st , ot ) + α [ρt+τ + γ τ Q(st+τ , ot+τ ) − Q(st , ot )] .
(2.10)
For Sarsa(λ), the multi-step update rule (Equation 2.8) is modified analogously. The eligibility trace
now tracks state-option pairs, e(s, o), and is updated upon option selection.
Note that for one-step options, these modifications reduce to the original Sarsa(λ) equations.
Also, when γ = 1, as is often true in episodic tasks, the reward ρ is simply the total reward accumulated over the course of executing the option, and the update rule is again essentially the same as
the original, if the reward r in the original rule is taken to mean the reward accumulated since the
last action.
Work is ongoing to refine and extend the options framework, including sharing information
for learning between options (Sutton, Precup, & Singh, 1998) and using the task decomposition
to improve state abstraction (Jonsson & Barto, 2000). Although much early work on options has
assumed an a priori task decomposition, more recent work has focused on discovering a useful set
of options from interaction with the environment. This and other work on automatic discovery of
hierarchical actions is reviewed in the next section.
Automatic Hierarchy Discovery
Manually decomposing a task into subtasks is a means of using prior knowledge to shrink the
effective diameter of a high-diameter task. One of the goals of SODA is an agent that learns to
reduce the diameter of its task without such prior knowledge.
Some work has been done on automatically learning a task hierarchy for this purpose.
Nested Q-Learning (Digney, 1996, 1998) builds a hierarchy of behaviors implemented as learned
sub-controllers similar to options (Section 2.1.2). It operates either by proposing every discrete feature value as a subgoal and learning a controller for each, or proposing as subgoals those states that
are frequently visited or that have a steep reward gradient. The former method can only be tractable
with a relatively small set of discrete features. Digney intended the latter version to be tractable
with larger feature sets, although it was only tested in a very small, discrete grid-world.
The work of McGovern & Barto (2001) and McGovern (2002) is similar in many respects
to the work of Digney. States are selected as subgoals using a statistic called “diverse density” to
discover states that occur more frequently in successful trials of a behavior than in unsuccessful
trials, and new options are created to achieve these states as subgoals. This method differs from
Nested Q-Learning in that it is able to use negative evidence (that a state is not on the path to
the goal, and thus not likely to be a subgoal). To do this, however, it requires that experiences be
separated into “successful,” and “failed” episodes, which seems to eliminate its use in non-episodic
tasks. Furthermore, it requires the use of a hand-crafted, task-specific “static filter” to filter out
spurious subgoals. In testing on 2-room and 4-room grid-worlds with narrow doorways connecting
the rooms, this method correctly identified the doorways between the rooms as subgoals.
17
Building on the work of McGovern & Barto, Şimşek et al have developed two new ways of
discovering useful options. The first method uses a new statistic called relative novelty (Şimşek &
Barto, 2004) to discover states through which the agent must pass in order to reach other parts of
the state space. The method then proposes these access states as subgoals and constructs options for
reaching them. The second method uses local graph partitioning (Şimşek, Wolfe, & Barto, 2005)
to identify states that lie between densely connected regions of the state space, and proposes those
states as subgoals.
SODA is similar to these methods in that it also learns a task decomposition, but it does so by
abstracting the agent’s continuous sensorimotor experience into a set of discrete, perceptually distinctive states and actions to carry the robot between these states. In contrast, the option-discovery
methods above assume a discrete state abstraction, and attempt to find a smaller set of states that
act as useful subgoals and learn new subtask policies that carry the robot to these states. In each
case, the agent’s task then decomposes into a sequence of higher-level, temporally extended actions
that decrease the effective diameter of the task. SODA’s hill-climbing actions are similar to these
sub-goal options in that they attempt to reach a distinctive state as a sub-goal. SODA differs in that
hill-climbing options are quite local in their scope, and trajectory-following options are the means
of carrying the robot from one distinctive state into the local neighborhood of another. Trajectoryfollowing (which is formulated as an option in Section 3.3) is different from sub-goal options: Its
purpose is to make progress on a given trajectory for as long as possible, rather than to achieve a
termination condition as quickly as possible. In this way trajectory-following defines a new kind
of option that is the dual of traditional sub-goal-based options. For very large diameter tasks, it
may be necessary to add more layers of features and actions to further reduce the diameter. The
methods above may be directly useful for bootstrapping to even higher levels of abstraction. One
possible method of identifying and constructing higher-level options on top of SODA is proposed
in Section 7.2.2.
To summarize, SODA uses reinforcement to learn policies for trajectory-following and hillclimbing actions, as well as for high-diameter navigation tasks using those actions. The agent’s
action set includes high-level actions as described above in Section 2.1.1, that are defined as Options using hierarchical reinforcement learning, and the state representation is learned using a selforganizing map, described in the next section.
2.1.3
Self-Organizing Maps (SOMs)
The unsupervised feature learning algorithm used in SODA is the Self-Organizing Map (SOM;
Kohonen, 1995). The self-organizing map has been a popular method of learning perceptual features
or state representations in general robotics, as well as world modeling for navigation (Martinetz,
Ritter, & Schulten, 1990; Duckett & Nehmzow, 2000; Nehmzow & Smithers, 1991; Nehmzow,
Smithers, & Hallam, 1991; Provost, Beeson, & Kuipers, 2001; Kroöse & Eecen, 1994; Zimmer,
1996; Toussaint, 2004). Many of these systems, however, provide some form of prior knowledge or
18
Figure 2.2: Example 5x5 Self-Organizing Map A SOM learning a representation of a highdimensional, continuous state space. In training, each sensor image is compared with the weight
vector of each cell, and the weights are adapted so that over time each cell responds to a different
portion of the input space. [Figure adapted from (Miikkulainen, 1990)]
external state information to the agent, and none attempt to build higher level actions to reduce the
task diameter. The first part of this section presents the basic SOM learning method in the context
of the standard “Kohonen Map” SOM algorithm, and details the reasons why SOMs are well suited
for feature learning in SODA. The second part describes the Growing Neural Gas (GNG) SOM
algorithm, and the Homeostatic-GNG variant developed for use with SODA.
The Standard Kohonen SOM
A standard SOM consists of a set of units or cells arranged in a lattice.1 The SOM takes a
continuous-valued vector x as input and returns one of its units as the output. Each unit has a
weight vector wi of the same dimension as the input. On the presentation of an input, each weight
vector is compared with the input vector and a winner is selected as arg mini kx − wi k.
In training, the weight vectors in the SOM are initialized to random values. When an input
1
The lattice is often, but not necessarily, a 2D rectangular grid.
19
vector xt is presented, each unit’s weights are adjusted to move it closer to the input vector by some
fraction of the distance between the input and the weights according to
wi ← wi + ηt Nt (i)(xt − wi ),
(2.11)
where ww is the winning unit’s weight vector and 0 < ηt < 1 is the learning rate at time t, and
Nt : N → [0, 1] is the neighborhood function at time t. The neighborhood function returns 1 if i
is the winning unit, and decreases with the distance of unit i from the winner in the SOM lattice,
eventually decreasing to zero for units “outside the neighborhood” (Figure 2.2).
Training begins with initially large values for both the learning rate and the neighborhood
size. As training proceeds, the learning rate and neighborhood are gradually annealed to very small
values. As a result, early training orients the map to cover the gross topology of the input space, and
as the parameters are annealed, finer grained structure of the input space emerges.
Self-organizing maps have several properties that lend themselves well to our feature learning task:
• Data- and sensor-generality. Because they operate on any input that can be expressed in
the form of a vector, they are not specific to any particular kinds of sensor or environment,
making them especially well suited to learning with an unknown sensorimotor system.
• Clustering with topology preservation. SOMs partition the input space into a set of clusters
that tends to preserve the topology of the original input space in reduced dimensions. Thus
features near one another in the SOM will be similar to one another. This property can be
exploited to speed up reinforcement learning when a SOM is used to represent the state space
(Smith, 2002).
• Incremental training. A SOM can be trained incrementally, with training vectors presented
on-line as they are received during robot exploration. As training progresses the SOM first
organizes into a rough, general approximation of the input space that is progressively refined. This property coincides well with the on-line nature of most reinforcement learning
algorithms, and enables the agent to learn its policy concurrently with learning its state representation (Smith, 2002).
• Adapting to the input distribution. Unlike a priori fixed discretizations, SOMs concentrate
their units in areas of the perceptual space where the input is distributed. Furthermore, the
Growing Neural Gas (GNG) a modified SOM algorithms used in SODA (described below) is
good at following non-stationary input distributions (Fritzke, 1997), making it easier for the
robot to learn incrementally as it performs its explores its environment, without needing store
training data and present it in batch. In some cases, this allows the agent to learn task policies
with Sarsa while training the SOM concurrently.
20
In SODA, a SOM is used to learn a set of perceptual features from the sensory input. The
features have three complementary roles. First, the units are used as discrete perceptual categories that form the state space for the reinforcement learning algorithm that chooses high-level
actions. Second, the continuous activation values on the SOM are used by the agent to define
closed-loop hill-climbing and trajectory-following control laws with which to construct high-level
actions. Specifically, the agent hill-climbs to perceptually distinctive states, each defined by the
local maxima of activation on a SOM unit, and trajectory-follows so as to make progress while
maintaining the activation of the current SOM unit. Third, the sorted list of the n closest units to
the current input forms the Topn state representation, described in Section 3.2, which is used when
learning hill-climbing and trajectory-following using reinforcement learning.
The Growing Neural Gas Algorithm
The SOM implementation used by SODA is based on a variant on the standard SOM algorithm
called the Growing Neural Gas (GNG; Fritzke, 1995). The GNG begins with a small set of units
and inserts new units incrementally to minimize distortion error (the error between the input, and
the winning unit in the network). The GNG is able to continue learning indefinitely, adapting to
changing input distributions. This property makes the GNG especially suitable for robot learning,
since a robot experiences its world sequentially, and may experience entirely new regions of the
input space after an indeterminate period of exploration. In addition, the GNG is not constrained by
the pre-specified topology of the SOM lattice. It learns its own topology in response to experience
with the domain. An abbreviated description of the GNG algorithm follows.
• Begin with two units, randomly placed in the input space.
• Upon presentation of an input vector x:
1. Select the two closest units to x, denoted as w1 and w2 with weight vectors q1 and q2 ,
respectively. If these units are not already connected in the topology, add a connection
between them.
2. Move q1 toward x by a fraction of the distance between them. Move the weights of
all the topological neighbors of w1 toward x by a smaller fraction of the respective
distances.
3. Increment the age aw1 j of all edges j emanating from w1 .
4. Set the age aw1 w2 of the edge between w1 and w2 to 0.
5. Remove any edges ij for which aij > amax . If this results in any units with no edges
emanating from them, remove those units as well.
6. Add the squared error ||x − q1 ||2 to an accumulator ew1 associated with w1 .
21
Figure 2.3: A Growing Neural Gas (GNG) Network An Example GNG network adapting to
model an input distribution with 1-dimensional, 2-dimensional, and 3-dimensional parts. The GNG
network has several properties that make it a potential improvement over typical SOMs for our
feature learning problem: It requires no prior assumptions about the dimensionality of the task, it
can continue to grow and adapt indefinitely. The Homeostatic-GNG developed for SODA regulates
its growth in order to maintain a fixed value of cumulative error between the input and the winning
units. Figure adapted from Fritzke (1995)
7. Decay the accumulated error of all nodes by a fraction of their values: ∀i : ei ←
ei + −βei , where 0 < β < 1.
• Periodically, every λ inputs, add a unit by selecting the existing unit with the greatest accumulated error and the unit among its topological neighbors with the most accumulated error;
between these two units add a new unit whose weight vector is the average of the two selected
units. Connect the new unit to the two selected units, and delete the original connection between the two.
22
The original GNG algorithm adds units until the network reaches some fixed criterion such as a
maximum number of units. SODA uses a new, slightly modified algorithm, Homeostatic-GNG, that
has no fixed stopping criterion, but rather only adds units if the average discounted cumulative error
over the network is greater than a given threshold. Every λ inputs, the Homeostatic-GNG checks
the condition ē > eτ , where eτ is the error threshold. If the condition is true, and the error is
over threshold, a unit is added, otherwise none is added. Homeostatic-GNG was first published as
Equilibrium-GNG (Provost, Kuipers, & Miikkulainen, 2006).
Given a stationary input distribution, an Homeostatic-GNG will grow until reaching an equilibrium between the rate of accumulation of error (per unit) and the rate of decay. If the distribution
changes to cover a new part of input space, the error accumulated in the network will increase above
the threshold, and the network will grow again in the region nearest the new inputs.
To summarize, SODA uses a variant self-organizing map algorithm, Homeostatic-GNG network, to learn a set of prototypical sensory images that form the perceptual basis of the abstraction.
Homeostatic-GNG is well suited for this task because it can be trained incrementally, covering the
changing input distribution as the robot explores, and because it does not require a fixed specification of the number of features to learn, instead adjusting the number of features to maintain a
prespecified level of discounted cumulative error.
This section described SODA’s foundations in the Spatial Semantic Hierarchy, reinforcement learning, and self-organizing maps. The remainder of this chapter discusses a variety of related
work in bootstrap learning, hierarhical reinforcement learning, and automatic feature construction
for reinforcement learning.
2.2 Bootstrap Learning
Most robot learning architectures endow the agent with significant amounts of prior knowledge of
the robot, environment, and task. As mentioned in Section 1.4, SODA is an instance of a “bootstrap
learning” algorithm (Kuipers et al., 2006). Agents using bootstrap learning algorithms build representations of their world progressively from the bottom up, by first learning simple concepts and
then using those as building blocks for more complex concepts. These kinds of algorithms can be
classified into two broad classes: homogeneous and heterogeneous.
Homogeneous bootstrap learning methods use the same learning method or set of methods
at every level of learning, positing a single, uniform learning algorithm to account for agent learning from raw pixels and motor commands all the way up to high-level behavior. Drescher’s Schema
Mechanism, described in Section 2.2.1, and Chaput’s Constructivist Learning Architecture (Section 2.2.2) are two such methods. Heterogeneous bootstrap learning methods, on the other hand,
use different algorithms as needed for different levels of behavior. Human research determines
hierarchy of methods and interfaces, within levels agents learn representations autonomously. Heterogeneous bootstrap learning methods are generally directed at learning specific kinds of agent
23
knowledge rather than the whole scope of high-level behavior. In addition, they often assume the
existence of some lower-level knowledge that has already been learned. As a result of this presumption, those methods can themselves be seen as building blocks in a larger bootstrap learning process
that knits together the individual methods. Heterogeneous bootstrap learning methods include the
work of Pierce & Kuipers, SODA, the place recognition system of Kuipers & Beeson (2002), and
OPAL (Modayil & Kuipers, 2006, 2004). Pierce and Kuipers’ work is described in Section 2.1.1;
Kuipers and Beeson’s place detection and Modayil’s OPAL are described in Sections 2.2.3 and
2.2.4.
2.2.1
Drescher’s Schema System
The Schema System (Drescher, 1991) uses a constructivist, Piagetian model of child development
as a framework for an intelligent agent that learns to understand and act in its world with no prior
knowledge of the meaning of its sensors and effectors. It was not applied to realistic robots, but
was only tested in a very small discrete grid world. The Schema System does not use reinforcement
learning, but instead explores its world attempting to find reliable context-action-result schemas. It
assumes a primitive set of discrete, propositional features and discrete, short-range actions. Highlevel perceptual features are represented as propositional conjuncts, generated by exhaustively pairing existing features, or their negations, and testing the resulting conjuncts to see if they can be
reliably achieved through action. High-level actions are created by chaining together sequences of
primitive actions that reliably activate some high level feature.
One especially interesting feature of its representation is what Drescher calls the synthetic
item. Each schema has a synthetic item representing the hidden state of the world that would make
that schema reliable. For example, the schema for moving the hand to position-X and then feeling
something touching the hand would have a synthetic item that could be interpreted as the proposition
that there is an object at position-X. Drescher defined heuristics for when the agent would “turn on”
synthetic items. For example, after a schema had successfully executed, its synthetic item would
stay on for some period of time. He also proposed that artificial neural networks might be used to
learn when to turn them on and off.
The Schema System was tested in a simulated “micro-world,” i.e. a small, discrete two
dimensional world with a hand that can touch and grasp “objects,” a simple visual system with
foveation, and a few other sensors and effectors. Even in this simple world, the number of combinations of propositions to search through is extremely large. Given that Drescher’s implementation
was unable to scale up to the full micro-world even using a Connection Machine, it is inconceivable
that the system as implemented would scale to modern physical robots with rich, continuous sensorimotor systems, and there’s no evidence that any attempt at such a system has been made (but see
Section 2.2.2 for an alternative implementation in the same simulated world).
One of the main contributions of the Schema System is the idea that an agent with practically
no prior knowledge can learn to understand itself and its world through a bottom-up, constructive
24
search through the space of causal schemas, using ideas from developmental psychology both to
provide a representational framework, and a set of heuristics for agent behavior to guide the search.
Another contribution, embodied in the synthetic items, is the idea that, through a kind of informal
abductive process, the agent can begin to form a representation of the latent concepts that explain
its sensorimotor experience.
However, unlike the Schema System, SODA explicitly concentrates on learning in a continuous world, developing a useful continuous-to-discrete abstraction that improves learning. Using
this abstraction, SODA uses reinforcement learning to learn to perform tasks, rather than just learning a model of the world.
2.2.2
Constructivist Learning Architecture
The Constructivist Learning Architecture (CLA; Chaput, 2004, 2001; Chaput & Cohen, 2001; Cohen, Chaput, & Cashon, 2002) is another computational model of child development that uses SOMs
(Section 2.1.3) as its feature representation, and constructs high-level features by using higher level
SOMs that learn the correlations of features on two or more lower-level SOMs. The architecture has
been used successfully to model development of infants’ perception of causation, and other child
developmental processes. It also successfully replicates Drescher’s results from the Schema System
operating in the micro-world (Chaput, Kuipers, & Miikkulainen, 2003). In addition, it has been
used to implement a learning robot controller for a simple foraging task in a simulated robot.
Part of the power of CLA is that it recognizes that an agent’s actual sensorimotor experience
is a very small subset of the set of experiences that can be represented in its sensorimotor system, and
it uses data-driven, unsupervised, competitive learning in SOMs to focus the search for high-level
features and actions on the regions of the state space in which the agent’s sensorimotor experience
resides. The rest of the power comes from its hierarchical structure, and the fact that it explicitly
uses the factored and compositional structure of high-level percepts and actions. That is, high-level
features are a small subset of the set of possible combinations of lower-level features, that are a small
subset of combinations of still lower-level features, continuing on down to the primitive features.
Assuming this kind of structure, CLA can further focus the search by first learning the lowest level
of features, then conducting the search for higher level features in the reduced space formed by
combining the existing lower level feature sets.
Like the Schema System, CLA does not explicitly address the continuous-to-discrete abstraction, but merely assumes that such an abstraction exists, whereas SODA explicitly deals with
learning such an abstraction. SODA, however, learns features with a single, monolithic SOM on one
large sensor group. Extending SODA to learn a factored continuous-to-discrete abstraction using
techniques from CLA is an interesting direction for future work that is discussed in Section 7.3.
25
2.2.3
Bootstrap Learning for Place Recognition
Kuipers and Beeson’s (2002) bootstrap learning system for place recognition uses unsupervised
clustering on sensory images, and the topology abduction methods of the Spatial Semantic Hierarchy to bootstrap training data for a supervised learning algorithm that learns to recognize places
directly from their sensory images.
The method assumes that a major sensor group has been provided or found through a grouping method like that of Pierce & Kuipers (1997). It also assumes an existing set of TF and HC
control-laws such as those learned by SODA. Using these control laws the agent moves through the
environment, from one distinctive state to another, collecting sensor images from distinctive states.
The agent then clusters the set of sensor images into a set of views, choosing a number of clusters
small enough to ensure that there is no image variability within a distinctive state (i.e., every distinctive state has the same view), but assuming that such a small set of clusters will create perceptual
aliasing (i.e. more than one state has the same view). The algorithm then uses an expensive exploration procedure combined with the SSH’s topology abduction to produce a set of sensory images
associated with their correct place labels. These data are then used to train a supervised learning
algorithm (k-nearest neighbor) to immediately recognize places from their sensory images, without
any exploration or topology abduction.
2.2.4
Learning an Object Ontology with OPAL
The Object Perception and Action Learning (OPAL) (Modayil & Kuipers, 2004, 2006), assumes the
existence of a method for constructing an occupancy-grid representation of the world from range
data, and uses a hierarchy of clustering and action learning methods to progressively distinguish
dynamic objects from the static background, track the objects as they move, register various views
of the same object into a coherent object model, classify new instances of objects based on existing
object models, and learn the actions that can be performed on different classes of objects. Each
one of these steps builds on the representations learned in the one below it, forming a multi-layer
bootstrap learning system for an object and action ontology.
2.2.5
Other methods of temporal abstraction
Ring (1994, 1997), developed two methods for temporal abstraction in reinforcement learning, neither one based on SMDPs. The focus of both methods, however, was on reinforcement learning
in non-Markov problems. By using temporal abstraction, he was able to develop reinforcement
learning methods to work on k-Markov problems, that is, problems in which the the next state is
dependent only on the current state and some finite k immediately previous states.
Ring’s first method, Behavior Nets (Ring, 1994) measured how often temporally successive
actions co-occur. This data was used to chain together actions that frequently occur in sequence to
form new, ballistic “macro actions”; these actions are added to the agent’s repertoire of available
26
actions, and also are available for further chaining. These new actions are similar in some respects to
the compound actions in Drescher’s Schema System (Section 2.2.1); however, Drescher’s compound
actions form trees of connected schemas that converge upon some goal state, while Behavior Nets
form single chains and are not goal-directed. Although this work was addressed to the problem of
non-Markov reinforcement learning, it is possible that such macros may be useful in high-diameter
reinforcement learning problems as well. However, as shown in Chapter 5, open-loop behaviors
such as these are less effective in physical robots with noisy motor systems, and closed loop methods
are preferred for usable high-level actions.
Ring’s second method, Temporal Transition Hierarchies (Ring, 1994, 1997), does not create
new actions. Rather, it builds a hierarchy of units that encode into the current state representation
information from progressively more distant time steps in the past, forming a kind of task-specific
memory. The agents’ policy function is implemented as a neural network with a single layer of
weights mapping from the state representation to the primitive actions. The state representation
initially consists only of units representing primitive perceptual features, such as walls bordering
each side of a cell in a grid-world. As the agent learns, it monitors the changes in the weights of
the network, and identifies weights that are not converging on fixed values. When it identifies such
a weight wij , it adds a new unit that takes input from the previous time step. The output of this new
unit is used to dynamically modify the value of weight wij based on the previous state. Each new
unit’s weights are likewise monitored, and the system progressively creates a cascading hierarchy
of units looking further back in time. This method, combined with Q-Learning for learning stateaction value functions forms the CHILD algorithm for continual learning (Ring, 1997). CHILD is a
very effective means of dealing with partially observable environments, i.e., environments in which
many places produce the same perceptual view. Possible extensions of SODA to such environments,
including the use of CHILD, are discussed in Section 7.1.
2.3 Automatic Feature Construction for Reinforcement Learning
Tabular value-function reinforcement learning methods cannot learn in large discrete state spaces
without some mapping of the continuous space into a discrete space. Furthermore, even large discrete spaces make learning difficult, because each state, or state-action pair, must be visited in order
to learn the value function (Sutton & Barto, 1998, Ch. 8). To deal with this it is necessary to extract
or construct features from the sensory vector that provide a usable state abstraction from which
the agent can learn a policy. Many times the state abstraction is constructed by hand and given a
priori, as with a popular method called tile coding (also called CMAC; Sutton & Barto, 1998, Sec.
8.3.2). Unlike these methods, SODA’s GNG network learns a state abstraction from input. Two
other methods that learn a state abstraction from input are the U-Tree algorithm, and value-function
learning using backpropagation.
27
2.3.1
U-Tree Algorithm
The U-Tree algorithm (McCallum, 1995) progressively learns a state abstraction in the form of a
decision tree that takes as input vectors from a large space of discrete, nominal features and splits
the space on the values of specific features, forming a partitioning of the state space at the leaves
of the tree. Splits are selected using a statistical test of utility, making distinctions only where
necessary to improve the Q function. In addition, the algorithm considers not only the current state,
but the recent history of states when making splits, allowing it to act as a compact state memory in
k-Markov problems.
U-Tree’s principal demonstration was in a simulated highway driving task with a state vector
of 8 features having from 2 to 6 possible discrete, nominal values, for a total of 2592 total perceptual
states and much of the state hidden. The individual features and actions were related to a set of
visual routines coded into the agent that encapsulated a great deal of prior task knowledge. For
example, one feature indicates whether object in the current gaze is a car, road, or road shoulder.
The algorithm eventually discovered a state abstraction with fewer than 150 states within which it
could perform the task.
It is unclear from this experiment how U-Tree would perform using as input, for example,
a raw laser rangefinder vector with approximately 500180 perceptual states. Nevertheless, U-tree
could potentially be useful once an initial discrete interface to the robot has been learned. The
possibility of replacing Sarsa in SODA with U-Tree or other methods is discussed in Section 7.1.
In addition, supervised learning algorithms that learn decision trees for classification problems can discover and use thresholds for splitting continuous-valued attributes (Mitchell, 1997). It
may be possible to implement similar continuous attribute splitting to apply U-Tree to continuous
state spaces, but there are problems with any method that abstracts a continuous space simply by
partitioning it. MDP-based reinforcement learning methods assume that the environment obeys the
Markov property – that there is no perceptual aliasing – but partition-based state abstractions automatically alias all states within a partition. If the size of the action is not well matched to the
granularity of the state partitioning, action can be highly uncertain, since it is impossible to know
whether executing an action will keep the robot within the same perceptual state or carry it across
the boundary into a new state. Furthermore, even if the scale of the actions is large enough to always move the robot into a new perceptual state, small amounts of positional variation, especially
in orientation or other angular measurements, can often lead to large differences in the outcome
of actions. Although SODA partitions the input space using a SOM, it defines high-level actions
that operate within the perceptual partition (neighborhood) defined by a single SOM unit. SODA
reduces the positional uncertainty induced by the SOM by hill-climbing to perceptually distinctive
states within each perceptual neighborhood.
28
2.3.2
Backpropagation
One popular non-tabular method of value-function approximation in reinforcement learning is training a feed-forward neural network using backpropagation to approximate the Q or V functions.
While not explicitly a method of feature construction, backpropagation networks implicitly learn an
intermediate representation of their input in their hidden layer, and thus can be said to be performing
a form of state abstraction.
Unfortunately, these features are typically not accessible in a form that allows easy reuse of
the features for other purposes, such as for further bootstrap learning. Typically these features are
encoded in distributed activation patterns across the hidden units in such a way that understanding
the encoded features requires further analysis using methods like clustering and principal component
analysis on the hidden unit activations. Given this difficulty, it is not clear what benefit the backprop
hidden layer provides over doing the similar analyses on the original inputs.
Moreover, even these networks often require substantial manual engineering of the input
features to work successfully. For example, TD-Gammon (Tesauro, 1995) is often cited as an example of a reinforcement learning method that used a backpropagation network to learn to play
grandmaster-level backgammon by playing games against itself. However, TD-Gammon incorporated significant prior knowledge of backgammon into its input representation (Pollack & Blair,
1997; Sutton & Barto, 1998) before any backpropagation learning took place.
Finally, unlike state abstraction using linear function approximators like tile coding, some
of which are proven to converge near the optimal policy (Gordon, 2000), backpropagation is nonlinear, and no such convergence proofs exist for it.
2.4 SODA and Tabula Rasa Learning
Although one of SODA’s goals is to minimize the need for human prior knowledge in the learning
process, it must be acknowledged that there is no truly tabula rasa learning, and SODA is not
entirely free of prior knowledge. This prior knowledge can be divided into three basic categories:
(1) general learning methods, (2) domain-specific knowledge assumed to come from a lower-level
bootstrap learning process, and (3) the parameter settings of SODA’s constituent learning methods.
First, SODA contains prior knowledge of a variety of general learning methods and representations, specifically those described in Section 2.1: the SSH, Homeostatic-GNG, Sarsa(λ), and
Options. These methods embed in them assumptions about the nature of the world:
• that the world is generally continuous and the agent travels through it on a connected path,
• that high-dimensional sensory input is distributed in a way that can be modeled usefully by a
topological graph structure like the GNG,
• that the spatial world has regularities that allow hill-climbing and trajectory following,
29
• and as Hume (1777) pointed out, that the past is a reasonable guide to the future.
Second, SODA assumes that certain domain-specific knowledge has already been learned by existing methods, namely:
• a main sensor group or modality that has been separated out from other sensors on the robot,
• an abstract motor interface that has been learned through interaction with the environment.
These items of knowledge can be learned using the methods of Pierce & Kuipers (1997), which
are assumed to form the bootstrap learning layer beneath SODA. These assumptions are described
more formally in Section 3.1.
Third, the parameter settings of SODA’s various learning methods embed some prior knowledge about the world, such as how long the agent must explore the world in order to learn a good
feature set (Chapter 4), or the degree of stochasticity in the environment, which determines an appropriate setting for Sarsa’s learning rate α. Section 7.4, discusses parameter setting in more detail,
and proposes some methods by which the need for such prior knowledge can be further reduced.
2.5 Conclusion
To summarize, SODA rests on three foundational areas: (1) the causal and control levels of the
Spatial Semantic Hierarchy provide a continuous-to-discrete abstraction of action that reduces the
agent’s task diameter and reduces state uncertainty between actions; (2) hierarchical reinforcement
learning methods allow the automatic construction of high-level actions; and (3) self-organizing
maps learn a state abstraction that is concentrated in the important regions of the state space, providing a discrete abstraction for reinforcement learning, and continuous features usable by the SSH
control level. The next chapter describes how the SODA agent combines these three foundational
methods for learning in high-diameter, continuous environments with little prior knowledge.
30
Chapter 3
Learning Method and Representation
The previous chapter reviewed the prior work on which SODA is based, as well as a variety of
related work. This chapter presents a formal description SODA’s learning method, followed by
detailed descriptions of the various components of the method. The first section lays out the formal
assumptions of the methods and the steps of the learning algorithm. The following sections describe
the state representation used for learning trajectory-following (TF) and hill-climbing (HC) actions,
the formal definition of TF and HC actions as Options, and alternative formulations of TF and HC
actions that are used for comparison the experiments in subsequent chapters.
3.1 Overview
The SODA algorithm can be characterized formally as follows. Given
• a robot with a sensory system providing experience at regular intervals as a sequence of Ndimensional, continuous sensory vectors y1 , y2 , . . . , where every yt ∈ RN ;
• a continuous, M-dimensional motor system that accepts at regular intervals from the agent a
sequence of motor vectors u1 , u2 , . . ., where every ut ∈ RM ;
• an almost-everywhere-continuous world, in which small actions usually induce small changes
in sensor values, though isolated discontinuities may exist; and
• and a scalar reward signal, r1 , r2 , . . ., that defines a high-diameter task, such that properly
estimating the value of a state requires assigning credit for reward over a long sequence of
motor vectors ut , . . . ut+k ),
the SODA algorithm consists of five steps:
1. Define a set of discrete, local primitive actions A0 . First, using methods developed by Pierce
& Kuipers (1997), learn an abstract motor interface, i.e., a basis set of orthogonal motor
31
vectors U = {u0 , u1 , ...un−1 } spanning the set of motor vectors ut possible for the robot.
Then define A0 to be the set of 2n motor vectors formed by the members of U and their
opposites: A0 = U ∪ {−ui |ui ∈ U }.
2. Learn a set F of high-level perceptual features. Exploring the environment with a random
sequence of A0 actions, train a Growing Neural Gas network (GNG) with the sensor signal
yt to converge to a set of high-level features of the environment. For each unit i in the GNG
with weight vector wi there is an activation function fi ∈ F such that:
|wi − y|2
fi (y) = exp −
,
(3.1)
σ2
P Pi
j=1 n(i, j)|wi − wj |
,
P Pi
j=1 n(i, j)
i
(
1, if i, j adjacent
n(i, j) =
0, otherwise.
σ=
i
(3.2)
(3.3)
These equations define each feature function fi (y) as a Gaussian kernel on wi and with a
standard deviation equal to the average distance between adjacent units in the GNG.
3. Define trajectory-following control-laws. For each distinctive state defined by fi ∈ F, define
trajectory-following control laws that take the agent to a state where a different feature fj ∈
F is dominant. Methods for definining trajectory-following control laws are described in
Section 3.3.
4. Define a hill-climbing (HC) control law for each fi ∈ F. For each fi , in the context where
arg maxf ∈F = fi , the controller HCi climbs the gradient of fi to a local maximum. Methods
for defining hill-climbing controllers are described in Section 3.4.
5. Define a set of higher-level actions A1 . Each a1j ∈ A1 consists of executing one TF control
law, and then hill-climbing on the resulting dominant feature fj . At this point the agent
has abstracted its continuous state and action space into a discrete Markov Decision Process
(MDP) with one state for each feature in F, and the large-scale actions in A1 . At the A0
level, this abstraction forms a Semi-Markov Decision process, as described in Section 2.1.2,
in which the choice of A0 action depends on both the current state and the currently running
A1 action. In the case where there is perceptual aliasing, the abstract space forms a Partially
Observable Markov Decision Process (POMDP); extending SODA to the POMDP case is
discussed further in Section 7.1.
Tasks such as robot navigation have considerably smaller diameter in this new A1 state-action space
than in the original A0 space, allowing the agent to learn to perform them much more quickly
32
πiTF : Open-loop Trajectory-follow on a0i :
fw ← arg maxf ∈F f (y)
while fw = arg maxf ∈F f (y):
execute action a0i
fds ← arg maxf ∈F f (y)
TF : Closed-loop Trajectory-follow on a0 in feature f :
πij
j
i
fw ← arg maxf ∈F f (y)
while fw = fj :
a ← argmaxb∈ATF QTF
ij (y, b)
ij
fds
execute action a
← arg maxf ∈F f (y)
Table 3.1: Open-loop and Closed-loop Trajectory-following. Top: The open-loop trajectoryfollowing macro repeats the same action until the current SOM winner changes. Bottom: The
closed-loop Option policy chooses the action with the highest value from the Option’s action set.
The action set, defined in Equation (3.7) is constructed to force the agent to make progress in the
direction of ai while being able to make small orthogonal course adjustments.
3.2 Topn State Representation
Trajectory-following and hill-climbing Options operate within an individual perceptual neighborhood. In order to learn policies for them, the learner needs a state representation that provides more
resolution than the winning GNG unit can provide by itself. Therefore, the TF and HC Options
described in Sections 3.3 and 3.4 use a simple new state abstraction derived from the GNG, called
the Topn representation.
If i1 , i2 , ..., i|F | are the indices of the feature functions in F, sorted in decreasing order of
the value of fi (y), then Topn (y) = hi1 , ..., in i. This representation uses the GNG prototypes to
create a hierarchical tessellation of the input space starting with the Voronoi tessellation induced by
the GNG prototypes. Each Voronoi cell is then subdivided according to the next closest prototype,
and those cells by the next, etc. This tuple of integers can be easily hashed into an index into a Qtable for use as a state representation. This state representation allows the agent to learn TF and HC
Options policies using simple, tabular reinforcement learning methods like Sarsa(λ) using existing
information from the GNG, instead of needing an entirely new learning method for Option policies.
3.3 Trajectory Following
The purpose of trajectory-following actions is to move the robot from one perceptual neighborhood
to another by moving the robot through a qualitatively uniform region of the environment. The
classic example is following a corridor: beginning in a pose aligned with the corridor, the robot
33
moves down the corridor until it reaches a qualitatively different region of space, e.g. an intersection
or a dead end.
The simplest form of trajectory-following is to repeat a single action until the SOM winner
changes, as described in Table 3.1. Chapter 5 will show that this sort of open-loop macro is unreliable when a realistic amount of noise perturbs the robot’s trajectory. Angular deviations from
motor noise accumulate, causing large deviations in the trajectory, sometimes pushing the robot off
of the “side” of the trajectory. As a result, the end state of the TF actions varies greatly, making
their outcomes highly unreliable. The solution described below constructs closed-loop TF Options
that can correct for the perturbations of noise and keep the robot moving along the trajectory that
best matches the current perceptual prototype. For example, in corridor following, the closed-loop
TF Option is expected to move down the hallway, correcting for deviations to maintain the view
looking forward as much as possible. As described in Section 2.1.2, each Option is defined by an
initiation set I, a termination function β, a pseudo-reward function R, an action set A, and a policy
π. The remainder of this section defines these elements for trajectory-following Options.
To achieve reliable trajectories, SODA defines a closed-loop TF Option for each combination of prototype and primitive action: {T Fij |hai , fj i ∈ A0 × F}. The initiation set of each TF
Option is the set of states1 where its prototype is the winner:
TF
Iij
= {y|j = arg max fk (y)}.
k
The Option terminates if it leaves its prototype’s perceptual neighborhood:
(
TF
0 if y ∈ Iij
TF
βij
(y) =
1 otherwise.
(3.4)
(3.5)
Each TF Option’s pseudo-reward function is designed to reward the agent for keeping the current
feature value as high as possible for as long as possible, thus:
(
fj (y) if not terminal,
TF
Rij
(y) =
(3.6)
0
if terminal.
In order to force the TF Options to make progress in some direction (instead of just oscillating
in some region of high reward), each Option is given a limited action set consisting of a progress
action selected from A0 , plus a set of corrective actions formed by adding a small component of
each orthogonal action in A0 :
0 T
ATF
ij = {ai } ∪ {ai + ctf ak |ak ∈ A , ak ai = 0}.
(3.7)
TF is learned using tabular Sarsa(λ), using the Topn (y) state represenLastly, the Option policy πij
tation described in Section 3.2, and the actions ATF
ij .
1
To simplify the terminology, these descriptions refer to the input vector y as if it were the state s. Since y is a
function of s, this terminology is sufficient to specify the Options.
34
This definition of trajectory-following actions as Options allows the agent to learn closedloop trajectory-following control for each combination of perceptual feature and primitive action.
Experiments in Chapter 5 show that the learned Options are far more reliable than open-loop TF in
a robot with realistic motor noise.
3.4 Hill-climbing
Once a SODA agent has executed a trajectory-following action to carry the robot from one distinctive state into the neighborhood of another, it then performs hill-climbing to reach a new distinctive
state. Hill-climbing actions remove positional uncertainty that may have accumulated during the
execution of a trajectory-following action by moving the robot to a fixed point in the environment
defined by the local maximum of the activation of the current winning SOM unit. One way of
moving to the local maximum is to estimate the gradient of the winning feature function f and follow it upward. An alternative method, used by SODA, is to learn a hill-climbing Option for each
perceptual neighborhood, that climbs to a local maximum.
Section 3.4.1 below describes two means of gradient-estimate hill-climbing: sampling the
feature changes from each action and using action models to estimate the feature changes. These
methods have drawbacks: the former is inefficient, and the latter requires substantial prior knowledge of the dynamics of the sensorimotor system. Section 3.4.2 describes how HC actions can
instead be formulated as Options and learned using reinforcement learning, Chapter 5 will the show
that learning to hill-climb in this fashion results in actions that are as efficient as those using a
hand-built action model, but that do not need prior knowledge of the action dynamics.
3.4.1
HC using Gradient Approximation
Ideally, a hill-climbing action would move the agent exactly in the direction of the feature gradient
(the greatest increase in feature value). Unfortunately, knowledge of the exact direction of the
gradient is not available to the agent. In addition, it is likely that none the primitive actions A0 will
move the agent exactly in the gradient direction. However, it is possible to approximate the gradient
by selecting the primitive action that increases the value of the feature by the greatest amount at
each step. The change in a feature value induced by a particular action, or the feature-action delta
is denoted Gij (t), and defined as
Gij (t) , fi (yt+1 ) − fi (yt ) given ut = a0j .
(3.8)
When the time t is obvious in context (e.g. the current time at which the agent is operating), the
feature-action delta will be abbreviated simply as Gij .
Hill-climbing using gradient approximation is accomplished using the simple greedy policy
shown in Table 3.2. This policy chooses the action with the greatest estimated feature-action delta
35
πiHC : Hill-climb on fi :
while not βiHC :
w ← arg maxj Gij
execute action a0w
Table 3.2: Pseudo-code for hill-climbing policy πiHC using gradient estimation. The value of Gij is
the estimated change in feature fi with respect to primitive action a0j . Gij can be determined either
by sampling the change induced by each action or by using an action model to predict the change.
Sampling is simple and requires no knowledge of the robot’s sensorimotor system or environment
dynamics, but is expensive, requiring 2|A0 | − 2 sampling steps for each movement
and executes it, terminating when all the estimated deltas are negative. The two methods described
in this section differ in how they obtain the estimate of Gij , and in how the termination condition
βiHC is defined.
Sampling the Deltas
The simplest way to estimate the deltas is by applying each action a0i , recording the feature change,
and reversing the action (by applying −a0i ). The termination criterion for this policy is simply to
stop when all the estimates are negative:
βiHC = 1 iff max Gij > 0.
j
(3.9)
This method easily produces an estimate for each action, but it requires 2|A0 |−2 exploratory actions
for each action “up the hill.” (The two steps are saved by caching the feature change from the last
up-hill action, so that it is not necessary to sample back in the direction from which the agent came.)
These extra actions are very costly, even with small action spaces.
Predicting the Deltas with Action Models
Although the purpose of SODA is to construct an agent that learns with little a priori knowledge
from human engineers, a human-engineered predictive action model for hill-climbing makes an useful comparison with the autonomously learned HC Options described in Section 3.4.2. A controller
using a predictive model can dispense with costly gradient sampling and simply use its model to
predict the gradient. In this case, the model would take the form of a function D such that
ŷi,t+1 = D(yt , ai ),
(3.10)
where ŷt+1 is the estimated value of y after executing action ai at time t. Using this model, the
feature-action delta can be estimated as:
Gij (t) ≈ fj (ŷi,t+1 ) − fj (yt ).
36
(3.11)
When using approximate models, in some cases the gradient magnitude will be near the precision
of the model, and approximation error may cause a sign error in one or more of the estimates. In
this case the process may fail to terminate, or may terminate prematurely, under the termination
condition in Equation 2.10. Without the termination condition, however, the greedy hill-climbing
policy will cause the agent to “hover” near the true local maximum of the feature, with little net
increase in fi (y) over time. Thus it is possible to define a new, k-Markov stopping condition in
which the Option terminates if the average step-to-step change in the feature value over a finite
moving window falls below a small fixed threshold:
&
'
Pk−1 t−k
|∆
f
(y)|
i
βiHC (yt−cw , ..., yt ) = cstop − i=0
,
(3.12)
cw
where cw is the window size, cstop is a constant threshold and
∆t fi (y) = fi (yt ) − fi (yt−1 )
(3.13)
is the change in feature value at time t.
Chapter 5 describes a predictive model of sensorimotor dynamics specifically engineered
for the robot and environments used in this dissertation, and compares its performance with both
delta sampling and learned HC Options, described in the next section.
3.4.2
Learning to Hill-Climb with RL
The greedy, gradient-based methods in Section 3.4.1 above have some drawbacks. Specifically, the
sampling-based method wastes many actions gathering information, while the model-based method
requires an action model D. Such a model could be provided a priori, but one of the objectives
of SODA is to develop a system that can learn with no such prior knowledge. Although it may
be possible to learn D from interaction with the environment using supervised learning techniques,
that would require adding yet another learning method to the system. Given that the system already
learns to follow trajectories by reinforcement lexarning, it is natural to use the same methods for
hill-climbing as well.
With hard-coded policies (Table 3.2), SODA would need only a single hill-climbing policy
that worked in all perceptual neighborhoods. In contrast, when hill-climbing is learned, each feature
presents a different pseudo-reward function, and thus requires a separate HC Option, HCi , for each
fi in F. As with the TF Options, the initiation set of each HC Option is the perceptual neighborhood
of that Option’s corresponding GNG prototype:
IiHC = {y| arg max fj (y) = i}.
j
(3.14)
Termination, however, is more complicated for hill-climbing. The perceptual input is unlikely to
ever match any perceptual prototype exactly, so the maximum feature value attainable in any neighborhood will be some value less than 1. Because it is difficult or impossible to know this value in
37
advance, the stopping criterion is not easily expressed as a function of the single-step input. Rather,
the Options use the same k-Markov termination function described above in Equation 3.12 for use
with gradient approximation.
The task of the HC Option is to climb the gradient of its feature as quickly as possible,
and terminate at the local maximum of the feature value. On nonterminal steps the pseudo-reward
for each HC Option is a shaping function (Ng, Harada, & Russell, 1999) consisting of a constant
multiple of the one-step change in fi , minus a small penalty for taking a step; on terminal steps, the
reward is simply fi itself:
(
cR1 ∆fi (y) − cR2 if not terminal,
HC
Ri =
(3.15)
fi (y)
if terminal
Finally, the action set for HC Options is just the set of primitive actions:
AHC
= A0 .
i
(3.16)
Hill-climbing policies learned in this way do not use explicit estimates of the feature-action
delta Gij . When they are learned using standard temporal-difference policy learning methods like
Sarsa (Section 2.1.2), they do learn a similar function, i.e. the state-action value function Q(y, a).
This function can be thought of as a kind of “internal gradient,” on which the controller hill-climbs.
One important difference between the learned value function and the true gradient, however, is that
because the learning algorithm distributes credit for reward changes over the sequence past actions,
it is possible for the controller to sometimes achieve higher feature activations than possible with
greedy, gradient-based policies. This is possible because the learned policies can perform downgradient actions that eventually lead to a higher ultimate reward. An example of this is shown in
Chapter 5.
This section has given the formal definition for SODA’s learned hill-climbing Options and
described two alternatives using gradient approximation. Chapter 5 presents experiments showing
that the learned Options perform as well as the other two HC methods, without requiring extensive prior knowledge, or expensive sampling. When executed after a trajectory-following Option
(Section 3.3) HC Options form the second step in SODA’s two-step A1 actions used for high-level
navigation. The next section concludes this chapter with a summary of the formal definition of
SODA and reviews the questions answered by the experiments in the next three chapters.
3.5 Conclusion
Table 3.3 summarizes the formal specification of the components of SODA: the primitive actions
A0 , GNG for feature learning, TF and HC Options, and A1 high-level actions used for high-diameter
navigation. The next three chapters present experimental results that answer several questions about
the method:
38
A0 (Primitive) Actions
Action output
GNG Feature learning
Input
Outputs
Trajectory-following Options
Policy
Initiation
Termination
State Representation
Actions
Reward
Hill-climbing Options
Policy
Initiation
Termination
State Representation
Actions
Reward
A1 Actions
Policy
Initiation
Termination
State Representation
Actions
Reward
Application of motor a vector from basis u0 , ..., um
Raw input vector y
Perceptual prototypes i
Continuous features fi
Topn (y) State representation for TF and HC Options.
Learned with RL
TF = {y|j = arg max f (y)}
Iij
k k
TF
0
if
y
∈
Iij
TF (y) =
βij
1 otherwise.
Topn (y)
ATF
∪ {ai + ctf ak |ak ∈ A0 aTk ai = 0}
ij = {ai }
fj (y) if not terminal,
TF (y) =
Rij
0
if terminal.
Learned with RL
IiHC = {y|i = arg max
j fj (y)}
βiHC (yt−cw , ..., yt )
Topn (y)
A0
RiHC =
= cstop −
Pk−1
i=0
|∆t−k fi (y)|
cw
,
cR1 ∆fi (y) − cR2 if not terminal,
fi (y)
if terminal.
Hard-coded: TF+HC
Same as TF component
Same as HC component
N/A
TF, HC
N/A
Table 3.3: Summary of the components of SODA.
• Can the GNG learn a feature set that covers a robot’s environment from data gained through
random exploration? Chapter 4 introduces a simulated mobile robot and two navigation environments and shows that SODA learns rich feature sets for both environments.
• Do the learned TF and HC Options perform as well or better than as obvious hand-coded alternatives? Chapter 5 presents experiments comparing open-loop and closed-loop TF, showing
that learned closed-loop TF Options produce longer, more reliable trajectories. In addition,
39
the chapter compares learned HC Options against HC by sampling feature-deltas, and HC
using feature-delta estimates from a hand-coded predictive model of the robot’s perceptual
dynamics; these experiments show that while all three methods achieve comparable final activation levels, the learned Options are far more efficient than sampling, while not requiring
the a priori knowledge needed for the hand-coded model.
• Do the A1 actions reduce task diameter? Chapter 6 shows that SODA reduces the diameter
of robot navigation tasks by an order of magnitude.
• Do the A1 actions enable the agent to learn to navigate more quickly? Experiments in Chapter 6 show dramatic speedups in navigation using A1 actions over using A0 actions.
• What is the contribution of the HC step in the A1 actions? An ablation study in Chapter 6
shows that using hill-climbing makes state transitions more reliable, and reduces task diameter over navigating using TF Options alone.
To summarize, this chapter has presented the formal description of SODA. The following
chapters present empirical evaluation of SODA’s feature learning, trajectory-following and hillclimbing Options, and high-diameter navigation.
40
Chapter 4
Learning Perceptual Features
In the first phase of learning, a SODA agent explores the environment, collecting sensor observations. As the observations are received they are given as training examples to an Homeostatic-GNG
(Section 2.1.3) network that learns a set of perceptual features in the form of prototypical sensor
views. This chapter describes experimental results from running this phase of learning on a simulated robot in two environments: the first is a small hand-built environment, and the second a large,
realistic environment, derived from an actual floor of a building on the University of Texas campus. Example feature sets learned in each environment show that the GNG learns a wide variety of
features covering the sensory space of the robot in the environment.
4.1 Experimental Setup
All the experiments in this dissertation were performed using the Stage robot simulator (Gerkey,
Vaughan, & Howard, 2003). Stage is a widely-used simulator for two-dimensional mobile robot
simulations. It can simulate a wide variety of robot hardware, and allows the fixed “architecture” of
the robot’s environment to be described easily using bitmapped layouts.
The robot configuration used in the experiments simulates an RWI Magellan Pro robot. The
robot was equipped with a laser range finder reading 180 readings at 1◦ intervals over the forward
semicircle around the robot, with a maximum range of 8000 mm. The Stage laser rangefinder model
returns ranges in mm, with no noise. Sensor noise was simulated through a two-stage process of
alternately adding Gaussian error and rounding, applied individually to each range reading. This
model provides a good characterization of the error on a SICK LMS laser rangefinder. However, the
actual noise in a SICK LMS is very small, approximately ±10mm, and it is not a significant cause
of uncertainty.
The robot was also equipped with a differential-drive base, taking two continuous control
values, linear velocity v and angular velocity ω. The Stage simulator does not model positional
41
(b) Primitive Actions A0
action
u
step
a00
u0
25mm
a01 −u0
-25mm
a02
u1
2◦
0
1
a3 −u
−2◦
(a) Abstract Motor Interface
u0
u1
drive 250 mm/sec 0 mm/sec
turn
0◦ /sec
20◦ /sec
Table 4.1: T-Maze Abstract Motor Interface and Primitive Actions
error internally, so motor noise was simulated by perturbing the motor commands thus:
v̂ = N (v, kvv v + kvω ω)
(4.1)
ω̂ = N (ω, kωv v + kωω ω),
(4.2)
where v̂ and ω̂ are the noisy motor command, and N (µ, σ) is Gaussian noise with mean µ and
standard deviation σ. The constants used were kvv = 0.1, kvω = 0.1, kωv = 0.2, kωω = 0.1. This is
a simplified motor noise model, inspired by realistic models used in robot localization and mapping
(Roy & Thrun, 1999; Beeson, Murarka, & Kuipers, 2006). The robot accepts motor commands
from the agent 10 times per simulated second.
This simulated robot was used for all the experiments described in this dissertation, including those in Chapters 5 and 6. The two simulated environments used are described in the next two
sections, along with any minor differences in simulation settings used in the different environments.
4.2 T-Maze Environment
The first experiment environment, called the T-Maze, is a 10 m × 6 m T-shaped room, shown in
Figure 4.1. The T-Maze was designed to be large enough to provide an interesting test environment,
yet small enough to allow experiments to run quickly. The environment has a single major decision
point, the central intersection, the extremities of the space are separated from one another by several
hundred primitive actions, allowing reasonably long-diameter navigation tasks.
In this environment the robot was given a basic drive speed of 250 mm/sec, and a basic turn
speed of 20◦ /sec, resulting in the abstract motor interface and primitive action set shown in Table 4.1.
For all experiments in both the T-Maze and the ACES environment (below), the discovery of the
abstract motor interface and the primitive actions A0 (step 1 of the algorithm in Chapter 3) was
assumed to have already been performed, using the methods of Pierce & Kuipers (1997).
To learn perceptual features in the T-Maze environment, the agent was allowed to wander by
selecting randomly from its set of A0 actions for 500,000 steps (about 14 simulated hours), training
its GNG with the input vector y received on each step. The GNG parameters (Fritzke, 1995) are
λ = 2000, α = 0.5, β = 0.0005, ǫb = 0.05, ǫn = 0.0006, amax = 100. The GNG was configured
42
5500
5000
4500
4000
Range (mm)
3500
3000
2500
2000
1500
1000
500
0
0
20
40
60
80
100
Scan Number
120
140
160
180
(a)
(b)
(c)
Figure 4.1: Simulated Robot Environment, T-Maze. (a) A “sensor-centric” plot of a single scan
from the robot’s laser rangefinder in the T-maze environment, with the individual range sensor on
the X axis and the range value on the Y axis. (b) The egocentric plot of the same scan in polar
coordinates, with the robot at the origin, heading up the Y axis. This format provides a useful
visualization of laser rangefinder scans for human consumption. (c) A screen-shot of the Stage
robot simulator with the robot in the pose that generated the scan used in (a) and (b). The shaded
region represents the area scanned by the laser rangefinder. The robot has a drive-and-turn base, and
a laser rangefinder. The environment is approximately 10 meters by 6 meters. The formats in (a)
and (b) will be used in later figures to display the learned perceptual prototypes.
43
to grow only if the average cumulative distortion error across all units was greater than 0.5%. These
parameters were selected after hand experimentation with the algorithm, and can be understood as
follows: λ allows the network to grow every 2000 input presentations, or just over three minutes
of simulated time (if the error threshold is met). This allows rapid growth early on in learning
(when error is usually high), but still allows the agent to experience a large amount of its local
neighborhood before growing again. The error decay β was set to 1/2000, the reciprocal of λ. This
setting allows error to decay to about 40% of its original value during the course of one learning
period, focusing learning on error accumulated in the most recent one or two growth periods. The
parameters ǫb and ǫn are the learning rates for the winner and its neighbors, respectively, and are set
to low values to counteract the similarity between successive inputs when they are presented from a
random walk in the environment. If the learning rates are too high, the winning neighborhood will
“track” the inputs, adapting so much on each presentation that the winning unit never changes. The
parameter amax sets how long the algorithm takes to delete edges in the connection graph when
they connect nodes that have moved away from one another in the input space; amax = 100 is a
relatively high value, designed to keep good connectivity when the units are distributed in a highdimensional input space. Finally the error threshold of 0.5% was set as the maximum value that
produced reasonably full coverage of environmental features on each run.
Ten runs with these parameters were performed to train ten GNG networks for use in the experiments in Chapter 6, to allow those experiments to test navigation ability across multiple feature
sets.
Figure 4.2 shows the features learned in one learning run in the T-Maze. The features are
organized into rows according to the GNG topology. Since the GNG weight vectors have the same
dimensionality as the sensory input, the learned weight vectors (features) can be thought of as
perceptual prototypes. The first plot in each row shows the sensory prototype represented by the
weight vector of GNG unit, and the remaining plots in the row show the prototypes represented by
the neighboring units in the GNG topology. On the left the prototypes are plotted in the humanreadable egocentric visualization of Figure 4.1(b), while on the right the same figures are plotted in
the sensor-centric format of Figure 4.1(a).
Three of the units, features 8, 35, and 65, are shown enlarged in Figure 4.3. Feature 8 shows
a prototypical view of the agent’s sensation of looking straight down the long corridor, feature 35
shows a view that represents facing the wall, at the end of the long corridor, and feature 65 shows a
prototypical view from the middle of the T intersection.
4.3 ACES Fourth Floor Environment
In order to test the ability of SODA to scale up to larger environments, the robot was also run in a
simulation of the fourth floor corridors of the ACES building on the UT Austin campus. This environment, shown in Figure 4.4 was constructed from an occupancy-grid map of the floor, collected
44
Figure 4.2: Example Learned Perceptual Prototypes, T-Maze. The agent’s self-organizing feature map learns a set of perceptual prototypes that are used to define perceptually distinctive states
in the environment. This figure shows the set of features learned from one feature-learning run in
the T-Maze. In each row the first figure represents a unit in the GNG, and the other figures represent
the units it is connected to in the GNG topology. Some rows are omitted to save space, but every
learned feature appears at least once. Each feature is a prototypical laser rangefinder image. On the
left, the ranges are plotted in the human-readable format of Figure 4.1(b). On the right, the ranges
are plotted in the sensor-centric format of Figure 4.1(a). The agent learns a rich feature set covering
the perceptual situations in the environment
45
Figure 4.3: Three Learned Prototypes, Enlarged Enlarged views of features 8, 35, and 65 from
Figure 4.2. These figures show prototypical sensory views of looking down the long corridor in the
T-Maze, looking toward the side wall at the end of the corridor, and the view from the middle of the
T intersection, respectively.
(b) Primitive Actions A0
action
u
step
a00
u0
50mm
a01 −u0
-50mm
a02
u1
2◦
0
1
a3 −u
−2◦
(a) Abstract Motor Interface
u0
u1
drive 500 mm/sec 0 mm/sec
turn
0◦ /sec
20◦ /sec
Table 4.2: ACES Abstract Motor Interface and Primitive Actions
using a physical RWI Magellan Pro robot similar to the simulated robot described in Section 4.1. To
construct the simulated environment from the original occupancy-grid map, the walls were thickened slightly and some imperfections removed to ensure that the walls would be entirely opaque to
the simulated laser scanner. In addition to being much larger than the T-Maze at approximately 40 m
× 35 m, the ACES environment is perceptually richer, with a rounded atrium, T- and L-intersections,
a dead end, and an alcove.
Because of the large size of the ACES environment, a very long random walk over primitive
0
A actions would be required for the agent to experience enough of the environment to learn a
useful feature set. Instead, SODA was configured to explore the environment using a random walk
over trajectory-following options (Section 3.3). In this case the GNG was trained concurrently with
the training of the TF option policies. However, for experimental clarity, the learned TF policies
were discarded after the feature sets were learned, then the TF options were trained anew during the
experiments in Chapter 6. The GNG was trained on each primitive step while the TF macros were
running. In addition, to reduce the running time of the experiments in this larger environment, the
robot’s forward speed was doubled to 500 mm/sec, giving the abstract motor interface described in
Table 4.2.
As with the T-Maze environment ten runs were performed to train ten GNGs for later use in
the experiments in Chapter 6. In each run, the agent explored the environment for 5,000,000 time
46
Figure 4.4: Simulated Robot Environment, ACES. The ACES4 environment. A simulated environment generated from an occupancy grid map of the fourth floor of the ACES building at UT
Austin. The map was collected using a physical robot similar to the simulated robot used in these
experiments. The environment is approximately 40m × 35m. The small circle represents the robot.
The area swept by the laser rangefinder is shaded. This environment is much larger and perceptually
richer than the T-maze.
steps. An example learned GNG from the ACES environment is shown in Figure 4.5.
4.4 Feature Discussion
As shown in Figures 4.2 and 4.5, the features learned by SODA in these environments cover a
broad range of the environment, showing prototypical views of corridors and intersections at a wide
variety of relative angles, as well as, in ACES, a variety of views of the central atrium. In fact, a
large part of every feature set is dedicated to representing similar views that differ mainly by small
changes in relative angle (Figure 4.6). This preponderance of similar features results from the fact
that a small rotation of the robot in its configuration space induces a “shift” of the values in the
input vector generated by the laser rangefinder. When some elements of the input vector are very
large and some are very small, as is often the case in these environments, such a shift moves the
47
Figure 4.5: Example Learned Perceptual Prototypes, ACES. The agent’s self-organizing feature
map learns a set of perceptual prototypes that are used to define perceptually distinctive states in the
environment. The richer variety of perceptual situations in the ACES environment produces a larger
set of features using the same parameter settings for the Homeostatic-GNG.
48
Figure 4.6: Views of T-Maze Intersection The use of Euclidean distance as the similarity metric
for the GNG leads to the learning of many similar views that differ only by a small rotation. A small
rotation in the robot’s configuration space induces a large (Euclidean) movement in the robot’s input
space causing large distortion error in the GNG, which must add more features in that region of the
space to reduce the error. The large number of features means that the agent must traverse many
more distinctive states when turning than when traveling forward and backward. Nevertheless,
the experiments in Chapter 6 show that SODA still greatly reduces task diameter and improves
navigation learning.
input a large Euclidean distance in the input space, essentially moving the vector from one “corner”
of the space to another. Given that the stated purpose of SODA is to decrease the diameter of the
task in the environment, one might argue that a more coarse-grained feature representation would
be preferable. Indeed Chapter 6 will show that SODA agents tend to use many more A1 actions
to turn than they use to travel forward (and backward). Section 7.2.2 describes possible future
research directions for finding ways to cut down or prune this proliferation of features from turning.
Chapter 6 will show, however, that despite these extra features, SODA still does a very good job of
cutting down the task diameter in the T-Maze and ACES environments. The reinforcement learning
algorithm learns from experience which features are useful in getting to the goal quickly, essentially
doing a form of rudimentary feature selection. Furthermore, Chapter 5 describes how SODA agents
learn trajectory-following and hill-climbing controllers that operate within the neighborhood of a
single feature. These controllers use the proximity of the input to nearby features (other than the
winner) to do this kind of “intra-neighborhood” navigation. The success of SODA at doing both
high-level and low-level navigation suggests that detailed feature sets such as those shown in this
chapter are not only sufficient for SODA’s purposes, they might be necessary as well.
Finally, the large number of algorithmic parameters for the GNG network deserve some
mention with reference to SODA’s stated goal of reducing the need for human prior knowledge in
agent’s learning process (Section 1.3). As with most learning algorithms, the GNG has a set of
49
parameter “knobs” that control learning that must be set properly for the agent to learn. To some
extent, these settings come from human experimentation with the algorithm in the target domain.
Reducing the need for such experimental knob twisting is a major direction for future research,
discussed in more detail in Section 7.4. With respect to the use of the GNG for feature learning,
however, these results do provide some reason for optimism. First, although human experimentation
was required, a painstaking, exhaustive search was not needed to discover suitable learning parameters. Rather, the parameters used in this chapter were discovered by means of a relatively short
“educated walk” through the parameter space. Second, the GNG’s learning performance seems to
be robust against small changes in learning parameters. Once a suitable region of the parameter
space has been found, small tweaks in parameter values do not seem to induce major qualitative
changes in the algorithm’s behavior. Lastly, it was not necessary to find a new set of GNG parameters when the agent was moved from the T-Maze to ACES, despite ACES’ greater perceptual
richness – although this transfer says nothing about what would be necessary for a robot with a
different sensorimotor configuration or a vastly different environment. These observations suggest
that it might be possible to find a useful set of default parameter values, and a simple set of rules for
searching for the correct parameter set, either automatically or by hand.
In conclusion, this chapter has presented the two experimental environments used throughout this dissertation, and demonstrated SODA’s ability to learn a set of perceptual features through
unsupervised interaction in each environment. The next two chapters will show that the learned features are useful both for learning local control in SODA’s trajectory-following and hill-climbing actions, and for reducing the task-diameter of large-scale navigation tasks, thus improving the agent’s
ability to learn to navigate from place to place.
50
Chapter 5
Learning High-Level Actions
The previous chapter described how the agent learns a set of perceptual features that it can use
for navigation. Once it has done so, its next task is to learn a set of trajectory-following and hillclimbing actions that can be combined to form high-level A1 actions. The formal structure of these
actions is described in Chapter 3. This chapter examines how such actions are learned and shows
that learning the actions as nested reinforcement learning problems (options) produces more reliable
and efficient actions than alternative hard-coded means of constructing them, without requiring prior
knowledge of the agent’s dynamics.
To separate the problem of learning the high-level actions from that of learning largescale navigation behavior (covered in Chapter 6), this chapter presents experiments testing SODA’s
trajectory-following (TF) and hill-climbing (HC) actions in isolation from any larger navigation
problem. The trajectory-following experiment in Section 5.1 compares the two TF methods described in Section 3.3: learned TF options and ballistic, open-loop TF macros. This experiment
shows that the learned options are more reliable and provide longer trajectories than the open-loop
macros. The experiment in Section 5.2 compares the three HC methods discussed in Section 3.4:
hill-climbing by manually sampling to approximate the feature gradient, hill-climbing by approximating the gradient using a hand-coded action model, and learning hill-climbing options using
reinforcement learning. The experiment shows that the learned HC options are significantly more
efficient than sampling, and perform comparably to the hand-coded action model, while not requiring prior knowledge of the dynamics of the robot and its environment.
5.1 Trajectory Following
The first step of a high-level action is trajectory following. In this task the SODA agent moves the
robot out of its current perceptual neighborhood and into another. It does so by making progress
along some axis of its abstract motor interface, while simultaneously trying to maintain the activation of the current SOM winner at as high a level as possible. One example of trajectory following
51
is following a corridor to its end. Assuming the robot is facing nearly straight down a corridor,
its current winning SOM feature should represent a view down a corridor, similar to Feature 8 in
Figure 4.3. To trajectory-follow down the corridor the agent would move the robot forward while
making adjustments left and right to keep the activation of the “facing down the hall” feature as high
as possible for as long as possible, stopping when some other perceptual feature became the winner.
The purpose of TF actions is to give SODA’s A1 actions spatial extent, reducing the effective
diameter of the high-level navigation task. It is desirable for TF actions to follow a given trajectory
for as long as possible, thus encapsulating many primitive actions in a single abstract action, and
requiring fewer abstract actions to reach distant goals. Given this purpose, a natural question to ask
is whether the simpler open-loop TF macros (described in Section 3.3) would be just as effective in
decreasing task diameter. Open-loop TF macros merely repeat the progress (e.g. forward) action
until the perceptual neighborhood changes, without making any corrective actions to keep the current feature maximized. The experiment below shows that the learned, closed-loop TF options not
only increase the average trajectory length compared with open-loop macros, but also more reliably
terminate near the same state in the environment.
The trajectory-following experiment in this section compared the reliability of open-loop
TF macros and learned TF options. The experiment tested the agent’s ability to learn to follow
a trajectory forward down each of the three corridors of the environment. The option’s action
set consisted of the progress action (moving straight forward), [1, 0]T , and two corrective actions
(moving forward while turning left or right) [1, 0.1]T and [1, −0.1]T . The option was trained for
2000 episodes from each starting point, although in each case the behavior converged within 100400 episodes. The Sarsa(λ) parameters were: λ = 0.9, α = 0.1, γ = 1.0, the option used ǫ-greedy
action selection with ǫ0 = 1.0 and annealing down to ǫ∞ = 0.001 with a half-life of 400 steps. All Q
values were initialized to 0, and the agent used the Top3 (y) state representation (Section 3.2). Runs
using the Top4 and Top5 were also performed. Adding more winners to the representation caused
the behavior to converge more slowly, but made no significant difference in the converged behavior.
Figure 5.1 shows the ending points of the last 100 runs from each starting point, compared with the
ending points for 100 runs using the open-loop TF macro. The bottom panel of Figure 5.2 shows the
average inter-point distance between endpoints for those 100 runs in each condition. For each of the
three starting points, the endpoints are more tightly clustered when using the learned option. This is
because there are many fewer episodes where the trajectory terminates part of the way down the hall
due to motor noise pushing the robot off of its trajectory and into a new perceptual neighborhood.
As a result, the learned option produces longer trajectories more reliably, and the endpoints tend
to be near one another. Figure 5.2 shows the average lengths (in steps) of the last 100 runs of the
learned TF option, compared with 100 runs from the open-loop macro. Table 5.1 shows the precise
values of the averages and standard deviations. In all three cases the average trajectory length from
the learned options is significantly longer (p < 6 × 10−5 ), and from two of the three starting points
it is dramatically longer.
52
Trajectory Endpoints for OpenLoop TF
Trajectory Endpoints for Learned TF
Figure 5.1: Trajectory Following Improvement with Learning These figures show the results of
100 trajectory following runs from each of three locations (marked with large black disks). The end
point of each run is marked with a ’+’. The top figure shows the results of open-loop TF, and the
bottom figure shows the results of TF learned using reinforcement learning. The TF learned using
RL are much better clustered, indicating much more reliable travel.
53
TF Trajectory Length
140
120
Steps
100
80
Open-Loop
Learned
60
40
20
0
Top Left
Top Right
Lower
TF Position
TF Inter-Point Distance (mm)
1200
1000
800
Open-Loop
Learned
600
400
200
0
Top Left
Top Right
Lower
Figure 5.2: The average trajectory length and endpoint spread for open loop vs. learned
trajectory-following for the last 100 episodes in each condition. Learned TF options produce
longer trajectories. Error bars indicate ± one standard error. All differences are significant at
p < 6 × 10−5
54
Starting point
Top left
Top right
Lower
Open-loop
mean (σ)
56 (34.3)
26 (32.3)
86 (12.8)
Learned
mean (σ)
114 (20.1)
77 (29.4)
90 (9.6)
Table 5.1: The average trajectory length for open loop vs. learned trajectory-following.
Learned TF options produce longer trajectories.
These experiments show that the learned TF options are more reliable and produce longer
trajectories than the naı̈ve alternative of open-loop macros. However, even with this improvement,
there is still a fairly large variation in the results, especially for the case in the upper-right corridor of
the environment. Although the average inter-endpoint distance shown in Figure 5.2 for the learned
TF option in the upper right corner is very large, the distribution of endpoints in Figure 5.1 shows
that nearly all of the endpoints are clustered into three clusters, one of which is far from the other two
and very near the starting point, resulting in very short trajectories. Thus, despite the large standard
deviation in trajectory length, the outcome of the learned action is still considerably more reliable
relative to the open-loop case, in which the endpoints are distributed widely along the mid-line of
the upper right hallway. In addition, the cluster of endpoints so near the starting point in the upperright hallway illustrates just how narrow the trajectories are that the agent must learn to follow. The
TF option is defined to terminate as soon as the SOM winner changes (Equation (3.5)). Because
the learned feature set contains many similar representations of corridors at different relative angles
(as described in Section 4.4), only a small deviation in heading from the mid-line of the corridor is
required for the TF option to terminate. It is likely that some of this variation could be eliminated by
allowing the TF option to stray temporarily from its designated perceptual neighborhood, as long as
it returns within a short window of time. Such a non-Markov termination function would be similar
in some respects to the termination function for the hill-climbing options in Equation (3.12) that
terminate if no progress is made over a short window of time. Such a change would likely improve
the average trajectory length somewhat as well as reducing both the variance in trajectory length
and the uncertainty in the outcome state. Such a change would come at the cost of adding another
free parameter to the algorithm (the size of the time window for termination). It is unclear, though,
how much impact such a change would have on the overall navigation performance, described in
Chapter 6.
5.2 Hill Climbing
Once the SODA agent has completed trajectory following, the second part of an A1 action is hillclimbing, in which the agent attempts to reduce its positional uncertainty by moving the robot to a
local maximum of the activation of the current winning SOM unit. Section 3.4 described three HC
55
methods. The first, hill-climbing by sampling, was shown to be somewhat effective in preliminary
experiments (Provost, Kuipers, & Miikkulainen, 2006). It is inefficient, however, using 2|A0 | − 2
sampling steps for each step “up the hill.” The second method eliminates these sampling steps by
using a hand-engineered model to predict the perceptual outcome of actions, at the cost of considerable prior knowledge of the robot and environment required of a human engineer. The third method
eliminates the need for manual sampling and human prior knowledge of the robot’s dynamics by
using reinforcement learning methods to learn a hill-climbing policy for each perceptual feature in
the agent’s feature set. The experiment in this section compares these three methods, showing that
all three are able to achieve similar feature activations, but the action models and the learned options
are considerably more efficient at doing so, with the learned options in particular performing comparably with the other two methods while requiring neither sampling nor prior knowledge. Below
is a detailed description of the hand-coded action model used in the experiment, followed by the
experiment description, results, and discussion.
5.2.1
Hand-coded Predictive Action Model
With detailed knowledge of the robot’s sensorimotor configuration, actions, and environment dynamics, it may be possible for a human engineer to create a predictive action model that will allow
the SODA agent to hill-climb by predicting the feature changes, as described in Section 3.4.1. Such
a model can be constructed for the robot and T-Maze and environments described in Section 4.1.
This model consists of four affine functions, one for each primitive action in A0 , that give the
difference between the current sensor input yt and the estimated input on the next time-step ŷt+1 :
ŷt+1 − yt = Ai yt + bi .
(5.1)
The four primitive actions for the robot in the T-Maze correspond to steps forward, backward, left,
and right (Table 4.1). The linear components of the four predictive functions will be referred to
as Aforward , Abackward , Aleft , and Aright , and the translation components will be called bforward ,
bbackward , bleft , and bright .
The functions for the turning actions are relatively simple. Since they turn the robot approximately 2◦ , and the laser range-finder samples radially at 1◦ increments, the turn action function
should return the difference between the current input and the same input shifted by two places, i.e
56
this 180 × 180 matrix:
Aleft
=
−1
0
1
0
0 −1
0
1
0
0 −1
0
0
0
0 −1
..
..
..
..
.
.
.
.
0
0
0
0
0
0
0
0
0
0
0
0
...
...
...
...
..
.
0
0
0
0
..
.
0
0
0
0
..
.
0
0
0
0
..
.
.
. . . −1 0 1
...
0 0 0
...
0 0 0
(5.2)
Because it is not possible to know what will be “shifted in” on the left side of the range-finder, the
function predicts no change in the last two positions. (The laser range-finder scans are numbered
counterclockwise, i.e. right-to-left, around the robot.) The right turn matrix Aright is constructed
analogously. The translation component of the turn models is zero:
bleft = bright = 0.
(5.3)
The functions for moving forward and backward are more complicated, since translating
the robot produces a complicated change in the radially organized range-finder image. The intuition
behind them comes from approximating how the input behaves at three key scans – far left, straight
ahead, and far right – and interpolating the effects in between. When the robot executes a forward
action, the forward facing scan (scan 90) is reduced by a constant amount, while the leftmost scan
(scan 179) behaves roughly as if the robot has rotated to the right, and the rightmost scan (scan 0)
behaves as if the robot has rotated to the left. The scans in between can be approximated with a
combination of a constant change and a shift, with the proportion of each depending on the scan’s
relative angle — scans facing more forward get more of a constant change, scans facing more
sideways get more of a shift.
The functions for moving forward and backward can be constructed using three blending
matrices, to interpolate between the right, left, and center parts of the function. The sin2 and cos2
functions provide an ideal means for constructing these matrices since sin2 x + cos2 x = 1 and both
functions have periods of 180◦ , the same as the extent of the arc of the laser range-finder. The three
blending matrices are
(5.4)
C1 = Diag180 cos2 (0◦ ), cos2 (1◦ ), ..., cos2 (89◦ ), 0, ..., 0 ,
(5.5)
C2 = Diag180 0, ..., 0, cos2 (90◦ ), cos2 (91◦ ), ..., cos2 (179◦ ) ,
and,
T
S = sin2 (0◦ ), sin2 (1◦ ), ..., sin2 (179◦ ) .
(5.6)
The matrix C1 is used for weighting the rotation component on the right side of the robot: it gives
a weight of 1.0 to the rightmost part of the laser range-finder, falls to zero at the center, and remains
57
Figure 5.3: Test Hill-Climbing Features. The hill-climbing experiments tested SODA’s ability to
hill-climb on these five features from the GNG in Figure 4.2in the T-Maze environment.
zero weight left half of the scan. Likewise, C2 weights the rotation on the left side: it gives a weight
of 1.0 to the leftmost part of the laser range-finder and falls to zero at the center. S weights the
translation component: it gives a weight of 1.0 to the center and falls to zero at both sides.
The forward and backward motion models are created using the above blending functions, .
The forward model,
Aforward = C1 Aright + C2 Aleft , bforward = −25S,
(5.7)
combines a left turn on the right, a right turn on the left, and a 25 mm reduction of the centermost
range value. The backward motion model,
Abackward = C1 Aleft + C2 Aright , bbackward = 25S,
(5.8)
combines a left turn on the left, a right turn on the right and a 25 mm increase of the centermost
range value.
As shown in the next section, a hill-climbing controller using these definitions to predict the
feature changes does as well as the method that samples manually, while using many fewer actions.
This model, however, requires considerable prior knowledge of the robot, including the size and
direction of its primitive actions, and the type and physical configuration of the sensors. The next
section will show that learned HC options can achieve similar performance without requiring that
prior knowledge.
5.2.2
Hill-Climbing Experiment
The hill-climbing experiment compared the speed and effectiveness of the learned HC options
against HC by gradient approximation using sampling, and using the models described above. The
experiment tested the agent’s ability to hill-climb in the neighborhood of each of five specific features, shown in Figure 5.3, taken from the GNG in Figure 4.2. The features were chosen to sample a
58
Hillclimbing Outcome for feature 8
0.7
Learned, 3 winners
Learned, 4 winners
Learned, 5 winners
Sampling Avg
Sampling stddev
0.6
Learned, 3 winners
Learned, 4 winners
Learned, 5 winners
Sampling Avg
Sampling stddev
0.7
Final Activation
0.65
Final Activation
Hillclimbing Outcome for feature 62
0.75
0.55
0.5
0.45
0.65
0.6
0.55
0.5
0.4
0.45
0.35
0.3
0.4
0
200
400
600
800 1000 1200 1400 1600 1800 2000
Episode
0
200
Hillclimbing Outcome for feature 18
1.05
800 1000 1200 1400 1600 1800 2000
Episode
Hillclimbing Outcome for feature 65
Learned, 3 winners
Learned, 4 winners
Learned, 5 winners
Sampling Avg
Sampling stddev
0.55
0.5
Final Activation
Final Activation
0.9
600
0.6
Learned, 3 winners
Learned, 4 winners
Learned, 5 winners
Sampling Avg
Sampling stddev
1
0.95
400
0.85
0.8
0.75
0.7
0.65
0.45
0.4
0.35
0.3
0.25
0.6
0.55
0.2
0.5
0.15
0
200
400
600
800 1000 1200 1400 1600 1800 2000
Episode
0
200
400
600
800 1000 1200 1400 1600 1800 2000
Episode
Hillclimbing Outcome for feature 35
0.75
Learned, 3 winners
Learned, 4 winners
Learned, 5 winners
Sampling Avg
Sampling stddev
Final Activation
0.7
0.65
0.6
0.55
0.5
0.45
0
200
400
600
800 1000 1200 1400 1600 1800 2000
Episode
Figure 5.4: Hill-climbing Learning Curves. These curves compare learned hill-climbing using
the Top3 , Top4 , and Top5 state representations against hill-climbing by approximating the gradient
with user-defined linear action models. Each plot compares hill-climbers on one of the five different
SOM features in Figure 5.3. The Y axis indicates the final feature activation achieved in each
episode. The thick straight line indicates the mean performance of sampling-based HC, and the
two thin straight lines indicate its standard deviation. This figure shows that Learned HC does
about as well as an HC controller that manually approximates the feature gradient at each point, and
sometimes better.
59
Hill-Climbing Length
250
Steps
200
150
Learned
Models
Sampling
100
50
0
8
18
35
62
65
Feature
Hill-Climbing Final Activation
0.9
0.8
Feature Activation
0.7
0.6
Learned
Models
Sampling
0.5
0.4
0.3
0.2
0.1
0
8
18
35
62
65
Feature
Figure 5.5: Hill-climbing performance with and without learned options. Using learned options makes hill-climbing achieve the same feature values faster. Top: The average lengths of
hill-climbing episodes in the neighborhoods of the three different features shown in Figure 5.3. All
differences are significant (p < 2 × 10−6 ). The bottom chart shows the average maximum feature
value achieved for each prototype per episode. The plots compare the last 100 HC episodes for each
feature with 100 hard-coded HC runs. Differences are significant between learned options and the
other two methods (p < 0.03) for all features except 65, which has no significant differences. Across
all features, the maximum values achieved are comparable, but the numbers of actions needed to
achieve them are much smaller.
60
wide variety of different perceptual situations for the robot in the T-Maze, including views of corridors, walls, intersections, and dead ends. For each feature, the robot was repeatedly placed at 2000
randomly selected poses in the perceptual neighborhood of the given feature, and the hill-climbing
option (or macro) was initiated from that point. The option’s Sarsa(λ) parameters were λ = 0.9,
α = 0.1, γ = 0.997. The agent did not use ǫ-greedy action selection, but rather all Q-values were
initialized optimistically to 1.0 to encourage exploration. HC options do not use ǫ-greedy exploration while TF options do, because hill-climbing is essentially a type of “shortest path” problem. It
has a fixed upper bound on the reward achievable from any state, making it suitable for exploration
by optimistic initialization. Trajectory-following, on the other hand, is a kind of “longest path”
problem in which the agent must try to continue acting (within constraints) for as long as possible,
with no obvious maximum value for an action, thus requiring another form of exploration policy.
The ǫ-greedy method is standard in the reinforcement learning literature. The HC option parameters
(Section 3.4) were cstop = 0.005, cw = 10, cR1 = 10, cR2 = 0.001. Runs were performed using
the Top3 , Top4 , and Top5 , state representations (Section 3.2).
Figure 5.4 presents learning curves comparing the final feature activations achieved by the
learned HC options with those achieved by the method of manually sampling to approximate the
feature gradient. It shows that for all the features, the learned HC can achieve feature activations
near (or better than) the manual sampling method within a few hundred episodes of training.
Figure 5.5 shows bar plots of the average episode lengths and final activations of the last
100 episodes for the learned HC options (using the Top4 state representation), and the other two
methods. It shows that all three methods achieve comparable final activations, while the learned
options and the hand-coded models use many fewer steps to climb the hill, because the sampling
method performs many extra steps for each step “up the hill.” From this result one can conclude that
the learned options make hill-climbing as efficient as a controller using hand-coded motion model,
without the need for the detailed prior knowledge of the robot’s configuration and dynamics that is
embedded in the model.
5.2.3
HC Learning Discussion
The HC experiment shows that, after training, the learned HC options generally hill-climb as well as
methods that climb the gradient directly, but the learned options do not require expensive sampling
actions, or extensive prior knowledge of the robot’s sensorimotor system in order to approximate
the direction of the gradient. However, there are some aspects of the results that deserve further
discussion.
First, the results have large variances in the final activation for all five features and all three
HC methods. This variance is largely a result of the experimental methodology: The trials for a
particular feature are started at random locations in the environment in the perceptual neighborhood
of that feature. The features may apply in multiple, distinct locations in the environment, and the
features are approximations of the actual perceptual image at the particular location. Thus, the lo61
cal maximum near one starting point may be substantially different from the local maximum near
another starting point for the same feature. The artificial variation in starting locations in this experiment is likely to be much greater than the actual variation in starting locations when running SODA
in practice, since SODA does not begin hill-climbing at uniformly distributed random locations in
the environment, but at the endpoints of trajectory-following control laws. The TF experiments in
Section 5.1 show that once the TF control laws have been learned, the endpoints of trajectories (and
the starting points for hill-climbing) are fairly tightly clustered.
A second point to note is that in Figure 5.4 the learned HC options for two features, 18 and
62, do not achieve quite as high a final activation as the sampling method. Interestingly, these two
features both represent views that might be seen at the end of a corridor (Figure 5.3). Figure 5.5
also shows that an HC episode for these two features using sampling is much longer than for any
other feature or method. In fact, for these two features the sampling HC episodes are much longer
than could be accounted for by the extra 2|A0 | − 2 = 6 sampling actions needed for each “up-hill”
step. The extra-long HC episodes and slightly higher activations for sampling seem result from the
combination of the nature of these two particular features and the particular stopping conditions
associated with the different HC methods. Features 18 and 62 are distinguished from the other
features by the fact that all of their individual range values fall relatively near one another. (That
fact that is more visible in the right hand column of Figure 4.2: compare variance in range values
the row containing features 18 and 62 to the rows above and below it.) Because all the ranges
in these features are relatively near one another, turn actions in these perceptual neighborhoods
cause much less of a change in feature activation than it would for say, Feature 8. In addition, the
sampling HC method only terminates when all sampled actions indicate a negative feature change,
while the learned HC options and the model-based HC both terminate if the average change in
feature activation over a short time window is near zero. As a result, when the robot gets into
an area where the activation change for all actions is small enough to be lost in the motor noise,
the learned HC options will stop relatively quickly, while the sampling HC method will “bounce
around,” continuing to collect noisy samples until the samples for all actions show a negative feature
change. Until it terminates, the sampling method chooses to execute the action that showed the
highest positive feature change, however slight, thus moving the robot stochastically up even a tiny
feature gradient, should one exist. As a result, for features like 18 and 62, the sampling method is
able to achieve slightly higher activations, at the cost of many more actions. This is not a reasonable
trade-off given the need to navigate efficiently from one location to another.
Although the learned HC options do not achieve the highest activations for any of the test
features, they are also not dominated by either of the other methods. The learned options achieve
comparable final activations to the other two methods very efficiently, without needing a handengineered action model.
62
5.3 Action Learning Conclusion
In conclusion, this chapter described experiments examining trajectory-following and hill-climbing
actions in isolation from any high-level navigation task. The TF experiments showed that learning
trajectory-following actions as options using reinforcement learning produces longer, more reliable
trajectories than the obvious naı̈ve alternative of simple repeat-action macros. The HC experiments
showed that learning HC actions as options produces actions that perform as well as either HC
by sampling or HC using predictive models; yet learned HC options use many fewer primitive
actions than the sampling method, and do not require the extensive prior knowledge needed for the
predictive models. The next chapter shows how the learned TF and HC actions, combined together,
improve learning in large-scale navigation tasks, over performing the same tasks using primitive
actions.
63
Chapter 6
Learning High-Diameter Navigation
Once a SODA agent has learned a set of perceptual features and a set of high-level trajectoryfollowing and hill-climbing actions based on those features (as described in Chapters 4 and 5), it can
begin to use these new features and actions to navigate between distant locations in the environment.
Navigating in the abstracted space defined by the learned A1 actions, the agent reduces its task
diameter dramatically. This chapter describes several experiments in which agents learn navigation
tasks using reinforcement learning over SODA actions in the environments described in Chapter 4.
The first set of experiments, run in the T-Maze, show that agents can learn to navigate much more
quickly using SODA actions than using primitive actions, using as few as 10 A1 actions to complete
tasks requiring hundreds of A0 . In addition, these experiments investigate the benefit provided
by the hill-climbing step in the SODA actions, showing that hill-climbing produces more reliable
actions, and produces solutions requiring fewer abstract actions. The second set of experiments, run
in the ACES environment, show that SODA can scale up from the T-Maze to a realistic, buildingsized environment, and that the benefit provided by the SODA actions is even greater in the larger
environment.
6.1 Learning in the T-Maze
The first set of navigation experiments was conducted in the T-Maze, described in Section 4.2. For
these experiments, three navigation targets were defined in the upper-left, upper-right, and bottom
extremities of the environment, as shown in Figure 6.1. Each target was defined by a point in the
environment. The robot was judged to have reached the target if its centroid passed within 500
mm of the target point. The agent was tested on its ability to learn to navigate between each pair
of targets using Sarsa(λ) over A0 actions and over A1 actions, to test whether the agents learn to
navigate more effectively using SODA.
As explained in Section 1.3, SODA is not intended to address problems of partial observability (also known as perceptual aliasing), in which multiple distinct states in the environment have
64
Figure 6.1: T-Maze Navigation Targets. The red circles indicate the locations used as starting and
ending points for navigation in the T-Maze in this chapter. In the text, these points are referred to as
top left, top right, and bottom. Navigation between these targets is a high-diameter task: they are all
separated from each other by hundreds of primitive actions, and from each target the other two are
outside the sensory horizon of the robot.
the same perceptual representation. However, in some cases the SODA abstraction induces perceptual aliasing in the environment. For example, in the T-Maze, the same GNG unit is active when the
robot is in the center of the intersection facing southeast or facing southwest (unit 4 in the first row of
Figure 4.2). Resolving such ambiguity is outside the scope of this dissertation. Therefore, to avoid
resorting to complicated representations and algorithms for partially observable environments, the
agents were given more sensory information to help reduce aliasing. The two new sensors were a
stall warning to indicate collisions, and an eight-point compass. Using these sensors, the learning
agent’s state representation was the tuple hstall,compass,(argmaxj fj ∈ F)i.
As with the Top-N representation (Section 3.2), the state tuple was hashed into an index
65
into the Q-table. The reward function for the task gave a reward of 0 for reaching the goal, −1 for
taking a step, −6 for stalling. Each episode timed out after 10,000 steps. When the robot reached
the goal or timed out, the robot was automatically returned to the starting point. For each of the
six pairs of starting and ending points, there were 18 total trials of 1000 episodes each: three trials
using each of three different trained GNG networks using A0 actions and A1 actions. The Sarsa(λ)
parameters were λ = 0.9, α = 0.1, γ = 1.0. All Q values were initialized optimistically to 0. The
agent also used ǫ-greedy action selection with ǫ0 = 0.1 annealing to ǫ∞ = 0.01 with a half-life of
100,000 (primitive) steps. The agents using A1 actions all learned their TF and HC options using
the parameters described in Chapter 5. For each start-end pair, the experiment began with all options
untrained, and option policy learning proceeded concurrently with high-level policy learning.
Figure 6.2 shows the learning curves comparing the performance of the agents using A0
actions to the agents using A1 actions. For every pair of start and end points, the agents using A1
actions learn dramatically faster than the agents using A0 actions. In each case, the behavior of
the agents using the high-level actions has converged easily within 1000 episodes (and often much
sooner) while in no case did the behavior of the agents using primitive actions converge within that
period. These results show that SODA significantly speeds up navigation learning.
Figure 6.3 shows a representative trace of learned behavior from one trained agent, traveling
from the top-left location to the bottom location. This figure shows the starting points of the A1
actions as triangles, and the path of the agent in underlying A0 actions. The agent has abstracted a
task requiring a minimum of around 300 primitive actions to a sequence of ten high-level actions.
Figure 6.4 shows the perceptual features that define the distinctive states for the starting points
of the actions in Figure 6.3. The trace and feature plots show that the agent began by traveling
forward in a single A1 action from the starting point to the intersection, trajectory-following on a
feature representing a view straight ahead down a long corridor. After reaching the intersection,
the agent then used seven shorter actions to progress through the intersection while turning to face
the lower corridor. Each of these seven distinctive states is defined by a feature resembling a view
of an intersection turned progressively more at each step. Finally, the agent progresses down the
lower corridor in two long actions. The lower corridor is about half the length of the upper corridor,
and as the robot progresses down the corridor, the laser rangefinder view of the corridor grows
progressively shorter. As a result, the view of the corridor is represented by two features (30 and
62) with shorter forward ranges than the feature used to represent the upper corridor (feature 8).
Although, to the human eye, these final two features bear a less obvious resemblance to a corridor
than the other features, they are still sufficient to allow the agent to navigate to the goal.
6.2 The Role of Hill-Climbing
The previous section showed that SODA agents learn to navigate faster than agents using primitive
actions, and that SODA reduces the diameter of navigation tasks in the T-maze by an order of
66
Top Left to Bottom
10000
Primitive Actions
SODA Actions
10000
8000
8000
6000
6000
Steps
Steps
Top Left to Top Right
Primitive Actions
SODA Actions
4000
4000
2000
2000
0
0
0
100
200
300
400
500 600
Episode
700
800
900 1000
0
100
200
300
Bottom to Top Left
8000
8000
6000
6000
Steps
Steps
700
800
900 1000
Primitive Actions
SODA Actions
10000
4000
4000
2000
2000
0
0
0
100
200
300
400
500 600
Episode
700
800
900 1000
0
100
200
Top Right to Bottom
300
400
500 600
Episode
700
800
900 1000
Top Right to Top Left
Primitive Actions
SODA Actions
10000
Primitive Actions
SODA Actions
10000
8000
8000
6000
6000
Steps
Steps
500 600
Episode
Bottom to Top Right
Primitive Actions
SODA Actions
10000
400
4000
4000
2000
2000
0
0
0
100
200
300
400
500 600
Episode
700
800
900 1000
0
100
200
300
400
500 600
Episode
700
800
900 1000
Figure 6.2: T-Maze Learning, all routes. These learning curves show the length per episode for
learning to navigate between each pair of targets shown in Figure 6.1. The curves compare learning
with A1 actions and learning with A0 actions. Each curve is the aggregate of three runs using
each of three trained GNGs. Error bars indicate +/- one standard error. In all cases the A1 agents
dramatically outperform the A0 agents.
67
Figure 6.3: Navigation using Learned Abstraction. An example episode after the agent has
learned the task using the A1 actions. The triangles indicate the state of the robot at the start of
each A1 action. The sequence of winning features corresponding to these states is [8, 39, 40, 14, 0,
4, 65, 7, 30, 62], shown in Figure 6.4. The narrow line indicates the sequence of A0 actions used by
the A1 actions. In two cases the A1 action essentially abstracts the concept, ‘drive down the hall to
the next decision point.’ Navigating to the goal requires only 10 A1 actions, instead of hundreds of
A0 actions. In other words, task diameter is vastly reduced.
magnitude. The question arises, however: How much of the benefit of SODA comes from the
trajectory-following and how much from hill-climbing? This section describes an ablation study
comparing navigation using SODA with and without hill-climbing. The study shows that while all
the speed-up in learning can be attributed to the TF component of the actions, the HC step makes
the actions more deterministic and cuts the task diameter roughly in half compared to navigating
using TF actions alone.
One of the stated goals of SODA, described in Section 1.4, is that it function as a building
block in a bootstrap-learning process for robotics. One likely next stage of bootstrap learning is to
move from the model-free navigation policies learned by Sarsa(λ) to navigation based on predictive
models and explicit planning. Such systems benefit greatly from more deterministic actions, which
have fewer possible outcomes, and thus reduce the branching factor of search for planning. They
also benefit from shorter task diameters, which reduce the depth of search. As shown in the study
68
Figure 6.4: Features for Distinctive States These are the perceptual features for the distinctive
states used in the navigation path shown in Figure 6.3, in the order they were traversed in the
solution. (Read left-to-right, top-to-bottom.) The first feature [8] and the last two [30, 62] represent
the distinctive states used to launch long actions down the hallways, while the intervening seven
features [39, 40, 14, 0, 4, 65, 7] show the robot progressively turning to the right to follow the lower
corridor. The large number of turn actions is caused by the large number of features in the GNG
used to represent views separated by small turns, discussed in Section 4.4 and Figure 4.6. Despite
the many turn actions, however, SODA still reduces the task diameter by an order of magnitude over
primitive actions.
in this section, the HC in A1 actions provides both of these benefits.
In the ablation study, the task of the agent was to navigate from the top-left location to the
bottom location in the T-Maze (Figure 6.1). The parameters of the experiment were identical to
those in the experiments in Section 6.1, with three exceptions: (1) In this case there were three
experimental conditions: A1 actions, TF actions only, and A0 actions. (2) Each condition consisted
of five independent learning agents for each of the ten different trained GNGs, to provide a larger
sample for computing statistical significance, giving a total of 3 × 5 × 10 = 150 total agents. (3)
Each agent ran 5000 episodes, instead of 1000, to ensure that the TF and A1 agents ran until policy
learning converged.
The learning curves from the experiment are shown in Figure 6.5. In the first 1000 episodes,
the
and A0 curves show the same basic shape as the learning curve for the same task above
(Figure 6.2, top left plot), with lower variation in the average as a result of the larger number of
agents run. Importantly, the TF-only curve is nearly identical to the A1 curve, showing that, in
terms of learning speed, all the benefit of SODA actions comes from the TF component of the
actions. In addition, the actions taken to perform hill-climbing exact a small cost. In the last 50
episodes, the average TF-only episode length is around 400 steps, while the average for TF+HC
agents is around 600 steps over the same period. These differences are significant (p < 0.0001).
A1
69
Episode Length, Top-Left-to-Bottom Task
10000
Primitive Actions
TF Only
SODA Actions
9000
8000
7000
Steps
6000
5000
4000
3000
2000
1000
0
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Episode
Figure 6.5: A1 vs TF-only Learning Performance, T-Maze These learning curves compare the
length per an episode in the T-Maze top-left-to-bottom navigation task using primitive actions (A0 ),
trajectory-following alone (TF), and trajectory following with hill-climbing (A1 ). Each curve is the
average of 50 runs, 5 each using 10 different learned SOMs. Error bars indicate +/- one standard
error. The agents using just TF actions, without the final HC step learn as fast and perform slightly
better than the agents using A1 actions. (TF vs A1 performance is significantly different, p <
0.0001, in the last 50 episodes.)
Although the hill-climbing component of A1 actions does not make learning faster, HC does
improve navigation by making the actions more deterministic, as measured by lower state transition
entropy. Given an environment’s state-transition function T (s, a, s′ ), indicating the probability of
ending in state s′ after taking action a in state s, the state transition entropy of a state-action pair
(s, a) is the entropy of the probability distribution over possible states s′ . Because the entropy of a
distribution is the number of bits needed to encode a selection from the distribution, the transition
entropy can be interpreted informally as the log2 of the number of possible outcomes weighted by
the likelihood of their occurrence. Thus an entropy of zero indicates exactly one possible outcome,
an entropy of 1 indicates roughly two possible outcomes, etc.
To estimate transition entropy in the T-maze, every hs, a, s′ i sequence the agent experi70
TF+HC vs TF-only Transition Entropy
1.8
GNG 0
GNG 1
GNG 2
GNG 3
GNG 4
GNG 5
GNG 6
GNG 7
GNG 8
GNG 9
1.6
TF-only Entropy
1.4
1.2
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6 0.8 1 1.2
TF+HC Entropy
1.4
1.6
1.8
Figure 6.6: Hill-climbing Reduces Transition Entropy. These plots compare the average transition entropy for all state/action pairs for each of the 10 different GNGs used in the T-Maze experiment. The x-axis indicates the transition entropy (in bits) using hill-climbing, and the y-axis
indicates the entropy without hill-climbing. The solid line indicates equal entropy (y = x). Error
bars indicate 95% confidence intervals. Hill-climbing reduces the entropy by about 0.4 bits, on average. This is approximately equivalent to a reduction in branching factor from 3 to 2. These results
indicate that hill-climbing makes actions more deterministic, making them more useful for building
planning-based abstractions on top of SODA.
71
enced over 5000 episodes was counted and the frequencies were used to estimate the distribution
P (s′ |s, a). These estimates were then used to compute an entropy value for each (s, a) pair, and
the entropies for all pairs were averaged to compute an average transition entropy for the run. Figure 6.6 compares the average transition entropy of agents using hill-climbing with those using only
trajectory-following, for each GNG. This figure shows that transition entropies are an average of
0.4 bits lower for the agents using HC actions. This result corresponds to a decrease of about 30%
in the number of possible outcomes for each state-action pair.
In addition to making actions more reliable, using hill-climbing also reduces the task diameter, as measured by the total number of abstract actions needed to reach the goal. Figure 6.7 shows
the average length, in high-level actions, of successful episodes for agents using TF+HC actions
versus agents using TF without HC. The bars show the average number of actions taken in the successful episodes taken from last 100 episodes for all TF+HC agents and all TF-only agents. (In 15
of the 100 × 5 × 2 = 10000 episodes, the agents did not reach the goal before the episode timed
out. These outliers were discarded because they are not representative of the learned task diameter.)
The agents using HC actions are able to complete the task using an average of 23 actions while the
agents using TF-only require an average of 67, a 62% decrease in task diameter.
6.3 Scaling Up The Environment: ACES 4
The third experiment in this section investigates how SODA scales to the larger, richer ACES environment. As mentioned in Section 4.3, the ACES environment not only provides much longer task
diameters at 40 m × 40 m, but it is far richer with a wide variety of different intersection types,
different corridor widths, irregular wall features, and a circular atrium. In addition, the outer corridors of the environment are long and narrow, making it very unlikely for a robot with realistic
motor noise robot to traverse them with open-loop trajectory-following macros, and thus requiring
closed-loop TF.
The task in the ACES environment was to navigate from the center-right intersection to the
lower-left corner of the environment, as shown in Figure 6.8. The experimental set-up and agent
parameters were identical to those in the T-Maze experiments, with these exceptions: First, the
general set-up of the environment and the robot was the same as described in Section 4.3. Second, the time-out for each episode was increased to 30,000 time-steps. Third, because the robot’s
rangefinder has a much shorter range than than the length of a typical corridor, many corridors and
intersections are perceptually indistinguishable. To deal with this problem the agent was given extra
state variables consisting of the eight-point compass and stall sensor used in the T-Maze experiment, plus a coarse tiling of the robot’s (x, y) position in the environment, in which the x and y
positions were tiled into ten-meter bins. This binning was chosen because it is sufficient to give
each intersection a unique sensory signature, and can be interpreted as similar to an environment in
which different areas of the building are painted a unique color. The agent was given the state tuple
72
Abstract Actions per Episode
(last 100 episodes)
80
70
60
50
40
30
20
10
0
TF+HC
TF Only
Figure 6.7: Hill-climbing Improves Task Diameter. Bars show the average abstract length per
episode of the successful episodes taken from the last 100 episodes for each agent in the T-Maze
experiment. Abstract length is measured in the number of high-level actions needed to complete the
task. Using trajectory-following with hill-climbing, the agents require an average of 23 actions to
complete the task, while without hill-climbing they require an average of 67. (Error bars indicate
+/- one standard error. Differences are statistically significant with infinitesimal p.) This result
indicates agents using hill-climbing the hill-climbing component of A1 actions will be make them
more useful for building future, planning-based abstractions on top of the SODA abstraction.
hstall,compass,x-bin,y-bin,(arg maxj fj ∈ F)i, which was hashed into the agent’s Q-table.
Figure 6.9 shows the learning curves from the ACES environment. Each curve averages the
performance of 50 agents — 5 runs using each of 10 trained GNGs — for agents using SODA’s
TF+HC options and agents using primitive actions only. The curves show that the SODA agents are
able to learn to solve the navigation task while the agents using primitive actions are not able learn
it within the alloted time-out period.
73
Figure 6.8: ACES Navigation Task. The circles indicate the locations used as starting and ending
points for navigation in ACES in this chapter. The green circle on the right indicates the starting
point and the red circle on the left indicates the ending point. The shaded area shows the robot’s
field of view. The longer task and added complexity of the environment make this task much more
difficult than the tasks in the T-maze.
6.4 Navigation Discussion
The results in this chapter show that SODA allows an agent to learn to navigate dramatically faster
than it can using primitive actions. However, there are two major points from these results worth
further discussion: (1) SODA does not learn clean, high-level “turn” actions, and (2) the learning
times in the larger environment are relatively long, compared to those in the T-maze.
First, Figure 6.3 shows that although SODA learns to abstract “hallway following” into a few
large actions, it does not cleanly learn to turn to a new corridor as a single large action. As discussed
in Section 4.4, the Euclidean distance metric causes SODA to learn many perceptual features when
turning, resulting in many closely spaced distinctive states separated by small turn actions. Using
the A1 actions, the agent must traverse sequences of these states in order to turn 90◦ . Although
SODA performs well despite this problem, an effective means of learning large turn actions could
potentially reduce the task diameter by another order of magnitude. The ideal abstract path to the
74
Length per Episode
35000
Primitive Actions
SODA Options
30000
Length
25000
20000
15000
10000
5000
0
0
500
1000
1500
Episode
2000
2500
3000
Figure 6.9: Learning Performance in the ACES Environment. These learning curves compare
the length per episode for learning to navigate in ACES from the center-right intersection to the
lower-left corner of the environment. The curves compare primitive actions and A1 actions. Each
curve is the average of 50 runs, five each using 10 different learned SOMs. Error bars indicate +/one standard error. The minimum length path is around 1200 actions. The agents using the highlevel actions learn to solve the task while those using only primitive actions have a very difficult
time learning to solve the task in the allotted time.
goal in the T-maze consists of three abstract actions: (1) drive to the intersection, (2) turn right, (3)
drive to the goal. Achieving such an ideal on every run is unlikely, given that actions may terminate
early because of motor noise, and that feature learning may not learn perfect features. However,
given that the majority of the actions in the trace in Figures 6.3 and 6.4 are used to turn, a better
method of learning turn actions could reduce the mean of 23 actions (Figure 6.7) to fewer than ten
actions. Section 7.2.2 in the next chapter discusses methods by which SODA could be extended to
more cleanly learn large turn actions, either by using alternative distance metrics that better handle
turning, or by aggregation of states into “places” and “paths.”
The second point for discussion is SODA’s performance in the ACES environment. Although SODA’s advantage over primitive actions in the larger environment is even greater than in
75
the T-Maze, the absolute learning performance is notably worse than that in the T-Maze. In the TMaze, performance converged in a few hundred episodes. In ACES, it took nearly the whole 3000
episodes. Much of this difference can be explained by the greater complexity of the environment.
The T-Maze has only a single intersection, while ACES has eight. In addition, the state space in
ACES is much larger: while the total number of states in the Q table of an agent in the T-Maze is
around 500-600, the number of states for an agent in ACES is around 4000-5000. Note that because of the hashing used in the Q-table, the number of states in the table represents the number
of states visited by the agent, not the total number of possible states in the environment. In other
words, the agents in ACES have to explore much more environment (visit more states) in order to
learn a solution. This substantial increase in complexity indicates that rapid learning in very large
environments may require bootstrapping to a higher level of abstraction, such as the aggregation of
states into places and paths as mentioned above and discussed in detail in Section 7.2.2.
6.5 Navigation Conclusion
To summarize, this chapter described experiments showing that SODA allows agents to learn to
navigate much more quickly than agents using primitive actions. In the T-Maze, SODA agents
out-performed agents using primitive actions in navigation between several locations in the environment. In addition, an ablation study showed that the hill-climbing phase of SODA’s actions
makes the actions more reliable than trajectory-following alone, and dramatically reduces the number of abstract actions needed to solve the task, at a small cost in the total number of primitive
actions needed to solve the task. These features will be attractive for bootstrapping from model-free
reinforcement learning methods to navigation using predictive models and planning. Finally, in the
navigation experiment in the larger, more complicated ACES environment, SODA’s advantage over
primitive actions was even more pronounced, although the long learning times in this environment
suggest that bootstrapping to a higher level of abstraction may be needed for learning to navigate in
practice.
76
Chapter 7
Discussion and Future Directions
The last chapter showed that robotic agents learn to navigate much faster using SODA than using
primitive actions, that the hill-climbing step of SODA’s A1 actions makes actions more reliable and
reduces the task diameter over using trajectory-following alone, and that SODA scales up to larger,
realistic environments. This chapter discusses some potential problems raised in the previous chapters and how they might be addressed in future research. First, Section 7.1 reviews the SODA
agents’ need for extra state information to disambiguate perceptually aliased states, and proposes
several replacements to Sarsa(λ) for learning navigation in partially observable environments. Next,
Section 7.2 discusses the large number of features, and hence distinctive states, created to represent
perceptual changes caused by turning the robot, and proposes how this problem may be dealt with
either by replacing Euclidean distance in the GNG with a different metric, or by bootstrapping up
to a higher A2 representation that aggregates distinctive states together into “places” and “paths.”
Section 7.3 considers how SODA may be scaled to even richer and more complicated sensorimotor
systems. Finally, Section 7.4 characterizes how well SODA’s goal of reducing the need for prior
knowledge in constructing an autonomous robot has been met, and the extent to which the knowledge embedded in SODA’s learning parameter settings might be further reduced.
7.1 SODA with Perceptual Aliasing
Under SODA’s abstraction the environments used in the previous chapters suffer from the problem
of perceptual aliasing, also called “partial observability”, in which more than one distinctive state is
represented by one GNG unit. Perceptual aliasing makes it difficult or impossible to navigate based
only on current perceptual input, and considerable research exists on methods for learning to act
in partially observable environments (Kaelbling, Littman, & Moore, 1996; Shani, 2004). SODA’s
design deliberately factors the problem of constructing a perceptual-motor abstraction away from
the problem of perceptual aliasing, with the intent that the abstract representation constructed by
SODA could be used as input to existing (or new) methods, thus bootstrapping up to a higher77
level of representation, as described in Section 1.4. As a place-holder for one of these methods,
the agents in Chapter 6 were given additional sensory information to disambiguate aliased states.
In the T-Maze the agents were given an eight-point compass, to differentiate aliased poses in the
intersection (e.g. facing southeast vs. facing southwest). In ACES, where the corridors are much
longer than the maximum range of the rangefinder, the agent was also given a coarse tiling of the
robot’s x/y position in the environment, approximately equivalent to using different wall or floor
colors in different parts of the building.
SODA’s abstraction provides a ready interface for adding methods designed for reasoning and acting in aliased environments. Formally, a SODA agent’s interaction with a perceptually
aliased environment forms a Partially Observable Markov Decision Process (POMDP). A POMDP
extends a standard Markov decision process (Section 2.1.2) consisting of a set of states S, a set of
actions A, and a transition function T : S × A × S → [0, 1] giving the probability that taking an
action in a given state will lead to some subsequent state. In a POMDP, the agent is unable to observe the current state s ∈ S directly. Rather, it observes an observation o from a set of observations
O where the probability of observing a particular observation o in state s is governed by an observation function Ω : S × O → [0, 1]. SODA’s distinctive states map on to the POMDP’s states S,
the high-level actions A1 provide the actions, GNG units are the observations O and the GNG itself
provides the observation function Ω. The remainder of this section discusses three major classes of
methods for learning to act in partially observable environments: model-free methods that extend
MDP-based reinforcement learning methods such as Sarsa(λ) and Q-learning to partially-observable
environments without explicitly constructing a POMDP model, model-based methods that induce
a POMDP from observations and then use it for planning and navigation, and newer model-based
methods that construct other forms of predictive models than explicit POMDP models, based on
experience.
The simplest method for bootstrapping SODA up to handling perceptual aliasing is simply to replace Sarsa(λ) for high-level navigation learning with another model-free reinforcement
learning method designed to deal with partial-observability. Two methods discussed in Chapter 2,
McCallum’s U-Tree algorithm (Section 2.3.1) and Ring’s CHILD (Section 2.2.5) have been successful in learning tasks in highly-aliased environments. Both of these methods track the uncertainty in
the value of each (o, a) pair, i.e the uncertainty in Q(o, a), and progressively build a memory m of
recent observations and actions, such that the uncertainty in Q(o, a, m) is minimized.
Another class of model-free learning methods, neuroevolution (NE), has had success on partially observable continuous control tasks (Gomez & Miikkulainen, 2002; Gomez, 2003). Neuroevolution methods use an evolutionary search in the space of policy functions expressed as neural
networks, without explicitly learning or storing a value function. Gomez, Schmidhuber, & Miikkulainen (2006) compared several neuroevolution methods against a variety of value-function-based
and policy-search-based RL methods on a series of pole-balancing tasks. The NE methods dramatically outperformed the other methods, although neither U-Tree nor CHILD was included in the
78
comparison. The pole-balancing tasks are low-dimensional, unlike the high dimensional input used
in the SODA experiments. Nevertheless, it is worth considering whether these methods could be
used to learn to navigate over SODA’s A1 actions. One potential problem with using neuroevolution is the construction of the fitness function. For episodes in which the robot reaches the goal, the
fitness should be inversely related to the number of steps the agent took to reach the goal. However,
for episodes that time-out before reaching the goal, it is not clear how to fashion a fitness function
without some external knowledge of the robot’s actual position in the environment. This would
nonetheless be an interesting direction for future research.
U-Tree, CHILD, and neuroevolution have all had success in learning to perform tasks in
partially-observable environments. However, like Sarsa(λ) these methods create an entirely new
learning problem for each navigation target in the environment, conflating general knowledge of
the dynamics of the environment and specific knowledge for a single task. For example, in the
set of six experiments described in Section 6.1, the agent had to learn the Q function, and hence
the environment dynamics, afresh for each experiment. This is very inefficient, given that all the
experiments are performed in the same environment, and the learned distinctive states, A1 actions,
and state-transition function are shared among all the tasks. An alternative is to learn a model of the
environment that allows the agent to predict the outcome of actions. This model can then be used
to solve each navigation task in the environment, either by learning a state value function for each
task (which is simpler than the state-action value functions learned by model-free methods), or by
explicit look-ahead planning.
The most obvious representation for a predictive environment model for SODA is the POMDP
representation itself. As mentioned above, the action set A and observation set O are already known,
so learning the POMDP would entail estimating the hidden state set S, transition function T , and
observation function Ω. For a given S, such a model can be learned from an observation/action
trace using a generalized expectation-maximization algorithm for learning hidden Markov models
known as the Baum-Welch algorithm (Duda, Hart, & Stork, 2001). Baum-Welch can also be used
to learn POMDPs (Chrisman, 1992). When the number of hidden states is unknown, Baum-Welch
can be applied repeatedly as the number of states is increased, to discover the number of states
that maximizes the likelihood of a second, independently collected, validation trace. In addition to
this iterative method, other search methods exist for learning graphical topologies in other domains,
including Bayesian networks (Segal et al., 2005; Teyssier & Koller, 2005), and neural networks
(Stanley, 2003). It is likely that one or more of these methods could be extended to POMDPs, especially in cases where the transition function, and hence the graph topology, is very sparse, as it is
with SODA—as shown in Figure 6.6, the average transition entropy in the T-Maze using A1 actions
is less than 1 bit.
In addition to the POMDP model, new state representations for partially observable dynamical systems have been developed recently. Chief among these are Predictive State Representations
(PSRs; Littman, Sutton, & Singh, 2002; Singh, James, & Rudary, 2004), and Temporal-Difference
79
Networks (TD-Nets; Sutton & Tanner, 2005). Rather than explicitly representing each hidden state
and the transitions between them, PSRs represent the agent’s state as predictions over a set of tests,
where a test is an action-conditioned sequence of future observations. Representationally, PSRs
have advantages over history-based method like U-Tree and CHILD, in that they can accurately
model some systems that cannot be modeled by any finite history method. In addition, PSRs are
often more compact than the equivalent POMDP representation for the same underlying dynamical system. Given an appropriate set of “core tests,” the parameters for updating a model can be
learned (Singh et al., 2003), and some progress has been made on discovering the core tests from
data (Wolfe, James, & Singh, 2005), though the methods do not work well for all environments.
However, navigation environments are considerably more constrained in their dynamics than what
can be represented by a POMDP generally. For example, the agent cannot jump to arbitrary states,
so the transition function is very sparse. It is worth investigating whether or not the properties that
make PSR discovery fail in some cases apply in navigation.
TD-Nets are a different representation, also based on predictions, in which nodes in a graph
represent scalar predictions about the world and links between them represent action-conditioned
temporal dependencies between the predicted outcomes. The parameters of the network are learned
from data using methods of temporal differences (Section 2.1.2), but no methods exist for discovering the predictions or network structure from experience, making it difficult to see how they can be
applied to SODA.
Although SODA itself does not deal with perceptual aliasing, its sparse, discrete abstraction
is well suited for use as input to a variety of methods for learning to act in partially observable
environments. Investigating these methods is an attractive area for future research.
7.2 Dealing with Turns
Although the learning results in Chapters 5 and 6 show that SODA’s feature learning method suffices
to construct features that greatly reduce task diameter, the method still has the problem, discussed
in Section 4.4, that turning the robot moves the perceptual vector farther in the feature space than
translating forward or backward. Because the GNG network adds features to minimize the distance
between the input and the winning feature, it creates many features that represent views of the
environment that differ from one another by only a small turn. Since there is at least one distinctive
state for each feature, this process creates many distinctive states that only differ by a small rotation.
As a result, as shown in Figures 6.3 and 6.4, turning the robot requires several smaller A1 actions
while traveling forward or backward only requires a few larger actions. Ideally, the abstraction
would treat traveling and turning equivalently: traveling to an intersection, turning to face a new
corridor, traveling again, etc. More generally, an ideal abstraction would cut the environment at
its natural decision points, regardless of whether some primitive actions move the robot farther
in perceptual space than others. SODA might be modified to create such an abstraction either
80
by replacing Euclidean distance in the GNG network with another metric that better captures the
distances between small actions, or by constructing a higher A2 layer of abstraction on top of the A1
layer, which aggregates the small turn actions together into larger actions. These ideas are discussed
in more detail below.
7.2.1
Alternative Distance Metrics for Feature Learning
One possible method of dealing with the large number of distinctive states created by turning is
to replace Euclidean distance for comparing sensory images with a metric that can account for the
relationships between the elements in the input vector. One such metric is earth movers’ distance.
Alternatively, it may be possible for SODA to learn a good distance metric for each new sensorimotor system.
Earth movers’ distance (EMD; Rubner, Tomasi, & Guibas, 2000) is a distance metric used
in computer vision that has properties well-suited for comparing sensor inputs. EMD computes
the cost to “move” the contents of one distribution to best cover another distribution, given known
distances between the bins of the distribution. By treating each input vector or GNG unit as a distribution histogram, it is possible to apply EMD to comparing inputs in SODA. Applying EMD this
way requires a distance matrix D = [dij ] indicating the distance between each sensor represented
in the input vector. Although no such distance matrix is specified in SODA, the formalization in
Chapter 3 assumes that the sensor group that produces the input vector was discovered using the
sensor-grouping methods of Pierce & Kuipers (1997). These methods include automatically learning a distance matrix over the robot’s sensor array that assigns small distances to sensors that are
correlated (i.e. that sample nearby regions of space), and larger distances to sensors that are less
correlated.
Using this distance matrix with the robot used in this dissertation, EMD should assign much
smaller distances between input vectors separated by small turns than between input vectors separated by large turns, because with small turns the similar range values are much closer to one another
(in terms of D). One potential drawback to EMD in this context is that computing a single distance
directly requires solving a linear program. This may be too computationally intensive for a program that must compute many distances many times per second. However, approximate EMD-like
methods have been developed that provide significant speedups (Grauman & Darrell, 2007, 2005).
Alternatively, it may be possible to use the computationally intensive EMD to generate training
data for a faster function approximator that computes an approximate EMD between arbitrary input
pairs.
An alternative to choosing a specific new distance function such as EMD is to have the
SODA agent learn the distance function from experience. Vector-space similarity function learning has already been studied for use in clustering (Xing et al., 2003) and text mining (Bilenko &
Mooney, 2003). These methods require information that indicates which pairs of vectors are similar
(or dissimilar). This information is used to train a function approximator that learns a distance func81
tion between pairs of vectors. In the case of SODA, this information is available in history of the
training phase for the GNG. Since one of SODA’s assumptions is that small actions produce small
changes in the input (Section 3.1), input vectors separated by a single action can be defined to have
a small distance between them. This information can be used to learn a distance metric in which
turning actions and translating actions are separated by similar distances, reducing the number of
GNG units (and hence distinctive states) devoted to representing variants of the same view that only
differ by small turn actions.
To summaraize, if Euclidean distance in GNG training is replaced with a new distance
metric that better represents the topological distance between inputs in the action space, SODA may
be able to reduce task diameters in environments like the T-Maze and ACES by another order of
magnitude.
7.2.2
Learning Places, Paths, and Gateways
An alternative, potentially more promising method for dealing with turns may be to aggregate the
large groups of distinctive states (dstates) linked by small turn actions into places, connected together by paths consisting of a few dstates linked by long “travel” actions. This is the approach
used by the Spatial Semantic Hierarchy (Section 2.1.1) in moving from the causal abstraction level
to the topological level. The SSH causal level is characterized by distinctive states linked by actions, while the topological level consists of places linked into sequences along paths. The resulting
abstraction forms a bipartite graph of places and paths, in which the agent can plan. Building such
an abstraction upon SODA’s dstates and actions would both reduce task diameter further and form
an abstraction suitable for navigation by planning.
The classic SSH formalization requires prior labeling of particular actions as either travel
actions or turn actions. This labeling is antithetical to SODA’s purpose of reducing the need for
human prior knowledge for the learning process. It may be possible to get around the need for
action labeling by adapting the concept of gateways from the Hybrid SSH (HSSH Kuipers et al.,
2004; Beeson et al., 2003). The HSSH combines the strengths of topological mapping for representation of large-scale space, and metrical, probabilistic methods (Thrun, Burgard, & Fox, 2005)
for representation and control in small-scale space. The HSSH labels each place (a decision point
or intersection of paths) with a local perceptual model (e.g. an occupancy grid). Each place has a
discrete set of gateways through which paths enter and leave. These gateways define the interface
between the large-scale, topological representation, and the small-scale metrical representation. An
analogous abstraction could be constructed from SODA’s actions and distinctive states by identifying the dstates that begin or end sequences of one or more long actions, all of which begin by
trajectory-following along the same progress vector (Section 3.3). These starting states would be
labeled as “gateways”; the sequences of long actions (and associated dstates) that connect them
would be labeled as “path segments”; and the groups of dstates and small actions that connect the
path segments together would become “places.” An example of such an abstraction in the T-Maze is
82
Place
Gateways
Path
Segments
Figure 7.1: Grouping SODA Actions into Gateways, Places and Paths. Topological abstraction
of SODA’s actions could be accomplished identifying the states that initiate or terminate sequences
of long similar actions as gateways, and then aggregating the sequences of states connected by
long actions between two gateways into paths, and the collections of states connecting groups of
gateways into places. In this T-Maze example, from Chapter 6, the states in which the robot enters
and leaves the intersection terminate and begin sequences of long actions, and thus are gateways.
The states and actions moving the robot down the corridor are grouped into paths, and the collection
of states traversed in turning are grouped into a place. This second-level (A2 ) abstraction would
reduce the diameter of this task from ten actions to three.
shown in Figure 7.1. This abstraction requires a definition of what constitutes a “long” action. One
method is to cluster all the A1 action instances in the training data into two sets using k-means, and
treat the longer set as the “long” actions.
Control in this new topological A2 abstraction could be accomplished by a new set of options defined over A1 actions, allowing the robot to navigate robustly between gateways even when
motor noise causes a TF action to terminate prematurely. Each gateway leaving a place would have
an associated option for reaching that gateway from any dstate within the place (including the incoming gateways); likewise, each path segment would have an option for reaching the end gateway
from anywhere within the path segment. Limiting these A2 actions to operating over A1 actions
is likely to make their learned polices somewhat less efficient than similar options over A0 . Once
83
policies for these options have been learned using A1 actions, however, it may be possible to replace
each option with a new optimized version that operates over A0 actions. Each optimized option’s
policy would be bootstrapped with action traces generated by the old A1 -based policy, as was done
by Smart & Kaelbling (2002), resulting (after further learning) in A2 actions that drive the robot
directly to the gateways without intermediate trajectory-following and hill-climbing. Such an abstraction would reduce the task diameter considerably, reducing, for example, the diameter of the
task in Figure 7.1 from ten actions to three.
7.3 Scaling to More Complex Sensorimotor Systems
The results in Section 6.3 show that SODA scales up to navigation environments of realistic size.
Increasing the environment size, however, is only one kind of scaling. An important future direction
for research in SODA is investigating how the abstraction scales along other dimensions, such as
input dimensionality and output dimensionality.
The SODA experiments in previous chapters show that SODA operates well with 180 input
dimensions, which is many more than the typical reinforcement learning system in which handcoded feature extractors reduce the dimensionality fewer than 10 dimensions. Nevertheless, there
are obvious ways in which the input dimensionality may be scaled up even further, either by using
sensors of enormous dimensionality, such as cameras (a 640x480 RGB camera provides nearly a
million individual input elements), or by adding multiple heterogeneous sensors. While in some
limited environments it may be possible to pass full camera images or other very large input vectors
directly to a single, huge SOM, such an approach is likely to be highly inefficient, as SOM lookups
for such large, dense vectors are computationally very expensive. Instead, handling these very
high-D inputs will require an extension to SODA’s feature learning. Rather than learn a single flat
feature set to characterize the whole input, a possible extension is to divide the large sensor group
into many smaller (possibly overlapping) groups of scalar elements. Such groups are computed as
an intermediate stage in the sensor grouping method of Pierce & Kuipers (1997). These groups
can then each be characterized by its own SOM. Once these SOMs have been trained, a similar
grouping can be performed on the SOMs, and another layer of SOMs learned to characterize the
joint behavior of the first layer, and so on.
This scheme forms a multistage hierarchical vector quantizer (Luttrell, 1988; Luttrell, 1989).
These kinds of networks are known to trade a small loss of encoding accuracy for a large increase
in computational efficiency. This hierarchical approach is also reminiscent of CLA, the Constructivist Learning Architecture (Chaput, 2004) (Section 2.2.2). In CLA the choice of which lower level
maps to combine in a higher-level map was made by the human engineer, but it may be possible
to automate the process using distance-based grouping. In particular, since each SOM’s output
is effectively a discrete random variable, it may be possible to group them based on informationtheoretic distance measures that maximize mutual information between grouped variables. Such
84
methods were used by Olsson, Nehaniv, & Polani (2006). By building and combining such groupings, SODA should be able to scale its feature learning to very large input domains.
Another possible direction for scaling up is in the dimensionality of the robot’s motor system. Many newer robot systems, such as manipulators, robot dogs, and humanoids, have many
degrees of freedom, yet trajectory-following and hill-climbing still appear to be reasonable highlevel actions for these more complicated robots. For example, it is easy to imagine a robot arm
retrieving an object by following a trajectory to place the gripper near the object, then hill-climbing
to a suitable position to grasp the object. The problem in these robots is that value-function-based
reinforcement learning methods scale poorly as the size of the action space increases. One possible
way of dealing with this problem would be to replace Sarsa-based learning in the TF and HC options
with methods such as neuroevolution or policy-gradient methods, that learn their policies directly,
rather than learning a value function. By encapsulating the interface to the high-dimensional action
space in this way, much of the problem of scaling the robot’s motor space should be alleviated.
7.4 Prior Knowledge and Self-Calibration
One of the stated long-term goals of SODA is to minimize or eliminate the need for human engineers
to embed in each robot specific knowledge of its own sensorimotor system and environment. In the
experiments in the previous chapters, SODA has largely succeeded in this, since parameters of its
perceptual function and control policies are learned from experience (e.g. the GNG size and weights,
and the control option Q-tables). In addition, although the learning parameters for those functions
were still set by hand, the specification of a handful of learning parameters is often considerably
more compact than explicit specification of the functions to be learned. For example, the handspecified predictive motion model for hill-climbing in Section 5.2.1 is much more complicated to
specify than the hill-climbing options learned in the following section. Furthermore, as with the
perceptual learning described in Section 4.4, the parameters used for the options and other learning
methods were chosen by relatively short “educated walks” in the parameter space — exhaustive
searching was not required. Also, most of the learning parameters used in the T-Maze were used
unmodified in ACES.
Nevertheless this kind of “knob-tweaking,” common to all machine learning practice, embeds human prior knowledge and should be reduced, with the ultimate goal of achieving selfcalibrating or “calibration-free” robots (Graefe, 1999). Assuming that there is no single set of
learning parameters that will work for SODA in all robots and environments, two methods of automatically setting the parameter values present themselves for further study: homeostatic control,
and evolutionary search.
“Homeostasis” is the term used in biology to describe a property of an organism (e.g. body
temperature) that is kept constant by the organism’s physiological processes. Homeostasis is distinguished from “equilibrium” by active control. While a closed system will fall into equilibrium in
85
the absence of outside perturbation, an organism maintains its systems in homeostasis by actively
changing the value of some parameters of the system to achieve constant target values on some measure or measures of system behavior. Homeostasis in learning has been studied in computational
neuroscience, modeling neurons whose activation threshold is modified to maintain a target firing
rate distribution over time (Turrigiano, 1999; Triesch, 2005; Kurniawan, 2006).
The Homeostatic-GNG used in SODA for feature learning employs a homeostatic mechanism to set the size of the GNG, adding units only when the error rate rises above a set threshold,
and deleting units that poorly represent the input distribution. The success of Homeostatic-GNG
and the work in computational neuroscience suggest that homeostatic processes might be able to
reduce manual search for parameter settings in SODA. It must be noted, however, that homeostatic
processes do not eliminate prior parameter specification, but rather replace one or many parameters
with a single new parameter, the homeostatic target value for the process. Homeostasis is effective
when the same target value can be used in situations where the control parameters would have to
change. For example in the T-Maze and ACES experiments, the Homeostatic-GNG error threshold
used was the same between environments, while the number of features needed to maintain that
error rate was much higher in ACES because of its greater perceptual variety.
It is possible that some parameters may not be controllable homeostatically. One important
parameter that seems to fall into this category is the length of the initial exploration period for feature
learning. To learn a complete feature set, the robot must explore the environment long enough to
experience the full range of possible perceptual situations. Beyond that, extra exploration helps
refine the features, but there is a point of diminishing returns. Experiments with learning highlevel navigation concurrently with the feature set (by allowing the Q-table to grow as new features
were added) were unsuccessful for two reasons. First, the navigation learner was chasing a moving
target – by the time the Q-values were backed-up, the states they belonged to had changed. Second,
they had poor exploration: the “optimistic initialization” exploration policy requires having a good
representation of the state space before learning so that the Q-table can be initialized properly. For
these reasons feature set needed to be learned and fixed before successful navigation.
This kind of multi-stage learning fits well with constructivist approaches to agent learning
such as Drescher’s Schema Mechanism and Chaput’s CLA (Section 2.2.2), and it also fits with
Bayesian-network-based hierarchical learning models of the cerebral cortex that train and fix each
hierarchical layer before training the layers above (Dean, 2005). If, in fact, a period of dedicated
exploration is truly necessary for SODA to learn a good set of features for navigation, then the
length of exploration is a free parameter that cannot be set by a homeostatic process. However, it
may be possible to set the length of the exploration phase by resetting the timeout clock each time a
new feature is added. I.e., if the time elapsed since the last feature was added exceeds some preset
value, then feature learning ends and the agent moves to the policy learning phase with a fixed set
of features.
Alternatively, however, an evolutionary search for parameters may be more appropriate,
86
where “evolutionary search” refers to any search method that generates parameter sets, tests their
fitness by running each one in an entire developmental cycle, and uses the fittest to generate new parameters. Such an evolutionary search would not be particularly efficient in a single-robot learning
scenario. It could work well, however, in an environment where many robots must perform similar
tasks, since different parameter sets could be evaluated in parallel, and the fittest agents could be
easily replicated into all the robots.
To summarize, SODA succeeds in its goal of learning without needing prior knowledge
in that it is constructed entirely out of generic learning methods, and does not embed any implicit
knowledge of the robot or environment, such as the detailed sensor and action models typically used
in probilistic robotics (Thrun, Burgard, & Fox, 2005). However, some implicit prior knowledge is
still embedded in the settings of SODA’s learning parameters. Homeostatic control and evolutionary
parameter search may allow SODA to reduce the amount of such implicit prior knowledge necessary
to learn, moving the system closer to the goal of creating a self-calibrating robot.
7.5 Discussion Summary
This chapter has examined several of the most important directions for future research on SODA:
dealing with perceptual aliasing, eliminating the large number of distinctive states created by turn
actions either through a new distance metric in the GNG, or by aggregating them into “places,”
scaling the method to more complex sensorimotor systems, and further reducing or eliminating the
amount of human-provided prior knowledge built into the system. The next chapter summarizes
this dissertation and concludes by placing this work in the greater context of bootstrap learning,
and, more generally, the construction of intelligent agents.
87
Chapter 8
Summary and Conclusion
To summarize, this dissertation has presented Self-Organizing Distinctive-state Abstraction (SODA),
a method by which a learning agent controlling a robot can learn an abstract set of percepts and actions that reduces the effective diameter of large-scale navigation tasks. The agent constructs the
abstraction by first learning a set of prototypical sensory images, and then using these to define a set
of high-level actions that move the robot between perceptually distinctive states in the environment.
The agent learns the set of perceptual features using a new, modified Growing Neural Gas
vector quantizing network called Homeostatic-GNG. Homeostatic-GNG incrementally adds features in the regions of the input space where the representation has the most error, stopping when
the accumulated training error over the input falls below a given threshold, and adding new units
again if the error rises above the threshold. This algorithm allows the agent to train the network
incrementally with the observations it experiences while exploring the environment, growing the
network as needed, for example when it enters a new room it has never seen before. This automatic
tuning of the feature set size makes it possible for the agent to learn features for environments of
different size and perceptual complexity without changing the learning parameters.
After learning perceptual features, SODA uses these features to learn two types of abstract
actions trajectory-following (TF) actions and hill-climbing (HC) actions. A TF action begins at a
distinctive state, and carries the robot into the neighborhood of a new perceptual feature. An HC
action, begins in the neighborhood of a perceptual feature, and takes the robot to a new distinctive
state at a local maximum of this new feature. By pairing TF and HC actions, the agent navigates
from one distinctive state to another. Both TF and HC actions are defined using the Options formalism for hierarchical reinforcement learning, and their policies are learned from experience in the
environment using a novel prototype-based state abstraction called the Topn representation. Topn
represents the agent’s state as the sorted tuple of the top n closest perceptual prototypes, allowing
TF and HC options to use the GNG features for navigation, even when navigating entirely within
the neighborhood of one feature. In addition, the Topn representation can be hashed into a table
index, so that options can be learned with off-the-shelf tabular reinforcement learning algorithms
88
like Sarsa and Q-learning.
Operating in this new, abstracted state space, moving from one distinctive state to the next,
SODA agents then use reinforcement learning to learn to navigate between widely separated locations in their environment, using a relatively small number of abstract actions, each comprising
many primitive actions.
Experiments using a simulated robot with a high-dimensional range sensor and drive-andturn effectors showed that the SODA was able to learn good sets of prototypes, and reliable, efficient
TF and HC options, in each environment. TF and HC options performed as well as hard-coded TF
and HC controllers constructed with extensive knowledge of the robot’s sensorimotor system and
environment dynamics. In addition, they outperformed hard-coded controllers requiring no such
prior knowledge. Using these learned perceptual features and actions, SODA agents navigating
between distant locations in these environments learned to navigate to their destinations dramatically
faster than agents using primitive, local actions. An ablation study showed that although HC options
do not significantly improve overall navigation learning times, they make the high-level actions
more reliable and reduce the average number of abstract actions needed to navigate to a target.
These features make the abstraction more suitable for future extensions to model-based navigation
by planning.
SODA learns its state-action abstraction autonomously, and the abstraction reflects only the
environment and the agent’s sensorimotor capabilities, without external direction. This abstraction
is learned through a process of bootstrap learning, in which simple or low-level representations
are learned first and then used as building blocks to construct more complex or higher-level representations. SODA bootstraps internally by first learning its perceptual representation using an
unsupervised, self-organizing algorithm, and then using that representation to construct its TF and
HC options, which are then used to learn high-level navigation. In addition, SODA itself is designed
to be used as a building block in a larger bootstrap learning process: The identity of the main sensor group that SODA uses for learning perceptual prototypes can be determined automatically by
existing lower-level bootstrap learning methods like those of Pierce & Kuipers (1997). In addition,
SODA’s representation of discrete states and actions is suitable for bootstrapping up to even higher
representations, such as topological maps or learning agents that can function in the presence of
hidden state.
Bootstrap learning is one approach to a larger problem: how can we create intelligent systems that autonomously acquire the knowledge they need to perform their tasks. This problem is
predicated on the notion that an essential objective of AI is “the turning over of responsibility for
the decision-making and organization of the AI system to the AI system itself.” (Sutton, 2001)
Achieving this objective is an enormous task that must be approached incrementally. This dissertation is one such step. This step enables a robot to learn an abstraction that allows it to navigate in
large scale space—a foundational domain of commonsense knowledge—and can itself be used as
the foundation for learning higher-level concepts and behaviors.
89
Bibliography
Andre, D., and Russell, S. J. 2001. Programmable reinforcement learning agents. In Advances in
Neural Information Processing Systems 12, 1019–1025.
Baird, L. 1995. Residual algorithms: Reinforcement learning with function approximation. In
Proceedings of the Twelfth International Conference on Machine Learning, 30–37. Morgan
Kaufmann.
Bednar, J. A.; Choe, Y.; De Paula, J.; Miikkulainen, R.; Provost, J.; and Tversky, T. 2004. Modeling
cortical maps with Topographica. Neurocomputing 58-60:1129–1135.
Beeson, P.; MacMahon, M.; Modayil, J.; Provost, J.; Savelli, F.; and Kuipers, B. 2003. Exploiting
local perceptual models for topological map-building. In IJCAI-2003 Workshop on Reasoning
with Uncertainty in Robotics (RUR-03).
Beeson, P.; Murarka, A.; and Kuipers, B. 2006. Adapting proposal distributions for accurate,
efficient mobile robot localization. In IEEE International Conference on Robotics and Automation.
Bilenko, M., and Mooney, R. J. 2003. Adaptive duplicate detection using learnable string similarity
measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD-2003), 39–48.
Chaput, H. H., and Cohen, L. B. 2001. A model of infant causal perception and its development.
In Proceedings of the 23rd Annual Conference of the Cognitive Science Society, 182–187.
Hillsdale, NJ: Erlbaum.
Chaput, H. H.; Kuipers, B.; and Miikkulainen, R. 2003. Constructivist learning: A neural implementation of the schema mechanism. In Proceedings of the Workshop on Self-Organizing
Maps (WSOM03).
Chaput, H. H. 2001. Post-Piagetian constructivism for grounded knowledge acquisition. In Proceedings of the AAAI Spring Symposium on Grounded Knowledge.
90
Chaput, H. H. 2004. The Constructivist Learning Architecture: a model of cognitive development
for robust autonomous robots. Ph.D. Dissertation, The University of Texas at Austin. Technical
Report AI-TR-04-34.
Chrisman, L. 1992. Reinforcement learning with perceptual aliasing: The perceptual distinctions
approach. In Swartout, W., ed., Proceedings of the Tenth National Conference on Artificial
Intelligence, 183–188. Cambridge, MA: MIT Press.
Cohen, J. D.; MacWhinney, B.; Flatt, M.; and Provost, J. 1993. PsyScope: An interactive graphic
system for designing and controlling experiments in the psychology laboratory using Macintosh computers. Behavioral Research Methods, Instruments and Computers 25(2):257–271.
Cohen, L. B.; Chaput, H. H.; and Cashon, C. H. 2002. A constructivist model of infant cognition.
Cognitive Development.
Dean, T. 2005. A computational model of the cerebral cortex. In The Proceedings of the Twentieth
National Conference on Artificial Intelligence. MIT Press.
Dietterich, T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13:227–303.
Digney, B. 1996. Emergent hierarchical control structures: Learning reactive / hierarchical relationships in reinforcement environments. In Proceedings of the Fourth Conference on the
Simulation of Adaptive Behavior: SAB 98.
Digney, B. 1998. Learning hierarchical control structure for multiple tasks and changing environments. In Proceedings of the Fifth Conference on the Simulation of Adaptive Behavior: SAB
98.
Drescher, G. L. 1991. Made-up minds: A constructivist approach to artificial intelligence. Cambridge, MA 02142: The MIT Press.
Duckett, T., and Nehmzow, U. 2000. Performance comparison of landmark recognition systems for
navigating mobile robots. In Proc. 17th National Conf. on Artificial Intelligence (AAAI-2000).
AAAI Press/The MIT Press.
Duda, R. O.; Hart, P. E.; and Stork, D. G. 2001. Pattern Classification. New York: John Wiley &
Sons, Inc., second edition.
Fritzke, B. 1995. A growing neural gas network learns topologies. In Advances in Neural Information Processing Systems 7.
91
Fritzke, B. 1997. A self-organizing network that can follow non-stationary distributions. In Gerstner, W.; Germond, A.; Hasler, M.; and Nicoud, J.-D., eds., Proceedings of the Seventh International Conference on Artificial Neural Networks: ICANN-97, volume 1327 of Lecture Notes
in Computer Science, 613–618. Berlin: Springer.
Gerkey, B.; Vaughan, R. T.; and Howard, A. 2003. The player/stage project: Tools for multirobot and distributed sensor systems. In Proceedings of the 11th International Conference on
Advanced Robotics, 317–323.
Gomez, F., and Miikkulainen, R. 2002. Robust nonlinear control through neuroevolution. Technical
Report AI02-292, Department of Computer Sciences, The University of Texas at Austin.
Gomez, F.; Schmidhuber, J.; and Miikkulainen, R. 2006. Efficient non-linear control through
neuroevolution. In Proceedings of the European Conference on Machine Learning.
Gomez, F. 2003. Robust Non-linear Control through Neuroevolution. Ph.D. Dissertation, The
University of Texas at Austin, Austin, TX 78712.
Gordon, G. J. 2000. Reinforcement learning with function approximation converges to a region. In
Advances in Neural Information Processing Systems, 1040–1046.
Graefe, V. 1999. Calibration-free robots. In Proceedings 9th Intelligent System Symposium. Japan
Society of Mechanical Engineers.
Grauman, K., and Darrell, T. 2005. The pyramid match kernel: Discriminative classification with
sets of images. In Proceedings of the IEEE International Conference on Computer Vision
(ICCV).
Grauman, K., and Darrell, T. 2007. Approximate correspondences in high dimensions. In Advances
in Neural Information Processing Systems (NIPS), volume 19.
Hume, D. 1777. An Equiry Concerning Human Understanding. Clarendon Press, Oxford.
Jonsson, A., and Barto, A. G. 2000. Automated state abstraction for options using the U-Tree
algorithm. In Advances in Neural Information Processing Systems 12, 1054–1060.
Kaelbling, L. P.; Littman, M.; and Moore, A. W. 1996. Reinforcement learning: A survey. Journal
of Artificial Intelligence 4:237–285.
Kohonen, T. 1995. Self-Organizing Maps. Berlin: Springer.
Kroöse, B., and Eecen, M. 1994. A self-organizing representation of sensor space for mobile
robotnavigation. In Proceedings of the IEEE International Conference on Intelligent Robots
and Systems.
92
Kuipers, B., and Beeson, P. 2002. Bootstrap learning for place recognition. In Proc. 18th National
Conf. on Artificial Intelligence (AAAI-2002). AAAI/MIT Press.
Kuipers, B.; Modayil, J.; Beeson, P.; MacMahon, M.; and Savelli, F. 2004. Local metrical and
global topological maps in the Hybrid Spatial Semantic Hierarchy. In IEEE International
Conference on Robotics & Automation (ICRA-04).
Kuipers, B.; Beeson, P.; Modayil, J.; and Provost, J. 2006. Bootstrap learning of foundational
representations. Connection Science 18(2).
Kuipers, B. 2000. The Spatial Semantic Hierarchy. Artificial Intelligence 119:191–233.
Kurniawan, V. 2006. Self-organizing visual cortex model using a homeostatic plasticity mechanism.
Master’s thesis, The University of Edinburgh, Scotland, UK.
Lee, S. J. 2005. Frodo Baggins, A.B.D. The Chronicle of Higher Education.
Littman, M.; Sutton, R. S.; and Singh, S. 2002. Predictive representations of state. In Advances in
Neural Information Processing Systems, volume 14, 1555–1561. MIT Press.
Luttrell, S. P. 1988. Self-organizing multilayer topographic mappings. In Proceedings of the IEEE
International Conference on Neural Networks (San Diego, CA). Piscataway, NJ: IEEE.
Luttrell, S. P. 1989. Hierarchical vector quantisation. IEE Proceedings: Communications Speech
and Vision 136:405–413.
MacWhinney, B.; Cohen, J. D.; and Provost, J. 1997. The PsyScope experiment-building system.
Spatial Vision 11(1):99–101.
Martinetz, T. M.; Ritter, H.; and Schulten, K. J. 1990. Three-dimensional neural net for learning
visuomotor coordination of a robot arm. IEEE Transactions on Neural Networks 1:131–136.
McCallum, A. K. 1995. Reinforcement Learning with Selective Perception and Hidden State. Ph.D.
Dissertation, University of Rochester, Rochester, New York.
McGovern, A., and Barto, A. G. 2001. Automatic discovery of subgoals in reinforcement learning
using diverse density. In Machine Learning: Proceedings of the 18th Annual Conference,
361–368.
McGovern, A. 2002. Autonomous Discovery of Temporal Abstractions from Interaction with an
Environment. Ph.D. Dissertation, The University of Massachusetts at Amherst.
Miikkulainen, R. 1990. Script recognition with hierarchical feature maps. Connection Science
2:83–101.
93
Mitchell, T. M. 1997. Machine Learning. WCB/McGraw Hill.
Modayil, J., and Kuipers, B. 2004. Bootstrap learning for object discovery. In IEEE/RSJ International Conference on Intelligent Robots and Systems.
Modayil, J., and Kuipers, B. 2006. Autonomous shape model learning for object localization and
recognition. In IEEE International Conference on Robotics and Automation.
Nehmzow, U., and Smithers, T. 1991. Mapbuilding using self-organizing networks in really useful
robots. In Proceedings SAB ’91.
Nehmzow, U.; Smithers, T.; and Hallam, J. 1991. Location recognition in a mobile robot using selforganizing feature maps. Research Paper 520, Department of Artificial Intelligence, University
of Edinburgh, Edinburgh, UK.
Ng, A. Y.; Harada, D.; and Russell, S. 1999. Policy invariance under reward transformations: theory
and application to reward shaping. In Proc. 16th International Conf. on Machine Learning,
278–287. Morgan Kaufmann, San Francisco, CA.
Olsson, L. A.; Nehaniv, C. L.; and Polani, D. 2006. From unknown sensors and actuators to actions
grounded in sensorimotor perceptions. Connection Science 18(2):121–144.
Parr, R., and Russel, S. 1997. Reinforcement learning with hierarchies of machines. In Advances
in Neural Information Processing Systems 9.
Pierce, D. M., and Kuipers, B. J. 1997. Map learning with uninterpreted sensors and effectors.
Artificial Intelligence 92:169–227.
Pollack, J. B., and Blair, A. D. 1997. Why did TD-gammon work? In Mozer, M. C.; Jordan, M. I.;
and Petsche, T., eds., Advances in Neural Information Processing Systems, volume 9, 10. The
MIT Press.
Precup, D. 2000. Temporal abstraction in reinforcement learning. Ph.D. Dissertation, The University of Massachusetts at Amherst.
Provost, J.; Beeson, P.; and Kuipers, B. J. 2001. Toward learning the causal layer of the spatial
semantic hierarchy using SOMs. AAAI Spring Symposium Workshop on Learning Grounded
Representations.
Provost, J.; Kuipers, B. J.; and Miikkulainen, R. 2006. Developing navigation behavior through
self-organizing distinctive-state abstraction. Connection Science 18.2.
Ring, M. B. 1994. Continual Learning in Reinforcement Environments. Ph.D. Dissertation, Department of Computer Sciences, The University of Texas at Austin, Austin, Texas 78712.
94
Ring, M. B. 1997. Child: A first step towards continual learning. Machine Learning 28(1):77–104.
Roy, N., and Thrun, S. 1999. Online self calibration for mobile robots. In IEEE International
Conference on Robotics and Automation.
Rubner, Y.; Tomasi, C.; and Guibas, L. J. 2000. The earth mover’s distance as a metric for image
retrieval. International Journal of Computer Vision 40(2):99–121.
Segal, E.; Pe’er, D.; Regev, A.; Koller, D.; and Friedman, N. 2005. Learning module networks.
Journal of Machine Learning Research 6:557–588.
Shani, G. 2004. A survey of model-based and model-free methods for resolving perceptual aliasing.
Technical Report 05-02, Department of Computer Science at the Ben-Gurion University in the
Negev.
Şimşek, Ö., and Barto, A. G. 2004. Using relative novelty to identify useful temporal abstractions
in reinforcement learning. In Proceedings of the Twenty-First International Conference on
Machine Learning, 751–758. ACM Press.
Şimşek, Ö.; Wolfe, A. P.; and Barto, A. G. 2005. Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the Twenty-Second International Conference
on Machine Learning.
Singh, S.; Littman, M. L.; Jong, N. K.; Pardoe, D.; and Stone, P. 2003. Learning predictive state
representations. In The Twentieth International Conference on Machine Learning (ICML2003).
Singh, S.; James, M. R.; and Rudary, M. R. 2004. Predictive state representations: an new theory
for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in
artificial intelligence.
Smart, W. D., and Kaelbling, L. P. 2002. Effective reinforcement learning for mobile robots. In
Proceedings of the International Conference on Robotics and Automation.
Smith, A. J. 2002. Applications of the self-organizing map to reinforcement learning. Neural
Networks 15:1107–1124.
Stanley, K. O. 2003. Efficient Evolution of Neural Networks Through Complexification. Ph.D.
Dissertation, Department of Computer Sciences, The University of Texas at Austin, Austin,
TX.
Stone, P., and Sutton, R. S. 2001. Scaling reinforcement learning toward roboCup soccer. In
Proceedings of the Eighteenth International Conference on Machine Learning.
95
Stone, P., and Veloso, M. 2000. Layered learning. In Eleventh European Conference on Machine
Learning (ECML-2000).
Sutton, R. S., and Barto, A. G. 1998. Reinforcement Learning: An Introduction. Cambridge, MA:
MIT Press.
Sutton, R. S., and Tanner, B. 2005. Temporal-difference networks. In Advances in Neural Information Processing Systems, volume 17, 1377–1384.
Sutton, R. S.; Precup, D.; and Singh, S. 1998. Intra-option learning about temporally abstract actions. In Proceedings of the Fifteenth International Conference on Machine Learning(ICML’98), 556–564. Morgan Kaufmann.
Sutton, R. S.; Precup, D.; and Singh, S. 1999. Between MDPs and SMDPs: A framework for
temporal abstraction in reinforcement learning. Artificial Intelligence 112:181–211.
Sutton, R.
2001.
What’s wrong with artificial intelligence.
anw.cs.umass.edu/r̃ich/IncIdeas/WrongWithAI.html.
http://www-
Tesauro, G. J. 1995. Temporal difference learning and TD-gammon. Communications of the ACM
38:58–68.
Teyssier, M., and Koller, D. 2005. Ordering-based search: A simple and effective algorithm for
learning bayesian networks. In Proceedings of the Twenty-first Conference on Uncertainty in
AI (UAI), 584–590.
Thrun, S.; Burgard, W.; and Fox, D. 2005. Probabilistic Robotics. Cambridge, MA: MIT Press.
Toussaint, M. 2004. Learning a world model and planning with a self-organizing, dynamic neural
system. In Advances in Neural Information Processing Systems 16.
Triesch, J. 2005. A gradient rule for the plasticity of a neuron’s intrinsic excitability. Artificial
Neural Networks: Biological Inspirations - ICANN 2005 65–70.
Turrigiano, G. G. 1999. Homeostatic plasticity in neuronal networks: The more things change, the
more they stay the same. Trends in Neurosciences 22(5):221–227.
Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning. Machine Learning 8(3):279–292.
Wolfe, B.; James, M. R.; and Singh, S. 2005. Learning predictive state representations in dynamical systems without reset. In Proceedings of the 22nd International Conference on Machine
Learning.
96
Xing, E. P.; Ng, A. Y.; Jordan, M. I.; and Russel, S. 2003. Distance metric learning with applications
to clustering with side information. In Advances in Neural Information Processing Systems
(NIPS).
Zimmer, U. R. 1996. Robust world-modelling and navigation in a real world. Neurocomputing
13(2-4):247–260.
97
Vita
Jefferson Provost was born in Mt. Lebanon, Pennsylvania in 1968. He graduated from Mt. Lebanon
High School in 1986, and received his B.S. in Computer Science from the University of Pittsburgh in
1990. From then until entering graduate school in 1998, he worked writing software to help psychologists design and run experiments, First, at the Carnegie Mellon University Psychology Department,
he helped design and implement the PsyScope experiment design and control system(MacWhinney,
Cohen, & Provost, 1997; Cohen et al., 1993). Later, at Psychology Software Tools, Inc., he was part
of the team that designed and implemented the E-Prime experiment system. More recently, he was
a key member of the team that designed and developed the Topographica cortical simulator(Bednar
et al., 2004).
Permanent Address: 3257 Eastmont Ave
Pittsburgh, PA 15216
[email protected]
http://www.cs.utexas.edu/users/jp/
This dissertation was typeset with LATEX 2ε 1 by the author.
1 A
LT
EX 2ε is an extension of LATEX. LATEX is a collection of macros for TEX. TEX is a trademark of the American
Mathematical Society. The macros used in formatting this dissertation were written by Dinesh Das, Department of
Computer Sciences, The University of Texas at Austin, and extended by Bert Kay and James A. Bednar.
98