Policy Gradient Reinforcement Learning Without Regret: by Travis Dick

Policy Gradient Reinforcement Learning Without Regret
by
Travis Dick
A thesis submitted in partial fulfillment of the requirements for the

degree of
Master of Science
Department of Computing Science

University of Alberta
c Travis Dick, 2015

Abstract
This thesis consists of two independent projects, each contributing to
a central goal of artificial intelligence research: to build computer systems
that are capable of performing tasks and solving problems without problem-
specific direction from us, their designers. I focus on two formal learning
problems that have a strong mathematical grounding. Many real-world
learning problems can be cast as instances of one of these two problems.
Whenever our translation from the real to the formal accurately captures
the character of the problem, then the mathematical arguments we make
about algorithms in the formal setting will approximately hold in the real-
world as well.
The first project focuses on an open question in the theory of policy
gradient reinforcement learning methods. These methods learn by trial and
error and decide whether a trial was good or bad by comparing its outcome
to a given baseline. The baseline has no impact on the formal asymptotic
guarantees for policy gradient methods, but it does alter their finite-time
behaviour. This immediately raises the question: which baseline should we
use? I propose that the baseline should be chosen such that a certain esti-
mate used internally by policy gradient methods has the smallest error. I
prove that, under slightly idealistic assumptions, this baseline gives a good
upper bound on the regret of policy gradient methods. I derive closed-form
expressions for this baseline in terms of properties of the formal learning
problem and the computer’s behaviour. The quantities appearing in the
ii
closed form expressions are often unknown, so I also propose two algorithms
for estimating this baseline from only known quantities. Finally, I present
an empirical comparison of commonly used baselines that demonstrates im-
proved performance when using my proposed baseline.
The second project focuses on a recently proposed class of formal learn-
ing problems that is in the intersection of two fields of computing science
research: reinforcement learning and online learning. The considered prob-
lems are sometimes called online Markov decision processes, or Markov de-
cision processes with changing rewards. The unique property of this class is
that it assumes the computer’s environment is adversarial, as though it were
playing a game against the computer. This is in contrast to the more com-
mon assumption that the environment’s behaviour is determined entirely by
stochastic models. I propose three new algorithms for learning in Markov
decision processes with changing rewards under various conditions. I prove
theoretical performance guarantees for each algorithm that either comple-
ment or improve the best existing results and that often hold even under
weaker assumptions. This comes at the cost of increased (but still polyno-
mial) computational complexity. Finally, in the development and analysis
of these algorithms, it was necessary to analyze an approximate version of a
well-known optimization algorithm called online mirror ascent. To the best
of my knowledge, this is the first rigorous analysis of this algorithm and it
is of independent interest.
iii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1 Introduction 1
2 Reinforcement Learning and Decision Processes 5

2.1 Markov decision processes . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Total Reward in Episodic MDPs . . . . . . . . . . . . 9
2.1.2 Average Reward in Ergodic MDPs . . . . . . . . . . . 11
2.2 Markov deicison processes with changing rewards . . . . . . . 13
2.2.1 Loop-free Episodic MDPCRs . . . . . . . . . . . . . . 17
2.2.2 Uniformly Ergodic MDPCRs . . . . . . . . . . . . . . 19
3 Gradient Methods 21
3.1 Gradient Ascent . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Stochastic Gradient Ascent . . . . . . . . . . . . . . . . . . . 25
3.3 Online Mirror Ascent . . . . . . . . . . . . . . . . . . . . . . . 27
4 Policy Gradient Methods and Baseline Functions 32

4.1 Policy Gradient Methods . . . . . . . . . . . . . . . . . . . . 33
4.2 Baseline Functions . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 MSE Minimizing Baseline . . . . . . . . . . . . . . . . . . . . 41
4.3.1 Regret Bound from the MSE Minimizing Baseline . . 42
4.3.2 MSE Minimizing Baseline for Average Reward . . . . 43
4.3.3 MSE Minimizing Baseline for Total Reward . . . . . . 45
4.4 Estimating the MSE Minimizing Baseline . . . . . . . . . . . 50
iii
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Learning in MDPCRs 69
5.1 Reductions to Online Linear Optimization . . . . . . . . . . . 70
5.1.1 Reduction of Loop-Free Episodic MDPCRs . . . . . . 71
5.1.2 Reduction of Uniformly Ergodic MDPCRs . . . . . . . 75
5.2 Online Mirror Ascent with Approximate Projections . . . . . 82
5.3 Learning Algorithms and Regret Bounds . . . . . . . . . . . . 87
5.3.1 Loop Free Episodic MDPCRs with Instructive Feedback 90
5.3.2 Loop Free Episodic MDPCRs with Evaluative Feedback 93
5.3.3 Uniformly Ergodic MDPCRs with Instructive Feedback 96
6 Conclusion 99
iv
Chapter 1
Introduction
This thesis focuses on a central goal of artificial intelligence: building com-

puter systems capable of performing tasks or solving problems without the
need for us, their designers, to treat each task or problem individually. That
is, we want algorithms that enable computers to learn for themselves how to
succeed at tasks and how to solve problems. I believe that a good approach
is to first focus on designing algorithms for formal learning problems that
we can mathematically reason about. Once we have a repertoire of formal
problems and well-understood algorithms, we can then solve a real-world
problem by first translating it into one of our formal learning problems and
then applying an algorithm designed for that formal problem. If our for-
mal model accurately captures the nature of the real-world problem, then
the mathematical arguments we make about our learning algorithms will
(nearly) hold in the real-world as well. I am not alone in this belief, and
this strategy is common in the artificial intelligence community.
To create truly general learning algorithms we should also automate the
modeling step, in which the real-world problem is approximated by a formal
one. This is a very exciting and interesting research problem, but it appears
to be quite difficult to make progress. Fortunately, even without automatic
modeling, it is still worthwhile to study and design algorithms for formal
learning problems. This is because it may be easier for a human to model
a problem than to solve it. A computing scientist equipped with a set
1
of descriptive formal learning problems and good learning algorithms can
then approach a difficult real-world problem by first modeling it formally
and then handing it off to a computer. Moreover, when we do eventually
have strategies for automatic modeling, it will be convenient to already have
algorithms for many formal learning problems.
This thesis describes two projects in pursuit of the strategy outlined
above. Both projects work towards answering interesting mathematical
questions arising in the design and analysis of algorithms for two different
formal learning problems. Both learning problems are formulated in the lan-
guage of reinforcement learning, which is an approach whereby the computer
learns by trial and error. Further, the algorithms studied in both projects
treat learning as mathematical optimization and are derived from an opti-
mization algorithm called gradient ascent. Finally, both projects measure
learning performance in terms of regret, which is roughly how much worse
the computer learner performed than if it had known the best strategy be-
fore hand. The title of the thesis, Policy Gradient Reinforcement Learning
Without Regret, mentions explicitly each of these three components, which
will be described in more detail in the remainder of the thesis.
The goal of the first project is to answer an open question about a fam-
ily of algorithms called policy gradient methods. The question is somewhat
technical, so for now I will only discuss it at a high level and postpone the de-
tailed description until Chapter 4. In the reinforcement learning framework,
the computer system receives a reward following each of its decisions and
its goal is to earn the most reward in the long run. Policy gradient meth-
ods learn to earn rewards by trial and error. After trying one behaviour
for a period of time, the algorithm compares the rewards it earned to a
given baseline. If the computer performed better than the baseline, then
the tendency to follow that behaviour again in the future is increased. Sim-
ilarly, if the computer performed worse than the baseline, the likelihood of
that behaviour is decreased. Surprisingly, the asymptotic formal guarantees
for policy gradient methods do not depend on what baseline they compare
against. The baseline does, however, influence the computer’s finite-time
behaviour, which leads immediately to the question: what baseline should
2
we use? This is the question addressed by the first project in this thesis.
The answer to this question depends on our goals for the computer system.
For example, some baselines may be computationally efficient to construct,
while others may allow the computer to learn more quickly.
The main contributions of my first project are as follows: for two specific
policy gradient algorithms, I propose that the baseline should be chosen to
minimize the mean-squared-error of an internally used gradient estimate. I
support this proposal by showing that under certain conditions, this choice
results in a good regret bound for the two policy gradient algorithms. I
present closed-form expressions showing how this baseline can be computed
from properties of the environment. This closed form expression depends
on properties of the environment that are usually unknown, so I also pro-
vide two new algorithms for estimating the baseline from interaction with
the environment. Finally, I present an empirical comparison between three
baselines that shows an empirical benefit to using my proposed baseline.
The second project combines aspects of reinforcement learning with as-
pects of another sub-field of computing science sometimes called online learn-
ing. The class of learning problems that are most commonly considered
in reinforcement learning are called Markov decision processes, which are
stochastic models describing how the computer’s environment behaves. In
contrast, problems from online learning typically use adversarial models,
where we imagine that the environment is playing a game against the com-
puter. In this project, we consider a class of learning problems called Markov
decision processes with changing rewards (MDPCR), which are very similar
to Markov decision processes where some stochastic pieces of the model are
replaced with adversarial counterparts. The goal of this project is to design
new efficient learning algorithms for this class of problems.
The main contributions of the second project, which were also presented
at ICML 2014 [DGS2014], are as follows: I show that learning in Loop-free
MDPCRs and Uniformly Ergodic MDPCRs can be reduced to another prob-
lem called online linear optimization. From this reduction, I derive three
new learning algorithms with sublinear regret bounds. There are trivial
algorithms that achieve regret bounds of the same order, but their compu-
3
tational complexity is exponential in the problem size. The three proposed
algorithms all have computational complexity that is polynomial in the prob-
lem size. Moreover, for the so-called bandit information setting, the regret
bound for the proposed algorithm holds under significantly weaker assump-
tions on the environment than existing algorithms. The three proposed
algorithms are of interest because of their low complexity and since their
analysis holds under weak conditions.
This thesis is organized as follows: (i) Chapter 2 introduces the reinforce-
ment learning framework, Markov decision processes, and Markov decision
processes with changing rewards. These are formal learning problems con-
sidered in this thesis; (ii) Chapter 3 introduces stochastic gradient descent
and mirror descent, which are the optimization tools that are used in all
algorithms studied in this thesis; (iii) Chapter 4 presents the first project,
which focuses on the baseline for policy gradient methods; (iv) Chapter 5
presents the second project, which focuses on designing new algorithms for
Markov decision processes with changing rewards; (v) and finally, Chapter 6
discusses directions for future research and gives concluding remarks.
4
Chapter 2
Reinforcement Learning and

Decision Processes
This chapter introduces the reinforcement learning framework, Markov deci-

sion processes, and Markov decision processes with changing rewards. Both
projects in this thesis work towards designing algorithms that effectively
learn how to make decisions in one or the other of these two decision prob-
lems. The first project focuses on Markov decision processes, and the second
project focuses on Markov decision processes with changing rewards.
The field of reinforcement learning is concerned with the following sit-
uation: A computer program, called the agent, is trying to achieve a well-
specified goal while interacting with its environment. For example, an agent
maintaining the inventory of a convenience store might be responsible for
choosing how much of each product to order at the end of each week. In
this setting, a natural goal for the agent is to maximize profits. If the agent
orders too little of a popular product, then profits may be lost due to missed
sales if it sells out. On the other hand, if the agent orders too much, profits
may be lost if the excess items expire before being sold. To perform well at
this task, the agent must interact with and anticipate the external world.
Every reinforcement learning problem has three components: states, ac-
tions, and rewards. In the above example, at the end of each week the
agent chooses an action (the amount of each product to order) based on the
5
environment’s current state (the store’s current inventory and any other ob-
servations available to the agent). Following each action, the agent receives
a scalar reward (the weekly profit), and the agent’s goal is to maximize some
measure of the long-term reward, such as the total profit over a fixed number
of weeks, or the average profit per week.
Reinforcement learning problems differ in the set of states, the set of
actions, and how the environment responds to the agent’s actions. Markov
decision processes and Markov decision processes with changing rewards are
formal models for how the agent’s actions affect the environment’s state and
rewards. We can mathematically reason about these formal models to make
strong theoretical claims about learning algorithms. The two models pre-
sented in this chapter are complementary and each can be used to accurately
model different kinds of real-world problems.
2.1 Markov decision processes

This section briefly describes Markov decision processes. The presentation
below is heavily influenced by my interactions with Rich Sutton, András
György, and Csaba Szepesvári, and the excellent books by Rich Sutton and
Andy Barto [SB1998] and by Csaba Szepesvári [Cs2010].
Markov decision processes are stochastic models in that they suppose
there are probability distributions that describe the outcomes of the agent’s
actions. For any set S, let ∆S denote the set of probability distributions
over S.1 With this, we have the following definition:
Definition 2.1. A Markov decision process (MDP) is a tuple (X , A, xstart , T)

where X is a finite set of states, A is a finite set of actions, xstart ∈ X is the
starting-state, and T : X × A → ∆X ×R is a transition probability kernel.
The interpretation of these quantities is as follows: Prior to choosing an

1
When S is finite, we will consider probability measures with respect to the discrete
σ-algebra. When S is the real line, we consider probability measures with respect to the
Borel σ-algebra. Otherwise, S will be a product of finite sets and one or more copies of
the real line, in which case we consider the product σ-algebra. The rest of this thesis does
not discuss measure theoretic results.
6
action (encoded as an element a ∈ A), the agent observes the environment’s
state (encoded as an element x ∈ X ). For each state action pair (x, a), the
transition probability kernel gives a distribution over states and rewards,
denoted by T(x, a). The environment’s next state and the agent’s reward
are jointly distributed according to T(x, a) whenever the agent takes action
a from state x. When the agent begins interacting with the environment,
the environment is in state xstart . Typically, we consider the case when the
agent knows the sets of states, the set of actions, and the starting state, but
does not know the transition probability kernel.
We rarely need to work with the transition probability kernel directly.
For almost all purposes, given a state-action pair (x, a), we only care about
the marginal distribution over the next state and the expected reward.
Therefore, we use the following simpler notation: Let (x, a) be any state-
action pair and let (X 0 , R) be randomly sampled from T(x, a). We define the
state transition probabilities by P(x, a, x0 ) = P(X 0 = x0 ) and the expected
reward by r(x, a) = E[R]. The dependence of these functions on the pair
(x, a) is through the distribution of X 0 and R.
Just as it is useful to have a model for how the environment behaves, it
is useful to have a model for how the agent chooses actions. For MDPs, a
natural choice is to suppose that the agent chooses actions according to a
Markov policy, which is a stochastic mapping from states to actions.
Definition 2.2. A (Markov) policy is a map π : X → ∆A that assigns a

probability distribution over actions to each state. Let Π = Π(X , A) denote
the set of all Markov policies.
We say that an agent is following policy π if, whenever the environment

is in state x, she randomly chooses an action according to the distribution
π(x). We will denote by π(x, a) the probability of choosing action a from
state x.
Following any fixed policy π ∈ Π will produce a random trajectory of
states, actions, and rewards. The distribution on this trajectory depends
only on the policy π and the MDP transition probability kernel T. We will
denote a sample of this random trajectory by X1π , Aπ1 , R1π , X2π , Aπ2 , R2π , . . . .
7
The meaning of the time indexes is as follows: action Aπt was taken after
observing Xtπ and the reward produced by this action was Rtπ . 2
At first glance it seems restrictive that Markov policies are only permit-
ted to choose actions based on the environment’s current state, rather than
the entire history of states, actions, and rewards. It turns out, however,
that in all the cases we consider, there is an optimal policy that chooses
actions as a deterministic function of the environment’s current state. This
is because, conditioned on the current state, the future of an MDP is inde-
pendent of the history. We consider the more general class of (stochastic)
Markov policies because they allow the agent to randomly choose actions,
which is useful for trial and error learning.
Reinforcement learning algorithms adapt their behaviour over time to
maximize reward. In principle, if the agent knew the environment’s transi-
tion probability kernel T before hand, the agent could compute an optimal
policy off-line prior to interacting with the environment. But, since the
probability kernel is unknown, the agent must use its interaction with the
environment to improve its policy. For example, a simple approach would
be to compute a maximum likelihood estimate of the transition probability
kernel based on the observed state transitions and rewards and to calculate
the optimal policy for the approximated kernel. In general, this approach
is too costly in terms of memory, computation, and interactions with the
environment, so we seek other approaches.
The only remaining aspect of an MDP left to formalize is the agent’s
learning objective. That is, what exactly is the agent trying to accomplish
while interacting with her environment? A formal learning objective is a map
J : Π → R that maps each policy to a scalar measure of its performance.
The map J specifies precisely what we desire in a policy and is usually a
function of both the policy and the environment. Intuitively, the learning
objective should be to maximize some measure of the long-term reward the
agent receives. The two most commonly used learning objectives are: First,
maximizing the agent’s expected total reward in repeated attempts at a
2
This is somewhat non-standard, and often Rtπ is taken to be the reward produced by
executing action Aπt−1 .
8
task. Second, maximizing the long-term average reward per-action in a task
that continues indefinitely. The following subsections introduce these two
learning objectives, together with additional conditions on the environment
that make learning possible.
2.1.1 Total Reward in Episodic MDPs

The first formal learning objective that we consider is appropriate when the
agent repeatedly tries the same task, and each attempt takes finite time.
Each attempt is called an episode, and in this case, a natural formal goal
is for the agent to maximize the total reward she earns in each episode.
This objective is not appropriate when the agent’s experience isn’t naturally
divided into episodes, since the total reward over an infinite span of time is
generally also infinite. To accommodate this formal objective, we need to
introduce the notion of episodes into the MDP model.
Definition 2.3. An MDP (X , A, xstart , T) is said to be episodic if there
exists a unique state xterm ∈ X , called the terminal state, such that for
all actions a ∈ A the transition kernel T(xterm , a) places all of its mass on
(xterm , 0). In other words, once the MDP enters state xterm , it remains there
indefinitely while producing no reward.
Since nothing interesting happens after an episodic MDP enters its ter-
minal state, we are free to restart the MDP and let the agent try again. We
could also model the restarts directly in the MDP by adding a transition
from the terminal state back to the starting state, but it is formally more
convenient to have a single infinite trajectory of states, actions, and rewards
for each episode (even if after some time it remains in the same state with
zero reward).
The total episodic reward learning objective is defined as follows:
Definition 2.4. Let M be an episodic MDP. The expected total reward is
a map Jtotal : Π → R given by
∞
X
π
Jtotal (π) = E Rt .
t=1
9
Let π ∈ Π be any policy for an episodic MDP and let
T π = inf {t ∈ N : Xtπ = xterm }
be the (random) first time that an agent following policy π enters the ter-
minal state. Then we can rewrite the expected total reward as
π −1
TX
π
Jtotal (π) = E Rt ,
t=1
since after time T π the rewards are 0 with probability one.

Two useful theoretical tools for learning to maximize total reward in
episodic MDPs are the value and action-value functions. The value function
measures the expected total reward an agent following policy π will receive
starting from a given state x, and the action-value function measures the
same when the agent starts from state x and takes action a first.
Definition 2.5. Let M be an episodic MDP. The value function Vtotal :

X × Π → R is defined by
∞
X
Vtotal (x, π) = Ex Rtπ ,
t=1
where Ex denotes the expectation where the environment starts in state x,

rather than xstart . For each policy π, the map x 7→ Vtotal (x, π) is usually
called the value function for policy π.
Definition 2.6. Let M be an episodic MDP. The action-value function

Qtotal : X × A × Π → R is defined by
∞
X
Qtotal (x, a, π) = Ex,a Rtπ ,
t=1
where Ex,a denotes the expectation where the environment starts in state
x and the agent’s first action is a. For each policy π, the map (x, a) 7→
Qtotal (x, a, π) is usually called the action-value function for policy π.
10
Since the state transitions in an MDP do not depend on the history of
states, any time an agent following policy π finds herself in state x, her
expected total reward until the end of the episode is given by Vtotal (x, π).
Similarly, whenever the agent finds herself in state x and she takes action a,
then Qtotal (x, a, π) is her expected total reward until the end of the episode.
In other words, for any time t, we have
∞
X
π π
E Rs Xt = x = Vtotal (x, π)
s=t
and
∞
X
π π π
E Rs Xt = x, At = a = Qtotal (x, a, π)
s=t
whenever the events being conditioned on happen with non-zero probability.
2.1.2 Average Reward in Ergodic MDPs

The second formal learning objective that we consider is appropriate when
we can’t naturally divide the agent’s experience into episodes. In this case,
the agent’s total reward on her infinite trajectory of states, actions, and
rewards is generally also infinite. Given two policies that both have total
reward diverging to infinity, how should we choose between them? A natural
idea is to choose the policy that gives the fastest divergence. The long-term
average reward per-action measures the asymptotic rate that a policy’s total
reward diverges.
Definition 2.7. Let M be any MDP. The average reward learning objective
is a map Javg : Π → R defined by
XT
1 π
Javg (π) = lim E Rt .
T →∞ T
t=1
There is a potential problem with choosing policies that maximize Javg :

Since the agent changes her policy during her interaction with the environ-
ment, all of the policies she follows except for the first will not be started
11
from the starting state xstart . Therefore, if use Javg to choose between poli-
cies, we would like to impose some constraints on the MDP that ensure the
long-term average reward of a policy does not depend on the starting state.
Otherwise, the agent may be encouraged to choose a policy which has high
average reward when started from the starting state, but which performs
poorly given the environment’s current state.
A relatively mild condition on the environment that ensures the average
reward of a policy does not depend on the starting state is ergodicity:
Definition 2.8. An MDP (X , A, xstart , T) is said to be weakly ergodic 3 if,

for each policy π ∈ Π, there exists a unique distribution ν(π) ∈ ∆X , called
the stationary distribution of π, such that
X X
ν(x, π) = π(x, a) P(x, a, x0 )ν(x0 , π),
a∈A x0 ∈X
where ν(x, π) denotes the probability mass given to state x by the distribu-
tion ν(π).
The condition in Definition 2.8 is a fixed-point equation and states that

if I sample a random state x from ν(π) and then take a single step according
to the policy π to get a new state x0 , then the distribution of x0 is exactly
ν(π) again.
It is well known that in weakly ergodic MDPs, it is possible to rewrite
the average reward learning objective in terms of the stationary distribution
as follows:
X
Javg (π) = ν(x, π)π(x, a)r(x, a)
x,a
= E[r(X, A)], where X ∼ ν(π), A ∼ π(X).
The starting state of the MDP no longer appears in this expression for the
average reward, and therefore the average reward does not depend on the
starting state.
3
In the remainder of this thesis, I will refer to weakly ergodic MDPs simply as ergodic
MDPs.
12
Like in the total reward setting, the value and action-value functions
are useful theoretical tools which have essentially the same interpretation as
before. Now, rather than measuring the total reward following a given state,
they measure the transient benefit of being in a give state, or state-action
pair, compared with the long-term average.
Definition 2.9. Let M be an ergodic MDP. The value function Vavg :

X × Π → R is defined by
∞
X
Rtπ

Vavg (x, π) = Ex − Javg (π) .
t=1
Definition 2.10. Let M be an ergodic MDP. The action-value function

Qavg : X × Π → R is defined by
∞
X
Rtπ

Qavg (x, a, π) = Ex,a − Javg (π) .
t=1
2.2 Markov deicison processes with changing re-

wards
MDPs are not suitable models for all environments. In particular, since the
transition probability kernel T of an MDP is unchanging with time, it can
be difficult to model environments whose dynamics change over time. This
section describes Markov decision processes with changing rewards (MD-
PCRs), which are a class of environment model that capture some kinds of
non-stationarity. This class of problems has been the focus of recent research
efforts [EKM2005, EKM2009, NGS2010, NGSA2014, YMS2009] and goes
by several different names, the most common of which is “Online MDP”. I
choose to use the name MDPCR because “Online MDP” is not universally
used, and I think MDPCR is more descriptive.
Before describing MDPCRs, I would like to comment on an alternative
approach to modeling non-stationary environments. In principle, we can
model a non-stationary environments as an MDP by including a description
13
of the environment’s current behaviour in the state. For example, if the en-
vironment switches between a finite number of modes, then we could include
an integer in the MDP state that indicates which mode the environment is
currently in. The drawback of this approach is that, in the MDP framework,
the agent completely observes the state, so the agent must be able to ob-
serve the environment’s current mode, which is a rather strong requirement.
One way to avoid this requirement is to modify the MDP model so that
the agent only partially observes the state. This kind of model is called a
partially observable MDP (POMDP). POMDPs are an interesting research
topic, but not a focus of this thesis.
MDPCRs are a different approach to modeling non-stationary environ-
ments. They keep the assumption that the agent completely observes the
modeled state and that the state transitions are Markovian. The difference
is that the reward for taking action a from state x changes over time in
a non-stochastic way. Specifically, there is an unknown sequence of reward
functions r1 , r2 , . . . and executing action at from state xt at time t produces
a reward rt (xt , at ).
Definition 2.11. A Markov decision process with changing rewards (MD-

PCR) is a tuple X , A, xstart , P, (rt )t∈N where X is a finite set of states, A
is a finite set of actions, xstart ∈ X is the starting-state, P : X × A → ∆X
encodes the state transition probabilities, and each rt : X × A → R is a
reward function.
The interpretations of all quantities in Definition 2.11 is the same as for

regular MDPs with the exception of the sequence of reward functions. In
this thesis, I consider the case where the agent knows the set of states, the
set of actions, the starting state, and the state transition probabilities, and
the only unknown quantity is the sequence of reward functions.
There are two different protocols under which the agent learns about
the reward functions. The first, which is sometimes called evaluative feed-
back (or bandit feedback), is where the agent only observes the scalar value
rt (Xt , At ) after executing action At from state Xt . The second, which is
sometimes called instructive feedback (or full-information feedback) is where
14
the agent observes the entire reward function rt : X × A → R after choos-
ing an action at time t. The evaluative feedback setting is more useful in
real-world tasks, since often the rewards are determined by some real-world
process and we only see the outcome of the action that was taken. The
instructive feedback case is still interesting theoretically, sometimes useful
in practice, and acts as a stepping stone towards developing algorithms for
the evaluative feedback setting.
The main usefulness of MDPCRs is modeling tasks where the agent’s
rewards depend on a difficult-to-model aspect of the environment. For ex-
ample, suppose the agent’s task is to explore a maze searching for treasure.
The agent’s actions have predictable (and time-invariant) effects on her lo-
cation in the maze and are therefore easily modeled with Markovian transi-
tion probabilities. Suppose, however, that there is a wizard that periodically
creates and destroys treasures throughout the maze. The agent’s reward de-
pends not only on her position in the maze, but also on the recent actions of
the wizard. This problem is difficult to model as an MDP, since the rewards
must be sampled from a time-invariant distribution that depends only on
the most recent state and action. This forces the state to explicitly model
(at least part of) the wizard’s behaviour, which may be very complicated.
On the other hand, it is easy to model this task as an MDPCR, since we
can leave the wizard out of the state entirely and use the sequence of reward
functions to model the moving treasures. Similar problems arise in many
situations where the agent interacts with another entity with agency, such
as a human user or another machine.
Like in the MDP case, we consider two different formal learning objec-
tives: one suitable for episodic tasks and one suitable for continuing tasks.
For each formal learning objective, we consider a sub-class of MDPCRs
where learning is possible. The sub-classes considered in this thesis are more
restrictive than in the MDP setting, and an interesting open question is to
design learning algorithms for more general settings. In the episodic setting,
we consider learning in loop-free episodic MDPCRs and in the continuing
setting, we consider uniformly ergodic MDPCRs.
Before giving detailed descriptions of the two cases, I will discuss some
15
common features of both models.
In each case, we define a sequence of performance functions J1 , J2 , . . . ,
where JT (π1 , . . . , πT ) is a function of T policies and represents the expected
performance of an agent following policies π1 , . . . , πT for the first T time-
steps. For example, we might define
T
X
JT (π1 , . . . , πT ) = Eπ1:T rt (Xt , At )
t=1
to be the expected total reward earned by an agent following policies π1 ,

. . . , πT for the first T time steps. The reason why we need a sequence of
performance functions, rather than a single performance function like for
MDPs, is because the performance depends on the changing sequence of
reward functions. This thesis focuses on the formal learning objective of
maximizing JT (π1 , . . . , πT ) for some fixed time-horizon T . There are stan-
dard techniques, such as the doubling trick (see, for example, Section 2.3.1
of [S2012]) that allow these algorithms to be extended to the case when the
time-horizon T is not known in advance.
In our formal analysis, it will be more convenient to work with the regret,
which is defined below.
Definition 2.12. Let J1 , J2 , . . . be any sequence of performance functions.

The regret of the sequence of policies π1 , . . . , πT ∈ Π relative to a fixed policy
π ∈ Π at time (or episode) T is given by
RT (π1 , . . . , πT ; π) = JT (π, . . . , π) − JT (π1 , . . . , πT ).
In words, it is the gap in performance between an agent that follows policies

π1 , . . . , πT and an agent that follows policy π on every time step. The regret
of the sequence relative to the set of Markov policies is given by
RT (π1 , . . . , πT ) = sup RT (π1 , . . . , πT ; π)

π∈Π

= sup Jt (π, . . . , π) −JT (π1 , . . . , πT ).
π∈Π
16
Minimizing regret is equivalent to maximizing the performance function,
but as we will see later, the regret is easier to analyze. In particular, we will
be able to provide upper bounds on the regret which depend only loosely
on the actual sequence of rewards.
2.2.1 Loop-free Episodic MDPCRs

A loop-free episodic MDPCR is much like an episodic MDP with two ad-
ditional constraints: The agent can never visit the same state twice in a
single episode and every episode has the same length. Formally, we have the
following definition
Definition 2.13. A loop-free MDPCR is a tuple (X , A, P, (rt )t∈N ) such

that the state space X can be partitioned into L layers X1 , . . . , XL with the
following properties:
1. the first layer contains a unique starting state: X1 = {xstart };
2. the last layer contains a unique terminal state: XL = {xterm };
3. for every action a, we have rL (xterm , a) = 0;
4. and, for any states x and x0 and any action a, if P(x, a, x0 ) > 0, then
either x = x0 = xterm or there exists a layer index 1 ≤ l < L such that
x ∈ Xl and x0 ∈ Xl+1 .
The above conditions guarantee that every episode in a loop-free episodic

MDPAC will visit exactly L distinct states, one from each layer. The agent
starts in the first layer, which contains only the starting state, and proceeds
through the layers until arriving at the final layer, which contains only the
terminal state. Once the agent enters the terminal state, the rest of the
trajectory remains in the terminal state and produces zero reward.
It is natural to measure time in loop-free episodic MDPCRs by counting
episodes, rather than time steps. Since the agent will never return to any
state in a single episode, there is no reason for her to update her action
selection probabilities for that state before the end of the episode. Similarly,
17
we can take the duration of each reward function to be an entire episode,
rather than a single time step. Therefore, we denote by πτ and rτ the
agent’s policy and the reward function for episode τ , respectively. Finally,
we measure the agent’s regret after T episodes, rather than after T time
steps of interaction.
The performance function that we consider in loop-free episodic MD-
PCRs is the total reward earned over the first T episodes:
Definition 2.14. In an MDPCR, the expected total reward of the policies

π1 , . . . , πT in the first T episodes is given by
T
X L
X
Jtotal,T (π1 , . . . , πT ) = Eπτ rτ (Xt , At ) ,
τ =1 t=1
where Eπτ denotes the expectation where actions are selected according to
policy πτ .
For this performance function, we can write the regret (relative to all
Markov policies) as follows:
T L L
( X X )
X
Rtotal,T (π1 , . . . , πT ) = sup Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At ) .
π∈Π τ =1 t=1 t=1
Suppose that an agent’s regret grows sublinearly with T . Then for any
policy π, we have
T L
X T L
X
1X 1X
Rtotal,T (π1:T ; π)/T = Eπ rτ (Xt , At ) − Eπτ rτ (Xt , At )
T T
τ =1 t=1 τ =1 t=1
≤ Rtotal,T (π1 , . . . , πT )/T → 0.
Taking π to be the best Markov policy, we have that the average episodic
reward of the agent is converging to the average episodic reward of the
best Markov policy. Therefore, our main goal is to show that the regret of
our algorithms grows sublinearly. Naturally, slower growth rates are more
desirable.
18
2.2.2 Uniformly Ergodic MDPCRs
Uniformly ergodic MDPCRs are very similar to ergodic MDPs. Recall that
ergodic MDPs were characterized by the existence of a unique stationary
distribution ν(π) ∈ ∆X for each policy π that described the long-term av-
erage state visitation probabilities while following policy π. The condition
in uniformly ergodic MDPs guarantees that every policy has a unique sta-
tionary distribution, that the finite-time state-visitation probabilities νt (π)
(νt (x, π) = P(Xtθ = x)) converge to the unique stationary distribution, and
that the rate of convergence of νt (π) to ν(π) is uniformly fast over all policies.
Formally, we have the following definition

Definition 2.15. An MDPCR X , A, xstart , P, (rt )t∈N is said to be uni-
formly ergodic if there exists a constant τ ≥ 0 such that for any two distri-
butions ν and ν 0 ∈ ∆X and any policy π, we have
νP π − ν 0 P π 1
≤ e−1/τ ν − ν 0 1
,
where P π is linear operator on distributions corresponding to taking a single

step according to π, defined component-wise as follows:
X X
(νP π )(x0 ) = ν(x) π(x, a)P(x, a, x0 ).
x a
The above condition implies ergodicity in the sense of Definition 2.8.

Suppose an agent follows policy π on every time step and let νt (π) ∈ ∆X
denote her probability distribution over states at time t. Since the agent
starts in state xstart , we have that ν1 (x, π) = I {x = xstart }. Using the
notation from above, it is not difficult to check that νt (π) = ν1 (P π )(t−1) . The
condition shows that the operator P π is a contraction and, by the Banach
fixed point theorem, we have that the sequence νt (π) converges to the unique
fixed point of P π , which we denote by ν(π). Being a fixed point of P π is
exactly the definition of a stationary distribution from Definition 2.8.
Uniform ergodicity is a considerably stronger requirement than ergod-
icity, and an interesting open question is to decide if there exist learning
19
algorithms for (non-uniform) ergodic MDPCRs with provably good perfor-
mance.
The performance function we use for uniformly ergodic MDPCRs is very
similar to the one used for the loop-free episodic MDPCRs, except that the
total reward is measured in terms of time steps, rather than episodes:
Definition 2.16. In a uniformly ergodic MDPCR, the expected total re-

ward of the policies π1 , . . . , πT in the first T time steps is given by
T
X
Jtotal,T (π1 , . . . , πT ) = Eπ1:T rt (Xt , At ) ,
t=1
where Eπ1:T denotes the expectation where the tth action At is chosen ac-
cording to policy πt .
For this performance function, we can write the regret (relative to all
Markov policies) as follows:
T T
( X X )
Rtotal,T (π1 , . . . , πT ) = sup Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At ) .
π∈Π t=1 t=1
By exactly the same argument as for loop-free episodic MDPCRs, an

agent with sublinear regret will have average reward converging to the av-
erage reward of the best Markov policy.
20
Chapter 3
Gradient Methods
This chapter introduces three optimization algorithms: gradient ascent,

stochastic gradient ascent, and online mirror ascent. These algorithms are
relevant to this thesis because all of the considered learning algorithms are
based on mathematical optimization. Policy gradient methods, which are
the focus of the first project, are an instance of stochastic gradient ascent.
All three algorithms introduced in the second project are instances of online
mirror ascent. Gradient ascent is included in the discussion because it is a
good starting point for describing the other two.
The goal of mathematical optimization can be stated formally as follows:
Given a function f : K → R where K ⊂ Rd is a subset of d-dimensional
space, find a vector w ∈ K that maximizes f (w). We use the following
notation to write this problem:
argmax f (w).
w∈K
The difficulty of finding a maximizer depends heavily on the structural prop-

erties of f and K. For example, when f is a concave function and K is a
convex set, the global maximizer of f can be efficiently found. When f is
not concave, the best we can hope for is to find a local maximum of the
function f .
21
3.1 Gradient Ascent
The gradient ascent algorithm can be applied whenever the objective func-
tion f is differentiable. Gradient ascent produces a sequence of vectors w1 ,
w2 , . . . such that the function value of f along the sequence is increasing.
In this section, we only consider so-called unconstrained maximization prob-
lems where K = Rd . Pseudocode for gradient ascent is given in Algorithm 1.
Input: step-size η > 0.

1 Choose w1 = 0 ∈ Rd ;
2 for each time t = 1, 2, . . . do
3 Optionally use wt in some other computation;
4 Set wt+1 = wt + η∇f (wt );
5 end
Algorithm 1: Gradient Ascent
There is a simple geometric idea underlying gradient asecnt. We imagine

that the graph of the function f is a landscape, where f (w) is the height
at location w (this analogy works best in R2 ). The gradient ∇f (w) is a
vector that points in the direction from w that f increases the most rapidly.
Therefore, the gradient ascent update wt+1 = wt + η∇f (wt ) produces wt+1
by taking a small step up-hill from wt . It is appropriate to think of gradient
ascent as optimizing a function as though it were a walker searching for the
highest point in a park by walking up hill.
We can also motivate the gradient ascent update as maximizing a linear
approximation to the function f . Specifically, since f is differentiable, the
first order Taylor expansion gives a linear approximation to f :
f (u) ≈ f (w) + ∇f (w)> (u − w).
This approximation is accurate whenever u is sufficiently close to w. A naive

idea would be to fix some w0 ∈ Rd and maximize the linear approximation
w 7→ f (w0 )+∇f (w0 )> (w −w0 ). There are two problems with this approach:
first, the linearized objective function is unbounded (unless it is constant)
22
and therefore has no maximizer. Second, the linear objective is only a good
approximation of f near the point w0 , so we should only trust solutions
that are near to w0 . A better idea is to successively maximize linear ap-
proximations to the function f , with a penalty that prevents our solutions
from being too far from the region where the approximation is accurate.
Formally, we might set
1
wt+1 = argmax η f (wt ) + ∇f (wt )> (u − wt ) − ku − wt k22 .
u∈Rd 2
The objective function in the above update has two competing terms. The
first term encourages wt+1 to maximize the linear approximation of f at the
point wt and the second term encourages wt+1 to stay near to the point wt .
The step size parameter η trades off between these two competing objectives.
The above objective is a concave quadratic function of u, and we can express
its unique maximizer in closed form as
1
wt+1 = argmax η f (wt ) + ∇f (wt )> (u − wt ) − ku − wt k22
u∈Rd 2
= wt + η∇f (wt ),
which is exactly the gradient ascent update. Therefore, we may think of

the gradient ascent update rule as maximizing a linear approximation to
the function f with an additional penalty that keeps the maximizer near to
the previous guess. When we view gradient ascent in this way, we see that
the update equation is defined based on the squared 2-norm. We will see
in Section 3.3 that we can derive similar algorithms where the the squared
2-norm distance is replaced by another distance function.
In addition to the above intuitions, we care that gradient ascent actually
maximizes functions. The following theorem guarantees that as long as the
step size is sufficiently small, the gradient ascent algorithm converges to a
stationary point of the function f .
Theorem 3.1. Let f : Rd → R be such that f ∗ = supw f (w) < ∞ and ∇f
23
is L-Lipschitz. That is,
for all u, v ∈ Rd k∇f (u) − ∇f (v)k2 ≤ ku − vk2 .
If the step-size satisfies η < 1/L, then the sequence (wt )t produced by gradi-
ent ascent converges to some point w∞ such that ∇f (w∞ ) = 0.
In most cases, this result is enough to guarantee that gradient ascent
converges to a local maximum of the function f . However, it is possible to
get unlucky and converge to some other point where the gradient of f is
zero, such as a local minima or a saddle-point of f . Since maximizing an
arbitrary non-concave functions is computationally intractible, this is the
best result we can hope for.
When the function f is concave, the situation is much better:
Theorem 3.2. Let f : Rd → R be a concave function with maximizer w∗ .
Let (wt )t be the sequence of points produced by running gradient ascent with
step size η > 0. Suppose that kw∗ k2 ≤ B and k∇f (wt )k2 ≤ G for all times
t = 1, . . . , T . Then
T
X B2
f (w∗ ) − f (wt )) ≤ + ηT G2 .
η
t=1
Setting the step size to be

B
η= √
G T
gives the best bound of
T
X √
f (w∗ ) − f (wt )) ≤ 2BG T .
t=1
Proof. This is a standard result. Since f is concave, for any points u and w
in Rd , we have
f (w) ≤ f (u) + ∇f (u)> (w − u).
Rearranging this inequality gives f (w) − f (u) ≤ ∇f (u)> (w − u). Taking

u = wt and w = w∗ gives f (w∗ ) − f (wt ) ≤ ∇f (wt )> (w∗ − wt ). Summing
24
over times t, we have
T
X T
X
∗
(f (w ) − f (wt )) ≤ ∇f (wt )> (w∗ − wt ).
t=1 t=1
Theorem 5.11 and Lemma 5.12 together with the facts that k·k2 is self-
dual and 1-strongly convex with repsect to itself give the final result. The
above theorem and lemma are the main subject of Section 5.2.
This results shows that whenever we set the step size appropriately, the
total suboptimality (usually called the regret) of the sequence w1 , . . . , wT
√
produced by gradient ascent grows at a rate of only T . Equivalently,
dividing by T shows that the average suboptimality T1 Tt=1 (f (w∗ ) − f (wt ))
P
√
goes to zero at least as quickly as 1/ T .
3.2 Stochastic Gradient Ascent

Stochastic gradient ascent is a variant of the gradient ascent algorithm that
can sometimes be used to maximize a function f even if we can’t compute
the value of f or its gradient ∇f . This method requires only that we are able
to produce random vectors whose expectation is equal to the gradient of the
function f . The idea behind stochastic gradient ascent is simply to use these
stochastic gradient estimates in place of the true gradients. Pseudocode is
given in Algorithm 2.

1 Choose w1 = 0 ∈ Rd ;
4 Set wt+1 = wt + η∇t where E[∇t | wt ] = ∇f (wt );
5 end
Algorithm 2: Stochastic Gradient Ascent
One common situation where we can’t evaluate f or ∇f , but for which
25
we can get unbiased stochastic estimates of the gradient is as follows: Let
P be a probability distribution that is unknown, but from which we can
sample. Set f (w) = E[g(w, X)] where g is a known function, and X has
distribution P . Since the distribution of X is unknown, we can’t evaluate f
or its gradient ∇f . But, let X have distribution P and set ∇ = ∇w g(w, X).
Then we have
E[∇] = E[∇w g(w, X)] = ∇w E[g(w, X)] = ∇f (w).
Therefore, even though we can’t compute f or ∇f , we can produce a random

vector whose expectation is equal to ∇f (w) for any w ∈ Rd .
In Algorithm 2, the condition on ∇t is that E[∇t | wt ] = ∇f (wt ). The
reason that the conditional expectation is used instead of a plain expectation
is that the sequence wt is itself random. It will generally not be the case that
the random vector ∇f (wt ) is equal to the constant E[∇t ]. The condition
E[∇t | wt ] = ∇f (wt ) is roughly equivalent to “given the value of wt , the
expectation of ∇t should be equal to ∇f (wt ).”
Since the sequence (wt )t∈N produced by Algorithm 2 is random, we can
only make probabilistic statements about how well the sequence optimizes
f . When the function f is non-concave, a standard results shows that the
random sequence (wt )t produced by stochastic gradient ascent almost surely
converges to a local maxima of the function f when a time-varying step size
is used that goes to zero at an appropriate rate. In practice, a constant step
size is often used instead, since this results in faster convergence early in the
optimization, at the cost of never quite driving the error to zero. I omit the
exact details of these results since they are quite technical and never used
directly in this thesis.
As with the deterministic case, the situation is much better when the
function f is concave. In this case, following essentially the same approach
as before, we have the following upper bound on the expected total subop-
timality (regret) of stochastic gradient ascent.
Theorem 3.3. Let f : Rd → R be a concave function with maximizer w∗ .

Let (wt )t be the sequence of points produced by running stochastic gradient
26
ascent with step size η > 0 and gradient estimates (∇t )t . Suppose that
kw∗ k2 ≤ B and E[k∇f (wt )k22 ] ≤ G2 for all times t = 1, . . . , T . Then
T
B2
X
∗
E f (w ) − f (wt )) ≤ + ηT G2 .
η
t=1
Setting the step size to be

B
η= √
G T
gives the best bound of
T
X
√
E f (w∗ ) − f (wt )) ≤ 2BG T .
t=1
Proof. The proof of this result is essentially identical to the proof of Theo-
rem 3.2 and is omitted.
Notice that the bound in Theorem 3.3 only depends on the distribution
of the gradient estimatse ∇t by way of their second moment. Therefore,
to get the best possible bound, one should try to construct the gradient
estimates to have the smallest possible second moment.
3.3 Online Mirror Ascent

This section introduces online mirror ascent, which generalizes gradient as-
cent in two ways. First, it is an algorithm for a problem called online linear
optimization, which is a slightly more general problem than maximizing a
function f using only the gradient of f . Second, online mirror ascent has
an additional parameter called the regularizer function R : Rd → R that
defines a distance function that replaces the squared 2-norm distance. The
regularizer allows mirror ascent to better exploit the natural geometry of an
optimization problem. If we take the regularizer to be R(w) = 1
2 kwk22 , then
we recover exactly gradient ascent, but there are other interesting cases as
well. For example, if the vectors represent probability distributions, then we
27
may way to measure distances in terms of the Kullback-Leibler divergence
instead of the squared 2-norm distance.
The problem solved by online mirror ascent is called online linear opti-
mization, defined as follows:
Definition 3.4. Online linear optimization is a game played between an

agent and her environment. On round t of the game, the agent chooses
a point wt from a convex set K ⊂ Rd . Following the agent’s choice, the
environment chooses a payout vector rt ∈ Rd and the agent earns reward
given by the inner product rt> wt . The set K is fixed for all rounds of the
game. The agent’s choice wt may only depend on w1:(t−1) and r1:(t−1) , while
the environment’s choice of rt may depend on w1:t and r1:(t−1) .
Given a fixed time horizon T , the agent’s goal is to maximize her total
reward in the first T rounds. Equivalently, she can minimize her regret
relative to the set K, given by
T
X T
X T
X
RT (w1:T , r1:T ) = sup rt> w − rt> wt = sup rt> (w − wt ).
w∈K t=1 t=1 w∈K t=1
We treat the online linear optimization problem in a game-theoretic style

and prove bounds on the regret for the worst-case sequence of payout vectors
r1:T under the constraint that rt (w) ∈ [0, 1] for all rounds t and w ∈ K.
The online mirror ascent algorithm is similar in spirit to gradient ascent,
but different in three important ways. First, rather than using the gradient
of a function to construct its update, it uses the payout vector from an
online linear optimization game. Second, rather than having an update that
is defined in terms of the squared euclidian distance (as in gradient ascent),
the update is defined in terms of a so-called Bregman divergence, which
allows the algorihtm to better take advantage of the underlying geometry of
a problem. For example, if the set K consists of probability distributions,
then it may make more sense to measure distance between them by the
KL-divergence than by their squared Euclidian distance. Finally, we will
present online mirror ascent for constrained online linear optimization, where
K is a proper subset of Rd . In principle, (stochastic) gradient ascent can
28
also accomodate constrained optimization problems, but this discussion was
omitted above because it is not used in this thesis.
I will now introduce Bregman divergences and describe how they are
used in the mirror ascent update. First, we need the notion of a strongly
convex function:
Definition 3.5. A function f : S → R with S ⊂ Rd is said to be σ-strongly

convex with respect to the norm k·k if
σ
f (u) ≥ f (w) + ∇f (w)> (u − w) + ku − wk2
2
for all vectors u and w in S.
Each strongly convex function induces a Bregman divergence on the

domain S, which is similar to a distance function.
Definition 3.6. Let R : S → R be a σ-strongly convex function with

respect to the norm k·k. The Bregman divergence induced by R is a map
DR : S × S ◦ → R defined by
DR (u, w) = R(u) − R(w) − ∇R(w)> (u − w),
where S ◦ denotes the interior of S.
The following lemma establishes some properties of Bregman divergences

that show they are somewhat similar to distance functions:
Lemma 3.7. Let R : S → Rd be a σ-strongly convex function with respect to

the norm k·k and DR be the induced Bregman divergence. Then the following
statements hold
1. DR (u, v) ≥ 0 for all vectors u and v in S,
2. DR (u, v) = 0 if and only if u = v,
3. (Pythagorean Theorem) If K is a convex subset of S, w ∈ S, u ∈ K,

and we set w0 = argminv∈K DR (v, w), then DR (u, w) ≥ DR (u, w0 ) +
DR (w0 , w).
29
Even though Bregman divergences behaving like distances, they are not
usual distances because they are not symmetric and do not satisfy the tri-
angle inequality.
With these definitions in hand, we are ready to define online mirror
ascent. Algorithm 3 gives pseudocode. The online mirror ascent update
has two steps: first, we compute an unconstrained maximizer wt+1/2 of the
most recent payout vector together with a penalty that encourages wt+1/2
to not stray too far from wt . Usually this update step has a closed-form
expression that can be efficiently evaluated. The second step is to set wt+1
to be the projection of wt+1/2 back onto the constraint set K with respect
to the Bregman divergence DR . Theorem 3.8 bounds the regret of online
mirror ascent.
Theorem 3.8. Let R : S → R be a σ-strongly convex regularizer with respect

to the norm k·k and K ⊂ S be a convex set. Then for any sequence of payout
vectors rt and any fixed point w ∈ K, the regret (relative to w) of online
mirror ascent with step size η > 0 and regularizer R satisfies
T T
X DR (w, w1 ) η X
rt> (w − wt ) ≤ + krt k2∗ ,
η σ
t=1 t=1
where k·k∗ is the dual norm of rt . Moreover, if krt k∗ ≤ G for all t, then we
have
T
X DR (w, w1 ) ηT G2
rt> (w − wt ) ≤ +
η σ
t=1
p
= 2G T DR (w, w1 )/σ
q
σDR (w,w1 )
where the last line is obtained by taking eta = T G2
, which is the
optimal value.
Proof. As before, this result is a special case of Theorem 5.11 together with
Lemma 5.12, so I defer the proof until Section 5.2.
30
Input: Step size η > 0, Regularizer R : S → R with S ⊃ K
1 Choose w1 ∈ K arbitrarily;
2 for Each round t = 1, 2, . . . do
3 Optionally use wt in another computation;
4 Set wt+1/2 = argmaxu∈S ηrt> u − DR (u, wt );
5 Set wt+1 = argminu∈K DR (u, wt+1/2 );
6 end
Algorithm 3: Online mirror ascent
Chapter 4
Policy Gradient Methods

and Baseline Functions
This chapter describes the first project that I worked on during my MSc
program. The goal of this project was to resolve an open question related to
policy gradient methods, which are learning algorithms for MDPs. Policy
gradient methods are instances of stochatic gradient ascent. Recall that to
apply stochastic gradient ascent, we must be able to sample random vec-
tors whose expectation is equal to the gradient of the objective function.
For both the total and average reward learning objectives, there are estab-
lished theoretical results that give methods for generating random gradient
estimates. Both gradient estimation schemes have a parameter called the
baseline function. The baseline does not change the expectation of the gradi-
ent estimates and therefore has no influence on the asymptotic performance
of policy gradient methods. The baseline function does, however, impact
the finite-time learning performance. How to choose the baseline function
is currently an open question. I propose that the baseline function should
be chosen to minimize the mean squared error (MSE) of the gradient esti-
mates. Under slightly idealistic conditions, I prove that this choice gives the
tightest bound on the suboptimality of the algorithm obtainable from the
standard analysis of stochastic gradient ascent. Unfortunately, the MSE-
minimizing baseline function depends on the transition probability kernel
32
T of the MDP, so the agent can not directly compute it. The final contri-
bution of this project is to show that the MSE-minimizing baseline can be
estimated from only observable quantities.
4.1 Policy Gradient Methods

Policy gradient methods are learning algorithms for MDPs based on stochas-
tic gradient ascent. We will consider two specific algorithms, one for learning
to maximize the total reward in episodic MDPs, and another for learning to
maximize the long-term average reward in ergodic MDPs.
To apply stochastic gradient ascent, we need to express the problem of
choosing good policies in terms of maximizing a function f : Rd → R. One
way to accomplish this is to choose a scheme for representing policies as
vectors in Rd . Such a scheme is called a policy parameterization, and is
a function π : Rd → Π that gives us a policy for each parameter vector
θ ∈ Rd . The composition of a policy parameterization π : Rd → Π and a
formal learning objective J : Π → R gives us a map θ 7→ J(π(θ)) which
is a suitable objective function for stochastic gradient ascent. To simplify
notation, I will write J(θ) to mean J(π(θ)). The set of policies that can be
represented by a parameterization π is given by π(Rd ) = π(θ) : θ ∈ Rd .

Maximizing J(θ) over Rd is equivalent to maximizing J(π) over the set of

policies π(Rd ).
Typically, not every Markov policy will be representable by a policy
parameterization (i.e., π(Rd ) is a strict subset of Π). This is actually an
advantage of policy gradient methods. Intuitively, the difficulty of finding
good parameters for a parameterization scales with the number of parame-
ters. For many real-world problems, the reinforcement learning practitioner
can design a policy parameterization with only a few parameters but which
can still represent nearly optimal policies. This allows the practitioner to
leverage their understanding of a problem to get better performance in prac-
tice.
Policy gradient methods were enabled by the development of techniques
for generating random vectors that are equal in expectation to the gradient of
33
the learning objective. I will loosely call such a random vector an estimated
gradient. The remainder of this section presents two gradient estimation
techniques: one for the total reward in episodic MDPs originally due to
Williams [W1992] and one for the average reward in ergodic MDPs due to
Sutton et al. [SMSM2000]. The presentation is slightly modified from the
original sources to better fit with this thesis.
Both gradient estimates have the same basic structure. To estimate the
gradient ∇J(θ), the agent follows policy π(θ) for a short period of time.
For any state-action pair (x, a), the policy gradient ∇θ π(x, a, θ) is the di-
rection in parameter-space that the agent would move the parameter vector
θ to increase the probability of choosing action a from state x. The learn-
ing objective gradient estimate is the sum of the policy gradients over the
state-action pairs visited during the trial period, each scaled by a term that
depends on how well the agent performed following that action (the details
depend on J) divided by the probability of choosing action a from state x.
Intuitively, the effect of adding this gradient estimate to the agent’s param-
eter vector is to increase the probability of the actions that performed well,
and to increase even more the probability of actions that performed well and
are rarely chosen. The following two theorems give detailed constructions of
the gradient estimates and show that they are equal in expectation to the
gradient of the learning objective.
Theorem 4.1. Let π : Rd → Π be a parametric policy for an episodic

MDP and let θ ∈ Rd be any parameter vector. Let (Xtθ , Aθt , Rtθ )∞
t=1 be the
sequence of states, actions, and rewards obtained by following policy π(θ)
for a single episode, let T θ be the first time the terminal state is entered,
P θ
and let Gθt = Ts=t Rsθ be the total reward earned after time t. Finally, let
∇θ π(Xtθ ,Aθt ,θ)
ϕθt = π(Xtθ ,Aθt ,θ)
be the so-called vector of compatible features at time t.
Then, the random vector
θ −1
TX
∇θ = ϕθt Gθt
t=1
satisfies E[∇θ ] = ∇Jtotal (θ).
Since the policy parameterization π is chosen by the reinforcement learn-
34
ing practitioner, the policy gradient ∇θ π(x, a, θ) can be computed by the
agent. All other quantities that appear in the gradient estimate from Theo-
rem 4.1 are observable by the agent. Further, since the gradient estimate is
a function of the current parameter vector θ, whenever θ is itself random, we
have the property E[∇θ | θ] = ∇Jtotal (θ). Therefore, we can use these gra-
dient estimates in a simple policy gradient method that updates the policy
parameters once after each episode. Pseudocode is given in Algorithm 4.

1 Choose θ1 ∈ Rd arbitrarily;
2 for each episode index τ = 1, 2, . . . do
3 Run one episode following π(θ) until the terminal state is reached;
4 Compute ∇θτ according to Theorem 4.1;
5 Set θτ +1 = θτ + η∇θτ ;
6 end
Algorithm 4: Policy Gradient Method for Episodic MDPs
Theorem 4.2. Let π : Rd → Π be a parametric policy for an ergodic MDP

and let θ ∈ Rd be any parameter vector. Let X θ be randomly sampled from
the stationary distribution ν(π(θ)), Aθ be sampled from π(x, ·, θ), and ϕθ =
∇θ π(X θ ,Aθ ,θ)
π(X θ ,Aθ ,θ)
be the so-called vector of compatible features. Then the random
vector
∇θ = ϕθ Qavg (X θ , Aθ , θ)
satisfies E[∇θ ] = ∇Javg (θ).
Unlike in the episodic case, this gradient estimate depends on quantities

that are unknown to the agent. First, the action value function depends
on the MDP transition probability kernel T, which is unknown. In place
of the true action value function, we can use an estimate (for example, the
estimate from linear Sarsa(λ). For a modern account of Sarsa(λ), see Sec-
tion 7.5 of [SB1998]). Using estimated action values introduces a small bias
into the gradient estimates, but its effect on the performance of stochastic
gradient ascent is small. Second, the random variable X θ in the statement
35
of the theorem is drawn from the stationary distribution ν(π(θ)) which the
agent doesn’t know and can’t directly sample from. Rather than drawing
a sample from ν(π(θ)), the agent can simply use the state they visit while
interacting with the MDP. In practice, the distribution over states at time t
is close enough to the stationary distribution that this also introduces only
a small bias. So, in the average reward setting, the agent is able to com-
pute a random vector which is nearly an unbiased estimate of the gradient
of Javg (θ). Pseudocode for an policy gradient method based on this nearly
unbiased gradient estimate is given in Algorithm 5.

2 Initialize action value estimate q̂;
4 Receive state Xt from the environment;
5 Sample action At from π(Xt , ·, θ);
6 Receive reward Rt ;
7 Compute ∇θt according to Theorem 4.2 using Xt , At , and the
estimated action value function q̂;
8 Set θt+1 = θt + η∇θt ;
9 Update the estimated action value function estimate q̂;
10 end
Algorithm 5: Policy Gradient Method for Ergodic MDPs
4.2 Baseline Functions

In both of the gradient estimation techniques from the previous section,
the value zero plays a special role. In the total reward setting (and the
average reward setting is similar), each time t in an episode contributes an
update to the parameter vector that increase the probability of choosing
the action At from state Xt . The scale of this update is proportional to
the difference Gt = Gt − 0, where Gt is the total reward following action
36
At . One consequence is that if Gt is positive, then the update will make
the action more probable, and if it is negative, then the update will make
the action less probable. This behaviour is strange, since there is nothing
special about earning zero total reward. In fact, some MDPs have only
negative rewards and, in this case, the agent never directly increases the
probability of choosing good actions. They are only increased as a side-
effect of decreasing the probability of worse actions more aggressively. This
raises the following questions: Can we compare the total reward Gt to a
value other than zero? And how does the choice affect the performance of
the policy gradient method?
We will see below that, rather than comparing to zero, we can compare
to any baseline value that depends on the state Xt and the agent’s parameter
vector θ. The function b : X → R that maps each state to the baseline value
in that state is called the baseline function. The following two results show
how to incorporate the baseline function into the gradient estimates from
the previous section.
Corollary 4.3. Let π : Rd → Π be a policy parameterization for an episodic

MDP, θ ∈ Rd be any parameter vector, and b : X → R be any baseline.
Assume that ∇θ π(xterm , a, θ) = 0 for all actions a and parameter vectors θ
(since the actions chosen in the terminal state have no effects, the policy
parameterization can always be made to satisfy this condition). Further,
suppose that E[T θ ] < ∞ for all parameter vectors θ. Using the notation of
Theorem 4.1, the random vector
θ −1
TX
∇bθ = ϕθt (Gθt − b(Xtθ ))
t=1
satisfies E[∇bθ ] = ∇Jtotal (θ).
Proof. From Theorem 4.1, it suffices to show that

θ −1
TX
θ θ
E ϕt b(Xt ) = 0.
t=1
37
For any time t ≥ T θ , we have that Xtθ = xterm and therefore ϕθt = 0. From
this, it follows that
θ −1
TX ∞
X
ϕθt b(Xtθ ) = ϕθt b(Xtθ ).
t=1 t=1
Consider any component θi of θ. Letting ϕθt,i denote the ith component of

ϕθt , we have
∞ ∞ ∂ θ θ
∂θi π(Xt , At , θ)
X X
θ θ θ
E ϕt,i b(Xt ) = E b(Xt )
t=1 t=1
π(Xtθ , Aθt , θ)
X X ∞ n o ∂ π(x, a, θ)
θ θ ∂θi
= E I Xt = x, At = a b(x)
x,a t=1
π(x, a, θ)
X X ∞ n o ∂ π(x, a, θ)
θ θ ∂θi
= E I Xt = x, At = a b(x)
π(x, a, θ)
x6=xterm , t=1
a
X ∂θ∂ π(x, a, θ) X∞ n o
= i
b(x) · E I Xtθ = x, Aθt = a
π(x, a, θ)
x6=xterm , t=1
a
X ∂θ∂ π(x, a, θ)
θ
≤ i
b(x) · E[T ]
π(x, a, θ)
x6=xterm ,
a
< ∞,
∂
where in line 3 we used the fact that ∂θi π(xterm , a, θ) = 0. Therefore, by
Fubini’s theorem, we have for each i
∞
X X∞
θ θ
E ϕt,i b(Xt ) = E[ϕθt,i b(Xtθ )].
t=1 t=1
In vector form, this gives

∞
X X∞
θ θ
E ϕt b(Xt ) = E[ϕθt b(Xtθ )].
t=1 t=1
38
Using the shorthand pt (x, a) = P(Xt = x, At = at ), pt (a | x) = P(At =
a | Xt = x), and pt (x) = P(Xt = x), we can rewrite the expectation as a sum
over the possible state-action pairs
X ∇θ π(x, a, θ)
E ϕθt b(Xtθ ) =

pt (x, a) b(x)
x,a
π(x, a, θ)
= pt (a | x)pt (x) b(x)
x,a
π(x, a, θ)
= π(x, a, θ)pt (x) b(x)
x,a
π(x, a, θ)
X X
= pt (x)b(x) ∇θ π(x, a, θ)
x a
= 0,
P
where in the last line we used that a ∇θ π(x, a, θ) = ∇θ 1 = 0. We have
shown that
θ −1
TX
E ϕθt b(Xtθ ) = 0,
t=1
as required.
Corollary 4.4. Let π : Rd → Π be a policy parameterization for an ergodic

MDP, θ ∈ Rd be any parameter vector, and b : X → R be any baseline
function. Using the notation of Theorem 4.2, the random vector
∇bθ = ϕθ Qavg (X θ , Aθ , θ) − b(X θ )

satisfies E∇bθ = ∇Javg (θ).
Proof. Again it suffices to show that E[ϕθ b(X θ )] = 0. Using the shorthand
p(x, a) = P(X θ = x, Aθ = a), p(a | x) = P(Aθ = a | X θ = x) and p(x) =
39
P(X θ = x), we can rewrite the expectation as a sum
E ϕθ b(X θ ) =

p(x, a) b(x)
x,a
π(x, a, θ)
X X
= p(x)b(x) ∇θ π(x, a, θ)
x a
= 0.
Notice that in the proof of Corollary 4.4, the expectation of ϕθ b(X θ ) is

zero independently of the distribution over X θ . Therefore, even when we
compute the gradient estimate with the state visited by the agent, which is
not distributed according to the stationary distribution, the baseline func-
tion introduces no additional bias.
Two commonly used baseline functions are the constantly zero baseline
and the value function of the current policy: b(x) = V (x, θ), where V (x, θ) =
Vtotal (x, π(θ)) in the total reward setting and V (x, θ) = Vavg (x, π(θ)) in the
average reward setting. The value function baseline seems like a natural
choice, since it compares the agent’s sampled performance to her expected
performance. If she performs better than expected, then she should increase
the probability of the actions tried and decrease them otherwise.
A hint that these two baseline functions are sub-optimal is that they
do not depend in any way on the policy parameterization. That is, given
two policy parameterizations π1 : Rd1 → Π and π2 : Rd2 → Π, there may
be parameter vectors θ1 ∈ Rd1 and θ2 ∈ Rd2 such that π1 (θ1 ) = π2 (θ2 ).
In this case, the value function and constantly zero baseline will have the
same value for both policy parameterizations in every state. But, since the
baseline function’s purpose is to modify the parameter updates, it would be
surprising if we should choose the baseline in a way that does not depend
on the particular parameterization used.
[GBB2004] have proposed that the baseline function should be chosen
to minimize the variance of the gradient estimates (specifically, the trace
of the covariance matrix). They consider learning in partially observable
40
MDPs and a different performance gradient estimate than the two described
in the previous section, so their results do not directly carry over to the
two cases studied in this project, though there are strong similarities. They
derive closed-form expressions for the variance minimizing baseline, a theory
that analyzes the variance of different baseline functions, and algorithms for
estimating the variance minimizing baseline. To the best of my knowledge,
they do not show that the variance minimizing baseline is a good or optimal
choice.
The goal in the rest of the chapter is to explore the connection between
the baseline function and the performance of policy gradient methods, and
to decide which baseline function gives the best performance.
4.3 MSE Minimizing Baseline

In this section I argue that the baseline function should be chosen to min-
imize the mean squared error of the performance gradient estimates. This
section derives a closed form expression for the MSE minimizing baseline,
which reveals an interesting connection to the value function baseline. I also
show that when the learning objective is a concave function of the policy pa-
rameters, the MSE-minimizing baseline gives the best possible bound on the
agent’s total expected sub-optimality obtainable from a standard analysis
of stochastic gradient ascent.
First, I show that choosing the baseline to minimize the MSE of the
gradient estimates is equivalent to choosing it to minimize the trace of the
covariance matrix of the estimates, which is equivalent to choosing it to
minimize the second moment of the gradient estimates. This equivalence
is useful because minimizing the MSE makes inuitive sense, minimizing the
second moment is the easiest to work with formally, and minimizing the trace
of the covariance matrix shows that this idea is equivalent to minimizing the
variance, which was already proposed by Greensmith, Bartlett, and Baxter.
The idea of minimizing the variance is not new, but the following analysis
and estimation techniques are.
41
Lemma 4.5. Let µ ∈ Rd be any vector and suppose for each function b :
X → R, we have a random vector ∇b such that E[∇b ] = µ for all b. Then
2 2
argmin tr Cov(∇b ) = argmin E ∇b − µ 2 = argmin E ∇b

2
.
b:X →R b:X →R b:X →R
Proof. This result is a consequence of the fact that E[∇b ] does not depend
on b. Using the trace rotation equality (that is, tr(AB) = tr(BA)) and the
definition of the covariance matrix, we have
tr Cov(∇b ) = tr(E[(∇b − µ)(∇b − µ)> ])

= E[tr (∇b − µ)(∇b − µ)> ]

= E[(∇b − µ)> (∇b − µ)]

2
= E ∇b − µ 2

which proves the first equivalence. Expanding the definition of the squared
2-norm, we have
2 2 2
E ∇b − µ 2 = E ∇b 2E[∇>

2
+ µ 2
− b ]µ
2 2
= E ∇b 2
− µ 2
.
2 2
It follows that E ∇b −µ 2 and E ∇b

2
differ by a constant that does not
depend on b. Therefore, the same function b will minimize both expressions.
Applying Lemma 4.5 to the case where µ = ∇J(θ) and ∇b is a gradi-

ent estimate with baseline b shows that minimizing the MSE is equivalent
to minimizing the trace of the covariance matrix, which is equivalent to
minimizing the second moment.
4.3.1 Regret Bound from the MSE Minimizing Baseline

In this section, I prove that the MSE-minimizing baseline has good theoret-
ical properties. Suppose that the learning objective J is a concave function
of the policy parameters and we use a policy gradient method to produce
42
a sequence of parameter vectors θ1 , . . . , θT . Let θ∗ be any parameter vec-
tor that maximizes J. We can use Theorem 3.3 to upper bound the sum
PT
t=1 E[J(θ) − J(θt )], which is one measure of the agent’s learning perfor-
mance until time T . The upper bound that we get is
T
X B2
E[J(θ) − J(θt )] ≤ + ηT G2 ,
η
t=1
where η is the step size used in the policy gradient method, B is an up-
per bound on kθ∗ k2 , and G2 is an upper bound on the second moments
of the stochastic gradient estimates. Setting the step size according to
√ √
η = B/(G T ) gives the best bound of 2BG T . The gradient estimates
only appear in this bound through their second moments, and minimizing
the second moments gives the best bound. Since minimizing the second
moments of the gradient estimates is equivalent to minimizing the MSE, the
MSE-minimizing baseline gives the best possible regret bound from Theo-
rem 3.3.
The requirement that J be a concave function of the policy parameters
is almost never satisfied. However, it is often the case that J will be concave
in a neighbourhood of its local maxima. In this case, once the algorithm
enters one of these neighbourhoods, the above analysis holds if we replace
θ∗ with the local maxima.
4.3.2 MSE Minimizing Baseline for Average Reward

This section derives a closed form expression for the MSE minimizing base-
line for the average reward learning objective. The derivation for the average
reward is given before the derivation for the total reward because it is con-
siderably simpler and uses the same ideas.
Theorem 4.6. Let π : Rd → Π be a policy parameterization for an ergodic

MDP and let θ ∈ Rd be any parameter vector. Then the function
X
b(x) = w(x, a, θ)Qavg (x, a, θ),
a
43
where
2
w̃(x, a, θ) ∇θ π(x, a, θ) 2
w(x, a, θ) = P 0
and w̃(x, a, θ) = ,
a0 w̃(x, a , θ) π(x, a, θ)
is the baseline function that minimizes the MSE of the gradient estimate ∇bθ
in Corollary 4.4.
Proof. By Lemma 4.5, we can equivalently find the minimizer of the second
moment of the gradient estimate. That is, we want to solve the optimization
problem
2
argmin E ϕθ (Qavg (X θ , Aθ , θ) − b(X θ )) 2 .

b:X →R
The general approach is as follows: by writing the expectation as a sum, we

see that it is separable over the values b(x), so we are free to optimize each
b(x) independently. Further, the contribution of each b(x) to the second
moment of the gradient estimate of ∇bθ is quadratic in b(x) and therefore
the minimizer can easily be computed.
First, we write the second moment as a sum and show it separates over
the values b(x). Let p(x, a), p(a | x), and p(x) be shorthand for the probabil-
ities P(X θ = x, Aθ = a), P(Aθ = a | X θ = x), and P(X θ = x), respectively.
Then
2
E ϕθ (Qavg (X θ , Aθ , θ) − b(X θ )) 2

2
= E ϕθ 2 (Qavg (X θ , Aθ , θ) − b(X θ ))2

2
= p(x, a) 2
(Qavg (x, a, θ) − b(x))2
x,a
π(x, a, θ)2
X w̃(x, a, θ)
= p(a | x)p(x) (Qavg (x, a, θ) − b(x))2
x,a
π(x, a, θ)
X X
= p(x) w̃(x, a, θ)(Qavg (x, a, θ) − b(x))2 .
x a
For each state x, the value b(x) only appears in exactly one term of the
sum over x. Since there are no constraints between the b(x), we are free
to minimize each term independently. Therefore, we can express the MSE-
44
minimizing baseline point-wise as
X 2
b(x) = argmin p(x) w̃(x, a, θ) Qavg (x, a, θ) − c
c∈R a
X
= w(x, a, θ)Qavg (x, a, θ),
a
which completes the proof.
There is an interesting connection between the MSE-minimizing base-

line and the value function: both are weighted averages of the action values.
Since w̃(x, a, θ) is non-negative for each action a, the weights w(x, a, θ) form
a probability distribution over the actions and therefore the MSE-minimizing
baseline b(x) is a weighted average of the action values. Similarly, the Bell-
man equation shows that the value function is a weighted average of the
action values:
X
Vavg (x, θ) = π(x, a, θ)Qavg (x, a, θ).
a
The only difference between these two baseline functions is the weighting
used in the average.
It is also interesting to notice that the value function baseline minimizes
the second moment of the quantity Qavg (X θ , Aθ , θ) − b(X θ ). The MSE-
minimizing baseline uses the modified weights w(x, a, θ) in place of the action
selection probabilities in order to instead minimize the second moment of
the gradient estimate ϕθ (Qavg (X θ , Aθ , θ) − b(X θ )).
4.3.3 MSE Minimizing Baseline for Total Reward

This section derives expressions for two different baseline functions for the
total reward learning objective. The first is the baseline function that truly
minimizes the MSE, but it has a complicated form due to the correlations
between the states and actions visited during a single episode. The second
baseline is derived by ignoring the correlations between states, has a much
simpler form, and may still come close to minimizing the MSE in prac-
tice. I will first present the exact MSE minimizing baseline, followed by the
45
approximation.
Theorem 4.7. Let π : Rd → Π be a policy parameterization for an episodic

P∞ θ = x ∇> θ

t=1 E I Xt θ ϕt
b(x) = P∞ θ 2 ,
t=1 E I Xt =x ϕθt 2
where ∇θ = ∇0θ is the baseline function that minimizes the MSE of the
gradient estimate ∇bθ in Corollary 4.3.
Proof. Let (Xtθ , Aθt , Rtθ )∞

t=1 be the random sequence of states, actions, and
rewards obtained by following π(θ), let T θ be the first time the terminal
P θ
state is entered, Gθt = Ts=t−1 Rsθ be the total reward earned after time t,
and ϕθt = π(Xtθ ,Aθt ,θ)
be the vector of compatible features at time t. We
can rewrite the gradient estimate from Corollary 4.3 as
∞
X
∇bθ = ϕθt (Gθt − b(Xtθ ))
t=1
∞
X ∞
X
= ϕθt Gt − ϕθt b(Xtθ )
t=1 t=1
∞
X
= ∇θ − ϕθt b(Xtθ ),
t=1
where ∇θ is the gradient estimate with the constantly zero baseline function.
We can therefore decompose the second moment as follows
 !> !
∞ ∞
2
X X
E ∇bθ ϕθt b(Xtθ ) ϕθt b(Xtθ ) 

2
= E  ∇θ − ∇θ −
t=1 t=1
∞ ∞ X
∞
2
X X >
E ∇> θ θ
E ϕθt ϕθs b(Xtθ )b(Xsθ ) .

= E ∇θ 2
−2 ϕ
θ t b(X t ) +
t=1 t=1 s=1
(4.1)
The double sum in (4.1) can be simplified using the following observation.
Let s and t be any two times with s < t. Since Markov policies choose
46
actions independently of the history of states and actions, the probability of
choosing action a from state Xtθ at time t does not depend on the state and
action visited at time s. Formally, for any states x and x0 , and any actions
a and a0 , we have
P(Aθt = a | Xtθ = x, Xsθ = x0 , Aθs = a0 ) = P(Aθt = a|Xtθ = x) = π(x, a, θ).
Therefore, we can factor the joint probability of taking action a0 from state
x0 at time s and later taking action a from state x at time t as follows:
P(Xtθ = x, Aθt = a, Xsθ = x0 , Aθs = a0 )

= P(Aθt = a | Xtθ = x, Xsθ = x0 , Aθs = a0 )P(Xtθ = x | Xsθ = x0 , Aθs = a0 )
· P(Aθs = a0 | Xsθ = x0 )P(Xsθ = x0 )
= π(x, a, θ)π(x0 , a0 , θ)P(Xsθ = x0 )P(Xtθ = x | Xsθ = x0 , Aθs = a0 ).
Note that this trick does not work when the time s follows time t, because
then knowing the state Xsθ gives you information about what action was
taken at time t. Using this factorization, for any times s < t, we have
>
E[ϕθt ϕθs b(Xtθ )b(Xsθ )]
X
= P(Xtθ = x, Aθt = a, Xsθ = x0 , Aθs = a0 )
x,a,x0 ,a0
∇θ π(x, a, θ)> ∇θ π(x0 , a0 , θ)

0
· b(x)b(x )
π(x, a, θ) π(x0 , a0 , θ)
X
= P(Xsθ = x0 )P(Xtθ = x | Xsθ = x0 , Aθs = a0 )
x,x0 ,a0
X
> 0 0 0

· ∇θ π(x, a, θ) ∇θ π(x , a , θ)b(x)b(x )
a
= 0,
P
where the last line uses the fact that a ∇θ π(x, a, θ) = ∇θ 1 = 0. The
expression is symmetric in the times t and s, so an identical argument can
>
be used to show that E[ϕθt ϕθx b(Xtθ )b(Xsθ )] = 0 when s > t. Therefore, the
47
only non-zero terms of the double sum from (4.1) are the terms when s = t,
and therefore we can write the second moment as
∞ ∞
2 2 2
X > θ X
E ∇bθ θ
E ϕθt b(Xtθ )2

2
= E ∇θ 2
−2 E ∇θ ϕt b(Xt ) + 2
t=1 t=1
∞
2
XX n o
2
E I Xtθ = x ϕθt b(x)2

= E ∇θ 2 + 2
x t=1

n o
− 2E I Xtθ = x ∇> θ

ϕ
θ t b(x) .
The last line above is obtained by first summing over the states, and then
summing over the times in which state x was visited, and shows that the
second moment is again separable over the states x.
Since the second moment is separable over the values b(x), we are free
to minimize each independently. Again, the contribution of each b(x) to the
second moment is an easily-minimized quadratic expression in b(x). The
MSE-minimizing baseline is therefore given by
P∞ θ = x ∇> θ

t=1 E I Xt θ ϕt
b(x) = P∞ θ 2 .
t=1 E I Xt =x ϕθt 2
The above expression for the MSE minimizing baseline may not be very
useful, since it may not be easy to compute even if we knew the MDP
transition probability kernel T. Both the form and derivation of this baseline
function are complicated because of the correlations between the states and
actions visited at different times during a single episode.
We can derive a much simpler alternative baseline that may still go a
long way towards minimizing the MSE. The idea is, rather than minimizing
P θ
the second moment of the complete gradient estimate ∇bθ = Tt=1 ϕθt (Gθt −
b(Xtθ )), we can try to minimize the second moment of each term in the sum
independently. The following result shows that a much simpler baseline
function minimizes the second moment of ϕθt (Gθt − b(Xtθ )) for all times t.
48
X
b(x) = w(x, a, θ)Qtotal (x, a, θ)
a
where
w̃(x, a, θ) k∇θ π(x, a, θ)k22

w(x, a, θ) = P 0
and w̃(x, a, θ) =
a0 w̃(x, a , θ) π(x, a, θ)
simultaneously minimizes the second moment of ϕθt (Gθt −b(Xtθ )) for all times
t.
Proof. Fix any time t and let Xtθ and Aθt be the state and action at time t
when following policy π(θ). Let ϕθt = π(Xtθ ,Aθt ,θ)
be the vector of compat-
ible features at time t and Gθt be the total reward following action Aθt . Our
goal is to find the baseline function b : X → R that minimizes
2
E ϕθt (Gθt − b(Xtθ )) 2 .

Again, the general strategy is to express the objective as a separable sum

over the states x and to solve for each b(x) independently. Let pt (x, a) =
P(Xt = x, At = a), pt (a | x) = P(At = a | Xt = x), and pt (x) = P(Xt = x).
Using the fact that E[Gθt | Xtθ = x, Aθt = a] = Qtotal (x, a, θ) and the definition
of w̃ from the statement of the theorem, we have
2
E ϕθt (Gθt − b(Xtθ )) 2

2
= E ϕθt 2 (Gθt − b(Xtθ )2

2
= pt (x, a) 2
(Qtotal (x, a, θ) − b(x))2
x,a
π(x, a, θ)2
X w̃(x, a, θ)
= pt (a | x)pt (x) (Qtotal (x, a, θ) − b(x))2
x,a
π(x, a, θ)
X X
= pt (x) w̃(x, a, θ)(Qtotal (x, a, θ) − b(x))2 .
x a
49
From this we see that the second moment is separable over the states and
again the contribution of each state is quadratic in b(x). Therefore, we can
choose each b(x) independently as follows
X
b(x) = argmin p(x) w̃(x, a, θ)Qtotal (x, a, θ) − b(x))2
c∈R a
X
= w(x, a, θ)Qtotal (x, a, θ).
a
Since the above baseline does not depend on the time index t, it simultane-
ously minimizes the second moment of each ϕθt (Gθt − b(Xtθ )).
This approximate MSE-minimizing baseline shows that, modulo the cor-

relations between times in an episode, the MSE minimizing baseline in the
total reward setting has the same form as in the average reward setting.
4.4 Estimating the MSE Minimizing Baseline

It may be easier to directly estimate the MSE minimizing baseline than to
estimate the unknown quantities in the closed-form expressions from the
previous section. For example, the action-value function (which appears in
two of the three baselines) is a function from state-action pairs to real num-
bers, while the baseline function is only a map from states to real numbers.
Since the baseline function has a simpler form, it may be possible to esti-
mate it more easily and from less data than the action value function. This
section describes a stochastic-gradient method for directly estimating the
MSE minimizing baselines from experience.
We will represent our baseline estimates as linear function approxima-
tors. Specifically, we will fix a map ψ : X → Rn which maps each state to a
feature vector. Our baseline estimate will be a linear function of the feature
vector: b(x, w) = ψ(x)> w, where w ∈ Rn is the baseline parameter vector.
Our goal is to find the parameter vector w ∈ Rd that minimizes the MSE of
b(·,w)
the gradient estimate ∇w
θ = ∇θ , which is equivalent to minimizing the
second moment.
50
For both the average and total reward settings, we will show that the
second moment is a convex quadratic function of the weights w used in the
baseline function. Minimizing the second moment is equivalent to maximiz-
ing the negative second moment, which is a concave function. Therefore, if
2
we can construct unbiased random estimates of the gradient ∇w E ∇w

θ 2 ,
then we can use stochastic gradient ascent to directly approximate the MSE
minimizing baselines. The following theorems show that it is possible to
compute unbiased random estimates of the gradient by interacting with the
environment.
Theorem 4.9. Let π : Rd → Π be a policy parameterization for an ergodic

MDP, θ ∈ Rd be any policy parameter vector, ψ : X → Rn be a baseline
feature map, and w ∈ Rd be any baseline parameter vector. Then the map
w 7→ E ∇w

θ is a convex quadratic function of w and the random vector
2 θ θ> >
Dθw = 2 ϕθ 2
ψ ψ w − 2ψ θ ϕθ ∇0θ
2
satisfies E[Dθw ] = ∇w E[ ∇w
θ 2
], where ψ θ = ψ(X θ ) and ∇w
θ is the gradient
estimate from Corollary 4.4 with baseline b(x) = ψ(x)> w.
Proof. We can rewrite ∇w

θ as follows
∇w θ θ θ θ>
θ = ϕ (Qavg (X , A , θ) − ψ w)
= ∇0θ − ϕθ ψ θ> w.
With this, we have
2
E ∇w = E (∇0θ − ϕθ ψ θ> w)> (∇0θ − ϕθ ψ θ> w)

θ 2
2 θ θ>
= w> E[ ϕθ 2
ψ ψ ]w − 2E[∇0θ (ϕθ ψ θ> )]w + E[ ∇0θ ].
This is a quadratic equation in w. Since the second moment is bounded
51
below by 0, it follows that it must also be convex. With this, we have
n o
2 2
∇w E ∇w = ∇w w> E[ ϕθ 2 ψ θ ψ θ> ]w − 2E[∇0θ (ϕθ ψ θ> )]w + E[ ∇0θ ]

θ 2
>
h i
2
= E 2 ϕθ 2 ψ θ ψ θ> − 2ψ θ ϕθ ∇0θ
and it follows that
2 θ θ> >
Dθw = 2 ϕθ 2
ψ ψ − 2ψ θ ϕθ ∇0θ
2
is an unbiased estimate of the gradient ∇w E ∇w

θ 2
.
All quantities in Dθw are observable by the agent, so the agent can pro-
duce samples from Dθw . Pseudocode for a policy gradient method that uses
the above estimate of the baseline function is given in Algorithm 6
Input: policy step-size η > 0, baseline step-size α > 0

2 Choose w1 ∈ Rn arbitrarily;
3 Initialize action value estimate q̂;
5 Receive state Xt from the environment;
6 Sample action At from π(Xt , ·, θ);
7 Receive reward Rt ;
8 Compute ∇w
θt according to Corollary 4.4 using Xt , At , the
t
estimated action value function q̂, and the baseline function

b(x) = ψ(x)> wt ;
9 Set θt+1 = θt + η∇θt ;
10 Compute Dθwtt according to Theorem 4.9;
11 Set wt+1 = wt − αDθwtt ;
12 Update the estimated action value function estimate q̂;
13 end
Algorithm 6: Policy gradient method for ergodic MDPs with a linear
approximation to the MSE minimizing baseline.
52
MDP, θ ∈ Rd be any policy parameter vector, ψ : X → Rn be a baseline
feature map, and w ∈ Rn be any baseline parameter vector. Assume that
∇θ π(xterm , a, θ) = 0 for all actions a and parameter vectors θ (the parame-
terization can always be chosen so that this property is satisfied). Then the
map w 7→ E ∇w

θ is a convex quadratic function of w and the random
vector
θ −1
TX
2 θ> >
Dθw ψtθ ϕθt − ϕθt ∇0θ

=2 ψ w
2 t
t=1
2
satisfies E[Dθw ] = ∇w E ∇w where ψtθ = ψ(Xtθ ) and ∇w

θ 2
, θ is the gradient
estimate from Corollary 4.3 with baseline b(x) = ψ(x)> w.
Proof. We can rewrite ∇w

θ as follows
θ −1
TX
∇w
θ = ϕθt (Gθt − ψtθ> w)
t=1
∞
X ∞
X
= ϕθt Gt − ϕθt ψtθ> w
t=1 t=1
= ∇0θ − M w,θ
P∞
where M θ = θ θ>
t=1 ϕt ψt . We can use this to write the second moment as
2
E ∇w = E (∇0θ − M θ w)> (∇0θ − M θ w)

θ 2
2
= w> E M θ> M θ w − 2E ∇0> θ
0
θ M w + E ∇θ 2
.
Again this is a quadratic form in w and, since the second moment is bounded
below, it must be convex.
Before taking the gradient, we can simplify the matrix of coefficients
θ> θ
E M M in a way very similar to the simplification of the double sum in
Theorem 4.7. Recall that for any times s < t, we have the following
P(Xtθ = x, Aθt = a, Xsθ = x0 , Aθs = a0 )

= π(x, a, θ)π(x0 , a0 , θ)P(Xtθ = x | Xsθ = x0 , Aθs = a0 )P(Xsθ = x0 ).
53
Therefore, whenever s < t, we have
X
E[ψtθ ϕθ> θ θ>
t ϕs ψs ] = P(Xtθ = x, Aθt = a, Xsθ = x0 , Aθs = a0 )
x,a,x0 ,a0
∇θ π(x, a, θ)> ∇θ π(x0 , a0 , θ)

· ψ(x) ψ(x0 )>
π(x, a, θ) π(x0 , a0 , θ)
X
= P(Xsθ = x0 )P(Xtθ = x | Xsθ = x0 , As θ = a0 )
x,x0 ,a0
X
∇θ π(x, a, θ)> ∇θ π(x0 , a0 , θ)ψ(x0 )>

· ψ(x)
a
= 0.
Similarly, when s > 0, we have E[ψtθ ϕθ> θ θ>

t ϕs ψs ] = 0. Therefore,
∞ X
∞
X
E M θ> M θ = E ψtθ ϕθ> θ θ>

t ϕs ψs
t=1 s=1
∞
2 θ θ>
X
E ϕθt

= ψ ψ .
2 t t
t=1
Finally, since ∇θ π(xterm , a, θ) = 0, we have that ϕθt = 0 whenever

Xtθ = xterm , therefore
Tθ
X
θ> θ 2 θ θ>
E M M =E ϕθt ψ ψ
2 t t
t=1
and
Tθ
X
θ> 0 θ θ>
E M ∇θ = E ϕt ψ t ∇0θ
t=1
Substituting these equalities into the expression for the second moment, we
have
Tθ
X Tθ
X >
2 > 2 θ θ> 2
E ∇w ϕθt θ θ>
∇0θ w + E ∇0θ

θ 2
=w E ψ ψ
2 t t
w−2 ϕt ψ t 2
.
t=1 t=1
Taking the gradient of this expression and exchanging the gradient and
54
expectation gives that
Tθ
2 θ> >
X
Dθw ψtθ ϕθt − ϕθt ∇0θ

=2 ψ w
2 t
t=1
2
satisfies E[Dθw ] = ∇w E ∇w

θ 2
.
It is worth noting that the gradient estimate from Theorem 4.10 can be
computed in linear time in the length of the episode. The gradient estimate
∇0θ can be computed in the first pass, and Dθw can be computed in a second
pass.
Pseudocode for a policy gradient method that uses the above estimate
of the baseline function is given in Algorithm 7
Input: policy step-size η > 0, baseline step-size α > 0

2 Choose w1 ∈ Rn arbitrarily;
3 for each episode index τ = 1, 2, . . . do
4 Run one episode following π(θ) until the terminal state is reached;
5 Compute ∇w
θτ according to Corollary 4.3 using the baseline
τ
b(x) = ψ(x)> wτ ;
6 Set θτ +1 = θτ + η∇θτ ;
7 Compute Dθwττ according to Theorem 4.10;
8 Set wτ +1 = wτ − αDθwττ .
9 end
Algorithm 7: Policy gradient method for episodic MDPs with a linear
approximation to the MSE minimizing baseline.
4.5 Experiments
This section presents an empirical comparison of the three baseline functions
discussed in this thesis: the always zero baseline, the value function, and
the MSE minimizing baseline. The goal of these experiments is to answer
55
the question: Does the baseline function have a significant impact on per-
formance? And if so, does one of the baseline functions consistently give
better performance than the others?
Earlier in this chapter I showed that when the formal learning objective
is a concave function of the policy parameters, then we can upper bound the
agent’s expected regret. This regret bound only depended on the baseline
through the second moments of the performance gradient estimates. There-
fore, the best regret bound is obtained by choosing the baseline to minimize
the second moment, which we saw was equivalent to minimizing the MSE.
We also saw that the best regret bound after T updates is obtained by
setting the step size to be
B
η(T ) = √ ,
G T
where G2 is an upper bound on the second moments of the first T gradient
estimates, and B is an upper bound on the 2-norm of the unknown optimal
parameter vector. Therefore, in the concave setting, the step size which gives
the best regret bound scales inversely proportional to G. My hypothesis is
that even when the learning objective is not a concave function of the policy
parameters, choosing the baseline to minimize the MSE (and therefore the
second moment) is a good choice. This hypothesis would be supported by
the experiments if the MSE baseline gives the highest performance and if its
maximum performance is achieved with a higher step size than for the other
baselines. All but one of the following experiments support this hypothesis.
Policy gradient methods have two significant parameters: the step size
and the baseline function. The expected performance depends on the pair of
parameter settings. To study the effect of the baseline function, I estimate
the expected performance when using each baseline with a wide range of step
sizes. For each baseline, this results in a graph showing the performance of
the given baseline as a function of the step size. This kind of parameter
study involves running the algorithm many more times than would be done
in practice. In practice, the parameters might be chosen according to a
rule of thumb or based on some brief experiment and then the algorithm
may be run only once with those settings. In these experiments, however,
56
we run the algorithm for many parameter settings and, for each parameter
setting, we do enough runs to accurately estimate the expected total reward.
While these parameter studies do not necessarily resemble reinforcement
learning in practice, they allow us to confidently compare the baselines and
to understand how their performance depends on the step size. A strong
result would be to find that one baseline outperforms the rest for every
step size. In this case, no matter how you choose the step size, you should
always use the winning baseline. A weaker result would be to find that the
maximum performance of one baseline (maximizing over the step sizes) is
higher than the maximum performances of the other baselines. This result
is weaker, since it might not be easy to find good step sizes in practice.
The experiments compare the three baseline functions in two different
test-beds. I borrowed the first test-bed, called the ten-armed test-bed, from
Rich Sutton and Andy Barto’s book [SB1998]. In the ten-armed test-bed,
the MDP has a single state and ten actions, each having rewards drawn from
a Gaussian distribution with unit standard deviation. Such an MDP is called
a bandit problem, in reference to multi-armed bandits or slot machines. An
episode in a bandit problem consists of choosing a single arm, and the agent’s
goal is to maximize her total reward over a fixed number of episodes. In all
the following experiments, the number of episodes is taken to be 20. Rather
than fixing a single bandit for the comparison, we can randomly produce
many distinct bandits by sampling the mean payout for each arm from a
standard Gaussian distribution. We compare the average performance over
a large number of randomly generated bandits to reduce the chance that the
results are an artifact of one specific bandit.
MDPs with a single state are missing many properties of typical MDPs.
For example, the action value function in a single-state MDP does not de-
pend on the agent’s policy at all. I designed the second test bed, called the
triangle test-bed, to be similar to the ten-armed test-bed but with more than
one state. Again, the payout for each state-action pair will be a Gaussian
with unit standard deviation and mean sampled from a standard Gaussian
distribution. The states are organized in a triangle with R = 5 rows, where
there are r states in the rth row. The starting state is the unique state in
57
Figure 4.1: Each black circle represents a state in a triangular MDP and
each arrow represents the deterministic transition resulting from the Left
action or Right action, depending on whether the arrow points left or right.
Starting from the top state, the agent chooses to move either Left or Right
until she reaches the bottom row.
the first row and there are two actions, Left and Right, which each move
the agent from its current state to the state below and left, or below and
right, respectively. Figure 4.1 shows the layout of the states and the avail-
able actions. As in the ten-armed test-bed, the agent is allowed to perform
a fixed number of episodes on each randomly generated triangle MDP, and
her goal is to maximize total reward. In all of the following experiments,
the agent is given 50 episodes to interact with each instance of the triangle
MDP.
I use a similar policy parameterization in both test-beds. The policy
parameterization is a mixture of the uniform policy, which chooses each
action with equal probability, and the so-called Gibbs policy, which is a
common policy parameterization when there are a small number of states
and actions. In the Gibbs policy, the agent stores a weight, or preference,
for each state-action pair. The weights are stored in a |X × A|-dimensional
vector θ, and we write θ(x, a) to denote the weight associated to action a
in state x. The Gibbs policy is parameterized by the agent’s preferences in
58
the following way:
exp(θ(x, a))
πGibbs (x, a, θ) = P 0
,
a0 exp(θ(x, a ))
where the sum is taken over all actions. Under the Gibbs policy, the agent
prefers to choose actions with a high weight and the action selection prob-
abilities are invariant to adding a constant to all weights. For a mixing
constant , which in all experiments I take to be 0.05, the policy parameter-
ization that I use is given by
π(x, a, θ) = /|A| + (1 − )πGibbs (x, a, θ).
The mixture constant is not a parameter of the policy that is controlled

by the agent. The reason for using a mixture of the Gibbs policy and
the uniform policy is that, in the mixture, the minimum action selection
probability is /|A| > 0. Having a parameter-independent lower bound
on the action selection probabilities ensures that the agent will continue
to explore the available actions for every possible parameter vector. This
helps to avoid situations where the agent sets the probability of choosing
an action to zero early in an episode based on an unlucky reward. It is a
straight forward calculation to check that the gradient of the Gibbs policy
with respect to the weight vector θ is given by
∇θ πGibbs (x, a, θ) = πGibbs (x, a, θ)(e(x, a) − πGibbs (x, θ)),
where e(x, a) is the (x, a)th standard-basis vector and πGibbs (x, θ) is the
vector whose (x0 , a0 )th entry is equal to I {x = x0 } πGibbs (x0 , a0 , θ). From
this, the gradient of the mixed policy is given by
∇θ π(x, a, θ) = (1 − )∇θ πGibbs (x, a, θ)

= (1 − )πGibbs (x, a, θ)(e(x, a) − πGibbs (x, θ)).
Since we have closed form expressions for the action selection probabili-
ties and their gradient with respect to the policy parameters, it is easy to
59
compute the stochastic gradient estimates discussed in this chapter. In all
experiments, the weight vector θ is always initialized to the zero vector,
which results in the uniform policy.
One practical complication is that the value function and the MSE min-
imizing baseline are both unknown to the computer agent, since they de-
pend on the unknown MDP transition probability kernel. An agent solving
real-world problems must estimate these functions in order to use them as
baselines. The experimental challenge is that, if we compare estimated base-
lines, we can’t be sure that a difference in performance is due to the choice
of baseline and not the quality of the estimation. Even though unknowable
baseline functions are not usable in practice, measuring their performance is
still useful because it motivates the search for good approximation methods
for the best theoretical baselines. For this reason, the experiments com-
pare not only the estimated baselines, but also their true values. In both of
the test-beds, we have perfect knowledge of the transition probability kernel
(though we do not share this information with the agent). Using this knowl-
edge, we can give the computer agent access to an oracle that produces the
true MSE minimizing baseline function, or the true value function.
Comparison of True Baseline Functions: First, I present the results

of the comparison in the setting where the agent has access to the true
MSE minimizing baseline and the true value function. This is the simplest
setting, since there are no parameters related to baseline estimation that
need to be tuned. Figure 4.2 shows the estimated performance in both
the ten-armed and triangle test-beds. In both test-beds, the expected total
reward for each parameter setting is estimated by taking the sample mean
of 1,000,000 independent runs. These parameter studies give statistically
significant support to my hypothesis, since the MSE minimizing baseline
gives better performance than the zero baseline and the value function across
a wide range of step sizes, and it attains its highest performance at a higher
step-size than the other baselines.
It appears, however, that using either the value function or the MSE
minimizing baseline gives only a small improvement over using the always-
60
mse mse
value
value
zero
zero
Figure 4.2: Comparison of the total reward earned by the episodic policy
gradient method when using the true baseline functions. The error bars
show 3 standard errors in the mean. The left figure shows performance in
the ten-armed test-bed. The right figure shows performance in the triangle
test-bed.
zero baseline, which is equivalent to using no baseline at all. For the ten-
armed test-bed, the average reward for each action is drawn from a standard
Gaussian random variable and therefore zero is a good guess at the average
payout of the arms. We might expect that if the mean payouts were shifted
up or down, the performance of the zero baseline may deteriorate, since
zero is no longer a good estimate of the mean arm payout. In contrast, the
updates made by both the value function and MSE minimizing baselines
do not change when a constant is added to all rewards. The situation is
similar in the triangle test-bed, since the expected payout of a path through
the states is also zero. Figure 4.3 shows the performance of the baselines
in both test-beds when the rewards are shifted up or down by 10. In this
case, we see a significant improvement when using non-zero baselines. It
is difficult to see the difference between the value function and the MSE
minimizing baseline, but since these methods are invariant to translations
in the rewards, their difference is exactly the same as in the mean-zero case.
The above comparisons show that for the two test-beds, it is important
61
value
µ= 10 µ= 10
mse value
mse
zero
zero
value
µ = 10 µ = 10
value mse
mse
zero
zero
Figure 4.3: Comparisons of the three baselines in the ten-armed test-bed

and the triangle test-bed. The mean reward for each action is drawn from
a Gaussian distribution with mean µ and unit standard deviation. In the
first row, µ = −10 and in the second row µ = 10. The left column shows
estimates of the expected total reward in the ten-armed test-bed and the
second column shows the same for the triangle test-bed. Again, the error
bars show 3 standard errors in the mean.
to use a non-zero baseline whenever the mean payout per episode is far
from zero. Moreover, the MSE minimizing baseline consistently gave better
performance than the other baselines for a wide range of step sizes, and its
maximum performance was obtained at a higher step size than for the other
baselines, which supports my hypothesis.
Comparison of Estimated Baseline Functions Next, I will present

the results of the comparison of the estimated baseline functions. This
case is slightly more complicated than when using the true baseline func-
tions, since we need to choose any parameters that the estimation techniques
have. Rather than carefully optimizing the parameters for the estimation
techniques, I tried to choose the parameters in a way that would be realistic
in practice. First, I will describe the estimation techniques and the param-
eter choices, followed by a comparison of the different baselines. The names
I use for each baseline estimate are prefixed with either the letter Q or G,
which indicates how they are estimated as will be described shortly.
I proposed two different techniques for estimating the MSE minimizing
baseline. In the first estimation, we ignored the correlations between dif-
ferent time-steps in each episode, which gave rise to an approximate form
of the MSE minimizing baseline that is a weighted average of the action
value function. When the MDP has only a single state, there are no corre-
lations to ignore and this approximation is exact. Given an estimate of the
action value function, which can be obtained in various standard ways, we
can substitute the estimated action values into the closed form approxima-
tion of the MSE minimizing baseline. This estimate is appealing because its
only parameters are those of the action value estimation technique, which
in many cases can be chosen according to rules-of-thumb. I will refer to this
estimate as the Qmse baseline (Q for action-values). The second estimation
was obtained by performing stochastic gradient descent to estimate the MSE
minimizing baseline directly from the observed sequences of states, actions,
and rewards. This estimation is appealing because it does not ignore the
correlation between time steps in the episode, but one draw back is that its
step size parameter is difficult to tune. I will refer to this estimate as the
63
Gmse (G for stochastic gradient descent).
I expect that the choice of the step size for the stochastic gradient descent
algorithm used to compute the Gmse baseline, denoted by ηbl will have a
similar effect on the performance of the agent for all policy gradient step sizes
η. Therefore, to set ηbl for each test bed, I ran the agent with policy gradient
step size η = 0.9 for the correct number of episodes (20 in the ten-armed
test-bed and 50 in the triangle test-bed) 1000 times and chose the baseline
step-size from the set {0.001, 0.01, 0.1, 1.0} that maximized performance.
The best parameter settings were ηbl = 0.01 in the ten-armed test bed and
ηbl = 0.1 in the triangle test-bed.
There are standard techniques for estimating the value function of a pol-
icy in an MDP. Rather than estimating the value function directly, though,
I use the fact that the value function is a weighted average of the action
value function. This gives more accurate estimates in the ten-armed test-
bed, since the action values do not depend on the agent’s current policy. I
will refer to this as the Qvalue baseline.
To estimate the action value function in the ten-armed test-bed, I use
the fact that the action value function does not depend on the agent’s policy,
since there is only one state. In this case, a good estimate of the action value
for a given action is to take the sample average of the observed rewards for
that action. For actions that have not yet been tried, a default value of 0 is
used. The only parameter of this estimation technique for the action values
is the default value. Since it only influences performance early in learning,
I did not tune the default value.
In the triangle test-bed, I use the Sarsa(λ) algorithm to estimate the
action value functions that are passed into the two action-value oriented
baseline estimates. Sarsa(λ) has two parameters, a step size α and an el-
igibility trace parameter λ. Again, I expect that the Sarsa(λ) parameters
should affect the agent’s performance similarly for all policy-gradient step
sizes and all baselines. For this reason, I chose the parameters α and λ
by running the policy gradient method with fixed step size η = 0.9 for 50
episodes 1000 times, and chose the parameters that gave the smallest aver-
age squared error in the action-value estimates at the end of the 50 episodes.
64
Of all pairs of α in {0.001, 0.01, 0.1, 1.0} and λ in {0.001, 0.01, 0.1, 1.0} the
best setting was to take α = λ = 0.1.
Figure 4.4 shows the parameter studies for each of the estimated baselines
in the two test-beds. As before, I also present the results when the rewards
are shifted up or down by 10. The results of this experiment tell a different
story than what we saw for the true baseline functions. Let µ denote the
amount that the rewards are shifted by. In the first experiments where we
compared the true baseline functions, the MSE minimizing baseline gave
the best performance across a wide range of parameters. In this setting,
however, the baseline with the best performance depends on the step size
and which baseline achieves the highest performance for an optimized step
size depends on the value of µ. These results do not support my hypothesis
and were surprising because I expected the relative performance of the non-
zero baselines to be the same independent of the shift µ.
The differences in performance between the non-zero baselines for the
various values of µ can be explained by an interesting property of policy
gradient methods. Consider the bandit problem and suppose that our base-
line is substantially larger than the rewards. Suppose the agent chooses a
random action A and receives reward R. Then the term (R − b), where b
is the baseline value, is negative with high probability the agent will reduce
the probability of choosing action A, even if it was the best action. Since
the action selection probabilities must sum to one, the probability of the
other actions will be increased. On the following episode, the agent will be
more likely to choose an action other than A, even if A was the best avail-
able action. In this way, having a baseline that underestimates the rewards
encourages systematic exploration of the actions. On the other hand, if the
baseline is substantially lower than the rewards, the probability of choosing
action A will always be increased, even if it was the worst action. On the
following episodes, the agent will be more likely to choose the same action
again. This asymmetry between having baselines that are too high or too
low suggests that it is better to have an underestimate, which results in
exploration, rather than an overestimate, which results in less exploration
and more erratic updates to the parameter vector.
65
µ= 10 Gmse µ= 10
Gmse Qmse
Qmse
Qvalue
Qvalue
zero
zero
µ=0 Qmse µ=0
zero Gmse zero

Qvalue
Gmse
Qvalue
Qmse
Qvalue µ = 10 Gmse µ = 10
Gmse Qvalue
zero
zero
Qmse
Qmse
Figure 4.4: Comparisons of the four estimated baselines in the ten-armed

the triangle test-beds. The mean reward for each action is drawn from a
Gaussian distribution with mean µ and unit standard deviation. In the first
row, µ = 10 in the second row, µ = 0, and in the third row µ = −10. The
left column shows estimates of the expected total reward in the ten-armed
test-bed and the second column shows the same for the triangle test-bed.
Again, the error bars show 3 standard errors in the mean.
But why should this asymmetry change which of the non-zero base-
lines performs best? The reason is that both of the non-zero baselines are
weighted averages of the action values. In the case of the Qvalue baseline,
the weights are exactly the action selection probabilities, so it places high
weight on the actions that will have the most reliable action value estimates.
On the other hand, the weights in the Qmse baseline are proportional to
k∇θ π(x, a, θ)k22 /π(x, a, θ). Since the denominator scales inversely with the
action selection probabilities, the weighted average depends more heavily on
the actions that are infrequently selected. Therefore, when the initial ac-
tion value estimates are very high, as is the case when µ = −10, we expect
there to be enough exploration for both estimated baselines to become accu-
rate. In this case, the MSE minimizing baselines performs better. But when
µ = 10, the amount of exploration is reduced and therefore the value func-
tion estimate becomes more accurate than for the MSE baselines. This is
one possible explanation for the difference between the µ = 10 and µ = −10
cases.
To test this hypothesis I ran the experiments again, but this time I
initialized the starting action value estimate to be a better estimate than 0.
In the ten-armed test-bed, I pull each arm once and use the sampled reward
as the initial estimate of the action value, instead of using the default value
of zero. For the triangle test-bed, I compute the true value function and
initialize the action value estimate with this instead. In principle, I could
have run several episodes using Sarsa(λ) to compute a more realistic initial
estimate of the action value function for the triangle MDP, but using the true
values requires less computation and has essentially the same result. Further,
even though this initialization is slightly unrealistic, it shouldn’t favour any
baseline function. Results for this experiment are shown in Figure 4.5. These
results are very similar to those for the true baseline functions and support
my hypothesis. The lesson from this experiment is that when using policy
gradient methods, we should be careful to initialize the baseline function in
a reasonable way so that the agent’s policy does not become too focused
early on, independently of the observed rewards.
67
µ= 10 µ= 10
Qmse
Qmse
Qvalue Gmse
Qvalue
Gmse
zero
zero
µ=0 Qmse µ=0 Qmse

Qvalue
zero
Gmse
Gmse
Qvalue
zero
µ = 10 Qvalue µ = 10
Qmse Qmse
Gmse
Qvalue
Gmse
zero
zero
Figure 4.5: Comparisons of the four estimated baselines with good initial-
izions in the ten-armed the triangle test-beds. The mean reward for each
action is drawn from a Gaussian distribution with mean µ and unit standard
deviation. In the first row, µ = 10 in the second row, µ = 0, and in the
third row µ = −10. The left column shows estimates of the expected total
reward in the ten-armed test-bed and the second column shows the same for
the triangle test-bed. Again, the error bars show 3 standard errors in the
mean.
Chapter 5
Learning in MDPCRs
This chapter describes the second project that I worked on during my MSc,
which focused on designing and analyzing efficient learning algorithms for
loop-free episodic and uniformly ergodic MDPCRs which were introduced in
Section 2.2. We propose three new algorithms: an algorithm for learning in
loop-free episodic MDPCRs under instructive (full-information) feedback,
where the agent observes the entire reward function rt after each action,
an algorithm for learning in loop-free episodic MDPCRs under evaluative
(bandit) feedback, where the agent only observes the reward for the action
she took, and an algorithm for learning in uniformly ergodic MDPCRs un-
der instructive feedback. We believe that the algorithm for learning under
instructive feedback in ergodic MDPCRs can be extended to learning un-
der evaluative feedback, but the analysis proved to be quite challenging.
In all cases, we assume that the rewards belong to the interval [0, 1]. The
theoretical results for these three algorithms either improve or complement
the results for existing algorithms and often hold even under weaker condi-
tions. This comes at the cost of having increased, though still polynomial,
computational complexity.
A common strategy in computing science is to reduce a problem that we
would like to solve to a problem that we have already solved. The strategy of
this project is to reduce the problem of learning in MDPCRs to the problem
of online linear optimization. Reduction in this case means that if I have an
69
algorithm for online linear optimization with provable regret bounds, then I
should be able to use that algorithm to achieve a similar regret bound while
learning in an MDPCR. Section 3.3 presented online mirror ascent, which is
an algorithm for online linear optimization with a good regret bound. The
three algorithms proposed in this project are all instances of online mirror
ascent applied to the problem of learning in MDPCRs by way of a reduction
to online linear optimization.
In all cases considered in ths project, online mirror ascent cannot be
implemented exactly. The update rule of online mirror ascent has two steps:
first, we compute the unconstrained maximizer of an objective function that
combines the goals of maximizing the most recent payout vector and not
moving too far from the previous choice. Second, we project the uncon-
strained maximizer back onto the set of feasible solutions. In many cases,
the unconstrained maximizer can be computed as a simple closed-form ex-
pression but the projection step is expensive and can only be solved ap-
proximately. A natural question is: how do these approximations impact
the performance of online mirror ascent? The final result of this project
is of independent interest and provides theoretical analysis for a natural
approximate implementation of online mirror ascent.
5.1 Reductions to Online Linear Optimization

In this section, we reduce loop-free episodic MDPCRs and uniformly er-
godic MDPCRs to online linear optimization. Recall that in online linear
optimization, the agent chooses a sequence of points w1 , . . . , wT from a
convex set K ⊂ Rd . Following the agent’s choice of wt , her environment
chooses a payout vector rt and she earns reward equal to rt> wt . Her choice
of wt should only depend on w1:(t−1) and r1:(t−1) , while the environment’s
choice of rt should depend only on w1:t and r1:(t−1) . The agent’s goal is to
choose the sequence wt so that her regret relative to the best-in-hindsight
fixed point in K is small. That is, she wants to minimize
RT (w1:T , r1:T ) = sup rt> (w − wt ).

w∈K
70
We only provide reductions for the instructive feedback setting, where the
agent observes the entire reward function, since our algorithm for the eval-
uative feedback setting are derived from the instructive case by statistically
estimating the reward function.
5.1.1 Reduction of Loop-Free Episodic MDPCRs

This subsection shows that learning in loop-free episodic MDPCRs can be
reduced to online linear optimization. Recall that in a loop-free episodic
MDPCR, the state space X is partitioned into L layers X1 , . . . , XL and that
each episode starts in X1 , and moves through the layers in order until the
agent reaches XL . As a consequence, every episode visits exactly one state
from each layer. Since each state can be visited at most once in an episode,
we consider the case where the reward function and the agent’s policy only
change at the end of an episode, rather than at every time step. We denote
the reward function and the agent’s policy for the τ th episode by rτ and πτ ,
respectively.
The main idea behind the reduction from learning in loop-free MDPCRs
to online linear optimization is to represent the agent’s policies in such a
way that the expected total reward in the τ th episode is a linear function of
the representation of the policy πτ . With such a policy representation, we
can construct an online linear optimization game where in each round the
agent chooses a policy for episode τ and the linear payout vector for that
round is set so that the agent’s reward in the linear optimization round is
exactly the expected total reward in the τ th episode.
We will represent policies by their occupancy measure. The occupancy
measure of a policy π in a loop-free episodic MDPCR describes how often an
agent following π will visit each state. An agent following policy π will visit
exactly one state in each layer X` and, since the transitions in an MDPCR are
stochastic, there is a well-defined probability of visiting each state x ∈ X` .
The (state) occupancy measure of a policy π is the map ν(π) : X → [0, 1]
defined by
ν(x, π) = P(X`π = x),
71
where ` ∈ {1, . . . , L} is the layer index such that x ∈ X` and X`π is the
random state from layer X` visited by the agent. In words, ν(x, π) is the
probability that an agent following policy π will visit state x in an episode.
The (state-action) occupancy measure of a policy, denote by µ(π), is defined
by
µ(x, a, π) = P(X`π = x, Aπ` = a) = ν(x, π)π(x, a).
The quantity µ(x, a, π) is the probability that an agent following policy π

will be in state x and choose action a. For the rest of this section, ν will
always refer to the state occupancy measure of a policy, and µ will always
refer to the state-action occupancy measure.
Our plan is to represent policies by their state-action occupancy mea-
sures and to choose policies by playing an online linear optimization game
over the set of state-action occupancy measures. For this approach to be
sensible, we need to show that the state-action occupancy measures can ac-
tually be used to represent policies (i.e., all policies have one, and the policy
can be determined from only the occupancy measure). In order to apply
online mirror ascent, we need to show that the set of occupancy measures
is a convex set. Finally, to make the connection between the online lin-
ear optimization game and learning in the MDPCR, we need to show that
the expected total episodic reward is a linear function of the state action-
occupancy measure.
First, we show that it is possible to recover a policy from its state-action
occupancy measure.
Lemma 5.1. Let µ : X × A → [0, 1] be the state-action occupancy measure

of some unknown policy π. Set

µ(x, a)/ν(x) if ν(x) > 0
π̂(x, a) =
1/|A| otherwise,
P
where ν(x) = a µ(x, a). Then π̂(x, a) = π(x, a) for all states x that π
visits with non-zero probability.
Proof. Suppose that π is the unknown policy. Then we have µ(x, a) =
72
µ(x, a, π) and ν(x) = ν(x, π). From the definition of µ(x, a, π), for each
state x we have
X X
µ(x, a, π) = ν(x, π)π(x, a) = ν(x, π).
a a
Further, whenever ν(x, π) 6= 0, we can divide the equation µ(x, a, π) =

ν(x, π)π(x, a) by ν(x, π) to obtain
µ(x, a, π) µ(x, a, π)
π(x, a) = =P 0
.
ν(x, π) a0 µ(x, a , π)
It follows that π̂(x, a) = π(x, a) whenever ν(x) > 0.
This lemma shows that, given only the state-action occupancy measure
of a policy, we can recover the policy’s action-selection probabilities in every
state that it visits with non-zero probability. It is not a serious problem that
we cannot recover the action selection probabilities in the remaining states,
since an agent following the policy will visit them with probability zero.
Therefore, since every policy has a state-action occupancy measure and,
since we can (essentially) recover a policy from any state-action occupancy
measure, we are able to represent policies by their state-action occupancy
measures. In the language of policy gradient methods, we can think of the
map given by Lemma 5.1 as a policy parameterization.
Next, we want to show that the set of all state-action occupancy measures
K = {µ(π) : π ∈ Π} is a convex subset of Rd . Let d = |X ×A| be the number
of state-action pairs. Then we can think of the set of functions {f : X ×A →
R} as a d-dimensional vector space by identifying functions with tables (or
vectors) of their values at each of the d state-action pairs. In this space of
functions the natural inner product is defined by f > g = x,a f (x, a)g(x, a).
P
For the rest of this chapter, we treat functions with finite domains as vectors
in finite-dimensional vector spaces together with this inner product. With
this convention, we can show that the set of occupancy measures is a convex
set.
Lemma 5.2. Fix a loop-free episodic MDPCR and let K = {µ(π) : π ∈ Π} ⊂
73
Rd be the set of occupation measures. Then
( )
X
0 0 0
K= µ : X × A → [0, 1] : ν(xstart ) = 1, ∀x ∈ X : ν(x ) = µ(x, a)P(x, a, x ) ,
x,a
P
where we used the shorthand ν(x) = a µ(x, a). Moreover, since K is
defined by a set of linear inequalities, it is a convex subset of Rd .
Finally, we want to show that the expected total reward of an agent

following policy πτ in the τ th epsiode is a linear function of µ(πτ ). We
obtain this result by applying the following lemma with f = rτ .
Lemma 5.3. Let π be any policy for a loop-free episodic MDPCR and let
f : X × A → R be any function. Then
L
X
E f (X` , A` ) = f > µ(π).
π π
`=1
Proof. The proof follows from the fact that for each state x in the `th layer,
and for each action a, we have that µ(x, a, π) = P(X`π = x, Aπ` = a).
L
X XL
π π
E f (X` , A` ) = E[f (X`π , Aπ` )]
`=1 `=1
XL X
= P(X`π = x, Aπ` = a)f (x, a)
`=1 x∈X` ,a
L
X X
= µ(x, a, π)f (x, a)
`=1 x∈X` ,a
X
= µ(x, a, π)f (x, a)
x,a
= f > µ(π).
Combining Lemmas 5.1, 5.2, and 5.3, we have the following reduction
from learning in loop-free episodic MDPCRs to online linear optimization.
74
Theorem 5.4. Let M = (X , A, P, (rτ )τ ∈N ) be a loop-free episodic MDPCR.
Then, the regret of any sequence of policies π1 , . . . , πT relative to the set
of Markov policies is equal to the regret of the sequence of state-action oc-
cupation measures µ(π1 ), . . . , µ(πT ) in an online linear optimization game
where K = {µ(π) : π ∈ Π} and the adversary chooses the payout vector rτ
for the τ th round to be equal to the reward function in the MDPCR for the
τ th episode.
5.1.2 Reduction of Uniformly Ergodic MDPCRs

This subsection shows that learning in uniformly ergodic MDPCRs can be
reduced to online linear optimization. Recall that in a uniformly ergodic
MDPCR, every policy π has a unique stationary distribution ν(π) ∈ ∆X and
that each policy converges to its stationary distribution uniformly quickly.
The stationary distribution over state-action pairs for a policy π, denoted
by µ(π), is defined by
µ(x, a, π) = ν(x, π)π(x, a).
The reduction presented in this section is very similar to the reduction in

the previous section with the state-action stationary distribution replacing
the state-action occupancy measure.
In this case, we represent the agent’s policies by their stationary distri-
bution over the set of state-action pairs. Again, we need to show that it is
possible to recover a policy from its stationary distribution, that the set of
stationary distributions is convex, and that we can establish a relationship
between the regret in an online linear optimization game and the regret in a
uniformly ergodic MDPCR. In the loop-free episodic case, the relationship
was very straight forward, while in this case the situation is slightly more
subtle.
First, we show that it is possible to recover a policy π from its stationary
distribution µ(π).
Lemma 5.5. Let µ : X × A → [0, 1] be the state-action stationary distribu-
75
tion of an unknown policy π. Set

µ(x, a)/ν(x) if ν(x) > 0
π̂(x, a) =
1/|A| otherwise,
P
where ν(x) = a µ(x, a). Then π̂(x, a) = π(x, a) for all states x with non-
zero probability in the stationary distribution of π.
Proof. The proof is identical to the proof of Lemma 5.1.
Next, we show that the set of stationary distributions is a convex subset

of Rd when we identify the set of functions {f : X × A → R} with tables
(or vectors) of their values at each of the d = |X × A| state-action pairs.
Lemma 5.6. Fix a uniformly ergodic MDPCR and let K = {µ(π) : π ∈ Π} ⊂

Rd denote the set of stationary distributions. Then
( )
X
K= µ : X × A → [0, 1] : ∀x0 ∈ X : ν(x0 ) = µ(x, a)P (x, a, x0 ) ,
x,a
P
where we used the shorthand ν(x) = a µ(x, a). Moreover, since K is
defined by a set of linear inequalities, it is a convex subset of Rd .
Finally, we want to show that the regret of an agent following policies

π1 , . . . , πT in the uniformly ergodic MDPCR can somehow be related to
linear functions of µ(π1 ), . . . , µ(πT ). In the loop-free episodic case, the ex-
pected reward in each episode was exactly the inner product rτ> µ(πτ ). In
the uniformly ergodic case, the inner product rt> µ(πt ) is the long-term av-
erage reward of the policy πt in the DP (not MDPCR) with deterministic
rewards given by rt and with the same states, actions, and transition prob-
abilities as the MDPCR. The following lemma shows that we can bound
the agent’s regret in the MDPCR in terms linear functions of the stationary
distributions.
Lemma 5.7. Fix a uniformly ergodic MDPCR with mixing time τ < ∞
and suppose that the reward functions satisfy rt (x, a) ∈ [0, 1] for all times,
76
states, and actions. Let B > 0 and suppose π1 , . . . , πT are any sequence of
policies with ν(πt−1 ) − ν(πt ) 1
≤ B for all times t = 2, . . . , T . Then the
regret of the sequence of policies π1 , . . . , πT relative to any fixed policy π
can be bounded as follows:
T
X T
X
Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At )
t=1 t=1
T
X
rt> µ(π) − µ(πt ) + (τ + 1)T B + 4τ + 4.

≤
t=1
Proof. Recall that we use the notation ν(π) and µ(π) for the state and
state-action stationary distributions of the policy π, respectively. We now
introduce notation for the finite-time distributions.
Notation for following the policy π: Let ν̃tπ (x) = Pπ (Xt = x) be the
probability that an agent following policy π visits state x at time t and let
µ̃πt (x, a) = Pπ (Xt = x, At = a) be the probability that she takes action a
from state x at time t. We have
µ̃πt (x, a) = Pπ (Xt = x, At = a)

= Pπ (At = a | Xt = x)Pπ (Xt = x)
= π(x, a)ν̃tπ (x).
The following recursive expression for ν̃tπ will be useful: Since the agent
starts in the state xstart with probability one, we know that ν̃1π (x) = I {x = xstart }.
For each t ≥ 1, we have
π
ν̃t+1 = ν̃tπ P π ,
where P π is as in Definition 2.15 and is an operator on ∆X that corresponds

to taking a single step according to the policy π.
Notation for following the sequence of policies π1:T : Similarly, let

ν̃t (x) = Pπ1:T (Xt = x) be the probability that an agent following the se-
quence of policies π1:T visits state x at time t and let µ̃t (x, a) = Pπ1:T (Xt =
77
x, At = a) be the probability that she takes action a from state x. Again,
we have µ̃t (x, a) = πt (x, a)ν̃t (x), and we can express ν̃t recursively: ν̃1 (x) =
I {x = xstart } and
ν̃t+1 = ν̃t P πt .
With the above notation, we are ready to prove the lemma. First, we
rewrite the two expectations as sums:
T
X XT

Eπ rt (Xt , At ) = Eπ rt (Xt , At )
t=1 t=1
T X
X
= µ̃πt (x, a)rt (x, a)
t=1 x,a
T
X
= rt> µ̃πt .
t=1
Similarly,
T
X XT

Eπ1:T rt (Xt , At ) = Eπ1:T rt (Xt , At )
t=1 t=1
XT X
= µ̃t (x, a)rt (x, a)
t=1 x,a
T
X
= rt> µ̃t .
t=1
With this, the expected regret of the policies π1 , . . . , πT relative to the fixed
policy π can be written as:
T
X T
X XT
Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At ) = rt> (µ̃πt − µ̃t ).
t=1 t=1 t=1
We can add and subtract the stationary distributions of π and πt into the
78
tth term of the sum above to obtain the following decomposition:
T
X T
X T
X T
X
rt> (µ̃πt − µ̃t ) = rt> (µ̃πt − µ(π)) + rt> (µ(π) − µ(πt )) + rt> (µ(πt ) − µ̃t ).
t=1 t=1 t=1 t=1
(5.1)
The middle term of the above decomposition appears in the bound from
the statement of the lemma, so it remains to upper bound the first and last
term by (τ + 1)BT + 4τ + 4.
To bound the first term we use the following lemma from [NGSA2014]
Lemma 5.8 (Lemma 1 from [NGSA2014]).
T
X
rt> (µ̃πt − µ(π)) ≤ 2τ + 2. (5.2)
t=1
To bound the last term, we use the following technical lemma:
Lemma 5.9. For each time t, we have
t−2
X
ν̃t − ν(πt ) 1
≤ 2e−(t−1)/τ + B e−s/τ
s=0
−(t−1)/τ
≤ 2e + B(τ + 1).
Moreover, for each time t, we have
µ̃t − µ(πt ) 1
≤ 2e−(t−1)/τ + B(τ + 1).
Proof. We prove the first inequality by induction on t. The base case is

when t = 1. By the triangle inequality and the fact that ν̃1 and ν(π1 ) are
distributions, we have ν̃1 − ν(π1 ) 1 ≤ ν̃1 1 + ν(π1 ) 1 ≤ 2 = 2e−(1−1)/τ +
B −1 −s/τ . Therefore, the claim holds when t = 1. Now suppose that
P
s=0 e
79
the claim holds for t. Then we have
ν̃t+1 − ν(πt+1 ) 1
≤ ν̃t+1 − ν(πt ) 1
+ ν(πt ) − ν(πt+1 ) 1
(Triangle Inequality)
≤ ν̃t P πt − ν(πt )P πt 1
+B (Stationarity of ν(πt ))
≤ e−1/τ ν̃t − ν(πt ) 1 + B (Uniformly Ergodic)
t−2
X
≤ e−1/τ 2e−(t−1)/τ + B e−s/τ + B (Induction Hypothesis)
s=0
t+1−2
X
= 2e−(t+1−1)/τ + B −s/τ
e .
s=0
It follows that the first inequality holds for all times t. The second inequality
follows from the first, together with the fact that
t−2
X Z ∞
−s/τ
e ≤1+ e−s/τ ds = 1 + τ.
s=0 0
The final inequality is proved as follows:

X
kµ̃t − µ(πt )k1 = |µ̃t (x, a) − µ(x, a, πt )|
x,a
X
= |ν̃t (x)πt (x, a) − ν(x, π)πt (x, a)|
x,a
X X
= |ν̃t (x) − ν(x, πt )| πt (x, a)
x a
X
= |ν̃t (x) − ν(x, πt )|
x
= ν̃t − ν(πt ) 1
−(t−1)/τ
≤ 2e + B(τ + 1).
80
PT >
We are finally ready to bound the sum t=1 rt (µ̃t − µ(πt )).
T
X T
X
rt> (µ̃t − µ(πt )) ≤ rt ∞
µ̃t − µ(πt ) 1
(Hölder’s Inequality)
t=1 t=1
XT
2e−(t−1)/τ + B(τ + 1)

≤ (Lemma 5.9, krt k∞ ≤ 1)
t=1
T
X
= (τ + 1)BT + 2 e−(t−1)/τ
t=1
≤ (τ + 1)BT + 2τ + 2, (5.3)
PT −(t−1)/τ
where in the last line we again used t=1 e ≤ τ + 1.
Substituting inequalities (5.2) and (5.3) into the regret decomposition
(5.1) proves the lemma.
Combining Lemmas 5.5, 5.6 and 5.7 gives the following reduction from
learning in uniformly ergodic MDPCRs to online linear optimization.
Theorem 5.10. Fix a uniformly ergodic MDPCR with mixing time τ < ∞
and bounded rewards: rt (x, a) ∈ [0, 1] for all states, actions, and times. Con-
sider the online linear optimization problem where K is the set of stationary
distributions over state-action pairs and the environment chooses the payout
vector in round t to be equal to the tth MDPCR reward function. Suppose
an agent for the online linear optimization game chooses the stationary dis-
tributions µ(π1 ), . . . , µ(πT ). Then we can essentially recover the policies
π1 , . . . , πT , and if an agent follows those policies in the MDPCR, then her
regret is bounded by
T
X T
X
Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At )
t=1 t=1
T
X
rt> µ(π) − µ(πt ) + (τ + 1)T B + 4τ + 4

≤
t=1
for any B such that B ≥ kν(πt−1 ) − ν(πt )k1 for all t = 2, . . . , T .
81
This reduction shows that if we can achieve low regret in an online
linear optimization problem, and if the sequence of choices don’t change too
quickly, then we can also achieve low regret in a uniformly ergodic MDPCR.
5.2 Online Mirror Ascent with Approximate Pro-

jections
A natural idea is to use online mirror ascent to learn low-regret sequences of
policies for MDPCRs by way of their reduction to online linear optimization.
Recall that online mirror ascent update has two steps: first we compute an
update that is not constrained to the set K which we then project back onto
K. In most cases, the unconstrained update can be expressed as closed-form
expression which is efficiently evaluatable. When the set K is simple (such
as the unit ball or the probability simplex) and the Bregman divergence
is chosen appropriately, the projection step may also have a closed-form
expression that can be evaluated efficiently. In general, however, computing
the Bregman projection onto the set K is a convex optimization problem
whose solution must be approximated iteratively by, for example, interior
point optimization methods. This section addresses the important question:
How much additional regret is incurred by using approximate projections?
To the best of our knowledge, this is the first formal analysis of online
mirror ascent with approximate projections, despite the fact that in most
applications the projection step must be approximated.
Formally, we consider the following notion of approximate projection:
Fix any constant c > 0. Let R : S → R be a σ-strongly convex function
with respect to the norm k·k and let K be a convex subset of S. For any
point w ∈ S, we say that a point w0 ∈ K is a c-approximate projection of
w onto K with respect to the Bregman divergence DR if w0 − w∗ < c
where w∗ = argminu∈K DR (u, w) is the exact projection. Algorithm 8 gives
pseudocode for online mirror ascent with c-approximate projections.
Theorem 5.11. Let R : S → R be a convex Legendre function and K ⊂ S be

a convex set such that R is L-Lipschitz on K wrt k·k (that is, k∇R(u) − ∇R(w)k ≤
82
Input: Step size η > 0, Regularizer R : S → R, S ⊃ K, and a
black-box PK that computes c-approximate projections onto
K
1 Choose w1 ∈ K arbitrarily;
2 for each round t = 1, 2, . . . do
4 Set wt+1/2 = argminu∈S ηrt> u + DR (u, wt );
5 Set wt+1 = PK (wt+1/2 );
6 end
Algorithm 8: Online Mirror Ascent with c-approximate Projections
L · ku − wk for all u, w ∈ K). Let D = supu,v∈K ku − vk∗ be the diameter

of K with respect to the dual norm of k·k. Then the regret of online mir-
ror ascent with c-approximate projections, step size η > 0 and regularizer R
satisfies
T T
X DR (w, w1 ) cLDT X >
rt> (w − wt ) ≤ + + rt (wt+1/2 − wt ).
η η
t=1 t=1
Moreover, when c = 0 the claim holds even when L = ∞.
Proof. This roughly follows the proof of Theorem 15.4 from [GPS2014] with
the appropriate modifications to handle the case where the projections are
only c-approximate.
Let w1 , . . . , wT ∈ K be the sequence of points generated by online
mirror ascent with c-approximate projections, let wt+1/2 be the unprojected
∗
updates for t = 1, . . . , T , and, finally, let wt+1 = argminu∈K DR (u, wt+1/2 )
be the exact projection of wt+1/2 onto the set K.
For each t = 1, . . . , T , we know that wt+1/2 is the unconstrained min-
imizer of the objective function Jt (u) = ηrt> u − DR (u, wt ) and therefore
83
∇Jt (wt+1/2 ) = 0. We can compute the gradient of Jt to be
∇J(u) = ∇ ηrt> u − R(u) + R(wt ) + ∇R(wt )> (u − w)

= ηrt − ∇R(u) + ∇R(wt ).
Rearranging the condition ∇J(wt+1/2 ) = 0 gives
1
rt = ∇R(wt+1/2 ) − ∇R(wt ) .
η
Therefore, for each time t we have
1 >
rt> (w − wt ) = ∇R(wt+1/2 ) − ∇R(wt ) (w − wt )
η
1
= DR (w, wt ) − DR (w, wt+1/2 ) + DR (wt , wt+1/2 ) ,
η
where the second line is obtained by a long but straight-forward calculation.

From the Pythagorean theorem for Bregman divergences, we have that
∗ ∗ ∗
DR (w, wt+1/2 ) ≥ DR (w, wt+1 ) − DR (wt+1 , w) ≥ DR (w, wt+1 ).
Substituting this above gives
1
rt> (w − wt ) ≤ ∗

DR (w, wt ) − DR (w, wt+1 ) + DR (wt , wt+1/2 )
η
1 ∗
= DR (w, wt ) − DR (w, wt+1 ) + DR (w, wt+1 ) − DR (w, wt+1 )
η

+ DR (wt , wt+1/2 ) .
Summing from t = 1 to T , The first two terms of the above expression will
84
telescope, leaving only the first and last:
T T
X DR (w, w1 ) DR (w, wT +1 ) 1 X
rt> (w − wt ) ≤ − + DR (wt , wt+1/2 )
η η η
t=1 t=1
T
1 X
∗

+ DR (w, wt+1 ) − DR (w, wt+1 )
η
t=1
T
DR (w, w1 ) 1 X
≤ + DR (wt , wt+1/2 )
η η
t=1
T
1 X
∗

+ DR (w, wt+1 ) − DR (w, wt+1 ) . (5.4)
η
t=1
All that remains is to bound the two sums in (5.4).

We can bound the first sum as follows: since Bregman divergences are
non-negative, we have
DR (wt , wt+1/2 ) ≤ DR (wt , wt+1/2 ) + DR (wt+1/2 , wt )

>
= ∇R(wt ) − ∇R(wt+1/2 ) (wt − wt+1/2 )
= ηrt> (wt+1/2 − wt ).
Substituting this into the first sum gives
T T
1X X
DR (wt , wt+1/2 ) ≤ rt> (wt+1/2 − wt ).
η
t=1 t=1
We can bound the second sum as follows: First, if c = 0 then wt = wt∗

and the sum is zero. In this case, we never needed the condition that ∇R
was L-Lipschitz. If c > 0, then, since R is a convex function, we have that
R(wt ) ≥ R(wt∗ ) + ∇R(wt∗ )> (w − wt∗ ).
Rearranging this inequality gives
R(wt∗ ) − R(wt ) ≤ ∇R(wt∗ )> (wt∗ − wt ).
85
Expanding Dr (w, wt ) − DR (w, wt∗ ) and using the above inequality gives
DR (w, wt ) − DR (w, wt∗ ) = R(wt∗ ) − R(wt ) + ∇R(wt∗ )> (w − wt∗ ) − ∇R(wt )> (w − wt )
≤ ∇R(wt∗ )> (w − wt ) − ∇R(wt )T (w − wt )
>
= ∇R(wt∗ ) − ∇R(wt ) (w − wt )
≤ k∇R(wt∗ ) − ∇R(wt )k kw − wt k∗
≤ L kwt∗ − wt k D
≤ cLD.
Finally, substituting this into the second sum gives
T
1X cLDT
DR (w, wt ) − DR (w, wt∗ ) ≤
η η
t=1
Substituting the above bounds into (5.4) completes the proof.
When the regularizer R is σ-strongly convex wrt k·k, we can use the
following lemma to bound the sum t rt> (wt − wt+1/2 ) in Theorem 5.11.
P
Lemma 5.12. Let R : S → R be a σ-strongly convex Legendre function

wrt the norm k·k, η > 0, wt ∈ S, rt ∈ Rd and defined wt+1/2 to be the
unconstrained mirror ascent update wt+1/2 = argminu∈S ηrt> u + DR (u, wt ).
Then
η
rt> (wt+1/2 − wt ) ≤ krt k2∗ ,
σ
where k·k∗ denotes the dual norm of k·k.
Proof. As in the proof of Theorem 5.11, we have that rt = η1 (∇R(wt+1/2 ) −

∇R(wt )). Since R is σ-strongly convex, for all u and v in K, we have
σ
R(u) ≥ R(v) + ∇R(v)> (u − v) + ku − vk2 .
2
Summing and rearranging two instances of this inequality, one with u = wt
86
and v = wt+1/2 , and one with u = wt+1/2 and v = wt , gives
2 1 >
wt − wt+1/2 ≤ ∇R(wt+1 − ∇Rw (wt − wt+1/2 )
σ
1
≤ ∇R(wt+1/2 ) − ∇R(wt ) ∗ wt − wt+1/2
σ
η
= krt k∗ kwt − wt+1 k .
σ
η
Dividing both sides by kwt − wt+1 k shows that wt − wt+1/2 ≤ σ krt k∗ .
Therefore,
η
rt> (wt+1/2 − wt ) ≤ krt k∗ wt+1/2 − wt ≤ krt k2∗ ,
σ
completing the proof.
5.3 Learning Algorithms and Regret Bounds

This section introduces three new learning algorithms for MDPCRs. All
three algorithms use online mirror ascent with approximate projections to
choose a sequence of occupancy measures / stationary distributions in the
online linear optimization problems from Section 5.1. All of these algorithms
have the same interpretation: On each time step (or epsiode), the agent ob-
serves the rewards in some or all of the states. Following this observation,
the agent updates her policy so that the occupancy measure / stationary
distribution places more weight on the states that had high rewards. In-
tuitively, the agent makes a small update to her policy so that she spends
more time taking actions from states which give high rewards.
In order to get the best regret bounds, we should choose the regularizer
function R for online mirror ascent so that the induced Bregman divergence
matches the geometry of the underlying problem. In the uniformly ergodic
MDPCR case, the set K consists of probability distributions, and it is natu-
ral to measure distances between them in terms of the Kullback-Leibler (KL)
divergence. Recall that we identify the set of functions {f : X × A → R} as
a d = |X ×A|-dimensional vector space. Consider the regularizer J : (0, ∞)d
87
defined by
X
J(w) = w(x, a) ln(w(x, a)) − w(x, a) .
x,a
This is the so-called unnormalized negative entropy (negentropy) regularizer,

and the induced Bregman divergence is
X
u(x, a)

DJ (u, w) = u(x, a) ln + w(x, a) − u(x, a) ,
x,a
w(x, a)
which is the unnormalized relative entropy between the non-negative func-

tions u and w. When u and w are probability vectors (i.e., their components
sum to one), then DJ (u, w) is exactly the KL divergence between the vectors.
Similarly, in the loop free episodic case, the set K consists of occu-
pancy measures, which are distributions when restricted to each of the lay-
ers X1 , . . . , XL of the MDPCR. In this case, a natural choice is to choose
the regularizer so that the induced Bregman divergence is the sum of the
KL-divergences between the probability distributions on each layer. The
unnormalized negentropy regularizer again accomplishes this.
For the regularizer J, the unconstrained update step of online mirror
ascent is defined by
X
u(x, a)

wt+1/2 = argmax ηrt> u − u(x, a) ln + w(x, a) − u(x, a) .
u∈(0,∞)d x,a
w(x, a)
This is a concave function of u, so we can find the maximizer by taking the

derivative and setting it to zero, which yields
wt+1/2 (x, a) = wt (x, a) expηrt (x,a)
for each state x and action a.

Moreover, suppose that the set K ⊂ w ∈ (0, ∞)d : kwk1 ≤ B . Then

we have that R is 1/B-strongly convex with respect to k·k1 on the set K.
88
Lemma 5.13 (Example 2.5 from [S2012]). The function
d
X
R(w) = w(i) log(w(i)) − w(i)
i=1
is 1/B-strongly convex with respect to k·k1 over the set

n o
S = w ∈ Rd : wi > 0, kwk1 ≤ B .
There are two problems with using the unnormalized negentropy regu-
larizer, both coming from the fact that that ∇J is not Lipschitz continuous
on the set of occupancy measures or the set of stationary distributions. To
see this, note that the partial derivatives of J are given by
∂
J(w) = ln(w(x, a)),
∂w(x, a)
which goes to −∞ as w(x, a) goes to 0. In general, there will be poli-

cies that have occupancy measures or stationary distributions with com-
ponents equal to zero, which means the gradients of J will be unbounded.
This prevents us from applying the results from Theorem 5.11 and makes
it challenging to compute c-approximate projections. In each case, we
deal with this by approximating K with the slightly smaller set Kα =
{µ ∈ K : ∀x, a : µ(x, a) ≥ α} which contains only occupancy measures or
stationary distributions that put at least mass α on every state-action pair.
We will be able to use online mirror ascent with approximate projections
to choose occupancy measures / stationary distributions from the set Kα
which have low-regret relative to the best in Kα , and we will show that the
best policy in K can’t be much better than the best policy in Kα , which
gives us a regret bound relative to the entire set of Markov policies. The
restricted set Kα forces the agent to choose policies that explore the state
and action spaces sufficiently well. In the evaluative feedback setting, where
the agent must explore to find good actions, this would be a good idea even
if the theory did not require it.
The following subsections give detailed descriptions of the three algo-
89
rithms and corresponding regret bounds.
5.3.1 Loop Free Episodic MDPCRs with Instructive Feed-

back
This section introduces an algorithm for learning in loop free episodic MD-
PCRs under instructive feedback, where the agent observes the entire reward
function rt after choosing her action at time t. For the remainder of this
section, fix a loop-free episodic MDPCR M = (X , A, P, (rt )t∈N ) with layers
X1 , . . . , XL and rτ (x, a) ∈ [0, 1] for all states, actions, and episode indices.
Let d = |X × A| be the number of state action pairs and set K ⊂ Rd
to be the convex set of occupancy measures described by Lemma 5.2. Fi-
nally, let β > 0 be such that there exists an exploration policy πexp with
µ(x, a, πexp ) ≥ β for all states x and actions a. This guarantees that for all
α < β, the set Kα is non-empty.
Algorithm 9 gives pseudocode for the proposed method and Theorem 5.14
applies the lemmas from the previous section to get a regret bound.
Input: Step size η > 0, exploration constant δ ∈ (0, 1],

approximation constant c > 0
1 Choose µ1 ∈ Kδβ arbitrarily;
2 for Each episode index τ = 1, . . . , T do
3 Execute one episode following policy πτ , obtained from µτ
according to Lemma 5.1;
4 Receive complete reward function rτ from environment;
5 Set µτ +1/2 (x, a) = µτ (x, a) exp(ηrτ (x, a)) for each state x and
action a;
6 Set µt+1 = PKδβ (µt+1/2 ), where PKδβ is a black-box that
computes c-approximate projections onto Kδβ wrt k·k1 .
7 end
Algorithm 9: Approximate Online Mirror Ascent for Loop Free
Episodic MDPCRs Under Instructive Feedback
90
Theorem 5.14. Let M be a loop free episodic MDPCR, π ∈ Π be any
Markov policy, and π1 , . . . , πT be the sequence of policies
q produced by Algo-
βδη
rithm 9 with parameters δ ∈ (0, 1], c = √T , and η = DLT max
where L is the
number of layers in the MDPCR and Dmax ≥ supµ∈Kδβ DJ (µ, µ(π1 )). Then,
the regret of an agent that follows the sequence of policies π1:T relative to
the fixed policy π is bounded as follows
T
X T
X p √
Eπ rt (Xt , At ) − Eπ1:T rτ (Xt , At ) ≤ 2 LT Dmax + T + LδT,
t=1 t=1
and the per-time-step computational cost is O(H(βδ, c) + d), where H(βδ, c)

is the cost of the black-box approximate projection routine and d is the num-
ber of state-action pairs.
Proof. First, we show that the agen’t does not incur too much additional
regret by choosing policies from the set Kδβ rather than the larger set K.
For any occupancy measure µ ∈ K, consider the mixed measure µδ = (1 −
δ)µ + δµ(πexp ). We have the following properties: µδ (x, a) = (1 − δ)µ(x, a) +
δµ(x, a, πexp ) ≥ δβ, and therefore µδ ∈ Kδβ . Second, for any payout vector
r with r(x, a) ∈ [0, 1], we have |r> (µ − µδ )| = δ|r> (µ(x, a) − µ(x, a, πexp )| ≤
P
δ x,a ≤ δL. Therefore, for any occupancy measure µ ∈ K, there is an
occupancy measure in Kδβ that earns nearly as much reward. This implies
that having a good regret bound relative to any point in Kδβ gives us a
regret bound relative to any point in K.
Next, we show that the regularizer J is 1/L-strongly convex with respect
to k·k1 on the set K. Since each occupancy measure µ is a distribution when
restricted to the states and actions in a single layer X` × A, we have the
following:
X L
X X L
X
kµk1 = |µ(x, a)| = |µ(x, a)| = 1 = L.
x,a `=1 x∈X` ,a `=1
Therefore, by Lemma 5.13, we have that R is 1/L-strongly convex on K

with respect to k·k1 .
91
Finally, we show that ∇J is Lipschitz continuous on Kδβ with respect
to k·k1 . Let w ∈ Kδβ be any occupancy measure and consider indices
i, j ∈ {1, . . . , d} (in this proof, it is more convenient to use integer indices,
rather than pairs of states and actions). Then we can compute the partial
derivatives of J(w) to be
∂
J(w) = ln(w(i))
∂w(i)
∂2 I {i = j}
J(w) = .
∂w(i)∂w(j) w(i)
It follows that the hessian ∇2 J(w) I(δβ)−1 and therefore ∇J is 1

δβ Lips-
chitz continuous.
Let µ ∈ K be an arbirtary occupancy measure and let µδ = (1 − δ)µ +
δµ(πexp ) be the mixture of µ with the occupancy measure. Since J is 1/L-
1
strongly convex and ∇J is δβ -Lipschitz on Kδβ we can apply Theorem 5.11
and Lemma 5.12 to get the following bound:
T T
DR (µδ , µ(π1 )) cLDT
krτ k∗∞
X X
rτ> (µδ − µ(πτ )) ≤ + + Lη
η η
τ =1 τ =1
DR (µδ , µ(π1 )) √
≤ + T + LT.
η
Finally, since rτ> (µ − µδ ) ≤ δL, we have that
T
X T
X T
X
rτ> (µ − µ(πτ )) = rτ> (µ − µδ ) + rτ> (µδ − µ(πτ ))
τ =1 τ =1 τ =1
Dmax √
≤ + T + ηLT + LT δ
η
p √
= 2 T Dmax L + T + LT δ.
By Theorem 5.4, the same regret bounds holds for the sequence of poli-
cies in the MDPCR.
Note that Dmax = Θ(L ln π10 ), where π0 = min(x,a) πexp (x, a) (notice that
92
πexp (x, a) ≥ β, since µ(x, a, πexp ) ≥ β). If, for example, πexp (x, ·) is selected
to be the uniform distribution over A, then β > 0 and π0 = 1/|A|, making
p √
the regret scale with O(L T ln(|A|)) when δ = 1/ T . Also, this makes
√
the computational cost Õ(dd.5 T 1/4 / β), where Õ hides log-factors. Neu
p
et al. [NGS2010] gave an algorithm that achieves O(L2 T ln(|A|)) regret
with O(d) computational complexity per time-step. Thus, our regret bound
scales better in the problem parameters than that of Neu et al. [NGS2010],
at the price of increasing the computational complexity. It is an interesting
(and probably challenging) problem to achieve the best of the two results.
5.3.2 Loop Free Episodic MDPCRs with Evaluative Feed-

back
This section introduces an algorithm for learning in loop free episodic MD-
PCRs under evaluative feedback, where the agent only observes the reward
rt (Xt , At ) for the state Xt and action At that she visited during the tth time
step. The algorithm for this setting is essentially identical to the algorithm
from the previous section for the instructive feedback setting, except we use
importance sampling to estimate the complete reward function. Specifically,
following the τ th episode, we set

rτ (x,a)

µ(x,a,πτ ) if (x, a) was visited in the τ th episode
r̂τ (x, a) = (5.5)
0 otherwise.
and perform the update with this estimate of the complete reward function.
Since r̂τ is only non-zero for state-action pairs visited by the agent in episode
τ , it is known to the agent. This estimate is only well defined if µ(x, a, πτ ) >
0 for all states and actions, but since we restrict our algorithm to the set of
occupancy measures which are lower bounded, this will always be the case
for policies chosen by the algorithm. This particular reward approximation
will be justified in the proof of Theorem 5.15.
As in the previous section, fix a loop free episodic MDPCR M with
layers X1 , . . . , XL and rτ (x, a) ∈ [0, 1] for all states, actions, and episode
indices. Let d = |X × A| be the number of state action pairs and set K =⊂
93
Rd to be the convex set of occupancy measures described by Lemma 5.2.
Finally, let β > 0 be such that there exists some exploration policy πexp
with µ(x, a, πexp ) > β for all states x and actions a.

2 for Each episode index τ = 1, . . . , T do
3 Execute one episode following policy πt , obtained from µt
according to Lemma 5.1;
4 Estimate the complete reward function r̂t as in (5.5);
5 Set µτ +1/2 (x, a) = µτ (x, a) exp(ηr̂τ (x, a)) for each state x and
action a;
6 Set µτ +1 = PKδβ (µτ +1/2 ), where PKδβ is a black-box that
7 end
Algorithm 10: Approximate Online Mirror Ascent for Loop Free
Episodic MDPCRs Under Evaluative Feedback
Theorem 5.15. Let M be a loop free episodic M DP CR, π ∈ Π be any

Markov policy, and π1 , . . . , πT be the sequence of policies
q produced by Algo-
βδη
rithm 10 with parameters δ ∈ (0, 1], c = T , and η = DLT
√ max
where L is the
number of layers in the MDPCR and Dmax ≥ supµ∈Kδβ DJ (µ, µ(π1 )). Then,
the regret of an agent that follows the sequence of policies π1:T relative to
the fixed policy π is bounded as follows
T
X L
X L
X p √
Eπ rτ (Xt , At ) − Eπτ rτ (Xt , At ) ≤ 2 dT Dmax + T + LδT,
τ =1 t=1 t=1
and the per-time-step computational cost is O(H(βδ, c) + d), where H(βδ, c)

is the cost of the black-box approximate projection routine and d is the num-
ber of state-action pairs.
Proof. The proposed algorithm for the evaluative feedback setting is identi-
94
cal to the instructive feedback setting, with the exception that we estimate
the reward function. As in the proof of Theorem 5.14, for any µ ∈ K, let
µδ = (1 − δ)µ + δµ(πexp ) be the mixture of µ with µ(πexp ). Then we have
T
X T
X T
X
r̂τ> (µ − µ(πτ )) = r̂τ> (µ − µδ ) + r̂τ> (µδ − µ(πτ ))
τ =1 τ =1 τ =1
√ T
Dmax X
≤ + T + LδT + r̂t (µ(πt+1/2 ) − µ(πt )).
η
τ =1
In the previous proof we bounded the expressions r̂t (µ(πt+1/2 ) − µ(πt )) ≤

η kr̂t k2∞ using Lemma 5.12. That is a bad idea, in this case, since the com-
ponents of r̂t scale inversely with µ(πt ) (because of the importance weights)
and may be very large. Instead, we upper bound the expectation of this
term in the following way.
The following lemma is extracted from [AHR2008].
Lemma 5.16. Let Ft be a σ-field, wt and r̂t be random d-dimensional vec-

tors that are measurable with respect to Ft , and set
wt+1/2 (x, a) = wt (x, a) exp(ηr̂t (x, a))
and suppose E[r̂t | Ft ] = rt . Then

X
> 2
E r̂t (wt+1/2 − wt ) | Ft ≤ ηE wt (x, a)r̂t (x, a) | Ft .
x,a
Now, let Fτ denote the sigma algebra generated by the first τ −1 episodes.
For each state x and action a, we have
rt (x, a)
E r̂t (x, a) | Fτ = E I {X` = x, A` = a} | Fτ
µ(x, a, πτ )
rt (x, a)
= E[I {X` = x, A` = a} | Fτ ]
µ(x, a, πτ )
= rt (x, a).
95
Therefore, we can apply Lemma 5.16 to get
E r̂t> (µ(πt+1/2 ) − µ(πt )) ≤ ηd.

Substituting this above gives the bound
T
X Dmax √
E[r̂τ> (µ − µ(πτ ))] ≤ ηT d + + T + LδT.
η
τ =1
q
Dmax
Setting η = Td gives the optimal bound of
T
X p √
E[r̂τ> (µ − µ(πτ ))] ≤ 2 T dDmax + T + LδT.
τ =1
√
As far as the dependence on τ is concerned, by choosing δ = 1/ T , we
can thus improve the previous state-of-the-art bound of Neu et al. [NGSA2014]
p p
that scales as O(τ 3/2 T ln(|A|)) to O( τ T ln |A|). The update cost of the
algorithm of Neu et al. [NGSA2014] is O(|X |3 + |X |2 |A|), while here the
√
cost of our algorithm is Õ(T 1/4 d3.5 / β).
5.3.3 Uniformly Ergodic MDPCRs with Instructive Feed-

back
Finally, this section introduces an algorithm for learning in uniformly ergodic
MDPCRs under instructive feedback. For the rest of this section, fix a
uniformly ergodic MDPCR M = (X , A, xstart , P, (rt )t∈N ) with mixing time
τ < ∞ and rt (x, a) ∈ [0, 1] for all states, actions, and times. Let d = |X × A|
be the number of state action pairs and set K ⊂ Rd to be the convex set
of stationary distributions described by Lemma 5.6. Finally, let β > 0 be
such that there exists an exploration policy πexp with µ(x, a, πexp ) ≥ β for
all states and actions.
Theorem 5.17. Let M be a uniformly ergodic MDPCR with mixing time
τ < ∞ and let π ∈ Π be any Markov policy and π1 , . . . , πT be the sequence of
96
2 for Each time index t = 1, . . . , T do
3 Receive state Xt from environment;
4 Sample action At from πt , obtained from µt according to
Lemma 5.5;
5 Receive complete reward function rt from environment;
6 Set µt+1/2 (x, a) = µt (x, a) exp(ηrt (x, a) for each state x and
action a;
7 Set µt+1 = PKδβ (µt+1/2 ), where PKδβ is a black-box that
8 end
Algorithm 11: Approximate Online Mirror Ascent for Uniformly Er-
godic Episodic MDPCRs Under Instructive Feedback
q
Dmax
policies produced by Algorithm 11 with parameters δ ∈ (0, 1], η = T (2τ +3) ,
βδη
and c = √ .
T
Then the regret of the agent that follows policies πt at time t
relative to policy π can be bounded as
T
X T
X
p √
Eπ rt (Xt , At ) −Eπ1:T rt (Xt , At ) ≤ 2 (2τ + 3)T Dmax + T +δT +4τ +4.
t=1 t=1
Proof. This proof is essentially identical to the proof of Theorem 5.14 and
has been omitted.
According to Bubeck et al. [BCK2012], for online bandit linear opti-

mization over a compact action set K ⊂ Rd , it is possible to obtain a regret
√
of order O(d T log T ) regardless of the shape of the decision set K, which,
√
in our case would translate into a regret bound of order O(|X ×A| T log T ).
Whether the algorithm proposed in this paper can be implemented efficiently
depends, however, on the particular properties of K: Designing the explo-
ration distribution needed by this algorithm requires the computation of
97
the minimum volume ellipsoid containing K and this problem is in general
NP-hard even when considering a constant factor approximation [N2007].
Selecting πexp (x, ·) to be the uniform distribution, β > 0 and Dmax ≤
p √
L ln(|A|), results in a O( dLT ln(|A|)) bound on the regret for δ = 1/ T ,
√
while the time-complexity of the algorithm is still O(d3.5 T 1/4 / β) as in the
full-information case. Neu et al. considered the same problem under the
assumption that any policy π visits any state with probability at least α for
P
some α > 0, that is, inf π a µ(x, a) ≥ α > 0. They provide an algorithm
p
with O(d) per round complexity whose regret is O(L2 T |A| ln(|A|)/α).
Compared to their result, we managed to lift the assumption α > 0, and
also improved the dependence on the size of the MDP, while paying a price
in terms of increased computational complexity.
98
Chapter 6
Conclusion
This thesis documents the two projects that I worked on during my MSc
program. Both projects contribute to the goal of building computer systems
that are capable of learning for themselves to solve problems and succeed at
tasks. Each project focuses on specific mathematical questions related to a
formal learning problem.
The first project addresses the question of which baseline function to
use in policy gradient reinforcement learning methods for Markov decision
processes. The baseline function’s role is to alter the performance gradi-
ent estimate used internally by policy gradient methods. I show that if the
formal learning objective is a concave function of the agent’s policy param-
eters, then the regret of a policy gradient method can be upper bounded
by a quantity that only depends on the baseline function only through the
second moment of the gradient estimates. This suggests that the baseline
function should be chosen to minimize the second moment of the gradi-
ent estimates, which I show to be equivalent to the more intuitive notion
of minimizing the mean squared error of the gradient estimates. I derive
closed form expressions for this baseline in terms of the MDP transition
probability kernel, the agent’s policy, and the agent’s policy parameteriza-
tion. Since the MDP transition probability kernel is unknown to the agent,
I also propose two algorithms for estimating this baseline while interacting
with the environment. Finally, I present a preliminary empirical comparison
99
of the always-zero baseline, the value function baseline, and my proposed
baseline. This comparison demonstrates a statistically significant increase
in performance when using my proposed baseline, as long as we are careful
to initialize our estimates reasonably accurately.
The goal of the second project is to design new learning algorithms for
MDPCRs. The main difference between MDPCRs and standard MDPs is
that, in the former, the environment chooses the sequence of reward func-
tions in an adversarial manner. This difference makes it easier to model
some real-world problems as MDPCRs, especially those with non-stationary
dynamics. I propose three new algorithms, all based on an approximate ver-
sion of online mirror ascent: one for learning in loop-free MDPCRs under
instructive feedback, one for learning in loop-free MDPCRs under evalua-
tive feedback, and one for learning in uniformly ergodic MDPCRs under
instructive feedback. Each of these algorithms has regret bounds that either
improve or complement the regret bounds of existing algorithms, and which
often hold even under weaker assumptions on the environment. In the de-
velopment of these algorithms, it was necessary to analyze an approximate
version of online mirror ascent, where the projection step is only computed
approximately. To my knowledge, this is the first rigorous analysis of this
approximation to online mirror ascent, despite the fact that the projection
step can often only be approximated.
Both projects provide sound, theoretically justified answers to important
questions in the fields of reinforcement learning and online learning.
100
Bibliography
[AHR2008] Abernethy, J., Hazan, E., & Rakhlin, A. “Competing in the

Dark: An efficient algorithm for bandit linear optimization” in Proceed-
ings of the 21st Annual Conference on Learning Theory (July 2008):
263–274.
[BCK2012] Bubeck, S., Cesa-Bianchi, N., & Kakade, S. M. “Towards mini-

max policies for online linear optimization with bandit feedback” in Jour-
nal of Machine Learning Research - Proceedings Track, (2012): 23:41.1–
41.14
[DGS2014] Dick, T., György, A., & Szepesvári, Cs., “Online learning in
Markov Decision Processes with Changing Cost Sequences” in Proceed-
ings of The 31st International Conference on Machine Learning (2014):
512–520.
[EKM2005] Even-Dar, E., Kakade, S. M., & Mansour, Y. “Experts in a

Markov Decision Process” in Advances in neural information processing
systems, Vol. 17 (2005): 401–408.
[EKM2009] Even-Dar, E., Kakade, S. M., & Mansour, Y. “Online markov

decision processes” in Mathematics of Operations Research, Vol. 34, No. 3
(2009): 726–736.
[GBB2004] Greensmith, E., Bartlett, P., & Baxter, J. “Variance reduction

techniques for gradient estimates in reinforcement learning” in The Jour-
nal of Machine Learning Research Vol. 5 (2004): 1471–1530.
101
[GPS2014] György, A., Pál, D., & Szepesvári, Cs. “Online learn-
ing: Algorithms for big data” (2014): https://www.dropbox.com/s/
bd38n4cuyxslh1e/online-learning-book.pdf.
[N2007] Nemirovski, A. “Advances in convex optimization: Conic pro-

gramming” in Proceedings of International Congress of Mathematicians,
Vol. 1 (2007): 413–444.
[NGSA2014] Neu, G., György, A., Szepesvári, Cs., & Antos, A. “Online
markov decision processes under bandit feedback” in IEEE Transactions
on Automatic Control, Vol. 59, No. 3 (February 2014): 676–691.
[NGS2010] Neu, G., György, A., & Szepesvári, Cs. “The online loop-free
stochastic shortest-path problem” in Proceedings of the 23rd Annual
Conference on Learning Theory (June 2010): 231–243.
[S2012] Shalev-Shwartz, S. “Online learning and online convex optimiza-

tion” in Foundations and Trends in Machine Learning, Vol. 4, No. 2
(2012): 107–194.
[SB1998] Sutton, R. S. & Barto, A. G. Reinforcement learning: An Intro-

duction. Cambridge, MA: The MIT Press, 1998.
[SMSM2000] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour,

Y. “Policy gradient methods for reinforcement learning with function
approximation” in Advances in Neural Information Processing Systems,
Vol. 12. Cambridge, MA: The MIT Press, 2000: 1057–1063.
[Cs2010] Szepesvári, Cs. Algorithms for Reinforcement Learning [Synthesis

Lectures on Artificial Intelligence and Machine Learning 9]. San Rafael,
CA: Morgan & Claypool Publishers, 2010.
[W1992] Williams, R. J. “Simple statistical gradient-following algorithms

for connectionist reinforcement learning” in Machine Learning, Vol. 8,
No. 3-4 (1992): 229–256.
102
[YMS2009] Yu, J. Y., Mannor, S., & Shimkin, N. “Markov decision pro-
cesses with arbitrary reward processes” in Mathematics of Operations
Research, Vol. 34, No. 3 (2009): 737–757.
103

Policy Gradient Reinforcement Learning Without Regret: by Travis Dick

Uploaded by

Copyright:

Available Formats

Policy Gradient Reinforcement Learning Without Regret: by Travis Dick

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Policy Gradient Reinforcement Learning Without Regret: by Travis Dick

Uploaded by

Copyright:

Available Formats

Policy Gradient Reinforcement Learning Without Regret

A thesis submitted in partial fulfillment of the requirements for the

Department of Computing Science

c Travis Dick, 2015

This thesis consists of two independent projects, each contributing to

a central goal of artificial intelligence research: to build computer systems

problems that have a strong mathematical grounding. Many real-world

learning problems can be cast as instances of one of these two problems.

the character of the problem, then the mathematical arguments we make

The first project focuses on an open question in the theory of policy

gradient reinforcement learning methods. These methods learn by trial and

to a given baseline. The baseline has no impact on the formal asymptotic

behaviour. This immediately raises the question: which baseline should we

upper bound on the regret of policy gradient methods. I derive closed-form

expressions for this baseline in terms of properties of the formal learning

problem and the computer’s behaviour. The quantities appearing in the

an empirical comparison of commonly used baselines that demonstrates im-

proved performance when using my proposed baseline.

The second project focuses on a recently proposed class of formal learn-

ing problems that is in the intersection of two fields of computing science

research: reinforcement learning and online learning. The considered prob-

that it assumes the computer’s environment is adversarial, as though it were

mon assumption that the environment’s behaviour is determined entirely by

stochastic models. I propose three new algorithms for learning in Markov

decision processes with changing rewards under various conditions. I prove

theoretical performance guarantees for each algorithm that either comple-

mial) computational complexity. Finally, in the development and analysis

of these algorithms, it was necessary to analyze an approximate version of a

well-known optimization algorithm called online mirror ascent. To the best

of my knowledge, this is the first rigorous analysis of this algorithm and it

2 Reinforcement Learning and Decision Processes 5

4 Policy Gradient Methods and Baseline Functions 32

This thesis focuses on a central goal of artificial intelligence: building com-

Reinforcement Learning and

This chapter introduces the reinforcement learning framework, Markov deci-

2.1 Markov decision processes

Definition 2.1. A Markov decision process (MDP) is a tuple (X , A, xstart , T)

The interpretation of these quantities is as follows: Prior to choosing an

Definition 2.2. A (Markov) policy is a map π : X → ∆A that assigns a

We say that an agent is following policy π if, whenever the environment

2.1.1 Total Reward in Episodic MDPs

T π = inf {t ∈ N : Xtπ = xterm }

since after time T π the rewards are 0 with probability one.

Definition 2.5. Let M be an episodic MDP. The value function Vtotal :

where Ex denotes the expectation where the environment starts in state x,

Definition 2.6. Let M be an episodic MDP. The action-value function

whenever the events being conditioned on happen with non-zero probability.

2.1.2 Average Reward in Ergodic MDPs

There is a potential problem with choosing policies that maximize Javg :

Definition 2.8. An MDP (X , A, xstart , T) is said to be weakly ergodic 3 if,

The condition in Definition 2.8 is a fixed-point equation and states that

= E[r(X, A)], where X ∼ ν(π), A ∼ π(X).

Definition 2.9. Let M be an ergodic MDP. The value function Vavg :

Definition 2.10. Let M be an ergodic MDP. The action-value function

2.2 Markov deicison processes with changing re-

Definition 2.11. A Markov decision process with changing rewards (MD-

The interpretations of all quantities in Definition 2.11 is the same as for

to be the expected total reward earned by an agent following policies π1 ,

Definition 2.12. Let J1 , J2 , . . . be any sequence of performance functions.