Policy Gradient Reinforcement Learning Without Regret: by Travis Dick
Policy Gradient Reinforcement Learning Without Regret: by Travis Dick
Policy Gradient Reinforcement Learning Without Regret: by Travis Dick
by
Travis Dick
Master of Science
that are capable of performing tasks and solving problems without problem-
specific direction from us, their designers. I focus on two formal learning
Whenever our translation from the real to the formal accurately captures
about algorithms in the formal setting will approximately hold in the real-
world as well.
error and decide whether a trial was good or bad by comparing its outcome
guarantees for policy gradient methods, but it does alter their finite-time
use? I propose that the baseline should be chosen such that a certain esti-
mate used internally by policy gradient methods has the smallest error. I
prove that, under slightly idealistic assumptions, this baseline gives a good
ii
closed form expressions are often unknown, so I also propose two algorithms
for estimating this baseline from only known quantities. Finally, I present
lems are sometimes called online Markov decision processes, or Markov de-
cision processes with changing rewards. The unique property of this class is
playing a game against the computer. This is in contrast to the more com-
ment or improve the best existing results and that often hold even under
weaker assumptions. This comes at the cost of increased (but still polyno-
is of independent interest.
iii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
1 Introduction 1
3 Gradient Methods 21
3.1 Gradient Ascent . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Stochastic Gradient Ascent . . . . . . . . . . . . . . . . . . . 25
3.3 Online Mirror Ascent . . . . . . . . . . . . . . . . . . . . . . . 27
iii
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Learning in MDPCRs 69
5.1 Reductions to Online Linear Optimization . . . . . . . . . . . 70
5.1.1 Reduction of Loop-Free Episodic MDPCRs . . . . . . 71
5.1.2 Reduction of Uniformly Ergodic MDPCRs . . . . . . . 75
5.2 Online Mirror Ascent with Approximate Projections . . . . . 82
5.3 Learning Algorithms and Regret Bounds . . . . . . . . . . . . 87
5.3.1 Loop Free Episodic MDPCRs with Instructive Feedback 90
5.3.2 Loop Free Episodic MDPCRs with Evaluative Feedback 93
5.3.3 Uniformly Ergodic MDPCRs with Instructive Feedback 96
6 Conclusion 99
iv
Chapter 1
Introduction
1
of descriptive formal learning problems and good learning algorithms can
then approach a difficult real-world problem by first modeling it formally
and then handing it off to a computer. Moreover, when we do eventually
have strategies for automatic modeling, it will be convenient to already have
algorithms for many formal learning problems.
This thesis describes two projects in pursuit of the strategy outlined
above. Both projects work towards answering interesting mathematical
questions arising in the design and analysis of algorithms for two different
formal learning problems. Both learning problems are formulated in the lan-
guage of reinforcement learning, which is an approach whereby the computer
learns by trial and error. Further, the algorithms studied in both projects
treat learning as mathematical optimization and are derived from an opti-
mization algorithm called gradient ascent. Finally, both projects measure
learning performance in terms of regret, which is roughly how much worse
the computer learner performed than if it had known the best strategy be-
fore hand. The title of the thesis, Policy Gradient Reinforcement Learning
Without Regret, mentions explicitly each of these three components, which
will be described in more detail in the remainder of the thesis.
The goal of the first project is to answer an open question about a fam-
ily of algorithms called policy gradient methods. The question is somewhat
technical, so for now I will only discuss it at a high level and postpone the de-
tailed description until Chapter 4. In the reinforcement learning framework,
the computer system receives a reward following each of its decisions and
its goal is to earn the most reward in the long run. Policy gradient meth-
ods learn to earn rewards by trial and error. After trying one behaviour
for a period of time, the algorithm compares the rewards it earned to a
given baseline. If the computer performed better than the baseline, then
the tendency to follow that behaviour again in the future is increased. Sim-
ilarly, if the computer performed worse than the baseline, the likelihood of
that behaviour is decreased. Surprisingly, the asymptotic formal guarantees
for policy gradient methods do not depend on what baseline they compare
against. The baseline does, however, influence the computer’s finite-time
behaviour, which leads immediately to the question: what baseline should
2
we use? This is the question addressed by the first project in this thesis.
The answer to this question depends on our goals for the computer system.
For example, some baselines may be computationally efficient to construct,
while others may allow the computer to learn more quickly.
The main contributions of my first project are as follows: for two specific
policy gradient algorithms, I propose that the baseline should be chosen to
minimize the mean-squared-error of an internally used gradient estimate. I
support this proposal by showing that under certain conditions, this choice
results in a good regret bound for the two policy gradient algorithms. I
present closed-form expressions showing how this baseline can be computed
from properties of the environment. This closed form expression depends
on properties of the environment that are usually unknown, so I also pro-
vide two new algorithms for estimating the baseline from interaction with
the environment. Finally, I present an empirical comparison between three
baselines that shows an empirical benefit to using my proposed baseline.
The second project combines aspects of reinforcement learning with as-
pects of another sub-field of computing science sometimes called online learn-
ing. The class of learning problems that are most commonly considered
in reinforcement learning are called Markov decision processes, which are
stochastic models describing how the computer’s environment behaves. In
contrast, problems from online learning typically use adversarial models,
where we imagine that the environment is playing a game against the com-
puter. In this project, we consider a class of learning problems called Markov
decision processes with changing rewards (MDPCR), which are very similar
to Markov decision processes where some stochastic pieces of the model are
replaced with adversarial counterparts. The goal of this project is to design
new efficient learning algorithms for this class of problems.
The main contributions of the second project, which were also presented
at ICML 2014 [DGS2014], are as follows: I show that learning in Loop-free
MDPCRs and Uniformly Ergodic MDPCRs can be reduced to another prob-
lem called online linear optimization. From this reduction, I derive three
new learning algorithms with sublinear regret bounds. There are trivial
algorithms that achieve regret bounds of the same order, but their compu-
3
tational complexity is exponential in the problem size. The three proposed
algorithms all have computational complexity that is polynomial in the prob-
lem size. Moreover, for the so-called bandit information setting, the regret
bound for the proposed algorithm holds under significantly weaker assump-
tions on the environment than existing algorithms. The three proposed
algorithms are of interest because of their low complexity and since their
analysis holds under weak conditions.
This thesis is organized as follows: (i) Chapter 2 introduces the reinforce-
ment learning framework, Markov decision processes, and Markov decision
processes with changing rewards. These are formal learning problems con-
sidered in this thesis; (ii) Chapter 3 introduces stochastic gradient descent
and mirror descent, which are the optimization tools that are used in all
algorithms studied in this thesis; (iii) Chapter 4 presents the first project,
which focuses on the baseline for policy gradient methods; (iv) Chapter 5
presents the second project, which focuses on designing new algorithms for
Markov decision processes with changing rewards; (v) and finally, Chapter 6
discusses directions for future research and gives concluding remarks.
4
Chapter 2
5
environment’s current state (the store’s current inventory and any other ob-
servations available to the agent). Following each action, the agent receives
a scalar reward (the weekly profit), and the agent’s goal is to maximize some
measure of the long-term reward, such as the total profit over a fixed number
of weeks, or the average profit per week.
Reinforcement learning problems differ in the set of states, the set of
actions, and how the environment responds to the agent’s actions. Markov
decision processes and Markov decision processes with changing rewards are
formal models for how the agent’s actions affect the environment’s state and
rewards. We can mathematically reason about these formal models to make
strong theoretical claims about learning algorithms. The two models pre-
sented in this chapter are complementary and each can be used to accurately
model different kinds of real-world problems.
6
action (encoded as an element a ∈ A), the agent observes the environment’s
state (encoded as an element x ∈ X ). For each state action pair (x, a), the
transition probability kernel gives a distribution over states and rewards,
denoted by T(x, a). The environment’s next state and the agent’s reward
are jointly distributed according to T(x, a) whenever the agent takes action
a from state x. When the agent begins interacting with the environment,
the environment is in state xstart . Typically, we consider the case when the
agent knows the sets of states, the set of actions, and the starting state, but
does not know the transition probability kernel.
We rarely need to work with the transition probability kernel directly.
For almost all purposes, given a state-action pair (x, a), we only care about
the marginal distribution over the next state and the expected reward.
Therefore, we use the following simpler notation: Let (x, a) be any state-
action pair and let (X 0 , R) be randomly sampled from T(x, a). We define the
state transition probabilities by P(x, a, x0 ) = P(X 0 = x0 ) and the expected
reward by r(x, a) = E[R]. The dependence of these functions on the pair
(x, a) is through the distribution of X 0 and R.
Just as it is useful to have a model for how the environment behaves, it
is useful to have a model for how the agent chooses actions. For MDPs, a
natural choice is to suppose that the agent chooses actions according to a
Markov policy, which is a stochastic mapping from states to actions.
7
The meaning of the time indexes is as follows: action Aπt was taken after
observing Xtπ and the reward produced by this action was Rtπ . 2
At first glance it seems restrictive that Markov policies are only permit-
ted to choose actions based on the environment’s current state, rather than
the entire history of states, actions, and rewards. It turns out, however,
that in all the cases we consider, there is an optimal policy that chooses
actions as a deterministic function of the environment’s current state. This
is because, conditioned on the current state, the future of an MDP is inde-
pendent of the history. We consider the more general class of (stochastic)
Markov policies because they allow the agent to randomly choose actions,
which is useful for trial and error learning.
Reinforcement learning algorithms adapt their behaviour over time to
maximize reward. In principle, if the agent knew the environment’s transi-
tion probability kernel T before hand, the agent could compute an optimal
policy off-line prior to interacting with the environment. But, since the
probability kernel is unknown, the agent must use its interaction with the
environment to improve its policy. For example, a simple approach would
be to compute a maximum likelihood estimate of the transition probability
kernel based on the observed state transitions and rewards and to calculate
the optimal policy for the approximated kernel. In general, this approach
is too costly in terms of memory, computation, and interactions with the
environment, so we seek other approaches.
The only remaining aspect of an MDP left to formalize is the agent’s
learning objective. That is, what exactly is the agent trying to accomplish
while interacting with her environment? A formal learning objective is a map
J : Π → R that maps each policy to a scalar measure of its performance.
The map J specifies precisely what we desire in a policy and is usually a
function of both the policy and the environment. Intuitively, the learning
objective should be to maximize some measure of the long-term reward the
agent receives. The two most commonly used learning objectives are: First,
maximizing the agent’s expected total reward in repeated attempts at a
2
This is somewhat non-standard, and often Rtπ is taken to be the reward produced by
executing action Aπt−1 .
8
task. Second, maximizing the long-term average reward per-action in a task
that continues indefinitely. The following subsections introduce these two
learning objectives, together with additional conditions on the environment
that make learning possible.
9
Let π ∈ Π be any policy for an episodic MDP and let
be the (random) first time that an agent following policy π enters the ter-
minal state. Then we can rewrite the expected total reward as
π −1
TX
π
Jtotal (π) = E Rt ,
t=1
where Ex,a denotes the expectation where the environment starts in state
x and the agent’s first action is a. For each policy π, the map (x, a) 7→
Qtotal (x, a, π) is usually called the action-value function for policy π.
10
Since the state transitions in an MDP do not depend on the history of
states, any time an agent following policy π finds herself in state x, her
expected total reward until the end of the episode is given by Vtotal (x, π).
Similarly, whenever the agent finds herself in state x and she takes action a,
then Qtotal (x, a, π) is her expected total reward until the end of the episode.
In other words, for any time t, we have
∞
X
π π
E Rs Xt = x = Vtotal (x, π)
s=t
and
∞
X
π π π
E Rs Xt = x, At = a = Qtotal (x, a, π)
s=t
Definition 2.7. Let M be any MDP. The average reward learning objective
is a map Javg : Π → R defined by
XT
1 π
Javg (π) = lim E Rt .
T →∞ T
t=1
11
from the starting state xstart . Therefore, if use Javg to choose between poli-
cies, we would like to impose some constraints on the MDP that ensure the
long-term average reward of a policy does not depend on the starting state.
Otherwise, the agent may be encouraged to choose a policy which has high
average reward when started from the starting state, but which performs
poorly given the environment’s current state.
A relatively mild condition on the environment that ensures the average
reward of a policy does not depend on the starting state is ergodicity:
where ν(x, π) denotes the probability mass given to state x by the distribu-
tion ν(π).
The starting state of the MDP no longer appears in this expression for the
average reward, and therefore the average reward does not depend on the
starting state.
3
In the remainder of this thesis, I will refer to weakly ergodic MDPs simply as ergodic
MDPs.
12
Like in the total reward setting, the value and action-value functions
are useful theoretical tools which have essentially the same interpretation as
before. Now, rather than measuring the total reward following a given state,
they measure the transient benefit of being in a give state, or state-action
pair, compared with the long-term average.
∞
X
Rtπ
Qavg (x, a, π) = Ex,a − Javg (π) .
t=1
13
of the environment’s current behaviour in the state. For example, if the en-
vironment switches between a finite number of modes, then we could include
an integer in the MDP state that indicates which mode the environment is
currently in. The drawback of this approach is that, in the MDP framework,
the agent completely observes the state, so the agent must be able to ob-
serve the environment’s current mode, which is a rather strong requirement.
One way to avoid this requirement is to modify the MDP model so that
the agent only partially observes the state. This kind of model is called a
partially observable MDP (POMDP). POMDPs are an interesting research
topic, but not a focus of this thesis.
MDPCRs are a different approach to modeling non-stationary environ-
ments. They keep the assumption that the agent completely observes the
modeled state and that the state transitions are Markovian. The difference
is that the reward for taking action a from state x changes over time in
a non-stochastic way. Specifically, there is an unknown sequence of reward
functions r1 , r2 , . . . and executing action at from state xt at time t produces
a reward rt (xt , at ).
14
the agent observes the entire reward function rt : X × A → R after choos-
ing an action at time t. The evaluative feedback setting is more useful in
real-world tasks, since often the rewards are determined by some real-world
process and we only see the outcome of the action that was taken. The
instructive feedback case is still interesting theoretically, sometimes useful
in practice, and acts as a stepping stone towards developing algorithms for
the evaluative feedback setting.
The main usefulness of MDPCRs is modeling tasks where the agent’s
rewards depend on a difficult-to-model aspect of the environment. For ex-
ample, suppose the agent’s task is to explore a maze searching for treasure.
The agent’s actions have predictable (and time-invariant) effects on her lo-
cation in the maze and are therefore easily modeled with Markovian transi-
tion probabilities. Suppose, however, that there is a wizard that periodically
creates and destroys treasures throughout the maze. The agent’s reward de-
pends not only on her position in the maze, but also on the recent actions of
the wizard. This problem is difficult to model as an MDP, since the rewards
must be sampled from a time-invariant distribution that depends only on
the most recent state and action. This forces the state to explicitly model
(at least part of) the wizard’s behaviour, which may be very complicated.
On the other hand, it is easy to model this task as an MDPCR, since we
can leave the wizard out of the state entirely and use the sequence of reward
functions to model the moving treasures. Similar problems arise in many
situations where the agent interacts with another entity with agency, such
as a human user or another machine.
Like in the MDP case, we consider two different formal learning objec-
tives: one suitable for episodic tasks and one suitable for continuing tasks.
For each formal learning objective, we consider a sub-class of MDPCRs
where learning is possible. The sub-classes considered in this thesis are more
restrictive than in the MDP setting, and an interesting open question is to
design learning algorithms for more general settings. In the episodic setting,
we consider learning in loop-free episodic MDPCRs and in the continuing
setting, we consider uniformly ergodic MDPCRs.
Before giving detailed descriptions of the two cases, I will discuss some
15
common features of both models.
In each case, we define a sequence of performance functions J1 , J2 , . . . ,
where JT (π1 , . . . , πT ) is a function of T policies and represents the expected
performance of an agent following policies π1 , . . . , πT for the first T time-
steps. For example, we might define
T
X
JT (π1 , . . . , πT ) = Eπ1:T rt (Xt , At )
t=1
16
Minimizing regret is equivalent to maximizing the performance function,
but as we will see later, the regret is easier to analyze. In particular, we will
be able to provide upper bounds on the regret which depend only loosely
on the actual sequence of rewards.
4. and, for any states x and x0 and any action a, if P(x, a, x0 ) > 0, then
either x = x0 = xterm or there exists a layer index 1 ≤ l < L such that
x ∈ Xl and x0 ∈ Xl+1 .
17
we can take the duration of each reward function to be an entire episode,
rather than a single time step. Therefore, we denote by πτ and rτ the
agent’s policy and the reward function for episode τ , respectively. Finally,
we measure the agent’s regret after T episodes, rather than after T time
steps of interaction.
The performance function that we consider in loop-free episodic MD-
PCRs is the total reward earned over the first T episodes:
T
X L
X
Jtotal,T (π1 , . . . , πT ) = Eπτ rτ (Xt , At ) ,
τ =1 t=1
where Eπτ denotes the expectation where actions are selected according to
policy πτ .
For this performance function, we can write the regret (relative to all
Markov policies) as follows:
T L L
( X X )
X
Rtotal,T (π1 , . . . , πT ) = sup Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At ) .
π∈Π τ =1 t=1 t=1
Suppose that an agent’s regret grows sublinearly with T . Then for any
policy π, we have
T L
X T L
X
1X 1X
Rtotal,T (π1:T ; π)/T = Eπ rτ (Xt , At ) − Eπτ rτ (Xt , At )
T T
τ =1 t=1 τ =1 t=1
Taking π to be the best Markov policy, we have that the average episodic
reward of the agent is converging to the average episodic reward of the
best Markov policy. Therefore, our main goal is to show that the regret of
our algorithms grows sublinearly. Naturally, slower growth rates are more
desirable.
18
2.2.2 Uniformly Ergodic MDPCRs
Uniformly ergodic MDPCRs are very similar to ergodic MDPs. Recall that
ergodic MDPs were characterized by the existence of a unique stationary
distribution ν(π) ∈ ∆X for each policy π that described the long-term av-
erage state visitation probabilities while following policy π. The condition
in uniformly ergodic MDPs guarantees that every policy has a unique sta-
tionary distribution, that the finite-time state-visitation probabilities νt (π)
(νt (x, π) = P(Xtθ = x)) converge to the unique stationary distribution, and
that the rate of convergence of νt (π) to ν(π) is uniformly fast over all policies.
Formally, we have the following definition
Definition 2.15. An MDPCR X , A, xstart , P, (rt )t∈N is said to be uni-
formly ergodic if there exists a constant τ ≥ 0 such that for any two distri-
butions ν and ν 0 ∈ ∆X and any policy π, we have
νP π − ν 0 P π 1
≤ e−1/τ ν − ν 0 1
,
19
algorithms for (non-uniform) ergodic MDPCRs with provably good perfor-
mance.
The performance function we use for uniformly ergodic MDPCRs is very
similar to the one used for the loop-free episodic MDPCRs, except that the
total reward is measured in terms of time steps, rather than episodes:
T
X
Jtotal,T (π1 , . . . , πT ) = Eπ1:T rt (Xt , At ) ,
t=1
where Eπ1:T denotes the expectation where the tth action At is chosen ac-
cording to policy πt .
For this performance function, we can write the regret (relative to all
Markov policies) as follows:
T T
( X X )
Rtotal,T (π1 , . . . , πT ) = sup Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At ) .
π∈Π t=1 t=1
20
Chapter 3
Gradient Methods
argmax f (w).
w∈K
21
3.1 Gradient Ascent
The gradient ascent algorithm can be applied whenever the objective func-
tion f is differentiable. Gradient ascent produces a sequence of vectors w1 ,
w2 , . . . such that the function value of f along the sequence is increasing.
In this section, we only consider so-called unconstrained maximization prob-
lems where K = Rd . Pseudocode for gradient ascent is given in Algorithm 1.
22
and therefore has no maximizer. Second, the linear objective is only a good
approximation of f near the point w0 , so we should only trust solutions
that are near to w0 . A better idea is to successively maximize linear ap-
proximations to the function f , with a penalty that prevents our solutions
from being too far from the region where the approximation is accurate.
Formally, we might set
1
wt+1 = argmax η f (wt ) + ∇f (wt )> (u − wt ) − ku − wt k22 .
u∈Rd 2
The objective function in the above update has two competing terms. The
first term encourages wt+1 to maximize the linear approximation of f at the
point wt and the second term encourages wt+1 to stay near to the point wt .
The step size parameter η trades off between these two competing objectives.
The above objective is a concave quadratic function of u, and we can express
its unique maximizer in closed form as
1
wt+1 = argmax η f (wt ) + ∇f (wt )> (u − wt ) − ku − wt k22
u∈Rd 2
= wt + η∇f (wt ),
23
is L-Lipschitz. That is,
If the step-size satisfies η < 1/L, then the sequence (wt )t produced by gradi-
ent ascent converges to some point w∞ such that ∇f (w∞ ) = 0.
In most cases, this result is enough to guarantee that gradient ascent
converges to a local maximum of the function f . However, it is possible to
get unlucky and converge to some other point where the gradient of f is
zero, such as a local minima or a saddle-point of f . Since maximizing an
arbitrary non-concave functions is computationally intractible, this is the
best result we can hope for.
When the function f is concave, the situation is much better:
Theorem 3.2. Let f : Rd → R be a concave function with maximizer w∗ .
Let (wt )t be the sequence of points produced by running gradient ascent with
step size η > 0. Suppose that kw∗ k2 ≤ B and k∇f (wt )k2 ≤ G for all times
t = 1, . . . , T . Then
T
X B2
f (w∗ ) − f (wt )) ≤ + ηT G2 .
η
t=1
T
X √
f (w∗ ) − f (wt )) ≤ 2BG T .
t=1
Proof. This is a standard result. Since f is concave, for any points u and w
in Rd , we have
f (w) ≤ f (u) + ∇f (u)> (w − u).
24
over times t, we have
T
X T
X
∗
(f (w ) − f (wt )) ≤ ∇f (wt )> (w∗ − wt ).
t=1 t=1
Theorem 5.11 and Lemma 5.12 together with the facts that k·k2 is self-
dual and 1-strongly convex with repsect to itself give the final result. The
above theorem and lemma are the main subject of Section 5.2.
This results shows that whenever we set the step size appropriately, the
total suboptimality (usually called the regret) of the sequence w1 , . . . , wT
√
produced by gradient ascent grows at a rate of only T . Equivalently,
dividing by T shows that the average suboptimality T1 Tt=1 (f (w∗ ) − f (wt ))
P
√
goes to zero at least as quickly as 1/ T .
25
we can get unbiased stochastic estimates of the gradient is as follows: Let
P be a probability distribution that is unknown, but from which we can
sample. Set f (w) = E[g(w, X)] where g is a known function, and X has
distribution P . Since the distribution of X is unknown, we can’t evaluate f
or its gradient ∇f . But, let X have distribution P and set ∇ = ∇w g(w, X).
Then we have
26
ascent with step size η > 0 and gradient estimates (∇t )t . Suppose that
kw∗ k2 ≤ B and E[k∇f (wt )k22 ] ≤ G2 for all times t = 1, . . . , T . Then
T
B2
X
∗
E f (w ) − f (wt )) ≤ + ηT G2 .
η
t=1
T
X
√
E f (w∗ ) − f (wt )) ≤ 2BG T .
t=1
Proof. The proof of this result is essentially identical to the proof of Theo-
rem 3.2 and is omitted.
Notice that the bound in Theorem 3.3 only depends on the distribution
of the gradient estimatse ∇t by way of their second moment. Therefore,
to get the best possible bound, one should try to construct the gradient
estimates to have the smallest possible second moment.
27
may way to measure distances in terms of the Kullback-Leibler divergence
instead of the squared 2-norm distance.
The problem solved by online mirror ascent is called online linear opti-
mization, defined as follows:
Given a fixed time horizon T , the agent’s goal is to maximize her total
reward in the first T rounds. Equivalently, she can minimize her regret
relative to the set K, given by
T
X T
X T
X
RT (w1:T , r1:T ) = sup rt> w − rt> wt = sup rt> (w − wt ).
w∈K t=1 t=1 w∈K t=1
28
also accomodate constrained optimization problems, but this discussion was
omitted above because it is not used in this thesis.
I will now introduce Bregman divergences and describe how they are
used in the mirror ascent update. First, we need the notion of a strongly
convex function:
σ
f (u) ≥ f (w) + ∇f (w)> (u − w) + ku − wk2
2
29
Even though Bregman divergences behaving like distances, they are not
usual distances because they are not symmetric and do not satisfy the tri-
angle inequality.
With these definitions in hand, we are ready to define online mirror
ascent. Algorithm 3 gives pseudocode. The online mirror ascent update
has two steps: first, we compute an unconstrained maximizer wt+1/2 of the
most recent payout vector together with a penalty that encourages wt+1/2
to not stray too far from wt . Usually this update step has a closed-form
expression that can be efficiently evaluated. The second step is to set wt+1
to be the projection of wt+1/2 back onto the constraint set K with respect
to the Bregman divergence DR . Theorem 3.8 bounds the regret of online
mirror ascent.
T T
X DR (w, w1 ) η X
rt> (w − wt ) ≤ + krt k2∗ ,
η σ
t=1 t=1
where k·k∗ is the dual norm of rt . Moreover, if krt k∗ ≤ G for all t, then we
have
T
X DR (w, w1 ) ηT G2
rt> (w − wt ) ≤ +
η σ
t=1
p
= 2G T DR (w, w1 )/σ
q
σDR (w,w1 )
where the last line is obtained by taking eta = T G2
, which is the
optimal value.
Proof. As before, this result is a special case of Theorem 5.11 together with
Lemma 5.12, so I defer the proof until Section 5.2.
30
Input: Step size η > 0, Regularizer R : S → R with S ⊃ K
1 Choose w1 ∈ K arbitrarily;
2 for Each round t = 1, 2, . . . do
3 Optionally use wt in another computation;
4 Set wt+1/2 = argmaxu∈S ηrt> u − DR (u, wt );
5 Set wt+1 = argminu∈K DR (u, wt+1/2 );
6 end
Algorithm 3: Online mirror ascent
Chapter 4
This chapter describes the first project that I worked on during my MSc
program. The goal of this project was to resolve an open question related to
policy gradient methods, which are learning algorithms for MDPs. Policy
gradient methods are instances of stochatic gradient ascent. Recall that to
apply stochastic gradient ascent, we must be able to sample random vec-
tors whose expectation is equal to the gradient of the objective function.
For both the total and average reward learning objectives, there are estab-
lished theoretical results that give methods for generating random gradient
estimates. Both gradient estimation schemes have a parameter called the
baseline function. The baseline does not change the expectation of the gradi-
ent estimates and therefore has no influence on the asymptotic performance
of policy gradient methods. The baseline function does, however, impact
the finite-time learning performance. How to choose the baseline function
is currently an open question. I propose that the baseline function should
be chosen to minimize the mean squared error (MSE) of the gradient esti-
mates. Under slightly idealistic conditions, I prove that this choice gives the
tightest bound on the suboptimality of the algorithm obtainable from the
standard analysis of stochastic gradient ascent. Unfortunately, the MSE-
minimizing baseline function depends on the transition probability kernel
32
T of the MDP, so the agent can not directly compute it. The final contri-
bution of this project is to show that the MSE-minimizing baseline can be
estimated from only observable quantities.
33
the learning objective. I will loosely call such a random vector an estimated
gradient. The remainder of this section presents two gradient estimation
techniques: one for the total reward in episodic MDPs originally due to
Williams [W1992] and one for the average reward in ergodic MDPs due to
Sutton et al. [SMSM2000]. The presentation is slightly modified from the
original sources to better fit with this thesis.
Both gradient estimates have the same basic structure. To estimate the
gradient ∇J(θ), the agent follows policy π(θ) for a short period of time.
For any state-action pair (x, a), the policy gradient ∇θ π(x, a, θ) is the di-
rection in parameter-space that the agent would move the parameter vector
θ to increase the probability of choosing action a from state x. The learn-
ing objective gradient estimate is the sum of the policy gradients over the
state-action pairs visited during the trial period, each scaled by a term that
depends on how well the agent performed following that action (the details
depend on J) divided by the probability of choosing action a from state x.
Intuitively, the effect of adding this gradient estimate to the agent’s param-
eter vector is to increase the probability of the actions that performed well,
and to increase even more the probability of actions that performed well and
are rarely chosen. The following two theorems give detailed constructions of
the gradient estimates and show that they are equal in expectation to the
gradient of the learning objective.
34
ing practitioner, the policy gradient ∇θ π(x, a, θ) can be computed by the
agent. All other quantities that appear in the gradient estimate from Theo-
rem 4.1 are observable by the agent. Further, since the gradient estimate is
a function of the current parameter vector θ, whenever θ is itself random, we
have the property E[∇θ | θ] = ∇Jtotal (θ). Therefore, we can use these gra-
dient estimates in a simple policy gradient method that updates the policy
parameters once after each episode. Pseudocode is given in Algorithm 4.
35
of the theorem is drawn from the stationary distribution ν(π(θ)) which the
agent doesn’t know and can’t directly sample from. Rather than drawing
a sample from ν(π(θ)), the agent can simply use the state they visit while
interacting with the MDP. In practice, the distribution over states at time t
is close enough to the stationary distribution that this also introduces only
a small bias. So, in the average reward setting, the agent is able to com-
pute a random vector which is nearly an unbiased estimate of the gradient
of Javg (θ). Pseudocode for an policy gradient method based on this nearly
unbiased gradient estimate is given in Algorithm 5.
36
At . One consequence is that if Gt is positive, then the update will make
the action more probable, and if it is negative, then the update will make
the action less probable. This behaviour is strange, since there is nothing
special about earning zero total reward. In fact, some MDPs have only
negative rewards and, in this case, the agent never directly increases the
probability of choosing good actions. They are only increased as a side-
effect of decreasing the probability of worse actions more aggressively. This
raises the following questions: Can we compare the total reward Gt to a
value other than zero? And how does the choice affect the performance of
the policy gradient method?
We will see below that, rather than comparing to zero, we can compare
to any baseline value that depends on the state Xt and the agent’s parameter
vector θ. The function b : X → R that maps each state to the baseline value
in that state is called the baseline function. The following two results show
how to incorporate the baseline function into the gradient estimates from
the previous section.
37
For any time t ≥ T θ , we have that Xtθ = xterm and therefore ϕθt = 0. From
this, it follows that
θ −1
TX ∞
X
ϕθt b(Xtθ ) = ϕθt b(Xtθ ).
t=1 t=1
∞ ∞ ∂ θ θ
∂θi π(Xt , At , θ)
X X
θ θ θ
E ϕt,i b(Xt ) = E b(Xt )
t=1 t=1
π(Xtθ , Aθt , θ)
X X ∞ n o ∂ π(x, a, θ)
θ θ ∂θi
= E I Xt = x, At = a b(x)
x,a t=1
π(x, a, θ)
X X ∞ n o ∂ π(x, a, θ)
θ θ ∂θi
= E I Xt = x, At = a b(x)
π(x, a, θ)
x6=xterm , t=1
a
X ∂θ∂ π(x, a, θ) X∞ n o
= i
b(x) · E I Xtθ = x, Aθt = a
π(x, a, θ)
x6=xterm , t=1
a
X ∂θ∂ π(x, a, θ)
θ
≤ i
b(x) · E[T ]
π(x, a, θ)
x6=xterm ,
a
< ∞,
∂
where in line 3 we used the fact that ∂θi π(xterm , a, θ) = 0. Therefore, by
Fubini’s theorem, we have for each i
∞
X X∞
θ θ
E ϕt,i b(Xt ) = E[ϕθt,i b(Xtθ )].
t=1 t=1
38
Using the shorthand pt (x, a) = P(Xt = x, At = at ), pt (a | x) = P(At =
a | Xt = x), and pt (x) = P(Xt = x), we can rewrite the expectation as a sum
over the possible state-action pairs
X ∇θ π(x, a, θ)
E ϕθt b(Xtθ ) =
pt (x, a) b(x)
x,a
π(x, a, θ)
X ∇θ π(x, a, θ)
= pt (a | x)pt (x) b(x)
x,a
π(x, a, θ)
X ∇θ π(x, a, θ)
= π(x, a, θ)pt (x) b(x)
x,a
π(x, a, θ)
X X
= pt (x)b(x) ∇θ π(x, a, θ)
x a
= 0,
P
where in the last line we used that a ∇θ π(x, a, θ) = ∇θ 1 = 0. We have
shown that
θ −1
TX
E ϕθt b(Xtθ ) = 0,
t=1
as required.
Proof. Again it suffices to show that E[ϕθ b(X θ )] = 0. Using the shorthand
p(x, a) = P(X θ = x, Aθ = a), p(a | x) = P(Aθ = a | X θ = x) and p(x) =
39
P(X θ = x), we can rewrite the expectation as a sum
X ∇θ π(x, a, θ)
E ϕθ b(X θ ) =
p(x, a) b(x)
x,a
π(x, a, θ)
X X
= p(x)b(x) ∇θ π(x, a, θ)
x a
= 0.
40
MDPs and a different performance gradient estimate than the two described
in the previous section, so their results do not directly carry over to the
two cases studied in this project, though there are strong similarities. They
derive closed-form expressions for the variance minimizing baseline, a theory
that analyzes the variance of different baseline functions, and algorithms for
estimating the variance minimizing baseline. To the best of my knowledge,
they do not show that the variance minimizing baseline is a good or optimal
choice.
The goal in the rest of the chapter is to explore the connection between
the baseline function and the performance of policy gradient methods, and
to decide which baseline function gives the best performance.
41
Lemma 4.5. Let µ ∈ Rd be any vector and suppose for each function b :
X → R, we have a random vector ∇b such that E[∇b ] = µ for all b. Then
2 2
argmin tr Cov(∇b ) = argmin E ∇b − µ 2 = argmin E ∇b
2
.
b:X →R b:X →R b:X →R
Proof. This result is a consequence of the fact that E[∇b ] does not depend
on b. Using the trace rotation equality (that is, tr(AB) = tr(BA)) and the
definition of the covariance matrix, we have
which proves the first equivalence. Expanding the definition of the squared
2-norm, we have
2 2 2
E ∇b − µ 2 = E ∇b 2E[∇>
2
+ µ 2
− b ]µ
2 2
= E ∇b 2
− µ 2
.
2 2
It follows that E ∇b −µ 2 and E ∇b
2
differ by a constant that does not
depend on b. Therefore, the same function b will minimize both expressions.
42
a sequence of parameter vectors θ1 , . . . , θT . Let θ∗ be any parameter vec-
tor that maximizes J. We can use Theorem 3.3 to upper bound the sum
PT
t=1 E[J(θ) − J(θt )], which is one measure of the agent’s learning perfor-
mance until time T . The upper bound that we get is
T
X B2
E[J(θ) − J(θt )] ≤ + ηT G2 ,
η
t=1
where η is the step size used in the policy gradient method, B is an up-
per bound on kθ∗ k2 , and G2 is an upper bound on the second moments
of the stochastic gradient estimates. Setting the step size according to
√ √
η = B/(G T ) gives the best bound of 2BG T . The gradient estimates
only appear in this bound through their second moments, and minimizing
the second moments gives the best bound. Since minimizing the second
moments of the gradient estimates is equivalent to minimizing the MSE, the
MSE-minimizing baseline gives the best possible regret bound from Theo-
rem 3.3.
The requirement that J be a concave function of the policy parameters
is almost never satisfied. However, it is often the case that J will be concave
in a neighbourhood of its local maxima. In this case, once the algorithm
enters one of these neighbourhoods, the above analysis holds if we replace
θ∗ with the local maxima.
43
where
2
w̃(x, a, θ) ∇θ π(x, a, θ) 2
w(x, a, θ) = P 0
and w̃(x, a, θ) = ,
a0 w̃(x, a , θ) π(x, a, θ)
is the baseline function that minimizes the MSE of the gradient estimate ∇bθ
in Corollary 4.4.
Proof. By Lemma 4.5, we can equivalently find the minimizer of the second
moment of the gradient estimate. That is, we want to solve the optimization
problem
2
argmin E ϕθ (Qavg (X θ , Aθ , θ) − b(X θ )) 2 .
b:X →R
2
E ϕθ (Qavg (X θ , Aθ , θ) − b(X θ )) 2
2
= E ϕθ 2 (Qavg (X θ , Aθ , θ) − b(X θ ))2
2
X ∇θ π(x, a, θ)
= p(x, a) 2
(Qavg (x, a, θ) − b(x))2
x,a
π(x, a, θ)2
X w̃(x, a, θ)
= p(a | x)p(x) (Qavg (x, a, θ) − b(x))2
x,a
π(x, a, θ)
X X
= p(x) w̃(x, a, θ)(Qavg (x, a, θ) − b(x))2 .
x a
For each state x, the value b(x) only appears in exactly one term of the
sum over x. Since there are no constraints between the b(x), we are free
to minimize each term independently. Therefore, we can express the MSE-
44
minimizing baseline point-wise as
X 2
b(x) = argmin p(x) w̃(x, a, θ) Qavg (x, a, θ) − c
c∈R a
X
= w(x, a, θ)Qavg (x, a, θ),
a
The only difference between these two baseline functions is the weighting
used in the average.
It is also interesting to notice that the value function baseline minimizes
the second moment of the quantity Qavg (X θ , Aθ , θ) − b(X θ ). The MSE-
minimizing baseline uses the modified weights w(x, a, θ) in place of the action
selection probabilities in order to instead minimize the second moment of
the gradient estimate ϕθ (Qavg (X θ , Aθ , θ) − b(X θ )).
45
approximation.
where ∇θ = ∇0θ is the baseline function that minimizes the MSE of the
gradient estimate ∇bθ in Corollary 4.3.
where ∇θ is the gradient estimate with the constantly zero baseline function.
We can therefore decompose the second moment as follows
!> !
∞ ∞
2
X X
E ∇bθ ϕθt b(Xtθ ) ϕθt b(Xtθ )
2
= E ∇θ − ∇θ −
t=1 t=1
∞ ∞ X
∞
2
X X >
E ∇> θ θ
E ϕθt ϕθs b(Xtθ )b(Xsθ ) .
= E ∇θ 2
−2 ϕ
θ t b(X t ) +
t=1 t=1 s=1
(4.1)
The double sum in (4.1) can be simplified using the following observation.
Let s and t be any two times with s < t. Since Markov policies choose
46
actions independently of the history of states and actions, the probability of
choosing action a from state Xtθ at time t does not depend on the state and
action visited at time s. Formally, for any states x and x0 , and any actions
a and a0 , we have
Therefore, we can factor the joint probability of taking action a0 from state
x0 at time s and later taking action a from state x at time t as follows:
Note that this trick does not work when the time s follows time t, because
then knowing the state Xsθ gives you information about what action was
taken at time t. Using this factorization, for any times s < t, we have
>
E[ϕθt ϕθs b(Xtθ )b(Xsθ )]
X
= P(Xtθ = x, Aθt = a, Xsθ = x0 , Aθs = a0 )
x,a,x0 ,a0
= 0,
P
where the last line uses the fact that a ∇θ π(x, a, θ) = ∇θ 1 = 0. The
expression is symmetric in the times t and s, so an identical argument can
>
be used to show that E[ϕθt ϕθx b(Xtθ )b(Xsθ )] = 0 when s > t. Therefore, the
47
only non-zero terms of the double sum from (4.1) are the terms when s = t,
and therefore we can write the second moment as
∞ ∞
2 2 2
X > θ X
E ∇bθ θ
E ϕθt b(Xtθ )2
2
= E ∇θ 2
−2 E ∇θ ϕt b(Xt ) + 2
t=1 t=1
∞
2
XX n o
2
E I Xtθ = x ϕθt b(x)2
= E ∇θ 2 + 2
x t=1
n o
− 2E I Xtθ = x ∇> θ
ϕ
θ t b(x) .
The last line above is obtained by first summing over the states, and then
summing over the times in which state x was visited, and shows that the
second moment is again separable over the states x.
Since the second moment is separable over the values b(x), we are free
to minimize each independently. Again, the contribution of each b(x) to the
second moment is an easily-minimized quadratic expression in b(x). The
MSE-minimizing baseline is therefore given by
P∞ θ = x ∇> θ
t=1 E I Xt θ ϕt
b(x) = P∞ θ 2 .
t=1 E I Xt =x ϕθt 2
The above expression for the MSE minimizing baseline may not be very
useful, since it may not be easy to compute even if we knew the MDP
transition probability kernel T. Both the form and derivation of this baseline
function are complicated because of the correlations between the states and
actions visited at different times during a single episode.
We can derive a much simpler alternative baseline that may still go a
long way towards minimizing the MSE. The idea is, rather than minimizing
P θ
the second moment of the complete gradient estimate ∇bθ = Tt=1 ϕθt (Gθt −
b(Xtθ )), we can try to minimize the second moment of each term in the sum
independently. The following result shows that a much simpler baseline
function minimizes the second moment of ϕθt (Gθt − b(Xtθ )) for all times t.
48
Theorem 4.8. Let π : Rd → Π be a policy parameterization for an episodic
MDP and let θ ∈ Rd be any parameter vector. Then the function
X
b(x) = w(x, a, θ)Qtotal (x, a, θ)
a
where
simultaneously minimizes the second moment of ϕθt (Gθt −b(Xtθ )) for all times
t.
Proof. Fix any time t and let Xtθ and Aθt be the state and action at time t
∇θ π(Xtθ ,Aθt ,θ)
when following policy π(θ). Let ϕθt = π(Xtθ ,Aθt ,θ)
be the vector of compat-
ible features at time t and Gθt be the total reward following action Aθt . Our
goal is to find the baseline function b : X → R that minimizes
2
E ϕθt (Gθt − b(Xtθ )) 2 .
2
E ϕθt (Gθt − b(Xtθ )) 2
2
= E ϕθt 2 (Gθt − b(Xtθ )2
2
X ∇θ π(x, a, θ)
= pt (x, a) 2
(Qtotal (x, a, θ) − b(x))2
x,a
π(x, a, θ)2
X w̃(x, a, θ)
= pt (a | x)pt (x) (Qtotal (x, a, θ) − b(x))2
x,a
π(x, a, θ)
X X
= pt (x) w̃(x, a, θ)(Qtotal (x, a, θ) − b(x))2 .
x a
49
From this we see that the second moment is separable over the states and
again the contribution of each state is quadratic in b(x). Therefore, we can
choose each b(x) independently as follows
X
b(x) = argmin p(x) w̃(x, a, θ)Qtotal (x, a, θ) − b(x))2
c∈R a
X
= w(x, a, θ)Qtotal (x, a, θ).
a
Since the above baseline does not depend on the time index t, it simultane-
ously minimizes the second moment of each ϕθt (Gθt − b(Xtθ )).
50
For both the average and total reward settings, we will show that the
second moment is a convex quadratic function of the weights w used in the
baseline function. Minimizing the second moment is equivalent to maximiz-
ing the negative second moment, which is a concave function. Therefore, if
2
we can construct unbiased random estimates of the gradient ∇w E ∇w
θ 2 ,
then we can use stochastic gradient ascent to directly approximate the MSE
minimizing baselines. The following theorems show that it is possible to
compute unbiased random estimates of the gradient by interacting with the
environment.
2 θ θ> >
Dθw = 2 ϕθ 2
ψ ψ w − 2ψ θ ϕθ ∇0θ
2
satisfies E[Dθw ] = ∇w E[ ∇w
θ 2
], where ψ θ = ψ(X θ ) and ∇w
θ is the gradient
estimate from Corollary 4.4 with baseline b(x) = ψ(x)> w.
∇w θ θ θ θ>
θ = ϕ (Qavg (X , A , θ) − ψ w)
= ∇0θ − ϕθ ψ θ> w.
2
E ∇w = E (∇0θ − ϕθ ψ θ> w)> (∇0θ − ϕθ ψ θ> w)
θ 2
2 θ θ>
= w> E[ ϕθ 2
ψ ψ ]w − 2E[∇0θ (ϕθ ψ θ> )]w + E[ ∇0θ ].
51
below by 0, it follows that it must also be convex. With this, we have
n o
2 2
∇w E ∇w = ∇w w> E[ ϕθ 2 ψ θ ψ θ> ]w − 2E[∇0θ (ϕθ ψ θ> )]w + E[ ∇0θ ]
θ 2
>
h i
2
= E 2 ϕθ 2 ψ θ ψ θ> − 2ψ θ ϕθ ∇0θ
2 θ θ> >
Dθw = 2 ϕθ 2
ψ ψ − 2ψ θ ϕθ ∇0θ
2
is an unbiased estimate of the gradient ∇w E ∇w
θ 2
.
All quantities in Dθw are observable by the agent, so the agent can pro-
duce samples from Dθw . Pseudocode for a policy gradient method that uses
the above estimate of the baseline function is given in Algorithm 6
52
Theorem 4.10. Let π : Rd → Π be a policy parameterization for an episodic
MDP, θ ∈ Rd be any policy parameter vector, ψ : X → Rn be a baseline
feature map, and w ∈ Rn be any baseline parameter vector. Assume that
∇θ π(xterm , a, θ) = 0 for all actions a and parameter vectors θ (the parame-
terization can always be chosen so that this property is satisfied). Then the
map w 7→ E ∇w
θ is a convex quadratic function of w and the random
vector
θ −1
TX
2 θ> >
Dθw ψtθ ϕθt − ϕθt ∇0θ
=2 ψ w
2 t
t=1
2
satisfies E[Dθw ] = ∇w E ∇w where ψtθ = ψ(Xtθ ) and ∇w
θ 2
, θ is the gradient
estimate from Corollary 4.3 with baseline b(x) = ψ(x)> w.
θ −1
TX
∇w
θ = ϕθt (Gθt − ψtθ> w)
t=1
∞
X ∞
X
= ϕθt Gt − ϕθt ψtθ> w
t=1 t=1
= ∇0θ − M w,θ
P∞
where M θ = θ θ>
t=1 ϕt ψt . We can use this to write the second moment as
2
E ∇w = E (∇0θ − M θ w)> (∇0θ − M θ w)
θ 2
2
= w> E M θ> M θ w − 2E ∇0> θ
0
θ M w + E ∇θ 2
.
Again this is a quadratic form in w and, since the second moment is bounded
below, it must be convex.
Before taking the gradient, we can simplify the matrix of coefficients
θ> θ
E M M in a way very similar to the simplification of the double sum in
Theorem 4.7. Recall that for any times s < t, we have the following
53
Therefore, whenever s < t, we have
X
E[ψtθ ϕθ> θ θ>
t ϕs ψs ] = P(Xtθ = x, Aθt = a, Xsθ = x0 , Aθs = a0 )
x,a,x0 ,a0
= 0.
∞ X
∞
X
E M θ> M θ = E ψtθ ϕθ> θ θ>
t ϕs ψs
t=1 s=1
∞
2 θ θ>
X
E ϕθt
= ψ ψ .
2 t t
t=1
Tθ
X
θ> θ 2 θ θ>
E M M =E ϕθt ψ ψ
2 t t
t=1
and
Tθ
X
θ> 0 θ θ>
E M ∇θ = E ϕt ψ t ∇0θ
t=1
Substituting these equalities into the expression for the second moment, we
have
Tθ
X Tθ
X >
2 > 2 θ θ> 2
E ∇w ϕθt θ θ>
∇0θ w + E ∇0θ
θ 2
=w E ψ ψ
2 t t
w−2 ϕt ψ t 2
.
t=1 t=1
Taking the gradient of this expression and exchanging the gradient and
54
expectation gives that
Tθ
2 θ> >
X
Dθw ψtθ ϕθt − ϕθt ∇0θ
=2 ψ w
2 t
t=1
2
satisfies E[Dθw ] = ∇w E ∇w
θ 2
.
It is worth noting that the gradient estimate from Theorem 4.10 can be
computed in linear time in the length of the episode. The gradient estimate
∇0θ can be computed in the first pass, and Dθw can be computed in a second
pass.
Pseudocode for a policy gradient method that uses the above estimate
of the baseline function is given in Algorithm 7
b(x) = ψ(x)> wτ ;
6 Set θτ +1 = θτ + η∇θτ ;
7 Compute Dθwττ according to Theorem 4.10;
8 Set wτ +1 = wτ − αDθwττ .
9 end
Algorithm 7: Policy gradient method for episodic MDPs with a linear
approximation to the MSE minimizing baseline.
4.5 Experiments
This section presents an empirical comparison of the three baseline functions
discussed in this thesis: the always zero baseline, the value function, and
the MSE minimizing baseline. The goal of these experiments is to answer
55
the question: Does the baseline function have a significant impact on per-
formance? And if so, does one of the baseline functions consistently give
better performance than the others?
Earlier in this chapter I showed that when the formal learning objective
is a concave function of the policy parameters, then we can upper bound the
agent’s expected regret. This regret bound only depended on the baseline
through the second moments of the performance gradient estimates. There-
fore, the best regret bound is obtained by choosing the baseline to minimize
the second moment, which we saw was equivalent to minimizing the MSE.
We also saw that the best regret bound after T updates is obtained by
setting the step size to be
B
η(T ) = √ ,
G T
where G2 is an upper bound on the second moments of the first T gradient
estimates, and B is an upper bound on the 2-norm of the unknown optimal
parameter vector. Therefore, in the concave setting, the step size which gives
the best regret bound scales inversely proportional to G. My hypothesis is
that even when the learning objective is not a concave function of the policy
parameters, choosing the baseline to minimize the MSE (and therefore the
second moment) is a good choice. This hypothesis would be supported by
the experiments if the MSE baseline gives the highest performance and if its
maximum performance is achieved with a higher step size than for the other
baselines. All but one of the following experiments support this hypothesis.
Policy gradient methods have two significant parameters: the step size
and the baseline function. The expected performance depends on the pair of
parameter settings. To study the effect of the baseline function, I estimate
the expected performance when using each baseline with a wide range of step
sizes. For each baseline, this results in a graph showing the performance of
the given baseline as a function of the step size. This kind of parameter
study involves running the algorithm many more times than would be done
in practice. In practice, the parameters might be chosen according to a
rule of thumb or based on some brief experiment and then the algorithm
may be run only once with those settings. In these experiments, however,
56
we run the algorithm for many parameter settings and, for each parameter
setting, we do enough runs to accurately estimate the expected total reward.
While these parameter studies do not necessarily resemble reinforcement
learning in practice, they allow us to confidently compare the baselines and
to understand how their performance depends on the step size. A strong
result would be to find that one baseline outperforms the rest for every
step size. In this case, no matter how you choose the step size, you should
always use the winning baseline. A weaker result would be to find that the
maximum performance of one baseline (maximizing over the step sizes) is
higher than the maximum performances of the other baselines. This result
is weaker, since it might not be easy to find good step sizes in practice.
The experiments compare the three baseline functions in two different
test-beds. I borrowed the first test-bed, called the ten-armed test-bed, from
Rich Sutton and Andy Barto’s book [SB1998]. In the ten-armed test-bed,
the MDP has a single state and ten actions, each having rewards drawn from
a Gaussian distribution with unit standard deviation. Such an MDP is called
a bandit problem, in reference to multi-armed bandits or slot machines. An
episode in a bandit problem consists of choosing a single arm, and the agent’s
goal is to maximize her total reward over a fixed number of episodes. In all
the following experiments, the number of episodes is taken to be 20. Rather
than fixing a single bandit for the comparison, we can randomly produce
many distinct bandits by sampling the mean payout for each arm from a
standard Gaussian distribution. We compare the average performance over
a large number of randomly generated bandits to reduce the chance that the
results are an artifact of one specific bandit.
MDPs with a single state are missing many properties of typical MDPs.
For example, the action value function in a single-state MDP does not de-
pend on the agent’s policy at all. I designed the second test bed, called the
triangle test-bed, to be similar to the ten-armed test-bed but with more than
one state. Again, the payout for each state-action pair will be a Gaussian
with unit standard deviation and mean sampled from a standard Gaussian
distribution. The states are organized in a triangle with R = 5 rows, where
there are r states in the rth row. The starting state is the unique state in
57
Figure 4.1: Each black circle represents a state in a triangular MDP and
each arrow represents the deterministic transition resulting from the Left
action or Right action, depending on whether the arrow points left or right.
Starting from the top state, the agent chooses to move either Left or Right
until she reaches the bottom row.
the first row and there are two actions, Left and Right, which each move
the agent from its current state to the state below and left, or below and
right, respectively. Figure 4.1 shows the layout of the states and the avail-
able actions. As in the ten-armed test-bed, the agent is allowed to perform
a fixed number of episodes on each randomly generated triangle MDP, and
her goal is to maximize total reward. In all of the following experiments,
the agent is given 50 episodes to interact with each instance of the triangle
MDP.
I use a similar policy parameterization in both test-beds. The policy
parameterization is a mixture of the uniform policy, which chooses each
action with equal probability, and the so-called Gibbs policy, which is a
common policy parameterization when there are a small number of states
and actions. In the Gibbs policy, the agent stores a weight, or preference,
for each state-action pair. The weights are stored in a |X × A|-dimensional
vector θ, and we write θ(x, a) to denote the weight associated to action a
in state x. The Gibbs policy is parameterized by the agent’s preferences in
58
the following way:
exp(θ(x, a))
πGibbs (x, a, θ) = P 0
,
a0 exp(θ(x, a ))
where the sum is taken over all actions. Under the Gibbs policy, the agent
prefers to choose actions with a high weight and the action selection prob-
abilities are invariant to adding a constant to all weights. For a mixing
constant , which in all experiments I take to be 0.05, the policy parameter-
ization that I use is given by
where e(x, a) is the (x, a)th standard-basis vector and πGibbs (x, θ) is the
vector whose (x0 , a0 )th entry is equal to I {x = x0 } πGibbs (x0 , a0 , θ). From
this, the gradient of the mixed policy is given by
Since we have closed form expressions for the action selection probabili-
ties and their gradient with respect to the policy parameters, it is easy to
59
compute the stochastic gradient estimates discussed in this chapter. In all
experiments, the weight vector θ is always initialized to the zero vector,
which results in the uniform policy.
One practical complication is that the value function and the MSE min-
imizing baseline are both unknown to the computer agent, since they de-
pend on the unknown MDP transition probability kernel. An agent solving
real-world problems must estimate these functions in order to use them as
baselines. The experimental challenge is that, if we compare estimated base-
lines, we can’t be sure that a difference in performance is due to the choice
of baseline and not the quality of the estimation. Even though unknowable
baseline functions are not usable in practice, measuring their performance is
still useful because it motivates the search for good approximation methods
for the best theoretical baselines. For this reason, the experiments com-
pare not only the estimated baselines, but also their true values. In both of
the test-beds, we have perfect knowledge of the transition probability kernel
(though we do not share this information with the agent). Using this knowl-
edge, we can give the computer agent access to an oracle that produces the
true MSE minimizing baseline function, or the true value function.
60
mse mse
value
value
zero
zero
Figure 4.2: Comparison of the total reward earned by the episodic policy
gradient method when using the true baseline functions. The error bars
show 3 standard errors in the mean. The left figure shows performance in
the ten-armed test-bed. The right figure shows performance in the triangle
test-bed.
zero baseline, which is equivalent to using no baseline at all. For the ten-
armed test-bed, the average reward for each action is drawn from a standard
Gaussian random variable and therefore zero is a good guess at the average
payout of the arms. We might expect that if the mean payouts were shifted
up or down, the performance of the zero baseline may deteriorate, since
zero is no longer a good estimate of the mean arm payout. In contrast, the
updates made by both the value function and MSE minimizing baselines
do not change when a constant is added to all rewards. The situation is
similar in the triangle test-bed, since the expected payout of a path through
the states is also zero. Figure 4.3 shows the performance of the baselines
in both test-beds when the rewards are shifted up or down by 10. In this
case, we see a significant improvement when using non-zero baselines. It
is difficult to see the difference between the value function and the MSE
minimizing baseline, but since these methods are invariant to translations
in the rewards, their difference is exactly the same as in the mean-zero case.
The above comparisons show that for the two test-beds, it is important
61
value
µ= 10 µ= 10
mse value
mse
zero
zero
value
µ = 10 µ = 10
value mse
mse
zero
zero
63
Gmse (G for stochastic gradient descent).
I expect that the choice of the step size for the stochastic gradient descent
algorithm used to compute the Gmse baseline, denoted by ηbl will have a
similar effect on the performance of the agent for all policy gradient step sizes
η. Therefore, to set ηbl for each test bed, I ran the agent with policy gradient
step size η = 0.9 for the correct number of episodes (20 in the ten-armed
test-bed and 50 in the triangle test-bed) 1000 times and chose the baseline
step-size from the set {0.001, 0.01, 0.1, 1.0} that maximized performance.
The best parameter settings were ηbl = 0.01 in the ten-armed test bed and
ηbl = 0.1 in the triangle test-bed.
There are standard techniques for estimating the value function of a pol-
icy in an MDP. Rather than estimating the value function directly, though,
I use the fact that the value function is a weighted average of the action
value function. This gives more accurate estimates in the ten-armed test-
bed, since the action values do not depend on the agent’s current policy. I
will refer to this as the Qvalue baseline.
To estimate the action value function in the ten-armed test-bed, I use
the fact that the action value function does not depend on the agent’s policy,
since there is only one state. In this case, a good estimate of the action value
for a given action is to take the sample average of the observed rewards for
that action. For actions that have not yet been tried, a default value of 0 is
used. The only parameter of this estimation technique for the action values
is the default value. Since it only influences performance early in learning,
I did not tune the default value.
In the triangle test-bed, I use the Sarsa(λ) algorithm to estimate the
action value functions that are passed into the two action-value oriented
baseline estimates. Sarsa(λ) has two parameters, a step size α and an el-
igibility trace parameter λ. Again, I expect that the Sarsa(λ) parameters
should affect the agent’s performance similarly for all policy-gradient step
sizes and all baselines. For this reason, I chose the parameters α and λ
by running the policy gradient method with fixed step size η = 0.9 for 50
episodes 1000 times, and chose the parameters that gave the smallest aver-
age squared error in the action-value estimates at the end of the 50 episodes.
64
Of all pairs of α in {0.001, 0.01, 0.1, 1.0} and λ in {0.001, 0.01, 0.1, 1.0} the
best setting was to take α = λ = 0.1.
Figure 4.4 shows the parameter studies for each of the estimated baselines
in the two test-beds. As before, I also present the results when the rewards
are shifted up or down by 10. The results of this experiment tell a different
story than what we saw for the true baseline functions. Let µ denote the
amount that the rewards are shifted by. In the first experiments where we
compared the true baseline functions, the MSE minimizing baseline gave
the best performance across a wide range of parameters. In this setting,
however, the baseline with the best performance depends on the step size
and which baseline achieves the highest performance for an optimized step
size depends on the value of µ. These results do not support my hypothesis
and were surprising because I expected the relative performance of the non-
zero baselines to be the same independent of the shift µ.
The differences in performance between the non-zero baselines for the
various values of µ can be explained by an interesting property of policy
gradient methods. Consider the bandit problem and suppose that our base-
line is substantially larger than the rewards. Suppose the agent chooses a
random action A and receives reward R. Then the term (R − b), where b
is the baseline value, is negative with high probability the agent will reduce
the probability of choosing action A, even if it was the best action. Since
the action selection probabilities must sum to one, the probability of the
other actions will be increased. On the following episode, the agent will be
more likely to choose an action other than A, even if A was the best avail-
able action. In this way, having a baseline that underestimates the rewards
encourages systematic exploration of the actions. On the other hand, if the
baseline is substantially lower than the rewards, the probability of choosing
action A will always be increased, even if it was the worst action. On the
following episodes, the agent will be more likely to choose the same action
again. This asymmetry between having baselines that are too high or too
low suggests that it is better to have an underestimate, which results in
exploration, rather than an overestimate, which results in less exploration
and more erratic updates to the parameter vector.
65
µ= 10 Gmse µ= 10
Gmse Qmse
Qmse
Qvalue
Qvalue
zero
zero
Gmse
Qvalue
Qmse
Qvalue µ = 10 Gmse µ = 10
Gmse Qvalue
zero
zero
Qmse
Qmse
67
µ= 10 µ= 10
Qmse
Qmse
Qvalue Gmse
Qvalue
Gmse
zero
zero
Gmse
Gmse
Qvalue
zero
µ = 10 Qvalue µ = 10
Qmse Qmse
Gmse
Qvalue
Gmse
zero
zero
Figure 4.5: Comparisons of the four estimated baselines with good initial-
izions in the ten-armed the triangle test-beds. The mean reward for each
action is drawn from a Gaussian distribution with mean µ and unit standard
deviation. In the first row, µ = 10 in the second row, µ = 0, and in the
third row µ = −10. The left column shows estimates of the expected total
reward in the ten-armed test-bed and the second column shows the same for
the triangle test-bed. Again, the error bars show 3 standard errors in the
mean.
Chapter 5
Learning in MDPCRs
This chapter describes the second project that I worked on during my MSc,
which focused on designing and analyzing efficient learning algorithms for
loop-free episodic and uniformly ergodic MDPCRs which were introduced in
Section 2.2. We propose three new algorithms: an algorithm for learning in
loop-free episodic MDPCRs under instructive (full-information) feedback,
where the agent observes the entire reward function rt after each action,
an algorithm for learning in loop-free episodic MDPCRs under evaluative
(bandit) feedback, where the agent only observes the reward for the action
she took, and an algorithm for learning in uniformly ergodic MDPCRs un-
der instructive feedback. We believe that the algorithm for learning under
instructive feedback in ergodic MDPCRs can be extended to learning un-
der evaluative feedback, but the analysis proved to be quite challenging.
In all cases, we assume that the rewards belong to the interval [0, 1]. The
theoretical results for these three algorithms either improve or complement
the results for existing algorithms and often hold even under weaker condi-
tions. This comes at the cost of having increased, though still polynomial,
computational complexity.
A common strategy in computing science is to reduce a problem that we
would like to solve to a problem that we have already solved. The strategy of
this project is to reduce the problem of learning in MDPCRs to the problem
of online linear optimization. Reduction in this case means that if I have an
69
algorithm for online linear optimization with provable regret bounds, then I
should be able to use that algorithm to achieve a similar regret bound while
learning in an MDPCR. Section 3.3 presented online mirror ascent, which is
an algorithm for online linear optimization with a good regret bound. The
three algorithms proposed in this project are all instances of online mirror
ascent applied to the problem of learning in MDPCRs by way of a reduction
to online linear optimization.
In all cases considered in ths project, online mirror ascent cannot be
implemented exactly. The update rule of online mirror ascent has two steps:
first, we compute the unconstrained maximizer of an objective function that
combines the goals of maximizing the most recent payout vector and not
moving too far from the previous choice. Second, we project the uncon-
strained maximizer back onto the set of feasible solutions. In many cases,
the unconstrained maximizer can be computed as a simple closed-form ex-
pression but the projection step is expensive and can only be solved ap-
proximately. A natural question is: how do these approximations impact
the performance of online mirror ascent? The final result of this project
is of independent interest and provides theoretical analysis for a natural
approximate implementation of online mirror ascent.
70
We only provide reductions for the instructive feedback setting, where the
agent observes the entire reward function, since our algorithm for the eval-
uative feedback setting are derived from the instructive case by statistically
estimating the reward function.
71
where ` ∈ {1, . . . , L} is the layer index such that x ∈ X` and X`π is the
random state from layer X` visited by the agent. In words, ν(x, π) is the
probability that an agent following policy π will visit state x in an episode.
The (state-action) occupancy measure of a policy, denote by µ(π), is defined
by
µ(x, a, π) = P(X`π = x, Aπ` = a) = ν(x, π)π(x, a).
72
µ(x, a, π) and ν(x) = ν(x, π). From the definition of µ(x, a, π), for each
state x we have
X X
µ(x, a, π) = ν(x, π)π(x, a) = ν(x, π).
a a
µ(x, a, π) µ(x, a, π)
π(x, a) = =P 0
.
ν(x, π) a0 µ(x, a , π)
This lemma shows that, given only the state-action occupancy measure
of a policy, we can recover the policy’s action-selection probabilities in every
state that it visits with non-zero probability. It is not a serious problem that
we cannot recover the action selection probabilities in the remaining states,
since an agent following the policy will visit them with probability zero.
Therefore, since every policy has a state-action occupancy measure and,
since we can (essentially) recover a policy from any state-action occupancy
measure, we are able to represent policies by their state-action occupancy
measures. In the language of policy gradient methods, we can think of the
map given by Lemma 5.1 as a policy parameterization.
Next, we want to show that the set of all state-action occupancy measures
K = {µ(π) : π ∈ Π} is a convex subset of Rd . Let d = |X ×A| be the number
of state-action pairs. Then we can think of the set of functions {f : X ×A →
R} as a d-dimensional vector space by identifying functions with tables (or
vectors) of their values at each of the d state-action pairs. In this space of
functions the natural inner product is defined by f > g = x,a f (x, a)g(x, a).
P
For the rest of this chapter, we treat functions with finite domains as vectors
in finite-dimensional vector spaces together with this inner product. With
this convention, we can show that the set of occupancy measures is a convex
set.
73
Rd be the set of occupation measures. Then
( )
X
0 0 0
K= µ : X × A → [0, 1] : ν(xstart ) = 1, ∀x ∈ X : ν(x ) = µ(x, a)P(x, a, x ) ,
x,a
P
where we used the shorthand ν(x) = a µ(x, a). Moreover, since K is
defined by a set of linear inequalities, it is a convex subset of Rd .
Lemma 5.3. Let π be any policy for a loop-free episodic MDPCR and let
f : X × A → R be any function. Then
L
X
E f (X` , A` ) = f > µ(π).
π π
`=1
Proof. The proof follows from the fact that for each state x in the `th layer,
and for each action a, we have that µ(x, a, π) = P(X`π = x, Aπ` = a).
L
X XL
π π
E f (X` , A` ) = E[f (X`π , Aπ` )]
`=1 `=1
XL X
= P(X`π = x, Aπ` = a)f (x, a)
`=1 x∈X` ,a
L
X X
= µ(x, a, π)f (x, a)
`=1 x∈X` ,a
X
= µ(x, a, π)f (x, a)
x,a
= f > µ(π).
Combining Lemmas 5.1, 5.2, and 5.3, we have the following reduction
from learning in loop-free episodic MDPCRs to online linear optimization.
74
Theorem 5.4. Let M = (X , A, P, (rτ )τ ∈N ) be a loop-free episodic MDPCR.
Then, the regret of any sequence of policies π1 , . . . , πT relative to the set
of Markov policies is equal to the regret of the sequence of state-action oc-
cupation measures µ(π1 ), . . . , µ(πT ) in an online linear optimization game
where K = {µ(π) : π ∈ Π} and the adversary chooses the payout vector rτ
for the τ th round to be equal to the reward function in the MDPCR for the
τ th episode.
75
tion of an unknown policy π. Set
µ(x, a)/ν(x) if ν(x) > 0
π̂(x, a) =
1/|A| otherwise,
P
where ν(x) = a µ(x, a). Then π̂(x, a) = π(x, a) for all states x with non-
zero probability in the stationary distribution of π.
P
where we used the shorthand ν(x) = a µ(x, a). Moreover, since K is
defined by a set of linear inequalities, it is a convex subset of Rd .
Lemma 5.7. Fix a uniformly ergodic MDPCR with mixing time τ < ∞
and suppose that the reward functions satisfy rt (x, a) ∈ [0, 1] for all times,
76
states, and actions. Let B > 0 and suppose π1 , . . . , πT are any sequence of
policies with ν(πt−1 ) − ν(πt ) 1
≤ B for all times t = 2, . . . , T . Then the
regret of the sequence of policies π1 , . . . , πT relative to any fixed policy π
can be bounded as follows:
T
X T
X
Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At )
t=1 t=1
T
X
rt> µ(π) − µ(πt ) + (τ + 1)T B + 4τ + 4.
≤
t=1
Proof. Recall that we use the notation ν(π) and µ(π) for the state and
state-action stationary distributions of the policy π, respectively. We now
introduce notation for the finite-time distributions.
Notation for following the policy π: Let ν̃tπ (x) = Pπ (Xt = x) be the
probability that an agent following policy π visits state x at time t and let
µ̃πt (x, a) = Pπ (Xt = x, At = a) be the probability that she takes action a
from state x at time t. We have
The following recursive expression for ν̃tπ will be useful: Since the agent
starts in the state xstart with probability one, we know that ν̃1π (x) = I {x = xstart }.
For each t ≥ 1, we have
π
ν̃t+1 = ν̃tπ P π ,
77
x, At = a) be the probability that she takes action a from state x. Again,
we have µ̃t (x, a) = πt (x, a)ν̃t (x), and we can express ν̃t recursively: ν̃1 (x) =
I {x = xstart } and
ν̃t+1 = ν̃t P πt .
With the above notation, we are ready to prove the lemma. First, we
rewrite the two expectations as sums:
T
X XT
Eπ rt (Xt , At ) = Eπ rt (Xt , At )
t=1 t=1
T X
X
= µ̃πt (x, a)rt (x, a)
t=1 x,a
T
X
= rt> µ̃πt .
t=1
Similarly,
T
X XT
Eπ1:T rt (Xt , At ) = Eπ1:T rt (Xt , At )
t=1 t=1
XT X
= µ̃t (x, a)rt (x, a)
t=1 x,a
T
X
= rt> µ̃t .
t=1
With this, the expected regret of the policies π1 , . . . , πT relative to the fixed
policy π can be written as:
T
X T
X XT
Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At ) = rt> (µ̃πt − µ̃t ).
t=1 t=1 t=1
We can add and subtract the stationary distributions of π and πt into the
78
tth term of the sum above to obtain the following decomposition:
T
X T
X T
X T
X
rt> (µ̃πt − µ̃t ) = rt> (µ̃πt − µ(π)) + rt> (µ(π) − µ(πt )) + rt> (µ(πt ) − µ̃t ).
t=1 t=1 t=1 t=1
(5.1)
The middle term of the above decomposition appears in the bound from
the statement of the lemma, so it remains to upper bound the first and last
term by (τ + 1)BT + 4τ + 4.
To bound the first term we use the following lemma from [NGSA2014]
T
X
rt> (µ̃πt − µ(π)) ≤ 2τ + 2. (5.2)
t=1
t−2
X
ν̃t − ν(πt ) 1
≤ 2e−(t−1)/τ + B e−s/τ
s=0
−(t−1)/τ
≤ 2e + B(τ + 1).
µ̃t − µ(πt ) 1
≤ 2e−(t−1)/τ + B(τ + 1).
79
the claim holds for t. Then we have
ν̃t+1 − ν(πt+1 ) 1
≤ ν̃t+1 − ν(πt ) 1
+ ν(πt ) − ν(πt+1 ) 1
(Triangle Inequality)
≤ ν̃t P πt − ν(πt )P πt 1
+B (Stationarity of ν(πt ))
≤ e−1/τ ν̃t − ν(πt ) 1 + B (Uniformly Ergodic)
t−2
X
≤ e−1/τ 2e−(t−1)/τ + B e−s/τ + B (Induction Hypothesis)
s=0
t+1−2
X
= 2e−(t+1−1)/τ + B −s/τ
e .
s=0
It follows that the first inequality holds for all times t. The second inequality
follows from the first, together with the fact that
t−2
X Z ∞
−s/τ
e ≤1+ e−s/τ ds = 1 + τ.
s=0 0
= ν̃t − ν(πt ) 1
−(t−1)/τ
≤ 2e + B(τ + 1).
80
PT >
We are finally ready to bound the sum t=1 rt (µ̃t − µ(πt )).
T
X T
X
rt> (µ̃t − µ(πt )) ≤ rt ∞
µ̃t − µ(πt ) 1
(Hölder’s Inequality)
t=1 t=1
XT
2e−(t−1)/τ + B(τ + 1)
≤ (Lemma 5.9, krt k∞ ≤ 1)
t=1
T
X
= (τ + 1)BT + 2 e−(t−1)/τ
t=1
≤ (τ + 1)BT + 2τ + 2, (5.3)
PT −(t−1)/τ
where in the last line we again used t=1 e ≤ τ + 1.
Substituting inequalities (5.2) and (5.3) into the regret decomposition
(5.1) proves the lemma.
Combining Lemmas 5.5, 5.6 and 5.7 gives the following reduction from
learning in uniformly ergodic MDPCRs to online linear optimization.
Theorem 5.10. Fix a uniformly ergodic MDPCR with mixing time τ < ∞
and bounded rewards: rt (x, a) ∈ [0, 1] for all states, actions, and times. Con-
sider the online linear optimization problem where K is the set of stationary
distributions over state-action pairs and the environment chooses the payout
vector in round t to be equal to the tth MDPCR reward function. Suppose
an agent for the online linear optimization game chooses the stationary dis-
tributions µ(π1 ), . . . , µ(πT ). Then we can essentially recover the policies
π1 , . . . , πT , and if an agent follows those policies in the MDPCR, then her
regret is bounded by
T
X T
X
Eπ rt (Xt , At ) − Eπ1:T rt (Xt , At )
t=1 t=1
T
X
rt> µ(π) − µ(πt ) + (τ + 1)T B + 4τ + 4
≤
t=1
81
This reduction shows that if we can achieve low regret in an online
linear optimization problem, and if the sequence of choices don’t change too
quickly, then we can also achieve low regret in a uniformly ergodic MDPCR.
82
Input: Step size η > 0, Regularizer R : S → R, S ⊃ K, and a
black-box PK that computes c-approximate projections onto
K
1 Choose w1 ∈ K arbitrarily;
2 for each round t = 1, 2, . . . do
3 Optionally use wt in some other computation;
4 Set wt+1/2 = argminu∈S ηrt> u + DR (u, wt );
5 Set wt+1 = PK (wt+1/2 );
6 end
Algorithm 8: Online Mirror Ascent with c-approximate Projections
T T
X DR (w, w1 ) cLDT X >
rt> (w − wt ) ≤ + + rt (wt+1/2 − wt ).
η η
t=1 t=1
Proof. This roughly follows the proof of Theorem 15.4 from [GPS2014] with
the appropriate modifications to handle the case where the projections are
only c-approximate.
Let w1 , . . . , wT ∈ K be the sequence of points generated by online
mirror ascent with c-approximate projections, let wt+1/2 be the unprojected
∗
updates for t = 1, . . . , T , and, finally, let wt+1 = argminu∈K DR (u, wt+1/2 )
be the exact projection of wt+1/2 onto the set K.
For each t = 1, . . . , T , we know that wt+1/2 is the unconstrained min-
imizer of the objective function Jt (u) = ηrt> u − DR (u, wt ) and therefore
83
∇Jt (wt+1/2 ) = 0. We can compute the gradient of Jt to be
1
rt = ∇R(wt+1/2 ) − ∇R(wt ) .
η
1 >
rt> (w − wt ) = ∇R(wt+1/2 ) − ∇R(wt ) (w − wt )
η
1
= DR (w, wt ) − DR (w, wt+1/2 ) + DR (wt , wt+1/2 ) ,
η
∗ ∗ ∗
DR (w, wt+1/2 ) ≥ DR (w, wt+1 ) − DR (wt+1 , w) ≥ DR (w, wt+1 ).
1
rt> (w − wt ) ≤ ∗
DR (w, wt ) − DR (w, wt+1 ) + DR (wt , wt+1/2 )
η
1 ∗
= DR (w, wt ) − DR (w, wt+1 ) + DR (w, wt+1 ) − DR (w, wt+1 )
η
+ DR (wt , wt+1/2 ) .
Summing from t = 1 to T , The first two terms of the above expression will
84
telescope, leaving only the first and last:
T T
X DR (w, w1 ) DR (w, wT +1 ) 1 X
rt> (w − wt ) ≤ − + DR (wt , wt+1/2 )
η η η
t=1 t=1
T
1 X
∗
+ DR (w, wt+1 ) − DR (w, wt+1 )
η
t=1
T
DR (w, w1 ) 1 X
≤ + DR (wt , wt+1/2 )
η η
t=1
T
1 X
∗
+ DR (w, wt+1 ) − DR (w, wt+1 ) . (5.4)
η
t=1
T T
1X X
DR (wt , wt+1/2 ) ≤ rt> (wt+1/2 − wt ).
η
t=1 t=1
85
Expanding Dr (w, wt ) − DR (w, wt∗ ) and using the above inequality gives
DR (w, wt ) − DR (w, wt∗ ) = R(wt∗ ) − R(wt ) + ∇R(wt∗ )> (w − wt∗ ) − ∇R(wt )> (w − wt )
≤ ∇R(wt∗ )> (w − wt ) − ∇R(wt )T (w − wt )
>
= ∇R(wt∗ ) − ∇R(wt ) (w − wt )
≤ k∇R(wt∗ ) − ∇R(wt )k kw − wt k∗
≤ L kwt∗ − wt k D
≤ cLD.
T
1X cLDT
DR (w, wt ) − DR (w, wt∗ ) ≤
η η
t=1
When the regularizer R is σ-strongly convex wrt k·k, we can use the
following lemma to bound the sum t rt> (wt − wt+1/2 ) in Theorem 5.11.
P
σ
R(u) ≥ R(v) + ∇R(v)> (u − v) + ku − vk2 .
2
86
and v = wt+1/2 , and one with u = wt+1/2 and v = wt , gives
2 1 >
wt − wt+1/2 ≤ ∇R(wt+1 − ∇Rw (wt − wt+1/2 )
σ
1
≤ ∇R(wt+1/2 ) − ∇R(wt ) ∗ wt − wt+1/2
σ
η
= krt k∗ kwt − wt+1 k .
σ
η
Dividing both sides by kwt − wt+1 k shows that wt − wt+1/2 ≤ σ krt k∗ .
Therefore,
η
rt> (wt+1/2 − wt ) ≤ krt k∗ wt+1/2 − wt ≤ krt k2∗ ,
σ
87
defined by
X
J(w) = w(x, a) ln(w(x, a)) − w(x, a) .
x,a
X
u(x, a)
DJ (u, w) = u(x, a) ln + w(x, a) − u(x, a) ,
x,a
w(x, a)
X
u(x, a)
wt+1/2 = argmax ηrt> u − u(x, a) ln + w(x, a) − u(x, a) .
u∈(0,∞)d x,a
w(x, a)
88
Lemma 5.13 (Example 2.5 from [S2012]). The function
d
X
R(w) = w(i) log(w(i)) − w(i)
i=1
There are two problems with using the unnormalized negentropy regu-
larizer, both coming from the fact that that ∇J is not Lipschitz continuous
on the set of occupancy measures or the set of stationary distributions. To
see this, note that the partial derivatives of J are given by
∂
J(w) = ln(w(x, a)),
∂w(x, a)
89
rithms and corresponding regret bounds.
90
Theorem 5.14. Let M be a loop free episodic MDPCR, π ∈ Π be any
Markov policy, and π1 , . . . , πT be the sequence of policies
q produced by Algo-
βδη
rithm 9 with parameters δ ∈ (0, 1], c = √T , and η = DLT max
where L is the
number of layers in the MDPCR and Dmax ≥ supµ∈Kδβ DJ (µ, µ(π1 )). Then,
the regret of an agent that follows the sequence of policies π1:T relative to
the fixed policy π is bounded as follows
T
X T
X p √
Eπ rt (Xt , At ) − Eπ1:T rτ (Xt , At ) ≤ 2 LT Dmax + T + LδT,
t=1 t=1
Proof. First, we show that the agen’t does not incur too much additional
regret by choosing policies from the set Kδβ rather than the larger set K.
For any occupancy measure µ ∈ K, consider the mixed measure µδ = (1 −
δ)µ + δµ(πexp ). We have the following properties: µδ (x, a) = (1 − δ)µ(x, a) +
δµ(x, a, πexp ) ≥ δβ, and therefore µδ ∈ Kδβ . Second, for any payout vector
r with r(x, a) ∈ [0, 1], we have |r> (µ − µδ )| = δ|r> (µ(x, a) − µ(x, a, πexp )| ≤
P
δ x,a ≤ δL. Therefore, for any occupancy measure µ ∈ K, there is an
occupancy measure in Kδβ that earns nearly as much reward. This implies
that having a good regret bound relative to any point in Kδβ gives us a
regret bound relative to any point in K.
Next, we show that the regularizer J is 1/L-strongly convex with respect
to k·k1 on the set K. Since each occupancy measure µ is a distribution when
restricted to the states and actions in a single layer X` × A, we have the
following:
X L
X X L
X
kµk1 = |µ(x, a)| = |µ(x, a)| = 1 = L.
x,a `=1 x∈X` ,a `=1
91
Finally, we show that ∇J is Lipschitz continuous on Kδβ with respect
to k·k1 . Let w ∈ Kδβ be any occupancy measure and consider indices
i, j ∈ {1, . . . , d} (in this proof, it is more convenient to use integer indices,
rather than pairs of states and actions). Then we can compute the partial
derivatives of J(w) to be
∂
J(w) = ln(w(i))
∂w(i)
∂2 I {i = j}
J(w) = .
∂w(i)∂w(j) w(i)
T T
DR (µδ , µ(π1 )) cLDT
krτ k∗∞
X X
rτ> (µδ − µ(πτ )) ≤ + + Lη
η η
τ =1 τ =1
DR (µδ , µ(π1 )) √
≤ + T + LT.
η
T
X T
X T
X
rτ> (µ − µ(πτ )) = rτ> (µ − µδ ) + rτ> (µδ − µ(πτ ))
τ =1 τ =1 τ =1
Dmax √
≤ + T + ηLT + LT δ
η
p √
= 2 T Dmax L + T + LT δ.
By Theorem 5.4, the same regret bounds holds for the sequence of poli-
cies in the MDPCR.
Note that Dmax = Θ(L ln π10 ), where π0 = min(x,a) πexp (x, a) (notice that
92
πexp (x, a) ≥ β, since µ(x, a, πexp ) ≥ β). If, for example, πexp (x, ·) is selected
to be the uniform distribution over A, then β > 0 and π0 = 1/|A|, making
p √
the regret scale with O(L T ln(|A|)) when δ = 1/ T . Also, this makes
√
the computational cost Õ(dd.5 T 1/4 / β), where Õ hides log-factors. Neu
p
et al. [NGS2010] gave an algorithm that achieves O(L2 T ln(|A|)) regret
with O(d) computational complexity per time-step. Thus, our regret bound
scales better in the problem parameters than that of Neu et al. [NGS2010],
at the price of increasing the computational complexity. It is an interesting
(and probably challenging) problem to achieve the best of the two results.
and perform the update with this estimate of the complete reward function.
Since r̂τ is only non-zero for state-action pairs visited by the agent in episode
τ , it is known to the agent. This estimate is only well defined if µ(x, a, πτ ) >
0 for all states and actions, but since we restrict our algorithm to the set of
occupancy measures which are lower bounded, this will always be the case
for policies chosen by the algorithm. This particular reward approximation
will be justified in the proof of Theorem 5.15.
As in the previous section, fix a loop free episodic MDPCR M with
layers X1 , . . . , XL and rτ (x, a) ∈ [0, 1] for all states, actions, and episode
indices. Let d = |X × A| be the number of state action pairs and set K =⊂
93
Rd to be the convex set of occupancy measures described by Lemma 5.2.
Finally, let β > 0 be such that there exists some exploration policy πexp
with µ(x, a, πexp ) > β for all states x and actions a.
T
X L
X L
X p √
Eπ rτ (Xt , At ) − Eπτ rτ (Xt , At ) ≤ 2 dT Dmax + T + LδT,
τ =1 t=1 t=1
Proof. The proposed algorithm for the evaluative feedback setting is identi-
94
cal to the instructive feedback setting, with the exception that we estimate
the reward function. As in the proof of Theorem 5.14, for any µ ∈ K, let
µδ = (1 − δ)µ + δµ(πexp ) be the mixture of µ with µ(πexp ). Then we have
T
X T
X T
X
r̂τ> (µ − µ(πτ )) = r̂τ> (µ − µδ ) + r̂τ> (µδ − µ(πτ ))
τ =1 τ =1 τ =1
√ T
Dmax X
≤ + T + LδT + r̂t (µ(πt+1/2 ) − µ(πt )).
η
τ =1
Now, let Fτ denote the sigma algebra generated by the first τ −1 episodes.
For each state x and action a, we have
rt (x, a)
E r̂t (x, a) | Fτ = E I {X` = x, A` = a} | Fτ
µ(x, a, πτ )
rt (x, a)
= E[I {X` = x, A` = a} | Fτ ]
µ(x, a, πτ )
= rt (x, a).
95
Therefore, we can apply Lemma 5.16 to get
T
X Dmax √
E[r̂τ> (µ − µ(πτ ))] ≤ ηT d + + T + LδT.
η
τ =1
q
Dmax
Setting η = Td gives the optimal bound of
T
X p √
E[r̂τ> (µ − µ(πτ ))] ≤ 2 T dDmax + T + LδT.
τ =1
√
As far as the dependence on τ is concerned, by choosing δ = 1/ T , we
can thus improve the previous state-of-the-art bound of Neu et al. [NGSA2014]
p p
that scales as O(τ 3/2 T ln(|A|)) to O( τ T ln |A|). The update cost of the
algorithm of Neu et al. [NGSA2014] is O(|X |3 + |X |2 |A|), while here the
√
cost of our algorithm is Õ(T 1/4 d3.5 / β).
96
Input: Step size η > 0, exploration constant δ ∈ (0, 1],
approximation constant c > 0
1 Choose µ1 ∈ Kδβ arbitrarily;
2 for Each time index t = 1, . . . , T do
3 Receive state Xt from environment;
4 Sample action At from πt , obtained from µt according to
Lemma 5.5;
5 Receive complete reward function rt from environment;
6 Set µt+1/2 (x, a) = µt (x, a) exp(ηrt (x, a) for each state x and
action a;
7 Set µt+1 = PKδβ (µt+1/2 ), where PKδβ is a black-box that
computes c-approximate projections onto Kδβ wrt k·k1 .
8 end
Algorithm 11: Approximate Online Mirror Ascent for Uniformly Er-
godic Episodic MDPCRs Under Instructive Feedback
q
Dmax
policies produced by Algorithm 11 with parameters δ ∈ (0, 1], η = T (2τ +3) ,
βδη
and c = √ .
T
Then the regret of the agent that follows policies πt at time t
relative to policy π can be bounded as
T
X T
X
p √
Eπ rt (Xt , At ) −Eπ1:T rt (Xt , At ) ≤ 2 (2τ + 3)T Dmax + T +δT +4τ +4.
t=1 t=1
Proof. This proof is essentially identical to the proof of Theorem 5.14 and
has been omitted.
97
the minimum volume ellipsoid containing K and this problem is in general
NP-hard even when considering a constant factor approximation [N2007].
Selecting πexp (x, ·) to be the uniform distribution, β > 0 and Dmax ≤
p √
L ln(|A|), results in a O( dLT ln(|A|)) bound on the regret for δ = 1/ T ,
√
while the time-complexity of the algorithm is still O(d3.5 T 1/4 / β) as in the
full-information case. Neu et al. considered the same problem under the
assumption that any policy π visits any state with probability at least α for
P
some α > 0, that is, inf π a µ(x, a) ≥ α > 0. They provide an algorithm
p
with O(d) per round complexity whose regret is O(L2 T |A| ln(|A|)/α).
Compared to their result, we managed to lift the assumption α > 0, and
also improved the dependence on the size of the MDP, while paying a price
in terms of increased computational complexity.
98
Chapter 6
Conclusion
This thesis documents the two projects that I worked on during my MSc
program. Both projects contribute to the goal of building computer systems
that are capable of learning for themselves to solve problems and succeed at
tasks. Each project focuses on specific mathematical questions related to a
formal learning problem.
The first project addresses the question of which baseline function to
use in policy gradient reinforcement learning methods for Markov decision
processes. The baseline function’s role is to alter the performance gradi-
ent estimate used internally by policy gradient methods. I show that if the
formal learning objective is a concave function of the agent’s policy param-
eters, then the regret of a policy gradient method can be upper bounded
by a quantity that only depends on the baseline function only through the
second moment of the gradient estimates. This suggests that the baseline
function should be chosen to minimize the second moment of the gradi-
ent estimates, which I show to be equivalent to the more intuitive notion
of minimizing the mean squared error of the gradient estimates. I derive
closed form expressions for this baseline in terms of the MDP transition
probability kernel, the agent’s policy, and the agent’s policy parameteriza-
tion. Since the MDP transition probability kernel is unknown to the agent,
I also propose two algorithms for estimating this baseline while interacting
with the environment. Finally, I present a preliminary empirical comparison
99
of the always-zero baseline, the value function baseline, and my proposed
baseline. This comparison demonstrates a statistically significant increase
in performance when using my proposed baseline, as long as we are careful
to initialize our estimates reasonably accurately.
The goal of the second project is to design new learning algorithms for
MDPCRs. The main difference between MDPCRs and standard MDPs is
that, in the former, the environment chooses the sequence of reward func-
tions in an adversarial manner. This difference makes it easier to model
some real-world problems as MDPCRs, especially those with non-stationary
dynamics. I propose three new algorithms, all based on an approximate ver-
sion of online mirror ascent: one for learning in loop-free MDPCRs under
instructive feedback, one for learning in loop-free MDPCRs under evalua-
tive feedback, and one for learning in uniformly ergodic MDPCRs under
instructive feedback. Each of these algorithms has regret bounds that either
improve or complement the regret bounds of existing algorithms, and which
often hold even under weaker assumptions on the environment. In the de-
velopment of these algorithms, it was necessary to analyze an approximate
version of online mirror ascent, where the projection step is only computed
approximately. To my knowledge, this is the first rigorous analysis of this
approximation to online mirror ascent, despite the fact that the projection
step can often only be approximated.
Both projects provide sound, theoretically justified answers to important
questions in the fields of reinforcement learning and online learning.
100
Bibliography
[DGS2014] Dick, T., György, A., & Szepesvári, Cs., “Online learning in
Markov Decision Processes with Changing Cost Sequences” in Proceed-
ings of The 31st International Conference on Machine Learning (2014):
512–520.
101
[GPS2014] György, A., Pál, D., & Szepesvári, Cs. “Online learn-
ing: Algorithms for big data” (2014): https://www.dropbox.com/s/
bd38n4cuyxslh1e/online-learning-book.pdf.
[NGSA2014] Neu, G., György, A., Szepesvári, Cs., & Antos, A. “Online
markov decision processes under bandit feedback” in IEEE Transactions
on Automatic Control, Vol. 59, No. 3 (February 2014): 676–691.
[NGS2010] Neu, G., György, A., & Szepesvári, Cs. “The online loop-free
stochastic shortest-path problem” in Proceedings of the 23rd Annual
Conference on Learning Theory (June 2010): 231–243.
102
[YMS2009] Yu, J. Y., Mannor, S., & Shimkin, N. “Markov decision pro-
cesses with arbitrary reward processes” in Mathematics of Operations
Research, Vol. 34, No. 3 (2009): 737–757.
103