Hoeffding and Bernstein Races For Selecting Policies in Evolutionary Direct Policy Search

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/221346372
Hoeffding and Bernstein races for selecting policies in evolutionary direct policy
search
Conference Paper · January 2009

DOI: 10.1145/1553374.1553426 · Source: DBLP
CITATIONS READS
86 209
2 authors, including:
Verena Heidrich-Meisner
Christian-Albrechts-Universität zu Kiel
23 PUBLICATIONS 381 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
LOKI Learning to organize complex systems View project
Energy Calibration of angular distributed particle events for the STEP detector onboard Solar Orbiter. View project
All content following this page was uploaded by Christian Igel on 22 May 2014.
The user has requested enhancement of the downloaded file.

Hoeffding and Bernstein Races for Selecting Policies
in Evolutionary Direct Policy Search
Verena Heidrich-Meisner [email protected]

Christian Igel [email protected]
Institut für Neuroinformatik, Ruhr-Universität Bochum, 44780 Bochum, Germany
Abstract estimates are not reliable enough to allow for learn-

ing. If too many episodes are considered, the learning
Uncertainty arises in reinforcement learning
process gets too slow. Unfortunately, it is usually not
from various sources, and therefore it is nec-
possible to determine an appropriate sample size for
essary to consider statistics based on sev-
a given problem a priori (in practice we just make it
eral roll-outs for evaluating behavioral poli-
“large enough”). The optimal number may vary in
cies. We add an adaptive uncertainty han-
the course of learning and between candidate policies
dling based on Hoeffding and empirical Bern-
(e.g., really bad behaviors can be detected after just a
stein races to the CMA-ES, a variable met-
few roll-outs).
ric evolution strategy proposed for direct pol-
icy search. The uncertainty handling ad- We employ the covariance matrix evolution strategy
justs individually the number of episodes con- (CMA-ES, Hansen et al., 2003; Suttorp et al., 2009) for
sidered for the evaluation of a policy. The direct policy search, which gives striking results on RL
performance estimation is kept just accurate benchmark problems (Gomez et al., 2008; Heidrich-
enough for a sufficiently good ranking of can- Meisner and Igel 2008). The CMA-ES adapts the
didate policies, which is in turn sufficient for policy as well as parameters of its own search strat-
the CMA-ES to find better solutions. This egy (such as a variable metric) based on ranking poli-
increases the learning speed as well as the ro- cies. This is already much less susceptible to noise
bustness of the algorithm. than estimating absolute performances or performance
gradients (Heidrich-Meisner & Igel, 2008). Still, the
ranking must be sufficiently accurate to evolve better
1. Introduction policies, and the accuracy of the ranking depends on
the degree of uncertainty as well as on the number
Dealing with uncertainty is one of the major issues in of roll-outs considered per performance estimation of
reinforcement learning (RL). When solving (partially each candidate solution. We propose to augment the
observable) Markov decision processes solely based on CMA-ES for RL with an adaptive uncertainty han-
observations and interactions with the environment, dling scheme based on Hoeffding or empirical Bern-
uncertainty and randomness arise from several sources. stein races (Maron & Moore, 1994; Maron & Moore,
The initial state usually varies, state-transitions and 1997; Audibert et al., 2007; Mnih et al., 2008), which
reward signals can be stochastic, and the state obser- dynamically adapts the number of episodes for evalu-
vations may be noisy. We consider RL methods that ating a policy such that the ranking of new candidate
search in a parametrized policy space, in which the solutions is just reliable enough to drive the learning
search direction can be determined using estimates of process. All individuals participate in a selection race,
the performance of behavioral policies or estimates of in which their performances are sampled. A policy re-
performance gradients. Uncertainty and randomness mains in the race as long as one cannot be sure with an
require that these estimates are based on a sample of a priori fixed probability whether the policy is promis-
several episodes (roll-outs). The sample size is a cru- ing or not, i.e., as long as the confidence interval for
cial parameter. If too few episodes are considered, the the estimated performance based on Hoeffding or em-
pirical Bernstein bounds does not clearly distinguish
Appearing in Proceedings of the 26 th International Confer- it from the other candidates. The selection race is
ence on Machine Learning, Montreal, Canada, 2009. Copy-
right 2009 by the author(s)/owner(s). finished when a complete and accurate ranking is ob-
tained. Algorithm 1: rank-µ CMA-ES

In the next section, we describe the CMA-ES for direct 1 initialize m (0) = xinit , σ (0) , evolution paths
policy search. In section 3 we introduce our new un- (0) (0)
pσ = pc = 0 and covariance matrix C (0) = I
certainty handling scheme based on racing algorithms. (0)
(unity matrix), tlimit = 3
Then we present some empirical results before we end // k counts number of generations
with our conclusions. 2 for k = 0, . . . do
// create new offspring
(k+1) 2
3 for l = 1, . . . , λ do xl ∼ N(m(k) , σ (k) C (k) )
2. Evolutionary direct policy search // evaluate new offspring, see section 3
(k+1) (k+1)
Evolution strategies (ESs) are random search methods, // X̂l reflects quality of xl , l = 1, . . . , λ
(k+1) (k+1) (k+1)
which iteratively sample a set of candidate solutions 4 ({X̂1 , . . . , X̂µ }, tlimit ) ←
5 selectRace
from a probability distribution over the search space (k+1) (k+1) (k)
(here a parametrized space of policies), evaluate these ({x1 , . . . , xλ }, µ, a, b, tlimit, δ)
// recombination and selection
potential solutions, and construct a new probability P (k+1)
6 m (k+1) ← µ i=1 wi x i:λ
distribution over the search space based on the gath- // step size control
ered information (Beyer, 2007). In ESs, this search (k+1) (k)
7 pσ ← (1 − cσ )pσ +
distribution is parametrized by a set of candidate so- p −1 (k+1)
−m (k)
lutions, the parent population with size µ, and by pa- cσ (2 − cσ )µeff C (k) 2 m σ (k)
„ „ ««
rameters of the variation operators that are used to σ (k+1) ← σ (k) exp dcσσ E["N
(k+1)
"p σ "
− 1
(0,I )"]
create new candidate solutions (the offspring popula- 8
tion with size λ) from the parent population. // covariance matrix update
(k+1)
9 pc ←
Arguably the most elaborate ES for real-valued op- (k)
p (k+1)
−m (k)
(1 − cc )pc + cc (2 − cc )µeff m σ (k)
timization is the covariance matrix adaptation ES (k+1) (k+1) T
C (k+1) ← (1 − ccov )C (k) + µccov pc pc +
(CMA-ES, Hansen et al., 2003; Suttorp et al., 2009), in 10
“ ”P cov
(k) (k) T
which the main variation operator is additive Gaussian 1 µ
ccov 1 − µcov i=1 wi z i:λ z i:λ
perturbation. The CMA-ES is a variable metric algo-
rithm adapting shape and strength of its search distri-
bution, and it is regarded as one of the most efficient
evolutionary algorithms for real-valued optimization of the offspring in the recombination implements non-
(Beyer, 2007). The CMA-ES is robust in the sense that elitist, rank-based selection. For the equal recombina-
it does not rely on tweaking of hyperparameters, see tion used here A common choice for the recombination
below. In a recent study, the CMA-ES (without rank- weights is wl ∝ ln(µ + 1) − ln(l), &w&1 = 1, w ∈ Rµ .
µ update, i.e., an outdated version) was compared to The CMA-ES adapts both the n-dimensional covari-
8–12 (depending on the task) other RL algorithms in- ance matrix C (k) of the normal mutation distribution
cluding value-function and policy gradient approaches as well as the global step size σ (k) ∈ R+ . The covari-
(Gomez et al., 2008). On the four test problems where ance matrix update has two parts, the rank-1 update
the CMA-ES was considered, it ranked first, second considering the change of the population mean over
(twice), and third. For further examples of successful time and the rank-µ update considering the successful
applications of the CMA-ES for RL (Pellecchia et al., variations in the last generation. The rank-1 update
2005; Siebel & Sommer, 2007) and additional compar- is based on a low-pass filtered evolution path p(k) of
isons on RL benchmarks (Heidrich-Meisner and Igel successful (i.e., selected) steps
2008; 2009) we refer to the literature.
" m(k+1) − m(k)
In each generation k of the CMA-ES, which is shown p(k+1)
c ← (1−cc ) p(k)
c + cc (2 − cc )µeff
(k+1) σ (k)
in Algorithm 1, the lth offspring xl ∈ Rn ,
l ∈ {1, . . . , λ}, is generated by additive multi-variate and aims at changing C (k) to make steps in the promis-
Gaussian mutation and weighted global intermediate ing direction p(k+1) more likely by morphing the co-
(k+1) (k) # $# $T
recombination, i.e., xl ← m(k) +σ (k) z l with mu- variance towards pc
(k+1) (k+1)
pc . The backward
(k)
tation σ (k) z l ∼ σ (k) N (0, C (k) ) and recombination time horizon of the cumulation process is approxi-
! µ (k) (k)
m(k) ← l=1 wl xl:λ . Here xl:λ denotes the lth best mately c−1
c , where cc = 4/(n + 4) is roughly inversely
individual of the λ offspring. Ranking the λ candi- linear in the dimension of the path vector. The vari-
%!µ &
dates (policies) requires their evaluation, which is dis- ance effective selection mass µeff = 2 −1
is
l=1 wl
cussed in detail in section 3. Considering the best µ a normalization constant. The rank-µ update aims
at making the single steps that were selected in the 3. Selection races
last iteration more likely by morphing C (k) towards
#
(k)
$#
(k)
$T Due to uncertainty and randomness, the performance
z i:λ z i:λ . Putting both updates together, we estimates, which are required for selecting policies,
have should be based on samples of several episodes (roll-
ccov (k+1) (k+1) T outs). Our goal is to ensure with a given confidence
C (k+1) ← (1 − ccov )C (k) + p pc that the µ selected policies have indeed the best mean
µcov c
' () µ
performances while using as few roll-outs as possible.
1 (k) (k) T To this end, we want to systematically control in each
+ ccov 1 − wi z i:λ z i:λ .
µcov i=1 iteration of our direct policy search (i) the overall num-
ber of roll-outs, and (ii) the distribution of roll-outs
The constants ccov and µcov are fixed learning rates. among the candidate policies. This can be achieved
The learning rate of the covariance matrix update by adopting racing algorithms (Maron & Moore, 1994;
ccov = (n+2√2)2 is roughly inversely proportional to Maron & Moore, 1997) for selection.
the degrees of freedom of the covariance matrix. The
parameter µcov mediates between the rank-µ update
(µcov → ∞) and the rank-one update (µcov = 1). The We view the performance of the policy encoded by
default value is µcov = µeff . xi , i ∈ {1, . . . , λ}, as a real-valued random variable
Xi . The return (accumulated reward) Xi,t of the tth
The global step size σ (k) is adapted on a faster episode corresponds to a realization of Xi . We as-
timescale. It is increased if the selected steps are larger sume that the returns are almost surely bounded with
and/or more correlated than expected and decreased known bounds a and b, Pr(Xi,t ∈ [a, b]) = 1, and de-
if they are smaller and/or more anticorrelated than fine R = |a − b|. Further, we assume an a priori fixed
expected: confidence level 1 − δ and a maximum number of eval-
* * ++ uations per individual tlimit (alternatively, we may fix
(k+1)
c σ &p σ &
σ (k+1) ← σ (k) exp −1 , a maximum number of evaluations per iteration). ! If
dσ E[&N (0, I)&] we consider the empirical mean X̂i,t = 1t tt! =1 Xi,t!
of t realizations (i.e., a performance estimate based
where the (conjugate) evolution path is
on t episodes), Hoeffding’s inequality bounds the de-
viation of the empirical mean from the true expecta-
p(k+1) ← (1 − cσ ) p(k)
σ σ tion E[Xi ]. With probability
- 2 of at least 1 − δ we have
" / /
− 1 m(k+1) − m(k) / / log δ
+ cσ (2 − cσ )µeff C (k) 2 . /X̂i,t − E[Xi ]/ ≤ R 2t . This rather loose bound
σ (k)
may be improved by applying Bernstein’s inequality
µeff +2 instead, which requires the usually unknown standard
Again, cσ = +3 is a.fixed learning rate and
n+µ
, eff- deviation of Xi . However, the recently derived empiri-
eff −1
dσ = 1 + 2 max 0, µn+1 + cσ is a damping fac- cal Bernstein bound (Audibert et al., 2007; Mnih et al.,
−21 2008) depends !tonly on the empirical standard devia-
tor. The matrix C is defined as BD −1 B T , where
tion σ̂i,t
2
= 1t t! =1 (Xi,t! − X̂i,t )2 . With probability of
BD B is an eigendecomposition of C (B is an or-
2 T
at least 1 − δ it holds
thogonal matrix with the eigenvectors of C and D a di- 0
agonal matrix with the corresponding eigenvalues) and / / 2 log 3δ 3R log 3δ
/ /
sampling N (0, C) is done by sampling BDN (0, I). /X̂i,t − E[Xi ]/ ≤ σ̂i,t + .
t t
The values of the learning rates and the damping fac-
tor are well considered and have been validated by ex- The dependency on R decreases linearly with the sam-
periments on many basic test functions (Hansen et al., ple size t and not just with its√square root, but now
2003). They need not be adjusted dependent on the the bound depends also on σ̂i,t / t. If the standard de-
problem and are therefore no hyperparameters of the viation is small compared to R, this bound is tighter
algorithm. Also the population sizes can be set to de- than the Hoeffding bound.
fault values, which are λ = max(4 + )3 ln n*, 5) and Racing algorithms are iterative methods trying to
µ = ) λ2 * for offspring and parent population, respec- identify with a high probability the best among sev-
tively. If we fix C (0) = I, the only hyperparameter eral options. In each iteration, a couple of options
to be chosen problem dependent is the initial global are sampled. In their standard formulation, which we
step size σ (0) (apart from the parameters controlling adopt here, there is an upper bound on the number of
sample size and uncertainty handling, see section 3). iterations.
(k) of at least µ other candidates, it is discarded (lines

Procedure selectRace({x1 , . . . , xλ }, µ, a, b, tlimit, δ).
23–25). If µ individuals are selected or the number of
1 S = ∅ // set of selected individuals iterations exceeds tlimit , the race is finished.
2 D = ∅ // set of discarded individuals
3 U = {xi | i = 1, . . . , λ} // set of undecided individuals
4 t←1 selected
5 forall xi ∈ U do
6 Xi,t ← performance(x i ) // evaluate individual
averaged return
7 LBi ← a, UBi ← b // init lower and upper
bounds
8 ut = |U| undecided: race!
(k)
9 while t < tlimit ∧ |S| < µ do
10 t ← t+1
11 ut = |U| // needed for confidence intervals unselected
// reevaluate undecided individuals
12 forall xi ∈ U do
13 Xi,t ← performance(x i ) candidate policies
P
14 X̂i ← 1t tt! =1 Xi,t! // recompute mean
15 compute new confidence interval Figure 1. Example of a selection race for µ = 2 and λ = 6.
h i
X̂i,t − ci,t , X̂i + ci,t
We compute the confidence intervals using the Hoeff-
16 using Hoeffding or empirical Bernstein
ding or empirical Bernstein bound. If a confidence
bound
// update lower interval is recomputed, the highest lower and lowest
n and upper bounds
o upper bounds determined so far are stored. If we want
LBi ← max LBi , X̂i − ci,t
17
n o our final decision to be correct with probability of at
18 UBi ← min UBi , X̂i + ci,t least 1 − δ, all computed bounds must be valid with
19 forall ˛x˘i ∈ U do probability of at least 1 − δ. By the union bound, this
˛ ¯˛
20 if ˛ xj ∈ U ˛ LBi > UBj ˛ ≥ λ − µ − |D| holds if each single bound holds with probability of
then at least 1 − δ/nb , where nb is the maximum number
// probably among the best µ of considered bounds. In our setting, nb can be upper
21 S ← S ∪ {xi } // select bounded by λtlimit , because we have at most tlimit iter-
22 U ← U \ {xi }
˛˘ ˛ ¯˛ ations and at most λ evaluations per iteration. Thus,
23 if xj ∈ U ˛ UBi < LBj ˛ ≥ µ − |S| then
˛
after t evaluations of policy$ xi we get a confidence
// probably not among the best µ #
24 D ← D ∪ {xi } // discard interval X̂i,t − ci,t , X̂i + ci,t with
25 U ← U \ {xi }
1
log (2nb ) − log δ
// adapt
(k+1)
tlimit depending on whether
(k)
was large
tlimit cHoeffding =R
enough for selection with confidence 1 − δ or not
i,t
2t
(k+1) (k)
26 if |S| = µ then tlimit = max{3, α1 tlimit} using the Hoeffding and
(k+1) (k)
27 else tlimit = min{αtlimit , tmax } 1
28
(k+1)
return {X̂1 , . . . , X̂λ }, tlimit log(3nb ) − log δ log(3n) − log δ
Bernstein
ci,t = σ̂i,t 2 +3R
t t
using the empirical Bernstein bound. The approxima-
Our algorithm is described in the procedure tion nb = λtlimit is much too loose because it does not
selectRace. For the rank-based selection procedure consider that already selected or discarded individuals
used in the ES, we need to know the best µ from λ poli- are not reevaluated. Therefore we use in iteration t
cies. Initially, all policies are evaluated (line 6) and
then labelled undecided (line 3). In every following t−1
)
iteration or racing step, all policies tagged undecided nb,t = uk + (tlimit − t + 1)ut ,
are reevaluated (line 11). The estimated mean perfor- k=1
mance (line 14) and confidence interval (lines 17 and
where ut is the number of individuals labeled unde-
18) are updated for each resampled policy (see below).
cided in iteration t.
If the lower bound of a policy is better than the upper
bounds of at least λ − µ other candidates, it is selected We combine this selection algorithm based on racing
(lines 20–22). Figure 1 illustrates this idea. If the up- with the CMA-ES and call the resulting algorithms
per bound of a policy is worse than the lower bounds Hoeffding- and Bernstein-Race-CMA-ES depending on
the way we compute the confidence intervals. By con- second scenario we add Gaussian noise N (0, 0.01) to
struction, our selection algorithm has the following the state observation (Heidrich-Meisner & Igel, 2008),
property: making the underlying Markov decision process par-
tially observable. A trial is successful if the final pol-
Proposition 1 Let {x1 , . . . , xλ } be a set of individu- icy allows the car to reach the hilltop in less than 100
als with fitness values almost surely between a and b, time steps on average for the fully observable task and
µ < λ, tlimit ≥ 3, and δ ∈]0, 1]. If the above procedure in less than 120 time steps on average for the partially
selectRace selects a set S of policies, then with prob- observable task.
ability of at least 1−δ these elements belong to the best
µ out of {x1 , . . . , xλ } in terms of mean performance. As a higher dimensional and more challenging task
we consider the swimmer problem (Coulom, 2002).
If after tlimit iterations less than µ individuals were The swimmer consists of nc compartments floating
selected, the – according to their estimated mean – on a liquid surface. The swimmer is supposed to
best of the not yet discarded policies are considered move its center of gravity as fast as possible in a
in the CMA-ES. If this happens tlimit was too small, predefined direction. The state description s =
and therefore the maximum number of iterations is [A0 , ζ1 , . . . , ζnc , Ȧ0 , ζ̇1 , . . . , ζ̇nc ]T includes the position
increased by a factor of α > 1 in the next generation of the end point of the first compartment A0 (mark-
upper bounded by a threshold tmax (line 27). If tlimit ing the “head” of the swimmer), the angle of the
iterations were sufficient, tlimit is reduced by a factor of ith part (i = 1, . . . , nc ) with respect to the x-axis,
α−1 (line 26), which has a positive effect on the bounds the corresponding velocity of the “head” Ȧ0 , and the
and therefore can speed up future races. Thus, the angular velocities for each part ζ̇1 , . . . , ζ̇nc . Actions
selection procedure adapts the overall budget of policy a = [a1 , . . . , anc −1 ]T are torques applied between body
evaluations and distributes the available evaluations parts. The reward given to the swimmer is the velocity
among the candidate policies. component of the swimmer’s center of gravity paral-
lel to the x-axis. We considered two swimmers with
Related work. Stagge (1998) and Schmidt et al. nc ∈ {3, 4}. The swimmer’s initial position is set to 0
(2006) propose methods for uncertainty handling in and the initial angular velocities are each drawn uni-
rank-based evolution strategies for noisy function op- formly from [0, π]. A swimmer trial is called success-
timization. However, both heuristics assume a particu- ful if the average velocity of the swimmer’s center of
lar distribution of the performance samples. Heidrich- gravity is larger than 3n 40 s , i.e., the swimmer covers a
c m
Meisner and Igel (2009) use the uncertainty han- distance of at least one and a half of its length in the
dling CMA-ES (UH-CMA-ES) developed by Hansen desired direction in the simulated time span of 20 s.
et al. (2009) for reliably ranking policies in RL. How-
ever, this heuristic only adjusts globally in each gener-
Baseline algorithms. In order to judge the perfor-
ation the number of roll-outs considered for evaluation.
mance of the CMA-ES and the two Race-CMA-ESs
Racing algorithms have been proposed in the domain
for RL, we consider two alternative methods for com-
of evolutionary computation for the empirical evalua-
parison. First, we use simple random search (random
tion of different evolutionary algorithms (e.g., different
weight guessing, RWG) as a baseline for evaluation.
external strategy parameter configurations) (Birattari
In every iteration new policy parameters are drawn
et al., 2002; Yuan & Gallagher, 2004), but to the best
uniformly from an interval [−xmax , xmax ]n , where n
of our knowledge not yet for selection.
is the number of policy parameters, and the best so-
lution is maintained. Second, we apply the efficient
4. Experiments episodic natural actor-critic algorithm (NAC) accord-
ing to Peters and Schaal (2008), a state-of-the-art pol-
Benchmark problems. We consider RL bench-
icy gradient method. The NAC is, like the CMA-ES,
marks taken from the literature. The mountain car
a variable metric algorithm operating on a predefined
task serves as a minimal example. Here an under-
policy class.
powered car has to be driven out a valley to the goal
state at the hilltop (Sutton & Barto, 1998). The state
s = [x, ẋ]T of the system is given by the position Experimental setup. In all four benchmark prob-
x ∈ [−1.2, 0.6] of the car and by its current velocity lems uncertainty arises from random start states. In
ẋ ∈ [−0.07, 0.07]. Actions are discrete forces applied the second of the two mountain car tasks we addi-
to the car a ∈ {−amax, 0, amax }. In every time step a tionally have to cope with noisy observations. We al-
reward of −1 is given to the agent. The initial state is ways consider the same type of linear policies in or-
uniformly drawn from [−1.2, 0.6] × [−0.07, 0.07]. In a der to allow for a fair comparison. The linear poli-
a)
cies examined here are typically used with the NAC 0
Hoeffding-Race-CMA-ES !(0)=10, "=0.05
(Peters & Schaal, 2008). More sophisticated choices -50
-100
of policy classes certainly improve the performance of -150
median of return
the CMA-ES, which for instance works fine with non- -200 NAC #NAC=0.01, !NAC=10, neval=12
linear neural networks (Pellecchia et al., 2005; Siebel -250 Bernstein-Race-CMA-ES !(0)=10, "=0.05
& Sommer, 2007; Gomez et al., 2008). Thus all meth-

-300 RWG xmax=50, neval=10
(0)
-350 CMA-ES ! =10, neval=20
ods operate on the same policy class πxdeter (s) = xT s -400
with s, x ∈ Rn . For learning, the NAC uses the -450
stochastic policy πxstoch (s, a) = N (πxdeter (s), σNAC ), -500

0 500 1000 1500 2000 2500
where the standard deviation σNAC is considered as b) number of episodes
(0)
an additional adaptive parameter of the policy gra-

Hoeffding-Race-CMA-ES ! =50, "=0.1
-100
dient method. The NAC is evaluated on the corre- -150
sponding deterministic policy. The policy parameters -200
(except the exploration parameter σNAC for the NAC)
median of return
(0)
-250 CMA-ES ! =10, neval=10
are always initialized with zero. For the swimmer task -300
the action consists of nc − 1 continuous values and -350

Bernstein-Race-CMA-ES ! =50, "=0.1
(0)
we apply an independent linear policy for each ac- -400

NAC #NAC=0.001, !NAC=100, neval=10
tion component, thus the search space is 2(n2c − 1)- -450
RWG xmax=10, neval=10
dimensional. For the independent evaluation of the -500
0 500 1000 1500 2000 2500
algorithms (e.g., in the following plots), we deter- number of episodes
mine the median of 50 roll-out for the mountain car Figure 2. Median of performance over 500 trials for RWG,
tasks and of 10 roll-outs for the swimmer tasks. For NAC, CMA-ES, Bernstein-Race-CMA-ES, and Hoeffding-
the mountain car problems we test for the CMA- Race-CMA-ES in the a) fully observable and b) partially
ES all combinations of initial global step size σ (0) ∈ observable mountain car task. The respective best param-
{0.1, 1, 5, 10, 15, 25, 50, 100} and sample size neval ∈ eter configuration is plotted.
{1, 10, 20, 30}, and for the NAC all combinations of ini- a)
tial exploration σNAC ∈ {0.1, 1, 5, 10, 15, 25, 50, 100}, 350
Bernstein-Race-CMA-ES !(0)=50, "=0.01
learning rate αNAC ∈ {0.1, 0.01, 0.001, 0.0001}, and 300
Hoeffding-Race-CMA-ES !(0)=1, "=0.01
sample size neval ∈ {n + 2, 2(n + 2), 3(n + 2) (the NAC 250
median of return
needs a minimum of n + 2 roll outs per policy update). 200

(0)
Since the swimmer problem is the more computational 150
CMA-ES ! =1, neval=50, nc=3
demanding we restricted σ (0) and σNAC to {1, 10, 50} 100

for this task. RWG xmax=10, neval=10, nc=3
50 NAC #NAC=0.01, !NAC=10, neval=18, nc=3
For the Hoeffding- and Bernstein-Race-CMA-ES we 0
always test all combinations of confidence δ ∈

0 1000 2000 3000 4000 5000
b) number of episodes
{0.01, 0.05, 0.1} and σ (0) ∈ {1, 10, 50}. In each trial 400
we initialize tlimit = 3 (at least three roll-outs are nec- 350 Bernstein-Race-CMA-ES !(0)=1, "=0.01
essary to compute the empirical Bernstein bounds), 300 Hoeffding-Race-CMA-ES !(0)=1, "=0.01
and we use α = 1.5 and tmax = 50 as upper bound on

median of return
250
tlimit . 200 (0)

CMA-ES ! =1, neval=10, nc=4
150
100
Results. In Figs. 2 and 3 the performances of RWG, 50 RWG xmax=10, neval=10, nc=4
NAC #NAC=0.01, !NAC=10, neval=32, nc=4
CMA-ES, NAC, and Bernstein- and Hoeffding-Race- 0
0 1000 2000 3000 4000 5000
CMA-ES are shown for the four test scenarios. For all number of episodes
methods the performance for the best respective pa- Figure 3. Median of performance over 20 trials for RWG,
rameter configuration is plotted. The race based selec- NAC, CMA-ES, Bernstein-Race-CMA-ES, and Hoeffding-
tions clearly increased the learning speed. RWG, NAC, Race-CMA-ES for swimmers with a) 3 and b) 4 segments.
and CMA-ES performed best for large but not too The respective best parameter configuration is plotted.
large sample sizes neval . Among the methods without
uncertainty handling the CMA-ES was more robust performed exceptionally well because it benefits from
against uncertainty than NAC and RWG when look- a policy initialization close to an optimum (Heidrich-
ing at the best respective parameter configurations. In Meisner & Igel, 2008). However, in the partially ob-
the fully observable mountain car problem the NAC servable mountain car problem the Race-CMA-ESs
300 maximal allowed sample size of tmax = 50 was used

250 most of the time. As can be seen in Fig. 5, a lower
tmax=25 confidence improved the performance in our experi-
median of return
200
tmax=50 ments, because in the initial learning phase, which is
150
tmax=100 crucial for the overall performance in simple mountain
100 car task, a high confidence is not needed.
50 But the Race-CMA-ESs not only improved the learn-
0
ing speed. They also proved to be very robust against
0 1000 2000 3000 4000
number of episodes
5000 6000 7000
changes in their hyperparameter values. For the moun-
tain car task both the Bernstein- and the Hoeffding-
Figure 4. Median of the performance over 20 trials for the
Bernstein-Race-CMA-ES with δ = 0.05 and σ (0) = 1 for
Race-CMA-ES were successful in more than 80% of
different maximal race lengths tmax ∈ {25, 50, 100} in the the 500 trials for 9 out of 9 tested parameter configu-
swimmer task with nc = 4. rations, while the standard CMA-ES was successful in
more than 80% of the trials in only 18 of 32 cases, the
-60 NAC in 60 of 96 cases and RWG in 7 of 28 cases. In the
"=0.5
-80
"=0.4
partially observable mountain car task, the Bernstein-
-100 "=0.3
and the Hoeffding-Race-CMA-ES were again success-
ful in more than 80% of the trials for 9 out of 9 tested
-120 "=0.2
median of return
-140 50
-160 40 parameter configurations. CMA-ES, NAC, and RWG
30
achieved a success level of 80% in 3 of 3, 6 of 12, and
tlimit
-180 "=0.1
20
-200 "=0.05 10 0 of 28 cases, respectively. (In this case only the most
-220 "=0.01
1 1000 effective fixed sample size was tested for the CMA-ES
and the NAC.) For swimmers with 3 compartments
-240 "=0.001 number of episodes
0 200 400 600 800 1000
number of episodes the Race-CMA-ESs were also robust. The Bernstein-
Figure 5. Median of performance over 500 trials for the Race-CMA-ES was successful in more than 50% of all
Hoeffding-Race-CMA-ES with σ (0) = 50 in the partially trials for 6 of 9 and the Hoeffding-Race-CMA-ES was
observable mountain car task for different confidence levels successful in more than 50% of all trials for 5 of 9 and
δ ∈ {0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}. The race length parameter configurations. For swimmers with nc = 4
tlimit selected by the Hoeffding-Race-CMA-ES with λ = 5, both Race-CMA-ESs were not successful because pre-
tmax = 50 and δ = 0.05 for four selected trials is plotted in sumably the value of tmax was too large. The CMA-ES
the inset. achieved successes in more than 50% of the trials in 11
of 12 and 5 of 12 cases for nc ∈ {3, 4}, while the NAC
outperformed all other methods. For both mountain and RWG reached this success level in none of the
car tasks the Hoeffding-Race-CMA-ES performed bet- configurations. Thus, uncertainty handling through
ter than the Bernstein-Race-CMA-ES. Initializing the selection races remarkably sped up learning and im-
start state of the mountain car randomly led to a large proved at the same time the robustness compared to
standard deviation σ̂i,t compared to the range R and the CMA-ES without uncertainty handing, which is
therefore the Hoeffding bound was tighter than the already much more robust than the policy gradient
empirical Bernstein bound. In the swimmer tasks the method.
empirical Bernstein bound was tighter than the Ho-
effding bound, and the Bernstein-Race-CMA-ES out-
performed the Hoeffding-Race-CMA-ES. The perfor-
5. Discussion and Conclusion
mance of both Race-CMA-ESs depended on the upper Evolution strategies (ESs) are powerful direct policy
bound tmax of the race length tlimit as shown in Fig. 4 search methods. One of their main advantages is their
for the Bernstein-Race-CMA-ES. Choosing tmax too ability to cope with uncertainty and noise. Still, ran-
large results in very costly policy updates and unnec- dom elements in the environment require gathering
essary slow learning. This can be seen in the swimmer statistics over several episodes for the evaluation of
task with nc = 4, where both Race-CMA-ESs per- candidate policies. We added a new adaptive uncer-
formed poorly even though they initially improve the tainty handling to evolutionary reinforcement learn-
learning speed. The inset in Fig. 5 shows the values of ing. It adjusts the number of roll-outs per evaluation
tlimit chosen by the Bernstein-Race-CMA-ES for single of a policy such that the signal to noise ratio is just
trials. At the beginning small sample sizes were used, high enough for a sufficiently good ranking of candi-
and when the accuracy was no longer sufficient the date policies, which in turn suffices for the ES to find
sample size was increased. Finally for fine tuning the
better solutions. The uncertainty handling exploits on Reinforcement Learning (EWRL 2008) (pp. 136–
the advantage of small sample sizes in the beginning 150). Springer-Verlag.
and increases the sample size when a higher accuracy is
Heidrich-Meisner, V., & Igel, C. (2009). Uncer-
necessary. It balances the trade-off between fast learn-
tainty handling CMA-ES for reinforcement learning.
ing and sufficient accuracy. This significantly increases
Genetic and Evolutionary Computation Conference
both learning speed and robustness.
(GECCO 2009). ACM Press.
The statistically sound uncertainty handling scheme is
Maron, O., & Moore, A. W. (1994). Hoeffding races:
independent of the CMA-ES and could be combined
Accelerating model selection search for classification
with other RL approaches (e.g., for evolutionary online
and function approximation. Advances in Neural In-
RL, Whiteson & Stone, 2006). Moreover, it is not
formation Processing Systems (pp. 59–66). Morgan
limited to RL and may also be adopted for general
Kaufmann Publishers.
noisy function approximation tasks.
Maron, O., & Moore, A. W. (1997). The racing algo-
References rithm: Model selection for lazy learners. Artificial
Intelligence Review, 11, 193–225.
Audibert, J.-Y., Munos, R., & Szepesvári, C. (2007).
Tuning bandit algorithms in stochastic environ- Mnih, V., Szepesvári, C., & Audibert, J. (2008).
ments. 18th International Conference on Algo- Empirical Bernstein stopping. Proceedings of the
rithmic Learning Theory (ALT) (pp. 150–165). 25th International Conference on Machine Learning
Springer-Verlag. (ICML 2008) (pp. 672–679).
Pellecchia, A., Igel, C., Edelbrunner, J., & Schöner, G.
Beyer, H.-G. (2007). Evolution strategies. Scholarpe-
(2005). Making driver modeling attractive. IEEE
dia, 2, 1965.
Intelligent Systems, 20, 8–12.
Birattari, M., Stutzle, T., Paquete, L., & Varrentrapp, Peters, J., & Schaal, S. (2008). Natural actor-critic.
K. (2002). A racing algorithm for configuring meta- Neurocomputing, 71, 1180–1190.
heuristics. Genetic and Evolutionary Computation
Conference (GECCO 2002) (pp. 11–18). Morgan Schmidt, C., Branke, J., & Chick, S. (2006). Inte-
Kaufmann Publishers. grating techniques from statistical ranking into evo-
lutionary algorithms. Applications of Evolutionary
Coulom, R. (2002). Apprentissage par renforcement Computing (pp. 752–763). Springer-Verlag.
utilisant des reseaux de neurones, avec des applica-
tions au controle moteur. These de doctorat, Institut Siebel, N. T., & Sommer, G. (2007). Evolutionary
National Polytechnique de Grenoble. reinforcement learning of artificial neural networks.
International Journal of Hybrid Intelligent Systems,
Gomez, F., Schmidhuber, J., & Miikkulainen, R. 4, 171–183.
(2008). Accelerated neural evolution through co-
Stagge, P. (1998). Averaging efficiently in the pres-
operatively coevolved synapses. Journal of Machine
ence of noise. Parallel Problem Solving from Nature
Learning Research, 9, 937–965.
(PPSN V) (pp. 188–197). Springer-Verlag.
Hansen, N., Müller, S. D., & Koumoutsakos, P. (2003). Sutton, R., & Barto, A. (1998). Reinforcement learn-
Reducing the time complexity of the derandomized ing: An introduction. MIT Press.
evolution strategy with covariance matrix adapta-
tion (CMA-ES). Evolutionary Computation, 11, 1– Suttorp, T., Hansen, N., & Igel, C. (2009). Efficient
18. covariance matrix update for variable metric evolu-
tion strategies. Machine Learning, 75, 167–197.
Hansen, N., Niederberger, A. S. P., Guzzella, L.,
& Koumoutsakos, P. (2009). A method for han- Whiteson, S., & Stone, P. (2006). Evolutionary
dling uncertainty in evolutionary optimization with function approximation for reinforcement learning.
an application to feedback control of combustion. Journal of Machine Learning Research, 7, 877–917.
IEEE Transactions on Evolutionary Computation, Yuan, B., & Gallagher, M. (2004). Statistical rac-
13, 180–197. ing techniques for improved empirical evaluation of
evolutionary algorithms. Parallel Problem Solving
Heidrich-Meisner, V., & Igel, C. (2008). Variable met- from Nature (PPSN VIII) (pp. 172–181). Springer-
ric reinforcement learning methods applied to the Verlag.
noisy mountain car problem. European Workshop
View publication stats

Hoeffding and Bernstein Races For Selecting Policies in Evolutionary Direct Policy Search

Uploaded by

Copyright:

Available Formats

Hoeffding and Bernstein Races For Selecting Policies in Evolutionary Direct Policy Search

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hoeffding and Bernstein Races For Selecting Policies in Evolutionary Direct Policy Search

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Conference Paper · January 2009

LOKI Learning to organize complex systems View project

The user has requested enhancement of the downloaded file.

Verena Heidrich-Meisner [email protected]

Abstract estimates are not reliable enough to allow for learn-

tained. Algorithm 1: rank-µ CMA-ES

(k) of at least µ other candidates, it is discarded (lines

& Sommer, 2007; Gomez et al., 2008). Thus all meth-

stochastic policy πxstoch (s, a) = N (πxdeter (s), σNAC ), -500

an additional adaptive parameter of the policy gra-

dient method. The NAC is evaluated on the corre- -150

sponding deterministic policy. The policy parameters -200

(except the exploration parameter σNAC for the NAC)

the action consists of nc − 1 continuous values and -350

we apply an independent linear policy for each ac- -400

needs a minimum of n + 2 roll outs per policy update). 200

demanding we restricted σ (0) and σNAC to {1, 10, 50} 100

For the Hoeffding- and Bernstein-Race-CMA-ES we 0

always test all combinations of confidence δ ∈

and we use α = 1.5 and tmax = 50 as upper bound on

tlimit . 200 (0)

300 maximal allowed sample size of tmax = 50 was used

View publication stats

You might also like