Robust Black-Box Optimization For Stochastic Search and Episodic Reinforcement Learning
Robust Black-Box Optimization For Stochastic Search and Episodic Reinforcement Learning
Robust Black-Box Optimization For Stochastic Search and Episodic Reinforcement Learning
Abstract
Black-box optimization is a versatile approach to solve complex problems where the objective
function is not explicitly known and no higher order information is available. Due to its
general nature, it finds widespread applications in function optimization as well as machine
learning, especially episodic reinforcement learning tasks. While traditional black-box
optimizers like CMA-ES may falter in noisy scenarios due to their reliance on ranking-based
transformations, a promising alternative emerges in the form of the Model-based Relative
Entropy Stochastic Search (MORE) algorithm. MORE can be derived from natural policy
gradients and compatible function approximation and directly optimizes the expected fitness
without resorting to rankings. However, in its original formulation, MORE often cannot
achieve state of the art performance. In this paper, we improve MORE by decoupling
the update of the search distribution’s mean and covariance and an improved entropy
scheduling technique based on an evolution path resulting in faster convergence, and a
simplified model learning approach in comparison to the original paper. We show that
our algorithm performs comparable to state-of-the-art black-box optimizers on standard
benchmark functions. Further, it clearly outperforms ranking-based methods and other
policy-gradient based black-box algorithms as well as state of the art deep reinforcement
learning algorithms when used for episodic reinforcement learning tasks.
Keywords: black-box optimization, stochastic search, derivative-free optimization, evolu-
tion strategies, episodic reinforcement learning
1. Introduction
Stochastic-Search algorithms (Spall, 2005) are problem independent algorithms well-suited
for black-box optimization (BBO) (Larson et al., 2019) of an objective function. They only
require function evaluations and are used when the objective function cannot be modeled
analytically and no gradient information is available. This is often the case for real world
problems such as robotics (Chatzilygeroudis et al., 2017) where the objective function
describes the outcome of a task, medical applications (Winter et al., 2008), or forensic
identification (Ibáñez et al., 2009).
Typically, these algorithms maintain a search distribution over the optimization variables
of the objective function. Solution candidates are sampled, evaluated, and the parameters of
the search distribution (e.g. mean and covariance for Gaussian search distributions) are then
updated towards a more promising direction. This process is repeated until a satisfactory
solution quality is found or a pre-defined budget of objective function evaluations is reached.
In this paper, we re-introduce Model-based Relative Entropy Stochastic Search (MORE)
(Abdolmaleki et al., 2015), a versatile, general purpose stochastic-search optimization
algorithm. Using insights from reinforcement learning (RL) and information-theoretic trust-
regions, it aims to update the parameters of a Gaussian search distribution in the direction
of the natural gradient (Kakade, 2001). To this end, MORE uses compatible function
approximation (Sutton et al., 1999; Pajarinen et al., 2019) and learn a quadratic surrogate
model of the objective function and bound the Kullback-Leibler (KL) divergence between
subsequent search distributions. In its original formulation, a bound on the Kullback-Leibler
divergence and a bound on the loss of entropy between the old and new search distribution
additionally acts as an exploration-exploitation trade-off to prevent pre-mature convergence
of the algorithm.
Most other successful stochastic search algorithms make use of rankings of objective
function evaluations for updating the search distribution’s parameters (Hansen, 2016; Wier-
stra et al., 2014; Rubinstein and Kroese, 2004). While the use of rankings is well studied for
deterministic objective function evaluations, its effects in the case of stochastic objective
function evaluations, which are, for example, common in episodic reinforcement learning,
are less well understood. We analyze these ranking based algorithms for this stochastic
evaluation case and uncover their limitations to solve such tasks. In contrast, MORE directly
optimizes the expected objective function values without ranking transformations of the
objective function evaluations and can therefore also be applied for such domains.
However, the original MORE approach suffers from (a) the need for conservative KL
bounds to keep the covariance updates stable, (b), a fixed entropy decreasing schedule
resulting in suboptimal exploration and slow convergence, and (c), it employs an unnecessary
complicated and inaccurate model learning method. We aim to fix these issues by (a)
splitting the KL divergence into separate updates for the mean and covariance with separate
trust region bounds, (b) introducing an adaptive entropy schedule based on an evolution
path, and (c) using ordinary least squares with appropriate data pre-processing techniques to
simplify the model learning process. We call our new algorithm Coordinate-Ascent MORE
with Step Size Adaptation or CAS-MORE for short.
We empirically evaluate our algorithm on simulated robotics tasks, as well as a set of
benchmark optimization functions. While we are competitive with state-of-the-art algorithms
such as CMA-ES on the benchmark optimization functions, the RL experiments clearly show
the strength of MORE.
2. Related Work
In the following sections, we describe in more depth current state of the art algorithms
from the field of black-box optimization and review common design choices and algorith-
mic procedures. In addition, we highlight current step- and episode-based reinforcement
learning approaches along with their benefits and drawbacks and discuss trust-region based
reinforcement learning algorithms.
2
Robust Black-Box Optimization for Stoch. Search and Episodic RL
3
Hüttenrauch and Neumann
Contrary to these algorithms, MORE does not discard poorly performing candidates in
its optimization process and only applies common data pre-processing techniques such as
standardization before feeding them to the learning algorithm. There also exist variations of
the CMA-ES using surrogate models such as (Loshchilov et al., 2012b) or (Hansen, 2019).
Whereas these algorithms use the surrogate to generate approximate function values, which
are then again used to generate a ranking, MORE directly uses the model parameters to
update the search distribution parameters.
4
Robust Black-Box Optimization for Stoch. Search and Episodic RL
use of a trust-region update is TRPO (Schulman et al., 2015) which enforces a bound
on an average KL divergence trust-region. Similar sample-based approximations of the
trust-region are also employed by recent deep reinforcement learning techniques such as
Maximum A-Posteriori Policy Optimization (MPO) (Abdolmaleki et al., 2018b). Compatible
function approximation for deep reinforcement learning has been explored in (Pajarinen
et al., 2019). An improved way of enforcing trust-regions directly in the policy function by
using differentiable trust-region layers (TRPLs) has been introduced in (Otto et al., 2021).
A closely related method to MORE is the Relative Entropy Policy Search (REPS) for
step-based reinforcement learning (Peters et al., 2010) and its episodic formulation (albeit
derived in a contextual setting) in (Kupcsik et al., 2013). Both use samples to approximate
the objective and the trust-regions which does not guarantee that the KL trust-region is
enforced for the new parametric policy in practice. Layered direct policy search (LaDIPS)
(End et al., 2017) extends MORE to the contextual setting using a linear mapping from the
context to the mean. Additionally, the usage of natural gradients can be derived by using a
second order approximation of the KL-divergence, resulting in the Fisher information matrix
used in NES.
Recently, Otto et al. (2023) extended TRPLs to the episodic case and use a policy
gradient method for learning movement primitives that allow a non-linear mapping from a
context vector to the motion primitive parameters. They show that other algorithms such
as PPO (Schulman et al., 2017) are unsuitable as they suffer from the higher dimensional
action space induced by the motion primitives. In contrast, we focus on the non-contextual
case where the policy is a Gaussian distribution with full covariance matrix. We show that
in this case, the natural gradient updates from MORE clearly outperform the trust-region
policy gradient updates from TRPL, indicating that extending the exact natural gradient
updates from MORE to deep neural networks is an interesting direction for future work.
The MORE algorithm finds application in a variety of fields, such as variational inference
(Arenz et al., 2018) or density estimation (Becker et al., 2019). In the context of trajectory
optimization, ideas from the MORE algorithm are explored in the MOTO algorithm (Akrour
et al., 2018).
5
Hüttenrauch and Neumann
3. Preliminaries
Before we introduce our new algorithm, we present in detail the problem setting and the
original MORE algorithm. Additionally, we provide another view on the algorithm derived
from compatible feature approximation and its relation to natural gradient optimization.
6
Robust Black-Box Optimization for Stoch. Search and Episodic RL
and collapse the search distribution to a point estimate which would prevent it from further
exploration. To control the exploration-exploitation trade-off, additional constraints on the
updated search distribution need to be introduced. First, the KL-divergence between the
current search distribution and the solution of the optimization problem is upper bounded
which ensures that the mean and covariance only slowly change and the approximate model
is not over-exploited. Furthermore, an additional lower bound on the entropy is introduced
to prevent premature convergence.
Closed Form Updates The optimization problem can be solved in closed form using the
method of Lagrangian multipliers. The dual function is given by
F(x)
Z
η
g(η, ω) = η − ωβ + (η + ω) log πt (x) η+ω exp dx
η+ω
with Lagrangian multipliers η and ω. The new search distribution is then given in terms of
the Lagrangian multipliers as
F(x)
η
π(x) ∝ πt (x) η+ω exp .
η+ω
Analytic Solution The use of a quadratic model is beneficial for two reasons. First, it is
expressive enough to solve a wide range of problems while being computationally easy to
obtain. Second, quadratic features are the compatible features of the Gaussian distribution
which make the integrals tractable. As shown in Section 3.3, the use of quadratic features
also allows us to obtain exact natural gradient updates. The optimization problem in terms
of the natural parameters m = Σ−1 µ and Λ = Σ−1 is solved by minimizing the Lagrangian
dual function, which is given as
1
g(η, ω) =η − ωβ + ωk log(2π) − η(log |Λ−1 T −1
t | + mt Λt mt )
2
+ (η + ω)(log |Λ−1 | + mT Λ−1 m) ,
7
Hüttenrauch and Neumann
with the inverse of the Fisher information matrix (FIM) F = Ex∼π [∇θ log π(x)∇θ log π(x)T ],
i.e.,
gNG = F −1 ∇θ J.
8
Robust Black-Box Optimization for Stoch. Search and Episodic RL
where yi = f (xi ), i = 1, . . . , N . Equation (4) resembles the least squares solution to a linear
regression problem which is given by
X 2
w∗ = arg min (∇θ log π(xi )T w − yi ) (5)
w
i
using the function approximator ∇θ log π(x)T w where the features are given by the gradient
of the log search distribution with respect to the parameters of π. In this case, the natural
gradient is given by the least squares solution, i.e., gNG = w∗ .
Consequently, if the model is obtained by a linear least squares fit using quadratic features,
the MORE algorithm is performing an unbiased update in the natural gradient direction
where the learning rate is specified implicitly by the trust region . This is reflected in the
update equations of MORE which, assuming an entropy bound β → −∞, are simply given
by
m = mt + η −1 a, Λ = Λt + η −1 A,
where a and A are the estimated parameters of the compatible function approximation.
They directly correspond to the fitted model parameter w∗ in Equation (5). In difference to
other methods such as CMA-ES (Akimoto et al., 2010) or NES, where a relation to natural
gradients has also been established, the natural gradient update of MORE is exact since the
estimation of the quadratic model is unbiased in expectation and no ranking based reward
transformation is used.
9
Hüttenrauch and Neumann
10
Robust Black-Box Optimization for Stoch. Search and Episodic RL
Mµ (λ) = λΣ−1
t + A, mµ (λ) = λΣ−1
t µt + a.
A detailed derivation of the dual function can be found in Appendix A.1. We now solve the
dual problem
minimize gµ (λ)
λ
subject to λ > 0
where λ∗ is the solution of the optimization problem which we find using a non-linear
optimization algorithm 2 .
minimize gΣ (ν)
ν, ω
subject to ν > 0
where ν is again a Lagrangian multiplier. The optimal solution Σ∗ in terms of the solution
ν ∗ can be found analogously and is given by
Σ∗ = Λ(ν ∗ )−1 .
11
Hüttenrauch and Neumann
Figure 1: This figure illustrates the concept of the evolution path in a 2D example and
is adapted from Hansen (2016). In each figure, arrows with black heads correspond to an
update of the mean. The evolution path is a smoothed sum over subsequent mean updates
and depicted with an open head. In the left plot, the mean updates show no clear search
direction and, as a result, the evolution path is short. The right plot shows the opposite
where heavily correlated mean updates lead to a long evolution path. The center plot shows
the desired case, where subsequent mean updates are uncorrelated.
can also lead to slow convergence as it takes a long time to settle in on a solution and simply
increasing the bound on the covariance can lead to unstable updates.
To solve these issues, we take inspiration from CMA-ES which already makes use of
an entropy control mechanism in form of the step-size update (Hansen, 2016). The crucial
property underlying this update is the so-called evolution path which is a smoothed sum
over previous mean updates. The idea of step-size adaptation is the following. Whenever
we take steps in the same direction, we could essentially take a single, larger step, while
dithering around a constant location indicates no clear search direction. In the first case, a
larger entropy would allow for bigger steps, in the second case, we need to decrease entropy.
We illustrate the concept in Figure 1. We can achieve this by scaling the covariance after
the optimization with a scalar factor
δt+1
σt+1 = exp −
n
after the optimization steps where n is the problem dimensionality and δt+1 is the desired
change between the entropy in iteration t and the next iteration t + 1 of the algorithm. The
question is now how to determine a suitable value for δt+1 .
The step size update we propose is inspired by the step size control mechanism used in
CMA-ES called cumulative step size adaptation (CSA). We also use the evolution path p
which is a vector tracking previous mean updates and compare its length to a hypothetical
length of the evolution path resulting from uncorrelated mean updates. Therefore, the
evolution path update needs to be constructed in a way that we can derive a computable
desired length from it. Inspired by CMA-ES, we initialize the evolution path p0 to zero and
12
Robust Black-Box Optimization for Stoch. Search and Episodic RL
update it as follows
s
cσ (2 − cσ ) − 12
pt+1 = (1 − cσ )pt + Σt (µt+1 − µt )
2µ
where cσ < 1 is a hyper parameter. The factors are chosen analogously to CMA-ES such that
p 2
(1 − cσ )2 + cσ (2 − cσ ) = 1. While multiplying the shift in means with the inverse square
root of the covariance matrix in CMA-ES makes the evolution path roughly zero mean and
unit variance distributed, we know its exact length due to the constraint on the mean update
−1
in Equation (6) as long
p as the mean update is at its
p bound. The vector Σt 2 (µt+1 − µt ) has
in that case length 2µ and by scaling it with ( 2µ )−1 the resulting length is 1. We set
!
kpt+1 k
δt+1 = α 1 − des ,
kpt+1 k
where kpdes
t+1 k is a desired length of the evolution path given by the length of a evolution
path resulting from fully uncorrelated updates and α influences the magnitude of entropy
change which was set to 1 in the black-box optimization experiments, and 0.1 in the episodic
RL experiments. This rule for determining the change in entropy is almost identical to the
one in CMA-ES. If the norm of the current evolution path pt+1 is smaller than pdes t+1 , i.e., the
mean updates were dithering around a constant location, we reduce entropy (i.e. δt+1 > 0)
while entropy is increased if the length of the evolution path is longer than desired.
In order to determine the length of an uncorrelated path kpdes t+1 k, we first look at the
length of two vectors
p
ka + bk = kak2 + kbk2 + 2kakkbk cos θ,
where θ is the angle between p vectors a and b. If these two vectors are uncorrelated (i.e.
θ = π/2), the length becomes kak2 + kbk2 . Under the assumption that the mean update
1
is always at the bound, i.e. (2µ Σt )− 2 (µt+1 − µt ) = 1, the length of the evolution path pt+1
given the previous length pt is then
q
pt+1 = (1 − cσ )2 p2t + cσ (2 − cσ ).
13
Hüttenrauch and Neumann
20 CAS MORE
104 CA MORE
0 MORE
102 CMA-ES
−20
Entropy
100
f (µ)
−40
10−2
−60
10−4
10−6 −80
10−8 −100
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Evaluations 4
·10 Evaluations ·10 4
Figure 2: This figure shows the function value (left) and the entropy of the search distribution
(right) over the course of optimization of a 15-dimensional Rosenbrock function using different
variants of MORE. The original MORE is shown in green, while coordinate ascent versions
of more are orange and blue where CA-MORE indicates a fixed entropy reduction schedule
and CAS-MORE indicates the adaptive entropy schedule using the step-size adaptation. For
comparison, CMA-ES is plotted in red.
a target function value of 1 × 10−8 is reached, or 12 000 function evaluations are exceeded.
Figure 2 shows median and 5% / 95% quantiles of the function value at the mean (Figure 2a)
and the entropy of the search distribution (Figure 2b) over the optimization. The hyper-
parameters for the KL bounds (and entropy schedule for MORE) are chosen individually for
each algorithm based on a grid search. We observe that decoupling the mean and covariance
update already significantly improves convergence speed as it allows for a higher learning
rate for the mean and therefore a quicker entropy reduction. Note, that reducing entropy
even quicker leads to premature convergence while increasing for MORE leads to divergence
of the algorithm. This problem is alleviated by introducing the adaptive entropy schedule
that first quickly reduces entropy, then plateaus, and, finally, quickly reduces entropy again,
resulting in an accelerated optimization process.
5. Model Learning
In this section, we introduce a simple but effective approach of learning a quadratic model
based on polynomial ridge regression using the method of least squares as replacement for
the more complex Bayesian dimensionality reduction technique introduced in the original
paper (Abdolmaleki et al., 2015). We design the model learning with two objectives in mind.
First, it should be data- and time-efficient to allow for quick execution of the algorithm.
Additionally, it should be robust towards outliers and samples from very low entropy regimes.
To this end, we apply a series of data pre-processing techniques, re-use old samples, and start
learning a model with a lower complexity and increase it once sufficient data is available.
14
Robust Black-Box Optimization for Stoch. Search and Episodic RL
where Φ is the design matrix whose rows are given by a feature transformation φ(x), i.e
Φ = [φ(x1 ), . . . , φ(xQ )]T , y is a vector containing the fitness evaluations yi = f (xi ) and λ is
the regularization factor. The solution is given by the well known ridge regression estimator
ŵ = (ΦT Φ + λI)−1 ΦT y.
15
Hüttenrauch and Neumann
106 106
default default
4
10 no whitening 10 4 no whitening
no target norm no target norm
102 102
0
10 100
f (µ)
f (µ)
−2
10 10−2
−4
10 10−4
10−6 10−6
10−8 10−8
10−10 10−10
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Evaluations ·104 Evaluations ·104
(a) Rosenbrock (b) Attractive Sector
Figure 3: This figure shows the effects of data pre-processing on the optimization. We plot
the function value over the course of optimization for a Rosenbrock (a) and Attractive Sector
(b) function in 15 dimensions and turn off whitening and robust target normalization with
clipping, respectively. We see that whitening becomes important for low entropy regimes
near the optimum, while target normalization is especially important in case of outliers in
the function values as can be seen for the Attractive Sector function.
We empirically found that we require at least 10% more samples than the model has
parameters, i.e. start with a linear model once |D| ≥ 1.1(1 + n) and continue with a diagonal
model once |D| ≥ 1.1(1 + 2n). Finally, we switch to estimating a full quadratic model once
|D| ≥ 1.1(1 + n(n + 3)/2).
Estimating the model parameters can be numerically difficult with function values spanning
several orders of magnitude, outliers, and very low variances of input and output values near
the optimum. Therefore, we perform data pre-processing on the input values x, as well as
the design matrix Φ and the target values y before solving the least squares problem. In
particular, we perform whitening of the inputs x, standardization of the design matrix Φ
and a special form of standardization of the target values y and construct the normalized
data set Dw . We demonstrate the effect of data pre-processing in Figure 3.
−1
xw = C̄D (x − x̄D ) ∀ x ∈ D
where x̄D is the empirical mean and C̄D is the Cholesky-factorization of the empirical
covariance of the data-set. The parameters β̂w learned with the whitened data-set to be
16
Robust Black-Box Optimization for Stoch. Search and Episodic RL
yw = (y − ȳD )/sD ∀ y ∈ D
of the target values where ȳD is the empirical mean and sD is the empirical standard
deviation of the target values in D.
Robust Target Normalization Additionally, we propose an adaptive iterative normal-
ization and clipping scheme based on the excess kurtosis of the target values resulting in
a normalization process more robust towards outliers. Given the data-set in a particular
iteration, we first perform standardization like above, then treat all values outside the interval
[−vclip , vclip ] as outliers. Afterwards, we look at the excess kurtosis of the standardized
targets in the interval (−vclip , vclip ). If it is high, the data is still compacted to a small
interval while outliers at the borders negatively influence the least squares optimization.
In order to evenly spread the data, we repeat the procedure (normalization and clipping),
until the excess kurtosis falls below a threshold or the standard deviation of the remaining
values is close to 1. After the procedure, we replace the negative and positive outliers with
the minimum and maximum value of the remaining targets, respectively. We illustrate the
procedure in Figure 4, pseudo-code of the approach can be found in Appendix B. While this
effectively changes the optimization objective, it is not as severe as replacing all function
values with a ranking and, thus, the update direction still corresponds to an approximated
natural gradient. An exact quantification of the error is subject to future work.
6. Experiments
In this section, we evaluate the performance of CAS-MORE5 in the black-box stochastic
search and episodic reinforcement learning setting. While we use the former to highlight the
applicability over a wide range of objective functions using as few samples as possible, it is
the episodic reinforcement learning domain where the core strengths of the algorithm come
into play. We start by looking at the performance in terms of sample efficiency on a set of
5. An implementation of the algorithm can be found under https://github.com/ALRhub/cas-more
17
Hüttenrauch and Neumann
Excess kurtosis: 11.57 Excess kurtosis: 12.37 Excess kurtosis: 6.61 Excess kurtosis: 2.10
35
80 80 50
30
40 25
60 60
20
30
40 40 15
20
10
20 20
10
5
0 0 0 0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Excess kurtosis: 1.40 Excess kurtosis: 0.56 Excess kurtosis: 0.00 Final
25 25 20 20
20 20
15 15
15 15
10 10
10 10
5 5
5 5
0 0 0 0
−4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4 −4 −2 0 2 4
Figure 4: This figure illustrates the robust target normalization scheme on 100 synthetic data
points, generated by a reward function y = −0.5xT x. We sample x from a two dimensional
multivariate normal distribution with zero mean and unit variance and pick 20 random y
values and add Gaussian noise with a standard deviation of 10. Plots from top left to bottom
right show histograms of the data after standardization of the input in each level of recursion
of the procedure. The excess kurtosis is measured on the black data points in the interval
(-3, 3) and only these data points are recursively treated again until the excess kurtosis of
the clipped data is below the threshold of 0.55. Finally, the bottom right histogram shows
the output with the red data points clipped to the minimum and maximum values of the
remaining data points. Note, that in each plot the standardized data has zero mean and
unit variance but the model quality suffers from the present outliers.
18
Robust Black-Box Optimization for Stoch. Search and Episodic RL
black-box optimization benchmark functions provided by Hansen et al. (2021) and compare
CAS-MORE to the original formulation of MORE, as well as competitors such as CMA-ES
(Hansen, 2016) and XNES (Wierstra et al., 2014). Afterwards, we run our algorithm on a
suite of different episodic reinforcement learning tasks from the domain of robotics where
the fitness function evaluation is inherently noisy and we also care about the quality of the
generated samples (e.g., the sample evaluations should not damage the robot) instead of
plainly looking at the fitness evaluation at the mean of the search distribution. We compare
against state of the art episodic and step-based reinforcement learning algorithms.
19
Hüttenrauch and Neumann
multi-modal functions such as the Rastrigin function, where MORE often focuses on a local
optimum. Here, the solution is to draw more samples in each iteration to improve the
estimated model. For more results on the BBOB suite we refer to Appendix D. Typically,
these black-box optimization benchmarks also focus purely on the fitness evaluation of the
mean of the distribution and disregard the quality of the samples. While CAS-MORE is
comparable to state-of-the-art black-box optimizers such as CMA-ES in these domains, the
real benefits of MORE appear if we consider problems with noisy fitness evaluations and
whenever we care also about the quality of the generated samples. This is for example the
case in robot reinforcement learning problems which are discussed in the next sub-section.
Our experiments aim to showcase the benefits of CAS-MORE over other black-box optimiza-
tion techniques in the episodic RL setting, as well as deep RL algorithms for the step-based
RL setting. To this end, we compare CAS-MORE to the original MORE algorithm that
uses the dimensionality reduction model, an intermediate algorithm labeled as MORE-LS
comprising of the same joint KL and entropy constraint as in MORE and the simplified
model learning approach as in CAS-MORE, and CMA-ES, representing black-box stochastic
search optimizers. Furthermore, we compare to policy gradient based episodic reinforcement
learning using movement primitives with and without trust regions (Otto et al., 2023)
labeled as BBRL-TRPL and BBRL-PPO, respectively. Note that these algorithms have
been developed for contextual episodic RL, which allows to learn a deep neural network
mapping from a context vector, which describes the task, to a motion primitive parameter
vector, which describes the corresponding robot motion. In this work, we concentrate on the
non-contextual case where we only have to learn a single Gaussian distribution. While this
case is arguably simpler, we show that CAS-MORE significantly outperforms trust-region
policy gradient based methods, showing the benefits of a closed form natural gradient update.
Finally, we compare with contemporary step-based deep RL algorithms such as PPO, SAC,
and TD-3. We run step-based PPO using approximately the same number of transitions as
the episode-based algorithms. For SAC and TD-3, we are limited by a 24 hour compute
budget and therefore report the performance after this budget is exceeded.
In our experiments, we analyze the performance of the mean the search distribution, as
well as other performance indicators such as success rates. The experiments are designed to
be non-contextual with unchanged starting configurations of the agent in a noisy environment.
For the episodic RL experiments, we use Probabilistic Movement Primitives (Paraschos et al.,
2013) to plan trajectories. A PD-controller provides the torques necessary to follow them.
For the step-based deep RL algorithms, we provide proprioceptive feedback in form of the
joint angles and joint velocities, as well as a normalized time-step to the agent. The policy
directly outputs the torques applied to the joints. For the BBRL and deep RL experiments,
we orient ourselves towards the hyper-parameters used in Otto et al. (2023). They are
provided in Appendix C. For more careful exploration, we set µ = 0.05 and Σ = 0.005 and
draw 64 samples per iteration for the episodic RL algorithms. Since we do not operate using
a very low number of new samples per iteration and the reward functions do not produce
large outliers, we can use the standard target normalization allowing for more exact natural
20
Robust Black-Box Optimization for Stoch. Search and Episodic RL
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
2 Ellipsoid separable 9 Rosenbrock rotated 12 Bent cigar
1.0 bbob f2, 20-D best 2009 1.0 bbob f9, 20-D best 2009 1.0 bbob f12, 20-D best 2009
Fraction of function,target pairs
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
16 Weierstrass 23 Katsuuras 6 Attractive sector
1.0 bbob f16, 20-D best 2009 1.0 bbob f23, 20-D best 2009 1.0 bbob f6, 20-D best 2009
Fraction of function,target pairs
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
Figure 5: This figure shows the bootstrapped empirical cumulative distribution of the
number of objective function evaluations divided by dimension (FEvals/DIM) for 51 targets
with target precision in 10[−8..2] in 20-D representing the percentage of targets achieved
over the number of function evaluations. The results are averaged over 15 instances of each
function. As reference algorithm, the best algorithm from BBOB 2009 is shown as light thick
line with diamond markers. The first two rows show the aggregated results for all functions
and subgroups of functions. Additionally, we show the results of individual functions in the
third and fourth row. Big thin crosses indicate the used budget median for the respective
algorithm which is 10 000 n function evaluations for each trial of CAS-MORE. Runtimes to
the right of the cross are based on simulated restarts and are used to determine a runtime
for unsuccessful runs (Hansen et al., 2016c).
21
Hüttenrauch and Neumann
1
4 2 0 2 4
Figure 6: This figure shows an illustration of the hole reacher task. The robot arms starts
upright and smoothly reaches down the narrow hole. The policy is learned using CAS-MORE
and shows a typical behavior that keeps a safe distance to the ground.
gradient updates. We report the results in terms of the interquartile mean (IQM) with a
95% stratified bootstrap confidence interval (Agarwal et al., 2021).
The first task is an advanced version of the holereaching task from Abdolmaleki et al. (2015)
that aims to show the difference between step-based and episodic reward formulations and
the benefits of maximizing the expected fitness. The end-effector of a 5-link planar robot arm
has to reach into a narrow hole without colliding with the ground or the walls of the hole.
The arm is controlled in joint-space by applying torques to each of the 5 joints. Each link
has a length of 1 m, the hole has a width of 20 cm, a depth of 1 m and is located 2 m away
from the robot’s base. For an illustration of a successful episode, see Figure 6. While the
simulation of the arm is rather crude, this experiment has other properties that emphasize
certain shortcomings of step-based RL, as well as rank-based optimization techniques. These
properties include two distinct regions of exploration where above ground level, the agent
can explore freely, while below the ground level even slight errors lead to a collision with the
walls of the hole. Additionally, the depth of the hole, which is a main contributor to the
reward function, is subject to noise in the noisy environment.
For the episodic RL setting, we learn the parameters of a ProMP with 3 basis functions,
resulting in 15 parameters to be optimized. An episode consists of 200 time steps and
in each iteration of the learning algorithm, we draw 64 new samples for the episodic RL
algorithms. The cost function of the task is composed of two stages. First, the distance of
the end-effector to the entrance of the hole is minimized, subsequently, the reward increases
the further down the end-effector reaches. The reward is scaled down with a constant
factor if the robot touches the ground. Additional penalty costs for acceleration ensure a
smooth and energy-efficient trajectory. This reward function is different from the original
MORE paper and is designed to highlight the risk awareness of the tested algorithms as
22
Robust Black-Box Optimization for Stoch. Search and Episodic RL
the reward increases towards the bottom of the hole and then suddenly drops. The exact
reward function can be found in Appendix E.1.
We examine 4 different settings of this experiment where we highlight the consequences
of each choice on the learning process, namely
where in the noisy environment the perception of the depth of the hole is subject to noise6 .
The difficulty of this task lies in the two distinct sectors above and below the ground level.
While above ground level, the agent can explore without negative consequences, it has to be
much more careful once navigating into the narrow hole. The results of these experiments
can be found in Figure 7. We plot the performance at the mean of the search distribution
in left column, the expected performance under the search distribution in the center left
column, the percentage of samples that collided with the ground in the center right column,
and the distance of the end-effector to the bottom of the hole at the end of the episode in
the right column. The result is averaged over 20 seeds for the episodic RL algorithms and
10 seeds for the step-based RL algorithms.
Dense in time reward We first examine the results of the experiment in the noise-free
environment with dense in time reward, the usual setting of current deep RL literature, in
the first row of Figure 7. When looking at the second column, we can see that CAS-MORE
optimizes towards a stable solution reaching into the ground that avoids collision with the
ground. While CMA-ES finds a solution that comes closer to the bottom of the hole (right
column) leading to a higher performance at the mean, it often draws samples that collide
with the walls during the optimization as can be seen in the center left column. Compared
to the original MORE algorithm, MORE-LS using a model based on our newly introduced
model learning approach leads to a similar performance as CAS-MORE. BBRL-TRPL also
performs similarly to CAS-MORE, yet, at the cost of much higher computation time. For
a comparison of run times, see also Table 1. PPO optimizes towards fast movement of
the end-effector in the direction of the entrance of the hole but fails to consistently reach
into it. When looking at individual learning curves, we found that it often finds policies
that successfully reach into the hole but soon after forgets this solution again. SAC, TD-3,
BBRL-PPO, and also the original MORE with the dimensionality reduction model fail to
find policies that reach into the ground in our experiments.
Next, we look at the results of the experiments in the noisy environment in the second
row of Figure 7. Compared to the noise-free experiment, we can see the biggest impact on
CMA-ES. CMA-ES again optimizes very fast towards the ground of the hole but due to the
noise it quickly ends up in a solution that collides with the ground. CAS-MORE, MORE-LS
and BBRL-TRPL on the other hand are hardly influenced by the noise in the environment
and still provide solutions that reach into hole, albeit maintaining a slightly larger distance
6. In the beginning of the episode, we sample the depth of the hole uniformly from (1.00 ± 0.02) m.
23
Hüttenrauch and Neumann
to the ground compared to the noise-less case. The results of the rest of the algorithms stay
more or less unchanged as they never get close to the ground.
Sparse in time reward As noted in Otto et al. (2023), a dense reward leads to a fast
movement and overall poor energy efficiency. Thus, a better task description for this task
is to only provide a distance-based reward in the last time-step and penalize high action
values in each time-step. Again, we first examine the noise-free environment results which
can be found in the third row of Figure 7. Even without noise, the results of CMA-ES
become inconsistent over individual seeds, while CAS-MORE consistently optimizes towards
well performing distributions. We assume this is the case as the distance between the
initialization of the movement primitive parameters and a well performing solution is quite
large. MORE-LS and BBRL-TRPL still find good solutions but perform slightly worse as
can be seen from the distance of the end-effector to the ground in the last column. All
step-based algorithms now fail to find good policies.
Last, we look at the results of the experiments in the noisy environment in the last
row of Figure 7. We can now see the full potential of optimizing the expected reward and
individual updates of the mean and covariance of the search distribution, as CAS-MORE
finds the policies with the best performance at the mean as well as the best performing
samples during the optimization process, with MORE-LS and BBRL-TPRL slightly below
them. CMA-ES now consistently produces policies that collide with the ground. Results for
all other algorithms remain poor.
24
Robust Black-Box Optimization for Stoch. Search and Episodic RL
1
200
200
0.8 100
Distance to ground
Expected reward
150
Collision rate
150
0.6
Reward
10−1
100 100
0.4
10−2
50 50 0.2
0 0 0 10−3
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
·107 ·107 ·107 ·107
200 1
200
0.8
Distance to ground
150
Expected reward
150 100
Collision rate
0.6
Reward
100 100
0.4
50 10−1
50
0.2
0 0 0
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
·107 ·107 ·107 ·107
101
2 2 1
Distance to ground
Expected reward
Collision rate
1 1 0.6 10−1
Reward
0.2
0 0 10−3
0
−0.5 −0.5
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
·107 ·107 ·107 ·107
2 2 1
Collision rate
100
1 1 0.6
Reward
0 0.2 10−1
0
−0.5 0
−0.5
0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Environment interactions ·107 Environment interactions ·107 Environment interactions ·107 Environment interactions ·107
Figure 7: This figure shows the results of the hole-reacher experiment. Plotted is the
expected performance of the mean in the left column, the mean of episode returns of the
samples drawn in each iteration in the center left column, the percentage of trajectories
that led to collisions with the ground in the center right column, and the distance of the
end-effector at the end of the trajectory in the right column. The first row contains the
results for the noise-less experiment with dense step-based reward, while the second row
shows the results of the noisy experiment with dense step-based reward. The third and
fourth row show the results for the noiseless and noisy experiments using a sparse episodic
reward, respectively.
25
Hüttenrauch and Neumann
15 15 0.8
Expected reward
Success rate
10 0.6
Reward
10
0.4
5 5
0.2
0 0
0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0 2 4 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Environment interactions ·107 Environment interactions ·107 Environment interactions ·107
CAS-MORE MORE MORE LS CMA-ES BBRL-TRPL BBRL-PPO PPO SAC TD-3
Figure 9: This figure shows the results of the table tennis experiment. Plotted is the expected
performance of the mean in the left column, the mean of episode returns of the samples
drawn in each iteration in the center column, and the percentage of trajectories that led to
ball landing points within the last 10cm of the table in the right column.
initialized as zero, so that the robot needs to learn the movement from scratch. The reward
function is composed of three stages. First, the distance between the ball and the racket is
minimized to enforce hitting of the ball. Second, a term is added to minimize the distance
between the landing point of the ball and the far edge of the table. Once the ball lands
on the table, the reward increases with the distance of the ball. Note, that different from
the hole-reacher task, the reward for this task is non-Markovian as it depends on the
whole trajectory (Otto et al., 2023). In order to provide a step-based based reward for the
step-based RL algorithms, we run the simulation until the racket touches the ball while only
return the action cost. The rest of the episode is then simulated without agent interaction
and presented to the learning algorithm as one last time-step with the task reward as in Otto
et al. (2023). For more details on the environment, we refer the reader to Appendix E.2.
The reward function differs from Otto et al. (2023) to showcase the risk awareness of the
optimization algorithms, as the reward drops as soon as the ball flies past the table.
The results of the experiment are presented in Figure 9. We can again see that CAS-
MORE and MORE using the novel model learning approach produce the best results, this
time with a distinct advantage over both, step-based RL algorithms PPO, TD3 and SAC,
as well as gradient based episodic RL algorithms BBRL TRPL and BBRL PPO. Only
CAS-MORE is able to consistently return around 90% of the balls into the target zone
towards the edge of the table. CMA-ES and BBRL TRPL only produce distributions that
are able to return around 60% of the balls.
26
Robust Black-Box Optimization for Stoch. Search and Episodic RL
6 6 1
0.8
4 4
Expected reward
Success rate
0.6
Reward
2 2 0.4
0.2
0 0
0
0.0 0.25 0.5 0.75 1.0 1.25 1.5 0.0 0.25 0.5 0.75 1.0 1.25 1.5 0.0 0.25 0.5 0.75 1.0 1.25 1.5
Environment interactions ·107 Environment interactions ·107 Environment interactions ·107
Figure 11: This figure shows the results of the beerpong experiment. Plotted is the expected
performance of the mean in the left column, the mean of episode returns of the samples
drawn in each iteration in the center column, and the percentage of successful throws where
the ball landed in the cup in the right column.
6.2.3 Beerpong
We use the same simulated robot as before to throw a ball towards a table where it first
needs to bounce at least once and then fly into a cup. The task is depicted in in Figure 10.
As the robot has no actuated end-effector and the ball is directly attached to the robot’s last
joint, we only need to actuate the first 6 joints, resulting in a 12 dimensional search space
for the episodic RL algorithms. In this experiment, the ball release is noisy and designed in
a way that it is not possible to find a conservative strategy. The noise will always cause
some trials to miss the cup. The results can be found in Figure 11.
Overall, CAS-MORE again provides the best trade-off between convergence speed and
task performance. Interestingly, the original MORE performs better than MORE using
the least squares model while the step-based reinforcement learning tasks fail to find any
good strategy. This experiment also shows the benefits of trust-region optimization, as
BBRL-PPO starts to approach a good strategy but then fails due to, presumably, too
aggressive policy updates.
27
Hüttenrauch and Neumann
Figure 12: A successful episode of the hopper jump task learned by CAS-MORE. At first,
the agent fully bends to then release into a fully extended jump. This way, it can reach a
safe height not risking contact with the box’s edge.
10 10 0.8
Expected reward
Success rate
0.6
Reward
5 5
0.4
0 0
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
Environment interactions ·107 Environment interactions ·107 Environment interactions ·107
Figure 13: This figure shows the results of the hopper jump experiment. Plotted is the
expected performance of the mean in the left column, the mean of episode returns of the
samples drawn in each iteration in the center column, and the percentage of successful
throws where the ball landed in the cup in the right column.
7. Conclusion
In this paper, we presented CAS-MORE, a new version of Model-based Relative Entropy
Stochastic Search, based on a coordinate ascent strategy on the mean and covariance of
the search distribution and an adaptive entropy schedule. Additionally, we provide an
28
Robust Black-Box Optimization for Stoch. Search and Episodic RL
improved model fitting process that is computationally more efficient and show how to deal
with objective functions that produce large outliers. We show how the parameter updates
follow the direction of the natural gradient and, as the model estimation is not based on
rankings of objective function values, we truly optimize the expected fitness under the search
distribution. The result is a robust risk-aware optimization which we show to outperform
state of the art algorithms such as CMA-ES on stochastic search task, as well as episodic and
step-based reinforcement learning tasks, especially on noisy episodic reinforcement learning
tasks.
Limitations and Future Work Arguably the biggest limitation of CAS-MORE is the
restriction to the non-contextual setting. However, we believe that CAS-MORE still is
a valuable tool in the prototyping stage of a more complex contextual problem, or if the
problem at hand is non-contextual. Here, the robustness and speed of the algorithm provide
a large benefit over more complex algorithms. In addition, the algorithm can easily be
extended for small context spaces with low-dimensional contexts using a linear mapping
from the context to the mean.
Acknowledgments
Before we solve the optimization problems, we will first state the closed-form solution of
the objective and KL-divergence using a quadratic model under multi-variate Gaussian
distributions. The solution to the objective is given by
1 1
Z
π(x)fˆ(x)dx = − µT Aµ − tr(AΣ) + µT a + a0
x 2 2
1
KL(π k πt ) = (µt − µ)T Σ−1
t (µt − µ)
2
+tr(Σ−1t Σ) − k + log |Σt | − log |Σ| .
29
Hüttenrauch and Neumann
After rearranging terms, the dual problem for the mean is given by
1
gµ (λ) = λµ + mµ (λ)T Mµ (λ)−1 mµ (λ)
2
−1
− λmTt M t mt .
30
Robust Black-Box Optimization for Stoch. Search and Episodic RL
After rearranging terms again, the dual for the covariance is given by
νΣ−1
t +A
Λ(ν) = .
ν
Appendix C. Hyper-Parameters
We provide default hyper-parameters for CAS-MORE in terms of the problem dimensionality
n in Table 2 which we empirically found to work well over all benchmark functions. For
31
Hüttenrauch and Neumann
101
mean std
mean std robust
100
R(µ)
10−1
10−2
Table 2: Empirically found default hyper-parameters for CAS-MORE based on the problem
dimensionality n.
some functions, a higher bound on the mean often leads to quicker convergence but may
result in divergence for others.
Table 3 provides the chosen hyper-parameters for the step-based deep RL experiments
(PPO, SAC, TD3), as well as the BBRL experiments. Note, that samples per iteration
corresponds to tuples (s, a, r, s0 ) in the step-based case, and whole trajectories in the BBRL
case.
32
Robust Black-Box Optimization for Stoch. Search and Episodic RL
hidden layers [256, 256] [256, 256] [256, 256] n.a n.a
hidden layers critic [256, 256] [256, 256] [256, 256] n.a. n.a.
hidden activation tanh ReLU ReLU n.a. n.a.
initial std 0.6 1.0 0.1 1.0 1.0
divided by the number of trials that actually reached ft (Hansen et al., 2012; Price, 1997).
Statistical significance is tested with the rank-sum test for a given target ∆ft using, for
each trial, either the number of needed function evaluations to reach ∆ft (inverted and
multiplied by −1), or, if the target was not reached, the best ∆f -value achieved, measured
only up to the smallest number of overall function evaluations for any unsuccessful trial
under consideration.
33
Hüttenrauch and Neumann
Appendix E. Episodic RL
In this section, we provide additional information and reward functions for the episodic RL
tasks. The action cost for all tasks is given by
K
X
τt = (ait )2 .
i
E.1 Holereaching
The reward function is composed of two phases. In the first phase, the distance dg = kpt −pg k
of the position of the end-effector pt = (xee,t , yee,t ) to a goal point pg = (2, −0.1) below the
entrance of the hole is minimized. Afterwards, the reward scales linearly with the absolute
value yee,t of the end-effector. The task ends if either t = 200 or a collision happens. In the
deterministic case, the hole has a depth of 1 m, while in the stochastic case, the depth is
sampled uniformly from U(0.98 m, 1.02 m). The task reward is given by
ccoll exp (−dg ) if a collision happened or t = T ,
Rtask = exp (−dg ) if yee,t ≥ 0 and no collision happened,
1 + |yee,t | if yee,t < 0 and no collision happened
rt = Rtask − 0.01τt
Sparse reward. The sparse reward only returns the task reward in the terminal time-step
T which can either be the last time-step of the episode or the time-step where a collision
happens and is given by
(
−0.001τt if t < T,
rt =
Rtask − 0.001τt if t = T.
34
Robust Black-Box Optimization for Stoch. Search and Episodic RL
to convert large distances to a value of 0 and small distances to a value of 1. The task
reward is given as
0.2ρ(db,r ) if ball was not hit,
0.5ρ(d ) + ρ(d ) if ball was hit but landed on floor,
b,r b,g
Rtask =
2ρ(db,r ) + 3ρ(dl,e ) if ball was hit, landed on the table and pl,x ≥ −1.25 ,
5ρ(d ) + 10|p |
if ball was hit, landed on the table and pl,x < −1.25
b,r l,x
and we consider a trajectory to be successful if the ball lands within 10cm of the opponent’s
side edge of the table. We use the sparse-in-time reward
(
−0.001τt if t < T,
rt =
Rtask − 0.001τt if t = T.
E.3 Beerpong
In the table beerpong task, the episode starts with the ball attached to the end-effector of
the arm. In order to add stochasticity to the environment, we add noise to the velocity at a
pre-defined fixed time-step of the release of the ball. For a successful throw, the ball first
needs to bounce once on the table and then land inside the cup. The reward function is
again non Markovian and is based on the final distance db,c,T = kpb,T − pc k, and on the
minimum distance db,c = mint kpb,t − pc k between the ball position pb,t and the center of
the cup pc within an episode. With the same transform ρ(x) as before, the task reward is
given by
0.2ρ(db,c ) + 0.1ρ(db,c,T ) if ball first has contact with anything but the table,
ρ(d ) + 0.5ρ(d
b,c b,c,T ) if ball first has contact w. the table but is not in cup,
Rtask =
ρ(db,c ) + 2ρ(db,c,T ) + 1 if ball is in cup, without contact with the table,
ρ(d ) + 2ρ(d
b,c b,c,T ) + 3 if ball is in cup and had contact with the table before.
35
Hüttenrauch and Neumann
position pf and the top center of the box pc . The individual reward terms are given as
5hmax
if hmax < 2,
Rheight = 10 + hT if hmax ≥ 2 and agent did not fall after landing on the box,
2hmax if agent is on box but falling over,
(
−5 if hmax < 2,
Rdist =
−5df,c,T if hmax ≥ 2,
Rmin dist = −2df,c
(
1 if zT ∈ [0.7, ∞], φ ∈ [−∞, ∞] and θ, θ̇, ∈ [−100, 100],
Rhealthy =
0 else,
(
5 if agent is on box and hT > 1.6,
Rbox =
0 else,
(
− min(10vx2 , 1) if agent is on box,
Rbox vel =
0 else.
36
Robust Black-Box Optimization for Stoch. Search and Episodic RL
5 5
2 3
3 3
1 CAS MORE
MORE 1
CMA-ES 1 1
15 instances xNESas schaul 15 instances 15 instances 15 instances
0 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1
2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40
5 Linear slope 6 Attractive sector 7 Step-ellipsoid 8 Rosenbrock original
4 4
5
3
3 3
2 3
2 2
1 1 1
1
15 instances 15 instances 15 instances 15 instances
0 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 0 target Df: 1e-8 v2.4.1.1 0 target Df: 1e-8 v2.4.1.1
2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40
9 Rosenbrock rotated 10 Ellipsoid 11 Discus 12 Bent cigar
4 5 5
4
3
3 3 3
2
2
1
1 1 1
2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40
13 Sharp ridge 14 Sum of different powers 15 Rastrigin 16 Weierstrass
4
5
5
3 5
3 2 3
3
1
1 1 1
15 instances 15 instances 15 instances 15 instances
target Df: 1e-8 v2.4.1.1 0 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1
2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40
17 Schaffer F7, condition 10 18 Schaffer F7, condition 1000 19 Griewank-Rosenbrock F8F2 20 Schwefel x*sin(x)
5 5
5
5
3 3
3
3
1 1 1 1
15 instances 15 instances 15 instances 15 instances
target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1
2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40
21 Gallagher 101 peaks 22 Gallagher 21 peaks 23 Katsuuras 24 Lunacek bi-Rastrigin
7
5 5 5
5
3 3 3
3
CAS MORE
MORE
1 1 1 1 CMA-ES
30, 15 instances 30, 15 instances 30, 15 instances 30, 15 instances
xNESas schaul
target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1 target Df: 1e-8 v2.4.1.1
2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40 2 3 5 10 20 40
Figure 15: Expected running time (ERT in number of f -evaluations as log10 value), divided
by dimension for target function value 10−8 versus dimension. Slanted grid lines indicate
quadratic scaling with the dimension. Different symbols correspond to different algorithms
given in the legend of f1 and f24 . Light symbols give the maximum number of function
evaluations from the longest trial divided by dimension. Black stars indicate a statistically
better result compared to all other algorithms with p < 0.01 and Bonferroni correction
number of dimensions (six). Legend: ◦: CAS-MORE, ♦: MORE, ?: CMA-ES, O: xNESas.
37
Hüttenrauch and Neumann
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
38
Robust Black-Box Optimization for Stoch. Search and Episodic RL
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 CAS MORE
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 CAS MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 xNESas sc
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
17 Schaffer F7, condition 10 18 Schaffer F7, condition 1000 19 Griewank-Rosenbrock F8F2 20 Schwefel x*sin(x)
1.0 bbob f17, 5-D best 2009 1.0 bbob f18, 5-D MORE 1.0 bbob f19, 5-D best 2009 1.0 bbob f20, 5-D best 2009
Fraction of function,target pairs
0.0
v2.4.1.1 xNESas sc 0.0
v2.4.1.1 CAS MORE 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 CAS MORE
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 CMA-ES 0.0
v2.4.1.1 MORE 0.0
v2.4.1.1 MORE
0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6
log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension) log10(# f-evals / dimension)
39
Hüttenrauch and Neumann
References
Abbas Abdolmaleki, Rudolf Lioutikov, Jan R Peters, Nuno Lau, Luis Pualo Reis, and Gerhard
Neumann. Model-based relative entropy stochastic search. In C. Cortes, N. Lawrence,
D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing
Systems, volume 28. Curran Associates, Inc., 2015.
Abbas Abdolmaleki, Jost Tobias Springenberg, Jonas Degrave, Steven Bohez, Yuval Tassa,
Dan Belov, Nicolas Heess, and Martin Riedmiller. Relative entropy regularized policy
iteration. arXiv preprint arXiv:1812.02256, 2018a.
Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess,
and Martin Riedmiller. Maximum a posteriori policy optimisation. In International
Conference on Learning Representations, 2018b.
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc
Bellemare. Deep reinforcement learning at the edge of the statistical precipice. Advances
in neural information processing systems, 34:29304–29320, 2021.
Youhei Akimoto, Yuichi Nagata, Isao Ono, and Shigenobu Kobayashi. Bidirectional relation
between cma evolution strategies and natural evolution strategies. In International
Conference on Parallel Problem Solving from Nature, pages 154–163. Springer, 2010.
Riad Akrour, Abbas Abdolmaleki, Hany Abdulsamad, Jan Peters, and Gerhard Neumann.
Model-free trajectory-based policy optimization with monotonic improvement. The Journal
of Machine Learning Research, 19(1):565–589, 2018.
Shun-Ichi Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):
251–276, 1998.
Brandon Amos and Denis Yarats. The differentiable cross-entropy method. In International
Conference on Machine Learning, pages 291–302. PMLR, 2020.
Oleg Arenz, Gerhard Neumann, and Mingjun Zhong. Efficient gradient-free variational
inference using policy search. In International conference on machine learning, pages
234–243. PMLR, 2018.
Anne Auger and Nikolaus Hansen. A restart cma evolution strategy with increasing
population size. In 2005 IEEE congress on evolutionary computation, volume 2, pages
1769–1776. IEEE, 2005.
Anne Auger, Marc Schoenauer, and Nicolas Vanhaecke. Ls-cma-es: A second-order algorithm
for covariance matrix adaptation. In International Conference on Parallel Problem Solving
from Nature, pages 182–191. Springer, 2004.
Philipp Becker, Oleg Arenz, and Gerhard Neumann. Expected information maximization:
Using the i-projection for mixture density estimation. In International Conference on
Learning Representations, 2019.
40
Robust Black-Box Optimization for Stoch. Search and Episodic RL
Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The cross-
entropy method for optimization. In Handbook of statistics, volume 31, pages 35–59.
Elsevier, 2013.
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie
Tang, and Wojciech Zaremba. Openai gym, 2016.
Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A survey on policy search for
robotics. Foundations and trends in Robotics, 2(1-2):388–403, 2013.
Felix End, Riad Akrour, Jan Peters, and Gerhard Neumann. Layered direct policy search
for learning hierarchical skills. In 2017 IEEE International Conference on Robotics and
Automation (ICRA), pages 6442–6448. IEEE, 2017.
Tobias Glasmachers, Tom Schaul, Sun Yi, Daan Wierstra, and Jürgen Schmidhuber. Ex-
ponential natural evolution strategies. In Proceedings of the 12th annual conference on
Genetic and evolutionary computation, pages 393–400, 2010.
41
Hüttenrauch and Neumann
Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772,
2016.
Nikolaus Hansen. A global surrogate assisted cma-es. In Proceedings of the Genetic and
Evolutionary Computation Conference, pages 664–672, 2019.
Nikolaus Hansen, Steffen Finck, Raymond Ros, and Anne Auger. Real-Parameter Black-Box
Optimization Benchmarking 2009: Noiseless Functions Definitions. Research Report
RR-6829, INRIA, 2009b. URL https://hal.inria.fr/inria-00362633.
Nikolaus Hansen, Tea Tusar, Olaf Mersmann, Anne Auger, and Dimo Brockhoff. Coco: The
experimental procedure. arXiv preprint arXiv:1603.08776, 2016c.
Nikolaus Hansen, Anne Auger, Raymond Ros, Olaf Mersmann, Tea Tušar, and Dimo
Brockhoff. Coco: A platform for comparing continuous optimizers in a black-box setting.
Optimization Methods and Software, 36(1):114–144, 2021.
Verena Heidrich-Meisner and Christian Igel. Hoeffding and bernstein races for selecting
policies in evolutionary direct policy search. In Proceedings of the 26th Annual International
Conference on Machine Learning, pages 401–408, 2009.
Jemin Hwangbo, Christian Gehring, Hannes Sommer, Roland Siegwart, and Jonas Buchli.
Rock∗—efficient black-box optimization for policy learning. In 2014 IEEE-RAS Interna-
tional Conference on Humanoid Robots, pages 535–540. IEEE, 2014.
Oscar Ibáñez, Lucia Ballerini, Oscar Cordón, Sergio Damas, and José Santamarı́a. An experi-
mental study on the applicability of evolutionary algorithms to craniofacial superimposition
in forensic identification. Information Sciences, 179(23):3998–4028, 2009.
Jeffrey Larson, Matt Menickelly, and Stefan M Wild. Derivative-free optimization methods.
Acta Numerica, 28:287–404, 2019.
42
Robust Black-Box Optimization for Stoch. Search and Episodic RL
Zhenhua Li and Qingfu Zhang. What does the evolution path learn in cma-es? In Parallel
Problem Solving from Nature–PPSN XIV: 14th International Conference, Edinburgh, UK,
September 17-21, 2016, Proceedings 14, pages 751–760. Springer, 2016.
Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale
optimization. Mathematical programming, 45(1):503–528, 1989.
Ilya Loshchilov, Marc Schoenauer, and Michele Sebag. Alternative restart strategies for
cma-es. In International Conference on Parallel Problem Solving from Nature, pages
296–305. Springer, 2012a.
John A Nelder and Roger Mead. A simplex method for function minimization. The computer
journal, 7(4):308–313, 1965.
Michael A Osborne, Roman Garnett, and Stephen J Roberts. Gaussian processes for global
optimization. In 3rd international conference on learning and intelligent optimization
(LION3), pages 1–15, 2009.
Fabian Otto, Philipp Becker, Ngo Anh Vien, Hanna Carolin Ziesche, and Gerhard Neu-
mann. Differentiable trust region layers for deep reinforcement learning. arXiv preprint
arXiv:2101.09207, 2021.
Fabian Otto, Onur Celik, Hongyi Zhou, Hanna Ziesche, Vien Anh Ngo, and Gerhard
Neumann. Deep black-box reinforcement learning with movement primitives. In Karen
Liu, Dana Kulic, and Jeff Ichnowski, editors, Proceedings of The 6th Conference on Robot
Learning, volume 205 of Proceedings of Machine Learning Research, pages 1244–1265.
PMLR, 14–18 Dec 2023.
Joni Pajarinen, Hong Linh Thai, Riad Akrour, Jan Peters, and Gerhard Neumann. Compat-
ible natural gradient policy search. Machine Learning, 108(8):1443–1466, 2019.
Alexandros Paraschos, Christian Daniel, Jan R Peters, and Gerhard Neumann. Probabilistic
movement primitives. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and
K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 26.
Curran Associates, Inc., 2013.
Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In
Twenty-Fourth AAAI Conference on Artificial Intelligence, 2010.
Kenneth Price. Differential evolution vs. the functions of the second ICEO. In Proceedings of
the IEEE International Congress on Evolutionary Computation, pages 153–157, Piscataway,
NJ, USA, 1997. IEEE. doi: 10.1109/ICEC.1997.592287.
43
Hüttenrauch and Neumann
Reuven Y Rubinstein and Dirk P Kroese. The cross-entropy method: a unified approach to
combinatorial optimization, Monte-Carlo simulation, and machine learning, volume 133.
Springer, 2004.
John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust
region policy optimization. In International conference on machine learning, pages
1889–1897. PMLR, 2015.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal
policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Yi Sun, Daan Wierstra, Tom Schaul, and Jürgen Schmidhuber. Efficient natural evolution
strategies. In Proceedings of the 11th Annual conference on Genetic and evolutionary
computation, pages 539–546, 2009.
Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient
methods for reinforcement learning with function approximation. Advances in neural
information processing systems, 12, 1999.
Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmid-
huber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):
949–980, 2014.
Susanne Winter, Bernhard Brendel, Ioannis Pechlivanis, Kirsten Schmieder, and Christian
Igel. Registration of ct and intraoperative 3-d ultrasound images of the spine using evolu-
tionary and gradient-based methods. IEEE Transactions on Evolutionary Computation,
12(3):284–296, 2008.
44