CMA-ES with Optimal Covariance Update and
Storage Complexity
Oswin Krause
Dept. of Computer Science
University of Copenhagen
Copenhagen, Denmark
[email protected]
Dídac R. Arbonès
Dept. of Computer Science
University of Copenhagen
Copenhagen, Denmark
[email protected]
Christian Igel
Dept. of Computer Science
University of Copenhagen
Copenhagen, Denmark
[email protected]
Abstract
The covariance matrix adaptation evolution strategy (CMA-ES) is arguably one
of the most powerful real-valued derivative-free optimization algorithms, finding
many applications in machine learning. The CMA-ES is a Monte Carlo method,
sampling from a sequence of multi-variate Gaussian distributions. Given the
function values at the sampled points, updating and storing the covariance matrix
dominates the time and space complexity in each iteration of the algorithm. We
propose a numerically stable quadratic-time covariance matrix update scheme
with minimal memory requirements based on maintaining triangular Cholesky
factors. This requires a modification of the cumulative step-size adaption (CSA)
mechanism in the CMA-ES, in which we replace the inverse of the square root of
the covariance matrix by the inverse of the triangular Cholesky factor. Because
the triangular Cholesky factor changes smoothly with the matrix square root, this
modification does not change the behavior of the CMA-ES in terms of required
objective function evaluations as verified empirically. Thus, the described algorithm
can and should replace the standard CMA-ES if updating and storing the covariance
matrix matters.
1
Introduction
The covariance matrix adaptation evolution strategy, CMA-ES [Hansen and Ostermeier, 2001], is
recognized as one of the most competitive derivative-free algorithms for real-valued optimization
[Beyer, 2007; Eiben and Smith, 2015]. The algorithm has been successfully applied in many unbiased
performance comparisons and numerous real-world applications. In machine learning, it is mainly
used for direct policy search in reinforcement learning and hyperparameter tuning in supervised
learning (e.g., see Gomez et al. [2008]; Heidrich-Meisner and Igel [2009a,b]; Igel [2010], and
references therein).
The CMA-ES is a Monte Carlo method for optimizing functions f : Rd → R. The objective function
f does not need to be continuous and can be multi-modal, constrained, and disturbed by noise. In
each iteration, the CMA-ES samples from a d-dimensional multivariate normal distribution, the
search distribution, and ranks the sampled points according to their objective function values. The
mean and the covariance matrix of the search distribution are then adapted based on the ranked points.
Given the ranking of the sampled points, the runtime of one CMA-ES iteration is ω(d2 ) because
the square root of the covariance matrix is required, which is typically computed by an eigenvalue
decomposition. If the objective function can be evaluated efficiently and/or d is large, the computation
of the matrix square root can easily dominate the runtime of the optimization process.
Various strategies have been proposed to address this problem. The basic approach for reducing the
runtime is to perform an update of the matrix only every τ ∈ Ω(d) steps [Hansen and Ostermeier,
30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.
1996, 2001], effectively reducing the time complexity to O(d2 ). However, this forces the algorithm
to use outdated matrices during most iterations and can increase the amount of function evaluations.
Furthermore, it leads to an uneven distribution of computation time over the iterations. Another
approach is to restrict the model complexity of the search distribution [Poland and Zell, 2001; Ros
and Hansen, 2008; Sun et al., 2013; Akimoto et al., 2014; Loshchilov, 2014, 2015], for example,
to consider only diagonal matrices [Ros and Hansen, 2008]. However, this can lead to a drastic
increase in function evaluations needed to approximate the optimum if the objective function is not
compatible with the restriction, for example, when optimizing highly non-separable problems while
only adapting the diagonal of the covariance matrix [Omidvar and Li, 2011]. More recently, methods
were proposed that update the Cholesky factor of the covariance matrix instead of the covariance
matrix itself [Suttorp et al., 2009; Krause and Igel, 2015]. This works well for some CMA-ES
variations (e.g., the (1+1)-CMA-ES and the multi-objective MO-CMA-ES [Suttorp et al., 2009;
Krause and Igel, 2015; Bringmann et al., 2013]), however, the original CMA-ES relies on the matrix
square root, which cannot be replaced one-to-one by a Cholesky factor.
In the following, we explore the use of the triangular Cholesky factorization instead of the square root
in the standard CMA-ES. In contrast to previous attempts in this direction, we present an approach
that comes with a theoretical justification for why it does not deteriorate the algorithm’s performance.
This approach leads to the optimal asymptotic storage and runtime complexity when adaptation of
the full covariance matrix is required, as is the case for non-separable ill-conditioned problems. Our
CMA-ES variant, referred to as Cholesky-CMA-ES, reduces the runtime complexity of the algorithm
with no significant change in the number of objective function evaluations. It also reduces the memory
footprint of the algorithm.
Section 2 briefly describes the original CMA-ES algorithm (for details we refer Hansen [2015]).
In section 3 we propose our new method for approximating the step-size adaptation. We give a
theoretical justification for the convergence of the new algorithm. We provide empirical performance
results comparing the original CMA-ES with the new Cholesky-CMA-ES using various benchmark
functions in section 4. Finally, we discuss our results and draw our conclusions.
2
Background
Before we briefly describe the CMA-ES to fix our notation, we discuss some basic properties of
using a Cholesky decomposition to sample from a multi-variate Gaussian distribution. Sampling
from a d-dimensional multi-variate normal distribution N (m, Σ), m ∈ Rd ,Σ ∈ Rd×d is usually
done using a decomposition of the covariance matrix Σ. This could be the square root of the matrix
Σ = HH ∈ Rd×d or a lower triangular Cholesky factorization Σ = AAT , which is related to the
square root by the QR-decomposition H = AE where E is an orthogonal matrix. We can sample a
point x from N (m, Σ) using a sample z ∼ N (0, I) by x = Hz + m = AEz + m = Ay + m,
where we set y = Ez. We have y ∼ N (0, I) since E is orthogonal. Thus, as long as we are only
interested in the value of x and do not need y, we can sample using the Cholesky factor instead of
the matrix square root.
2.1
CMA-ES
The CMA-ES has been proposed by Hansen and Ostermeier [1996, 2001] and its most recent version
is described by Hansen [2015]. In the tth iteration of the algorithm, the CMA-ES samples λ points
from a multivariate normal distribution N (mt , σt2 · Ct ), evaluates the objective function f at these
points, and adapts the parameters Ct ∈ Rd×d , mt ∈ Rd , and σt ∈ R+ . In the following, we present
the update procedure in a slightly simplified form (for didactic reasons, we refer to Hansen [2015] for
the details). All parameters (µ, λ, ω, cσ , dσ , cc , c1 , cµ ) are set to their default values [Hansen, 2015,
Table 1].
For a minimization task, the λ points are ranked by function value such that f (xP
1,t ) ≤ f (x2,t ) ≤
µ
· · · ≤ f (xλ,t ). The distribution mean is set to the weighted average mt+1 = i=1 ωi xi,t . The
weights depend only on the ranking, not on the function values directly. This renders the algorithm
invariant under order-preserving transformation of the objective function. Points with smaller ranks
Pλ
(i.e., better objective function values) are given a larger weight ωi with i=1 ωi = 1. The weights
are zero for ranks larger than µ < λ, which is typically µ = λ/2. Thus, points with function values
worse than the median do not enter the adaptation process of the parameters. The covariance matrix
2
is updated using two terms, a rank-1 and a rank-µ update. For the rank-1 update, a long term average
of the changes of mt is maintained
p
mt+1 − mt
,
(1)
pc,t+1 = (1 − cc )pc,t + cc (2 − cc )µeff
σt
Pµ
where µeff = 1/ i=1 ωi2 is the effective sample size given the weights. Note that pc,t is large
when the algorithm performs steps in the same direction, while it becomes small when the algorithm
performs steps in alternating directions.1 The rank-µ update estimates the covariance of the weighted
steps xi,t − mt , 1 ≤ i ≤ µ. Combining rank-1 and rank-µ update gives the final update rule for Ct ,
which can be motivated by principles from information geometry [Akimoto et al., 2012]:
Ct+1 = (1 − c1 − cµ )Ct +
c1 pc,t+1 pTc,t+1
µ
cµ X
T
+ 2
ωi (xi,t − mt ) (xi,t − mt )
σt i=1
(2)
So far, the update is (apart from initialization) invariant under affine linear transformations (i.e.,
x 7→ Bx + b, B ∈ GL(d, R)).
The update of the global step-size parameter σt is based on the cumulative step-size adaptation
algorithm (CSA). It measures the correlation of successive steps in a normalized coordinate system.
The goal is to adapt σt such that the steps of the algorithm become uncorrelated. Under the assumption
that uncorrelated steps are standard normally distributed, a carefully designed long term average over
the steps should have the same expected length as a χ-distributed random variable, denoted by E{χ}.
The long term average has the form
p
−1/2 mt+1 − mt
(3)
pσ,t+1 = (1 − cσ )pσ,t + cσ (2 − cσ )µeff Ct
σt
−1/2
with pσ,1 = 0. The normalization by the factor Ct
is the main difference between equations
(1) and (3). It is important because it corrects for a change of Ct between iterations. Without this
correction, it is difficult to measure correlations accurately in the un-normalized coordinate system.
For the update, the length of pσ,t+1 is compared to the expected length E{χ} and σt is changed
depending on whether the average step taken is longer or shorter than expected:
cσ kpσ,t+1 k
−1
(4)
σt+1 = σt exp
dσ
E{χ}
This update is not proven to preserve invariance under affine linear transformations [Auger, 2015],
and it is it conjectured that it does not.
3
Cholesky-CMA-ES
In general, computing the matrix square root or the Cholesky factor from an n × n matrix has time
complexity ω(d2 ) (i.e., scales worse than quadratically). To reduce this complexity, Suttorp et al.
[2009] have suggested to replace the process of updating the covariance matrix and decomposing it
afterwards by updates directly operating on the decomposition (i.e., the covariance matrix is never
computed and stored explicitly, only its factorization is maintained). Krause and Igel [2015] have
shown that the update of Ct in equation (2) can be rewritten as a quadratic-time update of its triangular
Cholesky factor At with Ct = At ATt . They consider the special case µ = λ = 1. We propose
to extend this update to the standard CMA-ES, which leads to a runtime O(µd2 ). As typically
µ = O(log(d)), this gives a large speed-up compared to the explicit recomputation of the Cholesky
factor or the inverse of the covariance matrix.
Unfortunately, the fast Cholesky update can not be applied directly to the original CMA-ES. To see
−1/2
this, consider the term st = Ct
(mt+1 − mt ) in equation (3). Rewriting pσ,t+1 in terms of st in
a non-recursive fashion, we obtain
pσ,t+1 =
p
cσ (2 − cσ )µeff
t
X
(1 − cσ )t−k
k=1
σk
sk .
1
Given cc , the factors in (1) are chosen to compensate
√ for the change in variance when adding distributions.
If the ranking of the points would be purely random, µeff · (mt+1 − mt )/σt ∼ N (0, Ct ) and if Ct = I and
pc,t ∼ N (0, I) then pc,t+1 ∼ N (0, I).
3
Algorithm 1: The Cholesky-CMA-ES.
input :λ, µ, m1 , ωi=1...µ , cσ , dσ , cc , c1 and cµ
A1 = I, pc,1 = 0, pσ,1 = 0
for t = 1, 2, . . . do
for i = 1, . . . , λ do
xi,t = σt At y i,t + mt , y i,t ∼ N (0, I)
Sort xi,t ,P
i = 1, . . . , λ increasing by f (xi,t )
µ
mt+1 = i=1 ωi xi,t
p
pc,t+1 = (1 − cc )pc,t + cc (2 − cc )µeff mt+1σt−mt
// Apply formula (2) to At
p
At+1 ← 1 − c1 − cµ At
At+1 ← rankOneUpdate(At+1 , c1 , pc,t+1 )
for i = 1, . . . , µ do
x −m
At+1 ← rankOneUpdate(At+1 , cµ ωi , i,tσt t )
// Update σ using ŝk as in (5)
p
mt+1 −mt
pσ,t+1 = (1 − cσ )pσ,t + cσ (2 − cσ )µeff A−1
t
σt
kpσ,t+1 k
−
1
σt+1 = σt exp dcσσ
E{χ}
Algorithm 2: rankOneUpdate(A, β, v)
input :Cholesky factor A ∈ Rd×d of C, β ∈ R, v ∈ Rd
output : Cholesky factor A′ of C + βvv T
α←v
b←1
for j = 1, .q
. . , d do
A′jj ←
A2jj + βb αj2
γ ← A2jj b + βαj2
for k = j + 1, . . . , d do
α
αk ← αk − Ajjj Akj
A′kj =
A′jj
Ajj Akj
+
A′jj βαj
αk
γ
α2
b ← b + β A2j
jj
1/2
By the RQ-decomposition, we can find Ct = At Et with Et being an orthogonal matrix and At
lower triangular. When replacing st by ŝt = A−1
t (mt+1 − mt ), we obtain
pσ,t+1 =
p
cσ (2 − cσ )µeff
t
X
(1 − cσ )t−k
k=1
σk
EkT ŝk .
−1/2
T
Thus, replacing Ct
by A−1
t introduces a new random rotation matrix Et , which changes in every
iteration. Obtaining Et from At can be achieved by the polar-decomposition, which is a cubic-time
operation: currently there are no algorithms known that can update an existing polar decomposition
from an updated Cholesky factor in less than cubic time. Thus, if our goal is to apply the fast Cholesky
update, we have to perform the update without this correction factor
pσ,t+1 ≈
p
cσ (2 − cσ )µeff
t
X
(1 − cσ )t−k
k=1
σk
ŝk .
(5)
This introduces some error, but we will show in the following that we can expect this error to be small
and to decrease over time as the algorithm converges to the optimum. For this, we need the following
result:
4
∞
Lemma 1. Consider the sequence of symmetric positive definite matrices C̄t=0
with C̄t =
t→∞
−1/d
. Assume that C̄t −→ C̄ and that C̄ is symmetric positive definite with det C̄ = 1.
Ct (det Ct )
1/2
1/2
Let C̄t = Āt Et denote the RQ-decomposition of C̄t , where Et is orthogonal and Āt lower
t→∞
T
Et −→ I.
triangular. Then it holds Et−1
Proof. Let C̄ = ĀE, the RQ-decomposition of C̄. As det C̄ 6= 0, this decomposition is unique.
Because the RQ-decomposition is continuous, it maps convergent sequences to convergent sequences.
t→∞
t→∞
T
Et −→ E T E = I.
Therefore Et −→ E and thus, Et−1
This result establishes that, when Ct converges to a certain shape (but not necessary to a certain
scaling), At and thus Et will also converge (up to scaling). Thus, as we only need the norm of pσ,t+1 ,
we can rotate the coordinate system and by multiplying with Et we obtain
kpσ,t+1 k = kEt pσ,t+1 k =
p
cσ (2 − cσ )µeff
t
X
(1 − cσ )t−k
k=1
σk
Et EkT ŝk
.
(6)
t→∞
T
Therefore, if Et Et−1
−→ I, the error in the norm will also vanish due to the exponential weighting
in the summation. Note that this does not hold for any decomposition Ct = Bt BtT . If we do not
constrain Bt to be triangular and allow any matrix, we do not have a bijective mapping between Ct
and Bt anymore and the introduction of d(d−1)
degrees of freedom (as, e.g., in the update proposed
2
by Suttorp et al. [2009]) allows the creation of non-converging sequences of Et even for Ct = const.
As the CMA-ES is a randomized algorithm, we cannot assume convergence of Ct . However, in
simplified algorithms the expectation of Ct converges [Beyer, 2014]. Still, the reasoning behind
Lemma 1 establishes that the error caused by replacing st by ŝt is small if Ct changes slowly.
Equation (6) establishes that the error depends only on the rotation of coordinate systems. As the
mapping from Ct to the triangular factor At is one-to-one and smooth, the coordinate system changes
in every step will be small – and because of the exponentially decaying weighting, only the last few
coordinate systems matter at a particular time step t.
The Cholesky-CMA-ES algorithm is given in Algorithm 1. One can derive the algorithm from the
standard CMA-ES by decomposing (2) into a number of rank-1 updates Ct+1 = (((αCt + β1 v 1 v T1 ) +
β2 v 2 v T2 ) . . . ) and applying them to the Cholesky factor using Algorithm 2.
Properties of the update rule. The O(µd2 ) complexity of the update in the Cholesky-CMAES is asymptotically optimal.2 Apart from the theoretical guarantees, there are several additional
advantages compared to approaches using a non-triangular Cholesky factorization (e.g., Suttorp et
al. [2009]). First, as only triangular matrices have to be stored, the storage complexity is optimal.
Second, the diagonal elements of a triangular Cholesky factor are the square roots of the eigenvalues
of the factorized matrix, that is, we get the eigenvalues of the covariance matrix for free. These
are important, for example, for monitoring the conditioning of the optimization problem and, in
particular, to enforce lower bounds on the variances of σt Ct projected on its principal components.
Third, a triangular matrix can be inverted in quadratic time. Thus, we can efficiently compute A−1
t
from At when needed, instead of having two separate quadratic-time updates for A−1
t and At , which
requires more memory and is prone to numerical instabilities.
4
Experiments and Results
Experiments. We compared the Cholesky-CMA-ES with other CMA-ES variants.3 The reference
CMA-ES
uses a delay strategy in which the matrix square root is computed every
n implementation
o
1
max 1, 10d(c1 +cµ ) iterations [Hansen, 2015], which equals one for the dimensions considered
2
Actually, the complexity is related to the complexity of multiplying two µ × d matrices. We assume a naïve
implementation of matrix multiplication. With a faster multiplication algorithm, the complexity can be reduced
accordingly.
3
We added our algorithm to the open-source machine learning library Shark [Igel et al., 2008] and used
LAPACK for high efficiency.
5
Iterations
104
103
102
4
104
Iterations
104
104
103
103
Cholesky-CMA-ES
Suttorp-CMA-ES
CMA-ES/d
CMA-ES-Ref
32
(a) Sphere
256
4
104
103
102
102
32
(b) Cigar
256
32
d
(d) Ellipsoid
256
102
4
32
(c) Discus
256
4
32
d
(f) DiffPowers
256
104
103
4
102
103
4
32
d
(e) Rosenbrock
256
102
Figure 1: Function evaluations required to reach f (x) < 10−14 over problem dimensionality
(medians of 100 trials). The graphs for CMA-ES-Ref and Cholesky-CMA-ES overlap.
time/s
103
Cholesky-CMA-ES
Suttorp-CMA-ES
CMA-ES/d
CMA-ES-Ref
1
10−3
4
time/s
103
32
(a) Sphere
256
103
1
1
10−3
4
103
32
(b) Cigar
256
4
32
d
(d) Ellipsoid
256
10−3
10−3
4
32
(c) Discus
256
4
32
d
(f) DiffPowers
256
103
1
1
10−3
103
1
4
32
d
(e) Rosenbrock
256
10−3
Figure 2: Runtime in seconds over problem dimensionality. Shown are medians of 100 trials. Note
the logarithmic scaling on both axes.
6
Name
Sphere
Rosenbrock
Discus
Cigar
Ellipsoid
Different Powers
f (x)
2
kxk
Pd−1
2
2 2
i=0P100(xi+1 − xi ) + (1 − xi )
d
x20 + i=1 10−6 x2i
Pd
−6 2
10 x0 + i=1 x2i
Pd
−6i
d−1 x2
i
i=0 10
2+10i
Pd
d−1
|x
|
i
i=0
Table 1: Benchmark functions used in the experiments (additionally, a rotation matrix B transforms
the variables, x 7→ Bx)
log f (mt )
102
102
Cholesky-CMA-ES
Suttorp-CMA-ES
CMA-ES/d
CMA-ES-Ref
10−6
10−14
0
10−6
10
20
10−14
0
(a) Sphere
log f (mt )
10
10
10−6
0
2
10−6
200
400
10−14
0
log f (mt )
(c) Discus
102
10−6
10−6
0
200
time/s
200
400
(d) DiffPowers
102
10−14
60
(b) Cigar
2
10−14
30
400
10−14
(e) Ellipsoid
0
200
time/s
400
(f) Rosenbrock
Figure 3: Function value evolution over time on the benchmark functions with d = 128. Shown are
single runs, namely those with runtimes closest to the corresponding median runtimes.
7
in our experiments. We call this variant CMA-ES-Ref. As an alternative, we experimented with
delaying the update for d steps. We refer to this variant as CMA-ES/d. We also adapted the nontriangular Cholesky factor approach by Suttorp et al. [2009] to the state-of-the art implementation of
the CMA-ES. We refer to the resulting algorithm as Suttorp-CMA-ES.
We considered standard benchmark functions for derivative-free optimization given in Table 1. Sphere
is considered to show that on a spherical function the step size adaption does not behave differently;
Cigar/Discus/Ellipsoid model functions with different convex shapes near the optimum; Rosenbrock
tests learning a function with d − 1 bends, which lead to slowly converging covariance matrices in
the optimization process; Diffpowers is an example of a function with arbitrarily bad conditioning.
To test rotation invariance, we applied a rotation matrix to the variables, x 7→ Bx, B ∈ SO(d, R).
This is done for every benchmark function, and a rotation matrix was chosen randomly at the
beginning of each trial. All starting points were drawn uniformly from [0, 1], except for Sphere,
where we sampled from N (0, I). For each function, we vary d ∈ {4, 8, 16, . . . , 256}. Due to the long
running times, we only compute CMA-ES-Ref up to d = 128. For the given range of dimensions,
for every choice of d, we ran 100 trials from different initial points and monitored the number of
iterations and the wall-clock time needed to sample a point with a function value below 10−14 . For
Rosenbrock we excluded the trials in which the algorithm did not converge to the global optimum.
We further evaluated the algorithms on additional benchmark functions inspired by Stich and Müller
[2012] and measured the change of rotation introduced by the Cholesky-CMA-ES at each iteration
(Et ), see supplementary material.
Results. Figure 1 shows that CMA-ES-Ref and Cholesky-CMA-ES required the same amount
of function evaluations to reach a given objective value. The CMA-ES/d required slightly more
evaluations depending on the benchmark function. When considering the wall-clock runtime, the
Cholesky-CMA-ES was significantly faster than the other algorithms. As expected from the theoretical analysis, the higher the dimensionality the more pronounced the differences, see Figure 2
(note logarithmic scales). For d = 64 the Cholesky-CMA-ES was already 20 times faster than the
CMA-ES-Ref. The drastic differences in runtime become apparent when inspecting single trials.
Note that for d = 256 the matrix size exceeded the L2 cache, which affected the performance of
the Cholesky-CMA-ES and Suttorp-CMA-ES. Figure 3 plots the trials with runtimes closest to the
corresponding median runtimes for d = 128.
5
Conclusion
CMA-ES is a ubiquitous algorithm for derivative-free optimization. The CMA-ES has proven to be a
highly efficient direct policy search algorithm and to be a useful tool for model selection in supervised
learning. We propose the Cholesky-CMA-ES, which can be regarded as an approximation of the
original CMA-ES. We gave theoretical arguments for why our approximation, which only affects the
global step-size adaptation, does not impair performance. The Cholesky-CMA-ES achieves a better,
asymptotically optimal time complexity of O(µd2 ) for the covariance update and optimal memory
complexity. It allows for numerically stable computation of the inverse of the Cholesky factor in
quadratic time and provides the eigenvalues of the covariance matrix without additional costs. We
empirically compared the Cholesky-CMA-ES to the state-of-the-art CMA-ES with delayed covariance
matrix decomposition. Our experiments demonstrated a significant increase in optimizaton speed. As
expected, the Cholesky-CMA-ES needed the same amount of objective function evaluations as the
standard CMA-ES, but required much less wall-clock time – and this speed-up increases with the
search space dimensionality. Still, our algorithm scales quadratically with the problem dimensionality.
If the dimensionality gets so large that maintaining a full covariance matrix becomes computationally
infeasible, one has to resort to low-dimensional approximations [e.g., Loshchilov, 2015], which,
however, bear the risk of a significant drop in optimization performance. Thus, we advocate our new
Cholesky-CMA-ES for scaling up CMA-ES to large optimization problems for which updating and
storing the covariance matrix is still possible, for example, for training neural networks in direct
policy search.
Acknowledgement. We acknowledge support from the Innovation Fund Denmark through the
projects “Personalized breast cancer screening” (OK, CI) and “Cyber Fraud Detection Using Advanced Machine Learning Techniques” (DRA, CI).
8
References
Y. Akimoto, Y. Nagata, I. Ono, and S. Kobayashi. Theoretical foundation for CMA-ES from
information geometry perspective. Algorithmica, 64(4):698–716, 2012.
Y. Akimoto, A. Auger, and N. Hansen. Comparison-based natural gradient optimization in high
dimension. In Proceedings of the 16th Annual Genetic and Evolutionary Computation Conference
(GECCO), pages 373–380. ACM, 2014.
A. Auger. Analysis of Comparison-based Stochastic Continous Black-Box Optimization Algorithms.
Habilitation thesis, Faculté des Sciences d’Orsay, Université Paris-Sud, 2015.
H.-G. Beyer. Evolution strategies. Scholarpedia, 2(8):1965, 2007.
H.-G. Beyer. Convergence analysis of evolutionary algorithms that are based on the paradigm of
information geometry. Evolutionary Computation, 22(4):679–709, 2014.
K. Bringmann, T. Friedrich, C. Igel, and T. Voß. Speeding up many-objective optimization by Monte
Carlo approximations. Artificial Intelligence, 204:22–29, 2013.
A. E. Eiben and Jim Smith. From evolutionary computation to the evolution of things. Nature,
521:476–482, 2015.
F. Gomez, J. Schmidhuber, and R. Miikkulainen. Accelerated neural evolution through cooperatively
coevolved synapses. Journal of Machine Learning Research, 9:937–965, 2008.
N. Hansen and A. Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE International Conference on
Evolutionary Computation (CEC 1996), pages 312–317. IEEE, 1996.
N. Hansen and A. Ostermeier. Completely derandomized self-adaptation in evolution strategies.
Evolutionary Computation, 9(2):159–195, 2001.
N. Hansen. The CMA evolution strategy: A tutorial. Technical report, Inria Saclay – Île-de-France,
Université Paris-Sud, LRI, 2015.
V. Heidrich-Meisner and C. Igel. Hoeffding and Bernstein races for selecting policies in evolutionary
direct policy search. In Proceedings of the 26th International Conference on Machine Learning
(ICML 2009), pages 401–408, 2009.
V. Heidrich-Meisner and C. Igel. Neuroevolution strategies for episodic reinforcement learning.
Journal of Algorithms, 64(4):152–168, 2009.
C. Igel, T. Glasmachers, and V. Heidrich-Meisner. Shark. Journal of Machine Learning Research,
9:993–996, 2008.
C. Igel. Evolutionary kernel learning. In Encyclopedia of Machine Learning. Springer-Verlag, 2010.
O. Krause and C. Igel. A more efficient rank-one covariance matrix update for evolution strategies.
In Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms (FOGA XIII),
pages 129–136. ACM, 2015.
I. Loshchilov. A computationally efficient limited memory CMA-ES for large scale optimization. In
Proceedings of the 16th Annual Genetic and Evolutionary Computation Conference (GECCO),
pages 397–404. ACM, 2014.
I. Loshchilov. LM-CMA: An alternative to L-BFGS for large scale black-box optimization. Evolutionary Computation, 2015.
M. N. Omidvar and X. Li. A comparative study of CMA-ES on large scale global optimisation. In AI
2010: Advances in Artificial Intelligence, volume 6464 of LNAI, pages 303–312. Springer, 2011.
J. Poland and A. Zell. Main vector adaptation: A CMA variant with linear time and space complexity.
In Proceedings of the 10th Annual Genetic and Evolutionary Computation Conference (GECCO),
pages 1050–1055. Morgan Kaufmann Publishers, 2001.
R. Ros and N. Hansen. A simple modification in CMA-ES achieving linear time and space complexity.
In Parallel Problem Solving from Nature (PPSN X), pages 296–305. Springer, 2008.
S. U. Stich and C. L. Müller. On spectral invariance of randomized Hessian and covariance matrix
adaptation schemes. In Parallel Problem Solving from Nature (PPSN XII), pages 448–457. Springer,
2012.
Y. Sun, T. Schaul, F. Gomez, and J. Schmidhuber. A linear time natural evolution strategy for
non-separable functions. In 15th Annual Conference on Genetic and Evolutionary Computation
Conference Companion, pages 61–62. ACM, 2013.
T. Suttorp, N. Hansen, and C. Igel. Efficient covariance matrix update for variable metric evolution
strategies. Machine Learning, 75(2):167–197, 2009.
9