Academia.eduAcademia.edu

Natural Evolution Strategies

2008

This paper presents Natural Evolution Strategies (NES), a novel algorithm for performing real-valued 'black box' function optimization: optimizing an unknown objective function where algorithm-selected function measurements constitute the only information accessible to the method. Natural Evolution Strategies search the fitness landscape using a multivariate normal distribution with a self-adapting mutation matrix to generate correlated mutations in promising regions. NES shares this property with Covariance Matrix Adaption (CMA), an Evolution Strategy (ES) which has been shown to perform well on a variety of high-precision optimization tasks. The Natural Evolution Strategies algorithm, however, is simpler, less ad-hoc and more principled. Self-adaptation of the mutation matrix is derived using a Monte Carlo estimate of the natural gradient towards better expected fitness. By following the natural gradient instead of the 'vanilla' gradient, we can ensure efficient update steps while preventing early convergence due to overly greedy updates, resulting in reduced sensitivity to local suboptima. We show NES has competitive performance with CMA on several tasks, while outperforming it on one task that is rich in deceptive local optima, the Rastrigin benchmark. * Daan Wierstra, Tom Schaul and Juergen Schmidhuber are with IDSIA, Manno-Lugano, Switzerland (email: [daan, tom, juergen]@idsia.ch). Jan Peters is with the

Natural Evolution Strategies Daan Wierstra, Tom Schaul, Jan Peters and Juergen Schmidhuber Technical Report No. IDSIA-01-08 January 2008 IDSIA / USI-SUPSI Istituto Dalle Molle di studi sull’intelligenza artificiale Galleria 2, 6928 Manno, Switzerland IDSIA was founded by the Fondazione Dalle Molle per la Qualità della Vita and is affiliated with both the Università della Svizzera italiana (USI) and the Scuola unversitaria professionale della Svizzera italiana (SUPSI). 1 Technical Report No. IDSIA-01-08 Natural Evolution Strategies Daan Wierstra, Tom Schaul, Jan Peters and Juergen Schmidhuber∗ January 2008 Abstract This paper presents Natural Evolution Strategies (NES), a novel algorithm for performing real-valued ‘black box’ function optimization: optimizing an unknown objective function where algorithm-selected function measurements constitute the only information accessible to the method. Natural Evolution Strategies search the fitness landscape using a multivariate normal distribution with a self-adapting mutation matrix to generate correlated mutations in promising regions. NES shares this property with Covariance Matrix Adaption (CMA), an Evolution Strategy (ES) which has been shown to perform well on a variety of high-precision optimization tasks. The Natural Evolution Strategies algorithm, however, is simpler, less ad-hoc and more principled. Self-adaptation of the mutation matrix is derived using a Monte Carlo estimate of the natural gradient towards better expected fitness. By following the natural gradient instead of the ‘vanilla’ gradient, we can ensure efficient update steps while preventing early convergence due to overly greedy updates, resulting in reduced sensitivity to local suboptima. We show NES has competitive performance with CMA on several tasks, while outperforming it on one task that is rich in deceptive local optima, the Rastrigin benchmark. 1 Introduction Real-valued ‘black box’ function optimization is one of the major branches of modern applied machine learning research [1]. It concerns itself with optimizing the continuous parameters of some unknown objective function, also called a fitness function. The exact structure of the objective function is assumed to be unknown or unspecified, but specific function measurements, freely chosen by the algorithm, are available. This is a recurring problem setup in real-world domains, since often the precise structure of a problem is either not available to the engineer, or too expensive to model or simulate. Numerous real-world problems can be treated as real-valued black box function optimization problems. In order to illustrate the importance and prevalence of this general problem setup, one could point to a diverse set of tasks such as the classic nozzle shape design problem [2], developing an Aibo robot gait [3] or non-Markovian neurocontrol [4]. Now, since exhaustively searching the entire space of solution parameters is considered infeasible, and since we do not assume a precise model of our fitness function, we are forced to settle for trying to find a reasonably fit solution that satisfies certain pre-specified constraints. This, inevitably, involves using a sufficiently intelligent heuristic approach. Though this may sound crude, in practice it has proven to be crucial to find the right domain-specific trade-off on issues such as convergence speed, expected quality of the solutions found and the algorithm’s sensitivity to local suboptima on the fitness landscape. ∗ Daan Wierstra, Tom Schaul and Juergen Schmidhuber are with IDSIA, Manno-Lugano, Switzerland (email: [daan, tom, juergen]@idsia.ch). Jan Peters is with the Max Planck Institute for Biological Cybernetics, Tuebingen, Germany (email: [email protected]). Technical Report No. IDSIA-01-08 2 A variety of algorithms has been developed within this framework, including methods such as Simulated Annealing [5], Simultaneous Perturbation Stochastic Optimization [6], simple Hill Climbing, Particle Swarm Optimization [7] and the class of Evolutionary Algorithms, of which Evolution Strategies (ES) [8, 9, 10] and in particular its Covariance Matrix Adaption (CMA) instantiation [11] are of great interest to us. Evolution Strategies, so named because of their inspiration from natural Darwinian evolution, generally produce consecutive generations of samples. During each generation, a batch of samples is generated by perturbing the parents’ parameters – mutating their genes, if you will. (Note that ES generally uses asexual reproduction: new individuals typically are produced without crossover or similar techniques more prevalent in the field of genetic algorithms). A number of samples is selected, based on their fitness values, while the less fit individuals are discarded. The winners are then used as parents for the next generation, and so on, a process which typically leads to increasing fitness over the generations. This basic ES framework, though simple and heuristic in nature, has proven to be very powerful and robust, spawning a wide variety of algorithms. In Evolution Strategies, mutations are generally drawn from a normal distribution with specific mutation sizes associated with every problem parameter. One of the major research topics in Evolution Strategies concerns the automated adjustment of the mutation sizes for the production of new samples, a procedure generally referred to as self-adaptation of mutation. Obviously, choosing mutation sizes too high will produce debilitating mutations and ensure that convergence to a sufficiently fit region of parameter space will be prevented. Choosing them too low leads to extremely small convergence rates and causes the algorithm to get trapped in bad local suboptima. Generally the mutation size must be chosen from a small range of values – known as the evolution window – specific to both the problem domain and to the distribution of current individuals on the fitness landscape. ES algorithms must therefore adjust mutation during evolution, based on the progress made on the recent evolution path. Often this is done by simultaneously evolving both problem parameters and the corresponding mutation sizes. This has been shown to produce excellent results in a number of cases [10]. The Covariance Matrix Adaptation algorithm constitutes a more sophisticated approach. CMA adapts a variance-covariance mutation matrix Σ used by a multivariate normal distributions from which mutations are drawn. This enables the algorithm to generate correlated mutations, speeding up evolution significantly for many real-world fitness landscapes. Self-adaptation of this mutation matrix is then achieved by integrating information on successful mutations on its recent evolution path, by making similar mutations more likely. CMA performs excellently on a number of benchmark tasks. One of the problems with the CMA algorithm, however, is its ad-hoc nature and relatively complex or even contrived set of mathematical justifications and ‘rules of thumb’. Another problem pertains to its sensitivity to local optima. The goal of this paper is to advance the state of the art of real-valued ‘black box’ function optimization while providing a firmer mathematical grounding of its mechanics, derived from first principles. Simultaneously, we want to develop a method that can achieve optimization up to arbitrarily high precision, and reduce sensitivity to local suboptima as compared to certain earlier methods such as CMA. We present an algorithm, Natural Evolution Strategies, that is both elegant and competitive, inheriting CMA’s strength of correlated mutations, while enhancing the ability to prevent early convergence to local optima. Our method, Natural Evolution Strategies (which can actually be seen as a (1, λ)-Evolution Strategy with 1 candidate solution per generation and λ samples or ‘children’), adapts both a mutation matrix and the parent individual using a natural gradient based update step [12]. Every generation, a gradient towards better expected fitness is estimated using a Monte Carlo approximation. This gradient is then used to update both the parent individual’s parameters and the mutation matrix. By using a natural gradient instead of a ‘vanilla’ gradient, we can prevent early convergence to local optima, while ensuring large update steps. We show that the algorithm has competitive performance with CMA on a number of benchmarks, while it outperforms CMA on the Rastrigin function benchmark, a task with many local suboptima. The paper is organized as follows. The next section provides a quick overview of the general problem framework of real-valued black box function optimization. The ensuing sections describe the derivation of Technical Report No. IDSIA-01-08 3 the ‘vanilla’ gradient approach, the concept of ‘fitness shaping’, and the natural gradient instantiation of our algorithm. The section of experimental results shows our initial results with a number of benchmark problems, and compares the performance to the CMA algorithm. The paper concludes with a discussion on the advantages and problems of the method, and points out possible directions for future work. 2 Algorithm Framework First let us introduce the algorithm framework and the corresponding notation. The objective is to optimize the n-dimensional continuous vector of objective parameters x for an unknown fitness function f : Rn → R. The function is unknown or ‘black box’, in that the only information accessible to the algorithm consists of function measurements selected by the algorithm. The goal is to optimize f (x), while keeping the number of function evaluations – which are considered costly – as low as possible. This is done by evaluating a number 1 . . . λ of separate individuals z1 . . . zλ each successive generation g, using the information from fitness evaluations f (z1 ) . . . f (zλ ) to adjust both the current candidate objective parameters x and the mutation sizes. In conventional Evolution Strategies, optimization is achieved by mimicking natural evolution: at every generation, parent solution x produces offspring z1 . . . zλ by mutating string x using a multivariate normal distribution with zero mean and some variance σ. After evaluating all individuals, the best µ individuals are kept (selected), stored as candidate solutions and subsequently used as ‘parents’ for the next generation. This simple process is known to produce excellent results for a number of challenging problems. Algorithm 1 Natural Evolution Strategies g = 0, initialize population parameters θ(g) = (x, Σ) repeat for k = 1 . . . λ do draw zk ∼ π(x, Σ) evaluate fitness of individual f (zk ) ∇x log π (zk ) = Σ−1 (zk − x) , ∇Σm log π (zk ) = 2A − diag A , where A ≡ 21 Σ−1 (zk − x) (zk − x)T Σ−1 − 21 Σ−1 end for 3 2 ∇θ log π(z1 ) 1 6 7 Φ = 4 ... 5 ∇θ log π(zλ ) 1 R = [f (z1 ), . . . , f (zλ )]T δθ = (ΦT Φ)−1 ΦT R θ(g+1) ← θ(g) + β · δθ g ←g+1 until stopping criterion is met 3 ‘Vanilla’ Gradients for Evolution Strategies Our approach, however, is different from conventional Evolution Strategies in one important respect. Instead of ‘wasting’ information by discarding low-fitness samples, we aim to use all available fitness values, even the bad ones, to generate a gradient for updating our population. The core idea is that we want to optimize expected ‘fitness’ J = Ez [f (z)] of the next generation. We assume at every generation g a population π (g) parameterized by θ = hx, Σi, representing the current 4 Technical Report No. IDSIA-01-08 candidate solution (‘parent’) x and mutation matrix Σ used for producing the next generation of search points. In order to adjust parameters θ = hx, Σi towards solutions that are likely more fit, we estimate a gradient on θ for the expected fitness. Now let f (z) be the fitness at a particular search point z, and, utilizing the familiar multivariate normal distribution, let π(z|θ) = N (z|x, Σ)   1 1 T −1 (z − x) Σ (z − x) = exp − 2 (2π)n/2 |Σ|1/2 denote the probability density of search point z given the current population θ = hx, Σi. Expected fitness can then be expressed as J = Ez [f (z)] = Z π(z|θ)f (z)dz. Taking the derivative of J with respect to θ of population π, we can write ∇θ J = Ez [f (z)] Z = ∇θ π(z|θ)f (z)dz Z π(z|θ) = ∇θ π(z|θ)f (z)dz π(z|θ) Z = π(z|θ)∇θ log π(z|θ)f (z)dz using the ‘likelihood-ratio’ trick. Taking a Monte Carlo approximation of this expectation by choosing λ search points, we get ∇θ J = Ez [∇θ log π(z|θ)f (z)] λ ≈ 1X ∇θ log π(zk |θ)f (zk ) λ k=1 The population parameter vector θ = hx, Σi is comprised of both the current candidate solution center and its mutation matrix, concatenated in one single vector. In order to calculate the derivatives of the loglikelihood with respect to individual elements of θ for this mixture of multivariate normal distributions, first note that log π (z|x, Σ) = − n 1 log(2π) − log det Σ 2 2 1 T (z − x) Σ−1 (z − x) . 2 We will need its derivatives, that is, ∇x log π (z|x, Σ) and ∇Σ log π (z|x, Σ). The first is trivially ∇x log π (z|x, Σ) = Σ−1 (z − x) , 5 Technical Report No. IDSIA-01-08 while the latter is more complex. For symmetric Σ, we have ∇Σ log π (z|x, Σ) ˜ Σ log π (z|x, Σ) = ∇ ˜ Σ log π (z|x, Σ)T + ∇ ˜ Σ log π (z|x, Σ)T − diag ∇ ˜ Σ denotes the derivative which ignores the symmetry of Σ and diag A denotes the diagonal of where ∇ matrix A. Now, we have ˜ Σ log π (z|θ) ∇ 1 −T T Σ (z − x) (z − x) Σ−T − Σ−T 2 1 −1 T Σ (z − x) (z − x) Σ−1 − Σ−1 = 2 ≡ A, = as Σ’s symmetry ensures Σ−T = Σ−1 . Due to this symmetry, we then have ∇Σ log π (z|x, Σ) = 2A − diag A. Using these derivatives to calculate ∇θ J, we can then update θ ← θ + β∇θ J using learning rate β. This produces a new candidate solution x(g+1) each generation, and simultaneously self-adapts the associated mutation matrix to Σ(g+1) . This simple update rule, which covers both object parameters and strategy parameters in one framework, is in marked contrast to the complicated procedure the CMA algorithm uses. 4 Nonlinear Fitness Shaping Apart from slow convergence, one of the main problems encountered by the current ‘vanilla’ version of the algorithm described so far is the early convergence of the algorithm due to quickly decreased mutation sizes of Σ. To see why this happens, one has to imagine the curvature of some hypothetical fitness landscape around x. If the fitness decreases quickly in one direction along the main axis of the hyperellipsoid defined by Σ, while it increases only slowly in the immediate neighborbood along the opposite direction, the estimated gradient will tend to decrease Σ too much, even driving it towards 0. To overcome this problem, we introduce fitness shaping, the use of a nonlinear fitness transformation function that, intuitively speaking, ‘awards’ better samples more than it ‘punishes’ bad samples. The choice of fitness shaping function is arbitrary as long as it is monotonically increasing with the original fitness, and should therefore be considered one of the tuning parameters of the algorithm, and chosen in a domaindependent manner. We have empirically found that ranking-based shaping functions work best for various problems, also because they circumvent the problem of extreme fitness values disproportionately distorting the gradient, making careful adaptation of the learning rate during evolution unnecessary even for problems with wildly fluctuating fitnesses. The ranking function used for all experiments in this paper shapes fitness as f (z) = i + 5λi20 where i is the relative rank of z’s original fitness in the batch z1 . . . zλ , scaled between 0 . . . 1. 5 Natural Evolution Strategies Standard gradient methods have been shown to converge slowly, however, in optimization landscapes with ridges and plateaus. An ad-hoc and often-used method for overcoming this would be the use of momentum. Natural gradients constitute a more principled approach, however. First introduced by Amari [12], natural gradients have numerous advantages over ‘vanilla’ gradients. Natural gradients tend to avoid overly greedy 6 Technical Report No. IDSIA-01-08 solutions, thus providing a ‘safe’ method for doing updates. Additionally, it avoids overfitting on the last seen example, and has the benefit of being covariant in its parameterizations. The traditional gradient ∇J simply follows the steepest descent in the space of the actual parameters. While this might be a good idea in many problems, the main drawback comes if we need to maintain uncertainty as it generates the necessary exploration for the solutions. In this case, we need to stay close to the presented type of solutions while maximizing the fitness. As our solutions are presented by a random sample, we need to use a measure of distance D(θ̂||θ) between probability distributions πθ (z) and πθ̂ (z). The natural measure distance between two probability distributions is the Kullback-Leibler divergence, but alternatively one could use the Hellinger distance. In this case, the natural gradient is given by max J (θ + δθ) = δθT ∇J, δθ s.t. D (θ + δθ||θ) = ε, where J (θ) is the expected fitness of population π parameterized by θ, δθ is the direction of constrained steepest descent, ∇J is the steepest descent or gradient, D (θ + δθ||θ) a measure of closeness on probability distributions and ε a small increment size. If D (θ + δθ||θ) is the Kullback-Leibler divergence or the Hellinger distance, we have D (θ + δθ||θ) = δθT F (θ) δθ + (const), for small δθ → 0, where Z T π (z) ∇ log π (z) ∇ log π (z) dz, h i T = E ∇ log π (z) ∇ log π (z) F (θ) = is the Fisher information matrix which yields the natural gradient δθ defined by the necessary condition F (θ) δθ = β∇J, with β being the learning rate. Additionally, we can introduce a fitness baseline b as Z ∇J = ∇π (z) f (z) dz + 0 Z Z = ∇π (z) f (z) dz + b∇ π (z) dz | {z } =1 Z Z = ∇π (z) f (z) dz + b ∇π (z) dz Z = ∇π (z) [f (z) − b] dz Z = π (z) ∇ log π (z) [f (z) − b] dz = E [∇ log π (z) [f (z) − b]] . Thus, we have the fitness baseline parameter b which can be used to reduce the estimation variance Var [∇ log π (z) [f (z) − b]]. Note that Var [∇ log π (z) Cf (z)] = Var [∇ log π (z) f (z)] C 2 , 7 Technical Report No. IDSIA-01-08 that is, the variance grows quadratically with the average magnitude of the fitnesses. It can be significantly reduced if a proper fitness baseline is used, reducing the number of samples required to correctly estimate the gradient. This changes the equation to h i T E φ (z) φ (z) δθ = E [φ (z) f (z)] − E [φ (z) b] with φ (z) = ∇ log π (z) with one open parameter b. We obtain b by realizing the lower bound Var [φ (z) [f (z) − b]] " # ¯    f (z) − f = f¯2 Var φ (z) + Var φ (z) f¯ − b , ¯ f h i    T ≥ f¯2 E φ (z) φ (z) + Var φ (z) f¯ − b , where f¯ = E [f (z)]. Thus, we have a minimum at b = f¯. However, this is not ideal as it ignores the interplay between δθ and b. However, as E [φ (z)] = 0, we can obtain the equation 0 + b = E [f (z)] , T E [φ (z)] δθ + b = E [f (z)] . Now, we have the equation system h i T E φ (z) φ (z) δθ + E [φ (z) b] = E [φ (z) f (z)] T E [φ (z)] δθ + b = E [f (z)] . This system can be solved straightforwardly as a linear regression problem using the pseudoinverse, and when replacing the E [·] by sample averages, we obtain the general natural gradient estimator δθ = (ΦT Φ)−1 ΦT R where   Φ =  R = ∇θ log π(z1 ) 1 .. . ∇θ log π(zλ ) 1 T [f (z1 ), . . . , f (zλ )]    The resulting Natural Evolution Strategies algorithm is described in pseudocode in Algorithm 1 6 Experiments To test the performance of the algorithm, we chose five different nonlinear test functions. Good test functions should be easy to interpret, but can scale up with n, be highly nonlinear or largely resistant to hillclimbing, or have deceptive local suboptima. We chose to minimize the classic SphereFunction, the Rosenbrock benchmark, the Tablet task, the SchwefelFunction and the Rastrigin benchmark. All functions were tested with both Natural Evolution Strategies and the CMA algorithm, as described in [11]. An overview of the function definitions can be seen in Table 1. The tunable parameters of the NES algorithm are comprised of A) population/batch size λ, B) the fitness shaping function f and C) the learning rate. Tuning these parameters instead of taking robust 8 Technical Report No. IDSIA-01-08 3 2 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 Figure 1: The evolution of the mutation matrices over the generations. Shown are the Σ-defined ellipsoids of generations 0, 20, 40, 60 and 80 imposed on the fitness landscape of the Rosenbrock benchmark. The function has its optimum at point (1, 1). It can clearly be seen how the mutations ‘learn’ to align themselves in the local direction towards improved fitness. Table 1: Test functions (to be minimized) Name SphereFunction SchwefelFunction Tablet Rosenbrock Rastrigin Fitness Function Pn 2 j=1 zj h i2 Pj Pn z j k=1 j=1 Pn (1000z1 )2 + j=2 zj2 h i 2 Pn−1 2 2 + (zj − 1) j=1 100 zj − zj+1  Pn  10n + j=1 zj2 − 10 cos (2πzj ) default settings can speed up the algorithm by roughly a factor 3. We did this for all experiments. The first parameter, the batch size, should be chosen such that underconstrained linear regression during the pseudoinverse does not occur. This entails a minimum batch size of n + n(n + 1)/2 (the size of the triangular matrix plus the length of the vector), but larger sizes, up to a factor 3 larger, have been found to converge sooner. The second parameter, the fitness shaping function, was taken to be f (z) = i+5λi20 in all our experiments. The algorithm proved robust to changes in the specific parameters used in this function. The last parameter, the learning rate, should be set as high as possible without destabilizing the updates. Too large updates can cause destructive mutations to both x and Σ, and the algorithm diverges or reaches early convergence especially if values of Σ are driven too close to 0. To accommodate this problem, we experimented with separate learning rates for x and Σ, and found that, though problem-dependent, good performance is typically reached when the learning rate for Σ is roughly 5 times lower than for x. High learning rates lead to quick convergence to the global optimum, but the trade-off consists of occasionally failing runs. As said above, tuning all these parameters can improve performance, but by about a factor 3. The results for the first four functions Sphere, Schwefel, Tablet and Rosenbrock are depicted in Figure 6. All graphs show results averaged over multiple runs, for both NES and CMA. The Sphere function, on n = 3, was given λ = 25, βx = βΣ = 0.015, and succeeded 30/49. The same function, on n = 10, Technical Report No. IDSIA-01-08 9 was given λ = 125, βx = 0.02, βΣ = 0.005, and succeeded 48/48. The Rosenbrock function, on n = 3, was given λ = 60, βx = 0.05, βΣ = 0.01, and succeeded 22/50. For n = 15, it was given λ = 300, βx = 0.01, βΣ = 0.002, and succeeded 5/7. The Schwefel function, on n = 15, was given λ = 200, βx = 0.01, βΣ = 0.002, and succeeded 10/10. The Tablet function, on n = 5, was given λ = 100, βx = 0.01, βΣ = 0.002, and succeeded 20/20. The results on these four functions indicate a performance that is about 5 times slower than CMA, also on higher dimensions. NES outperforms CMA on the Rastrigin benchmark, though. See Figure 6 for a graphic illustration of this deceptive test function. As can be seen in the performance histogram, taken over 100 runs, NES nearly always reaches the global optimum irrespective of the distance to the starting x. CMA typically fails to converge to the global optimum and gets stuck in a local suboptimum. Figure 5 shows the evolution of the mutation matrix during a run on the Rosenbrock function. It can be seen how, over the generations, the mutations learn to align themselves perfectly with the curvature of the test function. To summarize, our experiments indicate NES is nearly competitive with CMA on the selected highprecision test functions, and outperforms CMA significantly on the Rastrigin benchmark due to natural gradient-induced prevention of premature convergence of Σ. In all likelihood, the algorithm might outperform CMA on future real-world experiments with many deceptive local suboptima. 7 Discussion Natural Evolution Strategies constitute a well-principled approach to real-valued black box function optimization with a relatively clean derivation from first principles. Its theoretical relationship to the field of Policy Gradients [13, 14], and in particular Natural Actor-Critic [15], should be clear to any reader familiar with both fields. The experiments however show that, on most benchmarks, NES is still roughly 5 times slower in performance than CMA. This contrasts with the results on the Rastrigin function, on which NES clearly outperforms CMA. That result suggests NES might be less sensitive to local suboptima than CMA, possibly due to the use of natural gradients which can typically better preserve uncertainty in the production of new samples, causing a better exploration-exploitation trade-off. The results also suggest that NES and CMA scale similarly with increased dimensionality. We argue, though, that a method such as Natural Evolution Strategies or CMA should be used only for problems with relatively small dimensionality (at most a few dozen), since the number of parameters in θ grows squared with n. In order to prevent underconstrained linear regression in computing the natural gradient, one needs a sufficient number of samples per generation before executing an update step. Also for this reason the dimensionality n needs to be kept low. Alternatively, instead of using the entire mutation matrix, one could use only the diagonal if n becomes infeasibly large. NES does require the user to manually specify some algorithm parameters: the learning rate, the batch/generation size, and the fitness shaping function. In contrast, the CMA algorithm has a set of excellent ‘magic’ default settings. Future work on Natural Evolution Strategies must determine whether NES can be made to outperform CMA consistently on typical benchmarks and real-world tasks. We suggest extending the algorithm from a single multinormal distribution as population representation to a mixture of Gaussians, thus further reducing its sensitivity to local suboptima. An other improvement could be achieved by replacing the pseudoinverse by Recursive Least Squares, enabling the algorithm to make update steps after every new sample, moving away from generational batches. Finally, even though NES uses all samples even the low-fitness ones (unlike conventional ES and CMA), we suspect further developing sound statistical methods for better exploiting the information in low-fitness samples might greatly improve performance. 10 Technical Report No. IDSIA-01-08 SphereFunction3 SphereFunction10 103 104 CMA NES 102 101 102 100 101 10-1 100 -fitness -fitness 10-2 10-3 10-4 10-4 10-6 10-5 10-7 10 10 -9 10-10 0 10-1 10-2 10-3 10-5 -8 NES CMA 103 10-6 10-7 500 1000 1500 2000 number of evaluations 2500 10-8 0 3000 RosenbrockFunction3 2000 6000 8000 10000 number of evaluations 12000 14000 RosenbrockFunction15 108 NES CMA 106 4000 CMA NES 106 104 104 10 2 -fitness -fitness 102 100 10 100 -2 10-2 10-4 10-4 10-6 10-6 10-8 0 1000 2000 3000 4000 number of evaluations 5000 10-8 0 6000 20000 SchwefelFunction15 106 80000 100000 TabletFunction5 CMA NES 105 40000 60000 number of evaluations NES CMA 108 104 106 103 104 102 102 100 -fitness -fitness 10 1 10-1 10-2 100 10-2 10-3 10-4 10-4 10 -5 10-6 10-6 10-8 10-7 10-8 0 5000 10000 15000 20000 25000 number of evaluations 30000 35000 10-10 0 5000 10000 15000 20000 number of evaluations 25000 30000 Figure 2: Results for six experiments on four different benchmark functions: the SphereFunction, the RosenbrockFunction, the TabletFunction, and the SchwefelFunction, on different dimensionalities n. Shown are averages over several runs. Notice how in most cases, NES performs roughly 5 times slower than CMA, regardless of problem dimensionality. 11 Technical Report No. IDSIA-01-08 NES CMA 100 80 70 success (%) 60 50 40 30 20 60 40 4 10 3 2 0-4 1 -3 -2 20 0 -1 -1 0 1 -2 2 -3 3 4-4 0 0 1 5 10 30 100 distance between initial guess and global optimum 1000 Figure 3: Performance of NES and CMA on the Rastrigin function. The histogram shows the percentage of runs that reaches the global optimum, for varying initial distances from the global optimum. Results are shown for 100 trials. NES greatly outperforms CMA, suggesting a much improved resistance to converging to bad local suboptima. 8 Conclusion In this paper we presented Natural Evolution Strategies, a novel algorithm for tackling the important class of real-valued ‘black box’ function optimization problems. The proposed method is nearly competitive with the well-established Covariance Matrix Adaptation algorithm, which shares its property of producing correlated mutations that greatly help performance in real-world problem domains. Using a Monte Carloestimated natural gradient for updating both candidate solutions and the mutation matrix, one might suspect a reduced sensitivity to getting stuck in local suboptima, and our initial experimental results suggest that this might indeed be the case. Moreover, NES seems simpler and better-principled than CMA and other Evolution Strategies. Future work will determine whether NES can be shown to consistently outperform CMA on more realistic problem settings and benchmarks. References [1] J. Spall, S. Hill, and D. Stark, “Theoretical framework for comparing several stochastic optimization approaches,” Probabilistic and Randomized Methods for Design under Uncertainty, pp. 99–117, 2006. [2] J. Klockgether and H.-P. Schwefel, “Two-phase nozzle and hollow core jet experiments,” in Proc. 11th Symp. Engineering Aspects of Magnetohydrodynamics, 1970, pp. 141–148. [3] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion,” in Proceedings of the IEEE International Conference on Robotics and Automation, May 2004. [4] F. J. Gomez and R. Miikkulainen, “Solving non-Markovian control tasks with neuroevolution,” in Proc. IJCAI 99. Denver, CO: Morgan Kaufman, 1999. [5] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimization by simulated annealing,” Science, Number 4598, 13 May 1983, vol. 220, 4598, pp. 671–680, 1983. Technical Report No. IDSIA-01-08 12 [6] J. C. Spall, “Stochastic optimization and the simultaneous perturbation method,” in WSC ’99: Proceedings of the 31st conference on Winter simulation. New York, NY, USA: ACM, 1999, pp. 101–109. [7] J. Kennedy and R. C. Eberhart, Swarm intelligence. Publishers Inc., 2001. San Francisco, CA, USA: Morgan Kaufmann [8] I. Rechenberg, “Evolutionsstrategie: Optimierung technischer systeme nach prinzipien der biologischen evolution,” Ph.D. dissertation, TU Berlin, 1971. [9] H.-G. Beyer and H.-P. Schwefel, “Evolution strategies: A comprehensive introduction,” Natural Computing: an international journal, vol. 1, no. 1, pp. 3–52, 2002. [10] H.-G. Beyer, “Toward a Theory of Evolution Strategies: Self-Adaptation,” Evolutionary Computation, vol. 3, no. 3, pp. 311–347, 1996. [11] N. Hansen and A. Ostermeier, “Completely derandomized self-adaptation in evolution strategies,” Evolutionary Computation, vol. 9, no. 2, pp. 159–195, 2001. [12] S. ichi Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, no. 2, pp. 251–276, 1998. [13] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine Learning, vol. 8, pp. 229–256, 1992. [14] J. Peters and S. Schaal, “Policy gradient methods for robotics,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Beijing, China, 2006, pp. 2219 – 2225. [15] J. Peters, S. Vijayakumar, and S. Schaal, “Natural actor-critic,” in Proceedings of the 16th European Conference on Machine Learning (ECML 2005), 2005, pp. 280–291.