Academia.eduAcademia.edu

Exploring Parameter Space in Reinforcement Learning

2010, Paladyn

This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-exploring algorithms: Natural Evolution Strategies and Policy Gradients with Parameter-Based Exploration. Both outperform state-of-the-art algorithms in several complex high-dimensional tasks commonly found in robot control. Furthermore, we describe how a novel exploration method, State-Dependent Exploration, can modify existing algorithms to mimic exploration in parameter space.

PALADYN Journal of Behavioral Robotics Review Article · DOI: 10.2478/s13230-010-0002-4 · JBR · 1(1) · 2010 · 14-24 Exploring Parameter Space in Reinforcement Learning 1 2 Technische Universität München Institut für Informatik VI, Boltzmannstr. 3, 85748 Garching, Germany, Dalle Molle Institute for Artificial Intelligence (IDSIA) Galleria 2, 6928 Manno-Lugano, Switzerland Abstract This paper discusses parameter-based exploration methods for reinforcement learning. Parameter-based methods perturb parameters of a general function approximator directly, rather than adding noise to the resulting actions. Parameter-based exploration unifies reinforcement learning and black-box optimization, and has several advantages over action perturbation. We review two recent parameter-exploring algorithms: Natural Evolution Strategies and Policy Gradients with Parameter-Based Exploration. Both outperform state-of-the-art algorithms in several complex high-dimensional tasks commonly found in robot control. Furthermore, we describe how a novel exploration method, State-Dependent Exploration, can modify existing algorithms to mimic exploration in parameter space. py Thomas Rückstieß1∗ , Frank Sehnke1 , Tom Schaul2 , Daan Wierstra2 , Yi Sun2 , Jürgen Schmidhuber2 Keywords reinforcement learning · optimization · exploration · policy gradients co Received 20 February 2010 Accepted 16 March 2010 ut h Reinforcement learning (RL) is the method of choice for many complex real-world problems where engineers are unable to explicitly determine the desired policy of a controller. Unfortunately, as the indirect reinforcement signal provides less information to the learning algorithm than the teaching signal in supervised learning, learning requires a large number of trials. A Exploration is a critical component of RL, affecting both the number of trials required and the quality of the solution found. Novel solutions can be found only through effective exploration. Preferably, exploration should be broad enough not to miss good solutions, economical enough not to require too many trials and intelligent in the sense that the information gained through it is high. Clearly, those objectives are difficult to trade off. In practice, unfortunately, many RL practitioners do not focus on exploration, instead relying on small random perturbations of the actions of the current policy. In this paper we review some alternative methods of exploration for Policy Gradient (PG) based RL that go beyond action-based exploration, directly perturbing policy parameters instead. We will look at two recent parameter-exploring algorithms: Natural Evolution Strategies (NES) [33] and Policy Gradients with Parameter-Based Exploration (PGPE) [22]. Both algorithms have been shown to outperform stateof-the-art PG methods in several complex high-dimensional tasks commonly found in robot control. We will review further a novel exploration technique, called State-Dependent Exploration (SDE), first introduced in [18]. SDE can modify existing PG algorithms to mimic exploration in ∗ E-mail: [email protected] parameter space and has also demonstrated to improve state-of-theart PG methods. We take a stand for parameter exploration (PE), the common factor of the above mentioned methods, which has the advantage of reducing the variance of exploration, while at the same time easing credit assignment. Furthermore, we establish how these methods relate to the field of Black-Box Optimization (BBO), and Evolution Strategies (ES) in particular. We highlight the topic of parameter-based exploration from different angles and give an overview of the superiority of this approach. We also give reasons for the increase in performance and insights in additional properties of this exploration approach. The general outline of this paper is as follows: In Section 2 we introduce the general frameworks of RL and BBO and review the state-ofthe-art. Section 3 then details parameter-based exploration, reviews two parameter-exploring algorithms and demonstrates how traditional algorithms can be brought to behave like parameter-exploring ones. Experimental results are described in Section 4 and the paper concludes in Section 5. or 1. Introduction 2. Background In this section we introduce the RL framework in general and fix the terminology. In that context, we then formalize the concept of exploration. Finally we provide a brief overview of policy gradients and black-box optimization. 2.1. Reinforcement Learning RL generally tries to optimize an agent’s behavior in its environment. Unlike supervised learning, the agent is not told the correct behavior 14 PALADYN Journal of Behavioral Robotics co py In a continuous action space, action selection is much harder, mainly because calculating arg maxa Q(s, a) is not trivial. Using a general function approximator (FA) to estimate the Q-values for state-action ∂Q(s,a) pairs, it is possible but expensive to follow the gradient ∂a towards an action that returns a higher Q-value. In actor-critic architectures [27], where the policy (the actor) is separated from the learning component (the critic), one can backpropagate the temporal difference error through the critic FA (usually implemented as neural networks) to the actor FA and train the actor to output actions that return higher Q-values [16, 30]. Direct reinforcement learning methods, in particular Policy Gradient methods [13, 14, 34], avoid the problem of finding arg maxa Q(s, a) altogether, thus being popular for continuous action and state domains. Instead, states are mapped to actions directly by means of a parameterized function approximator, without utilizing Q-values. The parameters θ are changed by following the performance gradient ∇θ J(θ). Different approaches exist to estimate this gradient [14]. Finally, model-based RL aims to estimate the transition probabilities a a ′ ′ Pss ′ = p(s |s, a) and the rewards Rss′ = E{r|s, s , a} going from state s to s′ with action a. Having a model of the environment allows one to use direct or value-based techniques within the simulation of the environment, or even for dynamic programming solutions. 2.2. Exploration The exploration/exploitation dilemma is one of the main problems that needs to be dealt with in Reinforcement Learning: Without exploration, the agent can only go for the best solution found so far, not learning about potentially better solutions. Too much exploration leads to mostly random behavior without exploiting the learned knowledge. A good exploration strategy carefully balances exploration and greedy policy execution. Many exploration techniques have been developed for the case of discrete actions [28, 32], commonly divided into undirected and directed exploration. The most popular—albeit not the most effective— undirected exploration method is ε -greedy exploration, where the probability of choosing a random action decreases over time. In practice, a random number r is drawn from a uniform distribution r ∼ U(0, 1), and action selection follows this rule: ut h or directly, but only informed about how well it did, usually in terms of a scalar value called reward. Learning proceeds in a cycle of interactions with the environment. In time t , the agent observes a state st from the environment, performs an action at and receives a reward rt . The environment then determines the next state st+1 . We will use the term history for the concatenation of all encountered states, actions and rewards up to time step t as ht = {s0 , a0 , r0 , . . . , st−1 , at−1 , rt−1 , st }. Often, the environment fulfills the Markov assumption, i.e. the probability of the next state depends only on the last observed state and the current action: p(st+1 |ht ) = p(st+1 |st , at ). The environment is episodic if it admits at least one terminal, absorbing state. The objective of RL is to maximize the ‘long term’ return Rt , which is PT k the (discounted) sum of future rewards Rt = k=t γ rt+1+k . In the episodic case, the objective is to maximize R1 , i.e., the discounted sum of rewards at the initial time step. The discounting factor γ ∈ [0, 1] can put more or less emphasis on the most recent rewards. For infinitehorizon tasks that do not have a foreseeable end, γ < 1 prevents unbounded sums. Actions are selected by a policy π that maps a history ht to a probability of choosing an action: π(ht ) = p(at |ht ). If the environment is Markovian, it is sufficient to consider Markovian policies that satisfy p(at |ht ) = p(at |st ). We define J(θ) to be the performance of a policy π with parameters θ : J(θ) = E{Rt }. For deterministic policies, we also write π(s′ ) = a. In either case, the selected action is not necessarily the one executed, as it can be modified or substituted by a different action, depending on the exploration component. We describe exploration in detail in Section 2.2. Three main categories can be found in RL: direct, value-based and model-based learning, as depicted in Figure 2.1. Value-based RL tries to map each combination of state and action to a Q-value. A Q-value is the expected return, if we execute action a in state s at time t and follow the policy π greedily thereafter: Q π (s, a) = E{Rt |st = s, at = a}. If the task is episodic, we can approximate the values by Monte-Carlo methods, generating several episodes and averaging over the received returns. For infinite-horizon tasks, the definition of the Q-values can be transformed: Q π (s, a) = E{Rt |st = s, at = a} = E{rt+1 + γQ π (st+1 , at+1 )|st = s, at = a}. Using an exponential moving average and replacing Q(st+1 , at+1 ) by maxa Q(st+1 , a), we end up with the well-known Q-Learning algorithm [31]: π(s) = A Q(st , at ) ← Q(st , at ) + α(rt+1 + γ max Q(st+1 , a) − Q(st , at )) (1) a A greedy policy then selects the action with the highest Q-value in a given state. ( arg maxa Q(s, a) if r ≥ ε random action from A(s) if r < ε where A(s) is the set of valid actions from state s and 0 ≤ ε ≤ 1 is the trade-off parameter, which is reduced over time to slowly transition from exploration to exploitation. Other exploration methods for discrete actions are presented in [28]. For the remainder of the paper, we will ignore the well-established area of discrete action exploration and concentrate on continuous actions. In the case of continuous actions, exploration is often neglected. If the policy is considered to be stochastic, a Gaussian distribution of actions is usually assumed, where the mean is often selected and interpreted as the greedy action: a ∼ N (agreedy , σ 2 ) = agreedy + ε, where ε ∼ N (0, σ 2 ) Figure 1. Different Reinforcement Learning categories. (2) During learning, exploration then occurs implicitly—almost as a sideeffect— by sampling actions from the stochastic policy. While this is convenient, it conceals the fact that two different stochastic elements 15 PALADYN Journal of Behavioral Robotics Policy Gradients T −1 Y π(at |hπt ) p(st+1 |hπt , at ) ut h p(hπ ) = p(s0 ) or In this paper, we will compare the parameter-exploring parameters to policy gradient algorithms that perturb the resulting action, in particular to REINFORCE [34] and eNAC [13]. Below, we will give a short overview of the derivation of policy gradient methods. we start with the probability of observing history hπ under policy π , which is given by the probability of starting with an initial observation s0 , multiplied by the probability of taking action a0 under h0 , multiplied by the probability of receiving the next observation s1 , and so on. Thus, (3) gives the probability of encountering a certain history hπ . t=0 co 2.3. py are involved here: the exploration and the stochastic policy itself. This becomes most apparent if we let σ adapt over time as well, following ∂J(θ) the gradient ∂σ , which is well-defined if the policy is differentiable. If the best policy is in fact a deterministic one, σ will decrease quickly and therefore exploration comes to a halt as well. This clearly undesirable behavior can be circumvented by adapting the variance manually, e.g. by decreasing it slowly over time. Another disadvantage of this implicit exploration is the independence of samples over time. In each time step, we draw a new ε and add it to the actions, leading to a very noisy trajectory through action space (see Figure 2 top). A robot controlled by such actions would exhibit a very shaky behavior, with a severe impact on the performance. Imagine an algorithm with this kind of exploration controlling the torques of a robot end-effector directly. Obviously, the trembling movement of the end-effector will worsen the performance in almost any object manipulation task. And we ignore the fact that such consecutive contradicting motor commands might even damage the robot or simply cannot be executed. Thus applying such methods requires the use of motion primitives [19] or other transformations. Despite these problems, many current algorithms [13, 17, 34] use this kind of Gaussian, actionperturbing exploration (cf. Equation 2). (3) Figure 2. Illustration of the main difference between action (left) and parameter (right) exploration. Several rollouts in state-action space of a task with state x ∈ R2 (velocity and angle axes) and action a ∈ R (torque axis) are plotted. While exploration based on action perturbation follows the same trajectory over and over again (with added noise), parameter exploration instead tries different strategies and can quickly find solutions that would take a long time to discover otherwise. Inserting this into the definition of the performance measure J(θ), we π can rewrite the equation by multiplying with 1 = p(h )/p(hπ ) and using 1 ∇x = ∇ log(x) to get x Z p(hπ ) ∇θ p(hπ )R(hπ ) dhπ p(hπ ) (4) p(hπ )∇θ log p(hπ )R(hπ ) dhπ . (5) A ∇θ J(π) = = Z On the right side of (6), only the policy π is dependent on θ , so the gradient can be simplified to For now, let us consider the gradient ∇θ log p(hπ ). Substituting the probability p(hπ ) according to (3) gives T −1 h i Y ∇θ log p(hπ ) = ∇θ log p(s0 ) π(at |hπt ) p(st+1 |hπt , at ) ∇θ log p(hπ ) = T −1 X t=0 ∇θ log π(at |hπt ). (7) We can now resubstitute this term into (5) and get t=0 T −1 h X = ∇θ log p(s0 ) + log π(at |hπt + ∇θ J(π) = t=0 + T −1 X t=0 log p(st+1 |hπt , at ) i . (6) Z = E 16 p(hπ ) ( T −1 X t=0 T −1 X t=0 ∇θ log π(at |hπt )R(hπ ) dhπ ∇θ log π(at |hπt )R(hπ ) ) . (8) PALADYN Journal of Behavioral Robotics Unfortunately, the probability distribution p(hπ ) over the histories produced by π is not known in general. Thus we need to approximate the expectation, e.g. by Monte-Carlo sampling. To this end, we collect N samples through world interaction, where a single sample comprises a complete history hπ (one episode or rollout) to which a return R(hπ ) can be assigned, and sum over all samples which basically yields Williams’ [34] episodic REINFORCE gradient estimation: ∇θ J(π) ≈ 1 N T −1 XX hπ t=0 ∇θ log π(at |hπt )R(hπ ) (9) The derivation of eNAC is based on REINFORCE but taken further. It is an actor-critic method that uses the natural gradient to estimate the performance gradient. Its derivation can be found in [13]. for doing so (section 3.2). Realizing how this relates to black-box optimization, specifically evolution strategies (section 3.3), we then describe a related family of algorithms from that field (section 3.4). Finally we introduce a methodology for bridging the gap between action-based and parameter-based exploration (section 3.5). 3.1. Exploring in Parameter Space Instead of manipulating the resulting actions, parameter exploration adds a small perturbation δθ directly to the parameter θ of the policy before each episode, and follow the resulting policy throughout the whole episode. Finite differences are a simple approach to estimate the gradient ∇θ J(θ) towards better performance: ∇θ J(θ) ≈ In order to get a more accurate approximation, several parameter perturbations are usually collected (one for each episode) and the gradient is then estimated through linear regression. For this, we generate several rollouts by adding some exploratory noise to our policy parameters, resulting in an action a = f(s; θ + δθ). From the rollouts, we generate the matrix Θ which has one row for each parameter perturbation δθi . We also generate a column vector J with the corresponding J(θ + δθ) in each row: Θi = [ δθi 1 ] Ji = [ J(θ + δθi ) ] A ut h or The objective of optimization is to find parameters x ∈ D that maximize a given fitness function f(x). Black-box optimization places little restrictions on f , in particular, it does not require f to be differentiable, continuous or deterministic. Consequently, black-box optimization algorithms have access only to a number of evaluations, that is (x, f(x))-tuples, without the additional information (e.g., gradient, Hessian) typically assumed by other classes of optimization algorithms. Furthermore, as evaluations are generally considered costly, black-box optimization attempts to maximize f with a minimal number of them. Such optimization algorithms allow domain experts to search for good or near-optimal solutions to numerous difficult real-world problems in areas ranging from medicine and finance to control and robotics. The literature on real-valued blackbox optimization is too rich to be exhaustively surveyed. To our best knowledge, among various families of algorithms (e.g., hill-climbing, differential evolution [23], evolution strategies [15], immunological algorithms, particle swarm optimization [8] or estimation of distribution algorithms [10]), methods from the evolution strategy family seem to have an edge, particularly in cases where the fitness functions are high dimensional, non-separable or ill-shaped. The common approach in evolution strategies is to have a Gaussian mutation distribution maintained and updated through generations. The Gaussian mutation distribution is updated such that the probability of generating points with better fitness is high. Evolution strategies have been extended to account for correlations between the dimensions of the parameter space D , leading to state of the art methods like covariance-matrix adaptation (CMA-ES) [6] and natural evolution strategies (see section 3.4). 3. Parameter-based Exploration A significant problem with policy gradient algorithms such as REINFORCE [34] is that the high variance in the gradient estimation leads to slow convergence. Various approaches have been proposed to reduce this variance [1, 3, 14, 26]. However, none of these methods address the underlying cause of the high variance, which is that repeatedly sampling from a probabilistic policy has the effect of injecting noise into the gradient estimate at every time-step. Furthermore, the variance increases linearly with the length of the history [12], since each state may depend on the entire sequence of previous samples. An alternative to the action-perturbing exploration described, as described in Section 2.2, is to manipulate the parameters θ of the policy directly. In this section we will start by showing how this parameter-based exploration can be realized (section 3.1) and provide a concrete algorithm (10) py Black-Box Optimization co 2.4. J(θ + δθ) − J(θ) δθ (11) (12) The ones in the right column of Θ are needed for the bias in the linear regression. The gradient can now be estimated with β = (ΘT Θ)−1 ΘT J (13) where the first n elements of β are the components of the gradient ∇θ J(θ), one for each dimension of θ . Parameter-based exploration has several advantages. First, we no longer need to calculate the derivative of the policy with respect to its parameters, since we already know which choice of parameters has caused the changes in behavior. Therefore policies are no longer required to be differentiable, which in turn provides more flexibility in choosing a suitable policy for the task at hand. Second, when exploring in parameter space, the resulting actions come from an instance of the same family of functions. This contrasts with action-perturbing exploration, which might result in actions that the underlying function could never have delivered itself. In the latter case, gradient descent could continue to change the parameters into a certain direction without improving the overall behavior. For example, in a neural network with sigmoid outputs, the explorative noise could push the action values above +1.0. Third, parameter exploration avoids noisy trajectories that are due to adding i.i.d. noise in each timestep. This fact is illustrated in Figure 2. Each episode is executed entirely with the same parameters, which are only altered between episodes, resulting in much smoother action trajectories. Furthermore, this introduces much less variance in the rollouts, which facilitates credit assignment and generally leads to faster convergence [18, 22]. Lastly, the finite difference gradient information is much more accessible and simpler to calculate, as compared to the likelihood ratio gradient in [34] or [13]. 17 PALADYN Journal of Behavioral Robotics Following the idea above, in [22] it is proposed to use policy gradients with parameter-based exploration, where a distribution over the parameters of a controller is maintained and updated. Therefore PGPE explores purely in parameter space. The parameters are sampled from this distribution at the start of each sequence, and thereafter the controller is deterministic. Since the reward for each sequence depends only on a single sample, the gradient estimates are significantly less noisy, even in stochastic environments. In what follows, we briefly summarize [21], outlining the derivation that leads to PGPE. We give a short summary of the algorithm as far as it is needed for the rest of the paper. Given the RL formulation from 2.1 we can associate a cumulative reward R(h) with each history h by summing over the rewards at each PT time step: R(h) = t=1 rt . In this setting, the goal of reinforcement learning is to find the parameters θ that maximize the agent’s expected reward Z H Z H p(h|θ)∇θ log p(h|θ)R(h)dh. (15) p(h|θ) H T X ∇θ p(at |st , θ)R(h)dh. ut h ∇θ J(θ) = Z ∇ρ J(ρ) = t=1 (16) A (17) where the histories hi are chosen according to p(hi |θ). The question then is how to model p(at |st , θ). In policy gradient methods such as REINFORCE, the parameters θ are used to determine a probabilistic policy πθ (at |st ) = p(at |st , θ). A typical policy model would be a parametric function approximator whose outputs define the probabilities of taking different actions. In this case the histories can be sampled by choosing an action at each time step according to the policy distribution, and the final gradient is then calculated by differentiating the policy with respect to the parameters. However, sampling from the policy on every time step leads to a high variance in the sample over histories, and therefore to a noisy gradient estimate. PGPE addresses the variance problem by replacing the probabilistic policy with a probability distribution over the parameters θ , i.e. p(at |st , ρ) = Z Θ p(θ|ρ)δFθ (st ),at dθ, (18) p(h, θ|ρ)R(h)dhdθ. H (19) Z Z Θ H p(h|θ)p(θ|ρ)∇ρ log p(θ|ρ)R(h)dhdθ, (20) ∇ρ J(ρ) ≈ N 1 X ∇ρ log p(θ|ρ)r(hn ). N n=1 (21) In the original formulation of PGPE, ρ consisted of a set of means {µi } and standard deviations {σi } that determine an independent normal distribution for each parameter θi in θ of the form p(θi |ρi ) = N (θi |µi , σi2 ) Some rearrangement gives the following forms for the derivative of log p(θ|ρ) with respect to µi and σi : Clearly, integrating over the entire space of histories is unfeasible, and we therefore resort to sampling methods T N 1 XX ∇θ p(ant |snt , θ)r(hn ), ∇θ J(θ) ≈ N n=1 t=1 Θ where p(h|θ) is the probability distribution over the parameters θ and ρ are the parameters determining the distribution over θ . Clearly, integrating over the entire space of histories and parameters is unfeasible, and we therefore resort to sampling methods. This is done by first choosing θ from p(θ|ρ), then running the agent to generate h from p(h|θ): or Assuming the environment is Markovian, and the states are conditionally independent of the parameters given the agent’s choice of actions, we can write p(h|θ) = p(s1 )ΠTt=1 p(st+1 |st , at )p(at |st , θ). Substituting this into Eq. (15) yields Z Z Noting that h is conditionally independent of ρ given θ , we have p(h, θ|ρ) = p(h|θ)p(θ|ρ) and therefore ∇ρ log p(h, θ|ρ) = ∇ρ log p(θ|ρ), we have (14) p(h|θ)R(h)dh. An obvious way to maximize J(θ) is to find ∇θ J and use it to carry out gradient ascent. Noting that R(h) is independent of θ , and using the standard identity ∇x y(x) = y(x)∇x log y(x), we can write ∇θ J(θ) = J(ρ) = co J(θ) = where ρ are the parameters determining the distribution over θ , Fθ (st ) is the (deterministic) action chosen by the model with parameters θ in state st , and δ is the Dirac delta function. The advantage of this approach is that the actions are deterministic, and an entire history can therefore be generated from a single parameter sample. This reduction in samples-per-history is what reduces the variance in the gradient estimate. As an added benefit the parameter gradient is estimated by direct parameter perturbations, without having to backpropagate any derivatives, which allows the use of non-differentiable controllers. The expected reward with a given ρ is py 3.2. Policy Gradients with Parameter-based Exploration ∇µi log p(θ|ρ) = (θi − µi ) σi2 ∇σi log p(θ|ρ) = (θi − µi )2 − σi2 , σi3 (22) which can then be substituted into (21) to approximate the µ and σ gradients that gives the PGPE update rules. Note the similarity to REINFORCE [34]. But in contrast to REINFORCE, θ defines the parameters of the model, not the probability of the actions. 3.3. Reinforcement Learning as Optimization There is a double link between RL and optimization. On one hand, we may consider optimization to be a simple sub-problem of RL, with only a single state and a single timestep per episode, where the fitness corresponds to the reward (i.e. a bandit problem). On the other hand, more interestingly, the return of a whole RL episode can be interpreted as a single fitness evaluation, where the parameters x now map onto the policy parameters θ . In this case, parameterbased exploration in RL is equivalent to black-box optimization. Moreover, when the exploration in parameter space is normally distributed, a directly link between RL and evolution strategies can be established. 18 PALADYN Journal of Behavioral Robotics ut h   2 1 p(x | θ) = · exp − A · (x − µ) d 2 (2π) 2 det(A) 1 3.5. State-Dependent Exploration An alternative to parameter-based exploration, that addresses most of the shortcomings of action-based exploration is State-Dependent Exploration [18]. Its core benefit is that it is compatible with standard policy gradient methods like REINFORCE in a way that it can simply replace or augment the existing Gaussian exploration described in Section 2.2 and Equation 2. Actions are generated as follows, where f is the parameterized function approximator: a = f(s, θ) + ε̂(s, θ̂), denote the density of the normal search distribution N (µ, C ), and let f(x) denote the fitness of sample x (which is of dimension d). Then, Z f(x) p(x | θ) dx A J(θ) = E[f(x) | θ] = ∇θ J(θ) = (24) (23) Instead of adding i.i.d. noise in each time step (cf. Equation 2), [18] introduces a pseudo-random function ε̂(s), that takes the current state as input and itself is parameterized with parameters θ̂ . These exploration parameters are in turn drawn from a Normal distribution with zero mean. The exploration parameters are varied between episodes (just like introduced in Section 3.1) and held constant during the rollout. Therefore, the exploration function ε̂ can still carry the necessary exploratory randomness through variation between episodes, but will always return the same value in the same state within an episode. Effectively, by drawing θ̂ , we actually create a policy delta, similar to finite difference methods. In fact, if both f(s; Θ) with Θ = [θji ] and ε̂(x, Θ̂) with Θ̂ = [θ̂ji ] are linear functions, we see that a = f(s; Θ) + ε̂(s; Θ̂) is the expected fitness under the search distribution given by θ . Again, the log-likelihood trick enables us to write Z h θ̂ ∼ N (0, σ̂ 2 ). or The original PGPE can be seen as a stochastic optimization algorithm, where the parameters are adapted separately. However, in practice, parameters representing the policies are often strongly correlated, rendering per-parameter update less efficient. The family of natural evolution strategies [24, 25, 33] offers a principled alternative by following the natural gradient of the expected fitness. NES maintains and iteratively updates a multivariate Gaussian mutation distribution. Parameters are updated by estimating a natural evolution gradient, i.e., the natural gradient on the parameters of the mutation distribution, and following it towards better expected fitness. The evolution gradient obtained automatically takes into account the correlation between parameters, thus is particularly suitable here. A well-known advantage of natural gradient methods over ‘vanilla’ gradient ascent is isotropic convergence on ill-shaped fitness landscapes [2]. Although relying exclusively on function value evaluations, the resulting optimization behavior closely resembles second order optimization techniques. This avoids drawbacks of regular gradients which are prone to slow or even premature convergence. The core idea of NES is its strategy adaptation mechanism: NES follows a sampled natural gradient of expected fitness in order to update (the parameters of) the search distribution. NES uses a Gaussian distribution with a fully adaptive covariance matrix, but it may in principle be used with a different family of search distributions. NES exhibits the typical characteristics of evolution strategies. It maintains a population of vector-valued candidate solutions, and samples new offspring and adapts its search distribution generation-wise. The essential concepts of NES are briefly revisited in the following. We collect the parameters of the Gaussian, the mean µ and the covariance matrix C , in the variable θ = (µ, C ). However, to sample efficiently from this distribution we need a square root of the covariance matrix (a matrix A fulfilling AAT = C ). Then x = µ + Az transforms a standard normal vector z ∼ N (0, I) into a sample x ∼ N (µ, C ). Let Instead of using the stochastic gradient directly for updates, NES follows the natural gradient [2]. In a nutshell, the natural gradient amounts to G = F −1 ∇θ J(θ), where F denotes the Fisher information matrix of the parametric family of search distributions. Natural gradient ascent has well-known advantages over vanilla gradient ascent. Most prominently it results in isotropic convergence on ill-shaped fitness landscapes because the natural gradient is invariant under linear transformations of the search space. Additional techniques were developed to enhance NES’ performance and viability, including importance mixing to reduce the number of required samples [24, 25], and exponential parametrization of the search distribution to guarantee invariance while at the same time providing an elegant and efficient way of computing the natural gradient without the need of the explicit Fisher information matrix or its costly inverse (under review). NES’ results are now comparable to the well-known CMA-ES [5, 9] algorithm, the de facto ‘industry standard’ for continuous black-box optimization. py Natural Evolution Strategies co 3.4. i f(x) ∇θ log(p(x | θ)) p(x | θ) dx . From this form we obtain the Monte Carlo estimate n 1X ∇θ J(θ) ≈ ∇θbJ(θ) = f(xi ) ∇θ log(p(x | θ)) n i=1 of the expected fitness gradient. For the Gaussian search distribution N (µ, C ), the term ∇θ log(p(x | θ)) can be computed efficiently, see e.g. [24]. = Θs + Θ̂s = (Θ + Θ̂)s, (25) which shows that direct parameter perturbation methods (cf. Equation (10)) are a special case of SDE and can be expressed in this more general framework. In effect, state-dependent exploration can be seen as a converter from action-exploring to parameter-exploring methods. A method equipped with the SDE converter does not benefit from all the advantages mentioned in Section 3.1, e.g. actions are not chosen from the same family of functions, since the exploration value is still added to the greedy action. It does, however, cause smooth trajectories and thus mitigates the credit assignment problem (as illustrated in Figure 2). For a linear exploration function ε̂(s; Θ) = Θs it is also possible to calculate the derivative of the log likelihood with respect to the variance. 19 PALADYN Journal of Behavioral Robotics i (si σ̂ji )2 ). (26) Therefore, differentiation of the policy with respect to the free parameters σ̂ji yields: X ∂ log πk (ak |s) ∂σj ∂ log π(a|s) = ∂σ̂ji ∂σj ∂σ̂ji k = (aj − µj )2 − σj2 σj4 s2i σ̂ji , (27) which can directly be inserted into the gradient estimator of REINFORCE. For more complex exploration functions, calculating the exact derivative for the sigma adaptation might not be possible and heuristic or manual adaptation (e.g. with slowly decreasing σ̂ ) is required. 4. Experimental Results Object Grasping The task in this scenario was to grasp an object from a table. The simulation, based on the CCRL robot [4] was implemented using the Open Dynamics Engine. The lengths and masses of the body parts and the location of the connection points were matched with those of the original robot. Friction was approximated by a Coulomb friction model. The framework has 7 degrees of freedom and a 35 dimensional observation vector (8 angles, 8 angular velocities, 8 forces, 2 pressure sensors in hand, 3 degrees of orientation and 3 values of position in hand, 3 values of position of object). The controller was a Jordan network [7] with 35 inputs, 1 hidden units and 7 output units. The system has only 7 DoF while having 8 joints, because the actual grasping is realized as a reflex. Has the hand reached the center of gravity of the object sufficiently close enough the hand closes automatically. Other simplifications are that the object is always at the same position at the very edge of the table. By this simplifications the task is easy enough to be learned from scratch within 20,000 episodes. The needed controller could be constructed small enough to allow also methods that use a covariance matrix to learn the task. Figure 4 shows a typical solution of the grasping task. Benchmark Environments A 4.1. ut h or In this section we compare the reviewed parameter-exploring methods PGPE and NES to action-exploring policy gradient algorithms REINFORCE and eNAC on several simulated control scenarios. We also demonstrate how both policy gradient algorithms can perform better when equipped with the SDE converter to act more like parameterexploring methods. The experiments are all executed in simulations, but their complexity is similar to today’s real-life RL problems [11, 14]. For all experiments we plot the agent’s reward against the number of training episodes. An episode is a sequence of T interactions of the agent with the environment, where T is fixed for each experiment, during which the agent makes one attempt to complete the task. For all methods, the agent and the environment are reset at the beginning of every episode. All the experiments were conducted with empirically tuned meta-parameters. The benchmarks and algorithms are included in the open source Machine Learning library PyBrain [20]. Section 4.1 describes each of the experiments briefly, and section 4.2 lists the results which are discussed in section 4.3. 4.1.3. py aj ∼ N (fj (s, Θ), connection points, and the range of allowed angles and torques in the joints were matched with those of the original robot. Due to the difficulty of accurately simulating the robot’s feet, the friction between them and the ground was approximated by a Coulomb friction model. The framework has 11 degrees of freedom and a 41 dimensional observation vector (11 angles, 11 angular velocities, 11 forces, 2 pressure sensors in feet, 3 degrees of orientation and 3 degrees of acceleration in the head). The controller is a Jordan network [7] with 41 inputs, 20 hidden units and 11 output units. The aim of the task is to maximize the height of the robot’s head, up to the limit of standing completely upright. The robot is continually perturbed by random forces (depictured by the particles in Figure 3) that would knock it over unless it counterbalanced. Figure 3 shows a typical scenario of the robust standing task. co Following the derivation in [18], we see that the action element aj is distributed as X 4.1.1. Pole Balancing The first scenario is the extended pole balancing benchmark as described in [17]. Pole balancing is a standard benchmark in reinforcement learning. In contrast to [17] however, we do not initialize the controller with a previously chosen stabilizing policy but rather start with random policies, which makes the task more difficult. In this task the agent’s goal is to maximize the length of time a movable cart can balance a pole upright in the centre of a track. The agent’s inputs are the angle and angular velocity of the pole and the position and velocity of the cart. The agent is represented by a linear controller with four inputs and one output unit. The simulation is updated 50 times a second. The initial position of the cart and angle of the pole are chosen randomly. 4.1.2. Biped Robust Standing The task in this scenario was to keep a simulated biped robot standing while perturbed by external forces. The simulation, based on the biped robot Johnnie [29] was implemented using the Open Dynamics Engine. The lengths and masses of the body parts, the location of the 4.1.4. Ball Catching This series of experiments is based on a simulated robot hand with realistically modeled physics. We chose this experiment to show the predominance of policy gradients equipped with SDE, especially in a realistic robot task. We used the Open Dynamics Engine to model the hand, arm, body, and object. The arm has 3 degrees of freedom: shoulder, elbow, and wrist, where each joint is assumed to be a 1D hinge joint, which limits the arm movements to forward-backward and updown. The hand itself consists of 4 fingers with 2 joints each, but for simplicity we only use a single actor to move all finger joints together, which gives the system the possibility to open and close the hand, but it cannot control individual fingers. These limitations to hand and arm movement reduce the overall complexity of the task while giving the system enough freedom to catch the ball. A 3D visualization of the robot attempting a catch is shown in Figure 5. The reward function is defined as follows: upon release of the ball, in each time step the reward can either be −3 if the ball hits the ground (in which case the episode is considered a failure, because the system cannot recover from it) or else the negative distance between ball center and palm center, which can be any value between −3 (we capped the distance at 3 units) and −0.5 (the closest possible distance considering the palm heights and ball radius). The return for a whole episode PN is the mean over the episode: R = N1 n=1 rt . In practice, we found an overall episodic return of −1 or better to represent nearly optimal catching behavior, considering the time from ball release to impact on 20 PALADYN Journal of Behavioral Robotics Figure 3. The original Johnnie (left). From left to right, a typical solution which worked well in the robust standing task is shown: 1. Initial posture. 2. Stable posture. 3. Perturbation by heavy weights that are thrown randomly at the robot. 4. - 7. Backsteps right, left, right, left. 8. Stable posture regained. Figure 4. The original CCRL robot (left). From left to right, a typical solution 4.2.2. Biped Robust Standing This complex, high-dimensional task was executed with REINFORCE and its parameter-exploring version PGPE, as shown in Figure 8. While PGPE is slower at first, it quickly overtakes REINFORCE and finds a robust posture in less than 1000 episodes, whereas REINFORCE needs twice as many episodes to reach that performance. or which worked well in the object grasping task is shown: 1. Initial posture. 2. Approach. 3. Enclose. 4. Take hold. 5. Lift. co py additive ε from Eqn. (2) in every time step, the probability of drawing a new ε was set to 0.125, 0.25, 0.5 and 1.0 respectively (the last one being the original REINFORCE again). Figure 6 shows the results. PGPE clearly outperformed all versions of REINFORCE, finding a better solution in shorter time. Original REINFORCE showed worst performance. The smaller the probability was to change the perturbation, the faster REINFORCE improved. A second experiment on the pole balancing benchmark was conducted, comparing PGPE, NES and eNAC directly. The results are illustrated in Figure 7. NES converged fastest while PGPE came second. The action-perturbing eNAC got stuck in a plateau after approx. 1000 episodes but eventually recovered and found a equally good solution, although much slower. 4.2. ut h palm, which is penalized with the capped distance to the palm center. Results 4.2.1. A We present the results of our experiments below, ordered by benchmark. All plots show performance (e.g. average returns) over episodes. The solid lines are the mean over many repetitions of the same experiments, while the thick bars represent the variance. The thin vertical lines indicate best and worst performance of all repetitions. The Ball Catching experiment was repeated 100 times, the Object Grasping experiment was repeated 10 times. All other experiments were repeated 40 times. Pole Balancing For the first set of experiments, we compared PGPE to REINFORCE with varying action perturbation probabilities. Instead of changing the 4.2.3. Object Grasping The object grasping experiment compares the two parameter-exploring algorithms PGPE and NES. Object Grasping is a mid dimensional task. As can be seen in Figure 9 with a parameter dimension of 48 both algorithms perform nearly the same. Both algorithms manage to learn to grasp the object from scratch in reasonable time. 4.2.4. Ball Catching Two experiments demonstrate the performance of eNAC and REINFORCE, both with and without SDE. The results are shown in Figure 10 and 11, respectively. REINFORCE enhanced with SDE clearly exceed its action-exploring counterpart, finding a superior solution with very low variance. REINFORCE without SDE on the other hand has a very high variance, which indicates that it finds good solutions sometimes while it gets stuck early on at other times. The same experiment was repeated with eNAC as policy gradient method, again with and without SDE. While the results are not as clear as in the case of REINFORCE, SDE still improved the performance significantly. 4.3. Discussion Figure 5. Visualization of the simulated robot hand while catching a ball. The ball is released above the palm with added noise in x and y coordinates. When the fingers grasp the ball and do not release it throughout the episode, the best possible return (close to −1.0) is achieved. 4.3.1. PGPE We previously asserted that the lower variance of PGPE’s gradient estimates compared to action exploring methods is partly due to the fact that PGPE requires only one parameter sample per history, whereas 21 Figure 6. REINFORCE on the pole balancing task, with various action pertur- py PALADYN Journal of Behavioral Robotics Figure 8. PGPE compared to REINFORCE on the robust standing benchmark. ut h or co bation probabilities (1, 0.5, 0.25, 0.125). PGPE is shown for reference. Figure 9. PGPE and NES on the object grasping benchmark. mark. A Figure 7. PGPE and NES compared to eNAC on the pole balancing bench- REINFORCE requires samples every time step. This suggests that reducing the frequency of REINFORCE perturbations should improve its gradient estimates, thereby bringing it closer to PGPE. As can be seen in Figure 6 in general, performance improved with decreasing perturbation probability. However the difference between 0.25 and 0.125 is negligible. This is because reducing the number of perturbations constrains the range of exploration at the same time as it reduces the variance of the gradient, leading to a saturation point beyond which performance does not increase. Note that the above trade off does not exist for PGPE, because a single perturbation of the parameters can lead to a large change in the overall behavior. Because PGPE also uses only the log likelihood gradient for parameter update and does not adapt the full covariance matrix but only uses a single variance parameter per exploration dimension, the difference in performance is solely based on the different exploration strategies. In Figure 8 PGPE is compared to its direct competitor and ancestor REINFORCE. The robust standing benchmark should clearly favor REINFORCE, because the policy has a lot more parameters than it has output dimensions (1111 parameters to 11 DoF) and is further quite noisy. Both properties are said to challenge parameter-exploring algorithms [14]. While these disadvantages are most likely responsible for the worse performance of PGPE in the beginning (less than 500 episodes), PGPE can still overtake REINFORCE and find a better solution is shorter time. 4.3.2. NES NES has shown its superiority in the Pole Balancing experiment, and demonstrated its real strength, difficult low-dimensional tasks, which it derived from its origins in CMA-ES. Like eNAC it uses a natural gradient for the parameter updates of the policy. The difference in performance is therefore only due to the adaptation of the full covariance matrix and the exploration in parameter space. eNAC’s convergence temporarily slows down after 1000 episodes, where it reaches a plateau. By looking at the intermediate solution of 22 PALADYN Journal of Behavioral Robotics We also compared NES to PGPE on the Object Grasping task with a medium number of dimensions. As can be seen in Figure 9 with a parameter dimension of 48 both algorithms perform similar. This underlines that NES is a specialist on difficult low-dimensional problems while PGPE was constructed to cope with high-dimensional problems. 4.3.3. SDE py The experiment that supports our claim most is probably the Ball Catching task, where REINFORCE and eNAC are compared to their SDEconverted counterparts. The underlying algorithms are exactly the same, only the exploration method has been switched from a normally distributed action exploration to a state-dependent exploration. As mentioned in section 3.5, this does not make them fully parameterexploring, but it does carry over some of its advantageous properties. As can be seen in Figures 10 and 11, the proposed kind of exploration results in a completely different quality of the found solutions. The high difference of performance on this challenging task, where torques are directly affected by the output of the controller policy, strengthens our arguments for parameter exploration further. Figure 10. REINFORCE compared to the SDE version of REINFORCE. While 5. Conclusion co SDE managed to learn to catch the ball quickly in every single case, original REINFORCE occasionally found a good solution, but in most cases did not learn to catch the ball. ut h or We studied the neglected issue of exploration for continuous RL, especially Policy Gradient methods. Parameter-based exploration has many advantages over perturbing actions by Gaussian noise. Several of our novel RL algorithms profit from it, despite stemming from different subfields and pursuing different goals. All find reliable gradient estimates and outperform traditional policy gradient methods in many applications, converging faster and often also more reliably than standard PG methods with random exploration of action space. SDE replaces the latter by state dependent exploration searching the parameter space of an exploration function approximator. It combines the advantages of policy gradients, in particular the advanced gradient estimation techniques found in eNAC, with the reduced variance of parameter-exploring methods. PGPE goes one step further and explores by perturbing the policy parameters directly. It lacks the methodology of existing PG methods, but works also for non-differentiable controllers, learning to execute smooth trajectories. NES finally combines desirable properties of CMA-ES with natural gradient estimates that replace the population/distribution-based convergence mechanism of CMA, while keeping the covariance matrix for more informed exploration. We believe that parameter-based exploration should play a more important role not only for PG methods but for continuous RL in general, and continuous value-based RL in particular. This is subject of ongoing research. A Figure 11. eNAC compared to the SDE version. Both learning curves had relatively high variances. While original eNAC often didn’t find a good solution, SDE found a catching behavior in almost every case, but many times lost it again due to continued exploration, hence the high variance. eNAC on plateau level, we discovered an interesting reason for this behavior. eNAC learns to balance the pole first, and only later proceeds to learn to drive to the middle of the area. NES and PGPE (and in fact all tested parameter-exploring techniques, including CMA-ES) learn both subtasks simultaneously. This is because action-perturbing methods try out small action variations and thus are more greedy. In effect, they learn small subtasks first. Parameter exploring methods on the other hand vary the strategy in each episode and are thus able to find the overall solution more directly. Admittedly, while the latter resulted in faster convergence in this task, it is possible that action perturbing methods can use the greedyness to their advantage, by learning stepby-step rather than tackling the big problem at once. This issue should be examined further in a future publication. References [1] D. Aberdeen. Policy-Gradient Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Australian National University, 2003. [2] S. Amari and S. C. Douglas. Why natural gradient? In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP ’98), volume 2, pages 1213–1216, 1998. [3] J. Baxter and P. L. Bartlett. Reinforcement learning in POMDPs via direct gradient ascent. In Proc. 17th International Conf. on Ma- 23 PALADYN Journal of Behavioral Robotics [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] py [8] co [7] [19] S. Schaal, J. Peters, J. Nakanishi, and A. Ijspeert. Learning movement primitives. In International symposium on robotics research. Citeseer, 2004. [20] T. Schaul, J. Bayer, D. Wierstra, Y. Sun, M. Felder, F. Sehnke, T. Rückstieß, and J. Schmidhuber. PyBrain. Journal of Machine Learning Research, 11:743–746, 2010. [21] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks, Special Issue, December 2009. [22] F. Sehnke, C. Osendorfer, T. Rückstieß, A. Graves, J. Peters, and J. Schmidhuber. Policy gradients with parameter-based exploration for control. In Proceedings of the International Conference on Artificial Neural Networks ICANN, 2008. [23] R. Storn and K. Price. Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of global optimization, 11(4):341–359, 1997. [24] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Stochastic Search using the Natural Gradient. In International Conference on Machine Learning (ICML), 2009. [25] Y. Sun, D. Wierstra, T. Schaul, and J. Schmidhuber. Efficient Natural Evolution Strategies. In Genetic and Evolutionary Computation Conference (GECCO), 2009. [26] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. NIPS-1999, pages 1057–1063, 2000. [27] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998. [28] S. B. Thrun. The role of exploration in learning control. Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Approaches, pages 527–559, 1992. [29] H. Ulbrich. Institute of Applied Mechanics, TU München, Germany, 2008. http://www.amm.mw.tum.de/. [30] H. van Hasselt and M. Wiering. Reinforcement learning in continuous action spaces. In Proc. 2007 IEEE Symp. Approx. Dynamic Programming and Reinforcement Learning, volume 272, page 279. Citeseer, 2007. [31] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292, 1992. [32] M. Wiering and J. Schmidhuber. Efficient Model-Based Exploration. From Animals to Animats 5: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior, 1998. [33] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In IEEE World Congress on Computational Intelligence (WCCI 2008), 2008. [34] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229– 256, 1992. or [6] ut h [5] A [4] chine Learning, pages 41–48. Morgan Kaufmann, San Francisco, CA, 2000. M. Buss and S. Hirche. Institute of Automatic Control Engineering, TU München, Germany, 2008. http://www.lsr.ei.tum.de/. N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9 (2):159–195, 2001. N. Hansen, S. D. Müller, and P. Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary Computation, 11(1):1–18, 2003. M. I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. Proc. of the Eighth Annual Conference of the Cognitive Science Society, 8:531–546, 1986. J. Kennedy, R. C. Eberhart, et al. Particle swarm optimization. In Proceedings of IEEE international conference on neural networks, volume 4, pages 1942–1948. Piscataway, NJ: IEEE, 1995. S. Kern, S. D. Müller, N. Hansen, D. Büche, J. Ocenasek, and P. Koumoutsakos. Learning probability distributions in continuous evolutionary algorithms–a comparative review. Natural Computing, 3(1):77–112, 2004. P. Larranaga and J. A. Lozano. Estimation of distribution algorithms: A new tool for evolutionary computation. Kluwer Academic Pub, 2002. H. Müller, M. Lauer, R. Hafner, S. Lange, A. Merke, and M. Riedmiller. Making a robot learn to play soccer. Proceedings of the 30th Annual German Conference on Artificial Intelligence (KI-2007), 2007. R. Munos and M. Littman. Policy gradient in continuous time. Journal of Machine Learning Research, 7:771–791, 2006. J. Peters and S. Schaal. Natural actor-critic. Neurocomputing, 71 (7-9):1180–1190, 2008. J. Peters and S. Schaal. Policy gradient methods for robotics. In Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006. I. Rechenberg. Evolution strategy. Computational Intelligence: Imitating Life, pages 147–159, 1994. M. Riedmiller. Neural fitted Q iteration – First Experiences with a Data Efficient Neural Reinforcement Learning Method. Lecture notes in computer science, 3720:317, 2005. M. Riedmiller, J. Peters, and S. Schaal. Evaluation of policy gradient methods and variants on the cart-pole benchmark. In Proceedings of the 2007 IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning, 2007. T. Rückstieß, M. Felder, and J. Schmidhuber. State-Dependent Exploration for policy gradient methods. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2008, Part II, LNAI 5212, pages 234–249, 2008. 24