Academia.eduAcademia.edu

Evolving plastic neural networks with novelty search

2010, Adaptive Behavior

Biological brains can adapt and learn from past experience. Yet neuroevolution, that is, automatically creating artificial neural networks (ANNs) through evolutionary algorithms, has sometimes focused on static ANNs that cannot change their weights during their lifetime. A profound problem with evolving adaptive systems is that learning to learn is highly deceptive. Because it is easier at first to improve fitness without evolving the ability to learn, evolution is likely to exploit domain-dependent static (i.e., nonadaptive) heuristics. This article analyzes this inherent deceptiveness in a variety of different dynamic, reward-based learning tasks, and proposes a way to escape the deceptive trap of static policies based on the novelty search algorithm. The main idea in novelty search is to abandon objective-based fitness and instead simply search only for novel behavior, which avoids deception entirely. A series of experiments and an in-depth analysis show how behaviors that could ...

Evolving Plastic Neural Networks with Novelty Search Sebastian Risi∗ Charles E. Hughes Kenneth O. Stanley Evolutionary Complexity Research Group School of Electrical Engineering and Computer Science University of Central Florida 4000 Central Florida Blvd. Orlando, FL 32816-2362 USA In: Adaptive Behavior journal 18(6), pages 470-491, London: SAGE, 2010 ∗ Correspondence to: Sebastian Risi, School of EECS, University of Central Florida, Orlando, USA. E-Mail: [email protected] Tel.: +1 407 929 5113 1 Abstract Biological brains can adapt and learn from past experience. Yet neuroevolution, i.e. automatically creating artificial neural networks (ANNs) through evolutionary algorithms, has sometimes focused on static ANNs that cannot change their weights during their lifetime. A profound problem with evolving adaptive systems is that learning to learn is highly deceptive. Because it is easier at first to improve fitness without evolving the ability to learn, evolution is likely to exploit domain-dependent static (i.e. non-adaptive) heuristics. This paper analyzes this inherent deceptiveness in a variety of different dynamic, reward-based learning tasks, and proposes a way to escape the deceptive trap of static policies based on the novelty search algorithm. The main idea in novelty search is to abandon objective-based fitness and instead simply search only for novel behavior, which avoids deception entirely. A series of experiments and an in-depth analysis show how behaviors that could potentially serve as a stepping stone to finding adaptive solutions are discovered by novelty search yet are missed by fitness-based search. The conclusion is that novelty search has the potential to foster the emergence of adaptive behavior in reward-based learning tasks, thereby opening a new direction for research in evolving plastic ANNs. Keywords Novelty Search, Neural Networks, Adaptation, Learning, Neuromodulation, Neuroevolution 2 1 Introduction Neuroevolution (NE), i.e. evolving artificial neural networks (ANNs) through evolutionary algorithms, has shown promise in a variety of control tasks (Floreano, Dürr, & Mattiussi, 2008; Reil & Husbands, 2002; Stanley, Bryant, & Miikkulainen, 2005; Stanley & Miikkulainen, 2002; Yao, 1999). However, the synaptic connections in ANNs produced by NE are normally static, which may limit the adaptive dynamics the network can display during its lifetime (Blynel & Floreano, 2002). While some tasks do not require the network to change its behavior, many domains would benefit from online adaptation. In other words, whereas evolution produces phylogenetic adaptation, learning gives the individual the possibility to react much faster to environmental changes by modifying its behavior during its lifetime. For example, a robot that is physically damaged should be able to adapt to its new circumstances without the need to re-evolve its neurocontroller. In this way, when the environment changes from what was encountered during evolution, adapting online is often necessary to maintain performance. There is much evidence that evolution and learning are both integral to the success of biological evolution (Mayley, 1997; Nolfi & Floreano, 1999) and that lifetime learning itself can help to guide evolution to higher fitness (Hinton & Nowlan, 1987). Thus NE can benefit from combining these complementary forms of adaptation by evolving ANNs with synaptic plasticity driven by local learning rules (Baxter, 1992; Floreano & Urzelai, 2000; Stanley, Bryant, & Miikkulainen, 2003). Synaptic plasticity allows the network to change its internal connection weights based on experience during its lifetime. It also resembles the way organisms in nature, which possess plastic nervous systems, cope with changing and unpredictable environments (Floreano & Urzelai, 2000; Niv, Joel, Meilijson, & Ruppin, 2002; Soltoggio, Bullinaria, Mattiussi, Dürr, & Floreano, 2008). In this paper, the term plastic ANNs refers in particular to ANNs that can accordingly change their connection weights during their lifetime, while the term adaptive ANNs refers to the larger class of ANNs that can adapt through any means (e.g. through recurrent connections). In a recent demonstration of the power of the plastic approach, Soltoggio et al. (2008) evolved plastic Hebbian networks with neuromodulation, i.e. in which some neurons can enhance or dampen the neural plasticity of their target nodes, that acquired the ability to memorize the position of a reward from previous trials in a T-Maze learning problem. However, evolving adaptive controllers for more complicated tasks has proven difficult in part because learning to learn is deceptive, which is the focus of this paper. 3 Objective functions often exhibit the pathology of local optima (Goldberg, 2007; Mitchell, Forrest, & Holland, 1991), and the more ambitious the goal, the more likely it is that search can be deceived by suboptimal solutions (Lehman & Stanley, 2008, 2010a). In particular, if fitness does not reward the stepping stones that lead to the final solution in the search space, fitnessbased search may be led astray. Deception in domains that require adaptation is particularly pathological for two primary reasons: (1) Reaching a mediocre fitness through non-adaptive behavior is often relatively easy, but any further improvement requires an improbable leap to sophisticated adaptive behavior, and (2) only sparse feedback on the acquisition of adaptive behavior is available from an objective-based performance measure. Because it is easier at first to improve fitness without evolving the ability to learn, objective functions may sometimes exploit domain-dependent static (i.e. non-adaptive) heuristics that can lead them further away from the adaptive solution in the genotypic search space, as analysis in this paper will confirm. Because of the problem of deception in adaptive domains, prior experiments in evolving plastic ANNs have needed to be carefully designed to ensure that no non-adaptive heuristics exist that could potentially lead evolution prematurely astray. This awkward requirement has significantly limited the scope of domains amenable to adaptive evolution and stifled newcomers from entering the research area. To remedy this situation and open up the range of problems amenable to evolving adaptation, this paper proposes that the novelty search algorithm (Lehman & Stanley, 2008), which abandons the traditional notion of objective-based fitness, circumvents the deception inherent in adaptive domains. Instead of searching for a final objective behavior, novelty search rewards finding any instance whose behavior is significantly different from what has been discovered before. Surprisingly, this radical form of search has been shown to outperform traditional fitnessbased search in several deceptive domains (Lehman & Stanley, 2008, 2010b, 2010c; Mouret, 2009), suggesting that it may be applicable to addressing the problem of deception in evolving plastic ANNs, which is the focus of this paper. To demonstrate the potential of this approach, this paper first compares novelty search to fitness-based evolution in a dynamic, reward-based single T-Maze scenario first studied in the context of NE by Blynel and Floreano (2003) and further investigated by Soltoggio et al. (2008) to demonstrate the advantage of neuromodulated plasticity. In this scenario, the reward location is a variable factor in the environment that the agent must learn to exploit. Because the aim of this paper is to show that novelty search solves particular difficult problems in the 4 evolution of plastic networks and it has been shown that neuromodulation is critical to those domains (Soltoggio et al., 2008), all evolved ANNs employ this most effective form of plasticity. Counterintuitively, novelty search significantly outperforms regular fitness-based search in the T-Maze learning problem because it returns more information about how behavior changes throughout the search space. To explain this result and understand the nature of deception in this domain, the locus of deception in the T-Maze is uncovered through a Sammon’s Mapping visualization that shows how fitness-based search and novelty search navigate the highdimensional genotypic search space. The main result is that genotypes that are leveraged by novelty search as stepping stones can in fact lead fitness-based search astray. Furthermore, deceptiveness in reward-based scenarios can increase when learning is only needed in a low percentage of trials. In that case, evolution is trapped in local optima that do not require learning at all because high fitness is achieved in the majority of trials. By varying the number of times the reward location changes in the T-Maze domain, the effect of adaptation on the fitness function can be controlled to make the domain more or less deceptive for objectivebased fitness. While fitness-based search performs worse with increased domain deception (as one would expect), novelty search is not significantly affected, suggesting an intriguing new approach to evolving adaptive behavior. The interesting aspect of this observation is that novelty search both solves the problem and solves it in a general way despite lacking any incentive to do so. Additional experiments in the more complicated double T-Maze domain and a bee foraging task add further evidence to the hypothesis that novelty search can effectively overcome the deception inherent in many dynamic, reward-based scenarios. In these domains, novelty search still significantly outperforms fitness-based search under an increased behavioral search space and raised domain complexity. The paper begins with a review of novelty search and evolving adaptive ANNs in the next section. The T-Maze domain is then described in Section 3, followed by the experimental design in Section 4. Results are presented in Section 5 and a detailed analysis of the inherent deception in the T-Maze domain is conducted in Section 6. The double T-Maze and bee domain experiments are described in Section 7. The paper concludes with a discussion and ideas for future work in Section 8. 5 2 Background This section first reviews novelty search, which is the proposed solution to deception in the evolution of learning. Then an overview of evolving plastic ANNs is given, focusing on the neuromodulation-based model followed in this paper. The section concludes with a description of NEAT, which is augmented in this paper to encode neuromodulated plasticity. 2.1 The Search for Novelty The problem with the objective fitness function in evolutionary computation is that it does not necessarily reward the intermediate stepping stones that lead to the objective. The more ambitious the objective, the harder it is to identify a priori these stepping stones. This paper hypothesizes that evolving plastic ANNs is especially susceptible to missing the essential intermediate stepping stones for fitness-based search and therefore highly deceptive. Reaching a mediocre fitness through non-adaptive behavior is relatively easy, but any further improvement requires sophisticated adaptive behavior with only sparse feedback from an objective-based performance measure. Such deception is inherent in most dynamic, rewardbased scenarios. A potential solution to this problem is novelty search, which is a recent method for avoiding deception based on the radical idea of ignoring the objective (Lehman & Stanley, 2008, 2010a). The idea is to identify novelty as a proxy for stepping stones. That is, instead of searching for a final objective, the learning method is rewarded for finding any behavior whose functionality is significantly different from what has been discovered before. Thus, instead of an objective function, search employs a novelty metric. That way, no attempt is made to measure overall progress. In effect, such a process gradually accumulates novel behaviors. This idea is also related to the concept of curiosity and seeking novelty in reinforcement learning research (Schmidhuber, 2003, 2006). Although it is counterintuitive, novelty search was actually more effective at finding the objective than a traditional objective-based fitness function in a deceptive navigation domain that requires an agent to navigate through a maze to reach a specific goal location (Lehman & Stanley, 2008; Mouret, 2009), in evolving biped locomotion (Lehman & Stanley, 2010a), and in evolving a program for an artificial ant benchmark task (Lehman & Stanley, 2010b). Thus novelty search might be a solution to the longstanding problem with training for adaptation. 6 The next section describes the novelty search algorithm (Lehman & Stanley, 2008) in more detail. 2.1.1 The Novelty Search Algorithm Evolutionary algorithms are well-suited to novelty search because the population that is central to such algorithms naturally covers a wide range of expanding behaviors. In fact, tracking novelty requires little change to any evolutionary algorithm aside from replacing the fitness function with a novelty metric. The novelty metric measures how different an individual is from other individuals, creating a constant pressure to do something new. The key idea is that instead of rewarding performance on an objective, the novelty search rewards diverging from prior behaviors. Therefore, novelty needs to be measured. There are many potential ways to measure novelty by analyzing and quantifying behaviors to characterize their differences. Importantly, like the fitness function, this measure must be fitted to the domain. The novelty of a newly generated individual is computed with respect to the observed behaviors (i.e. not the genotypes) of an archive of past individuals whose behaviors were highly novel when they originated. In addition, if the evolutionary algorithm is steady state (i.e. one individual is replaced at a time) then the current population can also supplement the archive by representing the most recently visited points. The aim is to characterize how far away the new individual is from the rest of the population and its predecessors in novelty space, i.e. the space of unique behaviors. A good metric should thus compute the sparseness at any point in the novelty space. Areas with denser clusters of visited points are less novel and therefore rewarded less. A simple measure of sparseness at a point is the average distance to the k-nearest neighbors of that point, where k is a fixed parameter that is determined experimentally. Intuitively, if the average distance to a given point’s nearest neighbors is large then it is in a sparse area; it is in a dense region if the average distance is small. The sparseness ρ at point x is given by ρ(x) = k 1X dist(x, µi ), k i=1 (1) where µi is the ith-nearest neighbor of x with respect to the distance metric dist, which is 7 a domain-dependent measure of behavioral difference between two individuals in the search space. The nearest neighbors calculation must take into consideration individuals from the current population and from the permanent archive of novel individuals. Candidates from more sparse regions of this behavioral search space then receive higher novelty scores. It is important to note that this novelty space cannot be explored purposefully, that is, it is not known a priori how to enter areas of low density just as it is not known a priori how to construct a solution close to the objective. Thus moving through the space of novel behaviors requires exploration. In effect, because novelty is measured relative to other individuals in evolution, it is driven by a coevolutionary dynamic. If novelty is sufficiently high at the location of a new individual, i.e. above some minimal threshold ρmin , then the individual is entered into the permanent archive that characterizes the distribution of prior solutions in novelty space, similarly to archive-based approaches in coevolution (De Jong, 2004). The current generation plus the archive give a comprehensive sample of where the search has been and where it currently is; that way, by attempting to maximize the novelty metric, the gradient of search is simply towards what is new, with no other explicit objective. To ensure that the archive continues to push the search to new areas and does not expand too fast, the threshold ρmin is adjusted dynamically (e.g. by lowering ρmin if no new individuals are added during a certain number of evaluations) to maintain a healthy rate of expansion. It is important to note that novelty search resembles prior diversity maintenance techniques (i.e. speciation) popular in evolutionary computation (Darwen & Yao, 1996; Goldberg & Richardson, 1987; Hornby, 2006; Hu, Goodman, Seo, Fan, & Rosenberg, 2005; Mahfoud, 1995). The most well known are variants of fitness sharing (Darwen & Yao, 1996; Goldberg & Richardson, 1987). These also in effect open up the search by reducing selection pressure. However, in these methods, as in Hutter’s fitness uniform selection (Hutter & Legg, 2006), the search is still ultimately guided by the fitness function. Diversity maintenance simply keeps the population more diverse than it otherwise would be. Also, most diversity maintenance techniques measure genotypic diversity as opposed to behavioral diversity (Darwen & Yao, 1996; Mahfoud, 1995). In contrast, novelty search takes the radical step of only rewarding behavioral diversity with no concept of fitness or a final objective, inoculating it to traditional deception. Other related methods seek to accelerate search through neutral networks by recognizing neutral areas in the search space (Stewart, 2001; Barnett, 2001). Stewart (2001) explicitly 8 rewards drifting further away in genotype space from the center of the population once a neutral network is encountered. Similarly, Barnett (2001) seeks to accelerate movement across a neutral network of equal objective fitness by reducing the population to one individual. However, identifying when the search is actually stalled may be difficult in practice and while such approaches potentially decrease the search complexity, finding the objective might still take a long time depending on the deceptiveness of the task. It is also important to note that novelty search is not a random walk; rather, it explicitly maximizes novelty. Because novelty search includes an archive that accumulates a record of where search has been, backtracking, which can happen in a random walk, is effectively avoided in behavioral spaces of any dimensionality. In this way, novelty search resembles tabu search (Glover & Laguna, 1997), which keeps a list of potential solutions to avoid repeatedly visiting the same points. However, tabu search still tries to measure overall progress and therefore can be potentially led astray by deception. The novelty search approach in general allows any behavior characterization and any novelty metric. Although generally applicable, novelty search is best suited to domains with deceptive fitness landscapes, intuitive behavioral characterization, and domain constraints on possible expressible behaviors. Changing the way the behavior space is characterized and the way characterizations are compared will lead to different search dynamics, similarly to how researchers now change the fitness function to improve the search. The intent is not to imply that setting up novelty search is easier than objective-based search. Rather, once novelty search is set up, the hope is that it can find solutions beyond what even a sophisticated objective-based search can currently discover. Thus the effort is justified in its returns. In summary, novelty search depends on the following four main concepts: • Individuals’ behaviors are characterized so that they can be compared. • The novelty of an individual is computed with respect to observed behaviors of other individuals and not others’ genotypes. • Novelty search replaces the fitness function with a novelty metric that computes the sparseness at any point in the novelty space. • An archive of past individuals is maintained whose behaviors were highly novel. 9 The evolutionary algorithm that evolves neuromodulated plastic networks (explained later in Section 2.3) through novelty search in this paper is NeuroEvolution of Augmenting Topologies (NEAT; Stanley & Miikkulainen, 2002), which offers the ability to discover minimal effective plastic topologies. The next section reviews the evolution of adaptive ANNs and details the model for neuromodulated plasticity in this paper, which is followed by an explanation of NEAT. 2.2 Evolving Adaptive Neural Networks Researchers have been evolving adaptive ANNs for more than fifteen years. Early work often focused on combining the built-in adaptive capabilities of backpropagation with NE. For example, Nolfi and Parisi (1993, 1996) evolved self-teaching networks that trained a motor control network through backpropagation from the outputs of a teaching subnetwork. In separate work, they evolved a network that learns through backpropagation to predict what it would see after moving around in its environment (Nolfi, Parisi, & Elman, 1994). Learning to predict the next state during the network’s lifetime was shown to enhance performance in a foraging task. Interestingly, Chalmers (1990) evolved a global learning rule (i.e. a rule that applies to every connection) and discovered that the evolved rule was similar to the well-known delta rule used in backpropagation. Furthermore, McQuesten and Miikkulainen (1997) showed that NE can benefit from parent networks teaching their offspring through backpropagation. Baxter (1992) performed early work on evolving networks with synaptic plasticity driven by local learning rules, setting the stage for NE of plastic ANNs. He evolved a very simple network that could learn boolean functions of one value. Each connection had a rule for changing its weight to one of two possible values. Baxter’s contribution was mainly to show that local learning rules are sufficient to evolve a plastic network. Floreano and Urzelai (2000) later showed that the evolution of local (node-based) synaptic plasticity parameters produces networks that can solve complex problems better than recurrent networks with fixed-weights. In Floreano and Urzelai’s experiment, a plastic network and a fixed-weight fully-recurrent network were evolved to turn on a light by moving to a switch. After the light turned on, the networks had to move onto a gray square. The plastic networks were compared to the fixed-weight networks. Each connection in the plastic network included a learning rule and a learning rate. The fixed-weight network only encoded static connection weights. The sequence of two actions proved difficult to learn for the fixed-weight network because the network could 10 not adapt to the sudden change in goals after the light was switched on. Fixed-weight networks tended to circle around the environment, slightly attracted by both the light switch and the gray square. Plastic networks, on the other hand, completely changed their trajectories after turning on the light, reconfiguring their internal weights to tackle the problem of finding the gray square. This landmark result established the promise of evolving plastic ANNs and that in fact plastic networks can sometimes evolve faster than static networks. The local learning rules in the evolved networks facilitated the policy transition from one task to the other. Plastic ANNs have also been successfully evolved to simulate robots in a dangerous foraging domain (Stanley et al., 2003). Although this work also showed that recurrent fixed-weight networks can be more effective and reliable than plastic Hebbian controllers in some domains, more recent studies (Niv et al., 2002; Soltoggio et al., 2008; Soltoggio, Dürr, Mattiussi, & Floreano, 2007) suggest that both network types reach their limits when more elaborate forms of learning are needed. For example, classical conditioning seems to require mechanisms that are not present in most current network models. To expand to such domains, following Soltoggio et al. (2008), the study presented in this paper controls plasticity through neuromodulation. 2.2.1 Neuromodulated Plasticity In the plastic ANNs presented in the previous section (e.g. Floreano & Urzelai, 2000; Stanley et al., 2003), the internal synaptic connection strengths change following a Hebbian learning rule that modifies synaptic weights based on pre- and postsynaptic neuron activity. The generalized Hebbian plasticity rule (Niv et al., 2002) takes the following form: ∆w = η · [Axy + Bx + Cy + D], (2) where η is the learning rate, x and y are the activation levels of the presynaptic and postsynaptic neurons, and A–D are the correlation term, presynaptic term, postsynaptic term, and constant, respectively. In a neuromodulated network, a special neuromodulatory neuron can change the degree of potential plasticity between two standard neurons based on their activation levels (Figure 1). In addition to its standard activation value ai , each neuron i also computes its modulatory activation mi : ai = X j∈Std 11 wji · oj , (3) m n1 n2 Figure 1: Neuromodulated plasticity. The weight of the connection between standard neurons n1 and n2 is modified by a Hebbian rule. Modulatory neuron m determines the magnitude of the weight change. mi = X wji · oj , (4) j∈M od where wji is the connection strength between presynaptic neuron j and postsynaptic neuron i and oj is calculated as oj (aj ) = tanh(aj /2). The weight between neurons j and i then changes following the mi -modulated plasticity rule ∆wji = tanh(mi /2) · η · [Aoj oi + Boj + Coi + D]. (5) The benefit of adding modulation is that it allows the ANN to change the level of plasticity on specific neurons at specific times. That is, it becomes possible to decide when learning should stop and when it should start. This property seems to play a critical role in regulating learning behavior in animals (Carew, Walters, & Kandel, 1981) and neuromodulated networks have a clear advantage in more complex dynamic, reward-based scenarios: Soltoggio et al. (2008) showed that networks with neuromodulated plasticity significantly outperform both fixed-weight and traditional plastic ANNs without neuromodulation in the double T-Maze domain, and display nearly optimal learning performance. The next section describes NEAT, the method that evolves plastic neuromodulated ANNs in this paper. 2.3 NeuroEvolution of Augmenting Topologies (NEAT) The NEAT method was originally developed to evolve ANNs to solve difficult control and sequential decision tasks and has proven successful in a wide diversity of domains (Aaltonen et al., 2009; Stanley et al., 2003, 2005; Stanley & Miikkulainen, 2002; Taylor, Whiteson, & Stone, 2006; Whiteson & Stone, 2006). Evolved ANNs control agents that select actions based on their sensory inputs. NEAT is unlike many previous methods that evolved neural networks, i.e. neu- 12 roevolution methods, which traditionally evolve either fixed-topology networks (Gomez & Miikkulainen, 1999; Saravanan & Fogel, 1995), or arbitrary random-topology networks (Angeline, Saunders, & Pollack, 1994; Gruau, Whitley, & Pyeatt, 1996; Yao, 1999). Instead, NEAT begins evolution with a population of small, simple networks and complexifies the network topology into diverse species over generations, leading to increasingly sophisticated behavior. A similar process of gradually adding new genes has been confirmed in natural evolution (Martin, 1999; Watson, Hopkins, Roberts, Steitz, & Weiner, 1987) and shown to improve adaptation in a few prior evolutionary (Watson et al., 1987) and neuroevolutionary (Harvey, 1993) approaches. However, a key feature that distinguishes NEAT from prior work in complexification is its unique approach to maintaining a healthy diversity of complexifying structures simultaneously, as this section reviews. Complete descriptions of the NEAT method, including experiments confirming the contributions of its components, are available in Stanley and Miikkulainen (2002, 2004) and Stanley et al. (2005). Before describing the neuromodulatory extension, let us review the three key ideas on which the basic NEAT method is based. First, to allow network structures to increase in complexity over generations, a method is needed to keep track of which gene is which. Otherwise, it is not clear in later generations which individual is compatible with which in a population of diverse structures, or how their genes should be combined to produce offspring. NEAT solves this problem by assigning a unique historical marking to every new piece of network structure that appears through a structural mutation. The historical marking is a number assigned to each gene corresponding to its order of appearance over the course of evolution. The numbers are inherited during crossover unchanged, and allow NEAT to perform crossover among diverse topologies without the need for expensive topological analysis. Second, historical markings make it possible for the system to divide the population into species based on how similar they are topologically. That way, individuals compete primarily within their own niches instead of with the population at large. Because adding new structure is often initially disadvantageous, this separation means that unique topological innovations are protected and therefore have time to optimize their structure before competing with other niches in the population. The distance δ between two network encodings can be measured as a linear combination of the number of excess (E) and disjoint (D) genes, as well as the average weight differences of matching genes (W ), where excess genes are those that arise in the lineage of one parent at a time later than all the genes in the other parent and disjoint genes are any 13 other genes in the lineage of one parent but not the other one (Stanley & Miikkulainen, 2002, 2004): δ= c1 E c2 D + + c3 · W . N N (6) The coefficients c1 , c2 , and c3 adjust the importance of the three factors, and the factor N , the number of genes in the larger genome, normalizes for genome size (N is normally set to one unless both genomes are excessively large; accordingly, it is set to one in this paper). Genomes are tested one at a time; if a genome’s distance to a representative member of the species is less than δt , a compatibility threshold, the genome is placed into this species. Third, many systems that evolve network topologies and weights begin evolution with a population of random topologies (Gruau et al., 1996; Yao, 1999). In contrast, NEAT begins with a uniform population of simple networks with no hidden nodes, differing only in their initial random weights. Because of speciation, novel topologies gradually accumulate over evolution, thereby allowing diverse and complex phenotype patterns to be represented. No limit is placed on the size to which topologies can grow. New structures are introduced incrementally as structural mutations occur, and only those structures survive that are found to be useful through fitness evaluations. In effect, then, NEAT searches for a compact, appropriate topology by incrementally increasing the complexity of existing structure. Few modifications to the standard NEAT algorithm are required to also encode neuromodulated plasticity. NEAT’s genetic encoding is augmented with a new modulatory neuron type and each time a node is added through structural mutation, it is randomly assigned a standard or modulatory role. The neuromodulatory dynamics follow equations 3–5. Also, importantly for this paper, novelty search is designed to work in combination with NEAT (Lehman & Stanley, 2008, 2010c). In particular, once objective-based fitness is replaced with novelty, the NEAT algorithm operates as normal, selecting the highest scoring individuals to reproduce. Over generations, the population spreads out across the space of possible behaviors, continually ascending to new levels of complexity (i.e. by expanding the neural networks in NEAT) to create novel behaviors as the simpler variants are exhausted. Thus, through NEAT, novelty search in effect searches not just for new behaviors, but for increasingly complex behaviors. Therefore, the main idea is to evolve neuromodulatory ANNs with NEAT through novelty search. The hypothesis is that this combination should help to escape the deception inherent 14 R R A Figure 2: The T-Maze. In this depiction, high reward is located on the left and low reward is on the right side, but these positions can change over a set of trials. The goal of the agent is to navigate to the position of the high reward and back home to its starting position. The challenge is that the agent must remember the location of the high reward from one trial to the next. in many adaptive domains. The next section describes such a domain, which is the initial basis for testing this hypothesis in this paper. 3 The T-Maze domain The first domain in this paper is based on experiments performed by Soltoggio et al. (2008) on the evolution of neuromodulated networks for the T-Maze learning problem. This domain is ideal to test the hypothesis that novelty search escapes deception in adaptive domains because it is well-established from prior work (Blynel & Floreano, 2003; Dürr, Mattiussi, Soltoggio, & Floreano, 2008; Soltoggio et al., 2008, 2007) and can be adjusted to be more or less deceptive, as is done in this paper. Furthermore, it represents a typical reward-based dynamic scenario (i.e. the agent’s actions that maximize reward intake can change during its lifetime), where optimal performance can only be obtained by an adaptive agent. Thus the results presented here should also provide more insight into the potential deceptiveness in similar learning problems. The single T-Maze (Figure 2) consists of two arms that either contain a high or low reward. The agent begins at the bottom of the maze and its goal is to navigate to the reward position and return home. This procedure is repeated many times during the agent’s lifetime. One such attempted trip to a reward location and back is called a trial. A deployment consists of a set of trials (e.g. 20 trials in the single T-Maze experiments in this paper are attempted over the course of a deployment). The goal of the agent is to maximize the amount of reward collected over deployments, which requires it to memorize the position of the high reward in each deployment. When the position of the reward sometimes changes, the agent should alter its strategy accordingly to explore the other arm of the maze in the next trial. In Soltoggio’s original experiments (Soltoggio et al., 2008), the reward location changes at least once during 15 each deployment of the agent, which fosters the emergence of learning behavior. However, the deceptiveness of this domain with respect to the evolution of learning can be increased if the reward location is not changed in all deployments in which the agent is evaluated. For example, an individual that performs well in the 99 out of 100 deployments wherein learning is not required and only fails in the one deployment that requires learning will most likely score a high fitness value. Thus such a search space is highly deceptive to evolving learning and the stepping stones that ultimately lead to an adaptive agent will not be rewarded. The problem is that learning domains often have the property that significant improvement in fitness is possible by discovering hidden heuristics that avoid lifetime adaptation entirely, creating a pathological deception against learning to learn. If adaptation is thus only required in a small subset of deployments, the advantage of an adaptive individual over a non-adaptive individual (i.e. always navigating to the same side) in fitness is only marginal. The hypothesis is that novelty search should outperform fitness-based search with increased domain deception. 4 Single T-Maze Experiment To compare the performance of NEAT with fitness-based search and NEAT with novelty search, each agent is evaluated on ten deployments, each consisting of 20 trials. The number of deployments in which the high reward is moved after ten trials varies among one (called the 1/10 scenario), five (called the 5/10 scenario), and ten (called the 10/10 scenario), effectively controlling the level of deception. The high reward always begins on the left side at the start of each deployment. Note that all deployments are deterministic, that is, a deployment in which the reward does not switch sides will always lead to the same outcome with the same ANN. Thus the number of deployments in which the reward switches is effectively a means to control the proportional influence of adaptive versus non-adaptive deployments on fitness and novelty. The question is whether the consequent deception impacts novelty as it does fitness. Of course, it is important to note that a population rewarded for performance in the 1/10 scenario would not necessarily be expected to be attracted to a general solution. At the same time, a process like novelty search that continues to find new behaviors should ultimately encounter the most general such behavior. Thus the hypothesized advantage of novelty search 16 Left / Right Bias Home Reward M-E Turn Evolved Topology Figure 3: ANN topology. The network has four inputs, one bias, and one output neuron (Soltoggio et al., 2008). Turn is set to 1.0 at a turning point location. M-E is set to 1.0 at the end of the maze, and Home is set to 1.0 when the agent returns to the home location. The Reward input returns the level of reward collected at the end of the maze. The bias neuron emits a constant 1.0 activation that can connect to other neurons in the ANN. Network topology is evolved by NEAT. in such scenarios follows naturally from the dynamics of these different types of search. Figure 3 shows the inputs and outputs of the ANN (following Soltoggio et al., 2008). The Turn input is set to 1.0 when a turning point is encountered. M-E is set to 1.0 at the end of the maze and Home becomes 1.0 when the agent successfully navigates back to its starting position. The Reward input is set to the amount of reward collected at the maze end. An agent crashes if it does not (1) maintain a forward direction (i.e. activation of output neuron between −0.3 and 0.3) in corridors, (2) turn either right (o > 0.3) or left (o < −0.3) when it encounters the junction, or (3) make it back home after collecting the reward. If the agent crashes then the current trial is terminated. The fitness function for fitness-based NEAT (which is identical to Soltoggio et al., 2008) is calculated as follows: Collecting the high reward has a value of 1.0 and the low reward is worth 0.2. If the agent fails to return home by taking a wrong turn after collecting a reward then a penalty of 0.3 is subtracted from fitness. On the other hand, 0.4 is subtracted if the agent does not maintain forward motion in corridors or does not turn left or right at a junction. The total fitness of an individual is determined by summing the fitness values for each of the 20 trials over all ten deployments. Novelty search on the other hand requires a novelty metric to distinguish between different behaviors. The novelty metric for this domain distinguishes between learning and non-learning individuals and is explained in more detail in the next section. 17 Trial Outcome Name Collected Reward Crashed NY none yes LY HY low high yes yes LN HN low high no no Pairwise Distances }1 }1 } } 2 2 } 3 Figure 4: The T-Maze novelty metric. Each trial is characterized by (1) the amount of collected reward (2) whether the agent crashed. The pairwise distances (shown at right) among the five possible trial outcomes, N Y , LY , HY , LN , and HN , depend on their behavioral similarities. 4.1 Measuring Novelty in the Single T-Maze The aim of the novelty metric is to measure differences in behavior. In effect, it determines the behavior-space through which the search explores. Because the goal of this paper is to evolve adaptive individuals, the novelty metric must distinguish a learning agent from a non-learning agent. Thus it is necessary to characterize behavior so that different such behaviors can be compared. The behavior of an agent in the T-Maze domain is characterized by a series of trial outcomes (i.e. 200 trial outcomes for ten deployments with 20 trials each). To observe learning behavior, and to distinguish it from non-learning behavior, it is necessary to run multiple trials in a single lifetime, such that the agent’s behavior before and after a reward switch can be observed. Importantly, the behavior space in the T-Maze domain is therefore significantly larger than in prior experiments (Lehman & Stanley, 2008), effectively testing novelty search’s ability to succeed in a high-dimensional behavior space of 200 dimensions (versus only two dimensions in Lehman & Stanley, 2008). It is important to note that the dimensionality of the behavior space is not the only possible characterization of the dimensionality of the problem. For example, the dimensionality of the solution ANN is also significantly related to the difficulty of the problem. Each trial outcome is characterized by two values: (1) the amount of reward collected (high, low, none) and (2) whether or not the agent crashed. These outcomes are assigned different distances to each other depending on how similar they are (Figure 4). In particular, an agent that collects the high reward and returns home successfully without crashing (HN ) should be more similar to an agent that collects the low reward and also returns home (LN ) than to one that crashes without reaching any reward location (N Y ). The novelty distance metric distnovelty is ultimately computed by summing the distances between each trial outcome of two individuals 18 Reward Switch LN HN LN HN } Agent 1 LN HN LN 4.8 1 + 0 + 1 + 0 = 4.0 } distn(a1,a2)= 1 + 0 + 1 + 0 + HN Fitness Agent 2 HN HN HN HN LN LN Agent 3 HN HN HN HN LN HN HN HN 7.2 LN LN 4.8 Time Figure 5: Three sample behaviors. These learning and non-learning individuals all exhibit distinguishable behaviors when compared over multiple trials. Agent three achieves the desired adaptive behavior. The vertical line indicates the point in time that the position of the high reward changed. While agents 1 and 2 look the same to fitness, novelty search notices their difference, as the distance calculation (inset line between agents 1 and 2) shows. over all deployments. Figure 5 depicts outcomes over several trials of three example agents. The first agent always alternates between the left and the right T-Maze arm, which leads to oscillating low and high rewards. The second agent always navigates to the left T-Maze arm. This strategy results in collecting the high reward in the first four trials and then collecting the low reward after the reward switch. The third agent exhibits the desired learning behavior and is able to collect the high reward in seven out of eight trials. (One trial of exploration is needed after the reward switch.) Interestingly, because both agents one and two collect the same amount of high and low reward, they achieve the same fitness, making them indistinguishable to fitness-based search. However, novelty search discriminates between them because distnovelty (agent1 , agent2 ) = 4.0 (Figure 5). Recall that this behavioral distance is part of the novelty metric (Equation 1), which replaces the fitness function and estimates the sparseness at a specific point in behavior space. Importantly, fitness and novelty both use the same information (i.e. the amount of reward collected and whether or not the agent crashed) to explore the search space, though in a completely different way. Thus the comparison is fair. 4.2 Generalization Performance An important goal of the comparison between fitness and novelty is to determine which learns to adapt most efficiently in different deployment scenarios, e.g. 1/10, 5/10, and 10/10. Thus it is important to note that, because performance on different scenarios will vary based on the number of trials in which the reward location switches, for the purpose of analyzing the results there is a need for an independent measure that reveals the overall adaptive capabilities of each 19 individual. Therefore, to test the ability of the individuals to generalize independently of the number of deployments in which the position of the high reward changes, they are tested for 20 trials on each of two different initial settings: (1) high reward starting left and (2) high reward starting right. In both cases, the position of the high reward changes after 10 trials. An individual passes the generalization test if it can collect the high reward and return back home in at least 18 out of 20 trials from both initial positions. Two low reward trials in each setting are necessary to explore the T-Maze at the beginning of each deployment and when the position of the high reward switches. The generalization measure does not necessarily correlate to fitness. An individual that receives a high fitness in the 1/10 scenario can potentially perform poorly on the generalization test because it does not exhibit adaptive behavior. Nevertheless, generalization performance does follow a general upward trend over evaluations and reveals the ultimate quality of solutions (i.e. individuals passing the generalization test would receive high fitness scores in all scenarios). 4.3 Experimental Parameters NEAT with fitness-based search and novelty search run with the same parameters in the experiments in this paper. The steady-state real-time NEAT (rtNEAT) package (Stanley, 2006-2008) is extended to encode neuromodulatory neurons. The population size is 500, with a 0.001 probability of adding a node (uniformly randomly chosen to be standard or modulatory) and 0.01 probability of adding a link. The weight mutation power is 1.8. The coefficients c1 , c2 and c3 for NEAT’s genome distance (see Equation 6) are all set to 1.0. Runs last up to 125,000 evaluations. They are stopped when the generalization test is solved. The number of nearest neighbors for the novelty search algorithm is 15 (following Lehman & Stanley, 2008). The novelty threshold is 2.0. This threshold for adding behaviors to the archive dynamically changes every 1,500 evaluations. If no new individuals are added during that time the threshold is lowered by 5%. It is raised by 20% if the number of individuals added is equal to or higher than four. The novelty scores of the current population are reevaluated every 100 evaluations to keep them up to date (the archive does not need to be reevaluated). Connection weights range within [-10, 10]. These parameter values are shared by all experiments in this paper. The coefficients of the generalized Hebbian learning rule used by all evolved neuromodulated networks in the T-Maze domain are A = 0.0, B = 0.0, C = −0.38, D = 0.0 and η = −94.6, 20 Average Maximum Generalization 350 Novelty Search (1/10) 300 Novelty Search (10/10) 250 Fitness-based Search (10/10) 200 150 Fitness-based Search (1/10) 100 0 20000 40000 60000 80000 Evaluations 100000 120000 Figure 6: Comparing generalization of novelty search and fitness-based search. The change in performance (calculated like fitness) over evaluations on the generalization test is shown for NEAT with novelty search and fitness-based search in the 1/10 and 10/10 scenarios. All results are averaged over 20 runs. The main result is that novelty search learns a general solution significantly faster. resulting in the following mi -modulated plasticity rule: ∆wji = tanh(mi /2) · 35.95y. (7) These values worked well for a neuromodulated ANN in the T-Maze learning problem described by Soltoggio et al. (2008). Therefore, to isolate the effect of evolving based on novelty versus fitness, they are fixed at these values in the T-Maze experiments in this paper. However, modulatory neurons still affect the learning rate at Hebbian synapses as usual. For a more detailed description of the implications of different coefficient values for the generalized Hebbian plasticity rule, see Niv et al. (2002). 5 Single T-Maze Results Because the aim of the experiment is to determine how quickly a general solution is found by fitness-based search and novelty search, an agent that can solve the generalization test described in Section 4.2 counts as a solution. Figure 6 shows the average performance (over 20 runs) of the current best-performing individuals on the generalization test across evaluations for novelty search and fitness-based search, depending on the number of deployments in which the reward location changes. Novelty search performs consistently better in all scenarios. Even in the 10/10 domain that resembles the orig- 21 Novelty Fitness 120000 Evaluations 100000 80000 60000 40000 20000 0 1/10 5/10 Scenario 10/10 Figure 7: Average evaluations to solution for novelty search and fitness-based search. The average number of evaluations over 20 runs that it took novelty search and fitness-based search to solve the generalization test is shown. Novelty search performs significantly better in all scenarios and fitness-based search performs even worse when deception is high. Interestingly, novelty search performance does not degrade at all with increasing deception. inal experiment (Soltoggio et al., 2008), it takes fitness significantly longer to reach a solution than novelty search. The fitness-based approach initially stalls, followed by gradual improvement, whereas on average novelty search rises sharply from early in the run. Figure 7 shows the average number of evaluations (over 20 runs) that it took fitness-based and novelty-based NEAT to solve the generalization test in the 1/10, 5/10, and 10/10 scenarios. If no solution was found within the initial 125,000 evaluations, the current simulation was restarted (i.e. a new run was initiated). This procedure was repeated until a solution was found, counting all evaluations over all restarts. Both novelty and fitness-based NEAT were restarted three times out of 20 runs in the 10/10 scenario. Fitness-based search took on average 90,575 evaluations (σ = 52, 760) while novelty search was almost twice as fast at 48,235 evaluations on average (σ = 55, 638). This difference is significant (p < 0.05). In the more deceptive 1/10 scenario, fitness-based search had to be restarted six times and it took 124,495 evaluations on average (σ = 81, 789) to find a solution. Novelty search only had to be restarted three times and was 2.7 times faster (p < 0.001) at 45,631 evaluations on average (σ = 46, 687). Fitness-based NEAT performs worse with increased domain deception and is 1.4 times slower in the 1/10 scenario than in the 10/10 scenario. It took fitness on average 105,218 evaluations (σ = 65, 711) in the intermediate 5/10 scenario, which is in-between its performance on the 1/10 and 10/10 scenarios, confirming that deception increases as the number of trials requiring adaptation decreases. In contrast, novelty search is not significantly affected by increased 22 domain deception: The performance differences among the 1/10, 5/10, and 10/10 scenarios is insignificant for novelty search, confirming its immunity to deception. Recall that individuals cannot avoid spending at least two trials (when the position of the reward switches for both initial settings; Section 4.2) collecting the low reward to pass the generalization test. However, the action after collecting the low reward (e.g. crashing into a wall or taking a wrong turn on the way home) is not part of the criteria for passing the test. To ensure that stricter criteria do not change the results, a second experiment consisting of 50 independent runs was performed in which the agent also had to return back home after collecting the low reward. That way, the agent always must return home. It took novelty search on average 191,771 evaluations (σ = 162, 601) to solve this harder variation while fitness-based search took 267,830 evaluations (σ = 198, 455). In this harder scenario, both methods needed more evaluations to find a solution but the performance difference is still significant (p < 0.05). Interestingly, novelty search not only outperforms fitness-based search in the highly deceptive 1/10 scenario but also in the intermediate 5/10 scenario and even in the 10/10 scenario, in which the location of the reward changes every deployment. There is no obvious deception in the 10/10 scenario (which resembles Soltoggio’s original experiment; Soltoggio et al., 2008). However, the recurring plateaus in fitness common to all scenarios (Figure 6) suggest a general problem for evolving learning behavior inherent to dynamic, reward-based scenarios. The next section addresses this issue in more depth. 6 Analysis of Deception in the Single T-Maze Why does novelty search outperform fitness-based search even when the position of the high reward changes in every deployment? To answer this question the analysis in this section is based on runs in a seemingly non-deceptive setting: Every individual is evaluated in one deployment consisting of 20 trials in which the position of the high reward switches after ten trials. The high reward is always located on the left side of the maze at the beginning of each deployment and an individual counts as a solution if it can collect at least 18 out of 20 high rewards and navigate back home. This reduction in deployments combined with the relaxed solution criteria simplifies an in-depth analysis (e.g. by yielding smaller novelty archives that are easy to visualize) and is analogous to the 10/10 scenario. The aim is to uncover the hidden source of deception and how exactly novelty search avoids it. 23 no reward/crash low reward/crash low reward/home high reward/crash Novelty Archive high reward/home Fitness Champions #Evals 0 1 40429 45690 49943 0 1 15 9448 9551 13074 19502 19753 19904 20907 21167 21476 21719 21875 21916 21999 22128 22273 22325 22421 22720 22871 23023 25575 25621 25896 25984 27010 27179 27410 #Trial 1 2 3 4 5 6 7 8 Reward 9 10 Switch 11 12 13 14 15 16 17 18 19 20 189 60 -80 Fitness Figure 8: Novelty search archive and fitness champions. Behaviors archived by novelty search and the highest-fitness-so-far found by fitness-based search during evolution are shown together with their corresponding fitness and evaluation at which they were discovered. Agents are evaluated on 20 trials and the reward location switches after 10 trials. Arcs (at top) connect behaviors that were highly rewarded by both methods. Novelty search consistently archives new behaviors while fitness-based search improves maximum fitness only four times during the whole evolutionary run. Many of the behaviors found by novelty search would receive the same fitness, which means they are indistinguishable to fitness-based search. The main result is that a higher number of promising directions are explored by novelty search. One interesting way to analyze different evolutionary search techniques is to examine what they consider best, i.e. which individuals have the highest chances to reproduce. For novelty search, these are the individuals that display the most novel behavior and therefore enter the archive. For fitness-based search, the most rewarded are the champions, i.e. the behaviors with the highest fitness found so far. Although the probabilistic nature of the evolutionary search means that such individuals are not guaranteed to produce offspring, they represent the most likely to reproduce. Highlighting the dramatic difference between these contrasting reward systems, Figure 8 shows the behaviors archived by novelty search and the most fit individuals (when they first appear) found by fitness-based search during a typical evolutionary run. It took novelty search 27,410 evaluations to find a solution in this scenario while fitness-based search took almost twice as long with 49,943 evaluations. While novelty search finds 30 behaviors that are novel enough to enter the archive, fitness only discovers five new champions during the whole evolutionary run. A look at the fitness values of the archived novel behaviors reveals that many of them collapse 24 to the same score, making them indistinguishable to fitness-based search (also see Section 4.1 for discussion of such conflation). For example, the second through fifth archived behaviors in Figure 8, which represent different combinations of ten HY (high reward/crash) and ten LY (low reward/crash) events, all receive the same fitness. However, they are all highly rewarded by novelty search at the time they are discovered, which places them into the archive. In the first 40,429 evaluations, fitness-based search does not discover any new champions, giving it little information about the direction in which the search should proceed. On the other hand, novelty search constantly produces novel behaviors and takes these behaviors and the current population into account to guide the search. A visualization technique can help to gain a deeper understanding of how the two approaches navigate the high-dimensional genotypic search space. The most common technique to visualize evolution is to plot fitness over evaluations; although this technique reveals information about the quality of the solution found so far, it provides no information on how the search proceeds through the high-dimensional search space. Various methods have been proposed to illuminate the trajectory of the search (Barlow, Galloway, & Abbass, 2002; Kim & Moon, 2003; Vassilev, Fogarty, & Miller, 2000), most of which focus on visualizing the fitness landscape to gain a deeper understanding of its ruggedness. However, the aim of this analysis is to visualize how the genotypes produced by both search methods traverse the search space in relation to each other. Two potential such visualization techniques are Principal Component Analysis (PCA) (Kittler & Young, 1973) and Sammon’s Mapping (Sammon, 1969). Both methods provide a mapping of high-dimensional points in genotypic space (ℜp ) to points in ℜ2 . However, while PCA tries to account for the most variance in the data at expense to their original Euclidian distances, Sammon’s Mapping aims to preserve the distances of the genotypes in the mapping to a lower dimension (Dybowski, Collins, Hall, & Weller, 1996). Therefore, Sammon’s mapping is chosen for this analysis because the distances between genotypes produced by fitness-based search and novelty search in the two dimensional visualization should be as close to their original distances as possible to understand how they relate. This approach facilitates the comparison between different regions of the search space that both methods explore. Sammon’s mapping maps a high-dimensional dataset onto a lower number of dimensions (typically two or three-dimensions), allowing a better understanding of the underlying structure of data. The mapping minimizes the stress measure E, which is the discrepancy between the 25 high dimensional distances δij between all objects i and j and the resulting distances dij between the data points in the lower dimension: 1 E = Pn−1 Pn i=1 n−1 X n X (δij − dij )2 j=i+1 δij i=1 j=i+1 δij . (8) The stress measure can be minimized by a steepest descent procedure in which the resulting value of E is a good indicator of the quality of the projection. For this study, Sammon’s mapping projects high-dimensional genotypes produced over the course of evolution onto a two-dimensional space. The output of the mapping are x and y coordinates for every genotype that minimize stress measure E. The original high-dimensional distance δij between two genotypes is based on NEAT’s genome distance (Equation 6), which is a good indicator of the similarity of two network encodings. The distance dij between two objects i and j in the visualization space is calculated by their Euclidean distance q (ix − jx )2 + (iy − jy )2 . To make the two-dimensional visualization clearer, not all genotypes created during evolution are part of the Sammon’s mapping; instead, only those are shown that have either (1) a genome distance δ greater than 9.0 compared to already recorded genotypes or (2) have a distance smaller than 9.0 but display a different behavior (based on the novelty metric described in Section 4.1). These criteria ensure that a representative selection of genotypes is shown that is still sparse enough to be visible in the projection onto two dimensions. Figure 9 shows a Sammon’s mapping of 882 genotypes; 417 were found by novelty search and 465 were found by fitness-based search during a typical evolutionary run of each. In this example, novelty search found a solution after 19,524 evaluations while it took fitnessbased search 36,124 evaluations. The low stress measure E = 0.058 indicates that the original genotypic distances have been conserved by the mapping. Genotypes that are close to each other in the two-dimensional output space are also close to each other in genotype space. The mapping reveals that both methods discover different regions of high fitness and that the majority of behaviors simply crash without collecting any rewards (denoted by the smallest points). The main result is that while novelty search (light gray) discovers a genotypic region of high fitness and then quickly reaches the solution (i.e. a behavior that can collect the high reward in at least 18 out of 20 trials, denoted by D in Figure 9), fitness-based search (black) needs to cover more of the genotypic search space because it searches through many identical behaviors (though different genotypes) when it is stuck at a local optimum. 26 A C B D Figure 9: Combined Sammon’s Mapping. The Sammon’s mapping of 417 genotypes found by novelty search (gray) and 465 found by fitness-based search (black) is shown. The size of each mark corresponds to the fitness of the decoded network. Larger size means higher fitness. Fitness-based search covered more of the genotypic search space than novelty search because it searches through many identical behaviors (though different genotypes) when it is stuck at a local optimum. Four important individuals are identified: the final solution found by fitnessbased search (A), a network that collects the high reward in the first ten trials and then the low reward (B), a network that collects the low reward in 18/20 trials (C) and the final solution found by novelty search (D). Although A and C are close they have significantly different fitnesses. Thus while the discovery of C could potentially serve as a stepping stone for novelty search, fitness-based search is led astray from the final solution. Points B and D are discussed in the text. Interestingly, an intermediate solution found by fitness-based search discovers a behavior that collects 18 out of 20 low rewards and returns back home (denoted by C in Figure 9). The network that produces this behavior and the final solution (A) are close in genotypic space though they have very different fitness values (178 vs. 50). Thus while the discovery of this behavior could potentially serve as a stepping stone to finding the final solution for novelty search, rather than helping fitness it actually deceives it. Agents that collect the high reward in 10 out of 20 trials and return back home (B in Figure 9) receive a higher fitness than C-type agents even though they are actually farther away from the final solution in genotype space and therefore might lead fitness search astray. Figure 10 examines the temporal progression of the two search methods in more detail by showing the Sammon’s mapping from Figure 9 at different stages of evolution in the corresponding run. For each evaluation (i.e. snapshot in time) the mapping shows the genotypes found so far together with the behaviors archived by novelty search and the champions found by fitness-based search. 27 Novelty Archive Mapping Combined Mapping Fitness Mapping Champions Fitness Start Novelty Start 1000 5000 10000 15000 Found Solution 19524 30000 36124 no reward/crash low reward/crash low reward/home Found Solution high reward/crash high reward/home Figure 10: Sammon’s Mapping of novelty and fitness-based search at different stages of evolution. A mapping of 882 recorded genotypes – 470 produced by novelty search (second column) and 465 by fitness-based search (fourth column) – is shown at seven different time steps together with the corresponding behavior characterizations added to the archive by novelty search and those of the champions found by fitness-based search. Larger markers in the Sammon’s mapping denote higher fitness received by the decoded network. The archived behaviors found by novelty search and the champions found by fitness-based search are connected to show the progression of each search. The magnification (bottom left) of the novelty mapping shows a region of the genotypic space with many novel behaviors that have small genotypic distances to each other. Novelty search finds a solution significantly faster than fitness-based search by exploiting intermediate stepping stones to guide the search. 28 Novelty search explores a wider sampling of the search space than fitness-based search during the first 1,000 evaluations. After that, both methods explore similar behaviors until novelty search finds a novel behavior at evaluation 13,912 that collects the low reward and returns back home in the first ten trials and then collects the high reward and returns back home in the successive trials. The ability to successfully return back home after collecting a reward turns out to be a stepping stone to regions of higher fitness. It opens up a wide range of possible new behaviors that lead novelty search to discover 18 new archive members between evaluations 15,000 and 19,520. Interestingly, all the underlying network encodings for these behaviors are close to each other in genotypic space even though they produce significantly different behaviors. Finally, novelty search discovers a solution after 19,524 evaluations. In contrast, fitness-based search is not able to exploit the same set of behaviors as potential stepping stones because many collapse to the same fitness. While fitness-based search discovers two new champions in the first 1,000 evaluations, it does not discover the next until evaluation 19,520. This more fit behavior is located within a cluster of high fitness genotypes close to the final solution. However, it takes fitness-based search another 17,439 evaluations to discover that solution. The problem again is that fitness-based search is deceived by genotypes that have a higher fitness than those that are actually closer to the solution (Figure 9). In a sense, novelty search proceeds more systematically, discovering a region of novel behaviors and then discovering the final solution in fewer evaluations than fitness-based search by exploiting intermediate stepping stones to guide the search. In fact, the number of archived behaviors is always higher than the number of new champions found by fitness across all runs. To gain a better understanding of the fitness landscape in this domain, Figure 11 shows histograms of fitness values for individuals discovered by novelty and fitness-based search in a typical run. The histograms are normalized so that the area sum is one. Interestingly, the vast majority of behaviors (for novelty and fitness-based search) receive one of three different fitness values resulting in three peaks in each distribution. In effect, many behaviors receive the same fitness, which is another indicator of the lack of intermediate stepping stones and the absence of a fitness gradient in the T-Maze domain. Moreover, the majority of behaviors (61% for fitness and 88% for novelty) simply crash without collecting any reward, suggesting that the encoded networks are brittle to small mutations. Overall, the analysis in this section shows that novelty search is able to return more information about how behavior changes throughout the search space. It finds a solution significantly 29 1 1 0.8 0.8 0.6 0.6 % % % 0.4 0.4 0.2 0.2 60 -80 0 -100 -50 0 50 100 Fitness Values 150 60 -80 120 0 -100 200 (a) Fitness-based search -50 0 50 100 Fitness Values 120 150 200 (b) Novelty search Figure 11: Distribution of fitness for novelty and fitness-based search for a typical run. The fitness values for fitness-based search (a) and novelty search (b) both have three peaks in both histograms. There are other values, but they are relatively very rare. Thus many of the genotypes collapse to the same few fitness values, suggesting a lack of intermediate stepping stones for a fitness-based search method. faster than fitness-based search by exploiting intermediate stepping stones to guide its search. Interestingly, genotypes that are potential stepping stones for novelty search can lead fitnessbased search astray if fitness does not correlate with distance to the final solution (Figure 9). 7 Additional Experiments To further demonstrate novelty search’s ability to efficiently evolve plastic ANNs, two substantially more complex scenarios are investigated, which are explained in the next sections. 7.1 Double T-Maze The double T-Maze (Figure 12) includes two turning points and four maze endings, which makes the learning task substantially more difficult than the single T-Maze studied in the previous sections (Soltoggio et al., 2008). In effect the agent must now memorize a location on a map that is twice as large. The experiment follows the setup described in Section 3 with a slightly modified novelty metric to capture behaviors in the larger environment. The behavior of an agent is still characterized by a series of trial outcomes, but each such outcome is now determined by the corresponding trial fitness value (e.g. 0.2 for collecting the low reward). The behavioral difference between two behaviors is then calculated as the sum over all trial differences. Each evaluation consists 30 Figure 12: The Double T-Maze. The maze includes two turning points and four maze endings. In this depiction, high reward is located at the top-left of the maze, but its position can change over a set of trials. The goal of the agent is to navigate to the position of the high reward and remember that location from one trial to the next. of two deployments with 200 trials each in which the high reward changes location after every 50 trials. Thus the behavior characterization includes 400 dimensions. Fitness-based search had to be restarted five times and found a solution in 801,798 evaluations on average (σ=695,534). Novelty search found a solution in 364,821 evaluations on average (σ=411,032) and had to be restarted two times. Therefore, even with an increased behavioral characterization (200-dimensional for the single T-Maze vs. 400-dimensional for the double T-Maze) and increased domain complexity, novelty search still finds the appropriate adaptive behavior significantly faster than fitness-based search (p < 0.05). 7.2 Foraging Bee Another domain studied for its reinforcement learning-like characteristics is the bee foraging task (Soltoggio et al., 2007; Niv et al., 2002). A simulated bee needs to remember which type of flower yields the most nectar in a stochastic environment wherein the amount of nectar associated with each type of flower can change. The bee flies in a simulated three-dimensional environment (Figure 13a) that contains a 60 × 60 meter flower field on the ground with two different kind of flowers (e.g. yellow and blue). The bee constantly flies downward at a speed of 0.5m/s and can perceive the flower patch through a single eye with a 10-degree view cone. The outside of the flower patch is perceived as gray-colored. The ANN controller has five inputs (following Soltoggio et al., 2007); the first three are Gray, Blue, and Y ellow, which receive the percentage of each perceived color in the bee’s view cone. The Reward input is set to the amount of nectar consumed after the bee successfully 31 Average Maximum Fitness Flight Direction Novelty Search 61 59 57 55 Fitness-based Search 53 51 49 47 45 0 150000 300000 450000 600000 750000 900000 Evaluations (a) Simulated Bee (b) Average Maximum Fitness Figure 13: Comparing Novelty Search to Fitness-based Search in the Bee Domain. The simulated bee flying in a three-dimensional space is shown in (a). The bee is constantly flying downwards but can randomly change its direction. The bee can perceive the flower patch with a simulated view cone (Soltoggio et al., 2007). (b) The change in fitness over time (i.e. number of evaluations) is shown for NEAT with novelty search and fitness-based NEAT, which are both averaged over 25 runs for each approach. The main result is that both methods reach about the same average fitness but novelty search finds a solution significantly faster. lands on either a yellow or blue flower. It remains zero during the flight. The Landing input is set to 1.0 upon landing. The bee has only one output that determines if it stays on its current course or changes its direction to a new random heading. Each bee is evaluated on four deployments with 20 trials each. The flower colors are inverted after 10 trials on average and between deployments, thereby changing the association between color and reward value. The amount of collected nectar (0.8 for high-rewarding and 0.3 for low-rewarding flowers) over all trials is the performance measure for fitness-based search. To mitigate noisy evaluation in novelty search (i.e. a single random change in direction at the beginning of the flight can lead to a very different sequence of collected high and low rewards), each deployment is described by two values. The first is the number of high rewards collected before the color switch and the second is the number of high rewards collected afterwards. This novelty metric rewards the bee for collecting a different number of high rewards before and after the reward switch than other individuals. This measure reflects that what is important is displaying adaptive behavior that is dependent on the time of the reward switch. The coefficients of the generalized Hebbian learning rule for this experiment are A = −0.79, B = 0.0, C = 0.0, and D = −0.038. These values worked well for a neuromodulated ANN in the foraging bee domain described by Soltoggio et al. (2007). The other experimental parameters are kept unchanged. A bee counts as a solution if it displays the desired learning behavior of associating the right color with the currently high-rewarding flower (which corresponds to a fitness of 61). 32 Both fitness-based search and novelty search discovered solutions in 13 out of 25 runs. Novelty search took on average 261,098 evaluations (σ=130,926) when successful and fitness-based search on average 491,221 evaluations (σ=277,497). Although novelty search still finds a solution significantly faster (p < 0.05), both methods quickly reach a high local optimum before that (Figure 13b). 8 Discussion and Future Work Novelty search outperforms fitness-based search in all domains investigated and is not affected by increased domain deception. This result is interesting because it is surprising that without any other a priori knowledge an algorithm that is not even aware of the desired behavior would find such behavior at all, let alone in a general sense. Fitness-based search also takes significantly more evaluations to produce individuals that exhibit the desired adaptive behavior when the impact of learning on the fitness of the agent is only marginal. Because it is easier at first to improve fitness without evolving the ability to learn, objective-based search methods are likely to exploit domain-dependent static (i.e. non-adaptive) heuristics. In the T-Maze domain in this paper, agents initially learn to always navigate to one arm of the maze and back, resulting in collecting 20 high rewards (i.e. ten high rewards for each of the two starting positions) on the generalization test. Yet, because the reward location changes after ten trials for both initial settings, to be more successful requires the agents to exhibit learning behavior. The natural question then is why novelty search outperforms fitness-based search in the seemingly non-deceptive 10/10 scenario? While the deception in this setting is not as obvious, the analysis presented in Section 6 addressed this issue in more depth. The problem is that evolving the right neuromodulated dynamics to be able to achieve learning behavior is not an easy task. There is little information available to incentivize fitness-based search to pass beyond static behavior, making it act more like random search. In other words, the stepping stones that lead to learning behavior are hidden from the objective approach behind long plateaus in the search space. This analysis reveals that fitness-based search is easily led astray if fitness does not reward the stepping stones to the final solution, which is the case in the T-Maze learning problem 33 (Figure 9). Novelty search, on the other hand, escapes the deceptive trap and instead builds on the intermediate stepping stones to proceed through the search space more efficiently. Novelty search’s ability to keep track of already-explored regions in the search space is probably another factor that accounts for its superior performance. While in some domains the fitness gradient can be improved, i.e. by giving the objectivebased search clues in which direction to search, such an approach might not be possible in dynamic, reward-based scenarios. The problem in such domains is that reaching a certain fitness level is relatively easy, but any further improvement requires sophisticated adaptive behavior to evolve from only sparse feedback from an objective-based performance measure. That is, novelty search returns more information about how behavior changes throughout the search space. In this way, novelty search removes the need to carefully design a domain that fosters the emergence of learning because novelty search on its own is capable of doing exactly that. The only prerequisite is that the novelty metric is constructed such that learning and non-learning agents are separable, which is not necessarily easy, but is worth the effort if objective-based search would otherwise fail. In fact, because NEAT itself employs the fitness sharing diversity maintenance technique (Goldberg & Richardson, 1987; Stanley & Miikkulainen, 2002) within its species (Section 2.3), the significant difference in performance between NEAT with novelty search and NEAT with fitness-based search also suggests that traditional diversity maintenance techniques do not evade deception as effectively as novelty search. Interestingly, novelty search has also been shown to succeed independently of NEAT (Mouret, 2009) in evolving ANNs and it also outperforms fitness-based search in genetic programming (Lehman & Stanley, 2010b). Thus evidence is building for its generality. Novelty search’s ability to build gradients that lead to stepping stones is evident in performance curves (Figure 6). The increase in generalization performance is steeper than for fitness-based NEAT, indicating a more efficient climb to higher complexity behaviors. In effect, by abandoning the objective, the stepping stones come into greater focus (Lehman & Stanley, 2008, 2010a). Although it means that the search is wider, the alternative is to be trapped by deception. Of course, there are likely domains for which the representation is not suited to discovering the needed adaptive behavior or in which the space of behaviors is too vast for novelty search to 34 reliably discover the right one. However, even in the double T-Maze domain in which the length of the behavioral characterization is substantially larger (i.e. 400 dimensions), novelty search still significantly outperforms fitness-based search. There are only so many ways to behave and therefore the search for behavioral novelty becomes computationally feasible and is different than random search. On the other hand, even though novelty search is still significantly faster in the foraging bee task, fitness-based search reaches a local optimum that is very close to the final solution in about the same number of evaluations. A possible explanation for the more even performance in this domain is that the noisy environment offers a vast space of exploitable behavioral strategies. Future research will address the problem of noise in novelty search in more detail. Overall, the results in this paper are important because research on evolving adaptive agents has been hampered largely as a result of the deceptiveness of adaptive tasks. Yet the promise of evolving plastic ANNs is among the most intriguing in artificial intelligence. After all, our own brains are the result of such an evolutionary process. Therefore, a method to make such domains more amenable to evolution has the potential to further unleash a promising research direction that is only just beginning. To explore this opportunity, a promising future direction is to apply novelty search to other adaptive problems without the need to worry about mitigating their potential for deception. For example, an ambitious domain that may benefit from this approach is to train a simulated biped to walk adaptively. Lehman and Stanley (2010a) already showed that novelty search significantly outperforms objective-based search in a biped walking task. However, as in previous work (Bongard & Paul, 2001; Hase & Yamazaki, 1999; Reil & Husbands, 2002), static ANNs were evolved. Although plastic biped-controlling ANNs have been evolved in the past (Ishiguro, Fujii, & Hotz, 2003; McHale & Husbands, 2004), new advances in evolving neuromodulated ANNs (Dürr et al., 2008; Soltoggio et al., 2008) can potentially allow such controllers to be more robust to environmental changes and to morphological damage. Moreover, unlike past evolved biped controllers, such networks could be deployed into a wide range of body variants and seamlessly adapt to their bodies of origin, just as people can walk as they grow up through a wide array of body sizes and proportions. As is common when novelty search succeeds, this adaptive domain likely suffers from deception. Initially, the robot simply falls down on every attempt, obfuscating to the objective function any improvements in leg oscillation (Panne & Lamouret, 1995). However, after exploring all the different ways to fall down, novelty search 35 should lead evolution to more complex behaviors and eventually to walking. Novelty search in combination with neuromodulated ANNs might be the right combination to evolve such a new class of plastic controllers. Characterizing when and for what reason novelty search fails is also an important future research direction. Yet its performance in this paper and in past research (Lehman & Stanley, 2008, 2010b, 2010c; Mouret, 2009) has proven surprisingly robust. While it is not always going to work well, this paper suggests that it is a viable new tool in the toolbox of evolutionary computation to counteract the deception inherent in evolving adaptive behavior. 9 Conclusions This paper showed that (1) reinforcement learning problems like the T-Maze and bee foraging task are inherently deceptive for fitness-based search and (2) novelty search can avoid such deception by exploiting intermediate stepping stones to guide its search. A detailed analysis revealed how the two different approaches explore the genotypic space and demonstrated that novelty search, which abandons the objective to search only for novel behaviors, in fact facilitates the evolution of adaptive behavior. Results in a variety of domains demonstrated that novelty search can significantly outperform objective-based search and that it performs consistently under varying levels of domain deception. Fitness-based NEAT on the other hand performs increasingly poorly as domain deception increases, adding to the growing body of evidence (Lehman & Stanley, 2008, 2010b, 2010c; Mouret, 2009) that novelty search can overcome the deception inherent in a diversity of tasks. The main conclusion is that it may now be more realistic to learn interesting adaptive behaviors that have been heretofore seemingly too difficult. Acknowledgments This research was partially supported by the National Science Foundation under grants DRL0638977 and IIP0750551 and in part by DARPA under grant HR0011-09-1-0045 (Computer Science Study Group Phase 2). Special thanks to Andrea Soltoggio for sharing his wisdom in the T-Maze and bee domains. 36 References Aaltonen, T., et al. (2009). Measurement of the top quark mass with dilepton events selected using neuroevolution at CDF. Physical Review Letters. Angeline, P. J., Saunders, G. M., & Pollack, J. B. (1994). An evolutionary algorithm that constructs recurrent neural networks. Neural Networks, IEEE Transactions on, 5 (1), 54–65. Barlow, M., Galloway, J., & Abbass, H. A. (2002). Mining evolution through visualization. In E. Bilotta et al. (Eds.), ALife VIII Workshops (pp. 103–112). Barnett, L. (2001). Netcrawling-optimal evolutionary search with neutral networks. In Evolutionary computation, 2001. proceedings of the 2001 congress on (Vol. 1). Baxter, J. (1992). The evolution of learning algorithms for artificial neural networks. In D. Green & T. Bossomaier (Eds.), Complex Systems (pp. 313–326). IOS Press. Blynel, J., & Floreano, D. (2002). Levels of Dynamics and Adaptive Behavior in Evolutionary Neural Controllers. In From Animals to Animats 7. (In B. Hallam, D. Floreano, J. Hallam, G. Hayes, and J.-A. Meyer (eds)) Blynel, J., & Floreano, D. (2003). Exploring the T-Maze: Evolving learning-like robot behaviors using CTRNNs. In 2nd European Workshop on Evolutionary Robotics (EvoRob 2003). Bongard, J. C., & Paul, C. (2001). Making evolution an offer it can’t refuse: Morphology and the extradimensional bypass. In Proceedings of the 6th European Conference on Advances in Artificial Life (ECAL 2001) (pp. 401–412). London, UK: Springer-Verlag. Carew, T., Walters, E., & Kandel, E. (1981). Classical conditioning in a simple withdrawal reflex in Aplysia californica. The Journal of Neuroscience, 1 (12), 1426-1437. Chalmers, D. J. (1990). The evolution of learning: An experiment in genetic connectionism. In Proceedings of the 1990 Connectionist Models Summer School (pp. 81–90). Morgan Kaufmann. Darwen, P., & Yao, Y. (1996). Every niching method has its niche: Fitness sharing and implicit sharing compared. Parallel Problem Solving from Nature (PPSN IV), 398–407. De Jong, E. D. (2004). The incremental pareto-coevolution archive. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2004) (p. 525-536). Springer. Dürr, P., Mattiussi, C., Soltoggio, A., & Floreano, D. (2008). Evolvability of neuromodulated learning for robots. In The 2008 ECSIS Symposium on Learning and Adaptive Behavior 37 in Robotic Systems (pp. 41–46). Los Alamitos, CA: IEEE Computer Society. Dybowski, R., Collins, T. D., Hall, W., & Weller, P. R. (1996). Visualization of binary string convergence by sammon mapping. In Proceedings of The Fifth Annual Conference on Evolutionary Programming. MIT Press. Floreano, D., Dürr, P., & Mattiussi, C. (2008). Neuroevolution: from architectures to learning. Evolutionary Intelligence, 1 , 47–62. Floreano, D., & Urzelai, J. (2000). Evolutionary robots with online self-organization and behavioral fitness. Neural Networks, 13 , 431-443. Glover, F., & Laguna, M. (1997). Boston: Kluwer Academic Publishers. Goldberg, D. E. (2007). Simple genetic algorithms and the minimal deceptive problem. In L. D. Davis (Ed.), Genetic Algorithms and Simulated Annealing, Research Notes in Artificial Intelligence. Morgan Kaufmann. Goldberg, D. E., & Richardson, J. (1987). Genetic algorithms with sharing for multimodal function optimization. In Proceedings of the Second International Conference on Genetic Algorithms and their Application (pp. 41–49). Hillsdale, NJ, USA: L. Erlbaum Associates Inc. Gomez, F. J., & Miikkulainen, R. (1999). Solving non-markovian control tasks with neuroevolution. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (pp. 1356–1361). Morgan Kaufmann. Gruau, F., Whitley, D., & Pyeatt, L. (1996). A comparison between cellular encoding and direct encoding for genetic neural networks. In Genetic Programming 1996: Proceedings of the First Annual Conference (pp. 81–89). MIT Press. Harvey, I. (1993). The artificial evolution of adaptive behavior. Unpublished doctoral dissertation, School of Cognitive and Computing Sciences, University of Sussex, Sussex. Hase, K., & Yamazaki, N. (1999). Computational evolution of human bipedal walking by a neuro-musculo-skeletal model. Artificial Life and Robotics, 3 , 133-138. Hinton, G. E., & Nowlan, S. J. (1987). How learning can guide evolution. Complex Systems, 1. Hornby, G. S. (2006). ALPS: The age-layered population structure for reducing the problem of premature convergence. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2006) (pp. 815–822). New York, NY, USA: ACM. Hu, J., Goodman, E., Seo, K., Fan, Z., & Rosenberg, R. (2005). The hierarchical fair competition 38 (HFC) framework for sustainable evolutionary algorithms. Evolutionary Computation, 13 (2), 241-277. (PMID: 15969902) Hutter, M., & Legg, S. (2006). Fitness uniform optimization. IEEE Transactions on Evolutionary Computation, 10 , 568–589. Ishiguro, A., Fujii, A., & Hotz, P. E. (2003). Neuromodulated control of bipedal locomotion using a polymorphic CPG circuit. Adaptive Behavior , 11 (1), 7-17. Kim, Y.-H., & Moon, B.-R. (2003). New Usage of Sammon’s Mapping for Genetic Visualization. In Proceedings Genetic and Evolutionary Computation (GECCO 2003) (Vol. 2723 of LNCS, p. 1136-1147). Springer-Verlag. Kittler, J., & Young, P. C. (1973). A new approach to feature selection based on the KarhunenLoeve expansion. Pattern Recognition, 5 , 335–352. Lehman, J., & Stanley, K. O. (2008). Exploiting open-endedness to solve problems through the search for novelty. In Proceedings of the Eleventh International Conference on Artificial Life. Cambridge, MA: MIT Press. Lehman, J., & Stanley, K. O. (2010a). Abandoning objectives: Evolution through the search for novelty alone. Evolutionary Computation. (To appear) Lehman, J., & Stanley, K. O. (2010b). Efficiently evolving programs through the search for novelty. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2010). New York, NY, USA: ACM. (To appear) Lehman, J., & Stanley, K. O. (2010c). Revising the evolutionary computation abstraction: Minimal criteria novelty search. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2010). New York, NY, USA: ACM. (To appear) Mahfoud, S. W. (1995). Niching methods for genetic algorithms. Unpublished doctoral dissertation, Champaign, IL, USA. Martin, A. P. (1999). Increasing genomic complexity by gene duplication and the origin of vertebrates. The American Naturalist, 154 (2), 111-128. Mayley, G. (1997). Guiding or hiding: Explorations into the effects of learning on the rate of evolution. In Fourth European Conference on Artificial Life (pp. 135–144). MIT Press. McHale, G., & Husbands, P. (2004). Gasnets and other evolvable neural networks applied to bipedal locomotion. From Animals to Animats 8 . McQuesten, P., & Miikkulainen, R. (1997). Culling and teaching in neuro-evolution. In Proceedings of the Seventh International Conference on Genetic Algorithms. San Francisco: 39 Kaufmann. Mitchell, M., Forrest, S., & Holland, J. H. (1991). The royal road for genetic algorithms: Fitness landscapes and GA performance. In Proceedings of the First European Conference on Artificial Life (pp. 245–254). MIT Press. Mouret, J. (2009). Novelty-based multiobjectivization. In Proceedings of iros workshop on exploring new horizons in the evolutionary design of robots. Niv, Y., Joel, D., Meilijson, I., & Ruppin, E. (2002). Evolution of reinforcement learning in uncertain environments: A simple explanation for complex foraging behaviors. Adaptive Behavior , 10 (1), 5–24. Nolfi, S., & Floreano, D. (1999, July). Learning and evolution. Autonomous Robots, 7 (1), 89-113. Nolfi, S., & Parisi, D. (1993). Auto-teaching: Networks that develop their own teaching input. In B. H. G. S. N. G. Deneubourg J. L. & R. Dagonnier (Eds.), Proceedings of the Second European Conference on Artificial Life (pp. 845–862.). Nolfi, S., & Parisi, D. (1996). Learning to adapt to changing environments in evolving neural networks. Adaptive Behavior , 5 , 75–98. Nolfi, S., Parisi, D., & Elman, J. L. (1994). Learning and evolution in neural networks. Adaptive Behavior , 3 , 5-28. Panne, M. van de, & Lamouret, A. (1995, september). Guided optimization for balanced locomotion. In 6th Eurographics Workshop on Animation and Simulation. Reil, T., & Husbands, P. (2002). Evolution of central pattern generators for bipedal walking in a real-time physics environment. IEEE Trans. Evolutionary Computation, 6 (2), 159–168. Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Trans. Comput., 18 (5), 401–409. Saravanan, N., & Fogel, D. B. (1995). Evolving neural control systems. IEEE Expert: Intelligent Systems and Their Applications, 10 (3), 23–27. Schmidhuber, J. (2003). Exploring the predictable. In S. Ghosh & S. Tsutsui (Eds.), Advances in evolutionary computing: theory and applications (pp. 579–612). New York, NY, USA: Springer-Verlag New York, Inc. Schmidhuber, J. (2006). Developmental robotics, optimal artificial curiosity, creativity, music, and the fine arts. Connection Science, 18 (2), 173–187. Soltoggio, A., Bullinaria, J. A., Mattiussi, C., Dürr, P., & Floreano, D. (2008). Evolutionary 40 advantages of neuromodulated plasticity in dynamic, reward-based scenarios. In Artificial Life XI (pp. 569–576). Cambridge, MA: MIT Press. Soltoggio, A., Dürr, P., Mattiussi, C., & Floreano, D. (2007). Evolving neuromodulatory topologies for reinforcement learning-like problems. In Proceedings of the IEEE Congress on Evolutionary Computation. Stanley, K. O. (2006-2008). rtNEAT C++ software homepage: www.cs.utexas.edu/users/nn/keyword?rtneat. Stanley, K. O., Bryant, B. D., & Miikkulainen, R. (2003). Evolving adaptive neural networks with and without adaptive synapses. In Proceedings of the 2003 IEEE Congress on Evolutionary Computation (CEC-2003). Canberra, Australia: IEEE Press. Stanley, K. O., Bryant, B. D., & Miikkulainen, R. (2005, December). Real-time neuroevolution in the NERO video game. IEEE Transactions on Evolutionary Computation, 9 (6), 653668. Stanley, K. O., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10 , 99-127. Stanley, K. O., & Miikkulainen, R. (2004). Competitive coevolution through evolutionary complexification. Journal of Artificial Intelligence Research, 21 , 63–100. Stewart, T. (2001). Extrema selection: Accelerated evolution on neutral networks. In Proceedings of the 2001 ieee international conference on evolutionary computation (Vol. 1). IEEE Press. Taylor, M. E., Whiteson, S., & Stone, P. (2006, July). Comparing evolutionary and temporal difference methods in a reinforcement learning domain. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2006) (p. 1321-1328). New York, NY, USA: ACM. Vassilev, V. K., Fogarty, T. C., & Miller, J. F. (2000). Information characteristics and the structure of landscapes. Evol. Comput., 8 (1), 31–60. Watson, J. D., Hopkins, N. H., Roberts, J. W., Steitz, J. A., & Weiner, A. M. (1987). Molecular biology of the gene fourth edition. Menlo Park, CA: The Benjamin Cummings Publishing Company, Inc. Whiteson, S., & Stone, P. (2006). Evolutionary function approximation for reinforcement learning. J. Mach. Learn. Res., 7 , 877–917. Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE , 87 (9), 1423–1447. 41