Academia.eduAcademia.edu

Learning distributed strategies for traffic control

1998

In this paper, we cast the problem of managing traac ow in terms of a distributed collection of independent agents that adapt to their environment. We describe an evolutionary algorithm that learns strategies for lane selection, using local information, on a simulated highway that contains hundreds of agents. Experimental studies suggest that the learned controllers lead to better traac ow than ones constructed manually, and that the learned controllers are robust with respect to to blocked lanes and changes in the number of lanes on the highway.

Learning Distributed Strategies for Trac Control David E. Moriarty, Simon Handley, and Pat Langley Daimler-Benz Research and Technology Center 1510 Page Mill Road, Palo Alto, CA 94304 fmoriarty,handley,[email protected] Abstract In this paper, we cast the problem of managing trac ow in terms of a distributed collection of independent agents that adapt to their environment. We describe an evolutionary algorithm that learns strategies for lane selection, using local information, on a simulated highway that contains hundreds of agents. Experimental studies suggest that the learned controllers lead to better trac ow than ones constructed manually, and that the learned controllers are robust with respect to to blocked lanes and changes in the number of lanes on the highway. 1. Introduction In recent years, there has been growing interest in the distributed behavior of large populations of independent agents, and in the ability of such agents to adapt not only to their environment but to each other's behavior. Most research in this area has focused on agent populations designed to mimic those that occur in the natural world, such as colonies of ants and schools of sh. However, the arti cial urban environment created by humans contains another important example of distributed agent behavior: automobile trac on roads and highways. The trac domain has much to recommend it as a fertile source of research problems. Clearly, each agent (a driver and his vehicle) has independent control of its actions, but its behavior must take into account physical constraints, such as staying on the road and avoiding collisions, and the behavior of many other agents. Another advantage is that there exist clear criteria for evaluation, such as maximizing trac ow and minimizing lane changes. In addition, the domain supports complex behaviors of the overall system, such as trac jams, even though the individual agents are relatively simple. Also, most researchers have personal experience in trac environments and, presumably, have good intuitions about reasonable strategies and behaviors. Finally, progress in this area could lead to improvements in actual trac conditions and thus increase the quality of life for drivers. In this paper, we describe research on distributed adaptation in the trac domain. Following Moriarty and Langley (1998), we formulate the problem of trac management from a distributed, car-centered perspective and the task of improving this process in terms of distributed machine learning. We assume that each agent receives local information about the vehicles that immediately surround it, including their location and speed, and that the agent determines which lane to drive in but not the vehicle's speed. The aim of learning, and thus our measure of performance, is not the behavior of individual trac agents but rather the behavior of the trac system as a whole. To this end, we want the learning module to develop a control strategy for lane selection that considers not only the maintenance of each car's desired speed, but that takes into account how its selection will a ect speeds of other cars. For instance, we would like the cars to organize themselves into a cooperative system that lets the fast drivers pass through, while still letting the slow drivers maintain their speeds. We begin by presenting our formulation of the trac control task, and its associated learning problem, in more detail. We next describe the inputs and outputs of our reactive vehicle controllers, along with our genetic approach to learning distributed control strategies. After this, we characterize our simulated trac environment and report experimental studies of the system under a variety of conditions. Elsewhere (Moriarty & Langley, 1998), we showed that our approach learns controllers that are robust to changes in the proportion of learned to `sel sh' cars on the highway and to changes in trac density. Here we focus instead on the system's ability to generalize to situations that involve blocked lanes and di erent numbers of lanes from those used in training. We close the paper with a discussion of related work on distributed learning and our plans for future research. 2. Distributed Trac Control Our approach to trac management involves a reformulation of the problem into a distributed arti cial intelligence task, in which cars coordinate lane changes to maintain desired speeds and reduce total lane maneuvers. We make no assumptions about the level of automation of the cars. Lane selection information could be provided to the driver, who completes the maneuver, or to a regulation controller in an automated car (Pomerleau, 1995; Varaiya, 1993). Here we consider in more detail our formulation of the performance and learning tasks. 72 67 72 65 75 75 67 55 55 65 (a) (b) Figure 1 (a) An example trac situation in which the trac ows from left to right and the number on each car shows the car's speed. (b) Trac after reorganization in which car 75 and 65 swap lanes, followed by another lane change by car 65, so that all cars can maintain their desired speeds. 2.1 De nition of the Problem Most work on advanced trac management views cars as tokens that follow simple, sel sh rules of behavior. These management systems a ect the ow of the car tokens by controlling external, xed-position devices such as trac signals, ramp meters, speed limits, and dynamic lanes. Surprisingly, little research has addressed how the cars themselves might sense and intelligently a ect trac dynamics.1 Our view is that cars are not blind tokens, but rather can sense their environment and act cooperatively to achieve desired global behavior. More speci cally, cars can learn to organize themselves by trac lanes to increase overall trac throughput, reduce the average number of lane changes, and maintain the desired speeds of drivers. Car-centered control, speci cally lane selection, should therefore complement existing trac management e orts by providing better behavior between trac signals. Figure 1(a) illustrates a situation in which lane coordination is bene cial. The gure shows ve cars along with their speeds, which we will use as identi ers. Car 72 is quickly approaching car 65 and will be unable to pass because of the position of car 67. Without reorganization, car 65 forces car 72 to reduce its speed and wait for car 67 to pass car 65, which will decrease trac throughput and car 72's satisfaction. An ecient solution to this problem is for car 75 and car 65 to immediately swap lanes, followed by car 65 moving into the bottom lane, as shown in Figure 1(b). This maneuver ensures that no speeds are reduced and no throughput is lost. We recast the trac management task as a problem in distributed arti cial intelligence, where each car represents an individual agent in a multi-agent system. Cars 1 One exception is the work of Carrara and Morello in the DOMINC project. act on their world (the highway) by selecting appropriate lanes to drive in. They interact with other cars by competing for resources (the spaces or slots on the highway). Each action is local in nature, and may not produce any noticeable bene t to the car. Collectively, however, the local actions can improve the global performance of the trac. For example, yielding a lane to a faster car does not produce any local bene t to the slower car, but does increase the overall trac throughput and let the passing car maintain its desired speed. Global trac performance could be de ned in many di erent ways. Governments want high trac throughput, whereas drivers want to maintain desired speeds with few lane changes. We selected the driver-oriented metric, since drivers are likely to be the harshest critics of cooperative driving. The performance measure for a set of cars contains two terms, one that penalizes deviations in speed and one that penalizes lane changes, P C ( )= P C PTt PNi =1 =1 P ( ita ? itd )2 +4  60  Ni=1 i (1) S S L TN TN ; where is the total number of time steps (in seconds), is the number of cars, itd is the desired speed of car at time , ita is the actual speed of car at time , and i is the total number of lane changes for car over time steps. The rst constant, 4, is a weighting factor, whereas the second, 60, converts the lane changes per second into lane changes per minute. The goal is to minimize the di erence between actual speeds and desired speeds, modulated by the number of lane changes, averaged over several time steps and over all learned cars on the road. Each speed di erence is squared to penalize extreme behavior. For example, driving 60 m/h 90% of the time and 10 m/h 10% of the time gives an average of 55 m/h but is clearly less desirable than driving 56 m/h 50% and 54 m/h 50% of the time, which gives the same T N S t L S i i t i T average. Squaring the error from desired speed gives a higher evaluation to the more consistent strategy. The problem is thus to nd a lane-changing strategy that minimizes equation 1. A naive strategy for each car, which most trac management systems assume, is to select the lane that lets it most consistently achieve its desired speed and only change lanes if a slower car is encountered. The disadvantage of such a strategy is that it does not take into account the global criteria of trac performance. A slow car should not drive in the \fast" lane simply because it can maintain its desired speed. We will refer to cars that employ the naive strategy as sel sh , since they maximize the local performance of their respective car. We are interested in smart strategies that maximize the aggregate performance of trac through cooperative lane-selection strategies. Desired Speed Current Speed Surrounding Speeds +7 65 65 +5 Smart/Greedy S S -10 72 67 G G +2 70 65 2.2 Communication and Coordination The previous section de ned the problem of car-centered trac management, but left open some important issues about the level of communication between cars and the knowledge available about other cars' decisions and states. The multi-agent literature is often divided on these matters, and we feel that it is not central to the problem de nition. Still, we should describe our assumptions about communication and state information. We assume that cars have access to information on their own state, including knowledge of their current driving speed and the driver's desired speed. One could imagine a driver specifying desired speeds at the start of a trip, or the system could infer this information from the driver's historical behavior. We also assume that agents can perceive limited state information of surrounding cars, such as their relative speeds. The system could sense this information using radar or receive it directly from other cars via radio waves or the Internet. Agents should also sense which surrounding cars are cooperative and which are sel sh. Again, the system could infer cooperation from driver behavior or direct communication. Figure 2 illustrates the input for an agent in a speci c trac situation. The middle car receives as input its current speed, its desired speed, the relative speeds of surrounding trac, and whether surrounding cars are cooperative or sel sh. The range and granularity of the relative speed inputs could be adjusted to take into account both local trac and distant trac. For example, it may prove bene cial to receive not only relative speeds of individual cars in the immediate vicinity, but also relative speeds of groups of cars in farther ranges. We assume that the controller's output consists of three options: stay in the current lane, change lanes to the left, or change lanes to the right. The output does not specify the best lane to drive in, but rather whether the lanes immediately to the left or immediately to the right are better than the current lane. This control provides exibility, since it does not depend on the number of 55 Figure 2 An illustration of the input to each agent. The shaded region shows the current input information for the middle car. The agent has access to its current speed, its desired speed, the relative speeds of surrounding trac, and whether other cars are smart or sel sh. lanes or on knowledge of the current driving lane. Thus, controllers that learn on a three-lane highway should, at least in principle, generalize to greater or fewer lanes. We assume that the controller's output represents a ranking of the three possible choices, with the highest ranked choice that is both valid and safe being selected as the car's next action. For a recommendation to be valid, there must be a lane available in the speci ed direction. For a recommendation to be safe, there must not be a car in the same longitudinal position in the new lane. It is always safe to remain in the current lane. The system could also incorporate other safety assurances, such as detecting whether a lane change produces an unsafe spacing between cars in the new lane. For example, one might specify that a slow car should not move in front of a fast car even if there is immediate space for it in the fast car's lane, since the fast car will likely close that space during the span of the lane change. The higher-level safety and validation process relieves the controller of the overhead in deciding which lanes are safe and centers the control problem on lane selection. In other words, by removing the problem of validation and safety, the controller can focus on and more easily learn to rank lanes. This approach is analogous to separating the identi cation of legal moves from the selection of desirable moves in game playing. Another important issue concerns support for individual di erences among drivers. Clearly, di erent drivers should be able to select lanes di erently. Slower drivers will normally (but not always) use lane selection to open up lanes for faster trac, whereas faster drivers will select lanes to get through slower trac. Average-speed drivers will employ elements of both strategies. At issue is how to represent and implement the di erent types of strategies. One approach is to maintain an explicit control policy for each type of driver. For example, fast drivers would utilize a fast lane-selection strategy and slow drivers a slow lane-selection strategy. A disadvantage of this approach is that it requires a priori knowledge of the number of driver types and the boundaries that separate them. Also, it does not provide a smooth transition between styles of driving. A driver on a boundary would be forced into one of the two surrounding strategies instead of an interpolation between the two. A better approach is to parameterize the driving style and use it as input to a single control policy. Each car would contain the same control policy, but since it receives driving style as input, it behaves di erently for di erent types of drivers. In this case, driving style is simply the desired speed. No a priori decisions are necessary regarding the number of lane-selection strategies or their boundaries. Moreover, since the di erent strategies are keyed to a continuous input (desired speed), there can be smooth transitions and interpolations between di erent lane-selection strategies. 2.3 Approaches to Intelligent Lane Selection Creating distributed lane-changing controllers by hand appears quite dicult. It is unclear whether experts exist in this domain and, even if they do, experts often nd it dicult to verbalize complex control skills, which creates a knowledge acquisition bottleneck. Also, the innumerable trac patterns and varying driving styles create a large problem space. Even with signi cant expert domain knowledge, hand crafting a controller that operates e ectively in all areas of the problem space may not be feasible. Another solution is to apply machine learning to develop intelligent controllers through direct experience with the domain. A learning algorithm would modify the controller based on good and bad experiences in the problem space. This approach frees us from the task of acquiring and encoding expert domain knowledge, since it discovers examples of good and bad decisions through direct experience. Moreover, the controllers are not necessarily xed and could continue to learn and adapt with new experiences. The lane-selection problem appears out of reach of the more standard, supervised machine learning methods (e.g., Quinlan, 1986; Rumelhart, Hinton, & Williams, 1986). In supervised learning, control policies are formed from examples of correct behavior. In the case of intelligent lane selection, supervised learning requires demonstrations of good and bad lane selections. Without expert domain knowledge, it is dicult to generate these examples. In some control problems, supervised learning is used to mimic the behavior of people (e.g., Pomerleau, 1992; Sammut, Hurst, Kedzier, & Michie, 1992). For intelligent lane selection, however, this is exactly what we do not want to model. We believe that most drivers do not select lanes intelligently, but are rather more sel sh in nature. Thus, it seems misguided to use real driver behaviors as a basis for learning cooperative lane selection. A more exible machine learning approach that is capable of learning from general rewards instead of behavioral examples has been termed reinforcement learning. The rewards provide only a general measure of prociency over the task and do not explicitly direct the learner toward any course of action. The learner adjusts its actions through trial and error interactions with the environment to maximize the reward signal. In the laneselection problem, agents receive the rewards de ned by equation 1 at speci c time steps. In response, they adjust their lane-selection strategies using some reinforcement learning algorithm to maximize the reward function. The literature includes two main types of approach to reinforcement learning. One class of methods learn through calculation of temporal di erences (Sutton, 1988; Watkins & Dayan, 1992; Kaelbling, Littman, & Moore, 1996) over rewards, which lets them acquire mappings from state-action pairs onto expected values. Another class of methods search more directly through the space of control policies (Grefenstette, Ramsey, & Schultz, 1990; Holland & Reitman, 1978; Moriarty & Miikkulainen, 1996; Whitley, Dominic, Das, & Anderson, 1993; Wilson, 1994), often using evolutionary algorithms to this end. Our approach, to which we now turn, relies on an evolutionary algorithm as the primary mechanism for reinforcement learning, but it also incorporates a technique similar to temporal-di erence learning to handle smaller strategy re nements. 2.4 Machine Learning for Lane Selection Elsewhere (Moriarty & Langley, 1998), we have described our distributed learning system in detail, so here we only review its main components. The system represents its control knowledge as a feedforward neural network with one hidden layer, as depicted in Figure 3. This network includes 16 input units, 12 hidden units, and three output units, with full connections between adjacent levels. The input nodes correspond to information about the car's current and desired speeds, as well as the speeds of surrounding vehicles, whereas the output nodes specify whether to stay in the current lane, to move left, or to move right. On each time step, the controller uses the input values to compute activations for each output node, then selects the action with the highest activation. Current Speed Error from Desired Speed Left Ahead Speed Left Side Speed Left Behind Speed Center Ahead Speed Center Behind Speed Right Ahead Speed Right Side Speed Right Behind Speed Move Left Stay Center Move Right Left Ahead Type Left Behind Type Center Ahead Type Center Behind Type Right Ahead Type Right Behind Type Figure 3 The input and outputs to the neural network for lane selection. The learning system relies on three interrelated modules to determine the weights on the network's links, and thus to acquire robust controllers. The rst component is SANE (Moriarty & Miikkulainen, 1996; Moriarty, 1997), which carries out genetic search through the space of feedforward networks, given a network architecture, by operating at two distinct levels. At one level, the module retains a population of complete controllers, each de ned as a collection of hidden-layer neurons. The second-level population consists of these individual neurons, each of which specify the weights on their input and output links. Each member of the neuron population can appear in zero or more members of the controller population. SANE assign tness to each candidate controller by converting its encoding (stored as bit strings) into a complete neural network, then using that network in a simulated trac environment for 400 seconds. The system determines the tness of a given controller by applying equation 1 to traces of the trac behavior. SANE also assigns tness to each candidate neuron based on the tnesses of the controllers to which it contributes. The algorithm selectively applies genetic operators, such as crossover, on members of each population to generate new members, repeating the evaluation and generation process many times. The second learning module is responsible for seeding the initial populations. Rather that starting SANE with random populations of controllers and neurons, the system initializes them with candidates that are likely to be useful. The module accomplishes this feat by collecting behavioral traces of the hand-written `polite' controller that we described earlier. These provide training cases that take the form of sensor-action pairs, which the sys- tem passes to a supervised algorithm (backpropagation) to learn weights that approximate the polite controller. The module repeats this process a number of times to generate a collection of controllers and neurons that form 25% of the two initial populations. As noted above, the SANE module evaluates the tness of candidate controllers by running them in the simulator for xed time periods. Because the reward signals generated by this scheme are infrequent, we added a third learning module that relies on more immediate feedback. After every ten simulated seconds, this component checks to determine if the overall trac performance has changed since the last measurement. If performance has improved substantially, it labels all actions taken during this period as desirable; if performance has worsened, it labels all invoked actions as undesirable. In either case, the module passes these actions (and their associated sensory inputs) to the backpropagation algorithm, which alters the controllers weights to either encourage or discourage their use. Recall that the system's reward signal is based on equation 1, which assumes global information about overall trac behavior. Clearly, such information is not currently available to actual cars or their drivers, but we predict this will change as vehicles come to include positioning devices and gain access to the Internet, which will let them report their position and speed to a central facility. For now, we have been forced to rely on a simulated trac domain, which has also encouraged us to use an oine training regimen to collect accurate statistics over extended runs. However, the basic approach also lends itself to online learning, though we expect the learning rate would decrease in this scenario. 3. Experimental Evaluation Our approach to distributed learning appears to o er a viable method for acquiring lane-selection strategies and thus improving overall trac performance. However, whether the method works in practice is an empirical question, and in this section we report experimental studies of the system's adaptive behavior. 3.1 A Simulated Trac Environment To evaluate trac management through intelligent lane selection, we developed a simulator to model trac on a highway. For each car, the simulator updates the continuous values of position, velocity, and acceleration at one second intervals. The acceleration and deceleration functions were set by visualizing trac performance under di erent conditions and represent our best estimate of the behavior of actual drivers. We adjust acceleration (A) using the equation A(s) = 10s?0 5, where s represents the current speed in miles per hour (m/h). Deceleration occurs at the rate of ?2:0 m/h per second if the di erence in speed from the immediate preceding car is greater than twice the number of seconds separating the two cars. In other words, if a car approaches a slower car, the deceleration point is proportional to the di erence in speed and the distance between the cars. If there is a large di erence in speed, cars will decelerate sooner than if the speed di erences are small. If the gap closes to two seconds, the speed is matched instantaneously. The simulator allows lane changes only if the change maintains a two-second gap between leading and following cars. The simulated roadway is 3.3 miles long, but the top of each lane \wraps around" toroidally to the bottom, creating an in nite stretch of roadway. We designed the simulator as a tool to eciently evaluate di erent laneselection strategies, and thus it makes several assumptions about trac dynamics. The current model makes ve primary assumptions: { all cars are the same size; { all cars use the same acceleration rules; { cars accelerate to and maintain their desired speed if there are no slower cars directly ahead; { lane changes are instantaneous; and { there are no curves, hills, on ramps, or exit ramps. Although none of these assumptions hold for real-world trac, they do not appear crucial for evaluating the merits of intelligent lane selection, and removing them unnecessarily complicates the model. In future work, however, we hope to expand our experiments to more realistic simulators such as SmartPATH (Eska , 1996). During training, the learning system uses the trafc simulator to evaluate candidate lane-selection strategies. Each evaluation or trial lasts 400 simulated seconds and begins with a random dispersement of 200 cars over : three lanes on the 3.3 mile roadway. Desired speeds are selected randomly from a normal distribution with mean 60 m/h and standard deviation 8 m/h. In each trial, the percentage of smart cars is randomly selected from a uniform distribution with a minimum percentage of 5%. All other cars follow the sel sh lane-selection strategy outlined in Section 2.1. To simulate congestion caused by lane closures and merging, we blocked portions of either the far right or far left lanes during training. Lane closures last for one mile and only one closure exists at any given time. There is an equal probability that the far right or far left lane will be blocked. A lane-selection strategy perceives a blocked lane as a car with a speed of zero. Each training run begins with a population of 75 random lane-selection strategies and 25 seeded strategies, which are modi ed by SANE and the local learning module over 30 simulated driving hours. SANE keeps track of the best strategy found so far based on its performance over a trial. When the system nds a better strategy, it is saved to a le for later testing. The saved strategies are each tested over ten 2000-second trials and the best is returned as the nal strategy. We developed two hand-written controllers for use as benchmarks and to provide the simulated environment. The polite controllers follow four rules: { If your desired speed is 55 m/h or less and the right lane is open, then change lanes to the right. { If you are in the left lane, a car behind you has a higher speed, and the right lane is open, then change lanes to the right. { If a car in front of you has a slower current speed than your desired speed and the left lane is open, then change lanes to the left. { In the previous situation, if the left lane was not open but the right lane is open, then change to the right. We based these rules on our interpretation of the \slower trac yield to the right" signs posted on the highways. The sel sh strategy described earlier uses only the last two rules. 3.2 Evaluation of Intelligent Lane Selection Our earlier studies (Moriarty & Langley, 1998) evaluated the learned controllers' behavior as we varied the density of trac and the ratio of learned to sel sh controllers. The results, which showed reasonable behavior over a wide range of densities and ratios, encouraged us to carry out additional additional studies to further test the adaptive nature of the learned controllers. We designed our rst experiment to evaluate the polite, sel sh, and learned strategies in the presence of lane closures. Recall that, during training, we closed one mile of either the far left or far right lane and the location of the closure changed every 500 simulated seconds. We replicated lane closures in testing to determine each 3.0 300.0 Selfish Polite Learned Selfish Polite Learned 250.0 Lane Changes per Minute Mean Squared Difference from Desired Speed 350.0 200.0 150.0 100.0 2.0 1.0 50.0 0.0 0 100 200 Number of Cars 300 400 0.0 0 100 200 Number of Cars 300 400 (a) (b) Figure 4 Trac performance when portions of the lanes were blocked. strategy's ability to handle high congestion areas created by the merging trac. The degree of merging in these tests is extreme (a blocked lane every 13 miles), to fully test the robustness of the three strategies. Figure 4 plots the mean squared error in desired speeds and the average number of lane changes with closed lanes. Surprisingly, the polite strategy performed worse than the sel sh one when lanes were blocked, which di ered from our earlier results with no closures. Figure 4(a) shows that, under a high degree of merging, it is better to act greedily than politely. The large errors that the polite strategy incurs come when portions of the rightmost lane are closed. Since the polite strategy directs all of its slow drivers into the right lane, it becomes dicult to merge them back into the two faster lanes when this lane is blocked. This diculty causes large bottlenecks in the right lane and creates high errors in desired speed. Since the sel sh strategy assigns no lane bias based on driving speed, it is less a ected by right lane closures. Although the learned strategy also directs its slower drivers to the right lane, its response to bottlenecks is even more robust than the sel sh scheme. Under the learned strategy, faster drivers in the center and left lanes maneuver to let slower drivers merge more easily, which eases congestion. These seemingly altruistic behaviors were learned because reinforcement comes from the aggregate trac performance. Additionally, the learned cars have relative speed sensors that can detect slow speeds in trac ahead. Thus, the learned strategy can merge the cars much earlier than the polite strategy, which does not begin to merge until a slow car or closed lane forces it to decelerate. Overall, the learned strategy incurs substantially lower driving errors and performs only a fraction of the lane change maneuvers as the other two strategies. The second experiment evaluated the learned strategy's ability to adapt to four-lane highways. As noted in Section 2.2, we designed the controller input to ignore the car's actual lane, and the output to re ect only whether the left or right lane is better than the current one. Thus, in principle an e ective strategy learned only on a three-lane highway should perform well on four lanes. Since the learning system only experienced threelane highways in training, this experiment serves as another test of the learned strategy's adaptability. Figure 5 plots the error in driving speed and average number of lane changes using four lanes of trac. Since there is more lane capacity, we need up to 600 cars in this study. The gure shows that the learned strategy achieves the same performance gains over the polite and sel sh strategies in four lanes of trac as it does in three lanes. In dense trac, the learned strategy incurs one third to one quarter of the driving speed error for the sel sh strategy and one half of the error for the polite strategy. As with three lanes of trac, the polite and sel sh strategies make substantially more lane change maneuvers than the learned controller. Figure 6 provides a visualization of the lane utilization for the three strategies with four trac lanes using 300 cars. This shows that the sel sh strategy assigns no lane bias based on driving speed, whereas the polite strategy exhibits a sharp transition between slow and fast driving styles. The graph for the learned strategy is very similar to its three-lane counterpart, giving further evidence that it generalizes to more than three lanes. With only three lanes, the learned strategy encumbers the right lane with many slow drivers and uses the other lanes to organize the faster drivers. Under four lanes of trac, however, the fastest drivers are placed more consistently in the leftmost lane than under three lanes. For example, drivers with desired speeds of 80 3.0 Selfish Polite Learned Selfish Polite Learned 100.0 Lane Changes per Minute Mean Squared Difference from Desired Speed 125.0 75.0 50.0 2.0 1.0 25.0 0.0 0 200 400 600 Number of Cars 0.0 0 200 400 600 Number of Cars (a) (b) Figure 5 Trac performance with four driving lanes; no lanes were blocked in this condition. m/h drive in the left lane 83% of the time with four lanes of trac, compared to only 57% of the time with three lanes. Most of the trac organization occurs in the middle two lanes, with the middle-speed drivers. This strategy seems reasonable, since the middle-speed drivers make two di erent types of lane changes: passing slow cars and yielding to fast cars, and therefore must reorganize more frequently. 4. Related and Future Work One can roughly divide research on communities of learning agents into two broad categories. The rst, often called multi-agent learning , refers to situations in which the agents have shared goals and thus cooperate, either explicitly or implicitly, to achieve those goals. Examples of this approach include Schultz, Grefenstette, and Adams' (1996) work on multi-robot herding behavior, Mataric's (1994) e orts on foraging, in which four robots acquire social rules that reduce disruption, Tan's (1993) studies of reinforcement learning among predators cooperating to track down prey, and Sen and Sekaran's (1998) use of reinforcement learning to improve two-agent coordination in block pushing. Stone and Veloso (1997) present an extensive review of work on multi-agent learning, including their own results on soccer playing, so we will not try to be exhaustive here. Another category focuses on situations that involve many agents, typically more than in multi-agent settings, each of which pursues its own goals. Research on such distributed learning seems less common than multiagent work, but it also a bears a closer relation to our own approach. Perhaps the best-known e ort of this sort revolves around Holland's (1996) Echo, a simulation framework designed to study the behavior of complex biological systems, such as the interaction of plants, herbivores, and carnivores in an ecosystem (Schmitz & Booth, 1997). Schoonderwoerd, Holland, and Bruten (1997) use distributed agents to balance loads in telecommunications networks, but learning occurs only in the sense that agents lay down ant-like trails to improve performance. Grand, Cli , and Malhotra's (1997) work on Creatures is more akin to our own work, with independent agents that exist in a simulated environment, receive rewards, and change their behaviors with experience. There does exist some work on machine learning for trac control, but this has focused on learning for individual driving agents. For example, Sukthankar et al. (1996) use reinforcement learning to acquire control strategies for a vehicle that operates on a simulated highway among other cars controlled by hand-crafted strategies. Similarly, McCallum (1996) reports a system that uses reinforcement learning to acquire a single-agent controller for `New York driving', which involves weaving around slower trac. There also exists a substantial literature on more traditional approaches to trac management that typically involve more centralized control, which we have reviewed in a separate paper (Moriarty & Langley, 1998). Although we have found no other work on distributed learning in the trac domain, we remain excited about its potential as a fertile research testbed. In future work, we plan to improve our trac simulator to include durative lane changes, entrance ramps, and exit ramps. We believe these additions will produce more realistic congestion patterns and thus increase the need for intelligent lane selection. The revised simulator will also let the vehicles increase or decrease their speed, which should provide further improvements in trac ow, at least for learned controllers. In the longer term, we envision an extended simulator that supports a network of interconnected highways. Each car would be given a destination and, to the extent Left Lane Left−Center Lane Right−Center Lane Right Lane 40.0 50.0 60.0 Desired Speed (m/h) 70.0 Left Lane Left−Center Lane Right−Center Lane Right Lane 80.0 40.0 50.0 60.0 Desired Speed (m/h) (a) (b) 70.0 Left Lane Left−Center Lane Right−Center Lane Right Lane 80.0 40.0 50.0 60.0 Desired Speed (m/h) 70.0 80.0 (c) Figure 6 Utility of lanes with respect to desired speeds for the (a) sel sh, (b) polite, and (c) learned strategies with four trac lanes. that multiple routes are available, the controller will select among routes just as it currently selects among lanes. This will require access to higher-level information about the distribution of cars on the various highways, so that the distributed controllers can select routes that avoid congestion. These variations on the task of distributed trac management should provide a rich set of problems to drive our research in years to come. However, our research methodology should remain much the same, in that we will construct systems that control trac in a distributed manner and we will study those systems' adaptive behavior under a variety of experimental conditions. We invite other researchers to join us in our exploration of an intriguing domain that remains poorly understood despite its relevance to our everyday lives. Acknowledgements We would like to thank the anonymous reviewers for their helpful comments and Dan Shapiro for his evaluation of trac simulators. References Carrara, M., & Morello, E. Advanced control strategies and methods for motorway of the future. In The drive project DOMINC: New concepts and research under way. Eska , F. (1996). Modeling and simulation of the automated highway system. Ph.D. thesis, Department of Electrical Engineering and Computer Science, University of California, Berkeley. Grand, S., Cli , D., & Malhotra, A. (1997). Creatures: Arti cial life autonomous software agents for home entertainment. In Proceedings of the First International Conference on Autonomous Agents, pp. 22{ 29. New York: ACM Press. Grefenstette, J. J., Ramsey, C. L., & Schultz, A. (1990). Learning sequential decision rules using simulation models and competition. Machine Learning, 5, 355{381. Holland, J. H. (1996). Hidden order: How adaptation builds complexity. Reading, MA: Addison-Wesley. Holland, J. H., & Reitman, J. S. (1978). Cognitive systems based on adaptive algorithms. In Waterman, D. A., & Hayes-Roth, F. (Eds.), Pattern-directed inference systems. New York: Academic Press. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Arti cial Intelligence Research, 4, 237{285. Mataric, M. J. (1994). Learning to behave socially. In Proceedings of the Third International Conference on Simulation of Adaptive Behavior, pp. 453{462. Cambridge, MA: MIT Press. McCallum, A. K. (1996). Learning to use selective attention and short-term memory in sequential tasks. In Proceedings of Fourth International Conference on Simulation of Adaptive Behavior, pp. 315{324 Cape Cod, MA. Moriarty, D. E. (1997). Symbiotic evolution of neural networks in sequential decision tasks. Ph.D. thesis, Department of Computer Sciences, University of Texas at Austin. Moriarty, D. E., & Langley, P. (1998). Learning cooperative lane selection strategies for highways. In Proceedings of the Fifeenth National Conference on Arti cial Intelligence Menlo Park, CA: AAAI Press. Moriarty, D. E., & Miikkulainen, R. (1996). Ecient reinforcement learning through symbiotic evolution. Machine Learning, 22, 11{32. Pomerleau, D. (1995). Ralph: Rapidly adapting lateral position handler. In Proceedings of the 1995 IEEE Symposium on Intelligent Vehicles, pp. 506{ 511 Detroit, MI. Pomerleau, D. A. (1992). Neural network perception for mobile robot guidance. Ph.D. thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81{106. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E., & McClelland, J. L. (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundations. Cambridge, MA: MIT Press. Sammut, C., Hurst, S., Kedzier, D., & Michie, D. (1992). Learning to y. In Proceedings of the Ninth International Workshop on Machine Learning, pp. 385{ 393. San Francisco: Morgan Kaufmann. Schmitz, O., & Booth, G. (1997). Modelling food web complexity: The consequences of individual-based, spatially explicit behavioral ecology on trophic interactions. Evolutionary Ecology, 11, 379{398. Schoonderwoerd, R., Holland, O., & Bruten, J. (1997). Ant-like agents for load balancing in telecommunications networks. In Proceedings of the First International Conference on Autonomous Agents, pp. 209{216. New York: ACM Press. Schultz, A. C., Grefenstette, J. J., & Adams, W. (1996). RoboShepherd: Learning a complex behavior. In Proceedings of RoboLearn-96: International Workshop for Learning in Autonomous Robots Key West, FL. Sen, S., & Sekaran, M. (1998). Individual learning of coordination knowledge. Journal of Experimental & Theoretical Arti cial Intelligence, 10. Stone, P., & Veloso, M. (1997). Multiagent systems: A survey from a machine learning perspective. Tech. rep. CMU-CS-97-193, School of Computer Science, Carnegie Mellon University. Sukthankar, R., Hancock, J., Baluja, S., Pomerleau, D., & Thorpe, C. (1996). Adaptive intelligent vehicle modules for tactical driving. In Proceedings of the AAAI-96 Workshop on Intelligent Adaptive Agents, pp. 13{22 Portland, OR. Also available at http://www.cs.cmu.edu/rahuls/Shiva/. Sutton, R. S. (1988). Learning to predict by the methods of temporal di erences. Machine Learning, 3, 9{44. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, pp. 330{337. San Francisco: Morgan Kaufmann. Varaiya, P. (1993). Smart cars on smart roads: Problems of control. IEEE Transactions on Automatic Control, 38, 195{207. Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8, 279{292. Whitley, D., Dominic, S., Das, R., & Anderson, C. W. (1993). Genetic reinforcement learning for neurocontrol problems. Machine Learning, 13, 259{284. Wilson, S. W. (1994). ZCS: A zeroth level classi er system. Evolutionary Computation, 2, 1{18.