Academia.eduAcademia.edu

Reinforcement learning and human behavior

2014, Current Opinion in Neurobiology

The dominant computational approach to model operant learning and its underlying neural activity is model-free reinforcement learning (RL). However, there is accumulating behavioral and neuronal-related evidence that human (and animal) operant learning is far more multifaceted. Theoretical advances in RL, such as hierarchical and model-based RL extend the explanatory power of RL to account for some of these findings. Nevertheless, some other aspects of human behavior remain inexplicable even in the simplest tasks. Here we review developments and remaining challenges in relating RL models to human operant learning. In particular, we emphasize that learning a model of the world is an essential step before or in parallel to learning the policy in RL and discuss alternative models that directly learn a policy without an explicit world model in terms of state-action pairs. Current Opinion in Neurobiology 2014, 25:93-98 This review comes from a themed issue on Theoretical and computational neuroscience Edited by Adrienne Fairhall and Haim Sompolinsky 0959-4388/$ -see front matter, #

Available online at www.sciencedirect.com ScienceDirect Reinforcement learning and human behavior Hanan Shteingart1 and Yonatan Loewenstein1,2,3,4 The dominant computational approach to model operant learning and its underlying neural activity is model-free reinforcement learning (RL). However, there is accumulating behavioral and neuronal-related evidence that human (and animal) operant learning is far more multifaceted. Theoretical advances in RL, such as hierarchical and model-based RL extend the explanatory power of RL to account for some of these findings. Nevertheless, some other aspects of human behavior remain inexplicable even in the simplest tasks. Here we review developments and remaining challenges in relating RL models to human operant learning. In particular, we emphasize that learning a model of the world is an essential step before or in parallel to learning the policy in RL and discuss alternative models that directly learn a policy without an explicit world model in terms of state-action pairs. Addresses Edmond and Lily Safra Center for Brain Sciences, The Hebrew University, Jerusalem 91904, Israel 2 Department of Neurobiology, The Alexander Silberman Institute of Life Sciences, The Hebrew University, Jerusalem 91904, Israel 3 Department of Cognitive Science, The Hebrew University, Jerusalem 91904, Israel 4 Center for the Study of Rationality, The Hebrew University, Jerusalem 91904, Israel 1 Corresponding author: Loewenstein, Yonatan ([email protected]) Current Opinion in Neurobiology 2014, 25:93–98 This review comes from a themed issue on Theoretical and computational neuroscience Edited by Adrienne Fairhall and Haim Sompolinsky 0959-4388/$ – see front matter, # 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.conb.2013.12.004 Model-free RL The computational problem in many operant learning tasks can be formulated in a framework known as Markov Decision Processes (MDP) [1]. In MDPs, the world can be in one of several states, which determine the consequences of the agent’s actions with respect to the future rewards and world states. A policy defines the agent behavior at a given situation. In MDPs, a policy is a mapping from the states of the environment to actions to be taken when in those states [1]. Finding the optimal policy is difficult because actions may have both immediate and long-term consequences. However, this problem can be simplified by estimating values, the expected cumulative (discounted) rewards www.sciencedirect.com associated with these states and actions and using these values to choose the actions (for a detailed characterization of the mapping from values to actions in humans, see [2]). Model-free RL, as its name suggests, is a family of RL algorithms devised to learn the values of the states without learning the full specification of the MDP. In a class of model-free algorithms, known as temporal-difference learning, the learning of the values is based on the reward-prediction error (RPE), the discrepancy between the expected reward before and after an action is taken (taking into account also the ensuing obtained reward). The hypothesis that the brain utilizes model-free RL for operant learning holds considerable sway in the fields of neuroeconomics. This hypothesis is supported by experiments demonstrating that in primates, the phasic activity of mid-brain dopaminergic neurons is correlated with the RPE [3,4]. In mice, this correlation was also shown to be causal: optogenetic activation of dopaminergic neurons is sufficient to drive operant learning, supporting the hypothesis that the dopaminergic neurons encode the RPE, which is used for operant learning [5]. Other putative brain regions for this computation are the striatum, whose activity is correlated with values of the states and/or actions [6,7] and the nucleus accumbens and pallidum, which are involved in the selection of the actions [8]. In addition to its neural correlates, model-free RL has been used to account for the trial-by-trial dynamics (e.g., [2]) and for several robust aggregate features of human behavior such as risk aversion [9], recency [10] and primacy [2]. Moreover, model-free RL has been proven useful in the field of computational psychiatry as a way of diagnosing and characterizing different pathologies [11–13,14]. However, there is also evidence that the correspondence between dopaminergic neurons and the RPE is more complex and diverse than was previously thought [15]. First, dopaminergic neurons increase their firing rate in response to both surprisingly positive and negative reinforcements [16,17]. Second, dopaminergic activity is correlated with other variables of the task, such as uncertainty [18]. Third, the RPE is not exclusively represented by dopamine, as additional neuromodulators, in particular serotonin, are also correlated with the RPE [19]. Finally, some findings suggest that reinforcement and punishment signals are not local but rather ubiquitous in the human brain [20]. These results challenge the dominance of the anatomically modular model-free RL as a model for operant learning. Current Opinion in Neurobiology 2014, 25:93–98 94 Theoretical and computational neuroscience Model-based RL When training is intense, task-independent reward devaluation, for example, through satiety, has only a little immediate effect on behavior. This habitual learning is consistent with model-free RL because in this framework, the value of an action is updated only when it is executed. By contrast, when training is moderate, the response to reward devaluation is immediate and substantial [21]. This and other behaviors (e.g., planning) are consistent with an alternative RL approach, known as model-based RL, in which a model of the world, that is, the parameters that specify the MDP are learned before choosing a policy. The effect of reward devaluation after moderate training can be explained by model-based RL because a change in a world parameter (e.g., the devaluation of the reward as a result of satiety) can be used to update (off-line) the values of other states and actions. (POMDP), applying model-free algorithms such as Qlearning may converge to a solution that is far from optimality or may fail to converge altogether [25]. One approach to addressing these problems is to break down the learning task into a hierarchy of simpler learning problems, a framework known as Hierarchical Reinforcement Learning (HRL) [26]. Neuroimaging studies have indeed found neural responses that are consistent with subgoal-related RPE, as is predicted by HRL [27]. Challenges in relating human behavior to RL algorithms The curse of dimensionality and the blessing of hierarchical RL Despite the many successes of the different RL algorithms in explaining some of the observed human operant learning behaviors, others are still difficult to account for. For example, humans tend to alternate rather than repeat an action after receiving a positively surprising payoff. This behavior is observed both in simple repeated twoalternative force choice tasks with probabilistic rewards (also known as the 2-armed bandit task, Figure 1a) and in the stock market [28]. Moreover, a recent study found that the behavior of half of the participants in a 4-alternative version of the bandit task, known as the Iowa gambling task, is better explained by the simple ad-hoc heuristic ‘win-stay, lose-shift’ (WSLS) than by RL models [29]. Another challenge to the current RL models is the tremendous heterogeneity in reports on human operant learning, even in simple bandit tasks, measured in different laboratories in slightly different conditions. For example, as was described above, the WSLS observed in 4-arm bandit [29] is inconsistent with the alternation after positively surprising payoffs discussed above [28]. Additionally, probability matching, the tendency to choose an action in proportion to the probability of reward associated with that action, has been a subject of debate over half-a-century. On one hand, there are numerous reports supporting this law of behavior both in the laboratory and when humans gamble substantial amounts of money on the outcome of real-life situations [30]. On the other hand, there is abundant literature arguing that people deviate from probability matching in favor of choosing the more rewarding action (maximization) [31]. Finally, there is substantial heterogeneity not only between subjects and laboratories but also within subjects over time. A recent study has demonstrated substantial day-to-day fluctuations in the learning behavior of monkeys in the two-armed bandit task and has shown that these fluctuations are correlated with day-to-day fluctuations in the neural activity in the putamen [32]. There are theoretical reasons why the RL models described above cannot fully account for operant learning in natural environments. First, the computational problem of finding the values is bedeviled by the ‘curse of dimensionality’: the number of states is exponential with the number of variable, which define a state [1]. Second, when the state of the world is only partially known (i.e., the environment is a partially observable MDP The lack of uniformity regarding behavior even in simple tasks could be due to heterogeneity in the prior expectations of the participants. From the experimentalist point of view, the two-armed bandit task, for example, is simple: the world is characterized by a single state and two actions (Figure 1a). However, from the participant If the parameters of the MDP are known, one can compute the values of all states and actions, for example by means of dynamic programming or Monte-Carlo simulation. Alternatively, one could choose an action by expanding a look-ahead decision tree on-line [1]. However, a full expansion of a look-ahead tree is computationally difficult because the number of branches increases exponentially with the height of the tree, so pruning of the tree is a necessary approximation. Indeed, a recent study has suggested that humans prune the decision-tree, by trimming branches associate with large losses [14]. Whether or not model-based and model-free learning are implemented by two anatomically distinct systems is a subject of debate. In support of anatomical modularity are findings that the medial striatum is more engaged during planning whereas the lateral striatum is more engaged during choices in extensively trained tasks [22]. In addition, the state prediction error, which signals the discrepancy between the current model and the observed state transitions is correlated with activity in the intraparietal sulcus and lateral prefrontal cortex, spatially separated from the main correlate of the RPE in the ventral striatum [23]. Findings that negate anatomical modularity include reports of signatures of both model-based and model-free learning in the ventral striatum [24]. Current Opinion in Neurobiology 2014, 25:93–98 Heterogeneity in world model www.sciencedirect.com Reinforcement learning and human behavior Shteingart and Loewenstein 95 Figure 1 (a) L (e) R S0 (b) L L/R R S0 (d) (c) Non Matcher R L R SR SL Matcher L R L 1,-1 -1,1 R -1,1 1,-1 L Current Opinion in Neurobiology Repertoire of possible world models. In this example, a participant is tested in the two-armed bandit task. (a) From the experimentalist’s point of view (scientist caricature), the world is characterized by a single state (S0) and two actions: left (blue, L) or right (red, R) button press. However, from the participant’s point of view there is an infinite repertoire of possible world models characterized by different sets of states and actions. (b) With respect to the action sets, she may assume that there is only a single available action, pressing any button, regardless of its location (purple, L/R). (c) With respect to the state sets, the participant may assume that the state is defined by her last action (SL and SR, for previous L and R action, respectively). (d) Moreover, the participant may assume she is playing a penny-matching game with another human. (e) These and other possible assumptions may lead to very different predictions in the framework of RL. point of view there is, theoretically, an infinite repertoire of possible world models characterized by different sets of states and actions. This could be true even when precise instructions are given due to, for example, lack of trust, inattention or forgetfulness. With respect to the actions set, the participant may assume that there is only a single available action, the pressing of any button, regardless of its properties (Figure 1b). Alternatively, differences in the timing of the button press, the finger used, among others, could all define different actions. Such precise definition of action, which is irrelevant to the task, may end with non-optimal behavior [32]. With respect to the states set, the participant may assume that there are several states that depend on the history of actions and/or rewards. For example, the participant may assume that the state is defined by the last action (Figure 1c), the last action and the last reward, or a function of the long history of actions [33]. Finally, the participant may assume a strategic game setting such as a matching-pennies game (Figure 1d). These and other possible assumptions (Figure 1e) may www.sciencedirect.com lead to very different predictions on behavior [34]. In support of this possibility, experimental manipulations such as instructions, which are irrelevant to the reward schedule, but may change the prior belief about the number of states can have a considerable effect on human behavior [35]. Finally, humans and animals have been shown to develop idiosyncratic and stereotyped superstitious behaviors even in simple laboratory settings [36]. If participants fail to recognize the true structure of a learning problem in simple laboratory settings, they may also fail to identify the relevant states and actions when learning from rewards in natural environments. For example, professional basketball players have been shown to overgeneralize when learning from their experience [37]. Learning the world model Many models of operant learning often take as given that the learner has already recognized the available sets of states and actions (Figure 2a). Hence, when attempting to Current Opinion in Neurobiology 2014, 25:93–98 96 Theoretical and computational neuroscience Figure 2 Learning Model Experience Learning Policy (and World Model if Applicable) Learning Policy, Given World Learned Policy (a) Standard RL Observations Learned World Model Learning Policy, Given World Model Learned Policy Learning World (b) World Model Learning Followed By Policy Learning Prior Worlds Rewards Learning World and Policy (c) Joint World and Policy Learning Learned World Model and Policy Prior Worlds Actions Tunable System Learned Policy (d) Gradient Descent Current Opinion in Neurobiology Alternative models of operant learning. In operant learning, experience (left trapezoid), composed of present and past observations, actions and rewards, is used to learn a policy. (a) Standard RL models typically assume that the learner (brain gray icon) has access to the relevant states and actions set (represented by a bluish world icon) before the learning of the policy. Alternative suggestions are that the state and action sets are learned from experience and from prior expectations (different world icons) before (b) or in parallel (c) to the learning of the policy. (d) Alternatively, the agent may directly learn without an explicit representation of states and actions, but rather by tuning a parametric policy (cog wheels icon), for example, using stochastic gradient methods on this policy’s parameters. account for human behavior they fail to consider the necessary preliminary step of identifying them (correctly or incorrectly). In machine learning, classification is often preceded by an unsupervised dimension-reduction for feature extraction [38,39]. Similarly, it has been suggested that operant learning is a two-step process (Figure 2b): in the first step, the state and action sets are learned from the history (possibly using priors on the world), where in the second step RL algorithms are utilized to find the optimal policy given these sets [40]. An interesting alternative is that the relevant state-action sets and the policy are learned in parallel (Figure 2c). For example, in a new approach in RL, known as feature RL, the state set and the values of the states are learned simultaneously from the history of observations, actions and rewards. One crucial property of feature RL is that it neither requires nor learns a model of the complete observation space, but rather Current Opinion in Neurobiology 2014, 25:93–98 learns a model that is based on the reward-relevant observations [41]. Learning without states Operant learning can also be accomplished without an explicit representation of states and actions, by directly tuning a parametric policy (Figure 2d). A plausible implementation of such direct policy learning algorithms is using stochastic policy-gradient methods [42–44]. The idea behind these methods is that the gradient of the average reward (with respect to policy parameter) can be estimated on-line by perturbing a neural network model with noise and considering the effect of these perturbations on the stream of payoffs delivered to the learning agent. Changes in the policy in the direction of this estimated gradient are bound, under certain assumptions, to improve performance. However, local minima may www.sciencedirect.com Reinforcement learning and human behavior Shteingart and Loewenstein 97 prevent the learning dynamics from converging to the optimal solution. Direct policy methods have been proposed to explain birdsong learning [45] and have received some experimental support [46,47]. In humans, a model for gradient learning in spiking neurons [48,49] has been shown to be consistent with the dynamics of human learning in twoplayer games [50]. Under certain conditions, gradient-like learning can be implemented using covariance-based synaptic plasticity. Interestingly, operant matching (not to be confused with probability matching) naturally emerges in this framework [51,52]. A model based on attractor dynamics and covariance-based synaptic plasticity has been shown to quantitatively account for free operant learning in rats [53]. However, the experimental evidence for gradient-based learning, implemented at the level of single synapses, awaits future experiments. Concluding remarks RL is the dominant theoretical framework to operant learning in humans and animals. RL models were partially successful in quantitative modeling of learning behavior and provided important insights into the putative role of different brain structures in operant learning. Yet, substantial theoretical as well as experimental challenges remain, indicating that these models may be substantially oversimplified. In particular, how state-space representations are learned in operant learning remain important challenges for future research. References and recommended reading Papers of particular interest, published within the period of review, have been highlighted as:  of special interest  of outstanding interest Acknowledgements We would like to thank Ido Erev for many fruitful discussions and David Hansel, Gianluigi Mongillo, Tal Neiman and Ran Darshan for carefully reading the manuscript. This work was supported by the Israel Science Foundation (Grant No. 868/ 08), Grant from the Ministry of Science and Technology, Israel and the Ministry of Foreign and European Affairs and the Ministry of Higher Education and Research France and the Gatsby Charitable Foundation. 1. Sutton RS, Barto AG: Introduction to Reinforcement Learning. MIT Press; 1998. 2.  Shteingart H, Neiman T, Loewenstein Y: The role of first impression in operant learning. J Exp Psychol Gen 2013, 142:476-488 This paper analyzed the behavior of human participants in a two-armed bandit task. The main contributions of the paper are: first, it characterizes non-parametrically the actionselection function in humans, which exhibits substantial exploration even when the difference in the values of the alternatives is large; second, it demonstrates a substantially primacy effect in operant learning; third, the primacy and trial-bytrial dynamics are well-fitted in the model-free framework if first experience is assumed to reset the initial conditions of the actionvalues.. www.sciencedirect.com 3. Schultz W: Updating dopamine reward signals. Curr Opin Neurobiol 2013, 23:229-238. 4. Montague PR, Hyman SE, Cohen JD: Computational roles for dopamine in behavioural control. Nature 2004, 431:760-767. 5.  Kim KM, Baratta MV, Yang A, Lee D, Boyden ES, Fiorillo CD: Optogenetic mimicry of the transient activation of dopamine neurons by natural reward is sufficient for operant reinforcement. PLoS ONE 2012, 7:e33612 The hypothesis that transient dopaminergic activity codes for the RPE had been based on the finding that the two are correlated. This study demonstrates a causal link between the dopamine signal and operant learning by simulating the natural transient activation of VTA dopamine using optogenetic methods. The results of this study provide a strong support for the role of dopaminergic neurons in driving operant learning.. 6. Samejima K, Ueda Y, Doya K, Kimura M: Representation of action-specific reward values in the striatum. Science 2005, 310:1337-1340. 7. O’Doherty J, Dayan P, Schultz J, Deichmann R, Friston K, Dolan RJ: Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science 2004, 304:452-454. 8. Nicola SM: The nucleus accumbens as part of a basal ganglia action selection circuit. Psychopharmacology (Berl) 2007, 191:521-550. 9. Denrell J: Adaptive learning and risk taking. Psychol Rev 2007, 114:177-187. 10. Hertwig R, Barron G, Weber EU, Erev I: Decisions from experience and the effect of rare events in risky choice. Psychol Sci 2004, 15:534-539. 11. Yechiam E, Busemeyer JR, Stout JC, Bechara A: Using cognitive models to map relations between neuropsychological disorders and human decision-making deficits. Psychol Sci 2005, 16:973-978. 12. Maia TV, Frank MJ: From reinforcement learning models to psychiatric and neurological disorders. Nat Neurosci 2011, 14:154-162. 13. Montague PR, Dolan RJ, Friston KJ, Dayan P: Computational psychiatry. Trends Cogn Sci 2012, 16:72-80. 14. Huys QJM, Eshel N, O’Nions E, Sheridan L, Dayan P, Roiser JP:  Bonsai trees in your head: how the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS Comput Biol 2012, 8:e1002410 Participants in this study were tested in a sequential reinforcement-based task. Modeling the participants’ behavior by a decision tree, it was found that their behavior is consistent with a simple pruning strategy: pruning any further evaluation of a sequence as soon as a large loss is encountered. These findings also highlight the importance of considering simple heuristics when studying human behavior.. 15. Lammel S, Lim BK, Malenka RC: Reward and aversion in a heterogeneous midbrain dopamine system. Neuropharmacology 2013, 76:351-359. 16. Iordanova MD: Dopamine transmission in the amygdala modulates surprise in an aversive blocking paradigm. Behav Neurosci 2010, 124:780-788. 17. Joshua M, Adler A, Mitelman R, Vaadia E, Bergman H: Midbrain dopaminergic neurons and striatal cholinergic interneurons encode the difference between reward and aversive events at different epochs of probabilistic classical conditioning trials. J Neurosci 2008, 28:11673-11684. 18. Fiorillo CD, Tobler PN, Schultz W: Discrete coding of reward probability and uncertainty by dopamine neurons. Science 2003, 299:1898-1902. 19. Seymour B, Daw ND, Roiser JP, Dayan P, Dolan R: Serotonin selectively modulates reward value in human decisionmaking. J Neurosci 2012, 32:5833-5842. 20. Vickery TJ, Chun MM, Lee D: Ubiquity and specificity of reinforcement signals throughout the human brain. Neuron 2011, 72:166-177. Current Opinion in Neurobiology 2014, 25:93–98 98 Theoretical and computational neuroscience 21. Tricomi E, Balleine BW, O’Doherty JP: A specific role for posterior dorsolateral striatum in human habit learning. Eur J Neurosci 2009, 29:2225-2232. 34. Green CS, Benson C, Kersten D, Schrater P: Alterations in choice behavior by manipulations of world model. Proc Natl Acad Sci USA 2010, 107:16401-16406. 22. Wunderlich K, Dayan P, Dolan RJ: Mapping value based planning and extensively trained choice in the human brain. Nat Neurosci 2012, 15:786-791. 35. Morse EB, Willard N: Runquist probability-matching with an unscheduled random sequence. Am J Psychol 1960, 73:603-607. 23. Gläscher J, Daw N, Dayan P, O’Doherty JP: States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 2010, 66:585-595. 36. Ono K: Superstitious behavior in humans. J Exp Anal Behav 1987, 47:261-271. 24. Daw ND, Gershman SJ, Seymour B, Dayan P, Dolan RJ: Model based influences on humans’ choices and striatal prediction errors. Neuron 2011, 69:1204-1215 This study explored behavior and BOLD signals of human participants in a two-stage operant task. The behavioral results are inconsistent with neither modelbased nor model-free algorithms. Rather, they indicate that participants use a combination of the two. Activity in the ventral striatum is correlated with both model-free and model-based error signals, challenging the hypothesis that model-based and modelfree learning are anatomically separated.. 25. Jaakkola T, Singh SP, Jordan MI: Reinforcement learning algorithm for partially observable Markov decision problems. NIPS 1994: 1143. Adv. Neural Inf. Process. Syst 1995, 7:345-352. 26. Barto AG, Mahadevan S: Recent advances in hierarchical reinforcement learning. Discret Event Dyn Syst 2003, 13:341-379. 27. Diuk C, Tsai K, Wallis J, Botvinick M, Niv Y: Hierarchical learning  induces two simultaneous, but separable, prediction errors in human basal ganglia. J Neurosci 2013, 33:5797-5805 The authors used functional neuroimaging to measure correlates of RPE signals in humans performing a two-level hierarchical task. Signals in the ventral striatum and the ventral tegmental area were significantly correlated with the two (weakly correlated) RPEs of the two levels of hierarchy in the task. This result suggests a more complex representation of RPE in the brain than predicted by a simple model-free learning.. 28. Nevo I, Erev I: On surprise, change, and the effect of recent outcomes. Front Psychol 2012, 3:24. 37. Neiman T, Loewenstein Y: Reinforcement learning in professional basketball players. Nat Commun 2011, 2:569. 38. Hinton GE, Salakhutdinov RR: Reducing the dimensionality of data with neural networks. Science 2006, 313:504-507. 39. Hinton G: Where do features come from? Cogn Sci 2013 http:// dx.doi.org/10.1111/cogs.12049. 40. Legenstein R, Wilbert N, Wiskott L: Reinforcement learning on slow features of high-dimensional input streams. PLoS Comput Biol 2010, 6. 41. Nguyen P, Sunehag P, Hutter M: Context tree maximizing reinforcement learning. Proc. of the 26th AAAI Conference on Artificial Intelligence. 2012:1075-1082. 42. Williams R: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 1992, 8:229-256. 43. Sutton RS, Mcallester D, Singh S, Mansour Y, Avenue P, Park F, Satinder S: Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 1999, 12:1057-1063. 44. Baxter J, Bartlett PL: Infinite-horizon policy-gradient estimation. J Artif Intell Res 2001, 15:319-350. 45. Fiete IR, Fee MS, Seung HS: Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductances. J Neurophysiol 2007, 98:2038-2057. 46. Andalman AS, Fee MS: A basal ganglia-forebrain circuit in the songbird biases motor output to avoid vocal errors. Proc Natl Acad Sci USA 2009, 106:12518-12523. 29. Worthy DA, Hawthorne MJ, Otto AR: Heterogeneity of strategy  use in the Iowa gambling task: a comparison of win-stay/loseshift and reinforcement learning models. Psychon Bull Rev 2013, 20:364-371 This study of human behavior reports that even in the simple Iowa gambling task (4 armed bandit task) the predictive power of RL algorithms is comparable to that of the simple WSLS heuristics. This result was controlled for the extent to which the WSLS and the RL models assume unique strategies. These results demonstrate the limitations of RL models in explaining even simple human behaviors.. 47. Tumer EC, Brainard MS: Performance variability enables adaptive plasticity of crystallized adult birdsong. Nature 2007, 450:1240-1244. 30. McCrea SM, Hirt ER: Match madness: probability matching in prediction of the NCAA basketball tournament. J Appl Soc Psychol 2009, 39:2809-2839. 50. Friedrich J, Senn W: Spike-based decision learning of Nash equilibria in two-player games. PLoS Comput Biol 2012, 8:e1002691. 31. Vulkan N: An economist’s perspective on probability matching. J Econ Surv 2000, 14:101-118. 51. Loewenstein Y, Seung HS: Operant matching is a generic outcome of synaptic plasticity based on the covariance between reward and neural activity. Proc Natl Acad Sci USA 2006, 103:15224-15229. 48. Urbanczik R, Senn W: Reinforcement learning in populations of spiking neurons. Nat Neurosci 2009, 12:250-252. 49. Friedrich J, Urbanczik R, Senn W: Spatio-temporal credit assignment in neuronal population learning. PLoS Comput Biol 2011, 7:e1002092. 32. Laquitaine S, Piron C, Abellanas D, Loewenstein Y, Boraud T: Complex population response of dorsal putamen neurons predicts the ability to learn. PLoS ONE 2013, 8:e80683. 52. Loewenstein Y: Synaptic theory of replicator-like melioration. Front Comput Neurosci 2010, 4:17. 33. Loewenstein Y, Prelec D, Seung HS: Operant matching as a Nash equilibrium of an intertemporal game. Neural Comput 2009, 21:2755-2773. 53. Neiman T, Loewenstein Y: Covariance-based synaptic plasticity in an attractor network model accounts for fast adaptation in free operant learning. J Neurosci 2013, 33:1521-1534. Current Opinion in Neurobiology 2014, 25:93–98 www.sciencedirect.com