Academia.eduAcademia.edu

Structure learning in human sequential decision-making

2009, Frontiers in Systems Neuroscience

We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure.

Structure Learning in Human Sequential Decision-Making Daniel Acuña Dept. of Computer Science and Eng. University of Minnesota–Twin Cities [email protected] Paul Schrater Dept. of Psychology and Computer Science and Eng. University of Minnesota–Twin Cities [email protected] Abstract We use graphical models and structure learning to explore how people learn policies in sequential decision making tasks. Studies of sequential decision-making in humans frequently find suboptimal performance relative to an ideal actor that knows the graph model that generates reward in the environment. We argue that the learning problem humans face also involves learning the graph structure for reward generation in the environment. We formulate the structure learning problem using mixtures of reward models, and solve the optimal action selection problem using Bayesian Reinforcement Learning. We show that structure learning in one and two armed bandit problems produces many of the qualitative behaviors deemed suboptimal in previous studies. Our argument is supported by the results of experiments that demonstrate humans rapidly learn and exploit new reward structure. 1 Introduction Humans daily perform sequential decision-making under uncertainty to choose products, services, careers, and jobs; and to mate and survive as species. One of the central problems in sequential decision making with uncertainty is balancing exploration and exploitation in the search for good policies. Using model-based (Bayesian) Reinforcement learning [1], it is possible to solve this problem optimally by finding policies that maximize the expected discounted future reward [2]. However, solutions are notoriously hard to compute, and it is unclear whether optimal models are appropriate for human decision-making. For tasks simple enough to allow comparison between human behavior and normative theory, like the multi-armed bandit problem, human choices appear suboptimal. In particular, earlier studies suggested human choices reflect inaccurate Bayesian updating with suboptimalities in exploration [3, 4, 5, 6]. Moreover, in one-armed bandit tasks where exploration is not necessary, people frequently converge to probability matching [7, 8, 9, 10], rather than the better option, even when subjects are aware which option is best [11]. However, failures against normative prediction may reflect optimal decion-making, but for a task that differs from the experimenter’s intention. For example, people may assume the environment is potentially dynamically varying. When this assumption is built into normative predictions, these models account much better for human choices in one-armed bandit problems [12], and potentially multi-armed problems [13]. In this paper, we investigate another possibility, that humans may be learning the structure of the task by forming beliefs over a space of canonical causal models of reward-action contingencies. Most human performance assessments view the subject’s task as parameter estimation (e.g. reward probabilities) within a known model (a fixed causal graph structure) that encodes the relations between environmental states, rewards and actions created by the experimenter. However despite instruction, it is reasonable that subjects may be uncertain about the model, and instead try to learn it. To illustrate structure learning in a simple task, suppose you are alone in a casino with many rooms. In one room you find two slot machines. It is typically assumed you know the machines are 1 independent and give rewards either 0 (failure) or 1 (success) with unknown probabilities that must be estimated. The structure learning viewpoint allows for more possibilities: Are they independent, or are are they rigged to covary? Do they have the same probability? Does reward accrue when the machine is not played for a while? We believe answers to these questions form a natural set of causal hypotheses about how reward/action contingencies may occur in natural environments. In this work, we assess the effect of uncertainty between two critical reward structures in terms of the need to explore. The first structure is a one-arm bandit problem in which exploration is not necessary (reward generation is coupled across arms); greedy action is optimal [14]. And the other structure is a two-arm bandit problem in which exploration is necessary (reward generation is independent at each arm); each action needs to balance the exploration/exploitation tradeoff [15]. We illustrate how structure learning affects action selection and the value of information gathering in a simple sequential choice task resembling a Multi-armed Bandit (MAB), but with uncertainty between the two previous models of reward coupling. We develop a normative model of learning and action for this class of problems, illustrate the effect of model uncertainty on action selection, and show evidence that people perform structure learning. 2 Bayesian Reinforcement Learning: Structure Learning The language of graphical models provides a useful framework for describing the possible structure of rewards in the environment. Consider an environment with several distinct reward sites that can be sampled, but the way models generate these rewards is unknown. In particular, rewards at each site may be independent, or there may be a latent cause which accounts for the presence of rewards at both sites. Even if independent, if the reward sites are homogeneous, then they may have the same probability. Uncertainty about which reward model is correct naturally produces a mixture as the appropriate learning model. This structure learning model is a special case of Bayesian Reinforcement Learning (BRL), where the states of the environment are the reward sites and the transitions between states are determined by the action of sampling a reward site. Uncertainty about reward dynamics and contingencies can be modeled by including within the belief state not only reward probabilities, but also the possibility of independent or coupled rewards. Then, the optimal balance of exploration and exploitation in BRL results in action selection that seeks to maximize (1) expected rewards (2) information about rewards dynamics, and (3) information about task structure. Given that tasks tested in this research involve mixtures of Multi-Armed Bandit (MAB) problems, we borrow MAB language to call a reward site, an arm, and sample a choice or pull. However, the mixture models we describe are not MAB problems. MAB problems require the dynamics of one site (arm) remain frozen until visited again, which is not true in general for our mixture model. Let γ (0 < γ < 1) be a discounting factor such that a possibly stochastic reward x obtained t time steps in the future means γ t x today. Optimality requires an  action selection policy that maximizes the expectation over the total discounted future reward Eb x + γx + γ 2 x + . . . , where b is the belief over environment dynamics. Let xa be a reward acquired from arm a. After observing reward xa , we R compute a belief state posterior bxa ≡ p(b|xa ) ∝ p(xa |b)p(b). Let f (xa |b) ≡ db p(xa |b)p(b) be the predicted probability of reward xa given belief b. Let r(b, a) ≡ ∑ xa f (xa | b) be the expected reward of sampling arm a at state b. The value of a state can be found using the Bellman equation [2], ) ( V (b) = max r(b, a) + γ ∑ f (xa | b)V (bxa ) . a (1) xa The optimal action can be recovered by choosing arm ( ′ ) a = arg max r(b, a ) + γ ∑ f (xa′ | b)V (bxa′ ) . a′ xa (2) The belief over dynamics is effectively a probability distribution over possible Markov Decision Processes that would explain observables. As such, the optimal policy can be described as a mapping from belief states to actions. In principle, the optimal solution can be found by solving Bellman optimality equations but generally there are countably or uncountably infinitely many states and solutions need approximations. 2 c Θ1 Θ2 x1 x2 Θ3 x1 x3 x2 Θ1 Θ3 Θ2 x1 x3 x2 N N N M M M (a) 2-arm bandit with no coupling (b) 1-arm, reward coupling (c) Mixture of generative models Figure 1: Different graphical models for generation of rewards at two known sites in the environment. The agent faces M bandit tasks each comprising a random number of N choices (a) Reward sites are independent. (b) Rewards are dependent within a bandit task (c) Mixture of generative models used by the learning model. The causes of reward may be independent or coupled. The node c acts as a “XOR” switch between coupled and independent reward. In Figure 1, we show the two reward structures considered on this paper. Figure 1(a) illustrates a structure where arms are independent and (b) coupled. When independent, rewards xa at arm a are samples from a unknown distribution p(xa |θa ). When coupled, rewards xa depends on a “hidden” state of reward x3 sampled from p(x3 |θ3 ). In this case, the rewards x1 and x2 are coupled and depends on x3 . If we were certain which of the two models were right, the action selection problem has known solution for both cases, presented below. Independent Rewards. Learning and acting in an environment like the one described in Figure 1(a) is known as the Multi-Armed Bandit (MAB) problem. The MAB problem is a special case of BRL because we can partition the belief b into a disjoint set of beliefs about each arm {ba }. Because beliefs about non-sampled arms remain frozen until sampled again and sampling one arm doesn’t affect the belief about any other, independent learning and action selection for each arm is possible. Let λa be the reward of a deterministic arm in V (ba ) = max {λa /(1 − γ), r(ba , a) + γ ∑ f (xa |ba )V (bxa )} such that both terms inside the maximization are equal. Gittins [16] proved that it is optimal to choose the arm a with the highest such reward λa (called the Gittins Index). This allows speedup of computation by transforming a many-arm bandit problem to many 2-arm bandit problems. In our task, the belief about a binary reward may be represented by a Beta Distribution with sufficient statistics parameters α, β (both > 0) such that xa ∼ p(xa |θa ) = θaxa (1 − θa )1−xa , where θa ∼ p(θa ; αa , βa ) ∝ θaαa −1 (1 − θa )βa −1 .Thus, the expected reward r(αa , βa , a) and predicted probability of reward f (xa = 1|αa , βa ) are αa (αa + βa )−1 . The belief state transition is bxa = hαa + xa , βa + 1 − xa i . Therefore, the Gittins index may be found by solving the Bellman equations using dynamic programming V (αa , βa ) =  max λa (1 − γ)−1 , (αa + βa )−1 [αa + γ (αa + αaV (αa + 1, βa ) + βaV (αa , βa + 1))] to a sufficiently large horizon. In experiments, we use γ = 0.98, for which a horizon of H = 1000 suffices. Coupled Rewards. Learning and acting in coupled environments (Figure 1b) is trivial because there is no need to maximize information in acting [14]. The belief state is represented by a Beta distribution with sufficient statistics α3 , β3 (> 0). Therefore, the optimal action is to choose the arm a with highest expected reward ( r(α3 , β3 , a) = α3 α3 +β3 β3 α3 +β3 a=1 a=2 The belief state transitions are b1 = hα3 + x1 , β3 + 1 − x1 i and b2 = hα3 + 1 − x2 , β3 + x2 i. 3 3 Learning and acting with model uncertainty In this section, we consider the case where there is uncerainty about the reward model. The agent’s belief is captured by a graphical model for a family of reward structures that may or may not be coupled. We show that learning can be accurate and that action selection is relatively efficient. We restrict ourselves to the following scenario. The agent is presented with a block of M bandit tasks, each with initially unknown Bernoulli reward probabilities and coupling. Each task involves N discrete choices, where N is sampled from a Geometric distribution (1 − γ)γ N . Figure 1(c) shows the mixture of two possible reward models shown in figure 1(a) and (b). Node c switches the mixture between the two possible reward models and encodes part of the belief state of the process. Notice that c is acting as a “XOR” gate between the two generative models. Given that it is unknown, the probability distribution p(c = 0) is the mixed proportion for independent reward structure and p(c = 1) is the mixed proportion for coupled reward structure. We put a prior on the state c using the distribution p(c; φ ) = φ c (1 − φ )1−c , with parameter φ . The posterior is p(θ1 , θ2 , θ3 , c|s1 , f1 , s2 , f2 ) =   ( α −1 (1 − φ ) × θ1α1 −1+s1 (1 − θ1 )β1 −1+ f1 θ2α2 −1+s2 (1 − θ2 )β2 −1+ f2 θ3 3 (1 − θ3 )β3 −1 ∝ α −1+s1 + f2 (1 − θ3 )β3 −1+s2 + f1 (φ ) × (θ1α1 −1 (1 − θ1 )β1 −1 θ2α2 −1 (1 − θ2 )β2 −1 θ3 3 c=0 c=1 (3) where sa and fa is the number of successes and failures observed in arm a. It is clear that the posterior (3) is a mixture of the beliefs on parameters θ j , for 1 ≤ j ≤ 3. With mixed proportion φ , successes of arm 1 and failures of arm 2 are attributed to successes on the shared “hidden” arm 3, whereas failures of arm 1 and successes of arm 2 are attributed to failures of arm 3. On the other hand, the usual Beta-Bernoulli learning of independent arms happens with mixed proportion 1 − φ . At the beginning of each bandit task, we assume the agent “resets” its belief about arms (si = fi = 0), but the posterior over p(c) is carried over and used as the prior on the next bandit task. Let Beta(α, β ) be the Beta function. The marginal posterior on c is as follows ( 1 ,β1 + f 1 )Beta(α2 +s2 ,β2 + f 2 ) c=0 (1 − φ ) Beta(α1 +s Beta(α1 ,β1 )Beta(α2 ,β2 ) p(c|s1 , f1 , s2 , f2 ) ∝ Beta(α3 +s1 + f2 ,β3 + f1 +s2 ) φ c=1 Beta(α ,β ) 3 3 The belief state b of this process may be completely represented hs1 , f1 , s2 , f2 ; φ , α1, β1 , α2 , β2 , α3 , β3 i. The predicted probability of reward x1 and x2 are: f (x1 |s1 , f1 , s2 , f2 ) = α1 +s1 1 +s1 +β1 + f 1 β1 + f1 α1 +s1 +β1 + f1 +φ α x1 = 1 +φ α3 +s1 + f2 3 +s1 + f 2 +β3 +s2 + f 1 β3 +s2 + f1 α3 +s1 + f2 +β3 +s2 + f1 x1 = 0 α2 +s2 2 +s2 +β2 + f 2 β2 + f2 α2 +s2 +β2 + f2 +φ α x2 = 1 +φ β3 +s2 + f1 3 +s1 + f 2 +β3 +s2 + f 1 α3 +s1 + f2 α3 +s1 + f2 +β3 +s2 + f1 x2 = 0 ( (1 − φ ) α (1 − φ ) by (4) and similarly f (x2 |s1 , f1 , s2 , f2 ) = ( (1 − φ ) α (1 − φ ) (5) Let us drop prior parameters α j , β j , 1 ≤ j ≤ 3, and φ from b. The action selection involves solving the following Bellman equations V (s1 , f1 , s2 , f2 ) =  r(b, 1) + γ [ f (x1 = 0|b)V (s1 , f1 + 1, s2 , f2 ) + f (x1 = 1|b)V (s1 + 1, f1 , s2 , f2 )] a = 1 max a=1,2 r(b, 2) + γ [ f (x2 = 0|b)V (s1 , f 1 , s2 , f 2 + 1) + f (x2 = 1|b)V (s1 , f 1 , s2 + 1, f 2 )] a = 2 4 (6) 0 p(θ1) p(θ1) 0 0.5 1 0 50 100 1 150 0.5 0 50 100 p(θ3) p(θ3) 100 150 0 50 100 150 0 50 100 150 0 0 50 100 150 0.5 0 0.5 0 50 100 p(c) 1 0.5 0 0 0.5 1 150 1 p(c) 50 1 150 0 1 0 0 p(θ2) p(θ2) 0 1 0.5 50 100 150 200 (a) Learning in coupled environment 0.5 200 (b) Learning in independent enviroment Figure 2: Learning example. A block of four bandit tasks of 50 trials each for each environment. Marginal beliefs on reward probabilities and coupling are shown as functions of time. The brightness indicates the relative probability mass. The coupling belief distribution starts uniform with φ = 0.5 and is not reset within a block. The priors p(θi ; αi , βi ) are reset at the beginning of each task with αi , βi = 1 (1 ≤ i ≤ 3) . Note that how well the reward probabilities sum to one forms critical evidence for or against coupling. To obtain (6) using dynamic programing for a horizon H, there will be a total of (1/24)(1 + H)(2 + H)(3 + H)(4 + H) computations which represent different occurrences of si , fi out of 4H possible histories of rewards. This dramatic reduction allows us to be relatively accurate in our approximation to the optimal value of an action. 4 Simulation Results In Figure 2, we perform simulations of learning on blocks of four bandit tasks, each comprising 50 trials. In one simulation, (a) rewards are coupled and the other (b) independent. Note that the model learns quickly on both cases, but it is slower when task is truly coupled because fewer cases support this hypothesis (when compared to the independent hypothesis). The importance of the belief on the coupling parameter is that it has a decisive influence on exploratory behavior. Coupling between the two arms corresponds to the case where one arm is a winner and the other is a loser by experimenter design. When playing coupled arms, evidence that an arm is “good” (e.g. > 0.5) necessarily entails the other is “bad”, and hence eliminates the need for exploratory behavior - the optimal choice is to “stick with the winner”, and switch when the probability estimate suggests dips below 0.5. An agent learning a coupling parameter while sampling arms can manifest a range of exploratory behaviors that depend critically on both the recent reward history and the current state of the belief about c, illustrated in figure 3. The top row shows the value of both arms as a function of coupling belief p(c) after different amounts of evidence for the success of arm 2. The plots show that optimal actions stick with the winner when belief in coupling is high, even for small amounts of data. Thus belief in coupling produces underexploration compared to a model assuming independence, and generates behavior similar to a “win stay, lose switch” heuristic early in learning. However, overexploration can also occur when the expected values of both arms are similar. Figure 3 (lower left) shows that uncertainty about c provides an exploratory bonus to the lower probability arm which incentivizes switching, and hence overexploration. In fact, when the difference in probability between arms is small, action selection can fail to converge to the better option. Figure 3 (to the right) shows that p(c) together with the probability of the better arm determine the transition between exploration vs. exploitation. These results show that optimal action selection with model uncertainty can generate several kinds of behavior typically labeled suboptimal in multi-armed bandit experiments. Next we provide evidence that people are capable of learning and exploiting coupling–evidence that structure learning may play a role in apparent failures of humans to behave optimally in multi-armed bandit tasks. 5 V(1-γ) 2 0.75 0.7 0.6 0.24 0.2 0.1 0 0.5 p(c) 1 0 0.5 p(c) 1 0.14 0.12 0.1 0.08 0.06 0.5 0.65 0.65 0.65 0.6 0.6 0.6 0.15 0.1 0.05 0.2 0.15 0.1 0.05 0.15 0.1 0 0.5 p(c) 0.05 1 0 0.5 p(c) 1 0 0.5 p(c) 0 2 2 0.7 10 f =6 f =5 2 0.7 0.65 0.65 0.3 0.2 0.1 f =4 2 0.7 0.7 0.04 Bonus f =3 2 0.75 1 Critical value of p(c) f =2 f =1 0.8 10 0 0.5 p(c) −1 1 −0.3 10 −0.2 10 Expected value of θ 2 −0.1 10 Figure 3: Value of arms as a function of coupling. The priors are uniform (α j = β j = 1, 1 ≤ j ≤ 3), the evidence for arm 1 remains fixed for all cases (s1 = 1, f1 = 0), and successes of arm 2 remains fixed as well (s2 = 5). Failures for arm 2 ( f2 ) vary from 1 to 6 . Upper left: Belief that arms are coupled (p(c)) versus reward per unit time (V (1 − γ), where V is the value) of arm 1 (dashed line) and arm 2 (solid line). In all cases, an independent model would choose arm 1 to pull. Vertical line shows the critical coupling belief value where the structure learning model switches to exploitative behavior. Lower left: Exploratory bonus (V (1−γ)−r, where r is the expected reward) for each arm. Right panel: Critical coupling belief values for exploitative behavior vs. the expected probability of reward of arm 2. Individual points correspond to different information states (successes and failures on both arms). 5 Human Experiments Each of 16 subjects ran on 32 bandit tasks, a block of 16 in a independent environment and a block of 16 coupled. Within blocks, the presentation order was randomized, and the order of the coupled environment was randomized accross subjects. On average each task required 48 pulls. For independent environment, the subjects made 1194 choices across the 16 tasks, and 925 for the coupled environment. Each arm is shown in the screen as a slot machine. Subjects pull a machine by pressing a key in the keyboard. When pulled, an animation of the lever is shown, 200 msec later the reward appears in the machine’s screen, and a sound mimicking dropping coins lasts proportionally to the amount gathered. We provide several cues, some redundant, to help subjects keep track of previous rewards. At the top, the machine shows the number of pulls, total reward, and average reward per pull so far. Instead of binary rewards 0 and 1, the task presented 0 and 100. The machine’s screen changes the color according to the average reward, from red (zero points), through yellow (fifty points), and green (one hundred points). The machine’s total reward is shown as a pile of coins underneath it. The total score, total pulls, and rankings within a game were presented. 6 Results We analyze how task uncertainty affects decisions by comparing human behavior to that of the optimal model and models that assume a structure. For each agent, be human or not, we compute the (empirical) probability that it selects the oracle-best action versus the optimal belief that a block of tasks is coupled. The idea behind this measure is to show how the belief on task structure changes the behavior and which of the models better captures human behavior. We run 1000 agents for each of the models with task uncertainty (optimal model), assumed coupled reward task (coupled model), and assumed independent reward task (independent model) under the same conditions that subjects faced on both the blocks of coupled and independent tasks. And for each of the decisons of these models and the 33904 decisions performed by the 16 subjects, we compute the optimal belief on coupling according to our model and bin the proportion of times the agent chooses the (oracle) best arm according to this belief. The results are summarized in Figure 4. The independent model tends to perform equally well on both coupled and independent reward tasks. The coupled model tends to perform well only in the coupled task and worse in the independent tasks. As expected, the optimal model has better overall performance, but does not perform better than models with fixed task structure—in their respective tasks—because it pays the price of learning early in the block. The optimal model behaves like a mixture between the coupled and independent model. Human behavior is much better captured by the optimal model (Figure 6 0.96 Human Optimal model Coupled model Independent model Prob. of choosing best arm 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0.78 0 0.25 0.5 0.75 1 Coupling belief p(c=1|D) Figure 4: Effect of coupling on behavior. For each of the decisions of subjects and simulated models under the same conditions, we compute the optimal belief on coupling according to the model proposed in this paper and bin the proportion of times an agent chooses the (oracle) best arm according to this belief. This plot represents the empirical probability that an agent would pick the best arm at a given belief on coupling. 4). This is evidence that human behavior shares the characteristics of the optimal model, namely, it contains task uncertainty and exploit the knowledge of the task structure to maximize its gains. The gap in performance that exists between the optimal model and humans may be explained by memory limitations or more complicated task structures being entertained by subjects. Because the subjects are not told the coupling state of the environment and the arms appear as separate options we conclude that people are capable of learning and exploiting task structure. Together these results suggest that structure learning may play a significant role in explaining differences between human behavior and previous normative predictions. 7 Conclusions and future directions We have provided evidence that structure learning may be an important missing piece in evaluating human sequential decision making. The idea of modeling sequential decision making under uncertainty as a structure learning problem is a natural extension of previous work on structure learning in Bayesian models of cognition [17, 18] and animal learning [19] to sequential decision making problems under uncertainty. It also extends previous work on Bayesian approaches to modeling sequential decision making in the multi-armed bandit [20] by adding structure learning. It is important to note that we have intentionally focused on reward structure, ignoring issues involving dependencies across trials. Clearly reward structure learning must be integrated with learning about temporal dependencies [21]. Although we focused on learning coupling between arms, there are other kinds of reward structure learning that may account for a broad variety of human decision making performance. In particular, allowing dependence between the probability of reward at a site and previous actions can produce large changes in decision making behavior. For instance, in a “foraging” model where reward is collected from a site and probabilistically replenished, optimal strategies will produce choice sequences that alternate between reward sites. Thus uncertainty about the independence of reward on previous actions can produce a continuum of behavior, from maximization to probability matching. Note that structure learning explanations for probability matching is significantly different than explanations based on reinforcing previously successful actions (the “law of effect”) [22]. Instead of explaining behavior in terms of the idiosynchracies of a learning rule, structure learning constitutes a fully rational response to uncertainty about the causal structure of rewards in the environment. We intend to test the predictive power of a range of structure learning ideas on experimental data we are currently collecting. Our hope is that, by expanding the range of normative hypotheses for human decisionmaking, we can begin to develop more principled accounts of human sequential decision-making behavior. 7 Acknowledgements The work was supported by NIH NPCS 1R90 DK71500-04, NIPS 2008 Travel Award, CONICYTFIC-World Bank 05-DOCFIC-BANCO-01, ONR MURI N 00014-07-1-0937, and NIH EY02857. References [1] Pascal Poupart, Nikos Vlassis, Jesse Hoey, and Kevin Regan. An analytic solution to discrete bayesian reinforcement learning. In 23rd International Conference on Machine Learning, Pittsburgh, Penn, 2006. [2] Richard Ernest Bellman. Dynamic programming. Princeton University Press, Princeton, 1957. [3] Noah Gans, George Knox, and Rachel Croson. Simple models of discrete choice and their performance in bandit experiments. Manufacturing and Service Operations Management, 9(4):383–408, 2007. [4] C.M. Anderson. Behavioral Models of Strategies in Multi-Armed Bandit Problems. PhD thesis, Pasadena, CA., 2001. [5] Jeffrey Banks, David Porter, and Mark Olson. An experimental analysis of the bandit problem. Economic Theory, 10(1):55–77, 1997. [6] R. J. Meyer and Y. Shi. Sequential choice under ambiguity: Intuitive solutions to the armedbandit problem. Management Science, 41:817–83, 1995. [7] N Vulkan. An economist’s perspective on probability matching. Journal of Economic Surveys, 14:101–118, 2000. [8] Yvonne Brackbill and Anthony Bravos. Supplementary report: The utility of correctly predicting infrequent events. Journal of Experimental Psychology, 64(6):648–649, 1962. [9] W Edwards. Probability learning in 1000 trials. Journal of Experimental Psychology, 62:385– 394, 1961. [10] W Edwards. Reward probability, amount, and information as determiners of sequential twoalternative decisions. J Exp Psychol, 52(3):177–88, 1956. [11] E. Fantino and A Esfandiari. Probability matching: Encouraging optimal responding in humans. Canadian Journal of Experimental Psychology, 56:58 – 63, 2002. [12] Timothy E J Behrens, Mark W Woolrich, Mark E Walton, and Matthew F S Rushworth. Learning the value of information in an uncertain world. Nat Neurosci, 10(9):1214–1221, 2007. [13] N. D. Daw, J. P. O’Doherty, P. Dayan, B. Seymour, and R. J. Dolan. Cortical substrates for exploratory decisions in humans. Nature, 441(7095):876–879, 2006. [14] JS Banks and RK Sundaram. A class of bandit problems yielding myopic optimal strategies. Journal of Applied Probability, 29(3):625–632, 1992. [15] John Gittins and You-Gan Wang. The learning component of dynamic allocation indices. The Annals of Statistics, 20(2):1626–1636, 1992. [16] J. C. Gittins and D. M. Jones. A dynamic allocation index for the sequential design of experiments. Progress in Statistics, pages 241–266, 1974. [17] Joshua B. Tenenbaum, Thomas L. Griffiths, and Charles Kemp. Theory-based bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences, 10(7):309–318, 2006. [18] Joshua B. Tenenbaum and Thomas L. Griffiths. Structure learning in human causal induction. NIPS 13, pages 59–65, 2000. [19] A. C. Courville, N. D. Daw, G. J. Gordon, and D. S. Touretzky. Model uncertainty in classical conditioning. Advances in Neural Information Processing Systems, (16):977–986, 2004. [20] Daniel Acuna and Paul Schrater. Bayesian modeling of human sequential decision-making on the multi-armed bandit problem. In CogSci, 2008. [21] Michael D. Lee. A hierarchical bayesian model of human decision-making on an optimal stopping problem. Cognitive Science: A Multidisciplinary Journal, 30:1 – 26, 2006. [22] Ido Erev and Alvin E. Roth. Predicting how people play games: Reinforcement learning in experimental games with unique, mixed strategy equilibria. The American Economic Review, 88(4):848–881, 1998. 8