Academia.eduAcademia.edu

Bayesian Brains without Probabilities

Trends in Cognitive Sciences

AI-generated Abstract

This paper discusses the concept of Bayesian brains functioning without explicit probabilistic representations. It argues that the brain operates as a Bayesian sampler, independent of probability calculations, which can lead to systematic reasoning errors. By exploring the pitfalls of such a Bayesian approach, the authors aim to reconcile complex domain models with the common errors observed in simpler situations, ultimately suggesting that infinite sampling is necessary for conforming to traditional probabilistic standards.

Original citation: Sanborn, Adam N. and Chater, Nick. (2016) Bayesian brains without probabilities. Trends in Cognitive Sciences . doi: 10.1016/j.tics.2016.10.003 Permanent WRAP URL: http://wrap.warwick.ac.uk/83251 Copyright and reuse: The Warwick Research Archive Portal (WRAP) makes this work of researchers of the University of Warwick available open access under the following conditions. This article is made available under the Creative Commons Attribution 4.0 International license (CC BY 4.0) and may be reused according to the conditions of the license. For more details see: http://creativecommons.org/licenses/by/4.0/ A note on versions: The version presented in WRAP is the published version, or, version of record, and may be cited as it appears here. For more information, please contact the WRAP Team at: [email protected] warwick.ac.uk/lib-publications TICS 1618 No. of Pages 11 Opinion Bayesian Brains without Probabilities Adam N. Sanborn1,* and Nick Chater2 Bayesian explanations have swept through cognitive science over the past two decades, from intuitive physics and causal learning, to perception, motor control and language. Yet people flounder with even the simplest probability questions. What explains this apparent paradox? How can a supposedly Bayesian brain reason so poorly with probabilities? In this paper, we propose a direct and perhaps unexpected answer: that Bayesian brains need not represent or calculate probabilities at all and are, indeed, poorly adapted to do so. Instead, the brain is a Bayesian sampler. Only with infinite samples does a Bayesian sampler conform to the laws of probability; with finite samples it systematically generates classic probabilistic reasoning errors, including the unpacking effect, base-rate neglect, and the conjunction fallacy. Trends Bayesian models in cognitive science and artificial intelligence operate over domains such as vision, motor control and language processing by sampling from vastly complex probability distributions. Such models cannot, and typically do not need to, calculate explicit probabilities. Sampling naturally generates a variety of systematic probabilistic reasoning errors on elementary probability problems, which are observed in experiments with people. Bayesian Brains without Probabilities In an uncertain world it is not easy to know what to do or what to believe. The Bayesian approach gives a formal framework for finding the best action despite that uncertainty, by assigning each possible state of the world a probability, and using the laws of probability to calculate the best action. Bayesian cognitive science has successfully modelled behavior in complex domains, whether in vision, motor control, language, categorization or common-sense reasoning, in terms of highly complex probabilistic models [1–13]. Yet in many simple domains, people make systematic probability reasoning errors, which have been argued to undercut the Bayesian approach [14–17]. In this paper, we argue for the opposite view: that the brain implements Bayesian inference and that systematic probability reasoning errors actually follow from a Bayesian approach. We stress that a Bayesian cognitive model does not require that the brain calculates or even represents probabilities. Instead, the key assumption is that the brain is a Bayesian sampler (see Glossary). While the idea that cognition is implemented by Bayesian samplers is not new [18–27], here we show that any Bayesian sampler is faced with two challenges that automatically generate classic probabilistic reasoning errors, and only converge on ‘well-behaved’ probabilities at an (unattainable) limit of an infinite number of samples. In short, here we make the argument that Bayesian cognitive models that operate well in complex domains actually predict probabilistic reasoning errors in simple domains (see Table 1, Key Table). To see why, we begin with the well-known theoretical argument against the naïve conception that a Bayesian brain must represent all possible probabilities and make exact calculations using these probabilities: that such calculations are too complex for any physical system, including brains, to perform in even moderately complex domains [4,28–30]. A crude indication of this is that the number of real numbers required to encode the joint probability distribution over n binary variables grows exponentially with 2n, quickly exceeding the capacity of any imaginable physical storage system. Yet Bayesian computational models often must represent vast data spaces, such as the space of possible images or speech waves; and effectively infinite hypothesis spaces, such as the space of possible scenes or sentences. Explicitly keeping track of Thus, it is possible to reconcile probabilistic models of cognitive and brain function with the human struggle to master even the most elementary explicit probabilistic reasoning. 1 University of Warwick, Coventry, UK Warwick Business School, Coventry, UK 2 *Correspondence: [email protected] (A.N. Sanborn). Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy http://dx.doi.org/10.1016/j.tics.2016.10.003 © 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). 1 TICS 1618 No. of Pages 11 Glossary Key Table Table 1. Properties of a Bayesian Sampler Compared to Ideal Bayesian Reasoning Property Ideal Bayesian reasoning Bayesian sampler Strictly follows laws of probability Yes Only asymptotically, otherwise can show systematic biases Represents all hypotheses simultaneously in the brain Yes No, only represents one or a few at a time Is easy to generate examples Yes, though requires a sampling mechanism Yes, just selects a sample Can find all likely hypotheses, even if surprising Yes No, leading to unpacking effects and conjunction fallacies Can evaluate relative probabilities of far-apart or incommensurable hypotheses Yes No, leading to forms of base-rate neglect and other forms of the conjunction fallacy Is more effective in ‘small’ worlds compared to large context-rich worlds Yes, better with small, well-defined worlds No, better when context guides sampler to relevant hypotheses, rather than aggregating over broad hypothesis spaces Produces stochastic and autocorrelated behavior No Yes probabilities in such complex domains such as vision and language, where Bayesian models have been most successful, is therefore clearly impossible. Computational limits apply even in apparently simple cases. In a classic study, Tversky and Kahneman [31] asked people about the relative frequency of randomly choosing words that fit the pattern _ _ _ _ _ n _ from a novel. An explicit Bayesian calculation would require three steps: (i) Posterior probabilities: calculating the posterior probability that each of the tens of thousands of words in our vocabulary (and ideally over the 600 000 words in the Oxford English Dictionary) are found in a novel; (ii) Conditionalization: filtering those which fit the pattern _ _ _ _ _ n _; (iii) Marginalization: adding up the posterior probabilities of the words that passed the filter to arrive at the overall probability that words randomly chosen from a novel would fit the pattern _ _ _ _ _ n _. Explicitly performing these three steps is challenging, in terms of both memory and calculation; in more complex domains, each step is computationally infeasible. And the brain clearly does not do this: indeed, even the fact that a common word like ‘nothing’ fits the pattern is obvious only in retrospect [30]. How, then, can a Bayesian model of reading, speech recognition, or vision possibly work if it does not explicitly represent probabilities? A key insight is that, although explicitly representing and working with a probability distribution is hard, drawing samples from that distribution is relatively easy. Sampling does not require knowledge of the whole distribution. It can work merely with a local sense of relative posterior probabilities. Intuitively, we have this local sense: once we ‘see’ a solution (e.g., ‘nothing’), it is often easy to see that it is better than another solution (‘nothing’ has higher posterior probability than ‘capping’) even if we cannot exactly say what either posterior probability is. And now we have thought of two words ending in ‘-ing’, we can rapidly generate more (sitting, singing, etc.). By continually sampling, we slowly build up a picture of all of the possibilities. Using a number of samples much smaller than the number of hypotheses makes the computations feasible. 2 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy Ballistic accumulator model: a model of evidence accumulation that explains both accuracy and response time. Unlike other similar models, it assumes that evidence accumulation is deterministic within each trial. Base-rate neglect: a reasoning fallacy in which individuals overweight diagnostic information (e.g., the fever I have could be caused by the Ebola virus) and underweight relevant background information (e.g., the Ebola virus is very rare). Bayesian sampler: an approximation to a Bayesian model that uses a sampling algorithm such as MCMC to avoid intractable integrals. While the model is used to perform Bayesian inference, the sampling algorithm itself is simply a mechanism for producing samples. Boltzmann machine: an artificial neural network with binary nodes that change state according to the ‘energy’ of a pattern of states, or how well that pattern fits the relationships determined by the edges between nodes. Deep belief network: a hierarchical artificial neural network of binary variables. Each layer of the network can be composed on simpler networks such as Boltzmann machines. Markov chain Monte Carlo (MCMC): a family of algorithms for drawing samples from probability distributions. These algorithms transition from state to state with probabilities that depend only on the current state. The transition probabilities are carefully chosen so that the states are (dependent) samples of a target probability distribution. Metropolis–Hastings: An MCMC algorithm that proposes new states based on the current state, and transitions to new states based on the relative probability of the current state and the proposed state. Multistable stimuli: stimuli with more than one perceptual interpretation. One's perception of the stimulus tends to switch back and forth between interpretations over time. Normalization constant: while probabilities of all of the hypotheses must sum to 1, often it is much easier to represent values that are proportional to the probabilities instead. The normalizing constant is TICS 1618 No. of Pages 11 This sampling approach to probabilistic inference began in the 1940s and 1950s [32,33]; and is ubiquitously and successfully applied in complex cognitive domains, whether cognitive science or artificial intelligence [5,34]. If we sample forever, we can make any probabilistic estimate we need to, without ever calculating explicit probabilities. But, as we shall see, restricted sampling will instead lead to systematic probabilistic errors – including some classic probabilistic reasoning fallacies. So how does this work in practice? Think of a posterior probability distribution as a hilly highdimensional landscape over possible hypotheses (see Figure 1A). Knowing the shape of this landscape, and even its highest peaks, is impossibly difficult. But a sampling algorithm simply explores the landscape, step by step. Perhaps the best known class of sampling algorithms is Markov chain Monte Carlo (MCMC [35,36]). A common type of MCMC, the Metropolis– Hastings algorithm [32,37], can be thought of as an android trying to climb the peaks of the probability distribution, but in dense fog, and with no memory of where it has been. It climbs by sticking one foot out in a random direction and ‘noisily’ judging whether it is moving uphill (i.e., it noisily knows a ‘better’ hypothesis when it finds one). If so, it shifts to the new location; otherwise, it stays put. The android repeats the process, slowly climbing through the probability landscape, using only the relative instead of absolute probability to guide its actions. Despite the simplicity of the android's method, a histogram of the android's locations will resemble the hill (arbitrarily closely, in the limit), meaning the positions are samples from the probability distribution (see last column of Figure 1A). These samples can then easily be used in a variety of calculations. For example, to estimate the chance that a word in a novel will follow the form _ _ _ _ _ n_, just sample some text, and use the proportion of the samples that follow that pattern as the estimate. This is, indeed, very close to Tversky and Kahneman's availability heuristic [38] – to work out how likely something is by generating possible examples; but now construed as Bayesian sampling rather than a mere ‘rule of thumb.’ the sum of these proportional values – it is the number by which each of these proportional values must be divided to become probabilities. Particle filters: an algorithm designed to sample from probability distributions for data that arrives sequentially. Posterior probability: the probability of a hypothesis in response to a question. Sampling: generating hypotheses with frequency proportional to their posterior probability. Probability estimates can then be based on the relative frequencies of these sampled hypotheses. Satisficing: searching until a goodenough solution is found, rather than searching until the best possible solution is found. Small worlds: restricted worlds with few variables and well-defined probabilities over those variables. Wisdom of crowds: empirical result that the aggregation of individual estimates is better than the average individual estimate, or sometimes any individual estimate. There is powerful prima facie evidence that the brain can readily draw samples from very complex distributions. We can imagine visual or auditory events, and we can generate language (and most tellingly to mimic the language, gestures, or movements of other people), which are forms of sampling [39]. Similarly, when given a part of a sentence, or fragments of a picture, people generate one (and sometimes many) possible completions [40,41]. This ‘generative’ aspect of perception and cognition (e.g., [42]) follows automatically from sampling models [6,10]. Unavoidable Limitations of Sampling Any distribution that can be sampled can, in the limit, be approximated – but finite samples will be imperfect and hence potentially misleading. The android's progress around the probability landscape is, first and most obviously, biased by its starting point. Suppose, specifically, that the landscape contains an isolated peak. Starting far from the peak in an undulating landscape, the android has little chance of finding the peak. Even complex sampling algorithms, that can be likened to more advanced androids that sense the gradient under their feet or are adapted to particular landscapes, will blindly search in the foothills, missing the mountain next door (see Figure 1A and bottom row of Figure 1B). While we have illustrated this problem with a simple sampling algorithm, it applies to any sampling algorithm without prior knowledge of the location of the peaks [28,30]; indeed even the inventors of sampling algorithms are not immune to these problems [36,43]. A Bayesian sampler that misses peaks of probability will produce reasoning fallacies. Various fallacies, such as the unpacking effect and the conjunction fallacy, involve participants believing that a more detailed description of an event is more likely than a less detailed description of that Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 3 TICS 1618 No. of Pages 11 (A) Histogram of locaƟons aŌer many steps Second step First step (B) Posterior density JAGS Sampler Metropolis-hasƟngs sampler x x y y y y x y x IteraƟon x x y y y y x y x IteraƟon x y y y y x IteraƟon x IteraƟon x x y y y y x x IteraƟon x y x x IteraƟon y IteraƟon x IteraƟon x Figure 1. Sampling Algorithms Have Difficulties with Isolated Modes and Produce Autocorrelations. (A) Illustration of the android metaphor, with the android climbing the landscape of the (log) posterior probability distribution. The android uses the difference in height of its two feet to decide where to step, and its location is tracked over time (red x). A histogram of its locations after many steps matches the mode of the probability distribution it explored. (B) Comparison of sampling methods on 2D distributions. Each row is a different example probability distribution: a unimodal distribution with uncorrelated x and y variables, a unimodal distribution with correlated x and y variables, a bimodal distribution with relatively nearby modes, and a [2_TD$IF]bimodal distribution where the modes are further apart. The first column shows a topographic map of the posterior density with darker regions indicating higher probability. The second and third columns illustrate samples drawn using the Metropolis–Hastings algorithm and JAGS program respectively. Within each column are trace plots that show how the location of the sampler changes along each variable during each iteration of the sampling process. Autocorrelations are present when a sample depends on the value of the previous sample in the trace plots (e.g., Metropolis–Hastings in the second row). Also shown are bivariate scatterplots that can be used to compare the samples obtained against the true distributions in the first column. These show that not all of the modes are always sampled, even when thousands of samples are drawn (i.e., in the bottom row). R code for these plots is included as supplemental material. 4 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy TICS 1618 No. of Pages 11 same event, violating probability theory. In the unpacking effect, participants judge, say, ‘being a lawyer’, to be less probable than the ‘unpacked’ description ‘being a tax, corporate, patent, or other type of lawyer’ [44–46]. From a sampling point of view, bringing to mind types of lawyer that are numerous but of low salience ensures these ‘peaks’ are not missed by the sampling process, yielding a higher probability rating. Yet unpacking can also guide people away from the high probability hypotheses: if the unpacked hypotheses are low probability instead of high, for example trying to assess whether a person with a background in social activism becomes either ‘a lawyer’ or ‘a tax, corporate, patent, or other type of lawyer’ then the probability of the unpacked event is judged less than that of the packed event [46] – the sampler is guided away from the peaks of probability (e.g., ‘human rights lawyer’). The conjunction fallacy is a complex effect [47,48], but one source of the fallacy is the inability to bring to mind relevant information. We struggle to estimate how likely a random word in a novel will match the less detailed pattern _ _ _ _ _ n _: our sampling algorithm searches around a large space and may miss peaks of high probability. However, when guided to where the peaks are (i.e., starting the android from a different location), for example, by being asked about the pattern _ _ _ _ ing in the more detailed description, then large peaks are found and probability estimates are higher [31,45]. The process involved is visually illustrated in Figure 2A. The conjunction of the correct locations of all the puzzle pieces cannot, of course, be more probable that the correct location of a single piece. Yet when considered in isolation, the evidence that an isolated piece is correct is weak (from a sampling standpoint, it is not clear whether, e.g., swapping pieces leads to a higher or lower probability). But in the fully assembled puzzle (i.e., the ‘peak’ in probability space is presented), local comparisons are easy – switching any of the pieces would make the fit worse – so you can be nearly certain that all the pieces are in the correct position. So the whole puzzle will be judged more probable than a single piece, exhibiting the conjunction fallacy. A second bias in sampling is subtler. Sometimes regions of high probability are so far apart that a sampler starting in one region is extremely unlikely to transition to the other. As shown in Figure 2B, our android could be set down in Britain or in Colorado, and in each case would gradually build up a picture of the local topography. But these local topographies would give no clue that the baseline height of Colorado is much higher than Britain. The android is incredibly unlikely to wander from one region to another so the relative heights of the two regions would remain unknown. This problem is not restricted to sampling algorithms or to modes that are far apart in a single probability distribution. Indeed, the problem is even starker when a Bayesian sampler compares probabilities in entirely different models – then it is often extremely difficult for the android to shuttle between landscapes [49]. So, for example, although it is obvious that we are more likely to obtain at least one, rather than at least five, double sixes from 24 dice throws, it is by no means obvious whether this first event is more or less likely than obtaining heads in a coin flip. Indeed, this problem sparked key developments in early probability theory by Pascal and Fermat [50]. One might think there is an easy solution to comparing across domains. Probabilities must sum to 1, after all. So if sampling explores the entire landscape, then we can figure out the absolute size of probabilities (i.e., whether we are dealing with Britain or Colorado) because the volume under the landscape must be precisely 1. But exploring the entire landscape is impossible – because the space is huge and may contain unknown isolated peaks of any possible size, as we’ve seen (technically, the normalization constant often cannot be determined, even approximately). An example of this is given in Figure 2B, where within-domain comparisons are relatively easy. So it may be easy to estimate which is more likely in a randomly chosen year: a total or partial Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 5 TICS 1618 No. of Pages 11 (A) ConsƟtuent quesƟon: What is the probability that the piece outlined in red is in the correct posiƟon in the frame? ConjuncƟon quesƟon: What is the probability that all of the pieces are in the correct posiƟons in the frame? (B) The falcon cannot hear the falconer Things fall apart; the centre cannot hold Figure 2. Illustrations of the Conjunction Fallacy and Base-Rate Neglect. (A) Illustration of why the conjunction fallacy arises from a Bayesian sampler. The top row gives a question about a piece of the puzzle. The bottom row illustrates that evaluating the probability of a conjunction will be easier. (B) Four events to illustrate why local assessments of relative probability are easier. Comparing the probability of seeing the two astronomical events in a year, or the probability of the two quotations appearing on a random website, are both relatively easy. Comparing the probability of seeing one of the astronomical events in a year to the probability of seeing one of the quotations on a random website is more difficult. In particular, when comparing ‘Things fall apart; the centre cannot hold’ to the eclipse, the quote may seem more likely as it is a common among quotes, yet this neglects the base rates: most websites do not have literary quotations, and there are many chances for an eclipse each year. eclipse; or which of two quotations are more likely to appear on a randomly chosen website. However, between-domain comparisons, such as deciding whether a total eclipse is more likely than ‘Things fall apart; the centre cannot hold’ appearing is more difficult: the astronomical event and the quotation must be compared against different sets of experiences (e.g., years and websites, respectively). Being unable to effectively compare the probabilities of two hypotheses is a second way in which a Bayesian sampler generates reasoning fallacies observed in people. The example in Figure 2B also illustrates a version of base-rate neglect. ‘Things fall apart; the centre cannot hold’ is a notable quotation, and eclipses are rare, so from local comparisons it may seem that ‘Things fall apart; the centre cannot hold’ is more likely. However, the base rates, which are often neglected, reverse this: the vast majority of websites have no literary quotations, and each year provides many opportunities for an eclipse. Fully taking account of base-rates would require searching the entire probability space – which is typically impossible. This inability to compare probabilities from different domains plays a role in other reasoning fallacies: the conjunction fallacy also 6 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy TICS 1618 No. of Pages 11 occurs when the two conjoined events are completely unrelated to one another [51]. While a sampler can provide a rough sense of the probabilities of each hypothesis separately (is this quotation or astronomical event common compared with similar quotes or astronomical events), the inability to bridge between the two phenomena means that sampling cannot accurately assess base rates or estimate the probabilities of conjunctions or disjunctions. In these cases, participants are left to combine the non-normalized sampled probability estimates for each hypothesis ‘manually’, and perhaps choose just one of the probabilities or perhaps combine them inappropriately [52–54]. Interestingly, accuracy is improved in base-rate neglect studies when a causal link is provided between the individual events [55,56], and experiencing outcomes together weakens conjunction fallacies [57] – we can interpret these studies as providing a link that allows a Bayesian sampler to traverse between the two regions to combine probabilities meaningfully. Sampling and Task Richness Reasoning fallacies such as the unpacking effect, conjunction fallacy, and base-rate neglect have greatly influenced our general understanding of human cognition. They arise in very simple situations and violate the most elementary laws of probability. We agree that these fallacies make a convincing argument against the view that the brain represents and calculates probabilities directly. This argument appears strengthened in complex real-world tasks. If our brains do not respect the laws of probability for simple tasks, surely the Bayesian approach to the mind must fail in rich domains such as vision, language and motor control with huge data and hypothesis spaces [29]. Indeed, even Savage, the great Bayesian pioneer, suggested restricting Bayesian methods to ‘small’ worlds [58,59]. Viewing brains as sampling from complex probability distributions upends this argument. Rich, realistic tasks, in which there is a lot of contextual information available to guide sampling, are just those where the Bayesian sampler is most effective. Rich tasks focus the sampler on the areas of the probability landscape that matter – those that arise through experience. By limiting the region in which the sampler must search, rich problems can often be far easier for the sampler than apparently simpler, but more abstract, problems. Consider how hard it would be to solve a jigsaw which is uniformly white; the greater the richness of the picture on the jigsaw, the more the sampler can locally be guided by our knowledge of real-world scenes. Moreover, the problem of learning the structure of the world, or interpreting an image or a sentence, involves finding ‘good-enough’ hypotheses to usefully guide our actions, which can be achieved by local sampling in the probability landscape. Such hypotheses are no less valuable if an isolated peak, corresponding to an even better hypothesis, remained undiscovered. We suggest too that, for many real-world problems, multiple but distant peaks, corresponding to very different hypotheses about the world, may be rare, particularly when context and background knowledge are taken into account. Language is locally ambiguous [60], but it is very unlikely that the acoustic signal of a whole sentence in English happens to have an equally good interpretation in Latin; vision, too, is locally ambiguous (e.g., [61]) but the probability that a portrait photograph could equally be reinterpreted as a rural scene is infinitesimal. In complex real-world problems, then, climbing a rugged probability landscape to find ‘good-enough’ hypothesis is crucial; linking to numerical probabilities, even approximately, is not. Thus, the view of cognition as satisficing [62] need not be viewed as opposed to the Bayesian approach [14,29]. Rather, Bayesian sampling provides a mechanism for satisficing in real-world environments. Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 7 TICS 1618 No. of Pages 11 For these reasons, Bayesian models of cognition have primarily flourished as explanations of the complex dependency of behavior on the environment in domains including intuitive physics, causal learning, perception, motor control, and language [1–10]; and these computational models generally do not involve explicit probability calculations, but apply Bayesian sampling, using methods such as MCMC. We have argued that rich real-world tasks may be more tractable to Bayesian sampling than abstract lab tasks, because sampling is more constrained. But is it possible that the fundamental difference is not task richness, but cognitive architecture, for example, between well-optimized Bayesian perceptual processes and heuristic central processes [63,64]? We do not believe this to be the case for two reasons. First, the differences between these kinds of tasks tendsdifference between these kinds of tasks tends to disappear when performance is measured on the same scale [65,66]. Second, there are counterexamples to the architectural distinction. Language interpretation is a high-level cognitive task which shows a good correspondence to Bayesian models [5,6]. And, conversely, purely perceptual versions of reasoning fallacies can be constructed, as Figure 2A illustrates. More broadly, any perceptual task where a hint can disambiguate the stimulus (e.g., sine-wave speech [67]) will generate examples of the conjunction fallacy. If the brain is indeed a Bayesian sampler, then sampling should leave traces in behavior. One immediate consequence is that behavior should be stochastic, because of the variability of the samples drawn. Hence behavior will look noisy, even where on average the response will be correct (e.g., generating the wisdom of crowds even from a single individual [68–70]). The ubiquitous variability of human behavior, especially in highly abstract judgments or choice tasks [71–74] is puzzling for pure ‘optimality’ explanations. Bayesian samplers can match behavioral stochasticity in tasks such as perception, categorization, causal reasoning, and decision making [20,21,23,24]. A second consequence of sampling is that behavior will be autocorrelated, meaning that each new sample depends on the last, because the sampler makes only small steps in the probability landscape (see Box 1). Autocorrelation appears ubiquitous within and between trials throughout memory, decision making, perception, language, and motor control (e.g., [75,76]), and Bayesian samplers produce human-like autocorrelations in perceptual and reasoning tasks [18,25]. One Box 1. Sampling and Autocorrelation When sampling from complex distributions, it is often impossible to draw samples directly, and sampling algorithms such as MCMC are used instead. Because these algorithms look locally when producing the next sample, autocorrelation can result. Figure 1B[3_TD$IF] in main text compares samples drawn independently to samples drawn from two versions of MCMC: the Metropolis–Hastings algorithm and Gibbs sampling as implemented in JAGS [36,93]. Samples produced by both MCMC methods have low autocorrelations for independent unimodal distributions (top row of Figure 1B), but for highly correlated variables and bimodal distributions autocorrelations are more prevalent, even when the aggregated samples match the overall distribution (middle rows of Figure 1B). More recently developed sampling methods such as those employed by the program Stan can also show sample autocorrelations, but more complex distributions are needed to induce them [94]. And of course, when two modes are far apart, these samplers are very unlikely to sample both modes (bottom row of Figure 1B). Although they are outnumbered by models of behavior that assume independent sampling, several models of memory and decision making do assume there are autocorrelations in sampling. Models of free recall such as Search of Associative Memory [95] and SIMPLE [96] use the previously recalled word to cue the next word, and has been productively used to account for the dependencies seen when people attempt to recall a list of words. In decision making, the ballistic accumulator model could be considered an extreme version of autocorrelation in which each internal time step produces the same strength of sampled evidence within a trial [97]. Explicit models of autocorrelated sampling have been used to account for perceptual switching times of multistable stimuli [18] and anchoring effects in reasoning tasks [25]. 8 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy TICS 1618 No. of Pages 11 particular consequence of autocorrelation will be that ‘priming’ a particular starting point in the sampling process will bias the sampler. Thus, asking, for example, the gestation time of an elephant will bias a person's estimate because they begin with the common reference point of 9 months: the starting point is an ‘anchor’ and sampling ‘adjusts’ from that anchor – but insufficiently when small samples are drawn. This provides a Bayesian reinterpretation of ‘anchoring and adjustment’ [25], a process widely observed in human judgment [15,77]. The extent of autocorrelation depends on both the algorithm and the probability distribution (see Figure 1B). Sampling algorithms often have tuning parameters which are chosen to minimize autocorrelation and to better explore multiple peaks. Of course the best settings of these tuning parameters are not known in advance, but they can be learned by drawing ‘warm-up’ samples to get a sense of what the distribution is like. Interestingly, there is behavioral evidence for this. Participants given ‘clumpier’ reward distributions in a 2D computer foraging task later behaved as if ‘tuned’ to clumpier distributions in semantic space in a word-generations task. This suggests that there is a general sampling process that people adapt to the properties of the probability distributions that they face [78–80]. Concluding Remarks and Future Perspectives Sampling provides a natural and scalable implementation of Bayesian models. Moreover, the questions that are difficult to answer with sampling correspond to those with which people struggle, and those easier to answer with sampling are contextually rich, and highly complex, real-world questions on which people show surprising competence. Sampling generates reasoning fallacies, and leaves traces, such as stochasticity and autocorrelations, which are observed in human behavior. The sampling viewpoint also fits with current thinking in computational neuroscience, where an influential proposal is that transitions between brain states correspond to sampling from a probability distribution [26,27,81,82] or multiple samples are represented simultaneously (e.g., using particle filters [83–86]). Moreover, neural networks dating back to the Boltzmann machine [87] take a sampling approach to probabilistic inference. For example, consider deep belief networks, which have been highly successful in vision, acoustic analysis, language, and other areas of machine learning [88]. These networks consist of layers of binary variables connected by weights. Once the network is trained, it defines a probability distribution over the output features. However, the probabilities of configurations of output features are not known. Instead, samples from the total probability distribution are generated by randomly initializing the output variables; and conditional samples are generated by fixing some of the output variables to particular values and sampling the remaining variables. Applications of deep belief networks include generating facial expressions and recognizing objects [34,89]. As with human performance, these networks readily sample highly complex probability distributions, but can only calculate explicit probabilities with difficulty. Of course, it is possible that the brain represents probability distributions over small collections of variables [90,91], or variational approximations to probability distributions [92], but this would not affect our key argument, which stems from the unavoidable difficulty of finding isolated peaks in, and calculating the volume of, complex probability distributions. Outstanding Questions Is sampling sequential or parallel? If the brain samples distributed patterns across a network of neurons (in line with connectionist models), then sampling should be sequential. This implies severe attentional limitations: for example, that we can sample from, recognize, or search for, one pattern at a time. How is sampling neurally implemented? Connectionist models suggest that sampling may be implemented naturally across a distributed network, such as the brain. How are ‘autocorrelated’ samples integrated by the brain? Accumulator models in perception, categorization, and memory seem best justified if sampling is independent. How should such models and their predictions be recast in the light of autocorrelated samples? Does autocorrelation of samples reduce over trials as the brain becomes tuned to particular tasks? Are samples drawn from the correct complex probability distribution or is the distribution simplified first? Variational approximations can be used to simplify complex probability distributions at the cost of a loss of accuracy. How does sampling deal with counterfactuals and the arrow of causality? Does sampling across causal networks allow us to ‘imagine’ what might have happened if Hitler had been assassinated in 1934? How can we sample over entirely fictional worlds (e.g., to consider possible endings to a story)? How far can we simulate sampling in ‘other minds’ to infer what other people are thinking? How are past interpretations suppressed to generate new interpretations for ambiguous stimuli? How far does sampling explain stochastic behavior in humans and nonhuman animals? While Bayesian sampling has great promise in answering the big questions of how Bayesian cognition could work, there are many open issues. As detailed in Outstanding Questions, interesting avenues for future work are computational models of reasoning fallacies, explanations of how complex causal structures are represented, and further exploration of the nature of the sampling algorithms that may be implemented in the mind and the brain. Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 9 TICS 1618 No. of Pages 11 Acknowledgments ANS was supported by the ESRC (grant number ES/K004948/1). NC was supported by ERC grant 295917-RATIONALITY, the ESRC Network for Integrated Behavioural Science (grant number ES/K002201/1), the Leverhulme Trust (grant number RP2012-V-022), RCUK Grant EP/K039830/1. [4_TD$IF]Supplemental [5_TD$IF]Information Supplemental information associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.tics. 2016.10.003. References 1. Battaglia, P.W. et al. (2013) Simulation as an engine of physical scene understanding. Proc. Natl. Acad. Sci. U.S.A. 110, 18327–18332 2. Sanborn, A.N. et al. (2013) Reconciling intuitive physics and Newtonian mechanics for colliding objects. Psychol. Rev. 120, 411 24. Denison, S. et al. (2013) Rational variability in children's causal inferences: the sampling hypothesis. Cognition 126, 285–300 25. Lieder, F. et al. (2012) Burn-in, bias, and the rationality of anchoring. Adv. Neural Inf. Process Syst. 25, 2690–2798 3. Pantelis, P.C. et al. (2014) Inferring the intentional states of autonomous virtual agents. Cognition 130, 360–379 26. Fiser, J. et al. (2010) Statistically optimal perception and learning: from behavior to neural representations. Trends Cogn. Sci. 14, 119–130 4. Anderson, J.R. (1991) The adaptive nature of human categorization. Psychol. Rev. 98, 409 27. Moreno-Bote, R. et al. (2011) Bayesian sampling in visual perception. Proc. Natl. Acad. Sci. U.S.A. 108, 12491–12496 5. Griffiths, T.L. et al. (2007) Topics in semantic representation. Psychol. Rev. 114, 211 28. Kwisthout, J. et al. (2011) Bayesian intractability is not an ailment that approximation can cure. Cogn. Sci. 35, 779–784 6. Chater, N. and Manning, C.D. (2006) Probabilistic models of language processing and acquisition. Trends Cogn. Sci. 10, 335–344 29. Gigerenzer, G. and Goldstein, D.G. (1996) Reasoning the fast and frugal way: models of bounded rationality. Psychol. Rev. 103, 650 7. Goodman, N.D. et al. (2008) A rational analysis of rule-based concept learning. Cogn. Sci. 32, 108–154 30. Aragones, E. et al. (2005) Fact-free learning. Am. Econ. Rev. 95, 1355–1368 8. Kemp, C. and Tenenbaum, J.B. (2009) Structured statistical models of inductive reasoning. Psychol. Rev. 116, 20–58 31. Tversky, A. and Kahneman, D. (1983) Extensional versus intuitive reasoning: the conjunction fallacy in probability judgment. Psychol. Rev. 90, 293 9. Wolpert, D.M. (2007) Probabilistic models in human sensorimotor control. Hum. Mov. Sci. 26, 511–524 32. Metropolis, N. et al. (1953) Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087–1092 10. Yuille, A. and Kersten, D. (2006) Vision as Bayesian inference: analysis by synthesis? Trends Cogn. Sci. 10, 301–308 33. Metropolis, N. and Ulam, S. (1949) The Monte Carlo method. J. Am. Stat. Assoc. 44, 335–341 11. Houlsby, N.M.T. et al. (2013) Cognitive tomography reveals complex, task-independent mental representations. Curr. Biol. 23, 2169–2175 34. Susskind, J.M. et al. (2008) Generating facial expressions with deep belief nets. In Affective Computing, Emotion Modelling, Synthesis and Recognition (Kordic, V., ed.), pp. 421–440, ARS Publishers 12. Griffiths, T.L. and Tenenbaum, J.B. (2011) Predicting the future as Bayesian inference: people combine prior knowledge with observations when estimating duration and extent. J. Exp. Psychol. Gen. 140, 725 35. Neal, R.M. (1993) Probabilistic inference using Markov chain Monte Carlo methods. 13. Petzschner, F.H. et al. (2015) A Bayesian perspective on magnitude estimation. Trends Cogn. Sci. 19, 285–293 14. Brighton, H. and Gigerenzer, G. (2012) Are rational actor models ‘rational’ outside small worlds. In Evolution and Rationality: Decisions, Co-operation, and Strategic Behavior (Okasha, S. and Binmore, K., eds), pp. 84–109, Cambridge, NY, Cambridge University Press 15. Tversky, A. and Kahneman, D. (1974) Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 36. Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 37. Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97–109 38. Tversky, A. and Kahneman, D. (1973) Availability: a heuristic for judging frequency and probability. Cogn. Psychol. 5, 207–232 39. Griffiths, T.L. and Kalish, M.L. (2007) Language evolution by iterated learning with Bayesian agents. Cogn. Sci. 31, 441–480 40. Bloom, P.A. and Fischler, I. (1980) Completion norms for 329 sentence contexts. Mem. Cogn. 8, 631–642 16. Elqayam, S. and Evans, J.S.B. (2011) Subtracting ‘ought’ from ‘is’: descriptivism versus normativism in the study of human thinking. Behav. Brain Sci. 34, 233–248 41. Gold, J.M. et al. (2000) Deriving behavioural receptive fields for visually completed contours. Curr. Biol. 10, 663–666 17. Gigerenzer, G. and Gaissmaier, W. (2011) Heuristic decision making. Annu. Rev. Psychol. 62, 451–482 42. Shepard, R.N. (1984) Ecological constraints on internal representation: resonant kinematics of perceiving, imagining, thinking, and dreaming. Psychol. Rev. 91, 417 18. Gershman, S.J. et al. (2012) Multistability and perceptual inference. Neural Comput. 24, 1–24 19. Griffiths, T.L. et al. (2012) Bridging levels of analysis for probabilistic models of cognition. Curr. Directions Psychol. Sci. 21, 263–268 20. Vul, E. et al. (2014) One and done? Optimal decisions from very few samples. Cogn. Sci. 38, 599–637 21. Sanborn, A.N. et al. (2010) Rational approximations to rational models: alternative algorithms for category learning. Psychol. Rev. 117, 1144 43. Morris, R. et al. (1996) The Ising/Potts model is not well suited to segmentation tasks. In Digital Signal Processing Workshop Proceedings, 1996, IEEE, pp. 263–266, IEEE 44. Fischhoff, B. et al. (1978) Fault trees: sensitivity of estimated failure probabilities to problem representation. J. Exp. Psychol. Hum. Percept. Perform. 4, 330 45. Tversky, A. and Koehler, D.J. (1994) Support theory: a nonextensional representation of subjective probability. Psychol. Rev. 101, 547 22. Shi, L. et al. (2010) Exemplar models as a mechanism for performing Bayesian inference. Psychon. Bull. Rev. 17, 443–464 46. Sloman, S. et al. (2004) Typical versus atypical unpacking and superadditive probability judgment. J. Exp. Psychol. Learn. Mem. Cogn. 30, 573–582 23. Wozny, D.R. et al. (2010) Probability matching as a computational strategy used in perception. PLoS Comput. Biol. 6, e1000871 47. Hertwig, R. and Gigerenzer, G. (1999) The ‘conjunction fallacy’ revisited: how intelligent inferences look like reasoning errors. J. Behav. Decis. Mak. 12, 275 10 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy TICS 1618 No. of Pages 11 48. Wedell, D.H. and Moro, R. (2008) Testing boundary conditions for the conjunction fallacy: effects of response mode, conceptual focus, and problem type. Cognition 107, 105–136 49. Han, C. and Carlin, B.P. (2001) Markov chain Monte Carlo methods for computing Bayes factors. J. Am. Stat. Assoc. 96, 1122– 1132 50. David, F.N. (1998) Games, Gods, and Gambling: A History of Probability and Statistical Ideas, Courier Corporation 51. Gavanski, I. and Roskos-Ewoldsen, D.R. (1991) Representativeness and conjoint probability. J. Pers. Soc. Psychol. 61, 181 52. Cosmides, L. and Tooby, J. (1996) Are humans good intuitive statisticians after all? Rethinking some conclusions from the literature on judgment under uncertainty. Cognition 58, 1–73 53. Fisk, J.E. (2002) Judgments under uncertainty: representativeness or potential surprise? Br. J. Psychol. 93, 431–449 54. Nilsson, H. et al. (2009) Linda is not a bearded lady: configural weighting and adding as the cause of extension errors. J. Exp. Psychol. Gen. 138, 517 55. Ajzen, I. (1977) Intuitive theories of events and the effects of baserate information on prediction. J. Pers. Soc. Psychol. 35, 303 56. Krynski, T.R. and Tenenbaum, J.B. (2007) The role of causality in judgment under uncertainty. J. Exp. Psychol. Gen. 136, 430 57. Nilsson, H. (2008) Exploring the conjunction fallacy within a category learning framework. J. Behav. Decis. Mak. 21, 471–490 58. Binmore, K. (2008) Rational Decisions, Princeton University Press 59. Savage, L.J. (1954) The Foundations of Statistics, Wiley 60. McGurk, H. and MacDonald, J. (1976) Hearing lips and seeing voices. Nature 264, 746–748 61. Marr, D. (1982) Vision, W.H. Freeman 62. Simon, H.A. (1956) Rational choice and the structure of the environment. Psychol. Rev. 63, 129 63. Maloney, L.T. et al. (2007) Questions without words: a comparison between decision making under risk and movement planning under risk. Integrated Models Cogn. Syst. 297–313 64. Oaksford, M. and Hall, S. (2016) On the source of human irrationality. Trends Cogn. Sci. 20, 336–344 65. Jarvstad, A. et al. (2013) Perceptuo-motor, cognitive, and description-based decision-making seem equally good. Proc. Natl. Acad. Sci. U.S.A. 110, 16271–16276 66. Jarvstad, A. et al. (2014) Are perceptuo-motor decisions really more optimal than cognitive decisions? Cognition 130, 397–416 67. Remez, R.E. et al. (1981) Speech perception without traditional speech cues. Science 212, 947–949 68. Vul, E. and Pashler, H. (2008) Measuring the crowd within probabilistic representations within individuals. Psychol. Sci. 19, 645– 647 69. Stroop, J.R. (1932) Is the judgment of the group better than that of the average member of the group? J. Exp. Psychol. 15, 550 74. Vulkan, N. (2000) An economist's perspective on probability matching. J. Econ. Surveys 14, 101–118 75. Gilden, D.L. et al. (1995) 1/f noise in human cognition. Science 267, 1837 76. Kello, C.T. et al. (2008) The pervasiveness of 1/f scaling in speech reflects the metastable basis of cognition. Cogn. Sci. 32, 1217–1231 77. Epley, N. and Gilovich, T. (2006) The anchoring-and-adjustment heuristic: why the adjustments are insufficient. Psychol. Sci. 17, 311–318 78. Abbott, J.T. et al. (2015) Random walks on semantic networks can resemble optimal foraging. Psychol. Rev. 122, 558–559 79. Hills, T.T. et al. (2012) Optimal foraging in semantic memory. Psychol. Rev. 119, 431 80. Hills, T.T. et al. (2008) Search in external and internal spaces evidence for generalized cognitive search processes. Psychol. Sci. 19, 802–808 81. Hennequin, G. et al. (2014) Fast sampling-based inference in balanced neuronal networks. Adv. Neural Inf. Process Syst. 27, 2240–2248 82. Buesing, L. et al. (2011) Neural dynamics as sampling: a model for stochastic computation in recurrent networks of spiking neurons. PLoS Comput. Biol. 7, e1002211 83. Lee, T.S. and Mumford, D. (2003) Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am. A. Opt. Image Sci. Vis. 20, 1434–1448 84. Huang, Y. and Rao, R.P. (2014) Neurons as Monte Carlo samplers: Bayesian inference and learning in spiking networks. Adv. Neural Inf. Process Syst. 27, 1943–1951 85. Legenstein, R. and Maass, W. (2014) Ensembles of spiking neurons with noise support optimal probabilistic inference in a dynamically changing environment. PLoS Comput. Biol. 10, e1003859 86. Probst, D. et al. (2015) Probabilistic inference in discrete spaces can be implemented into networks of LIF neurons. Front. Comput. Neurosci. 9, 13 87. Ackley, D.H. et al. (1985) A learning algorithm for Boltzmann machines. Cogn. Sci. 9, 147–169 88. Hinton, G.E. et al. (2006) A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 89. Nair, V. and Hinton, G.E. (2009) 3D object recognition with deep belief nets. Adv. Neural Inf. Process Syst. 22, 1339–1347 90. Kopp, B. et al. (2016) P300 amplitude variations, prior probabilities, and likelihoods: a Bayesian ERP study. Cogn. Affect. Behav. Neurosci. 16, 911–928 91. Chan, S.C.Y. et al. (2016) A probability distribution over latent causes in the orbitofrontal cortex. J. Neurosci. 36, 7817–7828 92. Friston, K. (2008) Hierarchical models in the brain. PLoS Comput. Biol. 4, e1000211 70. Herzog, S.M. and Hertwig, R. (2014) Harnessing the wisdom of the inner crowd. Trends Cogn. Sci. 18, 504–506 93. Plummer, M. (2003) JAGS: a program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing, pp. 125 71. Mosteller, F. and Nogee, P. (1951) An experimental measurement of utility. J. Polit. Econ. 59, 371–404 94. Carpenter, B. et al. (2016) Stan: a probabilistic programming language. J. Stat. Softw. 72. Loomes, G. (2015) Variability, noise, and error in decision making under risk. In The Wiley Blackwell Handbook of Judgment and Decision Making (Keren, G. and Wu, G., eds), pp. 658–695, Wiley 95. Raaijmakers, J.G. and Shiffrin, R.M. (1981) Search of associative memory. Psychol. Rev. 88, 93 73. Herrnstein, R.J. (1961) Relative and absolute strength of response as a function of frequency of reinforcement. J. Exp. Anal. Behav. 4, 267–272 96. Brown, G.D. et al. (2007) A temporal ratio model of memory. Psychol. Rev. 114, 539 97. Brown, S. and Heathcote, A. (2005) A ballistic model of choice response time. Psychol. Rev. 112, 117 Trends in Cognitive Sciences, Month Year, Vol. xx, No. yy 11