Academia.eduAcademia.edu

Probability biases as Bayesian inference

2006, Judgment and Decision Making

In this article, I will show how several observed biases in human probabilistic reasoning can be partially explained as good heuristics for making inferences in an environment where probabilities have uncertainties associated to them. Previous results show that the weight functions and the observed violations of coalescing and stochastic dominance can be understood from a Bayesian point of view. We will review those results and see that Bayesian methods should also be used as part of the explanation behind other known biases. That means that, although the observed errors are still errors under the laboratory conditions in which they are demonstrated, they can be understood as adaptations to the solution of real life problems. Heuristics that allow fast evaluations and mimic a Bayesian inference would be an evolutionary advantage, since they would give us an efficient way of making decisions.

Judgment and Decision Making, Vol. 1, No. 2, November 2006, pp. 108–117 Probability biases as Bayesian inference André C. R. Martins∗ Universidade de São Paulo Abstract In this article, I will show how several observed biases in human probabilistic reasoning can be partially explained as good heuristics for making inferences in an environment where probabilities have uncertainties associated to them. Previous results show that the weight functions and the observed violations of coalescing and stochastic dominance can be understood from a Bayesian point of view. We will review those results and see that Bayesian methods should also be used as part of the explanation behind other known biases. That means that, although the observed errors are still errors under the laboratory conditions in which they are demonstrated, they can be understood as adaptations to the solution of real life problems. Heuristics that allow fast evaluations and mimic a Bayesian inference would be an evolutionary advantage, since they would give us an efficient way of making decisions. Keywords: weighting functions, probabilistic biases, adaptive probability theory. 1 Introduction It is a well known fact that humans make mistakes when presented with probabilistic problems. In the famous paradoxes of Allais (1953) and Ellsberg (1961), it was observed that, when faced with the choice between different gambles, people make their choices in a way that is not compatible with normative decision theory. Several attempts to describe this behavior exist in the literature, including Prospect Theory (Kahneman & Tversky, 1979), Cumulative Prospect Theory (Kahneman & Tversky, 1992), and a number of configural weighting models (Birnbaum & Chavez, 1997; Luce, 2000; Marley & Luce, 2001). All these models use the idea that, when analyzing probabilistic gambles, people alter the stated probabilistic values using a S-shaped weighting function w(p) and use these altered values in order to calculate which gamble would provide a maximum expected return. Exact details of all operations involved in these calculations, as values associated to each branch of a bet, coalescing of equal branches, or aspects of framing are dealt with differently in each model, but the models agree that people do not use the exact known probabilistic values when making their decisions. There are also models based on different approaches, as the decision by sampling model. Decision by sampling proposes that people make their decision by making comparisons of attribute values remembered by them and it can describe many of the characteristics of human reasoning well (Stewart et al., 2006). ∗ GRIFE – Escola de Artes, Ciências e Humanidades, Universidade de São Paulo, Av. Arlindo Bettio, 1000, Prédio I1, sala 310 F, CEP 03828-000, São Paulo - SP Brazil, [email protected] Recently, strong evidence has appeared indicating that the configural weighting models describe human behavior better than Prospect Theory. Several tests have shown that people don’t obey simple decision rules. If a bet is presented with two equal possible outcomes, for example, 5% of chance of getting 10 in one outcome and 10% of chance of getting the same return, 10, in another possible result, it should make no difference if both outcomes were combined into one single possibility, that is, a 15% chance of obtaining 10. This property is called coalescing of branches and it has been observed that it is not always respected (Starmer & Sugden, 1993; Humphrey, 1995; Birnbaum, 2004). Other strong requirement of decision theory that is violated in laboratory experiments is that people should obey stochastic dominance. Stochastic dominance happens when there are two bets available and the possible gains of one of them are as good as the other one, with at least one possibility to gain more. Per example, given the bets G = $96, 0.9; $12, 0.1 and G+ = $96, 0.9; $14, 0.05; $12, 0.05, G+ clearly dominates G, since the first outcome is the same and the second outcome in G is split into two possibilities in G+, returning the same or more than G, depending on luck. The only rational choice here is G+, but laboratory tests show that people do not always follow this simple rule (Birnbaum, 1999). Since rank-dependent models, as Prospect Theory (and Cumulative Prospect Theory) obey both stochastic dominance and coalescing of branches, configural weight models, that can predict those violations, are probably a better description of real behavior. In the configural weight models, each branch of a bet is given a different 108 Judgment and Decision Making, Vol. 1, No. 1, July 2006 weight, so that the branches with worst outcome will be given more weight by the decider. This allows those basic principles to be violated. However, although configural weight models can be good descriptive models, telling how we reason, the problem of understanding why we reason the way we do is not solved by them. The violations of normative theory it predicts are violations of very simple and strong principles and it makes sense to ask why people would make such obvious mistakes. Until recently, the reason why humans make these mistakes was still not completely clear. Evolutionary psychologists have suggested that it makes no sense that humans would have a module in their brains that made wrong probability assessments (Pinker, 1997), therefore, there must be some logical explanation for those biases. It was also suggested that, since our ancestors had to deal with observed frequencies instead of probability values, the observed biases might disappear if people were presented with data in the form of observed frequencies in a typical Bayes Theorem problem. Gigerenzer and Hoffrage (1995) conducted an experiment confirming this idea. However, other studies checking those claims (Griffin, 1999; Sloman, 2003) have shown that frequency formats seem to improve the reasoning only under some circumstances. If those circumstances are not met, frequency formats have either no effect or might even cause worse probability evaluations by the tested subjects. On the other hand, proponents of the heuristics and biases point of view claim that, given that our intellectual powers are necessarily limited, errors should be expected and the best one can hope is that humans would use heuristics that are efficient, but prone to error (Gigerenzer & Goldstein, 1996). And, as a matter of fact, they have shown that, for decision problems, there are simple heuristics that do a surprisingly good job (Martignon, 2001). But, since many of the calculations involved in the laboratory experiments are not too difficult to perform, the question of the reasons behind our probabilistic reasoning mistakes still needed answering. If we are using a reasonable heuristics to perform probabilistic calculations, understanding when this is a good heuristic and why it fails in the tests is an important question. Of course, the naïve idea that people should simply use observed frequencies, instead of probability values, can certainly be improved from a Bayesian point of view. The argument that our ancestors should be well adapted to deal with uncertainty from their own observations is quite compelling, but, to make it complete, we can ask what would happen if our ancestors minds (and therefore, our own) were actually more sophisticated than a simple frequentistic mind. If they had a brain that, although possibly using rules of thumb, behaved in a way that mimicked a Bayesian inference instead of a frequentistic evaluation, they would be better equipped to make sound Biases as Bayesian inference 109 decisions and, therefore, that would have been a good adaptation. In other words, our ancestors who were (approximately) Bayesians would be better adapted than any possible cousins who didn’t consider uncertainty in their analysis. And that would eventually lead those cousins to extinction. Of course, another possibility is that we learn those heuristics as we grow up, adjusting them to provide better answers. But, even if this is the dynamics behind our heuristics, good learning should lead us closer to a Bayesian answer than afrequentistic one. So, it makes sense to ask if humans are actually smarter than the current literature describes them as. Evidence supporting the idea that our reasoning resembles Bayesian reasoning already exists. Tenenbaum et al. (in press) have shown that observed inductive reasoning can be modeled by theory-based Bayesian models and that those models can provide approximately optimal inference. Tests of human cognitive judgments about everyday phenomena seems to suggest that our inferences provide a very good prediction for the real statistics (Griffiths & Tenenbaum, 2006). 1.1 Adaptive probability theory (APT) In a recent work (Martins, 2005), I have proposed the Adaptive Probability Theory (APT). APT claims that the biases in human probabilistic reasoning can be actually understood as an approximation to a Bayesian inference. If one supposes that people treat all probability values as if they were uncertain (even when they are not) and make some assumptions about the sample size where those probabilities would have been observed as frequencies, it follows that the observed shape of the weighting functions is obtained. Here, I will review those results and also show that we can extend the ideas that were introduced to explain weighting functions to explain other observed biases. I will show that some of those biases can be partially explained as a result of a mind adapted to make inferences in an environment where probabilities have uncertainties associated to them. That is, the weighting functions of Prospect Theory (and the whole class of models that use weighting functions to describe our behavior) can be understood and predicted from a Bayesian point of view. Even the observed violations of descriptive Prospect Theory, that is, violations of coalescing and stochastic dominance, that need configural weight models to be properly described, can also be predicted by using APT. And I will propose that Bayesian methods should be used as part of the explanation behind a number of other biases (for a good introductory review to many of the reported mistakes, see, for example, Plous, 1993). Judgment and Decision Making, Vol. 1, No. 1, July 2006 1.2 What kind of theory is APT? Finally, a note on what APT really is, from an epistemological point of view, is needed. Usually, science involves working on theories that should describe a set of data, making predictions from those theories and testing them in experiments. Decision theory, however, requires a broader definition of proper scientific work. This happens because, unlike other areas, we have a normative decision theory that tells us how we should reason. It does not necessarily describe real behavior, since it is based on assumptions about what the best choice is, not about how real people behave. Its testing is against other decision strategies and, as long as it provides optimal decisions, the normative theory is correct, even if it does not predict behavior for any kind of agents. That means that certain actions can be labeled as wrong, in the sense that they are far from optimal decisions, even though they correspond to real actions of real people. This peculiarity of decision theory means that not every model needs to actually predict behavior. Given nonoptimal observed behavior, understanding what makes the deciders to behave that way is also a valid line of inquiry. That is where APT stands. Its main purpose is to show that apparently irrational behavior can be based on an analysis of the decision problem that follows from normative theory. The assumptions behind such analysis might be wrong and, therefore, the observed behavior would not be optimal. That means that our common sense is not perfect. However, if it works well for most real life problems, it is either a good adaptation or well learned. APT intends to make a bridge between normative and descriptive theories. This means that it is an exploratory work, in the sense of trying to understand the problems that led our minds to reason the way they do. While based on normative theory, it was designed to agree with the observed biases. This means that APT does not claim to be the best actual description of real behavior (although it might be). Even if other theories (such as configural weight models or decision by sampling) actually describe correctly the way our minds really work, as long as their predictions are compatible with APT, APT will show that the actual behavior predicted by those theories is reasonable and an approximation to optimal decisions. Laboratory tests can show if APT is actually the best description or not and we will see that APT suggests new experiments in the problem of base rate neglect, in order to understand better our reasoning. But the main goal of APT is to show that real behavior is reasonable and it does that well. 2 Bayesian weighting functions Suppose you are an intuitive Bayesian ancestor of mankind (or a Bayesian child learning how to reason Biases as Bayesian inference 110 about the world). That is, you are not very aware of how you decide the things you do, but your mind does something close to Bayesian estimation (although it might not be perfect). You are given a choice between the following two gambles: Gamble A 85% to win 100 15% to win 50 Gamble B 95% to win 100 5% to win 7 If you are sure about the stated probability values and you are completely rational, you should just go ahead and assign utilities to each monetary value and choose the gamble that provides the largest expected utility. And, as a matter of fact, the laboratory experiments that revealed the failures in human reasoning provided exact values, without any uncertainty associated to them. Therefore, if humans were perfect Bayesian statisticians, when faced with those experiments, the subjects should have treated those values as if they were known for sure. But, from your point of view of an intuitive Bayesian, or from the point of view of everyday life, there is no such a thing as a probability observation that does not carry with it some degree of uncertainty. Even values from probabilistic models based on some symmetry of the problem depend, in a more complete analysis, on the assumption that the symmetry does hold. If it doesn’t, the value could be different and, therefore, even though the uncertainty might be small, we would still not be completely sure about the probability value. Assuming there is uncertainty, what you need to do is to obtain your posterior estimate of the chances, given the stated gamble probabilities. Here, the probabilities you were told are actually the data you have about the problem. And, as long as you were not in a laboratory, it is very likely they have been obtained as observed frequencies, as proposed by the evolutionary psychologists. That is, what you understand is that you are being told that, in the observed sample, a specific result was observed 85% of the times it was checked. And, with that information in mind, you must decide what posterior value it will use. The best answer would certainly involve a hierarchical model about possible ways that frequency was observed and a lot of integrations over all the nuisance parameters (parameters you are not interested about). You should also consider whether all observations were made under the same circumstances, if there is any kind of correlation between their results, and so on. But all those calculations involve a cost for your mind and it might be a good idea to accept simpler estimations that work reasonably well most of the time. You are looking for a good heuristic, one that is simple and efficient and that gives you correct answers most of the time (or, at least, close enough). That is the basic idea behind Adaptive Probability Theory Judgment and Decision Making, Vol. 1, No. 1, July 2006 Biases as Bayesian inference (Martins, 2005). Our minds, from evolution or learning, are built to work with probability values as if they were uncertain and make decisions compatible with that possibility. APT does not claim we are aware of that, it just says that our common sense is built in a way that mimics a Bayesian inference of a complex, uncertain problem. If you hear a probability value, it is a reasonable assumption to think that the value was obtained from a frequency observation. In that case, the natural place to look for a simple heuristic is by treating this problem as one of independent, identical observations. In this case, the problem has a binomial likelihood and the solution to the problem would be straight-forward if not for one missing piece of information. You were informed the frequency, but not the sample size n. Therefore, you must use some prior opinion about n. In the full Bayesian problem, that means that n is a nuisance parameter. This means that, while inference about p is desired, the posterior distribution depends also on n and the final result must be integrated over n. The likelihood that a observed frequency o, equivalent to the observation of s = no successes, is reported is given by f (o|p, n) ∝ pon (1 − p)(n(1−o)) . (1) In order to integrate over n, a prior for it is required. However, the problem is actually more complex than that since it is reasonable that our opinion on n should depend on the value of o. That happens because if o = 0.5, it is far more likely that n = 2 than if o = 0.001, when it makes sense to assume that at least 1,000 observations were made. And we should also consider that extreme probabilities are more subject to error. In real life, outside the realm of science, people rarely, if ever, have access to large samples to draw their conclusions from. For the problem of detecting correlates, there is some evidence that using small samples can be a good heuristics (Kareev, 1997). In other words, when dealing with extreme probabilities, we should also include the possibility that the sample size was actually smaller and the reported frequency is wrong. The correct prior f (n, o), therefore, can be very difficult to describe and, for the complete answer, hierarchical models including probabilities of error are needed. However, such a complicated, complete model is not what we are looking for. A good heuristic should be reasonably fast to use and shouldn’t depend on too many details of the model. Therefore, it makes sense to look for reasonable average values of n and simply assume that value for the inference process. Given a fixed value for n, it is easy to obtain a posterior distribution. The likelihood in Equation 1 is a binomial likelihood and the easiest way to obtain inferences when dealing with binomial likelihoods is assuming a Beta distribution for the prior. Given a Beta prior with parameters 111 a and b, the posterior distribution will also be a Beta distribution with parameters a+s and b+n. The average of a random variable that follows a Beta distribution with paa rameters a and b has a simple form, a+b . That means that we can obtain a simple posterior average for the probability p, given the observed frequency o, w(o) = E[p|o] w(o) = a + on , a+b+n (2) which is a straight line if n is a constant (independent of o), but different and less inclined than w(o) = o. For a non-informative prior distribution, that corresponds to the choice a = 1 and b = 1, Equation 2 can be written in the traditional form (1 + s)/(2 + n) (Laplace rule). However, a fixed sample size, equal for all values of p does not make much sense in the regions closer to certainty and n must somehow increase as we get close to those regions (o → 0 or o → 1). The easiest way to model n is to suppose that the event was observed at least once (and, at least once, it was not observed). That is, if o is small, choose an observed number of successes s = 1 (some other value would serve, but we should remember that humans tend to think with small samples, as observed by Kareev et al., 1997). If p is closer to 1, take s = n − 1. That is, the sample size will be given by n = 1/t where t = min(o; 1 − o) and we have that w(o) = 2o/(2o + 1) for o < 0.5 and w(o) = 1/(3 − 2o) for o > 0.5. By calculating h i w(x) w(y) h w(cx) w(cy) i, it is easy to show that the common-rate effect holds, meaning that the curves are subproportional, both for o < 0.5 and o > 0.5. Estimating the above fraction shows that w(x)/w(y) < w(cx)/w(cy), for c < 1, exactly when x < y. The curve w(o) can be observed in Figure 1, where it is compared to a curve proposed by Prelec (2000) as a parameterization that fits reasonably well the data observed in the experiments. A few comments are needed here. For most values of o, the predicted value of n will not be an integer, as it would be reasonable to expect. If o = 0.4, we have n = 2.5, an odd and absurd sample size, if taken literally. One could propose using a sample size of n = 5 for this case, but that would mean a non-continuous weighting function. More than that, for too precise values, as 0.499, that would force n to be as large as 1.000. However, it must be noted that, in the original Bayesian model, n is not supposed to be an exact value, but the average value that is obtained after it is integrated out (remember it is a nuisance parameter). As an average value, there is nothing wrong with non-integer numbers. Also, it is necessary to remember that this is a proposed heuristic. It is not the Judgment and Decision Making, Vol. 1, No. 1, July 2006 Binomial Inference for o =1/e Fixed Point f 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 w(o) w(o) 0.6 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 w(o)=o α=β=1, s=1 Prelec 0.1 0 112 Biases as Bayesian inference 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 w(o)=o s=1, α=1, β=e−1 s=2, α=1, β=e−1 Prelec 0.1 1 Observed Value Figure 1: Weighting Function as a function of observed frequency. The curve proposed by Prelec, fitted to the observed data, as well as the w(o) = o curve are also shown for comparison. exact Bayesian solution to the problem, but an approximation to it. In the case of o = 0.499, it is reasonable to assume that people would interpret it as basically 50%. In that sense, what the proposed behavior for n says is that, around 50%, the sample size is estimated to be around n = 2; around o = 0.33, n is approximately 3; and so on. The first thing to notice in Figure 1 is that, by correcting the assumed sample size as a function of o, the S-shaped format of the observed behavior is obtained. However, there are still a few important quantitative differences between the observations and the predicted curve. The most important one is on the location of the fixed point of , defined as the solution to the equation w(o) = o. If we had no prior information about o (a = b = 1), we should have of = 0.5. Instead, the actual observed value is closer to 1/3. That is, for some reason, our mind seems to use an informative prior where the probability associated with obtaining the better outcomes are considered less likely than those associated with the worse outcomes (notice that the probability o is traditionally associated with the larger gain in the experiment). As a matter of fact, configural weighting models propose that, given a gamble, the branches with worst outcomes are given more weight than those with higher returns. For gambles with two branches, the lower branch is usually assigned a weight around 2, while the upper branch has a weight of 1. This can be understood as risk management or also as some lack of trust. If someone offered you a bet, it is possible that the real details are not what you are told and it would make sense to consider the possibility that the bet is actually worse than the described one. 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Observed value Figure 2: Weighting Functions as a function of observed frequency for a binomial likelihood with fixed point of = 1/e A different fixed point can be obtained by assigning different priors, that is if a, in the Beta distribution, correspond to the upper branch and b to the lower one, keep using a = 1 and change the value of b for something larger. Following the suggestion that of = 1/e (Prelec, 2000), based on compound invariance, we have a = 1 and b = e − 1. This values are not necessarily exact and they can also be understood as giving more weight to the worse outcome. That choice would be reasonable if there was any reason to believe that the chances might actually be worse than the stated probability values. Figure 2 shows the results for this informative prior, with the s = 1 (n − 1 for p > 0.5) and s = 2 (or n − 2) curves. This change leads to a posterior average w(o) with a correct fixed point, but the predicted curve does not deviate from the w(o) = o curve as much as it has been observed. This is especially true for values of o near the certainty values, where the effect should be stronger, due to what is known as certainty effect (Tversky & Kahneman, 1981). The certainty effect is the idea that when a probability changes from certainty to almost certain, humans change their decisions about the problem much more than decision theory tells they should. This means that the real curve should approach 0 and 1 with larger derivatives than the ones obtained from the simple binomial model. In order to get a closer agreement, other effects must be considered. Martins (2005) has suggested a number of possible corrections to the binomial curve. Among them, there is a possibility that the sample size does not grow as fast as it should when one gets closer to certainty (changing as tγ , instead of t1 ). Another investigated explanation is that people might actually be us- Judgment and Decision Making, Vol. 1, No. 1, July 2006 ing stronger priors. The curves generated from those hypotheses are similar to those of Figure 2. The main effect of the different parametrizations is on how the weighting function behaves as one gets closer to certainty. Simple stronger priors (a = 3, b = 3(e − 1)) do not actually solve the certainty effect, while slowly growing sample sizes seem to provide a better fitting. This agrees with the idea that people tend to use small samples for their decisions. Large samples are costly to obtain and it makes sense that most everyday inferences are based on small samples, even when the stated probability would imply a larger one. And it is important to keep in mind that what is being analyzed here are possible heuristics and in what sense they actually agree with decision theory. That mistakes are still made is nothing to be surprised about. This means that APT is able to explain the reasons behind the shape of the weighting functions. As such, APT is also capable of explaining the observed violations of coalescing and stochastic dominance. For example, suppose that you have to choose between the two gambles A and B. One of the possibilities in gamble A is a branch that gives you a 85% chance to win 100. If gamble A is changed to a gamble A’, where the only change is that this branch is substituted for two new branches, the first with 5% to win 100 and the second with 80% to win 100, no change was actually made to the gamble, if those values are known for sure. Both new branches give the same return and they can be added up back to the original gamble. In other words, the only difference between the gambles is the coalescing of branches. Therefore, if only coalescing (or splitting) of branches is performed to transform A into A’, people who chose A over B, should still choose A’ over B. However, notice that, the game A has its most extreme probability value equal to 15%, while in A’, it is 5%. If people behave in a way that is consistent with APT, that means they will use different sample sizes. And they will make different inferences. It should not be a surprise, therefore, that, under some circumstances, coalescing would be broken. Of course, coalescing is not broken for every choice, but APT, by using the supposition that the sample grows slower than it should (γ around 0.3) does predict violation of coalescing in the example reported by Birnbaum (2005). The assumption about sample sizes will also affect choices where one gamble clearly dominates the other and, therefore, Martins (2005) has shown that APT can explain the observed violations of coalescing and stochastic dominance. 3 Other biases It is an interesting observation that our reported mistakes can be understood as an approximation to a Bayesian in- Biases as Bayesian inference 113 ference process. But if that is true, the same effect should be able to help explain our observed biases in other situations, not only our apparent use of weighting functions. And, as a matter of fact, the literature of biases in human judgment has many other examples of observed mistakes. If we are approximately Bayesians, it would make sense to assume that the ideas behind APT should be useful under other circumstances. In this section, I will discuss a collection of other mistakes that can be, at least partially, explained by APT. In the examples bellow, a literature trying to explain those phenomena already exists, so, it is very likely that the explanations from APT are not the only source of the observed biases, and I am not claiming that they are. But APT should, at least, show that our mistakes are less wrong than previously thought. Therefore, it makes sense to expect a better fit between normative and descriptive results when uncertainty is included in the analysis as well as the possibility that some sort of error or mistake exists in the data. We will see that the corrections to decision theoretic results derived from those considerations are consistently in the direction of observed human departure from rationality. 3.1 Conjunctive events Cohen et al. (1979) reported that people tend to overestimate the probability of conjunctive events. If people are asked to estimate the probability of a result in a two-stage lottery with equal probabilities in each state, their answer was far higher than the correct 25%, showing an average value of 45%. Again, as in the choice between gambles, this is certainly wrong from a probabilistic point of view. But the correct 25% value is only true if independence can be assumed and the value 0.5 is actually known for sure. Real, everyday problems can be more complex than that. If you were actually unsure of the real probability and only thought that, in average, the probability of a given outcome in the lottery was 50%, the independence becomes conditional on the value of p. The chance that two equal outcomes will obtain is given by p2 . But, if p is unknown, you’d have to calculate an average estimate for that chance. Supposing a uniform prior, that is f (p) = 1 for 0 ≤ p ≤ 1, the expected value will be Z 1 E[p2 ] = f (p).p2 dp = 1/3. 0 That is, for real problems where only conditional independence exists, the result is not the correct 25% for the situation where p is known to be 0.5 with certainty. Of course, if the uncertainty in the priori was smaller, the result would become closer to 25%. Furthermore, if the conditional independence assumption is also dropped, the predicted results can become even closer to the observed behavior. In many situations, Judgment and Decision Making, Vol. 1, No. 1, July 2006 especially when little is known about a system, even conditional independence might be too strong an assumption. Suppose, for example, that our ancestors needed to evaluate the probability of finding predators at the river they used to get water from. If a rational man had a prior uniform (a = b = 1, with an average a/(a + b) = 1/2) distribution for the chance the predator would be there and, after that, only one observation was made an hour ago where the predator was actually seen, the average chance a predator would be by the river would change to a posterior where a = 2 and b = 1. That is, the average probability would now be 2/3. However, if he wanted to return to the river only one hour later, the events would not be really conditionally independent, as the same predator might still be there. The existence of correlation between the observations implies that the earlier sighting of the predator should increase the probability of observing it there again. Real problems are not as simple as complete independence would suggest. Therefore, a good heuristic is not one that simply multiplies probabilities. When probabilistic values are not known for sure, true independence does not exist, only conditional independence remains, and the heuristic should model that. If our heuristics are also built to include the possibility of dependent events, they might be more successful for real problems. However, they would fail more seriously in the usual laboratory experiments. This means that the observed estimate for the conjunctive example in Cohen, around 45%, can be at least partially explained as someone trying to make inferences when independence, or even conditional independence, do not necessarily hold. It is important to keep in mind that our ancestors had to deal with a world they didn’t know how to describe and model as well as we do nowadays. It would make sense for a successful heuristic to include the learning about the systems it was applied to. The notion of independent sampling for similar events might not be natural in many cases, and our minds might be better equipped by not assuming independence. When faced with the same situation, not only the previous result can be used as inference for the next ones, but also it might have happened that some covariance between the results existed and this might be the origin of the conjunctive events bias. Of course, this doesn’t mean that people are aware of that nor that our minds perform all the analysis proposed here. The actual calculations can be performed following different guidelines. All that is actually required is that, most of the time, they should provide answers that are close to the correct ones. 3.2 Conservatism It might seem, at first, that humans are good intuitive Bayesian statisticians. However, it has been shown that, Biases as Bayesian inference 114 when faced with typical Bayesian problems, people make mistakes. This result seems to contradict APT. One example of such a behavior is conservatism. Conservatism happens because people seem to update their probability estimates slower than the rules of probability dictate, when presented with new information (Phillips & Edwards, 1966). That is, given the prior estimates and new data, the data are given less importance than they should have. And, therefore, the subjective probabilities, after learning the new information, change less than they should, when compared to a correct Bayesian analysis. This raises the question of why we would have heuristics that mimic a Bayesian inference, but apparently fail to change our points of view by following the Bayes Theorem. Described like that, conservatism might sound as a challenge to APT. In order to explain what might be going on, we need to understand that APT is not a simple application of Bayesian rules. It is actually based on a number of different assumptions. First of all, even though our minds approximate a Bayesian analysis, they do not perform one flawlessly. Second, we have seen that, when given probabilities, people seem to use sample sizes that are actually smaller than they should be. And, for any set of real data, there is always the possibility that some mistake was made. This possibility exists in scientific works, subject to much more testing and checking than everyday problems. For those real problems, the chance of error is certainly larger. This means that our heuristics should approximate not just a simple Bayesian inference, but a more complete model. And this model should include the possibility that the new information could be erroneous or deceptive. If the probability of deception or errors, is sufficiently large, the information in the new data should not be completely trusted. This means that the posterior estimates will actually change slower than the simpler calculation would predict. This does not mean that people actually distrust the reported results, at least, not in a conscious way. Instead, it is a heuristic that might have evolved in a world where the information available was subject to all kind of errors. 3.3 Illusory and invisible correlations Another observed bias is the existence of illusory and invisible correlations. Chapman and Chapman (1967) observed that people tended to detect correlations in sets of data where no correlation was present. They presented pairs of words on a large screen to the tested subjects, where the pairs were presented such that each first word was shown an equal number of times together with each of the different word from the second set. However, after watching the whole sequence, people tended to believe that pairs like lion and tiger, or eggs and bacon, showed Judgment and Decision Making, Vol. 1, No. 1, July 2006 up more often than the pairs where there was no logical relation between the variables. Several posterior studies have confirmed that people tend to observe illusory correlations when none is available, if they expect some relation between the variables or observations for whatever reason. On the other hand, Hamilton and Rose (1980) observed that, when no correlation between the variables were expected, it is common that people will fail to see a real correlation between those two variables. Sometimes, even strong correlations go unnoticed. And the experiment also shows that, even when the correlation is detected, it is considered weaker than the actual one. That is, it is clear that, whatever our minds do, they do not simply calculate correlations based on the data. From a Bayesian point of view, this qualitative description of the experiments can not be called an error at all. Translated correctly, the illusory correlation case simply states that, when your prior tells you there is a correlation, it is possible that a set of data with no correlation at all will not be enough to convince you otherwise. Likewise, if you don’t believe a priori that there is a correlation, your posterior estimate will be smaller than the observed correlation calculated from the sample. That is, your posterior distribution will be something between your prior and the data. When one puts it this way, the result is so obvious that what is surprising is that those effects have been labeled as mistakes without any further checks. Of course, this does not mean that the observed behavior is exactly a Bayesian analysis of the problem. From what we have seen so far, it is most likely that it is only an approximation. In order to verify how people actually deviate from the Bayesian predictions, we would need to measure their prior opinions about the problem. But this may not be possible, depending on the experiment. If the problem is presented together with the data, people had to make their guesses about priors and update them at the same time they were already looking at the data (by whatever mechanism they actually use). It is important to notice that, once more, the observed biases were completely compatible with an approximation to a Bayesian inference. Much of the observed problem in the correlation case was actually due to the expectation by the experimenters that people should simply perform a frequentistic analysis and not include any other information. However, if connections were apparent between variables, it would be natural to have used an informative prior, since the honest opinion would be, initially, that you know something about the correlation. Of course, in problems where just words are put together, there should be no reason to expect a correlation. But, for a good heuristic, detecting a pattern and assuming a correlation can be efficient. Biases as Bayesian inference 115 3.4 Base rate neglect In Section 3.2, we have seen an example of how humans are not actually accomplished Bayesian statisticians, since they fail to apply Bayes Theorem in a case where a simple calculation would provide the correct answer. Another observed bias where the same effect is observed is the problem of base rate neglect (Kahneman and Tversky, 1973). Base rate neglect is the observed fact that, when presented with information about the relative frequency some events are expected to happen, people often ignore that information when making guesses about what they expect to be observed, if other data about the problem is present. In the experiment where they observed base rate neglect, Kahneman and Tversky told their subjects that a group was composed of 30 engineers and 70 lawyers. When presenting extra information about one person of that group, the subjects, in average, used only that extra information when evaluating the chance that the person might be an engineer. Even if the information was noninformative about the profession of the person, the subjects provided a 50%-50% of being either an engineer or a lawyer, despite the fact that lawyers were actually more probable. Application of APT to this problem is not straightforward. Since APT claims that we use approximations to deal with probability problem, one possible approximation is actually ignore part of the information and use only what the subject considers more relevant. In the case of those tests, it would appear that the subjects consider the description of the person much more relevant than the base rates. Therefore, one possible approximation is actually ignoring the base rates. If our ancestors (or the subjects of the experiments, as they grew up) didn’t have access to base rates of most problems they had to deal with, there would be no reason to include base rates in whatever heuristics we are using. On the other hand, it is possible that we actually use the base rates. Notice that base rates are given as frequencies and, assuming people are not simply ignoring them, those frequencies would be altered by the weighting functions. In this problem, a question that must be answered is if there is a worst outcome. Remember that we can use a non-informative prior (a = b = 1), but, for choice between bets, the fixed point agrees with a prior information that considers the worse outcome as initially more likely to happen (a = 1 and b = e − 1, for example). In a problem as estimating if someone is an engineer or a lawyer, the choice of best or worst outcome is nonexistent, or individual, at best. This suggests we should use the non-informative prior for a first analysis. In order to get a complete picture, we present the results of calculating weighting functions for the base rates in Kah- Judgment and Decision Making, Vol. 1, No. 1, July 2006 Table 1: The result of the weighting functions w(o) applied to the base rates of Kahneman and Tversky (1973) experiment, for observed frequencies of engineers (or lawyers) given by o = 0.3 or o = 0.7. The parameter γ describes how the sample size n grows as the observed value o moves towards certainty. a=b=1 a = 1 and b = e − 1 o = 0.3, γ = 1 0.375 0.330 o = 0.3, γ = 0.3 0.416 0.344 o = 0.7, γ = 1 0.625 0.551 o = 0.7, γ = 0.3 0.584 0.483 neman and Tversky (1973) experiment in Table 1. The weighting functions associated with the observed values of o = 30% and o = 70% are presented and two possibilities are considered in the table, that the sample size grows with 1/t (corresponding to γ = 1) and that it grows slower than it should, with 1/t0.3 (γ = 0.3). While the first one would correspond to a more correct inference, the second alternative (γ = 0.3) actually fits better the observed behavior (Martins, 2005). The first thing to notice in Table 1 is that, both for observed frequencies of o = 0.3 and o = 0.7, all weighting functions provide an answer closer to 0.5 than the provided base rate. Actually ignoring the base rates correspond to a prior choice to 0.5, therefore this means that people would actually behave in a way that is closer to ignoring the base rates. Notice that, using the case that better describes human observed choices (γ = 0.3) and the fact that there are no better outcomes here, a = b = 1, we get the altered base rates of 0.42 and 0.58, approximately, instead of 0.3 and 0.7. The correction, once more, leads in the direction of the observed biases, confirming that APT can play a role in base rate neglect, even if the base rates are not completelly ignored. These results also provide an interesting test of how much information our minds see in the base rate problem. As mentioned above, it is not clear if people simply ignore the base rates, or transform them, obtaining values much closer to 0.5 than the base rates as initial guesses. Even if the actual behavior is to ignore the base rates, APT shows that there might be a reason why this approximation is less of an error than initially believed. Anyway, since we have different predictions for this problem, how we are actually thinking can be tested and an experiment is being planned in order to check which alternative seems to provide the best description of actual behavior. Biases as Bayesian inference 116 4 Conclusions We have seen that our observed biases seem to originate from an approximation to a Bayesian solution of the problems. This does not mean that people are actually decent Bayesian statisticians, capable of performing complex integrations when faced with new problems and capable of providing correct answers to those problems. What I have shown is that many of the observed biases can actually be understood as the use of a heuristic that tries to approximate a Bayesian inference, based on some initial assumptions about probabilistic problems. In the laboratory experiments where our reasoning was tested, quite often, those assumptions were wrong. When a probability value is stated without uncertainty, people still behave as if it were uncertain and that was their actual mistake. When assessing correlations, some simple heuristics might be involved in evaluating what seems reasonable, that is, our minds work as if they were using a prior that, although informative, might not be including everything we should know. Even violations of Bayesian results, as ignoring prior information in the base rate neglect problem, can be better understood by applying the tools proposed by APT. Since full rationality would require an ability to make all complex calculations in almost no time, departures from it are inevitable. Still, it is clear that, whatever our mind is doing when making decisions, the best evolutionary (or learning) solution is to have heuristics that do not cost too much brain activity, but that provide answers as close as possible to a Bayesian inference. In this sense, the explanations presented here are not meant to be the only cause of the errors observed and, as such, do not challenge the rest of the literature about those biases. What APT provides is an explanation to the open problem of why we make so many mistakes. People use probabilistic rules of thumbs that are built to approximate a Bayesian inference under the most common conditions our ancestors would have met (or that we would have met as we grew up and learned about the world). The laboratory experiments do not show that these results actually come from an evolutionary process. It is quite possible that we actually learned to think that way when we grow up. In both circumstances, APT shows that we are a little more competent than we previously believed. References Allais, P. M. (1953). The behavior of rational man in risky situations - A critique of the axioms and postulates of the American School. Econometrica, 21, 503546. Birnbaum, M. H. (1999). Paradoxes of Allais, stochastic dominance, and decision weights. In J. Shanteau, Judgment and Decision Making, Vol. 1, No. 1, July 2006 B. A. Mellers, & D. A. Schum (Eds.), Decision science and technology: Reflections on the contributions of Ward Edwards, 27-52. Norwell, MA: Kluwer Academic Publishers. Birnbaum, M. H. (2004). Tests of rank-dependent utility and cumulative prospect theory in gambles represented by natural frequencies: Effects of format, event framing, and branch splitting. Organizational Behavior and Human Decision Processes, 95, 40-65. Birnbaum, M. H., (2005). New paradoxes of risky decision making. Working Paper. Birnbaum, M. H., & Chavez, A. (1997). Tests of theories of decision making: Violations of branch independence and distribution independence. Organizational Behavior and Human Decision Processes, 71 (2), 161194. Chapman, L. J., & Chapman, J. P. (1967). Genesis of popular but erroneous psychodiagnostic observations. Journal of Abnormal Psychology, 72, 193-204. Cohen, J., Chesnick, E.I., & Haran, D., (1979). Evaluation of compound probabilities in sequential choice, Nature, 232, 414-416. Ellsberg, D. (1961). Risk, ambiguity and the Savage axioms. Quart. J. of Economics, 75, 643-669. Gigerenzer, G., & Goldstein, D. G. (1996). ‘Reasoning the fast and frugal way: Models of bounded rationality’ Psych. Rev., 103: 650-669 Gigerenzer, G., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction: Frequency formats. Psych. Rev., 102, 684-704. Griffin, D. H. & Buehler, R. (1999). Frequency, probability and prediction: Easy solutions to cognitive illusions?, Cognitive Psychology, 38, 48-78. Griffiths, T. L. and Tenenbaum, J. B. (2006). Optimal predictions in everyday cognition. Psychological Science 17(9), 767-773. Hamilton, D. L., & Rose, T. L. (1980). Illusory correlation and the maintenance of stereotypic beliefs. Journal of Personality and Social Psychology, 39, 832-845. Humphrey, S. J. (1995). Regret aversion or eventsplitting effects? More evidence under risk and uncertainty. Journal of Risk and Uncertainty, 11, 263-274. Kahneman, D., & Tversky, A. (1972). Subjective probability: a judgment of representativeness. Cognitive Psychology, 3, 430-454. Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychology Review, 80, 237-251. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263-291. Kahneman, D., & Tversky, A. (1992). Advances in prospect theory: Cumulative representation of uncertainty. J. of Risk and Uncertainty, 5, 297-324. Biases as Bayesian inference 117 Kareev, Y., Lieberman, I., & Lev, M. (1997). Through a narrow window: Sample size and the perception of correlation. J. of Exp. Psych.: General, 126, 278-287. Lichtenstein, D., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? The Calibration of Probability Judgments. Organizational Behavior and Human Performance, 3, 552-564. Luce, R. D. (2000). Utility of gains and losses: Measurement-theoretical and experimental approaches. Mahwah: Lawrence Erlbaum Associates. Marley, A. A. J., & Luce, E. D. (2001). Rank-weighted utilities and qualitative convolution. J. of Risk and Uncertainty, 23 (2), 135-163. Martignon, L. (2001). Comparing fast and frugal heuristics and optimal models in G. Gigerenzer, & R. Selten (eds.), Bounded rationality: The adaptive toolbox. Dahlem Workshop Report, 147-171. Cambridge, Mass, MIT Press. Martins, A. C. R. (2005). Adaptive Probability Theory: Human Biases as an Adaptation. Cogprint preprint at http://cogprints.org/4377/ . Philips, L. D., & Edwards, W. (1966). Conservatism in a simple probability inference task. Journal of Experimental Psychology, 72, 346-354. Pinker, S. (1997). How the mind works. New York, Norton. Plous, S. (1993). The Psychology of Judgment and Decision Making. New York, MacGraw-Hill. Prelec, D. (1998). The Probability Weighting Function. Econometrica, 66, 3, 497-527. Prelec, D. (2000). Compound invariant weighting functions in Prospect Theory in Kahneman, D., & Tversky, A. (eds.), Choices, Values and Frames, 67-92. New York, Russell Sage Foundation, Cambridge University Press. Sloman, S. A., Slovak, L., Over, D. & Stibel, J. M. (2003). Frequency illusions and other fallacies. Organizational Behavior and Human Decision Processes, 91, 296-309. Starmer, C., & Sugden, R. (1993). Testing for juxtaposition and event-splitting effects. Journal of Risk and Uncertainty, 6, 235-254. Stewart, N., Chater, N., and Brown, G. D. A. (2006). Decision by sampling. Cognitive Psychology, 53, 1, 1-26. Tenenbaum, J. B., Kemp, C., and Shafto, P. (in press). Theory-based Bayesian models of inductive reasoning. To appear in Feeney, A. & Heit, E. (Eds.), Inductive reasoning. Cambridge University Press. Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453-458.