Bayesian Statistics

Jens Vogelgesang

Bayesian Statistics

Jens Vogelgesang

2017, The International Encyclopedia of Communication Research Methods

visibility

…

description

9 pages

link

1 file

AI-generated Abstract

Bayesian statistics integrates prior knowledge with new data through Bayes' theorem, offering a posterior distribution that balances these inputs. This approach serves as a counterpoint to traditional frequentist methods predominating in communication research. Despite its growing importance, the paper emphasizes the need for careful interpretation of Bayesian results, especially regarding the sensitivity of Bayes factors to prior distributions. Additionally, Bayesian techniques, such as multilevel modeling and hidden Markov models, are highlighted for their flexibility and applicability to complex datasets, potentially overcoming the limitations of classical methods.

Bayesian Statistics JENS VOGELGESANG University of Hohenheim, Germany MICHAEL SCHARKOW Zeppelin University, Germany Introduction Bayesian statistics allows prior knowledge to be incorporated into a statistical model. Central to this approach is Bayes’ theorem, which provides a mathematical rule to update the prior knowledge with new data. The results of a Bayesian analysis are characterized by a posterior distribution, which balances the prior knowledge—specified by using a particular form of probability distribution—and the data likelihood. Bayesian statistics is conceived as a counterpart to the classical approach to statistics. The classical approach, however, is the dominant statistical paradigm in communication research. Historically, the classical approach to inferential statistics can be traced back to R. A. Fisher (1890–1962), Jerzy Neymann (1894–1981), and Egon Pearson (1895–1980). The current statistical practice in the field of communication is a hybrid approach binding together frequentist methodology developed by Neymann and Pearson with likelihood-based methodology developed by Fisher (Gigerenzer, Porter, Daston, Beatty, & Krüger, 1989). However, introductory textbooks on statistics usually present the classical approach as if there were never any discussion about its coherence and applicability. The frequentist approach views probability as a ratio of frequencies. Most students in communication are introduced to the classical concept of probability by studying the infinite sampling properties of a coin toss or the roll of a die. This particular notion of probability can be described using axioms developed by Andrei Nikolaevich Kolmogorov (1903–1987): (a) the probability of any event is equal to or greater than zero; (b) the probability of a certain event is 1; (c) if A and B are two mutually exclusive events (events that cannot both occur), then the probability of the disjunction (the probability of either A or B occurring) is equal to the sum of their individual probabilities: P(A or B) = P(A) + P(B). If two events A and B are independent, then the occurrence of one event does not influence the probability of another event, which can be defined as P(A and B) = P(A) × P(B). There are situations in which probabilities can be estimated from data (e.g., the probability of someone being a newspaper reader in a population), but sometimes they cannot (e.g., the probability, in advance of data, of the null hypothesis being correct). When probabilities cannot be thought of as a ratio of frequencies, they still need to be estimated somehow. In such situations the Bayesian approach to statistics comes into play. The International Encyclopedia of Communication Research Methods. Jörg Matthes (General Editor), Christine S. Davis and Robert F. Potter (Associate Editors). © 2017 John Wiley & Sons, Inc. Published 2017 by John Wiley & Sons, Inc. DOI: 10.1002/9781118901731.iecrm0013 2 B AY E S I A N S TAT I S T I C S Interestingly, the Bayesian approach to statistics predates the classical approach by more than 150 years. In the 18th century, Thomas Bayes (1702–1761) described the mathematical rule that we now call the Bayes theorem in a paper titled “An essay towards solving a problem in the doctrine of chances”. This paper was found after his death and posthumously published in the Philosophical Transactions of the Royal Society. Pierre-Simon Laplace (1749–1827) later generalized the work by Bayes. This is why the Bayes theorem is sometimes also called the Bayes–Laplace rule. The work by Bayes and Laplace, among others, laid the foundation of modern statistical theory. Until the second half of the 18th century, probability calculation was almost entirely focused on estimating the likelihood of future uncertain events. It was the insight of Bayes that the calculus of probability could be used to assess not just the likelihood of future events, but also the likelihood of past events. Essentially, with Bayes’ theorem the concept of conditional probability was introduced. This probability is denoted using a vertical bar; for example, p(A|B), reads as “the probability of A given B.” Conditional probabilities refer to the case of nonindependent events, which can be described using the axioms put forth by Alfréd Rényi (1921–1970), who extended the system of Kolmogorov’s axioms (Kaplan, 2014; Press, 2003). The conditional probability p(A|B) is determined by the joint distribution of A and B. Bayesian statistics and probability Bayesian and classical (hybrid approach) statistics are based on the assumption that statistical methods allow valid inference statements when there is random variability in the data. In both approaches, sample data are used to make inference statements about unknown population parameters. However, Bayesian statistics challenges many of the assumptions underlying classical statistics. One key difference between both approaches concerns the nature of the unknown parameter. According to the frequentist approach, an unknown population parameter is a fixed, nonrandom quantity. It is assumed that there is one true population parameter. As a consequence, no probability statements can be made about its value. In the Bayesian view, in contrast, the true value of a population parameter is conceived as uncertain and is therefore considered a random variable. According to the Bayesian approach, the unknown, random population parameter should be described by a probability distribution. Unlike the frequentist approach, the Bayesian counterpart allows a probability statement to be made about the value of an unknown parameter. Both approaches also differ in their notion of probability. Frequentist procedures are based on a concept of probability that is associated with the idea of long-run frequency (e.g., a coin toss). Frequentist inference, which employs sampling distributions based on infinite repeated sampling, is focused on the performance over all possible random samples. Therefore, a frequentist probability statement does not relate to a particular random sample that was obtained. Rather, the sampling distribution, which describes the probability distribution of the sample statistic over all possible random samples from the population, is used to make a confidence statement about the unknown population parameter. The name “confidence statement” is chosen because the inference probability is based on all possible datasets that could have occurred for the fixed but unknown population parameter. B AY E S I A N S TAT I S T I C S 3 The Bayesian approach, in contrast, has a different interpretation of probability. According to this view, a probability statement about an unknown parameter mirrors a subjective degree of belief or experience of uncertainty. This uncertainty is captured by a probability distribution that is defined before observing the data. In the Bayesian terminology, this particular distribution is called the prior distribution, or simply prior. The idea of a prior is best described as being analogous to placing a bet. The bet comprises the amount of certainty that a bettor has about a random outcome before knowing the outcome’s realization. The Bayesian approach provides a mathematical rule called Bayes’ theorem describing how to change existing prior beliefs about the value of an unknown random parameter in the light of new evidence, such as empirical (sample) data. The data can be expressed in terms of a likelihood function, sometimes simply called the likelihood. Using Bayes’ theorem as a formal rule to weigh the likelihood of the actual occurred data with the beliefs held before observing the data gives the posterior distribution. The posterior distribution allows researchers to make probability statements concerning the unknown parameter of interest. Bayes’ theorem Bayes’ theorem contains three essential elements. It balances a prior state of knowledge and the data likelihood to a more informed posterior distribution, that is: posterior distribution ∝ prior distribution × data likelihood, (1) where the symbol ∝ means “is proportional to.” More specifically, Bayesian inference always begins with some prior probability statement about an unknown parameter, that is f(theta), for example. Recall that, in contrast to the frequentist approach, all unknown parameters of a Bayesian model are treated as random variables that can be described in terms of a distribution. Unlike the frequentist approach, the Bayesian approach allows incorporation of prior knowledge before observing the data. The prior probability statement f(theta) is nothing but a summarized expression of the current state of knowledge and the subjective degree of belief about theta before gathering or seeing any new data. The functional form of the prior is usually chosen to facilitate the calculation of the posterior. For example, the mean of a normally distributed prior would represent an informed guess about the location of the unknown population mean, whereas the variance would reflect the amount of uncertainty about that particular parameter. The smaller the prior variance, the more certain one is that the prior mean mirrors the population mean. In Bayesian terminology, the prior variance is called precision. Parameters such as the prior mean or the precision are also referred to as hyperparameters. When a probability distribution is specified on the hyperparameters in a fully hierarchical Bayesian model, they are referred to as hyperpriors (Kaplan, 2014). As can be seen from Equation 1, the prior distribution is weighted with the data likelihood (which is not a probability distribution). Let, for example, data = (x1 , … , xi ) be a sample from a density ftheta with an unknown parameter theta and with an associated likelihood function: l(theta|data) = n ∏ i=1 ftheta (xi ). (2) 4 B AY E S I A N S TAT I S T I C S The likelihood function summarizes the sample information about theta and provides some value of theta that makes the data most likely to have occurred. The information from the likelihood function is weighted with the prior probability distribution by employing Bayes’ theorem to calculate an updated distribution posterior to the former state of knowledge: f (theta|data) = f (data|theta) × f (theta) , f (data) (3) where f(theta|data) denotes the posterior distribution for the parameter theta, f(data|theta) is a sampling density for the data, f(theta) is the prior distribution for the parameter, and f(data) is the marginal probability of the data. Whereas classical inference about theta follows from inspection of the likelihood, Bayesian inference, in contrast, relies on inspecting the posterior distribution using descriptive measures. The shape of the posterior distribution can be described by calculating the location parameters, such as the posterior mean, which is the expected value of theta under f(theta|data), or the posterior mode, which is the most likely value under f(theta|data) or, in addition, by some variability measures. Bayes’ theorem is fundamental, for example, to the Naïve Bayes (NB) algorithm that is commonly used in automatic text classification. The attribute naïve comes from the assumption that the features (such as words) in a text are mutually independent, and therefore probabilities for all features can simply be multiplied to yield a combined probability. The NB algorithm is used to assign documents to prespecified categories. Specifically, the probability that a document belongs in a category given its features is computed using (a) the prior probability of the category and (b) the frequency of the occurring features in documents previously assigned to the category. The basic idea of this algorithm is to maximize the posterior probability for a category given some training data to formulate a classification or—more general—a decision rule. Bayesian interpretation of the p-values and credible intervals The different notions of probability of Bayesian and classical statistics have far-reaching ramifications for how conclusions are drawn. Scientists formulate hypotheses based on theoretical reasoning and the question is whether or not the hypothesis is correct. Unlike the classical approach, the Bayesian approach provides an answer to this question. Bayesian statistics is concerned with the probability of a hypothesis given the data; that is, p(hypothesis|data). By contrast, classical statistics is interested in the probability of obtaining data as extreme, or more extreme, than those observed if the hypothesis is correct, that is p(data|hypothesis). Correspondingly, different notions of probability also affect the interpretation of Bayesian credible intervals and classical confidence intervals. Both types of intervals provide a measure of uncertainty with respect to an estimated parameter. Assuming that a posterior density is approximately normal, derivation of a 95 percent confidence interval, for example, is straightforward: posterior mean ± 1.96 × posterior standard deviation. (4) B AY E S I A N S TAT I S T I C S 5 The Bayesian 95 percent credible interval is expected to contain the unknown population parameter with a probability of 95 percent. Unlike the Bayesian credible interval, which refers to the parameter space, the classical confidence interval refers to the sample space. A probability statement associated with a 95 percent confidence interval can only be made with reference to the procedure, not the unknown population parameter itself. Since the population parameter is a fixed, nonrandom quantity, a frequentist 95 percent confidence interval means nothing other than that the procedure of interval construction is expected to construct intervals that include the population parameter approximately 95 percent of the time. Prior distributions Scientific progress rests on the idea of learning; gathering scientific knowledge is a cumulative process. Most studies, if not all, are conducted in the light of previous research. When planning a study, it is rational to collate as much theoretical thought as possible, and to become familiar with methodological and statistical standards, and existing empirical findings. Bayesian statistics requires researchers to take into account all existing knowledge when choosing a prior distribution. The notion that statistical inference is a way of updating existing knowledge given new data, and that, consequently, data do not speak for themselves, might be irritating at first sight. The incorporation of preexisting information adds a seemingly more subjective flavor to the scientific method than pure likelihood-based inference. Consequently, the subjectivity of the chosen priors is the most prominent objection to the Bayesian approach. Generally speaking, the choice of a prior is based on how much information one believes oneself to have prior to the data collection and how certain one is about this subjective belief. There is considerable dissent on the issue of specifying priors. There are two general types of priors: uninformative and informative prior distributions. This distinction is also referred to as objective and subjective priors (Press, 2003). The former are chosen in such a way that the data—that is, the likelihood according to Bayes’ theorem—speak for themselves. Although there is consensus that no statistical or other scientific method can be truly objective, many scholars argue that using uninformative priors is more justifiable and transparent to colleagues, students, or reviewers. For those, the notion of objectivity might be more important than the statistical efficiency gained by using informative priors. The public policy prior denotes a special case of using uninformative prior distributions and concerns reporting results as general as possible; that is, minimizing the impact of the researcher’s subjective belief on the posterior (Press, 2003). Noninformative priors are also referred to as vague or diffuse priors. Statistically, using noninformative priors yields about the same results as classical inference but still allows Bayesian interpretation. The simplest way of expressing prior ignorance is often seen in using a uniform distribution for a parameter of interest. A uniform distribution deems every value equally probable, possibly within some bounds like [−1, 1] for a correlation. Because this kind of prior is a constant, the posterior distribution is computed only from the likelihood, yielding a pseudofrequentist result. As easy as 6 B AY E S I A N S TAT I S T I C S the interpretation and justification of this prior may seem, uniform priors are not equally well suited for all kinds of parameters, mainly because they are not robust to simple transformations (Gill, 2008). For example, a uniform prior for a variance is different than that for a standard deviation, and both yield different posteriors. The construction of uninformative priors that are both robust to transformations and lead to proper posteriors has challenged many Bayesian statisticians. If one accepts the idea of informative priors, the question of obtaining them arises. The answer is both simple and complicated: Any knowledge or belief, be it from an expert, a preexisting study, a meta-analysis, theoretical reasoning, or just an educated guess, can be used as long as it can be transformed into a probability distribution of some kind. It is mathematically convenient to choose conjugate informative priors. A prior distribution is called conjugate if the posterior is in the same distributional family as the prior. If the prior is not conjugate, the resulting posterior distribution cannot be solved analytically. Alternatively, numerical simulation methods such as Markov chain Monte Carlo (MCMC) estimation need to be used to find approximate solutions for the posterior (Gilks, Richardson, & Spiegelhalter, 1996). Together with the advent of fast computer hardware in the 1990s, MCMC algorithms and freely available software programs helped to popularize Bayesian data analysis methods in the social sciences. Generally speaking, there are three types of informative priors: empirical priors, weakly informative priors, and subjective priors. Empirical priors refer to previous observation or other forms of data collection, including expert interviews. The latter are often referred to as elicited priors. Eliciting a prior is very demanding because transforming qualitative or vague expert opinions into prior distributions for parameters such as regression coefficients or between-group variances is a challenging task for any applied researcher. Another kind of empirical prior comes from the incorporation of previous single results or meta-analyses. Those results can either be directly incorporated into strict replication studies or somewhat discounted, depending on the similarity of the previous and the current study. While this can be intuitively accomplished by using equivalent sample sizes to relate prior and likelihood, a more versatile and rigorous approach to this problem involves power priors. In the presence of historical data or data from previous similar studies with large sample size, a power prior can be realized by raising the likelihood function based on the prior data to a suitable power δ (0 ≤ δ ≤ 1) to downweight the historical data relative to the current data. Subjective priors express personal theories or beliefs about a phenomenon, and as such they are highly debatable. However, it may often be desirable to check the consequences of different priors on the posterior distribution or compare models with optimistic or pessimistic priors (Press, 2003). While both empirical and subjective priors are domain specific, a third kind of informative prior distribution is based on statistical convenience and common sense about model parameters. The rationale behind weakly informative priors is that extreme values of parameters such as correlations or regression coefficients are highly unlikely and should therefore be given less prior probability (Gelman, Carlin, Stern, & Rubin, 2013). This approach to using default priors leads to efficient and stable estimation without overly affecting the likelihood. B AY E S I A N S TAT I S T I C S 7 Bayesian data analysis Applications using Bayesian data analysis have recently grown in the social sciences. One reason for this growth, among others, is that Bayesian methods were successively implemented in software packages that are frequently used by social scientists such as R, Mplus, and SPSS Amos. These software packages, which were originally designed with respect to the frequentist approach, now also allow estimation of the posterior distribution using MCMC methods. Predefined routines of software packages nowadays support researchers in evaluating the posterior distribution, the results of a Bayesian data analysis. This can be done, for example, by using point estimates of the posterior, such as the posterior mean or variance. Another summary statistic that is commonly used is the mode of the posterior distribution. This mode is also referred to as the maximum a posteriori (MAP) estimate. The MAP is the Bayesian analogy to the classical maximum likelihood estimator (MLE). In addition to point estimates, intervals summaries can be obtained to characterize the posterior distribution (Kaplan, 2014). As pointed out above, credible intervals—also sometimes called posterior probability intervals—of point estimates can be directly obtained from the quantiles of the posterior distribution. Another alternative is the so-called highest posterior density (HPD) interval, which has the property that the density within the interval region is never lower than the density outside. It is recommended to use the HPD in favor of standard credible intervals when the posterior density is asymmetric or not unimodal (Box & Tiao, 1973). As in classical statistics when using the likelihood ratio test, Bayesian data analysis allows the testing of model fit, the comparison of competing models or hypotheses. For example, the Bayes factor (BF) is a weighted average likelihood ratio. It is often interpreted as the relative evidence in the data, indicating the odds that the data favor one model or hypothesis over another. Konijn, van de Schoot, Winter, and Ferguson (2015), who illustrated the use of BF by reanalyzing data of published communication studies, argue that the BF offers more meaningful results than frequentist null hypothesis significance testing (NHST). BF null hypothesis testing is supposed to have advantages over classical NHST, because it allows a more intuitive interpretation, considers likelihood under both the null and the alternative hypothesis, and can also provide evidence for and not just against the null hypothesis. It should be noted, however, that the BF is sensitive with regard to the choice of priors used for parameters in each model. Therefore, any BF should be used with caution. To overcome the sensitivity of the BF to the prior, Gelman et al. (2013) recommend weakly informative default priors like the Cauchy distribution when estimating parameters of common statistical models like analysis of variance (ANOVA) or regression analysis. Kass and Raftery (1995) suggest that if BF is 1 to 3, the evidence is not worth more than a bare mention; if BF is 3 to 20, it is positive; if BF is 20 to 150, it is strong; and if BF is more than 150, it is very strong. When the exact BF is impossible to calculate (e.g., due to computational limitations or when it is difficult to specify reasonable priors), the Bayesian information criterion (BIC) is said to provide a reasonable approximation to the BF. Note that the BIC is a widely used measure in model selection when using classical statistics such as time series analysis or structural equation modeling (SEM). Unlike the BF, there are no rules of thumb with 8 B AY E S I A N S TAT I S T I C S respect to the size of the BIC difference between two models. A common guideline is to favor the model with the smallest BIC. The deviance information criterion (DIC) is a Bayesian alternative to the BIC. Based on MCMC estimation, the DIC uses the posterior density, implying that the prior information is taken into account. Among a candidate set of models, the one with the lowest DIC value is chosen. It is important to recognize that there is no unique Bayesian solution to a statistical problem. However, the Bayesian approach provides a versatile, flexible toolkit that might help to overcome the limitations of classical statistical approaches. Generally speaking, the Bayesian approach is particularly well suited to model complex data structures. For example, Bayesian multilevel modeling allowed estimating a proportional hazards model to investigate the number of seconds that commercials are viewed before being zapped, while accounting for unobserved heterogeneity across both consumers and commercials (Gustafson & Siddarth, 2007). Likewise, marketing researchers used a Bayesian hidden Markov model to identify visual attention states to magazine advertisements from individual eye-tracking data (Liechty, Pieters, & Wedel, 2003). Moreover, Bayesian parameter estimates have favorable efficiency and bias properties relative to MLE in small samples. For example, when few countries are available in comparative research designs, using a Bayesian approach yields far more stable and precise estimation results than frequentist techniques (Stegmueller, 2013). SEE ALSO: Amos (Software); Comparative Research Methods; Computational Simulation Methods; Mplus; Probability Distributions; R (Software); Statistical Significance (Testing) References Box, G. E. P., & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). London: CRC Press. Gigerenzer, G. S., Porter, Z., Daston, T., Beatty, L., & Krüger, J. L. (1989). The empire of chance: How probability changed science and everyday life. New York: Cambridge University Press. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.) (1996). Markov chain Monte Carlo in practice. London: Chapman & Hall. Gill, J. (2008). Bayesian methods: A social and behavioral sciences approach (2nd ed.). Boca Raton, FL: Chapman & Hall/CRC. Gustafson, P., & Siddarth, S. (2007). Describing the dynamics of attention to TV commercials: A hierarchical Bayes analysis of the time to zap an ad. Journal of Applied Statistics, 34(5), 585–609. doi:10.1080/02664760701235279 Kaplan, D. (2014). Bayesian statistics for the social sciences. New York: Guilford Press. Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. doi:10.2307/2291091 Konijn, E. A., van de Schoot, R., Winter, S. D., & Ferguson, C. J. (2015). Possible solution to publication bias through Bayesian statistics, including proper null hypothesis testing. Communication Methods and Measures, 9(4), 280–302. doi:10.1080/19312458.2015.1096332 B AY E S I A N S TAT I S T I C S 9 Liechty, J., Pieters, R., & Wedel, M. (2003). Global and local covert visual attention: Evidence from a Bayesian hidden Markov model. Psychometrika, 68(4), 519–541. doi:10.1007/ bf02295608 Press, S. J. (2003). Subjective and objective Bayesian statistics: Principles, models, and applications (2nd ed.). New York: John Wiley & Sons. Stegmueller, D. (2013). How many countries for multilevel modeling? A comparison of frequentist and Bayesian approaches. American Journal of Political Science, 57(3), 748–761. doi:10.1111/ajps.12001 Further reading Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd ed.). Amsterdam: Academic Press. McGrayne, S. B. (2011). The theory that would not die: How Bayes’ rule cracked the enigma code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy. New Haven, CT: Yale University Press. Jens Vogelgesang (PhD, Free University of Berlin) is a professor of communication at the University of Hohenheim, Germany. His main interests concern audience research, media effects, and methodology. Michael Scharkow (PhD, University of the Arts Berlin) is professor of communication at Zeppelin University, Germany. His research interests include empirical research methods, online communication, and media use.

Log In

Bayesian Statistics

Related papers

Related topics