Bayesian-Statistics Final 20140416 3
Bayesian-Statistics Final 20140416 3
Bayesian-Statistics Final 20140416 3
Statistics
Meng-Yun
Lin
mylin@bu.
ed u
This paper was published in fulfillment of the requirements for PM931 Directed
Study in Health Policy and Management under Professor Cindy Christiansen's
([email protected]) direction. Michal Horny, Jake Morgan, Marina Soley Bori, and
Kyung Min Lee provided helpful reviews and comments.
Table of Contents
Executive Summary ...................................................................................................................................... 1
1.
2.
Basic Concepts..................................................................................................................................... 3
3.
3.1.
3.2.
3.3.
3.4.
Software ............................................................................................................................................ 6
4.
4.1.
4.2.
4.3.
4.4.
4.5.
5.
6.
Executive Summary
This report is a brief introduction of Bayesian statistics. The first section describes the basic
concepts of Bayesian approach and how they are applied to statistical estimation and hypothesis testing.
The next section presents the statistical modeling using Bayesian approach. It first explains the main
components of Bayes model including prior, likelihood function, and posterior. Then, it introduces
informative and
non-informative Bayes models. The last section provides an example of fitting Bayesian logistic
regression in SAS. It illustrates how to program Bayes model and how to check model convergence.
Keywords: Bayesian, Prior, Posterior, Informative Bayes Model, Non-informative Bayes Model.
Page 1 of 21
Bayesian
parameter of
model
parameters
distributions)
can make probability statements about
the parameters
probability
main outcomes
posterior distribution
estimate/inference
interval estimate
measurable probability.
procedure.
One can make a direct probability statement
e.g: 95%CI=(a, b) implies the interval
about parameters.
experiments
Page 2 of 21
2. Basic Concepts
Bayesian probability is the foundation of Bayesian statistics. It interprets probability as an abstract
concepta quantity that one assign theoretically by specifying some prior probabilitiesfor the purpose
of representing a state of knowledge. It is then updated in the light of new and relevant data. In Bayesian
statistics, a probability can be assigned to a hypothesis that can be any quantities between 0 and 1 if the
truth value is uncertain. Broadly speaking, there are two views on Bayesian probability that interpret the
concept of probability in different ways. For objectivists, probability objectively measures the plausibility of
propositions. The probability of a proposition corresponds to a reasonable belief. For subjectivists,
probability corresponds to a personal belief. Rationality and coherence constrain the probabilities one
may have but allow for substantial variation within those constraints. The objective and subjective
variants of
1
Bayesian probability differ mainly in their interpretation and construction of the prior probability .
2
Based on Bayesian probability, Bayes' theorem links the degree of belief in a proposition before and
after accounting for evidence by giving the relationship between the probabilities of A and B, p(H) and
p(D), and the conditional probabilities of H given D and D given H, p(H|D) and p(D|H). The algebraic
formula is
given by p(H|D) =
is
)
(
, where H and D stand for hypothesis and data (evidence) respectively; p(H)
Page 3 of 21
Page 4 of 21
the initial degree of belief in H (prior); p(H|D) is the degree of belief in H having accounted for D
(posterior);
and p(D|H)/p(D) represents the support D provides for H.
2
Further, Bayes rule relates the odds of event A1 to event A2 before and after conditioning on event
B. The relationship is expressed in terms of the Bayes factor, , which represents the impact of the
conditioning on the odds. On the ground of Bayesian probability, Bayes rule relates the odds on
probability models A1 and A2 before and after evidence B is observed. In this case, represents the
impact of the evidence on the odds. For single evidence, the association is given by O(A1: A2 |B) =
O(A1: A2)*(A1: A2 |B), where O(A1: A2) is called prior odds; O(A1: A2 |B) is called posterior odds. In
brief, Bayes rule is preferred to Bayes theorem when the relative probability, odds, of two events
matters but the individual probabilities.
Page 5 of 21
Lee, Peter M. Bayesian Statistics: An Introduction. John Wiley & Sons, 2012
Goodman, Steven N. Toward Evidence-Based Medical Statistics. 2: The Bayes Factor. Annals of Internal Medicine
130, no. 12 (June 15, 1999): 10051013
3
Page 6 of 21
p(H): prior probability, the probability of H before E is observed, indicating ones preconceived
beliefs about how likely different hypotheses are
p(H|E): posterior probability, the probability of H given E (after E is observed)
p(E|H): likelihood, indicating the compatibility of the evidence with the given hypothesis
p(E): marginal likelihood, indicating total probability of E which is the same for all possible
hypothesis being considered.
( | )
)
( |
()
(|)
current
(()
> 1: given the model is true, the evidence would be more likely than is predicted by the
(|)
=1: the evidence is independent of the model. It would be exactly as likely as predicted by
the(()
current state of belief.
point. This approach has the disadvantage that it does not account for any uncertainty in the value of the
parameter and therefore will underestimate the variance of the predictive distribution. Bayesian approach
use the posterior predictive distribution to do predictive the distribution of a new, unobserved data point.
It is, instead of a fixed point as a prediction, a distribution over possible points is returned.
the posterior predictive distribution is the distribution of a new data point, marginalized over the
posterior: p(x |X, ) = p(x |)
p(|X, )d
the prior predictive distribution is the distribution of a new data point, marginalized over the
prior: ( |) = ( |)
(|)
p(
| )=
p( |
)
p( |
p(
| )=
( |
( |
)=
( |
( |
)
)
By parameterizing the space of models, the belief in all models can be updated in a single step. The
distribution of belief over the model space can then be considered as a distribution of belief over the
parameter space. Suppose vector span the parameter space, the initial prior distribution over is p(|
),
where is as set of parameters to the prior. Let e1~p(e|), the posterior distribution over is given by:
(| , ) =
(
(| ) =
| ,
( | )
|,
(| ) =
,
( |( )|, )
)
( |( )|, )
3.4. Software
1) SAS: BAYES statement (specify prior and estimate the parameters by Markov chain Monte
Carlo sampling approach) is available to
a.
b.
c.
4. Bayesian Modeling
The principle of Bayes model is to compute posteriors based on specified priors and the likelihood
function of data. It requires researchers to appropriately specify priors given inappropriate priors may
lead to biased estimates or make computation of posteriors difficult. In this section, we will briefly go
through several most common priors, how posteriors are calculated based on priors, and how the
selection of priors influence posterior computation.
distributional attributes of prior, conjugate prior usually is the best choice because it provides a
practical way to obtain the posterior distributions. From Bayes' theorem, the posterior distribution is
equal to the product of the likelihood function, p(x|), and prior, p(), normalized (divided) by the
probability of the data, p(x). The use of prior ensures the posterior has the same algebraic form as
the prior (generally with different parameter values). Further, conjugate priors may give intuition, by
more transparently showing how a likelihood function updates a distribution. The following examples
shows how conjugate prior is updated by corresponding data and ensure posterior possess the same
distributional format.
4
Example :
1) If the likelihood is poisson distributed, y~ poisson(), a conjugate prior on is the
Gamma distribution.
1 exp()
exp()
Posterior p(|Yn)=
(Yn| )
()
(Yn
)
(Yn
)
(
)
1 exp()
exp()*
+1
Prior
Data
Posterior
~ gamma(v, )
Y| ~ poisson()
|Y ~ gamma(v+y, +n)
Shape
v+y
Rate
+n
E()
v/
(v+y)/(+n)
Var()
v/ 2
(v+y)/
(+n)2
(Yn |
) (
1
(Yn
)
( )
)
(1
(1
(1
+1
(1
(Yn
)
* (,
(1
)
+ 1
Prior
~ beta(+y, +n-y)
~ beta(, )
Data
Posterior
Y| ~ binomial(n, )
|Y ~ beta(+y, +n-y)
+y
Referred to Dr. Gheorghe Doros lecture on Bayesian Approach to Statistics Discrete Case
E()
+n-y
( +)
( + +
( + )
Var()
( + )2 ( +
)
( +)( +
)
+1)
( + + )2 ( + +
+1)
has conjugate priors . This family includes distributions: normal, poisson, gamma, binomial, negative
binomial, and the NEFs generated by the generalized hyperbolic secant (GHS). Morris and Lock
(2009)
graphically illustrate the statistical properties of NEFs and their connections with corresponding
6
conjugate distributions . Another study (Morris, 1983) provides a comprehensive overview of how
7
each member of NEFs is related to its prior and posterior accordingly . Additionally, Wikipedia
provides comprehensive tables summarizing various conjugate prior/likelihood combinations along
8
with the corresponding model, prior, and posterior parameters in a less mathematical manner .
Gelman, Andrew, and John Carlin. Bayesian Data Analysis. 2nd ed. CRC Press, 2003.
Morris, Carl N., and Kari F. Lock. Unifying the Named Natural Exponential Families and Their Relatives. The
American Statistician 63, no. 3 (August 2009): 247253. doi:10.1198/tast.2009.08145.
7
Carl N., Morris. Natural Exponential Families with Quadratic Variance Functions: Statistical Theory. Institute of
Mathematical Statistics 11, no. 2 (n.d.): 515529.
8
The Wikipedia page about Conjugate Prior at http://en.wikipedia.org/wiki/Conjugate_prior
6
fX|Y=y(x) gives the posterior probability density function for a random variable X given the data Y = y, where
fX(x) is the prior density of X; LX|Y=y(x) = fY|X=x(y) is the likelihood function as a function of x; fX(x)*LX|Y=y(x) dx
is the normalizing constant. In brief, the above formula implies posterior probability
prior probability *
likelihood. However, Bayesian modeling is still controversial. Computation of Bayesian statistics could be
very intractable. Potential solution is Markov Chain Monte Carlo.
4.3.
Bayes Estimation
Generally, posterior calculations with the normal likelihood for the mean are simple as long as the
prior are chosen from the normal family. Although many data themselves are not continuous, binomial
and time to event for example, the estimates of the coefficients in generalized linear models (GLM) follow
a normal distribution. With sufficient data all the usual estimates for GLM are approximated well by a
normal distribution.
By contrast to classical methods which reports the maximum likelihood estimator of a parameter,
Bayesian approaches is primarily based on the posterior distribution. All relevant information about the
parameter (given the data and prior experience) is summarized in the posterior distribution. There are
various ways, including the mean, median and mode of the parameters posterior distribution, in which
one can summarize the distribution. Mostly, point estimation is conducted though reporting the posterior
mean. The following section will then introduce how Bayesian approach conduct interval estimation and
hypothesis testing.
example,
1) One-sided lower 95% CI for : p(<U|data)=0.95 or p(U|data)=0.05
2) One-sided upper 95% CI for : p(>L|data)=0.95 or p(L|data)=0.05
3) Two-sided 95% CI for : p(L<< U|data)=0.95 or p(U or L|data)=0.05
One can construct credible sets that have equal tails. Equal-Tail-Area intervals divide the
probability of the complement in two equal areas: p(L|data)=p(U|data)=0.025; another
frequently used Bayesian credible set is called the Highest Posterior Density (HPD) intervals. A
HPD interval is a region that contains parameters with highest posterior density values. HPS
ensures that the posterior density value at the ends of the interval is the same. For symmetric
posteriors: HPD=equal-tail-area interval. For skewed posteriors: HPD is shorter than equal-tail-area
interval;
9
however, HPD interval is more difficult to construct . Figure 1 graphically illustrates the difference
between equal-tail-area and HPD intervals.
Referred to Dr. Gheorghe Doros lecture on Estimation and Hypothesis Testing in Clinical Trials Bayesian
Approach
indicates control in fact is better than intervention. In sum, conclusions about hypothesis test are
reached by relating the info on , including prior, data, and posterior, to the three intervals: inferiority
interval (-, I), equivalency interval (I, S), and superiority interval (S, ).
Jeffreys prior: a prior distribution on parameter space that is proportional to the square root of
the determinate of the Fisher information. It expresses the same belief no matter which metric is
used:
()
de
().
Several issues that researchers need to consider when adopt non-informative priors: (1) in
parameter estimation problems, the use of a non-informative prior typically yields results which are
not too different from conventional statistical analysis, as the likelihood function often yields more
information than the non-informative prior. (2) Non-informative priors are frequently improper, that is
the area under the prior density is not unitythe sum or integral of the prior values are not equal to 1.
In most case, improper priors, for instance beta(0, 0), uniform distribution on an infinite interval, and
Logarithmic prior on positives reals, can be used in Bayesian analyses without major problems.
However, things to watch out for are:
Bayesian methods .
Example:
Fixed effects model: yi= j [i] + xi11 + xi22 + i controls for the time-invariant variables by letting
the intercept vary by group (j). It takes into account ALL group-level variables but cant estimate the
effect of individual group-level covariate.
2
Random effects model: yi = j [i] + xi1 1 + xi2 2 + i , j ~ N(, ) assumes intercepts are drawn from
a normal distribution. One can incorporate group-level covariates in the following way:
yi= j [i] + xi11 + xi22 + i
2
p(y| , , )*p(|)p()p()
Last, solve for the joint posterior using Gibbs Sampling or Metropolis Hasting.
Page 13 of
21
10
Page 14 of
21
statistical analysis methods in the analysis of binary outcomes from a cluster randomized trials . The
individual level analyses included (1) standard logistic regression, (2) robust standard errors
approach, (3) generalized estimating equations, (4) random-effects meat-analytic approach, (5)
random-effects logistic regression, and (6) Bayesian random-effects regression. They found
Bayesian random-effects
logistic regression yielded the widest 95% interval estimate for the odds ratio and led to the most
conservative conclusion, though the results remained robust under all methods. The individuallevel standard logistic regression is the least appropriate method as it ignores the correlation of the
outcomes for the individuals within the same cluster.
Ma, Jinhui, Lehana Thabane, Janusz Kaczorowski, Larry Chambers, Lisa Dolovich, Tina Karwalajtys, and Cheryl
Levitt. Comparison of Bayesian and Classical Methods in the Analysis of Cluster Randomized Controlled Trials
Page 15 of
21
with a Binary Outcome: The Community Hypertension Assessment Trial (CHAT). BMC Medical Research
Methodology 9, no. 1 (June 16, 2009): 37.
Page 16 of
21
a normal distribution with expected value equal to today's noontime temperature and variance equal
to the day-to-day variance of atmospheric temperature.
Figure 3: Effect of informative prior on posterior [left: weak prior; right: strong prior]
(Source: Dr. Gheorghe Doros lecture on Prior Elicitation, page 2)
12
papillomavirus (HPV) vaccination . However, it is not clear whether the low rate of immunization is
associated with teens demographic characteristics and provider attributes. This example uses data of the
National Immunization Survey-Teen (year 2009) from the CDC to explore what demographic and provider
characteristics predict HPV vaccination rates among teen girls aged 13-17 in NY. Outcome of interest is
the
probability of receiving at least one dose of HPV vaccine. The example builds a logistic model that
regresses the probability of initiating HPV vaccination on a set of covariates including teens race and
14
age, moms age and education, provider recommendation of HPV, Vaccine for Children (VFC) program ,
and immunization registry, and facility type.
Let Yi denote the response variable (subscript i represents individual teen girl), p_udthpv, flagging if
teens received at least one HPV vaccines; X denotes a vector representing the set of covariates
described above. Applying a generalized linear model (GLM), we can fit the data points Yi with a binary
distribution, Yi
~ binary (Pi) where Pi is the probability of Yi equal to 1, and links it to the regression covariates, X, through
a
logit transformation. The Bayesian model is given by Pr( | logit(Pi), X)
Pr(logit(Pi) | X, )*Pr( )
where logit(Pi) is the likelihood function of data. The main advantage of GLM is that it allow for using any
distribution of the exponential family. In this example, GLM assumes logit(Pi) is normally distributed with
a mean of X where is a vector representing regression coefficients.
PROC GENMOD offers convenient access to Bayesian analysis for GLM. We can specify a model
essentially the same way as we do from a frequentist approach, but add a BAYES statement to
request Bayesian estimation. A sample code for fitting a Bayesian logistic linear regression is provided
below: proc genmod data=ads desc;
class white/ param=glm order=internal desc;
model p_utdhpv=white age momage / dist=bin link=logit;
bayes seed=1 coeffprior=normal nbi=1000 nmc=20000 outpost=posterior;
13
HPV Vaccination Initiation, Completion Rates Remain Low among Teen Girls | Infectious Diseases in Children.
Accessed April 22, 2013.
Page 18 of
21
http://www.healio.com/pediatrics/news/print/infectious-diseases-in-children/%7Bba020c72-98de-4c1c-928e89902ed a921f%7D/hpv-vaccination-initiation-completion-rates-remain-low-among-teen-girls.
14
VFC program offers vaccines at no cost for eligible children through VFC-enrolled providers.
Page 19 of
21
run;
We first specify the GLM model in the MODEL statement as usual. In the following BAYES statement,
SEED option specifies an integer seed for the random number generator in the simulation. It enables
researcher to reproduce identical Markov chains for the same specification. COEFFPRIOR=NORMAL
6
specifies a non-informative independent normal prior distribution with zero mean and variance of 10 for
each parameter. NBI=1000 option specifies the number of burn-in iterations before the chain was saved.
Burn-in refers to the practice of discarding an initial portion of a Markov chain sample so that the effect
of initial values on the posterior inference is minimized. NMC=20000 option specifies the iterations after
burn-in. The OUTPOST option names a SAS dataset for the posterior samples for further analysis.
Maximum likelihood estimates of the model parameters are computed by default (Output 1, right
panel). The GLM model shows white race is negatively related to HPV initialization rate; while teens age
and provider recommendation is associated with increased rate of HPV vaccination. Summary statistics
for the posterior sample are displayed in the left panel of Output 1. Since non-informative prior
distributions for the regression coefficients were used, the mean, standard deviations, and the intervals of
the posterior distributions for the model parameters are close to the maximum likelihood estimates and
standard errors.
Output 1: Posterior Descriptive and Interval Statistics of Regression Coefficients (L: Bayesian; R: GLM)
Simulation-based Bayesian inference requires using simulated draws to summarize the posterior
distribution or calculate any relevant quantities of interest. Therefore, researchers are suggested to treat
the simulation draws with care. SAS performs various convergence diagnostics to help researchers
determine whether the Markov chain has successfully convergedreached its stationary, or the desired
posterior, distribution. One can assess Markov chain convergence by visually checking a number of
diagnostic graphs automatically produced by SAS, including trace, correlation, and kernel density plots.
The trace plot (Output 2, upper panel) shows the mean of the Markov chain has stabilized and
Page 20 of
21
appears constant over the graph. Also, the chain has good mixing and is denseit traverses the posterior
space
Page 21 of
21
rapidly. The autocorrelation plot (Output 2, bottom left panel) indicates no high degree of autocorrelation
for each of the posterior samples, implying good mixing. The kernel density plot (Output 2, bottom right
panel) estimates the posterior marginal distribution of parameters. In sum, these plots conclude the
Markov chain has successfully converged to the desired posterior. Though, this example only displays
diagnostic graphics for the covariate of interest (white race), it is essential that one visually examines the
convergence of ALL parameters, not just those of interest. One cannot get valid posterior inference for
parameters that appear to have good mixing if the other parameters have bad mixing.
Output 2: Diagnostic Plots for the Coefficient of White Race
Though fitting a Bayesian GLM model is relatively easier by using PROC GENMOD, the BAYES
statement provides limited Bayesian capabilityit can only request the regression coefficients be
estimated by Bayesian method. When fitting a Bayesian logistic model, directly presenting the estimated
coefficients is usually not clinically meaningful and therefore researchers ordinarily perform exponential
transformations to convey the odds ratios. With regard to compute the more intuitive estimates, PROC
MCMC is much more flexible than the Bayes statement in PROC GENMOD. The procedure allows users
to simulate the point estimates and intervals of odds ratios instead, not just the coefficients. It can also be
used to estimate the probability that the coefficients or odds ratios excess certain critical values. The
following code illustrates how to fit Bayesian logistic model and calculate the estimates of odds ratios by
using PROC MCMC.
proc mcmc data=ads seed=14 nbi=1000 nmc=20000 outpost=posterior monitor=(or);
beginnodata;
mistake*/ or=exp(beta1);
endnodata;
parms beta: 0;
values*/
Page 23 of
21
The PROC MCMC statement invokes the procedure and the options specified is similar to that in the
BAYES statement of PROC GENMOD. The MONITOR option outputs analysis for the symbols of
interest which is specified in the BEGINNODATA/ENDNODATA statement. In this case, we are
interested in the odds ratio of initiating HPV vaccine among the white compared to minorities. The
PARMS statement identifies the parameters in the model: regression coefficients of beta0, beta1,
beta2, and assigns their initial values as zeros. The PRIOR statement specifies prior distributions for
the parameters. Again, we
6
apply a normal prior with mean 0 and variance 10 for each parameter. The following Pi assignment
statement calculates the logit of expected probability of initiating HPV vaccination as a linear function of
a set of covariates. The MODEL statement specifies the likelihood function using the binary distribution
to indicate that the response variable, p_utdhpv, follows binary distribution with parameter Pi.
Output 3 reports the estimated odds ratios of initiating HPV vaccine among the white compared to
minorities. Again, since we adopted non-informative priors, the mean, standard deviation, and the
interval of the posterior distribution for the odds ratio is close to the maximum likelihood estimate and
standard error. It reveals that the odds ratios of receiving at least one dose of HPV vaccine is about 50%
lower among the white teen girls compared to their counterparts of minorities when controlling for teens
and moms demographic characteristics and the attributes of health care providers and facilities.
Output 3: Summary Statistics for the Odds Ratio of Initiating HPV Vaccine among the White
(Upper Panel: Bayesian model; Button Panel: GLM model)
Output 4 displays the diagnostic plots for the estimated odds ratio of initiating HPV vaccine among
the white compared to that of minorities. Here, the diagnostic graphs conclude the simulation draws are
reasonably converged and therefore we can be more confident about the accuracy of posterior inference.
Page 24 of
21
Additionally, the kernel density plot (bottom right panel) confirms the mean of odds ratio is around 0.5
which is consistent with the summary statistics in Output 3.
Output 4: Diagnostic Plots for the Odds of Initiating HPV Vaccine among the White
Last, this report compares the odds ratios estimated by logistic, GLM and Bayes models. Bayes
model is fitted using both GENMOD and MCMC procedures. Figure 4 shows the estimated ORs are
around 0.5 and consistent across different approaches. The intervals of odds ratios for both logistic and
GLM models appear to be less symmetric. It is probably because of that the exponent transformed
regression coefficient (OR) does not remain normally distributed even though its original value is
assumed following normal distribution. In summary, the results of Bayes models with non-informative
priors are very similar to that of
traditional logistic and GLM modelsboth point estimates and intervals do not seem to vary much.
Figure 4. Model
Comparisions
0.2
0.4
0.6
0.8
1
OR (95%CI) of Initiating HPV among White v.s. Minorities
Traditional Logistic
Bayesian Logistic (proc genmod)
GLM
Bayesian Logistic (proc mcmc)
6. Useful Resources
Tutorial Introduction of Bayesian Statistics
1)
2)
3)
Lee, Peter M. Bayesian Statistics: An Introduction. John Wiley & Sons, 2012.
4)
Dixon-Woods, Mary, Shona Agarwal, David Jones, Bridget Young, and Alex Sutton.
Bernardo, Jose M. Reference Posterior Distributions for Bayesian Inference. Journal of the Royal