15.097: Probabilistic Modeling and Bayesian Analysis
15.097: Probabilistic Modeling and Bayesian Analysis
15.097: Probabilistic Modeling and Bayesian Analysis
We will assume that the data were generated from a probability distribution
that is described by some parameters θ (not necessarily scalar). We treat θ
as a random variable. We will use the shorthand notation p(y|θ) to represent
the family of conditional density functions over y, parameterized by the ran
dom variable θ. We call this family p(y|θ) a likelihood function or likelihood
model for the data y, as it tells us how likely the data y are given the model
specified by any value of θ.
observing them.
Our end goal is the conditional density function over θ, given the observed
data, which we denote as p(θ|y). We call this the posterior distribution, and
it informs us which parameters are likely given the observed data.
We, the modeler, specify the likelihood function (as a function of y and θ)
and the prior (we completely specify this) using our knowledge of the system
at hand. We then use these quantities, together with the data, to compute
the posterior. The likelihood, prior, and posterior are all related via Bayes’
rule:
p(y|θ)p(θ) p(y|θ)p(θ)
p(θ|y) = = , (1)
p(y) p(y|θ' )p(θ' )dθ'
where the second step uses the law of total probability. Unfortunately the
integral in the denominator, called the partition function, is often intractable.
This is what makes Bayesian analysis difficult, and the remainder of the notes
will essentially be methods for avoiding that integral.
Coin Flip Example Part 1. Suppose we have been given data from a se
ries of m coin flips, and we are not sure if the coin is fair or not. We might
assume that the data were generated by a sequence of independent draws
from a Bernoulli distribution, parameterized by θ, which is the probability of
flipping Heads.
But what’s the value of θ? That is, which Bernoulli distribution generated
these data?
We could estimate θ as the proportion of the flips that are Heads. We will
see shortly that this is a principled Bayesian approach. Let yi = 1 if flip i
was Heads, and yi = 0 otherwise. Let mH = m i=1 yi be the number of heads
in m tosses. Then the likelihood model is
p(y|θ) = θmH (1 − θ)m−mH . (2)
2 Point estimates
Rather than estimate the entire distribution p(θ|y), sometimes it is sufficient
to find a single ‘good’ value for θ. We call this a point estimate. For the sake
of completeness, we will briefly discuss two widely used point estimates, the
maximum likelihood (ML) estimate and the maximum a posteriori (MAP)
estimate.
The ML estimator is very popular and has been used all the way back to
Laplace. It has a number of nice properties, one of which is that it is a
consistent estimator. Let’s explain what that means.
Definition 1 (Convergence in Probability). A sequence of random variables
X1 , X2 , . . . is said to converge in probability to a random variable X if, ∀E > 0,
lim
n→∞
1 (|Xn − X| ≥ E) = 0.
P
We denote this convergence as Xn −
→ X.
Definition 2 (Consistent estimators). Suppose the data y1 , . . . , ym were gen
erated by a probability distribution p(y|θ0 ). An estimator θ̂ is consistent if
P
it converges in probability to the true value: θ̂ −
→ θ0 as m → ∞.
We said that maximum likelihood is consistent. This means that if the distri
bution that generated the data belongs to the family defined by our likelihood
model, maximum likelihood is guaranteed to find the correct distribution, as
m goes to infinity. In proving consistency, we do not get finite sample guar
antees like with statistical learning theory; and data are always finite.
Coin Flip Example Part 2. Returning to the coin flip example, equation
(2), the log-likelihood is
We can maximize this by differentiating and setting to zero, and doing a few
lines of algebra:
dR(θ)
m H m − mH
0 =
=
−
dθ
θ̂ θ 1−θ θ̂
mH (1 − θ̂ML ) = (m − mH )θ̂ML
mH − θ̂ML mH = mθ̂ML − θ̂ML mH
mH
θ̂ML = . (5)
m
(It turns out not to be difficult to verify that this is indeed a maximum).
In this case, the maximum likelihood estimate is exactly what we intuitively
thought we should do: estimate θ as the observed proportion of Heads.
We don’t have to worry about evaluating the partition function p(y|θ' )p(θ' )dθ'
R
When the prior is uniform, the MAP estimate is identical to the ML estimate
because the log p(θ) is constant.
One might ask what would be a bad choice for a prior. We will see later that
reasonable choices of the prior are those that do not assign zero probability to
the true value of θ. If we have such a prior, the MAP estimate is consistent,
which we will discuss in more detail later. Some other properties of the MAP
estimate are illustrated in the next example.
Coin Flip Example Part 3. We again return to the coin flip example.
Suppose we model θ using a Beta prior (we will see later why this is a good
idea): θ ∼ Beta(α, β). The Beta distribution is:
1
Beta(θ; α, β) = θα−1 (1 − θ)β−1 , (7)
B(α, β)
where B(α, β) is the beta function, and is constant with respect to θ:
� 1
B(α, β) = tα−1 (1 − t)β−1 dt. (8)
0
The quantities α and β are parameters of the prior which we are free to set
according to our prior belief about θ. By varying α and β, we can encode
a wide range of possible beliefs, as is shown in this figure taken from the
Wikipedia article on the Beta distribution:
The MAP estimate for θ we can get from the formula for computing θ̂MAP
in (6), plugging in the formula for the likelihood we found in part 2, and the
definition of the Beta distribution for the prior (7):
θ̂MAP ∈ arg max (log p(y|θ) + log p(θ))
θ
= arg max (mH log θ + (m − mH ) log(1 − θ)
θ
+(α − 1) log θ + (β − 1) log(1 − θ) − log B(α, β)) .
coin tossing experiment with 6 Heads and 2 Tails on the record. If we truly
believed before we started flipping coins that the probability of Heads was
around 6/8, then this is a good idea. This can be very useful in reducing the
variance of the estimate for small samples.
For example, suppose the data contain only one coin flip, a Heads. The ML
estimate will be θ̂M L = 1, which predicts that we will never flip tails! How
ever, we, the modeler, suspect that the coin is probably fair, and can assign
α = β = 3 (or some other number with α = β), and we get θ̂M AP = 3/5.
Question How would you set α and β for the coin toss under a strong prior
belief vs. a weak prior belief that the probability of Heads was 1/8?
For large samples it is easy to see for the coin flipping that the effect of the
prior goes to zero:
lim θˆMAP = lim θˆML = θtrue . Why?
m→∞ m→∞
Recall what know about regularization in machine learning - that data plus
knowledge implies generalization. The prior is the “knowledge” part. One
could interpret the MAP estimate as a regularized version of the ML estimate,
or a version with “shrinkage.”
Example 1. (Rare Events) The MAP estimate is particularly useful when
dealing with rare events. Suppose we are trying to estimate the probabil
ity that a given credit card transaction is fraudulent. Perhaps we monitor
transactions for a day, and there are no fraudulent transactions. The ML
estimate tells us that the probability of credit card fraud is zero. The MAP
estimate would allow us to incorporate our prior knowledge that there is some
probability of fraud, we just haven’t seen it yet.
are the dependent variables. We will use a likelihood model under which yi
depends linearly on xi , and the yi ’s are all independent. Specifically, we
model
yi ∼ θT xi + E,
where θ ∈ Jd are the parameters we are interested in, and E represents noise.
We will assume that the noise is distributed normally with zero mean and
some known variance σ 2 : E ∼ N (0, σ 2 ). This, together with our assumption of
independence, allows us to express the likelihood function for the observations
y = {y1 , . . . , ym }:
m
m m
m
p(y|x, θ) = p(yi |xi , θ) = N (yi ; θT xi , σ 2 )
i=1 i=1
m
m 1 1
= √ exp − 2 (yi − θT xi )2 ,
i=1 2πσ 2 2σ
where the first line follows from independence. As usual, it is more convenient
to work with the log-likelihood:
�m �
m 1 1
R(θ) = log √ exp − 2 (yi − θT xi )2
i=1 2πσ 2 2σ
m
m 1 m
= − log(2πσ 2 ) − 2 (yi − θT xi )2 .
2 2σ i=1
The ML estimate is
m
1 m
θ̂ML ∈ arg max R(θ) = arg max − 2 (yi − θT xi )2 (10)
θ θ 2σ i=1
m
m
= arg min (yi − θT xi )2 . (11)
θ
i=1
Now let us try the MAP estimate. We saw in our coin toss example that the
MAP estimate acts as a regularized ML estimate. For probabilistic regression,
we will use a multivariate normal prior (we will see later why this is a good
idea) with mean 0. The covariance matrix will be diagonal, it is I (the
identity matrix) times σ 2 /C, which is the known, constant variance of each
component of θ. So the prior is:
1 1 −1
θ ∼ N (0, Iσ 2 /C) = exp − θT Iσ 2 /C
�
�
d/2 2 1/2
θ
.
(2π) |σ /C| 2
Here we are saying that our prior belief is that the “slope” of the regression
line is near 0. Now,
from (10). At this point you can really see how MAP is a regularized ML.
Continuing,
m
1
m T 2 d 1
σ 2 1
T
= arg max
−
2 (yi − θ xi ) − log 2π − log −
θ (IC/σ 2 )θ
θ 2σ i=1 2
2
C 2
m
1
m 1
= arg max − 2 (yi − θT xi )2 − 2 CθT θ
θ 2σ i=1 2σ
m
m
= arg min (yi − θT xi )2 + C IθI22 . (12)
θ
i=1
3 Conjugate priors
Although point estimates can be useful in many circumstances (and are used
in many circumstances), our true goal in Bayesian analysis is often to find
the full posterior, p(θ|y). One reason for wanting the posterior is to be able
to use the posterior predictive distribution of a yet unobserved data point:
�
Z �
Z
p(ym+1 |y) = p(ym+1 |y, θ)p(θ|y)dθ = p(ym+1 |θ)p(θ|y)dθ, (13)
We saw earlier that the posterior can be obtained in principle from the prior
and the likelihood using Bayes’ rule, but that there is an integral in the
denominator which often makes this intractable. One approach to circum
venting the integral is to use conjugate priors.
10
As with the MAP estimate, we can see the interplay between the data yi and
the prior parameters α and β in forming the posterior. As before, the exact
choice of α and β does not matter asymptotically, the data overwhelm the
prior.
If we knew that the Beta distribution is the conjugate prior to the Bernoulli,
we could have figured out the same thing faster by recognizing that
and then realizing that by the definition of a conjugate prior, the posterior
must be a Beta distribution. There is exactly one Beta distribution that
satisfies (15): the one that is normalized correctly, and that is equation (14).
The parameters of the prior distribution (α and β in the case of the Beta
prior) are called prior hyperparameters. We choose them to best represent
our beliefs about the distribution of θ. The parameters of the posterior dis
tribution (mH + α and m − mH + β) are called posterior hyperparameters.
Any time a likelihood model is used together with its conjugate prior, we
know the posterior is from the same family of the prior, and moreover we have
an explicit formula for the posterior hyperparameters. A table summarizing
some of the useful conjugate prior relationships follows. There are many more
conjugate prior relationships that are not shown in the following table but
that can be found in reference books on Bayesian statistics1 .
1
Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin is an excellent choice.
11
Prior Posterior
Conjugate
Likelihood Hyper- Hyper-
Prior
params params
Pm
yi , β + m − m
P
Bernoulli Beta α, β α+ i=1 i=1 yi
Pm Pm Pm
Binomial Beta α, β α+ i=1 yi , β + i=1 mi − i=1 yi
Pm
Poisson Gamma α, β α+ yi , β + m
i=1
α + m, β + m
P
Geometric Beta α, β i=1 yi
α + m, β + m
P
Exponential Gamma α, β i=1 yi
Normal −1
µ0 1
Pm 1 m 1 m
unknown mean Normal µ0 , σ02 σ02
+ σ2 i=1 yi / σ02
+ σ2
, σ02
+ σ2
known variance σ 2
We will now discuss a few of these conjugate prior relationships to try to gain
additional insight. For Bernoulli-Beta, we saw that the prior hyperparame
ters can be interpreted as starting the tossing with a certain number of Heads
and Tails on the record. The posterior hyperparameters then simply add the
observed number of heads (mH ) to the prior hyperparameter for number of
Heads (α), and the observed number of tails (m − mH ) to the prior hyperpa
rameter for number of tails (β).
12
p(θ|y) ∝ p(y|θ)p(θ)
im # � m
!
� � �
η T
m m g(θ ) exp φ(θ) ν
= f (yi ) g(θ)m exp φ(θ)T u(yi ) R
i=1 i=1
��
!!
m
∝ g(θ)η+m exp φ(θ)T ν + u(yi ) (dropping non-θ terms),
i=1
which is in the same family as the prior (17), with the posterior hyperparam
eters being η + m and ν + m
P
i=1 u(yi ).
Although this proof yields the form of the conjugate prior, we may not always
be able to compute the partition function. So we can’t always gain something
in practice from it.
It turns out that in general, the converse to Theorem 1 is true, and exponen
tial families are the only distributions with (non-trivial) conjugate priors.
Coin Flip Example Part 5. Returning again to the coin flip example, let
us first verify that the Bernoulli distribution is an exponential family:
14
4 Posterior asymptotics
Up to this point, we have defined a likelihood model that is parameterized
by θ, assigned a prior distribution to θ, and then computed the posterior
p(θ|y). There are two natural questions that arise. First, what if we choose
the ‘wrong’ likelihood model? That is, what if the data were actually gener
6 p(y|θ) for any θ, but we use
ated by some distribution q(y) such that q(y) =
p(y|θ) as our likelihood model? Second, what if we assign the ‘wrong’ prior?
We can answer both of these questions asymptotically as m → ∞. First we
15
for us.
Proof. We will rely on Jensen’s inequality, which states that for any convex
�
Z
1y∼q(y)[f (X)] ≥ f (1y∼q(y)
�Z
[X])
p(y) p(y)
− q(y) log dy ≥ − log q(y) dy
q(y) q(y)
�
Z
≥ − log p(y)dy
≥ − log 1 = 0 so,
Z
�
q(y)
q(y) log dy ≥ 0,
p(y)
with equality under the same conditions required for equality in Jensen’s
inequality: if and only if X is constant, that is, q(y) = p(y) ∀y.
16
We will use the KL divergence to find the distribution from the likelihood
family that is ‘closest’ to the true generating distribution:
θ∗ = arg min D(q(·)||p(·|θ)). (18)
θ∈Θ
For convenience, we will suppose that the arg min in (18) is unique. The
results can be easily extended to the case where the arg min is not unique. The
main results of this section are two theorems, the first for discrete parameter
spaces and the second for continuous parameter spaces. The intuition for the
theorem is that as long as there is some probability in the prior that θ = θ∗ ,
then as m → ∞ the whole posterior will be close to θ∗ .
Theorem 3 (Posterior consistency in finite parameter space). Let F be a
finite family of likelihood models, with Θ = {θ : p(·|θ) ∈ F} the finite pa
rameter space. Let y = (y1 , . . . , ym ) be a set of independent samples from
an arbitrary distribution q(·), and θ∗ be as in (18). If p(θ = θ∗ ) > 0, then
p(θ = θ∗ |y) → 1 as m → ∞.
Proof. Consider any θ 6= θ∗ :
m
p(θ|y) p(θ)p(y|θ) p(θ)
m p(yi |θ)
log
∗ = log = log
+
log
(19)
17
Let’s plug back into (19). By supposition, p(θ∗ ) > 0 (and the prior doesn’t
change as m → ∞).Thus as m → ∞, (19) obeys
m
p(θ|y) p(θ) m p(yi |θ)
log ∗
= log ∗
+ log → −∞,
p(θ |y) p(θ ) i=1 p(yi |θ∗ )
and
p(θ|y) p(θ|y)
log → −∞ implies → 0 which implies p(θ|y) → 0.
p(θ∗ |y) p(θ∗ |y)
This holds for every θ 6= θ∗ , thus p(θ∗ |y) → 1.
Again, this theorem tells us that the posterior eventually becomes concen
trated on the value θ∗ . θ∗ corresponds to the likelihood model that is ‘closest’
in the KL sense to the true generating distribution q(·). If q(·) = p(·|θ0 ) for
some θ0 ∈ Θ, then θ∗ = θ0 is the unique minimizer of (18) because p(·|θ∗ ) is
closest to p(·|θ0 ) when θ∗ = θ0 . The theorem tells us that the posterior will
become concentrated around the true value. This is only the case if in the
prior, p(θ∗ ) > 0, which shows the importance of choosing a prior that assigns
non-zero probability to every plausible value of θ.
We now present the continuous version of Theorem 3, but the proof is quite
technical and is omitted.
Theorem 4 (Posterior consistency in continuous parameter space). If Θ
is a compact set, θ∗ is defined as in (18), A is a neighborhood of θ∗ , and
p(θ∗ ∈ A) > 0, then p(θ ∈ A|y) → 1 as m → ∞.
These theorems show that asymptotically, the choice of the prior does not
matter as long as it assigns non-zero probability to every θ ∈ Θ. They also
show that if the data were generated by a member of the family of likelihood
models, we will converge to the correct likelihood model. If not, then we will
converge to the model that is ‘closest’ in the KL sense.
We give these theorems along with a word of caution. These are asymptotic
results that tell us absolutely nothing about the sort of m we encounter in
practical applications. For small sample sizes, poor choices of the prior or
likelihood model can yield poor results and we must be cautious.
18
Coin Flip Example Part 6. Suppose that the coin flip data truly came from
a biased coin with a 3/4 probability of Heads, but we restrict the likelihood
model to only include coins with probability in the interval [0, 1/2]:
p(y|θ) = θmH (1 − θ)m−mH , θ ∈ [0, 1/2].
This time we will use a uniform prior, p(θ) = 2, θ ∈ [0, 1/2]. The posterior
distribution is then
θmH (1 − θ)m−mH 2
p(θ|y) = R 1/2 .
0 θ'mH (1 − θ' )m−mH 2dθ'
The partition function is an incomplete Beta integral, which has no closed
form but can easily be solved numerically. In the following figure, we draw
samples iid from the true distribution (3/4 probability of Heads) and show
how the posterior becomes increasingly concentrated around θ = 0.5, the
closest likelihood function to the true generating distribution.
5 Hierarchical modeling
Hierarchical models are a powerful application of Bayesian analysis. They al
low us to reason about knowledge at multiple levels of abstraction. Basically,
19
the prior parameters at the lower level come from the same distribution at
the higher level, and there is a prior at that level, and so on. This is best
explained by example, so we will develop a hierarchical model for topic mod
eling, an important information retrieval problem.
The model has several components. The data are m documents, with docu
ment i consisting of ni words. Each word in the document will be associated
with one of K topics. We let zi,j denote the topic of word j in document i.
We model zi,j ∼ Multinomial(θi ), where θi ∈ JK describes the topic mixture
of document i.
For each topic, we define a multinomial distribution over all possible words.
For example, given the topic is “Sports”, the probability of having the word
“football” might be high; if the topic were “Weather”, the probability of hav
ing the word “football” might be lower. Other words, like “the” will have a
high probability regardless of the topic. If words are chosen from a set of W
possible words, then we let φk ∈ JW be the multinomial parameter over words
for topic k. Word j of document i, denoted wi,j , will be generated by the dis
tribution over words corresponding to the topic zi,j : wi,j ∼ Multinomial(φzi,j ).
Finally, we give prior distributions for the parameters θi and φk . The multi
nomial distribution is a generalization of the binomial distribution, and its
20
21
We are interested in infering the posterior distribution for the latent vari
ables. Let Z = {zi,j }i=1,...,m,j=1,...,ni , θ = {θi }i=1,...,m , φ = {φk }k=1,...,K , and
W = {wi,j }i=1,...,m,j=1,...,ni . Then, by Bayes’ rule, ignoring the constant de
nominator, we can express the posterior as:
We will look at each of these pieces and show that they have a compact
22
analytical form.
ni
m m
m
p(w|Z, φ, θ, α, β) = p(wi,j |Z, φ, θ, α, β) By iid assumption
i=1 j=1
mm m ni
= p(wi,j |zi,j , φ) By conditional independence
i=1 j=1
mm m ni
= Multinomial(wi,j ; φzi,j ) By definition.
i=1 j=1
Also,
p(Z, θ, φ|α, β) = p(Z, θ|α)p(φ|β)
by conditional independence. Again considering each term, by the definition
of conditional probability:
where
m ni
m m ni
m m
m
p(Z|θ, α) = p(Z|θ) = p(zi,j |θi ) = Multinomial(zi,j ; θi ),
i=i j=1 i=i j=1
and
K
m K
m
p(φ|β) = p(φk |β) = Dirichlet(φk ; β).
k=1 k=1
Plugging all of these pieces back into (21), we obtain
m ni
m m ni
m m
m
p(Z, θ, φ|w, α, β) ∝ Multinomial(wi,j ; φzi,j ) Multinomial(zi,j ; θi )×
i=1 j=1 i=i j=1
m
m K
m
Dirichlet(θi ; α) Dirichlet(φk ; β). (22)
i=1 k=1
23
For any given Z, θ, φ, w, α, and β, we can easily evaluate (22). (All the nor
malizations are known so it’s easy.) We will see in the next section that this
is sufficient to be able to simulate draws from the posterior.
24
conditions that will be satisfied by all of the chains we are interested in,
the sequence of state distributions π 0 (θ), π 1 (θ), . . . will converge to a unique
distribution π(θ) which we call the stationary distribution, or equilibrium dis
tribution or steady-state distribution.
There are many options when choosing a proposal distribution which we will
discuss later. The proposal distribution will yield a random walk over the
parameter space, proposing steps θ → θ' . We accept or reject each step
depending on the relative posterior probabilities for θ and θ' . When we run
the random walk for long enough, the accepted values will simulate draws
from the posterior.
Suppose we are considering the transition θ → θ' . If p(θ' |y) is larger than
p(θ|y), then for every accepted draw of θ, we should have at least as many
accepted draws of θ' and so we should always accept the transition θ → θ' .
25
If p(θ' |y) is less than p(θ|y), then for every accepted draw θ, we should have
�
|y)
on average p(θ '
p(θ|y)
accepted draws of θ . We thus accept the transition with
p(θ� |y)
probability p(θ|y) . Thus for any transition, we accept the transition with
probability
p(θ' |y)
min , 1 . (24)
p(θ|y)
Equation (25) reduces to what we developed intuition for in (24) when the
proposal distribution is symmetric: J(θ, θ' ) = J(θ' , θ). We will see in the
next theorem that the extra factors in (25) are necessary for the posterior to
be the stationary distribution.
26
Because the proposal distribution and α(θt−1 , θ∗ ) depend only on the cur
rent state, the sequence θ0 , θ1 , . . . forms a Markov chain. What makes the
Metropolis-Hastings algorithm special is the following theorem, which shows
that if we simulate the chain long enough, we will simulate draws from the
posterior.
Theorem 5. If J(θ, θ' ) is such that that the Markov chain θ0 , θ1 , . . . produced
by the Metropolis-Hastings algorithm has a unique stationary distribution,
then the stationary distribution is p(θ|y).
Proof. We will use the fact given in (23) that if the posterior p(θ|y) satisfies
the detailed balance equation, then it is the stationary distribution. Thus we
wish to show:
K(θ, θ' )p(θ|y) = K(θ' , θ)p(θ' |y), for all θ, θ' .
where the transition kernel comes from Metropolis-Hastings:
K(θ, θ' ) = (probability of proposing θ' )×
(probability of accepting θ' given it was proposed)
= J(θ, θ' )α(θ, θ' ).
To show that the detailed balance equation holds, take any θ and θ' , and
without loss of generality, suppose that α is less than or equal to 1 for the
transition θ to θ' , which means
J(θ, θ' )p(θ|y) ≥ J(θ' , θ)p(θ' |y),
so that
' J(θ' , θ)p(θ' |y)
α(θ, θ ) = ,
J(θ, θ' )p(θ|y)
and α(θ' , θ) = 1. Now let’s plug:
K(θ, θ' )p(θ|y) =J(θ, θ' )α(θ, θ' )p(θ|y)
' J(θ' , θ)p(θ' |y)
=J(θ, θ )p(θ|y)
J(θ, θ' )p(θ|y)
=J(θ' , θ)p(θ' |y) (cancel terms)
=J(θ' , θ)α(θ' , θ)p(θ' |y) (1 in disguise)
=K(θ' , θ)p(θ' |y).
And that’s the detailed balance equation.
27
In fact, it is often the case in hierarchical models that their structure al
lows us to determine analytically these conditional posterior distributions.
This can be done for LDA, and the derivation is quite lengthy but can
be found on the Wikipedia article for LDA. The Gibbs’ sampler updates
the posterior variables θ1 , θ2 , . . . , θd one at a time. At each step, all of
them are held constant in their current state except for one (j), which is
then updated by drawing from its (known) conditional posterior distribu
tion, p(θj |θ1 , . . . , θj−1 , θj+1 , . . . , θd , y). We then hold it in its state and move
on to the next variable to update it similarly. When we do this iterative up
dating for long enough, we eventually simulate draws from the full posterior.
The full algorithm is:
Step 1. Initialize θ0 = [θ10 , . . . , θd0 ]. Set t = 1.
t−1
Step 2. For j ∈ {1, . . . , d}, sample θj t from p(θj |θ1t , . . . , θj−1
t
, θj+1 , . . . , θdt−1 , y).
Step 3. Until stationary distribution and the desired number of draws are reached,
increment t ← t + 1 and return to Step 2.
28
29
This is probably the largest flaw in MCMC sampling. There are a large num
ber of heuristics which are used to assess convergence that generally involve
looking at how θt varies with time. In general, the initial simulations depend
strongly on the starting point and are thrown away. This is referred to as
burn-in, and often involves discarding the first half of all draws.
30
which we can easily compute. This formula and a uniform random number
generator for the proposal distribution are all that is required to implement
the Metropolis-Hastings algorithm.
In the next figure, with r = 1 the steps are too large and we reject most of
the proposals. This leads to a small number of accepted draws after 2000
proposals.
31
The chain with r = 0.1 is the happy medium, where we rapidly reach the
stationary distribution, and accept most of the samples.
32
histogram of the resulting draws, along with the analytical posterior. The
running time to generate the MCMC draws was less than a second, and they
are a reasonable approximation to the posterior.
33
By this point it should come as no surprise that our first example will be the
coin flip example.
Specifying the model in OpenBUGS is very natural, and uses two main sym
bols: “<-” means deterministic assignment, as in R, and “∼” means distribu
tional assignment. OpenBUGS includes many different distributions, a full
list of how to specify them is at:
http://mathstat.helsinki.fi/openbugs/Manuals/ModelSpecification.html#ContentsAI
Y ∼ dbin(theta,m)
theta ∼ dbeta(alpha,beta)
Screenshots © OpenBUGS. Some rights reserved; the OpenBUGS software is available under the GNU
General Public License. This content is excluded from our Creative Commons license. For more information,
see http://ocw.mit.edu/fairuse
. .
34
Here dbin and dbeta are the binomial and beta distributions respectively.
Notice that in the model we do not specify the constants m, α, and β.
We then specify the data, including all constants. Data is specified in Open-
BUGS as an R list. Here, for 10 Heads out of 40 flips and priors α = β = 5,
we write:
list(Y=10,m=40,alpha=5,beta=5)
Finally, we specify the initial parameter values for each chain, each as an R
list. Here we will start one chain from θ = 1, and the other from θ = 0.
list(theta=0)
list(theta=1)
36
Alternatively, instead of loading intial values for both chains, we could have
pressed the button “gen inits” in the model specification window, which
generates random initial values for both chains.
We may now close the model specification window.
OpenBUGS will only store whatever results we tell it to. Before simulating
the chains, we must tell OpenBUGS to store the values of theta that it
simulates.
37
To view the results, we close the update tool and return to the sample monitor
tool (Inference -> Samples). Here there are a few useful quantities to look
at.
We type in “theta” as the
accept.
Screenshots © OpenBUGS. Some rights reserved; the OpenBUGS software is available under the GNU
General Public License. This content is excluded from our Creative Commons license. For more information,
see http://ocw.mit.edu/fairuse.
38
Pressing “accept” shows the acceptance rate vs. iteration number, which is
the following figure.
39
http://www.openbugs.info/Examples/Pumps.html
This example is a hierarchical model for pump failure. There are 10 pumps,
pump i being in operation for a certain amount of time ti and having had a
certain number of failures xi . We wish to estimate the failure rate for each
pump, and we will do so with a Bayesian hierarchical model. We suspect that
the failure rates are related for all of the pumps, so we include a common prior.
We give flexibility to the prior by sampling it from a diffuse distribution. The
generative model for the observed number of pump failures is then:
model{
alpha ∼ dexp(1)
for (i in 1 : 10) {
x[i] ∼ dpois(lambda[i])
Notice in the above that OpenBUGS does not allow mathematical operations
inside of calls to distributions like dpois(theta[i]*t[i]), so we had to use
a deterministic relationship to specify the new variable lambda[i]. The data
to be specified are “t” and “x”, and we do this with an R list:
40
list(t = c(94.3, 15.7, 62.9, 126, 5.24, 31.4, 1.05, 1.05, 2.1, 10.5),
x = c( 5, 1, 5, 14, 3, 19, 1, 1, 4, 22))
41
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.