15.097: Probabilistic Modeling and Bayesian Analysis

15.
097: Probabilistic Modeling and Bayesian Analysis
Ben Letham and Cynthia Rudin
Credits: Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin
1 Introduction and Notation

Up to this point, most of the machine learning tools we discussed (SVM,
Boosting, Decision Trees,...) do not make any assumption about how the
data were generated. For the remainder of the course, we will make distri
butional assumptions, that the underlying distribution is one of a set. Given
data, our goal then becomes to determine which probability distribution gen
erated the data.
We are given m data points y1 , . . . , ym , each of arbitrary dimension. Let

y = {y1 , . . . , ym } denote the full set of data. Thus y is a random variable,
whose probability density function would in probability theory typically be
denoted as fy ({y1 , . . . , ym }). We will use a standard (in Bayesian analysis)
shorthand notation for probability density functions, and denote the proba
bility density function of the random variable y as simply p(y).
We will assume that the data were generated from a probability distribution
that is described by some parameters θ (not necessarily scalar). We treat θ
as a random variable. We will use the shorthand notation p(y|θ) to represent
the family of conditional density functions over y, parameterized by the ran
dom variable θ. We call this family p(y|θ) a likelihood function or likelihood
model for the data y, as it tells us how likely the data y are given the model
specified by any value of θ.
We specify a prior distribution over θ, denoted p(θ). This distribution rep

resents any knowledge we have about how the data are generated prior to
observing them.
Our end goal is the conditional density function over θ, given the observed
data, which we denote as p(θ|y). We call this the posterior distribution, and
it informs us which parameters are likely given the observed data.
We, the modeler, specify the likelihood function (as a function of y and θ)
and the prior (we completely specify this) using our knowledge of the system
at hand. We then use these quantities, together with the data, to compute
the posterior. The likelihood, prior, and posterior are all related via Bayes’
rule:
p(y|θ)p(θ) p(y|θ)p(θ)
p(θ|y) = = , (1)
p(y) p(y|θ' )p(θ' )dθ'
where the second step uses the law of total probability. Unfortunately the
integral in the denominator, called the partition function, is often intractable.
This is what makes Bayesian analysis difficult, and the remainder of the notes
will essentially be methods for avoiding that integral.
Coin Flip Example Part 1. Suppose we have been given data from a se
ries of m coin flips, and we are not sure if the coin is fair or not. We might
assume that the data were generated by a sequence of independent draws
from a Bernoulli distribution, parameterized by θ, which is the probability of
flipping Heads.
But what’s the value of θ? That is, which Bernoulli distribution generated
these data?
We could estimate θ as the proportion of the flips that are Heads. We will
see shortly that this is a principled Bayesian approach. Let yi = 1 if flip i
was Heads, and yi = 0 otherwise. Let mH = m i=1 yi be the number of heads
in m tosses. Then the likelihood model is
p(y|θ) = θmH (1 − θ)m−mH . (2)
1.1 A note on the Bayesian approach

The problem formulation we have just described has historically been a source
of much controversy in statistics. There are generally two subfields of statis
tics: frequentist (or classical) statistics, and Bayesian statistics. Although

many of the techniques overlap, there is a fundamental difference in phi
losophy. In the frequentist approach, θ is an unknown, but deterministic
quantity. The goal in frequentist statistics might then be to determine the
range of values for θ that is supported by the data (called a confidence in
terval). When θ is viewed as a deterministic quantity, it is nonsensical to
talk about its probability distribution. One of the greatest statisticians of
our time, Fisher, wrote that Bayesian statistics “is founded upon an error,
and must be wholly rejected.” Another of the great frequentists, Neyman,
wrote that, “the whole theory would look nicer if it were built from the start
without reference to Bayesianism and priors.” Nevertheless, recent advances
in theory and particularly in computation have shown Bayesian statistics to
be very useful for many applications. Machine learning is concerned mainly
with prediction ability. A lot of the methods we discussed do not worry about
exactly what the underlying distribution is - as long as we can predict, we are
happy, regardless of whether we even have a meaningful estimate for p(y|θ).
2 Point estimates
Rather than estimate the entire distribution p(θ|y), sometimes it is sufficient
to find a single ‘good’ value for θ. We call this a point estimate. For the sake
of completeness, we will briefly discuss two widely used point estimates, the
maximum likelihood (ML) estimate and the maximum a posteriori (MAP)
estimate.
2.1 Maximum likelihood estimation

The ML estimate for θ is denoted θ̂ML and is the value for θ under which the
data are most likely:
θ̂ML ∈ arg max p(y|θ). (3)
θ
As a practical matter, when computing the maximum likelihood estimate it
is often easier to work with the log-likelihood, R(θ) := log p(y|θ). Because the
logarithm is monotonic, it does not affect the argmax:
θ̂ML ∈ arg max R(θ). (4)

θ
The ML estimator is very popular and has been used all the way back to
Laplace. It has a number of nice properties, one of which is that it is a
consistent estimator. Let’s explain what that means.
Definition 1 (Convergence in Probability). A sequence of random variables
X1 , X2 , . . . is said to converge in probability to a random variable X if, ∀E > 0,
lim
n→∞
1 (|Xn − X| ≥ E) = 0.
P
We denote this convergence as Xn −
→ X.
Definition 2 (Consistent estimators). Suppose the data y1 , . . . , ym were gen
erated by a probability distribution p(y|θ0 ). An estimator θ̂ is consistent if
P
it converges in probability to the true value: θ̂ −
→ θ0 as m → ∞.
We said that maximum likelihood is consistent. This means that if the distri
bution that generated the data belongs to the family defined by our likelihood
model, maximum likelihood is guaranteed to find the correct distribution, as
m goes to infinity. In proving consistency, we do not get finite sample guar
antees like with statistical learning theory; and data are always finite.
Coin Flip Example Part 2. Returning to the coin flip example, equation
(2), the log-likelihood is
R(θ) = mH log θ + (m − mH ) log(1 − θ).
We can maximize this by differentiating and setting to zero, and doing a few
lines of algebra:
dR(θ)
m H m − mH
0 =
=
−
dθ
θ̂ θ 1−θ θ̂
mH (1 − θ̂ML ) = (m − mH )θ̂ML
mH − θ̂ML mH = mθ̂ML − θ̂ML mH
mH
θ̂ML = . (5)
m
(It turns out not to be difficult to verify that this is indeed a maximum).
In this case, the maximum likelihood estimate is exactly what we intuitively
thought we should do: estimate θ as the observed proportion of Heads.
2.2 Maximum a posteriori (MAP) estimation

The MAP estimate is a pointwise estimate with a Bayesian flavor. Rather
than finding θ that maximizes the likelihood function, p(y|θ), we find θ that
maximizes the posterior, p(θ|y). The distinction is between the θ under which
the data are most likely, and the most likely θ given the data.
We don’t have to worry about evaluating the partition function p(y|θ' )p(θ' )dθ'
R
because it is constant with respect to θ. Again it is generally more convenient

to work with the logarithm.
p(y|θ)p(θ)
θ̂MAP ∈ arg max p(θ|y) = arg max R
θ θ p(y|θ' )p(θ' )dθ'
= arg max p(y|θ)p(θ) = arg max (log p(y|θ) + log p(θ)) . (6)
θ θ
When the prior is uniform, the MAP estimate is identical to the ML estimate
because the log p(θ) is constant.
One might ask what would be a bad choice for a prior. We will see later that
reasonable choices of the prior are those that do not assign zero probability to
the true value of θ. If we have such a prior, the MAP estimate is consistent,
which we will discuss in more detail later. Some other properties of the MAP
estimate are illustrated in the next example.
Coin Flip Example Part 3. We again return to the coin flip example.
Suppose we model θ using a Beta prior (we will see later why this is a good
idea): θ ∼ Beta(α, β). The Beta distribution is:
1
Beta(θ; α, β) = θα−1 (1 − θ)β−1 , (7)
B(α, β)
where B(α, β) is the beta function, and is constant with respect to θ:
� 1
B(α, β) = tα−1 (1 − t)β−1 dt. (8)
0
The quantities α and β are parameters of the prior which we are free to set
according to our prior belief about θ. By varying α and β, we can encode
a wide range of possible beliefs, as is shown in this figure taken from the
Wikipedia article on the Beta distribution:
© Krishnavedala on Wikipedia. CC BY-SA. This content is excluded from our Creative

Commons license. For more information, see http://ocw.mit.edu/fairuse.
The MAP estimate for θ we can get from the formula for computing θ̂MAP
in (6), plugging in the formula for the likelihood we found in part 2, and the
definition of the Beta distribution for the prior (7):
θ̂MAP ∈ arg max (log p(y|θ) + log p(θ))
θ
= arg max (mH log θ + (m − mH ) log(1 − θ)
θ
+(α − 1) log θ + (β − 1) log(1 − θ) − log B(α, β)) .
Differentiating and setting to zero at θ̂MAP ,

mH m − mH α−1 β−1
− + − =0
θˆMAP 1 − θˆMAP θˆMAP 1 − θˆMAP
mH + α − 1
θ̂MAP = . (9)
m+β−1+α−1
This is a very nice result illustrating some interesting properties of the MAP
estimate. In particular, comparing the MAP estimate in (9) to the ML esti
mate in (5) which was
mH
θ̂ML = ,
m
we see that the MAP estimate is equivalent to the ML estimate of a data set
with α − 1 additional Heads and β − 1 additional Tails. When we specify,
for example, a prior of α = 7 and β = 3, it is literally as if we had begun the
coin tossing experiment with 6 Heads and 2 Tails on the record. If we truly
believed before we started flipping coins that the probability of Heads was
around 6/8, then this is a good idea. This can be very useful in reducing the
variance of the estimate for small samples.
For example, suppose the data contain only one coin flip, a Heads. The ML
estimate will be θ̂M L = 1, which predicts that we will never flip tails! How
ever, we, the modeler, suspect that the coin is probably fair, and can assign
α = β = 3 (or some other number with α = β), and we get θ̂M AP = 3/5.
Question How would you set α and β for the coin toss under a strong prior
belief vs. a weak prior belief that the probability of Heads was 1/8?
For large samples it is easy to see for the coin flipping that the effect of the
prior goes to zero:
lim θˆMAP = lim θˆML = θtrue . Why?
m→∞ m→∞
Recall what know about regularization in machine learning - that data plus
knowledge implies generalization. The prior is the “knowledge” part. One
could interpret the MAP estimate as a regularized version of the ML estimate,
or a version with “shrinkage.”
Example 1. (Rare Events) The MAP estimate is particularly useful when
dealing with rare events. Suppose we are trying to estimate the probabil
ity that a given credit card transaction is fraudulent. Perhaps we monitor
transactions for a day, and there are no fraudulent transactions. The ML
estimate tells us that the probability of credit card fraud is zero. The MAP
estimate would allow us to incorporate our prior knowledge that there is some
probability of fraud, we just haven’t seen it yet.
2.3 Point estimation and probabilistic linear regression

We will now apply point estimation to a slightly more interesting problem,
linear regression, and on the way will discover some very elegant connections
between some of the machine learning algorithms we have already seen and
our new probabilistic approach. Suppose now that we have m data points
(x1 , y1 ), . . . , (xm , ym ), where xi ∈ Jd are the independent variables and yi ∈ J
are the dependent variables. We will use a likelihood model under which yi
depends linearly on xi , and the yi ’s are all independent. Specifically, we
model
yi ∼ θT xi + E,
where θ ∈ Jd are the parameters we are interested in, and E represents noise.
We will assume that the noise is distributed normally with zero mean and
some known variance σ 2 : E ∼ N (0, σ 2 ). This, together with our assumption of
independence, allows us to express the likelihood function for the observations
y = {y1 , . . . , ym }:
m
m m
m
p(y|x, θ) = p(yi |xi , θ) = N (yi ; θT xi , σ 2 )
i=1 i=1
m
m 1 1
= √ exp − 2 (yi − θT xi )2 ,
i=1 2πσ 2 2σ
where the first line follows from independence. As usual, it is more convenient
to work with the log-likelihood:
�m �
m 1 1
R(θ) = log √ exp − 2 (yi − θT xi )2
i=1 2πσ 2 2σ
m
m 1 m
= − log(2πσ 2 ) − 2 (yi − θT xi )2 .
2 2σ i=1
The ML estimate is
m
1 m
θ̂ML ∈ arg max R(θ) = arg max − 2 (yi − θT xi )2 (10)
θ θ 2σ i=1
m
m
= arg min (yi − θT xi )2 . (11)
θ
i=1
The ML estimate is equivalent to ordinary least squares!
Now let us try the MAP estimate. We saw in our coin toss example that the
MAP estimate acts as a regularized ML estimate. For probabilistic regression,
we will use a multivariate normal prior (we will see later why this is a good
idea) with mean 0. The covariance matrix will be diagonal, it is I (the
identity matrix) times σ 2 /C, which is the known, constant variance of each
component of θ. So the prior is:

1 1 −1
θ ∼ N (0, Iσ 2 /C) = exp − θT Iσ 2 /C
�
�
d/2 2 1/2
θ
.
(2π) |σ /C| 2
Here we are saying that our prior belief is that the “slope” of the regression
line is near 0. Now,
θ̂MAP ∈ arg max log p(y|x, θ) + log p(θ)

θ
m
1
m
= arg max − 2 (yi − θT xi )2 + log p(θ)
θ 2σ i=1
from (10). At this point you can really see how MAP is a regularized ML.
Continuing,
m
1
m T 2 d 1
σ 2 1
T
= arg max
−
2 (yi − θ xi ) − log 2π − log −
θ (IC/σ 2 )θ
θ 2σ i=1 2
2
C 2
m
1
m 1
= arg max − 2 (yi − θT xi )2 − 2 CθT θ
θ 2σ i=1 2σ
m
m
= arg min (yi − θT xi )2 + C IθI22 . (12)
θ
i=1
We see that the MAP estimate corresponds exactly to R2 -regularized linear

regression (ridge regression), and that the R2 regularization can be interpreted
as a Gaussian prior. Increasing C corresponds to increasing our certainty that
θ should be close to zero.
3 Conjugate priors
Although point estimates can be useful in many circumstances (and are used
in many circumstances), our true goal in Bayesian analysis is often to find
the full posterior, p(θ|y). One reason for wanting the posterior is to be able
to use the posterior predictive distribution of a yet unobserved data point:
�
Z �
Z
p(ym+1 |y) = p(ym+1 |y, θ)p(θ|y)dθ = p(ym+1 |θ)p(θ|y)dθ, (13)
because ym+1 and y are conditionally independent given θ by assumption.
We saw earlier that the posterior can be obtained in principle from the prior
and the likelihood using Bayes’ rule, but that there is an integral in the
denominator which often makes this intractable. One approach to circum
venting the integral is to use conjugate priors.
The appropriate likelihood function (Binomial, Gaussian, Poisson, Bernoulli,...)

is typically clear from the data. However, there is a great deal of flexibility
when choosing the prior distribution. The key notion is that if we choose
the ‘right’ prior for a particular likelihood function, then we can compute
the posterior without worrying about the integrating. We will formalize the
notion of conjugate priors and then see why they are useful.
Definition 3 (Conjugate Prior). Let F be a family of likelihood functions
and P a family of prior distributions. P is a conjugate prior to F if for
any likelihood function f ∈ F and for any prior distribution p ∈ P, the
corresponding posterior distribution p∗ satisfies p∗ ∈ P.
It is easy to find the posterior when using conjugate priors because we know
it must belong to the same family of distributions as the prior.
Coin Flip Example Part 4. In our previous part of coin flip example, we
were very wise to use a Beta prior for θ because the Beta distribution is the
conjugate prior to the Bernoulli distribution. Let us see what happens when
we compute the posterior using (2) for the likelihood and (7) for the prior:
p(y|θ)p(θ) θmH (1 − θ)m−mH θα−1 (1 − θ)β−1
p(θ|y) = R 1 =
p(y|θ ' )p(θ ' )dθ ' (normalization)
0
θmH +α−1 (1 − θ)m−mH +β−1
= R1 .
'm +α−1 '
(1 − θ ) m−m +β−1 '
0 θ dθ
H H
We can recognize the denominator as a Beta function from the definition in

(8):
1
p(θ|y) = θmH +α−1 (1 − θ)m−mH +β−1 ,
B(mH + α, m − mH + β)
and we recognize this as being a Beta distribution:
p(θ|y) ∼ Beta (mH + α, m − mH + β) . (14)
10
As with the MAP estimate, we can see the interplay between the data yi and
the prior parameters α and β in forming the posterior. As before, the exact
choice of α and β does not matter asymptotically, the data overwhelm the
prior.
If we knew that the Beta distribution is the conjugate prior to the Bernoulli,
we could have figured out the same thing faster by recognizing that
p(θ|y) ∝ p(y|θ)p(θ) = θmH +α−1 (1 − θ)m−mH +β−1 , (15)
and then realizing that by the definition of a conjugate prior, the posterior
must be a Beta distribution. There is exactly one Beta distribution that
satisfies (15): the one that is normalized correctly, and that is equation (14).
The parameters of the prior distribution (α and β in the case of the Beta
prior) are called prior hyperparameters. We choose them to best represent
our beliefs about the distribution of θ. The parameters of the posterior dis
tribution (mH + α and m − mH + β) are called posterior hyperparameters.
Any time a likelihood model is used together with its conjugate prior, we
know the posterior is from the same family of the prior, and moreover we have
an explicit formula for the posterior hyperparameters. A table summarizing
some of the useful conjugate prior relationships follows. There are many more
conjugate prior relationships that are not shown in the following table but
that can be found in reference books on Bayesian statistics1 .
1
Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin is an excellent choice.
11
Prior Posterior
Conjugate
Likelihood Hyper- Hyper-
Prior
params params
Pm
yi , β + m − m
P
Bernoulli Beta α, β α+ i=1 i=1 yi
Pm Pm Pm
Binomial Beta α, β α+ i=1 yi , β + i=1 mi − i=1 yi
Pm
Poisson Gamma α, β α+ yi , β + m
i=1
α + m, β + m
P
Geometric Beta α, β i=1 yi
Uniform on [0, θ] Pareto xs , k max{max yi , xs }, k + m
α + m, β + m
P
Exponential Gamma α, β i=1 yi
Normal −1
µ0 1
Pm 1 m 1 m
unknown mean Normal µ0 , σ02 σ02
+ σ2 i=1 yi / σ02
+ σ2
, σ02
+ σ2
known variance σ 2
We will now discuss a few of these conjugate prior relationships to try to gain
additional insight. For Bernoulli-Beta, we saw that the prior hyperparame
ters can be interpreted as starting the tossing with a certain number of Heads
and Tails on the record. The posterior hyperparameters then simply add the
observed number of heads (mH ) to the prior hyperparameter for number of
Heads (α), and the observed number of tails (m − mH ) to the prior hyperpa
rameter for number of tails (β).
For Binomial-Beta, we now have m Binomial experiments, each of which

consists of a certain number of coin tosses (mi ) and a certain number of
Heads (yi ). As before, the first prior hyperparameter corresponds to number
of “hallucinated” Heads, and in the posterior we combine the prior hyper-
parameter α with the total number of observed Heads across all Binomial
trials. Similarly, for the second posterior hyperparameter, we compute the
total number of Tails observed across all Binomial trials and combine it with
the corresponding prior hyperparameter (β).
In the Uniform-Pareto example, the data come from a uniform distribution

on [0, θ], but we don’t know θ. We choose a prior for θ that is a Pareto
distribution with prior hyperparameters xs and k. Here, xs and k can be in
terpreted as beginning the experiment with k observations, whose maximum
value is xs (so we believe θ is at least xs ). In the posterior, we replace the
maximum value xs with the new maximum value max{max yi , xs } including
12
the observed data, and update the number of observations to k + m.
For Normal-Normal, we see that the posterior mean is a weighted combination

of the data and the prior, where the weights are the variances and correspond
to our certainty in the prior vs. the data. The higher the variance in the
data, the less certain we are in them and the longer it will take for the data
to overwhelm the effect of the prior. (Might be helpful to think of µ0 as one
prior example rather than a bunch of them.)
3.1 Exponential families and conjugate priors

There is an important connection between exponential families (not to be
confused with the exponential distribution) and conjugate priors.
Definition 4. The family of distributions F is an exponential family if every
member of F has the form:
p(yi |θ) = f (yi )g(θ) exp φ(θ)T u(yi ) ,

� �
(16)
for some f (·), g(·), φ(·), and u(·).

Essentially all of the distributions that we typically work with (normal, ex
ponential, Poisson, beta, gamma, binomial, Bernoulli,...) are exponential
families. The next theorem tells us when we can expect to have a conjugate
prior.
Theorem 1 (Exponential families and conjugate priors). If the likeli
hood model is an exponential family, then there exists a conjugate prior.
Proof. Consider the likelihood of our iid data y = {y1 , . . . , ym }:
m
m m
m
f (yi )g(θ) exp φ(θ)T u(yi )
� �
p(y|θ)
=
p(yi |θ) =
i=1 i=1
i m

� m
!
�
m m
= f (yi ) g(θ)m exp φ(θ)T u(yi ) .
i=1 i=1
Take the prior distribution to be:

� �
g(θ)η exp φ(θ)T ν
p(θ) = R , (17)
g(θ' )η exp (φ(θ' )T ν) dθ'
13
where η and ν are prior hyperparameters. Then the posterior will be
p(θ|y) ∝ p(y|θ)p(θ)
im # � m
!
� � �
η T
m m g(θ ) exp φ(θ) ν
= f (yi ) g(θ)m exp φ(θ)T u(yi ) R
i=1 i=1
g(θ' )η exp (φ(θ' )T ν) dθ'

� � m
��
!!
m
∝ g(θ)η+m exp φ(θ)T ν + u(yi ) (dropping non-θ terms),
i=1
which is in the same family as the prior (17), with the posterior hyperparam
eters being η + m and ν + m
P
i=1 u(yi ).
Although this proof yields the form of the conjugate prior, we may not always
be able to compute the partition function. So we can’t always gain something
in practice from it.
It turns out that in general, the converse to Theorem 1 is true, and exponen
tial families are the only distributions with (non-trivial) conjugate priors.
Coin Flip Example Part 5. Returning again to the coin flip example, let
us first verify that the Bernoulli distribution is an exponential family:
p(yi |θ) = θyi (1 − θ)1−yi

= exp (yi log θ + (1 − yi ) log(1 − θ))
= exp (yi log θ − yi log(1 − θ) + log(1 − θ))

θ
= (1 − θ) exp yi log ,
1−θ
and we see that the Bernoulli distribution is an exponential family according
θ
to (16) with f (yi ) = 1, g(θ) = 1 − θ, u(yi ) = yi , and φ(θ) = log 1−θ . Thus,
according to (17), the conjugate prior is
p(θ) ∝ g(θ)η exp φ(θ)T ν

� �

η θ
= (1 − θ) exp ν log
1−θ
ν
θ
= (1 − θ)η
1−θ
= θν (1 − θ)η−ν .
14
Reparameterizing by defining ν = α − 1 and η − ν = β − 1 gives us the Beta

distribution that we expect.
Example 2. (Normal Distribution is an Exponential Family) Con
sider a normal distribution with known variance but unknown mean:

1 1
p(yi |θ) = √ exp − 2 (yi − θ)2 .
2πσ 2 2σ
1
� 2 2
�
Let f (yi ) = √2πσ 2
exp −y i /2σ , u(yi ) = yi /σ, φ(θ) = θ/σ, and g(θ) =
� 2 �
exp −θ /2σ 2 . Then,

1 1 2 1 2 1
f (yi )g(θ) exp (φ(θ)u(yi )) = √ exp − 2 yi exp − 2 θ exp 2θyi
2πσ 2 2σ 2σ 2σ 2

1 1
=√ exp − 2 (yi − θ)2
2πσ 2 2σ
= p(yi |θ).
Thus the normal distribution with known variance is an exponential family.

Example 3. (Exponential Distribution is an Exponential Family)
Consider an exponential distribution:
p(yi |θ) = θe−θyi .
Let f (yi ) = 1, u(yi ) = yi , φ(θ) = −θ, and g(θ) = θ. Then,
f (yi )g(θ) exp (φ(θ)u(yi )) = θe−θyi = p(yi |θ).
Thus the exponential distribution is an exponential family.
4 Posterior asymptotics
Up to this point, we have defined a likelihood model that is parameterized
by θ, assigned a prior distribution to θ, and then computed the posterior
p(θ|y). There are two natural questions that arise. First, what if we choose
the ‘wrong’ likelihood model? That is, what if the data were actually gener
6 p(y|θ) for any θ, but we use
ated by some distribution q(y) such that q(y) =
p(y|θ) as our likelihood model? Second, what if we assign the ‘wrong’ prior?
We can answer both of these questions asymptotically as m → ∞. First we
15
must develop a little machinery from information theory.
A useful way to measure the dissimilarity between two probability distribu

tions is the Kullback-Leibler (KL) divergence, defined for two distributions
p(y) and q(y) as:
D(q(·)||p(·)) := 1y∼q(y) log

�
Z
q(y) q(y)
= q(y) log dy.
p(y) p(y)
The KL divergence is only defined if q(y) > 0 for any y such that p(y) > 0.
It is sometimes referred to as the KL distance, however it is not a metric in
the mathematical sense because in general it is asymmetric: D(q(·)||p(·)) 6=
D(p(·)||q(·)). It is the average of the logarithmic difference between the prob
ability distributions p(y) and q(y), where the average is taken with respect
to q(y). The following property of the KL divergence will be very important
for us.
Theorem 2 (Non-negativity of KL Divergence). D(q(·)||p(·)) ≥ 0 with equal

ity if and only if q(y) = p(y) ∀y.
Proof. We will rely on Jensen’s inequality, which states that for any convex
function f and random variable X,
1[f (X)] ≥ f (1[X]).

When f is strictly convex, Jensen’s inequality holds with equality if and only
if X is constant, so that 1[X] = X and 1[f (X)] = f (X). Take y ∼ q(y) and
define the random variable X = p(y)
q(y) . Let f (X) = − log(X), a strictly convex
function. Now we can apply Jensen’s inequality:
�
Z
1y∼q(y)[f (X)] ≥ f (1y∼q(y)
�Z
[X])

p(y) p(y)
− q(y) log dy ≥ − log q(y) dy
q(y) q(y)
�
Z
≥ − log p(y)dy
≥ − log 1 = 0 so,
Z
�
q(y)
q(y) log dy ≥ 0,
p(y)
with equality under the same conditions required for equality in Jensen’s
inequality: if and only if X is constant, that is, q(y) = p(y) ∀y.
16
We will use the KL divergence to find the distribution from the likelihood
family that is ‘closest’ to the true generating distribution:
θ∗ = arg min D(q(·)||p(·|θ)). (18)
θ∈Θ
For convenience, we will suppose that the arg min in (18) is unique. The
results can be easily extended to the case where the arg min is not unique. The
main results of this section are two theorems, the first for discrete parameter
spaces and the second for continuous parameter spaces. The intuition for the
theorem is that as long as there is some probability in the prior that θ = θ∗ ,
then as m → ∞ the whole posterior will be close to θ∗ .
Theorem 3 (Posterior consistency in finite parameter space). Let F be a
finite family of likelihood models, with Θ = {θ : p(·|θ) ∈ F} the finite pa
rameter space. Let y = (y1 , . . . , ym ) be a set of independent samples from
an arbitrary distribution q(·), and θ∗ be as in (18). If p(θ = θ∗ ) > 0, then
p(θ = θ∗ |y) → 1 as m → ∞.
Proof. Consider any θ 6= θ∗ :
m
p(θ|y) p(θ)p(y|θ) p(θ)
m p(yi |θ)
log
∗ = log = log
+
log
(19)
p(θ |y) p(θ∗ )p(y|θ∗ ) p(θ∗ ) i=1 p(yi |θ∗ )

Consider the sum. By the strong law of large numbers, with probability 1,
m
→ 1y∼q(y) log
1
m p(yi |θ) p(y|θ)
log
. (20)
m i=1 p(yi |θ∗ ) p(y|θ∗ )

Well,

1y∼q(y) log
p(y|θ)
p(y|θ∗ )
= 1y∼q(y) log
p(y|θ)q(y)
p(y|θ∗ )q(y)
(1 in disguise)

= 1y∼q(y) log
q(y) q(y)
− log
p(y|θ∗ ) p(y|θ)
∗
= D(q(·)||p(·|θ )) − D(q(·)||p(·|θ))
< 0. Why?
By (20), with probability 1, as m → ∞,
m
→ m1y∼q(y) log
m p(yi |θ) p(y|θ)
log = −∞.
i=1
p(yi |θ∗ ) p(y|θ∗ )
17
Let’s plug back into (19). By supposition, p(θ∗ ) > 0 (and the prior doesn’t
change as m → ∞).Thus as m → ∞, (19) obeys
m
p(θ|y) p(θ) m p(yi |θ)
log ∗
= log ∗
+ log → −∞,
p(θ |y) p(θ ) i=1 p(yi |θ∗ )
and
p(θ|y) p(θ|y)
log → −∞ implies → 0 which implies p(θ|y) → 0.
p(θ∗ |y) p(θ∗ |y)
This holds for every θ 6= θ∗ , thus p(θ∗ |y) → 1.
Again, this theorem tells us that the posterior eventually becomes concen
trated on the value θ∗ . θ∗ corresponds to the likelihood model that is ‘closest’
in the KL sense to the true generating distribution q(·). If q(·) = p(·|θ0 ) for
some θ0 ∈ Θ, then θ∗ = θ0 is the unique minimizer of (18) because p(·|θ∗ ) is
closest to p(·|θ0 ) when θ∗ = θ0 . The theorem tells us that the posterior will
become concentrated around the true value. This is only the case if in the
prior, p(θ∗ ) > 0, which shows the importance of choosing a prior that assigns
non-zero probability to every plausible value of θ.
We now present the continuous version of Theorem 3, but the proof is quite
technical and is omitted.
Theorem 4 (Posterior consistency in continuous parameter space). If Θ
is a compact set, θ∗ is defined as in (18), A is a neighborhood of θ∗ , and
p(θ∗ ∈ A) > 0, then p(θ ∈ A|y) → 1 as m → ∞.
These theorems show that asymptotically, the choice of the prior does not
matter as long as it assigns non-zero probability to every θ ∈ Θ. They also
show that if the data were generated by a member of the family of likelihood
models, we will converge to the correct likelihood model. If not, then we will
converge to the model that is ‘closest’ in the KL sense.
We give these theorems along with a word of caution. These are asymptotic
results that tell us absolutely nothing about the sort of m we encounter in
practical applications. For small sample sizes, poor choices of the prior or
likelihood model can yield poor results and we must be cautious.
18
Coin Flip Example Part 6. Suppose that the coin flip data truly came from
a biased coin with a 3/4 probability of Heads, but we restrict the likelihood
model to only include coins with probability in the interval [0, 1/2]:
p(y|θ) = θmH (1 − θ)m−mH , θ ∈ [0, 1/2].
This time we will use a uniform prior, p(θ) = 2, θ ∈ [0, 1/2]. The posterior
distribution is then
θmH (1 − θ)m−mH 2
p(θ|y) = R 1/2 .
0 θ'mH (1 − θ' )m−mH 2dθ'
The partition function is an incomplete Beta integral, which has no closed
form but can easily be solved numerically. In the following figure, we draw
samples iid from the true distribution (3/4 probability of Heads) and show
how the posterior becomes increasingly concentrated around θ = 0.5, the
closest likelihood function to the true generating distribution.
5 Hierarchical modeling
Hierarchical models are a powerful application of Bayesian analysis. They al
low us to reason about knowledge at multiple levels of abstraction. Basically,
19
the prior parameters at the lower level come from the same distribution at
the higher level, and there is a prior at that level, and so on. This is best
explained by example, so we will develop a hierarchical model for topic mod
eling, an important information retrieval problem.
5.1 Topic models

Suppose we have been given a collection of m documents and we wish to
determine how the documents are related. For example, if the documents
are all news articles, the article “Patriots game canceled due to hurricane”
is related to the article “New York Giants lose superbowl” because they are
both about football. The article “Patriots game canceled due to hurricane”
is also related to the article “Record snowfall in May” because they are both
about the weather. We will now develop a hierarchical model for finding topic
relationships between documents in an unsupervised setting. The method is
called Latent Dirichlet Allocation (LDA) and it was developed by David Blei,
Andrew Ng, and Michael Jordan.
5.1.1 LDA formulation
The model has several components. The data are m documents, with docu
ment i consisting of ni words. Each word in the document will be associated
with one of K topics. We let zi,j denote the topic of word j in document i.
We model zi,j ∼ Multinomial(θi ), where θi ∈ JK describes the topic mixture
of document i.
For each topic, we define a multinomial distribution over all possible words.
For example, given the topic is “Sports”, the probability of having the word
“football” might be high; if the topic were “Weather”, the probability of hav
ing the word “football” might be lower. Other words, like “the” will have a
high probability regardless of the topic. If words are chosen from a set of W
possible words, then we let φk ∈ JW be the multinomial parameter over words
for topic k. Word j of document i, denoted wi,j , will be generated by the dis
tribution over words corresponding to the topic zi,j : wi,j ∼ Multinomial(φzi,j ).
Finally, we give prior distributions for the parameters θi and φk . The multi
nomial distribution is a generalization of the binomial distribution, and its
20
conjugate prior is a generalization of the beta distribution: the Dirichlet

distribution. Thus we model the data with the following generative model:
1. For document i = 1, . . . , m, choose the document’s topic distribution
θi ∼ Dirichlet(α), where α ∈ JK is the prior hyperparameter.
2. For topic k = 1, . . . , K, choose the topic’s word distribution φk ∼
Dirichlet(β), where β ∈ JW is the prior hyperparameter.
3. For document i = 1, . . . , m:
For word j = 1, . . . , ni :
Choose the topic for this word zi,j ∼ Multinomial(θi ).
Choose the word wi,j ∼ Multinomial(φzi,j ).
5.2 Graphical representation

Hierarchical models are illustrated with a node for every variable and arcs
between nodes to indicate the dependence between variables. Here’s the one
for LDA:
Graphs representing hierarchical models must be acyclic. For any node x,

we define Parents(x) as the set of all nodes with arcs to x. The hierarchical
model consists of, for every node x, the distribution p(x|Parents(x)). Define
Descendants(x) as all nodes that can be reached from x and Non-descendants(x)
as all other nodes. Because the graph is acyclic and the distribution for each
node depends only on its parents, given Parents(x), x is conditionally inde
pendent from Non-descendants(x). This is a powerful fact about hierarchical
models that is important for doing inference. In the graph for LDA, this
means that, for example, zi,j is independent of α, given θi .
21
In addition to the graph structure, we use plates to denote repeated, inde

pendent draws from the same distribution. The plates are like the ‘for’ loops
in the pseudocode of how the data is generated in the text description above.
Coin Flip Example Part 7. Even simple models like the coin flip model
can be represented graphically. The coin flip model is:
5.3 Inference in hierarchical models

Hierarchical models are useful because they allow us to model interactions
between the observed variables (in this case the words in each document)
through the use of latent (or hidden) variables.
What are the latent variables for LDA?
Despite introducing a larger number of latent variables than we have observed

variables, because of their additional structure hierarchical models are gen
erally not as prone to overfitting as you might expect.
We are interested in infering the posterior distribution for the latent vari
ables. Let Z = {zi,j }i=1,...,m,j=1,...,ni , θ = {θi }i=1,...,m , φ = {φk }k=1,...,K , and
W = {wi,j }i=1,...,m,j=1,...,ni . Then, by Bayes’ rule, ignoring the constant de
nominator, we can express the posterior as:
p(Z, θ, φ|w, α, β) ∝ p(w|Z, φ, θ, α, β)p(Z, θ, φ|α, β) (21)
We will look at each of these pieces and show that they have a compact
22
analytical form.
ni
m m
m
p(w|Z, φ, θ, α, β) = p(wi,j |Z, φ, θ, α, β) By iid assumption
i=1 j=1
mm m ni
= p(wi,j |zi,j , φ) By conditional independence
i=1 j=1
mm m ni
= Multinomial(wi,j ; φzi,j ) By definition.
i=1 j=1
Also,
p(Z, θ, φ|α, β) = p(Z, θ|α)p(φ|β)
by conditional independence. Again considering each term, by the definition
of conditional probability:
p(Z, θ|α) = p(Z|θ, α)p(θ|α),
where
m ni
m m ni
m m
m
p(Z|θ, α) = p(Z|θ) = p(zi,j |θi ) = Multinomial(zi,j ; θi ),
i=i j=1 i=i j=1
by conditional independence, iid, and definition as before. Further,

m
m m
m
p(θ|α) = p(θi |α) = Dirichlet(θi ; α),
i=1 i=1
and
K
m K
m
p(φ|β) = p(φk |β) = Dirichlet(φk ; β).
k=1 k=1
Plugging all of these pieces back into (21), we obtain
m ni
m m ni
m m
m
p(Z, θ, φ|w, α, β) ∝ Multinomial(wi,j ; φzi,j ) Multinomial(zi,j ; θi )×
i=1 j=1 i=i j=1
m
m K
m
Dirichlet(θi ; α) Dirichlet(φk ; β). (22)
i=1 k=1
23
For any given Z, θ, φ, w, α, and β, we can easily evaluate (22). (All the nor
malizations are known so it’s easy.) We will see in the next section that this
is sufficient to be able to simulate draws from the posterior.
Even using conjugate priors, in general it will not be possible to recover

the posterior analytically for hierarchical models of any complexity. We will
rely on (among a few other options) sampling methods like the Monte Carlo
Markov Chains (MCMC) that we discuss in the next section. What the
statistics community call Bayesian hierarchical models are in the machine
learning community often treated as a special case of Bayesian graphical
models (specifically, directed acyclic graphs). There is at least one entire
course at MIT on inference in Bayesian graphical models (6.438).
6 Markov Chain Monte Carlo sampling

As we have seen with hierarchical models, even with conjugate priors we are
unable to express the posterior analytically. The reason Bayesian statistics
is so widely used is because of the development of computational methods
for simulating draws from the posterior distribution. Even though we are
unable to express the posterior analytically, with a large enough sample of
simulated draws we can compute statistics of the posterior with arbitrary
precision. This approach is called Monte Carlo simulation. We will describe
the two most commonly used Monte Carlo methods, which both fall under the
umbrella of Markov Chain Monte Carlo (MCMC) methods: the Metropolis-
Hastings algorithm, and Gibbs’ sampling.
6.1 Markov chains

The results from MCMC depend on some key results from the theory of
Markov chains. We will not do a thorough review of Markov Chains and will
rather present the necessary results as fact. A continuous state Markov chain
is a sequence θ0 , θ1 , . . . with θt ∈ Jd that satisfies the Markov property:
p(θt |θt−1 , . . . , θ1 ) = p(θt |θt−1 ).
We will be interested in the (unconditioned) probability distribution over

states at time t, which we denote as π t (θ) := Pr(θt = θ). Under some
24
conditions that will be satisfied by all of the chains we are interested in,
the sequence of state distributions π 0 (θ), π 1 (θ), . . . will converge to a unique
distribution π(θ) which we call the stationary distribution, or equilibrium dis
tribution or steady-state distribution.
We define the transition kernel to be the probability of transitioning from

state θ to state θ' : K(θ, θ' ) = p(θ' |θ). We then have the following important
fact.
Fact: if the distribution π(·) satisfies the detailed balance equation:
K(θ, θ' )π(θ) = K(θ' , θ)π(θ' ), for all θ, θ' , (23)
then π(·) is the stationary distribution. The interpretation of the detailed

balance equation is that the amount of mass transitioning from θ' to θ is the
same as the amount of mass transition back from θ to θ' . For a stationary
distribution, we cannot have mass going from state to state.
6.2 Metropolis-Hastings algorithm

The goal in MCMC is to construct a Markov Chain whose stationary dis
tribution is the posterior p(θ|y). We now present the Metropolis-Hastings
algorithm. In addition to the distributions we have already used (likelihood
and prior), we will need a proposal distribution (or jumping distribution)
J(θ, θ' ) which will propose a new state θ' given the current state θ.
There are many options when choosing a proposal distribution which we will
discuss later. The proposal distribution will yield a random walk over the
parameter space, proposing steps θ → θ' . We accept or reject each step
depending on the relative posterior probabilities for θ and θ' . When we run
the random walk for long enough, the accepted values will simulate draws
from the posterior.
6.2.1 Some intuition into the Metropolis-Hastings algorithm
Suppose we are considering the transition θ → θ' . If p(θ' |y) is larger than
p(θ|y), then for every accepted draw of θ, we should have at least as many
accepted draws of θ' and so we should always accept the transition θ → θ' .
25
If p(θ' |y) is less than p(θ|y), then for every accepted draw θ, we should have
�
|y)
on average p(θ '
p(θ|y)
accepted draws of θ . We thus accept the transition with
p(θ� |y)
probability p(θ|y) . Thus for any transition, we accept the transition with
probability
p(θ' |y)
min , 1 . (24)
p(θ|y)
6.2.2 Steps of the algorithm
We now give the steps of the Metropolis-Hastings algorithm.
Step 1. Choose a starting point θ0 . Set t = 1.

Step 2. Draw θ∗ from the proposal distribution J(θt−1 , ·). The proposed move
for time t is to move from θt−1 to θ∗ .
Step 3. Compute the following:

t−1 ∗ p(θ∗ |y)J(θ∗ , θt−1 )
α(θ , θ ) := min ,1
p(θt−1 |y)J(θt−1 , θ∗ )

p(y|θ∗ )p(θ∗ )J(θ∗ , θt−1 )
= min ,1 (25)
p(y|θt−1 )p(θt−1 )J(θt−1 , θ∗ )
We’ll explain this more soon. The fact that we can compute ratios of
posterior probabilities without having to worry about the normalization
integral is the key to Monte Carlo methods.
Step 4. With probability α(θt−1 , θ∗ ), accept the move θt−1 → θ∗ by setting θt =
θ∗ and incrementing t ← t + 1. Otherwise, discard θ∗ and stay at θt−1 .
Step 5. Until stationary distribution and the desired number of draws are reached,
return to Step 2.
Equation (25) reduces to what we developed intuition for in (24) when the
proposal distribution is symmetric: J(θ, θ' ) = J(θ' , θ). We will see in the
next theorem that the extra factors in (25) are necessary for the posterior to
be the stationary distribution.
26
6.2.3 Convergence of Metropolis-Hastings
Because the proposal distribution and α(θt−1 , θ∗ ) depend only on the cur
rent state, the sequence θ0 , θ1 , . . . forms a Markov chain. What makes the
Metropolis-Hastings algorithm special is the following theorem, which shows
that if we simulate the chain long enough, we will simulate draws from the
posterior.
Theorem 5. If J(θ, θ' ) is such that that the Markov chain θ0 , θ1 , . . . produced
by the Metropolis-Hastings algorithm has a unique stationary distribution,
then the stationary distribution is p(θ|y).
Proof. We will use the fact given in (23) that if the posterior p(θ|y) satisfies
the detailed balance equation, then it is the stationary distribution. Thus we
wish to show:
K(θ, θ' )p(θ|y) = K(θ' , θ)p(θ' |y), for all θ, θ' .
where the transition kernel comes from Metropolis-Hastings:
K(θ, θ' ) = (probability of proposing θ' )×
(probability of accepting θ' given it was proposed)
= J(θ, θ' )α(θ, θ' ).
To show that the detailed balance equation holds, take any θ and θ' , and
without loss of generality, suppose that α is less than or equal to 1 for the
transition θ to θ' , which means
J(θ, θ' )p(θ|y) ≥ J(θ' , θ)p(θ' |y),
so that
' J(θ' , θ)p(θ' |y)
α(θ, θ ) = ,
J(θ, θ' )p(θ|y)
and α(θ' , θ) = 1. Now let’s plug:
K(θ, θ' )p(θ|y) =J(θ, θ' )α(θ, θ' )p(θ|y)
' J(θ' , θ)p(θ' |y)
=J(θ, θ )p(θ|y)
J(θ, θ' )p(θ|y)
=J(θ' , θ)p(θ' |y) (cancel terms)
=J(θ' , θ)α(θ' , θ)p(θ' |y) (1 in disguise)
=K(θ' , θ)p(θ' |y).
And that’s the detailed balance equation.
27
We have now proven that the Metropolis-Hastings algorithm simulations will

eventually draw from the posterior distribution. However, there are a number
of important questions to be addressed. What proposal distribution should
we use? How many iterations will it take for the chain to be sufficiently close
to the stationary distribution? How will we know when the chain has reached
its stationary distribution? We will discuss these important issues after we
introduce the Gibbs’ sampler.
6.3 Gibbs’ Sampler

The Gibbs’ sampler is a very powerful MCMC sampling technique for the spe
cial situation when we have access to conditional distributions. It is a special
case of the Metropolis-Hastings algorithm that is typically much faster, but
can only be used in special cases.
Let us express θ ∈ Jd as θ = [θ1 , . . . , θd ]. Suppose that although we are not

able to draw directly from p(θ|y) because of the normalization integral, we
are able to draw samples from the conditional distribution
p(θj |θ1 , . . . , θj−1 , θj+1 , . . . , θd , y).
In fact, it is often the case in hierarchical models that their structure al
lows us to determine analytically these conditional posterior distributions.
This can be done for LDA, and the derivation is quite lengthy but can
be found on the Wikipedia article for LDA. The Gibbs’ sampler updates
the posterior variables θ1 , θ2 , . . . , θd one at a time. At each step, all of
them are held constant in their current state except for one (j), which is
then updated by drawing from its (known) conditional posterior distribu
tion, p(θj |θ1 , . . . , θj−1 , θj+1 , . . . , θd , y). We then hold it in its state and move
on to the next variable to update it similarly. When we do this iterative up
dating for long enough, we eventually simulate draws from the full posterior.
The full algorithm is:
Step 1. Initialize θ0 = [θ10 , . . . , θd0 ]. Set t = 1.
t−1
Step 2. For j ∈ {1, . . . , d}, sample θj t from p(θj |θ1t , . . . , θj−1
t
, θj+1 , . . . , θdt−1 , y).
Step 3. Until stationary distribution and the desired number of draws are reached,
increment t ← t + 1 and return to Step 2.
28
In each iteration of the Gibbs’ sampler, we sequentially update each compo

nent of θt . We could do that updating in any order, it does not have to be
1, . . . , d.
The Gibbs’ sampler is a special case of Metropolis-Hastings where the pro

posal distribution is taken to be the conditional posterior distribution. In
fact, it is easy to show (but notationally extremely cumbersome) that when
we use these conditional posterior distributions as proposal distributions in
Metropolis-Hastings, the probability of accepting any proposed move is 1,
hence in the Gibbs’ sampler we accept every move.
6.4 Practical considerations

Now that we have seen the general idea of MCMC algorithms and some theory
behind them, let us dive into some details.
6.4.1 Proposal distributions
To guarantee existence of a stationary distribution, all that is required (with

rare exceptions) is for the proposal distribution J(·, ·) to be such that there is
a positive probability of eventually reaching any state from any other state.
A typical proposal distribution is a random walk J(θ, θ' ) = N (θ, σ 2 ) for some
σ 2 . There are several important features of proposal distributions that work
well in practice. First, we must be able to sample from it efficiently. Second,
we must be able to compute the ratio α(θ, θ' ) in an efficient way. Third, the
jumps should not be too large or we will reject them frequently and the chain
will not move quickly. Fourth, the jumps should not be too small or it will
take a long time to explore the whole space. The balance between ‘too small’
and ‘too large’ is the subject of hundreds of papers on ‘adaptive MCMC’,
but there is really no good way to know which proposal distribution to use.
In practice, we often try several proposal distributions to see which is most
appropriate, for example, by adjusting σ 2 in the above proposal distribution.
6.4.2 On reaching the stationary distribution
Unfortunately, it is impossible to know how many iterations it will take to

reach the stationary distribution, or even to be certain when we have arrived.
29
This is probably the largest flaw in MCMC sampling. There are a large num
ber of heuristics which are used to assess convergence that generally involve
looking at how θt varies with time. In general, the initial simulations depend
strongly on the starting point and are thrown away. This is referred to as
burn-in, and often involves discarding the first half of all draws.
To reduce the effect of autocorrelations between draws, we typically only

store a small fraction of them, for example we store only 1 out of every 1000
accepted draws. This process is called thinning.
Finally, we can assess convergence by comparing the distribution of the first

half of stored draws to the distribution of the second half of stored draws.
If the chain has reached its stationary distribution, these should have sim
ilar distributions, as assessed by various statistics. These (and similar) ap
proaches have been shown to be successful in practice. Unfortunately, there
is no way to be entirely certain that we are truly drawing from the posterior
and we must be very cautious.
Coin Flip Example Part 8. Let us return to our coin flip example. We
will draw from the posterior using the Metropolis-Hastings algorithm. Our
model has a scalar parameter θ ∈ [0, 1] which is the probability of heads.
The proposal distribution J(θt−1 , θ∗ ) will be uniform on an interval of size r
around θt−1 :
if θ∗ ∈ [θt−1 8 2r , θt−1 ⊕ 2r ]
1
Jt (θt−1 , θ∗ ; r) = r
0 otherwise
where ⊕ and 8 represent modular addition and subtraction on [0, 1], e.g.,
0.7 ⊕ 0.5 = 0.2. Notice that this proposal distribution is symmetric:
J(θt−1 , θ∗ ; r) = J(θ∗ , θt−1 ; r).
We accept the proposed θ∗ with probability
p(y|θ∗ )p(θ∗ )J(θ∗ , θt−1 ; r)

t−1 ∗
α(θ , θ ) = min ,1
p(y|θt−1 )p(θt−1 )J(θt−1 , θ∗ ; r)
p(y|θ∗ )p(θ∗ )

= min ,1
p(y|θt−1 )p(θt−1 )
(θ∗ )mH +α−1 (1 − θ∗ )m−mH +β−1

= min ,1 ,
(θt−1 )mH +α−1 (1 − θt−1 )m−mH +β−1
30
which we can easily compute. This formula and a uniform random number
generator for the proposal distribution are all that is required to implement
the Metropolis-Hastings algorithm.
Consider the specific case of m = 25, mH = 6, and α = β = 5. The following

three figures show the time sequence of proposals θ∗ for chains with r = 0.01,
r = 0.1, and r = 1 respectively, with the colors indicating whether each
proposed θ∗ was accepted or not, and time along the x-axis. In this first
figure we see that with r = 0.01, the step sizes are too small and after 2000
proposals we have not reached the stationary distribution.
In the next figure, with r = 1 the steps are too large and we reject most of
the proposals. This leads to a small number of accepted draws after 2000
proposals.
31
The chain with r = 0.1 is the happy medium, where we rapidly reach the
stationary distribution, and accept most of the samples.
To compare this to the analytic distribution that we obtained in (14), namely

p(θ|y) ∼ Beta (mH + α, m − mH + β) .
we run a chain with r = 0.1 until we have collected 25,000 accepted draws. We
then discard the initial 200 samples (burn-in) and keep one out of every 100
samples from what remains (thinning). The next figure shows a normalized
32
histogram of the resulting draws, along with the analytical posterior. The
running time to generate the MCMC draws was less than a second, and they
are a reasonable approximation to the posterior.
7 MCMC with OpenBUGS

There is a great deal of “art” to MCMC simulation, and a large body of
research on the “right” way to do the simulations. OpenBUGS is a nice
software package that has built-in years of research in MCMC and can draw
posterior samples in a fairly automated way. It is great for doing Bayesian
analysis without having to get your hands too dirty with MCMC details.
Here we demonstrate OpenBUGS using two examples.
OpenBUGS is old and is infrequently updated, but is still functional and

powerful. The graphical interface version of OpenBugs is available only as
a Windows executable, but for Linux and Mac users, we had no difficulties
running the Windows executable with Wine.
33
7.1 OpenBUGS example 1: Coin flip example
By this point it should come as no surprise that our first example will be the
coin flip example.
When you first open OpenBUGS,

you will be greeted with a screen
that looks something like this.
File -> New will open what is es

sentially a blank text file. There are
three items that must be specified to
do simulations in OpenBUGS: The
model, the data, and initial condi
tions. All of these are to be specified
in this text file.
Specifying the model in OpenBUGS is very natural, and uses two main sym
bols: “<-” means deterministic assignment, as in R, and “∼” means distribu
tional assignment. OpenBUGS includes many different distributions, a full
list of how to specify them is at:
http://mathstat.helsinki.fi/openbugs/Manuals/ModelSpecification.html#ContentsAI
We specify the coin flip model as:

model{
Y ∼ dbin(theta,m)
theta ∼ dbeta(alpha,beta)
Screenshots © OpenBUGS. Some rights reserved; the OpenBUGS software is available under the GNU
General Public License. This content is excluded from our Creative Commons license. For more information,
see http://ocw.mit.edu/fairuse
. .
34
Here dbin and dbeta are the binomial and beta distributions respectively.
Notice that in the model we do not specify the constants m, α, and β.
To load the model into OpenBUGS,
we first enter it into the text en

try space in OpenBUGS. Then from
the “Model” menu, we go Model
-> Specification. This opens the
model specification window, seen to
the right. To set the model as the
current working model, we highlight
the word “model” that we typed
in, as in the image on the right,
and while it is highlighted we select
“check model.” When we do this,
the text “model is syntatically

correct” appears along the bottom.
We then specify the data, including all constants. Data is specified in Open-
BUGS as an R list. Here, for 10 Heads out of 40 flips and priors α = β = 5,
we write:
list(Y=10,m=40,alpha=5,beta=5)
To load the data into OpenBUGS,

we first type the data list into the
text entry space in OpenBUGS. We
then highlight the word “list”, as
in the image on the right, and
while it is highlighted we click “load
data” in the model specification
window. When we do this, “data
loaded” appears along the bottom
of the screen.
see http://ocw.mit.edu/fairuse.
35
We are now ready to specify the
number of chains and compile the
model. We use multiple chains from
multiple initial conditions to test
convergence; it is usually sufficient
to use 2 chains. We input 2 chains
and press “compile” in the model
specification window. Then, “model
compiled” appears along the bot
tom. If we did not specify one of
the variables (for instance, if we had
mistakenly left beta out of the data
list) then it will give an error when
we try to compile.
Finally, we specify the initial parameter values for each chain, each as an R
list. Here we will start one chain from θ = 1, and the other from θ = 0.
list(theta=0)
list(theta=1)
For the first initial value we high

light the word “list”, and click
“load inits” in the model speci
fication. The number scroll to the
right of the “load inits” button
indicates that we are loading the
initial values for chain 1. When
we do this, it says on the bottom
of the screen “initial values
loaded and chain initialized
but another chain,” referring to
the fact that chain 2 still has not
been initialized.
36
To initialize chain 2, we simply fol

low the same procedure with the sec
ond list, and with the number scroll
set to 2. At the bottom of the screen
it then says “model initialized.”
Alternatively, instead of loading intial values for both chains, we could have
pressed the button “gen inits” in the model specification window, which
generates random initial values for both chains.
We may now close the model specification window.
OpenBUGS will only store whatever results we tell it to. Before simulating
the chains, we must tell OpenBUGS to store the values of theta that it
simulates.
To specify which variables
to store, go Inference ->
Samples. Here we set
“nodes,” which is the Open-
BUGS term for stored vari

ables. To tell it to store
theta, we type “theta” into
the node entry box, and
press “set” below. The
other entry boxes will be
used for things like burn-in
and thinning, which we will

worry about later.
Now we are ready to simulate the chains! We may close the sample monitor
37
tool for now.

To simulate the chains, we go Model
-> Update. This opens the update
tool which we use to run the simu
lations. We simply enter the desired
number of draws (we choose 10,000),
and the amount of thinning (we en

ter 10, meaning keep only one out
of every 10 draws). The “refresh”
entry refers to how often it will re

fresh the “iteration” count during
the simulation and is not very im

portant. When we have entered
in the desired parameters, we press
“update”. The iteration count in

creases from 0 to 10,000, and finally
at the bottom of the screen it reports
“10000 updates took 5 s.”
To view the results, we close the update tool and return to the sample monitor
tool (Inference -> Samples). Here there are a few useful quantities to look
at.
We type in “theta” as the
node we are interested in.
We set “501” in the “beg”
entrybox to specify that we
will begin inference at in

teration 501, thus ignoring
the first 500 iterations as
burn-in. The buttons we are
most interested in are den

sity, coda, trace, history, and
accept.
38
Pressing “density” produces the

MCMC estimte of the posterior den
sity, what we are most interested in.
It specifies that the density was cre
ated using 19,000 samples (10,000
for each chain minus 500 burn-in).
Pressing “trace” shows the values
for the last 200 draws for each chain,
so we can verify that the chains are
stationary and that they are similar
as a check of convergence. This im

age is shown on the right. Pressing
“history” provides a similar plot,
with all 10,000 iterations instead of
only the last 200. Pressing “coda”
dumps the values at each iteration
to a text file which you can then load
into R for further analysis.
Pressing “accept” shows the acceptance rate vs. iteration number, which is
the following figure.
OpenBUGS is smart enough to notice the Binomial-Beta conjugacy and uses

a Gibbs’ sampler, hence the acceptance rate is uniformly 1.
39
7.2 OpenBUGS example 2: estimating pump failure rates

The following example is from the OpenBUGS manual, and can be found at:
http://www.openbugs.info/Examples/Pumps.html
This example is a hierarchical model for pump failure. There are 10 pumps,
pump i being in operation for a certain amount of time ti and having had a
certain number of failures xi . We wish to estimate the failure rate for each
pump, and we will do so with a Bayesian hierarchical model. We suspect that
the failure rates are related for all of the pumps, so we include a common prior.
We give flexibility to the prior by sampling it from a diffuse distribution. The
generative model for the observed number of pump failures is then:
1. Choose the first prior hyperparameter α ∼ Exponential(1).

2. Choose the second prior hyperparameter β ∼ Gamma(0.1, 1.0).
3. For pump i = 1, . . . , 10:
Choose this pump’s failure rate θi ∼ Gamma(α, β).
Choose the number of failures xi ∼ Poisson(θi ti ).
We input this model into OpenBUGS as
model{
alpha ∼ dexp(1)
beta ∼ dgamma(0.1, 1.0)
for (i in 1 : 10) {
theta[i] ∼ dgamma(alpha, beta)
lambda[i] <- theta[i]*t[i]
x[i] ∼ dpois(lambda[i])
Notice in the above that OpenBUGS does not allow mathematical operations
inside of calls to distributions like dpois(theta[i]*t[i]), so we had to use
a deterministic relationship to specify the new variable lambda[i]. The data
to be specified are “t” and “x”, and we do this with an R list:
40
list(t = c(94.3, 15.7, 62.9, 126, 5.24, 31.4, 1.05, 1.05, 2.1, 10.5),
x = c( 5, 1, 5, 14, 3, 19, 1, 1, 4, 22))
We will allow OpenBUGS to generate the initial values for us randomly. We

simulate two chains exactly as in the coin flip example, and easily recover the
posterior densities for the failure rates for each pump. In the following figure
we show some of the posterior distributions:
41
MIT OpenCourseWare
http://ocw.mit.edu
15.097 Prediction: Machine Learning and Statistics

Spring 2012
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

15.097: Probabilistic Modeling and Bayesian Analysis

Uploaded by

Copyright:

Available Formats

15.097: Probabilistic Modeling and Bayesian Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

15.097: Probabilistic Modeling and Bayesian Analysis

Uploaded by

Copyright:

Available Formats

15.

097: Probabilistic Modeling and Bayesian Analysis

Ben Letham and Cynthia Rudin

Credits: Bayesian Data Analysis by Gelman, Carlin, Stern, and Rubin

1 Introduction and Notation

We are given m data points y1 , . . . , ym , each of arbitrary dimension. Let

We specify a prior distribution over θ, denoted p(θ). This distribution rep­

1.1 A note on the Bayesian approach

tics: frequentist (or classical) statistics, and Bayesian statistics. Although

2.1 Maximum likelihood estimation

θ̂ML ∈ arg max R(θ). (4)

R(θ) = mH log θ + (m − mH ) log(1 − θ).

2.2 Maximum a posteriori (MAP) estimation

because it is constant with respect to θ. Again it is generally more convenient

© Krishnavedala on Wikipedia. CC BY-SA. This content is excluded from our Creative

Diﬀerentiating and setting to zero at θ̂MAP ,

2.3 Point estimation and probabilistic linear regression

The ML estimate is equivalent to ordinary least squares!

θ̂MAP ∈ arg max log p(y|x, θ) + log p(θ)

We see that the MAP estimate corresponds exactly to R2 -regularized linear

because ym+1 and y are conditionally independent given θ by assumption.

The appropriate likelihood function (Binomial, Gaussian, Poisson, Bernoulli,...)

We can recognize the denominator as a Beta function from the deﬁnition in

p(θ|y) ∝ p(y|θ)p(θ) = θmH +α−1 (1 − θ)m−mH +β−1 , (15)

Uniform on [0, θ] Pareto xs , k max{max yi , xs }, k + m

For Binomial-Beta, we now have m Binomial experiments, each of which

In the Uniform-Pareto example, the data come from a uniform distribution

the observed data, and update the number of observations to k + m.

For Normal-Normal, we see that the posterior mean is a weighted combination

3.1 Exponential families and conjugate priors

p(yi |θ) = f (yi )g(θ) exp φ(θ)T u(yi ) ,

for some f (·), g(·), φ(·), and u(·).

Take the prior distribution to be:

g(θ' )η exp (φ(θ' )T ν) dθ'

p(yi |θ) = θyi (1 − θ)1−yi

p(θ) ∝ g(θ)η exp φ(θ)T ν

Reparameterizing by deﬁning ν = α − 1 and η − ν = β − 1 gives us the Beta

Thus the normal distribution with known variance is an exponential family.

p(yi |θ) = θe−θyi .

Let f (yi ) = 1, u(yi ) = yi , φ(θ) = −θ, and g(θ) = θ. Then,

f (yi )g(θ) exp (φ(θ)u(yi )) = θe−θyi = p(yi |θ).

Thus the exponential distribution is an exponential family.

must develop a little machinery from information theory.

A useful way to measure the dissimilarity between two probability distribu­

D(q(·)||p(·)) := 1y∼q(y) log

Theorem 2 (Non-negativity of KL Divergence). D(q(·)||p(·)) ≥ 0 with equal­

function f and random variable X,

1[f (X)] ≥ f (1[X]).

p(θ |y) p(θ∗ )p(y|θ∗ ) p(θ∗ ) i=1 p(yi |θ∗ )

m i=1 p(yi |θ∗ ) p(y|θ∗ )

5.1 Topic models

5.1.1 LDA formulation

conjugate prior is a generalization of the beta distribution: the Dirichlet

5.2 Graphical representation

Graphs representing hierarchical models must be acyclic. For any node x,

In addition to the graph structure, we use plates to denote repeated, inde­

5.3 Inference in hierarchical models

What are the latent variables for LDA?

Despite introducing a larger number of latent variables than we have observed

We specify a prior distribution over θ, denoted p(θ). This distribution rep

A useful way to measure the dissimilarity between two probability distribu

Theorem 2 (Non-negativity of KL Divergence). D(q(·)||p(·)) ≥ 0 with equal

In addition to the graph structure, we use plates to denote repeated, inde

In each iteration of the Gibbs’ sampler, we sequentially update each compo

The Gibbs’ sampler is a special case of Metropolis-Hastings where the pro

File -> New will open what is es

we ﬁrst enter it into the text en

For the ﬁrst initial value we high

To initialize chain 2, we simply fol

BUGS term for stored vari

and the amount of thinning (we en

entry refers to how often it will re