Chapter 1 B
Chapter 1 B
Chapter 1 B
Bayesian Inference
Semester 1, 2019–20
You were introduced to the Bayesian approach to statistical inference in MAS2903. This
module showed statistical analysis in a very different light to the frequentist approach
used in other courses. The frequentist approach bases inference on the sampling dis-
tribution of (usually unbiased) estimators; as you may recall, the Bayesian framework
combines information expressed as expert subjective opinion with experimental data. You
have probably realised that the Bayesian approach has many advantages over the fre-
quentist approach. In particular it provides a more natural way of dealing with parameter
uncertainty and inference is far more straightforward to interpret.
Much of the work in this module will be concerned with extending the ideas presented in
MAS2903 to more realistic models with many parameters that you may encounter in real
life situations. These notes are split into four chapters:
• Chapter 1 reviews some of the key results for Bayesian inference of single param-
eter problems studied in Stage 2. It also introduces the idea of a mixture prior
distribution.
• Chapter 2 studies the case of a random sample from a normal population and
determines how to make inferences about the population mean and precision, and
about future values from the population. The Group Project is based on this
material.
• Chapter 3 contains some general results for multi-parameter problems. You will
encounter familiar concepts, such as how to represent vague prior information and
the asymptotic normal posterior distribution.
• Chapter 4 introduces Markov chain Monte Carlo techniques which have truly revo-
lutionised the use of Bayesian inference in applications. Inference proceeds by sim-
ulating realisations from the posterior distribution. The ideas will be demonstrated
using an R library specially written for the module. This material is extended in the
4th year module MAS8951: Modern Bayesian Inference.
Contents
This chapter reviews some of the key results for Bayesian inference of single parameter
problems studied in MAS2903.
Suppose we have data x = (x1 , x2 , . . . , xn )T which we model using the probability (density)
function f (x|θ), which depends on a single parameter θ. Once we have observed the data,
f (x|θ) is the likelihood function for θ and is a function of θ (for fixed x) rather than of x
(for fixed θ).
Also, suppose we have prior beliefs about likely values of θ expressed by a probability
(density) function π(θ). We can combine both pieces of information using the following
version of Bayes Theorem. The resulting distribution for θ is called the posterior distri-
bution for θ as it expresses our beliefs about θ after seeing the data. It summarises all
our current knowledge about the parameter θ.
Using Bayes Theorem, the posterior probability (density) function for θ is
π(θ) f (x|θ)
π(θ|x) =
f (x)
where
R
Θ π(θ) f (x|θ) dθ
if θ is continuous,
f (x) =
P
Θ π(θ) f (x|θ) if θ is discrete.
1
2 CHAPTER 1. SINGLE PARAMETER PROBLEMS
Example 1.1
Table 1.1 shows some data on the number of cases of foodbourne botulism in England
and Wales. It is believed that cases occur at random at a constant rate θ in time (a
Poisson process) and so can be modelled as a random sample from a Poisson distribution
with mean θ.
Table 1.1: Number of cases of foodbourne botulism in England and Wales, 1998–2005
An expert in the epidemiology of similar diseases gives their prior distribution for the rate θ
as a Ga(2, 1) distribution, with density
and mean E(θ) = 2 and variance V ar (θ) = 2. Determine the posterior distribution for θ.
Solution
Bayes Theorem combines the expert opinion with the observed data, and gives the pos-
terior density function as
The only continuous distribution with density of the form kθg−1 e −hθ , θ > 0 is the Ga(g, h)
distribution. Therefore, the posterior distribution must be θ|x ∼ Ga(8, 9).
Thus the data have updated our beliefs about θ from a Ga(2, 1) distribution to a Ga(8, 9)
distribution. Plots of these distributions are given in Figure 1.1, and Table 1.2 gives a
1.1. PRIOR AND POSTERIOR DISTRIBUTIONS 3
1.2
1.0
0.8
density
0.6
0.4
0.2
0.0
0 1 2 3 4 5
summary of the main changes induced by incorporating the data — a Ga(g, h) distribution
has mean g/h, variance g/h2 and mode (g − 1)/h.
Notice that, as the mode of the likelihood function is close to that of the prior distribution,
the information in the data is consistent with that in the prior distribution. Also there
is a reduction in variability from the prior to the posterior distributions. The similarity
between the prior beliefs and the data has reduced the uncertainty we have about the
rate θ at which cases occur.
Prior Likelihood Posterior
(1.1) (1.2) (1.3)
Mode(θ) 1.00 0.75 0.78
E(θ) 2.00 – 0.89
SD(θ) 1.41 – 0.31
Example 1.2
Solution
where k is a constant that does not depend on θ. Therefore, the posterior density takes
the form kθG−1 e −Hθ , θ > 0 and so the posterior must be a gamma distribution. Thus we
have θ|x ∼ Ga(G = g + nx̄, H = h + n).
1.1. PRIOR AND POSTERIOR DISTRIBUTIONS 5
Summary:
If we have a random sample from a P o(θ) distribution and our prior beliefs about θ follow
a Ga(g, h) distribution then, after incorporating the data, our (posterior) beliefs about θ
follow a Ga(g + nx̄, h + n) distribution.
The changes in our beliefs about θ are summarised in Table 1.3, taking g ≥ 1. Notice
that the posterior mean is greater than the prior mean if and only if the likelihood mode
is greater than the prior mean, that is,
The standard deviation of the posterior distribution is smaller than that of the prior
distribution if and only if the sample mean is not too large, that is
n
SD(θ|x) < SD(θ) ⇐⇒ Modeθ {f (x|θ)} < 2 + E(θ),
h
and this will be true in large samples.
Example 1.3
Suppose we have a random sample from a normal distribution. In Bayesian statistics, when
dealing with the normal distribution, the mathematics is more straightforward working
with the precision (= 1/variance) of the distribution rather than the variance itself.
So we will assume that this population has unknown mean µ but known precision τ :
Xi |µ ∼ N(µ, 1/τ ), i = 1, 2, . . . , n (independent), where τ is known. Suppose our prior
beliefs about µ can be summarised by a N(b, 1/d) distribution, with probability density
function
1/2
d d 2
π(µ) = exp − (µ − b) . (1.7)
2π 2
Determine the posterior distribution for µ.
Hint:
2
2 2 db + nτ x̄
d(µ − b) + nτ (x̄ − µ) = (d + nτ ) µ − +c
d + nτ
where c does not depend on µ.
6 CHAPTER 1. SINGLE PARAMETER PROBLEMS
Solution
Now
n
X n
X
2
(xi − µ) = (xi − x̄ + x̄ − µ)2
i=1 i=1
n
X
= {(xi − x̄)2 + (x̄ − µ)2 + 2(xi − x̄)(x̄ − µ)}
i=1
n
X n
X
2 2
= {(xi − x̄) + (x̄ − µ) } + 2(x̄ − µ) (xi − x̄)
i=1 i=1
Xn
= (xi − x̄)2 + n(x̄ − µ)2 .
i=1
n
1X
2
Let s = (xi − x̄)2 and so
n i=1
n
X
(xi − µ)2 = n s 2 + (x̄ − µ)2 .
i=1
Therefore
τ n/2 n nτ o
f (x|µ) = exp − s 2 + (x̄ − µ)2 . (1.8)
2π 2
Using Bayes Theorem, the posterior density function is, for µ ∈ R
π(µ|x) ∝ π(µ) f (x|µ)
1/2
d d 2
∝ exp − (µ − b)
2π 2
τ n/2 n nτ o
2 2
× exp − s + (x̄ − µ)
2π 2
1
∝ exp − d(µ − b)2 + nτ (x̄ − µ)2
2
( " 2 #)
1 db + nτ x̄
∝ exp − (d + nτ ) µ − +c
2 d + nτ
Summary:
If we have a random sample from a N(µ, 1/τ ) distribution (with τ known) and our prior
beliefs about µ follow a N(b, 1/d) distribution then, after incorporating the data, our
(posterior) beliefs about µ follow a N(B, 1/D) distribution.
The changes in our beliefs about µ are summarised in Table 1.4. Notice that the posterior
mean is greater than the prior mean if and only if the likelihood mode (sample mean) is
greater than the prior mean, that is
Also, the standard deviation of the posterior distribution is smaller than that of the prior
distribution.
Example 1.4
The 18th century physicist Henry Cavendish made 23 experimental determinations of the
earth’s density, and these data (in g/cm3 ) are given below.
Suppose that Cavendish asserts that the error standard deviation of these measurements
is 0.2 g/cm3 , and assume that they are normally distributed with mean equal to the
true earth density µ. Using a normal prior distribution for µ with mean 5.41 g/cm3 and
standard deviation 0.4 g/cm3 , derive the posterior distribution for µ.
Solution
From the data we calculate x̄ = 5.4848 and s = 0.1882. Therefore, the as-
sumed standard deviation σ = 0.2 is probably okay. We also have τ = 1/0.22 , b = 5.41,
d = 1/0.42 and n = 23. Therefore, using Example 1.3, the posterior distribution is
µ|x ∼ N(B, 1/D), where
db + nτ x̄ 5.41/0.42 + 23 × 5.4848/0.22
B= = = 5.4840
d + nτ 1/0.42 + 23/0.22
1.2. DIFFERENT LEVELS OF PRIOR KNOWLEDGE 9
The actual mean density of the earth is 5.515 g/cm3 (Wikipedia). We can determine the
and
(posterior) probability that the 1mean density
23 is within
1 0.1 of this value as follows. The
D = d is
+ µ|x
nτ = + = 2 .
posterior distribution ∼ N(5.484, 0.0415 ) and so
0.42 0.22 0.04152
P r (5.415 <isµµ|x
Therefore the posterior distribution < 5.615|x) = 0.9510,
∼ N(5.484, 0.04152 ) and is shown in Fig-
ure 1.2.
calculated using the R command pnorm(5.615,5.484,0.0415)-pnorm(5.415,5.484,0.0415).
Without the data, the only basis for determining the earth’s density is via the prior
distribution. Here the prior distribution is µ ∼ N(5.4, 0.42 ) and so the (prior) probability
that the mean density is within 0.2 of the (now known) true value is
4
2
0
Figure 1.2: Prior (dashed) and posterior (solid) densities for the earth’s density
We have substantial prior information for θ when the prior distribution dominates the
posterior distribution, that is π(θ|x) ∼ π(θ).
When we have substantial prior information there can be some difficulties:
10 CHAPTER 1. SINGLE PARAMETER PROBLEMS
When prior information about θ is limited, the pragmatic approach is to choose a distri-
bution which makes the Bayes updating from prior to posterior mathematically straight-
forward, and use what prior information is available to determine the parameters of this
distribution. For example
In these examples, the prior distribution and the posterior distribution come from the
same family. This leads us to the following definition.
Definition 1.1
Suppose that data x are to be observed with distribution f (x|θ). A family F of prior
distributions for θ is said to be conjugate to f (x|θ) if for every prior distribution π(θ) ∈ F,
the posterior distribution π(θ|x) is also in F.
Notice that the conjugate family depends crucially on the model chosen for the data x.
For example, the only family conjugate to the model “random sample from a Poisson
distribution” is the Gamma family.
If we have very little or no prior information about the model parameters θ, we must
still choose a prior distribution in order to operate Bayes Theorem. Obviously, it would
be sensible to choose a prior distribution which is not concentrated about any particular
value, that is, one with a very large variance. In particular, most of the information
about θ will be passed through to the posterior distribution via the data, and so we have
π(θ|x) ∼ f (x|θ).
We represent vague prior knowledge by using a prior distribution which is conjugate to
the model for x and which is as diffuse as possible, that is, has as large a variance as
possible.
1.2. DIFFERENT LEVELS OF PRIOR KNOWLEDGE 11
Example 1.5
Suppose we have a random sample from a N(µ, 1/τ ) distribution (with τ known). De-
termine the posterior distribution assuming a vague prior for µ.
Solution
B → x̄ and D → nτ.
Therefore, assuming vague prior knowledge for µ results in a N{x̄, 1/(nτ )} posterior
distribution.
Notice that the posterior mean is the sample mean (the likelihood mode) and that the
posterior variance 1/(nτ ) → 0 as n → ∞.
12 CHAPTER 1. SINGLE PARAMETER PROBLEMS
Example 1.6
Suppose we have a random sample from a Poisson distribution, that is, Xi |θ ∼ P o(θ),
i = 1, 2, . . . , n (independent). Determine the posterior distribution assuming a vague
prior for θ.
Solution
The conjugate prior distribution is a Gamma distribution. Recall that a Ga(g, h) dis-
tribution has mean m = g/h and variance v = g/h2 . Rearranging these formulae we
obtain
m2 m
g= and h= .
v v
Clearly g → 0 and h → 0 as v → ∞ (for fixed m). We have seen how taking a Ga(g, h)
prior distribution results in a Ga(g + nx̄, h + n) posterior distribution. Therefore, taking
a vague prior distribution will give a Ga(nx̄, n) posterior distribution.
Note that the posterior mean is x̄ (the likelihood mode) and that the posterior variance
x̄/n → 0 and n → ∞.
If we have a statistical model f (x|θ) for data x = (x1 , x2 , . . . , xn )T , together with a prior
distribution π(θ) for θ then
q
D
J(θ̂) (θ − θ̂)|x −→ N(0, 1) as n → ∞,
This means that, with increasing amounts of data, the posterior distribution looks more
and more like a normal distribution. The result also gives us a useful approximation to
the posterior distribution for θ when n is large:
θ|x ∼ N{θ̂, J(θ̂)−1 } approximately.
Note that this limiting result is similar to one used in Frequentist statistics for the distri-
bution of the maximum likelihood estimator, namely
D
p
I(θ) (θ̂ − θ) −→ N(0, 1) as n → ∞,
where Fisher’s information I(θ) is the expected value of the observed information, where
the expectation is taken over the distribution of X|θ, that is, I(θ) = EX|θ [J(θ)]. You may
also have seen this result written as an approximation to the distribution of the maximum
likelihood estimator in large samples, namely
θ̂ ∼ N{θ, I(θ)−1 } approximately.
1.4. BAYESIAN INFERENCE 13
Example 1.7
Suppose we have a random sample from a N(µ, 1/τ ) distribution (with τ known). De-
termine the asymptotic posterior distribution for µ.
Recall that
( n
)
τ n/2 τX
f (x|µ) = exp − (xi − µ)2 ,
2π 2 i=1
and therefore
n
n n τX
log f (x|µ) = log τ − log(2π) − (xi − µ)2
2 2 2 i=1
n n
∂ τ X X
⇒ log f (x|µ) = − × −2(xi − µ) = τ (xi − µ) = nτ (x̄ − µ)
∂µ 2 i=1 i=1
∂2 ∂2
⇒ log f (x|µ) = −nτ ⇒ J(µ) = − log f (x|µ) = nτ.
∂µ2 ∂µ2
Solution
We have
∂
log f (x|µ) = 0 =⇒ µ̂ = x̄
∂µ
=⇒ J(µ̂) = nτ
1
=⇒ J(µ̂)−1 = .
nτ
Here the asymptotic posterior distribution is the same as the posterior distribution under
vague prior knowledge.
The posterior distribution π(θ|x) summarises all our information about θ to date. How-
ever, sometimes it is helpful to reduce this distribution to a few key summary measures.
14 CHAPTER 1. SINGLE PARAMETER PROBLEMS
1.4.1 Estimation
Point estimates
There are many useful summaries for a typical value of a random variable with a particular
distribution; for example, the mean, mode and median. The mode is used more often as
a summary than is the case in frequentist statistics.
Confidence intervals/regions
A more useful summary of the posterior distribution is one which also reflects its variation.
For example, a 100(1 − α)% Bayesian confidence interval for θ is any region Cα that
satisfies P r (θ ∈ Cα |x) = 1 − α. If θ is a continuous quantity with posterior probability
density function π(θ|x) then
Z
π(θ|x) dθ = 1 − α.
Cα
The usual correction is made for discrete θ, that is, we take the largest region Cα such
that P r (θ ∈ Cα |x) ≤ 1 − α. Bayesian confidence intervals are sometimes called credible
regions or plausible regions. Clearly these intervals are not unique, since there will be
many intervals with the correct probability coverage for a given posterior distribution.
A 100(1 − α)% highest density interval (HDI) for θ is the region
Cα = {θ : π(θ|x) ≥ γ}
density
a b
• the interval CF covers the true value θ on 95% of occasions — in repeated appli-
cations of the formula.
Example 1.8
Solution
This distribution has a symmetric bell shape and so the HDI is an equi-tailed
interval Cα = (a, b) with P r (µ < a|x) = α/2 and P r (µ > b|x) = α/2, that is,
zα/2 zα/2
a = x̄ − √ and b = x̄ + √ ,
nτ nτ
16 CHAPTER 1. SINGLE PARAMETER PROBLEMS
where zα is the upper α-quantile of the N(0, 1) distribution. For example, the 95% HDI
for µ is
1.96 1.96
x̄ − √ , x̄ + √ .
nτ nτ
Note that this interval is numerically identical to the 95% frequentist confidence interval
for the (population) mean of a normal random sample with known variance. However,
the interpretation is very different.
Example 1.9
Recall Example 1.1 on the number of cases of foodbourne botulism in England and
Wales. The data were modelled as a random sample from a Poisson distribution with
mean θ. Using a Ga(2, 1) prior distribution, we found the posterior distribution to be
θ|x ∼ Ga(8, 9). This posterior density is shown in Figure 1.4. Determine the 100(1−α)%
HDI for θ.
1.2
1.0
0.8
density
0.6
0.4
0.2
0.0
Solution
The HDI must take the form Cα = (a, b) if it is to include the values of θ with
the highest probability density. Suppose that F (·) and f (·) are the posterior distribution
and density functions. Then the end-points a and b must satisfy
and
The R package nclbayes contains functions to determine the HDI for several distribu-
tions. The function for the Gamma distribution is hdiGamma and we can calculate the
95% HDI for the Ga(8, 9) posterior distribution by using the commands
library(nclbayes)
hdiGamma(p=0.95,a=8,b=9)
Taking 1 − α = 0.95 and using such R code gives a = 0.3304362 and b = 1.5146208.
To check this answer, R gives P r (a < θ < b|x) = 0.95, π(θ = b|x) = 0.1877215 and
π(θ = a|x) = 0.1877427. Thus the 95% HDI is (0.3304362, 1.514621).
The package also has functions hdiBeta for the Beta distribution and hdiInvchi for the
Inv-Chi distribution (introduced in Chapter 2).
1.4.2 Prediction
Much of statistical inference (both Frequentist and Bayesian) is aimed towards making
statements about a parameter θ. Often the inferences are used as a yardstick for sim-
ilar future experiments. For example, we may want to predict the outcome when the
experiment is performed again.
Clearly there will be uncertainty about the future outcome of an experiment. Suppose
this future outcome Y is described by a probability (density) function f (y |θ). There are
several ways we could make inferences about what values of Y are likely. For example, if
we have an estimate θ̂ of θ we might base our inferences on f (y |θ = θ̂). Obviously this
is not the best we can do, as such inferences ignore the fact that it is very unlikely that
θ = θ̂.
Implicit in the Bayesian framework is the concept of the predictive distribution. This
distribution describes how likely are different outcomes of a future experiment. The
predictive probability (density) function is calculated as
Z
f (y |x) = f (y |θ) π(θ|x) dθ
Θ
when θ is a continuous quantity. From this equation, we can see that the predictive
distribution is formed by weighting the possible values of θ in the future experiment
f (y |θ) by how likely we believe they are to occur π(θ|x).
If the true value of θ were known, say θ0 , then any prediction can do no better than one
based on f (y |θ = θ0 ). However, as (generally) θ is unknown, the predictive distribution
is used as the next best alternative.
We can use the predictive distribution to provide a useful range of plausible values for the
outcome of a future experiment. This prediction interval is similar to a HDI interval. A
100(1 − α)% prediction interval for Y is the region Cα = {y : f (y |x) ≥ γ} where γ is
chosen so that P r (Y ∈ Cα |x) = 1 − α.
1.4. BAYESIAN INFERENCE 19
Example 1.10
Recall Example 1.1 on the number of cases of foodbourne botulism in England and Wales.
The data for 1998–2005 were modelled by a Poisson distribution with mean θ. Using
a Ga(2, 1) prior distribution, we found the posterior distribution to be θ|x ∼ Ga(8, 9).
Determine the predictive distribution for the number of cases for the following year (2006).
Solution
Z ∞
98
= θy +7 e −10θ dθ
y !Γ(8) 0
98 Γ(y + 8)
= ×
y !Γ(8) 10y +8
(y + 7)!
= × 0.98 × 0.1y
y !7!
y +7
= × 0.98 × 0.1y .
7
You may not recognise this probability function but it is related to that of a negative
binomial distribution. Suppose Z ∼ NegBin(r, p) with probability function
z −1 r
P r (Z = z) = p (1 − p)z−r , z = r, r + 1, . . . .
r −1
This is the same probability function as our predictive probability function, with r = 8
and p = 0.9. Therefore Y |x ∼ NegBin(8, 0.9) − 8. Note that, unfortunately R also
calls the distribution of W a negative binomial distribution with parameters r and p. To
distinguish between this distribution and the NegBin(r, p) distribution used above, we
shall denote the distribution of W as a NegBinR (r, p) distribution – it has mean r (1−p)/p
and variance r (1 − p)/p 2 . Thus Y |x ∼ NegBinR (8, 0.9).
We can compare this predictive distribution with a naive predictive distribution based on
an estimate of θ. Here we shall base our naive predictive distribution on the maximum
20 CHAPTER 1. SINGLE PARAMETER PROBLEMS
likelihood estimate θ̂ = 0.75, that is, use the distribution Y |θ = θ̂ ∼ P o(0.75). Thus,
the naive predictive probability function is
0.75y e −0.75
f (y |θ = θ̂) = , y = 0, 1, . . . .
y!
Numerical values for the predictive and naive predictive probability functions are given in
Table 1.5.
correct naive
y f (y |x) f (y |θ = θ̂)
0 0.430 0.472
1 0.344 0.354
2 0.155 0.133
3 0.052 0.033
4 0.014 0.006
5 0.003 0.001
≥6 0.005 0.002
Again, the naive predictive distribution is a predictive distribution which, instead of us-
ing the correct posterior distribution, uses a degenerate posterior distribution π ∗ (θ|x)
which essentially allows only one value: P rπ∗ (θ = 0.75|x) = 1 and standard deviation
SD
√ π∗ (θ|x) = 0. Note that the correct posterior standard deviation of θ is SDπ (θ|x) =
8/9 = 0.314. Using a degenerate posterior distribution results in the naive predictive
distribution having too small a standard deviation:
(
0.994 using the correct π(θ|x)
SD(Y |x = 1) =
0.866 using the naive π ∗ (θ|x),
these values being calculated from NegBinR (8, 0.9) and P o(0.75) distributions.
Using the numerical table of predictive probabilities, we can see that {0, 1, 2} is a 92.9%
prediction set/interval. This is to be contrasted with the more “optimistic” calculation
using the naive predictive distribution which shows that {0, 1, 2} is a 95.9% prediction
set/interval.
Candidate’s formula
In the previous example, a non-trivial integral had to be evaluated. However, when the
past data x and future data y are independent (given θ) and we use a conjugate prior
distribution, another (easier) method can be used to determine the predictive distribution.
1.4. BAYESIAN INFERENCE 21
Example 1.11
Rework Example 1.10 using Candidate’s formula to determine the number of cases in
2006.
Solution
Let Y denote the number of cases in 2006. We know that θ|x ∼ Ga(8, 9) and
Y |θ ∼ P o(θ). Using Example 1.2 we obtain
Γ(8 + y ) 98
= × 8+y
y !Γ(8) 10
(y + 7)!
= × 0.98 × 0.1y
y !7!
y +7
= × 0.98 × 0.1y .
7
22 CHAPTER 1. SINGLE PARAMETER PROBLEMS
Sometimes prior beliefs cannot be adequately represented by a simple distribution, for ex-
ample, a normal distribution or a beta distribution. In such cases, mixtures of distributions
can be useful.
Example 1.12
Investigations into infants suffering from severe idiopathic respiratory distress syndrome
have shown that whether the infant survives may be related to their weight at birth.
Suppose that you are interested in developing a prior distribution for the mean birth
weight µ of such infants. You might have a normal N(2.3, 0.522 ) prior distribution for
the mean birth weight (in kg) of infants who survive and a normal N(1.7, 0.662 ) prior
distribution for infants who die. If you believe that the proportion of infants that survive is
0.6, what is your prior distribution of birth weights of infants suffering from this syndrome?
Solution
Let T = 1, 2 denote whether the infant survives or dies. Then the information
above tells us
We write this as
µ ∼ 0.6 N(2.3, 0.522 ) + 0.4 N(1.7, 0.662 ).
1.5. MIXTURE PRIOR DISTRIBUTIONS 23
This prior distribution is a mixture of two normal distributions. Figure 1.5 shows the
overall (mixture) prior distribution π(µ) and the “component” distributions describing
prior beliefs about the mean weights of those who survive and those who die. Notice that,
in this example, although the mixture distribution is a combination of two distributions,
each with one mode, this mixture distribution has only one mode. Also, although the
component distributions are symmetric, the mixture distribution is not symmetric.
0.6
density
0.4
0.2
0.0
−1 0 1 2 3 4
Figure 1.5: Plot of the mixture density (solid) with its component densities (survive –
dashed; die – dotted)
Definition 1.2
m
X
π(θ) = pi πi (θ). (1.11)
i=1
Figure 1.6 contains a plot of two quite different mixture distributions. One mixture
distribution has a single mode and the other has two modes. In general, a mixture
distribution whose m component distributions each have a single mode will have at most
m modes.
24 CHAPTER 1. SINGLE PARAMETER PROBLEMS
0.8
0.6
density
0.4
0.2
0.0
−1 0 1 2 3 4
Figure 1.6: Plot of two mixture densities: solid is 0.6N(1, 1) + 0.4N(2, 1); dashed is
0.9Exp(1) + 0.1N(2, 0.252 )
We can calculate the mean and variance of a mixture distribution as follows. We will
assume, for simplicity, that θ is a scalar. Let Ei (θ) and V ari (θ) be the mean and variance
of the distribution for θ in component i , that is,
Z Z
Ei (θ) = θ πi (θ) dθ and V ari (θ) = {θ − Ei (θ)}2 πi (θ) dθ.
Θ Θ
We also have
m
X
2
E(θ ) = pi Ei (θ2 )
i=1
Xm
pi V ari (θ) + Ei (θ)2
= (1.13)
i=1
from which we can calculate the variance of the mixture distribution using
Combining a mixture prior distribution with data x using Bayes Theorem produces the
posterior density
π(θ) f (x|θ)
π(θ|x) =
f (x)
m
X pi πi (θ) f (x|θ)
= (1.14)
i=1
f (x)
where f (x) is a constant with respect to θ. Now if the prior density were πi (θ) (instead
of the mixture distribution), using Bayes Theorem, the posterior density would be
πi (θ) f (x|θ)
πi (θ|x) =
fi (x)
and so
pi fi (x)
pi∗ = m , i = 1, 2, . . . , m.
X
pj fj (x)
j=1
Hence, combining data x with a mixture prior distribution (pi , πi (θ)) produces a posterior
mixture distribution (pi∗ , πi (θ|x)). The effect of introducing the data is to “update” the
mixture weights (pi → pi∗ ) and the component distributions (πi (θ) → πi (θ|x)).
26 CHAPTER 1. SINGLE PARAMETER PROBLEMS
Example 1.13
Suppose we have a random sample of size 20 from an exponential distribution, that is,
Xi |θ ∼ Exp(θ), i = 1, 2, . . . , 20 (independent). Also suppose that the prior distribution
for θ is the mixture distribution
as shown in Figure 1.7. Here the component distributions are π1 (θ) = Ga(5, 10) and
π2 (θ) = Ga(15, 10), with weights p1 = 0.6 and p2 = 0.4.
1.2
1.0
0.8
density
0.6
0.4
0.2
0.0
We have already seen that combining a random sample of size 20 from an exponential
distribution with a Ga(g, h) prior distribution results in a Ga(g + 20, h + 20x̄) posterior
1.5. MIXTURE PRIOR DISTRIBUTIONS 27
We now calculate new values for the weights p1∗ and p2∗ = 1 − p1∗ , which will depend on
both prior information and the data. We have
0.6f1 (x)
p1∗ =
0.6f1 (x) + 0.4f2 (x)
from which
0.4f2 (x)
(p1∗ )−1 − 1 = .
0.6f1 (x)
In general, the functions Z
fi (x) = πi (θ) f (x|θ) dθ
Θ
are potentially complicated integrals (solved either analytically or numerically). However,
as with Candidates formula, these calculations become much simpler when we have a
conjugate prior distribution: rewriting Bayes Theorem, we obtain
π(θ) f (x|θ)
f (x) =
π(θ|x)
and so when the prior and posterior densities have a simple form (as they do when using a
conjugate prior), it is straightforward to determine f (x) using algebra rather than having
to use calculus.
In this example we know that the gamma distribution is the conjugate prior distribution:
using a random sample of size n with mean x̄ and a Ga(g, h) prior distribution gives a
Ga(g + n, h + nx̄) posterior distribution, and so
π(θ) f (x|θ)
f (x) =
π(θ|x)
hg θg−1 e −hθ
× θn e −nx̄θ
Γ(g)
=
(h + nx̄)g+n θg+n−1 e −(h+nx̄)θ
Γ(g + n)
hg Γ(g + n)
= .
Γ(g)(h + nx̄)g+n
Therefore
0.4 × 1015 Γ(35) 0.6 × 105 Γ(25)
(p1∗ )−1 −1=
Γ(15)(10 + 20x̄)35 Γ(5)(10 + 20x̄)25
2Γ(35)Γ(5)
=
3Γ(25)Γ(15)(1 + 2x̄)10
611320
=
7(1 + 2x̄)10
28 CHAPTER 1. SINGLE PARAMETER PROBLEMS
and so
1
p1∗ = , p2∗ = 1 − p1∗ .
611320
1+
7(1 + 2x̄)10
Recall that the most likely value of θ from the data alone, the likelihood mode, is 1/x̄.
Therefore, large values of x̄ indicate that θ is small and vice versa. With this in mind,
it is not surprising that the weight p1∗ (of the component distribution with the smallest
mean) is increasing in x̄, and p1∗ → 1 as x̄ → ∞. Using (1.12), the posterior mean is
1 25 1 35
E(θ|x) = × + 1 − ×
611320 10 + 20x̄ 611320 10 + 20x̄
1+ 10
1+ 10
7(1 + 2x̄) 7(1 + 2x̄)
= ···
1 2
= 7− .
2(1 + 2x̄) 611320
1+
7(1 + 2x̄)10
The posterior standard deviation can be calculated using (1.12) and (1.13).
Table 1.6 shows the posterior distributions which result when various sample means x̄
are observed together with the posterior mean and the posterior standard deviation.
Graphs of these posterior distributions, together with the prior distribution, are given in
Figure 1.8. When considering the effect on beliefs of observing the sample mean x̄, it
is important to remember that large values of x̄ indicate that θ is small and vice versa.
Plots of the posterior mean against the sample mean reveal that the posterior mean lies
between the prior mean and the likelihood mode only for x̄ ∈ (0, 0.70) ∪ (1.12, ∞). Note
that observing the data has focussed our beliefs about θ in the sense that the posterior
standard deviation is less than the prior standard deviation – and considerably less in some
cases.
1.5. MIXTURE PRIOR DISTRIBUTIONS 29
Table 1.6: Posterior distributions (with summaries) for various sample means x̄
6
density
4
2
0
Figure 1.8: Plot of the prior distribution and various posterior distributions
30 CHAPTER 1. SINGLE PARAMETER PROBLEMS
• determine the likelihood function using a random sample from any distribution
• combine this likelihood function with any prior distribution to obtain the posterior
distribution
• do all the above for a particular data set or for a general case with random sample
x1 , . . . , xn
• describe the different levels of prior information; determine and use conjugate priors
and vague priors
• describe and calculate the confidence intervals, HDIs and prediction intervals