Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
Ma40189 2016 2017 Problem Sheet 3 Solutions合并版
2016/17 Semester II
Qn
1. Let X1 , . . . , Xn be conditionally independent given θ, so f (x | θ) = i=1 f (xi | θ)
where x = (x1 , . . . , xn ), with each Xi | θ ∼ P o(θ). Suppose we judge that
θ ∼ Gamma(α, β).
As θ ∼ Gamma(α, β) then
β α α−1 −βθ
f (θ) = θ e (1)
Γ(α)
whilst as Xi | θ ∼ P o(θ)
n
Y θxi e−θ θnx̄ e−nθ
f (x | θ) = = Qn . (2)
i=1
xi ! i=1 xi !
(c) Let Z be a future (unobserved) observation. Find the mean and vari-
ance of the predictive distribution Z | x. You may assume that X and
Z are conditionally independent given θ and that Z | θ ∼ P o(θ).
We do not need to find the full predictive distribution. As X and Z are condi-
tionally independent given θ we can use expectation and variance the formulas
given in lectures so
E(Z | X) = E(E(Z | θ) | X)
= E(θ | X) (3)
α + nx̄
= (4)
β+n
where (3) follows as Z | θ ∼ P o(θ) and (4) as θ | x ∼ Gamma(α + nx̄, β + n).
Similarly,
(d) The data in the table below are the number of fatal accidents on sched-
uled airline flights between 1976 and 1985.
1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
24 25 31 31 22 21 26 20 16 22
Let Z be the number of fatal accidents in 1986 and Xi the number
of fatal accidents in 1975+i. Adopting the model given above, with
α = 101 and β = 5 and assuming that Z | x may be approximated by a
N (E(Z | X), V ar(Z | X)), find an approximate 95% prediction interval for
the number of accidents in 1986, that is an interval (zL , zU ) such that
P (zL < Z < zU | x) = 0.95.
P10
Taking α = 101 and β = 5 we observe i=1 xi = 238 with n = 10. Hence,
α+nx̄ = 101+238 = 339 and β +n = 5+10 = 15 so that θ | x ∼ Gamma(339, 15).
From (4) and (5) we have E(Z | X) = 339 113
15 = 5 and V ar(Z | X) =
339×16
152 = 1808
75
113 1808
so we use Z | x ∼ N ( 5 , 75 ) approximately. Recalling that if Y ∼ N (0, 1) then
P (−1.96 < Y < 1.96) = 0.95 the (approximate) 95% prediction interval is
r
113 1808
± 1.96 = (12.98, 32.22).
5 75
2. Let X1 , . . . , Xn be a finitely exchangeable sequence of random quantities and
consider any i 6= j ∈ {1, . . . , n}.
(a) Explain why the marginal distribution of any Xi does not depend upon
i so that E(Xi ) and V ar(Xi ) do not depend upon i.
for all permutations π defined on the set {1, . . . , n}. For a given permutation there
exists j ∈ {1, . . . , n} such that xπ(j) = xi . Let x−i denote the set {x1 , . . . , xn }\xi .
The marginal distribution of Xi is given by
Z
fXi (xi ) = f (x1 , . . . , xn ) dx−i
x
Z −i
= f (xπ(1) , . . . , xπ(j−1) , xπ(j) , xπ(j+1) , . . . , xπ(n) ) dx−i
x−i
Z
= f (xπ(1) , . . . , xπ(j−1) , xi , xπ(j+1) , . . . , xπ(n) ) dx−i
x−i
= fXj (xi ).
As this holds for all permutations then it holds for all j ∈ {1, . . . , n}. Thus, the
marginal distribution of Xi does not depend upon i. As E(Xi ) and V ar(Xi ) are
constructed over the marginal of Xi then it follows immediately that these also
do not depend upon i.
(b) Explain why the joint distribution of any Xi , Xj does not depend upon
either i or j so that Cov(Xi , Xj ) does not depend on either i or j.
For i 6= j, let x−i,−j denote the set {x1 , . . . , xn }\{xi , xj }. For a given permutation
there exists k 6= l ∈ {1, . . . , n} such that xπ(k) = xi and xπ(l) = xj . Suppose
without loss of generality that k < l. The joint distribution of Xi and Xj is given
by
Z
fXi ,Xj (xi , xj ) = f (x1 , . . . , xn ) dx−i,−j
x−i,−j
Z
= f (xπ(1) , . . . , xπ(k−1) , xπ(k) , xπ(k+1) , . . . , xπ(l−1) , xπ(l) , xπ(l+1) , . . . , xπ(n) ) dx−i,−j
x−i,−j
Z
= f (xπ(1) , . . . , xπ(k−1) , xi , xπ(k+1) , . . . , xπ(l−1) , xj , xπ(l+1) , . . . , xπ(n) ) dx−i,−j
x−i,−j
= fXk ,Xl (xi , xj ).
As this holds for all permutations then it holds for all k 6= l ∈ {1, . . . , n}. Thus,
the joint distribution of Xi and Xj does not depend upon i 6= j. As Cov(Xi , Xj )
is constructed over the joint distribution of Xi and Xj then it follows immediately
that Cov(Xi , Xj ) does not depend on either i or j.
Pn
(c) Let Y = i=1 Xi . By considering V ar(Y ) ≥ 0, or otherwise, show that
Cov(Xi , Xj ) −1
Corr(Xi , Xj ) = p ≥ .
V ar(Xi )V ar(Xj ) n−1
Pn
Taking the variance of Y = Xi we have
i=1
n
!
X
V ar(Y ) = V ar Xi
i=1
n
X n X
X n
= V ar(Xi ) + Cov(Xi , Xj ).
i=1 i=1 j6=i
Now, from part (a), V ar(Xi ) does not depend upon i and, from part (b), Cov(Xi , Xj )
does not depend upon either i or j. Thus,
f (x, y | µ, δ) = f (x | µ, δ)f (y | µ, δ)
= f (x | µ)f (y | µ, δ)
1 1 2 2
= exp − 2 [(x − µ) + (y − µ − δ) ] .
2πσ 2 2σ
f (µ, δ) ∝ 1.
As f (µ, δ) ∝ 1 then
f (µ, δ | x, y) ∝ f (x, y | µ, δ)
1
∝ exp − 2 [(x − µ)2 + (y − µ − δ)2 ]
2σ
1
∝ exp − 2 [µ2 − 2xµ + (µ + δ)2 − 2y(µ + δ)]
2σ
1
= exp − 2 [2µ2 − 2(x + y)µ + 2µδ + δ 2 − 2yδ] . (6)
2σ
Using the properties of multivariate Normal distributions, we can read off the
marginal distributions so µ | x, y ∼ N (x, σ 2 ) and δ | x, y ∼ N (y − x, 2σ 2 ). We
shall derive these marginal distributions directly in parts (c) and (d) as these
give insights into the types of techniques used when the marginals are not so
conventional to obtain.
(c) Find the marginal posterior distribution f (δ | x, y).
Z ∞
f (δ | x, y) = f (µ, δ | x, y) dµ
−∞
Z ∞
1 2 2
∝ exp − 2 [2µ − 2(x + y)µ + 2µδ + δ − 2yδ] dµ
−∞ 2σ
Z ∞
1 2 1 2
= exp − 2 [δ − 2yδ] exp − 2 [2µ − 2(x + y − δ)µ] dµ
2σ −∞ 2σ
1
= exp − 2 [δ 2 − 2yδ] ×
2σ
Z ∞ ( " 2 #)
1 (x + y − δ) (x + y − δ)2
exp − 2 µ− − dµ
−∞ σ 2 4
1
= exp − 2 [2δ 2 − 4yδ − (x + y − δ)2 ] ×
4σ
Z ∞ ( 2 )
1 (x + y − δ)
exp − 2 µ − dµ
−∞ σ 2
1 2 2
∝ exp − 2 [2δ − 4yδ − (x + y − δ) ]
4σ
1 2
∝ exp − 2 [δ − 2(2y − (x + y))δ]
4σ
1 2
= exp − 2 [2µ − 2(x + y)µ] ×
2σ
Z ∞
1 2
exp − 2 [δ − 2(y − µ)δ] dδ
−∞ 2σ
1
= exp − 2 [2µ2 − 2(x + y)µ] ×
2σ
Z ∞
1 2 1 2
exp − 2 [δ − (y − µ)] + 2 (y − µ) dδ
−∞ 2σ 2σ
1
= exp − 2 [2µ2 − 2(x + y)µ − (y − µ)2 ] ×
2σ
Z ∞
1 2
exp − 2 [δ − (y − µ)] dδ
−∞ 2σ
1
∝ exp − 2 [2µ2 − 2(x + y)µ − (y − µ)2 ]
2σ
1 2
∝ exp − 2 [µ − 2(x + y − y)µ]
2σ
which is a kernel of a N (x, σ 2 ) density. Hence, µ | x, y ∼ N (x, σ 2 ).
(e) Consider a future observation Z where Z | µ, δ ∼ N (µ − δ, σ 2 ) and Z is
conditionally independent of X and Y given µ and δ. Find the predictive
distribution f (z | x, y).
Notice that
1
f (z | µ, δ) ∝ exp − 2 [z 2 − 2(µ − δ)z + (µ − δ)2 ]
2σ
so that, as Z is conditionally independent of X and Y given µ and δ and using
(6),
Z ∞Z ∞
f (z | x, y) = f (z | µ, δ)f (µ, δ | x, y) dµ dδ
−∞ −∞
Z ∞Z ∞
1
∝ exp − 2 [z 2 − 2(µ − δ)z + (µ − δ)2 ] ×
−∞ −∞ 2σ
1 2 2
exp − 2 [2µ − 2(x + y)µ + 2µδ + δ − 2yδ] dµ dδ
2σ
Z ∞Z ∞
1 2 2 2
= exp − 2 [z + 3µ − 2(x + y + z)µ + 2δ − 2(y − z)δ] dµ dδ
−∞ −∞ 2σ
Z ∞
1 2 1 2
= exp − 2 z exp − 2 [3µ − 2(x + y + z)µ] dµ
2σ −∞ 2σ
Z ∞
1 2
× exp − 2 [2δ − 2(y − z)δ] dδ
−∞ 2σ
( " 2 #)
∞
z2 (x + y + z)2
3 (x + y + z)
Z
= exp − 2 exp − 2 µ− − dµ
2σ −∞ 2σ 3 9
Z ∞ ( " 2 #)
1 (y − z) (y − z)2
× exp − 2 δ− − dδ
−∞ σ 2 4
3 (x + y + z)2 1 (y − z)2
1 2
∝ exp − 2 z exp exp
2σ 2σ 2 9 σ2 4
1
∝ exp − [z 2 − 4(x + y)z + 6yz]
12σ 2
1 2
= exp − [z − 2(2x − y)z]
12σ 2
As Xi | θ ∼ U (0, θ) then
1
θ 0 ≤ xi ≤ θ;
f (xi | θ) =
0 otherwise.
Let I(0,θ) (x) denote the indicator function for the event 0 ≤ x ≤ θ, so I(0,θ) (x) = 1
if 0 ≤ x ≤ θ and 0 otherwise. Then we can write f (xi | θ) = θ1 I(0,θ) (xi ) so that
n
Y 1
f (x | θ) = I(0,θ) (xi )
i=1
θ
n
1 Y
= I(0,θ) (xi )
θn i=1
1
= I(0,θ) (m)
θn
where m = maxi (xi ). Thus, f (x | θ) = g(θ, m)h(x) where g(θ, m) = θ1n I(0,θ) (m)
and h(x) = 1 so that M = maxi (Xi ) is sufficient for X = (X1 , . . . , Xn ) for
learning about θ.
(b) Show that the Pareto distribution, θ ∼ P areto(a, b),
aba
f (θ) = , θ≥b
θa+1
is a conjugate prior distribution.
Let I{θ≥c} denote the indicator function for the event θ ≥ c, so I{θ≥c} = 1 if
θ ≥ c and 0 otherwise. Then, viewing the likelihood as a function of θ rather than
m = maxi (xi ), we may write the likelihood as
1
f (x | θ) = I{θ≥m}
θn
whilst the prior may be expressed as
aba
f (θ) = I{θ≥b} .
θa+1
The posterior is
1 aba
f (θ | x) ∝ I {θ≥m} × I{θ≥b}
θn θa+1
1
∝ a+n+1
I{θ≥m} I{θ≥b}
θ
1
= a+n+1
I{θ≥max(b,m)}
θ
which is a kernel of a P areto(a + n, max(b, m)) density so that θ | x ∼ P areto(a +
n, max(b, m)). Thus, the prior and posterior are in the same family and so, relative
to the U (0, θ) likelihood, the Pareto distribution is a conjugate family.
2016/17 Semester II
f (x | θ) = θnx̄ (1 − θ)n−nx̄
so that θ | x ∼ Beta(α + nx̄, β + n − nx̄). Thus, the prior and the posterior
are in the same family giving conjugacy.
In conventional form,
( n
)
−n 1 X
f (x | θ) ∝ θ 2 exp − (xi − µ)2
2θ i=1
x2i
3 2 1 2
f (xi | θ) = exp −θ + log θ + log xi + log
2 2 2 π
In conventional form,
n2 n
! Pn
x2i
2 3n
Y
2 i=1
f (x | θ) = θ2 xi exp − θ
π i=1
2
Pn 2
3n
i=1 xi
∝ θ exp −
2 θ
2
which, viewing f (x | θ)as a function of θ, we recognise
Pn as a kernel of a Gamma
distribution (in particular, Gamma( 3n+2 2 , 1
2 i=1 x2
i )).
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).
(c) Suppose now that the prior for θ is instead given by the probability
density function
1 1
f (θ) = θα (1 − θ)β−1 + θα−1 (1 − θ)β ,
2B(α + 1, β) 2B(α, β + 1)
where B(α, β) denotes the Beta function evaluated at α and β. Show
that the posterior probability density function can be written as
f (θ | x) = λf1 (θ) + (1 − λ)f2 (θ)
where
(α + n)β
λ = Pn
(α + n)β + (β − n + i=1 xi )α
and f1 (θ) and f2 (θ) are probability density functions.
f (θ | x) ∝ f (x | θ)f (θ)
θα (1 − θ)β−1 θα−1 (1 − θ)β
= θn (1 − θ)(nx̄−n) +
B(α + 1, β) B(α, β + 1)
θα1 (1 − θ)β1 −1 θα1 −1 (1 − θ)β1
= +
B(α + 1, β) B(α, β + 1)
where α1 = α + n and β1 = β + nx̄ − n. Finding the constant of proportionality
we observe that θα1 (1 − θ)β1 −1 is a kernel of a Beta(α1 + 1, β1 ) and θα1 −1 (1 − θ)β1
is a kernel of a Beta(α1 , β1 + 1). So,
B(α1 + 1, β1 ) B(α1 , β1 + 1)
f (θ | x) = c f1 (θ) + f2 (θ)
B(α + 1, β) B(α, β + 1)
where f1 (θ) is the density function of Beta(α1 + 1, β1 ) and f2 (θ) the density
function of Beta(α1 , β1 + 1). Hence,
B(α1 + 1, β1 ) B(α1 , β1 + 1)
c−1 = +
B(α + 1, β) B(α, β + 1)
Using de Finetti’s Representation Theorem (Theorem 2 of the on-line notes), the joint
distribution has an integral representation of the form
Z (Y
n
)
f (x1 , . . . , xn ) = f (xi | θ) f (θ) dθ.
θ i=1
If Xi | θ ∼ Exp(θ) then
n n n
!
Y Y X
n
f (xi | θ) = θ exp (−θxi ) = θ exp −θ xi .
i=1 i=1 i=1
Pn
Notice that, viewed as a function of θ, this looks like a kernel of Gamma(n+1, i=1 xi ).
The result holds if we can find an f (θ) such that
n
!−(n+1) Z n
!
X X
n
n! 1 + xi = θ exp −θ xi f (θ) dθ.
i=1 θ i=1
Pn
The left hand side looks like the normalising constant of a Gamma(n + 1, 1 + i=1 xi )
(as n! = Γ(n + 1)) and if f (θ) =Pexp(−θ) then the integrand on the right hand side is
n
a kernel of a Gamma(n + 1, 1 + i=1 xi ). So, if θ ∼ Gamma(1, 1) then f (θ) = exp(−θ)
and we have the desired representation.
5. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally indepen-
dent given a parameter θ. Suppose that Xi | θ is distributed as a Poisson
distribution with mean θ.
(a) Show that, with respect to this Poisson likelihood, the gamma family
of distributions is conjugate.
n
Y
f (x | θ) = P (Xi = xi | θ)
i=1
Yn
∝ θxi exp {−θ}
i=1
= θnx̄ exp {−nθ} .
As θ ∼ Gamma(α, β) then
f (θ | x) ∝ f (x | θ)f (θ)
∝ θnx̄ exp {−nθ} θα−1 exp {−βθ}
= θα+nx̄−1 exp {−(β + n)θ}
λ
f (λ | x) ∝
(n + λ)nx̄+1
1
Pn
where x̄ = n i=1 xi .
θ | λ ∼ Exp(λ) so f (θ | λ) = λ exp{−λθ}.
λΓ(nx̄ + 1)
f (λ | x) ∝
(n + λ)nx̄+1
λ
∝ .
(n + λ)nx̄+1
2016/17 Semester II
(a) Xi | θ ∼ Bern(θ).
The likelihood is
n
Y
f (x | θ) = θxi (1 − θ)1−xi
i=1
= θnx̄ (1 − θ)n−nx̄ .
As θ is univariate
∂2 ∂2
log f (x | θ) = {nx̄ log θ + (n − nx̄) log(1 − θ)}
∂θ2 ∂θ2
nx̄ (n − nx̄)
= − 2 − .
θ (1 − θ)2
n
r
1 1
f (θ) ∝ ∝ θ− 2 (1 − θ)− 2 .
θ(1 − θ)
(b) Xi | θ ∼ P o(θ).
The likelihood is
n
Y θxi e−θ
f (x | θ) =
i=1
xi !
nx̄ −nθ
θ e
= Qn .
i=1 xi !
As θ is univariate
n
!
∂2 ∂2 X
log f (x | θ) = nx̄ log θ − nθ − log xi !
∂θ2 ∂θ2 i=1
nx̄
= − .
θ2
The Fisher information is thus
nX̄
I(θ) = −E − 2 θ
θ
nE(X̄ | θ) n
= 2
= .
θ θ
The Jeffreys prior is then
r
n 1
f (θ) ∝ ∝ θ− 2 .
θ
This is often expressed as the improper Gamma( 21 , 0) distribution. This makes
it straightforward to obtain the posterior using the conjugacy of the Gamma
distribution relative to the Poisson likelihood. The posterior is Gamma( 21 +nx̄, n).
(If you are not sure why, see Solution Sheet Four Question 4 (a) with α = 12 ,
β = 0.)
(c) Xi | θ ∼ M axwell(θ), the Maxwell distribution with parameter θ so that
12
θx2i
2 3
2
f (xi | θ) = θ xi exp −
2 , xi > 0
π 2
q
2
and E(Xi | θ) = 2 πθ , V ar(Xi | θ) = 3π−8
πθ .
The likelihood is
n 2 1
θx2i
Y 2 3
f (x | θ) = θ exp − 2 x2i
i=1
π 2
2n n
! ( n
)
2 3n
Y
2 θX 2
= θ2 xi exp − x .
π i=1
2 i=1 i
As θ is univariate
n n
!
∂2 ∂2 n 2 3n X θX 2
log f (x | θ) = log + log θ + log x2i − x
∂θ2 ∂θ2 2 π 2 i=1
2 i=1 i
3n
= − .
2θ2
We write
which is of the form exp{φ1 (λ)u1 (xi ) + g(λ) + h(xi )} where φ1 (λ) = −λ, u1 (xi ) =
xi , g(λ) = log λ and h(xi ) = 0. From the Proposition given in Lecture 9 (see p26
of the on-line notes) a sufficient statistic is
n
X n
X
t(X) = u1 (Xi ) = Xi .
i=1 i=1
(b) Find the Jeffreys prior and comment upon whether or not it is im-
proper. Find the posterior distribution for this prior.
The likelihood is
n
Y
f (x | λ) = λe−λxi = λn e−λnx̄ . (1)
i=1
As θ is univariate
∂2 ∂2
log f (x | θ) = (n log λ − λnx̄)
∂θ2 ∂θ2
n
= − 2.
λ
The Fisher information is
n n
I(λ) = −E − 2 λ = .
λ λ2
We have
∂2 ∂2
nφ − nx̄eφ
log L(φ) =
∂φ2 ∂φ2
= −nx̄eφ .
= −E −nX̄eφ φ
I(φ)
= nE(Xi | φ)eφ .
Now E(Xi | φ) = E(Xi | λ) = λ−1 = e−φ so that I(φ) = n. The Jeffreys prior
is then
√
fφ (φ) ∝ n ∝ 1 (3)
From (2) we have that fλ (λ) ∝ λ−1 . Using the familiar change of variables
formula,
∂λ
fφ (φ) = fλ (eφ )
∂φ
∂λ
with ∂φ = eφ as λ = eφ , we have that
The likelihood is
n
Y 1 1
f (x | θ) = √exp − (xi − µ)2
i=1
2πθ 2θ
( n
)
−n n 1 X 2
= (2π) 2 θ 2 exp −
−
(xi − µ) .
2θ i=1
As θ is univariate
( n
)
∂2 ∂2 n n 1 X
log f (x | θ) = − log 2π − log θ − (xi − µ)2
∂θ2 ∂θ2 2 2 2θ i=1
n
n 1 X
= − (xi − µ)2 .
2θ2 θ3 i=1
(b) Now suppose that Xi | θ ∼ N (µ, σ 2 ) where θ = (µ, σ 2 ). Find the Jeffreys
prior for θ.
The likelihood is
( n
)
−n 2 −n 1 X
f (x | θ) = (2π) 2 (σ ) 2 exp − 2 (xi − µ)2
2σ i=1
so that
n
n n 1 X
log f (x | θ) = − log 2π − log σ 2 − 2 (xi − µ)2 .
2 2 2σ i=1
Now,
n
∂ 1 X
log f (x | θ) = (xi − µ);
∂µ σ 2 i=1
n
∂ n 1 X
log f (x | θ) = − + (xi − µ)2 ;
∂(σ 2 ) 2σ 2 2σ 4 i=1
so that
∂2 n
log f (x | θ) = − ;
∂µ2 σ2
n
∂2 1 X
log f (x | θ) = − (xi − µ);
∂µ∂(σ 2 ) σ4 i=1
n
∂2 n 1 X
log f (x | θ) = − (xi − µ)2 .
∂(σ 2 )2 2σ 4 σ 6 i=1
(a) Obtain the Jeffreys prior distribution for each of the two methods. You
may find it useful to note that E(Y | θ) = θr .
so that
∂2 x (n − x)
log fX (x | θ) = − − .
∂θ2 θ 2 (1 − θ)2
n
r
1 1
f (θ) ∝ ∝ θ− 2 (1 − θ)− 2
θ(1 − θ)
so that
∂2 r (y − r)
log fY (y | θ) = − − .
∂θ2 θ 2 (1 − θ)2
r
r
1
f (θ) ∝ ∝ θ−1 (1 − θ)− 2
θ2 (1 − θ)
Notice that if x = r and y = n then the two approaches have identical likelihoods:
in both cases we observed x successes in n trials but θ | x ∼ Beta( 21 + x, 21 + n − x)
and θ | y ∼ Beta(x, 21 + n − x). Although the likelihoods are the same, Jeffreys’
approach yields different posterior distributions which seems to contradict the
notion of an noninformative prior. This occurs because Jeffreys’ prior violates the
likelihood principle. In short this principle states that the likelihood contains
all the information about the data x so that two likelihoods contain the same
information if they are proportional. In this case, the two likelihoods are iden-
tical but yield different posterior distributions. Classical statistics violates the
likelihood principle but Bayesian statistics (using proper prior distributions) does
not.
2016/17 Semester II
f (x | θ) = g(t, θ)h(x)
where g(t, θ) depends upon t(x) and θ and h(x) does not depend upon θ but may
depend upon x. In this case we have that
−n
f (x | µ, σ 2 ) = g(x, s2 , µ, σ 2 ) (2π) 2
f (µ, σ 2 | x) ∝ f (x | µ, σ 2 )f (µ, σ 2 )
n
2 −2 1 2 2
1
∝ 2πσ exp − 2 (n − 1)s + n(x − µ)
2σ σ2
−( 2 +1)
n 1
σ2 exp − 2 (n − 1)s2 + n(x − µ)2 .
∝
2σ
(c) Show that µ | σ 2 , x ∼ N (x, σ 2 /n). Hence, explain why, in this case, the
chosen prior distribution for θ is noninformative.
f (µ, σ 2 | x)
f (µ | σ 2 , x) =
f (σ 2 | x)
∝ f (µ, σ 2 | x) (with respect to µ)
n n o
∝ exp − 2 (x − µ)2 .
2σ
We recognise this as a kernel of N (x, σ 2 /n) so µ | σ 2 , x ∼ N (x, σ 2 /n). In the
classical model for µ when σ 2 is known, the mle is x and the standard error
is σ 2 /n. The distribution is coming only from the data (given σ 2 ) showing the
noninformative prior in this case. Notice that a symmetric 100(1 − α)% credible
interval for µ | σ 2 , x is
σ σ
x − z(1− α ) √ , x + z(1− α ) √
2 n 2 n
which agrees with the 100(1 − α)% confidence interval for µ when σ 2 is known
and X1 , . . . , Xn are iid N (µ, σ 2 ).
(d) By integrating f (µ, σ 2 | x) over σ 2 , show that
" 2 #− n2
1 x−µ
f (µ | x) ∝ 1+ √ .
n − 1 s/ n
so that
n
−2
1 2 2
f (µ | x) ∝ Γ(n/2) (n − 1)s + n(x − µ)
2
n
−2
1 2 2
∝ (n − 1)s + n(x − µ)
2
− n
∝ (n − 1)s2 + n(x − µ)2 2
" 2 #− n2
1 x−µ
∝ 1+ √ .
n − 1 s/ n
We recognise this as a kernel of tn−1 (x, s2 /n) so that µ | x ∼ tn−1 (x, s2 /n). This
gives a further insight into how the prior distribution is noninformative. Inference
X−µ
about µ will mirror the classical approach when µ and σ 2 are unknown and S/ √ ∼
n
tn−1 where tn−1 = tn−1 (0, 1). For example, a symmetric 100(1 − α)% credible
interval for µ | x is
s s
x − tn−1, α2 √ , x + tn−1, α2 √
n n
which agrees with the 100(1 − α)% confidence interval for µ when σ 2 is unknown
and X1 , . . . , Xn are iid N (µ, σ 2 ).
2. Let X1 , . . . , Xn be exchangeable with Xi | θ ∼ Bern(θ).
(a) Using the improper prior distribution f (θ) ∝ θ−1 (1 − θ)−1 find the pos-
terior distribution of θ | x where x = (x1 , . . . , xn ). Find a normal approx-
imation about the mode to this distribution.
The likelihood is
n
Y
f (x | θ) = θxi (1 − θ)1−xi = θnx̄ (1 − θ)n−nx̄ ,
i=1
(n − 2)2 (n − 2)2
I(θ̃ | x) = +
nx̄ − 1 n − nx̄ − 1
(n − 2)3
= .
(nx̄ − 1)(n − nx̄ − 1)
(b) Show that the prior distribution f (θ) is equivalent to a uniform prior
on
θ
β = log
1−θ
and find the posterior distribution of β | x. Find a normal approximation
about the mode to this distribution.
approximation about the mode, we first need to find the mode of β | x. The mode
is the maximum of f (β | x) which is, equivalently, the maximum of log f (β | x).
We have
= log c + βnx̄ − n log(1 + eβ )
log f (β | x)
∂ neβ
⇒ log f (β | x) = nx̄ − .
∂β 1 + eβ
The mode β̃ satisfies
neβ̃
nx̄ − = 0
1 + eβ̃
nx̄
⇒ eβ̃ =
n − nx̄
x̄
⇒ β̃ = log .
1 − x̄
x̄
I(β̃ | x) = n× × (1 − x̄)2 = nx̄(1 − x̄).
1 − x̄
Hence, approximately,
x̄ 1
β |x ∼ N log , .
1 − x̄ nx̄(1 − x̄)
(c) For which parameterisation does it make more sense to use a normal
approximation?
Whilst we can find a normal approximation about the mode either on the scale
of θ or of β, it makes more sense for β. We have 0 < θ < 1 and −∞ < β < ∞ so
only β has a sample space which agrees with the normal distribution.
3. In viewing a section through the pancreas, doctors see what are called
“islands”. Suppose that Xi denotes the number of islands observed in the
ith patient, i = 1, . . . , n, and we judge that X1 , . . . , Xn are exchangeable with
Xi | θ ∼ P o(θ). A doctor believes that for healthy patients θ will be on average
around 2; he thinks it is unlikely that θ is greater than 3. The number of
islands seen in 100 patients are summarised in the following table.
Number of islands 0 1 2 3 4 5 ≥6
Frequency 20 30 28 14 7 1 0
(a) Express the doctor’s prior beliefs as a normal distribution for θ. You
may interpret the term “unlikely” as meaning “with probability 0.01”.
The doctor thus asserts θ ∼ N (µ, σ 2 ) with E(θ) = 2 and P (θ > 3) = 0.01. Note
that, as θ is continuous, this is equivalent to P (θ ≥ 3) = 0.01. We use these two
pieces of information to obtain µ and σ 2 . Firstly, E(θ) = 2 ⇒ µ = 2. Secondly,
θ−2 1
P (θ > 3) = 0.01 ⇒ P > = 0.01.
σ σ
θ−2
As σ ∼ N (0, 1) we have that
1
= 2.33 ⇒ σ 2 = 5.4289−1 = 0.1842.
σ
Hence, θ ∼ N (2, 5.4289−1 ).
(d) Why might you prefer to express the doctor’s prior beliefs as a normal
distribution on some other parameterisation φ = g(θ)? Suggest an ap-
propriate choice of g(·) in this case. Now express the doctor’s beliefs
using a normal prior for φ; that for healthy patients φ will be on aver-
age around g(2) and it is “unlikely” that φ is greater than g(3). Give an
expression for the density of φ | x up to a constant of proportionality.
By definition θ > 0 but both the specified normal prior distribution and the
normal approximation for the posterior θ | x have a sample space of (−∞, ∞) so
we might want to use some other parametrisation which has the same sample
space as the normal distribution. An obvious choice is to use φ = g(θ) = log θ as
if θ > 0 then −∞ < log θ < ∞. We assert φ ∼ N (µ0 , σ02 ) with E(φ) = log 2 and
P (φ > log 3) = 0.01. So, E(φ) = log 2 ⇒ µ0 = log 2 and
P (φ > log 3) = 0.01 ⇒ P φ−log σ0
2
> log 3−log
σ0
2
= 0.01.
φ−log 2
As σ0 ∼ N (0, 1) we have that
1 2.33
= ⇒ σ02 = 0.0303.
σ0 log 32
The likelihood is
10
Y
f (x | λ) = λe−λxi = λ10 e−4λ
i=1
where
Z ∞
k −1 = λ10 exp{−20λ2 + 6λ} dλ.
0
Hence, as λ̃ > 0,
p
3+ 9 + 4(20)(5)
λ̃ = = 0.5806.
2(20)
then
N
1 X kλ11 2
i exp{−20λi + 6λi }I{λi >0}
Iˆ = √
N i=1 69.6651
exp − 69.6651
√
2π 2 (λi − 0.5806)2
is an unbiased estimate of the posterior mean of λ with Iλi >0 denoting the indi-
cator function for the event {λi > 0}.
2016/17 Semester II
Note that
10 10
1
Z Z
I = exp(−2|x − 5|) dx = 10 exp(−2|x − 5|)
dx
0 0 10
= Ef (10 exp(−2|X − 5|)).
∞ 1
10 exp(−2|x − 5|) 10 I{0≤x≤10} 1
1
Z
2
= √ exp − (x − 5) dx
√1 exp − 1 (x − 5)2
−∞ 2 2π 2
2π
∞
g(x)f (x) g(X)f (X)
Z
= q(x) dx = Eq
−∞ q(x) q(X)
1
where g(x) = 10 exp(−2|x − 5|), f(x) = 10 I{0≤x≤10} (the density of the U (0, 10)
1 1
distribution) and q(x) = 2π exp − 2 (x − 5)2 (the density of the N (5, 1) distri-
√
for each xi
˜ the sample mean of the g(xi )f (xi )
(iii) estimate I by I, q(xi ) s, so that
n 1
1 X 10 exp(−2|xi − 5|) 10 I{0≤xi ≤10}
I˜ = 1
1 . (2)
n i=1 √ exp − 2 (xi − 5)2 2π
(c) Use R to explore the methods used in parts (a) and (b). Which method
do you prefer?
return(vect)
}
We explore the two methods for n = 100, 500, 1000, and 5000, in each case taking
1000 replications. The results are shown in the following table, note that the true
value of I is 1 − exp(−10) = 0.9999546.
Mean of Variance of
Method n
1000 samples 1000 samples
Monte 100 1.013746 0.04300273
Import 100 0.9997384 0.003624815
Monte 500 0.999168 0.008210746
Import 500 1.00074 0.0007107777
Monte 1000 1.00279 0.004010403
Import 1000 1.000122 0.0003694998
Monte 5000 1.000884 0.000773445
Import 5000 0.9997536 0.00007040271
1.0
0.8
exp(−2|x−5|)
pdf of U(0, 10)
pdf of N(5, 1)
0.6
0.4
0.2
0.0
0 2 4 6 8 10
N
1 X h(θi )g(θi )
Iˆ = (3)
N i=1 q(θi )
ˆ
and, since c is known, estimate Ef (h(θ)) by cI.
(b) Suppose that c is unknown. By noting that
Z ∞
c−1 = g(θ) dθ,
−∞
We have that
Z ∞ Z ∞
1 = f (θ | x) dθ = cg(θ) dθ
−∞ −∞
so that
Z ∞
−1
c = g(θ) dθ
−∞
∞
g(θ) g(θ)
Z
= q(θ) dθ = Eq .
−∞ q(θ) q(θ)
g(θi ) g(θ)
Thus, we compute q(θi ) for each i = 1, . . . , N and estimate c−1 = Eq q(θ) by
N
1 X g(θi )
I˜ = . (4)
N i=1 q(θi )
Note that
Z ∞
P (X > a) = f (x) dx
a
Z ∞
= I{x>a} f (x) dx
−∞
= Ef (I{X>a} )
where I{X>a} is the indicator function of the event {X > a} and Ef (·) denotes
expectation with respect to the distribution f . We thus may use Monte Carlo
integration to estimate P (X > a) as follows:
(i) sample n values x1 , . . . , xn independently from f (x), the N (0, 1)
(ii) compute I{xi >a} for each xi , so I{xi >a} = 1 if xi > a and is 0 otherwise
(iii) estimate P (X > a) = Ef (I{X>a} ) by the sample mean of the I{xi >a} ,
1
Pn
n i=1 I{xi >a} .
(b) What are the difficulties with the method for large values of a?
Let q(x) denote the probability density function of the N (µ, 1). Then, as q(x) > 0
Thus,
2
µ
Ef (I{X>a} ) = Eq I{X>a} exp − µX .
2
We thus use importance sampling to estimate P (X > a) = Ef (I{X>a} ) as follows:
(i) sample n values x1 , . . . , xn independently from q(x), the N (µ, 1)
n 2 o
(ii) compute I{xi >a} exp µ2 − µxi for each xi
(iii) estimate P (X > a) = Ef (I{X>a} ) by
n
µ2
1X
Iˆ = I{xi >a} exp − µxi .
n i=1 2
We observe that, even for n = 100 observations, the method of importance sam-
pling is doing well and that the variance of the estimate is much reduced when
compared to direct Monte Carlo integration. We repeat the process with a = 4.5
and µ = 4.5 so that P (X > 4.5) = 3.397673 × 10−6 .
Notice that even for sizes of n when Monte Carlo fails (as we haven’t observed any
non-zero values), the importance sampling method is working well. When we can
use Monte Carlo, for the same n, the variance of the estimate using importance
sampling is vastly reduced.
(d) Explain how to find P (X > 4.5) using the method of importance sam-
pling where we sample from the Exp(1) distribution truncated at 4.5
which has probability density function
exp{−(x − 4.5)} x > 4.5,
q(x) =
0 otherwise.
In this part, the idea is to remove the need of the indicator function when esti-
mating P (X > a) by sampling from a distribution, q(x), whose support is (a, ∞)
so that
Z ∞ Z ∞
f (x) f (X)
P (X > a) = f (x) dx = q(x) dx = Eq .
a a q(x) q(X)
For a = 4.5 we have that, for X > 4.5,
√1 exp − 1 X 2
f (X) 2π 2 1 1
= = √ exp − X 2 + X − 4.5
q(X) exp{−(X − 4.5)} 2π 2
so that our importance sampling method is as follows:
(i) sample n values x1 , . . . , xn independently from q(x), the Exp(1) distribution
truncated at 4.5
(ii) compute √12π exp − 21 x2i + xi − 4.5 for each xi
3e−06
Cumulative sum of estimate
2e−06
1e−06
0e+00
Figure 2: Convergence, as n increases, of the importance sampling estimate Iˆ for P (X > 4.5)
obtained by sampling from the Exp(1) distribution truncated at 4.5.
Prove that the choice of q(x) which minimises the variance of Iˆ (with respect
to q(x)) is
|g(x)|f (x)
q ∗ (x) = R .
X
|g(z)|f (z) dz
Let V arq (·) denote variance calculated with respect to q(x). Then, as X1 , . . . , XN are
independent observations from q(x),
N
ˆ = 1 X g(Xi )f (Xi ) 1 g(X)f (X)
V arq (I) V arq = V arq .
N 2 i=1 q(Xi ) N q(X)
Hence, the choice of q(x) which minimises the variance of Iˆ is equivalent to that which
minimises
2
g (X)f 2 (X)
g(X)f (X) 2 g(X)f (X)
V arq = Eq − Eq
q(X) q 2 (X) q(X)
2 2
g (X)f (X)
= Eq − Ef2 (g(X)) .
q 2 (X)
Thus, as Ef2 (g(X)) does not depend upon the choice of q(x), then we need only consider
2 2
finding a q(x) minimising Eq g (X)f2
q (X)
(X)
. Now, for any q(x), as V arq
|g(X)|f (X)
q(X) ≥
0,
2
g 2 (X)f 2 (X)
Z
|g(X)|f (X)
Eq ≥ Eq2 = |g(x)|f (x) dx . (5)
q 2 (X) q(X) X
Notice that the lower bound does not depend upon the choice of q(x). Thus, if there
exists a q(x) which achieves this minimum bound then it must minimise the variance
ˆ
of I.
|g(X)|f (X)
Z
= |g(z)|f (z) dz
q ∗ (X) X
R
so that, as X
|g(z)|f (z) dz is a constant,
2 ! 2
g 2 (X)f 2 (X)
Z Z
Eq = Eq |g(z)|f (z) dz = |g(z)|f (z) dz
(q ∗ (X))2 X X
and the lower bound in (5) is achieved showing that q ∗ (x) minimises the variance of
ˆ2
I.
2 Notice that this is really a theoretical rather than a practical result. For example, if g(x) > 0 then
q ∗ (x) = (g(x)f (x))/Ef (g(X)) and so requires Ef (g(X)) which is the integral we’re trying to evaluate.
10
2016/17 Semester II
As Xi | θ ∼ Inv-gamma(α, θ) then
n
θα −(α+1)
Y θ
f (x | θ) = xi exp −
i=1
Γ(α) xi
n
!
X 1
∝ θnα exp −θ .
x
i=1 i
Similarly, as θ ∼ Gamma(α0 , β0 ),
f (θ) ∝ θα0 −1 exp (−θβ0 ) .
Hence,
f (θ | x) ∝ f (x | θ)f (θ)
n
!)
(
X 1
∝ θ(α0 +nα)−1 exp −θ β0 +
x
i=1 i
Pn
which we recognise as a kernal of a Gamma(α0 + nα, β0 + i=1 x1i ). Thus, θ | x ∼
Gamma(αn , βn ) as required.
(b) We wish to use the Metropolis-Hastings algorithm to sample from the
posterior distribution θ | x using a normal distribution with mean θ and
chosen variance σ 2 as the symmetric proposal distribution.
i. Suppose that, at time t, the proposed value θ∗ ≤ 0. Briefly explain
why the corresponding acceptance probability is zero for such a θ∗
and thus that the sequence of values generated by the algorithm
are never less than zero.
Pn
where αn = α0 + nα, βn = β0 + i=1 x1i .
• Generate U ∼ U (0, 1).
• If U ≤ α(θ(t−1) , θ∗ ) accept the move, θ(t) = θ∗ . Otherwise reject the
move, θ(t) = θ(t−1) .
C. Repeat step B.
The algorithm will produce a Markov Chain with stationary distribution
f (θ | x). After a sufficiently long time, b say, to allow for convergence, the
values {θ(t) } for t > b may be considered as a sample from f (θ | x), that is
a sample from Gamma(αn , βn ), where the samples {θ(t) } for t ≤ b are the
“burn-in”.
2. Suppose that X | θ ∼ N (θ, σ 2 ) and Y | θ, δ ∼ N (θ − δ, σ 2 ), where σ 2 is a known
constant and X and Y are conditionally independent given θ and δ. It is
judged that the improper noninformative joint prior distribution f (θ, δ) ∝ 1
is appropriate.
(a) Show that the joint posterior distribution of θ and δ given x and y is
bivariate normal with mean vector µ and variance matrix Σ where
2
σ2
E(θ | X, Y ) x σ
µ = = ; Σ = .
E(δ | X, Y ) x−y σ 2 2σ 2
As X | θ ∼ N (θ, σ 2 ) then
1
f (x | θ) ∝ exp − 2 (θ2 − 2xθ) .
2σ
As Y | θ, δ ∼ N (θ − δ, σ 2 ) then
1 2
f (y | θ, δ) ∝ exp − 2 [(θ − δ) − 2y(θ − δ)] .
2σ
f (x, y | θ, δ) = f (x | θ)f (y | θ, δ)
1
∝ exp − 2 [θ2 − 2xθ + (θ − δ)2 − 2y(θ − δ)] .
2σ
So,
(b) Describe how the Gibbs sampler may be used to sample from the poste-
rior distribution θ, δ | x, y, deriving all required conditional distributions.
f (θ | δ, x, y) ∝ f (θ, δ | x, y)
1 2
∝ exp − 2 [2θ − 2(x + y + δ)θ]
2σ
1 2 2 x+y+δ
= exp − θ −2 θ
2 σ2 2
2
which is a kernel of a normal density. Thus, θ | δ, x, y ∼ N ( x+y+δ
2 , σ2 ). Now, with
proportionality with respect to δ,
f (δ | θ, x, y) ∝ f (θ, δ | x, y)
1 2
∝ exp − 2 [δ − 2(θ − y)δ]
2σ
2
For x = 2, y = 1 and σ = 1 we shall use the Gibbs sampler to sample from
2 1 1 (t−1)
N2 (µ, Σ) where µ = and Σ = . We draw θ(t) from N ( 3+δ2 , 21 )
1 1 2
and δ (t) from N (θ(t) − 1, 1). Suppose that we start from (1, 1) and sample θ(1) =
2.216715, δ (1) = 1.475738, θ(2) = 2.086383 and δ (2) = 1.617760. The path of the
trajectory of these points is shown in Figure 1.
(d) Suppose, instead, that we consider sampling from the posterior distri-
bution using the Metropolis-Hastings algorithm where the proposal dis-
tribution is the bivariate normal with mean vector µ̃(t−1) = (θ(t−1) , δ (t−1) )T
and known variance matrix Σ̃. Explain the Metropolis-Hastings algo-
rithm for this case, explicitly stating the acceptance ratio.
4
0.04
3
0.08
2
0.12
0.14
delta
1
0
0.1
−1
0.06
−2
0.02
−1 0 1 2 3 4 5
theta
Figure 1: A trajectory path of the Gibbs sampler sampling from N2 (µ, Σ).
(t−1)
θ∗ (t−1) θ
where γ = ∗
and γ = .
δ∗ δ (t−1)
• Generate U ∼ U (0, 1).
• If U ≤ α((θ(t−1) , δ (t−1) ), (θ∗ , δ ∗ )) accept the move, (θ(t) , δ (t) ) = (θ∗ , δ ∗ ).
Otherwise reject the move, (θ(t) , δ (t) ) = (θ(t−1) , δ (t−1) ).
iii. Repeat step ii.
The algorithm produces a Markov Chain with stationary distribution f (θ, δ | x, y).
After a sufficiently long time to allow for convergence, the values {θ(t) , δ (t) } for t >
b may be considered as a sample from f (θ, δ | x, y) where the samples {θ(t) , δ (t) }
for t ≤ b are the “burn-in”.
3. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given a parameter θ = (µ, λ). Suppose that Xi | θ ∼ N (µ, 1/λ) so that µ is the
mean and λ the precision of the distribution. Suppose that we judge that µ
and λ are independent with µ ∼ N (µ0 , 1/τ ), where µ0 and τ are known, and
λ ∼ Gamma(α, β), where α and β are known.
(a) Show that the posterior density f (µ, λ | x), where x = (x1 , . . . , xn ), can be
expressed as
( n
)
α+ n λX 2 τ 2
f (µ, λ | x) ∝ λ 2 exp −
−1
(xi − µ) − µ + τ µ0 µ − βλ .
2 i=1 2
i=1
2
( n
)
n λX 2
= λ 2 exp − (xi − µ) .
2 i=1
We have
f (λ, µ | x)
f (λ | µ, x) =
f (µ | x)
∝ f (λ, µ | x)
1 Pn
which is a kernel of a Gamma α + n2 , β + − µ)2 .
2 i=1 (xi
+nλx̄
(c) Given that µ | λ, x ∼ N ( τ µτ0+nλ 1
), where x̄ = n1 ni=1 xi , describe
P
, τ +nλ
how the Gibbs sampler may be used to sample from the posterior
distribution µ, λ | x. Give a sensible estimate of V ar(λ | x).
of λ(t) | µ(t) , x.
iii. Repeat step ii.
The algorithm will produce a Markov Chain with stationary distribution f (µ, λ | x).
After a sufficiently long time to allow for convergence, the values {µ(t) , λ(t) } for
t > b may be considered as a sample from f (µ, λ | x) where the samples {µ(t) , λ(t) }
for t ≤ b are the “burn-in”.
Ergodic averages converge to the desired expectations under the target distribu-
tion so that, for example,
N
1 X
g(λ(t) ) → E(g(λ) | X)
N t=1
1
PN
where our sample is {µ(t) , λ(t) : t = b + 1, . . . , N } and λ̄ = N −b t=b+1 λ(t) , the
sample mean.
4. Consider a Poisson hierarchical model. At the first stage we have observa-
tions sj which are Poisson with mean tj λj for j = 1, . . . , p where each tj is
known. We assume that that the λj are independent and identically dis-
tributed with Gamma(α, β) prior distributions. The parameter α is known
but β is unknown and is given a Gamma(γ, δ) distribution where γ and δ are
known.
(a) Find, up to a constant of integration, the joint posterior distribution
of the unknown parameters given s = (s1 , . . . , sp ).
Yp
= f (λj | β) f (β)
j=1
p α
Y β δ γ γ−1 −δβ
= λα−1
j e−βλj × β e
j=1
Γ(α) Γ(γ)
Yp p
X
∝ λα−1
j
β (pα+γ)−1 exp −β δ + λj .
j=1 j=1
The likelihood is
p
Y (tj λj )sj e−tj λj
f (s | λ, β) =
j=1
sj !
p p
s
Y X
∝ λj j exp − tj λj .
j=1 j=1
(b) Describe how the Gibbs sampler may be used to sample from the pos-
terior distribution, deriving all required conditional distributions.
For each j, let λ−j = λ\λj . For the Gibbs sampler we need to find the distributions
of λj | λ−j , β, s, j = 1, . . . , p, and β | λ, s. Now,
f (λ, β | s)
f (λj | λ−j , β, s) =
f (λ−j , β | s)
∝ f (λ, β | s) (taken with respect to λj )
α+sj −1
= λj exp {−(β + tj )λj }
which is a kernel of a Gamma(α + sj , β + tj ) distribution. Hence, λj | λ−j , β, s ∼
Gamma(α + sj , β + tj ).
f (λ, β | s)
f (β | λ, s) =
f (λ | s)
∝ f (λ, β | s) (taken with respect to β)
X p
= β (pα+γ)−1 exp − δ + λj β
j=1
Pp
which is a kernel of a Gamma(pα + γ, δ + j=1 λj ) distribution. So, β | λ, s ∼
Pp
Gamma(pα + γ, δ + j=1 λj ).
i. Choose an arbitrary starting point for which f (λ(0) , β (0) | s) > 0. As λj > 0
for j = 1, . . . , p and β > 0 then any p + 1 collection of positive numbers will
suffice.
(t) (t−1)
ii. • Obtain λ1 by sampling from the distribution of λ1 | λ−1 , β (t−1) , s which
is Gamma(α + s1 , β (t−1) + t1 ).
..
(t)
. (t) (t) (t−1)
• Obtain λj by sampling from the distribution of λj | λ1 , . . . , λj−1 , λj+1 ,
(t−1)
. . . , λp , β (t−1) , s which is Gamma(α + sj , β (t−1) + tj ).
..
(t)
. (t)
• Obtain λp by sampling from the distribution of λp | λ−p , β (t−1) , s which
is Gamma(α + sp , β (t−1) + tp ).
• Obtain β (t) by sampling from the distribution of β | λ(t) , s, the Gamma(pα
(t)
+γ, δ + pj=1 λj ).
P
We first remove the burn-in samples. Suppose we determine the burn-in time b,
where we assume that the Markov chain has sufficiently converged for t > b. (We
assume that b < N .) Then we discard observations {λ(t) , β (t) ; t ≤ b} and use the
remaining sample {λ(t) , β (t) ; t = b + 1, . . . N } to estimate posterior summaries.
(t) (t)
Here λ(t) = (λ1 , . . . , λp ).
(t)
The posterior mean of λj | s can be estimated by the sample mean of {λj , t =
(t)
b + 1, . . . , N } so that our estimate of E(λj | s) is N1−b N
P
t=b+1 λj .
(t)
For the Gibbs sampler, we update each θp one at a time by obtaining θp from the
We now demonstrate that this update step in the Gibbs sampler is equivalent to a single
(t−1) (t) (t)
Metropolis-Hastings step with acceptance probability 1. Let θ[p] = (θ1 , . . . , θp−1 ,
(t−1) (t−1) (t−1) (t) (t) (t) (t−1) (t−1) (t−1)
θp , θp+1 , . . . , θd ) and θ[p] = (θ1 , . . . , θp−1 , θp , θp+1 , . . . , θd ). Consider
(t−1) (t) (t) (t) (t) (t−1) (t−1)
the move from toθ[p] As is drawn from
θ[p] . θ[p] π(θp | θ1 , . . . , θp−1 , θp+1 , . . . , θd )
then we shall use this for the proposal distribution in the Metropolis-Hastings algorithm
so that
(t) (t−1) (t) (t) (t−1) (t−1)
q(θ[p] | θ[p] ) = π(θp(t) | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). (1)
(t) (t−1) (t−1)
If, instead, we consider the move from θ[p] to θ[p] then, using Gibbs sampling, θ[p]
(t−1) (t) (t) (t−1) (t−1)
is obtained by drawing θp from π(θp | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). Thus, we set
(t−1) (t) (t) (t) (t−1) (t−1)
q(θ[p] | θ[p] ) = π(θp(t−1) | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). (2)
Now,
(t−1) (t) (t) (t−1) (t−1)
π(θ[p] ) = π(θ1 , . . . , θp−1 , θp(t−1) , θp+1 , . . . , θd )
(t) (t) (t−1) (t−1) (t) (t) (t−1) (t−1)
= π(θp(t−1) | θ1 , . . . , θp−1 , θp+1 , . . . , θd )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd )(3)
(t−1) (t) (t) (t) (t−1) (t−1)
= q(θ[p] | θ[p] )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd ) (4)
where (4) follows by substituting (2) into (3). In an identical fashion,
(t) (t) (t) (t−1) (t−1)
π(θ[p] ) = π(θ1 , . . . , θp−1 , θp(t) , θp+1 , . . . , θd )
(t) (t) (t−1) (t−1) (t) (t) (t−1) (t−1)
= π(θp(t) | θ1 , . . . , θp−1 , θp+1 , . . . , θd )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd ) (5)
(t) (t−1) (t) (t) (t−1) (t−1)
= q(θ[p] | θ[p] )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd ) (6)
where (6) follows by substituting (1) into (5). Dividing (6) by (5) gives
(t) (t) (t−1)
π(θ[p] ) q(θ[p] | θ[p] )
(t−1)
= (t−1) (t)
π(θ[p] ) q(θ[p] | θ[p] )
so that
(t) (t−1) (t)
π(θ[p] )q(θ[p] | θ[p] )
(t−1) (t) (t−1)
= 1. (7)
π(θ[p] )q(θ[p] | θ[p] )
Thus, a Metropolis-Hastings algorithm for sampling from π(θ) for a proposed move
(t−1) (t)
from θ[p] to θ[p] using the proposal distribution given by (1) has an acceptance ratio
of
(t) (t−1) (t)
(t−1) (t)
π(θ [p] )q(θ [p] | θ [p] )
α(θ[p] , θ[p] ) = min 1, (t−1) (t) (t−1)
(8)
π(θ[p] )q(θ[p] | θ[p] )
= 1 (9)
where (9) follows from (8) using (7).
10
2016/17 Semester II
We consider the decision problem [Θ, D, π(θ), L(θ, d)]. Relative to distribution π
the expected loss is
With π(θ) = f (θ), the Bayes rule of the immediate decision is d∗ = E −1 (θ−1 ).
Now for θ ∼ Gamma(α, β),
β α α−1 −βθ
Z ∞
E(θ ) =
−1
θ−1 θ e dθ
0 Γ(α)
Γ(α − 1) βα ∞
β α−1
Z
= × α−1 θα−2 e−βθ dθ (2)
Γ(α) β 0 Γ(α − 1)
Γ(α − 1) βα β
= × α−1 = .
Γ(α) β α−1
Notice that in (2) we assume that we are integrating over the density of a
Gamma(α − 1, β) distribution. This assumption requires α > 1. The Bayes
rule of the immediate decision is thus
α−1
d∗ = E −1 (θ−1 ) = . (3)
β
We now consider the case when π(θ) = f (θ | x). From (1) the Bayes rule is d∗ =
E −1 (θ−1 | X). Now, as θ ∼ Gamma(α, β) and XP i | θ ∼ P o(θ) then, see Question
n
Sheet Three Exercise 1.(b), θ | x ∼ Gamma(α + i=1 xi , β + n). We exploit this
conjugacy to observe that the Bayes rule after sampling can
Pnbe obtained from the
Bayes rule of the immediate decision by substituting α + i=1 xi for α and β + n
for β in (3) to obtain
Pn
α + i=1 xi − 1
d =
∗
.
β+n
(θ − d)2
L(θ, d) = .
θ3
(a) Find the Bayes rule and Bayes risk of an immediate decision.
We consider the decision problem [Θ, D, π(θ), L(θ, d)]. Relative to distribution π
the expected loss is
E(π) (θ−2 )
d∗ = (4)
E(π) (θ−3 )
We now consider the immediate decision by solving the decision problem [Θ, D,
f (θ), L(θ, d)]. As θ ∼ Gamma(α, β) then
β α α−1 −βθ
Z ∞
k
E(θ ) = θk θ e dθ
0 Γ(α)
Γ(α + k) βα ∞
β α+k
Z
= × α+k θα+k−1 e−βθ dθ (6)
Γ(α) β 0 Γ(α + k)
Γ(α + k)
=
Γ(α)β k
provided that α + k > 0 so that we have integrated over a Gamma(α + k, β)
density at (6). Notice that we require k ∈ {−3, −2, −1} and as α > 3 then for
k ∈ {−3, −2, −1} we do have α + k > 0. Thus,
Γ(α − 1) β
E(θ−1 ) = = ;
Γ(α)β −1 α−1
Γ(α − 2) β2
E(θ−2 ) = = ;
Γ(α)β −2 (α − 1)(α − 2)
Γ(α − 3) β3
E(θ−3 ) = = .
Γ(α)β −3 (α − 1)(α − 2)(α − 3)
So, from (4), the Bayes rule of the immediate decision is
E(θ−2 ) α−3
d∗ = = (7)
E(θ−3 ) β
with, from (5), corresponding Bayes risk
ρ∗ (f (θ)) = E(θ−1 ) − E(θ−2 )d∗
β β2 α−3 β
= − = . (8)
α − 1 (α − 1)(α − 2) β (α − 1)(α − 2)
We solve the decision problem [Θ, D, f (θ | x), L(θ, d)]. As Xi | θ ∼ Exp(θ) then
n
( n
! )
Y X
n
f (x | θ) = f (xi | θ) ∝ θ exp − xi θ
i=1 i=1
as the Bayes rule after observing a sample of size n with corresponding Bayes risk
Pn
β + i=1 xi
ρ∗ (f (θ | x)) = . (10)
(α + n − 1)(α + n − 2)
(c) Find the Bayes risk of the sampling procedure.
We consider the decision problem [Ω, D, π(ω), L(ω, d)]. Relative to distribution π
the expected loss is
We now consider the immediate decision by solving the decision problem [Ω, D,
f (ω), L(ω, d)]. As ω ∼ Beta(α, β) then
1
Γ(α + β) α−1
Z
t
E(ω ) = ωt ω (1 − ω)β−1 dω
0 Γ(α)Γ(β)
Γ(α + t)Γ(α + β) 1 Γ(α + β + t) α+t−1
Z
= ω (1 − ω)β−1 dω (15)
Γ(α)Γ(α + β + t) 0 Γ(α + t)Γ(β)
Γ(α + t)Γ(α + β)
=
Γ(α)Γ(α + β + t)
Γ(α + 1)Γ(α + β) α
E(ω) = = ;
Γ(α)Γ(α + β + 1) α+β
Γ(α − 1)Γ(α + β) α+β−1
E(ω −1 ) = =
Γ(α)Γ(α + β − 1) α−1
provided α > 1. So, from (13), the Bayes rule of the immediate decision is
E −1 (ω −1 ) α−1
d∗ = = (16)
2 2(α + β − 1)
E −1 (ω −1 )
ρ∗ (f (ω)) = E(ω) −
2
α α−1 α(α + β − 1) + β
= − = . (17)
α+β 2(α + β − 1) 2(α + β)(α + β − 1)
(b) Suppose that we toss the coin n times before making our decision, and
observe k heads and n − k tails. Find the new Bayes rule and Bayes
risk.
We solve the decision problem [Ω, D, f (ω | k), L(ω, d)]. As ω ∼ Beta(α, β) and
K | ω ∼ Bin(n, ω) then ω | k ∼ Beta(α + k, β + n − k). We can exploit this
conjugacy to observe that the Bayes rule and Bayes risk after sampling can be
found from substituting α + k for α and β + n − k for β in (16) and (17) to obtain
α+k−1
d∗ = (18)
2(α + β + n − 1)
as the Bayes rule after observing a sample of k heads and n − k tails with corre-
sponding Bayes risk
(α + k)(α + β + n − 1) + (β + n − k)
ρ∗ (f (ω | k)) = . (19)
2(α + β + n)(α + β + n − 1)
(c) Find the Bayes risk of the sampling procedure of tossing the coin n
times and then making a decision when α = β = 2.
α+K −1
δ∗ = .
2(α + β + n − 1)
The risk of the sampling procedure, ρ∗n , is the expected value of (19) when viewed
as a random quantity,
(α + K) (α + β + n − 1) + (β + n − K)
ρn = E
∗
2(α + β + n)(α + β + n − 1)
{α + E(K)} (α + β + n − 1) + {β + n − E(K)}
= . (20)
2(α + β + n)(α + β + n − 1)
We utilise the tower property of expectations, see Question Sheet One Exercise
5, to find E(K). We have that
nα α(α + β + n)
α + E(K) = α+ = ; (21)
α+β α+β
nα β(α + β + n)
β + n − E(K) = β+n− = . (22)
α+β α+β
n+4
ρ∗n = .
4(n + 3)