Ma40189 2016 2017 Problem Sheet 3 Solutions合并版

Download as pdf or txt
Download as pdf or txt
You are on page 1of 67

lOMoARcPSD|36619859

MA40189 - Solution Sheet Three


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

Qn
1. Let X1 , . . . , Xn be conditionally independent given θ, so f (x | θ) = i=1 f (xi | θ)
where x = (x1 , . . . , xn ), with each Xi | θ ∼ P o(θ). Suppose we judge that
θ ∼ Gamma(α, β).

(a) Find the distribution of θ | x.

As θ ∼ Gamma(α, β) then

β α α−1 −βθ
f (θ) = θ e (1)
Γ(α)

whilst as Xi | θ ∼ P o(θ)
n
Y θxi e−θ θnx̄ e−nθ
f (x | θ) = = Qn . (2)
i=1
xi ! i=1 xi !

From (1) and (2) the posterior density

θnx̄ e−nθ β α α−1 −βθ


f (θ | x) ∝ Qn × θ e
i=1 xi ! Γ(α)
∝ θ(α+nx̄)−1 e−(β+n)θ

which is a kernel of a Gamma(α + nx̄, β + n) density, so θ | x ∼ Gamma(α +


nx̄, β + n).
(b) Show that the posterior mean can be written as a weighted average of
the prior mean, denoted θ0 , and the maximum likelihood estimate of
θ, x.

As θ | x ∼ Gamma(α + nx̄, β + n),


 
α
α + nx̄ β β + nx̄
E(θ | x) = =
β+n β+n
= λθ0 + (1 − λ)x̄
β
where λ = β+n and θ0 = αβ = E(θ) as θ ∼ Gamma(α, β). Thus, the poste-
rior mean is a weighted average of the prior mean and the maximum likelihood
estimate.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(c) Let Z be a future (unobserved) observation. Find the mean and vari-
ance of the predictive distribution Z | x. You may assume that X and
Z are conditionally independent given θ and that Z | θ ∼ P o(θ).

We do not need to find the full predictive distribution. As X and Z are condi-
tionally independent given θ we can use expectation and variance the formulas
given in lectures so

E(Z | X) = E(E(Z | θ) | X)
= E(θ | X) (3)
α + nx̄
= (4)
β+n
where (3) follows as Z | θ ∼ P o(θ) and (4) as θ | x ∼ Gamma(α + nx̄, β + n).
Similarly,

V ar(Z | X) = V ar(E(Z | θ) | X) + E(V ar(Z | θ) | X)


= V ar(θ | X) + E(θ | X)
α + nx̄ α + nx̄ (α + nx̄)(β + n + 1)
= + = . (5)
(β + n)2 β+n (β + n)2

(d) The data in the table below are the number of fatal accidents on sched-
uled airline flights between 1976 and 1985.
1976 1977 1978 1979 1980 1981 1982 1983 1984 1985
24 25 31 31 22 21 26 20 16 22
Let Z be the number of fatal accidents in 1986 and Xi the number
of fatal accidents in 1975+i. Adopting the model given above, with
α = 101 and β = 5 and assuming that Z | x may be approximated by a
N (E(Z | X), V ar(Z | X)), find an approximate 95% prediction interval for
the number of accidents in 1986, that is an interval (zL , zU ) such that
P (zL < Z < zU | x) = 0.95.
P10
Taking α = 101 and β = 5 we observe i=1 xi = 238 with n = 10. Hence,
α+nx̄ = 101+238 = 339 and β +n = 5+10 = 15 so that θ | x ∼ Gamma(339, 15).
From (4) and (5) we have E(Z | X) = 339 113
15 = 5 and V ar(Z | X) =
339×16
152 = 1808
75
113 1808
so we use Z | x ∼ N ( 5 , 75 ) approximately. Recalling that if Y ∼ N (0, 1) then
P (−1.96 < Y < 1.96) = 0.95 the (approximate) 95% prediction interval is
r
113 1808
± 1.96 = (12.98, 32.22).
5 75
2. Let X1 , . . . , Xn be a finitely exchangeable sequence of random quantities and
consider any i 6= j ∈ {1, . . . , n}.
(a) Explain why the marginal distribution of any Xi does not depend upon
i so that E(Xi ) and V ar(Xi ) do not depend upon i.

If X1 , . . . , Xn are finitely exchangeable then their joint density satisfies

f (x1 , . . . , xn ) = f (xπ(1) , . . . , xπ(n) )

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

for all permutations π defined on the set {1, . . . , n}. For a given permutation there
exists j ∈ {1, . . . , n} such that xπ(j) = xi . Let x−i denote the set {x1 , . . . , xn }\xi .
The marginal distribution of Xi is given by
Z
fXi (xi ) = f (x1 , . . . , xn ) dx−i
x
Z −i
= f (xπ(1) , . . . , xπ(j−1) , xπ(j) , xπ(j+1) , . . . , xπ(n) ) dx−i
x−i
Z
= f (xπ(1) , . . . , xπ(j−1) , xi , xπ(j+1) , . . . , xπ(n) ) dx−i
x−i
= fXj (xi ).

As this holds for all permutations then it holds for all j ∈ {1, . . . , n}. Thus, the
marginal distribution of Xi does not depend upon i. As E(Xi ) and V ar(Xi ) are
constructed over the marginal of Xi then it follows immediately that these also
do not depend upon i.
(b) Explain why the joint distribution of any Xi , Xj does not depend upon
either i or j so that Cov(Xi , Xj ) does not depend on either i or j.

For i 6= j, let x−i,−j denote the set {x1 , . . . , xn }\{xi , xj }. For a given permutation
there exists k 6= l ∈ {1, . . . , n} such that xπ(k) = xi and xπ(l) = xj . Suppose
without loss of generality that k < l. The joint distribution of Xi and Xj is given
by
Z
fXi ,Xj (xi , xj ) = f (x1 , . . . , xn ) dx−i,−j
x−i,−j
Z
= f (xπ(1) , . . . , xπ(k−1) , xπ(k) , xπ(k+1) , . . . , xπ(l−1) , xπ(l) , xπ(l+1) , . . . , xπ(n) ) dx−i,−j
x−i,−j
Z
= f (xπ(1) , . . . , xπ(k−1) , xi , xπ(k+1) , . . . , xπ(l−1) , xj , xπ(l+1) , . . . , xπ(n) ) dx−i,−j
x−i,−j
= fXk ,Xl (xi , xj ).

As this holds for all permutations then it holds for all k 6= l ∈ {1, . . . , n}. Thus,
the joint distribution of Xi and Xj does not depend upon i 6= j. As Cov(Xi , Xj )
is constructed over the joint distribution of Xi and Xj then it follows immediately
that Cov(Xi , Xj ) does not depend on either i or j.
Pn
(c) Let Y = i=1 Xi . By considering V ar(Y ) ≥ 0, or otherwise, show that

Cov(Xi , Xj ) −1
Corr(Xi , Xj ) = p ≥ .
V ar(Xi )V ar(Xj ) n−1

Pn
Taking the variance of Y = Xi we have
i=1

n
!
X
V ar(Y ) = V ar Xi
i=1
n
X n X
X n
= V ar(Xi ) + Cov(Xi , Xj ).
i=1 i=1 j6=i

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

Now, from part (a), V ar(Xi ) does not depend upon i and, from part (b), Cov(Xi , Xj )
does not depend upon either i or j. Thus,

V ar(Y ) = nV ar(Xi ) + n(n − 1)Cov(Xi , Xj ).

As V ar(Y ) ≥ 0 it follows that nV ar(Xi ) + n(n − 1)Cov(Xi , Xj ) ≥ 0. Rearranging


this inequality gives
Cov(Xi , Xj ) −1
≥ .
V ar(Xi ) n−1
p
As V ar(Xi ) deoes not depend upon i then V ar(Xi ) = V ar(Xi )V ar(Xj ) so
that
Cov(Xi , Xj ) −1
Corr(Xi , Xj ) = p ≥ .
V ar(Xi )V ar(Xj ) n−1

Notice that an immediate consequence of this result is that if X1 , X2 , . . . are


−1
infinetely exchangeable then Corr(Xi , Xj ) ≥ 0 as we require Corr(Xi , Xj ) ≥ n−1
for all n.
3. Suppose X | µ ∼ N (µ, σ 2 ) and Y | µ, δ ∼ N (µ + δ, σ 2 ), where σ 2 is known and X
and Y are conditionally independent given µ and δ.
(a) Find the joint distribution of X and Y given µ and δ.

As X and Y are conditionally independent given µ and δ then

f (x, y | µ, δ) = f (x | µ, δ)f (y | µ, δ)
= f (x | µ)f (y | µ, δ)
 
1 1 2 2
= exp − 2 [(x − µ) + (y − µ − δ) ] .
2πσ 2 2σ

(b) Consider the improper noninformative joint prior distribution

f (µ, δ) ∝ 1.

Find, up to a constant of proportionality, the joint posterior distribu-


tion of µ and δ given x and y. Are µ | x, y and δ | x, y independent?

As f (µ, δ) ∝ 1 then

f (µ, δ | x, y) ∝ f (x, y | µ, δ)
 
1
∝ exp − 2 [(x − µ)2 + (y − µ − δ)2 ]

 
1
∝ exp − 2 [µ2 − 2xµ + (µ + δ)2 − 2y(µ + δ)]

 
1
= exp − 2 [2µ2 − 2(x + y)µ + 2µδ + δ 2 − 2yδ] . (6)

Thus, as a consequence of the µδ term in (6), we cannot write f (µ, δ | x, y) ∝


g(µ | x, y)h(δ | x, y) so that µ | x, y and δ | x, y are not independent.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

Note that f (µ, δ | x, y) is quadratic in µ and δ so that


   
1 2 1 µ−x
f (µ, δ | x, y) ∝ exp − 2 (µ − x δ − (y − x))
2σ 1 1 δ − (y − x)
(  −1  )
1 1 −1 µ−x
= exp − 2 (µ − x δ − (y − x))
2σ −1 2 δ − (y − x)

which is the kernel of a bivariate Normal distribution. We have


σ 2 −σ 2
   
x
µ, δ | x, y ∼ N2 , .
y−x −σ 2 2σ 2

Using the properties of multivariate Normal distributions, we can read off the
marginal distributions so µ | x, y ∼ N (x, σ 2 ) and δ | x, y ∼ N (y − x, 2σ 2 ). We
shall derive these marginal distributions directly in parts (c) and (d) as these
give insights into the types of techniques used when the marginals are not so
conventional to obtain.
(c) Find the marginal posterior distribution f (δ | x, y).
Z ∞
f (δ | x, y) = f (µ, δ | x, y) dµ
−∞
Z ∞  
1 2 2
∝ exp − 2 [2µ − 2(x + y)µ + 2µδ + δ − 2yδ] dµ
−∞ 2σ
 Z ∞  
1 2 1 2
= exp − 2 [δ − 2yδ] exp − 2 [2µ − 2(x + y − δ)µ] dµ
2σ −∞ 2σ
 
1
= exp − 2 [δ 2 − 2yδ] ×

Z ∞ ( " 2 #)
1 (x + y − δ) (x + y − δ)2
exp − 2 µ− − dµ
−∞ σ 2 4
 
1
= exp − 2 [2δ 2 − 4yδ − (x + y − δ)2 ] ×

Z ∞ (  2 )
1 (x + y − δ)
exp − 2 µ − dµ
−∞ σ 2
 
1 2 2
∝ exp − 2 [2δ − 4yδ − (x + y − δ) ]

 
1 2
∝ exp − 2 [δ − 2(2y − (x + y))δ]

which is a kernel of a N (y − x, 2σ 2 ) density. Hence, δ | x, y ∼ N (y − x, 2σ 2 ).


(d) Find the marginal posterior distribution f (µ | x, y).
Z ∞
f (µ | x, y) = f (µ, δ | x, y) dδ
−∞
Z ∞  
1 2 2
∝ exp − 2 [2µ − 2(x + y)µ + 2µδ + δ − 2yδ] dδ
−∞ 2σ

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

 
1 2
= exp − 2 [2µ − 2(x + y)µ] ×

Z ∞  
1 2
exp − 2 [δ − 2(y − µ)δ] dδ
−∞ 2σ
 
1
= exp − 2 [2µ2 − 2(x + y)µ] ×

Z ∞  
1 2 1 2
exp − 2 [δ − (y − µ)] + 2 (y − µ) dδ
−∞ 2σ 2σ
 
1
= exp − 2 [2µ2 − 2(x + y)µ − (y − µ)2 ] ×

Z ∞  
1 2
exp − 2 [δ − (y − µ)] dδ
−∞ 2σ
 
1
∝ exp − 2 [2µ2 − 2(x + y)µ − (y − µ)2 ]

 
1 2
∝ exp − 2 [µ − 2(x + y − y)µ]

which is a kernel of a N (x, σ 2 ) density. Hence, µ | x, y ∼ N (x, σ 2 ).
(e) Consider a future observation Z where Z | µ, δ ∼ N (µ − δ, σ 2 ) and Z is
conditionally independent of X and Y given µ and δ. Find the predictive
distribution f (z | x, y).

Notice that
 
1
f (z | µ, δ) ∝ exp − 2 [z 2 − 2(µ − δ)z + (µ − δ)2 ]

so that, as Z is conditionally independent of X and Y given µ and δ and using
(6),
Z ∞Z ∞
f (z | x, y) = f (z | µ, δ)f (µ, δ | x, y) dµ dδ
−∞ −∞
Z ∞Z ∞  
1
∝ exp − 2 [z 2 − 2(µ − δ)z + (µ − δ)2 ] ×
−∞ −∞ 2σ
 
1 2 2
exp − 2 [2µ − 2(x + y)µ + 2µδ + δ − 2yδ] dµ dδ

Z ∞Z ∞  
1 2 2 2
= exp − 2 [z + 3µ − 2(x + y + z)µ + 2δ − 2(y − z)δ] dµ dδ
−∞ −∞ 2σ
 Z ∞  
1 2 1 2
= exp − 2 z exp − 2 [3µ − 2(x + y + z)µ] dµ
2σ −∞ 2σ
Z ∞  
1 2
× exp − 2 [2δ − 2(y − z)δ] dδ
−∞ 2σ
( " 2 #)

z2 (x + y + z)2
  
3 (x + y + z)
Z
= exp − 2 exp − 2 µ− − dµ
2σ −∞ 2σ 3 9
Z ∞ ( " 2 #)
1 (y − z) (y − z)2
× exp − 2 δ− − dδ
−∞ σ 2 4

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

3 (x + y + z)2 1 (y − z)2
     
1 2
∝ exp − 2 z exp exp
2σ 2σ 2 9 σ2 4
 
1
∝ exp − [z 2 − 4(x + y)z + 6yz]
12σ 2
 
1 2
= exp − [z − 2(2x − y)z]
12σ 2

which is a kernel of a N (2x − y, 6σ 2 ) density. Hence, Z | x, y ∼ N (2x − y, 6σ 2 ).

4. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent


given a parameter θ. Suppose that each Xi | θ ∼ U (0, θ).
(a) Show that M = maxi (Xi ) is a sufficient statistic for X = (X1 , . . . , Xn ).

As Xi | θ ∼ U (0, θ) then
1

θ 0 ≤ xi ≤ θ;
f (xi | θ) =
0 otherwise.

Let I(0,θ) (x) denote the indicator function for the event 0 ≤ x ≤ θ, so I(0,θ) (x) = 1
if 0 ≤ x ≤ θ and 0 otherwise. Then we can write f (xi | θ) = θ1 I(0,θ) (xi ) so that
n
Y 1
f (x | θ) = I(0,θ) (xi )
i=1
θ
n
1 Y
= I(0,θ) (xi )
θn i=1
1
= I(0,θ) (m)
θn
where m = maxi (xi ). Thus, f (x | θ) = g(θ, m)h(x) where g(θ, m) = θ1n I(0,θ) (m)
and h(x) = 1 so that M = maxi (Xi ) is sufficient for X = (X1 , . . . , Xn ) for
learning about θ.
(b) Show that the Pareto distribution, θ ∼ P areto(a, b),

aba
f (θ) = , θ≥b
θa+1
is a conjugate prior distribution.

Let I{θ≥c} denote the indicator function for the event θ ≥ c, so I{θ≥c} = 1 if
θ ≥ c and 0 otherwise. Then, viewing the likelihood as a function of θ rather than
m = maxi (xi ), we may write the likelihood as
1
f (x | θ) = I{θ≥m}
θn
whilst the prior may be expressed as
aba
f (θ) = I{θ≥b} .
θa+1

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

The posterior is
1 aba
f (θ | x) ∝ I {θ≥m} × I{θ≥b}
θn θa+1
1
∝ a+n+1
I{θ≥m} I{θ≥b}
θ
1
= a+n+1
I{θ≥max(b,m)}
θ
which is a kernel of a P areto(a + n, max(b, m)) density so that θ | x ∼ P areto(a +
n, max(b, m)). Thus, the prior and posterior are in the same family and so, relative
to the U (0, θ) likelihood, the Pareto distribution is a conjugate family.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

MA40189 2016-2017 Problem Sheet 4 - Solutions

Topics in Bayesian statistics (University of Bath)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by xin shen ([email protected])
lOMoARcPSD|36619859

MA40189 - Solution Sheet Four


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

1. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent


given a parameter θ.
(a) Let Xi | θ ∼ Bern(θ).
i. Show that f (xi | θ) belongs to the 1-parameter exponential family
and for X = (X1 , . . . , Xn ) state the sufficient statistic for learning
about θ.

Notice that we can write

f (xi | θ) = θxi (1 − θ)1−xi


  
θ
= exp log xi + log(1 − θ)
1−θ
so that f (xi | θ) belongs to the 1-parameter exponential family with φ1 (θ) =
θ
log 1−θ , u1 (xi ) = xi , g(θ) = log(1 − θ) and h(xi ) = 0. Notice that, from
Pn
Proposition 1 (see Lecture 11), tn = [n, i=1 Xi ] is a sufficient statistic.
ii. By viewing the likelihood as a function of θ, which generic family
of distributions (over θ) is the likelihood a kernel of ?

The likelihood, without expressing in the explicit exponential family form, is

f (x | θ) = θnx̄ (1 − θ)n−nx̄

which, viewing as a function of θ, we immediately recognise as a Beta kernel


(in particular, a Beta(nx̄ + 1, n − nx̄ + 1)).
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).

Taking θ ∼ Beta(α, β) we have that

f (θ | x) ∝ θnx̄ (1 − θ)n−nx̄ × θα−1 (1 − θ)β−1


= θα+nx̄−1 (1 − θ)β+n−nx̄−1

so that θ | x ∼ Beta(α + nx̄, β + n − nx̄). Thus, the prior and the posterior
are in the same family giving conjugacy.

Deriving the results directly from exponential family representation

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

Expressed in the 1-parameter exponential family form the likelihood is


( X n
)
θ
f (x | θ) = exp log xi + n log(1 − θ)
1 − θ i=1
Pn
from which we immediately observe the sufficient statistic tn = [n, i=1 xi ].
Viewing f (x | θ) as a function of θ the natural conjugate prior is a member of
the 2-parameter exponential family of the form
   
θ
f (θ) = exp a log + d log(1 − θ) + c(a, d)
1−θ

where c(a, d) is the normalising constant. Hence,


   
θ
f (θ) ∝ exp a log + d log(1 − θ)
1−θ
= θa (1 − θ)d−a

which we recognise as a kernel of a Beta distribution. The convention is to


label the hyperparameters as α and β so that we put α = α(a, d) = a + 1
and β = β(a, d) = d − a + 1 (equivalently, a = a(α, β) = α − 1, d = d(α, β) =
β + α − 2). The conjugate prior distribution is θ ∼ Beta(α, β).
(b) Let Xi | θ ∼ N (µ, θ) with µ known.
i. Show that f (xi | θ) belongs to the 1-parameter exponential family
and for X = (X1 , . . . , Xn ) state the sufficient statistic for learning
about θ.

Writing the normal density as an exponential family (parameter θ as µ is a


known constant) we have

 
1 2 1
f (xi | θ) = exp − (xi − µ) − log θ − log 2π
2θ 2

so that f (xi | θ) belongs


Pn to the 1-parameter exponential family. The sufficient
statistic is tn = [n, i=1 (xi − µ)2 ]. Note that, expressed explicitly as a 1-
parameter exponential family, the likelihood for x = (x1 , . . . , xn ) is
( n
)
1 X 2 n √
f (x | θ) = exp − (xi − µ) − log θ − n log 2π
2θ i=1 2

so that the natural conjugate prior has the form


 
1
f (θ) = exp −a − d log θ + c(a, d)
θ
 
1
∝ θ exp −a
−d
θ

which we recognise as a kernel of an Inverse-Gamma distribution.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

ii. By viewing the likelihood as a function of θ, which generic family


of distributions (over θ) is the likelihood a kernel of ?

In conventional form,
( n
)
−n 1 X
f (x | θ) ∝ θ 2 exp − (xi − µ)2
2θ i=1

which, viewing f (x | θ) as a function of θ, we recognise as a kernel


Pn of an
1
Inverse-Gamma distribution (in particular, an Inv-gamma( n−2 2 , 2 i=1 (xi −
µ)2 )).
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).

Taking θ ∼ Inv-gamma(α, β) we have


( n
)  
−n 1 X 2 β
f (θ | x) ∝ θ 2 exp − (xi − µ) × θ −(α+1)
exp −
2θ i=1 θ
( n
! )
n 1X 1
= θ−(α+ 2 +1) exp − β + (xi − µ)2
2 i=1 θ

which we recognise as a kernel Pnof an Inverse-Gamma distribution so that


θ | x ∼ Inv-gamma(α+ n2 , β + 12 i=1 (xi −µ)2 ). Hence, the prior and posterior
are in the same family giving conjugacy.
(c) Let Xi | θ ∼ M axwell(θ), the Maxwell distribution with parameter θ so
that
  12
θx2
 
2 3
f (xi | θ) = θ 2 x2i exp − i , xi > 0
π 2
q
2
and E(Xi | θ) = 2 πθ , V ar(Xi | θ) = 3π−8
πθ .

i. Show that f (xi | θ) belongs to the 1-parameter exponential family


and for X = (X1 , . . . , Xn ) state the sufficient statistic for learning
about θ.

Writing the Maxwell density in exponential family form we have

x2i
 
3 2 1 2
f (xi | θ) = exp −θ + log θ + log xi + log
2 2 2 π

so that f (xi | θ) belongs


Pn to the 1-parameter exponential family. The sufficient
statistic is tn = [n, i=1 x2i ]. Note that, expressed explicitly as a 1-parameter
exponential family, the likelihood for x = (x1 , . . . , xn ) is
( n n
)
X x2i 3n X n 2
f (x | θ) = exp −θ + log θ + log x2i + log
i=1
2 2 i=1
2 π

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

so that the natural conjugate prior has the form


f (θ) = exp {−aθ + d log θ + c(a, d)}
∝ θd e−aθ
which we recognise as a kernel of a Gamma distribution.
ii. By viewing the likelihood as a function of θ, which generic family
of distributions (over θ) is the likelihood a kernel of ?

In conventional form,
  n2 n
!   Pn
x2i
 
2 3n
Y
2 i=1
f (x | θ) = θ2 xi exp − θ
π i=1
2
  Pn 2
 
3n
i=1 xi
∝ θ exp −
2 θ
2
which, viewing f (x | θ)as a function of θ, we recognise
Pn as a kernel of a Gamma
distribution (in particular, Gamma( 3n+2 2 , 1
2 i=1 x2
i )).
iii. By first finding the corresponding posterior distribution for θ given
x = (x1 , . . . , xn ), show that this family of distributions is conjugate
with respect to the likelihood f (x | θ).

Taking θ ∼ Gamma(α, β) we have


  Pn 2
 
3n
i=1 xi
f (θ | x) ∝ θ 2 exp − θ × θα−1 e−βθ
2
( n
! )
α+ 3n 1X 2
= θ 2 −1
exp − β + x θ
2 i=1 i

which, of Pcourse, is a kernel of a Gamma distribution so that θ | x ∼ Gamma(α+


3n 1 n 2
2 , β + 2 i=1 xi ). The prior and the posterior are in the same family giving
conjugacy.
2. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given a parameter θ. Suppose that Xi | θ is geometrically distributed with
probability density function
f (xi | θ) = (1 − θ)xi −1 θ, xi = 1, 2, . . . .
(a) Show that f (x | θ), where x = (x1 , . . . , xn ), belongs to the 1-parameter
exponential family. Hence, or otherwise, find the conjugate prior dis-
tribution and corresponding posterior distribution for θ.

As the Xi are exchangeable then


n
Y
f (x | θ) = f (xi | θ)
i=1
Yn
= (1 − θ)xi −1 θ
i=1
= (1 − θ)nx̄−n θn
= exp {(nx̄ − n) log(1 − θ) + n log θ}

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

and so belongs to the 1-parameter exponential family. The conjugate prior is of


the form
f (θ) ∝ exp {a log(1 − θ) + b log θ}
= θb (1 − θ)a
which is a kernel of a Beta distribution. Letting α = b + 1, β = a + 1 then we
have θ ∼ Beta(α, β).
f (θ | x) ∝ f (x | θ)f (θ)
∝ θn (1 − θ)(nx̄−n) θα−1 (1 − θ)β−1
which is a kernel of a Beta(α+n, β +nx̄−n) so that θ | x ∼ Beta(α+n, β +nx̄−n).
(b) Show that the posterior mean for θ can be written as a weighted average
of the prior mean of θ and the maximum likelihood estimate, x̄−1 .
α+n
E(θ | X) =
(α + n) + (β + nx̄ − n)
α+n
=
α + β + nx̄
     
α+β α nx̄ 1
= +
α + β + nx̄ α+β α + β + nx̄ x̄
= λE(θ) + (1 − λ)x̄−1 .

(c) Suppose now that the prior for θ is instead given by the probability
density function
1 1
f (θ) = θα (1 − θ)β−1 + θα−1 (1 − θ)β ,
2B(α + 1, β) 2B(α, β + 1)
where B(α, β) denotes the Beta function evaluated at α and β. Show
that the posterior probability density function can be written as
f (θ | x) = λf1 (θ) + (1 − λ)f2 (θ)
where
(α + n)β
λ = Pn
(α + n)β + (β − n + i=1 xi )α
and f1 (θ) and f2 (θ) are probability density functions.
f (θ | x) ∝ f (x | θ)f (θ)
θα (1 − θ)β−1 θα−1 (1 − θ)β
 
= θn (1 − θ)(nx̄−n) +
B(α + 1, β) B(α, β + 1)
θα1 (1 − θ)β1 −1 θα1 −1 (1 − θ)β1
= +
B(α + 1, β) B(α, β + 1)
where α1 = α + n and β1 = β + nx̄ − n. Finding the constant of proportionality
we observe that θα1 (1 − θ)β1 −1 is a kernel of a Beta(α1 + 1, β1 ) and θα1 −1 (1 − θ)β1
is a kernel of a Beta(α1 , β1 + 1). So,
 
B(α1 + 1, β1 ) B(α1 , β1 + 1)
f (θ | x) = c f1 (θ) + f2 (θ)
B(α + 1, β) B(α, β + 1)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

where f1 (θ) is the density function of Beta(α1 + 1, β1 ) and f2 (θ) the density
function of Beta(α1 , β1 + 1). Hence,

B(α1 + 1, β1 ) B(α1 , β1 + 1)
c−1 = +
B(α + 1, β) B(α, β + 1)

so that f (θ | x) = λf1 (θ) + (1 − λ)f2 (θ) with


B(α1 +1,β1 )
B(α+1,β)
λ = B(α1 +1,β1 ) B(α1 ,β1 +1)
B(α+1,β) + B(α,β+1)
α1 (α+β)B(α1 ,β1 )
α(α1 +β1 )B(α,β)
= α1 (α+β)B(α1 ,β1 ) β1 (α+β)B(α1 ,β1 )
α(α1 +β1 )B(α,β) + β(α1 +β1 )B(α,β)
α1 β
=
α1 β + β1 α
(α + n)β
= Pn .
(α + n)β + (β + i=1 xi − n)α

3. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally indepen-


dent given a parameter θ. Suppose that Xi | θ is distributed as a double-
exponential distribution with probability density function
 
1 |xi |
f (xi | θ) = exp − , −∞ < xi < ∞
2θ θ
for θ > 0.
(a) Find the conjugate prior distribution and corresponding posterior dis-
tribution for θ following observation of x = (x1 , . . . , xn ).
n  
Y 1 |xi |
f (x | θ) = exp −
i=1
2θ θ
( n
)
1 1X
∝ exp − |xi |
θn θ i=1
Pn
which, when viewed as a function of θ, is a kernel of Inv-gamma(n−1, i=1 |xi |).
We thus take θ ∼ Inv-gamma(α, β) as the prior so that
( n
)  
1 1X 1 β
f (θ | x) ∝ exp − |xi | exp −
θn θ i=1 θα+1 θ
( n
!)
1 1 X
= exp − β + |xi |
θα+n+1 θ i=1
Pn
which is a kernel of Inv-gamma(α + n, β + i=1 |xi |). Thus, with respect to
X | θ, the prior and posterior P
are in the same family, showing conjugacy, with
n
θ | x ∼ Inv-gamma(α + n, β + i=1 |xi |).
(b) Consider the transformation φ = θ−1 . Find the posterior distribution
of φ | x.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

We have φ = g(θ) where g(θ) = θ−1 so that θ = g −1 (φ) = φ−1 . Transforming


fθ (θ | x) to fφ (φ | x) we have
∂θ
fφ (φ | x) = fθ (g −1 (φ) | x)
∂φ
( n
!)
−1 1 1 X
∝ exp − 1 β + |xi |
φ2 1 α+n+1 φ i=1
φ
( n
!)
X
α+n−1
= φ exp −φ β + |xi |
i=1
Pn
which is a kernel of aPGamma(α + n, β + i=1 |xi |) distribution. That is φ | x ∼
n
Gamma(α + n, β + i=1 |xi |). The result highlights the relationship between
the Gamma and Inv-Gamma distributions shown on question 3(b)(i) of Question
Sheet Two.
4. Let X1 , . . . , Xn be a finite subset of a sequence of infinitely exchangeable
random quantities with joint density function
n
!−(n+1)
X
f (x1 , . . . , xn ) = n! 1 + xi .
i=1

Show that they can be represented as conditionally independent and expo-


nentially distributed.

Using de Finetti’s Representation Theorem (Theorem 2 of the on-line notes), the joint
distribution has an integral representation of the form
Z (Y
n
)
f (x1 , . . . , xn ) = f (xi | θ) f (θ) dθ.
θ i=1

If Xi | θ ∼ Exp(θ) then
n n n
!
Y Y X
n
f (xi | θ) = θ exp (−θxi ) = θ exp −θ xi .
i=1 i=1 i=1
Pn
Notice that, viewed as a function of θ, this looks like a kernel of Gamma(n+1, i=1 xi ).
The result holds if we can find an f (θ) such that
n
!−(n+1) Z n
!
X X
n
n! 1 + xi = θ exp −θ xi f (θ) dθ.
i=1 θ i=1
Pn
The left hand side looks like the normalising constant of a Gamma(n + 1, 1 + i=1 xi )
(as n! = Γ(n + 1)) and if f (θ) =Pexp(−θ) then the integrand on the right hand side is
n
a kernel of a Gamma(n + 1, 1 + i=1 xi ). So, if θ ∼ Gamma(1, 1) then f (θ) = exp(−θ)
and we have the desired representation.
5. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally indepen-
dent given a parameter θ. Suppose that Xi | θ is distributed as a Poisson
distribution with mean θ.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(a) Show that, with respect to this Poisson likelihood, the gamma family
of distributions is conjugate.
n
Y
f (x | θ) = P (Xi = xi | θ)
i=1
Yn
∝ θxi exp {−θ}
i=1
= θnx̄ exp {−nθ} .

As θ ∼ Gamma(α, β) then

f (θ | x) ∝ f (x | θ)f (θ)
∝ θnx̄ exp {−nθ} θα−1 exp {−βθ}
= θα+nx̄−1 exp {−(β + n)θ}

which is a kernel of a Gamma(α + nx̄, β + n) distribution. Hence, the prior and


posterior are in the same family giving conjugacy.
(b) Interpret the posterior mean of θ paying particular attention to the
cases when we may have weak prior information and strong prior in-
formation.
α + nx̄
E(θ | X) =
β+n
 
β αβ + nx̄
=
β+n
 
α
= λ + (1 − λ)x̄
β
β
where λ = β+n . Hence, the posterior mean is a weighted average of the prior
α
mean, β , and the data mean, x̄, which is also the maximum likelihood estimate.

Weak prior information corresponds to a large variance of θ which can be viewed


as small β (β is the inverse scale parameter). In this case, more weight is attached
to x̄ than α
β in the posterior mean.

Strong prior information corresponds to a small variance of θ which can be viewed


as large β (once again, β is the inverse scale parameter). In this case, more weight
is attached to α
β than x̄ in the posterior mean which thus favours the prior mean.
(c) Suppose now that the prior for θ is given hierarchically. Given λ, θ
is judged to follow an exponential distribution with mean λ1 and λ is
given the improper distribution f (λ) ∝ 1 for λ > 0. Show that

λ
f (λ | x) ∝
(n + λ)nx̄+1
1
Pn
where x̄ = n i=1 xi .

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

θ | λ ∼ Exp(λ) so f (θ | λ) = λ exp{−λθ}.

f (λ, θ | x) ∝ f (x | θ, λ)f (θ, λ)


= f (x | θ)f (θ | λ)f (λ)
∝ θnx̄ exp {−nθ} (λ exp {−λθ})


= λθnx̄ exp {−(n + λ)θ} .

Thus, integrating out θ,


Z ∞
f (λ | x) ∝ λθnx̄ exp {−(n + λ)θ} dθ
0
Z ∞
= λ θnx̄ exp {−(n + λ)θ} dθ
0

As the integrand is a kernel of a Gamma(nx̄ + 1, n + λ) distribution we thus have

λΓ(nx̄ + 1)
f (λ | x) ∝
(n + λ)nx̄+1
λ
∝ .
(n + λ)nx̄+1

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

MA40189 2016-2017 Problem Sheet 5 - Solutions

Topics in Bayesian statistics (University of Bath)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by xin shen ([email protected])
lOMoARcPSD|36619859

MA40189 - Solution Sheet Five


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

1. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent


given a parameter θ. For each of the following distributions for Xi | θ find
the Jeffreys prior and the corresponding posterior distribution for θ.

(a) Xi | θ ∼ Bern(θ).

The likelihood is
n
Y
f (x | θ) = θxi (1 − θ)1−xi
i=1
= θnx̄ (1 − θ)n−nx̄ .

As θ is univariate
∂2 ∂2
log f (x | θ) = {nx̄ log θ + (n − nx̄) log(1 − θ)}
∂θ2 ∂θ2
nx̄ (n − nx̄)
= − 2 − .
θ (1 − θ)2

The Fisher information is thus


 
nX̄ (n − nX̄)
I(θ) = −E − 2 − θ
θ (1 − θ)2
nE(X̄ | θ) {n − nE(X̄ | θ)}
= +
θ2 (1 − θ)2
n n n
= + = .
θ 1−θ θ(1 − θ)

The Jeffreys prior is then

n
r
1 1
f (θ) ∝ ∝ θ− 2 (1 − θ)− 2 .
θ(1 − θ)

This is the kernel of a Beta( 21 , 21 ) distribution. This makes it straightforward to


obtain the posterior using the conjugacy of the Beta distribution relative to the
Bernoulli likelihood. The posterior is Beta( 21 + nx̄, 12 + n − nx̄). (If you are not
sure why, see Solution Sheet Four Question 1 (a) with α = β = 21 .)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(b) Xi | θ ∼ P o(θ).

The likelihood is
n
Y θxi e−θ
f (x | θ) =
i=1
xi !
nx̄ −nθ
θ e
= Qn .
i=1 xi !
As θ is univariate
n
!
∂2 ∂2 X
log f (x | θ) = nx̄ log θ − nθ − log xi !
∂θ2 ∂θ2 i=1
nx̄
= − .
θ2
The Fisher information is thus
 
nX̄
I(θ) = −E − 2 θ
θ
nE(X̄ | θ) n
= 2
= .
θ θ
The Jeffreys prior is then
r
n 1
f (θ) ∝ ∝ θ− 2 .
θ
This is often expressed as the improper Gamma( 21 , 0) distribution. This makes
it straightforward to obtain the posterior using the conjugacy of the Gamma
distribution relative to the Poisson likelihood. The posterior is Gamma( 21 +nx̄, n).
(If you are not sure why, see Solution Sheet Four Question 4 (a) with α = 12 ,
β = 0.)
(c) Xi | θ ∼ M axwell(θ), the Maxwell distribution with parameter θ so that
  12
θx2i
 
2 3
2
f (xi | θ) = θ xi exp −
2 , xi > 0
π 2
q
2
and E(Xi | θ) = 2 πθ , V ar(Xi | θ) = 3π−8
πθ .

The likelihood is
n  2 1
θx2i
 
Y 2 3
f (x | θ) = θ exp − 2 x2i
i=1
π 2
 2n n
! ( n
)
2 3n
Y
2 θX 2
= θ2 xi exp − x .
π i=1
2 i=1 i

As θ is univariate
n n
!
∂2 ∂2 n 2 3n X θX 2
log f (x | θ) = log + log θ + log x2i − x
∂θ2 ∂θ2 2 π 2 i=1
2 i=1 i
3n
= − .
2θ2

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

The Fisher information is


 
3n 3n
I(θ) = −E − 2 θ = .
2θ 2θ2

The Jeffreys prior is then


r
3n
f (θ) ∝ ∝ θ−1 .
2θ2
Thus is often expressed as the improper Gamma(0, 0) distribution. This makes
it straightforward to obtain the posterior using the conjugacy of the Gamma Pndistri-
bution relative to the Maxwell likelihood. The posterior is Gamma( 3n 1
2 , 2
2
i=1 xi ).
(If you are not sure why, see Solution Sheet Four Question 1 (c) with α = β = 0.)
2. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given a parameter λ. Suppose that Xi | λ ∼ Exp(λ) where λ represents the
rate so that E(Xi | λ) = λ−1 .
(a) Show that Xi | λ ∼ Exp(λ) is a member of the 1-parameter exponential
family. Hence, write down a sufficient statistic t(X) for X = (X1 , . . . , Xn )
for learning about λ.

We write

f (xi | λ) = λe−λxi = exp {−λxi + log λ}

which is of the form exp{φ1 (λ)u1 (xi ) + g(λ) + h(xi )} where φ1 (λ) = −λ, u1 (xi ) =
xi , g(λ) = log λ and h(xi ) = 0. From the Proposition given in Lecture 9 (see p26
of the on-line notes) a sufficient statistic is
n
X n
X
t(X) = u1 (Xi ) = Xi .
i=1 i=1

(b) Find the Jeffreys prior and comment upon whether or not it is im-
proper. Find the posterior distribution for this prior.

The likelihood is
n
Y
f (x | λ) = λe−λxi = λn e−λnx̄ . (1)
i=1

As θ is univariate
∂2 ∂2
log f (x | θ) = (n log λ − λnx̄)
∂θ2 ∂θ2
n
= − 2.
λ
The Fisher information is
 n  n
I(λ) = −E − 2 λ = .
λ λ2

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

The Jeffreys prior is thus


r
n
f (λ) ∝ ∝ λ−1 . (2)
λ2
This is the improper Gamma(0, 0) distribution. The posterior can be obtained
using the conjugacy of the Gamma distribution relative to the Exponential like-
lihood. It is Gamma(n, nx̄). (If you are not sure why, see Solution Sheet Three
Question 1 (a) with α = β = 0.)
(c) Consider the transformation φ = log λ.
i. By expressing L(λ) = f (x | λ) as L(φ) find the Jeffreys prior for φ.

Inverting φ = log λ we have λ = eφ . Substituting this into (1) we have

L(φ) = enφ exp −nx̄eφ .




We have
∂2 ∂2
nφ − nx̄eφ

log L(φ) =
∂φ2 ∂φ2
= −nx̄eφ .

The Fisher information is

= −E −nX̄eφ φ

I(φ)
= nE(Xi | φ)eφ .

Now E(Xi | φ) = E(Xi | λ) = λ−1 = e−φ so that I(φ) = n. The Jeffreys prior
is then

fφ (φ) ∝ n ∝ 1 (3)

which is the improper uniform distribution on (−∞, ∞).


ii. By transforming the distribution of the Jeffreys prior for λ, f (λ),
find the distribution of φ.

From (2) we have that fλ (λ) ∝ λ−1 . Using the familiar change of variables
formula,
∂λ
fφ (φ) = fλ (eφ )
∂φ
∂λ
with ∂φ = eφ as λ = eφ , we have that

fφ (φ) ∝ |eφ |e−φ = 1

which agrees with (3). This is an illustration of the invariance to reparame-


terisation of the Jeffreys prior.
3. The Jeffreys prior for Normal distributions. In Lecture 13 we showed that
for an exchangeable collection X = (X1 , . . . , Xn ) with Xi | θ ∼ N (θ, σ 2 ) where
σ 2 is known the Jeffreys prior for θ is f (θ) ∝ 1.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(a) Consider, instead, that Xi | θ ∼ N (µ, θ) where µ is known. Find the


Jeffreys prior for θ.

The likelihood is
n  
Y 1 1
f (x | θ) = √exp − (xi − µ)2
i=1
2πθ 2θ
( n
)
−n n 1 X 2
= (2π) 2 θ 2 exp −

(xi − µ) .
2θ i=1

As θ is univariate
( n
)
∂2 ∂2 n n 1 X
log f (x | θ) = − log 2π − log θ − (xi − µ)2
∂θ2 ∂θ2 2 2 2θ i=1
n
n 1 X
= − (xi − µ)2 .
2θ2 θ3 i=1

The Fisher information is


( n
)
n 1 X
I(θ) = −E − (Xi − µ)2 θ
2θ2 θ3 i=1
n
n 1 X 
= − + E (Xi − µ)2 | θ
2θ2 θ3 i=1
n n n
= − 2+ 2 = .
2θ θ 2θ2
The Jeffreys prior is then
r
n
f (θ) ∝ = θ−1 .
2θ2

(b) Now suppose that Xi | θ ∼ N (µ, σ 2 ) where θ = (µ, σ 2 ). Find the Jeffreys
prior for θ.

The likelihood is
( n
)
−n 2 −n 1 X
f (x | θ) = (2π) 2 (σ ) 2 exp − 2 (xi − µ)2
2σ i=1

so that
n
n n 1 X
log f (x | θ) = − log 2π − log σ 2 − 2 (xi − µ)2 .
2 2 2σ i=1

As θ = (µ, σ 2 ) then the Fisher Information matrix is


 n 2 o n 2 o 
∂ ∂
E ∂µ 2 log f (x | θ) θ E ∂µ∂σ 2 log f (x | θ) θ
I(θ) = − 
 
n 2 o n 2 o 
∂ ∂
E ∂µ∂σ2 log f (x | θ) θ E ∂(σ2 )2 log f (x | θ) θ

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

Now,
n
∂ 1 X
log f (x | θ) = (xi − µ);
∂µ σ 2 i=1
n
∂ n 1 X
log f (x | θ) = − + (xi − µ)2 ;
∂(σ 2 ) 2σ 2 2σ 4 i=1
so that
∂2 n
log f (x | θ) = − ;
∂µ2 σ2
n
∂2 1 X
log f (x | θ) = − (xi − µ);
∂µ∂(σ 2 ) σ4 i=1
n
∂2 n 1 X
log f (x | θ) = − (xi − µ)2 .
∂(σ 2 )2 2σ 4 σ 6 i=1

Noting that E(Xi − µ | θ) = 0 and E{(Xi − µ)2 | θ} = σ 2 , the Fisher information


matrix is
− σn2
   n 
0 σ2 0
I(θ) = − n n = .
0 2σ 4 − σ 4 0 2σn4
The Jeffreys prior is
p
f (θ) ∝ |I(θ)|
r
n2
= ∝ σ −3 .
2σ 6
(c) Comment upon your answers for these three Normal cases.

Suppose θ = (µ, σ 2 ) and Xi | θ ∼ N (µ, σ 2 ). If µ is unknown and σ 2 known then


the Jeffreys prior is f (µ) ∝ 1. If µ is known and σ 2 is unknown then the Jeffreys
prior is f (σ 2 ) ∝ σ −2 . When both µ and σ 2 are unknown then the Jeffreys
prior is f (µ, σ 2 ) ∝ σ −3 . Jeffreys himself found this inconsistent, arguing that
f (µ, σ 2 ) ∝ σ −2 , the product of the priors f (µ) and f (σ 2 ). Jeffreys’ argument was
that ignorance about µ and σ 2 should be represented by independent ignorance
priors for µ and σ 2 separately. However, it is not clear under what circumstances
this prior judgement of independence should be imposed.
4. Consider, given θ, a sequence of independent Bernoulli trials with param-
eter θ. We wish to make inferences about θ and consider two possible
methods. In the first, we carry out n trials and let X denote the total
number of successes in these trials. Thus, X | θ ∼ Bin(n, θ) with
 
n
fX (x | θ) = θx (1 − θ)n−x , x = 0, 1, . . . , n.
x
In the second method, we count the total number Y of trials up to and
including the rth success so that Y | θ ∼ N bin(r, θ), the negative binomial
distribution, with
 
y−1
fY (y | θ) = θr (1 − θ)y−r , y = r, r + 1, . . . .
r−1

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(a) Obtain the Jeffreys prior distribution for each of the two methods. You
may find it useful to note that E(Y | θ) = θr .

For X | θ ∼ Bin(n, θ) we have


 
n
log fX (x | θ) = log + x log θ + (n − x) log(1 − θ)
x

so that
∂2 x (n − x)
log fX (x | θ) = − − .
∂θ2 θ 2 (1 − θ)2

The Fisher information is


 
X (n − X)
I(θ) = −E − 2− θ
θ (1 − θ)2
n n n
= + = .
θ 1−θ θ(1 − θ)

The Jeffreys prior is thus

n
r
1 1
f (θ) ∝ ∝ θ− 2 (1 − θ)− 2
θ(1 − θ)

which is a kernel of the Beta( 12 , 12 ) distribution.

For Y | θ ∼ N bin(r, θ) we have


 
y−1
log fY (y | θ) = log + r log θ + (y − r) log(1 − θ)
r−1

so that
∂2 r (y − r)
log fY (y | θ) = − − .
∂θ2 θ 2 (1 − θ)2

The Fisher information is


 
r (Y − r)
I(θ) = −E − − θ
θ2 (1 − θ)2
r r( θ1 − 1)
= +
θ2 (1 − θ)2
r r r
= + = 2 .
θ2 θ(1 − θ) θ (1 − θ)

The Jeffreys prior is thus

r
r
1
f (θ) ∝ ∝ θ−1 (1 − θ)− 2
θ2 (1 − θ)

which can be viewed as the improper Beta(0, 12 ) distribution.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(b) Suppose we observe x = r and y = n. For each method, calculate the


posterior distribution for θ with the Jeffreys prior. Comment upon
your answers.

We summarise the results in a table.


Prior Likelihood Posterior
Beta( 12 , 12 ) X | θ ∼ Bin(n, θ); θx (1 − θ)n−x Beta( 21 + x, 21 + n − x)
Beta(0, 21 ) Y | θ ∼ N bin(r, θ); θr (1 − θ)y−r Beta(r, 12 + y − r)

Notice that if x = r and y = n then the two approaches have identical likelihoods:
in both cases we observed x successes in n trials but θ | x ∼ Beta( 21 + x, 21 + n − x)
and θ | y ∼ Beta(x, 21 + n − x). Although the likelihoods are the same, Jeffreys’
approach yields different posterior distributions which seems to contradict the
notion of an noninformative prior. This occurs because Jeffreys’ prior violates the
likelihood principle. In short this principle states that the likelihood contains
all the information about the data x so that two likelihoods contain the same
information if they are proportional. In this case, the two likelihoods are iden-
tical but yield different posterior distributions. Classical statistics violates the
likelihood principle but Bayesian statistics (using proper prior distributions) does
not.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

MA40189 2016-2017 Problem Sheet 6 - Solutions

Topics in Bayesian statistics (University of Bath)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by xin shen ([email protected])
lOMoARcPSD|36619859

MA40189 - Solution Sheet Six


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

1. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent


given a parameter θ = (µ, σ 2 ). Suppose that Xi | θ ∼ N (µ, σ 2 ). It is judged
that the improper joint prior distribution f (µ, σ 2 ) ∝ 1/σ 2 is appropriate.
(a) Show that the likelihood f (x | µ, σ 2 ), where x = (x1 , . . . , xn ), can be ex-
pressed as
 
2
 n
2 −2 1  2 2

f (x | µ, σ ) = 2πσ exp − 2 (n − 1)s + n(x − µ) ,

Pn Pn
where x = n1 i=1 xi , s2 = n−1 1 2
i=1 (xi − x) are respectively the sample
mean and variance. Hence, explain why X and S 2 are sufficient for
X = (X1 , . . . , Xn ) for learning about θ.
n  
Y 1 1
f (x | µ, σ 2 ) = √ exp − 2 (xi − µ)2
i=1
2πσ 2σ
n
( )
− n 1 X
= 2πσ 2 2 exp − 2 (xi − µ)2

2σ i=1
n
( )
 n
2 −2 1 X 2
= 2πσ exp − 2 ((xi − x) + (x − µ))
2σ i=1
 
− n 1 
= 2πσ 2 2 exp − 2 (n − 1)s2 + n(x − µ)2 .


Recall that a statistic t(X) is said to be sufficient for X for learning about θ if we
can write

f (x | θ) = g(t, θ)h(x)

where g(t, θ) depends upon t(x) and θ and h(x) does not depend upon θ but may
depend upon x. In this case we have that
−n
f (x | µ, σ 2 ) = g(x, s2 , µ, σ 2 ) (2π) 2

so that X and S 2 are sufficient.


(b) Find, up to a constant of integration, the posterior distribution of θ
given x.

f (µ, σ 2 | x) ∝ f (x | µ, σ 2 )f (µ, σ 2 )

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

 
 n
2 −2 1  2 2
 1
∝ 2πσ exp − 2 (n − 1)s + n(x − µ)
2σ σ2
 
−( 2 +1)
n 1 
σ2 exp − 2 (n − 1)s2 + n(x − µ)2 .


(c) Show that µ | σ 2 , x ∼ N (x, σ 2 /n). Hence, explain why, in this case, the
chosen prior distribution for θ is noninformative.
f (µ, σ 2 | x)
f (µ | σ 2 , x) =
f (σ 2 | x)
∝ f (µ, σ 2 | x) (with respect to µ)
n n o
∝ exp − 2 (x − µ)2 .

We recognise this as a kernel of N (x, σ 2 /n) so µ | σ 2 , x ∼ N (x, σ 2 /n). In the
classical model for µ when σ 2 is known, the mle is x and the standard error
is σ 2 /n. The distribution is coming only from the data (given σ 2 ) showing the
noninformative prior in this case. Notice that a symmetric 100(1 − α)% credible
interval for µ | σ 2 , x is
 
σ σ
x − z(1− α ) √ , x + z(1− α ) √
2 n 2 n
which agrees with the 100(1 − α)% confidence interval for µ when σ 2 is known
and X1 , . . . , Xn are iid N (µ, σ 2 ).
(d) By integrating f (µ, σ 2 | x) over σ 2 , show that
"  2 #− n2
1 x−µ
f (µ | x) ∝ 1+ √ .
n − 1 s/ n

Thus, explain why µ | x ∼ tn−1 (x, s2 /n), the non-central t-distribution


with n − 1 degrees of freedom, location parameter x and squared scale
parameter s2 /n. How does this result relate to the classical problem of
making inferences about µ when σ 2 is also unknown?
Z ∞
f (µ | x) = f (µ, σ 2 | x) dσ 2
−∞
Z ∞  
2 −( 2 +1)
 n 1  2 2
dσ 2 .

∝ σ exp − 2 (n − 1)s + n(x − µ)
−∞ 2σ
We recognise the integrand as a kernel of Inv-Gamma n2 , 21 (n − 1)s2 + n(x − µ)2
 

so that
 n
 −2

1 2 2
f (µ | x) ∝ Γ(n/2) (n − 1)s + n(x − µ)
2
 n
 −2

1 2 2
∝ (n − 1)s + n(x − µ)
2
− n
∝ (n − 1)s2 + n(x − µ)2 2

"  2 #− n2
1 x−µ
∝ 1+ √ .
n − 1 s/ n

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

We recognise this as a kernel of tn−1 (x, s2 /n) so that µ | x ∼ tn−1 (x, s2 /n). This
gives a further insight into how the prior distribution is noninformative. Inference
X−µ
about µ will mirror the classical approach when µ and σ 2 are unknown and S/ √ ∼
n
tn−1 where tn−1 = tn−1 (0, 1). For example, a symmetric 100(1 − α)% credible
interval for µ | x is
 
s s
x − tn−1, α2 √ , x + tn−1, α2 √
n n

which agrees with the 100(1 − α)% confidence interval for µ when σ 2 is unknown
and X1 , . . . , Xn are iid N (µ, σ 2 ).
2. Let X1 , . . . , Xn be exchangeable with Xi | θ ∼ Bern(θ).
(a) Using the improper prior distribution f (θ) ∝ θ−1 (1 − θ)−1 find the pos-
terior distribution of θ | x where x = (x1 , . . . , xn ). Find a normal approx-
imation about the mode to this distribution.

The likelihood is
n
Y
f (x | θ) = θxi (1 − θ)1−xi = θnx̄ (1 − θ)n−nx̄ ,
i=1

where x = (x1 , . . . , xn ). With the given prior the posterior is

f (θ | x) ∝ f (x | θ)f (θ) ∝ θnx̄−1 (1 − θ)n−nx̄−1

which we recognise as a kernel of a Beta(nx̄, n − nx̄) density. Thus, θ | x ∼


α−1
Beta(nx̄, n − nx̄). The mode of a Beta(α, β) distribution is α+β−2 so for θ | x the
mode is
nx̄ − 1
θ̃ = .
n−2
The observed information is
∂2
I(θ | x) = − 2 log f (θ | x)
∂θ
∂2

Γ(n)
= − 2 log + (nx̄ − 1) log θ + (n − nx̄ − 1) log(1 − θ)
∂θ Γ(nx̄)Γ(n − nx̄)
nx̄ − 1 n − nx̄ − 1
= + .
θ2 (1 − θ)2
So, evaluating the observed information at the mode,
nx̄ − 1 n − nx̄ − 1
I(θ̃ | x) = + .
θ̃2 (1 − θ̃)2
n−nx̄−1
Noting that 1 − θ̃ = n−2 we have that

(n − 2)2 (n − 2)2
I(θ̃ | x) = +
nx̄ − 1 n − nx̄ − 1
(n − 2)3
= .
(nx̄ − 1)(n − nx̄ − 1)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

So, approximately, θ | x ∼ N (θ̃, I −1 (θ̃ | x)), that is, approximately,


 
nx̄ − 1 (nx̄ − 1)(n − nx̄ − 1)
θ|x ∼ N , .
n−2 (n − 2)3

(b) Show that the prior distribution f (θ) is equivalent to a uniform prior
on
 
θ
β = log
1−θ
and find the posterior distribution of β | x. Find a normal approximation
about the mode to this distribution.

We have β = g(θ). We invert to find θ = g −1 (β). We find



θ = .
1 + eβ
The prior fβ (β) for β is given by
∂θ
fβ (β) = fθ (θ)
∂β
−1  −1
eβ eβ eβ

= 1−
(1 + eβ )2 1 + eβ 1 + eβ
eβ 1 + eβ
= × × (1 + eβ ) = 1,
(1 + eβ )2 eβ
which is equivalent to the (improper) uniform on β. The posterior is
f (β | x) ∝ f (x | β)f (β)
∝ θnx̄ (1 − θ)n−nx̄
 β nx̄  n−nx̄
e eβ eβnx̄
= 1 − = .
1 + eβ 1 + eβ (1 + eβ )n
βnx̄
ce
Hence, f (β | x) = (1+e β )n where c is the constant of integration. For the normal

approximation about the mode, we first need to find the mode of β | x. The mode
is the maximum of f (β | x) which is, equivalently, the maximum of log f (β | x).
We have
= log c + βnx̄ − n log(1 + eβ )
log f (β | x)
∂ neβ
⇒ log f (β | x) = nx̄ − .
∂β 1 + eβ
The mode β̃ satisfies

neβ̃
nx̄ − = 0
1 + eβ̃
nx̄
⇒ eβ̃ =
n − nx̄
 

⇒ β̃ = log .
1 − x̄

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

The observed information is


∂2
I(β | x) = − log f (β | x)
∂β 2
neβ
= .
(1 + eβ )2
1
Noting that 1 + eβ̃ = 1−x̄ we have


I(β̃ | x) = n× × (1 − x̄)2 = nx̄(1 − x̄).
1 − x̄
Hence, approximately,
 
x̄ 1
β |x ∼ N log , .
1 − x̄ nx̄(1 − x̄)

(c) For which parameterisation does it make more sense to use a normal
approximation?

Whilst we can find a normal approximation about the mode either on the scale
of θ or of β, it makes more sense for β. We have 0 < θ < 1 and −∞ < β < ∞ so
only β has a sample space which agrees with the normal distribution.

3. In viewing a section through the pancreas, doctors see what are called
“islands”. Suppose that Xi denotes the number of islands observed in the
ith patient, i = 1, . . . , n, and we judge that X1 , . . . , Xn are exchangeable with
Xi | θ ∼ P o(θ). A doctor believes that for healthy patients θ will be on average
around 2; he thinks it is unlikely that θ is greater than 3. The number of
islands seen in 100 patients are summarised in the following table.

Number of islands 0 1 2 3 4 5 ≥6
Frequency 20 30 28 14 7 1 0

(a) Express the doctor’s prior beliefs as a normal distribution for θ. You
may interpret the term “unlikely” as meaning “with probability 0.01”.

The doctor thus asserts θ ∼ N (µ, σ 2 ) with E(θ) = 2 and P (θ > 3) = 0.01. Note
that, as θ is continuous, this is equivalent to P (θ ≥ 3) = 0.01. We use these two
pieces of information to obtain µ and σ 2 . Firstly, E(θ) = 2 ⇒ µ = 2. Secondly,
 
θ−2 1
P (θ > 3) = 0.01 ⇒ P > = 0.01.
σ σ
θ−2
As σ ∼ N (0, 1) we have that

1
= 2.33 ⇒ σ 2 = 5.4289−1 = 0.1842.
σ
Hence, θ ∼ N (2, 5.4289−1 ).

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(b) Find, up to a constant of proportionality, the posterior distribution θ | x


where x = (x1 , . . . , x100 ).

As Xi | θ ∼ P o(θ), the likelihood is


n
Y θxi e−θ θnx̄ e−nθ
f (x | θ) = = Qn .
i=1
xi ! i=1 xi !

As θ ∼ N (2, 5.4289−1 ), the prior is


 
2.33 5.4289 2
f (θ) = √ exp − (θ − 2) .
2π 2
The posterior is thus
f (θ | x) ∝ f (x | θ)f (θ)
 
5.4289 2
∝ θnx̄ e−nθ × exp − (θ − 4θ)
2
 
5.4289
= θnx̄ exp − (θ2 − 4θ) − nθ
2
For the explicit data we have n = 100 and
100
X
xi = (0 × 20) + (1 × 30) + (2 × 28) + (3 × 14) + (4 × 7) + (5 × 1) = 161.
i=1

The posterior is thus


 
5.4289 2
f (θ | x) = cθ161 exp − (θ − 4θ) − 100θ
2
where c is the constant of proportionality.
(c) Find a normal approximation to the posterior about the mode. Thus,
estimate the posterior probability that the average number of islands
is greater than 2.

We find the mode of θ | x by maximising f (θ | x) or, equivalently, log f (θ | x). We


have
 
∂ ∂ 5.4289 2
log f (θ | x) = log c + 161 log θ − (θ − 4θ) − 100θ
∂θ ∂θ 2
161
= − 5.4289θ + 2(5.4289) − 100.
θ
So, the mode θ̃ satisfies
5.4289θ̃2 + {100 − 2(5.4289)}θ̃ − 161 = 0 ⇒
2
5.4289θ̃ + 89.1422θ̃ − 161 = 0.
Hence, as θ̃ > 0,
p
−89.1422 + 89.14222 + 4(5.4289)(161)
θ̃ = = 1.6419.
2(5.4289)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

The observed information is


∂2 161
I(θ | x) = − log f (θ | x) = + 5.4289
∂θ2 θ2
so that
161
I(θ̃ | x) = + 5.4289 = 65.1506.
1.64192
So, approximately, θ | x ∼ N (1.6419, 65.1506−1 ). Thus,

P (θ > 2 | x) = P {Z > 65.1506(2 − 1.6419)} = 0.0019.

(d) Why might you prefer to express the doctor’s prior beliefs as a normal
distribution on some other parameterisation φ = g(θ)? Suggest an ap-
propriate choice of g(·) in this case. Now express the doctor’s beliefs
using a normal prior for φ; that for healthy patients φ will be on aver-
age around g(2) and it is “unlikely” that φ is greater than g(3). Give an
expression for the density of φ | x up to a constant of proportionality.

By definition θ > 0 but both the specified normal prior distribution and the
normal approximation for the posterior θ | x have a sample space of (−∞, ∞) so
we might want to use some other parametrisation which has the same sample
space as the normal distribution. An obvious choice is to use φ = g(θ) = log θ as
if θ > 0 then −∞ < log θ < ∞. We assert φ ∼ N (µ0 , σ02 ) with E(φ) = log 2 and
P (φ > log 3) = 0.01. So, E(φ) = log 2 ⇒ µ0 = log 2 and
 
P (φ > log 3) = 0.01 ⇒ P φ−log σ0
2
> log 3−log
σ0
2
= 0.01.
φ−log 2
As σ0 ∼ N (0, 1) we have that
1 2.33
= ⇒ σ02 = 0.0303.
σ0 log 32

As φ = log θ then θ = eφ . The likelihood is thus, using (b),


φ φ
f (x | φ) ∝ (eφ )nx̄ e−ne = e161φ e−100e
for the given data. The posterior is thus
 
161φ −100eφ 1 2
f (φ | x) ∝ e e × exp − (φ − log 4φ)
0.0606
= exp −100eφ − 16.5017φ2 + 183.8761φ .


4. Let X1 , . . . , X10 be the length of time between arrivals at an ATM machine,


and assume that the Xi s may be viewed as exchangeable with Xi | λ ∼ Exp(λ)
where λ is the rate at which people P10 arrive at the machine in one-minute
intervals. Suppose we observe i=1 xi = 4. Suppose that the prior distribu-
tion for λ is given by
c exp{−20(λ − 0.25)2 } λ ≥ 0,

f (λ) =
0 otherwise
where c is a known constant.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(a) Find, up to a constant k of proportionality, the posterior distribution


λ | x where x = (x1 , . . . , x10 ). Find also an expression for k which you
need not evaluate.

The likelihood is
10
Y
f (x | λ) = λe−λxi = λ10 e−4λ
i=1

with the given data. The posterior is thus

f (λ | x) ∝ λ10 e−4λ × exp{−20(λ − 0.25)2 } = λ10 exp{−20(λ2 − 0.5λ) − 4λ}

so that, making the fact that λ > 0 explicit,

kλ10 exp{−20λ2 + 6λ} λ > 0



f (λ | x) =
0 otherwise,

where
Z ∞
k −1 = λ10 exp{−20λ2 + 6λ} dλ.
0

(b) Find a normal approximation to this posterior distribution about the


mode.

We find the mode of λ | x by maximising f (λ | x) or, equivalently, log f (λ | x). We


have
∂ ∂ 
log f (λ | x) = 10 log λ − 20λ2 + 6λ
∂λ ∂λ
10
= − 40λ + 6.
λ

So, the mode λ̃ satisfies

40λ̃2 − 6λ̃ − 10 = 0 ⇒ 20λ̃2 − 3λ̃ − 5 = 0.

Hence, as λ̃ > 0,
p
3+ 9 + 4(20)(5)
λ̃ = = 0.5806.
2(20)

The observed information is


∂2 10
I(λ | x) = − log f (λ | x) = 2 + 40
∂λ2 λ
so that
10
I(λ̃ | x) = + 40 = 69.6651.
0.58062
So, approximately, λ | x ∼ N (0.5806, 69.6651−1 ).

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(c) Let Zi , i = 1, . . . , N be a sequence of independent and identically dis-


tributed standard Normal random quantities. Assuming the normalis-
ing constant k is known, explain carefully how the Zi may be used to
obtain estimates of the mean of λ | x.

We shall use importance sampling. If we wish to estimate some E{g(λ) | X} with


posterior density f (λ | x) and can generate independent samples λ1 , . . . , λN from
some q(λ), an approximation of f (λ | x), then
N
1 X g(λi )f (λi | x)
Iˆ =
N i=1 q(λi )

is an unbiased estimate of E{g(λ) | X}.


1
As Zi ∼ N (0, 1) then λi = 69.6651− 2 Zi + 0.5806 ∼ N (0.5806, 69.6651−1 ) so
that we can generate an independent and identically distributed sample from the
N (0.5806, 69.6651−1 ) which is an approximation to the posterior of λ. Letting
g(λ) = λ,

kλ10 exp{−20λ2 + 6λ} λ > 0



f (λ | x) =
0 otherwise,
√  
69.6651 69.6651
q(λ) = √ exp − (λ − 0.5806)2
2π 2

then
N
1 X kλ11 2
i exp{−20λi + 6λi }I{λi >0}
Iˆ = √
N i=1 69.6651
exp − 69.6651


2π 2 (λi − 0.5806)2

is an unbiased estimate of the posterior mean of λ with Iλi >0 denoting the indi-
cator function for the event {λi > 0}.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

MA40189 2016-2017 Problem Sheet 7 - Solutions

Topics in Bayesian statistics (University of Bath)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by xin shen ([email protected])
lOMoARcPSD|36619859

MA40189 - Solution Sheet Seven


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

1. Consider the integral


Z 10
I = exp(−2|x − 5|) dx.
0

(a) Suppose that X ∼ U (0, 10) with probability density function


 1
10 0 ≤ x ≤ 10,
f (x) =
0 otherwise.

Show that I = Ef (10 exp(−2|X − 5|)), the expectation of 10 exp(−2|X − 5|)


calculated with respect to the probability density function f (x). Hence,
explain how to use the method of Monte Carlo integration to estimate
I.

Note that
10 10
1
Z Z
I = exp(−2|x − 5|) dx = 10 exp(−2|x − 5|)
dx
0 0 10
= Ef (10 exp(−2|X − 5|)).

We can then estimate I using Monte Carlo integration as follows:


(i) sample n values x1 , . . . , xn independently from U (0, 10)
(ii) compute 10 exp(−2|xi − 5|) for each xi
ˆ the sample mean of the 10 exp(−2|xi − 5|)s, so that
(iii) estimate I by I,
n
1X
Iˆ = 10 exp(−2|xi − 5|). (1)
n i=1

(b) Explain how to estimate I using the method of importance sampling


where we sample from the N (5, 1) distribution.

For importance sampling, we want to write I as an expectation calculated with


respect to the N (5, 1) distribution which has support on (−∞, ∞). If we let
I{0≤x≤10} be the indicator function that 0 ≤ x ≤ 10 then
Z 10 Z ∞
I = exp(−2|x − 5|) dx = exp(−2|x − 5|)I{0≤x≤10} dx
0 −∞

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

∞ 1
10 exp(−2|x − 5|) 10 I{0≤x≤10} 1
 
1
Z
2
= √ exp − (x − 5) dx
√1 exp − 1 (x − 5)2

−∞ 2 2π 2

∞  
g(x)f (x) g(X)f (X)
Z
= q(x) dx = Eq
−∞ q(x) q(X)
1
where g(x) = 10 exp(−2|x − 5|), f(x) = 10 I{0≤x≤10} (the density of the U (0, 10)
1 1
distribution) and q(x) = 2π exp − 2 (x − 5)2 (the density of the N (5, 1) distri-

bution). We can then estimate I using importance sampling as follows:


(i) sample n values x1 , . . . , xn independently from N (5, 1)
(ii) compute
1
g(xi )f (xi ) 10 exp(−2|xi − 5|) 10 I{0≤xi ≤10}
= 1
 1
q(xi ) √

exp − 2 (xi − 5)2

for each xi
˜ the sample mean of the g(xi )f (xi )
(iii) estimate I by I, q(xi ) s, so that
n 1
1 X 10 exp(−2|xi − 5|) 10 I{0≤xi ≤10}
I˜ = 1
 1 . (2)
n i=1 √ exp − 2 (xi − 5)2 2π

(c) Use R to explore the methods used in parts (a) and (b). Which method
do you prefer?

ˆ as given by equation (1).


We can write a function monte0 in R to estimate I by I,
# Calculates m replications of the Monte Carlo
# integration in part (a) with n = n1
monte0 <-function(n1,m){
vect <- c(1:m)
for (i in 1:m)
{
x <- runif(n1,0,10)
vect[i] <- sum(10*exp(-2*abs(x-5)))/n1
}
return(vect)
}
˜ as given by equation (2). Notice
Similarly, the function import0 estimates I by I,
1
that dunif(x, 0, 10) = 10 I{0≤x≤10} .
# Calculates m replications of the importance
# sampling in part (b) with n = n1
import0 <- function(n1,m){
vect <- c(1:m)
for (i in 1:m)
{
x <- rnorm(n1,5,1)
weight <- dunif(x,0,10)/dnorm(x,5,1)
vect[i] <- sum(10*exp(-2*abs(x-5))*weight)/n1
}

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

return(vect)
}
We explore the two methods for n = 100, 500, 1000, and 5000, in each case taking
1000 replications. The results are shown in the following table, note that the true
value of I is 1 − exp(−10) = 0.9999546.

Mean of Variance of
Method n
1000 samples 1000 samples
Monte 100 1.013746 0.04300273
Import 100 0.9997384 0.003624815
Monte 500 0.999168 0.008210746
Import 500 1.00074 0.0007107777
Monte 1000 1.00279 0.004010403
Import 1000 1.000122 0.0003694998
Monte 5000 1.000884 0.000773445
Import 5000 0.9997536 0.00007040271

We recall from lectures, that both methods give unbiased estimates of I so we


compare their performance by considering the respective variances. From the
table, we observe that the sample variance for the Monte Carlo integration is
about ten times that of the importance sampling method.
• for Monte Carlo integration as V arf (I)ˆ = 1 V arf (g(X)), where g(x) =
n
10 exp(−2|x − 5|), then V arf (g(X)) ≈ 4
 
˜ = 1 V arq g(X)f (X) so that
• for importance sampling we have that V arq (I) n q(X)
 
(X)
V arq g(X)f
q(X) ≈ 0.4
We thus see that the importance sampling method is preferable to the Monte
Carlo integration. Figure 1 demonstrates why this is the case. The function
g(x) = 10 exp(−2|x − 5|) has a peak at x = 5 and decays quickly elsewhere in the
interval [0, 10] so that when we sample from U (0, 10) many of the sampled values
will be where g(x) is very close to zero. We are likely to increase the precision
of our estimate if we sample from a distribution which attaches more weight, or
importance, to the region where g(x) is not close to zero. As Figure 1 shows,
the N (5, 1) does this, leading to the reduced variance in the importance sampling
scheme.
2. Suppose that f (θ | x) = cg(θ) where g(θ) ∝ f (x | θ)f (θ) and c is the normalising
constant. For a function h(θ) we wish to find E(h(θ) | X) = Ef (h(θ)) where
Z ∞
Ef (h(θ)) = h(θ)f (θ | x) dθ.
−∞

Suppose that we sample N values θ1 , . . . , θN independently from a distri-


bution with probability density function q(θ) (you may assume that the
support of q(θ) is the same as f (θ | x)).
 
(a) Show that Ef (h(θ)) = cEq h(θ)g(θ)
q(θ) , where Eq (·) denotes expectation
calculated with respect to the probability density function q(θ). If c is
known, explain how to use θ1 , . . . , θN to estimate Ef (h(θ)).

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

1.0
0.8
exp(−2|x−5|)
pdf of U(0, 10)
pdf of N(5, 1)

0.6
0.4
0.2
0.0

0 2 4 6 8 10

Figure 1: A plot of g(x)


10 = exp(−2|x−5|) with the probability density function of the U (0, 10)
used for the Monte Carlo integration in part (a) and the probability density function of the
N (5, 1) used for the importance sampling in part (b).

Note that, as f (θ | x) = cg(θ), then


Z ∞
Ef (h(θ)) = c h(θ)g(θ) dθ
−∞
Z ∞  
h(θ)g(θ) h(θ)g(θ)
= c q(θ) dθ = cEq .
−∞ q(θ) q(θ)
 
i )g(θi )
We compute h(θq(θ i)
for each i = 1, . . . , N and estimate E q
h(θ)g(θ)
q(θ) by

N
1 X h(θi )g(θi )
Iˆ = (3)
N i=1 q(θi )

ˆ
and, since c is known, estimate Ef (h(θ)) by cI.
(b) Suppose that c is unknown. By noting that
Z ∞
c−1 = g(θ) dθ,
−∞

explain how to use θ1 , . . . , θN to estimate c−1 .

We have that
Z ∞ Z ∞
1 = f (θ | x) dθ = cg(θ) dθ
−∞ −∞

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

so that
Z ∞
−1
c = g(θ) dθ
−∞
∞  
g(θ) g(θ)
Z
= q(θ) dθ = Eq .
−∞ q(θ) q(θ)
 
g(θi ) g(θ)
Thus, we compute q(θi ) for each i = 1, . . . , N and estimate c−1 = Eq q(θ) by

N
1 X g(θi )
I˜ = . (4)
N i=1 q(θi )

(c) In the case when c is unknown, suggest a sensible way of using θ1 , . . . , θN


to estimate Ef (h(θ)).

By theLaw of Large Numbers, I, ˆ as given by equation (3), almost surely converges



to Eq h(θ)g(θ) ˜
whilst I, as given by equation (4), almost surely converges to c−1
q(θ)
so that a sensible estimate of Ef (h(θ)) is
PN
ˇ Iˆ i=1 h(θi )wi
I = =
I˜ PN
i=1 wi

where wi = g(θi )/q(θi ). Estimating Ef (h(θ)) by Iˇ is often called self-normalised


importance sampling and provides a method of estimating posterior variances
without needing to know the normalising constant c. It can be shown that (pro-
vided that q(θ) > 0 whenever g(θ) > 0) Iˇ converges to Ef (h(θ)).
3. Let X ∼ N (0, 1).
(a) Explain how to use Monte Carlo integration to find P (X > a) for a ∈ R.

Note that
Z ∞
P (X > a) = f (x) dx
a
Z ∞
= I{x>a} f (x) dx
−∞
= Ef (I{X>a} )

where I{X>a} is the indicator function of the event {X > a} and Ef (·) denotes
expectation with respect to the distribution f . We thus may use Monte Carlo
integration to estimate P (X > a) as follows:
(i) sample n values x1 , . . . , xn independently from f (x), the N (0, 1)
(ii) compute I{xi >a} for each xi , so I{xi >a} = 1 if xi > a and is 0 otherwise
(iii) estimate P (X > a) = Ef (I{X>a} ) by the sample mean of the I{xi >a} ,
1
Pn
n i=1 I{xi >a} .
(b) What are the difficulties with the method for large values of a?

We can write a function monte in R to estimate P (X > a).

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

# Calculates m replications of the Monte Carlo


# integration in part (a) with a = y and n = n1
monte <- function(y,n1,m){
vect<-c(1:m)
for (i in 1:m)
{
x <- rnorm(n1)
total <- length(x[x > y])
vect[i] <- total/n1
}
return(vect)
}
We consider sampling n = 1000 observations to estimate P (X > a) for a =
0, 1.96, 3 and 4.5. We’ll repeat each experiment 1000 times to explore the sam-
pling properties of the estimate and also calculate the number of samples which
contained observations greater than a. For example, for a = 4.5 we compute
> sample<-monte(4.5,1000,1000)
> mean(sample)
[1] 5e-06
> var(sample)
[1] 4.97998e-09
> length(sample[sample > 0])
[1] 5
We summarise our results.
Mean of Variance of No of non-
a P (X > a)
1000 samples 1000 samples zero samples
0 0.5 0.499914 0.000245 1000
1.96 0.0249979 0.025027 2.437865 × 10−5 1000
3 0.001349898 0.001395 1.394369 × 10−6 754
4.5 3.397673 × 10−6 5 × 10−6 4.97998 × 10−9 5

As a increases then P (X > a) decreases and we are interested


Pn in estimating the
probability of an increasingly rare event. If we let t = i=1 I{xi >a} then t denotes
the number of the n observations that were greater than a and our estimate of
P (X > a) is t/n. For large a, we will need to take n to be extremely large to
even observe a non-zero t let alone obtain a stable estimate. For example, in our
1000 samples with n = 1000 and a = 4.5, only five of the samples contained an
observation greater than 4.5. An alternate approach to estimating P (X > a)
which doesn’t require n to be extremely large may be to use importance sampling
where we sample from a distribution that puts more mass on the event {X > a}.
Two possible approaches are considered in parts (c) and (d).
(c) Suppose instead we consider finding P (X > a) using the method of
importance sampling where we sample from N (µ, 1). By considering
the cases when a = 3 and µ = 4 and that when a = 4.5 and µ = 4.5
comment on the advantages of this approach over direct Monte Carlo
integration.

Let q(x) denote the probability density function of the N (µ, 1). Then, as q(x) > 0

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

for all x ∈ (−∞, ∞), we have


Z ∞ ∞
I{x>a} f (x)
Z  
Ef (I{X>a} ) = I{x>a} f (x) dx = q(x) dx
−∞ −∞ q(x)
I{X>a} f (X)
 
= Eq
q(X)
where
√1 exp − 21 X 2

µ2
 
f (X) 2π
= = exp − µX .
√1 exp − 21 (X − µ)2

q(X) 2

Thus,
  2 
µ
Ef (I{X>a} ) = Eq I{X>a} exp − µX .
2
We thus use importance sampling to estimate P (X > a) = Ef (I{X>a} ) as follows:
(i) sample n values x1 , . . . , xn independently from q(x), the N (µ, 1)
n 2 o
(ii) compute I{xi >a} exp µ2 − µxi for each xi
(iii) estimate P (X > a) = Ef (I{X>a} ) by
n
µ2
 
1X
Iˆ = I{xi >a} exp − µxi .
n i=1 2

We write a function import in R to implement this algorithm.


# Calculates m replications of the importance sampling
# in part (c) with a = y1, mu = y2 and n = n1
import <- function(y1,y2,n1,m)
{
vect <- c(1:m)
for (i in 1:m)
{
x <- rnorm(n1,y2)
weight <- dnorm(x)/dnorm(x, y2)
vect[i] <- sum(weight[x > y1])/n1
}
return(vect)
}
Suppose we consider a = 3 and µ = 4 so that P (X > 3) = 0.001349898. We
compare the importance sampling method with direct Monte Carlo integration
for various choices of n.
Mean of Variance of No of non-
Method n
1000 samples 1000 samples zero samples
Monte 100 0.00119 1.16956 × 10−5 113
Import 100 0.001349189 9.391652 × 10−8 1000
Monte 500 0.001232 2.308484 × 10−6 472
Import 500 0.001349657 1.949415 × 10−8 1000
Monte 1000 0.001352 1.373469 × 10−6 733
Import 1000 0.001348711 9.348038 × 10−9 1000

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

We observe that, even for n = 100 observations, the method of importance sam-
pling is doing well and that the variance of the estimate is much reduced when
compared to direct Monte Carlo integration. We repeat the process with a = 4.5
and µ = 4.5 so that P (X > 4.5) = 3.397673 × 10−6 .

Mean of Variance of No of non-


Method n
1000 samples 1000 samples zero samples
Monte 100 0
Import 100 3.42554 × 10−6 6.096552 × 10−13 1000
Monte 500 0
Import 500 3.395555 × 10−6 1.202796 × 10−13 1000
Monte 1000 2 × 10−6 1.997998 × 10−9 2
Import 1000 3.397125 × 10−6 5.644111 × 10−14 1000

Notice that even for sizes of n when Monte Carlo fails (as we haven’t observed any
non-zero values), the importance sampling method is working well. When we can
use Monte Carlo, for the same n, the variance of the estimate using importance
sampling is vastly reduced.
(d) Explain how to find P (X > 4.5) using the method of importance sam-
pling where we sample from the Exp(1) distribution truncated at 4.5
which has probability density function

exp{−(x − 4.5)} x > 4.5,
q(x) =
0 otherwise.

In this part, the idea is to remove the need of the indicator function when esti-
mating P (X > a) by sampling from a distribution, q(x), whose support is (a, ∞)
so that
Z ∞ Z ∞  
f (x) f (X)
P (X > a) = f (x) dx = q(x) dx = Eq .
a a q(x) q(X)
For a = 4.5 we have that, for X > 4.5,
√1 exp − 1 X 2
  
f (X) 2π 2 1 1
= = √ exp − X 2 + X − 4.5
q(X) exp{−(X − 4.5)} 2π 2
so that our importance sampling method is as follows:
(i) sample n values x1 , . . . , xn independently from q(x), the Exp(1) distribution
truncated at 4.5
(ii) compute √12π exp − 21 x2i + xi − 4.5 for each xi


(iii) estimate P (X > a) = Ef (I{X>a} ) by


n  
1X 1 1 2
Iˆ = √ exp − xi + xi − 4.5 .
n i=1 2π 2

We write a function import1 in R to implement this algorithm.1 To demon-


strate the convergence of the estimate in Figure 2 we plot the cumulative sum
of Iˆ for each n.
1 This example is inspired by Example 3.5 of Robert, C. P. and Casella G. (2010). Introducing Monte

Carlo Methods with R, Springer, New York.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

3e−06
Cumulative sum of estimate

2e−06
1e−06
0e+00

0 200 400 600 800 1000

Figure 2: Convergence, as n increases, of the importance sampling estimate Iˆ for P (X > 4.5)
obtained by sampling from the Exp(1) distribution truncated at 4.5.

# Calculates the importance sampling in part (d) for


# P(X > y) and q(x) the Exp(1) truncated at y
# Plots the cumulative estimate of P(X > y) with n = n1
import1 <- function(y,n1){
x<-rexp(n1)+y
weight=dnorm(x)/dexp(x-y)
plot(cumsum(weight)/1:n1,type="l",ylab="Cumulative sum of estimate",xlab="n")
abline(a=pnorm(-y),b=0,col="red")
}
Figure 2 shows that the estimate is converging to the true value and that
this happens quickly, certainly when compared with the direct Monte Carlo
integration approach of part (a) whose limitations were noted in part (b).
4. Let Ef (g(X)) denote the expectation of g(X) with respect to the probability
density function f (x) so that
Z
Ef (g(X)) = g(x)f (x) dx.
X

Suppose that we sample N independent observations from a distribution


q(x) and use the method of importance sampling to estimate Ef (g(X)). Our
estimator is thus
N
1 X g(Xi )f (Xi )
Iˆ = .
N i=1 q(Xi )

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

Prove that the choice of q(x) which minimises the variance of Iˆ (with respect
to q(x)) is

|g(x)|f (x)
q ∗ (x) = R .
X
|g(z)|f (z) dz

Let V arq (·) denote variance calculated with respect to q(x). Then, as X1 , . . . , XN are
independent observations from q(x),
N    
ˆ = 1 X g(Xi )f (Xi ) 1 g(X)f (X)
V arq (I) V arq = V arq .
N 2 i=1 q(Xi ) N q(X)

Hence, the choice of q(x) which minimises the variance of Iˆ is equivalent to that which
minimises
 2
g (X)f 2 (X)
    
g(X)f (X) 2 g(X)f (X)
V arq = Eq − Eq
q(X) q 2 (X) q(X)
 2 2

g (X)f (X)
= Eq − Ef2 (g(X)) .
q 2 (X)

Thus, as Ef2 (g(X)) does not depend upon the choice of q(x), then we need only consider
 2 2
  
finding a q(x) minimising Eq g (X)f2
q (X)
(X)
. Now, for any q(x), as V arq
|g(X)|f (X)
q(X) ≥
0,
2
g 2 (X)f 2 (X)
    Z
|g(X)|f (X)
Eq ≥ Eq2 = |g(x)|f (x) dx . (5)
q 2 (X) q(X) X

Notice that the lower bound does not depend upon the choice of q(x). Thus, if there
exists a q(x) which achieves this minimum bound then it must minimise the variance
ˆ
of I.

If q ∗ (x) = R |g(x)|f (x) then


X
|g(z)|f (z) dz

|g(X)|f (X)
Z
= |g(z)|f (z) dz
q ∗ (X) X
R
so that, as X
|g(z)|f (z) dz is a constant,
2 ! 2
g 2 (X)f 2 (X)
  Z Z
Eq = Eq |g(z)|f (z) dz = |g(z)|f (z) dz
(q ∗ (X))2 X X

and the lower bound in (5) is achieved showing that q ∗ (x) minimises the variance of
ˆ2
I.

2 Notice that this is really a theoretical rather than a practical result. For example, if g(x) > 0 then

q ∗ (x) = (g(x)f (x))/Ef (g(X)) and so requires Ef (g(X)) which is the integral we’re trying to evaluate.

10

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

MA40189 2016-2017 Problem Sheet 8 - Solutions

Topics in Bayesian statistics (University of Bath)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by xin shen ([email protected])
lOMoARcPSD|36619859

MA40189 - Solution Sheet Eight


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

1. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent


given a parameter θ. Suppose that Xi | θ ∼ Inv-gamma(α, θ), where α is
known, and we judge that θ ∼ Gamma(α0 , β0 ), where α0 and β0 are known.
(a) Show that θ | x ∼ Gamma(αn , βn ) where αn = α0 + nα, βn = β0 + ni=1 x1i ,
P
and x = (x1 , . . . , xn ).

As Xi | θ ∼ Inv-gamma(α, θ) then
n
θα −(α+1)
 
Y θ
f (x | θ) = xi exp −
i=1
Γ(α) xi
n
!
X 1
∝ θnα exp −θ .
x
i=1 i

Similarly, as θ ∼ Gamma(α0 , β0 ),
f (θ) ∝ θα0 −1 exp (−θβ0 ) .
Hence,
f (θ | x) ∝ f (x | θ)f (θ)
n
!)
(
X 1
∝ θ(α0 +nα)−1 exp −θ β0 +
x
i=1 i
Pn
which we recognise as a kernal of a Gamma(α0 + nα, β0 + i=1 x1i ). Thus, θ | x ∼
Gamma(αn , βn ) as required.
(b) We wish to use the Metropolis-Hastings algorithm to sample from the
posterior distribution θ | x using a normal distribution with mean θ and
chosen variance σ 2 as the symmetric proposal distribution.
i. Suppose that, at time t, the proposed value θ∗ ≤ 0. Briefly explain
why the corresponding acceptance probability is zero for such a θ∗
and thus that the sequence of values generated by the algorithm
are never less than zero.

As the proposal distribution is symmetric then the Metropolis-Hastings algo-


rithm reduces to the Metropolis algorithm with acceptance probability
 
f (θ∗ | x)
α(θ(t−1) , θ∗ ) = min 1,
f (θ(t−1) | x)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

where θ | x ∼ Gamma(αn , βn ). Now, assuming θ(t−1) > 0, if θ∗ ≤ 0 then


f (θ∗ | x) = 0 and α(θ(t−1) , θ∗ ) = 0 so the move to θ∗ ≤ 0 is rejected and
θ(t) = θ(t−1) . So, provided θ(0) > 0 then the sequence of values generated by
the algorithm are never less than zero.
ii. Describe how the Metropolis-Hastings algorithm works for this ex-
ample, giving the acceptance probability in its simplest form.

The algorithm is performed as follows.


A. Choose a starting point θ(0) for which f (θ(0) | x) > 0. This will hold for
any θ(0) > 0 as θ | x ∼ Gamma(αn , βn ).
B. At time t
• Sample θ∗ ∼ N (θ(t−1) , σ 2 ).
• Calculate the acceptance probability
 
f (θ∗ | x)
α(θ(t−1) , θ∗ ) = min 1,
f (θ(t−1) | x)
(  ∗ αn −1

(θ ) exp(−βn θ ∗ )
min 1, (θ(t−1) αn −1 exp(−β θ (t−1) ) θ∗ > 0
= ) n
0 θ∗ ≤ 0
   αn −1 
θ∗ (t−1)
 
min 1, θ(t−1) exp −βn θ − θ ∗
θ∗ > 0

=
0 θ∗ ≤ 0

Pn
where αn = α0 + nα, βn = β0 + i=1 x1i .
• Generate U ∼ U (0, 1).
• If U ≤ α(θ(t−1) , θ∗ ) accept the move, θ(t) = θ∗ . Otherwise reject the
move, θ(t) = θ(t−1) .
C. Repeat step B.
The algorithm will produce a Markov Chain with stationary distribution
f (θ | x). After a sufficiently long time, b say, to allow for convergence, the
values {θ(t) } for t > b may be considered as a sample from f (θ | x), that is
a sample from Gamma(αn , βn ), where the samples {θ(t) } for t ≤ b are the
“burn-in”.
2. Suppose that X | θ ∼ N (θ, σ 2 ) and Y | θ, δ ∼ N (θ − δ, σ 2 ), where σ 2 is a known
constant and X and Y are conditionally independent given θ and δ. It is
judged that the improper noninformative joint prior distribution f (θ, δ) ∝ 1
is appropriate.
(a) Show that the joint posterior distribution of θ and δ given x and y is
bivariate normal with mean vector µ and variance matrix Σ where
 2
σ2
    
E(θ | X, Y ) x σ
µ = = ; Σ = .
E(δ | X, Y ) x−y σ 2 2σ 2

As X | θ ∼ N (θ, σ 2 ) then
 
1
f (x | θ) ∝ exp − 2 (θ2 − 2xθ) .

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

As Y | θ, δ ∼ N (θ − δ, σ 2 ) then
 
1 2
f (y | θ, δ) ∝ exp − 2 [(θ − δ) − 2y(θ − δ)] .

Now, by conditional independence,

f (x, y | θ, δ) = f (x | θ)f (y | θ, δ)
 
1
∝ exp − 2 [θ2 − 2xθ + (θ − δ)2 − 2y(θ − δ)] .

So,

f (θ, δ | x, y) ∝ f (x, y | θ, δ)f (θ, δ)


∝ f (x, y | θ, δ)
 
1 2 2
∝ exp − 2 [θ − 2xθ + (θ − δ) − 2y(θ − δ)]

 
1 2 2
∝ exp − 2 [2θ − 2(x + y)θ − 2θδ + δ + 2yδ] .

The exponential argument is quadratic in θ, δ so θ, δ | x, y is bivariate normal. We


can find the exact normal by completing the square or, in this case, using the
information given in the question. We have,
 
1 2 −1
Σ−1 =
σ2 −1 1

so that the density of the N2 (µ, Σ) is


   
1 1 2 −1 θ−x
exp − (θ − x δ − (x − y)) 2
2 σ −1 1 δ − (x − y)
 
1
∝ exp − 2 [2θ2 − 2(2x − (x − y))θ − 2θδ + δ 2 − 2((x − y) − x)δ]

 
1
= exp − 2 [2θ2 − 2(x + y)θ − 2θδ + δ 2 + 2yδ]

∝ f (θ, δ | x, y).

(b) Describe how the Gibbs sampler may be used to sample from the poste-
rior distribution θ, δ | x, y, deriving all required conditional distributions.

The Gibbs sampler requires the conditional distributions θ | δ, x, y and δ | θ, x, y.


Now, with proportionality with respect to θ,

f (θ | δ, x, y) ∝ f (θ, δ | x, y)
 
1 2
∝ exp − 2 [2θ − 2(x + y + δ)θ]

     
1 2 2 x+y+δ
= exp − θ −2 θ
2 σ2 2

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

2
which is a kernel of a normal density. Thus, θ | δ, x, y ∼ N ( x+y+δ
2 , σ2 ). Now, with
proportionality with respect to δ,

f (δ | θ, x, y) ∝ f (θ, δ | x, y)
 
1 2
∝ exp − 2 [δ − 2(θ − y)δ]

which is a kernel of a normal density. Thus, δ | θ, x, y ∼ N (θ − y, σ 2 ). The Gibbs


sampler algorithm is
i. Choose a starting value (θ(0) , δ (0) ) for which f (θ(0) , δ (0) | x, y) > 0.
ii. At iteration t generate the new values (θ(t) , δ (t) ) as follows:
(t−1) 2
• draw θ(t) from N ( x+y+δ
2 , σ2 ), the distribution of θ(t) | δ (t−1) , x, y.
• draw δ (t) from N (θ(t) − y, σ 2 ), the distribution of δ (t) | θ(t) , x, y.
iii. Repeat step ii.
The algorithm produces a Markov Chain with stationary distribution f (θ, δ | x, y).
After a sufficiently long time to allow for convergence, the values {θ(t) , δ (t) } for t >
b may be considered as a sample from f (θ, δ | x, y) where the samples {θ(t) , δ (t) }
for t ≤ b are the “burn-in”.
(c) Suppose that x = 2, y = 1 and σ 2 = 1. Sketch the contours of the joint
posterior distribution. Starting from the origin, add to your sketch the
first four steps of a typical Gibbs sampler path.

2
For x = 2, y = 1 and  σ = 1 we shall use  the Gibbs sampler to sample from
2 1 1 (t−1)
N2 (µ, Σ) where µ = and Σ = . We draw θ(t) from N ( 3+δ2 , 21 )
1 1 2
and δ (t) from N (θ(t) − 1, 1). Suppose that we start from (1, 1) and sample θ(1) =
2.216715, δ (1) = 1.475738, θ(2) = 2.086383 and δ (2) = 1.617760. The path of the
trajectory of these points is shown in Figure 1.
(d) Suppose, instead, that we consider sampling from the posterior distri-
bution using the Metropolis-Hastings algorithm where the proposal dis-
tribution is the bivariate normal with mean vector µ̃(t−1) = (θ(t−1) , δ (t−1) )T
and known variance matrix Σ̃. Explain the Metropolis-Hastings algo-
rithm for this case, explicitly stating the acceptance ratio.

For the proposal we have

q(θ∗ , δ ∗ | θ(t−1) , δ (t−1) ) ∝


θ − θ(t−1)
  ∗ 
1 ∗ (t−1) (t−1)

exp − θ − θ δ −δ

Σ̃ −1
2 δ ∗ − δ (t−1)

so that q(θ∗ , δ ∗ | θ(t−1) , δ (t−1) ) = q(θ(t−1) , δ (t−1) | θ∗ , δ ∗ ). The proposal distri-


bution is symmetric and so the Metropolis-Hastings algorithm reduces to the
Metropolis algorithm. The algorithm is performed as follows.
i. Choose a starting point (θ(0) , δ (0) ) for which f (θ(0) , δ (0) | x, y) > 0.
ii. At time t
• Sample (θ∗ , δ ∗ ) ∼ N2 (µ̃(t−1) , Σ̃).

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

4
0.04

3
0.08

2
0.12
0.14
delta

1
0

0.1
−1

0.06
−2

0.02

−1 0 1 2 3 4 5

theta

Figure 1: A trajectory path of the Gibbs sampler sampling from N2 (µ, Σ).

• Calculate the acceptance probability


 
(t−1) (t−1) f (θ∗ , δ ∗ | x, y)
α((θ ,δ ), (θ , δ )) = min 1,
∗ ∗
f (θ(t−1) , δ (t−1) | x, y)
 1 ∗ !
exp − 2 (γ − µ)T Σ−1 (γ ∗ − µ)
= min 1,
exp − 21 (γ (t−1) − µ)T Σ−1 (γ (t−1) − µ)


  (t−1) 
θ∗ (t−1) θ
where γ = ∗
and γ = .
δ∗ δ (t−1)
• Generate U ∼ U (0, 1).
• If U ≤ α((θ(t−1) , δ (t−1) ), (θ∗ , δ ∗ )) accept the move, (θ(t) , δ (t) ) = (θ∗ , δ ∗ ).
Otherwise reject the move, (θ(t) , δ (t) ) = (θ(t−1) , δ (t−1) ).
iii. Repeat step ii.
The algorithm produces a Markov Chain with stationary distribution f (θ, δ | x, y).
After a sufficiently long time to allow for convergence, the values {θ(t) , δ (t) } for t >
b may be considered as a sample from f (θ, δ | x, y) where the samples {θ(t) , δ (t) }
for t ≤ b are the “burn-in”.
3. Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given a parameter θ = (µ, λ). Suppose that Xi | θ ∼ N (µ, 1/λ) so that µ is the
mean and λ the precision of the distribution. Suppose that we judge that µ
and λ are independent with µ ∼ N (µ0 , 1/τ ), where µ0 and τ are known, and
λ ∼ Gamma(α, β), where α and β are known.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(a) Show that the posterior density f (µ, λ | x), where x = (x1 , . . . , xn ), can be
expressed as
( n
)
α+ n λX 2 τ 2
f (µ, λ | x) ∝ λ 2 exp −
−1
(xi − µ) − µ + τ µ0 µ − βλ .
2 i=1 2

As Xi | θ ∼ N (µ, 1/λ) and θ = (µ, λ) then


n  
Y λ 2 1
f (x | µ, λ) ∝ λ exp − (xi − µ) 2

i=1
2
( n
)
n λX 2
= λ 2 exp − (xi − µ) .
2 i=1

µ ∼ N (µ0 , 1/τ ) so that


n τ o
f (µ) ∝ exp − (µ − µ0 )2
n τ2 o
∝ exp − µ2 + τ µ0 µ .
2
Finally, as λ ∼ Gamma(α, β) then

f (λ) ∝ λα−1 exp {−βλ} .

Hence, as µ and λ are independent,

f (µ, λ | x) ∝ f (x | µ, λ)f (µ)f (λ)


( n
)
n λ X τ
∝ λα+ 2 −1 exp − (xi − µ)2 − µ2 + τ µ0 µ − βλ .
2 i=1 2

(b) Hence show that


n
!
n 1X
λ | µ, x ∼ Gamma α + , β + (xi − µ)2 .
2 2 i=1

We have
f (λ, µ | x)
f (λ | µ, x) =
f (µ | x)
∝ f (λ, µ | x)

where the proportionality is with respect to λ. So, from 2.(a) we have


( n
)
α+ n λX 2 τ 2
f (λ | µ, x) ∝ λ 2 exp −
−1
(xi − µ) − µ + τ µ0 µ − βλ
2 i=1 2
( n
!)
n 1 X
∝ λα+ 2 −1 exp −λ β + (xi − µ)2
2 i=1

1 Pn
which is a kernel of a Gamma α + n2 , β + − µ)2 .

2 i=1 (xi

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

+nλx̄
(c) Given that µ | λ, x ∼ N ( τ µτ0+nλ 1
), where x̄ = n1 ni=1 xi , describe
P
, τ +nλ
how the Gibbs sampler may be used to sample from the posterior
distribution µ, λ | x. Give a sensible estimate of V ar(λ | x).

The Gibbs sampler requires the conditional distributions λ | µ, x and µ | λ, x which


we have. The algorithm is
i. Choose a starting value (µ(0) , λ(0) ) for which f (µ(0) , λ(0) | x) > 0.
ii. At iteration t generate the new values (µ(t) , λ(t) ) as follows:
(t−1)
+nλ
• draw µ(t) from N ( τ µτ0+nλ(t−1)

, τ +nλ1(t−1) ), the distribution of µ(t) | λ(t−1) , x.
Pn
• draw λ(t) from Gamma α + n2 , β + 12 i=1 (xi − µ(t) )2 , the distribution


of λ(t) | µ(t) , x.
iii. Repeat step ii.
The algorithm will produce a Markov Chain with stationary distribution f (µ, λ | x).
After a sufficiently long time to allow for convergence, the values {µ(t) , λ(t) } for
t > b may be considered as a sample from f (µ, λ | x) where the samples {µ(t) , λ(t) }
for t ≤ b are the “burn-in”.
Ergodic averages converge to the desired expectations under the target distribu-
tion so that, for example,
N
1 X
g(λ(t) ) → E(g(λ) | X)
N t=1

as N → ∞. As this includes the observations prior to convergence, we will


do better to estimate only using the observations after the “burn-in”. Noting
that V ar(λ | X) corresponds to g(λ) = (λ − E(λ | X))2 then any sensible sample
variance will suffice. For example, we could estimate V ar(λ | X) = E(λ2 | X) −
E 2 (λ | X) by
N N
!
1 X
(t) 2 1 X
(λ − λ̄) = (λ ) − λ̄2
(t) 2
N −b N −b
t=b+1 t=b+1

1
PN
where our sample is {µ(t) , λ(t) : t = b + 1, . . . , N } and λ̄ = N −b t=b+1 λ(t) , the
sample mean.
4. Consider a Poisson hierarchical model. At the first stage we have observa-
tions sj which are Poisson with mean tj λj for j = 1, . . . , p where each tj is
known. We assume that that the λj are independent and identically dis-
tributed with Gamma(α, β) prior distributions. The parameter α is known
but β is unknown and is given a Gamma(γ, δ) distribution where γ and δ are
known.
(a) Find, up to a constant of integration, the joint posterior distribution
of the unknown parameters given s = (s1 , . . . , sp ).

Let λ = (λ1 , . . . , λp ) then the unknown parameters are θ = (λ1 , . . . , λp , β) =


(λ, β). The prior is

f (λ, β) = f (λ | β)f (β)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

 
Yp 
= f (λj | β) f (β)
 
j=1
 
p α
Y β  δ γ γ−1 −δβ
= λα−1
j e−βλj × β e

j=1
Γ(α)  Γ(γ)
    
Yp  p
X 
∝  λα−1
j
 β (pα+γ)−1 exp −β δ + λj  .
 
j=1 j=1

The likelihood is
p
Y (tj λj )sj e−tj λj
f (s | λ, β) =
j=1
sj !
   
p p
s
Y X
∝  λj j  exp − tj λj  .
j=1 j=1

The posterior is thus


f (λ, β | s) ∝ f (s | λ, β)f (λ, β)
     
p  p p 
α+s
Y X X
λj j  β (pα+γ)−1 exp −β δ +
−1
∝  λj  − tj λj .
 
j=1 j=1 j=1

(b) Describe how the Gibbs sampler may be used to sample from the pos-
terior distribution, deriving all required conditional distributions.

For each j, let λ−j = λ\λj . For the Gibbs sampler we need to find the distributions
of λj | λ−j , β, s, j = 1, . . . , p, and β | λ, s. Now,
f (λ, β | s)
f (λj | λ−j , β, s) =
f (λ−j , β | s)
∝ f (λ, β | s) (taken with respect to λj )
α+sj −1
= λj exp {−(β + tj )λj }
which is a kernel of a Gamma(α + sj , β + tj ) distribution. Hence, λj | λ−j , β, s ∼
Gamma(α + sj , β + tj ).
f (λ, β | s)
f (β | λ, s) =
f (λ | s)
∝ f (λ, β | s) (taken with respect to β)
   
 X p 
= β (pα+γ)−1 exp − δ + λj  β
 
j=1
Pp
which is a kernel of a Gamma(pα + γ, δ + j=1 λj ) distribution. So, β | λ, s ∼
Pp
Gamma(pα + γ, δ + j=1 λj ).

The Gibbs sampler is

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

i. Choose an arbitrary starting point for which f (λ(0) , β (0) | s) > 0. As λj > 0
for j = 1, . . . , p and β > 0 then any p + 1 collection of positive numbers will
suffice.
(t) (t−1)
ii. • Obtain λ1 by sampling from the distribution of λ1 | λ−1 , β (t−1) , s which
is Gamma(α + s1 , β (t−1) + t1 ).
..
(t)
. (t) (t) (t−1)
• Obtain λj by sampling from the distribution of λj | λ1 , . . . , λj−1 , λj+1 ,
(t−1)
. . . , λp , β (t−1) , s which is Gamma(α + sj , β (t−1) + tj ).
..
(t)
. (t)
• Obtain λp by sampling from the distribution of λp | λ−p , β (t−1) , s which
is Gamma(α + sp , β (t−1) + tp ).
• Obtain β (t) by sampling from the distribution of β | λ(t) , s, the Gamma(pα
(t)
+γ, δ + pj=1 λj ).
P

iii. Repeat step ii.


The algorithm is run until convergence is reached, When convergence is reached,
then resulting value (λ(t) , β (t) ) is a realisation from the posterior λ, β | s.
(t) (t)
(c) Let {λ1 , . . . , λp , β (t) ; t = 1, . . . , N }, with N large, be a realisation of the
Gibbs sampler described above. Give sensible estimates of E(λj | s),
V ar(β | s) and E(λj | a ≤ β ≤ b, s) where 0 < a < b are given constants.

We first remove the burn-in samples. Suppose we determine the burn-in time b,
where we assume that the Markov chain has sufficiently converged for t > b. (We
assume that b < N .) Then we discard observations {λ(t) , β (t) ; t ≤ b} and use the
remaining sample {λ(t) , β (t) ; t = b + 1, . . . N } to estimate posterior summaries.
(t) (t)
Here λ(t) = (λ1 , . . . , λp ).

(t)
The posterior mean of λj | s can be estimated by the sample mean of {λj , t =
(t)
b + 1, . . . , N } so that our estimate of E(λj | s) is N1−b N
P
t=b+1 λj .

The posterior variance of β | s can be estimated by the sample variance of {β (t) , t =


PN
b + 1, . . . , N } so that our estimate of V ar(β | s) is N1−b t=b+1 (β (t) − β̄)2 =
PN PN
( N1−b t=b+1 (β (t) )2 ) − β̄ 2 where β̄ = N1−b t=b+1 β (t) .

For estimating E(λj | a ≤ β ≤ b, s) we not only discard the burn-in observations


(t)
but also those elements λj for which β (t) < a or β (t) > b for each t = b +
1, . . . , N . So, letting I[a ≤ β (t) ≤ b] denote the indicator function which is one
if a ≤ β (t)P≤ b, the number of observations after the burn-in with a ≤ β (t) ≤ b
N (t)
is M = t=b+1 I[a ≤ β ≤ b] and our estimate of E(λj | a ≤ β ≤ b, s) is
1
P N (t) (t)
M t=b+1 λj I[a ≤ β ≤ b].
5. Show that the Gibbs sampler for sampling from a distribution π(θ) where
θ = (θ1 , . . . , θd ) can be viewed as a special case of the Metropolis-Hastings
algorithm where each iteration t consists of d Metropolis-Hastings steps
each with an acceptance probability of 1.

(t)
For the Gibbs sampler, we update each θp one at a time by obtaining θp from the

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(t) (t) (t−1) (t−1)


conditional distribution π(θp | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). Thus, we move from
(t−1) (t)
θp to θp leaving the remaining d − 1 parameters fixed.

We now demonstrate that this update step in the Gibbs sampler is equivalent to a single
(t−1) (t) (t)
Metropolis-Hastings step with acceptance probability 1. Let θ[p] = (θ1 , . . . , θp−1 ,
(t−1) (t−1) (t−1) (t) (t) (t) (t−1) (t−1) (t−1)
θp , θp+1 , . . . , θd ) and θ[p] = (θ1 , . . . , θp−1 , θp , θp+1 , . . . , θd ). Consider
(t−1) (t) (t) (t) (t) (t−1) (t−1)
the move from toθ[p] As is drawn from
θ[p] . θ[p] π(θp | θ1 , . . . , θp−1 , θp+1 , . . . , θd )
then we shall use this for the proposal distribution in the Metropolis-Hastings algorithm
so that
(t) (t−1) (t) (t) (t−1) (t−1)
q(θ[p] | θ[p] ) = π(θp(t) | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). (1)
(t) (t−1) (t−1)
If, instead, we consider the move from θ[p] to θ[p] then, using Gibbs sampling, θ[p]
(t−1) (t) (t) (t−1) (t−1)
is obtained by drawing θp from π(θp | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). Thus, we set
(t−1) (t) (t) (t) (t−1) (t−1)
q(θ[p] | θ[p] ) = π(θp(t−1) | θ1 , . . . , θp−1 , θp+1 , . . . , θd ). (2)
Now,
(t−1) (t) (t) (t−1) (t−1)
π(θ[p] ) = π(θ1 , . . . , θp−1 , θp(t−1) , θp+1 , . . . , θd )
(t) (t) (t−1) (t−1) (t) (t) (t−1) (t−1)
= π(θp(t−1) | θ1 , . . . , θp−1 , θp+1 , . . . , θd )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd )(3)
(t−1) (t) (t) (t) (t−1) (t−1)
= q(θ[p] | θ[p] )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd ) (4)
where (4) follows by substituting (2) into (3). In an identical fashion,
(t) (t) (t) (t−1) (t−1)
π(θ[p] ) = π(θ1 , . . . , θp−1 , θp(t) , θp+1 , . . . , θd )
(t) (t) (t−1) (t−1) (t) (t) (t−1) (t−1)
= π(θp(t) | θ1 , . . . , θp−1 , θp+1 , . . . , θd )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd ) (5)
(t) (t−1) (t) (t) (t−1) (t−1)
= q(θ[p] | θ[p] )π(θ1 , . . . , θp−1 , θp+1 , . . . , θd ) (6)
where (6) follows by substituting (1) into (5). Dividing (6) by (5) gives
(t) (t) (t−1)
π(θ[p] ) q(θ[p] | θ[p] )
(t−1)
= (t−1) (t)
π(θ[p] ) q(θ[p] | θ[p] )
so that
(t) (t−1) (t)
π(θ[p] )q(θ[p] | θ[p] )
(t−1) (t) (t−1)
= 1. (7)
π(θ[p] )q(θ[p] | θ[p] )
Thus, a Metropolis-Hastings algorithm for sampling from π(θ) for a proposed move
(t−1) (t)
from θ[p] to θ[p] using the proposal distribution given by (1) has an acceptance ratio
of
 
(t) (t−1) (t)
(t−1) (t)
π(θ [p] )q(θ [p] | θ [p] )
α(θ[p] , θ[p] ) = min 1, (t−1) (t) (t−1)
 (8)
π(θ[p] )q(θ[p] | θ[p] )
= 1 (9)
where (9) follows from (8) using (7).

10

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

MA40189 2016-2017 Problem Sheet 9 - Solutions

Topics in Bayesian statistics (University of Bath)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by xin shen ([email protected])
lOMoARcPSD|36619859

MA40189 - Solution Sheet Nine


Simon Shaw, [email protected]
http://people.bath.ac.uk/masss/ma40189.html

2016/17 Semester II

1. Consider the following loss function


 
d d
L(θ, d) = − log − 1.
θ θ
Let X1 , . . . , Xn be exchangeable so that the Xi are conditionally independent
given parameter θ. Suppose that Xi | θ ∼ P o(θ) and θ ∼ Gamma(α, β) with
α > 1.
(a) Find the Bayes rule of an immediate decision.

We consider the decision problem [Θ, D, π(θ), L(θ, d)]. Relative to distribution π
the expected loss is

E(π) {L(θ, d)} = dE(π) (θ−1 ) − log d + E(π) (log θ) − 1.

Differentiating with respect to d we find


∂ 1
E(π) {L(θ, d)} = E(π) (θ−1 ) −
∂d d
so that the Bayes rule is
−1 −1
d∗ = E(π) (θ ). (1)

With π(θ) = f (θ), the Bayes rule of the immediate decision is d∗ = E −1 (θ−1 ).
Now for θ ∼ Gamma(α, β),
β α α−1 −βθ
Z ∞
E(θ ) =
−1
θ−1 θ e dθ
0 Γ(α)
Γ(α − 1) βα ∞
β α−1
Z
= × α−1 θα−2 e−βθ dθ (2)
Γ(α) β 0 Γ(α − 1)
Γ(α − 1) βα β
= × α−1 = .
Γ(α) β α−1
Notice that in (2) we assume that we are integrating over the density of a
Gamma(α − 1, β) distribution. This assumption requires α > 1. The Bayes
rule of the immediate decision is thus
α−1
d∗ = E −1 (θ−1 ) = . (3)
β

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

(b) Find the Bayes rule after observing x = (x1 , . . . , xn ).

We now consider the case when π(θ) = f (θ | x). From (1) the Bayes rule is d∗ =
E −1 (θ−1 | X). Now, as θ ∼ Gamma(α, β) and XP i | θ ∼ P o(θ) then, see Question
n
Sheet Three Exercise 1.(b), θ | x ∼ Gamma(α + i=1 xi , β + n). We exploit this
conjugacy to observe that the Bayes rule after sampling can
Pnbe obtained from the
Bayes rule of the immediate decision by substituting α + i=1 xi for α and β + n
for β in (3) to obtain
Pn
α + i=1 xi − 1
d =

.
β+n

(c) Interpret the Bayes rules that you have found.

If θ ∼ Gamma(α, β) then the mode of the distribution is α−1


β . So, the Bayes rule
of the immediate decision is the prior mode whilst the Bayes rule after sampling
is the posterior mode.
2. We wish to estimate the parameter, θ, of an exponential distribution.
We consider that X1 , . . . , Xn are exchangeable with Xi | θ ∼ Exp(θ) so that
E(Xi | θ) = 1/θ. Our prior for θ is Gamma(α, β) with α > 3. We wish to
produce an estimate, d, for θ, with loss function

(θ − d)2
L(θ, d) = .
θ3
(a) Find the Bayes rule and Bayes risk of an immediate decision.

We consider the decision problem [Θ, D, π(θ), L(θ, d)]. Relative to distribution π
the expected loss is

E(π) {L(θ, d)} = E(π) (θ−1 ) − 2dE(π) (θ−2 ) + d2 E(π) (θ−3 ).

Differentiating with respect to d we find



E(π) {L(θ, d)} = −2E(π) (θ−2 ) + 2dE(π) (θ−3 )
∂d
so that the Bayes rule is

E(π) (θ−2 )
d∗ = (4)
E(π) (θ−3 )

with corresponding Bayes risk


2
E(π) (θ−2 ) E(π) (θ−2 )

ρ (π)

= E(π) (θ −1
)−2 E(π) (θ−2 ) + E(π) (θ−3 )
E(π) (θ−3 ) E(π) (θ−3 )
2
E(π) (θ−2 )
= E(π) (θ−1 ) −
E(π) (θ−3 )
= E(π) (θ −1
) − d∗ E(π) (θ−2 ). (5)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

We now consider the immediate decision by solving the decision problem [Θ, D,
f (θ), L(θ, d)]. As θ ∼ Gamma(α, β) then
β α α−1 −βθ
Z ∞
k
E(θ ) = θk θ e dθ
0 Γ(α)
Γ(α + k) βα ∞
β α+k
Z
= × α+k θα+k−1 e−βθ dθ (6)
Γ(α) β 0 Γ(α + k)
Γ(α + k)
=
Γ(α)β k
provided that α + k > 0 so that we have integrated over a Gamma(α + k, β)
density at (6). Notice that we require k ∈ {−3, −2, −1} and as α > 3 then for
k ∈ {−3, −2, −1} we do have α + k > 0. Thus,
Γ(α − 1) β
E(θ−1 ) = = ;
Γ(α)β −1 α−1
Γ(α − 2) β2
E(θ−2 ) = = ;
Γ(α)β −2 (α − 1)(α − 2)
Γ(α − 3) β3
E(θ−3 ) = = .
Γ(α)β −3 (α − 1)(α − 2)(α − 3)
So, from (4), the Bayes rule of the immediate decision is
E(θ−2 ) α−3
d∗ = = (7)
E(θ−3 ) β
with, from (5), corresponding Bayes risk
ρ∗ (f (θ)) = E(θ−1 ) − E(θ−2 )d∗
β β2 α−3 β
= − = . (8)
α − 1 (α − 1)(α − 2) β (α − 1)(α − 2)

(b) Suppose that we may take a sample of n observations before estimating


θ. Find the Bayes rule and Bayes risk when we have observed x =
(x1 , . . . , xn ).

We solve the decision problem [Θ, D, f (θ | x), L(θ, d)]. As Xi | θ ∼ Exp(θ) then
n
( n
! )
Y X
n
f (x | θ) = f (xi | θ) ∝ θ exp − xi θ
i=1 i=1

so that, as θ ∼ Gamma(α, β),


( n
! )
X
α+n−1
f (θ | x) ∝ θ exp − β + xi θ .
i=1
Pn
Hence, θ | x ∼ Gamma(α + n, β + i=1 xi ). We can exploit this conjugacy to
observe that the Bayes rule and P
Bayes risk after sampling can be found from
n
substituting α + n for α and β + i=1 xi for β in (7) and (8) to obtain
α+n−3
d∗ = Pn (9)
β + i=1 xi

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

as the Bayes rule after observing a sample of size n with corresponding Bayes risk
Pn
β + i=1 xi
ρ∗ (f (θ | x)) = . (10)
(α + n − 1)(α + n − 2)
(c) Find the Bayes risk of the sampling procedure.

The Bayes decision function is (9) viewed as a random quantity, that is


α+n−3
δ∗ = Pn .
β + i=1 Xi
The risk of the sampling procedure, ρ∗n , is the expected value of (10) when viewed
as a random quantity,
 Pn 
β + i=1 Xi
ρ∗n = E
(α + n − 1)(α + n − 2)
Pn
β + E ( i=1 Xi )
= . (11)
(α + n − 1)(α + n − 2)
We utilise the
Ptower property of expectations, see Question Sheet One Exercise
n
5, to find E( i=1 Xi ). We have that
n
! ( n
!)
X X
E Xi = E E Xi θ
i=1 i=1
( n )
X
= E E(Xi | θ)
i=1
n
X nβ
= E(θ−1 ) = . (12)
i=1
α−1

Substituting (12) into (11) gives



β+ α−1
ρ∗n =
(α + n − 1)(α + n − 2)
 
n
β 1 + α−1
=
(α + n − 1)(α + n − 2)
β
= .
(α − 1)(α + n − 2)
β
Notice that when n = 0, ρ∗n=0 = (α−1)(α−2) = ρ∗ (f (θ)), the Bayes risk of the
immediate decision as per (8). Also, as n increases then ρ∗n decreases.
(d) If each observation costs a fixed amount c find the total risk of a sample
of size n and thus the optimal choice of n. (Hint: the total risk of the
sample is the Bayes risk of the sampling procedure added to the sample
cost.)

The total risk is


Rn = ρ∗n + nc
β
= + nc.
(α − 1)(α + n − 2)

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

We want to choose n to minimise Rn . Differentiating with respect to n we have


∂Rn β
= − + c.
∂n (α − 1)(α + n − 2)2
∂Rn
So, ∂n = 0 when
s
β β
(α + n − 2)2 = ⇒ n = + 2 − α.
(α − 1)c (α − 1)c

If n ≤ 0 then we don’t sample. If n > 0 then either n rounded up or down to an


integer is the optimal choice.
Notice how the choice of n depends upon the choice of the prior hyperparameters
β
α and β. Using E(θ−1 ) = α−1 we may write the optimal n as
r
E(θ−1 )
n = + 2 − α.
c
Hence, relative to E(θ−1 ), increasing α decreases n. We can interpret this in
terms of the strength of prior information for θ−1 ; note that as Xi | θ ∼ Exp(θ)
then E(Xi | θ) = θ−1 and E(Xi ) = E(θ−1 ). As θ ∼ Gamma(α, β) then θ−1 ∼
2
Inv − Gamma(α, β) with V ar(θ−1 ) = (α−1)β2 (α−2) . So, increasing α can be seen
as decreasing V ar(θ−1 ) and thus strengthening the prior belief in θ−1 . We have
the property that the stronger the prior belief in θ−1 is, the smaller the optimal
value of n is.
3. A certain coin has probability ω of landing heads. We assess that the prior
for ω is a Beta distribution with parameters α, β > 1. The loss function for
estimate d and value ω is
(ω − d)2 + d2
L(ω, d) = .
ω
(a) Find the Bayes rule and Bayes risk for an immediate decision.

We consider the decision problem [Ω, D, π(ω), L(ω, d)]. Relative to distribution π
the expected loss is

E(π) {L(ω, d)} = E(π) (ω) − 2d + 2d2 E(π) (ω −1 ).

Differentiating with respect to d we find



E(π) {L(ω, d)} = −2 + 4dE(π) (ω −1 )
∂d
so that the Bayes rule is
−1
E(π) (ω −1 )
d∗ = (13)
2
with corresponding Bayes risk
−1 −1
−1
E(π) (ω −1 ) E(π) (ω −1 )
ρ∗ (π) = E(π) (ω) − E(π) (ω −1 ) + = E(π) (ω) − . (14)
2 2

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

We now consider the immediate decision by solving the decision problem [Ω, D,
f (ω), L(ω, d)]. As ω ∼ Beta(α, β) then
1
Γ(α + β) α−1
Z
t
E(ω ) = ωt ω (1 − ω)β−1 dω
0 Γ(α)Γ(β)
Γ(α + t)Γ(α + β) 1 Γ(α + β + t) α+t−1
Z
= ω (1 − ω)β−1 dω (15)
Γ(α)Γ(α + β + t) 0 Γ(α + t)Γ(β)
Γ(α + t)Γ(α + β)
=
Γ(α)Γ(α + β + t)

provided that α + t > 0 so that we have integrated over a Beta(α + t, β) density


at (15). Thus,

Γ(α + 1)Γ(α + β) α
E(ω) = = ;
Γ(α)Γ(α + β + 1) α+β
Γ(α − 1)Γ(α + β) α+β−1
E(ω −1 ) = =
Γ(α)Γ(α + β − 1) α−1

provided α > 1. So, from (13), the Bayes rule of the immediate decision is

E −1 (ω −1 ) α−1
d∗ = = (16)
2 2(α + β − 1)

with, from (14), corresponding Bayes risk

E −1 (ω −1 )
ρ∗ (f (ω)) = E(ω) −
2
α α−1 α(α + β − 1) + β
= − = . (17)
α+β 2(α + β − 1) 2(α + β)(α + β − 1)

(b) Suppose that we toss the coin n times before making our decision, and
observe k heads and n − k tails. Find the new Bayes rule and Bayes
risk.

We solve the decision problem [Ω, D, f (ω | k), L(ω, d)]. As ω ∼ Beta(α, β) and
K | ω ∼ Bin(n, ω) then ω | k ∼ Beta(α + k, β + n − k). We can exploit this
conjugacy to observe that the Bayes rule and Bayes risk after sampling can be
found from substituting α + k for α and β + n − k for β in (16) and (17) to obtain

α+k−1
d∗ = (18)
2(α + β + n − 1)

as the Bayes rule after observing a sample of k heads and n − k tails with corre-
sponding Bayes risk

(α + k)(α + β + n − 1) + (β + n − k)
ρ∗ (f (ω | k)) = . (19)
2(α + β + n)(α + β + n − 1)

(c) Find the Bayes risk of the sampling procedure of tossing the coin n
times and then making a decision when α = β = 2.

Downloaded by xin shen ([email protected])


lOMoARcPSD|36619859

The Bayes decision function is (18) viewed as a random quantity, that is

α+K −1
δ∗ = .
2(α + β + n − 1)

The risk of the sampling procedure, ρ∗n , is the expected value of (19) when viewed
as a random quantity,
 
(α + K) (α + β + n − 1) + (β + n − K)
ρn = E

2(α + β + n)(α + β + n − 1)
{α + E(K)} (α + β + n − 1) + {β + n − E(K)}
= . (20)
2(α + β + n)(α + β + n − 1)

We utilise the tower property of expectations, see Question Sheet One Exercise
5, to find E(K). We have that

E(K) = E{E(K | ω)}



= E(nω) = .
α+β
Hence,

nα α(α + β + n)
α + E(K) = α+ = ; (21)
α+β α+β
nα β(α + β + n)
β + n − E(K) = β+n− = . (22)
α+β α+β

Substituting (21) and (22) into (20) gives


α(α+β+n)(α+β+n−1) β(α+β+n)
α+β + α+β
ρ∗n =
2(α + β + n)(α + β + n − 1)
 
1 β
= α+ .
2(α + β) α+β+n−1
α(α+β−1)+β
Notice that when n = 0, ρ∗n=0 = 2(α+β)(α+β−1) = ρ∗ (f (λ)), the Bayes risk of the
immediate decision. As n increases then ρn decreases. If α = β = 2 then

n+4
ρ∗n = .
4(n + 3)

Downloaded by xin shen ([email protected])

You might also like