Msiii PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 118

MATHEMATICAL STATISTICS III

Lecture Notes

Lecturer: Professor Patty Solomon

SCHOOL
c OF MATHEMATICAL SCIENCES
Contents

1 Distribution Theory 1
1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 4
1.1.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 6
1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.4 Beta density function . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.6 Standard Cauchy distribution . . . . . . . . . . . . . . . . . . . 10
1.3 Transformations of a single random variable . . . . . . . . . . . . . . . 12
1.4 CDF transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Non-monotonic transformations . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Moments of transformed RVS . . . . . . . . . . . . . . . . . . . . . . . 17
1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7.1 Trinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 22
1.7.2 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 23
1.7.3 Marginal and conditional distributions . . . . . . . . . . . . . . 23
1.7.4 Continuous multivariate distributions . . . . . . . . . . . . . . . 26
1.8 Transformations of several RVs . . . . . . . . . . . . . . . . . . . . . . 32
1.8.1 Multivariate transformation rule . . . . . . . . . . . . . . . . . . 39
1.8.2 Method of regular transformations . . . . . . . . . . . . . . . . 40
1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.9.1 Moment generating functions . . . . . . . . . . . . . . . . . . . 49
1.9.2 Marginal distributions and the MGF . . . . . . . . . . . . . . . 52
1.9.3 Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.9.4 Properties of variance matrices . . . . . . . . . . . . . . . . . . 55
1.10 The multivariable normal distribution . . . . . . . . . . . . . . . . . . . 56
1.10.1 The multivariate normal MGF . . . . . . . . . . . . . . . . . . . 63
1.10.2 Independence and normality . . . . . . . . . . . . . . . . . . . . 64
1.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.11.1 Convergence of random variables . . . . . . . . . . . . . . . . . 70

2 Statistical Inference 76
2.1 Basic definitions and terminology . . . . . . . . . . . . . . . . . . . . . 76
2.1.1 Criteria for good estimators . . . . . . . . . . . . . . . . . . . . 77
2.2 Minimum Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 80
2.2.1 Likelihood, score and Fisher Information . . . . . . . . . . . . . 80
2.2.2 Cramer-Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . 83
2.2.3 Exponential families of distributions . . . . . . . . . . . . . . . 87
2.2.4 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.2.5 The Rao-Blackwell Theorem . . . . . . . . . . . . . . . . . . . . 92
2.3 Methods Of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3.1 Method Of Moments . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 96
2.3.3 Elementary properties of MLEs . . . . . . . . . . . . . . . . . . 97
2.3.4 Asymptotic Properties of MLEs . . . . . . . . . . . . . . . . . . 100
2.4 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . 104
2.4.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.4.2 Large sample tests and confidence intervals . . . . . . . . . . . . 106
2.4.3 Optimal tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1. DISTRIBUTION THEORY

1 Distribution Theory

A discrete random variable (RV) is described by its probability function

p(x) = P ({X = x})

and is represented by a probability histogram. A continuous RV is described by its


probability density function (PDF), f (x) ≥ 0, for which

P ({a ≤ X ≤ b})
Z b
= f (x) dx, for all a ≤ b.
a

The PDF is a piecewise continuous function which integrates to 1 over the range of the
RV. Note that
Z a
P (X = a) = f (x) dx = 0 for any continuous RV.
a

Example: ‘Precipitation’ is neither a discrete nor a continuous RV, since there is zero
precipitation on some days; it is a mixture of both.
The cumulative distribution function (CDF) is defined by:

F (x) = P ({X ≤ x}) (area to left of x)


X

 p(t)

t;t≤x

completed by: F (x) =

 Z x
f (t) dt.



−∞

The expected value E(X) of a discrete RV is given by:


X
E(X) = xp(x),
x
X
provided that |x|p(x) < ∞ (i.e., must converge). Otherwise the expectation is not
x
defined.
The expected value of a continuous RV is given by:
Z ∞
E(X) = xf (x) dx,
−∞
Z ∞
provided that |x|f (x) < ∞, otherwise it is not defined.
−∞

1
1. DISTRIBUTION THEORY

(For example, the Cauchy distribution is momentless.)

X X

 h(x)p(x) if |h(x)|p(x) < ∞ (X discrete)

 x
 x
E{h(X)} =

Z ∞ Z ∞
h(x)f (x) dx if |h(x)|f (x) dx < ∞ (X continuous)



−∞ −∞

The moment generating function (MGF) of a RV X is defined to be:

X


 etx p(x) X discrete
x




MX (t) = E[etX ] =



Z

↓ etx f (x) dx X continuous



moment −∞

generating fct
of RV X.

MX (0) = 1 always; the mgf may or may not be defined for other values of t.
If MX (t) defined for all t in some open interval containing 0, then:

1. Moments of all orders exist;

2. E[X r ] = MX(r) (0) (rth order derivative) ;


3. MX (t) uniquely determines the distribution of X:

M 0 (0) = E(X)
M 00 (0) = E(X 2 ), and so on.

1.1 Discrete distributions

1.1.1 Bernoulli distribution

Parameter: 0≤p≤1
Possible values: {0, 1}

2
1. DISTRIBUTION THEORY

Prob. function:

p,
 x=1
p(x) = = px (1 − p)1−x

1 − p, x = 0

E(X) = p
Var (X) = p(1 − p)
MX (t) = 1 + p(et − 1).

1.1.2 Binomial distribution

Parameter: 0 ≤ p ≤ 1; n > 0;
MGF: MX (t) = {1 + p(et − 1)}n .
Consider a sequence of n independent Bern(p) trials. If X = total number of successes,
then X ∼ B(n, p).
Probability function:  
n x
p(x) = p (1 − p)n−x
x

Note: If Y1 , . . . , Yn are underlying Bernoulli RV’s then X = Y1 + Y2 + · · · + Yn .


(same as counting the number of successes).
Probability function if:

p(x) ≥ 0, x = 0, 1, . . . , n,
Xn
p(x) = 1.
x=0

1.1.3 Geometric distribution

This is a discrete waiting time distribution. Suppose a sequence of independent


Bernoulli trials is performed and let X be the number of failures preceding the first
success. Then X ∼ Geom(p), with

p(x) = p(1 − p)x , x = 0, 1, 2, . . .

3
1. DISTRIBUTION THEORY

1.1.4 Negative Binomial distribution

Suppose a sequence of independent Bernoulli trials is conducted. If X is the number


of failures preceding the nth success, then X has a negative binomial distribution.
Probability function:
 
n+x−1 n
p(x) = p (1 − p)x ,
n−1
| {z }
ways of allocating
failures and successes

n(1 − p)
E(X) = ,
p
n(1 − p)
Var (X) = ,
p2
 n
p
MX (t) = .
1 − et (1 − p)

1. If n = 1, we obtain the geometric distribution.


2. Also seen to arise as sum of n independent geometric variables.

1.1.5 Poisson distribution

Parameter: rate λ>0


t
MGF: MX (t) = eλ(e −1)
Probability function:
e−λ λx
p(x) = , x = 0, 1, 2, . . .
x!
1. The Poisson distribution arises as the distribution for the number of “point events”
observed from a Poisson process.
Examples:

Figure 1: Poisson Example

Number of incoming calls to certain exchange in a given hour.

4
1. DISTRIBUTION THEORY

2. The Poisson distribution also arises as the limiting form of the binomial distribution:

n → ∞, np → λ
p→0

The derivation of the Poisson distribution (via the binomial) is underpinned by a


Poisson process i.e., a point process on [0, ∞); see Figure 1.
AXIOMS for a Poisson process of rate λ > 0 are:

(A) The number of occurrences in disjoint intervals are independent.

(B) Probability of 1 or more occurrences in any sub-interval [t, t+h) is λh+o(h) (h →


0) (approx prob. is equal to length of interval xλ).

(C) Probability of more than one occurrence in [t, t + h) is o(h) (h → 0) (i.e. prob
is small, negligible).

Note: o(h), pronounced (small order h) is standard notation for any function r(h)
with the property:

r(h)
lim =0
h→0 h

0
!2 !1 0 1 2
h
!1

!2

Figure 2: Small order h: functions h4 (yes) and h (no)

5
1. DISTRIBUTION THEORY

1.1.6 Hypergeometric distribution

Consider an urn containing M black and N white balls. Suppose n balls are sampled
randomly without replacement and let X be the number of black balls chosen. Then
X has a hypergeometric distribution.
Parameters: M, N > 0, 0<n≤M +N
Possible values: max (0, n − N ) ≤ x ≤ min (n, M )
Prob function:

  
M N
x n−x
p(x) =   ,
M +N
n

M M + N − n nM N
E(X) = n , Var (X) = .
M +N M + N − 1 (M + N )2

The mgf exists, but there is no useful expression available.

1. The hypergeometric distribution is simply


# samples with x black balls
,
# possible samples

  
M N
x n−x
=   .
M +N
n

2. To see how the limits arise, observe we must have x ≤ n (i.e., no more than
sample size of black balls in the sample.) Also, x ≤ M , i.e., x ≤ min (n, M ).
Similarly, we must have x ≥ 0 (i.e., cannot have < 0 black balls in sample), and
n − x ≤ N (i.e., cannot have more white balls than number in urn).
i.e. x ≥ n − N
i.e. x ≥ max (0, n − N ).
M

3. If we sample with replacement, we would get X ∼ B n, p = M +N
. It is inter-
esting to compare moments:

6
1. DISTRIBUTION THEORY

finite population correction



+N −n
hypergeometric E(X) = np Var (X) = M
M +N −1
[np(1 − p)]
binomial E(x) = np Var (X) = np(1 − p) ↓
when sample all
balls in urn Var(X) ∼ 0

4. When M, N >> n, the difference between sampling with and without replace-
ment should be small.

1
Figure 3: p =
3

If white ball out →

1
Figure 4: p = (without replacement)
2

1
Figure 5: p = (with replacement)
3

Intuitively, this implies that for M, N >> n, the hypergeometric and binomial
probabilities should be very similar, and this can be verified for fixed, n, x:
0 10 1
lim N A@ M A  
M,N →∞
@
x n−x n x
0 1 = p (1 − p)n−x .
M M + N x
→p @ A
M +N n

7
1. DISTRIBUTION THEORY

1.2 Continuous Distributions

1.2.1 Uniform Distribution

CDF, for a < x < b:

x x
x−a
Z Z
1
F (x) = f (x) dx = dx = ,
−∞ a b−a b−a
that is, 


0 x≤a




x − a
F (x) = a<x<b

 b−a





1 b≤x

Figure 6: Uniform distribution CDF

etb − eta
MX (t) =
t(b − a)
A special case is the U (0, 1) distribution:

1 0 < x < 1

f (x) =

0 otherwise,

F (x) = x for 0 < x < 1,

1 1 et − 1
E(X) = , Var(X) = , M (t) = .
2 12 t

8
1. DISTRIBUTION THEORY

1.2.2 Exponential Distribution

CDF:
F (x) = 1 − e−λx ,

λ
MX (t) = , λ > 0, x ≥ 0.
λ−t

This is the distribution for the waiting time until the first occurrence in a Poisson
process with rate parameter λ > 0.

1. If X ∼ Exp(λ) then,

P (X ≥ t + x|X ≥ t) = P (X ≥ x)
(memoryless property)

2. It can be obtained as limiting form of geometric distribution.

1.2.3 Gamma distribution

λalpha α−1 −λx


f (x) = x e , α > 0, λ > 0, x ≥ 0
Γ(α

with
Z ∞
gamma function : Γ(α) = tα−1 e−t dt
0

 α
λ
mgf : MX (t) = , t < λ.
λ−t

Suppose Y1 , . . . , YK are independent Exp(λ) random variables and let X = Y1 +· · ·+YK .


Then X ∼ Gamma(K, λ), for K integer. In general, X ∼ Gamma(α, λ), α > 0.

1. α is the shape parameter,


λ is the scale parameter
Y
Note: if Y ∼ Gamma (α, 1) and X = , then X ∼ Gamma (α, λ). That is, λ is
λ
scale parameter.

9
1. DISTRIBUTION THEORY

Figure 7: Gamma Distribution

 
p 1
2. Gamma , distribution is also called χ2p (chi-square with p df) distribution
2 2
if p is integer;
χ22 = exponential distribution (for 2 df).

3. Gamma (K, λ) distribution can be interpreted as the waiting time until the K th
occurrence in a Poisson process.

1.2.4 Beta density function

Suppose Y1 ∼ Gamma (α, λ), Y2 ∼ Gamma (β, λ) independently, then,

Y1
X= ∼ B(α, β), 0 ≤ x ≤ 1.
Y1 + Y2

Remark: see soon for derivation!

1.2.5 Normal distribution


2 σ 2 /2
X ∼ N (µ, σ 2 ); MX (t) = etµ et .

1.2.6 Standard Cauchy distribution

Possible values: x∈R


 
1 1
PDF: f (x) = ; (location parameter θ = 0)
π 1 + x2
1 1
CDF: F (x) = + arctan x
2 π
E(X), Var(X), MX (t) do not exist.

10
1. DISTRIBUTION THEORY

Figure 8: Beta Distribution

→ the Cauchy is a bell-shaped distribution symmetric about zero for which no moments
are defined.

Figure 9: Cauchy Distribution

(Pointier than normal distribution and tails go to zero much slower than normal distribution.)

Z1
If Z1 ∼ N (0, 1) and Z2 ∼ N (0, 1) independently, then X = ∼ Cauchy distribution.
Z2

11
1. DISTRIBUTION THEORY

1.3 Transformations of a single random variable

If X is a RV and Y = h(X), then Y is also a RV. If we know distribution of X, for a


given function h(x), we should be able to find distribution of Y = h(X).

Theorem. 1.3.1 (Discrete case)


Suppose X is a discrete RV with prob. function pX (x) and let Y = h(X), where h is
any function then:
X
pY (y) = pX (x)
x:h(x)=y

(sum over all values x for which h(x) = y)

Proof.

pY (y) = P (Y = y) = P {h(X) = y}

X
= P (X = x)
x:h(x)=y

X
= pX (x)
x:h(x)=y

Theorem. 1.3.2
Suppose X is a continuous RV with PDF fX (x) and let Y = h(X), where h(x) is
differentiable and monotonic, i.e., either strictly increasing or strictly decreasing.
Then the PDF of Y is given by:

fY (y) = fX {h−1 (y)}|h−1 (y)0 |

Proof. Assume h is increasing. Then

FY (y) = P (Y ≤ y) = P {h(X) ≤ y}

= P {X ≤ h−1 (y)}

= FX {h−1 (y)}

12
1. DISTRIBUTION THEORY

Figure 10: h increasing.

d d
⇒ fY (y) = FY (y) = FX {h−1 (y)} (use Chain Rule)
dy dy

= fX h−1 (y) h−1 (y)0 .



(1)

Now consider the case of h(x) decreasing:

FY (y) = P (Y ≤ y) = P {h(X) ≤ y}

= P {X ≥ h−1 (y)}

= 1 − FX {h−1 (y)}

⇒ fY (y) = −fX {h−1 (y)}h−1 (y)0 (2)


Finally, observe that if h is increasing then h0 (x) and hence h−1 (y)0 must be positive.
Similarly for h decreasing, h−1 (y)0 < 0. Hence (1) and (2) can be combined to give:

fY (y) = fX {h−1 (y)}|h−1 (y)0 |

Examples:

1. Discrete transformation of single RV

13
1. DISTRIBUTION THEORY

Figure 11: h decreasing.

X ∼ P o(λ) and Y is X rounded to the nearest multiple of 10.


Possible values of Y are: 0, 10, 20, . . .

P (Y = 0) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)

e−λ λ2 e−λ λ3 e−λ λ4


= e−λ + e−λ λ + + + ;
2! 3! 4!

P (Y = 10) = P (X = 5) + P (X = 6) + P (X = 7) + P (X = 8) + . . . + P (X = 14),

and so on.

2. Continuous transformation of single RV


Let Y = h(X) = aX + b, a 6= 0.
To find h−1 (y) we solve for x in the equation y = h(x), i.e.,

y = ax + b

y−b y−b y b
⇒x= ⇒ h−1 (y) = = −
a a a a

1
⇒ h−1 (y)0 = .
a
 
1 y−b
Hence, fY (y) = fX .
|a| a

14
1. DISTRIBUTION THEORY

Specifically:
1. Suppose Z ∼ N (0, 1) and let X = µ + σZ, σ > 0. Recall that Z has PDF:
1 2
φ(z) = √ e−z /2 .

 
1 x−µ 1 (x−µ)2
Hence, fX (x) = φ =√ e− 2σ2 , which is N(µ, σ 2 ) PDF.
|σ| σ 2π(σ)
X −µ
2. Suppose X ∼ N (µ, σ 2 ) and let Z = .
σ
Find the PDF of Z.
1 µ
Solution. Observe that Z = h(X) = X − ,
σ σ
z + σµ
 
⇒ fZ (z) = σfX 1 = σfX (µ + σz)
σ

1 (−1)
(µ+σz−µ)2
= σ√ e (2σ2 )
2πσ 2

1 (−1) 2 2
σ z
= √ e (2σ2 )

1 2
= √ e−z /2 = φ(z),

i.e., Z ∼ N(0, 1).
X
3. Suppose X ∼ Gamma (α, 1) and let Y = , λ > 0.
λ
Find the PDF of Y .
1
Solution. Since Y = X, is a linear function, we have
λ
fY (y) = λfX (λy)

1 λα α−1 −λy
=λ (λy)α−1 e−λy = y e ,
Γ(α) Γ(α)

which is the Gamma(α, λ) PDF.

15
1. DISTRIBUTION THEORY

1.4 CDF transformation

Suppose X is a continuous RV with CDF FX (x), which is increasing over the range of
X. If U = FX (x), then U ∼ U (0, 1).

Proof.
FU (u) = P (U ≤ u)

= P {FX (X) ≤ u}

= P {X ≤ FX−1 (u)}

= FX {FX−1 (u)}

= u, for 0 < u < 1.

This is simply the CDF of U (0, 1), so the result is proved.

The converse also applies. If U ∼ U (0, 1) and F is any strictly increasing “on its range”
CDF, then X = F −1 (U ) has CDF F (x), i.e.,
FX (x) = P (X ≤ x) = P {F −1 (U ) ≤ x}

= P (U ≤ F (x))

= F (x), as required.

1.5 Non-monotonic transformations

Theorem 1.3.2 applies to monotonic (either strictly increasing or decreasing) transfor-


mations of a continuous RV.
In general, if h(x) is not monotonic, then h(x) may not even be continuous.
For example,
integer part of x
%
if h(x) = [x],
then possible values for Y = h(X) are the integers
⇒ Y is discrete.

16
1. DISTRIBUTION THEORY

However, it can be shown that h(X) is continuous if X is continuous and h(x) is


piecewise monotonic.

1.6 Moments of transformed RVS

Suppose X is a RV and let Y = h(X).

If we want to find E(Y ) , we can proceed as follows:

1. Find the distribution of Y = h(X) using preceeding methods.


X

 yp(y) Y discrete

 y

2. Find E(Y ) =

 Z ∞
yf (y) dy Y continuous



−∞

(that is, forget X ever existed!)

Or use
Theorem. 1.6.1
If X is a RV of either discrete or continuous type and h(x) is any transformation (not
necessarily monotonic), then E{h(X)} (provided it exists) is given by:
X

 h(x) p(x) X discrete

 x

E{h(X)} = Z

 ∞
h(x)f (x) dx X continuous



−∞

Proof.

Not examinable.

Examples:

1. CDF transformation
Suppose U ∼ U (0, 1). How can we transform U to get an Exp(λ) RV?

17
1. DISTRIBUTION THEORY

Solution. Take X = F −1 (U ), where F is the Exp(λ) CDF. Recall F (x) =


1 − e−λx for the Exp(λ) distribution.
To find F −1 (u), we solve for x in F (x) = u,
i.e.,
u = 1 − e−λx

⇒ 1 − u = e−λx

⇒ ln(1 − u) = −λx

− ln(1 − u)
⇒x= .
λ
− ln(1 − U )
Hence if U ∼ U (0, 1), it follows that X = ∼ Exp(λ).
λ
− ln U
Note: Y = ∼ Exp(λ) [both (1 − U ) & U have U (0, 1) distribution].
λ
m This type of result is used to generate random numbers. That is, there are good
methods for producing U (0, 1) (pseudo-random) numbers. To obtain Exp(λ) ran-
− log U
dom numbers, we can just get U (0, 1) numbers and then calculate X = .
λ
2. Non-monotonic transformations
Suppose Z ∼ N (0, 1) and let X = Z 2 ; h(Z) = Z 2 is not monotonic, so Theorem
1.3.2 does not apply. However we can proceed as follows:
FX (x) = P (X ≤ x)

= P (Z 2 ≤ x)

√ √
= P (− x ≤ Z ≤ x)

√ √
= Φ( x) − Φ(− x),

where
Z a
1 2
Φ(a) = √ e−t /2 dt
−∞ 2π
is the N (0, 1) CDF. That is,
√  √
   
d 1 −1/2 1 −1/2
⇒ fX (x) = FX (x) = φ x x − φ(− x) − x ,
dx 2 2

18
1. DISTRIBUTION THEORY

2
where φ(a) = Φ0 (a) = √1 e−a /2

is the N (0, 1) PDF
 
1 −1/2 1 −x/2 1 −x/2
= x √ e +√ e
2 2π 2π

(1/2)1/2 1/2−1 −1/2x,


 
1 1
= √ x e which is the Gamma , PDF.
π 2 2

On the other hand, the distribution of Z 2 is also called the χ21 , distribution. We
1 1

have proved that it is the same as the Gamma 2 , 2 distribution.

3. Moments of transformed RVS:


− log U
Example: if U ∼ U (0, 1) and Y = then Y ∼ Exp(λ) ⇒ f (y) =
λ
λe−λy , y > 0.
Can check
Z ∞
E(Y ) = λye−λy dy
0

1
= .
λ

4. Based on Theorem 1.6.1:


log U
If U ∼ U (0, 1) and Y = − , then according to Theorem 1.6.1,
λ
Z 1
− log u
E(Y ) = (1) du
0 λ

Z 1
1
1 1
=− log u du = − (u log u − u)
λ 0 λ 0

1h 1 i
=− u log u 10 − u 0
λ

1
= − [0 − 1]
λ

1
= , as required.
λ

19
1. DISTRIBUTION THEORY

There are some important consequences of Theorem 1.6.1:

1. If E(X) = µ and Var(X) = σ 2 , and Y = aX + b for constants a, b, then E(Y ) =


aµ + b and Var(Y ) = a2 σ 2 .

Proof. (Continuous case)


Z ∞
E(Y ) = E(aX + b) = (ax + b)f (x) dx
−∞

Z ∞ Z ∞
=a xf (x) dx + b f (x) dx
−∞ −∞
| {z } | {z }
E(X) =1

= aµ + b.

h 2 i
Var(Y ) = E Y − E(Y )

h 2 i
=E aX + b − (aµ + b)

= E a2 (X − µ)2


= a2 E (X − µ)2


= a2 Var(X)

= a2 σ 2 .

2. If X is a RV and h(X) is any function, then the MGF of Y = h(X), provided it


exists is, X

 eth(x) p(x) X discrete

 x

MY (t) = Z

 ∞
eth(x) f (x) dx X continuous



−∞

20
1. DISTRIBUTION THEORY

This gives us another way to find the distribution of Y = h(X).


i.e., Find MY (t). If we recognise MY (t), then by uniqueness we can conclude that
Y has that distribution.

Examples

1. Suppose X is continuous with CDF F (x), and F (a) = 0, F (b) = 1; (a, b can be
±∞ respectively).
Let U = F (X). Observe that
Z b
MU (t) = etF (x) f (x) dx
a

b
1 tF (x) etF (b) − etF (a)
= e =
t a t

et − 1
= ,
t
which is the U (0, 1) MGF.
− log X
2. Suppose X ∼ U (0, 1), and let Y = .
λ
Z 1
− log x
MY (t) = et( λ ) (1) dx
0

Z 1
= x−t/λ dx
0

1
1
1−t/λ
= x
1 − t/λ
0

1
=
1 − t/λ

λ
= ,
λ−t
which is the MGF for Exp(λ) distribution. Hence we can conclude that
− log X
Y = ∼ Exp(λ).
λ

21
1. DISTRIBUTION THEORY

1.7 Multivariate distributions

Definition. 1.7.1
If X1 , X2 , . . . , Xr are discrete RV’s then, X = (X1 , X2 , . . . , Xr )T is called a discrete
random vector.
The probability function P (x) is:

P (x) = P (X = x) = P ({X1 = x1 } ∩ {X2 = x2 } ∩ · · · ∩ {Xr = xr }) ;

P (x) = P (x1 , x2 , . . . , xr )

1.7.1 Trinomial distribution

r=2
Consider a sequence of n independent trials where each trial produces:

Outcome 1: with prob π1


Outcome 2: with prob π2
Outcome 3: with prob 1 − π1 − π2

If X1 , X2 are number of occurrences of outcomes 1 and 2 respectively then(X1 , X2 )


have trinomial distribution.
Parameters: π1 > 0, π2 > 0 and π1 + π2 < 1; n > 0 fixed
Possible Values: integers (x1 , x2 ) s.t. x1 ≥ 0, x2 ≥ 0, x1 + x2 ≤ n
Probability function:

n!
P (x1 , x2 ) = π x1 π x2 (1 − π1 − π2 )n−x1 −x2
x1 !x2 !(n − x1 − x2 )! 1 2

for

x1 , x2 ≥ 0, x1 + x2 ≤ n.

22
1. DISTRIBUTION THEORY

1.7.2 Multinomial distribution


r
X
Parameters: n > 0, π = (π1 , π2 , . . . , πr )T with πi > 0 and πi = 1
i=1
r
X
Possible values: Integer valued (x1 , x2 , . . . , xr ) s.t. xi ≥ 0 & xi = n
i=1

Probability function:

 
n
P (x) = π1x1 π2x2 . . . πrxr for
x 1 , x2 , . . . , x r

r
X
xi ≥ 0, xi = n.
i=1

Remarks
 
n def n!
1. Note = x1 !x2 ! . . . xr ! is the multinomial coefficient.
x 1 , x2 , . . . , x r
2. Multinomial distribution is the generalisation of the binomial distribution to r
types of outcome.

3. Formulation differs from binomial and trinomial cases in that the redundant count
xr = n − (x1 + x2 + . . . xr−1 ) is included as an argument of P (x).

1.7.3 Marginal and conditional distributions

Consider a discrete random vector

X = (X1 , X2 , . . . , Xr )T and let


T
X1 = (X1 , X2 , . . . , Xr1 ) &
X2 = (Xr1 +1 , Xr1 +2 , . . . , Xr )T ,
 
X1
so that X = .
X2

Definition. 1.7.2

23
1. DISTRIBUTION THEORY

If X has joint probability function PX (x) = PX (x1 , x2 ) then the marginal probability
function for X1 is : X
PX1 (x1 ) = PX (x1 , x2 ).
x2

Observe that:
X X
PX1 (x1 ) = PX (x1 , x2 ) = P ({X1 = x1 } ∩ {X2 = x2 })
x2 x2

= P (X1 = x1 ) by law of total probability.

Hence the marginal probability function for X1 is just the probability function we
would have if X2 was not observed.
The marginal probability function for X2 is:
X
PX2 (x2 ) = PX (x1 , x2 ).
x1

Definition. 1.7.3
 
X1
Suppose X = . If PX1 (x1 ) > 0, we define the conditional probability function
X2
X2 |X1 by:
PX (x1 , x2 )
PX2 |X1 (x2 |x1 ) = .
PX1 (x1 )

Remarks

1.
P ({X1 = x1 } ∩ {X2 = x2 })
PX2 |X1 (x2 |x1 ) =
P (X1 = x1 )

= P (X2 = x2 |X1 = x1 ).

2. Easy to check PX2 |X1 (x2 |x1 ) is a proper probability function with respect to x2
for each fixed x1 such that PX1 (x1 ) > 0.

3. PX1 |X2 (x1 |x2 ) is defined by:

PX (x1 , x2 )
PX1 |X2 (x1 |x2 ) = .
PX2 (x2 )

24
1. DISTRIBUTION THEORY

Definition. 1.7.4 (Independence)


Discrete RV’s X1 , X2 , . . . , Xr are said to be independent if their joint probability func-
tion satisfies
P (x1 , x2 , . . . , xr ) = p1 (x1 )p2 (x2 ) . . . pr (xr )
for some functions p1 , p2 , . . . , pr and all (x1 , x2 , . . . , xr ).

Remarks

1. Observe that:
XX X
PX1 (x1 ) = ··· P (x1 , . . . , xr )
x2 x3 xr

XX X
= p1 (x1 ) ··· p2 (x2 ) . . . pr (xr )
x2 x3 xr

= c1 p1 (x1 ).

Hence p1 (x1 ) ∝ PX1 (x1 ) also pi (xi ) ∝ PXi (xi )


↓ | {z }
sum of probs. marginal prob.

P (x1 , . . . , xr )
PX1 |X2 ...Xr (x1 |x2 , . . . , xr ) =
Px2 ...xr (x1 , . . . , xr )
p1 (x1 )p2 (x2 ) . . . pr (xr )
= X
p1 (x1 )p2 (x2 ) . . . pr (xr )
x1

p1 (x1 )p2 (x2 ) . . . pr (xr )


= X
p2 (x2 )p3 (x3 ) . . . pr (xr ) p1 (x1 )
x1

p1 (x1 )
= X
p1 (x1 )
x1

1
PX (x1 )
= c 1
1X
PX1 (x1 )
c x
1

= PX1 (x1 ).
That is, PX1 |X2 ...Xr (x1 |x2 , . . . , xr ) = PX1 (x1 ).

25
1. DISTRIBUTION THEORY

2. Clearly independence =⇒

PXi |X1 ...Xi−1 ,Xi+1 ...Xr (xi |x1 , . . . , xi−1 , xi+1 , . . . , xr ) = PXi (xi ).

Moreover, we have:
 
X1
PX1 |X2 (x1 |x2 ) = PX1 (x1 ) for any partitioning of X = if X1 , . . . , Xr are
X2
independent.

1.7.4 Continuous multivariate distributions

Definition. 1.7.5
The random vector (X1 , . . . , Xr )T is said to have a continuous multivariate distribution
with PDF f (x) if
Z Z
P (X ∈ A) = . . . f (x1 , . . . , xr ) dx1 . . . dxr
A

for any measurable set A.

Examples

1. Suppose (X1 , X2 ) have the trinomial distribution with parameters n, π1 , π2 .


Then the marginal distribution of X1 can be seen to be B(n, π1 ) and the condi-
π2 
tional distribution of X2 |X1 = x1 is B n − x1 , .
1 − π1
Outcome 1 π1
Outcome 2 π2
Outcome 3 1 − π1 − π2

Examples of Definition 1.7.5

1. If X1 , X2 have PDF 

1 0 < x1 < 1

 0 < x2 < 1
f (x1 , x2 ) =




0 otherwise
The distribution is called uniform on (0, 1) × (0, 1).
It follows that P (X ∈ A) = Area (A) for any A ∈ (0, 1) × (0, 1):

26
1. DISTRIBUTION THEORY

Figure 12: Area A

2. Uniform distribution on unit disk:


1



π x21 + x22 < 1
f (x1 , x2 ) =


0 otherwise.

3. Dirichlet distribution is defined by PDF


Γ(α1 + α2 + · · · + αr + αr+1 )
f (x1 , x2 , . . . , xr ) = ×
Γ(α1 )Γ(α2 ) . . . Γ(αr )Γ(αr+1 )

xα1 1 −1 xα2 2 −1 . . . xαr r −1 (1 − x1 − · · · − xr )αr+1


P
for x1 , x2 , . . . , xr > 0, xi < 1, and parameters α1 , α2 , . . . , αr+1 > 0.

#
Recall joint PDF:
Z Z
P (X ∈ A) = ... f (x1 , . . . , xr ) dx1 . . . dxr .
A

Note: the joint PDF must satisfy:

1. f (x) ≥ 0 for all x;


Z ∞ Z ∞
2. ... f (x1 , . . . , xr ) dx1 . . . dxr = 1.
−∞ −∞

27
1. DISTRIBUTION THEORY

Definition. 1.7.6
 
X1
If X = has joint PDF f (x) = f (x1 , x2 ), then the marginal PDF of X1 is given
X2
by:
Z ∞ Z ∞
fX1 (x1 ) = ... f (x1 , . . . , xr1 , xr1 +1 , . . . xr ) dxr1 +1 dxr1 +2 . . . dxr ,
−∞ −∞

and for fX1 (x1 ) > 0, the conditional PDF is given by

f (x1 , x2 )
fX2 |X1 (x2 |x1 ) = .
fX1 (x1 )

Remarks:

1. fX2 |X1 (x2 |x1 ) cannot be interpreted as the conditional PDF of X2 |{X1 = x1 }
because P (X1 = x1 ) = 0 for any continuous distribution.
Proper interpretation is the limit as δ → 0 in X2 |X1 ∈ B(x1 , δ).

2. fX2 (x2 ), fX1 |X2 (x1 |x2 ) are defined analogously.

Definition. 1.7.7 (Independence)


Continuous RV’s X1 , X2 , . . . , Xr are said to be (mutually) independent if their joint
PDF satisfies:
f (x1 , x2 , . . . , xr ) = f1 (x1 )f2 (x2 ) . . . fr (xr ),
for some functions f1 , f2 , . . . , fr and all x1 , . . . , xr

Remarks

1. Easy to check that if X1 , . . . , Xr are independent then each fi (xi ) = ci fXi (xi ).
Moreover c1 , c2 , . . . , cr = 1.

2. If X1 , X2 , . . . , Xr are independent, then it can be checked that:

fx1 |x2 (x1 |x2 ) = fX1 (x1 )


 
X1
for any partitioning of X = .
X2

Examples

28
1. DISTRIBUTION THEORY

1. If (X1 , X2 ) has the uniform distribution on the unit disk, find:

(a) the marginal PDF fX1 (x1 )


(b) the conditional PDF fX2 |X1 (x2 |x1 ).
Solution. Recall
1



π x21 + x22 < 1
f (x1 , x2 ) =


0 otherwise

Z ∞
(a) fX1 (x1 ) = f (x1 , x2 ) dx2
−∞

Z −√1−x21 Z √1−x21 Z ∞
1
= 0 dx2 + √ dx2 + √ 0 dx2
−∞ − 1−x21 π 1−x21


x2 1−x21
= 0+ √ 2 +0
π − 1−x1
 p
2 1 − x21
−1 < x1 < 1


π

=



0 otherwise.
i.e., a semi-circular distribution (ours has some distortion).
(b) The conditional density for X2 |X1 is:

f (x1 , x2 )
fX2 |X1 (x2 |x1 ) =
fX1 (x1 )


1 p p


 2p1 − x2
 for − 1 − x21 < x2 < 1 − x21
1
=



0 otherwise,
p p
which is uniform U (− 1 − x21 , 1 − x21 ).

2. If X1 , . . . , Xr are any independent continuous RV’s then their joint PDF is:

f (x1 , . . . , xr ) = fX1 (x1 )fX2 (x2 ) . . . fXr (xr ).

29
1. DISTRIBUTION THEORY

Figure 13: A graphic showing the conditional distribution and a semi-circular distri-
bution.

E.g. If X1 , X2 , . . . , Xr are independent Exp(λ) RVs, then

r
Y
f (x1 , . . . , xr ) = λ e−λxi
i=1

Pr
= λr e−λ i=1 xi
,

for xi > 0 i = 1, . . . , n.

3. Suppose (X1 , X2 ) are uniformly distributed on (0, 1) × (0, 1).


That is, 
1 0 < x1 < 1, 0 < x2 < 1

f (x1 , x2 ) =

0 otherwise

Claim: X1 , X2 are independent.

Proof.
Let 
1 0 < x1 < 1

f1 (x1 ) =

0 otherwise,

30
1. DISTRIBUTION THEORY

1 0 < x2 < 1

f2 (x2 ) =

0 otherwise,

then f (x1 , x2 ) = f1 (x1 )f2 (x2 ), and the two variables are independent.

4. (X1 , X2 ) is uniform on unit disk. Are X1 , X2 independent? NO.

Proof.
We know:

1 p p


 2p1 − x2
 − 1 − x21 < x2 < 1 − x21
1
fX2 |X1 (x2 |x1 ) = .



0 otherwise,
p p
i.e., U (− 1 − x21 , 1 − x21 ).
On the other hand,
2
 p
2
 π 1 − x2 −1 < x2 < 1


fX2 (x2 ) =


0 otherwise,

i.e., a semicircular distribution.


Hence fX2 (x2 ) 6= fX2 |X1 (x2 |x1 ), so the variables cannot be independent.

Definition. 1.7.8
The joint CDF of RVS’s X1 , X2 , . . . , Xr is defined by:

F (x1 , x2 , . . . , xr ) = P ({X1 ≤ x1 } ∩ {X2 ≤ x2 } ∩ . . . ∩ ({Xr ≤ xr }).

Remarks:

1. Definition applies to RV’s of any type, i.e., discrete, continuous, “hybrid”.

2. Marginal CDF of X1 = (X1 , . . . , Xs )T is FX1 (x1 , . . . , xs ) = FX (x1 , . . . , xs , ∞, ∞, . . . , ∞),


for s < r.
3. RV’s X1 , . . . , Xr are defined to be independent if

F (x1 , . . . , xr ) = FX1 (x1 )FX2 (x2 ) . . . FXr (xr ).

4. The definitions above are completely general. However, in practice it is usually


easier to work with the PDF/probability function for any given example.

31
1. DISTRIBUTION THEORY

1.8 Transformations of several RVs

Theorem. 1.8.1
If X1 , X2 are discrete with joint probability function P (x1 , x2 ), and Y = X1 + X2 , then:

1. Y has probability function


X
PY (y) = P (x, y − x).
x

2. If X1 , X2 are independent,
X
PY (y) = PX1 (x)PX2 (y − x)
x

Proof. (1)
X
P ({Y = y}) = P ({Y = y} ∩ {X1 = x})
x
(law of total probability)

 
If A is an event & B1 , B2 . . . are events
 
 [ 
s.t. Bi = S & Bi ∩ Bj = φ (for j 6= i)
 
 
 
 i 
 
 X X 
then P (A) = P (A ∩ Bi ) = P (A|Bi )P (Bi )
i i

X
= P ({X1 + X2 = y} ∩ {X1 = x})
x

X
= P ({X2 = y − x} ∩ {X1 = x})
x

X
= P (x, y − x).
x

Proof. (2)

32
1. DISTRIBUTION THEORY

Just substitute
P (x, y − x) = PX1 (x)PX2 (y − x).

Theorem. 1.8.2
Suppose X1 , X2 are continuous with PDF, f (x1 , x2 ), and let Y = X1 + X2 . Then
Z ∞
1. fY (y) = f (x, y − x) dx.
−∞

2. If X1 , X2 are independent, then


Z ∞
fY (y) = fX1 (x)fX2 (y − x) dx.
−∞

Proof. 1. FY (y) = P (Y ≤ y)

= P (X1 + X2 ≤ y)
Z ∞ Z y−x1
= f (x1 , x2 ) dx2 dx1 .
x1 =−∞ x2 =−∞

Let x2 = t − x1 ,

dx2
⇒ =1
dt

⇒ dx2 = dt
Z ∞ Z y
= f (x1 , t − x1 ) dt dx1
−∞ −∞

Z y Z ∞ 
= f (x1 , t − x1 ) dx1 dt
−∞ −∞
Z ∞
⇒ fY (y) = FY0 (y) = f (x, y − x) dx.
−∞

Proof. (2)

Take f (x, y − x) = fX1 (x) fX2 (y − x) if X1 , X2 independent.

33
1. DISTRIBUTION THEORY

Figure 14: Theorem 1.8.2

Examples
1. Suppose X1 ∼ B(n1 , p), X2 ∼ B(n2 , p) independently. Find the probability func-
tion for Y = X1 + X2 .
Solution.
X
PY (y) = PX1 (x)PX2 (y − x)
x

min(n1 , y)    
X n1 x n1 −x n2
= p (1 − p) py−x (1 − p)n2 +x−y
x y−x
x=max (0, y − n2 )

min (n1 , y)   
y n1 +n2 −y
X n1 n2
= p (1 − p)
x y−x
x=max (0, y − n2 )

  
n1 n2
min (n1 , y)
y−x
 
n1 + n2 y n1 +n2 −y
X x
= p (1 − p)  
y n 1 + n 2
x=max (0, y − n2 )
y
| {z }
sum of hypergeometric
probability function = 1

 
n1 + n2 y
= p (1 − p)n1 +n2 −y ,
y
i.e., Y ∼ B(n1 + n2 , p).

34
1. DISTRIBUTION THEORY

2. Suppose Z1 ∼ N (0, 1), Z2 ∼ N (0, 1) independently. Let X = Z1 + Z2 . Find


PDF of X.
Solution.
Z ∞
fX (x) = φ(z)φ(x − z) dz
−∞

Z ∞
1 2 1 2
= √ e−z /2 √ e−(x−z) /2 dz
−∞ 2π 2π

Z ∞
1 − 1 (z2 +(x−z)2 )
= e 2 dz
−∞ 2π

Z ∞
1 − 21

2(z− x2 )2 2 /4
= e e−x dz
−∞ 2π

Z ∞ (z− x )2
1 2 1 − 21
= √ e−x /4 √ e 2× 2 dz
2π 2π
| −∞ {z }
=1
 
x 1
N , PDF
2 2

1 2
= √ e−x /4 ,

i.e., X ∼ N (0, 2).
Theorem. 1.8.3 (Ratio of continuous RVs)
X2
Suppose X1 , X2 are continuous with joint PDF f (x1 , x2 ), and let Y = .
X1
Then Y has PDF Z ∞
fY (y) = |x|f (x, yx) dx.
−∞
If X1 , X2 independent, we obtain
Z ∞
fY (y) = |x|fX1 (x)fX2 (yx) dx.
−∞

35
1. DISTRIBUTION THEORY

Proof.

FY (y) = P ({Y ≤ y})

= P ({Y ≤ y} ∩ {X1 < 0}) + P ({Y ≤ y} ∩ {X1 > 0})

= P ({X2 ≥ yx1 } ∩ {X1 < 0}) + P ({X2 ≤ yx1 } ∩ {X1 > 0})

Z 0 Z ∞ Z ∞ Z x1 y
= f (x1 , x2 ) dx2 dx1 + f (x1 , x2 ) dx2 dx1 .
−∞ x1 y 0 −∞

Substitute x2 = tx1 in both inner integrals;

dx2 = x1 dt;

Z 0 Z −∞ Z ∞ Z y
= x1 f (x1 , tx1 )dt dx1 + x1 f (x1 , tx1 ) dt dx1
−∞ y 0 −∞

Z 0 Z y Z ∞ Z y
= (−x1 )f (x1 , tx1 ) dt dx1 + x1 f (x1 , tx1 ) dt dx1
−∞ −∞ 0 −∞

Z 0 Z y Z ∞ Z y
= |x1 |f (x1 , tx1 ) dt dx1 + |x1 |f (x1 , tx1 ) dt dx1
−∞ −∞ 0 −∞

= FY (y)

Z ∞
⇒ fY (y) = FY0 (y) = |x1 |f (x1 , yx1 ) dx1 .
−∞

Example

Z
1. If Z ∼ N(0, 1) and X ∼ χ2k independently, then T = p is said to have the
X/k
t-distribution with k degrees of freedom. Derive the PDF of T .

36
1. DISTRIBUTION THEORY

Solution. Step 1:
 
p k 1
Let V = X/k. Need to find PDF of V . Recall χ2k is Gamma , so
2 2

(1/2)k/2 k/2−1 −x/2


fX (x) = x e , for x > 0.
Γ(k/2)

Now,
p
v = h(x) = x/k

⇒ h−1 (v) = kv 2 and

h−1 (v)0 = 2kv

⇒ fV (v) = fX (h−1 (v))|h−1 (v)0 |

(1/2)k/2 2
= (kv 2 )k/2−1 e−kv /2 (2kv)
Γ(k/2)

2(k/2)k/2 k−1 −kv2 /2


= v e , v > 0.
Γ(k/2)

Step 2:
Apply Theorem 1.8.3 to find PDF of
Z ∞
Z
T = = |v|fV (v)fZ (vt) dv
V −∞


2(k/2)k/2 k−1 −kv2 /2 1 −t2 v2 /2
Z
= v v e √ e dv
0 Γ(k/2) 2π


(k/2)k/2
Z
2 )v 2
= √ v k−1 e−1/2(k+t 2v dv;
Γ(k/2) 2π 0

37
1. DISTRIBUTION THEORY

substitute u = v 2

⇒ du = 2v dv


(k/2)k/2
Z
k−1 2 )u
= √ u 2 e−1/2(k+t du
Γ(k/2) 2π 0

(k/2)k/2 (k/2)k/2 Γ( k+1


 
Γ(α) 2
)
= √ α
= √
Γ(k/2) 2π λ Γ(k/2) 2π (1/2(k + t ))(k+1)/2
2

− k+1  −1/2
Γ( k+1

2
) 2 1 2
2 k
= √ × (k + t )
Γ(k/2) 2π k 2 2

− k+1
Γ( k+1 t2

2
) 2
= k √ 1+ −∞<t<∞
Γ( 2 ) kπ k

Remarks:

1. If we take k = 1, we obtain
1
f (t) ∝ ;
1 + t2
that is, t1 = Cauchy distribution.
Γ(1) 1 1 1
Can also check 1 √ = as required, so that f (t) = .
Γ( 2 ) π π π 1 + t2

2. As k → ∞, tk → N(0, 1). To see this directly consider limit of t-density as k → ∞:

− k+1
t2
 2
i.e. lim 1+ for t fixed
k→∞ k

 −k
t2
 2
k
= lim 1+ ; let ` =
k→∞ k 2

!−`
t2
2 1 −t2
= lim 1+ = t2
=e 2 .
`→∞ ` e2

38
1. DISTRIBUTION THEORY

Recall standard limit:  a n


lim 1+ = ea
n→∞ n

1 −t2
⇒ fT (t) → √ e 2 as k → ∞.

1.8.1 Multivariate transformation rule

Suppose h : Rr → Rr is continuously differentiable.


That is,
h(x1 , x2 , . . . , xr ) = (h1 (x1 , . . . , xr ), h2 (x1 , . . . , xr ), . . . , hr (x1 , . . . , xr )) ,

where each hi (x) is continuously differentiable. Let


∂h1 ∂h1 ∂h1
 
 ∂x1 ∂x2 . . . ∂xr 
 
 ∂h ∂h ∂h 
 2 2
. . .
2
 ∂x ∂x2 ∂xr 
 
H= 1
 .. .. .. 

 . . ... . 
 
 
 ∂hr ∂hr ∂hr 
...
∂x1 ∂x2 ∂xr
If H is invertible for all x, then ∃ an inverse mapping:
g : Rr → Rr with property that:
g(h(x)) = x.

It can be proved that the matrix of partial derivatives


 
∂gi
G= satisfies G = H −1 .
∂yj
Theorem. 1.8.4
Suppose X1 , X2 , . . . , Xr have joint PDF fX (x1 , . . . , xr ), and let h, g, G be as above.
If Y = h(X), then Y has joint PDF

fY (y1 , y2 , . . . , yr ) = fX g(y) | det G(y)|

Remark:
Can sometimes use H −1 instead of G, but need to be careful to evaluate H −1 (x) at
x = h−1 (y) = g(y).

39
1. DISTRIBUTION THEORY

1.8.2 Method of regular transformations

Suppose X = (X1 , . . . , Xr )T and h : Rr → Rs , where s < r if

Y = (Y1 , . . . , Ys )T = h(X).

One approach is as follows:

1. Choose a function, d : Rr → Rr−s and let Z = d(X) = (Z1 , Z2 , . . . , Zr−s )T .

2. Apply theorem 1.8.4 to (Y1 , Y2 , . . . , Ys , Z1 , Z2 , . . . , Zr−s )T .

3. Integrate over Z1 , Z2 , . . . , Zr−s to get the marginal PDF for Y.

Examples
T
Suppose Z1 ∼ N(0, p 1), Z2 ∼ N(0, 1) independently. Consider h(z1 , z2 ) = (r, θ) , where
r = h1 (z1 , z2 ) = z12 + z22 , and
 
z2


arctan for z1 > 0



 z1



  
z2


arctan + π for z1 < 0


θ = h2 (z1 , z2 ) = z1



 π
sgnz2 for z1 = 0, z2 6= 0


2









0 for z1 = z2 = 0.

 
2 π 3π
h maps R → [0, ∞) × − , .
2 2
The inverse mapping, g, can be seen to be:
   
g1 (r, θ) r cos θ
g(r, θ) = =
g2 (r, θ) r sin θ

To find distribution of (R, θ):


Step 1
∂ ∂
 
g1 (r, θ) g1 (r, θ)
 ∂r ∂θ 
G(r, θ) =  
∂ ∂
 
g2 (r, θ) g2 (r, θ)
∂r ∂θ
40
1. DISTRIBUTION THEORY

∂ ∂
g1 (r, θ) = r cos θ = cos θ
∂r ∂r


(r cos θ) = −r sin θ
∂θ

∂ ∂
g2 (r, θ) = r sin θ = sin θ
∂r ∂r


r sin θ = r cos θ
∂θ
 
cos θ −r sin θ
⇒G= 
sin θ r cos θ

⇒ det(G) = r cos2 θ − (−r sin2 θ)

= r cos2 θ + r sin2 θ = r.

Step 2
Now apply theorem 1.8.4.
Recall,
1 1 2 1 1 2
fZ (z1 , z2 ) = √ e− 2 z1 × √ e− 2 z2
2π 2π

1 − 1 (z12 +z22 )
= e 2 .

41
1. DISTRIBUTION THEORY


fR,θ (r, θ) = fZ g(r, θ) | det G(r, θ)|

= fZ (r cos θ, r sin θ)|r|

1 − 1 r2

 2π r e 2

 for r ≥ 0, − π2 ≤ θ < 3π
2
=


0 otherwise

= fθ (θ)fR (r),

where
1



 2π − π2 ≤ θ < 3π
2
fθ (θ) =


0 otherwise

2 /2
fR (r) = re−r for r ≥ 0.

Hence, if Z1 , Z2 i.i.d. N(0, 1), and we translate to polar coordinates, (R, θ) we find:

1. R, θ are independent.
2. θ ∼ U (− π2 , 3π
2
).
3. R has distribution called the Rayleigh distribution.

It is easy to check that R2 ∼ Exp(1/2) ≡ Gamma(1, 1/2) ≡ χ22 .


Example
Suppose X1 ∼ Gamma (α, λ) and X2 ∼ Gamma (β, λ), independently. Find the
distribution of Y = X1 /(X1 + X2 ).
Solution. Step 1: try using Z = X1 + X2 .
x1 !  
y= yz
Step 2: If h(x1 , x2 ) = x1 + x2 , then g(y, z) = .
z = x1 + x2 (1 − y)z
 
z y
G=
−z 1 − y

42
1. DISTRIBUTION THEORY

⇒ det(G) = z(1 − y) − (−zy)

= z(1 − y) + zy = z,
where z > 0. Why?


fY,Z (y, z) = fX1 (yz)fX2 (1 − y)z z

λα α−1 −λyz λ
β β−1 −λ(1−y)z
= (yz) e (1 − y)z e z
Γ(α) Γ(β)

1
= λα+β y α−1 (1 − y)β−1 z α+β−1 e−λz .
Γ(α)Γ(β)

Step 3:
Z ∞
fY (y) = f (y, z) dz
0


λα+β z α+β−1 −λz
Z
Γ(α + β) α−1
= y (1 − y)β−1 e dz
Γ(α)Γ(β) 0 Γ(α + β)

Γ(α + β) α−1
= y (1 − y)β−1
Γ(α)Γ(β)

= Beta(α, β), for 0 < y < 1.

Exercise: Justify the range of values for y.

1.9 Moments

Suppose X1 , X2 , . . . , Xr are RVs. If Y = h(X) is defined by a real-valued function h,


then to find E(Y ) we can:

1. Find the distribution of Y .


Z ∞


 yfY (y) dy continuous
 −∞

2. Calculate E(Y ) =
 X
yp(y) discrete




y

43
1. DISTRIBUTION THEORY

Theorem. 1.9.1
If h and X1 , . . . , Xr are as above, then provided it exists,
X X X

 ... h(x1 , . . . , xr )P (x1 , . . . , xr ) if X1 , . . . , Xr discrete


 x1 x2 xr

E h(X) =

 Z ∞ Z ∞
... h(x1 , . . . , xr )f (x1 , . . . xr ) dx1 . . . dxr if X1 , . . . , Xr continuous



−∞ −∞

Theorem. 1.9.2
Suppose X1 , . . . , Xr are RVs, h1 , . . . , hk are real-valued functions and a1 , . . . , ak are
constants. Then, provided it exists,

E a1 h1 (X) + a2 h2 (X) + · · · + ak hk (X) = a1 E[h1 (X)] + a2 E[h2 (X)] + · · · + ak E [hk (X)]

Proof. (Continuous case)



Z ∞ Z ∞
E a1 h1 (X) + · · · + ak hk (X) = ... {a1 h1 (x) + · · · + ak hk (x)} fx (x) dx1 dx2 . . . dxr
−∞ −∞

Z ∞ Z ∞
= a1 ... h1 (x)fX (x) dx1 . . . dxr
−∞ −∞

Z ∞ Z ∞
+ a2 ... h2 (x)fX (x) dx1 . . . dxr + . . .
−∞ −∞

Z ∞ Z ∞
. . . + ak ... hk (x)fX (x) dx1 . . . dxr
−∞ −∞

= a1 E[h1 (X)] + a2 E[h2 (X)] + · · · + ak E[hk (X)],

as required.

Corollary.

Provided it exists,

E a1 X1 + a2 X2 + · · · + ar Xr = a1 E[X1 ] + a2 E[X2 ] + · · · + ar E[Xr ].

44
1. DISTRIBUTION THEORY

Definition. 1.9.1.
If X1 , X2 are RVs with E[Xi ] = µi and Var(Xi ) = σi2 , i = 1, 2 we define
 
σ12 = Cov(X1 , X2 ) = E (X1 − µ1 )(X2 − µ2 )

 
= E X1 X2 − µ1 µ2 ,

σ12
ρ12 = Corr(X1 , X2 ) = .
σ1 σ2

Remark:
Cov(X, X) = Var(X) = E[X 2 ] − (E[X])2 .


In some contexts, it is convenient to use the notation

σii = Var(Xi ) instead of σi2 .

Theorem. 1.9.3
Suppose X1 , X2 , . . . , Xr are RVs with E[Xi ] = µi , Cov(Xi , Xj ) = σij , and let a1 , a2 , . . . , ar ,
b1 , b2 , . . . , br be constants. Then
r r
! r Xr
X X X
Cov ai Xi , bj Xj = ai bj σij .
i=1 j=1 i=1 j=1

45
1. DISTRIBUTION THEORY

Proof.
r r
! "( r r
!) ( r r
!)#
X X X X X X
Cov ai Xi , bj Xj =E ai Xi − E ai Xi bj Xj − E bj Xj
i=1 j=1 i=1 i=1 j=1 j=1

( r r
! r r
!)
X X X X
=E ai Xi − ai µ i bj Xj − b j µj
i=1 i=1 j=1 j=1

"( r )( r )#
X X
=E ai (Xi − µi ) bj (Xj − µj )
i=1 j=1

( r r )
XX
=E ai bj (Xi − µi )(Xj − µj )
i=1 j=1

r X
X r
= ai bj E {(Xi − µi )(Xj − µj )} (Theorem 1.9.2)
i=1 j=1

r X
X r
= ai bj σij , as required.
i=1 j=1

Corollary. 1
Under the above assumptions,
r
! r X
r r
X X X X
Var ai Xi = ai aj σij = a2i σi2 + 2 ai aj σij .
i=1 i=1 j=1 i=1 i,j
| {z } i<j

covariance
with itself
= variance
Corollary. 2
ρ = Corr(X1 , X2 ) satisfies |ρ| ≤ 1; and |ρ| = 1 implies that X2 is a linear function of
X1 .

Proof. It follows from Corollary 1 that


 
X1 X2
Var + = 2(1 + ρ)
σ1 σ2

46
1. DISTRIBUTION THEORY

Similarly, we can show that


 
X1 X2
Var − = 2(1 − ρ)
σ1 σ2
Then
2(1 + ρ) ≥ 0 ⇒ ρ ≥ −1
and
2(1 − ρ) ≥ 0 ⇒ρ≤1
Hence, −1 ≤ ρ ≤ 1, as required.

Now suppose |ρ| = 1 ⇒ ∆=0


⇒ there is single t0 such that:
0 = q(t0 ) = Var(X1 − t0 X2 ),
i.e., X1 = t0 X2 + c with probability 1.
Theorem. 1.9.4
If X1 , X2 are independent, and Var(X1 ), Var(X2 ) exist, then Cov(X1 , X2 ) = 0.

Proof. (Continuous)


Cov(X1 , X2 ) = E (X1 − µ1 )(X2 − µ2 )

Z ∞ Z ∞
= (x1 − µ1 )(x2 − µ2 )f (x1 , x2 ) dx1 dx2
−∞ −∞

Z ∞ Z ∞
= (x1 − µ1 )(x2 − µ2 )fX1 (x1 )fX2 (x2 ) dx1 dx2 (since X1 , X2 independent)
−∞ −∞

Z ∞  Z ∞ 
= (x1 − µ1 )fX1 (x1 ) dx1 (x2 − µ2 )fX2 (x2 ) dx2
−∞ −∞

=0×0

= 0.

47
1. DISTRIBUTION THEORY

Remark:
But the converse does NOT apply, in general!
That is, Cov(X1 , X2 ) = 0 6⇒ X1 , X2 independent.

Definition. 1.9.2
If X1 , X2 are RVs, we define the symbol E[X1 |X2 ] to be the expectation of X1 calculated
with respect to the conditional distribution of X1 |X2 ,

i.e., X

 x1 PX1 |X2 (x1 |x2 ) X1 |X2 discrete

   x1

E X1 |X2 =

Z ∞
x1 fX1 |X2 (x1 |x2 ) dx1 X1 |X2 continuous



−∞

Theorem. 1.9.5
Provided the relevant moments exist,

E(X1 ) = EX2 E(X1 |X2 ) ,

 
Var(X1 ) = EX2 Var(X1 |X2 ) + VarX2 E(X1 |X2 ) .

48
1. DISTRIBUTION THEORY

Proof. 1. (Continuous case)



Z ∞
EX2 E(X1 |X2 ) = E(X1 |X2 )fX2 (x2 ) dx2
−∞

Z ∞ Z ∞ 
= x1 f (x1 |x2 ) dx1 fX2 (x2 ) dx2
−∞ −∞

Z ∞ Z ∞
= x1 f (x1 |x2 )fX2 (x2 ) dx1 dx2
−∞ −∞

Z ∞ Z ∞
= x1 f (x1 , x2 ) dx2 dx1
−∞ −∞
| {z }
= fX1 (x1 )

Z ∞
= x1 fX1 (x1 ) dx1
−∞

= E(X1 ).

1.9.1 Moment generating functions

Definition. 1.9.3
If X1 , X2 . . . , Xr are RVs, then the joint MGF (provided it exists) is given by

MX (t) = MX (t1 , t2 , . . . , tr ) = E et1 X1 +t2 X2 +···+tr Xr .


 

Theorem. 1.9.6
If X1 , X2 , . . . , Xr are mutually independent, then the joint MGF satisfies

MX (t1 , . . . , tr ) = MX1 (t1 )MX2 (t2 ) . . . MXr (tr ),

provided it exists.

49
1. DISTRIBUTION THEORY

Proof. (Theorem 1.9.6, continuous case)

Z ∞ Z ∞
MX (t1 , . . . , tr ) = ... et1 x1 +t2 x2 +···+tr xr fX (x1 , . . . , xr ) dx1 . . . dxr
−∞ −∞

Z ∞ Z ∞
= ... et1 x1 et2 x2 . . . etr xr fX1 (x1 )fX2 (x2 ) . . . fXr (xr ) dx1 . . . dxr
−∞ −∞

Z ∞  Z ∞  Z ∞ 
t1 x1 t2 x2 tr xr
= e fX1 (x1 ) dx1 e fX2 (x2 ) dx2 . . . e fXr (xr ) dxr
−∞ −∞ −∞

= MX1 (t1 )MX2 (t2 ) . . . MXr (tr ), as required.

We saw previously that if Y = h(X), then we can find MY (t) = E[eth(X) ] from the
joint distribution of X1 , . . . , Xr without calculating fY (y) explicitly.
A simple, but important case is:

Y = X1 + X2 + · · · + Xr .

Theorem. 1.9.7
Suppose X1 , . . . , Xr are RVs and let Y = X1 + X2 + · · · + Xr . Then (assuming MGFs
exist):

1. MY (t) = MX (t, t, t, . . . , t).

2. If X1 . . . Xr are independent then MY (t) = MX1 (t)MX2 (t) . . . MXr (t).

3. If X1 , . . . , Xr independent and identically distributed with common MGF MX (t),


then:  r
MY (t) = MX (t) .

50
1. DISTRIBUTION THEORY

Proof.
MX (t) = E[etY ]

= E[et(X1 +X2 +···+Xr ) ]

= E[etX1 +tX2 +···+tXr ]

= MX (t, t, t, . . . , t) (using def 1.9.3).

For parts (2) and (3), substitute into (1) and use Theorem 1.9.6.

Examples
Consider RVs X, V , defined by V ∼ Gamma(α, λ) and the conditional distribution of
X|V ∼Po(V ).
Find E(X) and Var(X).
Solution. Use
E(X) = E{E(X|V )}

Var(X) = EV {Var(X|V )} + VarV {E(X|V )}

E(X|V ) = V

Var(X|V ) = V

α
so E(X) = E{E(X|V )} = E(V ) = .
λ

Var(X) = EV {Var(X|V )} + VarV {E(X|V )}

= EV (V ) + VarV (V )

α α
= + 2
λ λ
   
α 1 1+λ
= 1+ =α .
λ λ λ2

51
1. DISTRIBUTION THEORY

Remark
The marginal distribution of X is sometimes called the negative binomial distribution.
In particular, when α is an integer, it corresponds to the definition previously with
λ
p= .
1+λ
Examples

1. Suppose X ∼ Bernoulli with parameter p. Then MX (t) = 1 + p(et − 1).


Now suppose X1 , X2 , . . . , Xn are i.i.d. Bernoulli with parameter p and

Y = X1 + X2 + · · · + Xn ;
n
then MY (t) = 1 + p(et − 1) , which agrees with the formula previously given
for the binomial distribution.

2. Suppose X1 ∼ N(µ1 , σ12 ) and X2 ∼ N(µ2 , σ22 ) independently. Find the MGF of
Y = X1 + X2 .
2 σ 2 /2
Solution. Recall that MX1 (t) = etµ1 et 1

2 σ 2 /2
MX2 (t) = etµ2 et 2

⇒ MY (t) = MX1 (t)MX2 (t)

2 (σ 2 +σ 2 )/2
= et(µ1 +µ2 ) et 1 2

⇒ Y ∼ N µ1 + µ2 , σ12 + σ22 .


1.9.2 Marginal distributions and the MGF

To find the moment generating function of the marginal distribution of any set of
components of X, set to 0 the complementary elements of t in MX (t).
Let X = (X1 , X2 )T , and t = (t1 , t2 )T . Then

 
t
MX1 (t1 ) = MX 1 .
0

To see this result: Note that if A is a constant matrix, and b is a constant vector, then
T
MAX+b (t) = et b MX (AT t).

52
1. DISTRIBUTION THEORY

Proof.
T (AX+b T T AX
MAX+b (t) = E[et )] = et b E[et ]
tT b (AT t)T X tT b
= e E[e ]=e MX (AT t).

Now partition  
t
tr×1 = 1
t2
and
Al×r = (Il×l 0l×m ).

Note that  
X1
AX = (Il×l 0l×m ) = X1 ,
X2
and    
Il×l t
T
A t1 = t1 = 1 .
0m×l 0
Hence,

MX1 (t1 ) = MAX (t1 ) = MX (AT t1 )


 
t
= MX 1 ,
0

as required.
Note that similar results hold for more than two random subvectors.
The major limitation of the MGF is that it may not exist. The characteristic function
on the other hand is defined for√all distributions. Its definition is similar to the MGF,
with it replacing t, where i = −1; the properties of the characteristic function are
similar to those of the MGF, but using it requires some familiarity with complex
analysis.

1.9.3 Vector notation

Consider the random vector

X = (X1 , X2 , . . . , Xr )T ,

with E[Xi ] = µi , Var(Xi ) = σi2 = σii , Cov(Xi , Xj ) = σij .

53
1. DISTRIBUTION THEORY

Define the mean vector by:  


µ1
µ 2 
µ = E(X) =  .. 
 
.
µr
and the variance matrix (covariance matrix) by:

Σ = Var(X) = [σij ]i=1,...,r


j=1,...,r

Finally, the correlation matrix is defined by:


 
1 ρ12 . . . ρ1r
 ... ...  σij
R= where ρij = √ √ .
 
... 
σii σjj
 ρr−1,r 
1

Now suppose that a = (a1 , . . . , ar )T is a vector of constants and observe that:


r
X
T
a x = a1 x1 + a2 x2 + · · · + ar xr = ai x i
i=1

Theorem. 1.9.8
Suppose X has E(X) = µ and Var(X) = Σ, and let a, b ∈ Rr be fixed vectors. Then,

1. E(aT X) = aT µ

2. Var(at X) = aT Σa

3. Cov(aT X, bT X) = aT Σb

Remark
It is easy to check that this is just a re-statement of Theorem 1.9.3 using matrix
notation.
Theorem. 1.9.9
Suppose X is a random vector with E(X) = µ, Var(X) = Σ, and let Ap×r and b ∈ Rp
be fixed.
If Y = AX + b, then

E(Y) = Aµ + b and Var(Y) = AΣAT .

54
1. DISTRIBUTION THEORY

Remark
This is also a re-statement of previously established results. To see this, observe that
if aTi is the ith row of A, we see that Yi = aTi X and, moreover, the (i, j)th element of
AΣAT = aTi Σaj = Cov aTi X, aTj X


= Cov(Yi , Yj ).

1.9.4 Properties of variance matrices

If Σ = Var(X) for some random vector X = (X1 , X2 , . . . , Xr )T , then it must satisfy


certain properties.
Since Cov(Xi , Xj ) = Cov(Xj , Xi ), it follows that Σ is a square (r × r) symmetric
matrix.
Definition. 1.9.4
The square, symmetric matrix M is said to be positive definite [non-negative definite]
if aT M a > 0 [≥ 0] for every vector a ∈ Rr s.t. a 6= 0.

=⇒ It is necessary and sufficient that Σ be non-negative definite in order that it can


be a variance matrix.
=⇒ To see the necessity of this condition, consider the linear combination aT X.
By Theorem 1.9.8, we have
0 ≤ Var(aT X) = aT Σa for every a;
=⇒ Σ must be non-negative definite.
Suppose λ is an eigenvalue of Σ, and let a be the corresponding eigenvector. Then
Σa = λa
=⇒ aT Σa = λaT a = λ||a||2 .

Hence Σ is non-negative definite (positive definite) iff its eigenvalues are all non-
negative (positive).
If Σ is non-negative definite but not positive definite, then there must be at least one
zero eigenvalue. Let a be the corresponding eigenvector.
=⇒ a 6= 0 but aT Σa = 0. That is, Var(aT X) = 0 for that a.
=⇒ the distribution of X is degenerate in the sense that either one of the Xi ’s is
constant or else a linear combination of the other components.

55
1. DISTRIBUTION THEORY

r
Y
Finally, recall that if λ1 , . . . , λr are eigenvalues of Σ, then det(Σ) = λr
i=1

=⇒ det(Σ) > 0 for Σ positive definite,

and
det(Σ) = 0 for Σ non-negative definite but not positive definite.

1.10 The multivariable normal distribution

Definition. 1.10.1
The random vector X = (X1 , . . . , Xr )T is said to have the r-dimensional multivariate
normal distribution with parameters µ ∈ Rr and Σr×r positive definite, if it has PDF
1 1 T Σ−1 (x−µ)
fX (x) = e− 2 (x−µ)
(2π)r/2 |Σ|1/2

We write X ∼ Nr (µ, Σ).

Examples

1. r = 2 The bivariate normal distribution


   2 
µ1 σ1 ρσ1 σ2
Let µ = , Σ=
µ2 ρσ1 σ2 σ22

=⇒ |Σ| = σ12 σ22 (1 − ρ2 ) and

 
σ22 −ρσ1 σ2
1
Σ−1 =  
σ12 σ22 (1 − ρ2 )
−ρσ1 σ2 σ12

(" " 2  2
1 −1 x 1 − µ1 x 2 − µ2
=⇒ f (x1 , x2 ) = p exp +
2πσ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1 σ2

  
x 1 − µ1 x 2 − µ2
−2ρ .
σ1 σ2

56
1. DISTRIBUTION THEORY

2. If Z1 , Z2 , . . . , Zr are i.i.d. N(0, 1) and Z = (Z1 , . . . , Zr )T ,


r
Y 1 2
⇒fZ (z) = √ e−zi /2
i=1

1 −1/2 ri=1 zi2


P 1 − 12 zT z
= e = e ,
(2π)r/2 (2π)r/2

which is Nr (0r , Ir×r ) PDF.

Theorem. 1.10.1
Suppose X ∼ Nr (µ, Σ) and let Y = AX + b, where b ∈ Rr and Ar×r invertible are
fixed. Then Y ∼ Nr (Aµ + b, AΣAT ).

Proof.

We use the transformation rule for PDFs:


If X has joint PDF fX (x) and Y = g(X) for g : Rr → Rr invertible, continuously
r r −1
 then fY (y) = fX (h(y))|H|, where h : R → R such that h = g and
differentiable,

∂hi
H= .
∂yj
In this case, we have g(x) = Ax + b
 
−1solving for x in
⇒h(y) = A (y − b)
y = Ax + b

⇒H = A−1 .

Hence,
1 −1 (y−b)−µ)T Σ−1 (A−1 (y−b)−µ)
fY (y) = e−1/2(A |A−1 |
(2π)r/2 |Σ|1/2
1 T (A−1 )T Σ−1 (A−1 )(y−(Aµ+b))
= e−1/2(y−(Aµ+b))
(2π)r/2 |Σ|1/2 |AAT |1/2
1 T (AΣAT )−1 (y−(Aµ+b))
= e−1/2(y−(Aµ+b))
(2π)r/2 |AΣAT |1/2

which is exactly the PDF for Nr (Aµ + b, AΣAT ).

57
1. DISTRIBUTION THEORY

Now suppose Σr×r is any symmetric positive definite matrix and recall that we can write
Σ = E ∧ E T where ∧ = diag (λ1 , λ2 , . . . , λr ) and Er×r is such that EE T = E T E = I.
Since Σ is positive definite, we must have λi > 0, i = 1, . . . , r and we can define the
symmetric square-root matrix by:
p p p
Σ1/2 = E ∧1/2 E T where ∧1/2 = diag ( λ1 , λ2 , . . . , λr ).

Note that Σ1/2 is symmetric and satisfies:


2 T
Σ1/2 = Σ1/2 Σ1/2 = Σ.

Now recall that if Z1 , Z2 , . . . , Zr are i.i.d. N(0, 1) then Z = (Z1 , . . . , Zr )T ∼ Nr (0, I).
Because of the i.i.d. N(0, 1) assumption, we know in this case that E(Z) = 0, Var(Z) =
I.
Now, let X = Σ1/2 Z + µ:

1. From Theorem 1.9.9 we have


T
E(X) = Σ1/2 E(Z) + µ = µ and Var(X) = Σ1/2 Var(Z) Σ1/2 .

2. From Theorem 1.10.1,


X ∼ Nr (µ, Σ).

Since this construction is valid for any symmetric positive definite Σ and any
µ ∈ Rr , we have proved,

Theorem. 1.10.2

If X ∼ Nr (µ, Σ) then

E(X) = µ and Var(X) = Σ.

Theorem. 1.10.3
If X ∼ Nr (µ, Σ) then Z = Σ−1/2 (X − µ) ∼ Nr (0, I).

Proof.
Use Theorem 1.10.1.

Suppose      
X1 µ1 Σ11 Σ12
∼ Nr1 +r2 ,
X2 µ2 Σ21 Σ22

58
1. DISTRIBUTION THEORY

where Xi , µi are of dimension ri and Σij is ri × rj . In order that Σ be symmetric, we


must have Σ11 , Σ22 symmetric and Σ12 = ΣT21 .
We will derive the marginal distribution of X2 and the conditional distribution of
X1 |X2 .

Lemma. 1.10.1
 
B A
If M = is a square, partitioned matrix, then |M | = |B| (det).
O I

Lemma. 1.10.2
 
Σ11 Σ12
If M = then |M | = |Σ22 ||Σ11 − Σ12 Σ−1
22 Σ21 |
Σ21 Σ22

Proof. (of Lemma 1.10.2)


Let
I −Σ12 Σ−1
 
22
C=
0 Σ−1
22

and observe
Σ11 − Σ12 Σ−1
 
22 Σ 21 0
CM = ,
Σ−1
22 Σ21 I

and
1
|C| = , |CM | = |Σ11 − Σ12 Σ−1
22 Σ21 |
|Σ22 |
Finally, observe |CM | = |C||M |

|CM |
⇒ |M | = = |Σ22 ||Σ11 − Σ12 Σ−1
22 Σ21 |.
|C|

Theorem. 1.10.4
Suppose that      
X1 µ1 Σ11 Σ12
∼ Nr1 ,r2 ,
X2 µ2 Σ21 Σ22
Then, the marginal distribution of X2 is X2 ∼ Nr2 (µ2 , Σ22 ) and the conditional distri-
bution of X1 |X2 is

Nr1 µ1 + Σ12 Σ−1 −1



22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21

59
1. DISTRIBUTION THEORY

Proof.

We will exhibit f (x1 , x2 ) in the form


f (x1 , x2 ) = h(x1 , x2 ) g(x2 ),
where h(x1 , x2 ) is a PDF with respect to x1 for each (given) x2 .
It follows then that fX1 |X2 (x1 |x2 ) = h(x1 , x2 ) and g(x2 ) = fX2 (x2 ).
Now observe that:
1
f (x1 , x2 ) = ×
Σ11 Σ12 1/2

(2π)(r1 +r2 )/2
Σ12 Σ22

" #
T  Σ11 Σ12 −1 x1 − µ1
  
1 T
exp − x1 − µ1 , x2 − µ2
2 Σ21 Σ22 x2 − µ2

and

1.
Σ11 Σ12 1/2

1/2 1/2
(2π) (r1 +r2 )/2 = (2π)r1 /2
(2π)r2 /2
Σ 22
Σ11 − Σ12 Σ−122 Σ 21

Σ21 Σ22
1/2
= (2π)r1 /2 Σ11 − Σ12 Σ−1 (2π)r2 /2 Σ22 1/2

22 Σ21

2.
 Σ11 Σ12 −1 x1 − µ1
   
T T
= (x1 − µ1 )T , (x2 − µ2 )T

(x1 − µ1 ) , (x2 − µ2 )
Σ21 Σ22 x2 − µ2
−1 −1 
Σ11 − Σ12 Σ−1 − Σ11 − Σ12 Σ−1

22 Σ21 22 Σ11

 ×Σ12 Σ−1 
22   
  x1 − µ1
× −Σ−1 −1 −1

 x2 − µ2
22 Σ 21 Σ 22 + Σ 22 Σ 21
 × Σ − Σ Σ−1 Σ −1 × Σ − Σ Σ−1 Σ −1 
 
11 12 22 21 11 12 22 21
×Σ12 Σ−122

= (x1 − µ1 )T V −1 (x1 − µ1 ) − (x1 − µ1 )T V −1 Σ12 Σ−1


22 (x2 − µ2 )

−(x2 − µ2 )T Σ−1
22 Σ21 V
−1
(x1 − µ1 ) + (x2 − µ2 )T Σ−1
22 Σ21 V
−1
Σ12 Σ−1
22 (x2 − µ2 )

+(x2 − µ2 )T Σ−1
22 (x2 − µ2 ).

60
1. DISTRIBUTION THEORY

 

 How did this come about: 


 


      

a b x ax + by

 

(x, y) = (x, y)

 

c d y cx + dy
 

 

= x(ax + by) + y(cx + dy)

 


 


 


 

= ax2 + bxy + cyx + dy 2
 

T −1
= (x1 − µ1 ) − Σ12 Σ−1 (x1 − µ1 ) − Σ12 Σ−1

22 (x2 − µ2 ) V 22 (x2 − µ2 )

+ (x2 − µ2 )T Σ−1
22 (x2 − µ2 )

 

 Note: 


 


 

 (x − y)T A(x − y) = (x − y)T (Ax − Ay)

 

= xT Ax − xT Ay − yT Ax + yT Ay 

 

 


 


 

And: ΣT12 = Σ21
 

T −1
= x1 − (µ1 + Σ12 Σ−1
22 (x2 − µ2 )) Σ11 − Σ12 Σ−1
22 Σ21 ×

−1
(x2 − µ2 )) + (x2 − µ2 )T Σ−1

x1 − (µ1 + Σ12 Σ22 22 (x2 − µ2 ).

Combining (1), (2) we obtain:

61
1. DISTRIBUTION THEORY

1
f (x1 , x2 ) =
Σ11 Σ12 1/2

(r +r
(2π) 1 2 )/2
Σ21 Σ22

  −1  
 1 Σ11 Σ12 x1 − µ1 
× exp − ((x1 − µ1 )T , (x2 − µ2 )T )    
 2
Σ21 Σ22 x2 − µ2

1
= −1
(2π)r1 /2 |Σ 11 − Σ12 Σ22 Σ21 |
1/2


1
× exp − (x1 − (µ1 + Σ12 Σ−1
22 (x2 − µ2 )))
T
2

−1 o
× Σ11 − Σ12 Σ−1
22 Σ21 (x1 − (µ1 + Σ12 Σ−1
22 (x2 − µ2 )))

 
1 1 T −1
× 1/2 exp − (x2 − µ2 ) Σ22 (x2 − µ2 )
(2π)r2 /2 Σ22 2

= h(x1 , x2 )g(x2 ), where for fixed x2 , h(x1 , x2 ), is the

Nr1 µ1 + Σ12 Σ−1 −1



22 (x2 − µ2 ), Σ11 − Σ12 Σ22 Σ21

PDF, and g(x2 ) is the Nr2 (µ2 , Σ22 ) PDF.


Hence, the result is proved.

Theorem. 1.10.5
Suppose that X ∼ Nr (µ, Σ) and Y = AX + b, where Ap×r with linearly independent
rows and b ∈ Rp are fixed. Then Y ∼ Np (Aµ + b, AΣAT ).
[Note: p ≤ r.]

Proof. Use method of regular transformations.


Since the  of A are linearly independent, we can find B(r−p)×r such that the r × r
 rows
B
matrix is invertible.
A

62
1. DISTRIBUTION THEORY

If we take Z = BX, we have


         
Z B 0 Bµ + 0 BΣB T BΣAT
= X+ ∼N ,
Y A b Aµ + b AΣB T AΣAT

by Theorem 1.10.1.
Hence, from Theorem 1.10.4, the marginal distribution for Y is

Np (Aµ + b, AΣAT ).

1.10.1 The multivariate normal MGF

The multivariate normal moment generating function for a random vector X ∼ N (µ, Σ)
is given by
T 1 T
MX (t) = et µ+ 2 t Σt
Prove this result as an exercise!
The characteristic function of X is
 
T 1 T T
E[exp(it X)] = exp it µ − t Σt
2

The marginal distribution of X1 (or X2 ) is easy to derive using the multivariate normal
MGF.
Let    
t1 µ1
t= ,µ = .
t2 µ2
Then the marginal distribution of X1 is obtained by setting t2 = 0 in the expression
for the MGF of X.
Proof:
 
T1 T
MX (t) = exp t µ + t Σt
2
 
T T 1 T T 1 T
= exp t1 µ1 + t2 µ2 + t1 Σ11 t1 + t1 Σ12 t2 + t2 Σ22 t2 .
2 2

Now,    
t1 T 1 T
MX1 (t1 ) = MX = exp t1 µ1 + t1 Σ11 t1 ,
0 2

63
1. DISTRIBUTION THEORY

which is the MGF of X1 ∼ Nr1 (µ1 , Σ11 ).


Similarly for X2 . This means that all marginal distributions of a multivariate nor-
mal distribution are multivariate normal themselves. Note though, that in general
the opposite implication is not true: there are examples of non-normal multivariate
distributions whose marginal distributions are normal.

1.10.2 Independence and normality

We have seen previously that X, Y independent ⇒ Cov(X, Y ) = 0, but not vice-versa.


An exception is when the data are (jointly) normally distributed.
In particular, if (X, Y ) have the bivariate normal distribution, then Cov(X, Y ) = 0 ⇐⇒
X, Y are independent.

Theorem. 1.10.6
Suppose X1 , X2 , . . . , Xr have a multivariate normal distribution. Then X1 , X2 , . . . , Xr
are independent if and only if Cov(Xi , Xj ) = 0 for all i 6= j.

Proof.

(=⇒)X1 , . . . , Xr independent

=⇒ Cov(Xi , Xj ) = 0 for i 6= j, has already been proved.

(⇐=) Suppose Cov(Xi , Xj ) = 0 for i 6= j

64
1. DISTRIBUTION THEORY
 
σ11 0 ...
0
.. 
 0 σ22 . . . . 

=⇒ Var(X) = Σ =  . .. .. . 
 .. . . .. 
0 . . . 0 σrr
 −1 
σ11 0 ... 0
−1 .. 
 0 σ22 ... . 

=⇒ Σ−1 =  . .. .. ..  and |Σ| = σ11 σ22 . . . σrr
 .. . . . 
−1
0 ... 0 σrr
1 1 T Σ−1 (x−µ)
=⇒ fX (x) = e− 2 (x−µ)
(2π)r/2 |Σ|1/2
1 (x −µ )2
− 12 ri=1 i σ i
P
= e ii
(2π)r/2 (σ11 σ22 . . . σrr )1/2
! !
1 (x −µ )2 1 (x −µ )2
− 12 1σ 1 − 12 2σ 2
= √ √ e 11 √ √ e 22 ...
2π σ11 2π σ22
!
1 (x −µ )2
− 12 rσ r
... √ √ e rr
2π σrr

= f1 (x1 )f2 (x2 ) . . . fr (xr )

=⇒ X1 , . . . , Xr are independent.

Note:

(x − µ)T Σ−1 (x − µ) = (x1 − µ1 x2 − µ2 . . . xr − µr )


 −1  
σ11 0 ... 0 x 1 − µ1
−1 ..  x − µ 
 0 σ22 ... .  2
 2
× . ... ... . .
 .. ..   .. 
  

0 ... −1
0 σrr x r − µr
r
X (xi − µi )2
= .
i=1
σii

65
1. DISTRIBUTION THEORY

Remark
The same methods can be used to establish a similar result for block diagonal matrices.
The simplest case is the following, which is most easily proved using moment generating
functions:
X1 , X2 are independent
if and only if Σ12 = 0, i.e.,
     
X1 µ1 Σ11 0
∼ Nr1 +r2 , .
X2 µ2 0 Σ22

Theorem. 1.10.7
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ) RVs and let
n
1X
X̄ = Xi ,
n i=1
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1

(n − 1)S 2
Then X̄ ∼ N (µ, σ 2 /n) and ∼ χ2n−1 independently.
σ2

(Note: S 2 here is a random variable, and is not to be confused with the sample covari-
ance matrix, which is also often denoted by S 2 . Hopefully, the meaning of the notation
will be clear in the context in which it is used.)

Proof. Observe first that if X = (X1 , . . . , Xn )T then the i.i.d. assumption may be
written as:
X ∼ Nn (µ1n , σ 2 In×n )

1. X̄ ∼ N(µ, σ 2 /n):
Observe that X̄ = BX, where
 
1 1 1
B= ...
n n n

Hence, by Theorem 1.10.5,

X̄ ∼ N(Bµ1, σ 2 BB T ) ≡ N(µ, σ 2 /n), as required.

66
1. DISTRIBUTION THEORY

2. Independence of X̄ and S: consider


 1 1 1 1

n n
... n n
 
 
1 − 1 − 1 . . . − 1 − 1 
 n n n n
 
 
 −1 1 − 1 ... −1 −1
A= n n n n
 
 
 .. ... ... ... .. 
 . . 
 
 
1 1 1 1
−n −n . . . 1 − n −n

and observe
 


 X1 − X̄


AX = 
 X2 − X̄


 ..

 .

Xn−1 − X̄

We can check that Var(AX) = σ 2 AAT has the form:


1 
n
0 . . . 0
 
 
0 
 
σ2 
 

 .. 
. Σ 22 
 
 
0

⇒ X̄, (X1 − X̄, X2 − X̄, . . . , Xn−1 − X̄) are independent.


(By multivariate normality.)
Finally, since

Xn − X̄ = − (X1 − X̄) + (X2 − X̄) + · · · + (Xn−1 − X̄) ,

it follows that S 2 is a function of (X1 − X̄, X2 − X̄, . . . , Xn−1 − X̄) and hence is
independent of X̄.

3. Prove:
(n − 1)S 2
2
∼ χ2n−1
σ

67
1. DISTRIBUTION THEORY

Consider the identity:


n
X n
X
2
(Xi − µ) = (Xi − X̄)2 + n(X̄ − µ)2
i=1 i=1
(subtract and add X̄ to first term)

n n 2
(Xi − µ)2 (Xi − X̄)2

X X X̄ − µ
⇒ = + √
i=1
σ2 i=1
σ2 σ/ n

n
X (Xi − µ)2
and let R1 =
i=1
σ2

n
X (Xi − X̄)2 (n − 1)S 2
R2 = =
i=1
σ2 σ2

 2
X̄ − µ
R3 = √
σ/ n

If M1 , M2 , M3 are the MGFs for R1 , R2 , R3 respectively, then,

M1 (t) = M2 (t)M3 (t) since R1 = R2 + R3


with R2 and R3 independent
↓ ↓
depends depends only
only on on X̄
2
S

M1 (t)
⇒ M2 (t) = .
M3 (t)
 2
X̄ − µ
Next, observe that R3 = √ ∼ χ21
σ/ n
1
⇒ M3 (t) = ,
(1 − 2t)1/2

68
1. DISTRIBUTION THEORY

and
n
X (Xi − µ)2
(R1 ) ∼ χ2n
i=1
σ2

1
⇒ M1 (t) = .
(1 − 2t)n/2

1
∴ M2 (t) = ,
(1 − 2t)(n−1)/2

which is the mgf for χ2n−1 .

Hence,
(n − 1)S 2
∼ χ2n−1 .
σ2

Corollary.

If X1 , . . . , Xn are i.i.d. N (µ, σ 2 ), then:

X̄ − µ
T = √ ∼ tn−1
S/ n

Proof.
Z
Recall that tk is the distribution of p , where Z ∼ N(0, 1) and V ∼ χ2k indepen-
V /k
dently.
Now observe that:
s
X̄ − µ X̄ − µ (n − 1)S 2 /σ 2
T = √ = √
S/ n σ/ n (n − 1)
and note that,
X̄ − µ
√ ∼ N(0, 1)
σ/ n

and
(n − 1)S 2
∼ χ2n−1 ,
σ2
independently.

69
1. DISTRIBUTION THEORY

1.11 Limit Theorems

1.11.1 Convergence of random variables

Let X1 , X2 , X3 , . . . , be an infinite sequence of RVs. We will consider 4 different types


of convergence.

1. Convergence in probability (weak convergence)


The sequence {Xn } is said to converge to the constant α ∈ R, in probability if
for every ε > 0
lim P (|Xn − α| > ε) = 0
n→∞

2. Convergence in Quadratic Mean

lim E (Xn − α)2 = 0



n→∞

3. Almost sure convergence (Strong convergence)


The sequence {Xn } is said to converge almost surely to α if for each ε > 0,

|Xn − α| > ε for only finite number of n ≥ 1.

Remarks
(1),(2),(3) are related by:

Figure 15: Relationships of convergence

4. Convergence in distribution
The sequence of RVs {Xn } with CDFs {Fn } is said to converge in distribution
to the RV X with CDF F (x) if:

lim Fn (x) = F (x) for all continuity points of F .


n→∞

70
1. DISTRIBUTION THEORY

Figure 16: Discrete case

Figure 17: Normal approximation to binomial (an application of the Central Limit
Theorem) continuous case

Remarks

0 x < α

1. If we take F (x) =

1 x≥α

then we have X = α with probability 1, and convergence in distribution to this


F is the same thing as convergence in probability to α.

2. Commonly used notation for convergence in distribution is either:

(i) L[Xn ] → L[X]

(ii) Xn −→ L[X] or e.g., Xn −→ N(0, 1).


D D
3. An important result that we will use without proof is as follows:
Let Mn (t) be MGF of Xn and M (t) be MGF of X. Then if Mn (t) → M (t) for

each t in some open interval containing 0, as n → ∞, then L[Xn ] −→ L[X].


D
(Sometimes called the Continuity Theorem.)

71
1. DISTRIBUTION THEORY

Theorem. 1.11.1 (Weak law of large numbers)


Suppose X1 , X2 , X3 , . . . is a sequence of i.i.d. RVs with E[Xi ] = µ, Var(Xi ) = σ 2 , and
let n
1X
X̄n = Xi .
n i=1

Then X̄n converges to µ in probability.

Proof. We need to show for each  > 0 that

lim P (|X̄n − µ| > ) = 0.


n→∞

σ2
Now observe that E[X̄n ] = µ and Var(X̄n ) = . So by Chebyshev’s inequality,
n
σ2
P (|X̄n − µ| > ) ≤ →0 as n → ∞, for any fixed  > 0.
n2

Remarks

1. The proof given for Theorem 1.11.1 is really a corollary to the fact that X̄n also
converges to µ in quadratic mean.

2. There is also a version of this theorem involving almost sure convergence (strong
law of large numbers). We will not discuss this.

3. The law of large numbers is one of the fundamental principles of statistical infer-
ence. That is, it is the formal justification for the claim that the “sample mean
approaches the population mean for large n”.

Lemma. 1.11.1
Suppose an is a sequence of real numbers s.t. lim nan = a with |a| < ∞. Then,
n→∞

lim (1 + an )n = ea .
n→∞

Proof.
Omitted (but not difficult).

72
1. DISTRIBUTION THEORY

Remarks
This is a simple generalisation of the standard limit,
 x n
lim 1 + = ex .
n→∞ n

Consider a sequence of i.i.d. RVs X1 , X2 , . . . with E(Xi ) = µ, Var(Xi ) = σ 2 and such


that the MGF, MX (t), is defined for all t in some open interval containing 0.
n
X
Let Sn = Xi and note that E[Sn ] = nµ and Var(Sn ) = nσ 2 .
i=1

Theorem. 1.11.2 (Central Limit Theorem)


Sn − nµ
Let X1 , X2 , . . . , Sn be as above and let Zn = √ . Then
σ n

L[Zn ] → N(0, 1) as n → ∞.

Proof.
We will use the fact that it is sufficient to prove that
2 /2
MZn (t) → et for each fixed t

2 /2
[Note: if Z ∼ N(0, 1) then MZ (t) = et .]
n
Xi − µ 1 X
Now let Ui = and observe that Zn = √ Ui .
σ n i=1

Since the Ui are independent we have:


 √ n
MZn (t) = MU (t/ n)

 
MaX (t) = E[etaX ]
 
= MX (at)

Now,

E(Ui ) = 0 ⇒ MU0 (0) = 0,

Var(Ui ) = 1 ⇒ MU00 (0) = 1.

73
1. DISTRIBUTION THEORY

Consider the second-order Taylor expansion,

t2
MU (t) = MU (0) + MU0 (0)t + MU00 (0) + r(t)
2

r(s)
MU (t) = 1 + 0 + t2 /2 + r(t), where lim =0
s→0 s2

(MU (0))(as MU0 (0) = 0)

= 1 + t2 /2 + r(t).

 √ n
MZn (t) = MU (t/ n)


= {1 + t2 /2n + r(t/ n)}n

= (1 + an )n ,

where

t2 √
an = + r(t/ n).
2n

t2
Next observe that lim nan = for fixed t.
n→∞ 2

To check this observe that:

nt2 t2 √
lim = and lim nr(t/ n)
n→∞ 2n 2 n→∞


t2 r(t/ n)
= lim √
n→∞ (t/ n)2

r(s)
= t2 lim = 0 for fixed t.
s→0 s2
√ 
Note: s = t/ n → 0 as n → ∞.

74
1. DISTRIBUTION THEORY

Hence we have shown MZn (t) = (1 + an )n , where

lim nan = t2 /2, so by Lemma 1.11.1,


n→∞

2 /2
lim MZn (t) = et for each fixed t.
n→∞

Remarks

1. The Central Limit Theorem can be stated equivalently for X̄n ,


   
X̄n − µ X̄n − µ Sn − nµ
i.e., L √ → N (0, 1) just note that √ = √ .
σ/ n σ/ n σ n

2. The Central Limit Theorem holds under conditions more general than those given
above. In particular, with suitable assumptions,

(i) MX (t) need not exist.


(ii) X1 , X2 , . . . need not be i.i.d..

3. Theorems 1.11.1 and 1.11.2 are concerned with the asymptotic behaviour of X̄n .

Theorem 1.11.1 states X̄n → µ in prob as n → ∞.

X̄n − µ
Theorem 1.11.2 states √ −→ N(0, 1) as n → ∞.
σ/ n D

These results are not contradictory because Var(X̄n ) → 0, but the Central Limit
X̄n − E(X̄n )
Theorem is concerned with p .
Var(X̄n )

75
2. STATISTICAL INFERENCE

2 Statistical Inference

2.1 Basic definitions and terminology

Probability is concerned partly with the problem of predicting the behavior of the RV
X assuming we know its distribution.
Statistical inference is concerned with the inverse problem:
Given data x1 , x2 , . . . , xn with unknown CDF F (x), what can we conclude about
F (x)?
In this course, we are concerned with parametric inference. That is, we assume F
belongs to a given family of distributions, indexed by the parameter θ:

= = {F (x; θ) : θ ∈ Θ}
where Θ is the parameter space.
Examples

(1) = is the family of normal distributions:


θ = (µ, σ 2 )
Θ = {(µ, σ2 ) : µ ∈ R, σ 2 ∈ R+} 
x−µ
then = = F (x) : F (x) = Φ .
σ2
(2) = is the family of Bernoulli distributions with success probability θ:
Θ = {θ ∈ [0, 1] ⊂ R}.

In this framework, the problem is then to use the data x1 , . . . , xn to draw conclusions
about θ.
Definition. 2.1.1
A collection of i.i.d. RVs, X1 , . . . , Xn , with common CDF F (x; θ), is said to be a
random sample (from F (x; θ)).

Definition. 2.1.2
Any function T = T (x1 , x2 , . . . , xn ) that can be calculated from the data (without
knowledge of θ) is called a statistic.
Example
n
1X
The sample mean x̄ is a statistic (x̄ = xi ).
n i=1

76
2. STATISTICAL INFERENCE

Definition. 2.1.3
A statistic T with property T (x) ∈ Θ ∀x is called an estimator for θ.
Example
For x1 , x2 , . . . , xn i.i.d. N(µ, σ 2 ), we have θ = (µ, σ 2 ). The quantity (x̄, s2 ) is an esti-
n n
1X 1 X
mator for θ, where x̄ = xi , S 2 = (xi − x̄)2 .
n i=1 n − 1 i=1

There are two important concepts here: the first is that estimators are random
variables; the second is that you need to be able to distinguish between random vari-
ables and their realisations. In particular, an estimate is a realisation of a random
variable.
For example, strictly speaking, x1 , x2 , . . . , xn are realisations of random variables
n
1X
X1 , X2 , . . . , Xn , and x̄ = xi is a realisation of X̄; X̄ is an estimator, and x̄ is an
n i=1
estimate.
We will find that from now on however, that it is often more convenient, if less rigorous,
to use the same symbol for estimator and estimate. This arises especially in the use of
θ̂ as both estimator and estimate, as we shall see.
#

An unsatisfactory aspect of Definition 2.1.3 is that it gives no guidance on how to


recognize (or construct) good estimators.
Unless stated otherwise, we will assume that θ is a scalar parameter in the following.

2.1.1 Criteria for good estimators

In broad terms, we would like an estimator to be “as close as possible to” θ with high
probability.
Definition. 2.1.4
The mean squared error of the estimator T of θ is defined by

M SET (θ) = E{(T − θ)2 }.

Example:
Suppose X1 , . . . , Xn are i.i.d. Bernoulli θ RV’s, and T = X̄ =‘proportion of successes’.

77
2. STATISTICAL INFERENCE

Since nT ∼ B(n, θ) we have

E(nT ) = nθ, Var(nT ) = nθ(1 − θ)

θ(1 − θ)
=⇒ E(T ) = θ, Var(T ) =
n

θ(1 − θ)
=⇒ M SET (θ) = Var(T ) = .
n

Remark:
This example shows that M SET (θ) must be thought of as a function of θ rather than
just a number.
For example: see Figure 18

Figure 18: M SET (θ) as a function of θ

Intuitively, a good estimator is one for which MSE is as small as possible. However,
quantifying this idea is complicated, because M SE is a function of θ, not just a number.
See Figure 19.
For this reason, it turns out it is not possible to construct a minimum M SE estimator
in general.
To see why, suppose T ∗ is a minimum M SE estimator for θ. Now consider the estimator
Ta = a, where a ∈ R is arbitrary. Then for a = θ, Tθ = θ with M SETθ (θ) = 0.

78
2. STATISTICAL INFERENCE

Figure 19: M SET (θ) as a function of θ

Observe M SETa (a) = 0; hence if T ∗ exists, then we must have M SET ∗ (a) = 0. As a
is arbitrary, we must have M SET ∗ (θ) = 0 ∀θ, ∈ Θ
=⇒ T ∗ = θ with probability 1.
Therefore we conclude that (excluding trivial situations) no minimum M SE estimator
can exist.
Definition. 2.1.5 The bias of the estimator T is defined by:

bT (θ) = E(T ) − θ.

An estimator T with bT (θ) = 0, i.e., E(T ) = θ, is said to be unbiased.

Remarks:

(1) Although unbiasedness is an appealing property, not all commonly used

79
2. STATISTICAL INFERENCE

estimates are unbiased and in some situations it is impossible to construct


unbiased estimators for the parameter of interest.
Example: (Unbiasedness isn’t everything)

E(s2 ) = σ 2 .

If E(s) = σ,

then Var(s) = E(s2 ) − {E(s)}2

= 0

which is not the case

=⇒ E(s) < σ.

(2) Intuitively, unbiasedness would seem to be a pre-requisite for a good esti-


mator. This is to some extent formalized as follows:

Theorem. 2.1.1

M SET (θ) = Var(T ) + bT (θ)2

Remark:
Restricting attention to unbiased estimators excludes estimators of the form Ta =
a. We will see that this permits the construction of Minimum Variance Unbiased
Estimators (MVUE’s) in some cases.

2.2 Minimum Variance Unbiased Estimation

2.2.1 Likelihood, score and Fisher Information

Consider data x1 , x2 , . . . , xn assumed to be observations of RV’s X1 , X2 , . . . , Xn with


joint PDF (probability function)

fx (x; θ) (Px (x; θ)).

Definition. 2.2.1

80
2. STATISTICAL INFERENCE

The likelihood function is defined by



 fx (x; θ) X continuous
L(θ; x) =
Px (x; θ) X discrete.

The log likelihood function is:


`(θ; x) = log L(θ; x) (log is the natural log i.e., ln).

Remark
If x1 , x2 , . . . , xn are independent, the log likelihood function can be written as:
n
X
`(θ; x) = log fi (xi ; θ).
i=1

If x1 , x2 , . . . , xn are i.i.d., we have:


n
X
`(θ; x) = log f (xi ; θ).
i=1

Definition. 2.2.2
Consider a statistical problem with log-likelihood `(θ; x). The score is defined by:
∂`
U(θ; x) =
∂θ
and the Fisher information is
∂2`
 
I(θ) = E − 2 .
∂θ

Remark
For a single observation with PDF f (x, θ), the information is
∂2
 
i(θ) = E − 2 log f (X, θ) .
∂θ
In the case of x1 , x2 , . . . , xn i.i.d. , we have
I(θ) = ni(θ).
Theorem. 2.2.1
Under suitable regularity conditions,
E{U(θ; X)} = 0

and Var{U(θ; X)} = I(θ).

81
2. STATISTICAL INFERENCE

Proof.
Z ∞ Z ∞
Observe E{U(θ; X)} = ... U(θ; x)f (x; θ)dx1 . . . dxn
−∞ −∞
Z ∞ Z ∞  

= ... log f (x; θ) f (x; θ)dx1 . . . dxn
−∞ −∞ ∂θ

 
Z ∞ Z ∞ f (x; θ)
 ∂θ
= ...  f (x; θ)dx1 . . . dxn

f (x; θ)

−∞ −∞

Z ∞ Z ∞

= ... f (x; θ)dx1 . . . dxn
−∞ −∞ ∂θ
Z ∞ Z ∞

regularity =⇒ = ... f (x; θ)dx1 . . . dxn
∂θ −∞ −∞


= 1
∂θ

= 0, as required.

To calculate Var{U(θ; X)}, observe


 
∂ 2 `(θ; x) ∂

 f (x; θ) 

= ∂θ
∂θ2 ∂θ 
 f (x; θ)  
2
∂2


f (x; θ)f (x; θ) − f (x; θ)
∂θ2 ∂θ
=
[f (x; θ)]2
∂2
f (x; θ)
= ∂θ2 − U 2 (θ; x)
f (x; θ)
∂2
f (x; θ) ∂ 2 `
=⇒ U 2 (θ; x) = ∂θ2 − 2
f (x; θ) ∂θ

=⇒ Var{U(θ; X)} = E{U 2 (θ; X)} (as E(U) = 0)

∂2
 
f (X; θ)
 2
= E  ∂θ  + I(θ) (by definition of I(θ)).

f (X; θ)

82
2. STATISTICAL INFERENCE

Finally observe that:

∂2 2
Z ∞ ∂ f (x; θ)
 
f (X; θ) Z ∞
 2 ∂θ2
E  ∂θ  = ... f (x; θ)dx1 . . . dxn

f (X; θ) −∞ −∞ f (x; θ)

∞ ∞
∂2
Z Z
= ... 2
f (x; θ)dx1 . . . dxn
−∞ −∞ ∂θ
∞ Z ∞
∂2
Z
= ... f (x; θ)dx1 . . . dxn
∂θ2 −∞ −∞

∂2
= 1
∂θ2

= 0

Hence, we have proved Var{U(θ; X)} = I(θ).

2.2.2 Cramer-Rao Lower Bound

Theorem. 2.2.2 (Cramer-Rao Lower Bound)


1
If T is an unbiased estimator for θ, then Var(T ) ≥ .
I(θ)

83
2. STATISTICAL INFERENCE

Proof.

Observe Cov{T (X), U(θ; X)} = E{T (X)U(θ; X)}


Z ∞ Z ∞
= ... T (x)U(θ; x)f (x; θ)dx1 . . . dxn
−∞ −∞


Z ∞ Z ∞ f (x; θ)
= ... T (x) ∂θ f (x; θ)dx1 . . . dxn
−∞ −∞ f (x; θ)
Z ∞ Z ∞

= ... T (x)f (x; θ)dx1 . . . dxn
∂θ −∞ −∞


= E{T (X)}
∂θ

= θ
∂θ

= 1.

To summarize, Cov(T, U) = 1.

Recall that Cov2 (T, U) ≤ Var(T ) Var(U) [i.e. |ρ| ≤ 1 and divide both sides by RHS]

Cov2 (T, U) 1
=⇒ Var(T ) ≥ = , as required.
Var(U) I(θ)

Example
n
1X
Suppose X1 , X2 , . . . , Xn are i.i.d. Po(λ) RV’s, and let X̄ = Xi . We will prove
n i=1
that X̄ is a MVUE for λ.

84
2. STATISTICAL INFERENCE

Proof.

e−λ λx
(1) Recall if X ∼Po(λ), then P (x) = , E(X) = λ and Var(X) = λ.
x!
λ
Hence, E(X̄) = λ and Var(X̄) = . Hence X̄ is unbiased for λ.
n
1
(2) To show that X̄ is MVUE, we will show that Var(X̄) = .
I(λ)
Step 1:
Log-likelihood is

`(λ; x) = log P (x; λ)

n
Y e−λ λxi
= log ,
i=1
xi !
n
−nλ
Pn
xi
Y 1
so `(λ; x) = log e λ i=1
x!
i=1 i
n
X n
Y
= −nλ + xi log λ − log xi !
i=1 i=1
n
Y
= −nλ + nx̄ log λ − log xi !.
i=1

Step 2:
∂2`
Find ;
∂λ2
∂` nx̄ ∂2` −nx̄
= −n + =⇒ 2
= 2 .
∂λ λ ∂λ λ
Step 3:

85
2. STATISTICAL INFERENCE

∂2`
 
I(λ) = −E
∂λ2
 
nX̄
=⇒ I(λ) = E
λ2
n
= E(X̄)
λ2

=
λ2
n
= .
λ

λ 1
(3) Finally, observe that Var(X̄) = = .
n I(λ)

By Theorem 2.2.2, any unbiased estimator T for λ must have


1 λ
Var(T ) ≥ =
I(λ) n

=⇒ X̄ is a MVUE.

Theorem. 2.2.3
The unbiased estimator T (x) can achieve the Cramer-Rao Lower Bound only if the
joint PDF/probability function has the form:

f (x; θ) = exp{A(θ)T (x) + B(θ) + h(x)},


−B 0 (θ)
where A, B are functions such that θ = , and h is some function of x.
A0 (θ)

Proof. Recall from the proof of Theorem 2.2.2 that the bound arises from the inequality

Cor2 (T (X), U(θ; X)) ≤ 1,


∂`
where U(θ; x) = .
∂θ
Moreover, it is easy to see that the Cramer-Rao Lower Bound (CRLB) can be achieved
only when
Cor2 {T (X), U(θ; X)} = 1

86
2. STATISTICAL INFERENCE

Hence the CRLB is achieved only if U is a linear function of T with probability 1


(correlation equals 1):

U = aT + b a, b constants =⇒ can depend on θ but not x


∂`
i.e. = U(θ; x) = a(θ)T (x) + b(θ) for all x.
∂θ

Integrating wrt θ, we obtain:

log f (x; θ) = `(θ; x) = A(θ)T (x) + B(θ) + h(x),

where A0 (θ) = a(θ), B 0 (θ) = b(θ).


Hence,
f (x; θ) = exp{A(θ)T (x) + B(θ) + h(x)}.

Finally to ensure that E(T ) = θ, recall that E{U(θ; X)} = 0 and observe that in this
case,

E{U(θ; X)} = E[A0 (θ)T (X) + B 0 (θ)]

= A0 (θ)E[T (X)] + B 0 (θ) = 0

−B 0 (θ)
=⇒ E[T (X)] = .
A0 (θ)

−B 0 (θ)
Hence, in order that T (X) be unbiased for θ, we must have = θ.
A0 (θ)

2.2.3 Exponential families of distributions

Definition. 2.2.3
A probability density function/probability function is said to be a single parameter
exponential family if it has the form:

f (x; θ) = exp{A(θ)t(x) + B(θ) + h(x)}


for all x ∈ D ∈ R, where the D does not depend on θ.

87
2. STATISTICAL INFERENCE

If x1 , . . . , xn is a random sample from an exponential family, the joint PDF/prob func-


tions becomes
n
Y
f (x; θ) = exp{A(θ)t(xi ) + B(θ) + h(xi )}
i=1
( n n
)
X X
= exp A(θ) t(xi ) + nB(θ) + h(xi ) .
i=1 i=1

In this case, a minimum variance unbiased estimator that achieves the CRLB can be
seen to be the function: n
1X
T = t(xi ),
n i=1
−B 0 (θ)
which is the MVUE for E(T ) = .
A0 (θ)

Example (Poisson example revisited)

Observe that the Poisson probability function,


e−λ λx
p(x; λ) =
x!

= exp{(log λ)x − λ − log x!}

= exp{A(λ)t(x) + B(λ) + h(x)},


where
A(λ) = log λ

B(λ) = −λ

t(x) = x

h(x) = − log x!.


n
1X
X̄ = t(xi ) is the MVUE for
n i=1
−B 0 (λ) −(−1)
=
A0 (λ) 1/λ

= λ.

88
2. STATISTICAL INFERENCE

Example (2)

Consider the Exp(λ) distribution. The PDF is

f (x) = λe−λx , x>0

= exp{−λx + log λ};

=⇒ f is an exponential family with

A(λ) = −λ

B(λ) = log λ

t(x) = x

h(x) = 0

−B 0 (λ)
We can also check that E(X) = . In particular, we have seen previously
A0 (λ)
1
E(X) = for X ∼ Exp(λ).
λ
1
Now observe that A0 (λ) = −1, B 0 (λ) =
λ
−B 0 (λ) 1
=⇒ = , as required.
A0 (λ) λ
n
1X
It also follows that if x1 , x2 , . . . , xn are i.i.d. Exp(λ) observations, then X̄ = t(xi )
n i=1
1
is the MVUE for = E(X).
λ

2.2.4 Sufficient statistics

Definition. 2.2.4
Consider data with PDF/prob. function, f (x; θ). A statistic, S(x), is called a sufficient
statistic for θ if f (x|s; θ) does not depend on θ for all s.

Remarks

89
2. STATISTICAL INFERENCE

(1) We will see that sufficient statistics capture all of the information in the
data x that is relevant to θ.

(2) If we consider vector-valued statistics, then this definition admits trivial


examples, such as s = x

 1 x=s
since P (X = x|S = s) =
0 otherwise

which does not depend on θ.

Example
n
X
Suppose x1 , x2 , . . . , xn are i.i.d. Bernoulli-θ and let s = xi . Then S is sufficient for
i=1
θ.

Proof.
n
Y
P (x) = θxi (1 − θ)1−xi
i=1
P P
xi (1−xi )
= θ (1 − θ)

= θs (1 − θ)n−s .

Next observe that S ∼ B(n, θ)


 
n s
=⇒ P (s) = θ (1 − θ)n−s
s
P ({X = x} ∩ {S = s})
=⇒ P (x|s) =
P (S = s)
P (X = x)
=
P (S = s)
 n
1 X
if xi = s

  
 n


i=1
= s





0, otherwise.

90
2. STATISTICAL INFERENCE

Theorem. 2.2.4 The Factorization Theorem


Suppose x1 , . . . , xn have joint PDF/prob function f (x; θ). Then S is a sufficient statis-
tic for θ if and only if
f (x; θ) = g(s; θ)h(x)
for some functions g, h.

Proof.
Omitted.

Examples

(1) x1 , x2 , . . . , xn i.i.d. N(µ, σ 2 ), σ 2 known.


n
Y 1 2 2
Then f (x; µ) = √ e(−1/(2σ ))(xi −µ)
i=1
2πσ
( n
)
1 X
= (2πσ 2 )−n/2 exp − 2 (xi − µ)2
2σ i=1
( n n
!)
1 X X
= (2πσ 2 )−n/2 exp − 2 x2i − 2µ xi + nµ2
2σ i=1 i=1
( n
!) n
!
1 X 1 X
= (2πσ 2 )−n/2 exp 2µ xi − nµ2 exp − 2 x2i
2σ 2 i=1
2σ i=1
( n
!) ( n
)
1 X 1 X n
= exp 2µ xi − nµ2 exp − 2 x2i − log(2πσ 2 )
2σ 2 i=1
2σ i=1
2
n
X
=⇒ S = xi is sufficient for µ.
i=1

(2) If x1 , x2 , . . . , xn are i.i.d. with


f (x) = exp{A(θ)t(x) + B(θ) + h(x)},
( n
) ( n )
X X
then f (x; θ) = exp A(θ) t(xi ) + nB(θ) exp h(xi )
i=1 i=1
n
X
=⇒ S = t(xi )
i=1

is sufficient for θ in the exponential family by the Factorization Theorem.

91
2. STATISTICAL INFERENCE

2.2.5 The Rao-Blackwell Theorem

Theorem. 2.2.5 Rao-Blackwell Theorem


If T is an unbiased estimator for θ and S is a sufficient statistic for θ, then the quantity
T ∗ = E(T |S) is also an unbiased estimator for θ with Var(T ∗ ) ≤ Var(T ). Moreover,
Var(T ∗ ) = Var(T ) iff T ∗ = T with probability 1.

Proof. (I) Unbiasedness:

θ = E(T ) = ES {E(T |S)} = E(T ∗ )

=⇒ E(T ∗ ) = θ.

(II) Variance inequality:

Var(T ) = E{Var(T |S)} + Var{E(T |S)}

= E{Var(T |S)} + Var(T ∗ )

≥ Var(T ∗),

since E{Var(T |S)} ≥ 0.


Observe also that Var(T ) = Var(T ∗ )

=⇒ E{Var(T |S)} = 0

=⇒ Var(T |S) = 0 with prob. 1

=⇒ T = E(T |S) with prob. 1.

(III) T ∗ is an estimator:
Since S is sufficient for θ,
Z ∞

T = T (x)f (x|s)dx,
−∞

which does not depend on θ.

Remarks

92
2. STATISTICAL INFERENCE

(1) The Rao-Blackwell Theorem can be used occasionally to construct estima-


tors.
(2) The theoretical importance of this result is to observe that T ∗ will always
depend on x only through S. If T is already a MVUE then T ∗ = T =⇒
MVUE depends on x only through S.

Example
Suppose x1 , . . . , xn ∼i.i.d. N(µ, σ 2 ), σ 2 known. We want to estimate µ. Take T = x1 ,
as an unbiased estimator for µ.
n
X
S= xi is a sufficient statistic for µ.
i=1

According to Rao-Blackwell, T ∗ = E(T |S) will be an improved (unbiased) estimator


for µ.
 
T 2 T
If x = (x1 , . . . , xn ) for X ∼ Nn (µ1, σ I), then = Ax, where A is a matrix
S
 
1 0 ... 0
A= 
1 1 ... 1
Hence,    2 
  µ σ σ2
T
∼ N2 (A(µ1), σ 2 AAT ) = N2  , 
S
nµ σ 2 nσ 2

σ2 σ4
 
=⇒ T |S = s ∼ N µ + 2 (s − nµ), σ 2 − 2
nσ nσ
   
1 1
= N s, 1 − σ2
n n
1X
∴ E(T |S) = xi = x̄ = T ∗
n
is a better (or equal) estimator for µ.
σ2
Finally, observe Var(X̄) = ≤ σ 2 = Var(X1 ) with < for n > 1, σ 2 > 0.
n
Remarks

(1) We have given only an incomplete outline of theory.


(2) There are also concepts of minimal sufficient & complete statistics to be
considered.

93
2. STATISTICAL INFERENCE

2.3 Methods Of Estimation

2.3.1 Method Of Moments

Consider a random sample X1 , X2 , . . . , Xn from F (θ) & let µ = µ(θ) = E(X).


The method of moments estimator θ̃ is defined as the solution to the equation

X̄ = µ(θ̃).

Example
1
X1 , . . . , Xn ∼ i.i.d. Exp(λ) =⇒ E(Xi ) = ; the method of moments estimator is
λ
defined as the solution to the equation
1 1
X̄ = =⇒ λ̃ = .
λ̃ X̄

Remark
The method of moments is appealing:

(1) it is simple;

(2) rationale is that X̄ is BLUE for µ.

If θ = (θ1 , θ2 , . . . , θp ), the method of moments can be adapted as follows:

(1) let µk = µk (θ) = E(X k ), k = 1, . . . , p;


n
1X k
(2) let mk = x .
n i=1 i

The MoM estimator θ̃ is defined to be the solution to the system of equations

m1 = µ1 (θ̃)

m2 = µ2 (θ̃)

.. ..
. .

mp = µp (θ̃)

94
2. STATISTICAL INFERENCE

Example
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ) & let θ = (µ, σ 2 ).
p = 2 =⇒ two equations in two unknowns:

µ1 (θ) = E(X) = µ

µ2 (θ) = E(X 2 ) = σ 2 + µ2

n
1X
m1 = xi
n i=1
n
1X 2
m2 = x
n i=1 i

∴ we need to solve for µ̃, σ̃ 2 :


µ̃ = x̄

1X 2 1X 2
σ̃ 2 + µ̃2 = xi =⇒ σ̃ 2 = xi − x̄2
n n
1X
= (xi − x̄)2
n
n−1 2
= s.
n

Remark
The Method of Moments estimators can be seen to have good statistical properties.
Under fairly mild regulatory conditions, the MoM estimator is

(1) Consistent, i.e., θ̃ → θ in prob as n → ∞, for all θ.

θ̃ − θ −→
(2) Asymptotically Normal, i.e., p . N(0, 1) as n → ∞.
Var(θ) D

However, the MoM estimator has one serious defect.


Suppose X1 , . . . , Xn is a random sample from FθX & let θ̃X be MoM estimator.
Let Y = h(X) for some invertible function h(X), then Y1 , . . . , Yn should contain the
same information as X1 , . . . , Xn .
∴ we would like θ̃Y ≡ θ̃X . Unfortunately this does not hold.

95
2. STATISTICAL INFERENCE

Example
1
X1 , . . . , Xn i.i.d. Exp(λ). We saw previously that λ̃X = .

Suppose Yi = Xi2 (which is invertible for Xi > 0). To obtain λ̃Y , observe E(Y ) =
2
E(X 2 ) = 2
λ s
2n 1
=⇒ λ̃Y = P 2 6= .
Xi X̄

2.3.2 Maximum Likelihood Estimation

Consider a statistical problem with log-likelihood function, `(θ; x).


Definition. The maximum likelihood estimate θ̂ is the solution to the problem max `(θ; x)
θ∈Θ
i.e. θ̂ = arg max `(θ; x).
θ∈Θ

Remark
In practice, maximum likelihood estimates are obtained by solving the score equation
∂`
= U(θ; x) = 0
∂θ

Example
If X1 , X2 , . . . , Xn are i.i.d. geometric-θ RV’s, find θ̂.
Solution:
( n )
Y
`(θ; x) = log θ(1 − θ)xi −1
i=1
n
X n
X
= log θ + (xi − 1) log(1 − θ)
i=1 i=1

= n{log θ + (x̄ − 1) log(1 − θ)}


 
∂` 1 x̄ − 1
∴ U(θ; x) = =n −
∂θ θ 1−θ
 
1 − θx̄
= n .
θ(1 − θ)

96
2. STATISTICAL INFERENCE

To find θ̂, we solve for θ in 1 − θx̄ = 0


 
 
1 = n
 # of successes 
=⇒ θ̂ = n = 
x̄  X # of trials 
xi
 
i=1

2.3.3 Elementary properties of MLEs

(1) MLE’s are invariant under invertible transformations of the data.

Proof. (Continuous i.i.d. RV’s, strictly increasing/decreasing translation)


Suppose X has PDF f (x; θ) and Y = h(X), where h is strictly monotonic;
=⇒ fY (y; θ) = fX (h−1 (y); θ)|h−1 (y)0 | when y = h(x).

Consider data x1 , x2 , . . . , xn and the transformed version y1 , y2 , . . . , yn :


( n )
Y
`Y (θ; y) = log fY (yi ; θ)
i=1
( n )
Y
= log fX (h−1 (yi ); θ)|h−1 (yi )0 |
i=1
n
Y n
Y
= log fX (xi ; θ) + log |h−1 (yi )0 |
i=1 i=1
n
!
Y
= `X (θ; x) + log |h−1 (yi )0 | ,
i=1
n
!
Y
since log |h−1 (yi )0 | does not depend on θ, it follows that θ̂ maximizes
i=1
`X (θ; x) iff it maximizes `Y (θ; y).

(2) If φ = φ(θ) is a 1-1 transformation of θ, then the MLE’s obey the transfor-
mation rule, φ̂ = φ(θ̂).

Proof. It can be checked that if `φ (φ; x) is the log-likelihood with respect


to φ, then
`θ (θ; x) = `φ (φ(θ); x).
It follows that θ̂ maximizes `θ (θ; x) iff φ̂ = φ(θ̂) maximizes `φ (φ; x).

97
2. STATISTICAL INFERENCE

(3) If T (x) is a sufficient statistic for θ, then θ̂ depends on the data only as a
function of t(x).

Proof. By the Factorization Theorem, T (x) is sufficient for θ iff

f (x; θ) = g(t(x); θ)h(x)

=⇒ `(θ; x) = log g(t(x); θ) + log h(x)

=⇒ θ̂ maximizes `(θ; x) ⇔ θ̂ maximizes log g(t(x); θ)

=⇒ θ̂ is a function of T (x).

Example
Suppose X1 , X2 , . . . , Xn are i.i.d. Exp(λ); then
n
Y Pn
fX (x; λ) = λe−λxi = λn e−λ i=1 xi
= λn e−λnx̄ .
i=1

By the Factorization Theorem, x̄ is sufficient for λ. To get the MLE,


∂` ∂
U(λ, x) = = (n log λ − nλx̄)
∂λ ∂λ
n
= − nx̄;
λ
∂` 1 1
= 0 =⇒ = x̄ =⇒ λ̂ = .
∂λ λ x̄
Note: as proved λ̂ is a function of the sufficient statistic x̄.
Let Y1 , Y2 , . . . , Yn be defined by Yi = log Xi .

98
2. STATISTICAL INFERENCE

If X ∼Exp(λ) and Y = log X, then we can find fY (y) by taking h(x) = log x and using

fY (y) = fX (h−1 (y))|h−1 (y)0 | [h−1 (y) = ey , h−1 (y)0 = ey ]

y
= λe−λe ey
( n )
−λeyi
Y
=⇒ `Y (λ; y) = log λe eyi
i=1
n Pn y
Pn o
= log λn e−λ i=1 e i e i=1 yi

n
X n
X
yi
= n log λ − λ e + yi
i=1 i=1
n
∂ n X yi
=⇒ `Y (λ; y) = − e =0
∂λ λ i=1
n
=⇒ λ̂ = n
X
eyi
i=1
n
= n
X
elog xi
i=1
n
= n
X
xi
i=1

1 n
= .
x̄ nx̄
Finally, suppose we take θ = log λ =⇒ λ = eθ
θx
=⇒ f (x; θ) = eθ e−e (λe−λx )

n
!
−eθ x
Y
=⇒ `θ (θ; x) = log eθ e i

i=1
n θ
o
= log enθ e−e nx̄

= nθ − eθ nx̄.

99
2. STATISTICAL INFERENCE

∂`
= n − nx̄eθ
∂θ
∂` 1
∴ =0 =⇒ 1 = x̄eθ =⇒ eθ =
∂θ x̄

=⇒ θ̂ = − log x̄.
 
1 1
But λ̂ = =⇒ log λ̂ = log = − log x̄ = θ̂, as required.
x̄ x̄
Remark: (Not examinable)
Maximum likelihood estimation and method of moments can both be generated by the
use of estimating functions.
An estimating function is a function, H(x; θ), with the property E{H(x; θ)} = 0.
H can be used to define an estimator θ̃ which is a solution to the equation

H(x; θ) = 0.

For method of moments estimates, we can take

H(x; θ) = x̄ − E(X) [E(X̄) if not i.i.d. case].

To calculate the MLE:

(1) find the log-likelihood, then

(2) calculate the score & let it equal 0.

For maximum likelihood, we can use


∂`
H(x; θ) = U(θ; x) = .
∂θ

Recall we showed previously that E{U(θ; x)} = 0.

2.3.4 Asymptotic Properties of MLEs

Suppose X1 , X2 , X3 , . . . is a sequence of i.i.d. RV’s with common PDF (or probability


function) f (x; θ).
Assume:

100
2. STATISTICAL INFERENCE

(1) θ0 is the true value of the parameter θ;

(2) f is s.t. f (x; θ1 ) = f (x; θ2 ) for all x =⇒ θ1 = θ2 .

We will show (in outline) that if θ̂n is the MLE (maximum likelihood estimator) based
on X1 , X2 , . . . , Xn , then:

(1) θ̂n → θ0 in probability, i.e., for each  > 0,

lim P (|θˆn − θ0 | > ) = 0.


n→∞

(2)
p −→
ni(θ0 )(θ̂n − θ0 ) N(0, 1) as n → ∞
D

(Asymptotic Normality).

Remark
The practical use of asymptotic normality is that for large n,
 
1
θ̂ ∼: N θ0 , .
ni(θ0 )

Outline of consistency & asymptotic normality for MLEs:


Consider i.i.d. data X1 , X2 , . . . with common PDF/prob. function f (x; θ).
If θ̂n is the MLE based on X1 , X2 , . . . , Xn , we will show (in outline) that θ̂n → θ0 in
probability as n → ∞, where θ0 is the true value for θ.
Lemma. 2.3.1
Suppose f is such that f (x; θ1 ) = f (x; θ2 ) for all x =⇒ θ1 = θ2 . Then `∗ (θ) =
E{log f (X; θ)} is maximized uniquely by θ = θ0 .

101
2. STATISTICAL INFERENCE

Proof.
Z ∞ Z ∞
∗ ∗
` (θ) − ` (θ0 ) = (log f (x; θ))f (x; θ0 )dx − (log f (x; θ0 ))f (x; θ0 )dx
−∞ −∞
Z ∞  
f (x; θ)
= log f (x; θ0 )dx
−∞ f (x; θ0 )
Z ∞  
f (x; θ)
≤ − 1 f (x; θ0 )dx
−∞ f (x; θ0 )
Z ∞
= (f (x; θ) − f (x; θ0 ))dx
−∞
Z ∞ Z ∞
= f (x; θ)dx − f (x; θ0 )dx = 0
−∞ −∞

Moreover, equality is achieved iff


f (x; θ)
= 1 for all x
f (x; θ0 )

=⇒ f (x; θ) = f (x; θ0 ) for all x

=⇒ θ = θ0 .

Lemma. 2.3.2
1 1
Let `¯n (θ; x) = `(θ, x1 , . . . , xn ), i.e., × log likelihood based on x1 , x2 , . . . , xn .
n n
Then for each θ, `¯n (θ; x) → `∗ (θ) in probability as n → ∞.

Proof. Observe that


n
1 Y
`¯n (θ; x) = log f (xi ; θ)
n i=1
n
1X
= log f (xi ; θ)
n i=1
n
1X
= Li (θ), where Li (θ) = log f (xi ; θ).
n i=1

Since Li (θ) are i.i.d. with E(Li (θ)) = `∗ (θ), it follows by the weak law of large numbers
that `¯n (θ; x) → `∗ (θ) in probability as n → ∞.

102
2. STATISTICAL INFERENCE

To summarise, we have proved that:

(1) `¯n (θ, x) → `∗ (θ);

(2) `∗ (θ) is maximized when θ = θ0 .

Since θ̂n maximizes `¯n (θ, x), it follows that θ̂n → θ0 in probability.

Theorem. 2.3.1
Under the above assumptions, let Un (θ; x) denote the score based on X1 , . . . , Xn .
Then,
Un (θ0 ; x) −→
N(0, 1).
ni(θ0 ) D
p

Proof.
n
∂ Y
Un (θ0 ; x) = log f (xi ; θ)|θ=θ0
∂θ i=1
n
X ∂
= log f (xi ; θ)|θ=θ0
i=1
∂θ
n
X ∂
= Ui , where Ui = log f (xi ; θ)|θ=θ0 .
i=1
∂θ

Since U1 , U2 , . . . are i.i.d. with E(Ui ) = 0 and Var(Ui ) = i(θ), by the Central Limit
Theorem, Pn
i=1 Ui − nE(U) Un (θ0 ; x) −→
= p N(0, 1).
ni(θ0 ) D
p
n Var(U)

Theorem. 2.3.2
Under the above assumptions,
p −→
ni(θ0 )(θ̂n − θ0 ) N(0, 1).
D

103
2. STATISTICAL INFERENCE

Proof. Consider the first order Taylor expansion of Un (θ̂; x) about θ0 :


Un (θ̂n ; x) ≈ Un (θ0 ; x) + Un0 (θ0 ; x)(θ̂n − θ0 ).
For large n
=⇒ Un (θ0 ; x) ≈ −Un0 (θ0 ; x)(θ̂n − θ0 )

Un (θ0 ; x) −→
N(0, 1)
ni(θ0 ) D
p

−Un0 (θ0 ; x)(θ̂n − θ0 ) −→


=⇒ N(0, 1).
D
p
ni(θ0 )
Now observe that:
−Un0 (θ0 ; x)
→ i(θ0 ) as n → ∞
n
by the weak law of large numbers, since:
n
−Un0 (θ0 ; x) 1 X ∂2
= − log f (xi ; θ|θ=θ0 )
n n i=1 ∂θ2
−Un0 (θ0 ; x)
=⇒ → 1 in probability.
ni(θ0 )
−Un0 (θ0 ; x) p
Hence, p (θ̂n − θ0 ) ≈ ni(θ0 )(θ̂n − θ0 ) for large n
ni(θ0 )
p −→
=⇒ ni(θ0 )(θ̂n − θ0 ) N(0, 1).
D

Remark”
The preceding theory can be generalized to include vector-valued coefficients. We will
not discuss the details.

2.4 Hypothesis Tests and Confidence Intervals

Motivating example:
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ), and consider H0 : µ = µ0 vs. Ha : µ 6= µ0 .
If σ 2 is known, then the test of H0 with significance level α is defined by the test
statistic
X̄ − µ
Z= √ ,
σ/ n

104
2. STATISTICAL INFERENCE

and the rule, reject H0 if |z| ≥ z(α/2).


A 100(1 − α)% CI for µ is given by
 
σ σ
X̄ − z(α/2) √ , X̄ + z(α/2) √ .
n n
It is easy to check that the confidence interval contains all values of µ0 that are accept-
able null hypotheses.

2.4.1 Hypothesis testing

In general, consider a statistical problem with parameter

θ ∈ Θ0 ∪ ΘA , where Θ0 ∩ ΘA = φ.

We consider the null hypothesis,


H0 : θ ∈ Θ0
and the alternative hypothesis
HA : θ ∈ ΘA .
The hypothesis testing set up can be represented as:

Actual Status
H0 true HA true
√ type II
Accept H0
error
Test Result
(β)
type I √
Reject H0
error
(α)

We would like both the type I and type II error rates to be as small as possible.
However, these results conflict with each other. To reduce the type I error rate we
need to “make it harder to reject H0 ”. To reduce the type II error rate we need to
“make it easier to reject H0 ”.
The standard (Neyman-Pearson) approach to hypothesis testing is to control the type
I error rate at a “small” value α and then use a test that makes the type II error as
small as possible.
The equivalence between the confidence intervals and hypothesis tests can be formu-
lated as follows: Recall that a 100(1 − α)% CI for θ is a random interval, (L, U ) with
the property
P ((L, U ) 3 θ) = 1 − α.

105
2. STATISTICAL INFERENCE

It is easy to check that the test defined by rule:


“Accept H0 : θ = θ0 iff θ0 ∈ (L, U )” has significance level α.
α = P (reject H0 |H0 true)
β = P (retain H0 |HA true)
1 − β = power = P (reject H0 |HA true), which is what we want.
Conversely, given a hypothesis test H0 : θ = θ0 with significance level α, it can be
proved that the set {θ0 : H0 : θ = θ0 is accepted} is a 100(1 − α)% confidence region
for θ.

2.4.2 Large sample tests and confidence intervals

Consider a statistical problem with data x1 , . . . , xn , log-likelihood `(θ; x), score U(θ; x)
and information i(θ).
Consider also a hypothesis H0 : θ = θ0 . The following three tests are often considered:

(1) The Wald Test: q


Test Statistic: W = ni(θ̂)(θ̂ − θ0 )

Critical Region: reject for |W | ≥ z(α/2)

(2) The Score Test:


U(θ0 ; x)
Test Statistic: V = p
ni(θ0 )

Critical Region: reject for |V | ≥ z(α/2)

(3) Likelihood-Ratio Test:


Test Statistic: G2 = 2(`(θ̂) − `(θ0 ))

Critical Region: reject for G2 ≥ χ21,α

Example:
Suppose X1 , X2 , . . . , Xn i.i.d. Po(λ), and consider H0 : λ = λ0 .
n
Y e−λ λxi
`(λ; x) = log
i=1
xi !
n
Y
= n(x̄ log λ − λ) − log xi !
i=1

106
2. STATISTICAL INFERENCE

∂`  x̄  n(x̄ − λ)
U(λ; x) = =n −1 = =⇒ λ̂ = x̄
∂λ λ λ
∂2`
   nx̄  n
I(λ) = ni(λ) = E − 2 = E =
∂λ λ2 λ

q
=⇒ W = ni(λ̂)(λ̂ − λ0 )

λ̂ − λ0
= q
λ̂/n
U(λ0 ; x)
V = p
ni(λ0 )
n(x̄ − λ0 )
= p
λ0 ( n/λ0 )
x̄ − λ0
= p
λ0 /n

G2 = 2(`(λ̂) − `(λ0 ))

= 2n(x̄ log λ̂ − λ̂) − 2n(x̄ log λ0 − λ0 )


!
λ̂
= 2n x̄ log − (λ̂ − λ0 ) .
λ0

Remarks

(1) It can be proved that the tests based on W, V, G2 are asymptotically equiv-
alent for H0 true.

(2) As a by-product, it follows that the null distribution of G2 is χ21 . (Recall


from Theorems 2.3.1, 2.3.1 that the null distribution for W, V are both
N(0, 1)).

(3) To understand the motivation for the three tests, it is useful to consider
their relation to the log-likelihood function. See Figure 20.

We have introduced the Wald test, score test and the likelihood test for H0 : θ = θ0 vs.
HA : θ = θa . These are large-sample tests in that asymptotic distributions for the test
statistic under H0 are available. It can also be proved that the LR statistic and the

107
2. STATISTICAL INFERENCE

Figure 20: Relationships of the tests to the log-likelihood function

score test statistic are invariant under transformation of the parameter, but the Wald
test is not.
Each of these three tests can be inverted to give a confidence interval (region) for θ:
Wald Test
Solve for θ0 in W 2 ≤ z(α/2)2 .
q
Recall, W = ni(θ̂)(θ̂ − θ0 )

z(α/2) z(α/2)
=⇒ θ̂ − q ≤ θ0 ≤ θ̂ + q
ni(θ̂) ni(θ̂)
z(α/2)
i.e., θ̂ ± q .
ni(θ̂)

Score Test !2
U(θ ; x)
Need to solve for θ0 in V 2 = p 0 ≤ z(α/2)2 .
ni(θ0 )
LR Test
Solve for θ0 in 2(`(θ̂; x) − `(θ0 ; x)) ≤ χ21,α = z(α/2)2 .
Example:

108
2. STATISTICAL INFERENCE

X1 , . . . , Xn i.i.d. Po(λ).

λ̂ − λ0
Recall Wald statistic is W = q
λ̂/n
q
=⇒ λ̂ ± z(α/2) λ̂/n

p
⇐⇒ x̄ ± z(α/2) x̄/n.

Score test:
x̄ − λ0
Recall that the test statistic is V = p . Hence to find a confidence interval, we
λ0 /n
need to solve for λ0 in the equation

V2 ≤ z(α/2)2

(x̄ − λ0 )2
=⇒ ≤ z(α/2)2
λ0 /n
λ0 z(α/2)2
=⇒ (x̄ − λ0 )2 ≤
n
z(α/2)2
 
=⇒ λ20 − 2x̄ + λ + x̄2 ≤ 0,
n

which is a quadratic in λ0 and hence can be solved:



−b ± b2 − 4ac
.
2a

For the LR test, we have


 
2 x̄
G = 2n x̄ log − (x̄ − λ0 )
λ0

→ can solve numerically for λ0 in G2 ≤ z(α/2)2 . See Figure 21.

2.4.3 Optimal tests

Consider simple null and alternative hypotheses:

H0 : θ = θ0 and Ha : θ = θa

109
2. STATISTICAL INFERENCE

Figure 21: The 3 tests

Recall that the type I error rate is defined by

α = P (reject H0 |H0 true).

and the type II error rate by

β = P (accept H0 |H0 false).

The power is defined by 1 − β = P (reject H0 |H0 false) → so we want high power.


Theorem. 2.4.1 (Neyman-Pearson Lemma)
Consider the test H0 : θ = θ0 vs. Ha : θ = θa defined by the rule
f (x; θ0 )
reject H0 for ≤ k,
f (x; θa )

110
2. STATISTICAL INFERENCE

for some constant k.


Let α∗ be the type I error rate and 1 − β ∗ be the power for this test. Then any other
test with α ≤ α∗ will have (1 − β) ≤ (1 − β ∗ ).

Proof. Let C be the critical region for the LR test and let D be the critical region (RR)
for any other test.
Let C1 = C ∩ D and C2 , D2 be such that

C = C1 ∪ C2 , C1 ∩ C2 = φ

D = C1 ∪ D2 , C1 ∩ D2 = φ

Figure 22: Neyman-Pearson Lemma

111
2. STATISTICAL INFERENCE

See Figure 22 and observe α ≤ α∗


Z Z
=⇒ f (x; θ0 )dx1 . . . dxn ≤ f (x; θ0 )dx1 . . . dxn
D C
Z Z
=⇒ f (x; θ0 )dx1 . . . dxn ≤ f (x; θ0 )dx1 . . . dxn .
D2 C2
Z Z

Now (1 − β ) − (1 − β) = f (x; θa )dx1 . . . dxn − f (x; θa )dx1 . . . dxn
C D
Z Z

=⇒ (1 − β ) − (1 − β) = f (x; θa )dx1 . . . dxn + f (x; θa )dx1 . . . dxn
C1 C2
Z Z
− f (x; θa )dx1 . . . dxn − f (x; θa )dx1 . . . dxn ,
C1 D2

since D = (C1 ∪ D2 ),
Z Z
= f (x; θa )dx1 . . . dxn − f (x; θa )dx1 . . . dxn
C2 D2
Z Z
1 1
≥ f (x; θ0 )dx1 . . . dxn − f (x; θ0 )dx1 . . . dxn
k C2 k D2

≥ 0, as required.

See Figure 23.


Moreover, equality is achieved only if

D2 is empty (= φ).

Example:
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ), σ 2 given, and consider

H0 : µ = µ0 vs. Ha : µ = µa , µa > µ 0

112
2. STATISTICAL INFERENCE

Figure 23: Neyman-Pearson Lemma

n  
Y 1 1 2
√ exp − 2 (xi − µ0 )
f (x; µ0 ) 2πσ 2σ
Then = i=1
n
f (x; µa )
 
Y 1 1 2
√ exp − 2 (xi − µa )
i=1
2πσ 2σ
( n
!)
1 X
exp − 2 x2i − 2nx̄µ0 + nµ20
2σ i=1
= ( n
!)
1 X
2 2
exp − 2 xi − 2nx̄µa + nµa
2σ i=1
 
1 2 2
= exp (2nx̄(µ0 − µa ) − nµ0 + nµa ) .
2σ 2

For a constant k,

f (x; µ0 )
≤k
f (x; µa )

⇔ (µ0 − µa )x̄ ≤ k ∗

⇔ x̄ ≥ c,

for a suitably chosen c (rejected when x̄ is too big).

113
2. STATISTICAL INFERENCE

To choose c, we use the fact that

X̄ − µ0
Z= √ ∼ N(0, 1) under H0 .
σ/ n

Hence, the usual z-test,


reject H0 if ẑ ≥ z(α)
is the Neyman-Pearson LR test in this case.
Remarks

(1) This result shows that the one-sided z test is also uniformly most powerful
for
H0 : µ = µ0 vs. HA : µ > µ0

(2) This can be extended to the case of

H0 : µ ≤ µ0 vs. HA : µ > µ0

In this case we take

α = max P (rejecting |µ = µ0 ).
H0

(3) This construction fails when we consider two-sided alternatives

i.e., H0 : µ = µ0 vs. HA : µ 6= µ0

=⇒ no uniformly most powerful test exists for that case.

114

You might also like