Msiii PDF
Msiii PDF
Msiii PDF
Lecture Notes
SCHOOL
c OF MATHEMATICAL SCIENCES
Contents
1 Distribution Theory 1
1.1 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Geometric distribution . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Negative Binomial distribution . . . . . . . . . . . . . . . . . . 4
1.1.5 Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.6 Hypergeometric distribution . . . . . . . . . . . . . . . . . . . . 6
1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Exponential Distribution . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Gamma distribution . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.4 Beta density function . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.5 Normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2.6 Standard Cauchy distribution . . . . . . . . . . . . . . . . . . . 10
1.3 Transformations of a single random variable . . . . . . . . . . . . . . . 12
1.4 CDF transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Non-monotonic transformations . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Moments of transformed RVS . . . . . . . . . . . . . . . . . . . . . . . 17
1.7 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.7.1 Trinomial distribution . . . . . . . . . . . . . . . . . . . . . . . 22
1.7.2 Multinomial distribution . . . . . . . . . . . . . . . . . . . . . . 23
1.7.3 Marginal and conditional distributions . . . . . . . . . . . . . . 23
1.7.4 Continuous multivariate distributions . . . . . . . . . . . . . . . 26
1.8 Transformations of several RVs . . . . . . . . . . . . . . . . . . . . . . 32
1.8.1 Multivariate transformation rule . . . . . . . . . . . . . . . . . . 39
1.8.2 Method of regular transformations . . . . . . . . . . . . . . . . 40
1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.9.1 Moment generating functions . . . . . . . . . . . . . . . . . . . 49
1.9.2 Marginal distributions and the MGF . . . . . . . . . . . . . . . 52
1.9.3 Vector notation . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.9.4 Properties of variance matrices . . . . . . . . . . . . . . . . . . 55
1.10 The multivariable normal distribution . . . . . . . . . . . . . . . . . . . 56
1.10.1 The multivariate normal MGF . . . . . . . . . . . . . . . . . . . 63
1.10.2 Independence and normality . . . . . . . . . . . . . . . . . . . . 64
1.11 Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.11.1 Convergence of random variables . . . . . . . . . . . . . . . . . 70
2 Statistical Inference 76
2.1 Basic definitions and terminology . . . . . . . . . . . . . . . . . . . . . 76
2.1.1 Criteria for good estimators . . . . . . . . . . . . . . . . . . . . 77
2.2 Minimum Variance Unbiased Estimation . . . . . . . . . . . . . . . . . 80
2.2.1 Likelihood, score and Fisher Information . . . . . . . . . . . . . 80
2.2.2 Cramer-Rao Lower Bound . . . . . . . . . . . . . . . . . . . . . 83
2.2.3 Exponential families of distributions . . . . . . . . . . . . . . . 87
2.2.4 Sufficient statistics . . . . . . . . . . . . . . . . . . . . . . . . . 89
2.2.5 The Rao-Blackwell Theorem . . . . . . . . . . . . . . . . . . . . 92
2.3 Methods Of Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3.1 Method Of Moments . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . 96
2.3.3 Elementary properties of MLEs . . . . . . . . . . . . . . . . . . 97
2.3.4 Asymptotic Properties of MLEs . . . . . . . . . . . . . . . . . . 100
2.4 Hypothesis Tests and Confidence Intervals . . . . . . . . . . . . . . . . 104
2.4.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 105
2.4.2 Large sample tests and confidence intervals . . . . . . . . . . . . 106
2.4.3 Optimal tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
1. DISTRIBUTION THEORY
1 Distribution Theory
P ({a ≤ X ≤ b})
Z b
= f (x) dx, for all a ≤ b.
a
The PDF is a piecewise continuous function which integrates to 1 over the range of the
RV. Note that
Z a
P (X = a) = f (x) dx = 0 for any continuous RV.
a
Example: ‘Precipitation’ is neither a discrete nor a continuous RV, since there is zero
precipitation on some days; it is a mixture of both.
The cumulative distribution function (CDF) is defined by:
1
1. DISTRIBUTION THEORY
X X
h(x)p(x) if |h(x)|p(x) < ∞ (X discrete)
x
x
E{h(X)} =
Z ∞ Z ∞
h(x)f (x) dx if |h(x)|f (x) dx < ∞ (X continuous)
−∞ −∞
X
etx p(x) X discrete
x
MX (t) = E[etX ] =
∞
Z
↓ etx f (x) dx X continuous
moment −∞
generating fct
of RV X.
MX (0) = 1 always; the mgf may or may not be defined for other values of t.
If MX (t) defined for all t in some open interval containing 0, then:
M 0 (0) = E(X)
M 00 (0) = E(X 2 ), and so on.
Parameter: 0≤p≤1
Possible values: {0, 1}
2
1. DISTRIBUTION THEORY
Prob. function:
p,
x=1
p(x) = = px (1 − p)1−x
1 − p, x = 0
E(X) = p
Var (X) = p(1 − p)
MX (t) = 1 + p(et − 1).
Parameter: 0 ≤ p ≤ 1; n > 0;
MGF: MX (t) = {1 + p(et − 1)}n .
Consider a sequence of n independent Bern(p) trials. If X = total number of successes,
then X ∼ B(n, p).
Probability function:
n x
p(x) = p (1 − p)n−x
x
p(x) ≥ 0, x = 0, 1, . . . , n,
Xn
p(x) = 1.
x=0
3
1. DISTRIBUTION THEORY
n(1 − p)
E(X) = ,
p
n(1 − p)
Var (X) = ,
p2
n
p
MX (t) = .
1 − et (1 − p)
4
1. DISTRIBUTION THEORY
2. The Poisson distribution also arises as the limiting form of the binomial distribution:
n → ∞, np → λ
p→0
(C) Probability of more than one occurrence in [t, t + h) is o(h) (h → 0) (i.e. prob
is small, negligible).
Note: o(h), pronounced (small order h) is standard notation for any function r(h)
with the property:
r(h)
lim =0
h→0 h
0
!2 !1 0 1 2
h
!1
!2
5
1. DISTRIBUTION THEORY
Consider an urn containing M black and N white balls. Suppose n balls are sampled
randomly without replacement and let X be the number of black balls chosen. Then
X has a hypergeometric distribution.
Parameters: M, N > 0, 0<n≤M +N
Possible values: max (0, n − N ) ≤ x ≤ min (n, M )
Prob function:
M N
x n−x
p(x) = ,
M +N
n
M M + N − n nM N
E(X) = n , Var (X) = .
M +N M + N − 1 (M + N )2
M N
x n−x
= .
M +N
n
2. To see how the limits arise, observe we must have x ≤ n (i.e., no more than
sample size of black balls in the sample.) Also, x ≤ M , i.e., x ≤ min (n, M ).
Similarly, we must have x ≥ 0 (i.e., cannot have < 0 black balls in sample), and
n − x ≤ N (i.e., cannot have more white balls than number in urn).
i.e. x ≥ n − N
i.e. x ≥ max (0, n − N ).
M
3. If we sample with replacement, we would get X ∼ B n, p = M +N
. It is inter-
esting to compare moments:
6
1. DISTRIBUTION THEORY
4. When M, N >> n, the difference between sampling with and without replace-
ment should be small.
1
Figure 3: p =
3
1
Figure 4: p = (without replacement)
2
1
Figure 5: p = (with replacement)
3
Intuitively, this implies that for M, N >> n, the hypergeometric and binomial
probabilities should be very similar, and this can be verified for fixed, n, x:
0 10 1
lim N A@ M A
M,N →∞
@
x n−x n x
0 1 = p (1 − p)n−x .
M M + N x
→p @ A
M +N n
7
1. DISTRIBUTION THEORY
x x
x−a
Z Z
1
F (x) = f (x) dx = dx = ,
−∞ a b−a b−a
that is,
0 x≤a
x − a
F (x) = a<x<b
b−a
1 b≤x
etb − eta
MX (t) =
t(b − a)
A special case is the U (0, 1) distribution:
1 0 < x < 1
f (x) =
0 otherwise,
1 1 et − 1
E(X) = , Var(X) = , M (t) = .
2 12 t
8
1. DISTRIBUTION THEORY
CDF:
F (x) = 1 − e−λx ,
λ
MX (t) = , λ > 0, x ≥ 0.
λ−t
This is the distribution for the waiting time until the first occurrence in a Poisson
process with rate parameter λ > 0.
1. If X ∼ Exp(λ) then,
P (X ≥ t + x|X ≥ t) = P (X ≥ x)
(memoryless property)
with
Z ∞
gamma function : Γ(α) = tα−1 e−t dt
0
α
λ
mgf : MX (t) = , t < λ.
λ−t
9
1. DISTRIBUTION THEORY
p 1
2. Gamma , distribution is also called χ2p (chi-square with p df) distribution
2 2
if p is integer;
χ22 = exponential distribution (for 2 df).
3. Gamma (K, λ) distribution can be interpreted as the waiting time until the K th
occurrence in a Poisson process.
Y1
X= ∼ B(α, β), 0 ≤ x ≤ 1.
Y1 + Y2
10
1. DISTRIBUTION THEORY
→ the Cauchy is a bell-shaped distribution symmetric about zero for which no moments
are defined.
(Pointier than normal distribution and tails go to zero much slower than normal distribution.)
Z1
If Z1 ∼ N (0, 1) and Z2 ∼ N (0, 1) independently, then X = ∼ Cauchy distribution.
Z2
11
1. DISTRIBUTION THEORY
Proof.
pY (y) = P (Y = y) = P {h(X) = y}
X
= P (X = x)
x:h(x)=y
X
= pX (x)
x:h(x)=y
Theorem. 1.3.2
Suppose X is a continuous RV with PDF fX (x) and let Y = h(X), where h(x) is
differentiable and monotonic, i.e., either strictly increasing or strictly decreasing.
Then the PDF of Y is given by:
FY (y) = P (Y ≤ y) = P {h(X) ≤ y}
= P {X ≤ h−1 (y)}
= FX {h−1 (y)}
12
1. DISTRIBUTION THEORY
d d
⇒ fY (y) = FY (y) = FX {h−1 (y)} (use Chain Rule)
dy dy
FY (y) = P (Y ≤ y) = P {h(X) ≤ y}
= P {X ≥ h−1 (y)}
= 1 − FX {h−1 (y)}
Examples:
13
1. DISTRIBUTION THEORY
P (Y = 0) = P (X = 0) + P (X = 1) + P (X = 2) + P (X = 3) + P (X = 4)
P (Y = 10) = P (X = 5) + P (X = 6) + P (X = 7) + P (X = 8) + . . . + P (X = 14),
and so on.
y = ax + b
y−b y−b y b
⇒x= ⇒ h−1 (y) = = −
a a a a
1
⇒ h−1 (y)0 = .
a
1 y−b
Hence, fY (y) = fX .
|a| a
14
1. DISTRIBUTION THEORY
Specifically:
1. Suppose Z ∼ N (0, 1) and let X = µ + σZ, σ > 0. Recall that Z has PDF:
1 2
φ(z) = √ e−z /2 .
2π
1 x−µ 1 (x−µ)2
Hence, fX (x) = φ =√ e− 2σ2 , which is N(µ, σ 2 ) PDF.
|σ| σ 2π(σ)
X −µ
2. Suppose X ∼ N (µ, σ 2 ) and let Z = .
σ
Find the PDF of Z.
1 µ
Solution. Observe that Z = h(X) = X − ,
σ σ
z + σµ
⇒ fZ (z) = σfX 1 = σfX (µ + σz)
σ
1 (−1)
(µ+σz−µ)2
= σ√ e (2σ2 )
2πσ 2
1 (−1) 2 2
σ z
= √ e (2σ2 )
2π
1 2
= √ e−z /2 = φ(z),
2π
i.e., Z ∼ N(0, 1).
X
3. Suppose X ∼ Gamma (α, 1) and let Y = , λ > 0.
λ
Find the PDF of Y .
1
Solution. Since Y = X, is a linear function, we have
λ
fY (y) = λfX (λy)
1 λα α−1 −λy
=λ (λy)α−1 e−λy = y e ,
Γ(α) Γ(α)
15
1. DISTRIBUTION THEORY
Suppose X is a continuous RV with CDF FX (x), which is increasing over the range of
X. If U = FX (x), then U ∼ U (0, 1).
Proof.
FU (u) = P (U ≤ u)
= P {FX (X) ≤ u}
= P {X ≤ FX−1 (u)}
= FX {FX−1 (u)}
The converse also applies. If U ∼ U (0, 1) and F is any strictly increasing “on its range”
CDF, then X = F −1 (U ) has CDF F (x), i.e.,
FX (x) = P (X ≤ x) = P {F −1 (U ) ≤ x}
= P (U ≤ F (x))
= F (x), as required.
16
1. DISTRIBUTION THEORY
Or use
Theorem. 1.6.1
If X is a RV of either discrete or continuous type and h(x) is any transformation (not
necessarily monotonic), then E{h(X)} (provided it exists) is given by:
X
h(x) p(x) X discrete
x
E{h(X)} = Z
∞
h(x)f (x) dx X continuous
−∞
Proof.
Not examinable.
Examples:
1. CDF transformation
Suppose U ∼ U (0, 1). How can we transform U to get an Exp(λ) RV?
17
1. DISTRIBUTION THEORY
⇒ 1 − u = e−λx
⇒ ln(1 − u) = −λx
− ln(1 − u)
⇒x= .
λ
− ln(1 − U )
Hence if U ∼ U (0, 1), it follows that X = ∼ Exp(λ).
λ
− ln U
Note: Y = ∼ Exp(λ) [both (1 − U ) & U have U (0, 1) distribution].
λ
m This type of result is used to generate random numbers. That is, there are good
methods for producing U (0, 1) (pseudo-random) numbers. To obtain Exp(λ) ran-
− log U
dom numbers, we can just get U (0, 1) numbers and then calculate X = .
λ
2. Non-monotonic transformations
Suppose Z ∼ N (0, 1) and let X = Z 2 ; h(Z) = Z 2 is not monotonic, so Theorem
1.3.2 does not apply. However we can proceed as follows:
FX (x) = P (X ≤ x)
= P (Z 2 ≤ x)
√ √
= P (− x ≤ Z ≤ x)
√ √
= Φ( x) − Φ(− x),
where
Z a
1 2
Φ(a) = √ e−t /2 dt
−∞ 2π
is the N (0, 1) CDF. That is,
√ √
d 1 −1/2 1 −1/2
⇒ fX (x) = FX (x) = φ x x − φ(− x) − x ,
dx 2 2
18
1. DISTRIBUTION THEORY
2
where φ(a) = Φ0 (a) = √1 e−a /2
2π
is the N (0, 1) PDF
1 −1/2 1 −x/2 1 −x/2
= x √ e +√ e
2 2π 2π
On the other hand, the distribution of Z 2 is also called the χ21 , distribution. We
1 1
have proved that it is the same as the Gamma 2 , 2 distribution.
1
= .
λ
Z 1
1
1 1
=− log u du = − (u log u − u)
λ 0 λ 0
1h 1 i
=− u log u 10 − u0
λ
1
= − [0 − 1]
λ
1
= , as required.
λ
19
1. DISTRIBUTION THEORY
Z ∞ Z ∞
=a xf (x) dx + b f (x) dx
−∞ −∞
| {z } | {z }
E(X) =1
= aµ + b.
h 2 i
Var(Y ) = E Y − E(Y )
h 2 i
=E aX + b − (aµ + b)
= E a2 (X − µ)2
= a2 E (X − µ)2
= a2 Var(X)
= a2 σ 2 .
20
1. DISTRIBUTION THEORY
Examples
1. Suppose X is continuous with CDF F (x), and F (a) = 0, F (b) = 1; (a, b can be
±∞ respectively).
Let U = F (X). Observe that
Z b
MU (t) = etF (x) f (x) dx
a
b
1 tF (x) etF (b) − etF (a)
= e =
t a t
et − 1
= ,
t
which is the U (0, 1) MGF.
− log X
2. Suppose X ∼ U (0, 1), and let Y = .
λ
Z 1
− log x
MY (t) = et( λ ) (1) dx
0
Z 1
= x−t/λ dx
0
1
1
1−t/λ
= x
1 − t/λ
0
1
=
1 − t/λ
λ
= ,
λ−t
which is the MGF for Exp(λ) distribution. Hence we can conclude that
− log X
Y = ∼ Exp(λ).
λ
21
1. DISTRIBUTION THEORY
Definition. 1.7.1
If X1 , X2 , . . . , Xr are discrete RV’s then, X = (X1 , X2 , . . . , Xr )T is called a discrete
random vector.
The probability function P (x) is:
P (x) = P (x1 , x2 , . . . , xr )
r=2
Consider a sequence of n independent trials where each trial produces:
n!
P (x1 , x2 ) = π x1 π x2 (1 − π1 − π2 )n−x1 −x2
x1 !x2 !(n − x1 − x2 )! 1 2
for
x1 , x2 ≥ 0, x1 + x2 ≤ n.
22
1. DISTRIBUTION THEORY
Probability function:
n
P (x) = π1x1 π2x2 . . . πrxr for
x 1 , x2 , . . . , x r
r
X
xi ≥ 0, xi = n.
i=1
Remarks
n def n!
1. Note = x1 !x2 ! . . . xr ! is the multinomial coefficient.
x 1 , x2 , . . . , x r
2. Multinomial distribution is the generalisation of the binomial distribution to r
types of outcome.
3. Formulation differs from binomial and trinomial cases in that the redundant count
xr = n − (x1 + x2 + . . . xr−1 ) is included as an argument of P (x).
Definition. 1.7.2
23
1. DISTRIBUTION THEORY
If X has joint probability function PX (x) = PX (x1 , x2 ) then the marginal probability
function for X1 is : X
PX1 (x1 ) = PX (x1 , x2 ).
x2
Observe that:
X X
PX1 (x1 ) = PX (x1 , x2 ) = P ({X1 = x1 } ∩ {X2 = x2 })
x2 x2
Hence the marginal probability function for X1 is just the probability function we
would have if X2 was not observed.
The marginal probability function for X2 is:
X
PX2 (x2 ) = PX (x1 , x2 ).
x1
Definition. 1.7.3
X1
Suppose X = . If PX1 (x1 ) > 0, we define the conditional probability function
X2
X2 |X1 by:
PX (x1 , x2 )
PX2 |X1 (x2 |x1 ) = .
PX1 (x1 )
Remarks
1.
P ({X1 = x1 } ∩ {X2 = x2 })
PX2 |X1 (x2 |x1 ) =
P (X1 = x1 )
= P (X2 = x2 |X1 = x1 ).
2. Easy to check PX2 |X1 (x2 |x1 ) is a proper probability function with respect to x2
for each fixed x1 such that PX1 (x1 ) > 0.
PX (x1 , x2 )
PX1 |X2 (x1 |x2 ) = .
PX2 (x2 )
24
1. DISTRIBUTION THEORY
Remarks
1. Observe that:
XX X
PX1 (x1 ) = ··· P (x1 , . . . , xr )
x2 x3 xr
XX X
= p1 (x1 ) ··· p2 (x2 ) . . . pr (xr )
x2 x3 xr
= c1 p1 (x1 ).
P (x1 , . . . , xr )
PX1 |X2 ...Xr (x1 |x2 , . . . , xr ) =
Px2 ...xr (x1 , . . . , xr )
p1 (x1 )p2 (x2 ) . . . pr (xr )
= X
p1 (x1 )p2 (x2 ) . . . pr (xr )
x1
p1 (x1 )
= X
p1 (x1 )
x1
1
PX (x1 )
= c 1
1X
PX1 (x1 )
c x
1
= PX1 (x1 ).
That is, PX1 |X2 ...Xr (x1 |x2 , . . . , xr ) = PX1 (x1 ).
25
1. DISTRIBUTION THEORY
2. Clearly independence =⇒
PXi |X1 ...Xi−1 ,Xi+1 ...Xr (xi |x1 , . . . , xi−1 , xi+1 , . . . , xr ) = PXi (xi ).
Moreover, we have:
X1
PX1 |X2 (x1 |x2 ) = PX1 (x1 ) for any partitioning of X = if X1 , . . . , Xr are
X2
independent.
Definition. 1.7.5
The random vector (X1 , . . . , Xr )T is said to have a continuous multivariate distribution
with PDF f (x) if
Z Z
P (X ∈ A) = . . . f (x1 , . . . , xr ) dx1 . . . dxr
A
Examples
1. If X1 , X2 have PDF
1 0 < x1 < 1
0 < x2 < 1
f (x1 , x2 ) =
0 otherwise
The distribution is called uniform on (0, 1) × (0, 1).
It follows that P (X ∈ A) = Area (A) for any A ∈ (0, 1) × (0, 1):
26
1. DISTRIBUTION THEORY
#
Recall joint PDF:
Z Z
P (X ∈ A) = ... f (x1 , . . . , xr ) dx1 . . . dxr .
A
27
1. DISTRIBUTION THEORY
Definition. 1.7.6
X1
If X = has joint PDF f (x) = f (x1 , x2 ), then the marginal PDF of X1 is given
X2
by:
Z ∞ Z ∞
fX1 (x1 ) = ... f (x1 , . . . , xr1 , xr1 +1 , . . . xr ) dxr1 +1 dxr1 +2 . . . dxr ,
−∞ −∞
f (x1 , x2 )
fX2 |X1 (x2 |x1 ) = .
fX1 (x1 )
Remarks:
1. fX2 |X1 (x2 |x1 ) cannot be interpreted as the conditional PDF of X2 |{X1 = x1 }
because P (X1 = x1 ) = 0 for any continuous distribution.
Proper interpretation is the limit as δ → 0 in X2 |X1 ∈ B(x1 , δ).
Remarks
1. Easy to check that if X1 , . . . , Xr are independent then each fi (xi ) = ci fXi (xi ).
Moreover c1 , c2 , . . . , cr = 1.
Examples
28
1. DISTRIBUTION THEORY
Z ∞
(a) fX1 (x1 ) = f (x1 , x2 ) dx2
−∞
Z −√1−x21 Z √1−x21 Z ∞
1
= 0 dx2 + √ dx2 + √ 0 dx2
−∞ − 1−x21 π 1−x21
√
x2 1−x21
= 0+ √ 2 +0
π − 1−x1
p
2 1 − x21
−1 < x1 < 1
π
=
0 otherwise.
i.e., a semi-circular distribution (ours has some distortion).
(b) The conditional density for X2 |X1 is:
f (x1 , x2 )
fX2 |X1 (x2 |x1 ) =
fX1 (x1 )
1 p p
2p1 − x2
for − 1 − x21 < x2 < 1 − x21
1
=
0 otherwise,
p p
which is uniform U (− 1 − x21 , 1 − x21 ).
2. If X1 , . . . , Xr are any independent continuous RV’s then their joint PDF is:
29
1. DISTRIBUTION THEORY
Figure 13: A graphic showing the conditional distribution and a semi-circular distri-
bution.
r
Y
f (x1 , . . . , xr ) = λ e−λxi
i=1
Pr
= λr e−λ i=1 xi
,
for xi > 0 i = 1, . . . , n.
Proof.
Let
1 0 < x1 < 1
f1 (x1 ) =
0 otherwise,
30
1. DISTRIBUTION THEORY
1 0 < x2 < 1
f2 (x2 ) =
0 otherwise,
then f (x1 , x2 ) = f1 (x1 )f2 (x2 ), and the two variables are independent.
Proof.
We know:
1 p p
2p1 − x2
− 1 − x21 < x2 < 1 − x21
1
fX2 |X1 (x2 |x1 ) = .
0 otherwise,
p p
i.e., U (− 1 − x21 , 1 − x21 ).
On the other hand,
2
p
2
π 1 − x2 −1 < x2 < 1
fX2 (x2 ) =
0 otherwise,
Definition. 1.7.8
The joint CDF of RVS’s X1 , X2 , . . . , Xr is defined by:
Remarks:
31
1. DISTRIBUTION THEORY
Theorem. 1.8.1
If X1 , X2 are discrete with joint probability function P (x1 , x2 ), and Y = X1 + X2 , then:
2. If X1 , X2 are independent,
X
PY (y) = PX1 (x)PX2 (y − x)
x
Proof. (1)
X
P ({Y = y}) = P ({Y = y} ∩ {X1 = x})
x
(law of total probability)
↓
If A is an event & B1 , B2 . . . are events
[
s.t. Bi = S & Bi ∩ Bj = φ (for j 6= i)
i
X X
then P (A) = P (A ∩ Bi ) = P (A|Bi )P (Bi )
i i
X
= P ({X1 + X2 = y} ∩ {X1 = x})
x
X
= P ({X2 = y − x} ∩ {X1 = x})
x
X
= P (x, y − x).
x
Proof. (2)
32
1. DISTRIBUTION THEORY
Just substitute
P (x, y − x) = PX1 (x)PX2 (y − x).
Theorem. 1.8.2
Suppose X1 , X2 are continuous with PDF, f (x1 , x2 ), and let Y = X1 + X2 . Then
Z ∞
1. fY (y) = f (x, y − x) dx.
−∞
Proof. 1. FY (y) = P (Y ≤ y)
= P (X1 + X2 ≤ y)
Z ∞ Z y−x1
= f (x1 , x2 ) dx2 dx1 .
x1 =−∞ x2 =−∞
Let x2 = t − x1 ,
dx2
⇒ =1
dt
⇒ dx2 = dt
Z ∞ Z y
= f (x1 , t − x1 ) dt dx1
−∞ −∞
Z y Z ∞
= f (x1 , t − x1 ) dx1 dt
−∞ −∞
Z ∞
⇒ fY (y) = FY0 (y) = f (x, y − x) dx.
−∞
Proof. (2)
33
1. DISTRIBUTION THEORY
Examples
1. Suppose X1 ∼ B(n1 , p), X2 ∼ B(n2 , p) independently. Find the probability func-
tion for Y = X1 + X2 .
Solution.
X
PY (y) = PX1 (x)PX2 (y − x)
x
min(n1 , y)
X n1 x n1 −x n2
= p (1 − p) py−x (1 − p)n2 +x−y
x y−x
x=max (0, y − n2 )
min (n1 , y)
y n1 +n2 −y
X n1 n2
= p (1 − p)
x y−x
x=max (0, y − n2 )
n1 n2
min (n1 , y)
y−x
n1 + n2 y n1 +n2 −y
X x
= p (1 − p)
y n 1 + n 2
x=max (0, y − n2 )
y
| {z }
sum of hypergeometric
probability function = 1
n1 + n2 y
= p (1 − p)n1 +n2 −y ,
y
i.e., Y ∼ B(n1 + n2 , p).
34
1. DISTRIBUTION THEORY
Z ∞
1 2 1 2
= √ e−z /2 √ e−(x−z) /2 dz
−∞ 2π 2π
Z ∞
1 − 1 (z2 +(x−z)2 )
= e 2 dz
−∞ 2π
Z ∞
1 − 21
2(z− x2 )2 2 /4
= e e−x dz
−∞ 2π
Z ∞ (z− x )2
1 2 1 − 21
= √ e−x /4 √ e 2× 2 dz
2π 2π
| −∞ {z }
=1
x 1
N , PDF
2 2
1 2
= √ e−x /4 ,
2π
i.e., X ∼ N (0, 2).
Theorem. 1.8.3 (Ratio of continuous RVs)
X2
Suppose X1 , X2 are continuous with joint PDF f (x1 , x2 ), and let Y = .
X1
Then Y has PDF Z ∞
fY (y) = |x|f (x, yx) dx.
−∞
If X1 , X2 independent, we obtain
Z ∞
fY (y) = |x|fX1 (x)fX2 (yx) dx.
−∞
35
1. DISTRIBUTION THEORY
Proof.
= P ({X2 ≥ yx1 } ∩ {X1 < 0}) + P ({X2 ≤ yx1 } ∩ {X1 > 0})
Z 0 Z ∞ Z ∞ Z x1 y
= f (x1 , x2 ) dx2 dx1 + f (x1 , x2 ) dx2 dx1 .
−∞ x1 y 0 −∞
dx2 = x1 dt;
Z 0 Z −∞ Z ∞ Z y
= x1 f (x1 , tx1 )dt dx1 + x1 f (x1 , tx1 ) dt dx1
−∞ y 0 −∞
Z 0 Z y Z ∞ Z y
= (−x1 )f (x1 , tx1 ) dt dx1 + x1 f (x1 , tx1 ) dt dx1
−∞ −∞ 0 −∞
Z 0 Z y Z ∞ Z y
= |x1 |f (x1 , tx1 ) dt dx1 + |x1 |f (x1 , tx1 ) dt dx1
−∞ −∞ 0 −∞
= FY (y)
Z ∞
⇒ fY (y) = FY0 (y) = |x1 |f (x1 , yx1 ) dx1 .
−∞
Example
Z
1. If Z ∼ N(0, 1) and X ∼ χ2k independently, then T = p is said to have the
X/k
t-distribution with k degrees of freedom. Derive the PDF of T .
36
1. DISTRIBUTION THEORY
Solution. Step 1:
p k 1
Let V = X/k. Need to find PDF of V . Recall χ2k is Gamma , so
2 2
Now,
p
v = h(x) = x/k
(1/2)k/2 2
= (kv 2 )k/2−1 e−kv /2 (2kv)
Γ(k/2)
Step 2:
Apply Theorem 1.8.3 to find PDF of
Z ∞
Z
T = = |v|fV (v)fZ (vt) dv
V −∞
∞
2(k/2)k/2 k−1 −kv2 /2 1 −t2 v2 /2
Z
= v v e √ e dv
0 Γ(k/2) 2π
∞
(k/2)k/2
Z
2 )v 2
= √ v k−1 e−1/2(k+t 2v dv;
Γ(k/2) 2π 0
37
1. DISTRIBUTION THEORY
substitute u = v 2
⇒ du = 2v dv
∞
(k/2)k/2
Z
k−1 2 )u
= √ u 2 e−1/2(k+t du
Γ(k/2) 2π 0
− k+1 −1/2
Γ( k+1
2
) 2 1 2
2 k
= √ × (k + t )
Γ(k/2) 2π k 2 2
− k+1
Γ( k+1 t2
2
) 2
= k √ 1+ −∞<t<∞
Γ( 2 ) kπ k
Remarks:
1. If we take k = 1, we obtain
1
f (t) ∝ ;
1 + t2
that is, t1 = Cauchy distribution.
Γ(1) 1 1 1
Can also check 1 √ = as required, so that f (t) = .
Γ( 2 ) π π π 1 + t2
− k+1
t2
2
i.e. lim 1+ for t fixed
k→∞ k
−k
t2
2
k
= lim 1+ ; let ` =
k→∞ k 2
!−`
t2
2 1 −t2
= lim 1+ = t2
=e 2 .
`→∞ ` e2
38
1. DISTRIBUTION THEORY
1 −t2
⇒ fT (t) → √ e 2 as k → ∞.
2π
Remark:
Can sometimes use H −1 instead of G, but need to be careful to evaluate H −1 (x) at
x = h−1 (y) = g(y).
39
1. DISTRIBUTION THEORY
Y = (Y1 , . . . , Ys )T = h(X).
Examples
T
Suppose Z1 ∼ N(0, p 1), Z2 ∼ N(0, 1) independently. Consider h(z1 , z2 ) = (r, θ) , where
r = h1 (z1 , z2 ) = z12 + z22 , and
z2
arctan for z1 > 0
z1
z2
arctan + π for z1 < 0
θ = h2 (z1 , z2 ) = z1
π
sgnz2 for z1 = 0, z2 6= 0
2
0 for z1 = z2 = 0.
2 π 3π
h maps R → [0, ∞) × − , .
2 2
The inverse mapping, g, can be seen to be:
g1 (r, θ) r cos θ
g(r, θ) = =
g2 (r, θ) r sin θ
∂ ∂
g1 (r, θ) = r cos θ = cos θ
∂r ∂r
∂
(r cos θ) = −r sin θ
∂θ
∂ ∂
g2 (r, θ) = r sin θ = sin θ
∂r ∂r
∂
r sin θ = r cos θ
∂θ
cos θ −r sin θ
⇒G=
sin θ r cos θ
= r cos2 θ + r sin2 θ = r.
Step 2
Now apply theorem 1.8.4.
Recall,
1 1 2 1 1 2
fZ (z1 , z2 ) = √ e− 2 z1 × √ e− 2 z2
2π 2π
1 − 1 (z12 +z22 )
= e 2 .
2π
41
1. DISTRIBUTION THEORY
fR,θ (r, θ) = fZ g(r, θ) | det G(r, θ)|
1 − 1 r2
2π r e 2
for r ≥ 0, − π2 ≤ θ < 3π
2
=
0 otherwise
= fθ (θ)fR (r),
where
1
2π − π2 ≤ θ < 3π
2
fθ (θ) =
0 otherwise
2 /2
fR (r) = re−r for r ≥ 0.
Hence, if Z1 , Z2 i.i.d. N(0, 1), and we translate to polar coordinates, (R, θ) we find:
1. R, θ are independent.
2. θ ∼ U (− π2 , 3π
2
).
3. R has distribution called the Rayleigh distribution.
42
1. DISTRIBUTION THEORY
= z(1 − y) + zy = z,
where z > 0. Why?
fY,Z (y, z) = fX1 (yz)fX2 (1 − y)z z
λα α−1 −λyz λ
β β−1 −λ(1−y)z
= (yz) e (1 − y)z e z
Γ(α) Γ(β)
1
= λα+β y α−1 (1 − y)β−1 z α+β−1 e−λz .
Γ(α)Γ(β)
Step 3:
Z ∞
fY (y) = f (y, z) dz
0
∞
λα+β z α+β−1 −λz
Z
Γ(α + β) α−1
= y (1 − y)β−1 e dz
Γ(α)Γ(β) 0 Γ(α + β)
Γ(α + β) α−1
= y (1 − y)β−1
Γ(α)Γ(β)
1.9 Moments
43
1. DISTRIBUTION THEORY
Theorem. 1.9.1
If h and X1 , . . . , Xr are as above, then provided it exists,
X X X
... h(x1 , . . . , xr )P (x1 , . . . , xr ) if X1 , . . . , Xr discrete
x1 x2 xr
E h(X) =
Z ∞ Z ∞
... h(x1 , . . . , xr )f (x1 , . . . xr ) dx1 . . . dxr if X1 , . . . , Xr continuous
−∞ −∞
Theorem. 1.9.2
Suppose X1 , . . . , Xr are RVs, h1 , . . . , hk are real-valued functions and a1 , . . . , ak are
constants. Then, provided it exists,
E a1 h1 (X) + a2 h2 (X) + · · · + ak hk (X) = a1 E[h1 (X)] + a2 E[h2 (X)] + · · · + ak E [hk (X)]
Z ∞ Z ∞
= a1 ... h1 (x)fX (x) dx1 . . . dxr
−∞ −∞
Z ∞ Z ∞
+ a2 ... h2 (x)fX (x) dx1 . . . dxr + . . .
−∞ −∞
Z ∞ Z ∞
. . . + ak ... hk (x)fX (x) dx1 . . . dxr
−∞ −∞
as required.
Corollary.
Provided it exists,
E a1 X1 + a2 X2 + · · · + ar Xr = a1 E[X1 ] + a2 E[X2 ] + · · · + ar E[Xr ].
44
1. DISTRIBUTION THEORY
Definition. 1.9.1.
If X1 , X2 are RVs with E[Xi ] = µi and Var(Xi ) = σi2 , i = 1, 2 we define
σ12 = Cov(X1 , X2 ) = E (X1 − µ1 )(X2 − µ2 )
= E X1 X2 − µ1 µ2 ,
σ12
ρ12 = Corr(X1 , X2 ) = .
σ1 σ2
Remark:
Cov(X, X) = Var(X) = E[X 2 ] − (E[X])2 .
Theorem. 1.9.3
Suppose X1 , X2 , . . . , Xr are RVs with E[Xi ] = µi , Cov(Xi , Xj ) = σij , and let a1 , a2 , . . . , ar ,
b1 , b2 , . . . , br be constants. Then
r r
! r Xr
X X X
Cov ai Xi , bj Xj = ai bj σij .
i=1 j=1 i=1 j=1
45
1. DISTRIBUTION THEORY
Proof.
r r
! "( r r
!) ( r r
!)#
X X X X X X
Cov ai Xi , bj Xj =E ai Xi − E ai Xi bj Xj − E bj Xj
i=1 j=1 i=1 i=1 j=1 j=1
( r r
! r r
!)
X X X X
=E ai Xi − ai µ i bj Xj − b j µj
i=1 i=1 j=1 j=1
"( r )( r )#
X X
=E ai (Xi − µi ) bj (Xj − µj )
i=1 j=1
( r r )
XX
=E ai bj (Xi − µi )(Xj − µj )
i=1 j=1
r X
X r
= ai bj E {(Xi − µi )(Xj − µj )} (Theorem 1.9.2)
i=1 j=1
r X
X r
= ai bj σij , as required.
i=1 j=1
Corollary. 1
Under the above assumptions,
r
! r X
r r
X X X X
Var ai Xi = ai aj σij = a2i σi2 + 2 ai aj σij .
i=1 i=1 j=1 i=1 i,j
| {z } i<j
covariance
with itself
= variance
Corollary. 2
ρ = Corr(X1 , X2 ) satisfies |ρ| ≤ 1; and |ρ| = 1 implies that X2 is a linear function of
X1 .
46
1. DISTRIBUTION THEORY
Proof. (Continuous)
Cov(X1 , X2 ) = E (X1 − µ1 )(X2 − µ2 )
Z ∞ Z ∞
= (x1 − µ1 )(x2 − µ2 )f (x1 , x2 ) dx1 dx2
−∞ −∞
Z ∞ Z ∞
= (x1 − µ1 )(x2 − µ2 )fX1 (x1 )fX2 (x2 ) dx1 dx2 (since X1 , X2 independent)
−∞ −∞
Z ∞ Z ∞
= (x1 − µ1 )fX1 (x1 ) dx1 (x2 − µ2 )fX2 (x2 ) dx2
−∞ −∞
=0×0
= 0.
47
1. DISTRIBUTION THEORY
Remark:
But the converse does NOT apply, in general!
That is, Cov(X1 , X2 ) = 0 6⇒ X1 , X2 independent.
Definition. 1.9.2
If X1 , X2 are RVs, we define the symbol E[X1 |X2 ] to be the expectation of X1 calculated
with respect to the conditional distribution of X1 |X2 ,
i.e., X
x1 PX1 |X2 (x1 |x2 ) X1 |X2 discrete
x1
E X1 |X2 =
Z ∞
x1 fX1 |X2 (x1 |x2 ) dx1 X1 |X2 continuous
−∞
Theorem. 1.9.5
Provided the relevant moments exist,
E(X1 ) = EX2 E(X1 |X2 ) ,
Var(X1 ) = EX2 Var(X1 |X2 ) + VarX2 E(X1 |X2 ) .
48
1. DISTRIBUTION THEORY
Z ∞ Z ∞
= x1 f (x1 |x2 ) dx1 fX2 (x2 ) dx2
−∞ −∞
Z ∞ Z ∞
= x1 f (x1 |x2 )fX2 (x2 ) dx1 dx2
−∞ −∞
Z ∞ Z ∞
= x1 f (x1 , x2 ) dx2 dx1
−∞ −∞
| {z }
= fX1 (x1 )
Z ∞
= x1 fX1 (x1 ) dx1
−∞
= E(X1 ).
Definition. 1.9.3
If X1 , X2 . . . , Xr are RVs, then the joint MGF (provided it exists) is given by
Theorem. 1.9.6
If X1 , X2 , . . . , Xr are mutually independent, then the joint MGF satisfies
provided it exists.
49
1. DISTRIBUTION THEORY
Z ∞ Z ∞
MX (t1 , . . . , tr ) = ... et1 x1 +t2 x2 +···+tr xr fX (x1 , . . . , xr ) dx1 . . . dxr
−∞ −∞
Z ∞ Z ∞
= ... et1 x1 et2 x2 . . . etr xr fX1 (x1 )fX2 (x2 ) . . . fXr (xr ) dx1 . . . dxr
−∞ −∞
Z ∞ Z ∞ Z ∞
t1 x1 t2 x2 tr xr
= e fX1 (x1 ) dx1 e fX2 (x2 ) dx2 . . . e fXr (xr ) dxr
−∞ −∞ −∞
We saw previously that if Y = h(X), then we can find MY (t) = E[eth(X) ] from the
joint distribution of X1 , . . . , Xr without calculating fY (y) explicitly.
A simple, but important case is:
Y = X1 + X2 + · · · + Xr .
Theorem. 1.9.7
Suppose X1 , . . . , Xr are RVs and let Y = X1 + X2 + · · · + Xr . Then (assuming MGFs
exist):
50
1. DISTRIBUTION THEORY
Proof.
MX (t) = E[etY ]
For parts (2) and (3), substitute into (1) and use Theorem 1.9.6.
Examples
Consider RVs X, V , defined by V ∼ Gamma(α, λ) and the conditional distribution of
X|V ∼Po(V ).
Find E(X) and Var(X).
Solution. Use
E(X) = E{E(X|V )}
E(X|V ) = V
Var(X|V ) = V
α
so E(X) = E{E(X|V )} = E(V ) = .
λ
= EV (V ) + VarV (V )
α α
= + 2
λ λ
α 1 1+λ
= 1+ =α .
λ λ λ2
51
1. DISTRIBUTION THEORY
Remark
The marginal distribution of X is sometimes called the negative binomial distribution.
In particular, when α is an integer, it corresponds to the definition previously with
λ
p= .
1+λ
Examples
Y = X1 + X2 + · · · + Xn ;
n
then MY (t) = 1 + p(et − 1) , which agrees with the formula previously given
for the binomial distribution.
2. Suppose X1 ∼ N(µ1 , σ12 ) and X2 ∼ N(µ2 , σ22 ) independently. Find the MGF of
Y = X1 + X2 .
2 σ 2 /2
Solution. Recall that MX1 (t) = etµ1 et 1
2 σ 2 /2
MX2 (t) = etµ2 et 2
2 (σ 2 +σ 2 )/2
= et(µ1 +µ2 ) et 1 2
⇒ Y ∼ N µ1 + µ2 , σ12 + σ22 .
To find the moment generating function of the marginal distribution of any set of
components of X, set to 0 the complementary elements of t in MX (t).
Let X = (X1 , X2 )T , and t = (t1 , t2 )T . Then
t
MX1 (t1 ) = MX 1 .
0
To see this result: Note that if A is a constant matrix, and b is a constant vector, then
T
MAX+b (t) = et b MX (AT t).
52
1. DISTRIBUTION THEORY
Proof.
T (AX+b T T AX
MAX+b (t) = E[et )] = et b E[et ]
tT b (AT t)T X tT b
= e E[e ]=e MX (AT t).
Now partition
t
tr×1 = 1
t2
and
Al×r = (Il×l 0l×m ).
Note that
X1
AX = (Il×l 0l×m ) = X1 ,
X2
and
Il×l t
T
A t1 = t1 = 1 .
0m×l 0
Hence,
as required.
Note that similar results hold for more than two random subvectors.
The major limitation of the MGF is that it may not exist. The characteristic function
on the other hand is defined for√all distributions. Its definition is similar to the MGF,
with it replacing t, where i = −1; the properties of the characteristic function are
similar to those of the MGF, but using it requires some familiarity with complex
analysis.
X = (X1 , X2 , . . . , Xr )T ,
53
1. DISTRIBUTION THEORY
Theorem. 1.9.8
Suppose X has E(X) = µ and Var(X) = Σ, and let a, b ∈ Rr be fixed vectors. Then,
1. E(aT X) = aT µ
2. Var(at X) = aT Σa
3. Cov(aT X, bT X) = aT Σb
Remark
It is easy to check that this is just a re-statement of Theorem 1.9.3 using matrix
notation.
Theorem. 1.9.9
Suppose X is a random vector with E(X) = µ, Var(X) = Σ, and let Ap×r and b ∈ Rp
be fixed.
If Y = AX + b, then
54
1. DISTRIBUTION THEORY
Remark
This is also a re-statement of previously established results. To see this, observe that
if aTi is the ith row of A, we see that Yi = aTi X and, moreover, the (i, j)th element of
AΣAT = aTi Σaj = Cov aTi X, aTj X
= Cov(Yi , Yj ).
Hence Σ is non-negative definite (positive definite) iff its eigenvalues are all non-
negative (positive).
If Σ is non-negative definite but not positive definite, then there must be at least one
zero eigenvalue. Let a be the corresponding eigenvector.
=⇒ a 6= 0 but aT Σa = 0. That is, Var(aT X) = 0 for that a.
=⇒ the distribution of X is degenerate in the sense that either one of the Xi ’s is
constant or else a linear combination of the other components.
55
1. DISTRIBUTION THEORY
r
Y
Finally, recall that if λ1 , . . . , λr are eigenvalues of Σ, then det(Σ) = λr
i=1
and
det(Σ) = 0 for Σ non-negative definite but not positive definite.
Definition. 1.10.1
The random vector X = (X1 , . . . , Xr )T is said to have the r-dimensional multivariate
normal distribution with parameters µ ∈ Rr and Σr×r positive definite, if it has PDF
1 1 T Σ−1 (x−µ)
fX (x) = e− 2 (x−µ)
(2π)r/2 |Σ|1/2
Examples
σ22 −ρσ1 σ2
1
Σ−1 =
σ12 σ22 (1 − ρ2 )
−ρσ1 σ2 σ12
(" " 2 2
1 −1 x 1 − µ1 x 2 − µ2
=⇒ f (x1 , x2 ) = p exp +
2πσ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1 σ2
x 1 − µ1 x 2 − µ2
−2ρ .
σ1 σ2
56
1. DISTRIBUTION THEORY
Theorem. 1.10.1
Suppose X ∼ Nr (µ, Σ) and let Y = AX + b, where b ∈ Rr and Ar×r invertible are
fixed. Then Y ∼ Nr (Aµ + b, AΣAT ).
Proof.
⇒H = A−1 .
Hence,
1 −1 (y−b)−µ)T Σ−1 (A−1 (y−b)−µ)
fY (y) = e−1/2(A |A−1 |
(2π)r/2 |Σ|1/2
1 T (A−1 )T Σ−1 (A−1 )(y−(Aµ+b))
= e−1/2(y−(Aµ+b))
(2π)r/2 |Σ|1/2 |AAT |1/2
1 T (AΣAT )−1 (y−(Aµ+b))
= e−1/2(y−(Aµ+b))
(2π)r/2 |AΣAT |1/2
57
1. DISTRIBUTION THEORY
Now suppose Σr×r is any symmetric positive definite matrix and recall that we can write
Σ = E ∧ E T where ∧ = diag (λ1 , λ2 , . . . , λr ) and Er×r is such that EE T = E T E = I.
Since Σ is positive definite, we must have λi > 0, i = 1, . . . , r and we can define the
symmetric square-root matrix by:
p p p
Σ1/2 = E ∧1/2 E T where ∧1/2 = diag ( λ1 , λ2 , . . . , λr ).
Now recall that if Z1 , Z2 , . . . , Zr are i.i.d. N(0, 1) then Z = (Z1 , . . . , Zr )T ∼ Nr (0, I).
Because of the i.i.d. N(0, 1) assumption, we know in this case that E(Z) = 0, Var(Z) =
I.
Now, let X = Σ1/2 Z + µ:
Since this construction is valid for any symmetric positive definite Σ and any
µ ∈ Rr , we have proved,
Theorem. 1.10.2
If X ∼ Nr (µ, Σ) then
Theorem. 1.10.3
If X ∼ Nr (µ, Σ) then Z = Σ−1/2 (X − µ) ∼ Nr (0, I).
Proof.
Use Theorem 1.10.1.
Suppose
X1 µ1 Σ11 Σ12
∼ Nr1 +r2 ,
X2 µ2 Σ21 Σ22
58
1. DISTRIBUTION THEORY
Lemma. 1.10.1
B A
If M = is a square, partitioned matrix, then |M | = |B| (det).
O I
Lemma. 1.10.2
Σ11 Σ12
If M = then |M | = |Σ22 ||Σ11 − Σ12 Σ−1
22 Σ21 |
Σ21 Σ22
and observe
Σ11 − Σ12 Σ−1
22 Σ 21 0
CM = ,
Σ−1
22 Σ21 I
and
1
|C| = , |CM | = |Σ11 − Σ12 Σ−1
22 Σ21 |
|Σ22 |
Finally, observe |CM | = |C||M |
|CM |
⇒ |M | = = |Σ22 ||Σ11 − Σ12 Σ−1
22 Σ21 |.
|C|
Theorem. 1.10.4
Suppose that
X1 µ1 Σ11 Σ12
∼ Nr1 ,r2 ,
X2 µ2 Σ21 Σ22
Then, the marginal distribution of X2 is X2 ∼ Nr2 (µ2 , Σ22 ) and the conditional distri-
bution of X1 |X2 is
59
1. DISTRIBUTION THEORY
Proof.
" #
T Σ11 Σ12 −1 x1 − µ1
1 T
exp − x1 − µ1 , x2 − µ2
2 Σ21 Σ22 x2 − µ2
and
1.
Σ11 Σ12 1/2
1/2 1/2
(2π) (r1 +r2 )/2 = (2π)r1 /2
(2π)r2 /2
Σ 22
Σ11 − Σ12 Σ−122 Σ 21
Σ21 Σ22
1/2
= (2π)r1 /2 Σ11 − Σ12 Σ−1 (2π)r2 /2 Σ22 1/2
22 Σ21
2.
Σ11 Σ12 −1 x1 − µ1
T T
= (x1 − µ1 )T , (x2 − µ2 )T
(x1 − µ1 ) , (x2 − µ2 )
Σ21 Σ22 x2 − µ2
−1 −1
Σ11 − Σ12 Σ−1 − Σ11 − Σ12 Σ−1
22 Σ21 22 Σ11
×Σ12 Σ−1
22
x1 − µ1
× −Σ−1 −1 −1
x2 − µ2
22 Σ 21 Σ 22 + Σ 22 Σ 21
× Σ − Σ Σ−1 Σ −1 × Σ − Σ Σ−1 Σ −1
11 12 22 21 11 12 22 21
×Σ12 Σ−122
−(x2 − µ2 )T Σ−1
22 Σ21 V
−1
(x1 − µ1 ) + (x2 − µ2 )T Σ−1
22 Σ21 V
−1
Σ12 Σ−1
22 (x2 − µ2 )
+(x2 − µ2 )T Σ−1
22 (x2 − µ2 ).
60
1. DISTRIBUTION THEORY
How did this come about:
a b x ax + by
(x, y) = (x, y)
c d y cx + dy
= x(ax + by) + y(cx + dy)
= ax2 + bxy + cyx + dy 2
T −1
= (x1 − µ1 ) − Σ12 Σ−1 (x1 − µ1 ) − Σ12 Σ−1
22 (x2 − µ2 ) V 22 (x2 − µ2 )
+ (x2 − µ2 )T Σ−1
22 (x2 − µ2 )
Note:
(x − y)T A(x − y) = (x − y)T (Ax − Ay)
= xT Ax − xT Ay − yT Ax + yT Ay
And: ΣT12 = Σ21
T −1
= x1 − (µ1 + Σ12 Σ−1
22 (x2 − µ2 )) Σ11 − Σ12 Σ−1
22 Σ21 ×
−1
(x2 − µ2 )) + (x2 − µ2 )T Σ−1
x1 − (µ1 + Σ12 Σ22 22 (x2 − µ2 ).
61
1. DISTRIBUTION THEORY
1
f (x1 , x2 ) =
Σ11 Σ12 1/2
(r +r
(2π) 1 2 )/2
Σ21 Σ22
−1
1 Σ11 Σ12 x1 − µ1
× exp − ((x1 − µ1 )T , (x2 − µ2 )T )
2
Σ21 Σ22 x2 − µ2
1
= −1
(2π)r1 /2 |Σ 11 − Σ12 Σ22 Σ21 |
1/2
1
× exp − (x1 − (µ1 + Σ12 Σ−1
22 (x2 − µ2 )))
T
2
−1 o
× Σ11 − Σ12 Σ−1
22 Σ21 (x1 − (µ1 + Σ12 Σ−1
22 (x2 − µ2 )))
1 1 T −1
× 1/2 exp − (x2 − µ2 ) Σ22 (x2 − µ2 )
(2π)r2 /2 Σ22 2
Theorem. 1.10.5
Suppose that X ∼ Nr (µ, Σ) and Y = AX + b, where Ap×r with linearly independent
rows and b ∈ Rp are fixed. Then Y ∼ Np (Aµ + b, AΣAT ).
[Note: p ≤ r.]
62
1. DISTRIBUTION THEORY
by Theorem 1.10.1.
Hence, from Theorem 1.10.4, the marginal distribution for Y is
Np (Aµ + b, AΣAT ).
The multivariate normal moment generating function for a random vector X ∼ N (µ, Σ)
is given by
T 1 T
MX (t) = et µ+ 2 t Σt
Prove this result as an exercise!
The characteristic function of X is
T 1 T T
E[exp(it X)] = exp it µ − t Σt
2
The marginal distribution of X1 (or X2 ) is easy to derive using the multivariate normal
MGF.
Let
t1 µ1
t= ,µ = .
t2 µ2
Then the marginal distribution of X1 is obtained by setting t2 = 0 in the expression
for the MGF of X.
Proof:
T1 T
MX (t) = exp t µ + t Σt
2
T T 1 T T 1 T
= exp t1 µ1 + t2 µ2 + t1 Σ11 t1 + t1 Σ12 t2 + t2 Σ22 t2 .
2 2
Now,
t1 T 1 T
MX1 (t1 ) = MX = exp t1 µ1 + t1 Σ11 t1 ,
0 2
63
1. DISTRIBUTION THEORY
Theorem. 1.10.6
Suppose X1 , X2 , . . . , Xr have a multivariate normal distribution. Then X1 , X2 , . . . , Xr
are independent if and only if Cov(Xi , Xj ) = 0 for all i 6= j.
Proof.
(=⇒)X1 , . . . , Xr independent
64
1. DISTRIBUTION THEORY
σ11 0 ...
0
..
0 σ22 . . . .
=⇒ Var(X) = Σ = . .. .. .
.. . . ..
0 . . . 0 σrr
−1
σ11 0 ... 0
−1 ..
0 σ22 ... .
=⇒ Σ−1 = . .. .. .. and |Σ| = σ11 σ22 . . . σrr
.. . . .
−1
0 ... 0 σrr
1 1 T Σ−1 (x−µ)
=⇒ fX (x) = e− 2 (x−µ)
(2π)r/2 |Σ|1/2
1 (x −µ )2
− 12 ri=1 i σ i
P
= e ii
(2π)r/2 (σ11 σ22 . . . σrr )1/2
! !
1 (x −µ )2 1 (x −µ )2
− 12 1σ 1 − 12 2σ 2
= √ √ e 11 √ √ e 22 ...
2π σ11 2π σ22
!
1 (x −µ )2
− 12 rσ r
... √ √ e rr
2π σrr
=⇒ X1 , . . . , Xr are independent.
Note:
0 ... −1
0 σrr x r − µr
r
X (xi − µi )2
= .
i=1
σii
65
1. DISTRIBUTION THEORY
Remark
The same methods can be used to establish a similar result for block diagonal matrices.
The simplest case is the following, which is most easily proved using moment generating
functions:
X1 , X2 are independent
if and only if Σ12 = 0, i.e.,
X1 µ1 Σ11 0
∼ Nr1 +r2 , .
X2 µ2 0 Σ22
Theorem. 1.10.7
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ) RVs and let
n
1X
X̄ = Xi ,
n i=1
n
2 1 X
S = (Xi − X̄)2 .
n − 1 i=1
(n − 1)S 2
Then X̄ ∼ N (µ, σ 2 /n) and ∼ χ2n−1 independently.
σ2
(Note: S 2 here is a random variable, and is not to be confused with the sample covari-
ance matrix, which is also often denoted by S 2 . Hopefully, the meaning of the notation
will be clear in the context in which it is used.)
Proof. Observe first that if X = (X1 , . . . , Xn )T then the i.i.d. assumption may be
written as:
X ∼ Nn (µ1n , σ 2 In×n )
1. X̄ ∼ N(µ, σ 2 /n):
Observe that X̄ = BX, where
1 1 1
B= ...
n n n
66
1. DISTRIBUTION THEORY
and observe
X̄
X1 − X̄
AX =
X2 − X̄
..
.
Xn−1 − X̄
it follows that S 2 is a function of (X1 − X̄, X2 − X̄, . . . , Xn−1 − X̄) and hence is
independent of X̄.
3. Prove:
(n − 1)S 2
2
∼ χ2n−1
σ
67
1. DISTRIBUTION THEORY
n n 2
(Xi − µ)2 (Xi − X̄)2
X X X̄ − µ
⇒ = + √
i=1
σ2 i=1
σ2 σ/ n
n
X (Xi − µ)2
and let R1 =
i=1
σ2
n
X (Xi − X̄)2 (n − 1)S 2
R2 = =
i=1
σ2 σ2
2
X̄ − µ
R3 = √
σ/ n
M1 (t)
⇒ M2 (t) = .
M3 (t)
2
X̄ − µ
Next, observe that R3 = √ ∼ χ21
σ/ n
1
⇒ M3 (t) = ,
(1 − 2t)1/2
68
1. DISTRIBUTION THEORY
and
n
X (Xi − µ)2
(R1 ) ∼ χ2n
i=1
σ2
1
⇒ M1 (t) = .
(1 − 2t)n/2
1
∴ M2 (t) = ,
(1 − 2t)(n−1)/2
Hence,
(n − 1)S 2
∼ χ2n−1 .
σ2
Corollary.
X̄ − µ
T = √ ∼ tn−1
S/ n
Proof.
Z
Recall that tk is the distribution of p , where Z ∼ N(0, 1) and V ∼ χ2k indepen-
V /k
dently.
Now observe that:
s
X̄ − µ X̄ − µ (n − 1)S 2 /σ 2
T = √ = √
S/ n σ/ n (n − 1)
and note that,
X̄ − µ
√ ∼ N(0, 1)
σ/ n
and
(n − 1)S 2
∼ χ2n−1 ,
σ2
independently.
69
1. DISTRIBUTION THEORY
Remarks
(1),(2),(3) are related by:
4. Convergence in distribution
The sequence of RVs {Xn } with CDFs {Fn } is said to converge in distribution
to the RV X with CDF F (x) if:
70
1. DISTRIBUTION THEORY
Figure 17: Normal approximation to binomial (an application of the Central Limit
Theorem) continuous case
Remarks
0 x < α
1. If we take F (x) =
1 x≥α
71
1. DISTRIBUTION THEORY
σ2
Now observe that E[X̄n ] = µ and Var(X̄n ) = . So by Chebyshev’s inequality,
n
σ2
P (|X̄n − µ| > ) ≤ →0 as n → ∞, for any fixed > 0.
n2
Remarks
1. The proof given for Theorem 1.11.1 is really a corollary to the fact that X̄n also
converges to µ in quadratic mean.
2. There is also a version of this theorem involving almost sure convergence (strong
law of large numbers). We will not discuss this.
3. The law of large numbers is one of the fundamental principles of statistical infer-
ence. That is, it is the formal justification for the claim that the “sample mean
approaches the population mean for large n”.
Lemma. 1.11.1
Suppose an is a sequence of real numbers s.t. lim nan = a with |a| < ∞. Then,
n→∞
lim (1 + an )n = ea .
n→∞
Proof.
Omitted (but not difficult).
72
1. DISTRIBUTION THEORY
Remarks
This is a simple generalisation of the standard limit,
x n
lim 1 + = ex .
n→∞ n
L[Zn ] → N(0, 1) as n → ∞.
Proof.
We will use the fact that it is sufficient to prove that
2 /2
MZn (t) → et for each fixed t
2 /2
[Note: if Z ∼ N(0, 1) then MZ (t) = et .]
n
Xi − µ 1 X
Now let Ui = and observe that Zn = √ Ui .
σ n i=1
MaX (t) = E[etaX ]
= MX (at)
Now,
73
1. DISTRIBUTION THEORY
t2
MU (t) = MU (0) + MU0 (0)t + MU00 (0) + r(t)
2
r(s)
MU (t) = 1 + 0 + t2 /2 + r(t), where lim =0
s→0 s2
↓
(MU (0))(as MU0 (0) = 0)
= 1 + t2 /2 + r(t).
⇒
√ n
MZn (t) = MU (t/ n)
√
= {1 + t2 /2n + r(t/ n)}n
= (1 + an )n ,
where
t2 √
an = + r(t/ n).
2n
t2
Next observe that lim nan = for fixed t.
n→∞ 2
To check this observe that:
nt2 t2 √
lim = and lim nr(t/ n)
n→∞ 2n 2 n→∞
√
t2 r(t/ n)
= lim √
n→∞ (t/ n)2
r(s)
= t2 lim = 0 for fixed t.
s→0 s2
√
Note: s = t/ n → 0 as n → ∞.
74
1. DISTRIBUTION THEORY
2 /2
lim MZn (t) = et for each fixed t.
n→∞
Remarks
2. The Central Limit Theorem holds under conditions more general than those given
above. In particular, with suitable assumptions,
3. Theorems 1.11.1 and 1.11.2 are concerned with the asymptotic behaviour of X̄n .
X̄n − µ
Theorem 1.11.2 states √ −→ N(0, 1) as n → ∞.
σ/ n D
These results are not contradictory because Var(X̄n ) → 0, but the Central Limit
X̄n − E(X̄n )
Theorem is concerned with p .
Var(X̄n )
75
2. STATISTICAL INFERENCE
2 Statistical Inference
Probability is concerned partly with the problem of predicting the behavior of the RV
X assuming we know its distribution.
Statistical inference is concerned with the inverse problem:
Given data x1 , x2 , . . . , xn with unknown CDF F (x), what can we conclude about
F (x)?
In this course, we are concerned with parametric inference. That is, we assume F
belongs to a given family of distributions, indexed by the parameter θ:
= = {F (x; θ) : θ ∈ Θ}
where Θ is the parameter space.
Examples
In this framework, the problem is then to use the data x1 , . . . , xn to draw conclusions
about θ.
Definition. 2.1.1
A collection of i.i.d. RVs, X1 , . . . , Xn , with common CDF F (x; θ), is said to be a
random sample (from F (x; θ)).
Definition. 2.1.2
Any function T = T (x1 , x2 , . . . , xn ) that can be calculated from the data (without
knowledge of θ) is called a statistic.
Example
n
1X
The sample mean x̄ is a statistic (x̄ = xi ).
n i=1
76
2. STATISTICAL INFERENCE
Definition. 2.1.3
A statistic T with property T (x) ∈ Θ ∀x is called an estimator for θ.
Example
For x1 , x2 , . . . , xn i.i.d. N(µ, σ 2 ), we have θ = (µ, σ 2 ). The quantity (x̄, s2 ) is an esti-
n n
1X 1 X
mator for θ, where x̄ = xi , S 2 = (xi − x̄)2 .
n i=1 n − 1 i=1
There are two important concepts here: the first is that estimators are random
variables; the second is that you need to be able to distinguish between random vari-
ables and their realisations. In particular, an estimate is a realisation of a random
variable.
For example, strictly speaking, x1 , x2 , . . . , xn are realisations of random variables
n
1X
X1 , X2 , . . . , Xn , and x̄ = xi is a realisation of X̄; X̄ is an estimator, and x̄ is an
n i=1
estimate.
We will find that from now on however, that it is often more convenient, if less rigorous,
to use the same symbol for estimator and estimate. This arises especially in the use of
θ̂ as both estimator and estimate, as we shall see.
#
In broad terms, we would like an estimator to be “as close as possible to” θ with high
probability.
Definition. 2.1.4
The mean squared error of the estimator T of θ is defined by
Example:
Suppose X1 , . . . , Xn are i.i.d. Bernoulli θ RV’s, and T = X̄ =‘proportion of successes’.
77
2. STATISTICAL INFERENCE
θ(1 − θ)
=⇒ E(T ) = θ, Var(T ) =
n
θ(1 − θ)
=⇒ M SET (θ) = Var(T ) = .
n
Remark:
This example shows that M SET (θ) must be thought of as a function of θ rather than
just a number.
For example: see Figure 18
Intuitively, a good estimator is one for which MSE is as small as possible. However,
quantifying this idea is complicated, because M SE is a function of θ, not just a number.
See Figure 19.
For this reason, it turns out it is not possible to construct a minimum M SE estimator
in general.
To see why, suppose T ∗ is a minimum M SE estimator for θ. Now consider the estimator
Ta = a, where a ∈ R is arbitrary. Then for a = θ, Tθ = θ with M SETθ (θ) = 0.
78
2. STATISTICAL INFERENCE
Observe M SETa (a) = 0; hence if T ∗ exists, then we must have M SET ∗ (a) = 0. As a
is arbitrary, we must have M SET ∗ (θ) = 0 ∀θ, ∈ Θ
=⇒ T ∗ = θ with probability 1.
Therefore we conclude that (excluding trivial situations) no minimum M SE estimator
can exist.
Definition. 2.1.5 The bias of the estimator T is defined by:
bT (θ) = E(T ) − θ.
Remarks:
79
2. STATISTICAL INFERENCE
E(s2 ) = σ 2 .
If E(s) = σ,
= 0
=⇒ E(s) < σ.
Theorem. 2.1.1
Remark:
Restricting attention to unbiased estimators excludes estimators of the form Ta =
a. We will see that this permits the construction of Minimum Variance Unbiased
Estimators (MVUE’s) in some cases.
Definition. 2.2.1
80
2. STATISTICAL INFERENCE
Remark
If x1 , x2 , . . . , xn are independent, the log likelihood function can be written as:
n
X
`(θ; x) = log fi (xi ; θ).
i=1
Definition. 2.2.2
Consider a statistical problem with log-likelihood `(θ; x). The score is defined by:
∂`
U(θ; x) =
∂θ
and the Fisher information is
∂2`
I(θ) = E − 2 .
∂θ
Remark
For a single observation with PDF f (x, θ), the information is
∂2
i(θ) = E − 2 log f (X, θ) .
∂θ
In the case of x1 , x2 , . . . , xn i.i.d. , we have
I(θ) = ni(θ).
Theorem. 2.2.1
Under suitable regularity conditions,
E{U(θ; X)} = 0
81
2. STATISTICAL INFERENCE
Proof.
Z ∞ Z ∞
Observe E{U(θ; X)} = ... U(θ; x)f (x; θ)dx1 . . . dxn
−∞ −∞
Z ∞ Z ∞
∂
= ... log f (x; θ) f (x; θ)dx1 . . . dxn
−∞ −∞ ∂θ
∂
Z ∞ Z ∞ f (x; θ)
∂θ
= ... f (x; θ)dx1 . . . dxn
f (x; θ)
−∞ −∞
Z ∞ Z ∞
∂
= ... f (x; θ)dx1 . . . dxn
−∞ −∞ ∂θ
Z ∞ Z ∞
∂
regularity =⇒ = ... f (x; θ)dx1 . . . dxn
∂θ −∞ −∞
∂
= 1
∂θ
= 0, as required.
∂
∂ 2 `(θ; x) ∂
f (x; θ)
= ∂θ
∂θ2 ∂θ
f (x; θ)
2
∂2
∂
f (x; θ)f (x; θ) − f (x; θ)
∂θ2 ∂θ
=
[f (x; θ)]2
∂2
f (x; θ)
= ∂θ2 − U 2 (θ; x)
f (x; θ)
∂2
f (x; θ) ∂ 2 `
=⇒ U 2 (θ; x) = ∂θ2 − 2
f (x; θ) ∂θ
∂2
f (X; θ)
2
= E ∂θ + I(θ) (by definition of I(θ)).
f (X; θ)
82
2. STATISTICAL INFERENCE
∂2 2
Z ∞ ∂ f (x; θ)
f (X; θ) Z ∞
2 ∂θ2
E ∂θ = ... f (x; θ)dx1 . . . dxn
f (X; θ) −∞ −∞ f (x; θ)
∞ ∞
∂2
Z Z
= ... 2
f (x; θ)dx1 . . . dxn
−∞ −∞ ∂θ
∞ Z ∞
∂2
Z
= ... f (x; θ)dx1 . . . dxn
∂θ2 −∞ −∞
∂2
= 1
∂θ2
= 0
83
2. STATISTICAL INFERENCE
Proof.
∂
Z ∞ Z ∞ f (x; θ)
= ... T (x) ∂θ f (x; θ)dx1 . . . dxn
−∞ −∞ f (x; θ)
Z ∞ Z ∞
∂
= ... T (x)f (x; θ)dx1 . . . dxn
∂θ −∞ −∞
∂
= E{T (X)}
∂θ
∂
= θ
∂θ
= 1.
To summarize, Cov(T, U) = 1.
Recall that Cov2 (T, U) ≤ Var(T ) Var(U) [i.e. |ρ| ≤ 1 and divide both sides by RHS]
Cov2 (T, U) 1
=⇒ Var(T ) ≥ = , as required.
Var(U) I(θ)
Example
n
1X
Suppose X1 , X2 , . . . , Xn are i.i.d. Po(λ) RV’s, and let X̄ = Xi . We will prove
n i=1
that X̄ is a MVUE for λ.
84
2. STATISTICAL INFERENCE
Proof.
e−λ λx
(1) Recall if X ∼Po(λ), then P (x) = , E(X) = λ and Var(X) = λ.
x!
λ
Hence, E(X̄) = λ and Var(X̄) = . Hence X̄ is unbiased for λ.
n
1
(2) To show that X̄ is MVUE, we will show that Var(X̄) = .
I(λ)
Step 1:
Log-likelihood is
n
Y e−λ λxi
= log ,
i=1
xi !
n
−nλ
Pn
xi
Y 1
so `(λ; x) = log e λ i=1
x!
i=1 i
n
X n
Y
= −nλ + xi log λ − log xi !
i=1 i=1
n
Y
= −nλ + nx̄ log λ − log xi !.
i=1
Step 2:
∂2`
Find ;
∂λ2
∂` nx̄ ∂2` −nx̄
= −n + =⇒ 2
= 2 .
∂λ λ ∂λ λ
Step 3:
85
2. STATISTICAL INFERENCE
∂2`
I(λ) = −E
∂λ2
nX̄
=⇒ I(λ) = E
λ2
n
= E(X̄)
λ2
nλ
=
λ2
n
= .
λ
λ 1
(3) Finally, observe that Var(X̄) = = .
n I(λ)
=⇒ X̄ is a MVUE.
Theorem. 2.2.3
The unbiased estimator T (x) can achieve the Cramer-Rao Lower Bound only if the
joint PDF/probability function has the form:
Proof. Recall from the proof of Theorem 2.2.2 that the bound arises from the inequality
86
2. STATISTICAL INFERENCE
Finally to ensure that E(T ) = θ, recall that E{U(θ; X)} = 0 and observe that in this
case,
−B 0 (θ)
=⇒ E[T (X)] = .
A0 (θ)
−B 0 (θ)
Hence, in order that T (X) be unbiased for θ, we must have = θ.
A0 (θ)
Definition. 2.2.3
A probability density function/probability function is said to be a single parameter
exponential family if it has the form:
87
2. STATISTICAL INFERENCE
In this case, a minimum variance unbiased estimator that achieves the CRLB can be
seen to be the function: n
1X
T = t(xi ),
n i=1
−B 0 (θ)
which is the MVUE for E(T ) = .
A0 (θ)
B(λ) = −λ
t(x) = x
= λ.
88
2. STATISTICAL INFERENCE
Example (2)
A(λ) = −λ
B(λ) = log λ
t(x) = x
h(x) = 0
−B 0 (λ)
We can also check that E(X) = . In particular, we have seen previously
A0 (λ)
1
E(X) = for X ∼ Exp(λ).
λ
1
Now observe that A0 (λ) = −1, B 0 (λ) =
λ
−B 0 (λ) 1
=⇒ = , as required.
A0 (λ) λ
n
1X
It also follows that if x1 , x2 , . . . , xn are i.i.d. Exp(λ) observations, then X̄ = t(xi )
n i=1
1
is the MVUE for = E(X).
λ
Definition. 2.2.4
Consider data with PDF/prob. function, f (x; θ). A statistic, S(x), is called a sufficient
statistic for θ if f (x|s; θ) does not depend on θ for all s.
Remarks
89
2. STATISTICAL INFERENCE
(1) We will see that sufficient statistics capture all of the information in the
data x that is relevant to θ.
Example
n
X
Suppose x1 , x2 , . . . , xn are i.i.d. Bernoulli-θ and let s = xi . Then S is sufficient for
i=1
θ.
Proof.
n
Y
P (x) = θxi (1 − θ)1−xi
i=1
P P
xi (1−xi )
= θ (1 − θ)
= θs (1 − θ)n−s .
90
2. STATISTICAL INFERENCE
Proof.
Omitted.
Examples
91
2. STATISTICAL INFERENCE
=⇒ E(T ∗ ) = θ.
≥ Var(T ∗),
=⇒ E{Var(T |S)} = 0
(III) T ∗ is an estimator:
Since S is sufficient for θ,
Z ∞
∗
T = T (x)f (x|s)dx,
−∞
Remarks
92
2. STATISTICAL INFERENCE
Example
Suppose x1 , . . . , xn ∼i.i.d. N(µ, σ 2 ), σ 2 known. We want to estimate µ. Take T = x1 ,
as an unbiased estimator for µ.
n
X
S= xi is a sufficient statistic for µ.
i=1
σ2 σ4
=⇒ T |S = s ∼ N µ + 2 (s − nµ), σ 2 − 2
nσ nσ
1 1
= N s, 1 − σ2
n n
1X
∴ E(T |S) = xi = x̄ = T ∗
n
is a better (or equal) estimator for µ.
σ2
Finally, observe Var(X̄) = ≤ σ 2 = Var(X1 ) with < for n > 1, σ 2 > 0.
n
Remarks
93
2. STATISTICAL INFERENCE
X̄ = µ(θ̃).
Example
1
X1 , . . . , Xn ∼ i.i.d. Exp(λ) =⇒ E(Xi ) = ; the method of moments estimator is
λ
defined as the solution to the equation
1 1
X̄ = =⇒ λ̃ = .
λ̃ X̄
Remark
The method of moments is appealing:
(1) it is simple;
m1 = µ1 (θ̃)
m2 = µ2 (θ̃)
.. ..
. .
mp = µp (θ̃)
94
2. STATISTICAL INFERENCE
Example
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ) & let θ = (µ, σ 2 ).
p = 2 =⇒ two equations in two unknowns:
µ1 (θ) = E(X) = µ
µ2 (θ) = E(X 2 ) = σ 2 + µ2
n
1X
m1 = xi
n i=1
n
1X 2
m2 = x
n i=1 i
1X 2 1X 2
σ̃ 2 + µ̃2 = xi =⇒ σ̃ 2 = xi − x̄2
n n
1X
= (xi − x̄)2
n
n−1 2
= s.
n
Remark
The Method of Moments estimators can be seen to have good statistical properties.
Under fairly mild regulatory conditions, the MoM estimator is
θ̃ − θ −→
(2) Asymptotically Normal, i.e., p . N(0, 1) as n → ∞.
Var(θ) D
95
2. STATISTICAL INFERENCE
Example
1
X1 , . . . , Xn i.i.d. Exp(λ). We saw previously that λ̃X = .
X̄
Suppose Yi = Xi2 (which is invertible for Xi > 0). To obtain λ̃Y , observe E(Y ) =
2
E(X 2 ) = 2
λ s
2n 1
=⇒ λ̃Y = P 2 6= .
Xi X̄
Remark
In practice, maximum likelihood estimates are obtained by solving the score equation
∂`
= U(θ; x) = 0
∂θ
Example
If X1 , X2 , . . . , Xn are i.i.d. geometric-θ RV’s, find θ̂.
Solution:
( n )
Y
`(θ; x) = log θ(1 − θ)xi −1
i=1
n
X n
X
= log θ + (xi − 1) log(1 − θ)
i=1 i=1
96
2. STATISTICAL INFERENCE
(2) If φ = φ(θ) is a 1-1 transformation of θ, then the MLE’s obey the transfor-
mation rule, φ̂ = φ(θ̂).
97
2. STATISTICAL INFERENCE
(3) If T (x) is a sufficient statistic for θ, then θ̂ depends on the data only as a
function of t(x).
=⇒ θ̂ is a function of T (x).
Example
Suppose X1 , X2 , . . . , Xn are i.i.d. Exp(λ); then
n
Y Pn
fX (x; λ) = λe−λxi = λn e−λ i=1 xi
= λn e−λnx̄ .
i=1
98
2. STATISTICAL INFERENCE
If X ∼Exp(λ) and Y = log X, then we can find fY (y) by taking h(x) = log x and using
y
= λe−λe ey
( n )
−λeyi
Y
=⇒ `Y (λ; y) = log λe eyi
i=1
n Pn y
Pn o
= log λn e−λ i=1 e i e i=1 yi
n
X n
X
yi
= n log λ − λ e + yi
i=1 i=1
n
∂ n X yi
=⇒ `Y (λ; y) = − e =0
∂λ λ i=1
n
=⇒ λ̂ = n
X
eyi
i=1
n
= n
X
elog xi
i=1
n
= n
X
xi
i=1
1 n
= .
x̄ nx̄
Finally, suppose we take θ = log λ =⇒ λ = eθ
θx
=⇒ f (x; θ) = eθ e−e (λe−λx )
n
!
−eθ x
Y
=⇒ `θ (θ; x) = log eθ e i
i=1
n θ
o
= log enθ e−e nx̄
= nθ − eθ nx̄.
99
2. STATISTICAL INFERENCE
∂`
= n − nx̄eθ
∂θ
∂` 1
∴ =0 =⇒ 1 = x̄eθ =⇒ eθ =
∂θ x̄
=⇒ θ̂ = − log x̄.
1 1
But λ̂ = =⇒ log λ̂ = log = − log x̄ = θ̂, as required.
x̄ x̄
Remark: (Not examinable)
Maximum likelihood estimation and method of moments can both be generated by the
use of estimating functions.
An estimating function is a function, H(x; θ), with the property E{H(x; θ)} = 0.
H can be used to define an estimator θ̃ which is a solution to the equation
H(x; θ) = 0.
100
2. STATISTICAL INFERENCE
We will show (in outline) that if θ̂n is the MLE (maximum likelihood estimator) based
on X1 , X2 , . . . , Xn , then:
(2)
p −→
ni(θ0 )(θ̂n − θ0 ) N(0, 1) as n → ∞
D
(Asymptotic Normality).
Remark
The practical use of asymptotic normality is that for large n,
1
θ̂ ∼: N θ0 , .
ni(θ0 )
101
2. STATISTICAL INFERENCE
Proof.
Z ∞ Z ∞
∗ ∗
` (θ) − ` (θ0 ) = (log f (x; θ))f (x; θ0 )dx − (log f (x; θ0 ))f (x; θ0 )dx
−∞ −∞
Z ∞
f (x; θ)
= log f (x; θ0 )dx
−∞ f (x; θ0 )
Z ∞
f (x; θ)
≤ − 1 f (x; θ0 )dx
−∞ f (x; θ0 )
Z ∞
= (f (x; θ) − f (x; θ0 ))dx
−∞
Z ∞ Z ∞
= f (x; θ)dx − f (x; θ0 )dx = 0
−∞ −∞
=⇒ θ = θ0 .
Lemma. 2.3.2
1 1
Let `¯n (θ; x) = `(θ, x1 , . . . , xn ), i.e., × log likelihood based on x1 , x2 , . . . , xn .
n n
Then for each θ, `¯n (θ; x) → `∗ (θ) in probability as n → ∞.
Since Li (θ) are i.i.d. with E(Li (θ)) = `∗ (θ), it follows by the weak law of large numbers
that `¯n (θ; x) → `∗ (θ) in probability as n → ∞.
102
2. STATISTICAL INFERENCE
Since θ̂n maximizes `¯n (θ, x), it follows that θ̂n → θ0 in probability.
Theorem. 2.3.1
Under the above assumptions, let Un (θ; x) denote the score based on X1 , . . . , Xn .
Then,
Un (θ0 ; x) −→
N(0, 1).
ni(θ0 ) D
p
Proof.
n
∂ Y
Un (θ0 ; x) = log f (xi ; θ)|θ=θ0
∂θ i=1
n
X ∂
= log f (xi ; θ)|θ=θ0
i=1
∂θ
n
X ∂
= Ui , where Ui = log f (xi ; θ)|θ=θ0 .
i=1
∂θ
Since U1 , U2 , . . . are i.i.d. with E(Ui ) = 0 and Var(Ui ) = i(θ), by the Central Limit
Theorem, Pn
i=1 Ui − nE(U) Un (θ0 ; x) −→
= p N(0, 1).
ni(θ0 ) D
p
n Var(U)
Theorem. 2.3.2
Under the above assumptions,
p −→
ni(θ0 )(θ̂n − θ0 ) N(0, 1).
D
103
2. STATISTICAL INFERENCE
Un (θ0 ; x) −→
N(0, 1)
ni(θ0 ) D
p
Remark”
The preceding theory can be generalized to include vector-valued coefficients. We will
not discuss the details.
Motivating example:
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ), and consider H0 : µ = µ0 vs. Ha : µ 6= µ0 .
If σ 2 is known, then the test of H0 with significance level α is defined by the test
statistic
X̄ − µ
Z= √ ,
σ/ n
104
2. STATISTICAL INFERENCE
θ ∈ Θ0 ∪ ΘA , where Θ0 ∩ ΘA = φ.
Actual Status
H0 true HA true
√ type II
Accept H0
error
Test Result
(β)
type I √
Reject H0
error
(α)
We would like both the type I and type II error rates to be as small as possible.
However, these results conflict with each other. To reduce the type I error rate we
need to “make it harder to reject H0 ”. To reduce the type II error rate we need to
“make it easier to reject H0 ”.
The standard (Neyman-Pearson) approach to hypothesis testing is to control the type
I error rate at a “small” value α and then use a test that makes the type II error as
small as possible.
The equivalence between the confidence intervals and hypothesis tests can be formu-
lated as follows: Recall that a 100(1 − α)% CI for θ is a random interval, (L, U ) with
the property
P ((L, U ) 3 θ) = 1 − α.
105
2. STATISTICAL INFERENCE
Consider a statistical problem with data x1 , . . . , xn , log-likelihood `(θ; x), score U(θ; x)
and information i(θ).
Consider also a hypothesis H0 : θ = θ0 . The following three tests are often considered:
Example:
Suppose X1 , X2 , . . . , Xn i.i.d. Po(λ), and consider H0 : λ = λ0 .
n
Y e−λ λxi
`(λ; x) = log
i=1
xi !
n
Y
= n(x̄ log λ − λ) − log xi !
i=1
106
2. STATISTICAL INFERENCE
∂` x̄ n(x̄ − λ)
U(λ; x) = =n −1 = =⇒ λ̂ = x̄
∂λ λ λ
∂2`
nx̄ n
I(λ) = ni(λ) = E − 2 = E =
∂λ λ2 λ
q
=⇒ W = ni(λ̂)(λ̂ − λ0 )
λ̂ − λ0
= q
λ̂/n
U(λ0 ; x)
V = p
ni(λ0 )
n(x̄ − λ0 )
= p
λ0 ( n/λ0 )
x̄ − λ0
= p
λ0 /n
G2 = 2(`(λ̂) − `(λ0 ))
Remarks
(1) It can be proved that the tests based on W, V, G2 are asymptotically equiv-
alent for H0 true.
(3) To understand the motivation for the three tests, it is useful to consider
their relation to the log-likelihood function. See Figure 20.
We have introduced the Wald test, score test and the likelihood test for H0 : θ = θ0 vs.
HA : θ = θa . These are large-sample tests in that asymptotic distributions for the test
statistic under H0 are available. It can also be proved that the LR statistic and the
107
2. STATISTICAL INFERENCE
score test statistic are invariant under transformation of the parameter, but the Wald
test is not.
Each of these three tests can be inverted to give a confidence interval (region) for θ:
Wald Test
Solve for θ0 in W 2 ≤ z(α/2)2 .
q
Recall, W = ni(θ̂)(θ̂ − θ0 )
z(α/2) z(α/2)
=⇒ θ̂ − q ≤ θ0 ≤ θ̂ + q
ni(θ̂) ni(θ̂)
z(α/2)
i.e., θ̂ ± q .
ni(θ̂)
Score Test !2
U(θ ; x)
Need to solve for θ0 in V 2 = p 0 ≤ z(α/2)2 .
ni(θ0 )
LR Test
Solve for θ0 in 2(`(θ̂; x) − `(θ0 ; x)) ≤ χ21,α = z(α/2)2 .
Example:
108
2. STATISTICAL INFERENCE
X1 , . . . , Xn i.i.d. Po(λ).
λ̂ − λ0
Recall Wald statistic is W = q
λ̂/n
q
=⇒ λ̂ ± z(α/2) λ̂/n
p
⇐⇒ x̄ ± z(α/2) x̄/n.
Score test:
x̄ − λ0
Recall that the test statistic is V = p . Hence to find a confidence interval, we
λ0 /n
need to solve for λ0 in the equation
V2 ≤ z(α/2)2
(x̄ − λ0 )2
=⇒ ≤ z(α/2)2
λ0 /n
λ0 z(α/2)2
=⇒ (x̄ − λ0 )2 ≤
n
z(α/2)2
=⇒ λ20 − 2x̄ + λ + x̄2 ≤ 0,
n
H0 : θ = θ0 and Ha : θ = θa
109
2. STATISTICAL INFERENCE
110
2. STATISTICAL INFERENCE
Proof. Let C be the critical region for the LR test and let D be the critical region (RR)
for any other test.
Let C1 = C ∩ D and C2 , D2 be such that
C = C1 ∪ C2 , C1 ∩ C2 = φ
D = C1 ∪ D2 , C1 ∩ D2 = φ
111
2. STATISTICAL INFERENCE
since D = (C1 ∪ D2 ),
Z Z
= f (x; θa )dx1 . . . dxn − f (x; θa )dx1 . . . dxn
C2 D2
Z Z
1 1
≥ f (x; θ0 )dx1 . . . dxn − f (x; θ0 )dx1 . . . dxn
k C2 k D2
≥ 0, as required.
D2 is empty (= φ).
Example:
Suppose X1 , X2 , . . . , Xn are i.i.d. N(µ, σ 2 ), σ 2 given, and consider
H0 : µ = µ0 vs. Ha : µ = µa , µa > µ 0
112
2. STATISTICAL INFERENCE
n
Y 1 1 2
√ exp − 2 (xi − µ0 )
f (x; µ0 ) 2πσ 2σ
Then = i=1
n
f (x; µa )
Y 1 1 2
√ exp − 2 (xi − µa )
i=1
2πσ 2σ
( n
!)
1 X
exp − 2 x2i − 2nx̄µ0 + nµ20
2σ i=1
= ( n
!)
1 X
2 2
exp − 2 xi − 2nx̄µa + nµa
2σ i=1
1 2 2
= exp (2nx̄(µ0 − µa ) − nµ0 + nµa ) .
2σ 2
For a constant k,
f (x; µ0 )
≤k
f (x; µa )
⇔ (µ0 − µa )x̄ ≤ k ∗
⇔ x̄ ≥ c,
113
2. STATISTICAL INFERENCE
X̄ − µ0
Z= √ ∼ N(0, 1) under H0 .
σ/ n
(1) This result shows that the one-sided z test is also uniformly most powerful
for
H0 : µ = µ0 vs. HA : µ > µ0
H0 : µ ≤ µ0 vs. HA : µ > µ0
α = max P (rejecting |µ = µ0 ).
H0
i.e., H0 : µ = µ0 vs. HA : µ 6= µ0
114