Introduction To Maximum Likelihood
Introduction To Maximum Likelihood
Introduction To Maximum Likelihood
by
Yannis Kasparis
1 Introduction
makes increased use of distributional assumptions on the errors of the model. More-
Consider the Classical Linear Regression Model (CLRM) with normal errors:
y = Xβ + ² (or yt = x0t β + ²t ),
X fixed,
1
The maximum likelihood procedure makes use of this fact.
• Moreover just like the other procedures the estimators β̂, σ̂ 2 are obtained by
minimises
0 0
L(β, σ 2 ; X) = L(θ; X), θ = (β , σ 2 )
term. This plays role in the construction of the Likelihood Function. Different
2
The Multivariate Normal Density
Let:
½ ¾
− T2 − 12 1 0
fZ (u; µ, Σ) = (2π) |Σ| exp − (u − µ) Σ−1 (u − µ)
2
T
• When Σ = σ 2 I (diagonal), |Σ| = (σ 2 ) so
½ ¾
2 − T2 2 − T2 1 0
fZ (u; µ, Σ) = (2πσ ) (σ ) exp − 2 (u − µ) (u − µ)
2σ
Example 1:
½ ¾
2 1 (ut − a)2
fzt (ut ; a, σ ) = √ exp −
2πσ 2 2σ 2
3
The joint density
0 0
where the vectors u = (u1 , u2,..., uT ) , µ = (a, a, .., a) (T × 1).
The likelihood function is the joint density evaluated at the observation points. In
4
Hence the likelihood function is
zt = a + ²t , ²t ∼ N(0, σ 2 )
0
Consider again the classical linear regression model with normal errors: yt = xt β + ²t .
0 0
where θ = (β , σ 2 ) (k × 1) vector.
5
3 Estimation of the Likelihood Function
rather than the likelihood function itself (maxθ l(θ; y/X) = maxθ L(θ; y/X)). Note
XT ¡ 0 ¢2
T T yt − x tβ
l(θ; y/X) = − ln(2π) − ln(σ 2 ) − 2
2 2 t=1
2σ
0
T T (y − Xβ) (y − Xβ)
= − ln(2π) − ln(σ 2 ) −
2 2 2σ 2
s(θ; y/X), the first derivative of l is called “score”. So the Maximum Likelihood
6
then H(θ ∗ ; y/X) must be negative definite. (θ∗ is the solution from FOCs).
½ 0 ¾
∂l(θ; y/X) ∂ T T (y − Xβ) (y − Xβ)
= ln(2π) − ln(σ 2 ) −
∂β ∂β 2 2 2σ 2
1 ∂ n 0
o
= − 2 (y − Xβ) (y − Xβ)
2σ ∂β
so
∂l(θ; y/X)
= 0
∂β
∂ n 0
o
→ (y − Xβ) (y − Xβ) = 0
∂β
So the ML estimator (the same with the OLS when the errors are normal) is:
³ 0 ´−1 0
β̂ = X X Xy
Now
½ 0 ¾
∂l(θ; y/X) ∂ T T 2 (y − Xβ) (y − Xβ)
= ln(2π) − ln(σ ) −
∂ (σ 2 ) ∂ (σ 2 ) 2 2 2 (σ 2 )
½ 0 ¾
∂ T 2 (y − Xβ) (y − Xβ)
= − (σ ) −
∂ (σ 2 ) 2 2 (σ 2 )
T ∂ 2 1 0 ∂ 1
= − ln(σ ) − (y − Xβ) (y − Xβ)
2 ∂ (σ 2 ) 2 ∂ (σ 2 ) (σ 2 )
T ∂ 2 1 0 ∂ ¡ 2 ¢−1
= − ln(σ ) − (y − Xβ) (y − Xβ) σ
2 ∂ (σ 2 ) 2 ∂ (σ 2 )
T 1 1 0 ¡ 2 ¢−2
= − − (y − Xβ) (y − Xβ)(−1) σ
2 σ2 2
1 0 T
= 2 (y − Xβ) (y − Xβ) − 2
2 (σ 2 ) 2σ
7
so
∂l(θ; y/X) 1 0 T
= 0→ 2 (y − Xβ) (y − Xβ) =
∂ (σ 2 ) 2
2 (σ ) 2σ 2
0 0
(y − X β̂) (y − X β̂)
2 ²̂ ²̂
→ σ̂ = =
T T
MLE θ̂.
approx.
θ̂ ∼ N (θ, I(θ/X)−1 )
approx.
s(θ̂; y/X) ∼ N (0, I(θ/X))
The Cramer-Rao Inequality states that there is a lower bound for the Variance-
Covariance matrix of any unbiased estimator. The bound is given by the inverse of
the information matrix. Consider some estimator θ̃ unbiased for θ (E(θ̃) = θ). then
V ar(θ̃) ≥ I(θ)−1
8
so the variance cannot be smaller than I(θ)−1 . We know that the variance of the
ML estimator is approximately I(θ)−1 . So the MLE attains the lower bound in large
samples.
9
4 The Model with Heteroscedastic Errors
Let
y = Xβ + ²,
² ∼ N (0, Σ),
Σ 6= σ 2 I
In particular
σ 21 0 • 0 0
0 σ 22 0
Σ = diag(σ t ) =
2
• • •
0 • 0
2
0 0 • 0 σT
σ 2t = αo + α1 wt , with wt non-random variable
Hence
T
Y
|Σ| = (αo + α1 wt )
t=1
and
T
X
ln |Σ| = ln(αo + α1 wt )
t=1
10
11
5 Testing Parameter Restrictions
β 1 = 0 or β 1 = β 2 = 0 or β 2 + β 3 = 1
σ 2t = αo + α1 wt (2)
α1 = 0
2 Estimate lU R (θ): → θ̂ UR
12
1 Obtain the likelihood function of the restricted model: lR (θ)
0
lR (θ) = lU R (θ) − λ (Rθ − q)
(constraints).
q : (m × 1).
For example suppose that the unrestricted model is the one in equation (1) (so .θ =
0
(β 1 , β 2 , β 2 , σ 2 ) ). Consider the restrictions β 1 = 0 and β 2 + β 3 = 1.
1 0 0 0 0
R=
, q = .
0 1 1 0 1
13
β1
1 0 0 0 β
2 0
Rθ = q →
=
0 1 1 0 β
3
1
σ2
β1 0
→
=
β2 + β3 1
so
T T
T
X (yt − x3t − β (x2t − x3t ))2
2 2
lR (θ) = − ln(2π) − ln(σ ) −
2 2 t=1
2σ 2
In ourexample
0
β̂
2R
θ̂R =
1 − β̂
2R
σ̂ 2R
14
6 The Three ML Test Principles
The LR test
The idea behind this approach is as follows: Consider some parameter restrictions.
The likelihood function will be larger when estimated at θ̂U R rather than at θ̂R . i.e.
l(θ̂U R ) > l(θ̂ R ). Now if the restrictions imposed are valid the difference between
l(θ̂U R ) and l(θ̂R ) will be small. The LR test statistic measures this difference:
" #
³ ´ L(θ̂U R )
LR = 2 l(θ̂U R ) − l(θ̂R ) = 2 ln
L(θ̂R )
approx.
LR ∼ χ2m
• Remark: you need both θ̂U R and θ̂R to calculate the LR statistic.
15
The LM test
This test measures the difference between the slopes (score functions) of the Likeli-
hood function at the points θ̂U R and θ̂R i.e. s(θ̂R )−s(θ̂ UR ). If the restrictions imposed
are valid then this difference is small. Note however that the derivative at the un-
restricted estimate θ̂U R is zero (First Order Conditions s(θ̂U R ) = 0). So difference
0 approx.
LM = s(θ̂ R ) H(θ̂R )−1 s(θ̂R ) ∼ χ2m
³ ´0 ³ ´−1 ³ ´ approx.
−1 0
W = Rθ̂U R − q RH(θ̂U R ) R Rθ̂U R − q ∼ χ2m
16
For a test of size α%
7 Revision Questions
1) “When the regression errors are normal, the OLS estimator is BUE (Best Unbiased
“The ML estimator attains the Cramer-Rao bound, so the ML estimator has the
smallest variance”.
yt = β 1 x1t + β 2 x2t + ²t ,
²t ∼ N (0, σ 2 ).
17
c) Hence suggest how one can test the hypothesis β 1 = β 2 using the Wald Test.
(Explicitly state the null and the alternative hypothesis and explain how the test
yt = βxt + ²t ,
²t ∼ N (0, σ 2t ),
σ 2t = αo + α1 wt , with wt non-random
c) Hence suggest how one can test for Heteroscedasticity using the Lagrange Mul-
tiplier Test. (Explicitly state the null and the alternative hypothesis and explain how
18