Introduction To Maximum Likelihood

Introduction to Maximum Likelihood
by
Yannis Kasparis
1 Introduction
Maximum likelihood estimation is an alternative method of estimation that
makes increased use of distributional assumptions on the errors of the model. More-
over it provides an extensive theory of inference that can be used to construct a
large number of very important tests1 .
Consider the Classical Linear Regression Model (CLRM) with normal errors:
y = Xβ + ² (or yt = x0t β + ²t ),
X fixed,
² ∼ N (0, σ 2 I) (or ²t ∼ N (0, σ 2 )).
Then this implies that
y ∼ N(Xβ, σ 2 I) (or yt ∼ N (x0t β, σ 2 ))

1
These include the serial correlation and heteroscedasticity tests of MICROFIT
1
The maximum likelihood procedure makes use of this fact.
2 The Likelihood Function
Some Preliminary Points:
• The maximum likelihood procedure provides estimates for model parameters β,
σ 2 . It is therefore comparable to other procedures: e.g. Ordinary Least Squares
(OLS), Generalised Method of Moments (GMM).
• Moreover just like the other procedures the estimators β̂, σ̂ 2 are obtained by
optimasing (minimising or maximising) a criterion function. For example OLS
minimises
S(β; X) = (y − Xβ)0 (y − Xβ)

X
( or = (yt − x0t β)2 )
• Maximum likelihood maximises the so called Likelihood Function:
0 0
L(β, σ 2 ; X) = L(θ; X), θ = (β , σ 2 )
• Maximum likelihood makes use of distributional assumptions about the error
term. This plays role in the construction of the Likelihood Function. Different
distributional assumptions lead to different likelihood functions. We will assume
that the errors are normal.
2
The Multivariate Normal Density
Let:
a) z be a (T × 1) vector of of normal random variables.
b)E(z) = µ a (T × 1) (mean) vector
c) V ar(z) = Σ a (T × T ) (covariance) matrix.
Then the density function of z denoted fz (u; µ, Σ) is
½ ¾
− T2 − 12 1 0
fZ (u; µ, Σ) = (2π) |Σ| exp − (u − µ) Σ−1 (u − µ)
2
T
• When Σ = σ 2 I (diagonal), |Σ| = (σ 2 ) so
½ ¾
2 − T2 2 − T2 1 0
fZ (u; µ, Σ) = (2πσ ) (σ ) exp − 2 (u − µ) (u − µ)
2σ
Example 1:
Suppose that zt , t = 1, 2, .., T, is a random sample (i.i.d.) of normally distributed
variables with E(zt ) = a , V ar(zt ) = σ 2 The density of zt (scalar) is
½ ¾
2 1 (ut − a)2
fzt (ut ; a, σ ) = √ exp −
2πσ 2 2σ 2
3
The joint density
fz1 ,z2,..., zT (u1 , u2,..., uT ; a, σ 2 ), (due to independece)

Y
T
= fzt (ut ; a, σ 2 )
t=1
T
Y ½ ¾
2 − T2 2 − T2 (ut − a)2
= (2πσ ) (σ ) exp − 2
)
t=1
2σ
( T )
T T
X (ut − a)2
= (2πσ 2 )− 2 (σ 2 )− 2 exp − 2
)
t=1
2σ
½ 0 ¾
T T (u − µ) (u − µ)
= (2πσ 2 )− 2 (σ 2 )− 2 exp − ) (vector form)
2σ 2
0 0
where the vectors u = (u1 , u2,..., uT ) , µ = (a, a, .., a) (T × 1).
The Likelihood Function
The likelihood function is the joint density evaluated at the observation points. In
the notation of Example 1 the joint density is:
fz1 ,z2,..., zT (u1 , u2,..., uT ; a, σ 2 )

( T )
X (ut − a)2
2 − T2 2 − T2
= (2πσ ) (σ ) exp − )
t=1
2σ 2
4
Hence the likelihood function is
L(a, σ 2 ; z1 , z2,..., zT ) = L(a, σ 2 ; z)

( )
X
T
(zt − a)2
− T2 − T2
= (2πσ 2 ) (σ 2 ) exp − )
t=1
2σ 2
½ 0 ¾
2 − T2 2 − T2 (z − µ) (z − µ)
= (2πσ ) (σ ) exp − )
2σ 2
(vector notation)
Note that zt can be written as a linear model
zt = a + ²t , ²t ∼ N(0, σ 2 )
0
Consider again the classical linear regression model with normal errors: yt = xt β + ²t .
The likelihood function is

( )
T T
T
X 0
(yt − x β)2
t
L(θ; y/X) = (2πσ 2 )− 2 (σ 2 )− 2 exp − 2
)
t=1
2σ
½ 0 ¾
2 − T2 2 − T2 (y − Xβ) (y − Xβ)
= (2πσ ) (σ ) exp − )
2σ 2
0 0
where θ = (β , σ 2 ) (k × 1) vector.
5
3 Estimation of the Likelihood Function
It is easier to maximise the logarithm of the likelihood function
l(θ; y/X) = ln L(θ; y/X)
rather than the likelihood function itself (maxθ l(θ; y/X) = maxθ L(θ; y/X)). Note
that log-likelihood for the CLRM is:
XT ¡ 0 ¢2
T T yt − x tβ
l(θ; y/X) = − ln(2π) − ln(σ 2 ) − 2
2 2 t=1
2σ
0
T T (y − Xβ) (y − Xβ)
= − ln(2π) − ln(σ 2 ) −
2 2 2σ 2
• First Order Conditions (FOCs):

 
∂l(θ;y/X)
∂l(θ; y/X)  ∂β 
s(θ; y/X) = =

 = 0 (k × 1)
 ¯
∂θ ∂l(θ;y/X)
∂(σ 2 )
s(θ; y/X), the first derivative of l is called “score”. So the Maximum Likelihood
Estimator (MLE) θ̂ is obtained as the solution of s(θ; y/X) = 0.
• Second Order Conditions:
Consider the matrix of the second derivatives (Hessian):

 
∂ 2 l(θ;y/X) ∂ 2 l(θ;y/X)
∂ 2 l(θ; y/X)  0
∂(σ 2 )∂β
0 
H(θ; y/X) = 0 =

∂β∂β  (k × k)

∂θ∂θ ∂ 2 l(θ;y/X) ∂ 2 l(θ;y/X)
∂β∂(σ2 ) ∂(σ 2 )2
6
then H(θ ∗ ; y/X) must be negative definite. (θ∗ is the solution from FOCs).
ML Estimation of the CLRM
½ 0 ¾
∂l(θ; y/X) ∂ T T (y − Xβ) (y − Xβ)
= ln(2π) − ln(σ 2 ) −
∂β ∂β 2 2 2σ 2
1 ∂ n 0
o
= − 2 (y − Xβ) (y − Xβ)
2σ ∂β
so
∂l(θ; y/X)
= 0
∂β
∂ n 0
o
→ (y − Xβ) (y − Xβ) = 0
∂β
So the ML estimator (the same with the OLS when the errors are normal) is:
³ 0 ´−1 0
β̂ = X X Xy
Now
½ 0 ¾
∂l(θ; y/X) ∂ T T 2 (y − Xβ) (y − Xβ)
= ln(2π) − ln(σ ) −
∂ (σ 2 ) ∂ (σ 2 ) 2 2 2 (σ 2 )
½ 0 ¾
∂ T 2 (y − Xβ) (y − Xβ)
= − (σ ) −
∂ (σ 2 ) 2 2 (σ 2 )
T ∂ 2 1 0 ∂ 1
= − ln(σ ) − (y − Xβ) (y − Xβ)
2 ∂ (σ 2 ) 2 ∂ (σ 2 ) (σ 2 )
T ∂ 2 1 0 ∂ ¡ 2 ¢−1
= − ln(σ ) − (y − Xβ) (y − Xβ) σ
2 ∂ (σ 2 ) 2 ∂ (σ 2 )
T 1 1 0 ¡ 2 ¢−2
= − − (y − Xβ) (y − Xβ)(−1) σ
2 σ2 2
1 0 T
= 2 (y − Xβ) (y − Xβ) − 2
2 (σ 2 ) 2σ
7
so
∂l(θ; y/X) 1 0 T
= 0→ 2 (y − Xβ) (y − Xβ) =
∂ (σ 2 ) 2
2 (σ ) 2σ 2
0 0
(y − X β̂) (y − X β̂)
2 ²̂ ²̂
→ σ̂ = =
T T
where β̂ is the estimate obtained above.
Large Sample Distribution of ML estimator & the score vector
First consider the Hessian H(θ; y/X) matrix. H is a random matrix.
I(θ/X) = −E [H(θ; y/X)]
I is “the Information Matrix”. It is (approximately) the covariance matrix of
MLE θ̂.
approx.
θ̂ ∼ N (θ, I(θ/X)−1 )
approx.
s(θ̂; y/X) ∼ N (0, I(θ/X))
Asymptotic effeciency of the ML estimator and the Cramer-Rao inequality
The Cramer-Rao Inequality states that there is a lower bound for the Variance-
Covariance matrix of any unbiased estimator. The bound is given by the inverse of
the information matrix. Consider some estimator θ̃ unbiased for θ (E(θ̃) = θ). then
V ar(θ̃) ≥ I(θ)−1
8
so the variance cannot be smaller than I(θ)−1 . We know that the variance of the
ML estimator is approximately I(θ)−1 . So the MLE attains the lower bound in large
samples.
9
4 The Model with Heteroscedastic Errors
Let
y = Xβ + ²,
² ∼ N (0, Σ),
Σ 6= σ 2 I
In particular
 
 σ 21 0 • 0 0 
 
 
 0 σ 22 0 
 
 
 
Σ = diag(σ t ) = 
2
 • • • 
 
 
 0 • 0 
 
 
 
2
0 0 • 0 σT
σ 2t = αo + α1 wt , with wt non-random variable
Hence
T
Y
|Σ| = (αo + α1 wt )
t=1
and
T
X
ln |Σ| = ln(αo + α1 wt )
t=1
So the log-likelihood function is

T ¡ 0 ¢2
T T 1 X yt − xt β
l(β, αo , α1 ; y/X) = − ln(2π) − ln(αo + α1 wt ) −
2 2 2 t=1 αo + α1 wt
10
11
5 Testing Parameter Restrictions
Consider the following regression model
yt = β 1 x1t + β 2 x2t + β 3 x3t + ²t (1)
Can test parameter restrictions from economic theory. For example
β 1 = 0 or β 1 = β 2 = 0 or β 2 + β 3 = 1
Also can test for heteroscedasticity (or serial correlation). If
σ 2t = αo + α1 wt (2)
then can test
α1 = 0
ML provides a framework for testing.
To carry out some testing procedure one needs to estimate:
A) The unrestricted model.
1 Obtain the likelihood function of the unrestricted model: lU R (θ)
2 Estimate lU R (θ): → θ̂ UR
B) The restricted model.
12
1 Obtain the likelihood function of the restricted model: lR (θ)
Two ways of doing it:
a) Lagrange Multiplier type of problem:
0
lR (θ) = lU R (θ) − λ (Rθ − q)
where λ is the Lagrange multiplier (vector) and Rθ − q. are equations of restrictions
(constraints).
R : (m × k), (m number of restrictions)
θ : (k × 1), (k number of parameters, m ≤ k)
q : (m × 1).
For example suppose that the unrestricted model is the one in equation (1) (so .θ =
0
(β 1 , β 2 , β 2 , σ 2 ) ). Consider the restrictions β 1 = 0 and β 2 + β 3 = 1.
   
 1 0 0 0   0 
R=

, q =  .
  
0 1 1 0 1
13
 
 β1 
    
 
1 0 0 0  β 
  2   0 
Rθ = q → 



=
 


0 1 1 0  β
 3 
 1
 
 
σ2
   
 β1   0 
→ 

= 
  
β2 + β3 1
b) Re-parameterise the model: with the restrictions β 1 = 0 and β 2 + β 3 = 1 the
model of eq. (1) can be written as
yt − x3t = β 2 (x2t − x3t ) + ²t
so
T T
T
X (yt − x3t − β (x2t − x3t ))2
2 2
lR (θ) = − ln(2π) − ln(σ ) −
2 2 t=1
2σ 2
2 Estimate lR (θ): → θ̂R
In ourexample 
 0 
 
 
 β̂ 
 2R 
θ̂R = 



 1 − β̂ 
 2R 
 
 
σ̂ 2R
14
6 The Three ML Test Principles
1 Likelihood Ratio (LR) test
2 Lagrange Multiplier (LM) test
3 Wald (W) test
The LR test
The idea behind this approach is as follows: Consider some parameter restrictions.
The likelihood function will be larger when estimated at θ̂U R rather than at θ̂R . i.e.
l(θ̂U R ) > l(θ̂ R ). Now if the restrictions imposed are valid the difference between
l(θ̂U R ) and l(θ̂R ) will be small. The LR test statistic measures this difference:
" #
³ ´ L(θ̂U R )
LR = 2 l(θ̂U R ) − l(θ̂R ) = 2 ln
L(θ̂R )
Under the Null Hypothesis (Ho : the restrictions are valid)
approx.
LR ∼ χ2m
For a test of size α%
Reject Ho if LR > χ2m;a
where χ2m;a is the α% critical value.
• Remark: you need both θ̂U R and θ̂R to calculate the LR statistic.
15
The LM test
This test measures the difference between the slopes (score functions) of the Likeli-
hood function at the points θ̂U R and θ̂R i.e. s(θ̂R )−s(θ̂ UR ). If the restrictions imposed
are valid then this difference is small. Note however that the derivative at the un-
restricted estimate θ̂U R is zero (First Order Conditions s(θ̂U R ) = 0). So difference
becomes equal to s(θ̂R ).
0 approx.
LM = s(θ̂ R ) H(θ̂R )−1 s(θ̂R ) ∼ χ2m
• Remark: you only need θ̂R to calculate the ML statistic.
The Wald test

³ ´ ³ ´
Consider the distance between Rθ̂U R − q − Rθ̂R − q . This what the Wald sta-
tistic measures. But by definition Rθ̂R − q = 0. The Wald statistic is
³ ´0 ³ ´−1 ³ ´ approx.
−1 0
W = Rθ̂U R − q RH(θ̂U R ) R Rθ̂U R − q ∼ χ2m
16
• Remark: you only need θ̂U R to calculate the W statistic.
7 Revision Questions
1) “When the regression errors are normal, the OLS estimator is BUE (Best Unbiased
Estimator). So OLS has the smallest variance.”
“The ML estimator attains the Cramer-Rao bound, so the ML estimator has the
smallest variance”.
Are these two statements contradictory?
2) Consider the following regression model
yt = β 1 x1t + β 2 x2t + ²t ,
²t ∼ N (0, σ 2 ).
a) Write down the likelihood function
b) Derive the Score function and the Hessian.
17
c) Hence suggest how one can test the hypothesis β 1 = β 2 using the Wald Test.
(Explicitly state the null and the alternative hypothesis and explain how the test
statistic can be obtained).
3) Consider the following regression model
yt = βxt + ²t ,
²t ∼ N (0, σ 2t ),
σ 2t = αo + α1 wt , with wt non-random
a) Write down the likelihood function
b) Derive the Score function and the Hessian.
c) Hence suggest how one can test for Heteroscedasticity using the Lagrange Mul-
tiplier Test. (Explicitly state the null and the alternative hypothesis and explain how
the test statistic can be obtained).
18

Introduction To Maximum Likelihood

Uploaded by

Copyright:

Available Formats

Introduction To Maximum Likelihood

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Maximum Likelihood

Uploaded by

Copyright:

Available Formats

Introduction to Maximum Likelihood

Maximum likelihood estimation is an alternative method of estimation that

over it provides an extensive theory of inference that can be used to construct a

large number of very important tests1 .

² ∼ N (0, σ 2 I) (or ²t ∼ N (0, σ 2 )).

Then this implies that

y ∼ N(Xβ, σ 2 I) (or yt ∼ N (x0t β, σ 2 ))

2 The Likelihood Function

Some Preliminary Points:

• The maximum likelihood procedure provides estimates for model parameters β,

σ 2 . It is therefore comparable to other procedures: e.g. Ordinary Least Squares

(OLS), Generalised Method of Moments (GMM).

optimasing (minimising or maximising) a criterion function. For example OLS

S(β; X) = (y − Xβ)0 (y − Xβ)

• Maximum likelihood maximises the so called Likelihood Function:

• Maximum likelihood makes use of distributional assumptions about the error

distributional assumptions lead to diﬀerent likelihood functions. We will assume

that the errors are normal.

a) z be a (T × 1) vector of of normal random variables.

b)E(z) = µ a (T × 1) (mean) vector

c) V ar(z) = Σ a (T × T ) (covariance) matrix.

Then the density function of z denoted fz (u; µ, Σ) is

Suppose that zt , t = 1, 2, .., T, is a random sample (i.i.d.) of normally distributed

variables with E(zt ) = a , V ar(zt ) = σ 2 The density of zt (scalar) is

fz1 ,z2,..., zT (u1 , u2,..., uT ; a, σ 2 ), (due to independece)

The Likelihood Function

the notation of Example 1 the joint density is:

fz1 ,z2,..., zT (u1 , u2,..., uT ; a, σ 2 )

L(a, σ 2 ; z1 , z2,..., zT ) = L(a, σ 2 ; z)

Note that zt can be written as a linear model

The likelihood function is

It is easier to maximise the logarithm of the likelihood function

l(θ; y/X) = ln L(θ; y/X)

that log-likelihood for the CLRM is:

• First Order Conditions (FOCs):

Estimator (MLE) θ̂ is obtained as the solution of s(θ; y/X) = 0.

• Second Order Conditions:

Consider the matrix of the second derivatives (Hessian):

ML Estimation of the CLRM

where β̂ is the estimate obtained above.

Large Sample Distribution of ML estimator & the score vector

First consider the Hessian H(θ; y/X) matrix. H is a random matrix.

I(θ/X) = −E [H(θ; y/X)]

I is “the Information Matrix”. It is (approximately) the covariance matrix of

Asymptotic eﬀeciency of the ML estimator and the Cramer-Rao inequality

So the log-likelihood function is

Consider the following regression model

yt = β 1 x1t + β 2 x2t + β 3 x3t + ²t (1)

Can test parameter restrictions from economic theory. For example

Also can test for heteroscedasticity (or serial correlation). If

then can test

ML provides a framework for testing.

To carry out some testing procedure one needs to estimate:

A) The unrestricted model.

1 Obtain the likelihood function of the unrestricted model: lU R (θ)

B) The restricted model.

Two ways of doing it:

a) Lagrange Multiplier type of problem: