Introduction To Maximum Likelihood

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Introduction to Maximum Likelihood

by

Yannis Kasparis

1 Introduction

Maximum likelihood estimation is an alternative method of estimation that

makes increased use of distributional assumptions on the errors of the model. More-

over it provides an extensive theory of inference that can be used to construct a

large number of very important tests1 .

Consider the Classical Linear Regression Model (CLRM) with normal errors:

y = Xβ + ² (or yt = x0t β + ²t ),

X fixed,

² ∼ N (0, σ 2 I) (or ²t ∼ N (0, σ 2 )).

Then this implies that

y ∼ N(Xβ, σ 2 I) (or yt ∼ N (x0t β, σ 2 ))


1
These include the serial correlation and heteroscedasticity tests of MICROFIT

1
The maximum likelihood procedure makes use of this fact.

2 The Likelihood Function

Some Preliminary Points:

• The maximum likelihood procedure provides estimates for model parameters β,

σ 2 . It is therefore comparable to other procedures: e.g. Ordinary Least Squares

(OLS), Generalised Method of Moments (GMM).

• Moreover just like the other procedures the estimators β̂, σ̂ 2 are obtained by

optimasing (minimising or maximising) a criterion function. For example OLS

minimises

S(β; X) = (y − Xβ)0 (y − Xβ)


X
( or = (yt − x0t β)2 )

• Maximum likelihood maximises the so called Likelihood Function:

0 0
L(β, σ 2 ; X) = L(θ; X), θ = (β , σ 2 )

• Maximum likelihood makes use of distributional assumptions about the error

term. This plays role in the construction of the Likelihood Function. Different

distributional assumptions lead to different likelihood functions. We will assume

that the errors are normal.

2
The Multivariate Normal Density

Let:

a) z be a (T × 1) vector of of normal random variables.

b)E(z) = µ a (T × 1) (mean) vector

c) V ar(z) = Σ a (T × T ) (covariance) matrix.

Then the density function of z denoted fz (u; µ, Σ) is

½ ¾
− T2 − 12 1 0
fZ (u; µ, Σ) = (2π) |Σ| exp − (u − µ) Σ−1 (u − µ)
2

T
• When Σ = σ 2 I (diagonal), |Σ| = (σ 2 ) so

½ ¾
2 − T2 2 − T2 1 0
fZ (u; µ, Σ) = (2πσ ) (σ ) exp − 2 (u − µ) (u − µ)

Example 1:

Suppose that zt , t = 1, 2, .., T, is a random sample (i.i.d.) of normally distributed

variables with E(zt ) = a , V ar(zt ) = σ 2 The density of zt (scalar) is

½ ¾
2 1 (ut − a)2
fzt (ut ; a, σ ) = √ exp −
2πσ 2 2σ 2

3
The joint density

fz1 ,z2,..., zT (u1 , u2,..., uT ; a, σ 2 ), (due to independece)


Y
T
= fzt (ut ; a, σ 2 )
t=1
T
Y ½ ¾
2 − T2 2 − T2 (ut − a)2
= (2πσ ) (σ ) exp − 2
)
t=1

( T )
T T
X (ut − a)2
= (2πσ 2 )− 2 (σ 2 )− 2 exp − 2
)
t=1

½ 0 ¾
T T (u − µ) (u − µ)
= (2πσ 2 )− 2 (σ 2 )− 2 exp − ) (vector form)
2σ 2

0 0
where the vectors u = (u1 , u2,..., uT ) , µ = (a, a, .., a) (T × 1).

The Likelihood Function

The likelihood function is the joint density evaluated at the observation points. In

the notation of Example 1 the joint density is:

fz1 ,z2,..., zT (u1 , u2,..., uT ; a, σ 2 )


( T )
X (ut − a)2
2 − T2 2 − T2
= (2πσ ) (σ ) exp − )
t=1
2σ 2

4
Hence the likelihood function is

L(a, σ 2 ; z1 , z2,..., zT ) = L(a, σ 2 ; z)


( )
X
T
(zt − a)2
− T2 − T2
= (2πσ 2 ) (σ 2 ) exp − )
t=1
2σ 2
½ 0 ¾
2 − T2 2 − T2 (z − µ) (z − µ)
= (2πσ ) (σ ) exp − )
2σ 2
(vector notation)

Note that zt can be written as a linear model

zt = a + ²t , ²t ∼ N(0, σ 2 )

0
Consider again the classical linear regression model with normal errors: yt = xt β + ²t .

The likelihood function is


( )
T T
T
X 0
(yt − x β)2
t
L(θ; y/X) = (2πσ 2 )− 2 (σ 2 )− 2 exp − 2
)
t=1

½ 0 ¾
2 − T2 2 − T2 (y − Xβ) (y − Xβ)
= (2πσ ) (σ ) exp − )
2σ 2

0 0
where θ = (β , σ 2 ) (k × 1) vector.

5
3 Estimation of the Likelihood Function

It is easier to maximise the logarithm of the likelihood function

l(θ; y/X) = ln L(θ; y/X)

rather than the likelihood function itself (maxθ l(θ; y/X) = maxθ L(θ; y/X)). Note

that log-likelihood for the CLRM is:

XT ¡ 0 ¢2
T T yt − x tβ
l(θ; y/X) = − ln(2π) − ln(σ 2 ) − 2
2 2 t=1

0
T T (y − Xβ) (y − Xβ)
= − ln(2π) − ln(σ 2 ) −
2 2 2σ 2

• First Order Conditions (FOCs):


 
∂l(θ;y/X)
∂l(θ; y/X)  ∂β 
s(θ; y/X) = =

 = 0 (k × 1)
 ¯
∂θ ∂l(θ;y/X)
∂(σ 2 )

s(θ; y/X), the first derivative of l is called “score”. So the Maximum Likelihood

Estimator (MLE) θ̂ is obtained as the solution of s(θ; y/X) = 0.

• Second Order Conditions:

Consider the matrix of the second derivatives (Hessian):


 
∂ 2 l(θ;y/X) ∂ 2 l(θ;y/X)
∂ 2 l(θ; y/X)  0
∂(σ 2 )∂β
0 
H(θ; y/X) = 0 =

∂β∂β  (k × k)

∂θ∂θ ∂ 2 l(θ;y/X) ∂ 2 l(θ;y/X)
∂β∂(σ2 ) ∂(σ 2 )2

6
then H(θ ∗ ; y/X) must be negative definite. (θ∗ is the solution from FOCs).

ML Estimation of the CLRM

½ 0 ¾
∂l(θ; y/X) ∂ T T (y − Xβ) (y − Xβ)
= ln(2π) − ln(σ 2 ) −
∂β ∂β 2 2 2σ 2
1 ∂ n 0
o
= − 2 (y − Xβ) (y − Xβ)
2σ ∂β

so

∂l(θ; y/X)
= 0
∂β
∂ n 0
o
→ (y − Xβ) (y − Xβ) = 0
∂β

So the ML estimator (the same with the OLS when the errors are normal) is:
³ 0 ´−1 0
β̂ = X X Xy

Now
½ 0 ¾
∂l(θ; y/X) ∂ T T 2 (y − Xβ) (y − Xβ)
= ln(2π) − ln(σ ) −
∂ (σ 2 ) ∂ (σ 2 ) 2 2 2 (σ 2 )
½ 0 ¾
∂ T 2 (y − Xβ) (y − Xβ)
= − (σ ) −
∂ (σ 2 ) 2 2 (σ 2 )
T ∂ 2 1 0 ∂ 1
= − ln(σ ) − (y − Xβ) (y − Xβ)
2 ∂ (σ 2 ) 2 ∂ (σ 2 ) (σ 2 )
T ∂ 2 1 0 ∂ ¡ 2 ¢−1
= − ln(σ ) − (y − Xβ) (y − Xβ) σ
2 ∂ (σ 2 ) 2 ∂ (σ 2 )
T 1 1 0 ¡ 2 ¢−2
= − − (y − Xβ) (y − Xβ)(−1) σ
2 σ2 2
1 0 T
= 2 (y − Xβ) (y − Xβ) − 2
2 (σ 2 ) 2σ

7
so

∂l(θ; y/X) 1 0 T
= 0→ 2 (y − Xβ) (y − Xβ) =
∂ (σ 2 ) 2
2 (σ ) 2σ 2
0 0
(y − X β̂) (y − X β̂)
2 ²̂ ²̂
→ σ̂ = =
T T

where β̂ is the estimate obtained above.

Large Sample Distribution of ML estimator & the score vector

First consider the Hessian H(θ; y/X) matrix. H is a random matrix.

I(θ/X) = −E [H(θ; y/X)]

I is “the Information Matrix”. It is (approximately) the covariance matrix of

MLE θ̂.
approx.
θ̂ ∼ N (θ, I(θ/X)−1 )

approx.
s(θ̂; y/X) ∼ N (0, I(θ/X))

Asymptotic effeciency of the ML estimator and the Cramer-Rao inequality

The Cramer-Rao Inequality states that there is a lower bound for the Variance-

Covariance matrix of any unbiased estimator. The bound is given by the inverse of

the information matrix. Consider some estimator θ̃ unbiased for θ (E(θ̃) = θ). then

V ar(θ̃) ≥ I(θ)−1

8
so the variance cannot be smaller than I(θ)−1 . We know that the variance of the

ML estimator is approximately I(θ)−1 . So the MLE attains the lower bound in large

samples.

9
4 The Model with Heteroscedastic Errors

Let

y = Xβ + ²,

² ∼ N (0, Σ),

Σ 6= σ 2 I

In particular
 
 σ 21 0 • 0 0 
 
 
 0 σ 22 0 
 
 
 
Σ = diag(σ t ) = 
2
 • • • 
 
 
 0 • 0 
 
 
 
2
0 0 • 0 σT
σ 2t = αo + α1 wt , with wt non-random variable

Hence
T
Y
|Σ| = (αo + α1 wt )
t=1

and
T
X
ln |Σ| = ln(αo + α1 wt )
t=1

So the log-likelihood function is


T ¡ 0 ¢2
T T 1 X yt − xt β
l(β, αo , α1 ; y/X) = − ln(2π) − ln(αo + α1 wt ) −
2 2 2 t=1 αo + α1 wt

10
11
5 Testing Parameter Restrictions

Consider the following regression model

yt = β 1 x1t + β 2 x2t + β 3 x3t + ²t (1)

Can test parameter restrictions from economic theory. For example

β 1 = 0 or β 1 = β 2 = 0 or β 2 + β 3 = 1

Also can test for heteroscedasticity (or serial correlation). If

σ 2t = αo + α1 wt (2)

then can test

α1 = 0

ML provides a framework for testing.

To carry out some testing procedure one needs to estimate:

A) The unrestricted model.

1 Obtain the likelihood function of the unrestricted model: lU R (θ)

2 Estimate lU R (θ): → θ̂ UR

B) The restricted model.

12
1 Obtain the likelihood function of the restricted model: lR (θ)

Two ways of doing it:

a) Lagrange Multiplier type of problem:

0
lR (θ) = lU R (θ) − λ (Rθ − q)

where λ is the Lagrange multiplier (vector) and Rθ − q. are equations of restrictions

(constraints).

R : (m × k), (m number of restrictions)

θ : (k × 1), (k number of parameters, m ≤ k)

q : (m × 1).

For example suppose that the unrestricted model is the one in equation (1) (so .θ =
0
(β 1 , β 2 , β 2 , σ 2 ) ). Consider the restrictions β 1 = 0 and β 2 + β 3 = 1.

   
 1 0 0 0   0 
R=

, q =  .
  
0 1 1 0 1

13
 
 β1 
    
 
1 0 0 0  β 
  2   0 
Rθ = q → 



=
 


0 1 1 0  β
 3 
 1
 
 
σ2
   
 β1   0 
→ 

= 
  
β2 + β3 1

b) Re-parameterise the model: with the restrictions β 1 = 0 and β 2 + β 3 = 1 the

model of eq. (1) can be written as

yt − x3t = β 2 (x2t − x3t ) + ²t

so
T T
T
X (yt − x3t − β (x2t − x3t ))2
2 2
lR (θ) = − ln(2π) − ln(σ ) −
2 2 t=1
2σ 2

2 Estimate lR (θ): → θ̂R

In ourexample 

 0 
 
 
 β̂ 
 2R 
θ̂R = 



 1 − β̂ 
 2R 
 
 
σ̂ 2R

14
6 The Three ML Test Principles

1 Likelihood Ratio (LR) test

2 Lagrange Multiplier (LM) test

3 Wald (W) test

The LR test

The idea behind this approach is as follows: Consider some parameter restrictions.

The likelihood function will be larger when estimated at θ̂U R rather than at θ̂R . i.e.

l(θ̂U R ) > l(θ̂ R ). Now if the restrictions imposed are valid the difference between

l(θ̂U R ) and l(θ̂R ) will be small. The LR test statistic measures this difference:
" #
³ ´ L(θ̂U R )
LR = 2 l(θ̂U R ) − l(θ̂R ) = 2 ln
L(θ̂R )

Under the Null Hypothesis (Ho : the restrictions are valid)

approx.
LR ∼ χ2m

For a test of size α%

Reject Ho if LR > χ2m;a

where χ2m;a is the α% critical value.

• Remark: you need both θ̂U R and θ̂R to calculate the LR statistic.

15
The LM test

This test measures the difference between the slopes (score functions) of the Likeli-

hood function at the points θ̂U R and θ̂R i.e. s(θ̂R )−s(θ̂ UR ). If the restrictions imposed

are valid then this difference is small. Note however that the derivative at the un-

restricted estimate θ̂U R is zero (First Order Conditions s(θ̂U R ) = 0). So difference

becomes equal to s(θ̂R ).

0 approx.
LM = s(θ̂ R ) H(θ̂R )−1 s(θ̂R ) ∼ χ2m

For a test of size α%

Reject Ho if LR > χ2m;a

where χ2m;a is the α% critical value.

• Remark: you only need θ̂R to calculate the ML statistic.

The Wald test


³ ´ ³ ´
Consider the distance between Rθ̂U R − q − Rθ̂R − q . This what the Wald sta-

tistic measures. But by definition Rθ̂R − q = 0. The Wald statistic is

³ ´0 ³ ´−1 ³ ´ approx.
−1 0
W = Rθ̂U R − q RH(θ̂U R ) R Rθ̂U R − q ∼ χ2m

16
For a test of size α%

Reject Ho if LR > χ2m;a

where χ2m;a is the α% critical value.

• Remark: you only need θ̂U R to calculate the W statistic.

7 Revision Questions

1) “When the regression errors are normal, the OLS estimator is BUE (Best Unbiased

Estimator). So OLS has the smallest variance.”

“The ML estimator attains the Cramer-Rao bound, so the ML estimator has the

smallest variance”.

Are these two statements contradictory?

2) Consider the following regression model

yt = β 1 x1t + β 2 x2t + ²t ,

²t ∼ N (0, σ 2 ).

a) Write down the likelihood function

b) Derive the Score function and the Hessian.

17
c) Hence suggest how one can test the hypothesis β 1 = β 2 using the Wald Test.

(Explicitly state the null and the alternative hypothesis and explain how the test

statistic can be obtained).

3) Consider the following regression model

yt = βxt + ²t ,

²t ∼ N (0, σ 2t ),

σ 2t = αo + α1 wt , with wt non-random

a) Write down the likelihood function

b) Derive the Score function and the Hessian.

c) Hence suggest how one can test for Heteroscedasticity using the Lagrange Mul-

tiplier Test. (Explicitly state the null and the alternative hypothesis and explain how

the test statistic can be obtained).

18

You might also like