Chap_2_Econometrics I Jonse (3)
Chap_2_Econometrics I Jonse (3)
Chap_2_Econometrics I Jonse (3)
Thus, the way regression is used in econometrics has nothing to do with the original
definition of the term regression.
Examples of economic relationship:
a. Qd = f ( p, pc , ps , taste, pop) demand function
Y = f ( X 1 , X 2 , X 3 ,..., X k )
Where, i = 1, 2, ..., k:
1. if k = 1, we say we have simple regression. Example Y = f ( X )
1
In this chapter, we deal with simple linear regression function (with single
independent variable)
Recall that the economic model takes the form: Yi = + X i , which is the abstract
form of reality. Suppose the pairs (x, y) are average weekly income (x) and expenditure
(y) for the sample of HHs. It is unrealistic to expect each observed pair lo lie on the
straight line. In addition, there are many other factors that affect consumption of HHs.
Thus, we move from economic model to statistical model in which relation is inexact and
hence allow us to introduce an unknown and unobservable random variable (E). The
2
Example: Qd = + X i + i The error term captures factors like income of the consumer,
1
A function is said to linear in parameters, say alpha and beta, if they appear with the power of one only
and they are not multiplied or divided by any other parameter.
3
X X1 X2 X3 X4
Y (y1, y2, y3,…) (y2, y4, y5, …) … …
3. the zero mean assumption: the disturbance term has zero mean ,i.e.,
E ( i /Xi )=0 i = 1, 2,..., n
That is, on average, factors, which are not explicitly included in the model, do not
systematically affect the mean value of Y. This means that the positive values of the
error term cancel out the negative values of the error term so that their mean effect
on the dependent variable is zero. Thus, on average, the regression line is correct.
Why do we need this assumption?
4. The Assumption of homoscadesticity: Common (constant) variance, i.e.,
Prove that Var (Y ) = E( i )2 = 2 . The proof begins with the definition of var of Y
Yi = + X i + i
E (Yi ) = + X i
We know that
Then, var(Y ) = E + X i + i − − X i
2
var(Y ) = E i = 2
2
This assumption implies that at all revels of X and Y, the variation around the mean value
of Y is the same. This means that all values of Y are equally reliable, reliability measured
by the variation about the mean of Y. Thus, all values of Y corresponding to different
values of X’s are equally important. What is the importance of this assumption?
5. The assumption of no autocorrelation/no serial correlation: independence of error
terms (error terms are not correlated): the, i.e., Cov ( i, j ) = 0 i j . How do
4
6. The assumption of zero covariance between X i & i : independence of xj and the
Proof:
cov( xi , i ) = E xi − E ( xi ) i − E ( i )
= E xi − E ( xi ) i ( E ( i ) = 0)
= E ( xi i ) − E ( xi ) E ( i ) ( E ( xi )isstochastic)
= E ( xi i )
= E ( xi ) E ( i )
cov( xi , i ) = 0
This assumption implies that independent variable (X) and the stochastic disturbance term
have separate and additive effect on Y. If they are correlated, it is impossible to assess
their individual effect on Y. Thai is, it causes problems for interpreting the model and for
deriving desirable statistical properties.
7. The number of observation (n) must be greater than the number of explanatory
variables in the model
8. Variability in values of X’s is important. All values of x in the sample must vary.
That is, var(X) must be finite positive number.
9. There is no model specification problem
Any econometric analysis begins with model specification. But what variables should be
included and what is the specific mathematical form of the model? These are extremely
important questions.
10. Normality assumption, i.e., i ~ N (0 , 2 ) i
Assumption 10 in conjunction with assumptions 3, 4 and 5 implies that i’s are normally
and independently distributed. i ~ NID (0 , 2 ) i
The assumptions of classical linear regression analysis are needed for statistical
5
Concepts of Population Regression function (PRF) and Sample Regression function
(SRF)
By assumption x is fixed in repeated sampling but Y’s vary. That is, we talk about the
conditional distribution of Yi given values of Xi. Again, by assumption E ( i ) = 0 . Now for
each conditional distribution of Y, we can compute its mean or average value known as
the conditional mean or conditional expectation of Y given X takes the specific values
of xi. It is denoted by E (Yi / X = X i ) . Thus, the conditional mean of Y given X (
E (Yi / X = X i ) = + X i ( E ( i )) or Yi = + X i + i
However we don’t have data on all Y’s (population) and we have sample information on
Y for a given values of X’s. Unlike the population case, we have one value of y for each
value of x. Now, we can estimate the PRF based on sample information but the estimation
is not accurate due to sampling fluctuations the regression function we obtained from the
sample is known as sample regression function (SRE).
Note that for different samples (n1 = n2 = … =nk) we have different SRF but which one
best “fit” the PRF? Or which one best represents the true PRF? (We will answer latter).
Now, consider E (Yi / X = X i ) = + X i ………….PRF
^ ^ ^
Yi = + X i ………………………SRF
^ ^
The stochastic version of SRF is given by: Yi = + X i + ei where ei is the residual or
an estimate of i .The residual is introduced into the SRF for the same reason as the PRF.
In summary, the main objective of regression analysis is the estimation of the PRF (
6
2.2. Simple Regression Model: the problem of Estimation
There are a number of methods for estimating the parameters of this model. The
most popular ones are
a) the method of moments
b) the method of least squares, and
c) the method of maximum likelihood.
Though they give different outcomes in generalised models, in the case of the simple
regression all three give identical results.
The main assumptions introduced about the error term imply that
E( ) = 0 and cov( X , ) = 0
In the method of moments, we replace these assumptions by their sample counterparts.
^ ^
Letting and be the estimators of and , respectively, and the sample counterpart of
1
E( ) = 0 is replaced by
n
ei = 0 or e i = 0 , and
1
cov( X , ) = 0 is replaced by
n
X i ei = 0 or Xei i =0
7
e = 0 or Yi − − X i = 0 and
^ ^
i
Xe = 0 or X i Yi − − X i = 0
^ ^
i i
Y = n + X
^ ^
i i and
system.
Given data on observations of Y and X, solving these two equations we obtain the
^ ^
parameter values (estimates) of and .
Assignment I
Table1. The data contains information on Y(output) and X (labour hours worked) for ten
workers.
No. X Y X2 XY Ŷ e
8
c. Suppose Y= consumption and X= current income, what is the economic
interpretation of alpha and beta?
d. Test the hypothesis that current income is insignificant in explaining consumption
e. Predict the consumption level after ten years if income will be Birr 100
X , X , X Y and Y
^ ^
To solve for and we need to compute i i
2
i i i
The residuals (estimated errors) are then given by: ei = Yi − 3.6 − 0.75 X i , which can be
calculated for each observation in our sample, and are presented in the last column in the
first table above.
The errors tell us what we get if we tried to predict the value of Y on the basis of the
estimated regression equation and the values of X, within the range of the sample. We do
not get errors for X outside our range simply because we do not have the corresponding
values of Y. By the virtue of the first condition we put in the normal equations, ei = 0.
The sum of squares of these errors is given by:
e 2
i = ( − 0.1) + (1.15) + ( 0.9) + ( − 1.35) + ( 0.4)
2 2 2 2 2
( )
+ − 2.6 + ( 0.9)
2 2
= 14.65
9
The method of least squares, as we shall see soon, tries to obtain estimators that minimise
this sum.
The Ordinary Least Squares (OLS) was developed by Friedrich Gauss. Under certain
assumptions of classical linear regression model, OLS estimators have some desirable
statistical properties.
Starting point:
PRF: Yi = + X i + i
^ ^
SRF: Yi = + X i + ei
^ ^
but
Yhat = + X i
Yi = Yhat + ei
ei = Yi − Yhat
Thus, the residual is the difference between the actual and the estimated/predicted values
of Y.
Now, given data on both X and Y, our objective is to find the SRF that best “fits” the PRF
or the values of estimators, which are as close as possible to population parameters.
However, what is the possible criterion to choose the SRF that best “fits” the actual PRF?
Criterion I: Minimize the sum of residuals
n
Choose the SRF in such a way that the sum of the residuals (i.e., e ) is as small as
i =1
i
n
possible. If the sum the residuals is zero (i.e., ei = 0 ), we will have the method of
i =1
moments. However, this criterion is not good as it gives equal weight to all kinds of
residuals (large, medium and small).
Criterion II: Sum Least Squares Criterion
According to this criterion, find the SRF, which minimize the sum of the squared residual
n
e i
2
^ ^
(i.e., min. i =1 ). Alternatively, we choose and such that the sum of squares of
errors is minimized. This criterion is very important for two basic reasons:
a. it gives more weight to larger residuals and less weight to smaller residuals
10
b. the estimators obtained through OLS method of estimation have some
desirable statistical properties
Since different samples from a given population result in different estimators, which of
these estimators minimize the sum of squared residuals? (Recall how to find extreme
values of functions- the necessary and sufficient conditions).
Y − − X
^ ^
S= i i
i =1
n
= e
i =1
2
i
The intuitive idea behind the procedure of least squares is given by looking at the
following figure.
Y
Sample
Regression Line
ei
Yi ei
Yˆ i .
.
Yi
. Yˆ i
X
Xi
The regression line passes through points in such a way that it is ‘as close as possible’ to
the points of the data. Closeness could mean different things. The minimisation procedure
of the OLS implies that we minimise the sum of squares of the vertical distances of the
points from the line.
11
The Necessary Condition
S S
= = 0
ˆ ˆ
This equation is known as the first order conditions for minimisation. This procedure
yields:
2(Y )
S n
ˆ X i (− 1) = 0
n
ˆ
= 0
i =1
i −
ˆ − e
i =1
i = 0
n n
Y
i =1
i = n ˆ
ˆ + Xi i =1
EQ 1
Y = ˆX
ˆ +
ˆX
ˆ = Y −
S
( )
n n
ˆ
= 0 2 Y i − ˆ − ˆ X i (− X i ) = 0
i =1
X
i =1
i ei = 0
EQ 2
n n n
X Y
i =1
i i
ˆ Xi +
=
ˆ X i2
i =1 i =1
Equations 1 and 2 are known as the normal equations. To solve for ̂ , we substitute for ̂
in equation 2 from equation 1 and we get
n n n
Yi X i = Y − x X i + X i2
^ ^
i =1 i =1 i =1
n n n n
YX = Y X i − X X i + X i2
^ ^
i i
i =1 i =1 i =1 i =1
n n
n
n
YX − Y Xi = Xi − X Xi
^
2
i =1
i i
i =1 i =1 i =1
n
1 n n ^ n 1 n n
Yi X i −
i =1
Yi i
n i =1 i =1
X =
i =1
X i
2
− X i Xi
n i =1 i =1
n 2
n
X i X i
n n n
n X i Y i − X i Y i n 2
−
i =1 i =1 i =1 ˆ i =1
=
i =1
n n
n n n
n X i Y i − X i Y i
ˆ =
i =1 i =1 i =1
2
n X i2 −
X i
n n
i =1 i =1
12
Numerator of the last Equation can be simplified as follows: First, we take the numerator
to get
n n n n
n Yi X i − Yi X i = n Yi X i − nYn X
i =1 i =1 i =1 i =1
n
= n Yi X i − nY X
i =1
n
= n (Yi X i − Y X )
i =1
n
= n (Yi − Y )( X i − X )
i =1
Similarly the denominator of our last equation can be rewritten as:
2
n
X − X i = n X i2 − (n X )
n n
n 2 2
i =1 i =1
i
i =1
n 2
= n X i2 − ( X )
i =1
= n ( X i − X )
n
2
i =1
Therefore, we have:
(Y − Y )( X ) x y
n n
^ i i −X i i
= i =1
= i =1
, where xi = X i − X & yi = Yi − Y the deviation form
( X )
n n
x
2
i −X i
2
i =1 i =1
( X − X )(Y i − Y )
n
i ( n − 1) Cov ( X , Y )
ˆ = i =1
=
Var ( X )
( X − X ) ( n − 1)
n 2
i
i =1
if we let
(Y − Y ) = y = Y
n n n
2 2
S yy = i
2
i i
2
− nY
i =1 i =1 i =1
(X − X) =
n n n
x = X
2 2
S xx = i
2
i i
2
− n X , and
i =1 i =1 i =1
(Yi − Y )( X i − X ) =
n n n
S yx =
i =1
yi xi =
i =1
Y X
i =1
i i − nXY
13
S yx
̂ =
S xx
ˆ = Y − ˆX
S yx
=Y − X
S xx
➢ Properties of OLS estimators and Gauss-Markov theorem
a) Assumptions on Xi
a1) The values of X: X1, X2, …, Xn, are fixed in advance (i.e., they are non-stochastic)
a2) Not all X are equal (variability in Xi’s are important).
b) Assumptions on the error term (εi):
b1) E(εi) = 0 for all i
b2) Var (εi) = 2 for all i (homoscedasticity or common variance)
b3) Cov(εi ε j) = 0 for all i j (Assumption of no autocorrelation)
Note that
Assumptions b1, b2 and b3 together are called assumptions of white noise on i ,which is
denoted as i ~WN[0, 2], and
1. a linear function of the random variable
2. unbiased, and
3. it has minimum variance in the class of all linear unbiased estimators.
Gauss-Markov theorem: Given the assumptions of the classical linear regression model,
the least-squares estimators have minimum variance in the class of linear unbiased
estimators, i.e., they are BLUE.
Proof: we shall provide a proof for ̂ , try to prove this for ̂
1. ̂ is a linear estimator of
We know that
14
ˆ =
x y i i
=
x (Y − Y ) i i
=
x Y −Y x i i i
x 2
i x 2
i x 2
i
EQ 2
x Y x
= i 2 Y i x
i i
= as = 0
x xi
2 i
i
xi
If we let k i = , EQ2 can be written as:
x i2
ˆ = k Y i i EQ 3
This shows that ̂ is linear in Yi—it is in fact its weighted average, with ki serving as
weights.
Note the following properties of ki.
a) Since the X variable is to be assumed non-stochastic, ki's are also fixed in advance
b) k i =0
1
c) k 2
=
xi2
i
d) k x = k X
i i i i =1
ˆ = k Y i i
= k ( + X + )
i as Y i i i = + X i +i
= k + k X + k i i i i i
= + k as k i i i = 0 & k X
i i = 1
Taking the expectation of this result, we get
E ˆ ( ) = E ( + k i i ) = E ( ) + E ( k i i )
= + k iE ( i ) = as E ( i ) = 0 i
(Assumption of zero mean of the disturbance term)
Hence, ̂ is an unbiased estimator of .
15
3. Among the set of all linear unbiased estimators of , the least squares estimator ̂ has
minimum variance. To show this we need to derive the variances and covariance of
the least squares estimators. We do this for the estimator of the slope parameter, ̂ .
( ) (
= E ˆ − E ( ˆ ) ) = E ˆ − ( ) ( )
2 2
Var ˆ ( E ˆ = )
= E ( k i i ) ( ˆ −
2
= k i i )
= E ( k1212 + k22 22 + ... + kn2 n2 + 2k1k21 2 + ... + 2kn −1kn n −1 n )
= E k i2 i2 + k i k j i j
i j i
= k i2E ( i2 ) + k i k jE ( i j )
i j i
= 2 k i2 ( E ( i j ) = 0)
2 1
= ( k 2
= )
xi2 x
i 2
i
EQ 4
Exercise:
(
1. Show that E ˆ − E ˆ ( )) = 0
To show that ̂ has the minimum variance among the set of linear unbiased estimators of
, we proceed as follows:
^
= kiYi … linear function of Y.
Xi − X xi
where ki = =
( X i −X )
2
xi2
~
Now define an alternative linear estimator of , which is unbiased and linear function
~
of Y. That is, = wiYi .
For this estimator to be unbiased its expected value must be equal to , i.e.,
16
~
E = wi E (Yi )
= wi ( + X i )
= wi + wi X i
~
Therefore, for to be unbiased, the following two conditins must hold
w = 0
i
w X =1
i i
~
the variance of is
( )
Var = Var ( w Y ) i i
= w Var (Y )2
i i
= w2 2
i
~
Now, compare the variances of ̂ and . Let
d i = ki − wi
d = k −w i i i =0
w = d +k i i i
w = (d + k )
2 2
i i i
= d +k i
2
i
2
+ 2 d i k i
= d +k i
2
i
2
Since
d k = x
dx i i
i i 2
i
d x = d (X − X)
i i i i
= d X − X d i i i
=dX i i
= (k − w ) X i i i
=kX −wX i i i i
= 1− 1 = 0
Therefore, w = d + k
2
i i
2
i
2
implying that
17
2 wi2 = 2 k i2 + 2 d i2
Var ( ) ( )
= Var ˆ + 2 d i2
Var ˆ ( ) Var ( ) as 2 d i2 0
This establishes that the OLS estimator, ̂ , is the Best Linear Unbiased Estimator
(BLUE) of .
Exercise:
1
1. Show that ˆ = c Y i i where c i =
n
− X ki
Proof
ˆ = Y + ˆ X
n
Y i n n
= i
− X Yi ki ( ˆ = Yi ki )
n i i
1
= ( − X ki )Yi
n
ˆ = ciYi
1 2
ˆ ) = 2 + X 2
2. Show that Var (
n xi
( )
Cov ˆ , ˆ
= E (ˆ − E(ˆ )) ˆ − E ˆ ( ( ))
EQ 5
But,
ˆ = Y − ˆX , Y = + X + and E(ˆ ) =
ˆ − E (
ˆ ) = + X + −
ˆ X − E (
ˆ)
= −X ˆ − = − X ( ˆ −E
ˆ ) ( ( )) ()
ˆ
as E =
18
(ˆ − E (ˆ ))(ˆ − E (ˆ )) ( ( )) (
= ˆ − E ˆ − X ˆ − E( ˆ ) )2
E (
ˆ − E (
ˆ ))
ˆ −E
ˆ ( ( )) = Eˆ −E(
ˆ − XE
( ))
ˆ)
ˆ − E(
2
( )
ˆ = -X () ( ( ))
2
= − XVar as ˆ −E
E ˆ = 0
x
2
i
Yi = + X i + i , i = 1,2,..., n EQ 1
We have
Y = +X + EQ 2
But,
Yi = ˆ X i + ei
ˆ + and Y = ˆX
ˆ +
Yi −Y = ˆ (X i − X ) + ei EQ 4
ˆ xi + ei
yi =
Subtract EQ 4 from EQ 3 to get
ei = ( i − i ) − xi − ( )
^
EQ 5
( )
2
ei2 = ( i − i ) − x i −
^
= ( − ) + x ( − ) − 2 x ( − )( − )
^ 2 ^
2 2
i i i i i i
( )
E ( ei2 ) = E i − + xi2 E ˆ − ( ) − 2 x E(ˆ − )(
− )
2 2
EQ 6
i i
19
Equation 6 has three components on its right hand side, which can be reduced as follows:
i) The 1st element in the equation could be written as follows
E ( − ) = E (
i
2 2
i
2
+ − 2 i )
= E 2
i + n − 2 i
2
= E
2 2
2
i + n − 2n
= E
2
2
i − n
= E( ) − nE( )
2
i
2
E ( i2 ) = var( i ) − E ( )( ) 2
and E ( ) = var( ) − ( E( ))
2 2
Therefore,
E ( − ) = E( ) − nE( )
i
2 2
i
2
= ( var ( ) − ( E ( )) ) − n( var ( ) − E ( ) )
2 2
i
= ( − E ( ) ) − n
2
− E ( )
2 2 2
n
= n 2 − nE ( ) − 2 + nE ( )
2 2
= n 2 − 2
= ( n − 1) 2
ii) the second part is also simplified as follows:
2
2
= xi2
xi2
= 2
iii) to simplify the 3rd element in equation 6 we use the following results
20
=
^ xi yi
, but yi = xi + i , where i = − bar
x 2
i
=
x ( x + )
i i i
x 2
i
x +x 2
= i i i
x 2
i
=+
x i i
x 2
i
^
− =
x i i
x 2
i
Thus,
x i i
(
2 E ˆ − ) x ( − )
i i = 2 E
x i
2 x i ( i − )
x i i
2 ( x i i
= 2E − xi )
x i
x i i
2 ( x i i )
= 2E ( x i = 0)
x i
2
(
)
2
= E x i i
xi 2
Now,
( x ) = ( x
i i
2
1 1 + x2 2 + ...+ xn n )
2
( ) ( ) ( )
E ( xi i ) = x12 E 12 + x22 22 + ... + xn2 E n2 + 2 x1 x2 E ( 1 2 ) + ... + 2 xn −1 xn E ( n −1 n )
2
= 2 xi2
(
2 E ˆ − ) x ( − ) =
2
(
E xi i
2
)
x
i i 2
i
=
2
( x )
2 2
x 2 i
i
= 2 2
21
Collecting the results obtained in i), ii) and iii) above, we get
+ x E( − ) ( )
2
E ( ei2 ) = E ( i − i ) − 2 E − xi ( i − i )
2 ^ ^
2
i
= ( n − 1) 2 + 2 − 2 2
= 2 ( n − 1 + 1 − 2)
= 2 ( n − 2)
It easily follows that if we set
^2 ei2 RSS
= =
n− 2 n− 2
we have an unbiased estimator of 2
e
i =1
i = 0 e = 0 EQ 1
n x e i i
X i ei = 0
i =1
i =1
n
= 0 EQ 2
Where
^ ^
ei = Yi − − X i
^
= Yi − Y i
2 imply that
1. the mean of residuals is zero, and
2. The residuals and the explanatory variable are uncorrelated.
Given this, it follows that
n
eY
^
i i =0
i =1
That is, the residuals and the estimated values of Y are uncorrelated.
22
Proof
n n
e Y = e + X
^ ^ ^
i i i i
i =1 i =1
n n
= ei + ei X i
^ ^
i =1 i =1
= 0+ 0 = 0
Now
Y i = Yˆ i + ei
EQ 3
Observed value of Yi = estimated value of Yi + residual
Sum equation 3 over i, the sampled observations, to get
Y i = Yˆ + e i i
Y i = Yˆ , as ei i = 0 EQ 4
Y = Yˆ
Y i − Y = Yˆ i − Yˆ + ei EQ 5
yi = ˆyi + ei
Squaring both sides and then summing over i we get
2
y = y i + ei
^
2
i
y = y + e
^
2
i i i
^2
y + 2 y i ei + e
^ 2
= i i
^2
y = y +e
2 2
i i i
EQ 6
23
The coefficient of determination denote by R2 - measure of goodness of fit of a
regression line
Mathematically,
ESS RSS e 2
i
R2 = = 1− = 1− ,
(Y − Y )
TSS TSS 2
i
Verbally, it measures the proportion or percentage of the total variation in the dependent
variable (Y) explained by the independent variable(s) included in the model.
NB 1. By definition 0 ≤ R2 ≤ 1
2. It is nonnegative quantity
Alternative formulas for the Coefficient of Determination
Thus, R 2
=
yˆ 2
i
y 2
i
Or
yˆ i = Yˆ i − Y
= ˆ + ˆ X i − Y ˆ
Yˆ i = ˆ + X i
= Y − ˆ X + ˆ X i − Y ˆ = Y − ˆ X
= ˆ ( X i − X ) = ˆ x i
Then,
yˆ 2
i = ˆ 2 xi2
ˆ 2 xi2
ˆ xi yi
R = 2
=
y y
Thus, 2 2
i i
ˆy 2
i = i
ˆ xi ˆy
ˆ xi ( y − ei )
= i as yi = ˆyi + ei
i ˆ xi ei
ˆ xi y −
=
= i
ˆ xi y as x ei i = 0
24
It follows that
yˆ 2
=
x y x y
i i
as ˆ =
x y
i i
x x
i 2 i i 2
i i
( )
2
xi y i
=
x 2
i
ˆ x i y i
R
2
= EQ 7
y 2
i
^
Denote the square of the correlation between Y and Y , i.e., the observed and fitted values
of Y by r 2y ,ˆy , thus, their correlation coefficient would be
yy
^
r^ =
i i
^2
yy
yy
2
i i
Proposition:
r y ,ˆy = R
2 2
proof
Start with the fact that
^
yi = yi + ei
^
Multiply the above equation throughout by y i and sum over i to get
y ˆy i i = ˆy + ˆy e
2
i i i
EQ 8
= ˆy 2
i as ˆy e i i = 0
Now
r y ,ˆy =
y ˆy i i
=
ˆy 2
i
=
ˆy ˆy 2
i
2
i
y ˆy
2
i
2
i y ˆy2
i
2
i y ˆy 2
i
2
i
=
ˆy = 2
i ESS
= R
2
y 2
i
TSS
Therefore, r 2^ = R 2
yy
25
Assignment:
Show that
r ^ = ryx
yy
r y ,ˆy =
y ˆy i i
=
y ˆ x i i
=
ˆ
y i xi
y ˆy y (ˆ x ) y x
2 2 2 2 ˆ
2 2
i
i i i i i
=
y x i i
= r y ,x
y x2
i
2
i
Example: Consider the data in Table 1. The data contains information on Y (output), and
X (labour hours worked) for 10 workers.
No Y X Y2 X2 YX ei ei2
1 11 10 121 100 110 -0.1 0.01
2 10 7 100 49 70 1.15 1.3225
3 12 10 144 100 120 0.9 0.81
4 6 5 36 25 30 -1.35 1.8225
5 10 8 100 64 80 0.4 0.16
6 7 8 49 64 56 -2.6 6.76
7 9 6 81 36 54 0.9 0.81
8 10 7 100 49 70 1.15 1.3225
9 11 9 121 81 99 0.65 0.4225
10 10 10 100 100 100 -1.1 1.21
Total 96 80 952 668 789 0.0 14.65
Given this information, we want to determine the relationship between labour input and
output. From the information above we can determine
80 96
X= = 8; Y = = 9.6
10 10
The following are also necessary pieces of information
26
(X − X) =
n n n
x = X
2 2
S xx = i
2
i i
2
− n X = 668 − 10(8) 2 = 28
i =1 i =1 i =1
(Y − Y )( X − X) =
n n n
S yx =
i =1
i i yx = YX
i =1
i i
i =1
i i − n X Y = 789 − 10(8)(9.6) = 21
(Y − Y ) = y = Y
n n n
2 2
S yy = i
2
i i
2
− nY = 952 − 10(9.6) 2 = 30.4
i =1 i =1 i =1
ˆ = S yx 21
= = 0.75
S xx 28
Since the intercept is 3.6, the equation implies that output will be 3.6 when 0 labour is
used! Note the absurdity of trying to predict values of Y, outside the range of the X values
in the data set! We can go further and estimate
r x ,y =
y x i i
=
S XY
=
21
21
0.72
y x 2
i
2
i S YY S XX 28(30.4) 29.175
Therefore, r 2x , y = R 2 0.52
Alternatively: R 2 =
i
ˆ xi y
=
(0.75)(21) 0.52
y 2
i
30.4
To generate the maximum likelihood estimators (MLE) for the parameters of a simple
linear regression model we need to make assumptions about the distribution of εi, the
disturbance term. In this setting εi is assumed to be normally and independently
distributed with mean 0 and variance 2, i.e., i ~ NIID (0, 2 ) .
27
Yi X i ~ NIID ( + X i , 2 ) i = 1, 2, , n EQ 1
Given the assumption of independence of the Yi’s, the joint probability density function
of the sample (Y1, Y2… Yn) can be written as a product of n marginal density functions
as:
f (Y 1, Y 2 , Y n / + X i , 2 ) = f 1 (Y 1, / + xi , 2 ). f 2 (Y 2 , / + xi , 2 ).
f n (Y n , / + xi , 2 )
EQ 2
Where
Y i − − xi
2
f i (Y i , / + X i , 2 )
1
e
1
= −
2
2
EQ 3
This is nothing but the density function of a normally distributed random variable with
mean + X i and variance 2. Substituting 3 in two for each Yi we get
f (Y 1 ,Y 2 , ,Y n X 1, X 2 , , X n ; ,,2 ) =
1 (Y i − − X i )
1 2
−
n e 2
( 2 )
2
n
EQ 4
Its likelihood function with parameters , , and 2 (the unknowns) is defined as:
L( ,, Y ,Y , , X )
1 (Y i − − X i )
1 2
2
, ,Y n ; X 1 , X 2 = −
n e 2
( 2 )
2
1 2 n EQ 5
n
The log-likelihood of this function is:
n n
Thus, l ln L = − ln − ln 2 −
2
2
2
1
2 2
Y i − − X i ( )
2
EQ 6
28
~ ,~
Let ~ 2 , respectively, represent the ML estimators of , , and 2. By evaluating
and
the first order partial derivatives of the log-likelihood function at the maximum likelihood
estimators and equating them to zero, we get the following first order conditions.
( ~ ,~
l ,)
~2 1 ~ −~ ( )
2 Yi
= 0 − Xi = 0
~
( ~ ,~
l ,)
~2 1 ~ −~ ( )
2 Yi
= 0 − Xi Xi = 0 EQ 9
~
( ~ ,~
l ,)
~2 n 1
( )
2
= 0 − + Y i − ~ −~
X i = 0
2 ~ 2 2
2 ~4
Rearranging the first two first order conditions given in EQ 8, one gets
~+~
n X i = Y i
~ EQ 10
~
X i + X i2 = X i Y i
Notice the similarity of these equations with the normal equations obtained using the
method of moments and those obtained using OLS. Solving equation 9, therefore, yields
the ML estimators of , which are identical to those of the OLS and method of
moments, which are given by:
~ = Y −~
X = ˆ
~
=
x y i i ˆ
=
x 2
i
Using these and substituting them in the third first order conditions given in EQ8, one
obtains the MLE of 2 as follows:
n
~2
2
=
1
~
2
(
4 Y i − − X i
~ ~ ) 2
~2 = 1
n
( X )
~ −~
Y − i i
2
=
1
n
(
Y i − ˆ −ˆ X i )
2
EQ 11
1
~2 =
n
2
ei
1
ˆ2 =
Note that:
n−2
ei2
Thus the ML estimator of 2 is different from its OLS estimator. Therefore, the ML
estimator is biased since it is different from the OLS estimator, which is unbiased. The
magnitude of bias could easily be obtained as follows:
E (
~2) = E ( ei2 ) = (n − 2) 2 = 2 − 2 2
1 1
EQ12
n n n
~ 2 is biased downwards and underestimates the true 2.
Thus
29
2.6. Confidence intervals and hypothesis testing
This section deals with statistical inference –estimation (interval) and hypothesis testing.
For this and related aspects we should derive
1. Variances of the OLS estimators
2. Covariance between the OLS estimators
3. The unbiased estimator of 2,
This assumption is particularly crucial for inference, because without this we cannot do
any statistical testing on the parameters and interval estimation is impossible.
Implications of the normality assumption of the error terms on the distribution of the
estimators of interest
Assumption of normality: i ~ NID ( 0, 2 )
The one variant of the central limit theorem states that if X is independently and
identically distributed normal random variable and Y is the function of X, it follows that
Y is also independently and identically distributed normal variable. Thus, since
i ~ NIID (0, 2 ) and Y is the function of the error term, it follows that Y is also
independently and identically normally distributed variable. That is,
( )
Yi X i ~ NIID + X i , 2 . OLS Estimators are linear functions of Y. Hence
ˆ = k Y ~ N ( , )
i i
2
2
( ) ( )
1. xi
where k i = , 2
= = Var ˆ and E ˆ =
xi2 x 2
i
ˆ = k Y ~ N ( , )
i i
2
2. 1 1 2
where k i = − X k i, 2
= 2 + X 2 = Var (ˆ ) and = E (ˆ )
n n xi
3.
e 2
i
~ n2−2
2
30
The first two follows from the fact that ̂ and ̂ are linear combinations of Yi, which is
normal. The proof of the 3rd proposition goes as follows. Given that i ~ NIID (0, 2 ) ,
i − 0
= i ~ NIID (0,1) : is a standard normal distribution.
2
i =
2
i
~ 2n , since the sum of squares of n independent standard
2
normal random variables has a chi- square distribution with n degrees of freedom. Now
(Y − − X )
2
2
i = i i
^ ^
adding and subtracting + X i in the right hand side of the equation we get
= e + ( ˆ − ) X
ˆ − )+ (
2
i i
= e + + (
ˆ − )x ˆ − = − (ˆ − )X 2
i as i
Assignment: Show that the sum of the cross product terms of the above equation are all
equal to zero
Dividing the whole equation by 2, we get
e ( ˆ − ) x
2
2 2 2
n
2
i
= + +
i i
2
2
2
2
2 2 2 2
n n−2 1 1
Hence,
e 2
i
~ 2n −2
2
ˆ 2
=
e 2
i
(n - 2)ˆ 2 ~ 2n − 2 EQ 1
n−2 2
2
ˆ
Recalling the fact that ~ N , , then it follows that
x2
i
ˆ −
~ N(0,1) EQ 2
xi2
31
i.e., it is standard normal. Recall that the ratio of a standard normal random variable to the
square root of an independent chi-square random variable divided by its degrees of
ˆ and
freedom follows a t-distribution. Since e 2
i are independent, then it follows that:
(ˆ − ) ( x ) 2
i
=
ˆ −
=
ˆ −
~ t (n − 2)
(n − 2) ˆ 2 (n − 2) x ()
^
ˆ 2
i ˆ
SE
2
()
()
ˆ
^
ˆ
Where SE = ˆ
and SE =
x 2
i x 2
i
()
2 2
Where SE(
ˆ) = + X 2 and SE + X 2
1 ^
ˆ 1
=
ˆ
n xi n xi
These results are used for both estimating confidence intervals and hypothesis testing.
^ ^
Notice the switch from the variance of to the estimator of the variance of
We shall use the data in our previous example to calculate the variances and standard
2. Substitute
ˆ 2 for 2
3. Take the square root of the resulting expression
()
2 2
ˆ
Now, Var = 2 = = 0.0362
x 2
i 28
2
and Var ( ) = =
ˆ 2 2 1
+ X = 2 1 + 64 = 2.39 2
n xi2 10 28
Recall that:
ˆ 2
=
e 2
i
=
14.65
1.83 (Standard error of the regression:
ˆ)
n−2 8
Therefore
32
1 X
2
SE(
ˆ) = (1.83)(2.39)
^
ˆ = ˆ 2 +
2
2.09
n xi
()
ˆ2
(0.036)(1.83)
^
ˆ
SE =
ˆ = 0.256
xi2
Usually, the complete result of the regression is written as follows:
Yˆ i = 3.6 + 0.75 X i , = 0.52
2
R
(2.09 ) (0.256 )
We can then easily obtain the confidence intervals for α and β by using the t distribution
with n-2 degrees of freedom.
From the table of a t-distribution, we know that t 0.025 (8) = 2.306 . Now since
ˆ −
~ t (8) , it follows that:
SE(
ˆ)
^
ˆ −
Pr − 2.306 ^ 2.306 = 0.95
SE(
ˆ)
ˆ ˆ )
− 2.306 SE( ˆ ) − −ˆ + 2.306 SE(
^ ^
Pr − = 0.95
ˆ ˆ ) =
+ 2.306 SE( ˆ ) ˆ − 2.306 SE(
^ ^
Pr 0.95
ˆ ˆ ) =
− 2.306 SE( ˆ ) ˆ + 2.306 SE(
^ ^
Pr 0.95
Pr(3.6 − (2.306)(2.09) 3.6 + (2.306)(2.09)) 0.95
Pr(− 1.22 8.42) 0.95
33
Thus, the 95% confidence for β is given by (0.16, 1.34).
Hypothesis testing
The main problem in statistical hypothesis testing is to ask whether the observations or
findings one gets in his research are compatible with some stated hypotheses. What we
mean by compatibility here is whether what we have is sufficiently close to the
hypothesised value so that we do not reject the hypothesis. Suppose that there is prior
experience or expectation that the true slope coefficient of some regression function is
unity (1). But from the observed data we obtain 0.75, as we did in our production
function. The question we ask would be is this observation consistent with the stated
hypothesis? If it is, we do not reject the hypothesis; otherwise we may reject it. The most
common hypothesis used in econometrics is whether the parameters of interest are
significantly different from zero (tests of significance).
In the language of statistics, the stated hypothesis is known as the null hypothesis and is
denoted by H0. This null is tested against an alternative hypothesis denoted by H1, which
may state that the true value of the slope parameter is different from unity (1). The
alternative hypothesis may be simple or composite. For example while =1 is a simple
hypothesis, while ≠ 1 is a composite hypothesis.
Though there are different approaches in hypothesis testing, we use the most widely used
approach—what is known as the test of significance approach. This procedure uses the
sample results to verify the truth or falsity of a null hypothesis.
For our example on the production function we may put foreward the following
hypothesis.
H0 : β = 1, and H1 : β ≠ 1 ; = 5% (Size of the test)
ˆ −
Now, since ~ t (n − 2) it follows that under H0
()
^
ˆ
SE
0.75 − 1
t cal = − 0.98
0.256
t cal = 0.98
34
Now, from the t-table with 8 degrees of freedom we read that
t 0.025 (8) = 2.306
t 0.05 (8) = 1.860
t 0.10 (8) = 1.397
The probability that Pr( t 0.98) 0.38 (using linear interpolation). Since this is not a
low probability, we do not reject the null hypothesis. Thus, what we obtained is not
statistically different from 1. It is customary to use as cut-offs probability levels of 0.05
and 0.01 to reject the null hypothesis.
The most customary type of test is to test whether a regression parameter estimate is
‘significant’ or ‘not significant’. What is meant here is that the parameter is significantly
different from zero, in the statistical sense. The hypothesis tested here is
H0 : β = 0, and H1 : β > 0 ; = 5% (Size of the test)
Following the procedure used earlier, we obtain
0.75 − 0
t cal = 2.93
0.256
Pr[t >2.896] = 0.01
And Pr[t > 3.355] = 0.005
Thus Pr[t > 2.93] = 0.0104
In this case, however, we reject the null hypothesis. Thus, our result is statistically
different from zero. People customarily say that their parameters have been found to be
significant.
Note, if we do the same significance test for α, at 5% level of significance we have
H0 : = 0, and H1 : > 0
3 .6 − 0
t cal = 1.72
2.09
Pr[t >1.72] > 0.25 for a t distribution with 8 df.
Thus, we cannot reject the null hypothesis at conventional sizes of the test, which states
that α = 0
35
Given the estimated regression equation Yˆ i = ˆ X i , we are interested in predicting
ˆ +
values of Y given values of X. This is known as conditional prediction. Let the given
value of X = Xf, then we predict the corresponding value of Yf of Y by solving
Yˆ f = ˆ + ˆ X f
ef = Y f − Yˆ f (Error of prediction)
but Yˆ f = ˆ + ˆ X f and Y f = + X f + f
Therefore,
ef = Y f − Yˆ f
( )
= f − (ˆ − ) − ˆ − X f
( )
Thus E (e f ) = E ( f ) − E (ˆ − ) − E ˆ − X f = 0
Thus,
( )
E Y f − Yˆ f = 0
( )
E Yˆ f = E (Y f ) = + X f
thus it is unbiased.
Third, Yˆ f has the smallest variance among the linear unbiased predictors of Yf. Thus, Yˆ f
ef ( )
= f − (ˆ − ) − ˆ − X f , thus,
36
( ) (
Var (e f ) = Var ( f ) + Var (ˆ − ) + X 2f Var ˆ − + 2 cov (ˆ − ) , X f ˆ − )
1 2 X 2f X
= 2 + 2 + X 2 + 2 − 2 2 X f
n xi xi xi2
2
1 2 2
Xf X Xf
= 2 1 + + X 2 + − 2
n xi xi xi2
2
1 X 2f + X 2 − 2 X X f
= 1 + +
2
n xi2
1
= 1 + +
2
X f −X( )
2
n
xi2
Thus, we observe that
1. The variance of the prediction error increases as Xf is further away from the mean of X,
^ ^
X , (i.e., the mean of the observations on the basis of which and have been
computed). Or as the distance between Xf and X increases, the variance of the error of
prediction increases.
2. The variance of the prediction error increases with the variance of the regression.
3. It decreases with n (number of observations used in the estimation of the parameters).
Interval prediction
1. Individual Prediction
37
In this case, our interest is to predict the individual value of Y (say Yf) corresponding to a
given level of X (say Xf). Given the assumption of normality, it can be shown that the
prediction error, which is given by e f = Y f − Yˆf , follows a normal distribution with mean
E (e f ) = E Y f − Yˆ f ( ) = ( + X f ) − ( + X f ) = 0
and variance
Var (e f )
1
= 1 + +
X f −X
2 ( )
2
n
xi2
Therefore, it follows that:
(n − 2)ˆ 2 ~
e f −0
~ N(0, 1) (
2
and
n−2 )
( ) 2
X f −X
2
1 + +
1
n x
2
i
^2
Substituting for 2 it follows that
X f −X ( )
2
1
where SE ( e f )
^
ef
t = = ˆ 1 + +
SE ( e f )
^
n xi2
follows a t distribution with n – 2 degrees of freedom. This result can be used for both
interval estimation and other inference purposes. Let us try to obtain confidence
intervals for our earlier example. Recall the results obtained earlier:
n
ˆ 2 1.83 , X = 8 , and x i2 = 28
i=1
Yˆ f = ˆ + ˆ X f
= 3.6 + 0.75(8)
= 9.6
and
SE(e f ) =
^
1
1 + +
2
X f −X ( ) 2
=
1 ( 8 − 8)
1.831 + +
2
28
n
xi2
10
1.83(1.1) 2.013 1.42
38
The t value for 95% confidence with 8 degrees of freedom is 2.306. Thus, the 95%
Now, suppose we wanted to predict the value of Yf for the value of Xf that is far away
Yˆ f = ˆ + ˆ X f
= 3.6 + 0.75(20)
= 18.6
and
1 ( 20 − 8)
2
SE(e f )
^
1.831 + +
10 28
1.83(6.24) 11.42 3.38
Thus, the 95% confidence interval for Yf is
Notice the large confidence interval as we move away from the mean.
2. Mean Prediction
In mean prediction, our interest is not to predict individual Yf, rather to predict E(Yf| X =
Xf). Thus, we are interested in the mean of Yf and not Yf as such. Note that the mean
(
E Yˆ f / X = X f ) = + X f ( & are unbiased )
ef = E (Y f / X = X f ) − Yˆ f ( )
= − (ˆ − ) − ˆ − X f
39
This is similar to what we did earlier. However, the variance of prediction error will be
Var (e f )
2 1
= +
(
X f −X ) Assignment: show this!!!
2
n
2
xi
Var (e f )
2
1
= 1 + +
(
X f −X ) >Var(e )
2
2 1
= +
(
X f −X )
2
n xi2
f
n
2
xi
Solution:
Solution
40
( − )
2
X f X
Var ( e f ) = Var (Yˆf ) = +
2 1
n xi
2
1 (100 − 170) 2
= 42.159 +
10 33000
= 10.4759
Thus, SE(Yˆ )=3.2366
f
Then,
75.3645 − 2.306*3.2366 E (Y f / X = 100) 75.3645 + 2.306*3.2366
67.9010 E (Y f / X = 100) 82.8381
Thus, given X f = 100 , in repeated sampling, 95 out of 100 cases intervals like this will
41