Chapter Two: Simple Linear Regression Analysis with Cross-Sectional Data

2.1. Concept of regression function

a. Historical Definition by Data Analysts

The person who coined the word in data analysis was Sir Francis Galton
(1822-1911). He found that although there was a tendency for tall parents to
have tall children and short parents to have short children, the average height
of children born of parents of a given height tended to move or “regress”
toward the average in the population as a whole (Galton’s law of universal
b. The Modern Definition of Regression
Regression analysis is concerned with describing and evaluating the economic
relationship between the left hand side variable (Y) and the right hand side variable(s)
Y x1, x2, …, xk
Dependent Variable Independent Variables
Explained Variable Explanatory Variables
Response Variable Control Variables
Predicted Variable Predictor Variables
Regressand Regressors

Thus, the way regression is used in econometrics has nothing to do with the original
definition of the term regression.
Examples of economic relationship:
a. Qd = f ( p, pc , ps , taste, pop) demand function

b. Qs = f ( p, tax, tech) supply function

In general, suppose Y is the dependent variable and Xi’s are independent variables. Then
the relationship can be defined as:

Y = f ( X 1 , X 2 , X 3 ,..., X k )
Where, i = 1, 2, ..., k:
1. if k = 1, we say we have simple regression. Example Y = f ( X )

2. if k > 1, we have a multiple regression. Example Y = f ( X 1 , X 2 , X 3 )

In this chapter, we deal with simple linear regression function (with single
independent variable)

Recall that the economic model takes the form: Yi =  +  X i , which is the abstract
form of reality. Suppose the pairs (x, y) are average weekly income (x) and expenditure
(y) for the sample of HHs. It is unrealistic to expect each observed pair lo lie on the
straight line. In addition, there are many other factors that affect consumption of HHs.
Thus, we move from economic model to statistical model in which relation is inexact and
hence allow us to introduce an unknown and unobservable random variable (E). The

statistical model takes the form: Yi =  +  X i +  i

Where Y=explained/dependent variable
X=explanatory/independent variable
 &  =parameters specifically alpha=intercept term and beta=coefficient or
slope (how do you interpret it??).
 i = stochastic disturbance/error term
The subscript i refers to the ith observation
Note that: the values of variables X and Y are observable but those of the error term
are unobservable.
In econometrics, we deal with the stochastic relationship. Why? That

is, why should we add the Error Term  i ?

There are some justifications for the inclusion of the error term in the model. These
1. Effect of the omitted variables from the model
Due to numbers of reasons some variables (factors), which might affect the
dependent variable, are omitted from the model:
➢ Lack of data (particularly on time series variables)
➢ Lack of knowledge on some factors that affect the dependent variable
➢ Difficulty in measuring factors that affect the dependent variable
➢ Some factors are random by their very nature (e.g. earthquake)
➢ Some factors taken individually may have very small effect

Example: Qd =  +  X i +  i The error term captures factors like income of the consumer,

family size, tastes and prices of other commodities

2. Measurement Error
Variables included in the model may be measured inaccurately and some
problems arise due to methods of data collection and processing. Such problems
are handled by the error term
3. Wrong Mathematical Specification of the Model
The equation may be miss-specified in the sense that the particular functional
forms chosen may be incorrect. Example we may have specified linear which is
in fact non-linear models and estimating a single model of systems of equations.
4. Errors in Aggregation
There may be errors in aggregation over time, space, cross-section, etc and the
stochastic disturbance term captures these errors in aggregation.
5. The randomness of human behaviour
There are unpredictable elements in human behaviour that are taken care by the
stochastic disturbance term
In order to take all these sources of error into account, we introduce the stochastic/random
disturbance term into our econometric models and hence the complete simple
econometric model becomes: Yi =  +  X i +  i

Variation in Y = Explained Variation + Unexplained Variation

Classical Assumptions of Linear Regression Model (Gauss-Markov Assumptions).Why

these assumptions are required?
1. The Linearity Assumption: The regression model is linear in the parameters1. Note
that the function can be non-linear in the dependent and explanatory variables. This
assumption is required for the purpose of simplicity.
2. X values are fixed in repeated sampling. The values taken by independent variables
are considered to be fixed in repeated samples. That is, independent variables are
assumed to be non-stochastic. Suppose you take four samples. In each sample, the
value of X is fixed (say at X = a) but the value of dependent variable varies from
sample to sample. That is,

A function is said to linear in parameters, say alpha and beta, if they appear with the power of one only
and they are not multiplied or divided by any other parameter.

X X1 X2 X3 X4
Y (y1, y2, y3,…) (y2, y4, y5, …) … …

3. the zero mean assumption: the disturbance term has zero mean ,i.e.,
E ( i /Xi )=0 i = 1, 2,..., n
That is, on average, factors, which are not explicitly included in the model, do not
systematically affect the mean value of Y. This means that the positive values of the
error term cancel out the negative values of the error term so that their mean effect
on the dependent variable is zero. Thus, on average, the regression line is correct.
Why do we need this assumption?
4. The Assumption of homoscadesticity: Common (constant) variance, i.e.,

var( i )=E  i − E ( i ) = E  i  =  2 i = 1, 2,..., n .

2 2

Prove that Var (Y ) = E( i )2 =  2 . The proof begins with the definition of var of Y

var(Y ) = E Yi − E (Yi )  , but


Yi =  +  X i +  i
E (Yi ) =  +  X i
We know that
Then, var(Y ) = E  +  X i +  i −  −  X i 

var(Y ) = E  i  =  2

This assumption implies that at all revels of X and Y, the variation around the mean value
of Y is the same. This means that all values of Y are equally reliable, reliability measured
by the variation about the mean of Y. Thus, all values of Y corresponding to different
values of X’s are equally important. What is the importance of this assumption?
5. The assumption of no autocorrelation/no serial correlation: independence of error
terms (error terms are not correlated): the, i.e., Cov ( i,  j ) = 0 i  j . How do

you define correlation? What is autocorrelation? It measures degree of linear

relationship between consecutive values of a given variable (the error term in this
case). The meaning of no serial correlation is that given the values of X’s, there is no
correlation between any two values of Y.

6. The assumption of zero covariance between X i &  i : independence of xj and the

disturbance terms, i.e., Cov( X j ,i ) = 0 i , j . This assumption automatically

follows if we assume that Xj are fixed (non-random variable).


cov( xi ,  i ) = E  xi − E ( xi )  i − E ( i ) 
= E  xi − E ( xi )  i  ( E ( i ) = 0)
= E ( xi i ) − E ( xi ) E ( i ) ( E ( xi )isstochastic)
= E ( xi i )
= E ( xi ) E ( i )
 cov( xi ,  i ) = 0
This assumption implies that independent variable (X) and the stochastic disturbance term
have separate and additive effect on Y. If they are correlated, it is impossible to assess
their individual effect on Y. Thai is, it causes problems for interpreting the model and for
deriving desirable statistical properties.
7. The number of observation (n) must be greater than the number of explanatory
variables in the model
8. Variability in values of X’s is important. All values of x in the sample must vary.
That is, var(X) must be finite positive number.
9. There is no model specification problem
Any econometric analysis begins with model specification. But what variables should be
included and what is the specific mathematical form of the model? These are extremely
important questions.
10. Normality assumption, i.e., i ~ N (0 , 2 ) i
Assumption 10 in conjunction with assumptions 3, 4 and 5 implies that i’s are normally
and independently distributed. i ~ NID (0 , 2 ) i

The assumptions of classical linear regression analysis are needed for statistical

inference (estimation and hypothesis testing)

Concepts of Population Regression function (PRF) and Sample Regression function

By assumption x is fixed in repeated sampling but Y’s vary. That is, we talk about the
conditional distribution of Yi given values of Xi. Again, by assumption E ( i ) = 0 . Now for

each conditional distribution of Y, we can compute its mean or average value known as
the conditional mean or conditional expectation of Y given X takes the specific values
of xi. It is denoted by E (Yi / X = X i ) . Thus, the conditional mean of Y given X (

E (Yi / X = X i ) ) is a function of Xi. Symbolically, E (Yi / X = X i ) = f ( X i ) .Such function is

known as the population regression function (PRF).It states that the mean distribution
of Y given X depends on Xi. In other words, it measures how the mean response of Y
varies with X. The specific functional form of f ( X i ) is assumed to be linear. That is,

E (Yi / X = X i ) =  +  X i ( E ( i )) or Yi =  +  X i +  i

However we don’t have data on all Y’s (population) and we have sample information on
Y for a given values of X’s. Unlike the population case, we have one value of y for each
value of x. Now, we can estimate the PRF based on sample information but the estimation
is not accurate due to sampling fluctuations the regression function we obtained from the
sample is known as sample regression function (SRE).
Note that for different samples (n1 = n2 = … =nk) we have different SRF but which one
best “fit” the PRF? Or which one best represents the true PRF? (We will answer latter).
Now, consider E (Yi / X = X i ) =  +  X i ………….PRF
^ ^ ^
Yi =  +  X i ………………………SRF
^ ^
The stochastic version of SRF is given by: Yi =  +  X i + ei where ei is the residual or

an estimate of  i .The residual is introduced into the SRF for the same reason as the PRF.

In summary, the main objective of regression analysis is the estimation of the PRF (

Yi =  +  X i +  i ) based on the SRF ( Yi =  +  X i + ei )

^ ^

2.2. Simple Regression Model: the problem of Estimation

Given n observations on x and y, the linear regression model can be written as

Yi =  +  X i +  i i = 1,2,..., n and one of the objectives of the whole exercise in

regression is to obtain appropriate estimates for the parameters  and .

There are a number of methods for estimating the parameters of this model. The
most popular ones are
a) the method of moments
b) the method of least squares, and
c) the method of maximum likelihood.
Though they give different outcomes in generalised models, in the case of the simple
regression all three give identical results.

2.2.1. Method of moments

The main assumptions introduced about the error term  imply that
E( ) = 0 and cov( X ,  ) = 0
In the method of moments, we replace these assumptions by their sample counterparts.
^ ^
Letting  and  be the estimators of  and , respectively, and the sample counterpart of

 be e (called the residual), defined as:

^ ^
Yi =  +  X i + ei
^ ^
 ei = Yi −  −  X i
^ ^
The two equations that determine  and  are obtained by replacing population
assumptions by their sample counterparts; i.e.,

E( ) = 0 is replaced by
 ei = 0 or e i = 0 , and

cov( X ,  ) = 0 is replaced by
 X i ei = 0 or  Xei i =0

From these replacements we get the following two equations.

e = 0 or   Yi −  −  X i  = 0 and
^ ^

 Xe = 0 or  X i  Yi −  −  X i  = 0
^ ^
i i

These two equations can be written as follows.

 Y = n +   X
^ ^
i i and

 XY =  X i +   X i . These equations are known as the normal equations of the

^ ^
i i

Given data on observations of Y and X, solving these two equations we obtain the
^ ^
parameter values (estimates) of  and  .

Table1. The data contains information on Y(output) and X (labour hours worked) for ten
No. X Y X2 XY Ŷ e

1 10 11 100 110 11.1 -0.1

2 7 10 49 70 8.85 1.15
3 10 12 100 120 11.1 0.9
4 5 6 25 30 7.35 -1.35
5 8 10 64 80 9.6 0.4
6 8 7 64 56 9.6 -2.6
7 6 9 36 54 8.1 0.9
8 7 10 49 70 8.85 1.15
9 9 11 81 99 10.35 0.65
10 10 10 100 100 11.1 -1.1
Total 80 96 668 789 96 0.0
a. Estimate alpha and beta using method of moments (the population regression
b. How do you interpret alpha and beta?

c. Suppose Y= consumption and X= current income, what is the economic
interpretation of alpha and beta?
d. Test the hypothesis that current income is insignificant in explaining consumption
e. Predict the consumption level after ten years if income will be Birr 100

 X ,  X ,  X Y and  Y
^ ^
To solve for  and  we need to compute i i
i i i

The normal equations are then given by:

10 ˆ = 96
ˆ + 80
ˆ + 668ˆ = 789

From the first equation we get

 ˆ
ˆ = 9.6 − 8

Substituting this result into the second normal equation we get

789 = 80 9.6 − 8ˆ + 668
ˆ + 668
= 768 - 640 ˆ
 21 = 28ˆ
ˆ = 0.75
 
Using this result in the first normal equation we get
ˆ = 9.6 − 8(0.75) = 9.6 − 6 = 3.6

The residuals (estimated errors) are then given by: ei = Yi − 3.6 − 0.75 X i , which can be
calculated for each observation in our sample, and are presented in the last column in the
first table above.

The errors tell us what we get if we tried to predict the value of Y on the basis of the
estimated regression equation and the values of X, within the range of the sample. We do
not get errors for X outside our range simply because we do not have the corresponding
values of Y. By the virtue of the first condition we put in the normal equations, ei = 0.
The sum of squares of these errors is given by:

e 2
i = ( − 0.1) + (1.15) + ( 0.9) + ( − 1.35) + ( 0.4)
2 2 2 2 2
( )
+ − 2.6 + ( 0.9)
2 2

+ (1.15) + ( 0.65) + ( − 1.1)

2 2 2

= 14.65

The method of least squares, as we shall see soon, tries to obtain estimators that minimise
this sum.

2.2.2. Method of Ordinary Least Squares (OLS)

The Ordinary Least Squares (OLS) was developed by Friedrich Gauss. Under certain
assumptions of classical linear regression model, OLS estimators have some desirable
statistical properties.
Starting point:

PRF: Yi =  +  X i +  i

^ ^
SRF: Yi =  +  X i + ei
^ ^

Yhat =  +  X i

 Yi = Yhat + ei
 ei = Yi − Yhat
Thus, the residual is the difference between the actual and the estimated/predicted values
of Y.
Now, given data on both X and Y, our objective is to find the SRF that best “fits” the PRF
or the values of estimators, which are as close as possible to population parameters.
However, what is the possible criterion to choose the SRF that best “fits” the actual PRF?
Criterion I: Minimize the sum of residuals
Choose the SRF in such a way that the sum of the residuals (i.e.,  e ) is as small as
i =1

possible. If the sum the residuals is zero (i.e.,  ei = 0 ), we will have the method of
i =1

moments. However, this criterion is not good as it gives equal weight to all kinds of
residuals (large, medium and small).
Criterion II: Sum Least Squares Criterion
According to this criterion, find the SRF, which minimize the sum of the squared residual

e i
^ ^
(i.e., min. i =1 ). Alternatively, we choose  and  such that the sum of squares of
errors is minimized. This criterion is very important for two basic reasons:
a. it gives more weight to larger residuals and less weight to smaller residuals

b. the estimators obtained through OLS method of estimation have some
desirable statistical properties
Since different samples from a given population result in different estimators, which of
these estimators minimize the sum of squared residuals? (Recall how to find extreme
values of functions- the necessary and sufficient conditions).

Now, given the population regression function as:

^ ^
Yi =  +  X i +  i , we want to substitute  and  for  for, and minimise
n 2

  Y −  −  X 
^ ^
S= i i
i =1
= e
i =1

The intuitive idea behind the procedure of least squares is given by looking at the
following figure.

Regression Line
Yi ei
Yˆ i .
. Yˆ i

The regression line passes through points in such a way that it is ‘as close as possible’ to
the points of the data. Closeness could mean different things. The minimisation procedure
of the OLS implies that we minimise the sum of squares of the vertical distances of the
points from the line.

The Necessary Condition
S S
= = 0
ˆ ˆ
This equation is known as the first order conditions for minimisation. This procedure

 2(Y )
S n
ˆ X i (− 1) = 0 

= 0 
i =1
i −
ˆ − e
i =1
i = 0
n n
 Y
i =1
i = n ˆ
ˆ +  Xi i =1
EQ 1

 Y = ˆX
ˆ +
  ˆX
ˆ = Y −

( )
n n

= 0   2 Y i − ˆ − ˆ X i (− X i ) = 0 
i =1
i =1
i ei = 0
EQ 2
n n n
 X Y
i =1
i i
ˆ  Xi +
=  
ˆ X i2
i =1 i =1

Equations 1 and 2 are known as the normal equations. To solve for ̂ , we substitute for ̂
in equation 2 from equation 1 and we get
n n n

 Yi X i =  Y −  x  X i +   X i2
^ ^

i =1 i =1 i =1
n n n n

YX = Y  X i −  X  X i +   X i2
^ ^
i i
i =1 i =1 i =1 i =1
n n
 n
 n

YX − Y Xi =    Xi − X  Xi 

i =1
i i
i =1  i =1 i =1 
1 n n ^  n 1 n n

Yi X i −
i =1
 Yi i
n i =1 i =1
X =   
 i =1
X i
−  X i  Xi 
n i =1 i =1 

 n 2

  n
 
  X i  X i 
n n n
n X i Y i −  X i  Y i n 2
− 
i =1 i =1 i =1 ˆ  i =1
= 
 i =1  
n  n 
 
 
 
n n n
n X i Y i −  X i  Y i
ˆ =
  i =1 i =1 i =1

n X i2 − 
  X i 
n n

i =1  i =1 

Numerator of the last Equation can be simplified as follows: First, we take the numerator
to get
n n n n
n Yi X i −  Yi  X i = n Yi X i − nYn X
i =1 i =1 i =1 i =1

 n 
= n  Yi X i − nY X 
 i =1 
 n 
= n  (Yi X i − Y X )
 i =1 
 n 
= n  (Yi − Y )( X i − X )
 i =1 
Similarly the denominator of our last equation can be rewritten as:
 n 
X −   X i  = n X i2 − (n X )
n n
n 2 2

i =1  i =1 
i =1

 n 2
= n  X i2 − ( X ) 
 i =1 

= n ( X i − X )

i =1

Therefore, we have:

 (Y − Y )( X ) x y
n n

^ i i −X i i
= i =1
= i =1
, where xi = X i − X & yi = Yi − Y the deviation form
( X )
n n

i −X i

i =1 i =1

Note also that we can write this equation can be written as

( X − X )(Y i − Y )

i ( n − 1) Cov ( X , Y )
ˆ = i =1
Var ( X )
 ( X − X ) ( n − 1)
n 2
i =1
if we let

 (Y − Y ) =  y =  Y
n n n
2 2
S yy = i
i i
− nY
i =1 i =1 i =1

(X − X) =
n n n

 x = X
2 2
S xx = i
i i
− n X , and
i =1 i =1 i =1

 (Yi − Y )( X i − X ) =
n n n
S yx =
i =1
 yi xi =
i =1
Y X
i =1
i i − nXY

We can then write the estimators of  and  as

S yx
̂ =
S xx
ˆ = Y − ˆX
S yx
=Y − X
S xx
➢ Properties of OLS estimators and Gauss-Markov theorem

The classical assumptions (Gauss-Markov Assumptions) we put forward earlier can be

divided into two parts: those that are made on Xi and those made on i.

a) Assumptions on Xi
a1) The values of X: X1, X2, …, Xn, are fixed in advance (i.e., they are non-stochastic)
a2) Not all X are equal (variability in Xi’s are important).
b) Assumptions on the error term (εi):
b1) E(εi) = 0 for all i
b2) Var (εi) = 2 for all i (homoscedasticity or common variance)
b3) Cov(εi  ε j) = 0 for all i  j (Assumption of no autocorrelation)
Note that
Assumptions b1, b2 and b3 together are called assumptions of white noise on i ,which is
denoted as i ~WN[0, 2], and
1. a linear function of the random variable
2. unbiased, and
3. it has minimum variance in the class of all linear unbiased estimators.

Gauss-Markov theorem: Given the assumptions of the classical linear regression model,
the least-squares estimators have minimum variance in the class of linear unbiased
estimators, i.e., they are BLUE.
Proof: we shall provide a proof for ̂ , try to prove this for ̂

1. ̂ is a linear estimator of 
We know that

ˆ =

x y i i
 x (Y − Y ) i i
x Y −Y x i i i

x 2
i x 2
i x 2
EQ 2
x Y  x 
=   i 2 Y i x
i i
= as = 0
x   xi 
2 i
i  
If we let k i = , EQ2 can be written as:
 x i2
ˆ = k Y i i EQ 3

This shows that ̂ is linear in Yi—it is in fact its weighted average, with ki serving as
Note the following properties of ki.
a) Since the X variable is to be assumed non-stochastic, ki's are also fixed in advance
b) k i =0

c) k 2
 xi2

d)  k x = k X
i i i i =1

Assignment: show that properties b, c and d are true.

2. ̂ is an unbiased estimator of 
From EQ 3 we know that:

ˆ = k Y i i

=  k ( +  X +  )
i as Y i i i =  +  X i +i
= k + k X + k  i i i i i

=  + k  as  k i i i = 0 & k X
i i = 1
Taking the expectation of this result, we get

E ˆ ( ) = E (  +  k i i ) = E (  ) + E (  k i  i )
=  +  k iE (  i ) =  as E ( i ) = 0 i
(Assumption of zero mean of the disturbance term)
Hence, ̂ is an unbiased estimator of .

3. Among the set of all linear unbiased estimators of , the least squares estimator ̂ has
minimum variance. To show this we need to derive the variances and covariance of
the least squares estimators. We do this for the estimator of the slope parameter, ̂ .

By definition of variance, we have the variance of ̂ as

( ) (
= E ˆ − E ( ˆ ) ) = E ˆ −  ( ) ( )
2 2
Var ˆ ( E ˆ = )

= E (  k i i ) ( ˆ − 
=  k i i )
= E ( k1212 + k22 22 + ... + kn2 n2 + 2k1k21 2 + ... + 2kn −1kn n −1 n )
 
= E  k i2 i2 +  k i k j i j 
 i j i 
=  k i2E (  i2 ) +  k i k jE ( i j )
i j i

=  2 k i2 ( E ( i j ) = 0)
2 1
= ( k 2
= )
 xi2 x
i 2

EQ 4
1. Show that E  ˆ − E ˆ ( )) = 0

To show that ̂ has the minimum variance among the set of linear unbiased estimators of

, we proceed as follows:
 =  kiYi … linear function of Y.

Xi − X xi
where ki = =
( X i −X )
 xi2
Now define an alternative linear estimator  of  , which is unbiased and linear function
of Y. That is,  =  wiYi .

For this estimator to be unbiased its expected value must be equal to , i.e.,

E    =  wi E (Yi )
 
=  wi ( +  X i )
=   wi +   wi X i
Therefore, for  to be unbiased, the following two conditins must hold

w = 0

w X =1
i i

the variance of  is

( )
Var  = Var ( w Y ) i i

=  w Var (Y )2
i i

=  w2 2

But var( ˆ ) =  2  ki2

Now, compare the variances of ̂ and  . Let

d i = ki − wi
d = k −w i i i =0
w = d +k i i i

 w = (d + k )
2 2
i i i

= d +k i
+ 2 d i k i
= d +k i


d k =  x
dx i i

i i 2

 d x =  d (X − X)
i i i i

=  d X − X d i i i

=dX i i

=  (k − w ) X i i i

=kX −wX i i i i

= 1− 1 = 0
Therefore, w = d + k
i i
implying that

 2 wi2 =  2 k i2 +  2 d i2
 Var  ( ) ( )
= Var ˆ +  2 d i2

 Var ˆ ( )  Var  ( ) as  2 d i2  0

This establishes that the OLS estimator, ̂ , is the Best Linear Unbiased Estimator

(BLUE) of .

1. Show that ˆ = c Y i i where c i =
− X ki


ˆ = Y + ˆ X

Y i n n
 = i
− X  Yi ki ( ˆ =  Yi ki )
n i i

 =  ( − X ki )Yi
 ˆ =  ciYi

1 2 
ˆ ) = 2  + X 2 
2. Show that Var (
 n  xi 

Covariance between alpha and Beta

By definition, covariance between ̂ and ̂ is given as

( )
Cov ˆ , ˆ 
= E (ˆ − E(ˆ )) ˆ − E ˆ ( ( ))
EQ 5
ˆ = Y − ˆX , Y =  + X +  and E(ˆ ) = 

ˆ − E (
  ˆ ) =  + X +  − 
ˆ X − E (
= −X ˆ − =  − X ( ˆ −E
ˆ ) ( ( )) ()
as E  = 

Multiplying both sides of the last equation by ˆ − E ˆ gives ( ( ))

(ˆ − E (ˆ ))(ˆ − E (ˆ )) ( ( )) (
=  ˆ − E ˆ − X ˆ − E( ˆ ) )2

Taking expectation of both sides of the above equation yields:

E (
ˆ − E (
ˆ )) 
ˆ −E
ˆ ( ( )) = Eˆ −E(
ˆ − XE  
( ))
ˆ) 
ˆ − E( 

 ( )
ˆ = -X () ( ( ))
= − XVar  as ˆ −E
E ˆ = 0
 x

Deriving unbiased least squares estimator of 2

Given, the population regression equation

Yi =  +  X i +  i , i = 1,2,..., n EQ 1

We have

Y =  +X +  EQ 2

Subtracting EQ 2 from EQ 1, we get

Y i − Y = (X i − X ) + (i −  )
yi =  xi + (i −  )
EQ 3


Yi =  ˆ X i + ei
ˆ + and Y =  ˆX
ˆ +
 Yi −Y =  ˆ (X i − X ) + ei EQ 4
ˆ xi + ei
yi = 
Subtract EQ 4 from EQ 3 to get

ei = (  i −  i ) − xi  −  ( )
EQ 5

( )
ei2 =  (  i −  i ) − x i  −  

=  (  −  ) − x ( −  )  (  −  ) − x ( −  )

^ ^
i i i i i i

= (  −  ) + x ( −  ) − 2 x ( −  )(  −  )
^ 2 ^
2 2
i i i i i i

Squaring both sides of EQ 5, one gets

Summing this over the sample and taking expectations we get

( )
E ( ei2 ) = E   i −  +  xi2 E ˆ −  ( ) − 2 x E(ˆ −  )( 
− )
2 2
EQ 6
  i i

Equation 6 has three components on its right hand side, which can be reduced as follows:
i) The 1st element in the equation could be written as follows

E  ( −  )  = E ( 
2 2
+  − 2 i  )
= E  2
i + n − 2   i

= E  
2 2
i + n  − 2n 

= E  
i − n

=  E( ) − nE( )

But we know that

E ( i2 ) = var( i ) − E ( )( ) 2
and E  ( ) = var( ) − ( E( ))
2 2


E  ( −  )  =  E( ) − nE( )
2 2

=  ( var ( ) − ( E ( )) ) − n( var ( ) − E ( ) )
2 2

=  (  − E ( ) ) − n
  2

− E ( ) 
2 2 2

 n 
= n 2 − nE ( ) −  2 + nE ( )
2 2

= n 2 −  2
= ( n − 1)  2
ii) the second part is also simplified as follows:

 x E   −   = x var   

^ ^
2 2
i i

=  xi2
 xi2
= 2
iii) to simplify the 3rd element in equation 6 we use the following results

^ xi yi
, but yi =  xi +  i , where  i =  −  bar
x 2

 x ( x +  )
i i i

x 2

x +x 2

= i i i

x 2

x i i

x 2

 − =
x i i

x 2

   x i i  
 (
2 E  ˆ −  )  x ( −  )
i i = 2 E 
  x i 
2   x i (  i −  ) 
  x i i 
2 (  x i i
= 2E  −   xi )
  x i 
  x i i 
2 (  x i i ) 
= 2E  (  x i = 0)
  x i 
2 
( 
= E   x i i 
 xi 2


( x  ) = ( x 
i i
1 1 + x2  2 + ...+ xn  n )

= x12  12 + x22  22 + ...+ xn2  2n + 2 x1 x2  1 2 + ...+ 2 xn−1 xn  n−1 n


( ) ( ) ( )
E (  xi i ) = x12 E 12 + x22  22 + ... + xn2 E  n2 + 2 x1 x2 E ( 1 2 ) + ... + 2 xn −1 xn E ( n −1 n )

=  2  xi2

It easily follows then that

2 E ˆ −  ) x ( −  ) =
E   xi  i 
 x  
i i 2

( x )
2 2

x 2 i

= 2 2

Collecting the results obtained in i), ii) and iii) above, we get

  +  x E( −  ) ( )
E (  ei2 ) = E  (  i −  i ) − 2 E  −   xi (  i −  i )
2 ^ ^

= ( n − 1)  2 +  2 − 2 2
=  2 ( n − 1 + 1 − 2)
=  2 ( n − 2)
It easily follows that if we set
^2  ei2 RSS
 = =
n− 2 n− 2
we have an unbiased estimator of 2

Residuals and goodness of fit

Recall the results from the normal equations we had:

i =1
i = 0  e = 0 EQ 1

n x e i i

 X i ei = 0 
i =1
i =1

= 0 EQ 2

^ ^
ei = Yi −  −  X i
= Yi − Y i

In this formulation Yˆ i =  ˆ X i is the estimated (fitted) value of Yi. Equations 1 and

ˆ +

2 imply that
1. the mean of residuals is zero, and
2. The residuals and the explanatory variable are uncorrelated.
Given this, it follows that

i i =0
i =1

That is, the residuals and the estimated values of Y are uncorrelated.

n n

 e Y =  e   +  X 
^ ^ ^
i i i i
i =1 i =1
n n
=   ei +   ei X i
^ ^

i =1 i =1

= 0+ 0 = 0

Y i = Yˆ i + ei
EQ 3
Observed value of Yi = estimated value of Yi + residual
Sum equation 3 over i, the sampled observations, to get

Y i =  Yˆ +  e i i

Y i =  Yˆ , as  ei i = 0 EQ 4

 Y = Yˆ

Given the fact that Y = Yˆ equation 3, can be written as:

Y i − Y = Yˆ i − Yˆ + ei EQ 5
 yi = ˆyi + ei
Squaring both sides and then summing over i we get
y =  y i + ei 

 y =   y + e 
i i i

y + 2 y i ei + e
^ 2
= i i

 y e = 0 , which implies that

but i i

 y = y +e
2 2
i i i
EQ 6


TSS =Total Sum of Squares of the dependent variable.
ESS = Explained Sum of Squares
RSS = the Residual Sum of Squares (or Unexplained Sun of Squares)

The coefficient of determination denote by R2 - measure of goodness of fit of a
regression line

ESS RSS e 2
R2 = = 1− = 1− ,
 (Y − Y )

Verbally, it measures the proportion or percentage of the total variation in the dependent
variable (Y) explained by the independent variable(s) included in the model.
NB 1. By definition 0 ≤ R2 ≤ 1
2. It is nonnegative quantity
Alternative formulas for the Coefficient of Determination

We know that ESS =  ˆy 2

i and TSS =  yi2

Thus, R 2
 yˆ 2

y 2

yˆ i = Yˆ i − Y
= ˆ + ˆ X i − Y ˆ
Yˆ i = ˆ +  X i
= Y − ˆ X + ˆ X i − Y ˆ = Y − ˆ X
= ˆ ( X i − X ) = ˆ x i


 yˆ 2
i = ˆ 2  xi2

ˆ 2  xi2
ˆ  xi yi
R = 2
y y
Thus, 2 2
i i

 ˆy 2
i =  i
ˆ xi ˆy
ˆ xi ( y − ei )
=  i as yi = ˆyi + ei

 i ˆ  xi ei
ˆ xi y − 
= 
=  i
ˆ xi y as x ei i = 0

It follows that

 yˆ 2
x y x y
i i
as ˆ =
x y
i i

x x
i 2 i i 2
i i

( )
 xi y i
x 2

Given that TSS = y 2

i it follows that

ˆ  x i y i
= EQ 7
y 2

Denote the square of the correlation between Y and Y , i.e., the observed and fitted values
of Y by r 2y ,ˆy , thus, their correlation coefficient would be


r^ =
i i
i i

r y ,ˆy = R
2 2

Start with the fact that
yi = yi + ei
Multiply the above equation throughout by y i and sum over i to get

 y ˆy i i =  ˆy +  ˆy e
i i i
EQ 8
=  ˆy 2
i as  ˆy e i i = 0

r y ,ˆy =
 y ˆy i i
 ˆy 2
 ˆy  ˆy 2

 y  ˆy
i  y  ˆy2
i  y  ˆy 2

 ˆy = 2
= R

y 2

Therefore, r 2^ = R 2


Show that
r ^ = ryx

The proof goes as follows:

r y ,ˆy =
 y ˆy i i
 y ˆ x i i
  y i xi
 y  ˆy  y  (ˆ x ) y x
2 2 2 2 ˆ
 2 2
i i i i i

y x i i
= r y ,x
y x2

Example: Consider the data in Table 1. The data contains information on Y (output), and
X (labour hours worked) for 10 workers.
No Y X Y2 X2 YX ei ei2
1 11 10 121 100 110 -0.1 0.01
2 10 7 100 49 70 1.15 1.3225
3 12 10 144 100 120 0.9 0.81
4 6 5 36 25 30 -1.35 1.8225
5 10 8 100 64 80 0.4 0.16
6 7 8 49 64 56 -2.6 6.76
7 9 6 81 36 54 0.9 0.81
8 10 7 100 49 70 1.15 1.3225
9 11 9 121 81 99 0.65 0.4225
10 10 10 100 100 100 -1.1 1.21
Total 96 80 952 668 789 0.0 14.65
Given this information, we want to determine the relationship between labour input and
output. From the information above we can determine
80 96
X= = 8; Y = = 9.6
10 10
The following are also necessary pieces of information

 (X − X) =
n n n

 x = X
2 2
S xx = i
i i
− n X = 668 − 10(8) 2 = 28
i =1 i =1 i =1

 (Y − Y )( X − X) =
n n n
S yx =
i =1
i i  yx = YX
i =1
i i
i =1
i i − n X Y = 789 − 10(8)(9.6) = 21

 (Y − Y ) =  y =  Y
n n n
2 2
S yy = i
i i
− nY = 952 − 10(9.6) 2 = 30.4
i =1 i =1 i =1

Thus the coefficients of the regression are

ˆ = S yx 21
 = = 0.75
S xx 28

ˆ = Y − ˆX = 9.6 − (0.75)x8 = 3.6

Hence, the regression of Y on X is
Yi = 3.6 + 0.75 X i + ei
Since Y measures output and X measures labour input it follows that the slope coefficient
 Yi
= 0.75  MPL
 Xi

Since the intercept is 3.6, the equation implies that output will be 3.6 when 0 labour is
used! Note the absurdity of trying to predict values of Y, outside the range of the X values
in the data set! We can go further and estimate

r x ,y =
y x i i

 0.72
y x 2
i S YY S XX 28(30.4) 29.175

Therefore, r 2x , y = R 2  0.52

Alternatively: R 2 =
 i
ˆ xi y
(0.75)(21)  0.52
y 2

Assignment: Check whether r ^ = ryx


2.2.3. Maximum likelihood estimation

To generate the maximum likelihood estimators (MLE) for the parameters of a simple
linear regression model we need to make assumptions about the distribution of εi, the
disturbance term. In this setting εi is assumed to be normally and independently
distributed with mean 0 and variance 2, i.e.,  i ~ NIID (0,  2 ) .

Given Y i =  +  X i +  i and  i ~ NIID (0,  2 ) , it follows that

Yi X i ~ NIID ( +  X i , 2 ) i = 1, 2,  , n EQ 1

Given the assumption of independence of the Yi’s, the joint probability density function
of the sample (Y1, Y2… Yn) can be written as a product of n marginal density functions
f (Y 1, Y 2 ,  Y n /  +  X i , 2 ) = f 1 (Y 1, /  +  xi , 2 ). f 2 (Y 2 , /  +  xi , 2 ).
f n (Y n , /  +  xi , 2 )
EQ 2


 Y i −  −  xi 

f i (Y i , /  +  X i , 2 )
e  
= −
 
 2
EQ 3
This is nothing but the density function of a normally distributed random variable with
mean  +  X i and variance 2. Substituting 3 in two for each Yi we get

f (Y 1 ,Y 2 , ,Y n X 1, X 2 , , X n ;  ,,2 ) =
1  (Y i −  − X i )
1 2

n e 2
( 2 )

EQ 4

Its likelihood function with parameters , , and 2 (the unknowns) is defined as:

L( ,, Y ,Y , , X )
1  (Y i −  − X i )
1 2
, ,Y n ; X 1 , X 2 = −
n e 2
( 2 )
1 2 n EQ 5

The log-likelihood of this function is:
n n
Thus, l  ln L = − ln  − ln 2 −

2 2
 Y i −  − X i ( )
EQ 6

Note: Maximizing the likelihood function is equivalent to maximizing the log-likelihood

To maximise this function with respect to the unknown parameters, take its first partial
derivative with respect to the unknown parameters, i.e.,
( −  −  X i)
2  Yi
 
( −  −  X i) X i
2  Yi
EQ 8
 
 2
= −
2  2
+ 4  Y i −  − X i ) 2

~ ,~
Let  ~ 2 , respectively, represent the ML estimators of , , and 2. By evaluating
 and 
the first order partial derivatives of the log-likelihood function at the maximum likelihood
estimators and equating them to zero, we get the following first order conditions.
( ~ ,~
l   ,)
~2 1 ~ −~ ( )
2  Yi
= 0  −  Xi = 0
 ~
( ~ ,~
l   ,)
~2 1 ~ −~ ( )
2  Yi
= 0  −  Xi Xi = 0 EQ 9
 ~
( ~ ,~
l   ,)
~2 n 1
( )

= 0  − + Y i − ~ −~
  X i = 0
 2 ~ 2 2
2 ~4
Rearranging the first two first order conditions given in EQ 8, one gets
n  X i = Y i
~ EQ 10
  X i +  X i2 =  X i Y i
Notice the similarity of these equations with the normal equations obtained using the
method of moments and those obtained using OLS. Solving equation 9, therefore, yields
the ML estimators of ,  which are identical to those of the OLS and method of
moments, which are given by:
~ = Y −~
 X =  ˆ
 =
x y i i ˆ
= 
x 2

Using these and substituting them in the third first order conditions given in EQ8, one
obtains the MLE of 2 as follows:
4  Y i − − X i
~ ~ ) 2

 
~2 = 1
 ( X )
~ −~
Y − i i
 Y i − ˆ −ˆ X i )
EQ 11

 
~2 =
 2

ˆ2 =
Note that: 
 ei2
Thus the ML estimator of 2 is different from its OLS estimator. Therefore, the ML
estimator is biased since it is different from the OLS estimator, which is unbiased. The
magnitude of bias could easily be obtained as follows:

E (
~2) = E ( ei2 ) = (n − 2) 2 = 2 − 2 2
1 1
n n n
~ 2 is biased downwards and underestimates the true 2.
Thus 

2.6. Confidence intervals and hypothesis testing
This section deals with statistical inference –estimation (interval) and hypothesis testing.
For this and related aspects we should derive
1. Variances of the OLS estimators
2. Covariance between the OLS estimators
3. The unbiased estimator of 2,

The implications of the normality assumption of the error terms

This assumption is particularly crucial for inference, because without this we cannot do
any statistical testing on the parameters and interval estimation is impossible.

Implications of the normality assumption of the error terms on the distribution of the
estimators of interest
Assumption of normality:  i ~ NID ( 0,  2 )

The one variant of the central limit theorem states that if X is independently and
identically distributed normal random variable and Y is the function of X, it follows that
Y is also independently and identically distributed normal variable. Thus, since
 i ~ NIID (0,  2 ) and Y is the function of the error term, it follows that Y is also
independently and identically normally distributed variable. That is,
( )
Yi X i ~ NIID  + X i , 2 . OLS Estimators are linear functions of Y. Hence

ˆ =  k Y ~ N (  ,  )
i i

( ) ( )
1. xi
where k i = ,  2
= = Var ˆ and E ˆ = 
 xi2 x 2

ˆ =  k Y ~ N ( ,   )
i i

2. 1 1 2 
where k i = − X k i,  2
=  2  + X 2  = Var (ˆ ) and  = E (ˆ )
n  n  xi 
 

e 2
~  n2−2
 2

The first two follows from the fact that ̂ and ̂ are linear combinations of Yi, which is

normal. The proof of the 3rd proposition goes as follows. Given that  i ~ NIID (0, 2 ) ,

i − 0 
= i ~ NIID (0,1) : is a standard normal distribution.
 

    i  =
 2
~  2n , since the sum of squares of n independent standard
 

normal random variables has a chi- square distribution with n degrees of freedom. Now

  (Y −  −  X )
i = i i

^ ^
adding and subtracting  +  X i in the right hand side of the equation we get

  (Y −ˆ −ˆ X )+ (ˆ + ˆ X −  −  X )

i = i i i i

=  e + ( ˆ − ) X 
ˆ −  )+ (
i i

=  e +  + (
ˆ −  )x  ˆ −  =  − (ˆ − )X 2
i as  i

=  e + n  + ( ˆ − )  x , as the cross product terms sum to zero

2 2 2 2
i i

Assignment: Show that the sum of the cross product terms of the above equation are all
equal to zero
Dividing the whole equation by 2, we get

 e ( ˆ −  )  x
2 2 2
= + +
i i


 2 
 
   
2 2 2 2

n n−2 1 1

e 2
~  2n −2
 2

We also know that

ˆ 2
e 2

(n - 2)ˆ 2 ~ 2n − 2 EQ 1
n−2 2

  2 
Recalling the fact that  ~ N  , , then it follows that
  x2 
 i 
ˆ −

~ N(0,1) EQ 2
  xi2

i.e., it is standard normal. Recall that the ratio of a standard normal random variable to the
square root of an independent chi-square random variable divided by its degrees of
ˆ and
freedom follows a t-distribution. Since  e 2
i are independent, then it follows that:

(ˆ − ) (  x ) 2
ˆ −

ˆ −

~ t (n − 2)
(n − 2) ˆ 2 (n − 2) x ()

ˆ 2
i ˆ
SE 


() 
() 
Where SE  = ˆ
and SE  =
x 2
i x 2

Similarly, the distribution of ̂ is given by:

ˆ −
~ t (n − 2)

2 2
Where SE(
ˆ) =  + X 2 and SE  + X 2
1 ^
ˆ 1
= 
n  xi n  xi

These results are used for both estimating confidence intervals and hypothesis testing.
^ ^
Notice the switch from the variance of  to the estimator of the variance of 
We shall use the data in our previous example to calculate the variances and standard

errors of the estimators:

Yˆ i = 3.6 + 0.75 X i  0.52

Where  ei2 = 14.65 x 2
i = 28 X = 8 n = 10

The standard errors are obtained by

1. calculating the variances of  ˆ
ˆ and 

2. Substitute 
ˆ 2 for 2
3. Take the square root of the resulting expression

()  
2 2
Now, Var  = 2 = =  0.0362
x 2
i 28

 2 
and Var ( ) =  = 
ˆ 2 2 1
+ X  = 2  1 + 64  =  2.39 2
 n  xi2   10 28 
 

Recall that: 
ˆ 2
e 2
 1.83 (Standard error of the regression: 
n−2 8

1 X 
ˆ) =  (1.83)(2.39)
ˆ = ˆ 2 +
 2
  2.09
 n  xi 

() 
SE  = 
ˆ =   0.256
 xi2
Usually, the complete result of the regression is written as follows:
Yˆ i = 3.6 + 0.75 X i , = 0.52
(2.09 ) (0.256 )

We can then easily obtain the confidence intervals for α and β by using the t distribution
with n-2 degrees of freedom.
From the table of a t-distribution, we know that t 0.025 (8) = 2.306 . Now since

ˆ −
~ t (8) , it follows that:

 
ˆ − 

Pr − 2.306  ^  2.306 = 0.95
 SE(
ˆ) 
 
 ˆ ˆ )
− 2.306 SE( ˆ )  −  −ˆ + 2.306 SE(
^ ^
 Pr −  = 0.95
 
ˆ ˆ ) =
+ 2.306 SE( ˆ )   ˆ − 2.306 SE(
^ ^
 Pr  0.95
 
ˆ ˆ ) =
− 2.306 SE( ˆ )   ˆ + 2.306 SE(
^ ^
 Pr  0.95
 
 Pr(3.6 − (2.306)(2.09)    3.6 + (2.306)(2.09))  0.95
 Pr(− 1.22    8.42)  0.95

Thus, the 95% confidence interval for  is (− 1.22, 8.42)

Similarly for β we obtain the 95% confidence interval as:
 ˆ −
 
Pr − 2.306  ^  2.306 = 0.95

SE  () 

 ˆ
() ˆ  = 0.95 ()
^ ^
 Pr −  − 2.306 SE  ˆ  −  −ˆ + 2.306 SE 
 
() ˆ  = 0.95 ()
^ ^
 Pr  + 2.306 SE  ˆ  ˆ − 2.306 SE 
 
() ˆ  = 0.95 ()
^ ^
 Pr  − 2.306 SE  ˆ  ˆ + 2.306 SE 
 
 Pr(0.75 − (2.306)(0.256)    0.75 + (2.306)(0.256))  0.95
 Pr(0.16    1.34)  0.95

Thus, the 95% confidence for β is given by (0.16, 1.34).

Hypothesis testing
The main problem in statistical hypothesis testing is to ask whether the observations or
findings one gets in his research are compatible with some stated hypotheses. What we
mean by compatibility here is whether what we have is sufficiently close to the
hypothesised value so that we do not reject the hypothesis. Suppose that there is prior
experience or expectation that the true slope coefficient of some regression function is
unity (1). But from the observed data we obtain 0.75, as we did in our production
function. The question we ask would be is this observation consistent with the stated
hypothesis? If it is, we do not reject the hypothesis; otherwise we may reject it. The most
common hypothesis used in econometrics is whether the parameters of interest are
significantly different from zero (tests of significance).

In the language of statistics, the stated hypothesis is known as the null hypothesis and is
denoted by H0. This null is tested against an alternative hypothesis denoted by H1, which
may state that the true value of the slope parameter is different from unity (1). The
alternative hypothesis may be simple or composite. For example while  =1 is a simple
hypothesis, while ≠ 1 is a composite hypothesis.

Though there are different approaches in hypothesis testing, we use the most widely used
approach—what is known as the test of significance approach. This procedure uses the
sample results to verify the truth or falsity of a null hypothesis.

For our example on the production function we may put foreward the following
H0 : β = 1, and H1 : β ≠ 1 ;  = 5% (Size of the test)
ˆ −

Now, since ~ t (n − 2) it follows that under H0
SE 
0.75 − 1
t cal =  − 0.98
 t cal = 0.98

Decision ruel: Reject H0 if t cal  t  2 (n − 2) , do not reject otherwise

Now, from the t-table with 8 degrees of freedom we read that
t 0.025 (8) = 2.306
t 0.05 (8) = 1.860
t 0.10 (8) = 1.397
The probability that Pr( t  0.98)  0.38 (using linear interpolation). Since this is not a

low probability, we do not reject the null hypothesis. Thus, what we obtained is not
statistically different from 1. It is customary to use as cut-offs probability levels of 0.05
and 0.01 to reject the null hypothesis.

The most customary type of test is to test whether a regression parameter estimate is
‘significant’ or ‘not significant’. What is meant here is that the parameter is significantly
different from zero, in the statistical sense. The hypothesis tested here is
H0 : β = 0, and H1 : β > 0 ;  = 5% (Size of the test)
Following the procedure used earlier, we obtain
0.75 − 0
t cal =  2.93
Pr[t >2.896] = 0.01
And Pr[t > 3.355] = 0.005
Thus Pr[t > 2.93] = 0.0104
In this case, however, we reject the null hypothesis. Thus, our result is statistically
different from zero. People customarily say that their parameters have been found to be
Note, if we do the same significance test for α, at 5% level of significance we have
H0 :  = 0, and H1 :  > 0
3 .6 − 0
t cal =  1.72
Pr[t >1.72] > 0.25 for a t distribution with 8 df.

Thus, we cannot reject the null hypothesis at conventional sizes of the test, which states
that α = 0

2.7. Predictions with simple regression model.

Given the estimated regression equation Yˆ i =  ˆ X i , we are interested in predicting
ˆ +

values of Y given values of X. This is known as conditional prediction. Let the given
value of X = Xf, then we predict the corresponding value of Yf of Y by solving

Yˆ f = ˆ + ˆ X f

Where Yˆ f is the predicted value of Yf.

Now, the true value of Yf is given by

Yf =  +  X f + f
Where  f is the disturbance (error) term.

We now try to look at the desirable properties of Yˆ f .

^ ^
First, note that Yˆ f is a linear function of Y1, Y2,… Yn, since  and  are linear in Yi.

Thus Yˆ f is a linear function of Yf, and hence a linear predictor.

Second, Yˆ f is unbiased. Note that

ef = Y f − Yˆ f (Error of prediction)

but Yˆ f = ˆ + ˆ X f and Y f =  +  X f + f
ef = Y f − Yˆ f
( )
=  f − (ˆ −  ) − ˆ −  X f

( ) 
Thus E (e f ) = E ( f ) − E (ˆ −  ) − E ˆ −  X f = 0
( )
E Y f − Yˆ f = 0
( )
E Yˆ f = E (Y f ) =  +  X f

thus it is unbiased.
Third, Yˆ f has the smallest variance among the linear unbiased predictors of Yf. Thus, Yˆ f

is the Best Linear Unbiased Predictor (BLUP) of Yf.

The variance of the predictor’s error, var (e f ) , is obtained as follows:

We have already obtained

ef ( )
=  f − (ˆ −  ) − ˆ −  X f , thus,

( )  (
Var (e f ) = Var ( f ) + Var (ˆ −  ) + X 2f Var ˆ −  + 2 cov (ˆ −  ) , X f ˆ −  )
1 2   X 2f  X
=  2 + 2  + X 2 + 2  − 2 2 X f
 n  xi    xi   xi2
 
 1 2 2
Xf X Xf
=  2 1 + + X 2 + − 2 
 n  xi  xi  xi2 

 1 X 2f + X 2 − 2 X X f 
=  1 + +

 n  xi2 

=  1 + +
X f −X( ) 

 n

 xi2 

Thus, we observe that
1. The variance of the prediction error increases as Xf is further away from the mean of X,
^ ^
X , (i.e., the mean of the observations on the basis of which  and  have been

computed). Or as the distance between Xf and X increases, the variance of the error of
prediction increases.
2. The variance of the prediction error increases with the variance of the regression.
3. It decreases with n (number of observations used in the estimation of the parameters).

Interval prediction

One of the purposes of econometric analysis is prediction or forecasting outside the

sampled data. There are two types of predictions, namely, individual prediction and
mean prediction

1. Individual Prediction

In this case, our interest is to predict the individual value of Y (say Yf) corresponding to a
given level of X (say Xf). Given the assumption of normality, it can be shown that the
prediction error, which is given by e f = Y f − Yˆf , follows a normal distribution with mean

E (e f ) = E Y f − Yˆ f ( ) = ( +  X f ) − ( +  X f ) = 0
and variance

Var (e f )

=  1 + +
X f −X
2 ( ) 

 n

 xi2 

Therefore, it follows that:
(n − 2)ˆ 2 ~
e f −0
~ N(0, 1) (
n−2 )

( )  2
 
X f −X

 1 + + 

n  x

Substituting  for 2 it follows that

X f −X ( ) 

 1
where SE ( e f )
t = = ˆ 1 + + 
SE ( e f )
n  xi2 

follows a t distribution with n – 2 degrees of freedom. This result can be used for both
interval estimation and other inference purposes. Let us try to obtain confidence
intervals for our earlier example. Recall the results obtained earlier:
ˆ 2  1.83 , X = 8 , and  x i2 = 28


Now, suppose we are interested to predict Yf for, Xf = 8. In this case

Yˆ f = ˆ + ˆ X f
= 3.6 + 0.75(8)
= 9.6

SE(e f ) =

 1
 1 + +
X f −X ( ) 2


 1 ( 8 − 8) 
1.831 + +

 28 
 n

 xi2 

 10
 
 1.83(1.1)  2.013  1.42

The t value for 95% confidence with 8 degrees of freedom is 2.306. Thus, the 95%

confidence interval for Yf is

9.6 ± 2.306(1.42) =(6.33, 12,87)

Now, suppose we wanted to predict the value of Yf for the value of Xf that is far away

from the mean of X, say, Xf =20. Then the predicted value of Yf is

Yˆ f = ˆ + ˆ X f
= 3.6 + 0.75(20)
= 18.6


 1 ( 20 − 8) 

SE(e f ) 

1.831 + +
 10 28 
 
 1.83(6.24)  11.42  3.38
Thus, the 95% confidence interval for Yf is

18.6 ± 2.306(3.38) = (10.8, 26.39)

Notice the large confidence interval as we move away from the mean.

2. Mean Prediction

In mean prediction, our interest is not to predict individual Yf, rather to predict E(Yf| X =

Xf). Thus, we are interested in the mean of Yf and not Yf as such. Note that the mean

prediction E (Yf / X = X f ) is given by

E Yˆ f / X = X f ) =  +  X f (  &  are unbiased )

Thus, the forecast error becomes

ef = E (Y f / X = X f ) − Yˆ f ( )
= − (ˆ −  ) − ˆ −  X f

This is similar to what we did earlier. However, the variance of prediction error will be

smaller. It will actually be:

Var (e f )

2 1
=   +
X f −X )  Assignment: show this!!!


 2
xi 

Compare the two:

Var (e f )

=  1 + +
X f −X )  >Var(e )

2 1
=   +
X f −X ) 

 n  xi2  
n 
   

Example Yˆi = 24.4545 + 0.509 X i

Suppose X f = 100 .Now, predict the value E (Yf / X f = 100) .


Yˆf = 24.4545 + 0.509*100

= 75.3645

X = 170,  2 hat = 42.159,  x 2 = 33000

Suppose n=10,
 = 0.05

Construct the CI for the true E(Yf / X = 100)


E (Y f / X = 100) + −t( n − 2)( / 2) SE (Yˆf )

But variance of the prediction error, which is equal to variance of the predicted Y is given


( − ) 

 X f X 
Var ( e f ) = Var (Yˆf ) =   +
2 1
n  xi 

 
 1 (100 − 170) 2 
= 42.159  + 
 10 33000 
= 10.4759
Thus, SE(Yˆ )=3.2366

75.3645 − 2.306*3.2366  E (Y f / X = 100)  75.3645 + 2.306*3.2366
 67.9010  E (Y f / X = 100)  82.8381
Thus, given X f = 100 , in repeated sampling, 95 out of 100 cases intervals like this will

contain the true mean value.


You might also like