Uni Variate Regression

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

PE I: Univariate Regression

General Concepts, OLS, R & Python implementation


(Chapters 3.1 & 3.2)

Andrius Buteikis, [email protected]


http://web.vu.lt/mif/a.buteikis/
Univariate Regression: Modelling Framework
Assume that we are interested in explaining Y in terms of X .
I In general, any single X will never be able to fully explain an
economic variable Y .
I All other non-observable factors, influencing Y , manifest themselves
as an unpredictable (i.e. random) disturbance .
Thus, we are looking for a function f (·), which describes this relationship:

Y = f (X , )

where:
I f (·) is called a regression function;
I X is called the independent variable, the control variable, the
explanatory variable, the predictor variable, or the regressor;
I Y is called the dependent variable, the response variable, the
explained variable, the predicted variable or the regressand;
I  is called the random component, the error term, the disturbance
or the (economic) shock.
Y is a random variable (r .v .), which depends on  (r .v .) and X (either a
r .v ., or not). The simplest functional form of f (·) is linear:

Y = β0 + β1 X +  (1)

I If this functional form is appropriate for the data, our goal is to


estimate its unknown coefficients β0 and β1 from the available
data sample.
I We will assume that we have a finite random sample
(X1 , Y1 ), ..., (XN , YN ) of independent identically distributed (i.i.d.)
r.v.’s (unless stated otherwise), with their realizations (observed
values) (x1 , y1 ), ..., (xN , yN ).
I Since  is a collection of unknown r.v.’s, unlike X and Y (which we
do observe), we do not observe .
We expect that, on average, the above linear relationship holds. The
two-dimensional random vector (which we also later refer as r .v .) (X , Y )
is fully determined by the two-dimensional r.v. (X , ).
I In order to get good estimates of β0 and β1 , we have to impose
certain restrictions on the properties of  and the interaction
between X and .
Let us denote E(Y |X ) - the conditional expectation of Y on X - an
expectation of Y , provided we know the value of r.v. X .
Conditional Expectation Properties (1)
Below we provide some properties of the conditional expectation which will
be useful later on:
(1) E [g(X )|X ] = g(X ) for any function g(·);
(2) E [a(X )Y + b(X )|X ] = a(X ) · E [Y |X ] + b(X );
(3) If X and Y are independent, then E [Y |X ] = E [Y ] and
Cov(X , Y ) = 0;
(4) Law of total expectation: E [E [Y |X ]] = E [Y ];
(5) Let Y , X and  follow eq. (1). The conditional expectation is a linear
operator:

E(Y |X ) = E(β0 + β1 X + |X )


= E(β0 |X ) + E(β1 X |X ) + E(|X ) (2)
= β0 + β1 X + E(|X )
Conditional Expectation Properties (2)
Depending on whether X and Y are discrete or continuous, the expected
value is calculated as follows:
I If both X and Y are discrete r.v.’s, then
X X PX ,Y (Y = y , X = x )
E(Y |X = x ) = y PX |Y (Y = y |X = x ) = y
y y
PX (X = x )

I If both X and Y are continuous r.v.’s, then


Z ∞ Z ∞
fX ,Y (x , y )
E(Y |X = x ) = yfY |X (y |x )dy = y dy ,
−∞ −∞ fX (x )

where fX ,Y is the joint probability density function (pdf) of X


and Y , and
R ∞ fX is the (marginal) density of X with
fX (x ) = −∞ fX ,Y (x , y )dy ;
I If X is discrete and Y is continuous r.v.’s, then
Z ∞ Z ∞
fX ,Y (x , y )
E(Y |X = x ) = yfY (y |X = x )dy = y dy
−∞ −∞ PX (X = x )
Equation (2) implies that if E(|X ) = 0, then the r.v. Y = β0 + β1 X + 
is on average β0 + β1 X :

E(β0 + β1 X + |X ) = β0 + β1 X

and:
I β0 - the intercept parameter, sometimes called the constant term -
is the average value of Y when X = 0. It often has no econometric
meaning, especially if X cannot take a zero value (e.g. if X is wage);
I β1 - the slope parameter - is the average change of Y provided X
increases by 1.
It may also be convenient to describe random variables as empirical
phenomena, that have some statistical regularity but not deterministic
regularity.
Regression and The Sample Data
Assume that the relationship between Y and X (i.e. the data generating
process) is a linear one as discussed before:

Y = β0 + β1 X + 

where X and  are r.v.’s. For simplicity assume that:


I β0 = 1, β1 = 2;
I  ∼ N (0, 1), X ∼ N (0, 52 );
I E(|X ) = 0;

We want to simulate N = 50 observations (i.e. realizations) of the random


sample (X1 , Y1 ), ..., (XN , YN ).
The observations can be generated as follows:
I in R: I in Python:
import numpy as np
import pandas as pd

set.seed(1) np.random.seed(1)
# #
beta_0 <- 1 beta_0 = 1
beta_1 <- 2 beta_1 = 2
N <- 50 N = 50
# #
x <- rnorm(mean = 0, sd = 5, x = np.random.normal(loc = 0,
n = N) scale = 5, size = N)
eps <- rnorm(mean = 0, sd = 1, eps = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- beta_0 + beta_1 * x + eps y = beta_0 + beta_1 * x + eps

The conditional expectation E(Y |X ) = β0 + β1 X is calculated as:


y_ce <- beta_0 + beta_1 * x y_ce = beta_0 + beta_1 * x
We can print the first 15 observations, as well as the conditional
expectation E(Yi |Xi = xi ) = 1 + 2xi for each point xi , i = 1, ..., N.
tmp_data <- cbind(y_ce, y,x) tmp_data = pd.DataFrame({'y_ce':y_ce,
# 'y':y, 'x':x})
print(round(head(tmp_data, 15), 4)) print(tmp_data.head(15).round(4))
## y_ce y x ## y_ce y x
## [1,] -5.2645 -4.8664 -3.1323 ## 0 17.2435 17.5436 8.1217
## [2,] 2.8364 2.2244 0.9182 ## 1 -5.1176 -5.4698 -3.0588
## [3,] -7.3563 -7.0152 -4.1781 ## 2 -4.2817 -5.4242 -2.6409
## [4,] 16.9528 15.8234 7.9764 ## 3 -9.7297 -10.0790 -5.3648
## [5,] 4.2951 5.7281 1.6475 ## 4 9.6541 9.4452 4.3270
## [6,] -7.2047 -5.2243 -4.1023 ## 5 -22.0154 -21.4288 -11.5077
## [7,] 5.8743 5.5071 2.4371 ## 6 18.4481 19.2871 8.7241
## [8,] 8.3832 7.3391 3.6916 ## 7 -6.6121 -5.6810 -3.8060
## [9,] 6.7578 7.3275 2.8789 ## 8 4.1904 4.4760 1.5952
## [10,] -2.0539 -2.1889 -1.5269 ## 9 -1.4937 -0.6086 -1.2469
## [11,] 16.1178 18.5194 7.5589 ## 10 15.6211 14.8667 7.3105
## [12,] 4.8984 4.8592 1.9492 ## 11 -19.6014 -18.3485 -10.3007
## [13,] -5.2124 -4.5227 -3.1062 ## 12 -2.2242 -1.7112 -1.6121
## [14,] -21.1470 -21.1190 -11.0735 ## 13 -2.8405 -3.1386 -1.9203
## [15,] 12.2493 11.5060 5.6247 ## 14 12.3377 12.8262 5.6688
To get a better understanding of how the conditional expectation relates
to the observed sample, we can look at the data scatter plots:
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
_ = plt.figure(num = 0)
_ = plt.plot(x, y, linestyle = "None", color = "black",
marker = "o", markerfacecolor = 'none',
label = "Sample data");
_ = plt.plot(x, y_ce, linestyle = "-", color = "red",
label = "$E(Y|X) = \\beta_0 + \\beta_1 X$");
_ = plt.legend()
plt.show()

Sample data
20 E(Y|X) = 0 + 1X

10

10

20

10 5 0 5 10
We can do this in R as well:
plot(x = x, y = y)
lines(x = x, y = y_ce, col = "red")
legend("topleft",
legend = c(expression(paste("E(Y|X) = ", beta[0] + beta[1] * X)),
"Sample data"),
lty = c(1, NA), lwd = c(1, NA), pch = c(NA, 1), col = c("red", "black"))
20

E(Y|X) = β0 + β1X
Sample data
10
0
y

−10
−20

−10 −5 0 5

x
We see that, on average, the conditional expectation captures the
relationship between Y and X . Since we do not measure any other
variables (which possible effects on Y are then consequently gathered in )
we see that the data sample values are scattered around the conditional
mean of the process.
Note:
I In this particular example, we could have alternatively taken X to be
a non-random sequence of values - since the arrangement of the
values does not matter, the horizontal axis is always ordered, so we
would get a similar scatter plot as before.
I In practical applications this is not the case - in order to determine if
X is random, we need to examine its variable definition (e.g. if X is
the number of holidays in a year - this is, usually, a non-random
number).
Parameter Terminology
In general, we do not know the underlying true values of the coefficients β0
and β1 . We will denote βb1 - the estimation of the unknown parameter β1 .

We can talk about βb1 in two ways:


I βb1 as a concrete estimated value, or more generally, as a result.
Then we say that βb1 is an estimate of β1 based on an observed data
sample;
I βb1 as a random variable, or more generally, as a rule for calculating
an estimate based on observed data. Then we say that βb1 is an
estimator of β1 .

When we are talking about βb1 as an estimator, we can also talk about its
mean E(βb1 ), variance Var(βb1 ) and distribution, which are very important
when determining if a particular estimation method is better than some
alternative one.

As you may have guessed - we are interested in estimating these unknown


parameters as accurately as possible. For that, we turn to examine their
estimation methods.
Ordinary Least Squares (OLS)
Let our (random) samples be: ε = (1 , ..., N )> , X = (X1 , ...., XN )> , and
Y = (Y1 , ..., YN )> .
(UR.1) The Data Generating Process (DGP), or in other words,
the population, is described by a linear (in terms of the coefficients)
model:
Y = β0 + β1 X +  (UR.1)
(UR.2) The error term  has an expected value of zero, given any
value of the explanatory variable:

E(i |Xj ) = 0, ∀i, j = 1, ..., N (UR.2)

1
I log(Y ) = β0 + β1 +  is also a linear model;
X
I (UR.2) implies that E(Yi |Xi ) = β0 + β1 Xi , ∀i = 1, ..., N;
I In a linear model  and Y are always dependent, thus
Cov(Y , ) 6= 0. However,  may (or may not) depend on X . This
leads us to further requirements for .
(UR.3) The error term  has the same variance given any value of
the explanatory variable (i.e. homoskedasticity) and the error terms
are not correlated across observations (i.e. no autocorrelation):
 
Var(1 ) ... Cov(1 , N )
 Cov(2 , 1 ) ... Cov(2 , N )
Var (ε|X) =   = σ2 I (UR.3)
 
.. .. ..
 . . . 
Cov(N , 1 ) ... Var(N )

(UR.4) (optional) The residuals are normal:

ε|X ∼ N 0, σ2 I

(UR.4)

(This optional condition simplifies some statistical properties of


the parameter estimators)
We can combine the requirements and restate them as the following:
The Data Generating Process Y = β0 + β1 X +  satisfies (UR.2)
and (UR.3), if: (conditionally on all X’s) E(i ) = 0, Var(i ) = σ2
and Cov(i , j ) = 0, ∀i 6= j and Cov(i , Xj ) = E(i Xj ) = 0, ∀i, j.
I The linear relationship Y = β0 + β1 X +  is also referred as the
regression line with an intercept β0 and a slope β1 ;
I From (UR.2) we have that the regression line coincides with the
expected value of Yi , given Xi .
In general, we do not know the true coefficient β0 and β1 values but we
would like to estimate them from our sample data, which consists of
points (Xi , Yi ), i = 1, ..., N.
We would also like to use the data in the best way possible to obtain an
estimate of the regression:

Y
b = βb0 + βb1 X .
Our random sample (X1 , Y1 ), ..., (XN , YN ) comes from the data generating
process:
Yi = β0 + β1 Xi + i , i = 1, ..., N

We want to use the data to obtain estimates of the intercept β0 and the
slope β1 . We will assume that the process satisfies (UR.1) - (UR.4)
conditions.
Simulated Data Example
We will use the following generated data sample for the estimation
methods outlined later on:
I β0 = 1, β1 = 0.5, Xi = i − 1, i ∼ N (0, 12 ), i = 1, ..., 51
import numpy as np
import pandas as pd

set.seed(234) np.random.seed(234)
# Set the coefficients: # Set the coefficients:
N = 50 N = 50
beta_0 = 1 beta_0 = 1
beta_1 = 0.5 beta_1 = 0.5
# Generate sample data: # Generate sample data:
x <- 0:N x = np.arange(start = 0,
# stop = N + 1, step = 1)
# # Alternative, but NOT an np.ndarray
# #x = list(range(0, N + 1))
e <- rnorm(mean = 0, sd = 1, e = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- beta_0 + beta_1 * x + e y = beta_0 + beta_1 * x + e
# Plot the data
_ = plt.figure(num = 1, figsize = (10, 6))
_ = plt.plot(x, y, linestyle = "None", marker = "o",
markerfacecolor = 'none', color = "black")
plt.show()

25

20

15

10

0
0 10 20 30 40 50
The Method of Moments (MM)
The (UR.2) condition is that E() = 0 as well as
Cov(, X ) = E (X ) − E()E(X ) = E (X ) = 0. Using these properties
and the linear relationship of Y and X we have that the following relations:
 
E()
 =0 E [Y − β0 − β1 X ]
 =0
E(X ) = 0 ⇐⇒
 
 = Y − β 0 − β1 X E [X (Y − β0 − β1 X )] = 0
 

We have two unknown parameters and two equations, which we can use to
estimate the unknown parameters.
Going back to our random sample, we can estimate the unknown
parameters by replacing the expectation with its sample counterpart.
Then, we want to choose the estimates βb0 and βb1 such that:
1 PN h
 i

 i=1 Yi − βb0 − βb1 Xi =0
N

(3)

 1 P N
h  i
Xi Yi − βb0 − βb1 Xi =0


N i=1
This is an example of the method of moments estimation approach.
PN PN
Setting Y = 1/N i=1 Yi , X = 1/N i=1 Xi allows us to rewrite the
first equation in (3) as:
βb0 = Y − βb1 X (4)
and plug it into the second equation of (3) to get:
N
X   N
X N
X
 
Xi Yi − Y + βb1 X − βb1 Xi = 0 =⇒ Xi Yi − Y = βb1 Xi Xi − X
i=1 i=1 i=1

which gives us:


PN  PN  
Xi Yi − Y i=1 Xi − X Yi − Y d ,Y)
Cov(X
βb1 = Pi=1
N  = PN 2 = (5)
i=1 Xi Xi − X X − X d )
Var(X
i=1 i

It is important to note that by this construction:


I the sample mean of the OLS residuals is always zero.
I The residuals do not correlate with X .
The expectation is also referred to as the first moment, which
we then estimate from a sample, hence the name - Method of
Moments.
Note: to get the second equality of equation (5) to hold, we used the
following properties:
N N
X  X 2
Xi Xi − X = Xi − X
i=1 i=1
N
X N
X
  
Xi Yi − Y = Xi − X Yi − Y
i=1 i=1

We generally employ various relationships and different equality expressions


to simplify/reduce some other more complex expression. Sometimes, this
is not obvious at first glance.
For a lighthearted take on this, see:
I Siegfried, John J, 1970. “A First Lesson in Econometrics”, Journal of
Political Economy, University of Chicago Press, vol. 78(6), pages
1378-1379, Nov.-Dec.
I Eldridge, D. S. (2014), A Comment On Siegfried’s First Lesson In
Econometrics, Economic Inquiry, 52: 503-504.
There is an alternative way of approaching our task, and it is by
attempting to minimize the sum of the squared errors.
OLS - System of Partial Derivatives Method
Suppose that we choose βb0 and βb1 to minimize the sum of squared
residuals:
X N XN  2
RSS = 2i =
b Yi − βb0 − βb1 Xi
i=1 i=1

The term Ordinary Least Squares (OLS) comes from the fact
that these estimates minimize the sum of squared residuals.
In order to minimize RSS, we have to differentiate it with respect to βb0
and βb1 and equate the derivatives to zero in order to solve the following
system of equations:
∂RSS

PN  


 b = −2 i=1 Y i − β
b 0 − β
b 1 i =0
X
∂ β0 (6)
∂RSS PN  


 b = −2 i=1 Xi Yi − βb0 − βb1 Xi = 0
∂ β1

We get the same equation system as in (3). So the solution here will
have the same expression as for the Method of Moments estimators.
 PN  
i=1 Xi − X Yi − Y
 d ,Y)
Cov(X
 β1 = =

 b
 PN 2 d )
i=1 Xi − X
Var(X




βb = Y − βb X
0 1

βb1 and βb0 derived in this way are called the OLS estimates of the linear
regression parameters.
This lets us estimate the parameters from our generated sample data:
beta_1_est = np.cov(x, y, bias = True)[0][1] / np.var(x)
beta_0_est = np.mean(y) - beta_1_est * np.mean(x)
print("Estimated beta_0 = " + str(beta_0_est) +
". True beta_0 = " + str(beta_0))

## Estimated beta_0 = 0.8385362094098401. True beta_0 = 1


print("Estimated beta_1 = " + str(beta_1_est) +
". True beta_1 = " + str(beta_1))

## Estimated beta_1 = 0.5099221954865624. True beta_1 = 0.5


Note that:
I bias = True calculates the population covariance (with division by
N), whereas bias = False calculates the sample covariance (with
N − 1 instead of N);
I the variance function var() calculates the population variance.
Since we are dividing the covariance by the variance, we need to calculate
both of them for the population (or the sample, since we are dividing the
covariance by the variance).
# Covariance:
print(np.sum((x - np.mean(x)) * (y - np.mean(y))) / (len(x) - 1)) # N - 1

## 112.69280520253027
print(np.cov(x, y, bias = False)[0][1])

## 112.69280520253028
# Variance:
print(np.sum((x - np.mean(x))**2) / (len(x) - 1)) # N-1

## 221.0
print(np.sum((x - np.mean(x))**2) / (len(x))) # N

## 216.66666666666666
print(np.var(x))

## 216.66666666666666
We can also calculate them, in R:
beta_1_est <- cov(x, y) / var(x)
beta_0_est <- mean(y) - beta_1_est * mean(x)
print(paste0("Estimated beta_0 = ", beta_0_est,
". True beta_0 = ", beta_0))

## [1] "Estimated beta_0 = 0.833740099062318. True beta_0 = 1"


print(paste0("Estimated beta_1 = ", beta_1_est,
". True beta_1 = ", beta_1))

## [1] "Estimated beta_1 = 0.504789456221484. True beta_1 = 0.5"


where cov() and var() are calculated for the sample:
# Covariance
print(sum((x - mean(x)) * (y - mean(y)))/(length(x) - 1)) # N - 1

## [1] 111.5585
print(cov(x, y))

## [1] 111.5585
# Variance
print(sum((x - mean(x))^2)/(length(x) - 1)) # N - 1

## [1] 221
print(var(x))

## [1] 221
We can further re-write the estimates in a matrix notation, which is often
easier to implement in numerical calculations.
OLS - The Matrix Method
It is convenient to use matrices when solving equation systems. Looking at
our random sample equations:


 Y1 = β0 + β1 X1 + 1

Y2 = β0 + β1 X2 + 2

..


 .

YN = β0 + β1 XN + N

which we can re-write in the following matrix notation:


     
Y1 1 X1 1
 Y2  1 X2     2 
 β0
 ..  =  .. ..  β +  .. 
    
 .  . .  1 .
YN 1 XN N
Or, in a more compact form:

Y = Xβ + ε
 
1 X1
1 X2 
where Y = [Y1 , ..., YN ]> , ε = [1 , ..., N ]> , β = [β0 , β1 ]> , X =  . .. .
 
 .. . 
1 XN
As in the previous case, we want to minimize the sum of squared residuals:

RSS(β) = ε> ε
>
= (Y − Xβ) (Y − Xβ)
= Y> Y − β > X> Y − Y> Xβ + β > X> Xβ → min
β0 ,β1
After using some matrix calculus (more specifically scalar-by-vector
identities) and equating the partial derivative to zero:

∂RSS(β)
b
= −2X> Y + 2X> Xβ
b=0
∂β
b

gives us the OLS estimator:

−1
b = X> X
β X> Y (OLS)

Note that:
I with the matrix notation we estimate both parameters at the
same time;
I whereas with the Method of Moments, or by taking partial derivatives
from the RSS and solving the equation system, we needed to first
estimate βb1 via (5) and then plug it in (4) in order to estimate βb0 .
While (OLS) is a general expression (as we will see in the Multivariable
Regression case), in this case, we can express the OLS estimators for the
univariate regression as:
btvn: chung minh cong thuc 2 
X · Y − X · XY
" #  d )
Var(X

βb0  
β
b= =
 
 
βb1  d 
 Cov(X , Y ) 
d )
Var(X

(For a detailed derivation, see the relevant section in the lecture


notes)
Note that in this case, we have an exact expression for βb0 , which does
not require estimating βb1 beforehand.
Defining the estimates in the matrix notation is not only convenient (as
they can be generalized to the multivariate case) but, in some cases, even
faster to compute via software.
We will continue our example and show not only how to calculate the
parameters manually using vectorization, but also using the built-in OLS
estimation methods.
OLS via manual matrix calculation
I R:
x_mat <- cbind(1, x)
beta_mat <- solve(t(x_mat) %*% x_mat) %*% t(x_mat) %*% y
row.names(beta_mat) <- c("beta_0", "beta_1")
print(beta_mat)
## [,1]
## beta_0 0.8337401
## beta_1 0.5047895
I Python:
x_mat = np.column_stack((np.ones(len(x)), x))
beta_mat = np.dot(np.linalg.inv(np.dot(np.transpose(x_mat), x_mat)),
np.dot(np.transpose(x_mat), y))
print(beta_mat)
## [0.83853621 0.5099222 ]
alternatively:
np.linalg.inv(x_mat.transpose().dot(x_mat)).dot(x_mat.transpose()).dot(y)
## array([0.83853621, 0.5099222 ])
OLS via the built-in methods
Both R and Python have packages with these estimation methods already
defined - we only need to specify the data.
We will explore these functions in more detail in further lectures, but for
now we will show only the relevant output for the currently discussed
topic below:
import statsmodels.api as sm
# Add a constant column - not optional!
x_mat = sm.add_constant(x)
# Create the OLS regression object
lm_model = sm.OLS(y, x_mat)
# Estimate the parameters # Estimate the parameters
lm_result <- lm(y ~ 1 + x) lm_fit = lm_model.fit()
# Extract the parameter estimates # Extract the parameter estimates
print(lm_result$coefficients) print(lm_fit.params)
## (Intercept) x ## [0.83853621 0.5099222 ]
## 0.8337401 0.5047895
Note that we can use y ~ x, instead of y
~ 1 + x - the constant term (Intercept)
is added automatically.
Relationship between estimates, residuals, fitted and actual
values
After obtaining the estimates βb0 and βb1 , we may want to examine the
following values:
I The fitted values of Y , which are defined as the following OLS
regression line (or more generally, the estimated regression line):

Y
bi = βb0 + βb1 Xi

where βb0 and βb1 are estimated via OLS. By definition, each fitted
value of Y
bi is on the estimated OLS regression line.
I The residuals, which are defined as the difference between the
actual and fitted values of Y :

ei = Yi − Y
i = b
b bi = Yi − βb0 − βb1 Xi

which are hopefully close to the errors i .


# The unknown DGP: # The unknown DGP:
y_dgp <- beta_0 + beta_1 * x y_dgp = beta_0 + beta_1 * x
# The fitted values: # The fitted values:
y_fit <- beta_mat[1] + beta_mat[2] * x y_fit = beta_mat[0] + beta_mat[1] * x
25

E(Y|X) = β0 + β1X
Y = β0 + β1X
20
15
y

10
5
0

0 10 20 30 40 50

We see that the estimated regression is very close to the true underlying
population regression (i.e. the true E(Y |X )).
Looking closer at the fitted values in a subset of the data, we can see
where the residuals originate and how the fitted values compare to the
sample data:

Scatter diagram of (X,Y) sample data and the regression line

^ε12 = residual

^ ^ ^
Y = β0 + β1X
Y

^
Y12 = fitted value

^
Y8

Y8

x8 x12

X
Properties of the OLS estimator
From the construction of the OLS estimators the following properties apply
to the sample:
1. The sum (and by extension, the sample average) of the OLS residuals
is zero:
XN
i = 0
b (7)
i=1

This follows from the first equation of (3). The OLS estimates βb0
and βb1 are chosen such that the residuals sum up to zero, regardless
of the underlying sample data.
I R:
resid <- y - y_fit
print(paste0("Sum of the residuals: ", sum(resid)))
## [1] "Sum of the residuals: -1.05582209641852e-13"
I Python:
resid = y - y_fit
print("Sum of the residuals: " + str(sum(resid)))
## Sum of the residuals: 2.042810365310288e-14
We see that the sum is very close to zero.
2. The sample covariance between the regressors and the OLS residuals
is zero:
XN
Xi b
i = 0 (8)
i=1

This follows from the second equation of (3). Because the sample
PN
average of the OLS residuals is zero, i=1 Xi b
i is proportional to
the sample covariance, between Xi and bi .
I R:
print(paste0("Sum of X*resid: ", sum(resid * x)))

## [1] "Sum of X*resid: -5.92859095149834e-13"


print(paste0("Sample covariance of X and residuals: ", cov(resid, x)))

## [1] "Sample covariance of X and residuals: 4.1182578180976e-14"


I Python:
print("Sum of X*resid: " + str(sum(np.array(resid) * np.array(x))))

## Sum of X*resid: -3.765876499528531e-13


print("Sample covariance of X and residuals: " + str(np.cov(resid, x)[0][1]))

## Sample covariance of X and residuals: -1.7815748876159887e-14


We see that both the sum and the sample covariance are very close to zero.
3. The point (X , Y ) is always on the OLS regression line - if we
calculate βb0 + βb1 X , the resulting value would be equal to Y .
I R:
print(paste0("Predicted value with mean(X): ",
beta_mat[1] + beta_mat[2] * mean(x)))

## [1] "Predicted value with mean(X): 13.4534765045994"


print(paste0("Sample mean of Y: ", mean(y)))

## [1] "Sample mean of Y: 13.4534765045994"


I Python:
print("Predicted value with mean(X): " +
str(beta_mat[0] + beta_mat[1] * np.mean(x)))

## Predicted value with mean(X): 13.586591096573901


print("Sample mean of Y: " + str(np.mean(y)))

## Sample mean of Y: 13.5865910965739


We see that the predicted value is identical to the sample mean of Y .
The properties in the previous slides are not the only ones, which justify
the use of the OLS method, instead of some other competing estimator.
The main advantage of the OLS estimators can be summarized by the
following Gauss-Markov theorem:
Gauss-Markov theorem
Under the assumption that the conditions (UR.1) - (UR.3) hold true,
the OLS estimators βb0 and βb1 are BLUE (Best Linear Unbiased
Estimator) and Consistent.
We will prove the above theorem and examine the acronym in more detail.
What is an Estimator?
An estimator is a rule that can be applied to any sample of data to
produce an estimate. In other words the estimator is the rule and the
estimate is the result.
So, eq. (OLS) is the rule and therefore an estimator.
Next we move on to the remaining components of the acronym BLUE.

OLS estimators are Linear


From the specification of the relationship between Y and X (using the
matrix notation for generality):

Y = Xβ + ε

We see that the relationship is linear with respect to Y.


OLS estimators are Unbiased
Using the matrix notation for the sample linear equations (Y = Xβ + ε)
and plugging it into eq. (OLS) gives us the following:

b = X> X −1 X> Y

β
−1 >
= X> X X (Xβ + ε)
>
−1 > −1 >
= X X X Xβ + X> X X ε
−1
= β + X> X X> ε


If we take the expectation of both sides and use the law of total
expectation:

b = β + E X> X −1 X> ε
h i h  i
E β
  
>
−1 >
=β+E E X X X εX
h −1 > i
= β + E X> X X E (ε|X)

h i
since E (ε|X) = 0 from (UR.2). We have shown that E β
b = β.
Some more notes on unbiasedness:
I Unbiasedness does not guarantee that the estimate we get with any
particular sample is equal (or even very close) to β.
I It means that if we could repeatedly draw random samples from the
population and compute the estimate each time, then the average of
these estimates would be (very close to) β.
I However, in most applications we have just one random sample to
work with, though as we will see later on, there are methods for
creating additional samples from the available data by creating and
analysing different subsamples.
OLS estimators are Best (Efficient)
I When there is more than one unbiased method of estimation to
choose from, that estimator which has the lowest variance is the best.
I We want to show that OLS estimators are best in the sense that β b
are efficient estimators of β (i.e. they have the smallest variance).
To do so we will calculate the variance - the average distance of an
element from the average - as follows (remember that for OLS estimators,
condition (UR.3) holds true):
From the proof of unbiasedness of the OLS we have that:

b = β + X> X −1 X> ε =⇒ β
b − β = X> X −1 X> ε
 
β

Which we can then use this expression for calculating the


variance-covariance matrix of the OLS estimator:
h i h i
Var(β)
b = E (βb − E(β))(
b β b > = E (β
b − E(β)) b − β)(β b − β)>
 
−1 >  > −1 > > h −1 > > −1 i
= E X> X X ε X X X ε = E X> X X εε X X> X
−1 −1 −1 > 2  −1
= X> X X> E εε> X X> X = X> X X σ I X X> X
 
−1 > −1 −1
= σ2 X> X X X X> X = σ 2 X> X
−1
For the univariate case, we have already calculated X> X , which leads
to:  PN 2
2 i=1 Xi

β ) = σ ·


 Var( b0 PN 2
N Xi − X


 i=1

1


= σ 2 · PN

Var(β1 )

 b
 2
i=1 Xi − X
Which correspond to the diagonal elements of the variance-covariance
matrix: " #
Var(βb0 ) Cov( β
b0 , βb1 )
Var(β)b =
Cov(βb1 , βb0 ) Var(βb1 )
Note that we applied the following relationship:
PN P 2 2
N PN
N i=1 Xi2 − X
i=1 i = N i=1 Xi − X .
Next, assume that we have some other estimator of β, which is also
unbiased and can be expressed as:

e = X> X −1 X> + D Y = CY
h  i
β

Then, since E [ε] = 0:


h i h −1 >  i
E β e =E X> X X + D (Xβ + ε)
 −1 > 
= X> X X + D E [Xβ + ε]
 −1 > 
= X> X X + D Xβ
= (I + DX) β
= β ⇐⇒ DX = 0

So, β
e is unbiased if and only if DX = 0.

Again, it is important that a competing estimator is unbiased. Hence


why we have chosen a specific form for β. e
Then, we can calculate its variance as:

Var(β)
e = Var(CY)
= CVar(Xβ + ε)C>
= σ 2 CC>
 −1 >  −1 
= σ 2 X> X X + D X X> X + D>
−1 > −1 −1
= σ 2 X> X X X X> X + σ 2 X> X (DX)>
−1
+ σ 2 DX X> X + σ 2 DD>
h −1 i
= σ 2 X> X + DD>
b + DD> ≥ Var(β)
= Var(β) b

since DD> is a positive semidefinite matrix.


This means that β
b has the smallest variance.
Estimating the variance parameter of the error term
We see an immediate problem from the OLS estimator variance formulas -
we do not know the true error variance σ 2 .
However, we can estimate it by calculating the sample residual variance:
N
> b
 1 X 2
b2 = s 2 =
b
σ = 
N −2 N − 2 i=1 i
b

Note that if we take N instead of N − 2 for the univariate regression


case in the denominator, then the variance estimate would be biased.
This is because the variance estimator would not account for two
restrictions that must be satisfied by the OLS residuals, namely (7) and
(8):
XN XN
i = 0,
b i X i = 0
b
i=1 i=1

So, we take N − 2 instead of N, because of the number of restrictions on


the residuals.
Using σb2 allows us to calculate the estimate of Var(β),
b i.e. we can
calculate Var(
d β).b One way to view these restrictions is this: if we know
N − 2 of the residuals, we can always get the other two residuals by using
the restrictions implied by (7) and (8).
Thus, there are only N − 2 degrees of freedom in the OLS residuals, as
opposed to N degrees of freedom in the errors.
Note that this is an estimated variance. Nevertheless, it is a key
component in assessing the accuracy of the parameter estimates (when
calculating test statistics and confidence intervals).
Since we estimate βb from the a random sample, the estimator β b is a
random variable as well. We can measure the uncertainty of β b via its
standard deviation. This is the standard error of our estimate of β:
The square roots of the diagonal elements of the variance-covariance
matrix Var(
d β) b are called the standard errors (se) of the corre-
sponding OLS estimators β, b which we use to estimate the standard
deviation of βbi from βi
q
se(βbi ) = Var(
d βbi )

The standard errors describe the accuracy of an estimator (the


smaller the better).
I The standard errors are measures of the sampling variability of the
least squares estimates βb1 and βb2 in repeated samples;
I If we collect a number of different data samples, the OLS estimates
will be different for each sample. As such, the OLS estimators are
random variables and have their own distribution.
Now is also a good time to highlight the difference between the errors and
the residuals.

I The random sample, taken from a Data Generating


Process (i.e. the population), is described via

Yi = β0 + β1 Xi + i

where i is the error for observation i.


I After estimating the unknown parameters β0 , β1 , we can
re-write the equation as:

Yi = βb0 + βb1 Xi + b
i

where bi is the residual for observation i.


The errors show up in the underlying (i.e. true) DGP equation,
while the residuals show up in the estimated equation. The errors
are never observed, while the residuals are calculated from
the data.
We can also re-write the residuals in terms of the error term and the
difference between the true and estimated parameters:
   
i = Yi − Y
b bi = β0 + β1 Xi + i − (βb0 + βb1 Xi ) = i − βb0 − β0 − βb1 − β1 Xi

Going back to our example, we can estimate the standard errors of our
coefficients by applying the formulas that we have seen:
I R:
sigma2_est <- sum(resid^2) / (length(x) - 2)
var_beta <- sigma2_est * solve(t(x_mat) %*% x_mat)
print(sqrt(diag(var_beta)))

## x
## 0.266578700 0.009188728
I Python:
sigma2_est = sum(resid**2) / (len(x) - 2)
var_beta = sigma2_est * np.linalg.inv(np.dot(np.transpose(x_mat), x_mat))
print(np.sqrt(np.diag(var_beta)))

## [0.26999713 0.00930656]
We can also use the built-in functions, just like we did with the coefficients:
I R:
out <- summary(lm_result)
print(out$coefficients[, 2, drop = FALSE])

## Std. Error
## (Intercept) 0.266578700
## x 0.009188728
I Python:
print(lm_fit.bse)

## [0.26999713 0.00930656]
This highlights a potential problem, which will be addressed in later
lectures concerning model adequacy/goodness of fit: if the residuals are
large (since their mean will still be zero this concerns the case when the
estimated variance of the residuals is large), then the standard errors of
the coefficients are large as well.
OLS estimators are Consistent
A consistent estimator has the property that, as the number of data points
(which are used to estimate the parameters) increases (i.e. N → ∞), the
estimates converges in probability to the true parameter, i.e.:
Definition
Let WN be an estimator of a parameter θ based on a sample Y1 , ..., YN .
Then we say that WN is a consistent estimator of θ if ∀ > 0:

P (|WN − θ| > ) → 0, as N → ∞
P
We can denote this as WN − → θ or plim(WN ) = θ. If XN is not consistent,
then we say that WN is inconsistent.
I Unlike unbiasedness, consistency involves the behavior of the
sampling distribution of the estimator as the sample size N gets large
- the distribution of WN becomes more and more concentrated about
θ. In other words, for larger sample sizes, N is less and less likely to
be very far from θ.
I An inconsistent estimator does not help us learn about θ, regardless
of the size of the data sample.
I For this reason, consistency is a minimal requirement of an
estimator used in statistics or econometrics.
Unbiased estimators are not necessarily consistent, but those whose
variances shrink to zero as the sample size grows are consistent.
For some examples, see the lecture notes.
Going back to our OLS estimators βb0 and βb1 , as N → ∞ we have that:

E() · E(X 2 ) − E(X ) · E(X )


   
β0 1
β→
b +
β1 Var(X ) E(X ) − E(X ) · E()
     
β 1 0 β
= 0 + = 0
β1 Var(X ) 0 β1

Since E() = 0 and E(X ) = E(X ) − E(X )E() = Cov(X , ) = 0.


b → β, as N → ∞.
Which means that β
So, the OLS parameter vector β
b is a consistent estimator of β.
Practical illustration of the OLS properties
We will return to our example in this chapter. We have proved the
unbiasedness and consistency of OLS estimators.
To illustrate these properties empirically, we will:
I generate 5000 replications (i.e. different samples) for each of the
different sample sizes N ∈ {11, 101, 1001}.
I for each replication of each sample size we will estimate the unknown
regression parameters β;
I for each sample size, we will calculate the average of these parameter
vectors.
This method of using repeat sampling is also known as a Monte Carlo
method.
The extensive code can be found in the lecture notes.
In our experimentation the true parameter values are:
## True beta_0 = 1. True beta_1 = 0.5
while the average values of the parameters from 5000 different samples for
each sample size is:
## With N = 10:
## the AVERAGE of the estimated parameters:
## beta_0: 1.00437
## beta_1: 0.49984
## With N = 100:
## the AVERAGE of the estimated parameters:
## beta_0: 0.99789
## beta_1: 0.50006
## With N = 1000:
## the AVERAGE of the estimated parameters:
## beta_0: 0.99957
## beta_1: 0.5
The variance of these estimates can also be examined:
## With N = 10:
## the VARIANCE of the estimated parameters:
## beta_0: 0.3178
## beta_1: 0.00904
## With N = 100:
## the VARIANCE of the estimated parameters:
## beta_0: 0.03896
## beta_1: 1e-05
## With N = 1000:
## the VARIANCE of the estimated parameters:
## beta_0: 0.00394
## beta_1: 0.0
Note that unbiasedness is true for any N, while consistency is an
asymptotic property, i.e. it holds when N → ∞.
We can see that:
I The mean of the estimated parameters are close to the true
parameter value regardless of sample size.
I The variance of the estimated parameters decreases with larger
sample size, i.e. the larger the sample size, the closer will our
estimated parameters be to the true values.
Histogram of 0 with N = 10 Histogram of 1 with N = 10
700
700
600
600
500 500
400 400
300 300
200 200
100 100
0 0
1 0 1 2 3 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Histogram of 0 with N = 100 Histogram of 1 with N = 100
800 700
700 600
600
500
500
400
400
300 300
200 200
100 100
0 0
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.490 0.495 0.500 0.505 0.510
Histogram of 0 with N = 1000 Histogram of 1 with N = 1000
800
700
700
600
600
500 500
400 400
300 300
200 200
100 100
0 0
0.8 0.9 1.0 1.1 1.2 0.49960.49970.49980.49990.50000.50010.50020.50030.5004
We see that the histograms of the OLS estimators have a bell-shaped
distribution.

Under assumption (UR.4) it can be shown that since ε|X ∼ N 0, σ2 I ,
b = β + X> X −1 X> ε will

then the linear combination of epsilons in β
also be normal, i.e.

b ∼ N β, σ 2 X> X −1
  
β|X
I We have shown how to derive an OLS estimation method in order to
estimate the unknown parameters of a linear regression with one
variable.
I We have also shown that these estimators (under conditions (UR.1) -
(UR.3)) are good in the sense that, as the sample size increases, the
estimated parameter values approach the true parameter values and
that the average of all the estimators, estimated from a number of
different random samples, is very close to the true underlying
parameter vector.
Examples using empirical data
From the Lecture note Ch. 3.10 select one dataset (or more) and do the
tasks from Exercise Set 1 from Ch 3.10. See Ch. 3.11 for an
example.

You might also like