Uni Variate Regression
Uni Variate Regression
Uni Variate Regression
Y = f (X , )
where:
I f (·) is called a regression function;
I X is called the independent variable, the control variable, the
explanatory variable, the predictor variable, or the regressor;
I Y is called the dependent variable, the response variable, the
explained variable, the predicted variable or the regressand;
I is called the random component, the error term, the disturbance
or the (economic) shock.
Y is a random variable (r .v .), which depends on (r .v .) and X (either a
r .v ., or not). The simplest functional form of f (·) is linear:
Y = β0 + β1 X + (1)
E(β0 + β1 X + |X ) = β0 + β1 X
and:
I β0 - the intercept parameter, sometimes called the constant term -
is the average value of Y when X = 0. It often has no econometric
meaning, especially if X cannot take a zero value (e.g. if X is wage);
I β1 - the slope parameter - is the average change of Y provided X
increases by 1.
It may also be convenient to describe random variables as empirical
phenomena, that have some statistical regularity but not deterministic
regularity.
Regression and The Sample Data
Assume that the relationship between Y and X (i.e. the data generating
process) is a linear one as discussed before:
Y = β0 + β1 X +
set.seed(1) np.random.seed(1)
# #
beta_0 <- 1 beta_0 = 1
beta_1 <- 2 beta_1 = 2
N <- 50 N = 50
# #
x <- rnorm(mean = 0, sd = 5, x = np.random.normal(loc = 0,
n = N) scale = 5, size = N)
eps <- rnorm(mean = 0, sd = 1, eps = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- beta_0 + beta_1 * x + eps y = beta_0 + beta_1 * x + eps
Sample data
20 E(Y|X) = 0 + 1X
10
10
20
10 5 0 5 10
We can do this in R as well:
plot(x = x, y = y)
lines(x = x, y = y_ce, col = "red")
legend("topleft",
legend = c(expression(paste("E(Y|X) = ", beta[0] + beta[1] * X)),
"Sample data"),
lty = c(1, NA), lwd = c(1, NA), pch = c(NA, 1), col = c("red", "black"))
20
E(Y|X) = β0 + β1X
Sample data
10
0
y
−10
−20
−10 −5 0 5
x
We see that, on average, the conditional expectation captures the
relationship between Y and X . Since we do not measure any other
variables (which possible effects on Y are then consequently gathered in )
we see that the data sample values are scattered around the conditional
mean of the process.
Note:
I In this particular example, we could have alternatively taken X to be
a non-random sequence of values - since the arrangement of the
values does not matter, the horizontal axis is always ordered, so we
would get a similar scatter plot as before.
I In practical applications this is not the case - in order to determine if
X is random, we need to examine its variable definition (e.g. if X is
the number of holidays in a year - this is, usually, a non-random
number).
Parameter Terminology
In general, we do not know the underlying true values of the coefficients β0
and β1 . We will denote βb1 - the estimation of the unknown parameter β1 .
When we are talking about βb1 as an estimator, we can also talk about its
mean E(βb1 ), variance Var(βb1 ) and distribution, which are very important
when determining if a particular estimation method is better than some
alternative one.
1
I log(Y ) = β0 + β1 + is also a linear model;
X
I (UR.2) implies that E(Yi |Xi ) = β0 + β1 Xi , ∀i = 1, ..., N;
I In a linear model and Y are always dependent, thus
Cov(Y , ) 6= 0. However, may (or may not) depend on X . This
leads us to further requirements for .
(UR.3) The error term has the same variance given any value of
the explanatory variable (i.e. homoskedasticity) and the error terms
are not correlated across observations (i.e. no autocorrelation):
Var(1 ) ... Cov(1 , N )
Cov(2 , 1 ) ... Cov(2 , N )
Var (ε|X) = = σ2 I (UR.3)
.. .. ..
. . .
Cov(N , 1 ) ... Var(N )
ε|X ∼ N 0, σ2 I
(UR.4)
Y
b = βb0 + βb1 X .
Our random sample (X1 , Y1 ), ..., (XN , YN ) comes from the data generating
process:
Yi = β0 + β1 Xi + i , i = 1, ..., N
We want to use the data to obtain estimates of the intercept β0 and the
slope β1 . We will assume that the process satisfies (UR.1) - (UR.4)
conditions.
Simulated Data Example
We will use the following generated data sample for the estimation
methods outlined later on:
I β0 = 1, β1 = 0.5, Xi = i − 1, i ∼ N (0, 12 ), i = 1, ..., 51
import numpy as np
import pandas as pd
set.seed(234) np.random.seed(234)
# Set the coefficients: # Set the coefficients:
N = 50 N = 50
beta_0 = 1 beta_0 = 1
beta_1 = 0.5 beta_1 = 0.5
# Generate sample data: # Generate sample data:
x <- 0:N x = np.arange(start = 0,
# stop = N + 1, step = 1)
# # Alternative, but NOT an np.ndarray
# #x = list(range(0, N + 1))
e <- rnorm(mean = 0, sd = 1, e = np.random.normal(loc = 0,
n = length(x)) scale = 1, size = len(x))
y <- beta_0 + beta_1 * x + e y = beta_0 + beta_1 * x + e
# Plot the data
_ = plt.figure(num = 1, figsize = (10, 6))
_ = plt.plot(x, y, linestyle = "None", marker = "o",
markerfacecolor = 'none', color = "black")
plt.show()
25
20
15
10
0
0 10 20 30 40 50
The Method of Moments (MM)
The (UR.2) condition is that E() = 0 as well as
Cov(, X ) = E (X ) − E()E(X ) = E (X ) = 0. Using these properties
and the linear relationship of Y and X we have that the following relations:
E()
=0 E [Y − β0 − β1 X ]
=0
E(X ) = 0 ⇐⇒
= Y − β 0 − β1 X E [X (Y − β0 − β1 X )] = 0
We have two unknown parameters and two equations, which we can use to
estimate the unknown parameters.
Going back to our random sample, we can estimate the unknown
parameters by replacing the expectation with its sample counterpart.
Then, we want to choose the estimates βb0 and βb1 such that:
1 PN h
i
i=1 Yi − βb0 − βb1 Xi =0
N
(3)
1 P N
h i
Xi Yi − βb0 − βb1 Xi =0
N i=1
This is an example of the method of moments estimation approach.
PN PN
Setting Y = 1/N i=1 Yi , X = 1/N i=1 Xi allows us to rewrite the
first equation in (3) as:
βb0 = Y − βb1 X (4)
and plug it into the second equation of (3) to get:
N
X N
X N
X
Xi Yi − Y + βb1 X − βb1 Xi = 0 =⇒ Xi Yi − Y = βb1 Xi Xi − X
i=1 i=1 i=1
The term Ordinary Least Squares (OLS) comes from the fact
that these estimates minimize the sum of squared residuals.
In order to minimize RSS, we have to differentiate it with respect to βb0
and βb1 and equate the derivatives to zero in order to solve the following
system of equations:
∂RSS
PN
b = −2 i=1 Y i − β
b 0 − β
b 1 i =0
X
∂ β0 (6)
∂RSS PN
b = −2 i=1 Xi Yi − βb0 − βb1 Xi = 0
∂ β1
We get the same equation system as in (3). So the solution here will
have the same expression as for the Method of Moments estimators.
PN
i=1 Xi − X Yi − Y
d ,Y)
Cov(X
β1 = =
b
PN 2 d )
i=1 Xi − X
Var(X
βb = Y − βb X
0 1
βb1 and βb0 derived in this way are called the OLS estimates of the linear
regression parameters.
This lets us estimate the parameters from our generated sample data:
beta_1_est = np.cov(x, y, bias = True)[0][1] / np.var(x)
beta_0_est = np.mean(y) - beta_1_est * np.mean(x)
print("Estimated beta_0 = " + str(beta_0_est) +
". True beta_0 = " + str(beta_0))
## 112.69280520253027
print(np.cov(x, y, bias = False)[0][1])
## 112.69280520253028
# Variance:
print(np.sum((x - np.mean(x))**2) / (len(x) - 1)) # N-1
## 221.0
print(np.sum((x - np.mean(x))**2) / (len(x))) # N
## 216.66666666666666
print(np.var(x))
## 216.66666666666666
We can also calculate them, in R:
beta_1_est <- cov(x, y) / var(x)
beta_0_est <- mean(y) - beta_1_est * mean(x)
print(paste0("Estimated beta_0 = ", beta_0_est,
". True beta_0 = ", beta_0))
## [1] 111.5585
print(cov(x, y))
## [1] 111.5585
# Variance
print(sum((x - mean(x))^2)/(length(x) - 1)) # N - 1
## [1] 221
print(var(x))
## [1] 221
We can further re-write the estimates in a matrix notation, which is often
easier to implement in numerical calculations.
OLS - The Matrix Method
It is convenient to use matrices when solving equation systems. Looking at
our random sample equations:
Y1 = β0 + β1 X1 + 1
Y2 = β0 + β1 X2 + 2
..
.
YN = β0 + β1 XN + N
Y = Xβ + ε
1 X1
1 X2
where Y = [Y1 , ..., YN ]> , ε = [1 , ..., N ]> , β = [β0 , β1 ]> , X = . .. .
.. .
1 XN
As in the previous case, we want to minimize the sum of squared residuals:
RSS(β) = ε> ε
>
= (Y − Xβ) (Y − Xβ)
= Y> Y − β > X> Y − Y> Xβ + β > X> Xβ → min
β0 ,β1
After using some matrix calculus (more specifically scalar-by-vector
identities) and equating the partial derivative to zero:
∂RSS(β)
b
= −2X> Y + 2X> Xβ
b=0
∂β
b
−1
b = X> X
β X> Y (OLS)
Note that:
I with the matrix notation we estimate both parameters at the
same time;
I whereas with the Method of Moments, or by taking partial derivatives
from the RSS and solving the equation system, we needed to first
estimate βb1 via (5) and then plug it in (4) in order to estimate βb0 .
While (OLS) is a general expression (as we will see in the Multivariable
Regression case), in this case, we can express the OLS estimators for the
univariate regression as:
btvn: chung minh cong thuc 2
X · Y − X · XY
" # d )
Var(X
βb0
β
b= =
βb1 d
Cov(X , Y )
d )
Var(X
Y
bi = βb0 + βb1 Xi
where βb0 and βb1 are estimated via OLS. By definition, each fitted
value of Y
bi is on the estimated OLS regression line.
I The residuals, which are defined as the difference between the
actual and fitted values of Y :
ei = Yi − Y
i = b
b bi = Yi − βb0 − βb1 Xi
E(Y|X) = β0 + β1X
Y = β0 + β1X
20
15
y
10
5
0
0 10 20 30 40 50
We see that the estimated regression is very close to the true underlying
population regression (i.e. the true E(Y |X )).
Looking closer at the fitted values in a subset of the data, we can see
where the residuals originate and how the fitted values compare to the
sample data:
^ε12 = residual
^ ^ ^
Y = β0 + β1X
Y
^
Y12 = fitted value
^
Y8
Y8
x8 x12
X
Properties of the OLS estimator
From the construction of the OLS estimators the following properties apply
to the sample:
1. The sum (and by extension, the sample average) of the OLS residuals
is zero:
XN
i = 0
b (7)
i=1
This follows from the first equation of (3). The OLS estimates βb0
and βb1 are chosen such that the residuals sum up to zero, regardless
of the underlying sample data.
I R:
resid <- y - y_fit
print(paste0("Sum of the residuals: ", sum(resid)))
## [1] "Sum of the residuals: -1.05582209641852e-13"
I Python:
resid = y - y_fit
print("Sum of the residuals: " + str(sum(resid)))
## Sum of the residuals: 2.042810365310288e-14
We see that the sum is very close to zero.
2. The sample covariance between the regressors and the OLS residuals
is zero:
XN
Xi b
i = 0 (8)
i=1
This follows from the second equation of (3). Because the sample
PN
average of the OLS residuals is zero, i=1 Xi b
i is proportional to
the sample covariance, between Xi and bi .
I R:
print(paste0("Sum of X*resid: ", sum(resid * x)))
Y = Xβ + ε
b = X> X −1 X> Y
β
−1 >
= X> X X (Xβ + ε)
>
−1 > −1 >
= X X X Xβ + X> X X ε
−1
= β + X> X X> ε
If we take the expectation of both sides and use the law of total
expectation:
b = β + E X> X −1 X> ε
h i h i
E β
>
−1 >
=β+E E X X X εX
h −1 > i
= β + E X> X X E (ε|X)
=β
h i
since E (ε|X) = 0 from (UR.2). We have shown that E β
b = β.
Some more notes on unbiasedness:
I Unbiasedness does not guarantee that the estimate we get with any
particular sample is equal (or even very close) to β.
I It means that if we could repeatedly draw random samples from the
population and compute the estimate each time, then the average of
these estimates would be (very close to) β.
I However, in most applications we have just one random sample to
work with, though as we will see later on, there are methods for
creating additional samples from the available data by creating and
analysing different subsamples.
OLS estimators are Best (Efficient)
I When there is more than one unbiased method of estimation to
choose from, that estimator which has the lowest variance is the best.
I We want to show that OLS estimators are best in the sense that β b
are efficient estimators of β (i.e. they have the smallest variance).
To do so we will calculate the variance - the average distance of an
element from the average - as follows (remember that for OLS estimators,
condition (UR.3) holds true):
From the proof of unbiasedness of the OLS we have that:
b = β + X> X −1 X> ε =⇒ β
b − β = X> X −1 X> ε
β
1
= σ 2 · PN
Var(β1 )
b
2
i=1 Xi − X
Which correspond to the diagonal elements of the variance-covariance
matrix: " #
Var(βb0 ) Cov( β
b0 , βb1 )
Var(β)b =
Cov(βb1 , βb0 ) Var(βb1 )
Note that we applied the following relationship:
PN P 2 2
N PN
N i=1 Xi2 − X
i=1 i = N i=1 Xi − X .
Next, assume that we have some other estimator of β, which is also
unbiased and can be expressed as:
e = X> X −1 X> + D Y = CY
h i
β
So, β
e is unbiased if and only if DX = 0.
Var(β)
e = Var(CY)
= CVar(Xβ + ε)C>
= σ 2 CC>
−1 > −1
= σ 2 X> X X + D X X> X + D>
−1 > −1 −1
= σ 2 X> X X X X> X + σ 2 X> X (DX)>
−1
+ σ 2 DX X> X + σ 2 DD>
h −1 i
= σ 2 X> X + DD>
b + DD> ≥ Var(β)
= Var(β) b
Yi = β0 + β1 Xi + i
Yi = βb0 + βb1 Xi + b
i
Going back to our example, we can estimate the standard errors of our
coefficients by applying the formulas that we have seen:
I R:
sigma2_est <- sum(resid^2) / (length(x) - 2)
var_beta <- sigma2_est * solve(t(x_mat) %*% x_mat)
print(sqrt(diag(var_beta)))
## x
## 0.266578700 0.009188728
I Python:
sigma2_est = sum(resid**2) / (len(x) - 2)
var_beta = sigma2_est * np.linalg.inv(np.dot(np.transpose(x_mat), x_mat))
print(np.sqrt(np.diag(var_beta)))
## [0.26999713 0.00930656]
We can also use the built-in functions, just like we did with the coefficients:
I R:
out <- summary(lm_result)
print(out$coefficients[, 2, drop = FALSE])
## Std. Error
## (Intercept) 0.266578700
## x 0.009188728
I Python:
print(lm_fit.bse)
## [0.26999713 0.00930656]
This highlights a potential problem, which will be addressed in later
lectures concerning model adequacy/goodness of fit: if the residuals are
large (since their mean will still be zero this concerns the case when the
estimated variance of the residuals is large), then the standard errors of
the coefficients are large as well.
OLS estimators are Consistent
A consistent estimator has the property that, as the number of data points
(which are used to estimate the parameters) increases (i.e. N → ∞), the
estimates converges in probability to the true parameter, i.e.:
Definition
Let WN be an estimator of a parameter θ based on a sample Y1 , ..., YN .
Then we say that WN is a consistent estimator of θ if ∀ > 0:
P (|WN − θ| > ) → 0, as N → ∞
P
We can denote this as WN − → θ or plim(WN ) = θ. If XN is not consistent,
then we say that WN is inconsistent.
I Unlike unbiasedness, consistency involves the behavior of the
sampling distribution of the estimator as the sample size N gets large
- the distribution of WN becomes more and more concentrated about
θ. In other words, for larger sample sizes, N is less and less likely to
be very far from θ.
I An inconsistent estimator does not help us learn about θ, regardless
of the size of the data sample.
I For this reason, consistency is a minimal requirement of an
estimator used in statistics or econometrics.
Unbiased estimators are not necessarily consistent, but those whose
variances shrink to zero as the sample size grows are consistent.
For some examples, see the lecture notes.
Going back to our OLS estimators βb0 and βb1 , as N → ∞ we have that:
b ∼ N β, σ 2 X> X −1
β|X
I We have shown how to derive an OLS estimation method in order to
estimate the unknown parameters of a linear regression with one
variable.
I We have also shown that these estimators (under conditions (UR.1) -
(UR.3)) are good in the sense that, as the sample size increases, the
estimated parameter values approach the true parameter values and
that the average of all the estimators, estimated from a number of
different random samples, is very close to the true underlying
parameter vector.
Examples using empirical data
From the Lecture note Ch. 3.10 select one dataset (or more) and do the
tasks from Exercise Set 1 from Ch 3.10. See Ch. 3.11 for an
example.