Lecture 1a

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof.

Lamarche

The Linear Regression Model


Lecture 1a

The plan of this lecture is to offer a review of regression analysis for the conditional mean
function E(y|xj , x−j ). We will introduce the basic regression model and then concentrate
in the simple case of a bivariate regression. At the end of the lecture, we discuss statistical
properties of the Ordinary Least Squares (OLS) estimator.

1 Multivariate Regression Analysis


We start assuming a linear model with p independent variables,

y = x′ β + u (1)
|{z} |{z} |{z} |{z}
1×1 1×p p×1 1×1

where y is the response or dependent variable and the β’s are the coefficients associated with
the independent variables x’s. At the core of econometrics, there is the concept of causal
effects or causal relationships. Obviously, we are not referring to the notion of correlation
between xj and y since Corr(xj , y) = Corr(y, xj ).
We define xj to be an independent variable, y to be a dependent variable and we might
have other (independent) variables we keep at a constant level, x−j . To investigate causal
relationships we employ models and the classical model of interest is the conditional expected
value of the dependent variable conditional on the independent variables,

E(y|xj , x−j ) = E(y|x) = yfY |X=x dx.

We can also define:


∂E(y|xj , x−j )
:= partial effect
∂xj
obtained by keeping x−j equal to a constant. Note that this is well defined if xj is a continuous
variable.
To estimate causal effects, a crucial assumption is that all observable factors (the inde-
pendent variables x1 , x2 , ..., xp ) that affect the dependent variable y are uncorrelated with

1
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

the error term.


E(u|x1 , x2 , ..., xp ) = 0. (2)
If condition (2) is satisfied, we say that the variables {x1 , x2 , ..., xp } are strictly exogenous.
As explained below, the implication is that the error term has mean equal to zero.

E(u|x) = E(u) = 0.

To summarize, the first two assumptions of the classical regression model are:

A1 (linearity) The model is linear as in (1)

A2 (strict exogeneity) We assume that E(u|x1 , x2 , ..., xp ) = 0

We will use later a model with p independent variables as in equation (1). Let us now
concentrate in one particular case assuming that we have one independent variable, p = 1.

2 Conditional Expectations
The simple regression model is,
y = β0 + β1 x + u (3)
where y is the dependent variable, x is the independent variable, and u is the error term or
disturbance. The model suggests that changes in y are induced by changes in x and u. We
want to know how much y can be explained in terms of x.

Example 1. Suppose we want to know the effect of education on wages and wages and
education are linearly related. Then, we can write

wages = β0 + β1 educ + u (4)

In this example, we assume that the relation between the response variable and the inde-
pendent variable is linear in the parameters β0 and β1 . It may be the case that the model
is
log(wages) = β0 + β1 educ + β2 educ2 + u,
which is non linear in variables.

We can define the population regression function (PRF) or, in this case, the population
regression Line (PRL) of equation (3) as,

E(y|x) = E(β0 + β1 x + u|x) = β0 + β1 x + E(u|x) = β0 + β1 x, (5)

2
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

by Assumptions A1 and A2. It says that the expected value of the dependent variable for
a given value of the independent variables is equal to the parameter β0 plus β1 times the
independent variable. In the case of the PRL with one slope parameter, the model contains
two unknown parameters β0 (the intercept) and β1 (the slope).

INSERT FIGURE

Our interest lies on estimating the PRF, which implies estimating the two unknowns, i.e.
the intercept and the slope.
We now discuss the implications of the previous conditions A1 and A2. We can write the
model as:

y = E(y|x) + u (6)
E(u|x) = 0. (7)

Equation (7) implies:

E(u) = 0 (8)
Cov(u, xj ) = 0, for j = 1, ...p, (9)
Cov(u, h(xj )) = 0, for j = 1, ...p, (10)

where Cov denotes covariance and h(·) is a known function (e.g. h(x1 ) = x21 , h(x1 , x2 ) = x1 x22 ,
h(x2 ) = exp(x2 + x22 ), etc.). To see these results, first note that by definition u := y − E(y|x).
Then,
E(u) = E(y − E(y|x)) = E(y) − E(E(y|x)) = E(y) − E(y) = 0. (11)

To show that also implies independence between the error term and covariates, we need
to show that
E(ux) = E(E(ux|x)) = E(xE(u|x)) = E(x × 0) = 0. (12)
Therefore, conditions (11) and (12) imply, for j = 1, ...p,

Cov(u, xj ) = E(xu) − E(x)E(u) = 0. (13)

3 Best Linear Predictor


Define a loss function of a generic form, L(u) = L(y − m(x)), where m(x) is the predictor
in terms of x and u is the prediction error. Naturally, a decision maker tries to solve the
following problem:
min E{L(y − m(x))|x},
m∈M

3
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

that is, she would like to choose a predictor of y that minimizes the expected loss.
It is convenient to assume a quadratic loss (e.g. L(u) = u2 ). Therefore the problem may
be transformed to minimize the mean square error of m(x) for y,

E{(y − m(x))2 |x} = E{(y − µ(x) + µ(x) − m(x))2 |x}


= V (y|x) + (µ(x) − m(x))2 + 2E[((m(x) − µ(x))(y − µ(x))|x]
= V (y|x) + (µ(x) − m(x))2

where V (y|x) is the conditional variance of y.


If the first two conditional moments of y exists (e.g., E(y|x) < ∞ and V (y|x) < ∞), the
(unconditional) mean square error is minimized to the conditional variance of y when m(x)
is equal to the conditional mean µ := E(y|x),

E(y − m(x))2 = EV (y|x) + E(m(x) − µ(x))2 .

Suppose now that the functional form of m is known. For instance, consider now that a
model linear in parameters:

min E{(y − x′ β)2 } = min{Ey 2 − 2E(yx′ )β + β ′ E(xx′ )β}


β β

Taking first order condition, we have

−E(xy) + E(xx′ )β = 0,

thus,
β = (E(xx′ ))−1 E(xy)
Therefore, the parameter β can be interpreted as the best linear predictor under squared
error loss.
Note that for β to exists, we∑need the p × p matrix E(xx′ ) to be invertible. (We also
need that E(y 2 ) < ∞ and E( j x2j ) < ∞ (see Hansen 2017, Theorem 2.18.1)). If no
redundant variables are included in the vector x, then the design matrix E(xx′ ) is invertible
and we can say that β is unique and identified. A requirement is that E(xx′ ) is positive
definite. Identification in this class of linear models means that we can write the parameter
of interest in terms of moment conditions that are functions of observed variables such us y
and x1 , ..., xp .
This is related to another assumption of the linear model. The condition

A3 (full rank design matrix) The rank (E(xi x′i )) = p

This condition rules out perfect collinerarity, or in other words, it rules out exact linear

4
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

relationships among the independent variables.

4 More on Identification ...


Identification of β refers to what we can learn about the unknown parameter from knowing
(features of) the population of the data.
Before we continue, there will be changes in the notation. I will refer to θ to the parameter
of interest. For example, it could be one slope parameter in the vector β in the model above.
Also, I will use capital letters for random variables, just to make clearer where identification
is coming from.

4.1 Examples
Example 2. An example of θ includes coefficients in a regression model. Suppose we have
scalars Y and X and there is an unknown parameter θ. Our model is

Y = Xθ + U, (14)

where U is a random variable representing the error term of a regression model. The popu-
lation mean of U is zero and E(U X) = 0. Assume that the second moment of X exists i.e.
E(X 2 ) ̸= 0. It can be shown that:
E(Y X)
θ= . (15)
E(X 2 )
The parameter θ is a function of second moments of X and Y .
Therefore, if we can learn from data about E(X 2 ) and E(Y X), then we are able to
point-identify θ (or θ is point identified). The unknown parameter is uniquely identified by
the second moments.

Example 3. Another example involves the average treatment effect (ATE), which is typ-
ically considered in the evaluation of policies and social programs. Suppose now we have
a binary treatment indicator, D = {0, 1}, that has been randomly assigned to different
individuals in a population. Suppose also that Y is an outcome of interest.
It follows that the ATE is

θ = E(Y |D = 1) − E(Y |D = 0), (16)

which is identified based on the existence of two conditional mean functions. For θ to be
well-defined, both E(Y |D = 1) and E(Y |D = 0) need to exist and can be learned from data.
The unknown θ is interpreted as the difference between the mean of Y among people who

5
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

are on treatment (D = 1) and the mean of Y among people who are in the control group
(D = 0).
A slightly different example is a difference-in-differences parameter. Suppose Yb is the
outcome of interest before a program is implemented and Ya is the outcome of interest is
observed after a program is implemented. Suppose that D now indicates individuals under
treatment (note that D = 0 before the program is implemented). It follows that

θ = (E(Ya |D = 1) − E(Ya |D = 0)) − (E(Yb |D = 1) − E(Yb |D = 0)), (17)

which is again identified based on the existence of conditional mean functions.


The unknown θ is interpreted as the difference between the mean of Y among people who
are on treatment (D = 1) and the mean of Y among people who are in the control group
(D = 0).

Example 4. Other examples include error distributions. For instance, suppose you have
two measurements of the outcomes of interest Y denoted by Y1 and Y2 . The interest is in
the errors of the outcome equations, denotes by U , ϵ1 and ϵ2 . Under some assumptions, the
distribution of U , ϵ1 and ϵ2 can be identified based on the joint distribution of Y1 and Y2 .

4.2 Identification in practice and in theory


Often, in empirical work, you might need to answer the following question: “How you would
identify the parameter θ?” or “What is the source of identification”? These questions are not
about whether the parameter exists uniquely or existence of θ. These questions are usually
about what information you have in the data to recover θ.
For instance, in our previous example on ATEs, we naturally need variation of D, which
implies that we need individuals receiving treatment and individuals that are not receiving
treatment. Moreover, we need that D is randomly assigned. Thus, randomization allows us
to identify the parameter of interest.
Note too that moments and the assumptions I used above implies that there is a under-
lying model. Of course, the implication is that identification requires a model.
But what is a model, then? In this class, and in general in econometrics and statistics,
a model is a set of equations and a set of assumptions or conditions about the stochastic
process that generated the data (we call it DGP, or data generating process). We usually have
assumptions that correspond to behavior (how the data is generated) and/or assumptions
that are statistical or related to econometric issues (selection, measurement error, attrition,
etc.).
We can consider two additional simple examples. First, suppose we want to identify the
median of a distribution.

6
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

Suppose we have identically and independently distributed random variables Y that are
draws from FY (y), the distribution function of Y . If we know F , we can determine the
median:
θ = y = F (0.5)−1 , (18)
because, by definition,
0.5 = F (y). (19)
However, it is clear that the assumptions above are not enough. We need to assume that F
is monotonically increasing to be able to “solve” for θ.
Another example is on the ATE. Suppose, in practice, we face the situation that while D
has been randomly assigned, individuals who completed the treatment are, in general, more
educated. In this case, there might be issues of selection and a usual strategy is to find a
set of observable variables X and “control for observables”. (Naturally, in our example, X
includes education of the individual).
The idea is to control for selection into treatment and X is typically included in regression
analysis or other estimation method to estimate parameters of social interventions (e.g.,
matching, RDD, difference-in-differences, etc.).
It follows that the ATE can be written as:

θ = E{E(Y |D = 1, X = x) − E(Y |D = 0, X = x)}, (20)

and naturally we should require to have individuals who have D = 1 and X = x as well
as individuals who have D = 0 and X = x. This is another assumption that it is often
called common support and implies that identification needs E(Y |D = 1, X = x) and
E(Y |D = 0, X = x) and not simply E(Y |D = 1) and E(Y |D = 0).
Obviously, if we do not have X for both individuals in the control and treatment group,
we fail to point identify the parameter θ as defined in equation 20.

4.3 Failure of Point Identification


There are numerous examples in the literature where parameter or a model is not point
identified. In these cases, the analysis rely on partial identification, and that is a topic that
goes beyond what we need to do here.
There are often several reasons to not be able to identify an unknown parameter θ. The
reasons are:

• Nonlinearities

• Model mispecification

7
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

• Singularity issues

• Endogeneity

• Simultaneity issues

• Unobserved heterogeneity

After a brief review of regression, we will start addressing these issues one by one during
the semester. The only exception is the first point on nonlinearities, as our course is mainly
on models linear in parameters.
Suppose we consider the following model:

Y = (X − θ0 )2 + U, (21)

where U is the error term satisfying E(U ) = 0 and θ0 ∈ Θ is the true parameter value of all
possible θs in the compact set Θ.
Based on the assumption on the error term, we can write:
( )
E Y − (X − θ0 )2 = 0, (22)

but the solution corresponds to one unknown and a nonlinear equation.


The following equation has two roots:

a + bθ + cθ2 = 0, (23)

where a = E(Y − X 2 ), b = 2E(X), and c = −1. The parameter θ0 is not uniquely identified
even if we have the entire population (all the data on Y and X).

5 Estimation
Consider a random sample {yi , xi } of size n from population. The pair of data points
(yi , xi ) represent the ith observation. The size of the sample n denotes the total number of
observations. Since each data point comes from model (1),

yi = β0 + β1 xi + ui (24)

where i = 1, ..., n. Our goal is to estimate the parameters (β0 , β1 ).

8
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

5.1 Method of Moments


We first review an estimation method called Method of Moments using the following example.

Example 5. Suppose you want to estimate the first moment of a random variable. The
method of moments (M M ) can be used to estimate the population mean. Consider {yi }
to be an independently and identically distributed (i.i.d.) random variable drawn from
a distribution whose first moment exists (e.g., y is not distributed as Cauchy). In the
population,
E(y − µ) = 0
so we can use the corresponding sample moment condition

1∑
n
(yi − µ) = 0 ⇒ µ̂ = ȳ.
n i=1

To obtain the MM estimator for the linear model, recall the moment condition,

E(xi ui ) = E(E(xi ui |x)) = E(xE(ui |x)) = E(x × 0) = 0,

or alternatively,
E(xi (yi − x′i β)) = 0,
where xi = (1, xi1 )′ and β = (β0 , β1 )′ . The MM estimator could be obtained as the solution
of the corresponding sample moment condition,

1∑
n
(xi (yi − x′i β)) = 0,
n i=1

yielding, ( n )−1
∑ ∑
n
β̂ = xi x′i xi y i .
i=1 i=1

We now turn to some basic linear regression models.

5.2 Ordinary Least Squares Estimation (OLS):


Our model’s assumptions were

E(u) = E(y − β0 − β1 x) = 0
Cov(x, u) = E(xu) − E(x)E(u) = E(xu) = E(x(y − β0 − β1 x)) = 0

9
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

These two equations are restrictions imposed to the data about the joint probability
distribution of x and y. There are two unknowns and two equations, which suggests that we
can solve for β0 and β1 . The sample counterpart of these equations are

1∑
n
(yi − β0 − β1 xi ) = 0 (25)
n i=1
1∑
n
(xi (yi − β0 − β1 xi )) = 0 (26)
n i=1

The Ordinary Least Squares Estimator is the argument that minimizes the following
criterion function:
∑n
(yi − β0 − β1 xi )2 , (27)
i=1

with respect to the unknowns β0 and β1 . It is easy to see that the normal equations corre-
sponding to this problem are identical to the two previous equations (25) and (26). Thus,
the OLS estimator is a MM estimator in this case.
From equation (25), we obtain

ȳ − β0 − β1 x̄ = 0 ⇒ βˆ0 = ȳ − β1 x̄ (28)

where ȳ = n−1 ni=1 yi . Thus, if we know the ordinary least square estimator of the slope
β1 , we can obtain an estimate of the intercept. How can we obtain β̂1 ? Plug the last result
in equation (26) to obtain,

1∑
n
xi (yi − (ȳ − β̂1 x̄) − β1 xi ) = 0
n i=1
1∑
n
xi ((yi − ȳ) − β̂1 (xi − x̄)) = 0
n i=1

n ∑
n
xi (yi − ȳ) = β̂1 xi (xi − x̄)
i=1 i=1

It can be shown that the last equation is equal to



n ∑
n
(xi − x̄)(yi − ȳ) = β̂1 (xi − x̄)2
i=1 i=1

10
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

which implies that the estimator for the parameter β1 is


∑n
(x − x̄)(yi − ȳ)
β̂1 = i=1∑n i (29)
i=1 (xi − x̄)
2

Therefore, β̂0 , β̂1 are the ordinary least squares (OLS) estimators for the intercept and slope
parameters of the simple linear regression model.

5.3 Arithmetic Properties of OLS


It is useful to briefly discuss properties of the OLS estimators. Consider the following 4
properties:

1. The sum of the residuals ûi is zero since β̂0 is chosen such that

n ∑
n
(yi − β̂0 − β̂1 xi ) = ûi = 0
i=1 i=1

2. The sample covariance between the independent variable and the residuals are zero.
To see this, it is crucial to find that

n ∑
n ∑
n
xi ûi = xi (yi − β̂0 − β̂1 xi ) = xi (yi − x′i β̂)
i=1 i=1 i=1

n ∑
n
= xi yi − xi x′i β̂
i=1 i=1
( n )−1

n ∑
n ∑ ∑
n
= xi yi − xi x′i xi x′i xi y i
i=1 i=1 i=1 i=1
∑n ∑n
= xi yi − xi yi = 0.
i=1 i=1

3. The sample average of the fitted values is equal to sample average of the observations.

yi = ŷi + ûi
∑ ∑ ∑
yi = ŷi + ûi
1∑ 1∑
yi = ŷi ⇒ ȳ = ȳˆ
n n

11
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

4. The point (x̄, ȳ) is always on the regression OLS line.


1∑ 1∑
ȳ = ȳˆ = ŷi = (β̂0 + β̂1 xi ) = β̂0 + β̂1 x̄
n n

5.4 Bias of the OLS estimator


Under condition A1 and A2, the OLS estimator is unbiased, that is, the expected value of
the OLS estimator is equal to the unknown parameter.
We can show the result as follows. Recall that,
∑n ∑
n
(xi − x̄)(yi − ȳ) 1
β̂1 = ∑n
i=1
= ∑n (xi − x̄)yi
i=1 (xi − x̄) i=1 (xi − x̄) i=1
2 2

1 ∑
n
= ∑n (xi − x̄)(β0 + β1 xi + ui )
i=1 (xi − x̄) i=1
2

1 ∑ n ∑ n ∑ n
= ∑n (β 0 (x i − x̄) + β 1 (x i − x̄)x i + (xi − x̄)ui )
i=1 (xi − x̄)
2
i=1 i=1 i=1

1 ∑ n ∑ n
= ∑n (β 1 (x i − x̄)2
+ (xi − x̄)ui ).
i=1 (xi − x̄)
2
i=1 i=1

Therefore, the OLS estimator for the slope can be written as:

1 ∑
n
β̂1 = β1 + ∑n (xi − x̄)ui . (30)
i=1 (xi − x̄) i=1
2

Taking expectation,

1 ∑
n
E(β1 ) = β1 + ∑n E((xi − x̄)ui )
i=1 (xi − x̄) i=1
2

Since E(u|x) = E(u) = 0, ⇒ E(β̂1 ) = β1 . Also, recall that

β̂0 = ȳ − β̂1 x̄ = β0 + β1 x̄ − β̂1 x̄


⇒ E(β̂0 ) = β0 + β1 x̄ − E(β̂1 )x̄ = β0

Therefore, the OLS estimators are unbiased e.g.

E(β̂0 ) = β0 ; E(β̂1 ) = β1 (31)

12
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

5.5 Variance of the OLS estimator


We are going to obtain the variance of the OLS estimators, but we have to state first one of
the assumptions of the simple linear regression model:

A4 (homocedasticity) Var(u|x) = σ 2

A5 (uncorrelated errors) Cov(ui , uj ) = 0 for i, j = 1, . . . , i − 1, i + 1, . . . , n

In other words, the variance of the error term is constant (it does not change with x) and
the error ui is independent of uj for j ̸= i. Assumptions A4 and A5 together imply spherical
errors, or more precisely, spherical error variance.
What is the variance of β̂1 ? We know that

1 ∑
n
β̂1 = β1 + ∑n (xi − x̄)ui
i=1 (xi − x̄) i=1
2

Considering A4 and A5, we obtain the variance as follows:


( )
1 ∑n
Var(β̂1 ) = Var ∑n (xi − x̄)ui
i=1 (xi − x̄) i=1
2
( n )
1 ∑
= ∑n Var (xi − x̄)ui
( i=1 (xi − x̄)2 )2 i=1

1 ∑n
Var(β̂1 ) = ∑n Var ((xi − x̄)ui )
( i=1 (xi − x̄)2 )2 i=1
1 ∑n
= ∑n (xi − x̄)2 Var(ui )
( i=1 (xi − x̄)2 )2 i=1
σ2
⇒ Var(β̂1 ) = ∑n
i=1 (xi − x̄) )
( 2

13
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

Similary, to obtain the variance of the intercept, under the stated assumptions, we write,
( ∑ )
(xi − x̄)yi
Var(β̂0 ) = Var(ȳ − β̂1 x̄) = Var ȳ − ∑ i

i (xi − x̄)
2
( ) )
∑(1 x̄(xi − x̄)
= Var −∑ yi
i (xi − x̄)
n 2
i
∑(1 x̄(xi − x̄)
)2
= −∑ Var(ui )
n i (x i − x̄) 2
i
( ( )2 )
∑ 1 x̄(x − x̄) 2 x̄(x − x̄)
+ ∑ − ∑
i i
= σ2
n 2 (x
i i − x̄)2 n i i − x̄)
(x 2
i
( ( )2 )
∑ 1 x̄(x − x̄)
+ ∑
i
= σ2
i i − x̄)
n 2 (x 2
(i
) (∑ )
1 x̄2 (xi − x̄)2 + nx̄2
= σ 2
+∑ =σ 2 i ∑
i (xi − x̄) n i (xi − x̄)2
n 2
(∑ 2 ∑ )
i xi + nx̄
2
− 2x̄ i xi + nx̄2
= σ 2

n i (xi − x̄)2
(∑ 2 ) ( ∑ 2 )
i xi +∑ nx̄2 − 2nx̄2 + nx̄2 xi
= σ 2
=σ 2
∑ i
,
n i (xi − x̄)2 n i (xi − x̄)2

which leads to the variance of the estimated intercept,



n−1 ni=1 x2i
Var(β̂0 ) = ∑n σ2
( i=1 (xi − x̄)2 )

Although we do not know the parameter σ 2 , we can plugin its estimator σ̂ 2 .

Remark 1. The variance of the error term is another unknown parameter of the population,
but we can estimate using the analog principle. Recall that u is distributed as a normal
variable with mean zero and variance σ 2 ,

E(u) = 0; V (u) = E(u2 ) − (E(u))2 = E(u2 ) = σ 2 ,

which implies that we can use (from the sample counterpart),

1∑ 2
n

n i=1 i

as an estimator of σ 2 . Unfortunately, this is a biased estimator, but we can easily correct

14
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

this to have an unbiased estimator of the variance of the error term,

1 ∑ 2
n
2
σ̂ = û (32)
n − 2 i=1 i

where (n − 2) are the degrees of freedom.

Therefore, the estimator of the variance of the OLS estimator,



n−1 ni=1 x2i σ̂ 2
Var(β̂0 ) = ∑n σ̂ 2
; Var(β̂ 1 ) = ∑ (33)
( i=1 (xi − x̄)2 ) ( i=1 (xi − x̄)2 )
n

Then, we can define the standard error of the slope coefficient as


1
se(β̂1 ) = √ ∑n σ̂ (34)
( i=1 (xi − x̄)2 )

6 Some Practical Implications


Recall that if the PRL is not correctly specified because E(u|x1 , . . . , xp ) ̸= 0, the parameter
β1 in the previous model does not have a causal interpretation, unless x1 is independent of
other variables and factors. Experiments that generate x1 randomly are not common, so
researchers rely on control variables to eliminate biases.
This is connected to what is known as the conditional independence assumption, or CIA
(as in Angrist and Pischke 2009). This is a condition we want to satisfy for estimation and
inference and here is why. Consider a simple bivariate model

yi = β0 + β1 xi + ϵi ,

where β1 is a constant causal parameter or effect. For instance, yi is earnings and xi is having
a college degree or an indicator of veteran status. The error term ϵi capture other factors
affecting earnings. If,
ϵi = δ0 + δ1 zi + vi ,
where zi denote an “observable” variable or simply a vector of observables. Because vi
is assumed to not be correlated with zi , we have that E(ϵi |xi ) ̸= 0 but E(vi |xi , zi ) = 0.
Therefore, we estimate an “augmented” model with control zi ,

yi = β0 + β1 xi + β2 zi + ui ,

by OLS, obtaining an unbiased estimate of β1 . Note that this was not possible if we use OLS
to estimate (6).

15
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

Table 1: The effect of veteran status on life-cycle earnings by race


Log earnings (non-white) Log earnings (white)
74 74 85 85 74 74 85 85

Veteran 0.002 -0.034+ 0.131** 0.085** -0.079 -0.049** 0.040** -0.003


(0.074) (0.018) (0.012) (0.007) (0.072) (0.018) (0.010) (0.006)

Controls No Yes No Yes No Yes No Yes


Obs. 1,763 1,763 2,499 2,499 2,096 2,096 3,155 3,155
R2 0.000 0.942 0.048 0.688 0.001 0.942 0.005 0.664
N 1763 1763 2499 2499 2096 2096 3155 3155
Standard errors in parentheses
** p<0.01, * p<0.05, + p<0.1

Example 6. The role of covariates can be seen in the illustration presented in Table 1. We
use Social Security and administrative data from Angrist (1998) which studied the effect
of voluntary military service on earnings of soldiers. We use workers who applied to the
military between 1976 and 1981, and we estimate the model before military service and after
(i.e., years 1974 and 1985, as shown in the table). Voluntary military service is obviously
non-random and most likely correlated with education, age, test scores, cohort effects, etc.
We also know that these observables affect earnings. Therefore, omission of these variables
are likely to introduce biases.
The point is that after controlling for all these “observable” differences (using interactions
too), veterans and non-veterans are comparable. Whether you believe of not this premise,
the results shown in Table 1 points to the fact that the model estimated without controls
give results that appear misleading.

Remark 2. We refer to controls as exogenous variables or observables, not to variables that


are determined by the variable of interest. These controls are not potentially right hand
side variables for other equations (or endogenous variables in genera). For instance, in the
Mincer equation,
log(wages)i = β0 + β1 educi + αi + ui ,
αi measures ability of the worker i and it is not observed (part of the error term). It makes
sense to assume that Cov(αi , educi ) ̸= 0. In the CIA world, it is tempting to include variables
to satisfy the condition and, sometimes, these variables are available to practitioners. For
instance, you might have test scores used to screen job applicants. The real problem is that
the performance in the test is likely to be correlated with education too.
For simplicity, consider that test scores are modeled as scorei = π0 + π1 educi + αi + εi .

16
Econometrı́a Avanzada, Maestrı́a en Economı́a - FCE - UNLP Prof. Lamarche

If we use test scores as a control variable, we estimate

log(wages)i = γ0 + γ1 educi + γ2 scorei + vi ,

where γ1 = β1 − π1 ̸= β1 . Angrist and Pischke (2009) call these variables, including the test
score used in this example, ‘bad controls’.

17

You might also like