Lecture 1a
The plan of this lecture is to offer a review of regression analysis for the conditional mean
function E(y|xj , x−j ). We will introduce the basic regression model and then concentrate
in the simple case of a bivariate regression. At the end of the lecture, we discuss statistical
properties of the Ordinary Least Squares (OLS) estimator.
y = x′ β + u (1)
|{z} |{z} |{z} |{z}
1×1 1×p p×1 1×1
where y is the response or dependent variable and the β’s are the coefficients associated with
the independent variables x’s. At the core of econometrics, there is the concept of causal
effects or causal relationships. Obviously, we are not referring to the notion of correlation
between xj and y since Corr(xj , y) = Corr(y, xj ).
We define xj to be an independent variable, y to be a dependent variable and we might
have other (independent) variables we keep at a constant level, x−j . To investigate causal
relationships we employ models and the classical model of interest is the conditional expected
value of the dependent variable conditional on the independent variables,
E(y|xj , x−j ) = E(y|x) = yfY |X=x dx.
E(u|x) = E(u) = 0.
To summarize, the first two assumptions of the classical regression model are:
We will use later a model with p independent variables as in equation (1). Let us now
concentrate in one particular case assuming that we have one independent variable, p = 1.
2 Conditional Expectations
The simple regression model is,
y = β0 + β1 x + u (3)
where y is the dependent variable, x is the independent variable, and u is the error term or
disturbance. The model suggests that changes in y are induced by changes in x and u. We
want to know how much y can be explained in terms of x.
Example 1. Suppose we want to know the effect of education on wages and wages and
education are linearly related. Then, we can write
In this example, we assume that the relation between the response variable and the inde-
pendent variable is linear in the parameters β0 and β1 . It may be the case that the model
log(wages) = β0 + β1 educ + β2 educ2 + u,
which is non linear in variables.
We can define the population regression function (PRF) or, in this case, the population
regression Line (PRL) of equation (3) as,
by Assumptions A1 and A2. It says that the expected value of the dependent variable for
a given value of the independent variables is equal to the parameter β0 plus β1 times the
independent variable. In the case of the PRL with one slope parameter, the model contains
two unknown parameters β0 (the intercept) and β1 (the slope).
Our interest lies on estimating the PRF, which implies estimating the two unknowns, i.e.
the intercept and the slope.
We now discuss the implications of the previous conditions A1 and A2. We can write the
model as:
y = E(y|x) + u (6)
E(u|x) = 0. (7)
E(u) = 0 (8)
Cov(u, xj ) = 0, for j = 1, ...p, (9)
Cov(u, h(xj )) = 0, for j = 1, ...p, (10)
where Cov denotes covariance and h(·) is a known function (e.g. h(x1 ) = x21 , h(x1 , x2 ) = x1 x22 ,
h(x2 ) = exp(x2 + x22 ), etc.). To see these results, first note that by definition u := y − E(y|x).
E(u) = E(y − E(y|x)) = E(y) − E(E(y|x)) = E(y) − E(y) = 0. (11)
To show that also implies independence between the error term and covariates, we need
to show that
E(ux) = E(E(ux|x)) = E(xE(u|x)) = E(x × 0) = 0. (12)
Therefore, conditions (11) and (12) imply, for j = 1, ...p,
that is, she would like to choose a predictor of y that minimizes the expected loss.
It is convenient to assume a quadratic loss (e.g. L(u) = u2 ). Therefore the problem may
be transformed to minimize the mean square error of m(x) for y,
Suppose now that the functional form of m is known. For instance, consider now that a
model linear in parameters:
−E(xy) + E(xx′ )β = 0,
β = (E(xx′ ))−1 E(xy)
Therefore, the parameter β can be interpreted as the best linear predictor under squared
error loss.
Note that for β to exists, we∑need the p × p matrix E(xx′ ) to be invertible. (We also
need that E(y 2 ) < ∞ and E( j x2j ) < ∞ (see Hansen 2017, Theorem 2.18.1)). If no
redundant variables are included in the vector x, then the design matrix E(xx′ ) is invertible
and we can say that β is unique and identified. A requirement is that E(xx′ ) is positive
definite. Identification in this class of linear models means that we can write the parameter
of interest in terms of moment conditions that are functions of observed variables such us y
and x1 , ..., xp .
This is related to another assumption of the linear model. The condition
This condition rules out perfect collinerarity, or in other words, it rules out exact linear
4.1 Examples
Example 2. An example of θ includes coefficients in a regression model. Suppose we have
scalars Y and X and there is an unknown parameter θ. Our model is
Y = Xθ + U, (14)
where U is a random variable representing the error term of a regression model. The popu-
lation mean of U is zero and E(U X) = 0. Assume that the second moment of X exists i.e.
E(X 2 ) ̸= 0. It can be shown that:
E(Y X)
θ= . (15)
E(X 2 )
The parameter θ is a function of second moments of X and Y .
Therefore, if we can learn from data about E(X 2 ) and E(Y X), then we are able to
point-identify θ (or θ is point identified). The unknown parameter is uniquely identified by
the second moments.
Example 3. Another example involves the average treatment effect (ATE), which is typ-
ically considered in the evaluation of policies and social programs. Suppose now we have
a binary treatment indicator, D = {0, 1}, that has been randomly assigned to different
individuals in a population. Suppose also that Y is an outcome of interest.
It follows that the ATE is
which is identified based on the existence of two conditional mean functions. For θ to be
well-defined, both E(Y |D = 1) and E(Y |D = 0) need to exist and can be learned from data.
The unknown θ is interpreted as the difference between the mean of Y among people who
are on treatment (D = 1) and the mean of Y among people who are in the control group
(D = 0).
A slightly different example is a difference-in-differences parameter. Suppose Yb is the
outcome of interest before a program is implemented and Ya is the outcome of interest is
observed after a program is implemented. Suppose that D now indicates individuals under
treatment (note that D = 0 before the program is implemented). It follows that
Example 4. Other examples include error distributions. For instance, suppose you have
two measurements of the outcomes of interest Y denoted by Y1 and Y2 . The interest is in
the errors of the outcome equations, denotes by U , ϵ1 and ϵ2 . Under some assumptions, the
distribution of U , ϵ1 and ϵ2 can be identified based on the joint distribution of Y1 and Y2 .
Suppose we have identically and independently distributed random variables Y that are
draws from FY (y), the distribution function of Y . If we know F , we can determine the
θ = y = F (0.5)−1 , (18)
because, by definition,
0.5 = F (y). (19)
However, it is clear that the assumptions above are not enough. We need to assume that F
is monotonically increasing to be able to “solve” for θ.
Another example is on the ATE. Suppose, in practice, we face the situation that while D
has been randomly assigned, individuals who completed the treatment are, in general, more
educated. In this case, there might be issues of selection and a usual strategy is to find a
set of observable variables X and “control for observables”. (Naturally, in our example, X
includes education of the individual).
The idea is to control for selection into treatment and X is typically included in regression
analysis or other estimation method to estimate parameters of social interventions (e.g.,
matching, RDD, difference-in-differences, etc.).
It follows that the ATE can be written as:
and naturally we should require to have individuals who have D = 1 and X = x as well
as individuals who have D = 0 and X = x. This is another assumption that it is often
called common support and implies that identification needs E(Y |D = 1, X = x) and
E(Y |D = 0, X = x) and not simply E(Y |D = 1) and E(Y |D = 0).
Obviously, if we do not have X for both individuals in the control and treatment group,
we fail to point identify the parameter θ as defined in equation 20.
• Nonlinearities
• Model mispecification
• Singularity issues
• Endogeneity
• Simultaneity issues
• Unobserved heterogeneity
After a brief review of regression, we will start addressing these issues one by one during
the semester. The only exception is the first point on nonlinearities, as our course is mainly
on models linear in parameters.
Suppose we consider the following model:
Y = (X − θ0 )2 + U, (21)
where U is the error term satisfying E(U ) = 0 and θ0 ∈ Θ is the true parameter value of all
possible θs in the compact set Θ.
Based on the assumption on the error term, we can write:
( )
E Y − (X − θ0 )2 = 0, (22)
a + bθ + cθ2 = 0, (23)
where a = E(Y − X 2 ), b = 2E(X), and c = −1. The parameter θ0 is not uniquely identified
even if we have the entire population (all the data on Y and X).
5 Estimation
Consider a random sample {yi , xi } of size n from population. The pair of data points
(yi , xi ) represent the ith observation. The size of the sample n denotes the total number of
observations. Since each data point comes from model (1),
yi = β0 + β1 xi + ui (24)
Example 5. Suppose you want to estimate the first moment of a random variable. The
method of moments (M M ) can be used to estimate the population mean. Consider {yi }
to be an independently and identically distributed (i.i.d.) random variable drawn from
a distribution whose first moment exists (e.g., y is not distributed as Cauchy). In the
E(y − µ) = 0
so we can use the corresponding sample moment condition
(yi − µ) = 0 ⇒ µ̂ = ȳ.
n i=1
To obtain the MM estimator for the linear model, recall the moment condition,
or alternatively,
E(xi (yi − x′i β)) = 0,
where xi = (1, xi1 )′ and β = (β0 , β1 )′ . The MM estimator could be obtained as the solution
of the corresponding sample moment condition,
(xi (yi − x′i β)) = 0,
n i=1
yielding, ( n )−1
∑ ∑
β̂ = xi x′i xi y i .
i=1 i=1
E(u) = E(y − β0 − β1 x) = 0
Cov(x, u) = E(xu) − E(x)E(u) = E(xu) = E(x(y − β0 − β1 x)) = 0
These two equations are restrictions imposed to the data about the joint probability
distribution of x and y. There are two unknowns and two equations, which suggests that we
can solve for β0 and β1 . The sample counterpart of these equations are
(yi − β0 − β1 xi ) = 0 (25)
n i=1
(xi (yi − β0 − β1 xi )) = 0 (26)
n i=1
The Ordinary Least Squares Estimator is the argument that minimizes the following
criterion function:
(yi − β0 − β1 xi )2 , (27)
with respect to the unknowns β0 and β1 . It is easy to see that the normal equations corre-
sponding to this problem are identical to the two previous equations (25) and (26). Thus,
the OLS estimator is a MM estimator in this case.
From equation (25), we obtain
ȳ − β0 − β1 x̄ = 0 ⇒ βˆ0 = ȳ − β1 x̄ (28)
where ȳ = n−1 ni=1 yi . Thus, if we know the ordinary least square estimator of the slope
β1 , we can obtain an estimate of the intercept. How can we obtain β̂1 ? Plug the last result
in equation (26) to obtain,
xi (yi − (ȳ − β̂1 x̄) − β1 xi ) = 0
n i=1
xi ((yi − ȳ) − β̂1 (xi − x̄)) = 0
n i=1
n ∑
xi (yi − ȳ) = β̂1 xi (xi − x̄)
i=1 i=1
Therefore, β̂0 , β̂1 are the ordinary least squares (OLS) estimators for the intercept and slope
parameters of the simple linear regression model.
1. The sum of the residuals ûi is zero since β̂0 is chosen such that
n ∑
(yi − β̂0 − β̂1 xi ) = ûi = 0
i=1 i=1
2. The sample covariance between the independent variable and the residuals are zero.
To see this, it is crucial to find that
n ∑
n ∑
xi ûi = xi (yi − β̂0 − β̂1 xi ) = xi (yi − x′i β̂)
i=1 i=1 i=1
n ∑
= xi yi − xi x′i β̂
i=1 i=1
( n )−1
n ∑
n ∑ ∑
= xi yi − xi x′i xi x′i xi y i
i=1 i=1 i=1 i=1
∑n ∑n
= xi yi − xi yi = 0.
i=1 i=1
3. The sample average of the fitted values is equal to sample average of the observations.
yi = ŷi + ûi
∑ ∑ ∑
yi = ŷi + ûi
1∑ 1∑
yi = ŷi ⇒ ȳ = ȳˆ
n n
1 ∑
= ∑n (xi − x̄)(β0 + β1 xi + ui )
i=1 (xi − x̄) i=1
1 ∑ n ∑ n ∑ n
= ∑n (β 0 (x i − x̄) + β 1 (x i − x̄)x i + (xi − x̄)ui )
i=1 (xi − x̄)
i=1 i=1 i=1
1 ∑ n ∑ n
= ∑n (β 1 (x i − x̄)2
+ (xi − x̄)ui ).
i=1 (xi − x̄)
i=1 i=1
Therefore, the OLS estimator for the slope can be written as:
1 ∑
β̂1 = β1 + ∑n (xi − x̄)ui . (30)
i=1 (xi − x̄) i=1
Taking expectation,
1 ∑
E(β1 ) = β1 + ∑n E((xi − x̄)ui )
i=1 (xi − x̄) i=1
A4 (homocedasticity) Var(u|x) = σ 2
In other words, the variance of the error term is constant (it does not change with x) and
the error ui is independent of uj for j ̸= i. Assumptions A4 and A5 together imply spherical
errors, or more precisely, spherical error variance.
What is the variance of β̂1 ? We know that
1 ∑
β̂1 = β1 + ∑n (xi − x̄)ui
i=1 (xi − x̄) i=1
1 ∑n
Var(β̂1 ) = ∑n Var ((xi − x̄)ui )
( i=1 (xi − x̄)2 )2 i=1
1 ∑n
= ∑n (xi − x̄)2 Var(ui )
( i=1 (xi − x̄)2 )2 i=1
⇒ Var(β̂1 ) = ∑n
i=1 (xi − x̄) )
( 2
Similary, to obtain the variance of the intercept, under the stated assumptions, we write,
( ∑ )
(xi − x̄)yi
Var(β̂0 ) = Var(ȳ − β̂1 x̄) = Var ȳ − ∑ i
i (xi − x̄)
( ) )
∑(1 x̄(xi − x̄)
= Var −∑ yi
i (xi − x̄)
n 2
∑(1 x̄(xi − x̄)
= −∑ Var(ui )
n i (x i − x̄) 2
( ( )2 )
∑ 1 x̄(x − x̄) 2 x̄(x − x̄)
+ ∑ − ∑
i i
= σ2
n 2 (x
i i − x̄)2 n i i − x̄)
(x 2
( ( )2 )
∑ 1 x̄(x − x̄)
+ ∑
= σ2
i i − x̄)
n 2 (x 2
) (∑ )
1 x̄2 (xi − x̄)2 + nx̄2
= σ 2
+∑ =σ 2 i ∑
i (xi − x̄) n i (xi − x̄)2
n 2
(∑ 2 ∑ )
i xi + nx̄
− 2x̄ i xi + nx̄2
= σ 2
n i (xi − x̄)2
(∑ 2 ) ( ∑ 2 )
i xi +∑ nx̄2 − 2nx̄2 + nx̄2 xi
= σ 2
=σ 2
∑ i
n i (xi − x̄)2 n i (xi − x̄)2
Remark 1. The variance of the error term is another unknown parameter of the population,
but we can estimate using the analog principle. Recall that u is distributed as a normal
variable with mean zero and variance σ 2 ,
1∑ 2
n i=1 i
1 ∑ 2
σ̂ = û (32)
n − 2 i=1 i
yi = β0 + β1 xi + ϵi ,
where β1 is a constant causal parameter or effect. For instance, yi is earnings and xi is having
a college degree or an indicator of veteran status. The error term ϵi capture other factors
affecting earnings. If,
ϵi = δ0 + δ1 zi + vi ,
where zi denote an “observable” variable or simply a vector of observables. Because vi
is assumed to not be correlated with zi , we have that E(ϵi |xi ) ̸= 0 but E(vi |xi , zi ) = 0.
Therefore, we estimate an “augmented” model with control zi ,
yi = β0 + β1 xi + β2 zi + ui ,
by OLS, obtaining an unbiased estimate of β1 . Note that this was not possible if we use OLS
to estimate (6).
Example 6. The role of covariates can be seen in the illustration presented in Table 1. We
use Social Security and administrative data from Angrist (1998) which studied the effect
of voluntary military service on earnings of soldiers. We use workers who applied to the
military between 1976 and 1981, and we estimate the model before military service and after
(i.e., years 1974 and 1985, as shown in the table). Voluntary military service is obviously
non-random and most likely correlated with education, age, test scores, cohort effects, etc.
We also know that these observables affect earnings. Therefore, omission of these variables
are likely to introduce biases.
The point is that after controlling for all these “observable” differences (using interactions
too), veterans and non-veterans are comparable. Whether you believe of not this premise,
the results shown in Table 1 points to the fact that the model estimated without controls
give results that appear misleading.
where γ1 = β1 − π1 ̸= β1 . Angrist and Pischke (2009) call these variables, including the test
where γ1 = β1 − π1 ̸= β1 .