Method of Moment

CHAPTER 1
METHODS OF MOMENTS FOR SINGLE

LINEAR EQUATION MODELS
Method-of-moment (MOM) estimator for single linear equation models

is introduced here, whereas MOM for multiple linear equations will be ex-
amined in the next chapter. Least squares estimator (LSE) is reviewed to
estimate the conditional mean (i.e., regression function) in a model with
exogenous regressors. Not just conditional mean, but conditional variance
also matters, and it is discussed under the headings “heteroskedasticity/
homoskedasticity” and generalized LSE (GLS). Instrumental variable esti-
mator (IVE) and generalized method-of-moment (GMM) estimator allow
endogenous regressors; IVE and GMM include LSE as a special case. En-
dogeneity matters greatly for policy variables, as the “ceteris paribus” effect
of a policy is of interest but endogenous regressors lead to biased effect esti-
mates. In addition to MOM estimation, testing linear hypotheses with “Wald
test” is studied.
1 Least Squares Estimator (LSE)

This section introduces standard linear models with exogenous regres-
sors, and then reviews least squares estimator (LSE) for regression functions,
which is a “bread-and-butter” estimator in econometrics. Differently from the
conventional approach, however, LSE will be viewed as a MOM. Also differ-
ently from the conventional approach, we will adopt a large sample framework
and invoke only a few assumptions.
1.1 LSE as a Method of Moment (MOM)

1.1.1 Linear Model
Consider a linear model
yi = xi β + ui , i = 1, ..., N
where xi is a k × 1 “regressor” vector with its first component being 1 (i.e.,

xi = (1, xi2 , ..., xik ) ), β ≡ (β 1 , ..., β k ) is a k × 1 parameter vector reflecting
effects of xi on yi , and ui is an “error” term. In β, β 1 is called the “inter-
cept” whereas β 2 , ..., β k are called the “slopes.” The left-hand side variable
yi is the “dependent” or “response” variable, whereas components of xi are
Myoung-jae Lee, Micro-Econometrics, DOI 10.1007/b60971 1, 1

c Springer Science+Business Media, LLC 2010
2 Ch. 1 Methods of Moments for Single Linear Equation Models
“regressors,” “explanatory variables,” or “independent variables.” Think of

xi as a collection of the observed variables affecting yi through xi β, and ui
as a collection of the unobserved variables affecting yi . Finding β with data
(xi , yi ), i = 1, ..., N , is the main goal in regression analysis. Assume that
(xi , yi ), i = 1, ..., N , are independent and identically distributed (iid) unless
otherwise noted, which means that each (xi , yi ) is an independent draw from
a common probability distribution. We will often omit the subscript i index-
ing individuals.
The linear model is linear in β, but not necessarily linear in xi , and
it is more general than it looks. For instance, x3 may be x22 , in which case
β 2 x2 +β 3 x22 depicts a quadratic relationship between x2 and y: the “effect” of
x2 on y is then β 2 + 2β 3 x2 —the first derivative of β 2 x2 + β 3 x22 with respect
to (wrt) x2 . For instance, with y monthly salary and x2 age, the effect of
age on monthly salary may be quadratic: going up to a certain age and then
declining after. Also x4 may be x2 x3 , in which case
β 2 x2 + β 3 x3 + β 4 x2 x3 = (β 2 + β 4 x3 )x2 + β 3 x3 :
the effect of x2 on y is β 2 + β 4 x3 . For instance, x3 can be education level: the
effect of age on monthly salary is not the constant slope β 2 , but β 2 + β 4 x3
which varies depending on education level. The display can be written also as
β 2 x2 + (β 3 + β 4 x2 )x3 to be interpreted analogously. The term x2 x3 is called
the interaction term between x2 and x3 , and its coefficient is the interaction
effect. By estimating β with data (xi , yi ), i = 1, ..., N , we can find these
effects.
1.1.2 LSE and Moment Conditions

The least squares estimator (LSE) for β is obtained by minimizing
1
(yi − xi b)2
N i
wrt b, where yi − xi b can be viewed as a “prediction error” in predicting yi

with the linear function xi b. LSE is also often called ordinary LSE (OLS),
relative to “generalized LSE” to appear later.
The first-order condition for the LSE blse is
1 1 1
xi (yi − xi blse ) = 0 ⇐⇒ xi yi = xi xi · blse .
N i N i N i

Assuming that N −1 i xi xi is invertible, solve this for blse to get
−1 −1
1 1

blse = xi xi · xi yi = xi xi · xi yi .
N i N i i i
The residual ûi ≡ yi − xi blse , which is an estimator for ui , has zero sample
mean and zero sample covariance with the regressors due to the first-order
condition:
Sec. 1 Least Squares Estimator (LSE) 3

1 1 1 1
xi (yi − xi blse ) = ûi , xi2 ûi , ..., xik ûi = 0.
N i N i N i N i

Instead of minimizing N −1 i (yi − xi b)2 , LSE can be motivated directly
from a moment
condition. Observe that the LSE first-order condition at b = β
is N −1 i xi ui = 0, and its population version is
⎡ ⎤ ⎡ ⎤
E(u) 0
⎢ E(x2 u) ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
E(xu) = 0 ⇐⇒ ⎢ .. ⎥=⎢ .. ⎥
⎣ . ⎦ ⎣ . ⎦
E(xk u) 0
⇐⇒ E(u) = 0, COV (xj , u) = 0 (or COR(xj , u) = 0), j = 2, ..., k
as COV (xj , u) = E(xj u) − E(xj )E(u), where COV and COR stand for
covariance and correlation, respectively.
Replacing u with y − x β yields
E{x(y − x β)} = 0 ⇐⇒ E(xy) = E(xx )β
which is a restriction on the joint distribution of (x , y). Assuming that E(xx )
is invertible, we get
−1
β = {E(xx )} · E(xy).
LSE blse is just a sample analog of this expression of
β, obtained by replacing

E(xx ) and E(xy) with their sample versions N −1 i xi xi and N −1 i xi yi .
Instead of identifying β by minimizing the prediction error, here β is identified
by the “information” (i.e., the assumption) that the observed x is “orthogo-
nal” to the unobserved u.
For any k × 1 constant vector γ,
γ E(xx )γ = E(γ xx γ) = E{(x γ) (x γ)} = E{(x γ)2 } ≥ 0.
Hence E(xx ) is positive semidefinite (p.s.d.). Assume that
E(xx ) is of full rank.
As E(xx ) is p.s.d., this full rank condition is equivalent to E(xx ) being

positive definite (p.d.) and thus being invertible. Note that E(xx ) being p.d.
is equivalent to E −1 (xx ) being p.d. where E −1 (xx ) means {E(xx )}−1 .
1.1.3 Zero Moments and Independence

The assumption E(xu) = 0 is the weakest for the LSE to be a valid
estimator for β as can be seen in the next subsection. In econometrics, the
following two assumptions have been used as well for LSE:
(i ) E(u|x) = 0 { ⇐⇒ E(y|x) = x β for the linear model}

(ii ) u is independent of x and E(u) = 0.
Note that E(u|x) = 0 implies E(u) = E{E(u|x)} = 0. For the three assump-
tions, the following implications hold:
independence of u from x and E(u) = 0 =⇒ E(u|x) = 0 =⇒ E(xu) = 0;
the last implication holds because E(xu) = E{xE(u|x)} = 0.
We will use mainly E(u|x) = 0 from now on unless otherwise mentioned,
because E(xu) = 0 would not take us much farther than LSE while the
independence is too strong to be realistic. The regressor vector x is often
said to be exogenous if any one of the three conditions holds. The function
E(y|x) = x β is called the (mean) regression function, which is nothing but
a location measure in the distribution of y|x. We can also think of other
location measures, say quantiles, in the distribution of y|x, which then yield
“quantile regression functions.”
In β, the intercept β 1 shows the level of y, and each slope represents the
effect of its regressor on E(y|x) while controlling for (i.e., holding constant)
the other regressors. This may be understood in
∂E(y|x)
= βj , j = 1, ..., k.
∂xj
The condition of “holding the other regressors constant”—reflected here with
the partial differentiation symbol ∂—may be better understood when “partial
regression” is explained later. The formal causal interpretation of regarding
xj as a cause and β j as its effect on the response y requires a little deeper
reasoning (see, e.g., Lee, 2005, and the references therein). This is because
LSE is a MOM which depends only on the covariances of the variables in-
volved, and the covariances per se do not designate any variable as a cause
or the response.
1.2 Asymptotic Properties of LSE

As N → ∞, the sample will be “close” to the population, and we would
want blse to converge to β in some sense. This is necessary for blse to be
a “valid” estimator for β. Going further, to be a “good” estimator for β,
blse should converge fast to β. For instance, both N −1 and N −2 converge
to 0, and they are valid “estimators” for 0, but N −2 is better than N −1
because N −2 converges to 0 faster. This subsection discusses these issues
in the names “consistency” and “asymptotic distribution.” The first-time
readers may want to only browse this subsection instead of reading every
detail, to come back later when better motivated theoretically. The upshot of
this subsection is the display (*) showing the asymptotic distribution of blse
(with its variance estimator in (*”)) and its practical version (*’) showing
that blse will degenerate (i.e., converge) to β as N → ∞. The main steps will
also appear in the instrumental variable estimator (IVE) section.
1.2.1 LLN and LSE Consistency

A law of large numbers (LLN), for an iid random variable (rv) sequence
z1 , ..., zN with E(z) < ∞, holds that
1
zi →p E(z) as N → ∞
N i
where “→p ” denotes convergence in probability:

1
P | zi − E(z)| < ε → 1 as N → ∞ for any constant ε > 0;
N i

(the estimator) z̄N ≡ N −1 i zi is said to be “consistent” for (the parameter)
E(z).
If z̄N is a matrix, the LLN applies to each component. This element-wise
convergence in probability of z̄N to E(z) is equivalent to |z̄N − E(z)| →p 0
where |A| ≡ {tr(A A)}1/2 for a matrix A—the usual matrix norm—in the
sense that the element-wise convergence implies the norm convergence and
vice versa. As “z̄N − E(z) →p 0” means that the difference between z̄N and
E(z) converges to 0 in probability, for two rv matrix sequences WN and MN ,
“WN − MN →p 0” (or WN →p MN ) means that the difference between the
two rv matrix sequences converges to zero in probability.
Substitute yi = xi β + ui into blse to get

−1
1 1
blse = β + xi xi xi ui .
N i N i
Clearly, blse = β due to the second term on the right-hand side (rhs) which
shows that each xi ui contributes to the deviation blse − β. Using the LLN,
we have
1 1
xi ui →p E(xu) = 0 and xi xi →p E(xx ).
N i N i
Substituting these into the preceding display, we can get blse →p β, but
we need to deal with the inverse: for a square random matrix WN , when
WN →p W , will WN−1 converge to W −1 in probability?
It is known that, for a rv matrix WN and a constant matrix Wo ,
f (WN ) →p f (Wo ) if WN →p Wo and f (·) is continuous at Wo .
The inverse f (W ) = W −1 of W , when it exists, is the adjoint of W divided

by the determinant det(W ). Because det(W ) is a sum of products of elements
of W and the adjoint consists of determinants, both det(W ) and the adjoint
are continuous in W , which implies that W −1 is continuous in W (see, e.g.,
Schott, 2005). Thus, W −1 is continuous at Wo so long as Wo−1 exists, and

using the last display, we get A−1
N → A
p −1
if AN →p A so long as A−1 exists;
−1
note that AN exists for a large enough N because det(AN ) = 0 for a large
enough N . Hence,
−1
1
xi xi →p E −1 (xx ) < ∞ as N → ∞.
N i
Therefore, blse is β plus a product of two terms, one consistent for a zero
vector and the other consistent for a bounded matrix; thus the product is
consistent for zero, and we have blse →p β: blse is consistent for β.
√
1.2.2 CLT and N-Consistency
For the asymptotic distribution of the LSE, a central limit theorem (CLT)
is needed that, for an iid random vector sequence z1 , ..., zN with finite second
moments,
1
√ {zi − E(z)} N (0, E [{z − E(z)} {z − E(z)} ]) as N → ∞
N i
where “” denotes convergence in distribution; i.e., letting Ψ(·) denote the
df of N (0, E[{z − E(z)}{z − E(z)} ]),

1
lim P √ {zi − E(z)} ≤ t = Ψ(t) ∀ t.
N →∞ N i
When wN →p 0, it is also denoted as wN = op (1); “op (1)” is the proba-

bilistic analog for o(1) where o(1) is a sequence converging to 0. For z̄N , we
thus have z̄N − E(z) = op (1). In comparison to wN = op (1), “wN = Op (1)”
means that {wN } is bounded in probability (or stochastically bounded)—i.e.,
“not explosive as N → ∞” (even if it does not converge to anything) in the
probabilistic sense. Note that op (1) is also Op (1). Formally, wN = Op (1) is
that, for any constant ε > 0, there exists a constant δ ε such that
sup P {|wN | > δ ε } < ε.

N
A single rv z always satisfies P {|z| > δ ε } < ε, because we can capture “all but
ε” probability mass by choosing δ ε large enough. The last display means that
we can capture all but ε probability mass with δ ε for any rv in the sequence
w1 , w2 , ... Any
random sequence converging in distribution is Op (1), which
implies N −1/2 i {zi − E(z)} = Op (1).
To understand Op better, consider N −1 and N −2 , both of which con-
verge to 0. Observe N −1 /N −1 = 1, but N −1 /N −1+ε = 1/N ε → 0 whereas
N −1 /N −1−ε = N ε → ∞ for any constant ε > 0. Thus the “(fastest) conver-
gence rate” is N −1 which, when divided into N −1 , makes the resulting ratio
bounded.√Analogously, the convergence

√ rate for N −2 is N −2 . Now consider
√
zN ≡ z/ N where z is a rv. Then N zN = z = Op (1) (or zN = Op (1/ N ))
because we can choose δ ε for any constant ε > 0 such that
√
sup P | N zN | > δ ε = sup P (|z| > δ ε ) = P (|z| > δ ε ) < ε.
N N
√
For an estimator aN √for a parameter α, in most cases, we have N (aN −
α) = Op (1): aN is “ N -consistent.” This means that aN →p α, and that
the convergence rate is N −1/2 which, when divided into aN − α, makes the
resulting product bounded in probability.
√ For most cases in our
√ discussion,
it would be harmless to think of the N -consistency of aN as N (aN − α)
converging to a normal distribution as N → ∞.
Analogously to o(1)O(1) = o(1)—“a sequence converging to zero” times
“a bounded sequence” converges to zero—it holds that op (1)Op (1) = op (1).
Likewise, op (1) + Op (1) = Op (1). Slutsky Lemma shows more: if wN w
(thus wN = Op (1)) and mN →p mo , then
(i ) mN wN mo w
(ii ) mN + wN mo + w.
Slutsky Lemma (i) states that, not just the product mN wN is Op (1), its
asymptotic distribution is that of w times the constant mo . Slutsky Lemma
(ii) can be understood analogously.
1.2.3 LSE Asymptotic Distribution

Observe
−1
√ 1 1
N (blse − β) = xi xi ·√ xi ui .
N i N i
From the CLT, we have

1
√ xi ui N {0, E(xx u2 )}.
N i
Using Slutsky Lemma (i),

if BN N (0, C) and AN →p A, then AN BN N (0, ACA ).
Apply this to
−1
1 1
BN =√ xi ui and AN = xi xi →p E −1 (xx )
N i N i
to get
√
N (blse − β) N (0, Ω) where Ω ≡ E −1 (xx )E(xx u2 )E −1 (xx ) : (*)
√
N (blse − β) is asymptotically normal with mean√ 0 and variance Ω. Often
this convergence in distribution (or “in law”) of N (blse − β) is informally
stated as
1 −1 2 −1
blse ∼ N β, E (xx )E(xx u )E (xx ) (*’)
N
√
The asymptotic variance Ω of N (blse −β) can be estimated consistently
with (this point will be further discussed later)
−1 −1
1 1 1
ΩN ≡ xi xi xi xi û2i xi xi . (*”)
N i N i N i
Alternatively (and informally), the asymptotic variance of blse is estimated
consistently with
−1 −1
ΩN
2
= xi xi xi xi ûi xi xi .
N i i i
Equipped with ΩN and the asymptotic normality, we can test hypotheses

involving β as to be seen later.
1.3 Matrices and Linear Projection

It is sometimes convenient (for computation) to express blse using matri-
ces. Define Y ≡ (y1 , ..., yN ) , U ≡ (u1 , ..., uN ) , and X ≡ (x1 , ..., xN ) where
xi = (xi1 , ..., xik ) so that
⎡ ⎤ ⎡ ⎤
x1 x11 , x12 , · · · , x1k
⎢ ⎥ ⎢ ⎥
X ≡ ⎣ ... ⎦=⎣
..
. ⎦;
N ×k
xN xN 1 , xN 2 , · · · , xN k
the numbers below X denote its dimension. In this matrix notation, the N
linear equations yi = xi β + ui , i = 1, ..., N , become Y = Xβ + U , and
1 1 2 1 1
(yi − xi β)2 = u = U U = (Y − Xβ) (Y − Xβ).
N i N i i N N
Differentiating this, the LSE first-order condition is N −1 X (Y − Xblse ) = 0,

which is also a moment condition for MOM. This yields
blse = (X X)−1 X Y.

The parts X X and X Y are the same as i xi xi and i xi yi , respec-
tively. For example, with k = 2 and xi1 = 1 ∀i,
⎡ ⎤
1 x12
1 ··· 1 ⎢ .. .. ⎥ = N i xi2
X X = ⎣ . . ⎦ ,
x12 · · · xN 2 i xi2
2
i xi2
1 xN 2
1 1 xi2 N
xi2

xi xi = (1, xi2 ) = = i .
xi2 xi2 x2i2 i xi2
2
i xi2
i i i
Define the N × N “(linear) projection matrix on X”
PX ≡ X(X X)−1 X
to get
Ŷ ≡ Xblse = X(X X)−1 X Y = PX Y (“fitted value of Y”),

Û ≡ Y − Xblse = Y − PX Y = QX Y, where QX ≡ IN − PX ;
Û = (û1 , ..., ûN ) is the N × 1 residual vector. We may think of Y comprising

X and the other components. Then PX extracts the X part of Y , and QX
removes the X part of Y (or QX extracts the non-X part of Y ). The fitted
value Ŷ = Xblse is the part of Y explained by X, and the residual Û is the
part of Y unexplained by X as clear in the decomposition
Y = IN Y = PX Y + (IN − PX )Y = PX Y + QX Y = Xblse + Û .
The LSE (X X)−1 X Y is called the sample (linear) projection coefficients of

Y on X. The population versions of the linear projection and linear projec-
tion coefficient are, respectively,
x β and β ≡ E −1 (xx )E(xy).
The matrices PX and QX are symmetric and idempotent:

PX = P X , PX P X = P X and QX = QX , QX QX = QX .
Also note
PX X = X and QX X = 0N :
extracting the X part of X gives X itself, and removing the X part of X
yields 0.
Suppose we use 1 as the only regressor. Defining 1N as the N × 1 vector
of 1’s and denoting Q1N just as Q1 ,

−1 1
Q1 Y = IN − 1N (1N 1N ) 1N Y = IN − 1N 1N Y
N

1
= IN − 1N 1N Y
N
⎧⎡ ⎤ ⎡ ⎤⎫ ⎡ ⎤
⎪
⎪ 1 0 ··· 0 1 1 ··· 1 ⎪ ⎪ y1
⎪
⎨⎢ 0 1 · · · 0 ⎥ ⎪
⎢ ⎥ 1 ⎢
⎢ 1 1 ··· 1 ⎥
⎥⎬ ⎢ y2 ⎥
⎢ ⎥
= ⎢ .. .. . . .. ⎥ − ⎢ .. .. . . .. ⎥ · ⎢ .. ⎥
⎪
⎪ ⎣ . . . . ⎦ N⎣ . . . . ⎦⎪⎪ ⎣ . ⎦
⎪
⎩ ⎪
⎭
0 ··· 0 1 1 ··· 1 1 yN
⎡ ⎤
y1 − ȳ
⎢ y2 − ȳ ⎥
⎢ ⎥
= ⎢ .. ⎥.
⎣ . ⎦
yN − ȳ
The part (1N 1N )−1 1N Y = ȳ demonstrates that the LSE with 1 as the sole
regressor is just the sample mean ȳ. Q1 may be called the “mean-deviation”
or “mean-subtracting” matrix.
1.4 R2 and Two Examples

Before we present two examples of LSE, we introduce some terminologies
frequently used in practice. Recall the LSE asymptotic variance estimator
ΩN /N ≡ [ω N,hj ], h, j = 1, ..., k; i.e., the element of ΩN /N in row h and
column j is denoted as ω N,hj . The t-values (t-ratios, or z-values) are defined
as
blse,j
√ , j = 1, ..., k, where blse = (blse,1 , ..., blse,k ) .
ω N,jj
Since the diagonal of ΩN /N is the asymptotic variances of blse,j , j = 1, ..., k,
the jth t-value asymptotically follows N (0, 1) under the H0 : β j = 0, and
hence it is a test statistic for H0 : β j = 0. The off-diagonal terms of ΩN /N
are the asymptotic covariances for blse,j , j = 1, ..., k, and they are used for
hypotheses involving multiple parameters.
The “standard error (of model)” sN and “R-squared ” R2 are defined as
2 1/2
i ûi
sN ≡ →p SD(u),
N −k

N −1 i û2i V (u) V (x β)
R ≡ 1 − −1
2 → p
1 − = , as
i (yi − ȳ)
N 2 V (y) V (y)
V (y) = V (x β + u) = V (x β) + V (u) because COV (x β, u) = 0.
R2 shows the proportion of V (y) that is explained by x β, and R2 measures

the “model fitness.” In general, the higher the R2 is the better, because the
less is buried in the unobserved u. But this statement should be qualified,
because R2 keeps increasing by adding more regressors into the model. Using
fewer regressors to explain y is desirable, which is analogous to using fewer
shots to hit a target.
Recall Ŷ = Xblse , Y = Ŷ + Û , and the idempotent mean-subtracting
matrix Q1 to observe
Q1 Û = Û (because the sample mean of Û is already

zero),

Y Q1 Q1 Y = (yi − ȳ)2 = Y Q1 Y
i

= (Ŷ + Û ) Q1 (Ŷ + Û ) = Ŷ Q1 Ŷ + Û Û = Ŷ Q1 Ŷ + û2i
i

because Ŷ Q1 Û = blse X Q1 Û = blse X Û = blse X QX Y = 0 for
X QX = (QX X) = 0.
The last line also implies Ŷ Q1 Y = Ŷ Q1 (Ŷ + Û ) = Ŷ Q1 Ŷ . The key point

of this display is the well-known decomposition

(Y Q1 Y = ) (yi − ȳ)2 = Ŷ Q1 Ŷ + û2i .
i explained (by x) variation i
total variation in y unexplained variation
R2 is defined as the ratio of the explained variation to the total variation:
Ŷ Q1 Ŷ Ŷ Q1 Y · Ŷ Q1 Ŷ Ŷ Q1 Y · Ŷ Q1 Y
R2 ≡
= =
Y Q1 Y Ŷ Q1 Y · Y Q1 Y Ŷ Q1 Ŷ · Y Q1 Y
2
i (ŷi − ŷ)(yi − ȳ)
= !2 = (sample correlation of Y andŶ )2
i ŷi − ŷ · i (yi − ȳ) 2
R2 falls in [0, 1], being a squared correlation.
EXAMPLE: HOUSE SALE. A data set of size 467 was collected from the State
College District in Pennsylvania for year 1991. State College is a small college
town with the population of about 50,000. The houses sold during the year
were sampled, and the sale prices and the durations until sale since the first
listing in the market were recorded.
The dependent variable is the discount (DISC) percentage defined as 100
times the natural log of list price (LP) over sale price (SP) of a house:

LP LP − SP LP − SP
100·ln = 100·ln 1 + 100 = discount %.
SP SP SP
LP and SP are measured in $1000. Since LP is the initial list price, given
LP, explaining DISC is equivalent to explaining SP. The following is the list
of regressors—the measurement units should be kept in mind: the number
of days on the market until sold (T), years built minus 1900 (YR), number
of rooms (ROOM), number of bathrooms (BATH), dummy for heating by
electricity (ELEC), property tax in $1,000 (TAX), dummy for spring listing
(L1), summer listing (L2), and fall listing (L3), sale-month interest rate in %
(RATE), dummy for sale by a big broker (BIGS), and number of houses on
the market divided by 100 in the month when the house is listed (SUPPLY).
In Table 1, examine only the first three columns for a while. ln(T ) ap-
pears before 1, because ln(T ) is different from the other regressors—it is
determined nearly simultaneously with DISC—and thus needs a special at-
tention. Judging from the t-values in “tv-het,” most regressors are statis-
tically significant at 5% level, for their absolute t-values are greater than
1.96; “tv-ho” will be used in the next subsection where the qualifiers “het”
and “ho” will be explained. A longer ln(T ) implies the bigger DISC: with
∂ ln T ∂T /T , an increase of ∂ ln T = 1 (i.e., 100% increase in T ) means
4.6% increase in DISC, which in turn means 1% increase in T leading to
Table 1: LSE for House-Sale Discount %

blse tv-het (tv-ho) blse (T) tv-het (tv-ho) (T)
ln(T) 4.60 7.76 (12.2) 0.027 8.13 (13.9)
1 −2.46 −0.23 (−0.24) 8.73 0.82 (0.86)
BATH 0.11 0.18 (0.17) 0.31 0.51 (0.51)
ELEC 1.77 2.46 (2.60) 1.84 2.67 (2.80)
ROOM −0.18 −0.67 (−0.71) −0.26 −0.95 (−1.04)
TAX −1.74 −1.28 (−1.65) −1.92 −1.49 (−1.88)
YR −0.15 −3.87 (−5.96) −0.15 −4.11 (−6.17)
ln(LP) 6.07 2.52 (3.73) 5.71 2.51 (3.63)
BIGS −2.15 −2.56 (−3.10) −1.82 −2.25 (−2.72)
RATE −2.99 −3.10 (−3.25) −2.12 −2.31 (−2.41)
SUPPLY 1.54 1.06 (1.02) 1.89 1.36 (1.30)
sN , R2 6.20, 0.34 5.99, 0.39
Variable Mean SD
DISC 7.16 7.64
L1 0.29 0.45
L2 0.31 0.46
L3 0.19 0.39
SP 115 57.7
T 188 150
BATH 2.02 0.67
ELEC 0.52 0.50
ROOM 7.09 1.70
TAX 1.38 0.65
YR 73.0 15.1
LP 124 64.9
BIGS 0.78 0.42
RATE 9.33 0.32
SUPPLY 0.62 0.19
0.046% increase in DISC. A newer house commands the less DISC: one year
newer causes 0.15% less DISC, and thus 10 year newer causes 1.5% less DISC.
A higher RATE means the lower DISC (1% increase in RATE causing 2.99%
DISC drop); this finding seems, however, counter-intuitive, because a higher
mortgage rate means the lower demand for houses. R2 = 0.34 shows that
34% of the DISC variance is explained by x blse , and sN = 6.20 shows that
about 95% of ui ’s fall in the range ±1.96 × 6.20 if ui ’s follow N (0, V (u)).
As just noted, 1% increase in T causes 0.046% increase in DISC. Since
this may not be easy to grasp, T is used instead of ln(T ) for the LSE in
the last two columns of the table. The estimate for T is significant with the
magnitude 0.027, meaning that 100 day increase in T leads to 2.7% DISC
increase, which seems reasonable. This kind of query—whether the popular
logged variable ln(T ), level T , or some other function of T should be used—
will be addressed later when we deal with “transformation of variables” in
nonlinear models.
EXAMPLE: INTEREST RATE. As another example of LSE, we use time-series

data on three month US treasury bill rates monthly from 01/1982 to 12/1999
(N = 216). To see what extent the past interest rates can explain the current
one, the LSE of yi on 1, yi−1 and yi−2 was done (since two lags are used, the
sample size becomes N = 214) with the following result:
yi = 0.216 + 1.298 · yi−1 − 0.337 · yi−2 , sN = 0.304,

t-values: (2.05) (10.29) (−2.61)
R2 = 0.980.
Both yi−1 and yi−2 are statistically significant; i.e., H0 : β 2 = 0 and H0 :

β 3 = 0 are rejected. The R2 indicates that the two past rates predict very
well the current rate. Since the unit of measurements are the same across the
regressors and dependent variable, there is no complication in interpreting
the estimates as in the house-sale example.
One curious point is that the estimate for yi−2 is significantly negative
and differs too much from the estimate for yi−1 , casting some doubt over the
linear model. One reason could be the “truncation bias”: the other lagged
regressors (yi−3 , yi−4 , ...) were omitted from the regressors to become part
of ui , which means COR(yi−1 , ui ) = 0 and COR(yi−2 , ui ) = 0, violating
the basic tenet of LSE. One counter argument, however, is COR(ûi , ûi−1 ) =
0.104 which means that COR(ui , ui−1 ) would not be far from zero. If omitting
yi−3 , yi−4 , · · · really matters, then one would expect COR(ûi , ûi−1 ) to be
higher than 0.104. Having COR(ûi , ûi−1 ) = 0.104 is also comforting for the
iid assumption for ui ’s. This data as well as the house sale data will be used
again.
1.5 Partial Regression

Suppose x consists of two sets of regressors of dimension kf × 1 and
kg × 1, respectively: x = (xf , xg ) and k = kf + kg . Partition X and blse
accordingly:

bf
X = [Xf , Xg ] and blse = =⇒ Xblse = Xf bf + Xg bg .
bg
Xf can be written as
Xf = X · Sf
N ×kf N ×k k×kf
where Sf is a “selection matrix ” consisting only of 1’s and 0’s to select the
components of X for Xf ; analogously we can get Xg = X · Sg . For example,
with N = 3, k = 3, and kf = 2, the preceding display is

⎡ ⎤ ⎡ ⎤ ⎡ ⎤
x11 x12 x11 x12 x13 1 0
⎣ x21 x22 ⎦ = ⎣ x21 x22 x23 ⎦ × ⎣ 0 1 ⎦ .
x31 x32 x31 x32 x33 0 0
Observe
PXf PX = PXf and QXf QX = QX because

PXf PX = Xf (Xf Xf )−1 Xf · X(X X)−1 X
= Xf (Xf Xf )−1 Sf X · X(X X)−1 X
= Xf (Xf Xf )−1 Sf X = Xf (Xf Xf )−1 Xf = PXf ,
QXf QX = (IN − PXf )(IN − PX ) = IN − PXf − PX + PXf = QX .
In words, for PXf PX = PXf , extracting first the X part (with PX ) and then
its subset Xf part (with PXf ) is the same as extracting only the Xf part. As
for QXf QX = QX , removing first the X part and then its subset Xf part is
the same as removing the whole X part.
Multiply Y = Xf bf + Xg bg + Û by QXf to get
QXf Y = QXf Xf bf + QXf Xg bg + QXf Û = QXf Xg bg + Û , because

QXf Xf = O and QXf Û = QXf (QX Y ) = QX Y = Û .
Multiply both sides of QXf Y = QXf Xg bg + Û by Xg QXf to get
Xg QXf QXf Y = Xg QXf QXf Xg · bg + Xg QXf Û .
Because
Xg QXf Û = Xg QXf QX Y = Xg QX Y = Sg X QX Y = 0 for

X QX = 0k×N ,
the residual term disappears. Solving for bg gives
bg = (Xg QXf Xg )−1 Xg QXf Y.
This expression shows that the LSE bg for β g can be obtained in two
stages. First, do the LSE of Y on Xf to get the partial residual QXf Y , and
then do the LSE of Xg on Xf to get the partial residual QXf Xg . Second, do
the LSE of QXf Y on QXf Xg :
(Xg QXf QXf Xg )−1 Xg QXf QXf Y = (Xg QXf Xg )−1 Xg QXf Y.
This is the partial regression interpretation of bg . The name “partial residual”

is appropriate, for only the xf part of x is used in the first regression. By
using only the residuals in the second step, the presence of xf is nullified, and
thus bg shows the effect of xg on y with xf controlled for. Put it differently,

bg shows the additional explanatory power of xg for y, over and above what
is already explained by xf . When xg is a scalar, it is informative to plot
QXf Y (on the vertical axis) versus QXf Xg (on the horizontal axis) to isolate
the effect of xg on y. The correlation between the two residuals is called the
partial correlation between y and xg .
As a special case, suppose xf = 1 and xg = (x2 , ..., xk ) . Denoting QXf
simply as Q1 , we already saw
Q1 Y = (y1 − ȳ, ..., yN − ȳ) and Q1 Xg = (x1g − x̄g , ..., xN g − x̄g ) .

N ×N N ×(k−1)
Using the vector notations, the partial regression for the slopes bg is nothing

LSE with the mean-deviation variables: with xi = (1, x̃i ) and x̃ ≡
but the
−1
N i x̃i ,

−1
bg = x̃i − x̃ x̃i − x̃ x̃i − x̃ (yi − ȳ) .
i i
The role of “1” is to explain the level of (x̃ and) y.
1.6 Omitted Variable Bias

In the model y = xf β f + xg β g + u, what happens if xg is not used
in estimation? This is an important issue, as we may not have (or use) all
relevant regressors in the data. With xg not used, xg β g + u ≡ v becomes the
new error term in the model, and the consequence of not using xg depends on
COR(xf , xg ). To simplify the discussion, assume that the model is written in
mean-deviation form, i.e., E(y) = E(xf )β f + E(xg )β g + E(u) is subtracted
from the model to yield
y − E(y) = {xf − E(xf )} β f + {xg − E(xg )} β g + u − E(u)
and we redefine y as y − E(y), xf as xf − E(xf ) and so on. So long as we are

interested in slopes in β f , the mean deviation model is adequate.
If COR(xf , xg ) = 0 (i.e., if E(xf xg ) = 0), then β f can still be estimated
consistently by the LSE of y on xf . The only downside is that, in general,
SD(v) > SD(u) as v has more terms than u, and thus R2 will drop. If
COR(xf , xg ) = 0, however, then COR(xf , v) = 0 makes xf an endogenous
regressor and the LSE becomes inconsistent. Specifically, the LSE of y on
xf is
−1
1 1
bf = xif xif xif yi
N i N i
−1
1 1 !
= xif xif xif xif β f + vi
N i N i
−1
1 1
= βf + xif xif xif vi
N i N i
−1
1 1 !
= βf + xif xif xif xig β g + ui
N i N i
−1
1 1
= βf + xif xif xif xig · β g
N i N i
−1
1 1
+ xif xif xif ui
N i N i
! !
which is consistent for β f + E −1 xf xf E xf xg · β g .
The term other than β f is called the omitted variable bias, which is 0 if
either β g = 0 (i.e., xg is not omitted at all) or if E −1 (xf xf )
E(xf xg) = 0 which is the population linear projection coefficient of regress-
ing xg on xf . In simple words, if COR(xf , xg ) = 0, then there is no omitted
variable bias. When LSE is run on some data and if resulting estimates do
not make sense intuitively, in most cases, the omitted variable bias formula
will provide a good guide on what might have gone wrong.
One question that might arise when COR(xf , xg ) = 0 is what happens
if a subvector xf 2 of xf is correlated to xg while the other subvector xf 1 of
xf is not where xf = (xf 1 , xf 2 ) . In this case, will xf 1 still be subject to the
omitted variable bias? The answer depends on COR(xf 1 , xf 2 ) as can be seen
in
−1
−1 E(xf 1 xf 1 ) E(xf 1 xf 2 ) 0
E (xf xf )E(xf xg ) =
E(xf 2 xf 1 ) E(xf 2 xf 2 ) E(xf 2 xg )
as E(xf 1 xg ) = 0

0
= if E(xf 1 xf 2 ) = 0.
E −1 (xf 2 xf 2 )E(xf 2 xg )
Hence if E(xf 1 xf 2 ) = 0, then there is no omitted variable bias for xf 1 .

Otherwise, the bias due to E(xf 2 xg ) = 0 gets channeled to xf 1 through
COR(xf 1 , xf 2 ).
In the case COR(xf 1 , xf 2 ) = 0, COR(xf 1 , xg ) = 0 but COR(xf 2 ,
xg ) = 0, we can in fact use only xf 1 as regressors—no omitted variable
bias in this case. Nevertheless, using xf 2 as regressors makes the model error
term variance smaller, which leads to a higher R2 and higher t-values for xf 1 .
In this case, we just have to be aware that the estimator for xf 2 is biased.
As an example for omitted variable bias, imagine a state considering
a mandatory seat belt law. Data is collected from N cities in the state,
with yi the yearly traffic fatality proportion per driver in city i, and xif the
proportion of drivers wearing seat belt in city i. LSE is run to find bf > 0,
which is counter-intuitive however. One possible scenario is that wearing the
Sec. 2 Heteroskedasticity and Homoskedasticity 17
seat belt makes the driver go faster, which results in more accidents. That
is, driving speed xg in the error term is correlated with xf , and the omitted
variable bias dominates β f so that the following sum becomes positive:
βf + E −1 (xf xf )E(xf xg ) · βg

negative positive positive
In this case, enacting the seat belt law will increase y, not because β f > 0
but rather because it will cause xg to increase.
What the state have in mind is the ceteris paribus (“direct”) effect β f
with all the other variables held constant, but what is estimated is the total
effect that is the sum of the direct effect β f and the indirect effect of xf on y
through xg . Both the direct and indirect effects can be estimated consistently
using the LSE of y on xf and xg , but enacting only the seat belt law will not
have the intended effect because the indirect effect will occur. A solution is
enacting both the seat belt law and a speed limit law to assure COR(xf , xg ) =
0 after the laws are passed.
In the example, omitted variable bias helped explaining an apparently
nonsensical result. But it can also help negating an apparently plausible re-
sult. Suppose that there are two types of people, one cautious and the other
reckless, with xg denoting the proportion of the cautious people, and that the
cautious people tend to wear seat belts more (COR(xf , xg ) > 0) and have
fewer traffic accidents. Also suppose β f = 0, i.e., no true effect of seat belt
wearing. In this case, the LSE of y on xf converges to a negative number
β f + E −1 (xf xf )E(xf xg ) · βg

0 positive negative
and, due to omitting xg , we may wrongly conclude that wearing seat belt
will lower y to enact the seat belt law. Here the endogeneity problem of xf
leads to an ineffective policy as the seat belt law will have no true effect on
y. Note that, differently from the xg = speed example, there is no indirect
effect of forcing seat belt wearing because seat belt wearing will not change
the people’s type.
2 Heteroskedasticity and Homoskedasticity

The assumption E(u|x) = 0 for LSE is a restriction on the conditional
first moment of u|x. We do not need restrictions on higher moments of u|x
to estimate β, but whether E(u2 |x) varies or not as x changes matters in the
LSE asymptotic inference, which is the topic of this section. E(u2 |x) will also
appear prominently later for generalized LSE (GLS).
Observe that E(u2 |x) = V (u|x) because E(u|x) = 0, and also that
V (u|x) = V (y|x) because y|x is a x β-shifted version of u|x. If V (u|x) is
a non-constant function of x, then u is “heteroskedastic” (or there is “het-
eroskedasticity”). If V (u|x) is a constant, say σ 2 , then u is “homoskedastic”
(or there is “homoskedasticity”). Although we assume that (ui , xi ) are iid
across i, ui |xi are not iid across i under heteroskedasticity.
2.1 Heteroskedasticity Sources

2.1.1 Forms of Heteroskedasticity
A well-known source for heteroskedasticity is random coefficients. Sup-
pose the coefficient vector is β i that is random around a constant β:
yi = xi β i + ui , β i = β + vi , E(v) = 0, E(vv ) ≡ Λ,

v is independent of x and u.
Substituting the β i equation yields a constant coefficient model:
yi = xi β + (xi vi + ui ), E(x v + u|x) = 0, V (x v + u|x) = x Λx + E(u2 |x).
Even if E(u2 |x) = σ 2 , still the error term ε ≡ x v + u is heteroskedastic. Here
the functional form of V (ε|x) = x Λx + σ 2 is known (up to Λ and σ 2 ) due to
the random coefficients and the homoskedasticity of u.
Heteroskedasticity does not necessarily have to be motivated by random
coefficients. If x is income and y is consumption per month, we can simply
imagine the variation of y|x increasing as x increases. In this case, we may
postulate, say,
yi = xi β + ui , V (u|xi ) = exp(xi θ),
where θ is an unknown parameter vector;
again, this is heteroskedasticity of known form as in the random coefficient
model.
The linear model assumption E(y|x) = x β is restrictive because E(y|x)
may not be a linear function. Assuming V (y|x) = exp(x θ) additionally is
even more restrictive, in which we would have even less confidence than in
E(y|x) = x β. If we just allow V (u|x) = V (y|x) to be an unknown function
of x instead of specifying the functional form of V (u|x), then we allow for
a heteroskedasticity of unknown form. The consistency and asymptotic dis-
tribution results of the LSE hold under heteroskedasticity of unknown form,
because we did not impose any assumption on V (y|x) in their derivations.
If V (u|x) = σ 2 , then because
E(xx u2 ) = E{xx E(u2 |x)} = σ 2 E(xx ),
√
the asymptotic variance of N (blse − β) is
E −1 (xx )E(xx u2 )E −1 (xx ) = σ 2 · E −1 (xx )
which appears often in introductory econometrics textbooks. The right-hand
side (rhs) is valid only under homoskedasticity; the left-hand side (lhs) is
called a “heteroskedasticity-robust (or -consistent) variance,” which is valid
with or without the homoskedasticity assumption.
2.1.2 Heteroskedasticity due to Aggregation

When “averaging with different numbers of observations” takes place,
heteroskedasticity can arise without x getting involved. Suppose a model
holds at individual level, but we have only a city-level aggregate data:
yji = xji β + uji , ji = 1, ..., ni where ji denotes individual

j in city i = 1, ..., N
1 1
=⇒ yi = xi β + ui where yi ≡ yji , xi ≡ xji ,
ni j ni j
i i
1
ui ≡ uji .
ni j
i
That is, what is available is a random sample on cities with (ni , xi , yi ), i =
1, ..., N , where ni is the total number of people in city i and N is the number
of the sampled cities. Suppose that uji is independent of xji , and that uji ’s
are iid with zero mean and variance σ 2 (i.e., uji ∼ (0, σ 2 )). Then u1 , ..., uN
are independent, and ui |(xi , ni ) ∼ (0, σ 2 /ni ): the error terms in the city-level
model are heteroskedastic wrt ni , but not wrt xi . Note that all of ni , xi , and
yi are random as we do not know which city gets drawn.
This type of heteroskedasticity can be dealt with by minimizing i (yi −
xi b)2 ni , which is equivalent to applying LSE to the transformed equation
√ √ √
yi∗ = x∗ ∗ ∗ ∗ ∗
i β + ui , where yi ≡ yi ni , xi ≡ xi ni and ui ≡ ui ni .
In the transformed equation, as (xi , ni ) is “finer” than x∗i ,

√ √
E(u∗i |x∗i ) = E{ E(ui ni |xi , ni ) |x∗i } = 0 as E(ui ni |xi , ni ) = 0,
√ √
V (u∗i |x∗i ) = E{ V (ui ni |xi , ni ) |x∗i } = σ 2 as V (ui ni |xi , ni ) = σ 2 .
Hence u∗1 , ..., u∗N are iid (0, σ 2 ). This LSE motivates “weighted LSE” to appear
later.
Two remarks on the city-level data example. First, there is no unity
√
in the transformed regressors because 1 is replaced with ni . This requires
a different definition of R2 which was defined using Q1 Û = Û . R2 for the
transformed model can be defined as {sample COR(y, ŷ)}2 , not as {sample
COR(y ∗ , ŷ ∗ )}2 , where ŷi = xi b∗lse , ŷi∗ = x∗ ∗ ∗
i blse , and blse is the LSE for the
2
transformed model. This definition of R can also be used for “weighted LSE”
below. Second, we assumed above that sampling is done at city level and what
is available is the averaged variables yi and xi along with ni . If, instead, all
cities are included but ni individuals get sampled in city i where ni is a pre-
determined (i.e., fixed) constant ahead of sampling, then ni is not random
(but still may vary across i); in contrast, (xi , yi ) is still random because it
depends on the sampled individuals. In this case, ui ’s are independent but
non-identically distributed (inid) due to V (ui ) = σ 2 /ni where V (ui ) is the
marginal variance of ui . Clearly, how sampling is done matters greatly.
2.1.3 Variance Decomposition

Observe
V (u) = E(u2 ) − {E(u)}2 = E[E(u2 |x)] − [E{E(u|x)}]2

= E[V (u|x) + {E(u|x)}2 ] − [E{E(u|x)}]2
= E[V (u|x)] + E[g(x)2 ]−[E{g(x)}]2 (defining g(x) ≡ E(u|x))
= E{V (u|x)} + V {g(x)} (in general)
{ = E{V (u|x)} as g(x) = E(u|x) = 0}.
Under homoskedasticity, V (u|x) = σ 2 ∀x, and thus V (u) = E(σ 2 ) = σ 2 .

For a rv y, this display gives the variance decomposition of V (y)
V (y) = E{V (y|x)} + V {E(y|x)}
which can help understand the sources of V (y). Suppose that x is a rv taking
on 1, 2, or 3. Decompose the population with x into 3 groups (i.e., subpopu-
lations):
Group P (x = 1) = 1/2 P (x = 2) = 1/4 P (x = 3) = 1/4

Group mean E(y|x = 1) E(y|x = 2) E(y|x = 3)
(level)
(Within-) V (y|x = 1) V (y|x = 2) V (y|x = 3)
Group
Variance
Each group has its conditional variance, and we may be tempted to think
that E{V (y|x)} which is the weighted average of V (y|x) with P (y|x) as the
weight yields the marginal variance V (y). But the variance decomposition
formula demonstrates V (y) = E{V (y|x)} unless E(y|x) = 0 ∀x, although
E(y) = E{E(y|x)} always. That is, the source of the variance is not just
the “within-group variance” V (y|x), but also the “between-group variance”
V {E(y|x)} of the group mean E(y|x).
If the variance decomposition is done with an observable variable x, then
we may dig deeper by estimating E(y|x) and V (y|x). But a decomposition
with an unobservable variable u can be also thought of, as we can choose
any variable we want in the variance decomposition: V (y) = E{V (y|u)} +
V {E(y|u)}. In this case, the decomposition can help us imagine the sources
depending on u. For instance, if y is income and u is ability (whereas x is
education), then the income variance is the weighted average of ability-group
variances plus the variance between the average group-incomes.
Two polar cases are of interest. Suppose that y is income and x is edu-
cation group: 1 for “below high school graduation,” 2 for “high school gradu-
ation” to “below college graduation,” and 3 for college graduation or above.
One extreme case is the same mean income for all education groups:
E(y|x) = E(y) ∀x =⇒ V {E(y|x)} = 0 =⇒ V (y) = E{V (y|x)}.

The other extreme case is the same variance in each education group:
V (y|x) = σ 2 ∀x =⇒ E{V (y|x)} = σ 2 =⇒ V (y) = σ 2 + V {E(y|x)};
if σ 2 = 0, then V (y) = V {E(y|x)}: the variance comes solely from the differ-
ences of E(y|x) across the groups.
2.1.4 Analysis of Variance (ANOVA)*

The variance decomposition formula is the basis for Analysis of Vari-
ance (ANOVA), where x stands for treatment categories (with one category
being no treatment). In ANOVA, the interest is on the “mean treatment
effect,” i.e., whether E(y|x) changes across the treatment groups/categories
or not. The classical approach—one-way ANOVA—assumes normality for y
and homoskedasticity across the groups (V (y|x) = σ 2 ∀x) so that V (y) =
σ 2 + V {E(y|x)}. One-way ANOVA decomposes the sample variance into
sample versions of σ 2 and V {E(y|x)} which are two independent χ2 rv’s.
“H0 : E(y|x) is a constant ∀x” is tested using the ratio of the two sample
versions, and the ratio follows a F -distribution.
Let x(j) denote the x-value for group j, and define the group-j mean
μj ≡ E(y|x = x(j) ). In one-way ANOVA, y gets indexed as in yij , i =
1, ..., Nj , j = 1, ..., J where j denotes the jth group (category) and i denotes
the ith observation in the jth group; there are Nj observations in group j.
The model is
yij = μj + uij , uij ∼ iid N (0, σ 2 ) across i and j.
Define the total sample size, group-j average and the “grand average” as,
respectively,

J
1
Nj
1
J Nj
N≡ Nj , ȳj ≡ yij ȳ ≡ yij .
j=1
Nj i=1 N j=1 i=1
Then the decomposition yij − ȳ = (yij − ȳj ) + (ȳj − ȳ) is used in one-way
ANOVA where ȳj − ȳ is for V {E(y|x)}.
J Nj
Specifically, take j=1 i=1 on (yij − ȳ)2 = {(yij − ȳj ) + (ȳj − ȳ)}2 to
see that the cross-product term is zero because

J
Nj

J
Nj
(yij − ȳj )(ȳj − ȳ) = (ȳj − ȳ) (yij − ȳj )
j=1 i=1 j=1 i=1

J
= (ȳj − ȳ)(Nj ȳj − Nj ȳj ) = 0.
j=1
Thus we get

J
Nj

J
Nj

J
(yij − ȳ)2 = (yij − ȳj )2 + Nj (ȳj − ȳ)2
j=1 i=1 j=1 i=1 j=1
total variation unexplained variation explained variation
where the two terms on the rhs are for σ 2 + V {E(y|x)} when divided by N.
The aforementioned test statistic for mean equality is
J
(J − 1)−1 j=1 Nj (ȳj − ȳ)2
J Nj ∼ F (J − 1, N − J).
(N − J)−1 j=1 i=1 (yij − ȳj )2
To understand the dof’s, note that there are J-many “observations” (ȳj ’s) in
the numerator, and 1 is subtracted in the dof because the grand mean gets
estimated by ȳ. In the denominator, there are N -many observations yij ’s,
and J is subtracted in the dof because the group means get estimated by
ȳj ’s. Under the H0 , the test statistic is close to zero as the numerator is so
because of V {E(y|x)} = 0.
The model yij = μj + uij can be rewritten as a familiar linear model.
Define J − 1 dummy variables, say xi2 , ..., xiJ , where xij = 1 if observation i
belongs to group j and xij = 0 otherwise. Then
yi = xi β + ui , where x = (1, xi2 , ..., xiJ ) and

J×1
β = (μ1 , μ2 − μ1 , ..., μJ − μ1 ) .
Here the intercept is for μ1 and the slopes are for the deviations from μ1 ;
group 1 is typically the “control (i.e., no-treatment) group” whereas the other
groups are the “treatment groups.” For instance, if observation i belongs to
treatment group 2, then
xi β = (1, 1, 0, ..., 0) β = μ1 + (μ2 − μ1 ) = μ2 .
Instead of the above F -test, we can test for H0 : μ1 =, ..., = μJ with “Wald
test” to appear later without assuming normality; the Wald test checks out
whether all slopes are zero or not.
“Two-way ANOVA” generalizes one-way ANOVA. There are two “fac-
tors” now, and we get yijk where j and k index the group (j, k), j = 1, ..., J
and k = 1, ..., K; group (j, k) has Njk observations. The model is
yijk = αj + β k + γ jk + uijk , uijk ∼ N (0, σ 2 ) iid across all indices
where αj is the factor-1 effect, β k is the factor-2 effect, and γ jk is the inter-
action effect between the two factors. The relevant decomposition is
yijk − ȳ = (yijk − ȳj. − ȳ.k + ȳ) + (ȳj. − ȳ) + (ȳ.k − ȳ)
where ȳ is the grand mean, ȳj. is the average of all observations with j
K Njk K
fixed (i.e., ȳj. ≡ k=1 i=1 yijk / k=1 Njk ), and ȳ.k is analogously defined.
Various F -test statistics can be devised by squaring and summing up this
display, but the two-way ANOVA model can be also written as a familiar
linear model, to which “Wald tests” can be applied.
2.2 Weighted LSE (WLS)

Suppose E(u2 |x) = (= V (u|x)) = m θ where m consists of elements of x
and functions of those, and suppose that we know this functional form; e.g.,
with k = 4,
mi θ = θ1 + θ2 xi2 + θ3 xi3 + θ4 x2i2 + θ5 xi2 xi3 .
Then we can do “Weighted LSE (WLS)”:
• First, apply LSE to yi = xi β + ui to get the residuals ûi .
• Second, estimate θ by the LSE of û2i on mi to get the LSE θ̂ for θ; this
is motivated by E(u2 |x) = m θ.
• Third, assuming mi θ̂ > 0 for all mi , estimate β again by minimizing

the weighted minimand N −1 i (yi − xi b)2 /(mi θ̂) wrt b.
In Chapter 3.3.3, it will be shown that replacing θ with θ̂ is innocuous, and

WLS is asymptotically equivalent to applying LSE to (with SD(u|xi ) =
(mi θ)1/2 )
yi xi ui
= β+ , where
SD(u|xi ) SD(u|xi ) SD(u|xi )
u V (u|x)
V{ |x} = = 1.
SD(u|x) SD(u|x)2
As in the above averaged data case, we can define yi∗ ≡ yi /SD(u|xi ) and x∗i ≡
xi /SD(u|xi ). The error term in the transformed equation is homoskedastic
with known variance 1. Inserting 1 and xi /SD(u|xi ), respectively, into σ 2
and x in σ 2 E −1 (xx ), we get
√ xx
N (bwls − β) N (0, E −1 { }).
V (u|x)
The assumption mi θ̂ > 0 for all mi can be avoided if V (u|x) = exp(m θ) and
if θ is estimated with “nonlinear LSE” that will appear later. The assumption
mi θ̂ > 0 for all mi is simply to illustrate WLS using LSE in the first step.
An easy practical alternative to guarantee positive estimated weights is
adopting a log-linear model ln u2i = mi ζ + vi with vi being an error term.
The log-linear model is equivalent to

u2i = emi ζ evi = (emi ζ/2 ν i )2 where ν i ≡ evi /2

and emi ζ/2 may be taken as the scale factor SD(u|xi ) for ν i (but emi ζ/2 ν i > 0

and thus the error ui cannot be emi ζ/2 ν i although u2i = (emi ζ/2 ν i )2 ). This

suggests using SD(u|xi ) emi ζ̂/2 for WLS weighting where ζ̂ is the LSE for
ζ. Strictly speaking, this “suggestion” is not valid because, for SD(u|xi ) =

emi ζ/2 to hold, we need
ln E(u2 |xi ) = mi ζ ⇐⇒ E(u2 |xi ) = exp(mi ζ)
but ln u2i = mi ζ+vi postulates instead E(ln u2 |xi ) = mi ζ. Since ln E(u2 |xi ) =

E(ln u2 |xi ), ln u2i = mi ζ + vi is not compatible with SD(u|xi ) = emi ζ/2 .
∗ ∗ ∗
Despite this, however, defining ûi ≡ yi − xi bwls where bwls is the WLS with
weight exp(mi ζ̂/2), so long as the LSE of û∗2 i on mi returns insignificant
slopes, we can still say that the weight exp(mi ζ̂/2) is adequate because the
heteroskedasticity has been removed by the weight, no matter how it was
obtained.
In short, each one of the following has different implications on how we
go about LSE.
• heteroskedasticity of unknown form: LSE to use E −1 (xx )

E(xx u2 )E −1 (xx )
• homoskedasticity: LSE to use σ 2 E −1 (xx )
• heteroskedasticity of known form: WLS to use
E −1 {xx /V (u|x)}.
Under homoskedasticity, all three variance matrices agree; otherwise, they

differ in general.
Later, we will see that, under the known form of heteroskedasticity, WLS
is more efficient than LSE; i.e.,

−1 2
! −1 −1 xx
E (xx ) E xx u E (xx ) ≥ E
V (u|x)
in the matrix sense (for two matrices A and B, A ≥ B means that A − B is
p.s.d). For instance, if ui is specified as
ui = wi exp(xi θ/2), where wi is independent of xi with V (w) = 1,
then V (u|x) = exp(x θ), and we can do WLS with this. This is also con-
venient in viewing yi : yi is obtained by generating xi and wi first and then
summing up xi β and wi exp(xi θ). But if the specified form of heteroskedastic-
ity exp(x θ) is wrong, then the asymptotic variance of the WLS is no longer
E −1 {xx /V (u|x)}. So, it is safer to use LSE with heteroskedasticity-robust
variance. From now on, we will not invoke homoskedasticity assumption, un-
less it gives helpful insights for the problem at hand, which does happen from
time to time.
2.3 Heteroskedasticity Examples

EXAMPLE: HOUSE SALE (continued). In the preceding section, the t-values
under heteroskedasticity of unknown form (tv-het) were shown along with the
t-values under homoskedasticity (tv-ho). Comparing the two sets of t-values,
the differences are small other than for ln(T)/T and YR, and tv-ho tends to
be greater than tv-het. This indicates that the extent of heteroskedasticity
would be minor, if any. Three courses of action are conceivable from this
observation:
Sec. 3 Testing Linear Hypotheses 25
• First, test for the H0 of homoskedasticity using a test, say, in White

(1980). This test does the LSE of û2i on 1 and some polynomial functions
of regressors to see if the slopes are all zero or not; all zero slopes mean
homoskedasticity. N · R2 χ2#slopes can be used as an asymptotic test
statistic where R2 is the R2 for the û2i equation LSE. If the null is not
rejected, then tv-ho may be used.
• Second, if the null is rejected, then model the form of heteroskedasticity
using the aforementioned forms (or some others) to do WLS, where the
weighted error term u/SD(u|x) should have variance one regardless of
x (this can be checked out using the method in the first step).
• Third, instead of testing for the H0 or modelling heteroskedasticity,
simply use tv-het. This is the simplest and most robust procedure. Also,
the gain in the above two procedures tends to be small in micro-data;
see, e.g., Deaton (1995).
EXAMPLE: INTEREST RATE (continued). Recall the interest rate example:
yi = 0.216 + 1.298 · yi−1 − 0.337 · yi−2 .

t−vlaues: 2.05 (3.42) 10.29 (21.4) −2.61 (−5.66)
We list both tv-het and tv-ho; the latter is in (·) and was computed with
√
/ vN,jj , j = 1, ..., k, where VN ≡ [vN,hj ], h, j = 1, ..., k, is defined as
blse,j
s2N ( i xi xi )−1 . The large differences between the two types of t-values indi-
cate that the homoskedasticity assumption would not be valid for this model.
In this time-series data, if the form of heteroskedasticity is correctly modeled,
the gain in significance (i.e., the gain in the precision of the estimators) would
be substantial. Indeed, such modeling is often done in financial time-series.
3 Testing Linear Hypotheses

3.1 Wald Test
Suppose we have an estimator bN with
√
N (bN − β) N (0, C);
√
bN is said to be a “ N -consistent asymptotically normal estimator with
asymptotic variance C.” LSE and WLS are two examples of bN and more
examples will appear later. Given bN , often we want to test linear null hy-
potheses such as
H0 : R β = c, where rank(R) = g,
R is a k × g (g ≤ k) known constant matrix and c is a g × 1 known constant
vector. Since bN →p β, we have R bN →p R β, because R b is a continuous
function of b. If R β = c is true, R bN should
√ be close to c. Hence testing for
R β = c can be based on the difference N (R bN − c).
As an example of R β = c, suppose k = 4 and H0 : β 2 = 0, β 3 = 2. For

this, set

0 1 0 0 0
R = , c=
0 0 1 0 2
to get R β = c equivalent to β 2 = 0 and β 3 = 2. If we want to add another

hypothesis, say β 1 − β 4 = 0, then set
⎡ ⎤ ⎡⎤
0 1 0 0 0
R = ⎣ 0 0 1 0 ⎦, c = ⎣ 2 ⎦.
1 0 0 -1 0
Typically, we test for some chosen elements of β being zero jointly. In that
case, R consists of the column vectors picking up the chosen elements of β
(each column consists of k − 1 zeros and 1) and c is a zero vector.
Given the above C and R, define H and Λ such that
R CR = HΛH .
H is a matrix whose g columns are orthonormal eigenvectors of R CR and Λ

is the diagonal matrix of the eigenvalues; HΛH exists because R CR is real
and symmetric. By construction, H H = Ig . Also, pre-multiplying H H = Ig
by H to get (HH )H = H, we obtain HH = Ig because H is of full rank.
Observe now
S ≡ HΛ−0.5 H =⇒ S S = (R CR)−1 because

S S = HΛ−0.5 H HΛ−0.5 H = HΛ−1 H and
S S · R CR = HΛ−1 H · HΛH = Ig .
Further observe
√
N · R (bN − β) N (0, R CR)
" √
from N (bN − β) N (0, C) “times R ”} ,
√
N SR (bN − β) N (0, Ig ), since
−0.5
S · R CR · S = HΛ H · HΛH · HΛ−0.5 H = Ig ,

N (R bN − R β) S S(R bN − R β)

= N (R bN − R β) (R CR)−1 (R b − R β) χ2g ,
because N (R bN −R β) S S(R bN −R β) is a sum of g-many squared, asymp-
totically uncorrelated N (0, 1) random variables (rv). Replacing R β with c
under H0 : R β = c, we get a Wald test statistic
N (R bN − c) (R CN R)−1 (R bN − c) χ2g where CN →p C.

√
The matrix (R CN R)−1 in the middle standardizes the vector N (R bN −c).
3.2 Remarks
When bN is the LSE of y on x, we get
C ≡ E −1 (xx )E(xx u2 )E −1 (xx ),

1 1 1
CN = ( xi xi )−1 · xi xi û2i · ( xi xi )−1
N i N i N i
[ = N (X X)−1 X DX(X X)−1 , in matrices where
D ≡ diag(û21 , ..., û2N )]
If homoskedasticity holds, then instead of C and CN , we can use Co and CoN

where
−1 −1
1 XX
CoN ≡ sN
2
xi xi = s2N →p Co ≡ σ 2 E −1 (xx ).
N i N

To show CN →p C, since (N −1 i xi xi )−1 →p E −1 (xx ) was noted
already, we have to show
1
xi xi û2i − E(xx u2 ) = op (1).
N i
Here, we take the “working proposition” that, for the expected value E(h
(x, y, β)) where h(x, y, β) is a (matrix-valued) function of x, y, and β, it
holds in general that
1
h(xi , yi , bN ) − E (h (x, y, β)) = op (1), if bN →p β.
N i
Then, setting h(x, y, b) = xx (y − x b)2 establishes CN →p C. For the preced-

ing display to hold, h(·, ·, b) should not be too variable as a function of b so
that the LLN holds uniformly over b. In almost all cases we encounter, the
preceding display holds.
Instead of CN , MacKinnon and White (1985) suggested to use, for a
better small sample performance,

−1 X r̃r̃ X −1
C̃N ≡ (N − 1) (X X) X D̃X − (X X) , where
N
! yi − xi blse
D̃ ≡ diag r̃12 , ..., r̃N
2
, r̃i ≡ , r̃ ≡ (r̃1 , ..., r̃N ) , and
1 − dii
dii is the ith diagonal element of the matrix X(X X)−1 X .
C̃N and CN are asymptotically equivalent as the term X r̃r̃ X/N in C̃N is
of smaller order than X D̃X.
Although the two variance estimators CN and CN o numerically differ
in finite samples, we have CN − CN o = op (1) under homoskedasticity. As
already noted, too much difference between CN and CN o would indicate the
presence of heteroskedasticity, which is the basis for White (1980) test for
heteroskedasticity. We will not, however, test for heteroskedasticity; instead,
we will just allow it by using the heteroskedasticity-robust variance estimator
CN . There have been criticisms on the heteroskedasticity-robust variance
estimator. For instance, Kauermann and Carroll (2001) showed that, when
homoskedasticity holds, the heteroskedasticity-robust variance estimator has
the higher variance than the variance estimator under homoskedasticity, and
that confidence intervals based on the former have the coverage probability
lower than the nominal value.
Suppose
yi = xi β + di β d + di wi β dw + ui
where di is a dummy variable of interest (e.g., a key policy variable on (d = 1)
or off (d = 0)), and wi consists of elements of xi interacting with di . Here, the
effect of di on yi is β d + wi β dw which varies across i; i.e., we get N different
individual effects. A way to summarize the N -many effects is using β d +
E(w )β dw (the effect evaluated at the “mean person”) or β d + M ed(w )β dw
(the effect evaluated at the “median person”). Observe
E(β d + wi β dw ) = β d + E(w )β dw but

M ed(β d + wi β dw ) = β d + M ed(w )β dw ;
M ed(z1 + z2 ) = M ed(z1 ) + M ed(z2 ) in general for two rv’s z1 and z2 . The

former is that the mean effect is also the effect at the mean person, whereas
the latter is that the median effect is not the effect at the median person.
If we want to Wald-test “H0 : β d + E(w )β dw = 0,” then replace E(w)
with w̄ to set c = 0 and R = (0kx , 1, w̄ ) where 0kx is the kx × 1 vector of
zero’s. In this test, we may worry about the difference w̄ − E(w) in replacing
the unknown (0kx , 1, E(w )) with its estimator (0kx , 1, w̄ ). But w̄ − E(w) can
be ignored, as we can just declare that we want to evaluate the effect at the
sample mean w̄. “H0 : β d + M ed(w )β dw = 0” can be tested in the analogous
way, replacing M ed(w) with the sample median.
3.3 Empirical Examples

EXAMPLE: HOUSE SALE (continued). The two variables BATH and ROOM
looked insignificant. Since BATH and ROOM tend to be highly correlated,
using the two individual t-values for BATH and ROOM for the two separate
hypotheses H0 : β bath = 0 and H0 : β room = 0 may be different from testing
the joint null hypothesis H0 : β bath = β room = 0 with Wald test, because
the latter involves the asymptotic covariance between bN,bath and bN,room
that is not used for the two t-values. It does happen in practice that, when
two regressors are highly correlated (“multicollinearity”), the two separate
null hypotheses may not be rejected while the single joint null hypothesis
is rejected. This is because either one of them has explanatory power, but
adding the other to the model when one is already included does not add any
new explanatory power. With k = 11, g = 2, c = (0, 0) , and

0 0 1 0 0 0 , ..., 0
R =
2×11 0 0 0 0 1 0 , ..., 0
β = (β ln(T ) , β 1 , β bath , β elec , β room , ..., β supply ) ,
11×1
the Wald test statistic is 0.456 with the p-value 0.796 = P (χ22 > 0.456)
for the model with ln(T) and CN : the joint null hypothesis is not rejected.
When CoN is used instead of CN , the Wald test statistic is 0.501 with the p-
value 0.779—hardly any change. Although BATH and ROOM are important
variables for house prices, they do not explain the discount % DISC. The t-
values with CN and C̃N shown below for the 11 regressors are little different
(tv-CN was shown already) because N = 467 is not too small for the number
of regressors:
x: ln(T) 1 BATH ELEC RM TAX YR ln(LP) BIGS RATE SUPPLY

tv-CN : 7.76 −0.23 0.18 2.46 −0.67 −1.28 −3.87 2.52 −2.56 −3.10 1.06
tv-C̃N : 7.29 −0.22 0.17 2.38 −0.65 −1.22 −3.66 2.40 −2.45 −2.98 1.03
EXAMPLE: TRANSLOG PRODUCTION FUNCTION. Consider a “translog produc-

tion function”:

m
m
m
1
ln y = β 0 + β p ln xp + β pq ln xp ln xq + u where β pq = β qp .
p=1 p=1 q=1
2
This becomes a Cobb-Douglas production function if β pq = 0 ∀p, q. To see

why the restriction β pq = β qp appears, observe
1 1 β pq + β qp
β pq ln xp ln xq + β qp ln xq ln xp = ln xp ln xq = β pq ln xp ln xq :
2 2 2
we can only identify the average of β pq and β qp , and β pq = β qp essentially
redefines the average as β pq .
If we take the translog function as a second-order approximation to an
underlying smooth function, say, y = exp{f (x)}, then β pq = β qp is a nat-
ural restriction from the symmetry of the second-order matrix. Specifically,
observe
ln y = f (x) =⇒ ln y = f {exp(ln x)} = f˜(ln x) where f˜(t) ≡ f {exp(t)}.
Now f˜(ln x) can be linearized around x = 1 (i.e., around ln x = 0) with its

second-order approximation where the β-parameters depend on the approx-
imation point x = 1.
For a production function y = f (x) + u, it is “homogeneous of degree h”

if th y = f (tx) + u ∀t. To test for the h-homogeneity, apply th y = f (tx) + u
to the translog production function to get

m
h ln t + ln y = β 0 + β p (ln t + ln xp )
p=1

m m
1
+ β pq (ln t + ln xp )(ln t + ln xq ) + u
p=1 q=1
2

m
m
(ln t)2
m m
=⇒ h ln t + ln y = β 0 + ln t βp + β p ln xp + β
p=1 p=1
2 p=1 q=1 pq
m
ln t
m m m
+ (ln xp β pq)+ (ln xq β pq )
2 p=1 q=1 q=1 p=1

m m
1
+ β pq ln xp ln xq + u.
p=1 q=1
2
For both sides to be equal for all t, it should hold that

m
m
βp = h and β pq = 0 ∀p
p=1 q=1

m
m
⇐⇒ β qp = 0 ∀p ⇐⇒ β pq = 0 ∀q .
q=1 p=1
To be specific, for m = 2 and h = 1, there are six parameters (β 0 , β 1 , β 2 ,

β 11 , β 22 , β 12 ) to estimate in
(ln x1 )2 (ln x2 )2
ln y = β 0 + β 1 ln x1 +β 2 ln x2 + β 11 + β 22 + β 12 ln x1 ln x2 + u.
2 2
Bear in mind β 12 = β 21 , and we use only β 12 with the first subscript smaller
than the second. The 1-homogeneity (i.e., constant returns to scale) restric-
tions are
H0 : β 1 + β 2 = 1, β 11 + β 12 = 0 and β 12 + β 22 = 0 (from β 21 + β 22 = 0).
Clearly, we can estimate the model with LSE to test for this linear H0 .
If H0 is accepted, then one may want to impose the H0 on the model
using its equivalent form
β2 = 1 − β1, β 11 = −β 12 , β 22 = −β 12 .
That is, the H0 -imposed model is

(ln x1 )2 (ln x2 )2
ln y − ln x2 = β 0 + β 1 (ln x1 − ln x2 ) + β 12 − −
2 2

+ ln x1 ln x2 + u.
Sec. 4 Instrumental Variable Estimator (IVE) 31
This can be estimated by the LSE of ln y − ln x2 on the rhs regressors.

If m = 3 and h = 1, then there will be 10 parameters (β 0 , β 1 , β 2 ,
β 3 , β 11 , β 22 , β 33 , β 12 , β 13 , β 23 ), and the 1-homogeneity restrictions are
β 1 + β 2 + β 3 = 1, β 11 + β 12 + β 13 = 0,
β 12 + β 22 + β 23 = 0 (from β 21 + β 22 + β 23 = 0) and
β 13 + β 23 + β 33 = 0 (from β 31 + β 32 + β 33 = 0).
4 Instrumental Variable Estimator (IVE)

When E(xu) = 0, LSE becomes inconsistent. A solution is dropping
(i.e., substituting out) the endogenous components of x from the model, but
the ensuing LSE does not deliver what is desired: the “other-things-being-
equal” effect. Another solution is to extract only the exogenous part of the
endogenous regressors, which is the topic of this and the following sections.
4.1 IVE Basics

4.1.1 IVE in Narrow Sense
For the linear model y = x β + u, suppose we have a k × 1 moment
condition
E(zu) = E(z(y − x β)) = 0,
instead of E(xu) = 0, where z is a k × 1 random vector such that E(zx ) is
invertible. Solve the equation for β to get
β = E −1 (zx ) · E(zy).
The sample analog of this is the instrumental variable estimator (IVE)

−1
1 1 −1
bive = zi xi zi y i {= (Z X) Z Y in matrices}.
N i N i
While IVE in its broad sense includes any estimator using instruments, here
we define IVE in its narrow sense as the one taking this particular form. IVE
includes LSE as a special case when z = x (or Z = X in matrices).
Substitute yi = xi β + ui into the bive formula to get
−1 −1
1 1 1
bive = zi xi zi (xi β + ui ) = β + zi xi
N i N i
N i
1
× zi u i .
N i
The consistency of the IVE follows simply by applying the LLN to the terms
other than β in the last equation. As for the asymptotic distribution, observe
−1
√ 1 1
N (bive − β) = zi xi √ zi u i .
N i N i

Applying the LLN to N −1 i zi xi and the CLT to N −1/2 i zi ui , it holds
that √ !
N (bive − β) N 0, E −1 (zx ) E zz u2 E −1 (xz ) .
This is informally stated as

1 −1
bive ∼ N β, E (zx )E(zz u2 )E −1 (xz )
N
the variance of which can be estimated with (defining ri ≡ yi − xi bive )

−1 −1
−1 −1
zi xi zi zi ri2 xi z = (Z X) Z DZ (X Z) ,
i i i
in matrices,
where D ≡ diag(r12 , ..., rN

2
) and ri = yi − xi bive , not yi − zi bive .
4.1.2 Instrumental Variable (IV) qualifications

IVE is useful when LSE is not applicable because some regressors are en-
dogenous in the sense E(xu) = 0. For instance, suppose xi = (1, xi2 , xi3 , xi4 )
(thus yi = β 1 + β 2 xi2 + β 3 xi3 + β 4 xi4 + ui ) and
E(u) = E(x2 u) = E(x3 u) = 0, but E(x4 u) = 0;
x2 and x3 are exogenous regressors in the y-equation whereas x4 is an en-

dogenous regressor. If there is a rv w such that
(i ) COR(w, u) = 0 ( ⇐⇒ E(wu) = 0)
(ii ) 0 = COR(w, x4 ) (“inclusion restriction” )
(iii ) w does not appear in the y equation (“exclusion restriction”)
then w is a valid instrumental variable (IV)—or just instrument—for x4 ,
and we can use zi = (1, xi2 , xi3 , wi ) . The reason why (ii) is called “inclusion
restriction” is that w should be in the x4 equation for (ii) to hold. Conditions
(ii) and (iii) together are simply called “inclusion/exclusion restrictions.”
As an example, suppose that y is blood pressure, x2 is age, x3 is gender,
x4 is exercise, u includes health concern, and w is a randomized education
dummy variable on health benefits of exercise (i.e., a coin is flipped to give
person i the education if head comes up). Those who are health-conscious
may exercise more, which means COR(x4 , u) = 0. Checking out (i−iii) for
w, first, w satisfies (i) because w is randomized. Second, those who received
the health education are likely to exercise more, thus implying (ii). Third,
receiving the education alone cannot affect blood pressure, and hence (iii)
holds. (iii) does not mean that w should not influence y at all: (iii) is that
w can affect y only indirectly through x4 .
Condition (i) is natural in view of E(zu) = 0. Condition (ii) is necessary
as w is used as a “proxy” for x4 ; if COR(w, x4 ) = 0, then w cannot represent
x4 —a rv from a coin toss is independent of x4 and fails (ii) despite satisfy-
ing (i) and (iii). Condition (iii) is necessary to make E(zx ) invertible; an
exogenous regressor x2 (or x3 ) already in the y-equation cannot be used as
an instrument for x4 despite it satisfies (i) and possibly (ii), because E(zx )
is not invertible if z = (1, x2 , x3 , x2 ) .
Recalling partial regression, only the part of x4 not explained by the
other regressors (1, x2 , x3 ) in the y equation contributes to explaining y.
Among the part of x4 , w picks only the part uncorrelated with u, because w is
uncorrelated with u by condition (i). The instrument w is said to extract the
“exogenous variation” in x4 . In view of this, to be more precise, (ii) should
be replaced with
(ii) 0 = COR[w, {part of x4 unexplained by the other

regressors(1, x2 , x3 )}]
⇐⇒ 0 = COR[w, {residual of the linear projection of
x4 on (1, x2 , x3 )}].
Condition (ii) can be (and should be) verified by the LSE of x4 on w and
the other regressors: the slope coefficient of w should be non-zero in this LSE
for w to be a valid instrument. But conditions (i) and (iii) cannot be checked
out; they can be only “argued for.”In short, an instrument should be excluded
from the response equation and included in the endogenous regressor equation
with zero correlation with the error term.
There are a number of sources for the endogeneity of x4 :
• First, a simultaneous relation when x4 is affected by y (as well as af-

fecting y). For example, xi4 = qi γ + αyi + vi where qi are regressors and
vi is an error term. This implies that u is correlated with x4 through y:
u → y → x4 . If y is the work hours of a spouse and x4 is the work hours
of the other spouse in the same family, then the simultaneous relation
may occur.
• Second, a recursive relation with correlated errors. For example, xi4 =

qi γ + vi and COR(v, u) = 0 holds (no simultaneity). Here x4 is corre-
lated with u through v: x4 ←− v −→ u. In the preceding family work
hour case, if x4 is for the “leader” (i.e., the dominant spouse), y is for
the “follower,” and local labor-market-condition variables influencing
both spouses are omitted, then these variables will lurk in u and v,
leading to COR(u, v) = 0.
• Third, errors-in-variables. Here, x4 is not observed, but its error-ridden

version xei4 = xi4 + ei is. In this case, we can rewrite the y equation as
yi = ..., +β 4 xi4 + ui = ..., +β 4 (xei4 − ei ) + ui = ..., +β 4 xei4 + (ui − β 4 ei )
and use xe4 as a regressor. But the new error u − β 4 e is correlated with
xe4 through e.
4.1.3 Further Remarks

What if there is no variable available for instruments? In this case, it is
tempting to use functions of exogenous regressors. Recall the above example:
yi = β 1 +β 2 xi2 +β 3 xi3 +β 4 xi4 +ui , E(x2 u) = E(x3 u) = 0 but E(x4 u) = 0.
Functions of x2 and x3 (such as x22 or x2 x3 ) qualify as instruments for x4 ,

if we know a priori that those functions are excluded from the y equation.
But “smooth” functions such as x22 are typically not convincing instruments,
because x22 may very well be included in the y equation if x2 is so. Instead
of smooth functions, non-smooth functions of exogenous regressors may be
used as instruments if there are due justifications that they appear in the x4
equation, but not in the y equation. Such examples can be seen in relation to
“regression discontinuity design” in the treatment-effect literature; see Lee
(2005a) and the references there. Those discontinuous functions then serve
as “local instruments” around the discontinuity points.
In case of no instrument, the endogenous regressors may be
dropped and LSE may be applied. But this leads to an omitted variable bias
as examined already. For instance, suppose xi4 = γ 1 + γ 2 xi3 + vi . Substitute
this into the yi equation to get
yi = β 1 + β 2 xi2 + β 3 xi3 + β 4 (γ 1 + γ 2 xi3 + vi ) + ui

= (β 1 + β 4 γ 1 ) + β 2 xi2 + (β 3 + β 4 γ 2 )xi3 + (ui + β 4 vi ).
When this is estimated by LSE, the slope estimator for x3 is consistent for
β 3 + β 4 γ 2 , where β 4 γ 2 is nothing but the bias due to omitting x4 in the LSE.
The slope parameter β 3 + β 4 γ 2 of x3 consists of two parts: the “direct effect”
of x3 on y, and the “indirect part” of x3 on y through x4 . If x3 affects x4
but not the other way around, then the indirect part can be interpreted as
the “indirect effect” of x3 on y through x4 . So long as we are interested in
the total effect β 3 + β 4 γ 2 , the LSE is all right. But usually in economics, the
desired effect is the “ceteris paribus” effect of changing x3 while holding all
the other variables (including x4 ) constant.
The IVE can alsobe cast into a minimization problem. The sample
analog of E(zu) is N −1 i zi ui . Since ui is unobservable, replace ui by yi −xi b
−1

to get N i zi (yi − xi b). We can get theIVE by minimizing the deviation
−1 −1
of N i z i (y i − x i b) from 0. Since N i z i (y i − xi b) is a k × 1 vector, we
need to choose how to measure the distance from 0. Adopting the squared
Euclidean norm as usual and ignoring N −1 , we get

zi (yi − xi b) · zi (yi − xi b) = {Z (Y − X b)} · Z (Y − X b)
i i
= (Y − Xb) ZZ (Y − Xb) = Y ZZ Y − 2b X ZZ Y + b X ZZ Xb.

The first-order condition of minimization is
0 = −2X ZZ Y + 2X ZZ Xb =⇒ bive = (Z X)−1 Z Y.
Although IVE can be cast into a minimization problem, it minimizes the

distance of N −1 i zi (yi − xi b) from 0. For LSE, we would be minimizing the
distance of N −1 i xi (yi − xi b) from 0, which is different from minimizing
the scalar N −1 i (yi − xi b)2 . This scalar minimand shows that LSE is a
“prediction-error minimizing estimator” where yi is the target and xi blse is
−1 2
the predictor for the target. In minimizing N i (yi − xi b) , there is no
concern for endogeneity: regardless of E(xu) = 0 holding or not, we can
always minimize N −1 i (yi − xi b)2 . The resulting estimator is, however,
consistent for β in yi = xi β + ui only if E(xu) = 0. The usual model fitness
and R2 are irrelevant for IVE, because, if they were, we would be using LSE,
not IVE. Nevertheless, there is a pseudo R2 to appear later that may be used
for model selection with IVE, as the usual R2 is used for the same purpose
with LSE.
4.2 IVE Examples

Here we provide three empirical examples for IVE, using the same four
regressor model as above with x4 being endogenous. The reader will see
that some instruments are more convincing than others. Among the three
examples, the instruments in the first example will be the most convincing,
followed by those in the second which are in turn more plausible than those in
the third. More examples of instruments can be found in Angrist and Krueger
(2001) and the references therein. It is not clear, however, who invented IVE.
See Stock and Trebbi (2003) for some “detective work” on the origin of IVE.
EXAMPLE: FERTILITY EFFECT ON WORK. Understanding the relationship be-

tween fertility and female labor supply matters greatly in view of increasing
labor market participation of women and declining fertility rates in many
countries; the latter is also a long-term concern for pension systems. But
finding a causal effect for either direction has proven difficult, as females are
likely to decide on fertility and labor supply jointly, leading to a simultane-
ity problem. Angrist and Evans (1998) examined the effect of the number
of children (x4 ) on labor market outcomes. Specifically, their x4 is a dummy

variable for more than two children.
One instrument for x4 is the dummy for the same sex children in a
household: having only girls or boys in the first two births may result in more
children than the couple planned otherwise. The random event (gender) for
the first two children gives an exogenous variation to x4 , and the dummy
for the same sex children is to take advantage of this variation. Another
instrument is the dummy for twin second birth: having a twin second birth
means an exogenous increase to the third child. Part of Table 2 (using a 1990
data set in US for women aged 21–35 with two or more children) in Angrist
and Evans (1998) shows descriptive statistics (SD is in (·)):
Twin First
#Children More than First birth Same second birth
ever two boy sex birth age
All women 2.50 0.375 0.512 0.505 0.012 21.8
(0.76) (0.48) (0.50) (0.50) (0.108) (3.5)
Wives 2.48 0.367 0.514 0.503 0.011 22.4
(0.74) (0.48) (0.50) (0.50) (0.105) (3.5)
The table shows that there is not much difference across all women data
and wives only data, that the probability of boy is slightly higher than the
probability of girl, and that the probability of twin birth is about 1%.
Part of Table 8 for the all women data in Angrist and Evans (1998) is:
Hours per Labor

y Worked or not Weeks worked week income
LSE (SD) −0.155 (0.002) −8.71 (0.08) −6.80 −3984
(0.07) (44.2)
IVE (SD) −0.092 (0.024) −5.66 (1.16) −4.08 −2100
(0.98) (664.0)
For instance, having the third child decreases weeks worked by 6–9% and
hours worked by 4–7 hours per week. Overall, IVE magnitudes are about
50–100% smaller than the LSE magnitudes.
EXAMPLE: POLICE IMPACTS ON CRIME. Whether the number of policemen

(x4 ) lowers crime rates (y) has been an important question in criminology.
The main difficulty in assessing the effects has been the simultaneity problem
between y and x4 . Suppose that x4 decreases y, and y increases x4 (a higher
crime rates leads to the more policemen):
yi = β 1 + β 2 xi2 + β 3 xi3 + β 4 xi4 + ui , xi4 = qi γ + αyi + vi ,

(with β 4 < 0 and α > 0)
=⇒ yi = β 1 + β 2 xi2 + β 3 xi3 + β 4 (qi γ + αyi + vi ) + ui ,
(substituting the x4 equation)
1
=⇒ yi = (β + β 2 xi2 + β 3 xi3 + β 4 qi γ + β 4 vi + ui )
1 − β4α 1
which is the y “reduced form (RF).” Substituting the y RF into xi4 = qi γ +
αyi + vi , we also get the x4 RF:
α
xi4 = qi γ + (β + β 2 xi2 + β 3 xi3 + β 4 qi γ + β 4 vi + ui ) + vi .
1 − β4α 1
Judging from the u’s slope α(1 − β 4 α)−1 > 0, we get COR(x4 , u) > 0.
Suppose that LSE is run for the y equation ignoring the simultaneity. Then
with xi = (1, xi2 , xi3 , xi4 ) , the LSE of y on x will be inconsistent by the
magnitude
E −1 (xx )E(xu) = E −1 (xx ){0, 0, 0, E(x4 u)} :
the LSE for β 4 is upward biased and hence the LSE for β 4 can even be
positive. Recalling the discussion on omitted variable bias, we can see that
the bias is not restricted to β 4 if x4 is correlated with x2 or x3 , because the
last column of E −1 (xx ) can “spread” E(x4 u) = 0 to all components of the
LSE.
One way to overcome the simultaneity problem is to use data for short
periods. For instance, if y is a monthly crime number for city i and x4 is the
number of policemen in the same month, then it is unlikely that y affects
x4 as it takes time to adjust x4 , whereas x4 can affect y almost instantly.
Another way is to find an instrument. Levitt (1997) noted that the change
in x4 takes place almost always in election years, mayoral or gubernatorial.
Thus he sets up a “panel (or longitudinal) data” model where yit is a change
in crime numbers for city i and year t, xit,4 is a change in policemen, and
wit = 1 if year t is an election year at city i and 0 otherwise, because wit
is unlikely to be correlated with the error term in the crime number change
equation. Levitt (1997) concluded that the police force size reduces (violent)
crimes.
As McCrary (2002) noted, however, there was a small error in Levitt
(1997). Levitt (2002) thus proposed the number of firefighters per capita as
a new instrument for the number of policemen per capita. The panel data
model used is
Δ ln(yit ) = β p ln(policei,t−1 ) + xit β x + δ i + λt + uit
where i indexes large US cities with N = 122, t indexes years 1975–1995,

policei,t−1 instead of policeit is used to mitigate the endogeneity problem, and
xit are the regressors other than police; δ i is for the “city effect” (estimated
by city dummies) and λt is for the “year effect” (estimated year dummies).
Part of Table 3 in Levitt (2002) for police effect is shown below with
SD in (·), where “LSE without city dummies” means the LSE without city
dummies but with year dummies. By not using city dummies, the parameters
are identified mainly with cross-city variation because cross-city variation is
much greater than over-time variation, and this LSE is thus similar to cross-
section LSE pooling all panel data.
Violent crimes Property crimes

y per capita per capita
LSE without city dummies 0.562 (0.056) 0.113 (0.038)
LSE with city/year dummies −0.076 (0.061) −0.218 (0.052)
IVE with city/year dummies −0.435 (0.231) −0.501 (0.235)
This table shows that the LSE’s are upward biased as analyzed above al-
though the bias is smaller when the city dummies are used, and that police
force expansion indeed reduces the number of crimes. The number of fire-
fighters is an attractive instrument, but somewhat less convincing than the
instruments in the fertility example.
EXAMPLE: ECONOMIC IMPACTS ON CRIME. In the preceding examples for IVE,

the justification of the instruments was strong. Here is an IVE example with
a weaker justification—this kind of cases are more common in practice.
Gould et al. (2002) analyzed the effect of local labor market conditions
on crime rates in the US for 1979–1997. They set up a panel data model
yit = xit β + δ i + uit , i = 1, ..., N, t = 1, ..., T
where yit is the number of various offenses per 100,000 people in county i at
year t, xit includes the mean log weekly wage of non-college educated men
(wageit ), unemployment rate of non-college educated men (urit ), and the
mean log household income (incit ), and time dummies, δ i is a time-constant
error and uit is a time-variant error. Our presentation in the following is a
rough simplification of their longer models.
Since δ i represents each county’s unobserved long-standing culture and
practice such as how extensively crimes are reported and so on, δ i is likely
to be correlated with xit . They take the difference between 1979 and 1989 to
remove δ i and get (removing δ i by differencing is a “standard” procedure in
panel data)
Δyi = Δxi β + Δui ,
where Δyi ≡ yi,1989 − yi,1979 , Δxi and Δui are analogously defined, and
N = 564. Their estimation results are as follows with SD in (·):
LSE : bwage = −1.13 (0.38), bur = 2.35 (0.62), binc = 0.71 (0.35)
R2 = 0.094,
IV E : bwage = −1.06 (0.59), bur = 2.71 (0.97), binc = 0.093 (0.55);
the instruments will be explained below. All three estimates in LSE are signif-
icant and show that low wage, high unemployment rate, and high household
income increase the crime rate. The IVE is close to the LSE in wageit and
urit , but much smaller for incit and insignificant. See Freeman (1999) for a
survey on crime and economics.
It is possible that crime rates influence local labor market conditions, be-
cause firms may move out in response to high crime rates or firms may offer
higher wages to compensate for high crime rates. This means that a simul-
taneous relation problem may occur between crime rates and labor market
conditions. To avoid this problem, Gould et al. (2002) constructed a number
of instruments. One of the instruments is

(employment share of industry j in county i in 1979)
j
·(national growth rate of industry j for 1979–1989).
The two conditions to check are COR(Δu, z) = 0 and COR(Δx, z) = 0. For

COR(Δu, z) = 0, the primary reason to worry for endogeneity was the influ-
ence of the crime rate on the labor market conditions. But it is unlikely that a
county’s crime rates over 1979–1989 influenced the national industry growth
rates over 1979–1989. Also the employment shares had been taken in 1979 be-
fore the crime rates were measured. These points support COR(Δu, z) = 0.
For COR(Δx, z) = 0, consider a county in Michigan: if the national auto
industry shrank during 1979–1989 and if the share of auto industry was large
in the county in 1979, then the local labor market condition would have
deteriorated.
4.3 IVE with More than Enough Instruments

4.3.1 IVE in Wide Sense
If a random variable w is independent u, then we get not just COR
(w, u) = 0, but also COR(w2 , u) = 0. This means that if w is an instrument
for an endogenous regressor x4 , then we may use both w and w2 as instru-
ments for x4 . In this case, zi = (1, xi2 , xi3 , wi , wi2 ) , the dimension of which
is bigger than the dimension of xi : E(zx ) is not a square matrix as in the
preceding subsection, and hence not invertible. There arises the question of
selecting or combining more than enough instruments (i.e., more than enough
moment conditions) for only k-many parameters. While a complete answer
will be provided later, here we just provide one simple answer which also
turns out to be optimal under homoskedasticity.
Suppose E(zu) = 0, where z is p × 1 with p ≥ k, the rank of E(xz ) = k,
and E −1 (zz ) exists. Observe
E{z(y − x β)} = 0 ⇐⇒ E(zy) = E(zx )β

=⇒ E(xz )E −1 (zz ) · E(zy) = E(xz )E −1 (zz ) · E(zx )β
=⇒ β = {E(xz )E −1 (zz )E(zx )}−1 · E(xz )E −1 (zz )E(zy).
For the product AB of two matrices A and B where B −1 exists, rank (AB) =
rank(A); i.e., multiplication by a non-singular matrix B does not alter the
rank of A. This fact implies that E(xz )E −1 (zz ) E(zx ) has rank k and
thus is invertible. If E(zx ) is invertible, then β in the last display becomes
E −1 (zx ) · E(zy) and the resulting bive is the IVE when the number of in-
struments is the same as the number of parameters.
The sample analog for β is the following instrumental variable estimator
⎧ −1 ⎫−1 −1
⎨ ⎬
bive = xi zi zi zi zi xi · xi zi zi zi zi y i
⎩ ⎭
i i i i i i
where many N −1 ’s are ignored that cancel one√another out. The consistency
is obvious, and the asymptotic distribution of N (bive − β) is
N (0, G · E(zz u2 ) · G ), where G ≡ {E(xz )E −1 (zz )E(zx )}−1 E(xz )E −1 (zz ).
A consistent estimator for the asymptotic variance is

1 2
GN · zi zi ri · GN ,
N i
where ri ≡ yi − xi bive , and

⎧ −1 ⎫−1 −1
⎨ ⎬
1 1 1 1 1
GN ≡ xi zi zi zi
zi xi ·
xi zi
zi zi .
⎩N N N ⎭ N N
i i i i i
4.3.2 Various Interpretations of IVE

It is informative to see the IVE in matrices:
−1
bive = X Z(Z Z)−1 Z X X Z(Z Z)−1 Z Y
# $−1
= {Z(Z Z)−1 Z X} X {Z(Z Z)−1 Z X} Y
= (X̂ X)−1 X̂ Y, where X̂ ≡ Z(Z Z)−1 Z X;
X̂ that has dimension N × k is “x fitted by z,” or the part of x explained by

z; (Z Z)−1 Z X is the LSE of z on x. X̂ combines more than k instruments
into just k many.
Using PZ ≡ Z(Z Z)−1 Z , bive can also be written as (recall that PZ is
idempotent)
bive = {(PZ X) PZ X}−1 (PZ X) PZ Y = (X̂ X̂)−1 X̂ Y
as if bive were the LSE for the equation Y = X̂β + error where the “error” is
Y − X̂β. This rewriting accords an interesting interpretation of bive . As X is
endogenous, we can decompose X as X̂ + (X − X̂) where the first component

is exogenous, “sifted” from X using PZ , and the second component is the
remaining endogenous component. Then
Y = Xβ + U = {X̂ + (X − X̂)}β + U = X̂β + {(X − X̂)β + U }.
Hence, β can be estimated by the LSE of Y on X̂ so long as X̂ is “asymp-

totically uncorrelated” with the error term (X − X̂)β + U , which is shown in
the following.
Recall
“AN = op (1)” means AN →p 0 (and BN = CN + op (1) means BN − CN →p 0).
The error vector (X − X̂)β + U satisfies N −1 X̂ {(X − X̂)β + U } = op (1)

because
1 1 1
X̂ {(X − X̂)β + U } = X̂ Xβ − X̂ X̂β + X̂ U = X̂ U,
N N N
for X̂ X = X̂ X̂
1
= Z(Z Z)−1 Z X U
N
−1
1 1 1
= XZ· ZZ · ZU = op (1).
N N N
The expression (X̂ X̂)−1 X̂ Y also demonstrates that the so-called “two-
stage LSE (2SLSE)” for simultaneous equations is nothing but IVE. For sim-
plification, consider two simultaneous equations with two endogenous vari-
ables y1 and y2 :
y1 = α1 y2 + x1 β 1 + u1 , y2 = α2 y1 + x2 β 2 + u2 ,

COR(xj , uj ) = 0, j, j = 0, 1 and x1 = x2 .
Let z denote the system exogenous regressors (i.e., the collection of the el-
ements in x1 and x2 ). Denoting the regressors for the y1 equation as x ≡
(y2 , x1 ) , the first step of 2SLSE for (α1 , β 1 ) is the LSE of y2 on z to obtain
the fitted value ŷ2 of y2 , and the second step is the LSE of y1 on (ŷ2 , x1 ).
This 2SLSE is nothing but the IVE where the first step is PZ X to obtain the
LSE fitted value of x on z—the LSE fitted value of x1 on z is simply x1 —and
the second step is the LSE of y1 on PZ X.
4.3.3 Further Remarks

We already mentioned that the usual R2 is irrelevant for IVE. Despite
this, sometimes
i (yi − xi bive )
2
1−
i (yi − ȳ)
2
is reported in practice as a measure of model fitness. Pesaran and Smith

(1994) showed, however, that this should not be used as a model selection
criterion. Instead, they propose the following pseudo R2 for IVE :

i (yi − x̂i bive )
2
Rive = 1 −
2
i (yi − ȳ)
2
where x̂i is the ith row of X̂. Rive 2

satisfies 0 ≤ Rive2
≤ 1 and takes 1 if
y = x̂i bive and 0 if all slope components of bive are zero. The intuition for Rive
2
was given ahead already: Y = X̂β + error with the error term asymptotically
orthogonal to X̂.
Observe
E(u2 |z) = σ 2 (homoskedasticity) implies G · E(zz u2 ) · G

= σ 2 {E(xz )E −1 (zz )E(zx )}−1 ;
here, homoskedasticity is wrt z, not x. To compare this to the LSE asymptotic

variance σ 2 E −1 (xx ) under homoskedasticity, observe
E(xx ) = E(xz )E −1 (zz )E(zx ) + E{(x − γ z)(x − γ z) } where

γ ≡ E −1 (zz )E(zx ).
This is a decomposition of E(xx ) into two parts, one explained by z and the
other unexplained by z; x − γ z is the “residual” (compared with the linear
projection, E(x|z) is often called the projection of x on z).
From the decomposition, we get
E(xx ) ≥ E(xz )E −1 (zz )E(zx ) ⇐⇒ E −1 (xx )

≤ {E(xz )E −1 (zz )E(zx )}−1 ;
the former is called generalized Cauchy-Schwarz inequality. This shows that

the “explained variation” E(xz )E −1 (zz )E(zx ) is not greater than the “to-
tal variation” E(xx ). Hence, under homoskedasticity, LSE is more efficient
than IVE; under homoskedasticity, there is no reason to use IVE unless
E(xu) = 0. Under heteroskedasticity, however, the asymptotic variances of
LSE and IVE are difficult to compare, because the comparison depends on
the functional forms of V (u|x) and V (u|z).
5 Generalized Method-of-Moment Estimator (GMM)

In IVE, we saw an answer to the question of how to combine more than
enough moment conditions. There, the idea was to multiply the more than
enough p-many equations E(zy) = E(zx )β from E(zu) = 0 with a k × p
matrix. But, there are many candidate k × p matrices, with E(xz )E −1 (zz )
Sec. 5 Generalized Method-of-Moment Estimator (GMM) 43
for IVE being just one of them. If we use E(xz )W −1 where W is a p × p p.d.
matrix, we will get
E(xz )W −1 E(zy) = E(xz )W −1 E(zx )β
=⇒ β = {E(xz )W −1 E(zx )}−1 E(xz )W −1 E(zy).
As it turns out, W = E(zz u2 ) is optimal for iid samples, which is the theme
of this section.
5.1 GMM Basics

Suppose there are p (≥ k) population moment conditions
Eψ(y, x, z, β) = 0
which may be nonlinear in β; we will often write ψ(y, x, z, β) simply as ψ(β).
The generalized method-of-moment (GMM) estimator is a class of estimators
indexed by W that is obtained by minimizing the following wrt b:
1 1
√ ψ(b) · W −1 · √ ψ(b).
N i N i
The question in GMM is which W to use. Hansen (1982) showed that the W
yielding the smallest variance for the class of GMM estimator is

1
V √ ψ(β) [= E{ψ(β)ψ(β) } for iid samples];
N i
this becomes E(zz u2 ) when ψ(β) = z(y
− x β).
−1/2
The intuition for W = V {N i ψ(β)} is that, in the minimization,
it is better to standardize N −1 i ψ(b); otherwise one component with a high
variance can unduly dominate the minimand. The optimal GMM is simply
called (the) GMM. The GMM with W = Ip is sometimes called the “un-
weighted (or equally weighted) GMM”; the name “equally weighted GMM,”
however, can be misleading, for the optimal GMM has this interpretation.
It may seem that we may be able to do better than GMM by using a dis-
tance other than the quadratic distance. But Chamberlain (1987) showed
that the GMM is the efficient estimator under the given moment condition
Eψ(β) = 0. In statistics, ψ(y, x, z, β) = 0 is called “estimating functions”
(Godambe, 1960) and Eψ(y, x, z, β) = 0 “estimating equations”; see Owen
(2001) and the references therein.
While GMM with nonlinear models will be examined in another chapter
in detail, for the linear model, we have
Eψ(β) = E{z(y − x β)} = E(zu) = 0.
In matrices, the GMM minimand with W is
{Z (Y − Xb)} W −1 {Z (Y − Xb)}
!
= Y ZW −1 − b X ZW −1 · (Z Y − Z Xb)
= Y ZW −1 Z Y − 2b X ZW −1 Z Y + b X ZW −1 Z Xb.
From the first-order condition of minimization, we get X ZW −1 Z Y =

X ZW −1 Z Xb. Solve this to obtain
bW = (X ZW −1 Z X)−1 · (X ZW −1 Z Y )
−1

−1
= xi zi W zi xi · xi zi W −1 zi y i , in vectors.
i i i i
Clearly bW is consistent for β, and its asymptotic distribution is

√
N (bW − β) N (0, CW ), where
CW ≡ {E(xz )W −1 E(zx )}−1 E(xz ) W −1 E(zz u2 )W −1 E(zx )
{E(xz )W −1 E(zx )}−1 .
With W = E(zz u2 ), this matrix becomes {E(xz )E −1 (zz u2 )E(zx )}−1 , and
we get the GMM with
√
N (bgmm − β) N (0, {E(xz )E −1 (zz u2 )E(zx )}−1 ).
Since W = E(zz u2 ) can be estimated consistently with

1 2 1
zi zi ri = Z DZ,
N i N
where ri = yi − xi bive and D = diag(r12 , ..., rN

2
), we get
⎧ −1 ⎫−1
⎨ ⎬
bgmm = xi zi zi zi ri2 zi xi ·
⎩ ⎭
i i i
−1

xi zi zi zi ri2 zi y i
i i i
−1 −1
= (X Z(Z DZ) Z X)−1 (X Z (Z DZ) Z Y ) in matrices.
Differently from IVE, Z(ZDZ )−1 Z is no longer the linear projection matrix
of Z. A consistent estimator for the GMM asymptotic variance {E(xz )E −1
(zz u2 )E(zx )}−1 is easily obtained: it is simply the first part (X Z(Z DZ)−1
Z X)−1 of bgmm times N .
5.2 GMM Remarks

A nice feature of GMM is that it also provides a specification test, called
“GMM over-identification test”: with uN i ≡ yi − xi bgmm ,
−1
1 1 2 1
√ zi u N i · zi zi u N i ·√ zi uN i χ2p−k .
N i N i N i
Too big a value, greater than an upper quantile of χ2p−k , indicates that some
moment conditions do not hold (or some other assumptions of the model
may be violated). The reader may wonder how we can test for the very mo-
ment conditions that were used to get the GMM. If there are only k moment
conditions, this concern is valid. But when there are more than k moment con-
ditions (p-many), essentially only k of them get to be used in obtaining the
GMM. The GMM over-identification test checks if the remaining p − k mo-
ment conditions are satisfied by the GMM, as can be seen in the degrees of
freedom (“dof”) of the test.
The test statistics may be viewed as
⎡⎧ −1 ⎫
⎨ u N i z zi u N i z u N i zi u N i ⎬
⎣ √ i √ √i
√ 1 ·
⎩ N N N N ⎭
i i i
⎧ −1 ⎫⎤
⎨ u z z u z u zi u N i ⎬ ⎦
Ni i i Ni i Ni
√ √ √ √ 1 .
⎩ N N N N ⎭
i i
√
Defining √ the matrix version for zi uN i / N as G—i.e., the ith row of G is
zi uN i / N —this display can be written as
−1
G(G G)−1 G 1N G(G G)−1 G 1N = 1N G (G G) G 1N
The inner-product form shows that the test statistic is non-negative at least.
Using the GMM over-identification test, a natural thing to do is to use
only those moment conditions that are not rejected by the test. This can be
done in practice by doing GMM on various subsets of the moment conditions,
which would be ad hoc, however. Andrews (1999) and Hall and Peixe (2003)
provided a formal discussion on this issue of selecting valid moment condi-
tions, although how popular these suggestions will be in practice remains to
be seen.
Under the homoskedasticity E(u2 |z) = σ 2 , W = σ 2 E(zz ) = σ 2 Z Z/N +
op (1). But any multiplicative scalar in W is irrelevant for the minimization.
Hence setting W = Z Z is enough, and bgmm becomes bive under homoskedas-
ticity; the aforementioned optimality of bive comes from the GMM optimality
under homoskedasticity. Under homoskedasticity, we do not need an initial
estimator to get the residuals ri ’s. But when we do not know whether ho-
moskedasticity holds or not, GMM is obtained in two stages: first apply IVE
to get the ri ’s, then use i zi zi ri2 to get the GMM. For this reason, the GMM
is sometimes called a “two-stage IVE.”
We can summarize our analysis for the linear model under p × 1 moment
condition E(zu) = 0 and heteroskedasticity of unknown form as follows.
First, the efficient estimator when p ≥ k is
bgmm = {X Z(Z DZ)−1 Z X}−1 X Z(Z DZ)−1 Z Y.
If homoskedasticity prevails,
bive = {X Z(Z Z)−1 Z X}−1 X Z(Z Z)−1 Z Y
is the efficient estimator

to which bgmm becomes asymptotically equivalent.
If p = k and (N −1 i zi xi )−1 exists, then
bgmm = {X Z(Z DZ)−1 Z X}−1 X Z(Z DZ)−1 Z Y = (Z X)−1 (Z Y );
i.e., bgmm = bive . Furthermore, if z = x, then bgmm = bive = blse . Since

GMM is efficient
under E(zu) = 0, IVE is also efficient under homoskedas-
ticity; if ( i zi xi )−1 exists, IVE inherits the efficiency from GMM because
bgmm = bive , homoskedasticity or not; furthermore, LSE is efficient when
z = x, because bgmm = blse . This way of characterizing the LSE efficiency is
more relevant to economic data than using the conventional Gauss–Markov
theorem in many econometric textbooks that requires non-random regressors.
There have been some further developments in linear-model IVE/
GMM. The main issues there are weak instruments (i.e., small correlations
between instruments and endogenous variables), small sample performance
of IVE/GMM (e.g., small sample bias), small sample distribution (i.e., non-
normal contrary to the asymptotic normality), and estimation of the variance
(e.g., under-estimation of the variance). Just to name a few studies for readers
interested in these topics, Donald and Newey (2001) showed how to choose
the number of instruments by minimizing a mean-squared error criterion for
IVE and other estimators, Stock et al. (2002) provided a survey on the lit-
erature, and Windmeijer (2005) suggested a correction to avoid the variance
under-estimation problem. See also Hall (2005) for an extensive review on
GMM.
5.3 GMM Examples

As an example of GMM moment conditions, consider a “rational expec-
tation” model:
yt = ρ · E(yt+1 |It ) + xt β + εt , t = 1, ..., T,

E(εt xt−j ) = E(εt yt−j ) = 0 ∀j = 1, ..., t
where It is the information available up to period t including xt , yt−1 , xt−1 , ...

Here ρ captures the effect of the expectation of yt+1 on yt . One way to
estimate ρ and β is to replace E(yt+1 |It ) by yt+1 :
yt = ρyt+1 + xt β + εt + ρ{E(yt+1 |It ) − yt+1 }

≡ ρyt+1 + xt β + ut , t = 1, ..., T − 1, where ut
≡ εt + ρ{E(yt+1 |It ) − yt+1 }.
Then yt−1 , xt−1 , yt−2 , xt−2 , ... are all valid instruments because the error
term E(yt+1 |It ) − yt+1 is uncorrelated with all available information up to t.
If E(xt εt ) = 0, then xt is also a good instrument.
EXAMPLE: HOUSE SALE (continued). In estimating the DISC equation,

ln(T) is a possibly endogenous variable as already noted. The endogeneity
can be dealt with IVE and GMM. We will use L1, L2, and L3 as instruments.
An argument for the instruments would be that, while the market conditions
and the characteristics of the house and realtor may influence DISC directly,
it is unlikely that DISC is affected directly by when to list the house in the
market. The reader may object to this argument, in which case the following
should be taken just as an illustration.
Although we cannot test for the exclusion restriction, we can at least
check whether the three variables have explanatory power for the potentially
endogenous regressor ln(T ). For this, the LSE of ln(T ) on the instruments
and the exogenous regressors was done to yield (heteroskedasticity-robust
variance used):
ln(Ti ) = − 1.352 +, ..., − 0.294 · L1 − 0.269 · L2 − 0.169 · L3 , R2 = 0.098,
(t−value) (−0.73) (−2.77) (−2.21) (−1.22)
which shows that indeed L1, L2, and L3 have explanatory power for ln(T).
Table 2 shows the LSE, IVE, and GMM results (LSE is provided here again
for the sake of comparison). The pseudo R2 for the IVE is 0.144; compare this
to the R2 = 0.34 of the LSE. The GMM over-identification test statistic value
and its p-value are, respectively, 2.548 and 0.280, not rejecting the moment
conditions.
Table 2: LSE, IVE, and GMM for House Sale Discount %

LSE IVE GMM
blse tv-ho tv-het bive tv-ho tv-het bgmm tv
Ln(T) 4.60 12.23 7.76 10.57 2.66 2.84 10.35 2.78
1 −2.46 −0.24 −0.23 3.51 0.26 0.21 1.65 0.10
BATH 0.11 0.17 0.18 −0.15 −0.19 −0.19 −0.43 −0.54
ELEC 1.77 2.60 2.46 0.75 0.69 0.68 0.87 0.80
RM −0.18 −0.71 −0.67 0.08 0.22 0.20 0.14 0.36
TAX −1.74 −1.65 −1.28 −1.27 −0.94 −0.86 −1.05 −0.71
YR −0.15 −5.96 −3.87 −0.15 −4.93 −3.71 −0.15 −3.75
Ln(LP) 6.07 3.73 2.52 3.15 1.12 0.97 2.79 0.87
BIGS −2.15 −3.10 −2.56 −1.57 −1.66 −1.56 −1.47 −1.46
RATE −2.99 −3.25 −3.10 −5.52 −2.73 −2.51 −5.05 −2.39
SUPPLY 1.54 1.02 1.06 1.96 1.03 1.14 2.11 1.23
In the IVE, the tv–ho’s are little different from the tv–het’s other than
for YR. The difference between the tv for GMM and the tv–het for IVE is
also negligible. The LSE have far more significant variables than the IVE and
GMM which are close to each other. Ln(T )’s estimate is about 50% smaller
in LSE than in IVE and GMM. ELEC and ln(LP) lose its significance in the
IVE and GMM and the estimate sizes are also halved. YR has almost the
same estimates and t-values across the three estimators. BIGS have similar
estimates across the three estimators, but not significant in the IVE and
GMM. RATE is significant for all three estimators, and its value changes from
-3 in LSE to -5 in the IVE and GMM. Overall, the signs of the significant
estimates are the same across all estimators, and most earlier remarks made
for LSE apply to IVE and GMM.
6 Generalized Least Squares Estimator (GLS)

In WLS, we assumed that u1 , ..., uN are independent and that E(u2i |xi ) =
ω(xi , θ) is a parametric function of xi with some unknown parameter vector
θ. In this section, we generalize WLS further by allowing u1 , ..., uN to be
correlated and ui uj to be heteroskedastic, which leads to “Generalized LSE
(GLS).” Although GLS is not essential for our main theme of using “simple”
moment conditions, understanding GLS is helpful in understanding GMM
and its efficiency issue. GLS will appear later in other contexts as well.
6.1 GLS Basics

Suppose
E(ui uj |x1 , ..., xN ) = ω(x1 , ..., xN , θ) ∀i, j
for some parametric function ω. The product ui uj may depend only on xi
and xj , but for more generality, we put all x1 , ..., xN in the conditioning set.
For example, if the data come from a small town, then ui uj may depend on
all x1 , ..., xN . If we set E(ui uj |x1 , ..., xN ) = σ which is a non-zero constant
for all i = j, then we are allowing for dependence between ui and uj while
ruling out heteroskedasticity. Recall that the consistency of LSE requires
only E(xu) = 0; there is no restriction on the dependence among u1 , ..., uN
nor on the form of heteroskedasticity. Hence, correlations among u1 , ..., uN
or unknown forms of heteroskedasticity do not make LSE inconsistent; they
may make either the LSE asymptotic variance matrix formula invalid or the
LSE inefficient.
Writing the assumption succinctly using matrix notations, we have
E(U U |X) = Ω(X; θ);
denote Ω(X; θ) just as Ω to simply notation, and pretend that θ is known for
a while. As we transformed the original equation in WLS so that the result-
ing error term variance matrix becomes homoskedastic with unit variance,
multiply Y = Xβ + U by Ω−1/2 = HΛ−0.5 H where Ω = HΛH , Λ is the
diagonal matrix of the eigenvalue of Ω, and H is a matrix whose columns are
orthonormal eigenvectors) to get
Ω−1/2 Y = Ω−1/2 Xβ + U ∗ where U ∗ ≡ Ω−1/2 U
=⇒ E(U ∗ U ∗ |X) = E(Ω−1/2 U U Ω−1/2 |X) = Ω−1/2 ΩΩ−1/2
= HΛ−0.5 H HΛH HΛ−0.5 H = IN .
Define
X ∗ ≡ Ω−1/2 X and Y ∗ ≡ Ω−1/2 Y,
and apply LSE to get the Generalized LSE (GLS)
bgls = (X ∗ X ∗ )−1 (X ∗ Y ∗ )
= (X Ω−1 X)−1 X Ω−1 Y = β + (X Ω−1 X)−1 · X Ω−1 U.
Sec. 6 Generalized Least Squares Estimator (GLS) 49
√
As in WLS, we need to replace θ with a first-stage N -consistent esti-
mator, say θ̂; call the GLS with θ̂ the “feasible GLS ” and the GLS with θ
the “infeasible GLS.” Whether the feasible GLS is consistent with the same
asymptotic distribution as the infeasible GLS follows depends on the form of
Ω(X; θ), but in all cases we will consider GLS for, this will be the case as to
be shown in a later chapter. In the transformed equation, the error terms are
iid and homoskedastic with unit variance. Thus we get
√
N (bGLS − β) N (0, E −1 (x∗ x∗ )).
The variance matrix E(x∗ x∗ ) can be estimated consistently with

1 ∗ ∗ 1 1
x x = X ∗ X ∗ = X Ω(X; θ̂)−1 X.
N i i i N N
6.2 GLS Remarks

If we define Z ≡ Ω−1 X, then bgls = (Z X)−1 Z Y , which is reminiscent of
IVE. But, differently from that IVE was motivated to avoid inconsistency of
LSE, the main motivation for GLS is a more efficient estimation than GMM.
In GMM, the functional form E(U U |X) is not specified; rather, GMM just
allows E(U U |X) to be an arbitrary unknown function of X. In contrast, GLS
specifies fully the functional form E(U U |X). Hence, GLS makes use of more
assumptions than GMM, and as a consequence, GLS is more efficient than
GMM—more on this later. But the obvious disadvantage of GLS is that the
functional form assumption on E(U U |X) can be wrong, which then nullifies
the advantage and makes the GLS asymptotic variance formula invalid.
Recall that, when Ω is diagonal, we have two ways to proceed: one is
doing LSE with an asymptotic variance estimator allowing for an unknown
form of heteroskedasticity, and the other is specifying the form of Ω to do
WLS; the latter is more efficient if the specified form is correct. When Ω is
not diagonal, an example of which is provided below, again we can think of
two ways to proceed, one of which is specifying the form of Ω to do GLS.
The other way would be doing LSE with an asymptotic variance estimator
allowing for an unknown form of Ω. When nonlinear GMM is discussed later,
we will see asymptotic variance matrix estimators allowing for an unknown
form of heteroskedasticity and correlations among u1 , ..., uN .
To see an example of GLS with a specified non-diagonal Ω, consider the
following model with dependent error terms (the so-called “auto-regressive
errors of order one”):
yt = xt β + ut , ut = ρut−1 + vt , |ρ| < 1,

u0 = 0, t = 1, ..., T.
{vt } are iid with E(v) = 0 and E(v 2 ) ≡ σ 2v < ∞, and independent
of x1 , ..., xT .
By substituting ut−1 , ut−2 , ... successively, we get
ut = ρut−1 + vt = ρ2 ut−2 + vt + ρvt−1 = ρ3 ut−3 + vt + ρvt−1 + ρ2 vt−2 = ...,
from which E(u2t ) → σ 2u ≡ σ 2v /(1 − ρ2 ) as t → ∞. Also observe
E(ut ut−1 ) = E{(ρut−1 + vt )ut−1 } = ρE(u2t−1 ) ρσ 2u ,

E(ut ut−2 ) = E{(ρ2 ut−2 + vt + ρvt−1 )ut−2 } = ρ2 E(u2t−1 ) ρ2 σ 2u .
As {vt } are independent of x1 , ..., xT and ut consists of {vt }, ut is independent

of x1 , ..., xT . Hence, E(U U |X) = E(U U ) and
⎡ ⎤
E(u1 u1 ) · · · E(u1 uN )
⎢ .. .. ⎥
Ω = ⎣ . . ⎦
E(uN u1 ) ··· E(uN uN )
⎡ ⎤
1 ρ ρ2 ρ3 ··· ρN −1
⎢ ρ 1 ρ ρ2 ··· ρN −2 ⎥
⎢ ⎥
σ 2u ⎢ .. .. .. .. .. ⎥.
⎣ . . . . . ⎦
ρN −1 ρN −2 ρN −3 ··· ρ 1
To implement the GLS, first do the LSE of yt on xt to get the residual

ût . Second, replace ρ with the LSE estimator ρ̂ of ût on ût−1 ; σ 2u can be
replaced by 1 because any scale factor in Ω is canceled in the GLS formula.
Third, transform the equation with Ω̂−1/2 and carry out the final LSE on the
transformed equation.
6.3 Efficiency of LSE, GLS, and GMM

One may ask why weuse LSE instead of some other estimators. For
instance, minimizing N −1 i |yi − xi b| may be more natural than LSE. The
usual answer found in many econometric textbooks is that LSE has the small-
est variance among the “unbiased linear estimators” where a linear estimator
aN should be written as A·Y for some N × N constant matrix A, and aN is
said to be unbiased for β if E(aN ) = β. However, this answer is not satis-
factory, for unbiasedness is hard to establish for nonlinear estimators. Also,
focusing on the linear estimators is too narrow. In the following, we provide
a modern answer which shows an optimality of LSE and efficiency compari-
son of LSE and GLS. The optimality of LSE is implied by the optimality of
GMM.
Chamberlain (1987) showed the smallest possible variance (or, the “ef-
ficiency bound”) under a general moment condition. His results are valid
for nonlinear as well as linear models. For the linear model under the iid
assumption on observations, suppose we have a moment condition
E(zu) = E{z(y − x β)} = 0

Sec. 6 Generalized Least Squares Estimator (GLS) 51
where y = x β + u and z has at least k components; z may include x and

u may be a vector. Using the moment condition only, the smallest possible
variance for (“regular”) estimators for β is
{E(xz ) · E −1 (zuuz ) · E(zx )}−1 .
This is the asymptotic variance of GMM, which means that GMM is efficient
under E(zu) = 0, y = x β + u, and the iid assumption. When z = x and u is
a scalar, the efficiency bound becomes
E −1 (xx ) · E(xx u2 ) · E −1 (xx )
which is the asymptotic variance of LSE. Thus LSE is the most efficient under
the moment condition E(xu) = 0, the linear model, and the iid assumption.
Chamberlain (1987) also showed that if
E(u|z) = E(y − x β|z) = 0
then the smallest possible variance (or the efficiency bound) is

∂ (y − x β) ∂ (y − x β)
Ez−1 E |z ·E −1 (uu |z) · E |z .
∂β ∂β
If z includes x, then this becomes

Ez−1 −x · E −1 (uu |z) · (−x ) .
If z = x and u is a scalar, then the bound becomes the asymptotic variance

of GLS
−1 xx
E .
V (u|x)
Interestingly, if the error term is homoskedastic, then the two bounds under
E(xu) = 0 and E(u|x) = 0 agree:
σ 2 E −1 (xx ).
This observation might be, however, misleading, because the homoskedas-

ticity condition is an extra information which could change the efficiency
bound.
Observe E{xx /V (u|x)} = E[{x/SD(u|x)}{x /SD(u|x)}] which is the
“variation” of x/SD(u|x). Also observe

E(xx )E −1 (xx u2 )E(xx ) = E(xx )Ex−1 xx Eu|x (u2 ) E(xx )

x $ x
=E x SD(u|x) E −1 [{x SD(u|x)} x SD(u|x) E xSD(u|x)
SD(u|x) SD(u|x)
which is the variation of the projection of x/SD(u|x) on x·SD(u|x). This

shows that

xx
E ≥ E(xx )E −1 (xx u2 )E(xx )
V (u|x)

xx
⇐⇒ E −1 ≤ E −1 (xx )E(xx u2 )E −1 (xx ) :
V (u|x)
GLS is more efficient than LSE.

The condition E(u|z) = 0 is stronger than E(zu) = 0, because E(u|z) =
0 implies E{g(z)u} = E{g(z)E(u|z)} = 0 for any square-integrable function
g(z) (i.e., E{g(z)2 } < ∞). Under the stronger moment condition, the effi-
ciency bound becomes smaller, which is attained by GLS. But this comes at
the price that GLS should specify correctly the form of heteroskedasticity.
Also, the known parametric form of heteroskedasticity is an extra information
which may change the efficiency bound.
http://www.springer.com/978-0-387-95376-2

Method of Moment

Uploaded by

Copyright:

Available Formats

Method of Moment

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Method of Moment

Uploaded by

Copyright:

Available Formats

CHAPTER 1

METHODS OF MOMENTS FOR SINGLE

Method-of-moment (MOM) estimator for single linear equation models

1 Least Squares Estimator (LSE)

1.1 LSE as a Method of Moment (MOM)

where xi is a k × 1 “regressor” vector with its ﬁrst component being 1 (i.e.,

Myoung-jae Lee, Micro-Econometrics, DOI 10.1007/b60971 1, 1

“regressors,” “explanatory variables,” or “independent variables.” Think of

1.1.2 LSE and Moment Conditions

wrt b, where yi − xi b can be viewed as a “prediction error” in predicting yi

⇐⇒ E(u) = 0, COV (xj , u) = 0 (or COR(xj , u) = 0), j = 2, ..., k

E{x(y − x β)} = 0 ⇐⇒ E(xy) = E(xx )β

γ E(xx )γ = E(γ xx γ) = E{(x γ) (x γ)} = E{(x γ)2 } ≥ 0.

Hence E(xx ) is positive semideﬁnite (p.s.d.). Assume that

E(xx ) is of full rank.

As E(xx ) is p.s.d., this full rank condition is equivalent to E(xx ) being

1.1.3 Zero Moments and Independence

(i ) E(u|x) = 0 { ⇐⇒ E(y|x) = x β for the linear model}

1.2 Asymptotic Properties of LSE

1.2.1 LLN and LSE Consistency

where “→p ” denotes convergence in probability:

Substitute yi = xi β + ui into blse to get

f (WN ) →p f (Wo ) if WN →p Wo and f (·) is continuous at Wo .

The inverse f (W ) = W −1 of W , when it exists, is the adjoint of W divided

Schott, 2005). Thus, W −1 is continuous at Wo so long as Wo−1 exists, and

When wN →p 0, it is also denoted as wN = op (1); “op (1)” is the proba-

sup P {|wN | > δ ε } < ε.

bounded.√Analogously, the convergence

1.2.3 LSE Asymptotic Distribution

From the CLT, we have

Using Slutsky Lemma (i),

Equipped with ΩN and the asymptotic normality, we can test hypotheses

1.3 Matrices and Linear Projection

Diﬀerentiating this, the LSE ﬁrst-order condition is N −1 X (Y − Xblse ) = 0,

Deﬁne the N × N “(linear) projection matrix on X”

Ŷ ≡ Xblse = X(X X)−1 X Y = PX Y (“ﬁtted value of Y”),

Û = (û1 , ..., ûN ) is the N × 1 residual vector. We may think of Y comprising

The LSE (X X)−1 X Y is called the sample (linear) projection coeﬃcients of

x β and β ≡ E −1 (xx )E(xy).

The matrices PX and QX are symmetric and idempotent:

1.4 R2 and Two Examples

R2 shows the proportion of V (y) that is explained by x β, and R2 measures

Q1 Û = Û (because the sample mean of Û is already

The last line also implies Ŷ Q1 Y = Ŷ Q1 (Ŷ + Û ) = Ŷ Q1 Ŷ . The key point

R2 is deﬁned as the ratio of the explained variation to the total variation:

R2 falls in [0, 1], being a squared correlation.

Table 1: LSE for House-Sale Discount %

sN , R2 6.20, 0.34 5.99, 0.39

EXAMPLE: INTEREST RATE. As another example of LSE, we use time-series

yi = 0.216 + 1.298 · yi−1 − 0.337 · yi−2 , sN = 0.304,

Both yi−1 and yi−2 are statistically signiﬁcant; i.e., H0 : β 2 = 0 and H0 :

1.5 Partial Regression

with N = 3, k = 3, and kf = 2, the preceding display is

PXf PX = PXf and QXf QX = QX because

QXf Y = QXf Xf bf + QXf Xg bg + QXf Û = QXf Xg bg + Û , because

Multiply both sides of QXf Y = QXf Xg bg + Û by Xg QXf to get

Xg QXf QXf Y = Xg QXf QXf Xg · bg + Xg QXf Û .

Xg QXf Û = Xg QXf QX Y = Xg QX Y = Sg X QX Y = 0 for

the residual term disappears. Solving for bg gives